面向机器阅读理解的高质量藏语数据集构建

Construction of High-quality Tibetan Dataset for Machine Reading Comprehension

孙媛 ¹刘思思 ²陈超凡 ²旦正错 ²赵小兵²

扫码查看

作者信息

1. 中央民族大学信息工程学院,北京 100081;国家语言资源监测与研究少数民族语言中心,北京 100081;民族语言智能分析与安全冶理教育部重点实验室,北京 100081
2. 中央民族大学信息工程学院,北京 100081;国家语言资源监测与研究少数民族语言中心,北京 100081
折叠

摘要

机器阅读理解是通过算法让机器根据给定的上下文回答问题,从而测试机器理解自然语言的程度.其中,数据集的构建是机器阅读理解的主要任务之一.目前,相关算法模型在大多数流行的英语数据集上都取得了显著的成绩,甚至超过了人类表现.但对于低资源语言,由于缺乏相应的数据集,机器阅读理解研究尚处于起步阶段.该文以藏语为例,人工构建了藏语机器阅读理解数据集(TibetanQA),其中包含 20 000 个问题答案对和 1 513 篇文章.该数据集的文章均来自云藏网,涵盖了自然、文化和教育等 12 个领域,问题形式多样且具有一定的难度.另外,该数据集在文章收集、问题构建、答案验证、回答多样性和推理能力等方面,均采用严格的流程以确保数据的质量,同时采用基于语言特征消融输入的验证方法说明了数据集的质量.最后,该文初步探索了三种经典的英语阅读理解模型在TibetanQA数据集上的表现,其结果难以媲美人类,这表明藏语机器阅读理解任务还需要更进一步的探索.

Abstract

Machine reading comprehension requires the machine to answer questions according to the given context.Existing algorithm models have achieved remarkable results in most popular English datasets,even surpassing the human performance.However,for low-resource languages,the research on machine reading comprehension is less touched due to the lack of corresponding datasets.Taking Tibetan as an example,this paper constructs Tibetan ma-chine reading comprehension dataset(TibetanQA),which contains 20 000 question answer pairs and 1 513 articles.The articles of dataset are collected from Yunzang website,covering 12 topics including nature,culture,education etc.The dataset is strict in the aspects of article selection,question construction,answer verification,answer diver-sity and reasoning ability,and the verification method based on questions language features shows that the dataset is high quality.Finally,this paper examines the performance of three classic English reading comprehension models on TibetanQA,revealing the results are still inferior to human.

关键词

机器阅读理解/低资源语言/藏语/数据集

Key words

machine reading comprehension/low-resource languages/Tibetan/datasets

引用本文复制引用

基金项目

国家自然科学基金(61972436)

中央民族大学项目(GRSCP202316)

中央民族大学项目(2023QNYL22)

出版年

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

参考文献量23

段落导航