Construction of High-quality Tibetan Dataset for Machine Reading Comprehension
Machine reading comprehension requires the machine to answer questions according to the given context.Existing algorithm models have achieved remarkable results in most popular English datasets,even surpassing the human performance.However,for low-resource languages,the research on machine reading comprehension is less touched due to the lack of corresponding datasets.Taking Tibetan as an example,this paper constructs Tibetan ma-chine reading comprehension dataset(TibetanQA),which contains 20 000 question answer pairs and 1 513 articles.The articles of dataset are collected from Yunzang website,covering 12 topics including nature,culture,education etc.The dataset is strict in the aspects of article selection,question construction,answer verification,answer diver-sity and reasoning ability,and the verification method based on questions language features shows that the dataset is high quality.Finally,this paper examines the performance of three classic English reading comprehension models on TibetanQA,revealing the results are still inferior to human.