首页|藏医药抽取式机器阅读理解数据集研究

藏医药抽取式机器阅读理解数据集研究

扫码查看
藏文机器阅读理解领域尚处于起步阶段,构建一份高质量的语料库成为推动该领域发展的当务之急.本研究采用众包方式,对藏医经典著作《四部医典》中的藏医植物药材与名词解释部分进行精细标注.结合藏文掩码数据扩充策略,有效扩充了数据集的规模,最终整理出13k条有效问答对.基于该数据集,通过优化传统的注意力机制,提出了一个高效的藏文机器阅读理解模型.文章的研究不仅对于推动藏文信息处理技术的深入发展具有重要意义,更有助于提升机器对藏文文本的理解能力,从而为藏文化的传承和保护提供有力支持.
A Study on Reading Comprehension Dataset of Tibetan Medicine Extractive Machine
The field of Tibetan machine reading comprehension is still in its infancy,and the construction of a high-quality corpus has become an urgent task to promote the development of this field.This study adopted a crowdsourc-ing approach to finely annotate the Tibetan medical compilation and terminology explanations in the Tibetan medical classics,the"The Four Medical Tantras."Combined with the Tibetan masked data enrichment strategy,the scale of the dataset was effectively expanded,and finally 13,000 effective question-answer pairs were sorted out.Based on the dataset,an efficient model of Tibetan machine reading comprehension is proposed by optimizing the traditional atten-tion mechanism.The research in this paper is not only of great significance for promoting the in-depth development of Tibetan information processing technology,but also helps to improvethe ability of machines to understand Tibetan texts,so as to provide strong support for the inheritance and protection of Tibetan culture.

Tibetan machine reading comprehensionThe Four Medical TantrasTibetan corpusAttention mechanism

旦增罗布、拉巴次仁、王浩畅、小次仁

展开 >

国网西藏电力有限公司山南供电公司,山南 856000

西藏藏医药大学,拉萨 850000

东北石油大学计算机与信息技术学院,大庆 163318

藏文机器阅读理解 四部医典 藏文语料库 注意力机制

2023年藏医博士点建设及中藏药博士点培育科研资助计划项目国家自然科学基金

BSDJS-23-1561402099

2024

西藏科技
西藏科技信息研究所

西藏科技

影响因子:0.202
ISSN:1004-3403
年,卷(期):2024.46(9)