藏医药抽取式机器阅读理解数据集研究

A Study on Reading Comprehension Dataset of Tibetan Medicine Extractive Machine

旦增罗布 ¹拉巴次仁 ²王浩畅 ³小次仁¹

扫码查看

作者信息

1. 国网西藏电力有限公司山南供电公司,山南 856000
2. 西藏藏医药大学,拉萨 850000
3. 东北石油大学计算机与信息技术学院,大庆 163318
折叠

摘要

藏文机器阅读理解领域尚处于起步阶段,构建一份高质量的语料库成为推动该领域发展的当务之急.本研究采用众包方式,对藏医经典著作《四部医典》中的藏医植物药材与名词解释部分进行精细标注.结合藏文掩码数据扩充策略,有效扩充了数据集的规模,最终整理出13k条有效问答对.基于该数据集,通过优化传统的注意力机制,提出了一个高效的藏文机器阅读理解模型.文章的研究不仅对于推动藏文信息处理技术的深入发展具有重要意义,更有助于提升机器对藏文文本的理解能力,从而为藏文化的传承和保护提供有力支持.

Abstract

The field of Tibetan machine reading comprehension is still in its infancy,and the construction of a high-quality corpus has become an urgent task to promote the development of this field.This study adopted a crowdsourc-ing approach to finely annotate the Tibetan medical compilation and terminology explanations in the Tibetan medical classics,the"The Four Medical Tantras."Combined with the Tibetan masked data enrichment strategy,the scale of the dataset was effectively expanded,and finally 13,000 effective question-answer pairs were sorted out.Based on the dataset,an efficient model of Tibetan machine reading comprehension is proposed by optimizing the traditional atten-tion mechanism.The research in this paper is not only of great significance for promoting the in-depth development of Tibetan information processing technology,but also helps to improvethe ability of machines to understand Tibetan texts,so as to provide strong support for the inheritance and protection of Tibetan culture.

关键词

藏文机器阅读理解/四部医典/藏文语料库/注意力机制

Key words

Tibetan machine reading comprehension/The Four Medical Tantras/Tibetan corpus/Attention mechanism

引用本文复制引用

基金项目

2023年藏医博士点建设及中藏药博士点培育科研资助计划项目(BSDJS-23-15)

国家自然科学基金(61402099)

出版年

2024

西藏科技

西藏科技信息研究所

西藏科技

影响因子：0.202

ISSN：1004-3403

段落导航