面向机器阅读理解的医学域数据集MedicalQA

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：机器阅读理解旨在利用算法让计算机理解段落语义并回答用户提出的问题,该任务所用数据集的质量可直接影响模型的实验结果.为丰富机器阅读理解的医学领域数据集,本文以爬虫和人工标注的方式构建了面向机器阅读理解的医学域数据集MedicalQA.本数据集以寻医问药网和 39健康网两大医疗平台为主要数据来源,包含 19502 个段落、问题和答案,内容涉及内科、外科、妇产科等 9 大科室.数据集形式为excel文件,由 5 列组成,第一列为段落ID,第二列为段落所属科室,第三列为段落内容,第四列为问题,第五列为问题对应答案.本数据集的构建,有利于机器阅读理解模型的鲁棒性研究以及医学问答系统的构建,也能促进机器阅读理解领域的医学数据集共享.

外文标题：MedicalQA:A dataset of medical domain for machine reading comprehension

外文摘要：Machine reading comprehension aims to make the computer understand the paragraph semantics and answer the questions raised by users using algorithms.The quality of the dataset used in this task can directly affect the experimental results of the model.In order to enrich the medical domain dataset of machine reading comprehension,this paper constructs MedicalQA,a medical domain dataset for machine reading comprehension,employing a combination of web crawlers and manual annotation techniques.The dataset takes two medical platforms(i.e.Xunyiwenyao Network and 39 Health Network)as main data sources,and includes 19,502 paragraphs and Q&A pairs,covering 9 medical departments,such as internal medicine,surgery,obstetrics and gynecology.The dataset is formatted as an Excel file,organized with 5e columns.The first column denotes the paragraph ID;the second column indicates the department to which the paragraph belongs;the third column contains the paragraph content;the fourth column lists the questions,and the fifth column provides corresponding answers to the questions.The construction of this dataset is conducive to the establishment of machine reading comprehension models in the medical domain,and can also promote the sharing of medical datasets in the field of machine reading comprehension.

外文关键词：

machine reading comprehensionmedical domaindataset

作者：

马宁、吕文蓉、郭泽晨

展开 >

作者单位：

西北民族大学,中国民族语言文字信息技术重点实验室,兰州 730030

西北民族大学,甘肃省民族语言智能处理重点实验室,兰州 730030

关键词：

机器阅读理解医学域数据集

出版年：

2024

DOI：

10.11922/11-6035.csd.2022.0030.zh

中国科学数据(中英文网络版)

CSTPCD

ISSN：2096-2223

年,卷(期)：2024.9(1)

参考文献量17