首页|中文糖尿病问题分类体系及标注语料库构建研究

中文糖尿病问题分类体系及标注语料库构建研究

扫码查看
作为一种典型慢性疾病,糖尿病已成为全球重大公共卫生挑战之一。随着互联网的快速发展,庞大的二型糖尿病患者和高危人群对糖尿病专业信息获取的需求日益突出,糖尿病自动问答服务在患者和高危人群的日常健康服务中也发挥着越来越重要的作用,缺点是缺乏细粒度分类等突出问题。该文设计了一个表示用户意图的新型糖尿病问题分类体系,包括6个大类和23个细类。基于该体系,该文从两个专业医疗问答网站爬取并构建了一个包含122 732个问答对的中文糖尿病问答语料库DaCorp,同时对其中的8 000个糖尿病问题进行了人工标注,形成一个细粒度的糖尿病标注数据集。此外,为评估该标注数据集的质量,该文实现了 8个主流基线分类模型。实验结果表明,最佳分类模型的准确率达到88。7%,验证了糖尿病标注数据集及所提分类体系的有效性。Dacorp、糖尿病标注数据集和标注指南已在线发布,可以免费用于学术研究。
Construction of Question Taxonomy and An Annotated Chinese Corpus for Diabetes Question Classification
As a typical chronic disease,diabetes has become one of the major global public health challenges.The au-tomated diabetes Question Answering(QA)services plays a vital role in providing daily health services for patients and high-risk people.This paper designed a new diabetes question classification taxonomy which represents the user intent,including 6 coarse-grained categories and 23 fine-grained categories.This paper also constructed a new Chi-nese diabetes QA corpus DaCorp that contains 122,732 questions-answer pairs,collected from two professional medical QA websites.Meanwhile,this paper annotated 8,000 diabetes questions in DaCorp as a fine-grained diabetes dataset.To evaluate the quality of the proposed taxonomy and the annotated dataset,this paper implemented 8 ma-instream baseline classifiers for diabetes question classification.Results show that the best-performing model gained an accuracy of 88.7%,demonstrating the validity of the annotated diabetes dataset and the efficacy of the proposed taxonomy.

diabetesquestion classificationclassification taxonomycorpus construction

钱晓波、谢文秀、龙绍沛、兰牧融、慕媛媛、郝天永

展开 >

华南师范大学计算机学院,广东 广州 510631

香港城市大学电脑科学系,香港 999077

巢湖学院外国语学院,安徽 合肥 238024

糖尿病 问题分类 分类体系 语料库建设

2024

中文信息学报
中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心
影响因子:0.8
ISSN:1003-0077
年,卷(期):2024.38(12)