首页|中文糖尿病问题分类体系及标注语料库构建研究

中文糖尿病问题分类体系及标注语料库构建研究

扫码查看
糖尿病作为一种典型慢性疾病已成为全球重大公共卫生挑战之一。随着互联网的快速 发展,庞大的二型糖尿病患者和髙危人群对糖尿病专业信息获取的需求日益突出,糖 尿病自动问答服务对患者和高危人群的日常健康服务也发挥着越来越重要的作用, 然而存在缺乏细粒度分类等突出问题。本文设计了一个表示用户意图的新型糖尿病 问题分类体系,包括6个大类和23个细类。基于该体系,本文从两个专业医疗问答网 站爬取并构建了一个包含122732个问答对的中文糖尿病问答语料库DaCorp,同吋对 其中的8000个糖尿病问题进行人工标注。形成-个细粒度的糖尿病标注数据集。此 夕卜,为评估该标注数据集的质量,本文实现了8个主流基线分类模型"实验结果表明, 最佳分类模型的准确率达到88。7%,验证了糖尿病标注数据集及所提分类体系的有效 性。DaCOTp、糖尿病标注数据集和标注指南已在线发布免费用矛学术研究。
中文糖尿病问题分类体系及标注语料库构建研究
As a typical chronic disease, diabetes has become one of the major global public health challenges. With the rapid development of the Internet, the huge group of type 2 diabetes patients and high-risk people has shown an increasing demand for specialized information on diabetes. The automated diabetes Question Answering (QA) services also play a vital role in providing daily health services for patients and high-risk people. However, issues like fine-grained classification are still unsolved in many QA services. In this paper, we design a new diabetes question classification taxonomy which represents the user intent, including 6 coarse-grained categories and 23 fine-grained categories. We also construct a new Chinese diabetes QA corpus DaCorp that contains 122,732 questions-answer pairs, collected from two professional medical QA websites. Meanwhile, we annotate 8,000 diabetes questions in DaCorp as a fine-grained diabetes dataset. To evaluate the quality of the proposed taxonomy and the annotated dataset, we implement 8 mainstream baseline classifiers for diabetes question classification. Results show that the best-performing model gained an accuracy of 88.7%, demonstrating the validity of the annotated diabetes dataset and the efficacy of the proposed taxonomy. The Dacorp, annotated diabetes dataset, and annotation guidelines are published online and free for academic research.

糖尿病;问题分类;分类体系;语料库建设;标注

钱晓波、谢文秀、龙绍沛、兰牧融、慕媛媛、郝天永

展开 >

华南师范大学,计算机学院,广东广州

香港城市大学,电脑科学系,香港

巢湖学院,外国语学院,安徽合肥

糖尿病;问题分类;分类体系;语料库建设;标注

Chinese national conference on computational linguistic

Nanchang(CN)

The 21st Chinese national conference on computational linguistic

395-405

2022