As a typical chronic disease, diabetes has become one of the major global public health challenges. With the rapid development of the Internet, the huge group of type 2 diabetes patients and high-risk people has shown an increasing demand for specialized information on diabetes. The automated diabetes Question Answering (QA) services also play a vital role in providing daily health services for patients and high-risk people. However, issues like fine-grained classification are still unsolved in many QA services. In this paper, we design a new diabetes question classification taxonomy which represents the user intent, including 6 coarse-grained categories and 23 fine-grained categories. We also construct a new Chinese diabetes QA corpus DaCorp that contains 122,732 questions-answer pairs, collected from two professional medical QA websites. Meanwhile, we annotate 8,000 diabetes questions in DaCorp as a fine-grained diabetes dataset. To evaluate the quality of the proposed taxonomy and the annotated dataset, we implement 8 mainstream baseline classifiers for diabetes question classification. Results show that the best-performing model gained an accuracy of 88.7%, demonstrating the validity of the annotated diabetes dataset and the efficacy of the proposed taxonomy. The Dacorp, annotated diabetes dataset, and annotation guidelines are published online and free for academic research.
糖尿病;问题分类;分类体系;语料库建设;标注
钱晓波、谢文秀、龙绍沛、兰牧融、慕媛媛、郝天永
展开 >
华南师范大学,计算机学院,广东广州
香港城市大学,电脑科学系,香港
巢湖学院,外国语学院,安徽合肥
糖尿病;问题分类;分类体系;语料库建设;标注
Chinese national conference on computational linguistic
Nanchang(CN)
The 21st Chinese national conference on computational linguistic