大语言模型在中医领域的标准化评估

Standardized Evaluation of Large Language Models in Traditional Chinese Medicine

曹露 ¹许林 ¹张宇洁 ¹张林帅 ¹付亚琴 ¹蒋涛¹

扫码查看

作者信息

1. 成都中医药大学智能医学学院,四川成都 611100
折叠

摘要

目的针对目前大语言模型(LLMs)在中医学领域测评中的空缺,设计并构建一个中医学测评基准数据集,以对LLMs在中医学知识的掌握与推理表现进行全面、客观地评测,从而为LLMs在中医领域的性能优化提供科学、可靠的依据.方法从中医标准化考试和教科书中收集数据,构建了一个涵盖 13 个学科共 29 506 道题的中医测评基准数据集.实验共选取了 3个通用模型(GPT3.5、ChatGLM3、Baichuan)和 5 个中文医疗模型(PULSE、BenTsao、HuatuoGPT2、BianQue2、ShenNong),对它们在答案预测能力和答案推理能力进行全面评测.测评结果使用准确率、F1 值、BLEU、Rouge等指标进行量化评估.结果答案预测实验的结果显示,Baichuan在单项选题中准确率最高,为 36.07%;ChatGLM3 在多项选题中准确率和 F1 值最高,为18.96%和 76.31%.答案推理实验的结果显示,Baichuan在BLEU-1 分值最高,为 24.71;ChatGLM3 在Rouge-1 分值最高,为44.64.结论通用LLMs整体表现略优于中文医疗LLMs,同时所有模型在选择题上的准确率都未超过 60%,反映出LLM在中医领域中仍面临巨大的挑战和提升空间.

Abstract

OBJECTIVE Aiming at the current vacancy of large language models(LLMs)in TCM evaluation,a TCM benchmark dataset is designed and constructed to comprehensively and objectively evaluate the mastery and reasoning performance of LLMs in TCM knowledge,providing scientific and reliable basis for optimizing the performance of LLMs in the field of TCM.METHODS This benchmark includes 29 506 questions across 13 subjects,with data collected from standardized TCM exams and textbooks.Three gen-eral-purpose models(GPT-3.5,ChatGLM3,Baichuan)and five Chinese medical models(PULSE,BenTsao,HuatuoGPT2,Bian-Que2,ShenNong)were evaluated with answer prediction and answer reasoning tasks.The evaluation results were quantitatively as-sessed using metrics including accuracy,F1 score,BLEU,and Rouge.RESULTS For the answer prediction task,Baichuan had the highest accuracy of 36.07%in single-choice questions,while ChatGLM3 achieved the highest accuracy of 18.96%and F1 score of 76.31%in multiple-choice questions.For the answer reasoning experiment,Baichuan scored highest on BLEU-1 with 24.71,while ChatGLM3 achieved the highest Rouge-1 score of 44.64.CONCLUSION In this study,general LLMs performed slightly better than Chinese medical LLMs.Meanwhile,all models'accuracy on choice questions remained below 60%,reflecting the significant challen-ges and room for improvement that LLMs still face in the field of TCM.

关键词

大语言模型/中文医疗模型/测评基准/ChatGPT/中医学

Key words

large language models/Chinese medical models/evaluation benchmark/ChatGPT/traditional Chinese medicine

引用本文复制引用

出版年

2024

南京中医药大学学报

南京中医药大学

南京中医药大学学报

CSTPCD北大核心

影响因子：1.658

ISSN：1672-0482

段落导航