首页|大语言模型在中医领域的标准化评估

大语言模型在中医领域的标准化评估

扫码查看
目的 针对目前大语言模型(LLMs)在中医学领域测评中的空缺,设计并构建一个中医学测评基准数据集,以对LLMs在中医学知识的掌握与推理表现进行全面、客观地评测,从而为LLMs在中医领域的性能优化提供科学、可靠的依据.方法 从中医标准化考试和教科书中收集数据,构建了一个涵盖 13 个学科共 29 506 道题的中医测评基准数据集.实验共选取了 3个通用模型(GPT3.5、ChatGLM3、Baichuan)和 5 个中文医疗模型(PULSE、BenTsao、HuatuoGPT2、BianQue2、ShenNong),对它们在答案预测能力和答案推理能力进行全面评测.测评结果使用准确率、F1 值、BLEU、Rouge等指标进行量化评估.结果 答案预测实验的结果显示,Baichuan在单项选题中准确率最高,为 36.07%;ChatGLM3 在多项选题中准确率和 F1 值最高,为18.96%和 76.31%.答案推理实验的结果显示,Baichuan在BLEU-1 分值最高,为 24.71;ChatGLM3 在Rouge-1 分值最高,为44.64.结论 通用LLMs整体表现略优于中文医疗LLMs,同时所有模型在选择题上的准确率都未超过 60%,反映出LLM在中医领域中仍面临巨大的挑战和提升空间.
Standardized Evaluation of Large Language Models in Traditional Chinese Medicine
OBJECTIVE Aiming at the current vacancy of large language models(LLMs)in TCM evaluation,a TCM benchmark dataset is designed and constructed to comprehensively and objectively evaluate the mastery and reasoning performance of LLMs in TCM knowledge,providing scientific and reliable basis for optimizing the performance of LLMs in the field of TCM.METHODS This benchmark includes 29 506 questions across 13 subjects,with data collected from standardized TCM exams and textbooks.Three gen-eral-purpose models(GPT-3.5,ChatGLM3,Baichuan)and five Chinese medical models(PULSE,BenTsao,HuatuoGPT2,Bian-Que2,ShenNong)were evaluated with answer prediction and answer reasoning tasks.The evaluation results were quantitatively as-sessed using metrics including accuracy,F1 score,BLEU,and Rouge.RESULTS For the answer prediction task,Baichuan had the highest accuracy of 36.07%in single-choice questions,while ChatGLM3 achieved the highest accuracy of 18.96%and F1 score of 76.31%in multiple-choice questions.For the answer reasoning experiment,Baichuan scored highest on BLEU-1 with 24.71,while ChatGLM3 achieved the highest Rouge-1 score of 44.64.CONCLUSION In this study,general LLMs performed slightly better than Chinese medical LLMs.Meanwhile,all models'accuracy on choice questions remained below 60%,reflecting the significant challen-ges and room for improvement that LLMs still face in the field of TCM.

large language modelsChinese medical modelsevaluation benchmarkChatGPTtraditional Chinese medicine

曹露、许林、张宇洁、张林帅、付亚琴、蒋涛

展开 >

成都中医药大学智能医学学院,四川 成都 611100

大语言模型 中文医疗模型 测评基准 ChatGPT 中医学

2024

南京中医药大学学报
南京中医药大学

南京中医药大学学报

CSTPCD北大核心
影响因子:1.658
ISSN:1672-0482
年,卷(期):2024.40(12)