PeMeBench:中文儿科医疗问答基准测试方法

PeMeBench:Chinese pediatric medical Q&A benchmark testing method

张芊 ¹陈攀峰 ¹冯林坤 ¹刘淑钰 ¹马丹 ¹陈梅 ¹李晖¹

扫码查看

作者信息

1. 公共大数据国家重点实验室,贵州贵阳 550000;贵州大学计算机科学与技术学院,贵州贵阳 550000
折叠

摘要

大语言模型在医疗领域显现出巨大的应用潜力,如何评估其在医疗领域中的性能成为挑战.现有医疗评测基准测试多为选择题形式,难以全面和精准地评估模型在儿科医疗场景中的性能.为此,提出首个中文儿科医疗问答基准测试方法——PeMeBench.该方法基于双视角评估维度,参考来自10个儿科疾病系统的诊疗规范类书籍,将儿科医疗问答任务细分为疾病知识、治疗方案、用药剂量、疾病预防和药理作用5个儿科医疗问答子任务,构建超1万个开放式的问答题目,引入一种融合实体召回和检测语句幻觉的多粒度自动化评估方案,旨在对大语言模型在儿科基础医疗领域中的性能进行全面、准确的评估,深入剖析其潜在局限性,为提升医疗服务的智能化水平奠定坚实的基础.

Abstract

Large language model(LLM)has demonstrated significant application potential in the medical field.However,evaluating the performance of LLM in medical scenarios poses a challenge.Existing medical benchmarks,predominantly in the form of multiple-choice questions,struggle to comprehensively and accurately assess LLM's performance in pediatric domains.To address this issue,PeMeBench,the first Chinese pediatric question-answering benchmark,was proposed.Leveraging a dual-perspective evaluation dimensions and referencing diagnostic and treatment guidelines from 10 pediatric disease systems,PeMeBench meticulously categorized pediatric medical question-answering tasks into five subdomains:disease knowledge,treatment plans,medication dosages,disease prevention,and pharmacological effects.It comprised over 10 000 open-ended question-answering items and introduced a multi-grained automated evaluation scheme that integrated entity retrieval with the detection of hallucinated sentences.This approach aimed to provide a comprehensive and precise assessment of LLM's performance in pediatric healthcare,delving into their potential limitations and laying a solid foundation for enhancing the intelligence level of medical services.

关键词

儿科医疗/基准测试/大语言模型/问答

Key words

pediatric medicine/benchmark testing/large language model/Q&A

引用本文复制引用

基金项目

国家自然科学基金项目(61462012)

2023年贵州省科技计划项目(黔科合支撑[2023]一般276)

2023年贵州省科技成果应用及产业化计划项目(黔科合成果[2023]一般010)

出版年

2024

大数据

人民邮电出版社

大数据

CSTPCD

ISSN：2096-0271

段落导航