Large language model(LLM)has demonstrated significant application potential in the medical field.However,evaluating the performance of LLM in medical scenarios poses a challenge.Existing medical benchmarks,predominantly in the form of multiple-choice questions,struggle to comprehensively and accurately assess LLM's performance in pediatric domains.To address this issue,PeMeBench,the first Chinese pediatric question-answering benchmark,was proposed.Leveraging a dual-perspective evaluation dimensions and referencing diagnostic and treatment guidelines from 10 pediatric disease systems,PeMeBench meticulously categorized pediatric medical question-answering tasks into five subdomains:disease knowledge,treatment plans,medication dosages,disease prevention,and pharmacological effects.It comprised over 10 000 open-ended question-answering items and introduced a multi-grained automated evaluation scheme that integrated entity retrieval with the detection of hallucinated sentences.This approach aimed to provide a comprehensive and precise assessment of LLM's performance in pediatric healthcare,delving into their potential limitations and laying a solid foundation for enhancing the intelligence level of medical services.
关键词
儿科医疗/基准测试/大语言模型/问答
Key words
pediatric medicine/benchmark testing/large language model/Q&A