基于大型语言模型的药理学考试主观题智能评分研究

扫码查看

原文链接

万方数据
维普

中文摘要：文章探讨大型语言模型(large language model,LLM)在药理学主观题智能评分中的应用效果.选取ChatGPT 4.0、Claude 2、讯飞星火认知大模型3.0、智谱清言3.0和文心一言3.5 五种 LLM,通过多种评分标准和提示工程技术,对药理学短文本类主观题进行评分.结果显示,ChatGPT 4.0评分上表现最为出色,平均绝对误差率(mean absolute error rate,MAER)和均方根误差(root mean square error,RMSE)分别为0.051 7和1.033 9,且组内相关系数(ICC)高达0.936,表明其评分具有较高的一致性和准确性.Claude 2紧随其后,MAER和RMSE分别为 0.072 4和1.299 9,ICC为 0.893,同样显示出良好的评分性能.其他模型在评分一致性和偏差方面表现较差,尤其是讯飞星火认知大模型3.0,MAER和RMSE分别为0.282 8和3.028 6,ICC仅为0.217.总体来看,LLM能有效利用其语言理解和逻辑推理能力,实现主观题的智能评分,并提供详尽的评分解析,这有助于提升学生的学习效率和自我评估能力.相比传统人工评分,LLM 在主观题智能评分方面具有更高的效率和成本效益.该研究为ChatGPT等先进模型在教育领域的应用提供了新的视角和方法,也为未来教育结合人工智能的发展与应用提供借鉴.

外文标题：Research on intelligent scoring of subjective questions in Pharmacology exams based on Large Language Models

外文摘要：This article explores the application effect of Large Language Model(LLM)in in-telligent scoring of subjective questions in Pharmacology.Five LLMs,namely ChatGPT 4.0,Claude 2,iFLYTEK Spark Large Cognitive Model 3.0,ChatGLM 3.0,and ERNIE Bot 3.5,were selected to score the subjective questions of short text of Pharmacology through a variety of scoring standards and prompt engineering techniques.The results showed that in terms of scoring,ChatGPT 4.0 performed the best,with mean absolute error rate(MAER)and root mean square error(RMSE)of 0.051 7 and 1.033 9,respectively,and intraclass correlation coefficient(ICC)of 0.936,indicating a high level of consistency and accuracy in its scoring.Claude 2 followed closely,with MAER and RMSE of 0.072 4 and 1.299 9,respectively,and ICC of 0.893,demonstrating good scoring performance.Other models perform poorly in terms of score consistency and bias,especially iFLYTEK Spark Large Cognitive Model 3.0,with MAER and RMSE of 0.282 8 and 3.028 6,respectively,and ICC of only 0.217.Overall,LLM can effectively utilize its language comprehension and logical reasoning abilities,achieve intelligent scoring of subjective questions,and provide detailed scoring analysis,which helps to improve student's learning efficiency and self-evaluation ability.Compared with traditional manual scoring,LLM has higher efficiency and cost-effectiveness in intelligent scoring of subjective ques-tions.This study provides a new perspective and method for the application of advanced models such as ChatGPT in the field of education,and also provides reference for the development and application of artificial intelligence in future education.

外文关键词：

artificial intelligenceLarge Language Modelsintelligent scoring of subjective questionsPharmacologyprompt engineering

作者：

向巴卓玛、王珍珍、畅洪昇、赵岩松、廖国龙、马星光

展开 >

作者单位：

北京中医药大学管理学院,北京 102488

北京中医药大学中药学院,北京 102488

北京中医药大学中医学院,北京 102488

关键词：

人工智能大型语言模型主观题智能评分药理学提示工程

出版年：

2024

DOI：

10.13566/j.cnki.cmet.cn61-1317/g4.202405006

中国医学教育技术

西安交通大学

中国医学教育技术

影响因子：1.087

ISSN：1004-5287

年,卷(期)：2024.38(5)