首页|面向古文自然语言处理生成任务的大语言模型评测研究

面向古文自然语言处理生成任务的大语言模型评测研究

扫码查看
大语言模型的频繁发布为大语言模型的评测研究带来了机遇与挑战,针对通用领域大语言模型的评测体系日趋成熟,而面向垂直领域的大语言模型评测仍在起步阶段,本文以古文领域评测为切入点,从语言和知识两个维度构建了一批古籍领域评测任务,并选取当前各大榜单中性能较为优越的13个通用领域大语言模型进行评测.评测结果显示,ERNIE-Bot在古籍领域知识方面遥遥领先于其他模型,而GPT-4模型在语言能力方面表现出最佳性能,在开源模型中,Chat-GLM系列模型表现最为出色.通过构建评测任务和数据集,制定了一套适用于古籍领域的大语言模型评测标准,为古籍领域大语言模型性能评测提供了参考,也为后续古籍大语言模型训练过程中的基座模型选取提供了依据.
Research on Large Language Model Evaluation for the Generation Task of Natural Language Processing in Classical Chinese
The rapid development of large language models(LLMs)presents both opportunities and chal-lenges for their evaluation.While evaluation systems for general-domain LLMs are becoming more refined,assessments in specialized fields remain in the early stages.This study evaluates LLMs in the domain of classical Chinese,designing a series of tasks based on two key dimensions:language and knowledge.Thir-teen leading general-domain LLMs were selected for evaluation using major benchmarks.The results show that ERNIE-Bot excels in domain-specific knowledge,while GPT-4 demonstrates the strongest language ca-pabilities.Among open-source models,the ChatGLM series exhibits the best overall performance.By de-veloping tailored evaluation tasks and datasets,this study provides a set of standards for evaluating LLMs in the classical Chinese domain,offering valuable reference points for future assessments.The findings also provide a foundation for selecting base models in future domain-specific LLM training.

Large language modelGenerative tasksLarge model evaluationAncient booksDomain knowledge

朱丹浩、赵志枭、张一平、孙光耀、刘畅、胡蝶、王东波

展开 >

江苏警官学院刑事科学技术系,南京,210031

南京农业大学信息管理学院,南京,210095

大语言模型 生成式任务 大模型评测 古籍 领域知识

国家社科重大基金项目江苏省高等学校大学生实践创新创业训练计划项目

21&ZD331202210329046Y

2024

信息资源管理学报
中国高校科技期刊研究会,武汉大学

信息资源管理学报

CSSCICHSSCD
影响因子:0.885
ISSN:2095-2171
年,卷(期):2024.14(5)