大语言模型评测综述

Evaluating Large Language Models:A Survey of Research Progress

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：大语言模型(Large Language Models,LLMs)在多种自然语言处理(Natural Language Processing,NLP)任务中展现出了卓越性能,并为实现通用语言智能提供了可能.然而随着其应用范围的扩大,如何准确、全面地评估大语言模型已经成为了一个亟待解决的问题.现有评测基准和方法仍存在许多不足,如评测任务不合理和评测结果不可解释等.同时,随着模型鲁棒性和公平性等其它能力或属性的关注度提升,对更全面、更具解释性的评估方法的需求日益凸显.该文深入分析了大语言模型评测的现状和挑战,总结了现有评测范式,分析了现有评测的不足,介绍了大语言模型相关的评测指标和评测方法,并探讨了大语言模型评测的一些新方向.

外文摘要：Large Language Models(LLMs)have demonstrated exceptional performance in various Natural Language Processing(NLP)tasks,providing a potential for achieving general language intelligence.However,their expanding application necessitates more accurate and comprehensive evaluations.Existing evaluation benchmarks and methods still have many short-comings,such as unreasonable evaluation tasks and uninterpretable evaluation results.With increasing attention to robustness,fairness and so on,the demand for holistic,interpretable evaluations is impress-ing.This paper delves into the current landscape and challenges of LLM evaluation,summarizes existing evaluation paradigms,analyzes limitations,introduces pertinent evaluation metrics and methodologies for LLMs and discusses the ongoing advancements and future directions in the evaluation of LLMs.

外文关键词：

natural language processinglarge language modelsmodel evaluation

作者：

罗文、王厚峰

展开 >

作者单位：

北京大学计算机学院,北京 100871

关键词：

自然语言处理大语言模型模型评测

基金：

新一代人工智能国家科技重大专项

项目编号：

2022ZD0116308

出版年：

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

年,卷(期)：2024.38(1)

被引量1
参考文献量144