首页|人文社科领域中文通用大模型性能评测

人文社科领域中文通用大模型性能评测

扫码查看
[目的/意义]以人文社科领域为出发点,从人文社科领域基础知识与人文社科学术文本两个方面入手进行人文社科领域模型性能比对.旨在为人文社科领域提供一份体系化的大模型评测基准,供人文社科相关领域研究人员参考.[方法/过程]设计7个人文社科领域相关的评测任务并选取对应指标,在此基础上,选取当前开源且性能较优的通用领域中文大模型,通过调用本地模型以问答形式完成领域化任务,并选取相关指标对其在人文社科领域的性能进行量化评测.[结果/结论]评测结果表明,在选取的开源模型中,无论是基座模型还是对话模型,Qwen性能最优、Baichuan2紧随其后、InternLM次之、Atom表现最差,此外,大多数情况下,相较于基座模型,对话模型表现出更加优越的性能.
Performance Evaluation of Chinese Universal Large Model in the Field of Humanities and Social Sciences
[Purpose/Significance]This paper Starting from the field of humanities and social sciences,this paper compares the model performance of humanities and social sciences from the aspects of basic knowledge and academic texts.It aims to provide a systematic large language model evaluation benchmark for the humanities and social sciences,and the reference for researchers in related fields.[Method/Process]Seven evaluation tasks related to the field of humanities and social sciences were designed and corresponding indicators were selected.On this basis,the current open-source and high-performance general-purpose domain Chinese large language models were selected to complete the domain-specific tasks in the form of questions and answers by invoking the local models,and their performance in humanities and social sciences was quantitatively evaluated by selecting relevant indica-tors.[Result/Conclusion]The evaluation results show that among the open-source models selected in this paper,Qwen has the best performance,followed by Baichuan2,InternLM,and Atom has the worst in both the base model and the dialog model.Moreover,in most cases,the dialog model shows more superior performance compared to the base model.

humanities and social sciencelarge model evaluationdomain knowledgeacademic texts

赵志枭、胡蝶、刘畅、沈思、王东波

展开 >

南京农业大学信息管理学院 南京 210095

南京农业大学人文与社会计算研究中心 南京 210095

南京理工大学经济管理学院 南京 210094

人文社科 大模型评测 领域知识 学术文本

江苏省社科基金后期资助项目

23HQBO63

2024

图书情报工作
中国科学院文献情报中心

图书情报工作

CSTPCDCSSCICHSSCD北大核心
影响因子:2.203
ISSN:0252-3116
年,卷(期):2024.68(13)
  • 9