摘要
[目的/意义]对大语言模型中文问答正确性进行实验测评研究,为中文用户使用大语言模型提供一定的指导作用.[方法/过程]针对科技、教育、医学、生活、旅游美食和哲学文化 6 个领域,分别设计常识性、专业性和开放性三类问题,每类 20 个问题,共计360 个问题.分别向ChatGPT 3.5、Claude 1.0 和文心一言2.1 提问,再针对回答进行正确性的人工评价.最后汇总评价结果,进行正确性的多方面对比分析.[结果/结论]实验分析表明中文语料数据的规模与质量,以及大语言模型的参数规模是影响大语言模型中文问答正确性的重要因素.
Abstract
[Purpose/significance]The paper conducts an experimental evaluation study on the accuracy of Chinese question-an-swering in large language models,aims to provide guidance for the Chinese users of large language models.[Method/process]Aiming at the six fields of science and technology,education,medicine,life,tourism and food,philosophy and culture,this paper designs three types of questions:common sense,professionalism and openness,20 questions in each category,a total of 360 questions.It asks ques-tions to ChatGPT 3.5,Claude 1.0 and Wenxinyiyan 2.1 respectively,and then manually evaluates the correctness of the answers.Fi-nally,the evaluation results are summarized and the correctness is compared and analyzed in many aspects.[Result/conclusion]The experimental analysis indicates that the scale and quality of Chinese corpus data and the parameter scale of large language models,are important factors influencing the accuracy of Chinese question-answering in large language models.