The rapid development of large language models(LLMs)presents both opportunities and chal-lenges for their evaluation.While evaluation systems for general-domain LLMs are becoming more refined,assessments in specialized fields remain in the early stages.This study evaluates LLMs in the domain of classical Chinese,designing a series of tasks based on two key dimensions:language and knowledge.Thir-teen leading general-domain LLMs were selected for evaluation using major benchmarks.The results show that ERNIE-Bot excels in domain-specific knowledge,while GPT-4 demonstrates the strongest language ca-pabilities.Among open-source models,the ChatGLM series exhibits the best overall performance.By de-veloping tailored evaluation tasks and datasets,this study provides a set of standards for evaluating LLMs in the classical Chinese domain,offering valuable reference points for future assessments.The findings also provide a foundation for selecting base models in future domain-specific LLM training.
关键词
大语言模型/生成式任务/大模型评测/古籍/领域知识
Key words
Large language model/Generative tasks/Large model evaluation/Ancient books/Domain knowledge