Research on Large Language Model Evaluation for the Generation Task of Natural Language Processing in Classical Chinese
The rapid development of large language models(LLMs)presents both opportunities and chal-lenges for their evaluation.While evaluation systems for general-domain LLMs are becoming more refined,assessments in specialized fields remain in the early stages.This study evaluates LLMs in the domain of classical Chinese,designing a series of tasks based on two key dimensions:language and knowledge.Thir-teen leading general-domain LLMs were selected for evaluation using major benchmarks.The results show that ERNIE-Bot excels in domain-specific knowledge,while GPT-4 demonstrates the strongest language ca-pabilities.Among open-source models,the ChatGLM series exhibits the best overall performance.By de-veloping tailored evaluation tasks and datasets,this study provides a set of standards for evaluating LLMs in the classical Chinese domain,offering valuable reference points for future assessments.The findings also provide a foundation for selecting base models in future domain-specific LLM training.
Large language modelGenerative tasksLarge model evaluationAncient booksDomain knowledge