With the widespread application of large language models,the evaluation of large language models has become crucial.In addition to the performance of large language models in downstream tasks,some potential risks should also be evaluated,such as the possibility that large language models may violate human values and be induced by malicious input to trigger security issues.This paper analyzes the commonalities and differences between traditional software,deep learning systems,and large model systems.It summarizes the existing work from the dimensions of functional evaluation,performance evaluation,alignment evaluation,and security evaluation of large language models,and introduces the evaluation criteria for large models.Finally,based on existing research and potential opportunities and challenges,the direction and development prospects of large language models evaluation technology are discussed.
关键词
大语言模型/功能评估/性能评估/对齐评估/安全性评估
Key words
large language models/functional evaluation/performance evaluation/alignment evaluation/security evaluation