大模型红队测试研究综述

Survey on Large Model Red Teaming

包泽芃 ¹钱铁云¹

扫码查看

作者信息

1. 武汉大学计算机学院武汉 430072
折叠

摘要

大模型红队测试(Large Model Red Teaming)旨在让大语言模型(Large Language Model,LLM)接收对抗测试,从而诱使模型输出有害的测试用例,进而发现模型中的漏洞并提高其鲁棒性.大模型红队测试是大模型领域的前沿课题,近年来受到学术界和工业界的广泛关注.研究者们针对大模型红队测试提出了众多解决方案,并在模型对齐上取得了一定进展.然而,受限于大模型红队数据的短缺和评价标准的模糊,现有研究大多局限于针对特定的场景进行评估.文中首先从与大模型安全相关的定义出发,对其所涉及的各种风险进行阐述;其次,针对大模型红队测试的重要性及其主要类别进行了阐述,综述和分析了相关红队技术的发展历程,并介绍了已有的数据集和评价指标;最后,对大模型红队测试的未来发展趋势进行了展望和总结.

Abstract

Large model red teaming is an emerging frontier in the field of large language model(LLM),which aims to allow the LLM to receive adversarial testing to induce the model to output harmful test cases,so as to find vulnerabilities in the model and improve its robustness.In recent years,large model red teaming has gained widespread attention from both academia and indus-try,and numerous solutions have been proposed and some progress has been made in model alignment.However,due to the scar-city of large model red teaming data and the lack of clear evaluation standards,most existing research has been limited to specific scenarios.In this paper,starting from definition of large model security,we discuss the various risks associated with it.Then,we discuss the importance of large model red teaming and its main categories,providing a comprehensive overview and analysis of the development of related red team techniques.Additionally,we introduce existing datasets and evaluation metrics.Finally,the future research trends of large model red teaming are prospected and summarized.

关键词

红队/大模型安全/强化学习/语言模型/越狱

Key words

Red team/LLM safety/Reinforcement learning/Language model/Jailbreak

引用本文复制引用

出版年

2025

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

北大核心

影响因子：0.944

ISSN：1002-137X

段落导航