首页|大模型红队测试研究综述

大模型红队测试研究综述

扫码查看
大模型红队测试(Large Model Red Teaming)旨在让大语言模型(Large Language Model,LLM)接收对抗测试,从而诱使模型输出有害的测试用例,进而发现模型中的漏洞并提高其鲁棒性.大模型红队测试是大模型领域的前沿课题,近年来受到学术界和工业界的广泛关注.研究者们针对大模型红队测试提出了众多解决方案,并在模型对齐上取得了一定进展.然而,受限于大模型红队数据的短缺和评价标准的模糊,现有研究大多局限于针对特定的场景进行评估.文中首先从与大模型安全相关的定义出发,对其所涉及的各种风险进行阐述;其次,针对大模型红队测试的重要性及其主要类别进行了阐述,综述和分析了相关红队技术的发展历程,并介绍了已有的数据集和评价指标;最后,对大模型红队测试的未来发展趋势进行了展望和总结.
Survey on Large Model Red Teaming
Large model red teaming is an emerging frontier in the field of large language model(LLM),which aims to allow the LLM to receive adversarial testing to induce the model to output harmful test cases,so as to find vulnerabilities in the model and improve its robustness.In recent years,large model red teaming has gained widespread attention from both academia and indus-try,and numerous solutions have been proposed and some progress has been made in model alignment.However,due to the scar-city of large model red teaming data and the lack of clear evaluation standards,most existing research has been limited to specific scenarios.In this paper,starting from definition of large model security,we discuss the various risks associated with it.Then,we discuss the importance of large model red teaming and its main categories,providing a comprehensive overview and analysis of the development of related red team techniques.Additionally,we introduce existing datasets and evaluation metrics.Finally,the future research trends of large model red teaming are prospected and summarized.

Red teamLLM safetyReinforcement learningLanguage modelJailbreak

包泽芃、钱铁云

展开 >

武汉大学计算机学院 武汉 430072

红队 大模型安全 强化学习 语言模型 越狱

2025

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2025.52(1)