Can LLMs deeply detect complex malicious queries? A framework for jailbreaking via obfuscating intent

扫码查看

原文链接

NETL
NSTL
Oxford Univ Press

外文摘要：This paper delves into a possible security flaw in large language models (LLMs), particularly in their capacity to identify malicious intent within intricate or ambiguous inquiries. We have discovered that LLMs might overlook the malicious nature of highly veiled requests, even without alterations to the malevolent text in those queries, thus exposing a significant weakness in their content analysis systems. To be specific, we pinpoint and scrutinize two aspects of this vulnerability: (ⅰ) LLMs' diminished capability to perceive maliciousness when parsing extremely obscured queries, and (ⅱ) LLMs' inability to discern malicious intent in queries that have been intentionally altered to increase their ambiguity by modifying the malevolent content itself. To illustrate and tackle this problem, we propose a theoretical framework and analytical strategy, and introduce a novel black-box jailbreak attack technique called IntentObfuscator. This technique exploits the identified vulnerability by concealing the genuine intentions behind user prompts, thereby compelling LLMs to inadvertently produce restricted content and circumvent their inherent content safety protocols. We elaborate on two specific applications within this framework: "Obscure Intention" and "Create Ambiguity," which skillfully manipulate the complexity and ambiguity of queries to effectively dodge the detection of malicious intent. We empirically confirm the efficacy of the IntentObfuscator approach across various models, including ChatGPT-3.5, ChatGPT-4, Qwen, and Baichuan, achieving an average jailbreak success rate of 69.21%. Remarkably, our tests on ChatGPT-3.5, boasting 100 million weekly active users, yielded an impressive success rate of 83.65%. Additionally, we verify our approach across a range of sensitive content categories, including graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal techniques, further highlighting the considerable impact of our findings on refining "Red Team" tactics against LLM content security frameworks.

作者：

Shang Shang、Xinqiang Zhao、Zhongjiang Yao、Yepeng Yao、Liya Su、Zijing Fan、Xiaodan Zhang、Zhengwei Jiang

展开 >

作者单位：

Institute of Information Engineering, Chinese Academy of Sciences, No. 19 Shucun Rd. Haidian District, 100085 Beijing, China||School of Cyber Security, University of Chinese Academy of Sciences, NO. 19 Yuquan Rd. Shijingshan District, 100049 Beijing, China

School of Cyber Security, University of Chinese Academy of Sciences, NO. 19 Yuquan Rd. Shijingshan District, 100049 Beijing, China||China Electronics Standardization Institute, No. 1, Andingmen East Street, Dongcheng District, 100007 Beijing, China

Institute of Information Engineering, Chinese Academy of Sciences, No. 19 Shucun Rd. Haidian District, 100085 Beijing, China

Security Lab, JD Cloud, Beijing, China

展开 >

出版年：

2025

DOI：

10.1093/comjnl/bxae124

The computer journal

ISSN：0010-4620

年,卷(期)：2025.68(5)

参考文献量50