令牌损失信息的通用文本攻击检测

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：目的文本对抗攻击主要分为实例型攻击和通用非实例型攻击.以通用触发器(universal trigger,UniTrig-ger)为代表的通用非实例型攻击对文本预测任务造成严重影响,该方法通过生成特定攻击序列使得目标模型预测精度降至接近零.为了抵御通用文本触发器攻击的侵扰,本文从图像对抗性样本检测器中得到启发,提出一种基于令牌损失权重信息的对抗性文本检测方法(loss-based detect universal adversarial attack,LBD-UAA),针对UniTrigger攻击进行防御.方法首先LBD-UAA分割目标样本为独立令牌序列,其次计算每个序列的令牌损失权重度量值(token-loss value,TLV)以此建立全样本序列查询表.最后基于UniTrigger攻击的扰动序列在查询表中影响值较大,将全序列查询表输入设定的差异性检测器中通过阈值阀门进行对抗性文本检测.结果通过在4个数据集上进行性能检测实验,验证所提出方法的有效性.结果表明,此方法在对抗性样本识别准确率上高达97.17％,最高对抗样本召回率达到100％.与其他3种检测方法相比,LBD-UAA在真阳率和假阳率的最佳性能达到99.6％和6.8％,均实现大幅度超越.同时,通过设置先验判断将短样本检测的误判率降低约50％.结论针对UniTrigger为代表的非实例通用式对抗性攻击提出LBD-UAA检测方法,并在多个数据集上取得最优的检测结果,为文本对抗检测提供一种更有效的参考机制.

外文标题：Universal detection method for mitigating adversarial text attacks through token loss information

外文摘要：Objective In recent years,adversarial text attacks have become a hot research problem in natural language pro-cessing security.An adversarial text attack is an malicious attack that misleads a text classifier by modifying the original text to craft an adversarial text.Natural language processing tasks,such as smishing scams(SMS),ad sales,malicious comments,and opinion detection,can be achieved by creating attacks corresponding to them to mislead text classifiers.A perfect text adversarial example needs to have imperceptible adversarial perturbation and unaffected syntactic-semantic cor-rectness,which significantly increases the difficulty of the attack.The adversarial attack methods in the image domain can-not be directly applied to textual attacks due to discrete text limitation.Existing text attacks can be categorized into two dominant groups:instance-based and learning-based universal non-instance attacks.For instance-based attacks,a specific adversarial example is generated for each input.For learning-based universal non-instance attacks,universal trigger(Uni-Trigger)is the most representative attack,which reduces the accuracy of the objective model to near zero by generating a fixed sequence of attacks.Existing detection methods mainly tackle instance-based attacks but are seldom studied in Uni-Trigger attacks.Inspired by the logit-based adversarial detector in computer vision,we propose a UniTrigger defense method based on token loss weight information.Method For our proposed loss-based detect universal adversarial attack(LBD-UAA),we generalize the pre-training model to transform token sequences into word vector sequences to obtain the representation of token sequences in the semantic space.Then,we remove the target to compute the token positions and feed the remaining token sequence strings into the model.In this paper,we use the token loss value(TLV)metric to obtain the weight proportion of each token to build a full-sample sequence lookup table.The token sequences of non-UniTrigger attacks have less fluctuation than the adversarial examples in the variation of the TLV metric.Prior knowledge suggests that the fluctuations in the token sequence are the result of adversarial perturbations generated by UniTrigger.Hence,we envision deriving the distinct numerical differences between the TLV full-sequence lookup table and clean samples,as well as adversarial samples.Subsequently,we can employ the differential outcomes as the data representation for the samples.Building upon this approach,we can set a differential threshold to confine the magnitude of variations.If the magnitude exceeds this threshold,then the input sample will be identified as an adversarial instance.Result To demon-strate the efficacy of the proposed approach,we conducted performance evaluations on four widely used text classification datasets:SST-2,MR,AG,and Yelp.SST-2 and MR represent short-text datasets,while AG and Yelp encompass a vari-ety of domain-specific news articles and website reviews,making them long-text datasets.First,we generated correspond-ing trigger sequences by attacking specific categories of the four text datasets through the UniTrigger attack framework.Sub-sequently,we blended the adversarial samples evenly with clean samples and fed them into the LBD-UAA for adversarial detection.Experimental results across the four datasets indicate that this method achieves a maximum detection rate of 97.17％,with a recall rate reaching 100％.When compared with four other detection methods,our proposed approach achieves an overall outperformance with a true positive rate of 99.6％and a false positive rate of 6.8％.Even for the chal-lenging MR dataset,it retains a 96.2％detection rate and outperforms the state-of-the-art approaches.In the generalization experiments,we performed detection on adversarial samples generated using three attack methods from TextBugger and the PWWS attack.Results indicate that LBD-UAA achieves strong detection performance across the four different word-level attack methods,with an average true positive rate for adversarial detection reaching 86.77％,90.98％,90.56％,and 93.89％.This finding demonstrates that LBD-UAA possesses significant discriminative capabilities in detecting instance-specific adversarial samples,showcasing robust generalization performance.Moreover,we successfully reduced the false positive rate of short sample detection to 50％by using our proposed differential threshold setting.Conclusion In this paper,we follow the design idea of adversarial detection tasks in the image domain and,for the first time in the general text adversarial domain,introduce a detection method called LBD-UAA,which leverages token weight information from the per-spective of token loss measurement,as measured by TLV.We are the first to detect UniTrigger attacks by using token loss weights in the adversarial text domain.This method is tailored for learning-based universal category adversarial attacks and has been evaluated for its defensive capabilities in sentiment analysis and text classification models in two short-text and long-text datasets.During the experimental process,we observed that the numerical feedback from TLV can be used to identify specific locations where perturbation sequences were added to some samples.Future work will focus on using the proposed detection method to eliminate high-risk samples,potentially allowing for the restoration of adversarial samples.We believe that LBD-UAA opens up additional possibilities for exploring future defenses against UniTrigger-type and other text-based adversarial strategies and that it can provide a more effective reference mechanism for adversarial text detection.

外文关键词：

adversarial text examplesuniversal triggerstext classificationdeep learningadversarial detection

作者：

陈宇涵、杜侠、王大寒、吴芸、朱顺痣、严严

展开 >

作者单位：

厦门理工学院计算机与信息工程学院福建省模式识别与图像理解重点实验室,厦门 361024

厦门大学信息学院,厦门 361005

关键词：

文本对抗样本通用触发器文本分类深度学习对抗性检测

基金：

福建省科技厅高校产学合作项目厦门市留学人员科研项目厦门市科技计划项目福建省自然科学基金项目福厦泉国家自主创新示范项目

项目编号：

2021H6035厦人社[2022]205号-023502Z202310422021J0111912022FX4

出版年：

2024

DOI：

10.11834/jig.230432

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心

影响因子：1.111

ISSN：1006-8961

年,卷(期)：2024.29(7)

参考文献量4