基于对抗补丁的深度神经网络解释定向攻击

扫码查看

原文链接

万方数据
维普

中文摘要：深度神经网络在高风险领域的广泛应用使得对黑盒模型进行解释的需求增加.为此,许多解释方法已被用于辅助可信决策,但这些方法通常假设在安全环境中运行,在实际环境中却可能遭受攻击,这对解释方法的可信性和鲁棒性提出了新的挑战.本文提出了一种基于对抗补丁的深度神经网络解释攻击算法,实现了对解释效果的定向攻击,同时维持深度神经网络最初的原始预测.该方法对单目标和多目标攻击通用,并在不同解释算法之间具有一定的迁移性.为了更好地评估解释器的鲁棒性,本文提出平均位置重要性指标,并结合前人提出的3种指标进行定量评估.实验结果显示,诸如Grad-CAM等基于显著性热图的解释器都可以显著地被提出的基于对抗补丁的深度神经网络解释攻击算法误导.具体而言,解释攻击后平均位置重要性指标提升5～10倍,交并比、直方图比较和斯皮尔曼相关性等指标均下降30％以上.本文提出的方法实现了定向攻击深度神经网络解释器,并可用于解释器的鲁棒性检验,从而增强决策中深度神经网络的可信度.

外文标题：Adversarial Patch-Based Targeted Attacks for Deep Neural Networks Explanation

外文摘要：The widespread use of deep neural network in high-risk domains has increased the need to interpret black-box models.Although various explanation methods have been proposed to aid trustworthy decision-making,these methods are often as-sumed to operate in secure environments that may be subject to attack in practice,making their trustworthiness and robustness a new challenge.This study proposes a method based on adversarial patches to attack neural network interpretation algorithms,a-chieving targeted attacks on interpretation effects without altering the network's predicted output category.The method applies to single-target and multi-target attacks,and exhibits transferability across different interpretation algorithms.To better evaluate the robustness of interpreters,the Mean Position Importance ( MPI) metric is proposed and quantitative evaluations are conducted,u-sing this and three previously proposed metrics.Experimental results show that the proposed method can significantly mislead sali-ency-based heatmap interpreters such as Grad-CAM.Specifically,the MPI increased by 5-10 times after the interpretation attack,while the IOU,SC,and HC decreased by more than 30％.Our method can achieve targeted attacks on deep neural network inter-preters and can be used to test the robustness of interpreters,thus enhancing the trustworthy decision-making of deep neural net-work.

外文关键词：

deep neural networkexplainable AIimage classificationinterpretation attacks

作者：

王子明、宋倩倩、孔祥维

展开 >

作者单位：

浙江大学管理学院,杭州 310058

大连理工大学信息与通信工程学院,大连 116024

关键词：

深度神经网络可解释人工智能图像分类解释攻击

出版年：

2024