Adversarial Patch-Based Targeted Attacks for Deep Neural Networks Explanation
The widespread use of deep neural network in high-risk domains has increased the need to interpret black-box models.Although various explanation methods have been proposed to aid trustworthy decision-making,these methods are often as-sumed to operate in secure environments that may be subject to attack in practice,making their trustworthiness and robustness a new challenge.This study proposes a method based on adversarial patches to attack neural network interpretation algorithms,a-chieving targeted attacks on interpretation effects without altering the network's predicted output category.The method applies to single-target and multi-target attacks,and exhibits transferability across different interpretation algorithms.To better evaluate the robustness of interpreters,the Mean Position Importance ( MPI) metric is proposed and quantitative evaluations are conducted,u-sing this and three previously proposed metrics.Experimental results show that the proposed method can significantly mislead sali-ency-based heatmap interpreters such as Grad-CAM.Specifically,the MPI increased by 5-10 times after the interpretation attack,while the IOU,SC,and HC decreased by more than 30%.Our method can achieve targeted attacks on deep neural network inter-preters and can be used to test the robustness of interpreters,thus enhancing the trustworthy decision-making of deep neural net-work.
deep neural networkexplainable AIimage classificationinterpretation attacks