TOWARDS COUNTERACTING ADVERSARIAL PERTUR-BATIONS TO RESIST ADVERSARIAL EXAMPLES

Abstract

Studies show that neural networks are susceptible to adversarial attacks. This exposes a potential threat to neural network-based artificial intelligence systems. We observe that the probability of the correct result outputted by the neural network increases by applying small perturbations generated for non-predicted class labels to adversarial examples. Based on this observation, we propose a method of counteracting adversarial perturbations to resist adversarial examples. In our method, we randomly select a number of class labels and generate small perturbations for these selected labels. The generated perturbations are added together and then clamped onto a specified space. The obtained perturbation is finally added to the adversarial example to counteract the adversarial perturbation contained in the example. The proposed method is applied at inference time and does not require retraining or finetuning the model. We validate the proposed method on CIFAR-10 and CIFAR-100. The experimental results demonstrate that our method effectively improves the defense performance of the baseline methods, especially against strong adversarial examples generated using more iterations.

1. INTRODUCTION

Deep neural networks (DNNs) have become the dominant approach for various tasks including image understanding, natural language processing and speech recognition (He et al., 2016; Devlin et al., 2018; Park et al., 2018) . However, recent studies demonstrate that neural networks are vulnerable to adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015) . That is, these network models make an incorrect prediction with high confidence for inputs that are only slightly different from correctly predicted examples. This reveals a potential threat to neural network-based artificial intelligence systems, many of which have been widely deployed in real-world applications. The adversarial vulnerability of neural networks reveals fundamental blind spots in the learning algorithms. Even with advanced learning and regularization techniques, neural networks are not learning the true underlying distribution of the training data, although they can obtain extraordinary performance on test sets. This phenomenon is now attracting much research attention. There have been increasing studies attempting to explain neural networks' adversarial vulnerability and develop methods to resist adversarial examples (Madry et al., 2018; Zhang et al., 2020; Pang et al., 2020) . While much progress has been made, most existing studies remain preliminary. Because it is difficult to construct a theoretical model to explain the adversarial perturbation generating process, defending against adversarial attacks is still a challenging task. Existing methods of resisting adversarial perturbations perform defense either at training time or inference time. Training time defense methods attempt to increase model capacity to improve adversarial robustness. One of the commonly used methods is adversarial training (Szegedy et al., 2014) , in which a mixture of adversarial and clean examples are used to train the neural network. The adversarial training method can be seen as minimizing the worst case loss when the training example is perturbed by an adversary (Goodfellow et al., 2015) . 



Adversarial training requires an adversary to generate adversarial examples in the training procedure. This can significantly increase the training time. Adversarial training also results in reduced performance on clean examples. Lamb et al. (2019) recently introduced interpolated adversarial training (IAT) that incorporates interpolation-based training into the adversarial training framework. The IAT method helps to improve performance on clean examples while maintaining adversarial robustness.

