TOWARDS COUNTERACTING ADVERSARIAL PERTUR-BATIONS TO RESIST ADVERSARIAL EXAMPLES

Abstract

Studies show that neural networks are susceptible to adversarial attacks. This exposes a potential threat to neural network-based artificial intelligence systems. We observe that the probability of the correct result outputted by the neural network increases by applying small perturbations generated for non-predicted class labels to adversarial examples. Based on this observation, we propose a method of counteracting adversarial perturbations to resist adversarial examples. In our method, we randomly select a number of class labels and generate small perturbations for these selected labels. The generated perturbations are added together and then clamped onto a specified space. The obtained perturbation is finally added to the adversarial example to counteract the adversarial perturbation contained in the example. The proposed method is applied at inference time and does not require retraining or finetuning the model. We validate the proposed method on CIFAR-10 and CIFAR-100. The experimental results demonstrate that our method effectively improves the defense performance of the baseline methods, especially against strong adversarial examples generated using more iterations.

1. INTRODUCTION

Deep neural networks (DNNs) have become the dominant approach for various tasks including image understanding, natural language processing and speech recognition (He et al., 2016; Devlin et al., 2018; Park et al., 2018) . However, recent studies demonstrate that neural networks are vulnerable to adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015) . That is, these network models make an incorrect prediction with high confidence for inputs that are only slightly different from correctly predicted examples. This reveals a potential threat to neural network-based artificial intelligence systems, many of which have been widely deployed in real-world applications. The adversarial vulnerability of neural networks reveals fundamental blind spots in the learning algorithms. Even with advanced learning and regularization techniques, neural networks are not learning the true underlying distribution of the training data, although they can obtain extraordinary performance on test sets. This phenomenon is now attracting much research attention. There have been increasing studies attempting to explain neural networks' adversarial vulnerability and develop methods to resist adversarial examples (Madry et al., 2018; Zhang et al., 2020; Pang et al., 2020) . While much progress has been made, most existing studies remain preliminary. Because it is difficult to construct a theoretical model to explain the adversarial perturbation generating process, defending against adversarial attacks is still a challenging task. As to inference time defense methods, the main idea is to transfer adversarial perturbations such that the obtained inputs are no longer adversarial. Tabacof & Valle (2016) studied the use of random noise such as Gaussian noise and heavy-tail noise to resist adversarial perturbations. Xie et al. (2018) introduced to apply two randomization operations, i.e., random resizing and random zero padding, to inputs to improve adversarial robustness. Guo et al. (2018) investigated the use of random cropping and rescaling to transfer adversarial perturbations. More recently, Pang et al. (2020) proposed the mixup inference method that uses the interpolation between the input and a randomly selected clean image for inference. This method can shrink adversarial perturbations somewhat by the interpolation operation. Inference time defense methods can be directly applied to off-the-shelf network models without retraining or finetuning them. This can be much efficient as compared to training time defense methods. Though adversarial perturbations are not readily perceivable by a human observer, it is suggested that adversarial examples are outside the natural image manifold (Hu et al., 2019) . Previous studies have suggested that adversarial vulnerability is caused by the locally unstable behavior of classifiers on data manifolds (Fawzi et al., 2016; Pang et al., 2018) . Pang et al. ( 2020) also suggested that adversarial perturbations have the locality property and could be resisted by breaking the locality. Existing inference time defense methods mainly use stochastic transformations such as mixup and random cropping and rescaling to break the locality. In this research, we observe that applying small perturbations generated for non-predicted class labels to the adversarial example helps to counteract the adversarial effect. Motivated by this observation, we propose a method that employs the use of small perturbations to counteract adversarial perturbations. In the proposed method, we generate small perturbation using local first-order gradient information for a number of randomly selected class lables. These small perturbations are added together and projected onto a specified space before finally applying to the adversarial example. Our method can be used as a preliminary step before applying existing inference time defense methods. To the best of our knowledge, this is the first research on using local first-order gradient information to resist adversarial perturbations. Successful attack methods such as projected gradient descent (PGD) (Madry et al., 2018) usually use local gradient to obtain adversarial perturbations. Compared to random transformations, it would be more effective to use local gradient to resist adversarial perturbations. We show through experiments that our method is effective and complementary to random transformation-based methods to improve defense performance. The contributions of this paper can be summarized as follows: • We propose a method that uses small first-order perturbations to defend against adversarial attacks. We show that our method is effective in counteracting adversarial perturbations and improving adversarial robustness. • We evaluate our method on CIFAR-10 and CIFAR-100 against PGD attacks in different settings. The experimental results demonstrate that our method significantly improves the defense performance of the baseline methods against both untargeted and targeted attacks and that it performs well in resisting strong adversarial examples generated using more iterations.

2. PRELIMINARY 2.1 ADVERSARIAL EXAMPLES

We consider a neural network f (•) with parameters θ that outputs a vector of probabilities for L = {1, 2, ..., l} categories. In supervised learning, empirical risk minimization (ERM) (Vapnik, 1998) has been commonly used as the principle to optimize the parameters on a training set. Given an input x, the neural network makes a prediction c(x) = arg max j∈L f j (x). The prediction is correct if c(x) is the same as the actual target c * (x). Unfortunately, ERM trained neural networks are vulnerable to adversarial examples, inputs formed by applying small but intentionally crafted perturbations (Szegedy et al., 2014; Madry et al., 2018) . That is, an adversarial example x ′ is close to a clean example x under a distance metric, e.g., ℓ ∞ distance, but the neural network outputs an incorrect result for the adversarial example x ′ with high



Existing methods of resisting adversarial perturbations perform defense either at training time or inference time. Training time defense methods attempt to increase model capacity to improve adversarial robustness. One of the commonly used methods is adversarial training (Szegedy et al., 2014), in which a mixture of adversarial and clean examples are used to train the neural network. The adversarial training method can be seen as minimizing the worst case loss when the training example is perturbed by an adversary (Goodfellow et al., 2015). Adversarial training requires an adversary to generate adversarial examples in the training procedure. This can significantly increase the training time. Adversarial training also results in reduced performance on clean examples. Lamb et al. (2019) recently introduced interpolated adversarial training (IAT) that incorporates interpolation-based training into the adversarial training framework. The IAT method helps to improve performance on clean examples while maintaining adversarial robustness.

