EXPLOITING SAFE SPOTS IN NEURAL NETWORKS FOR PREEMPTIVE ROBUSTNESS AND OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

Recent advances on adversarial defense mainly focus on improving the classifier's robustness against adversarially perturbed inputs. In this paper, we turn our attention from classifiers to inputs and explore if there exist safe spots in the vicinity of natural images that are robust to adversarial attacks. In this regard, we introduce a novel bi-level optimization algorithm that can find safe spots on over 90% of the correctly classified images for adversarially trained classifiers on CIFAR-10 and ImageNet datasets. Our experiments also show that they can be used to improve both the empirical and certified robustness on smoothed classifiers. Furthermore, by exploiting a novel safe spot inducing model training scheme and our safe spot generation method, we propose a new out-of-distribution detection algorithm which achieves the state of the art results on near-distribution outliers.

1. INTRODUCTION

Deep neural networks have achieved significant performance on various artificial intelligence tasks such as image classification, speech recognition, and reinforcement learning. Despite the results, Szegedy et al. (2013) demonstrated that deep neural networks are vulnerable to adversarial examples, minute input perturbations designed to mislead networks to yield incorrect predictions. There have been a large number of studies to improve the robustness of networks against adversarial perturbations (Song et al., 2017; Guo et al., 2018) , while many of the proposed methods have been shown to fail against stronger adversaries (Athalye et al., 2018; Tramer et al., 2020) . Adversarial training (Madry et al., 2017) and randomized smoothing (Cohen et al., 2019) are some of the few methods that survived the harsh verifications, each focusing on empirical and certified robustness, respectively. To summarize, the study of adversarial examples has been an arms race between adversaries, who manipulate inputs to raise network malfunction, and defenders, who aim to preserve the network performance against the corrupted inputs. In this paper, we approach the adversarial robustness problem from a different perspective. Instead of defending networks from already perturbed examples, we assume the situation where the defenders can also influence inputs slightly for their interest before the adversaries' incursion. The defenders' goal for this manipulation will be to improve robustness by searching for spots in the input space that are resistant to adversarial attacks, given a pre-trained classifier. We explore methods for finding those safe spots from natural images under a given input modification budget and the degree of robustness achievable by utilizing these spots, which we denote as preemptive robustness. Ultimately, we tackle the following question: • Do safe spots always exist in the vicinity of natural images? One practical example of the proposed framework is the case where a user uploads his or her photo from local storage (e.g., mobile device) to social media (e.g., Instagram), as illustrated in Figure 1 . Suppose there is an uploader (A) who posts a photo on social media, a web user (B) who queries a search engine (e.g., Google) for an image, and a search engine that crawls images from social media, indexes them with a neural network, and retrieves the relevant images to B. Our threat model considers an adversary (M) that can download A's image from social media, perturb it maliciously, and re-upload the perturbed image on the web, where the search engine may crawl and index images from. The classifier on the search engine will wrongly index the perturbed image, causing the search engine to malfunction. Suppose an African-American uploader (A) posts a photo of him or herself on social media, and a racist adversary (M) perturbs it to be misclassified as "gorilla" by the search engine. When another person (B) searches "gorilla" on Google, the perturbed image would appear, though the image content shows a photo of A. This attack fools both A and B since the perturbed image is used contrary to A's purpose and is not the image B wanted. To prevent this, the social media company, cooperating with the search engine company, could ask if A agrees that the images are slightly changed when uploaded to make them robust to such attacks. The purpose of the modification process, corresponding to the "safe spot filter" in Figure 1 , will be to ensure that the uploaded images are used under A's intention and provide more accurate search results to B. We develop a novel optimization problem for searching safe spots in the vicinity of natural images and observe that over 90% of the correctly classified images have safe spots nearby for adversarially trained models on both CIFAR-10 ( Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015) . We also find that safe spots can enhance both empirical and certified robustness when applied on smoothed classifiers. Furthermore, we propose a novel safe spot inducing model training scheme to improve the preemptive robustness. By exploiting these safe spot-aware classifiers along with our safe spot search method, we also propose a new algorithm for out-of-distribution detection, which is often addressed together with robustness (Hendrycks et al., 2019a; c) . Our algorithm outperforms other baselines on near-distribution outlier datasets such as CIFAR-100 (Krizhevsky & Hinton, 2009) . (Madry et al., 2017) . Some recent works report performance gains over PGD adversarial training by modifying the adversarial example generation procedure (Qin et al., 2019; Zhang & Wang, 2019; Zhang et al., 2020) . However, most of the recent algorithmic improvements can be matched by simply using early stopping with PGD adversarial training (Rice et al., 2020; Croce & Hein, 2020) . Other line of works achieve performance gains by utilizing additional datasets (Carmon et al., 2019; Wang et al., 2020; Hendrycks et al., 2019a) .

Adversarial training

Randomized smoothing Injecting random noise during the forward pass can smooth the classifier's decision boundary and improve empirical robustness (Liu et al., 2018) . Using differential privacy, Lecuyer et al. (2019) Out-of-Distribution detection with deep networks Although deep networks achieve high performance on various classification tasks, they also tend to yield high confidence in out-of-distribution samples (Nguyen et al., 2015) . To filter out the anomalous examples, Hendrycks & Gimpel (2017) use the maximum value of a classifier's softmax distribution as a score function, while Lee et al. (2018) propose Mahalanobis distance-based metric which spots out-of-distribution samples using hidden features. Hendrycks et al. (2019b) show that leveraging auxiliary datasets disjoint from test-time data can improve the detection performance. Recently, Sastry & Oore (2020) characterize activity patterns of hidden features by Gram matrices and use the matrix values to identify anomalies.

3.1. GENERAL DEFINITION OF SAFE SPOT AND PREEMPTIVE ROBUSTNESS

We first establish a formal definition of safe spot and preemptive robustness. Let c : X → Y be a classifier which maps images to class labels. We define the safe region of the classifier c as the set of images that c can output robust predictions in the presence of slight adversarial perturbations. Definition 1 ( -safe region). Let c : X → Y be a classifier and ∈ R + be the perturbation budget of an adversary. The -safe region of the classifier c is defined by S (c) := {x ∈ X | c(x ) = c(x), ∀x ∈ B (x)}. Here, B (x) = {x ∈ X | d(x, x ) ≤ } is the x-centered -ball. In this paper, we assume p threat model, i.e., d(x, x ) = x -x p , which is the most common setting on adversarial robustness literature, and consider p ∈ {2, ∞}. Now, suppose a defender can preemptively manipulate a natural image x o under a small modification budget, knowing its ground-truth label y o . We denote the modified output image as x s . Then, the defender's objective is to make x s be correctly classified as y o and locate in the safe region S (c) to improve the robustness against adversarial attacks. If x s satisfies these two conditions, then we say x o is preemptively robust and x s is a safe spot of x o . Definition 2 (Preemptive robustness). Let c : X → Y be a classifier and δ, ∈ R + be the modification budgets of the defender and the adversary, respectively. A natural image x o with its ground-truth label y o is called (δ, )-preemptively robust on the classifier c if there exists a safe spot x s ∈ B δ (x o ) such that (i) c(x s ) = y o and (ii) x s ∈ S (c).

3.2. SAFE SPOT SEARCH ALGORITHM

In this subsection, we develop an algorithm for searching a safe spot with a natural image. Given a classifier c, finding a safe spot x s from a natural image x o can be formulated as the following problem, which is directly from the definition of safe spot: minimize xs 1 c(xs) =yo + 1 xs / ∈S (c) subject to xs -xo p ≤ δ, where 1 is the 0-1 loss function. Note in this formulation the defender requires the ground-truth label y o for the safe spot search. However, images in the real-world (e.g., social media) are usually unlabeled, unless uploaders annotate labels to their images by hand. So, it is natural to assume that the defender cannot access the ground-truth label y o . In this case, we utilize the classifier's prediction c(x o ) instead of y o : minimize xs 1 c(xs) =c(xo) + 1 xs / ∈S (c) subject to xs -xo p ≤ δ. As x s / ∈ S (c) implies there exists an adversarial example x a ∈ B (x s ) such that c(x a ) = c(x s ), we can reformulate the optimization problem as Since the 0-1 loss is not differentiable, we employ the cross-entropy loss : X × Y → R + of the classifier c as the convex surrogate loss function:  Lemma 1. If h(x s ) ≤ -log(0.5) 0.6931, then h(x s ) ≤ 2 h(x s ). Proof. See Supplementary A.1. Finally, we have the following optimization problem: To solve Equation ( 2), we first approximate the inner maximization problem by running T -step PGD (Madry et al., 2017) whose dynamics is given by x (0) a = xs + η (random start) x(t) a = f x (t-1) a ; c(xo), (adversarial update) x (t) a = Πx s , x(t) a , where η is a random noise uniformly sampled from the p zero-centered -ball, f is FGSM (Goodfellow et al., 2015) defined by f (x; y, ) =    x + α • sgn (∇x (x, y)) if p = ∞ x + α • ∇x (x, y) ∇x (x, y) 2 if p = 2, and Π xs, is a projection operation to B (x s ). Then, we iteratively solve the approximate problem given by replacing x a to x (T ) a in Equation (2). To update x s , we need to compute the gradient of (x (T ) a , c(x o )) with respect to x s expressed as ∂ x (T ) a , c(xo) ∂xs = ∂ f x (0) a ∂x • • • ∂ f x (T -1) a ∂x • ∇x x (T ) a , c(xo) , where ∂ f /∂x is the Jacobian matrix of f = Π xs, • f which is easily computed via back-propagation. After computing the gradient, we update x s by projected gradient descent method: x (i+1) s = Π xo,δ x (i) s -β • ∂ (x (T ) a , c(xo)) ∂xs . Note that the loss (x 

3.3. COMPUTING UPDATE GRADIENT WITHOUT SECOND-ORDER DERIVATIVES

Computing the update gradient with respect to x s involves the use of second-order derivatives of the loss function since the dynamics f contains the loss gradient ∇ x (x, y). Standard deep learning libraries, such as PyTorch (Paszke et al., 2019) , support the computation of higher-order derivatives. However, it imposes a huge memory burden as the size of the computational graph increases. Furthermore, for the case of p = 2, computing the update gradient with the second-order derivatives might cause exploding gradient problem if the loss gradient vanishes by Proposition 1. Lemma 2. Suppose is twice-differentiable and its second partial derivatives are continuous. If p = 2, the Jacobian of the dynamics f is ∂f ∂x = I + α • I - g g 2 g g 2 H g 2 , where g = ∇ x (x, y) and H = ∇ 2 x (x, y). Proof. See Supplementary B.1. Proposition 1. If the maximum eigenvalue of H in absolute value is σ, then ∂f ∂x • a 2 ≤ 1 + α • σ g 2 a 2. Proof. See Supplementary B.2. As we update x s , the loss gradients of x s and its adversarial examples x a get reduced to zero, which might cause the update gradient to explode and destabilize the update process. To address this problem, we approximate the update gradient by excluding the second-order derivatives, following the practice in Finn et al. (2017) . We also include an experiment in comparison to using the exact update gradient in supplementary B.3. For the case of p = ∞, the second-order derivatives naturally vanish since we take the sign on the loss gradient ∇ x (x, y). Therefore, the approximate gradient is equal to the exact update gradient.

3.4. FINDING A SAFE SPOT FOR CLASSIFIERS WITH RANDOMIZED SMOOTHING

To further enhance the robustness of our safe spot framework, we can leverage the randomized smoothing technique along with our algorithm. Given a base classifier c : X → Y, the smoothed classifier g : X → Y is defined by g(x) = argmax y∈Y P (c(x + η) = y) , where η ∼ N (0, σ 2 I). To find a safe spot x s of a natural image x o , we have to find an adversarial example x a of x s that maximizes the cross-entropy loss (x a , c(x o )) for solving the inner maximization problem in Equation (2). However, crafting adversarial examples for the smoothed classifier is ill-behaved since the argmax is non-differentiable. To address the problem, we follow the approach in Salman et al. ( 2019) and approximate the smoothed classifier g with the smoothed soft classifier G : X → P (Y) defined as G(x) = E η∼N (0,σ 2 I) [C(x + η)] , where P (Y) is the set of probability distribution over Y and C : X → P (Y) is the soft version of the base classifier c such that argmax y∈Y C(x) y = c(x). Finally, the adversarial example x a is found by maximizing the cross-entropy loss of G instead: 

3.5. SAFE SPOT-AWARE ADVERSARIAL TRAINING

In Section 3.2, we investigated how the defender can find a safe spot from a natural image, given a pre-trained classifier. In this subsection, we explore the defender's training scheme for a classifier on which data points are preemptively robust. Suppose the defender has a labeled training set, which is drawn from a true data distribution D. To induce a classifier to have safe spots in the vicinity of data points, the defender's optimal training objective should have the following form: minimize θ E (xo,yo)∼D [ (x * a , yo; θ)] subject to x * a = argmax xa∈B (x * s ) (xa, yo) and x * s = argmin xs∈B δ (xo) sup xa∈B (xs) (xa, yo), where θ is the set of trainable parameters. Concretely, the defender finds a safe spot candidate x * s of a datapoint x o and generates an adversarial example x * a from x * s . Then, the defender minimizes the cross-entropy loss (x * a , y o ; θ) so that x * s becomes an actual safe spot. Note that the ground-truth label y o is used instead of the prediction c(x o ), since we assume that the defender can access the ground-truth label during training. The most direct way to optimize the objective would be to find x * s from x o using our safe spot search algorithm and perform k-step PGD adversarial training (Madry et al., 2017 ) with x * s . However, since the safe spot search algorithm requires running T -step PGD dynamics per each update, the proposed training procedure would be more computationally demanding than PGD adversarial training. To ease this problem, we consider replacing the inner maximization sup xa (x a , y o ) in the safe spot search by (x s , y o ): x * s = argmin xs∈B δ (xo) sup xa∈B (xs) (x a , y o ) =⇒ x * s = argmin xs∈B δ (xo) (x s , y o ). Then, x * s can be easily computed by running targeted FGSM or k-step PGD on x o towards the ground-truth label y o . We denote this training scheme as safe spot-aware adversarial training.

3.6. OUT-OF-DISTRIBUTION DETECTION

The safe spot-aware adversarial training method induces the learned data distribution to have safe spots near its data points. Thus, we can naturally conjecture that the samples from the learned distribution will have a higher probability of having safe spots compared to the out-of-distribution (OOD) samples, as shown in Figure 3 . We leverage this conjecture to propose a new out-of-distribution detection algorithm that jointly utilizes our safe spot generation method and safe spot-aware adversarial training. Following the framework of Hendrycks et al. (2019b) , which use auxiliary outlier data to tune anomaly detectors, we consider there are three types of data distributions, D in , D train out , and D test out . D in refers to the learned distribution, also called the in-distribution. D train out is the given distribution of outliers used to tune the detection algorithm, which is orthogonal to D test out . D test out is the distribution we want to detect as OOD during inference, which is unknown. We include the auxiliary outlier data to our safe spot-ware training procedure and adapt the training objective as below:  minimize θ E (xo,yo)∼D in [ (x * a , yo; θ)] + E xo∼D train out [γ • DKL(ȳ C(xo; θ)) -λ • (x * a , c(xo); θ)] (xa, c(xo)), where ȳ is the uniform distribution and C(x o ; θ) is the softmax probability of xo . Since xo is unlabeled, we use the prediction c(x o ) instead for safe spot searching. Note that if ≥ δ, the first term in Equation ( 4) also maximizes the confidence of the original in-distribution samples, since x o ∈ B δ (x * s ) ⊆ B (x * s ) and therefore (x o , y; θ) ≤ (x * a , y; θ). Similarly, the second and third terms minimize the prediction confidence and the probability of safe spot existence of the outlier samples, respectively. With the trained classifier, we measure the safe spot objective value from Equation ( 2) along with the maximum softmax probability (MSP) and use the values as indicators to detect OOD samples. Concretely, we define the score function as a linear combination of the two indicators. Considering they have a different range of possible values, we replace the safe spot objective value with the MSP of the adversarial example for the safe spot solution. Finally, the score function is formulated as D(xo) := µ • max y∈Y C(xo)y + (1 -µ) • max y∈Y C(x * a )y, where x * s ∈ B δ (x o ) is the optimal solution of the safe spot algorithm for x o and x * a ∈ B (x * s ) is the adversarial example of x * s . We filter inputs with low scores as OOD.

4. EXPERIMENTS

As it is natural to assume that the defender and the adversary have the same modification budget, we set δ = for all experiments. We evaluate our by measuring clean and adversarial accuracies, where adversarial accuracy refers to the prediction score under 20-step untargeted PGD attack with a step size of /4. In the experiment tables, None column indicates using original images as inputs, and S-Full uses safe spot images from Algorithm 1. We also evaluate safe spot search via targeted FGSM and 20-step PGD towards the class inferred from the classifier, each denoted as S-FGSM and S-PGD. Detailed settings are listed on supplementary C.

4.1. CIFAR-10

We use Wide-ResNet-34-10 (Zagoruyko & Komodakis, 2016) and consider two threat models, ∞ with = 8/255 and 2 with = 0.5. We run our experiments on four differently trained models. The natural model is trained in a standard manner without considering adversaries. ADV is a PGD adversarially trained model. S-FGSM+ADV and S-PGD+ADV are safe spot-aware adversarially trained models, with safe spot search approximated by FGSM or 10-step PGD with a step size of δ/4. The ∞ threat model result in Table 1 (left) shows our methods can find safe spots on over 85% of the test set images, except for the natural model. This performance is near the upper bound, which is the classifier's clean accuracy since we use predicted labels for safe spot search. We also observe that safe spot search via targeted FGSM or PGD is also feasible for ADV, S-FGSM+ADV, and S-PGD+ADV models, but they still miss on about 10% of correctly classified images. When jointly used with our safe spot search method, the safe spot-aware training achieves the highest adversarial accuracy, along with a clean accuracy much higher than PGD adversarial training. The 2 threat model result in Table 1 (right) shows similar results as the ∞ experiment, except that the adversarial accuracy of safe spots generated by S-Full on the natural model is much higher. However, we note that the adversarial accuracy of safe spots on the natural model may go down to about 20% when the attack gets stronger, for example, by increasing PGD iterations. The results on stronger PGD attacks and other types of attacks are considered in supplementary D.3 and D.4. We also evaluate our algorithm for classifiers with randomized smoothing. Here, we consider the 2 threat model, where = 0.5 for CIFAR-10 and = 3.0 for ImageNet. We run experiments on the smoothed classifiers based on the natural and the Gaussian-noise augmented model, which are considered certifiably robust (Lecuyer et al., 2019; Cohen et al., 2019) . We measure empirical robustness against the randomized PGD on both models and measure certified robustness on the Gaussian model. Detailed settings such as noise level σ and randomized PGD are listed in supplementary C.2. Table 3 shows our results on the empirical robustness against the randomized PGD. We observe that our algorithm can find safe spots for the natural model with randomized smoothing on 57% and 37% of correctly classified images of CIFAR-10 and ImageNet, respectively. Furthermore, as shown in supplementary D.3, the adversarial accuracy of the smoothed natural model does not suffer from accuracy drop when the attack becomes stronger, in contrast to the natural model. Also, the smoothed Gaussian model, whose training cost is comparable to standard training much less than PGD adversarial training, achieves higher clean and adversarial accuracy compared to the ADV model. Certified robustness results of smoothed classifiers can be found in Supplementary D.2, where our safe spot algorithm also improves the certified robustness on both the datasets. Table 3 : Empirical robustness of randomized smoothed networks under 2 threat = 0.5 on CIFAR-10 (left) and with = 3.0 on ImageNet (right). (clean acc./adv acc.)

4.4. OUT-OF-DISTRIBUTION DETECTION

We evaluate the performance of our proposed detection algorithm on models trained with CIFAR-10. We consider various OOD datasets including CIFAR-100, SVHN (Netzer et al., 2011) , TinyImageNet (Johnson et al.) , LSUN (Yu et al., 2015) , and synthetic noise. Following the experimental protocol of Hendrycks et al. (2019b) , we evaluate the OOD detection methods on three metrics: area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPR), and the false positive rate at 95% true positive rate (FPR95). We compare our method's performance to Mahalanobis (Lee et al., 2018) , OE (Hendrycks et al., 2019b), and Gram (Sastry & Oore, 2020) . Since Lee et al. (2018) Table 4 : Out-of-distribution detection results. All results are percentages and averaged over 10 runs. Table 4 shows the evaluation results. While Mahalanobis and Gram works slightly better on synthetic datasets such as Gaussian noise, on more near-distribution outliers such as CIFAR-100, TinyImageNet, and LSUN, our method outperforms these baselines by a large margin, which leads to a gain in overall performance. Our method also outperforms OE on most metrics including the Gaussian noise.

5. CONCLUSION

Parting from recent studies on adversarial examples, we present a new adversarial framework where the defender preemptively modifies classifier inputs. We introduce a novel optimization algorithm for finding safe spots in the vicinity of original inputs as well as a new network training method suited for enhancing preemptive robustness. The experiments show that our algorithm can find safe spots for robust classifiers on most of the correctly classified images. Further results show that they can be used to improve empirical and certified robustness on smooth classifiers. Finally, we combine the new network training scheme and the safe spot generation method to devise a new out-of-distribution detection algorithm that achieves the state of the art performance on near-distribution outliers.



Figure 1: Overview of our proposed framework. The left side shows the web users retrieving wrong results due to the adversarial example. The right side adopts a safe spot filter on the image uploading process and succeeds in defending the query system from the attacker.

xs) =c(xo) + sup xa 1 c(xa) =c(xs) subject to xs -xo p ≤ δ and xa -xs p ≤ .

Figure 2: Illustration of the safe spot search process. The shaded region represents the set of points that are misclassified.

subject to xs -xo p ≤ δ and xa -xs p ≤ .

(T ) a , c(x o )) is now a random variable dependent on η. Therefore, we generate N adversarial examples {x (T ) a,n } N n=1 with different noises and optimize the sample mean of the losses instead. Algorithm 1 shows the overall safe spot search algorithm and Figure 2 illustrates our optimization process.

Figure 3: Histograms for the loss values of images (x o , c(x o )) (left) and the loss values of the perturbed safe spot solution sup x * a ∈B (x * s ) (x * a , c(x o )) (right). A safe spot-aware adversarially trained model without fine-tuning is used as the classifier. The dotted lines are where the false positive rate is 95%. Detailed settings in Supplementary C.3.

= 8/255 (left), 2 threat with = 0.5 (right), on CIFAR-10. (clean acc./adv acc.)4.2 IMAGENETWe use ResNet-50 and consider three threat models: ∞ with ∈ {4/255, 8/255} 2 with = 3.0. For safe spot-aware adversarial training experiments, we utilize "fast" adversarial training(Wong et al., 2020) and train the safe spot-aware model S-FGSM+Fast to reduce the training cost.

FGSM+ADV 86.83/42.50 86.83/78.23 86.83/69.25  86.83/85.35 S-PGD+ADV 91.32/39.33 91.32/77.01 91.32/63.94 91.32/89.84 FGSM+ADV 90.92/63.27 90.92/88.82 90.92/84.92 90.92/90.60 S-PGD+ADV 94.10/57.70 94.10/88.03 94.10/80.94 94.10/93.54 Classification accuracy under ∞ threat with

left) shows results on ∞ attack under = 4/255. Similar to results on CIFAR-10, our methods are capable of finding safe spots near to original images that are correctly classified on the robust classifiers. Also, our proposed safe spot-aware classifier outperforms the original robust classifier by a large margin in both clean and adversarial accuracies. Table2(right) shows results on ∞ on = 8/255. In this setting, we also apply our algorithm to the ADV model trained with train = 4/255. Note that by changing only the train value on adversarial training, we get a 10% gain on our safe spot's adversarial accuracy. Surprisingly, classifiers adversarially trained with smaller train performs substantially better in terms of preemptive robustness compared to using more robust classifiers. This implies that the conventional notion of robustness does not necessarily translate to preemptive robustness. Experiments on 2 attacks show similar results and can be found in Supplementary D.1.

Classification accuracy under ∞ threat with = 4/255 (left) and = 8/255 (right) on ImageNet. The lower three models on = 4/255 are trained in Fast style. (clean acc./adv acc.)

utilizes a subset of the D test out data for tuning the detection procedure while our method and OE do not, we modify Mahalanobis to tune with D train out for fair comparison. For detailed descriptions of the datasets and the experiments, refer to Supplementary E.1 and E.2.

