EXPLOITING SAFE SPOTS IN NEURAL NETWORKS FOR PREEMPTIVE ROBUSTNESS AND OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

Recent advances on adversarial defense mainly focus on improving the classifier's robustness against adversarially perturbed inputs. In this paper, we turn our attention from classifiers to inputs and explore if there exist safe spots in the vicinity of natural images that are robust to adversarial attacks. In this regard, we introduce a novel bi-level optimization algorithm that can find safe spots on over 90% of the correctly classified images for adversarially trained classifiers on CIFAR-10 and ImageNet datasets. Our experiments also show that they can be used to improve both the empirical and certified robustness on smoothed classifiers. Furthermore, by exploiting a novel safe spot inducing model training scheme and our safe spot generation method, we propose a new out-of-distribution detection algorithm which achieves the state of the art results on near-distribution outliers.

1. INTRODUCTION

Deep neural networks have achieved significant performance on various artificial intelligence tasks such as image classification, speech recognition, and reinforcement learning. Despite the results, Szegedy et al. (2013) demonstrated that deep neural networks are vulnerable to adversarial examples, minute input perturbations designed to mislead networks to yield incorrect predictions. There have been a large number of studies to improve the robustness of networks against adversarial perturbations (Song et al., 2017; Guo et al., 2018) , while many of the proposed methods have been shown to fail against stronger adversaries (Athalye et al., 2018; Tramer et al., 2020) . Adversarial training (Madry et al., 2017) and randomized smoothing (Cohen et al., 2019) are some of the few methods that survived the harsh verifications, each focusing on empirical and certified robustness, respectively. To summarize, the study of adversarial examples has been an arms race between adversaries, who manipulate inputs to raise network malfunction, and defenders, who aim to preserve the network performance against the corrupted inputs. In this paper, we approach the adversarial robustness problem from a different perspective. Instead of defending networks from already perturbed examples, we assume the situation where the defenders can also influence inputs slightly for their interest before the adversaries' incursion. The defenders' goal for this manipulation will be to improve robustness by searching for spots in the input space that are resistant to adversarial attacks, given a pre-trained classifier. We explore methods for finding those safe spots from natural images under a given input modification budget and the degree of robustness achievable by utilizing these spots, which we denote as preemptive robustness. Ultimately, we tackle the following question: • Do safe spots always exist in the vicinity of natural images? One practical example of the proposed framework is the case where a user uploads his or her photo from local storage (e.g., mobile device) to social media (e.g., Instagram), as illustrated in Figure 1 . Suppose there is an uploader (A) who posts a photo on social media, a web user (B) who queries a search engine (e.g., Google) for an image, and a search engine that crawls images from social media, indexes them with a neural network, and retrieves the relevant images to B. Our threat model considers an adversary (M) that can download A's image from social media, perturb it maliciously, and re-upload the perturbed image on the web, where the search engine may crawl and index images from. The classifier on the search engine will wrongly index the perturbed image, causing the search engine to malfunction. Suppose an African-American uploader (A) posts a photo of him or herself on social media, and a racist adversary (M) perturbs it to be misclassified as "gorilla" by the search engine. When another person (B) searches "gorilla" on Google, the perturbed image would appear, though the image content shows a photo of A. This attack fools both A and B since the perturbed image is used contrary to A's purpose and is not the image B wanted. To prevent this, the social media company, cooperating with the search engine company, could ask if A agrees that the images are slightly changed when uploaded to make them robust to such attacks. The purpose of the modification process, corresponding to the "safe spot filter" in Figure 1 , will be to ensure that the uploaded images are used under A's intention and provide more accurate search results to B. We develop a novel optimization problem for searching safe spots in the vicinity of natural images and observe that over 90% of the correctly classified images have safe spots nearby for adversarially trained models on both CIFAR-10 (Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015) . We also find that safe spots can enhance both empirical and certified robustness when applied on smoothed classifiers. Furthermore, we propose a novel safe spot inducing model training scheme to improve the preemptive robustness. By exploiting these safe spot-aware classifiers along with our safe spot search method, we also propose a new algorithm for out-of-distribution detection, which is often addressed together with robustness (Hendrycks et al., 2019a;c). Our algorithm outperforms other baselines on near-distribution outlier datasets such as CIFAR-100 (Krizhevsky & Hinton, 2009). Out-of-Distribution detection with deep networks Although deep networks achieve high performance on various classification tasks, they also tend to yield high confidence in out-of-distribution samples (Nguyen et al., 2015) . To filter out the anomalous examples, Hendrycks & Gimpel (2017) use the maximum value of a classifier's softmax distribution as a score function, while Lee et al. (2018) 



Figure 1: Overview of our proposed framework. The left side shows the web users retrieving wrong results due to the adversarial example. The right side adopts a safe spot filter on the image uploading process and succeeds in defending the query system from the attacker.

Adversarial training Goodfellow et al. (2015) first show that the robustness of a neural network can be enhanced by generating adversarial examples and including them in training set. PGD adversarial training improves the robustness against stronger adversarial attacks by augmenting training data with multi-step PGD adversarial examples (Madry et al., 2017). Some recent works report performance gains over PGD adversarial training by modifying the adversarial example generation procedure (Qin et al., 2019; Zhang & Wang, 2019; Zhang et al., 2020). However, most of the recent algorithmic improvements can be matched by simply using early stopping with PGD adversarial training (Rice et al., 2020; Croce & Hein, 2020). Other line of works achieve performance gains by utilizing additional datasets (Carmon et al., 2019; Wang et al., 2020; Hendrycks et al., 2019a). smoothing Injecting random noise during the forward pass can smooth the classifier's decision boundary and improve empirical robustness (Liu et al., 2018). Using differential privacy, Lecuyer et al. (2019) give theoretical guarantees for 1 and 2 robustness of classifiers smoothed with Gaussian and Laplacian noise. Cohen et al. (2019) provide a tight bound of 2 robustness of networks smoothed with Gaussian noise via the Neyman-Pearson lemma. Another proof of the robustness bound was given in Salman et al. (2019) using Lipschitz property of smoothed classifiers, where they also propose a new adversarial training scheme for building robust smoothed classifiers.

