ON THE POWER OF ABSTENTION AND DATA-DRIVEN DECISION MAKING FOR ADVERSARIAL ROBUSTNESS Anonymous

Abstract

We formally define a feature-space attack where the adversary can perturb datapoints by arbitrary amounts but in restricted directions. By restricting the attack to a small random subspace, our model provides a clean abstraction for non-Lipschitz networks which map small input movements to large feature movements. We prove that classifiers with the ability to abstain are provably more powerful than those that cannot in this setting. Specifically, we show that no matter how well-behaved the natural data is, any classifier that cannot abstain will be defeated by such an adversary. However, by allowing abstention, we give a parameterized algorithm with provably good performance against such an adversary when classes are reasonably well-separated in feature space and the dimension of the feature space is high. We further use a data-driven method to set our algorithm parameters to optimize over the accuracy vs. abstention trade-off with strong theoretical guarantees. Our theory has direct applications to the technique of contrastive learning, where we empirically demonstrate the ability of our algorithms to obtain high robust accuracy with only small amounts of abstention in both supervised and self-supervised settings. Our results provide a first formal abstention-based gap, and a first provable optimization for the induced trade-off in an adversarial defense setting.

1. INTRODUCTION

A substantial body of work has shown that deep networks can be highly susceptible to adversarial attacks, in which minor changes to the input lead to incorrect, even bizarre classifications (Nguyen et al., 2015; Moosavi-Dezfooli et al., 2016; Su et al., 2019; Brendel et al., 2018; Shamir et al., 2019) . Much of this work has considered `p-norm adversarial examples, but there has also been recent interest in exploring adversarial models beyond bounded `p-norm (Brown et al., 2018; Engstrom et al., 2017; Gilmer et al., 2018; Xiao et al., 2018; Alaifari et al., 2019) . What these results have in common is that changes that either are imperceptible or should be irrelevant to the classification task can lead to drastically different network behavior. One reason for this vulnerability to adversarial attack is the non-Lipschitzness property of typical neural networks: small but adversarial movements in the input space can often produce large perturbations in the feature space. In this work, we consider the question of whether non-Lipschitz networks are intrinsically vulnerable, or if they could still be made robust to adversarial attack, in an abstract but (we believe) instructive adversarial model. In particular, suppose an adversary, by making an imperceptible change to an input x, can cause its representation F (x) in feature space (the penultimate layer of the network) to move by an arbitrary amount: will such an adversary always win? Clearly if the adversary can modify F (x) by an arbitrary amount in an arbitrary direction, then yes. But what if the adversary can modify F (x) by an arbitrary amount but only in a random direction (which it cannot control)? In this case, we show an interesting dichotomy: if the classifier must output a classification on any input it is given, then yes the adversary will still win, no matter how well-separated the classes are in feature space and no matter what decision surface the classifier uses. However, if the classifier is allowed to abstain, then it can defeat such an adversary so long as natural data of different classes are reasonably well-separated in feature space. Our results hold for generalizations of these models as well, such as adversaries that can modify feature representations in random low-dimensional subspaces, or directions that are not completely random. More broadly, our results provide a theoretical explanation for the importance of allowing abstaining, or selective classification, in the presence of adversarial attack. Apart from providing a useful abstraction for non-Lipschitz feature embeddings, our model may be viewed as capturing an interesting class of real attacks. There are various global properties of an image, such as brightness, contrast, or rotation angle whose change might be "perceptible but not relevant" to classification tasks. Our model could also be viewed as an abstraction of attacks of that nature. Feature space attacks of other forms, where one can perturb abstract features denoting styles, including interpretable styles such as vivid colors and sharp outlines and uninterpretable ones, have also been empirically studied in (Xu et al., 2020; Ganeshan & Babu, 2019 ). An interesting property of our model is that it is critical to be able to refuse to predict: any algorithm which always predicts a class label-therefore without an ability to abstain-is guaranteed to perform poorly. This provides a first formal hardness result about abstention in adversarial defense, and also a first provable negative result in feature-space attacks. We therefore allow the algorithm to output "don't know" for some examples, which, as a by-product of our algorithm, serves as a detection mechanism for adversarial examples. It also results in an interesting trade-off between robustness and accuracy: by controlling how frequently we refuse to predict, we are able to trade (robust) precision off against recall. We also provide results for how to provably optimize for such a trade-off using a data-driven algorithm. Our strong theoretical advances are backed by empirical evidence in the context of contrastive learning (He et al., 2020; Chen et al., 2020; Khosla et al., 2020) .

1.1. OUR CONTRIBUTIONS

Our work tackles the problem of defending against adversarial perturbations in a random feature subspace, and advances the theory and practice of robust machine learning in multiple ways. • We introduce a formal model that captures feature-space attacks and the effect of non-Lipschitzness of deep networks which can magnify input perturbations. • We begin our analysis with a hardness result concerning defending against adversary without the option of "don't know". We show that all classifiers that partition the feature space into two or more classes-thus without an ability to abstain-are provably vulnerable to adversarial examples for at least one class of examples with nearly half probability. • We explore the power of abstention option: a variant of nearest-neighbor classifier with the ability to abstain is provably robust against adversarial attacks, even in the presence of outliers in the training data set. We characterize the conditions under which the algorithm does not output "don't know" too often. • We leverage and extend dispersion techniques from data-driven decision making, and present a novel data-driven method for learning data-specific optimal hyperparameters in our defense algorithms to simultaneously obtain high robust accuracy and low abstention rates. Unlike typical hyperparameter tuning, our approach provably converges to a global optimum. • Experimentally, we show that our proposed algorithm achieves certified adversarial robustness on representations learned by supervised and self-supervised contrastive learning. Our method significantly outperforms algorithms without the ability to abstain.

2. RELATED WORK

Adversarial robustness with abstention options. Classification with abstention option (a.k.a. selective classification (Geifman & El-Yaniv, 2017) ) is a relatively less explored direction in the adversarial machine learning. Hosseini et al. (2017) Another related line of research to our method is the detection of adversarial examples (Grosse et al., 2017; Li & Li, 2017; Carlini & Wagner, 2017; Ma et al., 2018; Meng & Chen, 2017; Metzen et al., 2017; Bhagoji et al., 2018; Xu et al., 2017; Hu et al., 2019) . However, theoretical understanding behind the empirical success of adversarial defenses with an abstention option remains elusive. Data-driven decision making. Data-driven algorithm selection refers to choosing a good algorithm from a parameterized family of algorithms for given data. It is known as "hyperparameter tuning" to machine learning practitioners and typically involves a "grid search", "random search" (Bergstra & Bengio (2012) ) or gradient-based search, with no guarantees of convergence to a global optimum. It was formally introduced to the theory of computing community by Gupta & Roughgarden (2017) as a learning paradigm, and was further extended in (Balcan et al., 2017) . The key idea is to model the problem of identifying a good algorithm from data as a statistical learning problem. The technique has found useful application in providing provably better algorithms for several domains including clustering, mechanism design, and mixed integer programs, and providing guarantees like differential privacy and adaptive online learning (Balcan et al., 2018a; b; 2020) . For learning in an adversarial setting, we provide the first demonstration of the effectiveness of data-driven algorithm selection in a defense method to optimize over the accuracy-abstention trade-off with strong theoretical guarantees.

3. PRELIMINARIES

Notation. We will use bold lower-case letters such as x and y to represent vectors, lower-case letters such as x and y to represent scalars, and calligraphy capital letters such as X , Y and D to represent distributions. Specifically, we denote by x 2 X the sample instance, and by y 2 Y the label, where X ✓ R n1 and Y indicate the image and label spaces, respectively. Denote by F : X ! R n2 the feature embedding which maps an instance to a high-dimensional vector in the latent space F (X ). It can be parameterized, e.g., by deep neural networks. We will frequently use v 2 R n2 to represent an adversarial perturbation in the feature space. Denote by dist(•, •) the distance between any two vectors in the image or feature space. Examples of distances include dist(x 1 , x 2 ) = kx 1 x 2 k-the one induced by vector norm. We use B(x, ⌧) to represent a neighborhood of x: {x 0 : dist(x, x 0 )  ⌧ } in the image or feature space. We will frequently denote by D X the distribution of instances in the input space, by D X |y the distribution of instances in the input space conditioned on the class y, by D F (X ) the distribution of features, and by D F (X )|y the distribution of features conditioned on the class y.

3.1. RANDOM FEATURE SUBSPACE THREAT MODEL

In principle, the adversarial example for a given labeled data (x, y) is a data point x 0 that causes a classifier to output a different label on x 0 than the true label y. Probably one of the most popular adversarial examples is the norm-bounded perturbation in the input space. Despite a large literature devoted to defending against norm-bounded adversary by improving the Lipschitzness of neural network as a function mapping from input space to feature space (Zhang et al., 2019; Yang et al., 2020) , it is typically not true that small perturbation in the input space necessarily implies small modification in the feature space. In this paper, we study a threat model where an adversary can modify the data by a large amount in the feature space. Note that because this large modification in feature space is assumed to come from a small perturbation in input space, we always assume that the true correct label y is the same for x 0 as for x. Our model highlights the power of abstention in the adversarial learning: there is a provable separation when we have and do not have an abstention option under our threat model. Our threat model. In the setting of (robust) representation learning, we are given a set of training instances x 1 , ..., x m 2 X . Let x be an n 1 -dimensional test input for classification. The input is embedded into a high n 2 -dimensional feature space using a deep neural network F . We predict the class of x by a prediction function on F (x) which can potentially output "don't know". The adversary may corrupt F (x) such that the modified feature vector is restricted in a random n 3 -dimensional affine subspace denoted by S + {F (x)}, while the perturbation magnitude might be arbitrarily large. The adversary is given access to everything including F , x, S and the true label of x. Throughout the paper, we will refer adversary and adversarial example to this threat model. 1: Input: A test feature F (x) (potentially an adversarial example), a set of training features F (x i ) and their labels y i , i 2 [m], a threshold parameter ⌧ , a separation parameter . 2: Preprocessing: Delete training examples F (x i ) if min j2[m],yi6 =yj dist(F (x i ), F (x j )) < 3: Output: A predicted label of F (x), or "don't know". 4: if min i2[m] dist(F (x), F (x i )) < ⌧ then 5: Return y arg min i2[m] dist(F (x),F (xi)) 6: else 7: Return "don't know"

3.2. A META-ALGORITHM FOR INFERENCE-TIME ROBUSTNESS

Given a test data x, let r denote the shortest distance between F (x) and any training embedding F (x i ) of different labels. Throughout the paper, we consider the prediction rule that we classify an unseen (and potentially adversarially modified) example with the class of its nearest training example provided that the distance between them is at most ⌧ ; otherwise the algorithm outputs "don't know" (see Algorithm 1 and Figure 2 ). The adversary is able to corrupt F (x) by a carefullycrafted perturbation along a random direction, i.e., F (x) + v, where v is an adversarial vector of arbitrary length in a random n 3 -dimensional subspace of R n2 . The parameter ⌧ trades the success rate off against the abstention rate; when ⌧ ! 1, our algorithm is equivalent to the nearest-neighbor algorithm. We also preprocess to remove outliers and points too close to them. TODO: Add a concrete theorem here. 2.2 This indicates that a 'nice' F could cluster together points from same class into small geometric regions to give an even better bound. Well separation already boosts the above bound as it improves with increasing r. Can we extend sample complexity and prediction confidence bounds for F to robus bounds for our setting? -Concretely, if we have that F +h have a sample complexity of m(✏, ), then in the above adversary model the failure probability is only increased by the above upper bound. We should therefore get a slightly weaker sample complexity bound for our model. Also look at what we can say abou accuracy assuming adversary did not perturb the input. For feature-space attacks, several empirical negative results are known (Xu et al., 2020; Ganeshan & Babu, 2019) . We present a hardness result concerning defenses without an ability to abstain, and prove that such defenses are inevitably doomed against our feature-space attacks. Theorem 4.1. For any classifier that partitions R n2 into two or more classes, any data distribution D, any > 0 and any feature embedding F , there must exist at least one class y ⇤ , such that for at least a 1 probability mass of examples x from class y ⇤ (i.e., x is drawn from D X |y ⇤ ), for a random unit-length vector v, with probability at least 1/2 for some 0 > 0, F (x) + 0 v is not labeled y ⇤ by the classifier. In other words, there must be at least one class y ⇤ such that for at least 1 probability mass of points x of class y ⇤ , the adversary wins with probability at least 1/2 . !(#) !(#′) ! !(#) !(# % ) ! # + ' Proof. Without loss of generality, we assume that the feature embedding F is an identity mapping. Define r to be a radius such that for every class y, at least a 1 probability mass of examples x of class y lie within distance r of the origin. Let R = r p n 2 / . R is defined to be large enough such that if we take a ball of radius R and move it by a distance r , at least a 1 fraction of the volume of the new ball is inside the intersection with the old ball. Now, let B be the ball of radius R centered at the origin. Let vol(B) denote the volume of B and let vol y (B) denote the volume of the subset of B that is assigned label y by the classifier. Let y ⇤ be any label such that vol y ⇤ (B)/vol(B)  1/2. Such a class y ⇤ exists because we do not have the option to output "don't know". Now by the definition of y ⇤ , a point z picked uniformly at random from B has probability at least 1/2 of being classified differently from y ⇤ . This implies that, by the definition of R, if x is within distance r of the origin, then a point z x that is picked uniformly at random in the ball B x of radius R centered at x has probability at least 1/2 of being classified differently from y ⇤ . This immediately implies that if we choose a random unit-length vector v, then with probability at least 1/2 , there exists 0 > 0 such that x + 0 v is classified differently from y ⇤ , since we can think of choosing v by first sampling z x from B x and then defining v = (z x x)/kz x xk 2 . So, the theorem follows from the fact that, by the definition of r , at least 1 probability mass of examples x from class y ⇤ are within distance r of the origin. We remark that our lower bound applies to any classifier and exploits the fact that a classifier without abstention must label the entire feature space. For a simple linear decision boundary (center of Figure 3 ), a perturbation in any direction (except parallel to the boundary) can cross the boundary with an appropriate magnitude. The left and right figures show that if we try to 'bend' the decision boundary to 'protect' one of the classes, the other class is still vulnerable. Our argument formalizes and generalizes this intuition, and shows that there must be at least one vulnerable class irrespective of how you may try to shape the class boundaries, where the adversary succeeds in a large fraction of directions. Theorem 4.1 implies that all classifiers that partitions R n2 into two or more classes-thus without an ability to abstain-are vulnerable to adversarial examples for at least one class of data with nearly half probability. Despite much effort has been devoted to empirically investigating the power of "don't know" in the adversarial robustness, theoretical understanding behind the empirical success of these methods remains elusive. To the best of our knowledge, our work is the first result that provably demonstrates the power of "don't know" in the algorithmic design of adversarially robust classifiers.

5. POSITIVE RESULTS WITH AN ABILITY TO ABSTAIN

Theorem 4.1 gives a hardness result of robust classification without abtention. In this section, we explore the power of abstaining and show classifiers with an ability to abstain are provably robust. Given a test instance x ⇠ D X , recall that r denotes the shortest distance between F (x) 2 R n2 and any training embedding F (x i ) 2 R n2 with a different label. The adversary is allowed to corrupt F (x) with an arbitrarily large perturbation in a uniform-distributed subspace S of dimension n 3 . Consider the prediction rule that we classify the unseen example F (x) 2 R n2 with the class of its nearest training example provided that the distance between them is at most ⌧ ; otherwise the algorithm outputs "don't know" (see Algorithm 1 when = 0). Denote by E x adv (f ) := E S⇠S 1{9e 2 S + F (x) ✓ R n2 s.t. f (e) 6 = y and f (e) does not abstain} the robust error of a given classifier f for classifying instance x. Our analysis leads to the following positive results on this algorithm. Theorem 5.1. Let x ⇠ D X be a test instance, m be the number of training examples and r be the shortest distance between F (x) and F (x i ) where x i is a training point from a different class. Suppose ⌧ = o ⇣ r q 1 n3 n2 ⌘ . The robust error of Algorithm 1, E x adv (ROBUSTCLASSIFIER(⌧, 0)), is at most m c⌧ r q 1 n 3 n 2 ! n2 n3 + mc n2 n3 0 , where c > 0 and 0 < c 0 < 1 are absolute constants. Proof Sketch. We begin our analysis with the case of n 3 = 1. Suppose we have a training example x 0 of another class, and suppose F (x) and F (x 0 ) are at distance D in the feature space. Because ⌧ = o (D), the probability that the adversary can move F (x) to within distance ⌧ of F (x 0 ) should be roughly the ratio of the surface area of a sphere of radius ⌧ to the surface area of a sphere of radius D, which is at most O ⌧ D n2 1  O ⌧ r n2 1 . The analysis for the general case of n 3 follows from a pealing argument: note that the random subspace in which the adversary vector is restricted to lie can be constructed by first sampling a vector v 1 uniformly at random from a unit sphere in the ambient space R n2 centered at 0; fixing v 1 , we then sample a vector v 2 uniformly at random from a unit sphere in the null space of span{v 1 }; we repeat this procedure n 3 times and let span{v 1 , v 2 , ..., v n3 } be the desired adversarial subspace. For each step of construction, we apply the same argument as that of n 3 = 1 with D = ⌦ ⇣ r q n2 i n2 ⌘ by a high probability, if we project F (x) and F (x 0 ) to a random subspace of dimension n 2 i. Finally, a union bound over m training points completes the proof. ⇤ Trade-off between success probability and abstention rate. Theorem 5.1 captures the trade-off between the success probability of an algorithm and the abstention rate: a smaller value of ⌧ increases the success probability of the algorithm, while it also encourages Algorithm 1 to output "don't know" more often. A related line of research to this observation is the trade-off between robustness and accuracy: Zhang et al. (2019) ; Tsipras et al. (2019) showed that there might be no predictor in the hypothesis class that has low natural and robust errors; even such a predictor exists for the well-separated data (Yang et al., 2020) , Raghunathan et al. (2020) showed that the natural error could increase by adversarial training if we only have finite number of data. To connect the two trade-offs, we note that a high success probability of ROBUSTCLASSIFIER(⌧, 0) in Algorithm 1 tends to avoid the algorithm from predicting wrong labels for adversarial examples, while the associated high abstention rate encourages the algorithm to output "don't know" even for natural examples, thus leading to a trivial non-accurate classifier.

5.1. A MORE GENERAL ADVERSARY WITH BOUNDED DENSITY

We extend our results to a more general class of adversaries, which have a bounded distribution over the space of linear subspaces of a fixed dimension n 3 and the adversary can perturb a test feature vector arbitrarily in the sampled adversarial subspace. Theorem 5.2. Consider the setting of Theorem 5.1, with an adversary having a -bounded distribution over the space of linear subspaces of a fixed dimension n 3 for perturbing the test point. If E(⌧, r) denotes the bound on error rate in Theorem 5.1 for ROBUSTCLASSIFIER(⌧, 0) in Algorithm 1, then the error bound of the same algorithm against the -bounded adversary is O(E(⌧, r)).

5.2. OUTLIER REMOVAL AND IMPROVED UPPER BOUND

The upper bounds above assume that the data is well-separated in the feature space. For noisy data and good-but-not-perfect embeddings, the condition may not hold. In Theorem E.1 (in Appendix E) we show that we obtain almost the same upper bound on failure probability under weaker assumptions by exploiting the noise removal threshold .

5.3. CONTROLLING ABSTENTION RATE ON NATURAL DATA

We show that we can control the frequency of outputting "don't know", when the data are nicely distributed according to the following generative assumption. Intuitively, it says that for every label class one can cover most of the distribution of the class with (potentially overlapping) balls of a fixed radius, each having a small lower bound on the density contained. This holds for well-clustered datasets (as is typical for feature data) for a sufficiently large radius. Assumption 1. We assume that at least 1 fraction of mass of the marginal distribution D F (X )|y over R n2 can be covered by N balls B 1 , B 2 , ... B N of radius ⌧ /2 and of mass Pr D F (X ) [B k ] C0 m ⇣ n 2 log m + log 4N ⌘ , where C 0 > 0 is an absolute constant and , 2 (0, 1). Our analysis leads to the following guarantee on the abstention rate. Theorem 5.3. Suppose that F (x 1 ), ..., F (x m ) are m training instances i.i.d. sampled from marginal distribution D F (X ) . Under Assumption 1, with probability at least 1 /4 over the sampling, we have Pr([ m i=1 B(F (x i ), ⌧)) 1 . Theorem 5.3 implies that when Pr[B k ] N and m = ⌦( n2N log n2N ), with probability at least 1 /4 over the sampling, we have Pr([ m i=1 B(F (x i ), ⌧)) 1 . Therefore, with high probability, the algorithm will output "don't know" only for an fraction of natural data.

6. LEARNING DATA-SPECIFIC OPTIMAL THRESHOLDS

Given an embedding function F and a classifier f ⌧ which outputs either a predicted class if the nearest neighbor is within distance ⌧ of a test point or abstains from predicting, we want to evaluate the performance of f ⌧ on a test set T against an adversary which can perturb a test feature vector in a random subspace S ⇠ S. To this end, we define E adv (⌧ ) := E S⇠S 1 |T | P (x,y)2T 1{9e 2 S + F (x) ✓ R n2 s.t. f (e) 6 = y and f ⌧ (e) does not abstain} as the robust error on the test set T , and D nat (⌧ ) := 1 |T | P (x,y)2T 1{f ⌧ (F (x) ) abstains} as the abstention rate on the natural data. E adv (⌧ ) and D nat (⌧ ) are monotonic in ⌧ . The robust error E adv (⌧ ) is optimal at ⌧ = 0, while we abstain from prediction all the time (i.e., D nat (⌧ ) = 1). A simple approach is to fix an upper limit d ⇤ on D nat (⌧ ), which corresponds to the maximum abstention rate on natural data under our budget. Then it is straightforward to search for the optimal ⌧ ⇤ such that D nat (⌧ ⇤ ) ⇡ d ⇤ by using nearest neighbor distances of test points. For ⌧ < ⌧ ⇤ we have a higher abstention rate, and when ⌧ > ⌧ ⇤ we have a higher robust error rate. A potential problem with this approach is that D nat (⌧ ) is non-Lipschitz, so small variation in ⌧ can possibly make the abstention rate significantly higher than d ⇤ . An alternative objective which captures the trade-off between abstention rate and accuracy is defined as g(⌧ ) := E adv (⌧ ) + cD nat (⌧ ), where c is a positive constant. If, for example, we are willing to take a one percent increase of the abstention rate for a two percent drop in the error rate, we could set c to be 1 2 . We can optimize g(⌧ ) in a data-driven fashion and obtain theoretical guarantee on the convergence to a global optimum. In the following, we consider the case where the test examples appear in an online fashion in small batches of size b, and we set the threshold ⌧ adaptively by a low-regret algorithm. We note in Corollary 6.3, using online-to-batch conversion, that our results imply a uniform convergence bound for objective g(⌧ ) in the supervised setting. Details of proofs in this section can be found in Appendix H. The significance of data-driven design in this setting is underlined by the following two observations. Firstly, as noted above, optimization for ⌧ is difficult due to the non-Lipschitzness nature of D nat (⌧ ) and the intractability of characterizing the objective function g(⌧ ) exactly due to E adv (⌧ ). Secondly, the optimal value of ⌧ can be a complex function of the data geometry and sampling rate. We illustrate this by exact computation of optimal ⌧ for a simple intuitive setting: consider a binary classification problem where the features lie uniformly on two one-dimensional manifolds embedded in two-dimensions (i.e., n 2 = 2, see Figure 4 ). Assume that the adversary perturbs in a uniformly random direction (n 3 = 1). For this setting, in Appendix J we show that Theorem 6.1. Let ⌧ ⇤ := arg max ⌧ 2R + g(⌧ ) and = 2⇡cr D . For the setting considered above, if we further assume D = o(r) and m = ! (log ), then there is a unique value of ⌧ ⇤ in [0, D/2). The remaining section summarizes our main theoretical results. ⌘ .

Furthermore, we have

⌧ ⇤ = ⇥ ⇣ D log( m) m ⌘ if m > 1 ; otherwise, ⌧ ⇤ = 0. Remark 1. The results can be generalized to a bounded density adversary (Corollary H.3). Remark 2. The above analysis can be extended to the problem of optimizing over by formulating the objective as function of two parameters, g(⌧, ) := E adv (⌧, ) + cD nat (⌧, ) within a range 2 [r, s]. For fixed ⌧ , both E adv (⌧, ) and D nat (⌧, ) are piece-wise constant and monotonic. The proof of Lipschitzness of the pieces can be adapted easily to the case of r (Lemma H.2). Discontinuities in E adv (⌧, •) and D nat (⌧, •) can be bounded using the upper bound s for (Lemma H.4). Finally, the number of discontinuities in g(⌧, ) in a ball of radius w can be upper bounded by a product of the number of discontinuities in g(⌧, •) and g(•, ) in intervals of width w.

7. EXPERIMENTS ON CONTRASTIVE LEARNING

Theorem 5.1 sheds light on algorithmic designs of robust learning of feature embedding F . In order to preserve robustness against adversarial examples regarding a given test point x, in the feature space the theorem suggests minimizing ⌧ -the closest distance between F (x) and any training feature F (x i ) of the same label, and maximizing r-the closest distance between F (x) and any training feature F (x i ) of different labels. This is conceptually consistent with the spirit of the nearest-neighbor algorithm, a.k.a. contrastive learning when we replace the max operator with the softmax operator for differentiable training: min F 1 m X i2[m] log 0 B @ P j2[m],j6 =i,yi=yj e kF (x i ) F (x j )k 2 T P k2[m],k6 =i e kF (x i ) F (x k )k 2 T 1 C A , where T > 0 is the temperature parameter. Loss (1) is also known as the soft-nearest-neighbor loss in the context of supervised learning (Frosst et al., 2019) , or the InfoNCE loss in the setting of self-supervised learning (He et al., 2020) .

7.1. CERTIFIED ADVERSARIAL ROBUSTNESS AGAINST EXACT COMPUTATION OF ATTACKS

We verify the robustness of Algorithm 1 when the representations are learned by contrastive learning. Given a embedding function F and a classifier f which outputs either a predicted class or abstains from predicting, recall that we define the natural and robust errors, respectively, as E nat (f ) := E (x,y)⇠D 1{f (F (x)) 6 = y and f (F (x)) does not abstain}, and E adv (f ) := E (x,y)⇠D,S⇠S 1{9e 2 S + F (x) ✓ R n2 s.t. f (e) 6 = y and f (e) does not abstain}, where S ⇠ S is a random adversarial subspace of R n2 with dimension n 3 . D nat (f ) := E (x,y)⇠D 1{f (F (x)) abstains} is the abstention rate on the natural examples. Note that the robust error is always at least as large as the natural error. Self-supervised contrastive learning setup. Our experimental setup follows that of SimCLR (Chen et al., 2020) . We use the ResNet-18 architecture (He et al., 2016) for representation learning with a two-layer projection head of width 128. The dimension of the representations is 512. We set batch size 512, temperature T = 0.5, and initial learning rate 0.5 which is followed by cosine learning rate decay. We sequentially apply four simple augmentations: random cropping followed by resize back to the original size, random flipping, random color distortions, and randomly converting image to grayscale with a probability of 0.2. In the linear evaluation protocol, we set batch size 512 and learning rate 1.0 to learn a linear classifier in the feature space by empirical risk minimization. Supervised contrastive learning setup. Our experimental setup follows that of Khosla et al. (2020) . We use the ResNet-18 architecture for representation learning with a two-layer projection head of width 128. The dimension of the representations is 512. We set batch size 512, temperature T = 0.1, and initial learning rate 0.5 which is followed by cosine learning rate decay. We sequentially apply four simple augmentations: random cropping followed by resize back to the original size, random flipping, random color distortions, and randomly converting image to grayscale with a probability of 0.2. In the linear evaluation protocol, we set batch size 512 and learning rate 5.0 to learn a linear classifier in the feature space by empirical risk minimization. In both self-supervised and supervised setups, we compare the robustness of the linear protocol with that of our defense protocol in Algorithm 1 under exact computation of adversarial examples using a convex optimization program in n 3 dimensions and m constraints. Algorithm 4 in the appendix provides an efficient implementation of the attack. Experimental results. We summarize our results in Table 1 . Comparing with a linear protocol, our algorithms have much lower robust error. Note that even if abstention is added based on distance from the linear boundary, sufficiently large perturbations will ensure the adversary can always succeed. For an approximate adversary which can be efficiently implemented for large n 3 , see Appendix L.2.

7.2. ROBUSTNESS-ABSTENTION TRADE-OFF

The threshold parameter ⌧ captures the trade-off between the robust accuracy A adv := 1 E adv and the abstention rate D nat on the natural data. We report both metrics for different values of ⌧ for supervised and self-supervised constrastive learning. The supervised setting enjoys higher adversarial accuracy and a smaller abstention rate for fixed ⌧ 's due to the use of extra label information. We plot A adv against D nat for Algorithm 1 as hyperparameters vary. For small ⌧ , both accuracy and abstention rate approach 1.0. As the threshold increases, the abstention rate decreases rapidly and our algorithm enjoys good accuracy even with small abstention rates. For ⌧ ! 1 (i.e. the nearest neighbor search), the abstention rate on the natural data D nat is 0% but the robust accuracy is also roughly 0%. Increasing (for small ) gives us higher robust accuracy for the same abstention rate. Too large may also lead to degraded performance. 



Figure 1: Illustration of a non-Lipschitz feature mapping using a deep network.

augmented the output class set with a NULL label and trained the classifier to reject the adversarial examples by classifying them as NULL; Stutz et al. (2020) and Laidlaw & Feizi (2019) obtained robustness by rejecting low-confidence adversarial examples according to confidence thresholding or predictions on the perturbations of adversarial examples.

Figure 1: Adversarial misclassification for nearest-neighbor predictor

Figure 2: Adversarial misclassification for nearest-neighbor predictor.

Figure 3: A simple example to illustrate Theorem 4.1.

Figure 4: A simple intuitive example where we compute the optimal value of the abstention threshold exactly. Classes A and B are both distributed uniformly on one-dimensional segments of length D, embedded collinear and at distance r in R 2 .

Figure 5: Adversarial accuracy (i.e., rate of adversary failure) vs. abstention rate as threshold ⌧ varies for n 3 = 1 and different outlier removal thresholds .

Theorem 6.2. Assume ⌧ is o min{m 1/n2 , r} , and the data distribution is continuous, -bounded, positive and has bounded partial derivatives. If ⌧ is set using a continuous version of the multiplicative updates algorithm (Algorithm 2 in Appendix H,Balcan et al. (2018a)), then with probability at least distance between any two training points, b is the batch size, and r is the smallest distance between points of different labels. Corollary 6.3. Suppose we run the online algorithm of Theorem 6.2 on a validation set of size T , and use a randomized threshold ⌧ on the test set drawn from a uniform distribution over the thresholds ⌧ 1 , . . . , ⌧ T used in online learning. If the threshold which maximizes g(⌧ ) is ⌧ ⇤ , then with probability

Natural error E nat and robust error E adv on the CIFAR-10 dataset when n 3 = 1 and the 512-dimensional representations are learned by contrastive learning, where D nat represents the fraction of each algorithm's output of "don't know" on the natural data. We report values for ⇡ ⌧ as they tend to give a good abstention-error trade-off w.r.t. .

