ON THE POWER OF ABSTENTION AND DATA-DRIVEN DECISION MAKING FOR ADVERSARIAL ROBUSTNESS Anonymous

Abstract

We formally define a feature-space attack where the adversary can perturb datapoints by arbitrary amounts but in restricted directions. By restricting the attack to a small random subspace, our model provides a clean abstraction for non-Lipschitz networks which map small input movements to large feature movements. We prove that classifiers with the ability to abstain are provably more powerful than those that cannot in this setting. Specifically, we show that no matter how well-behaved the natural data is, any classifier that cannot abstain will be defeated by such an adversary. However, by allowing abstention, we give a parameterized algorithm with provably good performance against such an adversary when classes are reasonably well-separated in feature space and the dimension of the feature space is high. We further use a data-driven method to set our algorithm parameters to optimize over the accuracy vs. abstention trade-off with strong theoretical guarantees. Our theory has direct applications to the technique of contrastive learning, where we empirically demonstrate the ability of our algorithms to obtain high robust accuracy with only small amounts of abstention in both supervised and self-supervised settings. Our results provide a first formal abstention-based gap, and a first provable optimization for the induced trade-off in an adversarial defense setting.

1. INTRODUCTION

A substantial body of work has shown that deep networks can be highly susceptible to adversarial attacks, in which minor changes to the input lead to incorrect, even bizarre classifications (Nguyen et al., 2015; Moosavi-Dezfooli et al., 2016; Su et al., 2019; Brendel et al., 2018; Shamir et al., 2019) . Much of this work has considered `p-norm adversarial examples, but there has also been recent interest in exploring adversarial models beyond bounded `p-norm (Brown et al., 2018; Engstrom et al., 2017; Gilmer et al., 2018; Xiao et al., 2018; Alaifari et al., 2019) . What these results have in common is that changes that either are imperceptible or should be irrelevant to the classification task can lead to drastically different network behavior. One reason for this vulnerability to adversarial attack is the non-Lipschitzness property of typical neural networks: small but adversarial movements in the input space can often produce large perturbations in the feature space. In this work, we consider the question of whether non-Lipschitz networks are intrinsically vulnerable, or if they could still be made robust to adversarial attack, in an abstract but (we believe) instructive adversarial model. In particular, suppose an adversary, by making an imperceptible change to an input x, can cause its representation F (x) in feature space (the penultimate layer of the network) to move by an arbitrary amount: will such an adversary always win? Clearly if the adversary can modify F (x) by an arbitrary amount in an arbitrary direction, then yes. But what if the adversary can modify F (x) by an arbitrary amount but only in a random direction (which it cannot control)? In this case, we show an interesting dichotomy: if the classifier must output a classification on any input it is given, then yes the adversary will still win, no matter how well-separated the classes are in feature space and no matter what decision surface the classifier uses. However, if the classifier is allowed to abstain, then it can defeat such an adversary so long as natural data of different classes are reasonably well-separated in feature space. Our results hold for generalizations of these models as well, such as adversaries that can modify feature representations in random low-dimensional subspaces, or directions that are not completely random. More broadly, our results provide a theoretical explanation for the importance of allowing abstaining, or selective classification, in the presence of adversarial attack. Apart from providing a useful abstraction for non-Lipschitz feature embeddings, our model may be viewed as capturing an interesting class of real attacks. There are various global properties of an image, such as brightness, contrast, or rotation angle whose change might be "perceptible but not relevant" to classification tasks. Our model could also be viewed as an abstraction of attacks of that nature. Feature space attacks of other forms, where one can perturb abstract features denoting styles, including interpretable styles such as vivid colors and sharp outlines and uninterpretable ones, have also been empirically studied in (Xu et al., 2020; Ganeshan & Babu, 2019 ). An interesting property of our model is that it is critical to be able to refuse to predict: any algorithm which always predicts a class label-therefore without an ability to abstain-is guaranteed to perform poorly. This provides a first formal hardness result about abstention in adversarial defense, and also a first provable negative result in feature-space attacks. We therefore allow the algorithm to output "don't know" for some examples, which, as a by-product of our algorithm, serves as a detection mechanism for adversarial examples. It also results in an interesting trade-off between robustness and accuracy: by controlling how frequently we refuse to predict, we are able to trade (robust) precision off against recall. We also provide results for how to provably optimize for such a trade-off using a data-driven algorithm. Our strong theoretical advances are backed by empirical evidence in the context of contrastive learning (He et al., 2020; Chen et al., 2020; Khosla et al., 2020) .

1.1. OUR CONTRIBUTIONS

Our work tackles the problem of defending against adversarial perturbations in a random feature subspace, and advances the theory and practice of robust machine learning in multiple ways. • We introduce a formal model that captures feature-space attacks and the effect of non-Lipschitzness of deep networks which can magnify input perturbations. • We begin our analysis with a hardness result concerning defending against adversary without the option of "don't know". We show that all classifiers that partition the feature space into two or more classes-thus without an ability to abstain-are provably vulnerable to adversarial examples for at least one class of examples with nearly half probability. • We explore the power of abstention option: a variant of nearest-neighbor classifier with the ability to abstain is provably robust against adversarial attacks, even in the presence of outliers in the training data set. We characterize the conditions under which the algorithm does not output "don't know" too often. • We leverage and extend dispersion techniques from data-driven decision making, and present a novel data-driven method for learning data-specific optimal hyperparameters in our defense algorithms to simultaneously obtain high robust accuracy and low abstention rates. Unlike typical hyperparameter tuning, our approach provably converges to a global optimum. • Experimentally, we show that our proposed algorithm achieves certified adversarial robustness on representations learned by supervised and self-supervised contrastive learning. Our method significantly outperforms algorithms without the ability to abstain.

2. RELATED WORK

Adversarial robustness with abstention options. 



Figure 1: Illustration of a non-Lipschitz feature mapping using a deep network.

Classification with abstention option (a.k.a. selective classification (Geifman & El-Yaniv, 2017)) is a relatively less explored direction in the adversarial machine learning. Hosseini et al. (2017) augmented the output class set with a NULL label and trained the classifier to reject the adversarial examples by classifying them as NULL; Stutz et al. (2020) and Laidlaw & Feizi (2019) obtained robustness by rejecting low-confidence adversarial examples

