ON THE POWER OF ABSTENTION AND DATA-DRIVEN DECISION MAKING FOR ADVERSARIAL ROBUSTNESS Anonymous

Abstract

We formally define a feature-space attack where the adversary can perturb datapoints by arbitrary amounts but in restricted directions. By restricting the attack to a small random subspace, our model provides a clean abstraction for non-Lipschitz networks which map small input movements to large feature movements. We prove that classifiers with the ability to abstain are provably more powerful than those that cannot in this setting. Specifically, we show that no matter how well-behaved the natural data is, any classifier that cannot abstain will be defeated by such an adversary. However, by allowing abstention, we give a parameterized algorithm with provably good performance against such an adversary when classes are reasonably well-separated in feature space and the dimension of the feature space is high. We further use a data-driven method to set our algorithm parameters to optimize over the accuracy vs. abstention trade-off with strong theoretical guarantees. Our theory has direct applications to the technique of contrastive learning, where we empirically demonstrate the ability of our algorithms to obtain high robust accuracy with only small amounts of abstention in both supervised and self-supervised settings. Our results provide a first formal abstention-based gap, and a first provable optimization for the induced trade-off in an adversarial defense setting.

1. INTRODUCTION

A substantial body of work has shown that deep networks can be highly susceptible to adversarial attacks, in which minor changes to the input lead to incorrect, even bizarre classifications (Nguyen et al., 2015; Moosavi-Dezfooli et al., 2016; Su et al., 2019; Brendel et al., 2018; Shamir et al., 2019) . Much of this work has considered `p-norm adversarial examples, but there has also been recent interest in exploring adversarial models beyond bounded `p-norm (Brown et al., 2018; Engstrom et al., 2017; Gilmer et al., 2018; Xiao et al., 2018; Alaifari et al., 2019) . What these results have in common is that changes that either are imperceptible or should be irrelevant to the classification task can lead to drastically different network behavior. One reason for this vulnerability to adversarial attack is the non-Lipschitzness property of typical neural networks: small but adversarial movements in the input space can often produce large perturbations in the feature space. In this work, we consider the question of whether non-Lipschitz networks are intrinsically vulnerable, or if they could still be made robust to adversarial attack, in an abstract but (we believe) instructive adversarial model. In particular, suppose an adversary, by making an imperceptible change to an input x, can cause its representation F (x) in feature space (the penultimate layer of the network) to move by an arbitrary amount: will such an adversary always win? Clearly if the adversary can modify F (x) by an arbitrary amount in an arbitrary direction, then yes. But what if the adversary can modify F (x) by an arbitrary amount but only in a random direction (which it cannot control)? In this case, we show an interesting dichotomy: if the classifier must output a classification on any input it is given, then yes the adversary will still win, no matter how well-separated the classes are in feature space and no matter what decision surface the classifier uses. However, if the classifier is allowed to abstain, then it can defeat such an adversary so long as natural data of different classes are reasonably well-separated in feature space. Our results hold for generalizations of these models as well, such as adversaries that can modify feature representations in random low-dimensional subspaces, or directions that are not completely random. More broadly, our results provide a theoretical explanation for the importance of allowing abstaining, or selective classification, in the presence of adversarial attack.

