COUNTERING THE ATTACK-DEFENSE COMPLEXITY GAP FOR ROBUST CLASSIFIERS

Abstract

We consider the decision version of defending and attacking Machine Learning classifiers. We provide a rationale for known difficulties in building robust models by proving that, under broad assumptions, attacking a polynomial-time classifier is N P -complete in the worst case; conversely, training a polynomial-time model that is robust on even a single input is Σ P 2 -complete, barring collapse of the Polynomial Hierarchy. We also provide more general bounds for non-polynomial classifiers. We point out an alternative take on adversarial defenses that can sidestep such a complexity gap, by introducing Counter-Attack (CA), a system that computes on-the-fly robustness certificates for a given input up to an arbitrary distance bound ε. Finally, we empirically investigate how heuristic attacks can approximate the true decision boundary distance, which has implications for a heuristic version of CA. As part of our work, we introduce UG100, a dataset obtained by applying both heuristic and provably optimal attacks to limited-scale networks for MNIST and for CIFAR10. We hope our contributions can provide guidance for future research.

1. INTRODUCTION

Adversarial attacks, i.e. algorithms designed to fool machine learning models, represent a significant threat to the applicability of such models in real-world contexts (Brendel et al., 2019; Brown et al., 2017; Wu et al., 2020) . Despite years of research effort, countermeasures (i.e. "defenses") to adversarial attacks are frequently fooled by applying small tweaks to existing techniques (Carlini & Wagner, 2016; 2017a; Croce et al., 2022; He et al., 2017; Hosseini et al., 2019; Tramer et al., 2020) . We argue that this pattern is due to differences between the fundamental mathematical problems that defenses and attacks need to tackle. Specifically, we prove that while attacking a polynomial-time classifier is N P -complete in the worst case, training a polynomial-time model that is robust even on a single input is Σ P 2 -complete. We also provide more general bounds for non-polynomial classifiers, showing that a A-time classifier can be attacked in N P A time. We then give an informal intuition for our theoretical results, which also applies to heuristic attacks and defenses. Our result highlights that, unless the Polynomial Hierarchy collapses, there exists a potential, structural, difficulty for defense approaches that focus on building robust classifiers at training time. We then show that the asymmetry can be sidestepped by an alternative perspective on adversarial defenses. As an exemplification, we introduce a new technique, named Counter-Attack (CA) that, instead of training a robust model, evaluates robstness on the fly for a specific input by running an adversarial attack. This simple approach, while very simple, provides robustness guarantees against perturbations of an arbitrary magnitude ε. Additionally, we prove that while generating a certificate is N P -complete in the worst case, attacking CA using perturbations of magnitude ε ′ > ε is Σ P 2 -complete, which represents a form of computational robustness -weaker than the one by (Garg et al., 2020) , but holding under much more general assumptions. CA can be applied in any setting where at least one untargeted attack is known, while also allowing one to capitalize on future algorithmic improvements: as adversarial attacks become stronger, so does CA. Finally, we investigate the empirical performance of an approximate version of CA where a heuristic attack is used instead of an exact one. This version achieves reduced computational time, at the cost of providing only approximate guarantees. We found heuristic attacks to be high-quality approximators for exact decision boundary distances, in experiments over a subsample of MNIST and CIFAR10 and small-scale Neural Networks. In particular, a pool of seven heuristic attacks

