COUNTERING THE ATTACK-DEFENSE COMPLEXITY GAP FOR ROBUST CLASSIFIERS

Abstract

We consider the decision version of defending and attacking Machine Learning classifiers. We provide a rationale for known difficulties in building robust models by proving that, under broad assumptions, attacking a polynomial-time classifier is N P -complete in the worst case; conversely, training a polynomial-time model that is robust on even a single input is Σ P 2 -complete, barring collapse of the Polynomial Hierarchy. We also provide more general bounds for non-polynomial classifiers. We point out an alternative take on adversarial defenses that can sidestep such a complexity gap, by introducing Counter-Attack (CA), a system that computes on-the-fly robustness certificates for a given input up to an arbitrary distance bound ε. Finally, we empirically investigate how heuristic attacks can approximate the true decision boundary distance, which has implications for a heuristic version of CA. As part of our work, we introduce UG100, a dataset obtained by applying both heuristic and provably optimal attacks to limited-scale networks for MNIST and for CIFAR10. We hope our contributions can provide guidance for future research.

1. INTRODUCTION

Adversarial attacks, i.e. algorithms designed to fool machine learning models, represent a significant threat to the applicability of such models in real-world contexts (Brendel et al., 2019; Brown et al., 2017; Wu et al., 2020) . Despite years of research effort, countermeasures (i.e. "defenses") to adversarial attacks are frequently fooled by applying small tweaks to existing techniques (Carlini & Wagner, 2016; 2017a; Croce et al., 2022; He et al., 2017; Hosseini et al., 2019; Tramer et al., 2020) . We argue that this pattern is due to differences between the fundamental mathematical problems that defenses and attacks need to tackle. Specifically, we prove that while attacking a polynomial-time classifier is N P -complete in the worst case, training a polynomial-time model that is robust even on a single input is Σ P 2 -complete. We also provide more general bounds for non-polynomial classifiers, showing that a A-time classifier can be attacked in N P A time. We then give an informal intuition for our theoretical results, which also applies to heuristic attacks and defenses. Our result highlights that, unless the Polynomial Hierarchy collapses, there exists a potential, structural, difficulty for defense approaches that focus on building robust classifiers at training time. We then show that the asymmetry can be sidestepped by an alternative perspective on adversarial defenses. As an exemplification, we introduce a new technique, named Counter-Attack (CA) that, instead of training a robust model, evaluates robstness on the fly for a specific input by running an adversarial attack. This simple approach, while very simple, provides robustness guarantees against perturbations of an arbitrary magnitude ε. Additionally, we prove that while generating a certificate is N P -complete in the worst case, attacking CA using perturbations of magnitude ε ′ > ε is Σ P 2 -complete, which represents a form of computational robustness -weaker than the one by (Garg et al., 2020) , but holding under much more general assumptions. CA can be applied in any setting where at least one untargeted attack is known, while also allowing one to capitalize on future algorithmic improvements: as adversarial attacks become stronger, so does CA. Finally, we investigate the empirical performance of an approximate version of CA where a heuristic attack is used instead of an exact one. This version achieves reduced computational time, at the cost of providing only approximate guarantees. We found heuristic attacks to be high-quality approximators for exact decision boundary distances, in experiments over a subsample of MNIST and CIFAR10 and small-scale Neural Networks. In particular, a pool of seven heuristic attacks provided an accurate (average over-estimate between 2.04% and 4.65%) and predictable (average R 2 > 0.99) approximation of the true optimum. We compiled our benchmarks and generated adversarial examples (both exact and heuristic) in a new dataset, named UG100, and made it publicly availablefoot_0 . Overall, we hope our contributions can support future research by highlighting potential structural challenges, pointing out key sources of complexity, inspiring research on heuristics and tractable classes, and suggesting alternative perspectives on how to build robust classifiers.

2. RELATED WORK

Robustness bounds for NNs were first provided in (Szegedy et al., 2013) , followed by (Hein & Andriushchenko, 2017) and (Weng et al., 2018b) . One major breakthrough was the introduction of automatic verification tools, such as the Reluplex solver (Katz et al., 2017) . However, the same work also showed that proving properties of a ReLU network is N P -complete. Researchers tried to address this issue by working in three directions. The first is building more efficient solvers based on alternative formulations (Dvijotham et al., 2018; Singh et al., 2018; Tjeng et al., 2019) . The second involves training models that can be verified with less computational effort (Leino et al., 2021; Xiao et al., 2019) or provide inherent robustness bounds (Sinha et al., 2018) . The third focuses on guaranteeing robustness under specific threat models (Han et al., 2021) or input distribution assumptions (Dan et al., 2020; Sinha et al., 2018) . Since all these approaches have limitations that reduce their applicability (Silva & Najafirad, 2020 ), heuristic defenses tend to be more common in practice. Exact approaches can also be used to compute provably optimal adversarial examples (Carlini et al., 2017; Tjeng et al., 2019) , although generating them requires a non-trivial amount of computational resources. Refer to Appendix M for a more in-depth overview of certified defenses. Another line of research has focused on understanding the nature of robustness and adversarial attacks. Frameworks such as (Dreossi et al., 2019 ), (Pinot et al., 2019 ) and (Pydi & Jog, 2021) focused on formalizing the concept of adversarial robustness. Some studies have highlighted trade-offs between robustness (under specific definitions) and properties such as accuracy (Dobriban et al., 2020; Zhang et al., 2019 ), generalization (Min et al., 2021) and invariance (Tramèr et al., 2020) . However, some of these results have been recently questioned, suggesting that these trade-offs might not be inherent in considered approaches (Yang et al., 2020; Zhang et al., 2020) . Adversarial attacks have also been studied from the point of view of Bayesian learning to derive robustness bounds and provide insight into the role of uncertainty (Rawat et al., 2017; Richardson & Weiss, 2021; Vidot et al., 2021) . Adversarial attacks have also been studied in the context of game theory (Ren et al., 2021) , identifying Nash equilibria between attacker and defender (Pal & Vidal, 2020; Zhou et al., 2019) . Finally, some works have also focused on the computational complexity of specific adversarial attacks and defenses. In particular, Mahloujifar & Mahmoody (2019) showed that there exist exact polynomial-time attacks against classifiers trained on product distributions. Similarly, Awasthi et al. (2019) showed that for degree-2 polynomial threshold functions there exists a polynomial-time algorithm that either proves that the model is robust or finds an adversarial example. Other works have also provided hardness results; Degwekar et al. (2019) showed that there exist certain classification tasks such that learning a robust model is as hard as solving the Learning Parity with Noise problem (which is N P -hard); Song et al. (2021) showed that learning a single periodic neuron over noisy isotropic Gaussian distributions in polynomial time would imply that the Shortest Vector Problem (conjectured to be N P -hard) can be solved in polynomial time. Finally, Garg et al. (2020) showed that, by requiring attackers to provide a valid cryptographic signature for inputs, it is possible to prevent attacks with limited computational resources from fooling the model in polynomial time.

3. BACKGROUND AND FORMALIZATION

Extensive literature in the field of adversarial attacks suggests that generating adversarial examples is comparatively easier than building robust classifiers (Carlini & Wagner, 2016; 2017a; Croce et al., 2022; He et al., 2017; Hosseini et al., 2019; Tramer et al., 2020) . In this section, we introduce some key definitions that we will employ to provide a theoretically grounded, potential, motivation for such



All our code, datasets, pretrained weights and results are available anonymously under MIT license at https://anonymous.4open.science/r/counter-attack.

