COUNTERING THE ATTACK-DEFENSE COMPLEXITY GAP FOR ROBUST CLASSIFIERS

Abstract

We consider the decision version of defending and attacking Machine Learning classifiers. We provide a rationale for known difficulties in building robust models by proving that, under broad assumptions, attacking a polynomial-time classifier is N P -complete in the worst case; conversely, training a polynomial-time model that is robust on even a single input is Σ P 2 -complete, barring collapse of the Polynomial Hierarchy. We also provide more general bounds for non-polynomial classifiers. We point out an alternative take on adversarial defenses that can sidestep such a complexity gap, by introducing Counter-Attack (CA), a system that computes on-the-fly robustness certificates for a given input up to an arbitrary distance bound ε. Finally, we empirically investigate how heuristic attacks can approximate the true decision boundary distance, which has implications for a heuristic version of CA. As part of our work, we introduce UG100, a dataset obtained by applying both heuristic and provably optimal attacks to limited-scale networks for MNIST and for CIFAR10. We hope our contributions can provide guidance for future research. provided an accurate (average over-estimate between 2.04% and 4.65%) and predictable (average R 2 > 0.99) approximation of the true optimum. We compiled our benchmarks and generated adversarial examples (both exact and heuristic) in a new dataset, named UG100, and made it publicly available 1 . Overall, we hope our contributions can support future research by highlighting potential structural challenges, pointing out key sources of complexity, inspiring research on heuristics and tractable classes, and suggesting alternative perspectives on how to build robust classifiers. Robustness bounds for NNs were first provided in (Szegedy et al., 2013), followed by (Hein & Andriushchenko, 2017) and (Weng et al., 2018b). One major breakthrough was the introduction of automatic verification tools, such as the Reluplex solver (Katz et al., 2017) . However, the same work also showed that proving properties of a ReLU network is N P -complete. Researchers tried to address this issue by working in three directions. The first is building more efficient solvers based on alternative formulations (Dvijotham et al., 

1. INTRODUCTION

Adversarial attacks, i.e. algorithms designed to fool machine learning models, represent a significant threat to the applicability of such models in real-world contexts (Brendel et al., 2019; Brown et al., 2017; Wu et al., 2020) . Despite years of research effort, countermeasures (i.e. "defenses") to adversarial attacks are frequently fooled by applying small tweaks to existing techniques (Carlini & Wagner, 2016; 2017a; Croce et al., 2022; He et al., 2017; Hosseini et al., 2019; Tramer et al., 2020) . We argue that this pattern is due to differences between the fundamental mathematical problems that defenses and attacks need to tackle. Specifically, we prove that while attacking a polynomial-time classifier is N P -complete in the worst case, training a polynomial-time model that is robust even on a single input is Σ P 2 -complete. We also provide more general bounds for non-polynomial classifiers, showing that a A-time classifier can be attacked in N P A time. We then give an informal intuition for our theoretical results, which also applies to heuristic attacks and defenses. Our result highlights that, unless the Polynomial Hierarchy collapses, there exists a potential, structural, difficulty for defense approaches that focus on building robust classifiers at training time. We then show that the asymmetry can be sidestepped by an alternative perspective on adversarial defenses. As an exemplification, we introduce a new technique, named Counter-Attack (CA) that, instead of training a robust model, evaluates robstness on the fly for a specific input by running an adversarial attack. This simple approach, while very simple, provides robustness guarantees against perturbations of an arbitrary magnitude ε. Additionally, we prove that while generating a certificate is N P -complete in the worst case, attacking CA using perturbations of magnitude ε ′ > ε is Σ P 2 -complete, which represents a form of computational robustness -weaker than the one by (Garg et al., 2020) , but holding under much more general assumptions. CA can be applied in any setting where at least one untargeted attack is known, while also allowing one to capitalize on future algorithmic improvements: as adversarial attacks become stronger, so does CA. Finally, we investigate the empirical performance of an approximate version of CA where a heuristic attack is used instead of an exact one. This version achieves reduced computational time, at the cost of providing only approximate guarantees. We found heuristic attacks to be high-quality approximators for exact decision boundary distances, in experiments over a subsample of MNIST and CIFAR10 and small-scale Neural Networks. In particular, a pool of seven heuristic attacks Finally, some works have also focused on the computational complexity of specific adversarial attacks and defenses. In particular, Mahloujifar & Mahmoody (2019) showed that there exist exact polynomial-time attacks against classifiers trained on product distributions. Similarly, Awasthi et al. (2019) showed that for degree-2 polynomial threshold functions there exists a polynomial-time algorithm that either proves that the model is robust or finds an adversarial example. Other works have also provided hardness results; Degwekar et al. (2019) showed that there exist certain classification tasks such that learning a robust model is as hard as solving the Learning Parity with Noise problem (which is N P -hard); Song et al. (2021) showed that learning a single periodic neuron over noisy isotropic Gaussian distributions in polynomial time would imply that the Shortest Vector Problem (conjectured to be N P -hard) can be solved in polynomial time. Finally, Garg et al. (2020) showed that, by requiring attackers to provide a valid cryptographic signature for inputs, it is possible to prevent attacks with limited computational resources from fooling the model in polynomial time.

3. BACKGROUND AND FORMALIZATION

Extensive literature in the field of adversarial attacks suggests that generating adversarial examples is comparatively easier than building robust classifiers (Carlini & Wagner, 2016; 2017a; Croce et al., 2022; He et al., 2017; Hosseini et al., 2019; Tramer et al., 2020) . In this section, we introduce some key definitions that we will employ to provide a theoretically grounded, potential, motivation for such discrepancy. We aim at capturing the key traits shared by most of the literature on adversarial attacks, so as to identify properties that are valid under broad assumptions. We start by defining the concept of adversarial example, which intuitively represents a modification of a legitimate input that is so limited as to be inconsequential from a practical perspective, but classified erroneously by a target model. Formally, let f : X → {1, . . . , N } be a discrete classifier. Let B p (x, ε) = {x ′ ∈ X | ∥xx ′ ∥ p ≤ ε} be a L p ball of radius ε and center x. Then we have: Definition 1 (Adversarial Example). Given an input x, a threshold ε, and a L p normfoot_1 , an adversarial example is an input x ′ ∈ B p (x, ε) such that f (x ′ ) ∈ C(x), where C(x) ⊆ {1, . . . , N } \ {f (x)}. This definition is a simplification compared to human perception, but it is adequate for a sufficiently small ε, and it is adopted in most of the relevant literature. An adversarial attack can then be viewed as an optimization procedure that attempts to find an adversarial example. We define an adversarial attack for a classifier f as a function a f,p : X → X that solves the following optimization problem: arg min x ′ ∈X {∥x ′ -x∥ p | f (x ′ ) ∈ C(x)} (1) The attack is considered successful if the returned solution x ′ = a f,p (x) also satisfies ∥x ′ -x∥ p ≤ ε. We say that an attack is exact if it solves Equation (1) to optimality; otherwise, we say that the attack is heuristic. An attack is said to be targeted if C(x) = C t,y ′ (x) = {y ′ } with y ′ ̸ = f (x); it is instead untargeted if C u (x) = {1, . . . , N } \ {f (x)}. We define the decision boundary distance d * p (x) of a given input x as the minimum L p distance between x and another input x ′ such that f (x) ̸ = f (x ′ ). Note that this is also the value of ∥a f,p (x) -x∥ p for an exact, untargeted, attack. Intuitively, a classifier is robust w.r.t. an example x iff x cannot be successfully attacked. Formally: Definition 2 ((ε, p)-Local Robustness). A discrete classifier f is (ε, p)-locally robust w.r.t. an example x ∈ X iff ∀x ′ ∈ B p (x, ε) we have f (x ′ ) = f (x). We then provide some additional definitions that are needed for our results, namely ReLU networks and FSFP spaces. ReLU networks are defined as follows: Definition 3 (ReLU network). A ReLU network is a composition of sum, multiplication by a constant, and ReLU activation, where ReLU : R → R + 0 is defined as ReLU (x) = max(x, 0). Note that any hardness result for ReLU classifiers also applies to the more general class of classifiers. Fixed-Size Fixed-Precision (FSFP) spaces, on the other hand, capture two common assumptions about real-world input spaces: all inputs can be represented with the same number of bits and there exists a positive minorant of the distance between inputs. Definition 4 (Fixed-Size Fixed-Precision space). Given a real p > 0, a space X ⊆ R n is FSFP if there exists a ν ∈ R such that ∀x.|r(x ′ )| ≤ ν (where |r(x)| is the size of the representation of x) and there exists a µ ∈ R such that µ > 0 and ∀x, x ′ ∈ X. (∥x ′ -x∥ p < µ =⇒ x = x ′ ). Examples of FSFP spaces include most image encodings, as well as 32-bit and 64-bit IEE754 tensors. Examples of non-FSFP spaces include the set of all rational numbers in an interval. Similarly to ReLU networks, hardness results for FSFP spaces also apply to more general spaces.

4. AN ASYMMETRICAL SETTING

In this section, we provide a theoretically sound result that is a viable explanation for why attacks seem to outperform defenses. The core of our analysis is proving that attacks are less computationally expensive than defenses in the worst case, unless the Polynomial Hierarchy collapses. Specifically, we prove that the decision version of attacking a ReLU classifier is N P -complete: Theorem 1foot_2 (Untargeted L ∞ attacks against ReLU classifiers are N P -complete). Let U -AT T p be the set of all tuples ⟨x, ε, f ⟩ such that: ∃x ′ ∈ B p (x, ε).f (x ′ ) ̸ = f (x) where x ∈ X, X is a FSFP space and f is a ReLU classifier. Then U -AT T ∞ is N P -complete. Corollary 1.1. For every 0 < p ≤ ∞, U -AT T p is N P -complete. Corollary 1.2. Targeted L p attacks (for 0 < p ≤ ∞) against ReLU classifiers are N P -complete. Corollary 1.3. Theorem 1 holds even if we consider the more general set of polynomial-time classifiers w.r.t. the size of the tuple. Theorem 1 represents a minor generalization of existing results in the literature (Katz et al., 2017) . However, together with the following more general bound for non-polynomial-time classifiers, it lays the groundwork for our main result. Theorem 2. Let A be a complexity class, let f be a classifier, let Z f = {⟨x, y⟩ | y = f (x), x ∈ X} and let U -AT T p (f ) = {⟨x, ε, g⟩ ∈ U -AT T ′ p | g = f }, where U -AT T ′ p is the same as U -AT T p but without the ReLU classifier restriction. If Z f ∈ A, then for every 0 < p ≤ ∞, U -AT T p (f ) ∈ N P A . Corollary 2.1. For every 0 < p ≤ ∞, if Z f ∈ Σ P n , then U -AT T p (f ) ∈ Σ P n+1 . The latter result implies that, if Z f ∈ P , then U -AT T p (f ) ∈ N P . Informally, Corollary 2.1 establishes that, under broad assumptions, evaluating and attacking a classifier are in complexity classes that are strongly conjectured to be distinct, with the attack problem being the harder one. The decision version of training a robust model is even more complex (again in the worst case): Theorem 3 (Finding a set of parameters that make a ReLU network (ε, p)-locally robust on an input is Σ P 2 -complete). Let P L-ROB p be the set of tuples ⟨x, ε, f θ , v⟩ such that: ∃θ ′ . (v f (θ ′ ) = 1 =⇒ ∀x ′ ∈ B p (x, ε).f θ ′ (x ′ ) = f θ ′ (x)) (3) where x ∈ X, X is a FSFP space and v f is a polynomial-time function that is 1 iff the input is a valid parameter set for f . Then P L-ROB ∞ is Σ P 2 -complete. Corollary 3.1. P L-ROB p is Σ P 2 -complete for all 0 < p ≤ ∞. Corollary 3.2. Theorem 3 holds even if, instead of ReLU classifiers, we consider the more general set of polynomial-time classifiers w.r.t. the size of the tuple. Our results rely on worst-case constructions and assume that the Polynomial Hierarchy does not collapse; moreover, both N P and Σ P 2 can be solved via super-polynomial algorithms. That said, we believe our theorems to have a strong practical relevance. First, the Polynomial Hierarchy collapse is strongly conjectured to be false; second, even super-polynomial algorithms can have dramatically different run times (e.g. SAT vs Quantified Boolean Formula solvers). Finally, generic classifiers can learn (and are known to learn) complex input-output mappings with many local optima. Intuitively, this is the core of the scenario captured by our worst-case construction, and also what makes Equation (1) difficult to solve. Again intuitively, robustness requires solving a nested optimization problem with universal quantification (since we need to guarantee the same prediction on all neighboring points), thus motivating the higher complexity class. For this reason, we think our results provide a plausible explanation for the hardness gap that is routinely observed in the relevant literature. Of course, there are definitely sub-cases where the problem is simple enough for exact attacks to run in polynomial time (e.g. (Awasthi et al., 2019) ); this suggests that, under specific circumstances, guaranteed robustness could be achieved at reasonable effort. By this argument, our proof also provides additional motivation for research on tractable classes of robust classifiers. Additional Sources of Asymmetry There are additional, complementary, factors that may provide an advantage to the attacker. We review them informally, since they can support efforts to build more robust defenses. First, the attacker can gather information about the target model, e.g. by using genuine queries (Papernot et al., 2017) , while the defender has not such advantage. As a result, the defender often needs to either make assumptions about adversarial examples (Hendrycks & Gimpel, 2017; Roth et al., 2019) or train models to identify common properties (Feinman et al., 2017; Grosse et al., 2017) . These assumptions can be exploited, such as in the case of Carlini & Wagner (2017a) , who generated adversarial examples that did not have the expected properties.Second, the attacker can focus on one input at the time, while the defender has to guarantee robustness on a large subset of the input space. This weakness can be exploited: for example, MagNet (Meng & Chen, 2017 ) relies on a model of the entire genuine distribution, which can be sometimes inaccurate. Carlini & Wagner (2017b) broke MagNet by searching for examples that were both classified differently and mistakenly considered genuine. Finally, defenses cannot significantly compromise the accuracy of a model. Adversarial training, for example, often reduces the clean accuracy of the model (Madry et al., 2018) , leading to a trade-off between accuracy and robustness.

5. SIDE-STEPPING THE COMPUTATIONAL ASYMMETRY

The limitations imposed by Theorem 3 cannot be addressed directly in the general case (barring collapse of the Polynomial Hierarchy). However, they can be sidestepped by changing perspective: we exemplify this by introducing an alternative approach to provide robust classification, which also allows us to take advantage of existing defenses. Instead of obtaining a robust model from scratch, we propose to evaluate the robustness of the classifier on a case-by-case basis, flagging the input if a robust answer cannot be provided. Specifically, given a norm-order p and threshold ε, we propose to: • Design a model that is as robust as possible using available and practically viable defenses; • For every input received, determine if the model is (ε, p)-locally robust on the input by running an adversarial attack on the input; • If the attack succeeds, flag the input. We name this technique Counter-Attack (CA). Instead of attempting to build a robust model, CA ensures that answers from a partially robust model are flagged as unreliable when they could be the result of an attack. This approach, while very simple, can take advantage of existing defenses, provides robustness guarantees, and is considerably hard to fool, as we will later prove. The behavior in case an input is flagged depends on the context. Examples include relying on a slower but more robust model (e.g. a human), or rejecting the input altogether. This kind of approach is viable in all cases where the goal is to support (rather than replace) human decision-making. Note that the flagging rate of CA is heavily dependent on the robustness of the model: a model that is robust on the entire input distribution will have a flagging rate of zero. Therefore, any improvement in the field of adversarial defenses also decreases the flagging rate of CA. Moreover, if there are known robustness bounds, they can be exploited to simplify the attack: for example, if the model is known to be (ε cert , p)-robust on x, with ε cert < ε, the attack can focus on searching adversarial examples in B p (x, ε) \ B p (x, ε cert ). At the same time, developing stronger and faster attacks also benefits CA, since better attacks can find adversarial examples more quickly. The major drawback of CA is that it requires running an exact adversarial attack on every input. We will investigate a possible mitigation for this phenomenon based on employing heuristic attacks, which still provide a significant degree of robustness (see Section 6). Finally, we stress that CA is just one of potentially several alternative paradigms that could circumvent the computational asymmetry. We hope our contribution will encourage other researchers to investigate this direction.

5.1. FORMAL PROPERTIES

When used with an exact attack, CA provides formal robustness guarantees for an arbitrary p and ε: Theorem 4. Let 0 < p ≤ ∞ and let ε > 0. Let f : X → {1, . . . , N } be a classifier and let a be an exact attack. Let f a CA : X → {1, . . . , N } ∪ {⋆} be defined as: f a CA (x) = f (x) ∥a f,p (x) -x∥ p > ε ⋆ otherwise (4) Then ∀x ∈ X an L p attack on x with radius greater than or equal to ε and with ⋆ ̸ ∈ C(x) fails. The notation f a CA (x) refers to the classifier f combined with CA, relying on attack a. The condition ⋆ ̸ ∈ C(x) requires that the input generated by the attack should not be flagged by CA. Corollary 4.1. Let 1 ≤ p ≤ ∞ and let ε > 0. Let f be a classifier on inputs with n elements that uses CA with norm p and radius ε. Then for all inputs and for all 1 ≤ r < p, L r attacks of radius greater than or equal to ε and with and ⋆ ̸ ∈ C(x) will fail. Similarly, for all inputs and for all r > p, L r attacks of radius greater than or equal to n 1 r -1 p ε and with ⋆ ̸ ∈ C(x) will fail (treating 1 ∞ as 0). Since the only expensive step in CA consists in applying an adversarial attack to an input, the complexity is the same as that of a regular attack. CA can therefore represent a more feasible task compared to training a robust model. Attacking with a Higher Radius In addition to robustness guarantees for a chosen ε, CA provides a form of computational robustness even beyond its intended radius. To prove this statement, we first formalize the task of attacking CA (referred to as Counter-CA, or CCA).This involves finding, given a starting point x, an input x ′ ∈ B p (x, ε ′ ) that is adversarial but not flagged by CA, i.e. such that f (x ′ ) ∈ C(x) ∧ ∀x ′′ ∈ B p (x ′ , ε).f (x ′′ ) = f (x ′ ). Note that, for ε ′ ≤ ε, no solution exists, since x ∈ B p (x ′ , ε) and f (x) ̸ = f (x ′ ). Theorem 5 (Attacking CA with a higher radius is Σ P 2 -complete). Let CCA p be the set of all tuples ⟨x, ε, ε ′ , C, f ⟩ such that: ∃x ′ ∈ B p (x, ε ′ ). (f (x ′ ) ∈ C(x) ∧ ∀x ′′ ∈ B p (x ′ , ε).f (x ′′ ) = f (x ′ )) (5) where x ∈ X, X is a FSFP space, ε ′ > ε, f (x) ̸ ∈ C(x) f is a ReLU classifier and whether an output is in C(x * ) for some x * can be decided in polynomial time. Then CCA ∞ is Σ P 2 -complete. Corollary 5.1. CCA p is Σ P 2 -complete for all 0 < p ≤ ∞. Corollary 5.2. Theorem 5 holds even if, instead of ReLU classifiers, we consider the more general set of polynomial-time classifiers w.r.t. the size of the tuple. In other words, under our assumptions, fooling CA is harder than running it. This phenomenon represents a form of computational robustness, a term introduced by Garg et al. (2020) in a very different setting where genuine examples can be cryptographically signed. Corollary 2.1 also implies that, unless the Polynomial Hierarchy collapses, it is impossible to obtain a better gap between running the model and attacking (e.g. a P -time model that is Σ P 2 -hard to attack). Note that while Theorem 5 shows that fooling CA is Σ P 2 -complete in general, attacking can be expected to be easy in practice when ε ′ ≫ ε: this is however a very extreme case, where the threshold may have been poorly chosen or the adversarial examples might be visually distinguishable from genuine examples.

5.2. USING HEURISTIC ATTACKS WITH CA

CA in its exact form has limited scalability due to Theorem 1. This could be addressed by using approaches with guaranteed bounds, as suggested in Section 5, or by simply relying on heuristic attacks. In this second scenario, to compensate for the heuristic nature of the employed attacks, we can flag the input x ′ if the attack fails to find an adversarial example in a radius of ε + b(x ′ ), where b : X → R + 0 is a buffer model. The idea behind b is that if a heuristic attack can identify an adversarial example within a radius of ε + b(x ′ ), an exact attack would be able to find an adversarial example within a radius of ε. The effectiveness of this approach depends on how well heuristic attacks approximate the decision boundary distance, which is an interesting topic for investigation by itself. Note that consistency of the estimation is in fact more important than its accuracy: if a heuristic attack overestimates d * p (x) in a predictable manner, we can train a buffer model to accurately correct the error. With this approach, if the heuristic attack finds an adversarial example with distance less than ε, we can confidently flag the input (i.e. false positives are guaranteed to be impossible). However, if the distance is above ε, it is possible to have a false negative. Note that using approaches with guaranteed bounds would lead to a complementary situation. Fooling the Heuristic Attack-Based CA The fact that the heuristic relaxation of CA can overestimate the decision boundary distance means that it is possible to generate adversarial examples with ε ′ ≤ ε. Specifically, if an adversarial example x adv for an input x is such that d * p (x adv ) ≤ ε and f (x adv ) ̸ = f (x) but ∥a f,p (x adv )x adv ∥ p > ε + b(x adv ), CA will incorrectly accept x adv . However, there are several informal considerations suggesting that fooling CA might be harder than running it. Such considerations are backed by empirical evidence in Section 6. First, both CA and CCA need to attack the same model, but CCA has at most as much information regarding the target model as CA, thus making the attacker at most as sample efficient as the defender. Second, fooling CA involves solving a nested optimization problem, while CA only needs to solve one; specifically, verifying the feasibility of a CCA solution involves running CA on the solution. Finally, as better attacks are developed the chances of CA being fooled become slimmer, since these attacks will be less likely to find sub-optimal adversarial examples.

6. EMPIRICAL INVESTIGATION OF HEURISTIC ATTACKS

Section 5.2 introduced the problem of investigating how accurately heuristic attacks can approximate the true decision boundary distance, which is needed for the heuristic version of CA to work, but also an interesting topic per se. In this section, we test whether ∥xx h ∥ p , where x h is an adversarial example found by a heuristic attack, is predictably close to the true decision boundary distance (i.e. d * p (x)). Consistently with Athalye et al. (2018) and Weng et al. (2018a) , we focus on the L ∞ norm. Additionally, we focus on pools of heuristic attacks. The underlying rationale is that different adversarial attacks should be able to cover for their reciprocal blind spots, providing a more reliable estimate. Since this evaluation is empirical, it requires sampling from a chosen distribution, in our case specific classifiers and the MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky et al., 2009) datasets. This means that the results are not guaranteed for other distributions, or for other defended models: studying how adversarial attacks fare in these cases is an important topic for future work. Experimental Setup We randomly selected ~2.3k samples each from the test set of two datasets, MNIST and CIFAR10. We used three architectures per dataset (named A, B and C), each trained in three settings, namely standard training, PGD adversarial training (Madry et al., 2018) and PGD adversarial training with ReLU loss and pruning (Xiao et al., 2019) (from now on referred to as ReLU training), for a total of nine configurations per dataset. Since our analysis requires computing exact decision boundary distances, and size and depth both have a strong adverse impact on solver times, we used small and relatively shallow networks with parameters between ~2k and ~80k. Note that using (more scalable) NN verification approaches that can provide bounds without tightness guarantees is not an option, as they would prevent us from drawing any firm conclusion. For this reason, the natural accuracies for standard training are significantly below the state of the art (89.63% -95.87% on MNIST and 47.85% -55.81% on CIFAR10). Adversarial training also had a negative effect on natural accuracies (84.54% -94.24% on , similarly to ReLU training (83.69% -93.57% on MNIST and 32.27% -37.33% on CIFAR10). We first ran a pool of heuristic attacks on each example, namely (Kurakin et al., 2017; Brendel et al., 2019; Carlini & Wagner, 2017c; Moosavi-Dezfooli et al., 2016; Goodfellow et al., 2015; Madry et al., 2018) , as well as simply adding uniform noise to the input. Our main choice of attack parameters (from now on referred to as the "strong" parameter set) prioritizes finding adversarial examples at the expense of computational time. For each example, we considered the nearest feasible adversarial example found by any attack in the pool. We then ran the exact solver-based attack MIPVerify (Tjeng et al., 2019) , which is able to find the nearest adversarial example to a given input. The entire process (including test runs) required ~45k core-hours on an HPC cluster. Each node of the cluster has 384 GB of RAM and features two Intel CascadeLake 8260 CPUs, each with 24 cores and a clock frequency of 2.4GHz. We removed the examples for which MIPVerify crashed in at least one setting, obtaining 2241 examples for MNIST and 2269 for CIFAR10. We also excluded from our analysis all adversarial examples for which MIPVerify did not find optimal bounds (atol = 1e-5, rtol = 1e-10), which represent on average 11.95% of the examples for MNIST and 16.30% for CIFAR10. Additionally, we ran the same heuristic attacks with a faster parameter set (from now on referred to as the "balanced" set) on a single machine with an AMD Ryzen 5 1600X six-core 3.6 GHz processor, 16 GBs of RAM and an NVIDIA GTX 1060 6 GB GPU. The process took approximately 8 hours. Refer to Appendix G for a more comprehensive overview of our experimental setup. Distance Approximation Across all settings, the mean distance found by the strong attack pool is 4.09±2.02% higher for MNIST and 2.21±1.16% higher for CIFAR10 than the one found by MIPVerify. For 79.81±15.70% of the MNIST instances and 98.40±1.63% of the CIFAR10 ones, the absolute difference is less than 1/255, which is the minimum distance in 8-bit image formats. The balanced attack pool performs similarly, finding distances that are on average 4.65±2.16% higher for MNIST and 2.04±1.13% higher for CIFAR10. The difference is below 1/255 for 77.78±16.08% of MNIST examples and 98.74±1.13% of CIFAR10 examples. We compare the distances found by the strong attack pool for MNIST A and CIFAR10 (using standard training) with the true decision bound distances in Figure 1 . Refer to Appendix I for the full data. For all datasets, architectures and training techniques there appears to be a strong, linear, correlation between the distance of the output of the heuristic attacks and the true decision boundary distance. We chose to measure this by training a linear regression model linking the two distances. For the strong parameter set, we find that the average R 2 across all settings is 0.992±0.004 for MNIST and 0.997±0.003 for CIFAR10. The balanced parameter set performs similarly, achieving an R 2 of 0.990±0.006 for MNIST and 0.998±0.002 for CIFAR10. From these results, we conjecture that increasing the computational budget of heuristic attacks does not necessarily improve predictability, although further tests would be needed to confirm such a claim. Note that such a linear model can also be used as a buffer function for heuristic CA. Another (possibly more reliable) procedure would consist in using quantile fitting; results for this approach are reported in Appendix H. Attack Pool Ablation Study Due to the nontrivial computational requirements of running several attacks on the same input, we now study whether it is possible to drop some attacks from the pool without compromising its predictability. Specifically, we consider all possible pools of size n (with a success rate of 100%) and pick the one with the highest average R 2 value over all architectures and training techniques. As show in Figure 2 , adding attacks does increase predictability, although with diminishing returns. For example, the pool composed of the Basic Iterative Method, the Brendel & Bethge Attack and the Carlini & Wagner attack achieves on its own a R 2 value of 0.988±0.004 for MNIST+strong, 0.986±0.005 for MNIST+balanced, 0.935±0.048 for CIFAR10+strong and 0.993±0.003 for CIFAR10+balanced. Moreover, dropping both the Fast Gradient Sign Method and uniform noise leads to negligible (≪ 0.001) absolute variations in the mean R 2 . These findings suggest that, as far as consistency is concerned, the choice of attacks represents a more important factor than the number of attacks in a pool. Refer to Appendix J for a more in-depth overview of how different attack selections affect consistency and accuracy. Efficient Attacks We then explore if it possible to increase the efficiency of attacks by optimizing for fast, rather than accurate, results. We pick three new parameter sets (namely Fast-100, Fast-1k and Fast-10k) designed to find the nearest adversarial examples within the respective number of calls to the model. We find that while Deepfool is not the strongest adversarial attack (see Appendix I), it provides adequate results in very few model calls. For details on these results see Appendix K. Fooling the Heuristic Attack-Based CA An open question from Section 5.2 is the empirical difficulty of fooling the version of CA based on heuristic attacks. Specifically, we carried on a limited experimentation by attempting to fool a CA-defended model. We used Deepfool Fast-1k as a heuristic attack for CA, then we built a proof-of-concept CCA implementation based on the PGD method, thus setting a baseline for attacks against CA. This variant uses a custom loss L CCA (x, y) = L P GD (x, y) + λ∥xa f,p (x)∥ p , which rewards adversarial examples with overestimated decision boundary distances. We then vary λ in order to test various trade-offs between the two terms. In order to estimate the gradient of the second term, we use Natural Evolution Strategies (Wierstra et al., 2014) . As a sanity check, we also attack using uniform noise. Due to the high computational requirements of such an experiment (30-60 minutes and ∼1.2M model calls per sample on the GTX 1060 machine), we only attack MNIST A Standard and CIFAR10 A Standard on 100 samples each. For comparison, running DeepFool on ten 250-element batches takes approximately 10 seconds. Overall, the attacks have a success rate of 0% -3%. However, we find that increasing ε ′ (while keeping ε constant) increases the success rate, up to 100% for ε ′ = 10 • ε. This suggests that as the difference between ε ′ and ε grows, so does the feasibility of fooling CA, which is consistent with our analysis. More in-depth results of our experiments can be found in Appendix L.

UG100 Dataset

We collect all the adversarial examples found by both MIPVerify and the heuristic attacks into a new dataset, which we name UG100. UG100 can be used to benchmark new adversarial attacks. Specifically, we can determine how strong an attack is by comparing it to both the theoretical optimum and heuristic attack pools. Another potential application involves studying factors that affect whether adversarial attacks perform sub-optimally.

7. CONCLUSION

We proved that attacking is N P -complete in the worst case, while training a robust model is Σ P 2complete, barring collapse of the Polynomial Hierarchy. We then showed how such a structural asymmetry can be sidestepped by adopting a different perspective on defense. This is exemplified by Counter-Attack, a technique that can identify non-robust points in N P time. We showed that CA can provide robustness guarantees up to an arbitrary ε. The CA approach naturally benefits from improvements in the field of adversarial attacks, and can be combined with other forms of defense. Due to its independence from the specific characteristics of the defended model, CA can also be applied to non-ML tools (e.g. signature-based malware detectors). We also believe that it should be possible to extend CA beyond classification. Finally, in an empirical evaluation we showed that heuristic attacks can provide an accurate and consistent approximation of the true decision boundary, which has implications for the viability of a heuristic version of CA. While our investigation is limited to small scale networks, we expect improvements in the field of NN verification will enable testing whether the observed results generalize to larger architectures. Overall, we hope that our contributions can provide broad benefits to the field of adversarial robustness by 1) highlighting a potential, structural, challenge; 2) pointing out how that can be sidestepped by a change in perspective; 3) showing a proof-of-concept defense based on this idea; 4) providing an experimentation and dataset to serve as a baseline and starting point.

REPRODUCIBILITY

We provide all our code and data in the repository linked in Section 1. Additionally, we report the key reproducibility information in Section 6, while all the other information can be found in Appendices G and L. To ensure maximum reproducibility, we also used consistent seeds across all experiments (one for parameter tuning and one for actual experiments). We also made sure to only rely on tools that are either open-source or for which there are free academic licenses. Concerning theoretical results, we provide full proofs of all theorems and corollaries in the appendices. Finally, for users with a slow internet connection, we also provide UG100 in JSON format (containing only the found adversarial distances). 

A PROOF PRELIMINARIES

A.1 NOTATION We use f i to denote the i-th output of a network. We define f as f (x) = arg max i {f i (x)} for situations where multiple outputs are equal to the maximum, we use the class with the lowest index. A.2 µ ARITHMETIC Given two FSFP spaces X and X ′ with distance minorants µ and µ ′ , we can compute new positive minorants after applying functions to the spaces as follows: • Sum of two vectors: µ X+X ′ = min(µ, µ ′ ); • Multiplication by a constant: µ αX = αµ; • ReLU: µ ReLU (X) = µ. Since it is possible to compute the distance minorant of a space transformed by any of these functions in polynomial time, it is also possible to compute the distance minorant of a space transformed by any composition of such functions in polynomial time.

A.3 FUNCTIONS

We now provide an overview of several functions that can be obtained by using linear combinations and ReLUs. max Carlini et al. (2017) showed that we can implement the max function using linear combinations and ReLUs as follows: max(x, y) = ReLU (x -y) + y (7) We can also obtain an n-ary version of max by chaining multiple instances together. step If X is a FSFP space, then the following scalar function: step 0 (x) = 1 µ (ReLU (x) -ReLU (x -µ)) is such that ∀i.∀x ∈ X, step 0 (x i ) is 0 for x i ≤ 0 and 1 for x i > 0. Similarly, let step 1 be defined as follows: step 1 (x) = 1 µ (ReLU (x + µ) -ReLU (x)) (9) Note that ∀i.∀x ∈ X, step 1 (x i ) = 0 for x i < 0 and step 1 (x i ) = 1 for x i ≥ 0. Boolean Functions We then define the Boolean functions not : {0, 1} → {0, 1}, and : {0, 1} 2 → {0, 1}, or : {0, 1} 2 → {0, 1} and if : {0, 1} 3 → {0, 1} as follows: Note that we can obtain n-ary variants of and and or by chaining multiple instances together. not(x) = 1 -x (10) and(x, y) = step 1 (x + y -2) (11) or(x, y) = step 1 (x + y) (12) if (a, b, c) = or cnf 3 Given a set z = {{z 1,1 , . . . , z 1,3 }, . . . , {z n,1 , z n,3 }} of Boolean atoms (i.e. z i,j (x) = x k or ¬x k for a certain k) defined on an n-long Boolean vector x, cnf 3 (z) returns the following Boolean function: cnf ′ 3 (x) = i=1,...,n j=1,...,3 z i,j (x) We refer to z as a 3CNF formula. Since cnf ′ 3 only uses negation, conjunction and disjunction, it can be implemented using respectively neg, and and or. Note that, given z, we can build cnf ′ 3 in polynomial time w.r.t. the size of z. Comparison Functions We can use step 0 , step 1 and neg to obtain comparison functions as follows:  geq(x, k) = step 1 (x -k) (15) gt(x, k) = step 0 (x, k) (16) leq(x, k) = not(gt(x, k)) (17) lt(x, k) = not(geq(x, k)) (18) eq(x, k) = and(geq(x, k), leq(x, k)) B PROOF OF THEOREM 1 B.1 U -AT T ∞ ∈ N P To prove that U -AT T ∞ ∈ N P , we show that there exists a polynomial certificate for U -AT T that can be checked in polynomial time. The certificate is the value of x ′ , which will have a representation of the same size as x (due to the FSFP space assumption) and can be checked by verifying: • ∥x -x ′ ∥ ∞ ≤ ε, which can be checked in linear time; • f θ (x ′ ) ̸ = f (x), which can be checked in polynomial time. B.2 U -AT T ∞ IS N P -HARD We will prove that U -AT T ∞ is N P -Hard by showing that 3SAT ≤ U -AT T ∞ . Given a set of 3CNF clauses z = {{z 11 , z 12 , z 13 }, . . . , {z m1 , z m2 , z m3 }} defined on n Boolean variables x 1 , . . . , x n , we construct the following query q(z) for U -AT T ∞ : q(z) = ⟨x (s) , 1 2 , f ⟩ (21) where x (s) = 1 2 , . . . , 1 2 is a vector with n elements. Verifying q(z) ∈ U -AT T ∞ is equivalent to checking: ∃x ′ ∈ B ∞ x s , 1 2 .f (x ′ ) ̸ = f (x (s) ) (22) Note that x ∈ B ∞ x (s) , 1 2 is equivalent to x ∈ [0, 1] n . Truth Values We will encode the truth values of x as follows: x ′ i ∈ 0, 1 2 ⇐⇒ xi = 0 (23) x ′ i ∈ 1 2 , 1 ⇐⇒ xi = 1 (24) We can obtain the truth value of a scalar variable by using isT (x i ) = gt x i , 1 2 . Let bin(x) = or(isT (x 1 ), . . . , isT (x n )). Definition of f We define f as follows: f 1 (x) = and(not(isx (s) (x)), cnf ′ 3 (bin(x))) (25) f 0 (x) = not(f 1 (x)) where cnf ′ 3 = cnf 3 (z) and isx (s) is defined as follows: isx (s) (x) = and eq x 1 , 1 2 , . . . , eq x n , 1 2 Note that f is designed such that f (x (s) ) = 0, while for x ′ ̸ = x (s) , f (x ′ ) = 1 iff the formula z is true for the variable assignment bin(x ′ ). Lemma 1. z ∈ 3SAT =⇒ q(z) ∈ U -AT T ∞ Proof. Let z ∈ 3SAT . Therefore ∃x * ∈ {0, 1} n such that cnf 3 (z)(x * ) = 1. Since bin(x * ) = x * and x * ̸ = x (s) , f (x * ) = 1 , which means that it is a valid solution for Equation ( 22). From this we can conclude that q(z) ∈ U -AT T ∞ . Lemma 2. q(z) ∈ U -AT T ∞ =⇒ z ∈ 3SAT Proof. Since q(z) ∈ U -AT T ∞ , ∃x * ∈ [0, 1] n \ {x (s) } that is a solution to Equation ( 22) (i.e. f (x * ) = 1). Then cnf ′ 3 (bin(x * )) = 1, which means that there exists a x (i.e. bin(x * )) such that cnf ′ 3 ( x) = 1. From this we can conclude that z ∈ 3SAT . Since: • q(z) can be computed in polynomial time; • z ∈ 3SAT =⇒ q(z) ∈ U -AT T ∞ ; • q(z) ∈ U -AT T =⇒ z ∈ 3SAT . we can conclude that 3SAT ≤ U -AT T ∞ . B.3 PROOF OF COROLLARY 1.1 B.3.1 U -AT T p ∈ N P The proof is identical to the one for U -AT T ∞ . B.3.2 U -AT T p IS N P -HARD The proof that q(z) ∈ U -AT T p =⇒ z ∈ 3SAT is very similar to the one for U -AT T ∞ . Since q(z) ∈ U -AT T p , we know that ∃x * ∈ B p (x (s) , ε) \ {x (s) }.f (x * ) = 1, which means that there exists a x (i.e. bin(x * )) such that cnf ′ 3 ( x) = 1. From this we can conclude that z ∈ 3SAT . The proof that z ∈ 3SAT =⇒ q(z) ∈ U -AT T p is slightly different, due to the fact that since x * ̸ ∈ B p (x (s) , 1 2 ) we need to use a different input to prove that ∃x ′ ∈ B p (x (s) ).f (x ′ ) = 1. Let 0 < p < ∞. Given a positive integer n and a real 0 < p < ∞, let ρ p,n (r) be a positive minorant of the L ∞ norm of a vector on the L p sphere of radius r. For example, for n = 2, p = 2 and r = 1, any positive value less than or equal to √ 2 2 is suitable. Note that, for 0 < p < ∞ and n, r > 0, ρ p,n (r) < r. Let z ∈ 3SAT . Therefore ∃x * ∈ {0, 1} n such that cnf 3 (z)(x * ) = 1. Let x * * be defined as: x * * i = 1 2 -ρ p,n 1 2 x * i = 0 1 2 + ρ p,n 1 2 x * i = 1 (28) By construction, x * * ∈ B p x (s) , ρ p,n 2 . Additionally, bin(x * * ) = x * , and since we know that z is true for the variable assignment x * , we can conclude that f (x * * ) = 1, which means that x * * is a valid solution for Equation (22) . From this we can conclude that q(z) ∈ U -AT T p .

B.4 PROOF OF COROLLARY 1.2

The proof is identical to the proof of Theorem 1 (for p = ∞) and Corollary 1.1 (for 0 < p < ∞), with the exception of requiring f (x ′ ) = 1.

B.5 PROOF OF COROLLARY 1.3

The proof that attacking a polynomial-time classifier is in N P is the same as that for Theorem 1. Attacking a polynomial-time classifier is N P -hard due to the fact that the ReLU networks defined in the proof of Theorem 1 are polynomial-time classifiers. Since attacking a general polynomial-time classifier is a generalization of attacking a ReLU polynomial-time classifier, the problem is N P -hard.

C PROOF OF THEOREM 2

Proving that U -AT T p (f ) ∈ N P A means proving that it can be solved in polynomial time by a non-deterministic Turing machine with an oracle that can solve a problem in A. Since Z f ∈ A, we can do so by picking a non-deterministic Turing machine with access to an oracle that solves Z f . We then generate non-deterministically the adversarial example and return the output of the oracle. Due to the FSFP assumption, we know that the size of this input is the same as the size of the starting point, which means that it can be generated non-deterministically in polynomial time. Therefore, U -AT T p (f ) ∈ N P A . C.1 PROOF OF COROLLARY 2.1 Follows directly from Theorem 2 and the definition of Σ P n . D PROOF OF THEOREM 3 D.1 PRELIMINARIES Π P 2 3SAT is the set of all z such that: ∀ x∃ ŷ.R( x, ŷ) where R( x, ŷ) = cnf 3 (z)(x 1 , . . . , xn , ŷ1 , . . . , ŷn ). Stockmeyer (1976) showed that Π 2 3SAT is Π P 2 -complete. Therefore, coΠ 2 3SAT , which is defined as the set of all z such that: ∃ x∀ ŷ¬R( x, ŷ) (30) is Σ P 2 -complete. D.2 P L-ROB ∞ ∈ Σ P 2 P L-ROB ∞ ∈ Σ P 2 if there exists a problem A ∈ P and a polynomial q such that ∀Γ = ⟨x, ε, f θ , v f ⟩: Γ ∈ P L-ROB ⇐⇒ ∃y.|y| ≤ q(|Γ|) ∧ (∀z.(|z| ≤ q(|Γ|) =⇒ ⟨Γ, y, z⟩ ∈ A)) This can be proven by setting y = θ ′ , z = x ′ and A as the set of triplets ⟨Γ, θ ′ , x ′ ⟩ such that all of the following are true: • v f (θ ′ ) = 1; • ∥x -x ′ ∥ ∞ ≤ ε; • f θ (x) = f θ (x ′ ). Since all properties can be checked in polynomial time, A ∈ P and thus P L-ROB ∞ ∈ Σ P 2 . D.3 P L-ROB ∞ IS Σ P 2 -HARD We will prove that P L-ROB ∞ is Σ P 2 -hard by showing that coΠ 2 3SAT ≤ P L-ROB ∞ . Let n x be the length of x and let n ŷ be the length of ŷ. Given a set z of 3CNF clauses, we construct the following query q(z) for P L-ROB: q(z) = ⟨x (s) , 1 2 , f θ , v f ⟩ (32) where x (s) = 1 2 , . . . , 1 2 is a vector with n ŷ elements and v f (θ) = 1 ⇐⇒ θ ∈ {0, 1} n x . Note that θ ′ ∈ {0, 1} n x can be checked in polynomial time w.r.t. the size of the input. Truth Values We will encode the truth values of x as a set of binary parameters θ ′ , while we will encode the truth values of ŷ using x ′ through the same technique mentioned in Appendix B.2. Definition of f θ We define f θ as follows: • f θ,1 (x) = and(not(isx (s) (x)), cnf ′′ 3 (θ, x)) , where cnf ′′ 3 is defined over θ and bin(x) using the same technique mentioned in Appendix B.2 and isx (s) (x) = and i=1,...,n eq(x i , 1 2 ); • f θ,0 (x) = not(f θ,1 (x)). Note that f θ (x (s) ) = 0 for all choices of θ. Additionally, f θ is designed such that: ∀x ′ ∈ B ∞ x (s) , 1 2 \ {x (s) }.∀θ ′ . (v f (θ ′ ) = 1 =⇒ (f θ ′ (x ′ ) = 1 ⇐⇒ R(θ ′ , bin(x ′ )))) (33) Lemma 3. z ∈ coΠ 2 3SAT =⇒ q(z) ∈ P L-ROB ∞ Proof. Since z ∈ coΠ 2 3SAT , there exists a Boolean vector x * such that ∀ ŷ.¬R(x * , ŷ). Then both of the following statements are true: • v f (x * ) = 1, since x * ∈ {0, 1} n x ; • ∀x ′ ∈ B ∞ (x (s) , ε).f x * (x ′ ) = 0, since f x * (x ′ ) = 1 ⇐⇒ R(x * , bin(x ′ )); Therefore, x * is a valid solution for Equation (3) and thus q(z) ∈ P L-ROB ∞ . Lemma 4. q(z) ∈ P L-ROB ∞ =⇒ z ∈ coΠ 2 3SAT Proof. Since q(z) ∈ P L-ROB ∞ , there exists a θ * such that: v f (θ) = 1 ∧ ∀x ′ ∈ B ∞ (x (s) , ε).f θ * (x ′ ) = f θ * (x (s) ) (34) Note that θ * ∈ {0, 1} n x , since v f (θ * ) = 1. Moreover, ∀ ŷ.¬R(θ * , ŷ), since bin( ŷ) = ŷ and f θ * ( ŷ) = 1 ⇐⇒ R(θ * , ŷ). Therefore, θ * is a valid solution for Equation (30), which implies that z ∈ coΠ 2 3SAT . Since: • q(z) can be computed in polynomial time; • z ∈ coΠ 2 3SAT =⇒ q(z) ∈ P L-ROB ∞ ; • q(z) ∈ P L-ROB ∞ =⇒ z ∈ coΠ 2 3SAT . we can conclude that coΠ 2 3SAT ≤ P L-ROB ∞ . D.4 PROOF OF COROLLARY 3.1 D.4.1 P L-ROB p ∈ Σ P 2 The proof is identical to the one for P L-ROB ∞ . D.4.2 P L-ROB p IS Σ P 2 -HARD We follow the same approach used in the proof for Corollary 1.1. Proof of q(z) ∈ P L-ROB p =⇒ z ∈ coΠ 2 3SAT If q(z) ∈ P L-ROB p , it means that ∃θ * . v f (θ * ) = 1 =⇒ ∀x ′ ∈ B p x (s) , 1 2 .f (x ′ ) = 0 . Then ∀ ŷ, there exists a corresponding input y * * ∈ B p x (s) , 1 2 defined as follows: y * * i = 1 2 -ρ p,n 1 2 ŷi = 0 1 2 + ρ p,n 1 2 ŷi = 1 (35) such that e (y) (y * * ) = ŷ. Since y * * ∈ B p x (s) , 1 2 , cnf ′′ 3 (θ * , bin(y * * )) = 0, which means that R(θ * , ŷ) is false. In other words, ∃θ * .∀ ŷ.¬R(θ * , ŷ), i.e. z ∈ coΠ 2 3SAT . Proof of z ∈ coΠ 2 3SAT =⇒ q(z) ∈ P L-ROB p The proof is very similar to the corresponding one for Theorem 3. If z ∈ coΠ 2 3SAT , then ∃ x * .∀ ŷ.¬R( x, ŷ). Set θ * = x * . We know that f * θ (x (s) ) = 0. We also know that ∀x ′ ∈ B p x (s) , 1 2 \ {x (s) }. (f θ * (x) = 1 ⇐⇒ cnf ′′ 3 (θ * , x ′ ) = 1). In other words, ∀x ′ ∈ B p x (s) , 1 2 \ {x (s) }. (f θ * (x ′ ) = 1 ⇐⇒ R(θ * , bin(x ′ ))). Since R(θ * , ŷ) is false for all choices of ŷ, ∀x ′ ∈ B p x (s) , 1 2 \ {x (s) }.f θ * (x ′ ) = 0. Given the fact that f θ * (x (s) ) = 0, we can conclude that θ * satisfies Equation (3).

D.5 PROOF OF COROLLARY 3.2

Similarly to the proof of Corollary 1.3, it follows from the fact that ReLU classifiers are polynomialtime classifiers (w.r.t. the size of the tuple).

E PROOF OF THEOREM 4

There are two cases: • ∀x ′ ∈ B p (x, ε).f (x ′ ) = f (x): then the attack fails because f (x) ̸ ∈ C(x); • ∃x ′ ∈ B p (x, ε).f (x ′ ) ̸ = f (x): then due to the symmetry of the L p norm x ∈ B p (x ′ , ε). Since f (x) ̸ = f (x ′ ), x is a valid adversarial example for x ′ , which means that f (x ′ ) = ⋆. Since ⋆ ̸ ∈ C(x) , the attack fails. E.1 PROOF OF COROLLARY 4.1 Assume that ∀x.||x|| r ≥ η||x|| p and fix x (s) ∈ X. Let x ′ ∈ B r (x (s) , ηε) be an adversarial example. Then ||x ′x (s) || r ≤ ηε, and thus η||x ′x (s) || p ≤ ηε. Dividing by η, we get ||x ′x (s) || p ≤ ε, which means that x (s) is a valid adversarial example for x ′ and thus x ′ is rejected by p-CA. We now proceed to find the values of η. E.1.1 1 ≤ r < p We will prove that ||x|| r ≥ ||x|| p . Case p < ∞ Consider e = x ||x||p . e is such that ||e|| p = 1 and for all i we have |e i | ≤ 1. Since r < p, for all 0 ≤ t ≤ 1 we have |t| p ≤ |t| r . Therefore: ||e|| r = n i=1 |e i | r 1/r ≥ n i=1 |e i | p 1/r = ||e|| p/r p = 1 (36) Then, since ||e|| r ≥ 1: ||x|| r = || ||x|| p e|| r = ||x|| p ||e|| r ≥ ||x|| p ( ) Case p = ∞ Since ||x|| r ≥ ||x|| p for all r < p and since the expressions on both sides of the inequality are compositions of continuous functions, as p → ∞ we get ||x|| r ≥ ||x|| ∞ .

E.1.2 r > p

We will prove that ||x|| r ≥ n 1 r -1 p ||x|| p . Case r < ∞ Hölder's inequality states that, given α, β ≥ 1 such that 1 α + 1 β = 1 and given f and g, we have: ||f g|| 1 ≤ ||f || α ||g|| β Setting α = r r-p , β = r p , f = (1, . . . , 1) and g = (x p 1 , . . . , x p n ), we know that: • ||f g|| 1 = n i=1 (1 • x p i ) = ||x|| p p ; • ||f || α = ( n i=1 1) 1/α = n 1/α ; • ||g|| β = i=1 x pr/p i p/r = ( i=1 x r i ) p/r = ||x|| p r . Therefore ||x|| p p ≤ n 1/α ||x|| p r . Raising both sides to the power of 1/p, we get ||x|| p ≤ n 1/(pα) ||x|| r . Therefore: ||x|| p ≤ n (r-p)/(pr) ||x|| r = n 1 p -1 r ||x|| r Dividing by n 1 p -1 r we get: n 1 r -1 p ||x|| p ≤ ||x|| r Case r = ∞ Since the expressions on both sides of the inequality are compositions of continuous functions, as r → ∞ we get ||x|| ∞ ≥ n -1 p ||x|| p . F PROOF OF THEOREM 5 F.1 CCA ∞ ∈ Σ P 2 CCA ∞ ∈ Σ P 2 iff there exists a problem A ∈ P and a polynomial p such that ∀Γ = ⟨x, ε, ε ′ , C, f ⟩: Γ ∈ CCA ∞ ⇐⇒ ∃y.|y| ≤ p (|Γ|) ∧ (∀z.(|z| ≤ p(|Γ|) =⇒ ⟨Γ, y, z⟩ ∈ A)) This can be proven by setting y = x ′ ,z = x ′′ and A as the set of all triplets ⟨Γ, x ′ , x ′′ ⟩ such that all of the following are true: • ∥x -x ′ ∥ ∞ ≤ ε ′ • f (x ′ ) ∈ C(x) • ∥x ′′ -x ′ ∥ ∞ ≤ ε • f (x ′′ ) = f (x ′ ) Since all properties can be checked in polynomial time, A ∈ P . F.2 CCA ∞ IS Σ P 2 -HARD We will show that CCA ∞ is Σ P 2 -hard by proving that coΠ 2 3SAT ≤ CCA ∞ . First, suppose that the length of x and ŷ differ. In that case, we pad the shortest one with additional variables that will not be used. Let n be the maximum of the lengths of x and ŷ. Given a set z of 3CNF clauses, we construct the following query q(z) for CCA ∞ : q(z) = ⟨x (s) , γ, 1 2 , C u , h⟩ where 1 4 < γ < 1 2 and x (s) = 1 2 , . . . , 1 2 is a vector with n elements. Verifying q(z) ∈ CCA ∞ is equivalent to checking: ∃x ′ ∈ B x s , 1 2 . h(x ′ ) ̸ = h(x) ∧ ∀x ′′ ∈ B x ′ , 1 4 . h(x ′′ ) = h(x ′ ) (43) Note that x ′ ∈ [0, 1] n . Truth Values We will encode the truth values of x and ŷ as follows: x ′′ i ∈ 0, 1 4 ⇐⇒ xi = 0 ∧ ŷi = 0 x ′′ i ∈ 1 4 , 1 2 ⇐⇒ xi = 0 ∧ ŷi = 1 x ′′ i ∈ 1 2 , 3 4 ⇐⇒ xi = 1 ∧ ŷi = 0 x ′′ i ∈ 3 4 , 1 ⇐⇒ xi = 1 ∧ ŷi = 1 Let e xi (x) = gt x i , 1 2 . Let: e ŷi (x) = or open x i , 1 4 , 1 2 , open x i , 3 4 , 1 Note that e xi (x ′′ i ) returns the truth value of xi and e ŷi (x ′′ i ) returns the truth value of ŷi (as long as the input is within one of the ranges described in Equation ( 44)). Invalid Encodings All the encodings other than the ones described in Equation ( 44) are not valid. We define inv F as follows: inv F (x) = or i=1,...,n or(out(x i ), edge(x i )) where out(x i ) = or(leq(x i , 0), geq(x i , 1)) and edge(x i ) = or eq x i , 1 4 , eq x i , 1 2 , eq x i , 3 4 On the other hand, we define inv T as follows: inv T (x) = or i=1,...,n eq x i , 1 2 Definition of h Let g be a Boolean formula defined over e (x) (x) and e (y) (x) that returns the value of R (using the same technique as cnf ′ 3 ). We define h as a two-class classifier, where: h 1 (x) = or(inv T (x), and(not(inv F (x)), g(x))) (49) and h 0 (x) = not(h 1 (x)). Note that: • If x i = 1 2 for some i, the top class is 1; therefore, h(x (s) ) = 1; • Otherwise, if x is not a valid encoding, the top class is 0; • Otherwise, the top class is 1 if R(e (x) (x), e (y) (x)) is true and 0 otherwise. Lemma 5. z ∈ coΠ 2 3SAT =⇒ q(z) ∈ CCA ∞ Proof. If z ∈ coΠ 2 3SAT , then there exists a Boolean vector x * such that ∀ ŷ.¬R(x * , ŷ). We now prove that setting x ′ = x * satisfies Equation ( 5). First, note that h(x * ) = 0, which satisfies h(x ′ ) ̸ = h(x). Then we need to verify that ∀x ′′ ∈ B ∞ (x * , γ).h(x) = 0. For every x ′′ ∈ B ∞ (x * , γ), we know that x ′′ ∈ ([-γ, γ] ∪ [1 -γ, 1 + γ]) n . There are thus two cases: • x ′′ is not a valid encoding, i.e. x ′′ i ≤ 0∨x ′′ i ≥ 1∨x ′′ i ∈ 1 4 , 3 4 for some i. Then h(x ′′ ) = 0. Note that, since γ < 1 2 , 1 2 ̸ ∈ [-γ, γ] ∪ [1 -γ, 1 + γ], so it is not possible for x ′′ to be an invalid encoding that is classified as 1; • x ′′ is a valid encoding. Then, since γ < 1 2 , e (x) (x ′′ ) = x * . Since h(x ′′ ) = 1 iff R(e (x) (x ′′ ), e (y) (x ′′ )) is true and since R(x * , ŷ) is false for all choices of ŷ, h(x ′′ ) = 0. Therefore, x * satisfies Equation ( 43) and thus q(z) ∈ CCA ∞ . Lemma 6. q(z) ∈ CCA ∞ =⇒ z ∈ coΠ 2 3SAT Proof. Since q(z) ∈ CCA, there exists a x * ∈ B x (s) , 1 2 such that h(x * ) ̸ = h(x (s) ) and ∀x ′′ ∈ B ∞ (x * , γ).h(x ′′ ) = h(x ′ ). We will prove that e (x) (x * ) is a solution to coΠ 2 3SAT . Since h(x (s) ) = 1, h(x * ) = 0, which means that ∀x ′′ ∈ B ∞ (x * , γ).h(x ′′ ) = 0. We know that x * ∈ B ∞ x (s) , 1 2 = [0, 1] n . We first prove by contradiction that x * ∈ [0, 1 2 -γ) ∪ ( 1 2 + γ, 1] n . If x * i ∈ [ 1 2 -γ, 1 2 + γ] for some i, then the vector x (w) defined as follows: x (w) j = 1 2 i = j x * i otherwise (50) is such that x (w) ∈ B ∞ (x * i , γ) and h x (w) = 1 (since inv T x (w) = 1). This contradicts the fact that ∀x ′′ ∈ B p (x * , γ).h(x) = 0. Therefore, x * ∈ [0, 1 2 -γ) ∪ ( 1 2 + γ, 1] n . As a consequence, ∀x ′′ ∈ B ∞ (x * , γ).e (x) (x ′′ ) = e (x) (x * ). We now prove that ∀ ŷ * .∃x ′′ * ∈ B ∞ (x * , γ) such that e (y) (x ′′ * ) = ŷ * . We can construct such x ′′ * as follows. For every i: • If e (x) (x * ) = 0 and e (y) (x * ) = 0, set x ′′ * i equal to a value in 0, 1 4 ; • If e (x) (x * ) = 0 and e (y) (x * ) = 1, set x ′′ * i equal to a value in 1 4 , γ ; • If e (x) (x * ) = 1 and e (y) (x * ) = 0, set x ′′ * i equal to a value in 1γ, 3 4 ; • If e (x) (x * ) = 1 and e (y) (x * ) = 1, set x ′′ * i equal to a value in 3 4 , 1 . By doing so, we have obtained a x ′′ * such that x ′′ * ∈ B ∞ (x * , γ) and e (y) (x ′′ * ) = ŷ * . Since: • e (x) (x ′′ ) = e (x) (x * ) for all x ′′ ; • h(x ′′ ) = 0 for all x ′′ ; • h(x ′′ ) = 1 iff R(e (x) (x ′′ )), e (y) (x ′′ )) is true; R(e (x) (x * ), ŷ * ) is false for all choices of ŷ * . In other words, x * is a solution to Equation ( 30) and thus z ∈ coΠ 2 3SAT . Since: • q(z) can be computed in polynomial time; • z ∈ coΠ 2 3SAT =⇒ q(z) ∈ CCA ∞ ; • q(z) ∈ CCA ∞ =⇒ z ∈ coΠ 2 3SAT ; we can conclude that coΠ 2 3SAT ≤ CCA. F.3 PROOF OF COROLLARY 5.1 The proof of CCA p ∈ Σ P 2 is the same as the one for Theorem 5. For the hardness proof, we follow a more involved approach compared to those for Corollaries 1.1 and 3.1. First, let ε ρp,n be the value of epsilon such that ρ p,n ε ρp,n = 1 2 . In other words, B p (x (s) , ε ρp,n ) is an L p ball that contains [0, 1] n , while the intersection of the corresponding L p sphere and [0, 1] n is the set {0, 1} n (for p < ∞).

Let inv ′

T (x) be defined as follows: inv ′ T (x) = or i=1,...,n or eq x i , 1 2 , leq(x i , 0), geq(x i , 1) Let inv ′ F (x) be defined as follows: inv ′ F (x) = or i=1,...,n or eq x i , 1 4 , eq x i , 3 4 We define h ′ as follows: h ′ 1 = or(inv ′ T (x), and(not(inv ′ F (x)), g(x)) (53) with h ′ 0 (x) = not(h ′ 1 (x)). Note that: • If x i ∈ (-∞, 0] ∪ { 1 2 } ∪ [1, ∞) for some i, then the top class is 1; • Otherwise, if x is not a valid encoding, the top class is 0; • Otherwise, the top class is 1 if R(e (x) (x), e (y) (x)) is true and 0 otherwise. Finally, let 1 8 < γ ′ < 1 4 . Our query is thus: q(z) = ⟨x (s) , γ ′ , 1 2 , C u , h ′ ⟩ (54) Proof of z ∈ coΠ 2 3SAT =⇒ q(z) ∈ CCA p If z ∈ coΠ 2 3SAT , then ∃x * .∀ ŷ.¬R(x * , ŷ). Let x * * be defined as follows: x * * i = 1 4 x * i = 0 3 4 x * i = 1 (55) Note that: • x * * ∈ B p x (s) , ε ρp,n ; • e (x) (x * * ) = x * ; • f (x * * ) = 0, since x * * ∈ { 1 4 , 3 4 } n ; • Since γ ′ < 1 4 , there is no i such that ∃x ′′ ∈ B p (x * * , γ ′ ).x ′′ i ∈ (-∞, 0] ∪ 1 2 ∪ [1, ∞); • For all x ′′ ∈ B p (x * * , γ ′ ): -If x ′′ is not a valid encoding (i.e. x ′′ i ∈ { 1 4 , 3 4 } for some i), then h ′ (x ′′ ) = 0; -Otherwise, h ′ (x ′′ ) = 1 iff R(e (x) (x ′′ ), e (y) (x ′′ )) is true. Therefore, since ∀ ŷ.¬R(x * , ŷ), we know that ∀x ′′ ∈ B p (x * * , γ ′ ).f (x ′′ ) = 0. In other words, x * * is a solution to Equation (5). Proof of q(z) ∈ CCA p =⇒ z ∈ coΠ 2 3SAT If q(z) ∈ CCA p , then we know that ∃x * ∈ B p x (s) , ε ρp,n . h ′ (x * ) ̸ = h(x (s) ) ∧ ∀x ′′ ∈ B p (x * , γ ′ ).h ′ (x ′′ ) = h ′ (x * ) . In other words, ∃x * ∈ B p x (s) , ε ρp,n . (h ′ (x * ) = 0 ∧ ∀x ′′ ∈ B p (x * , γ ′ ).h ′ (x ′′ ) = 0). We will first prove by contradiction that w) , defined as follows: x * ∈ (γ ′ , 1 2 -γ ′ ) ∪ ( 1 2 + γ ′ , 1 -γ ′ ) n . First, suppose that x * i ∈ (-∞, 0) ∪ (1, ∞) for some i. Then h ′ (x * ) = 0 due to the fact that inv T (x * ) = 1. Second, suppose that x * i ∈ [0, γ ′ ] ∪ [1 -γ ′ , 1] for some i. Then x ( x w) , defined as follows: (w) j =    0 i = j ∧ x * i ∈ [0, γ ′ ] 1 i = j ∧ x * i ∈ [1 -γ ′ , 1] x * j j ̸ = i (56) is such that x (w) ∈ B p (x * , γ ′ ) but h ′ (x (w) ) = 1. Finally, suppose that x * i ∈ [ 1 2 -γ, 1 2 + γ] for some i. Then x ( x (w) j = 1 2 i = j x * j otherwise (57) is such that x (w) ∈ B p (x * , γ ′ ) but h ′ (x (w) ) = 1. Therefore, x * ∈ (γ ′ , 1 2 -γ ′ ) ∪ ( 1 2 + γ ′ , 1 -γ ′ ) n . As a consequence ∀x ′′ ∈ B p (x * , γ ′ ).e (x) (x ′′ ) = e (x) (x ′ ). From this, due to the fact that γ ′ > 1 8 and that p > 0, we can conclude that for all ŷ, there exists a x ′′ ∈ B p (x * , γ ′ ) such that: x ′′ i ∈ 0, 1 4 for x * i ∈ γ ′ , 1 2 -γ ′ , ŷi = 0 x ′′ i ∈ 1 4 , 1 2 for x * i ∈ γ ′ , 1 2 -γ ′ , ŷi = 1 x ′′ i ∈ 1 2 , 3 4 for x * i ∈ 1 2 + γ ′ , 1 -γ ′ , ŷi = 0 x ′′ i ∈ 3 4 , 1 for x * i ∈ 1 2 + γ ′ , 1 -γ ′ , ŷi = 1 In other words, for all ŷ there exists a corresponding x ′′ ∈ B p (x * , γ ′ ) such that e (y) (x ′′ ) = ŷ. Therefore, since h ′ (x ′′ ) = 1 iff R(e (x) (x ′′ ), e (y) (x ′′ )) is true and since ∀x ′′ ∈ B p (x * , γ ′ ).h ′ (x ′′ ) = 0, we can conclude that ∀ ŷ.¬R(e (x) (x * ), ŷ). In other words, z ∈ coΠ 2 3SAT . F.4 PROOF OF COROLLARY 5.2 Similarly to the proof of Corollary 1.3, it follows from the fact that ReLU classifiers are polynomialtime classifiers (w.r.t. the size of the tuple).

G FULL EXPERIMENTAL SETUP

All our code is written in Python + PyTorch (Paszke et al., 2019) , with the exception of the MIPVerify interface, which is written in Julia. When possible, most experiments were run in parallel, in order to minimize execution times. Models All models were trained using Adam (Kingma & Ba, 2014) and dataset augmentation. We performed a manual hyperparameter and architecture search to find a suitable compromise between accuracy and MIPVerify convergence. The process required approximately 4 months. When performing adversarial training, following (Madry et al., 2018) we used the final adversarial example found by the Projected Gradient Descent attack, instead of the closest. To maximize uniformity, we used for each configuration the same training and pruning hyperparameters (when applicable), which we report in Table 1 . We report the chosen architectures in Tables 2 and 3 , while Table 4 outlines their accuracies and parameter counts.

UG100

The first 250 samples of the test set of each dataset were used for hyperparameter tuning and were thus not considered in our analysis. For our G100 dataset, we sampled uniformly across each ground truth label and removed the examples for which MIPVerify crashed. Table 5 details the composition of the dataset by ground truth label. Attacks For the Basic Iterative Method (BIM), the Fast Gradient Sign Method (FGSM) and the Projected Gradient Descent (PGD) attack, we used the implementations provided by the AdverTorch library (Ding et al., 2019) . For the Brendel & Bethge (B&B) attack and the Deepfool (DF) attack, we used the implementations provided by the Foolbox Native library (Rauber et al., 2020) . The Carlini & Wagner and the uniform noise attacks were instead implemented by the authors. We modified the attacks that did not return the closest adversarial example found (i.e. BIM, Carlini & Wagner, Deepfool, FGSM and PGD) to do so. For the attacks that accept ε as a parameter (i.e. BIM, FGSM, PGD and uniform noise), for each example we first performed an initial search with a decaying value of ε, followed by a binary search. In order to pick the attack parameters, we first selected the strong set by performing an extensive manual search. The process took approximately 3 months. We then modified the strong set in order to obtain the balanced parameter set. We report the parameters of both sets (as well as the parameters of the binary and ε decay searches) in Table 6 . MIPVerify We ran MIPVerify using the Julia library MIPVerify.jl and Gurobi (Gurobi Optimization, LLC, 2022). Since MIPVerify can be sped up by providing a distance upper bound, we used the same pool of adversarial examples utilized throughout the paper. For CIFAR10 we used the strong parameter set, while for MNIST we used the strong parameter set with some differences (reported in Table 7 ). Since numerical issues might cause the distance upper bound computed by the heuristic attacks to be slightly different from the one computed by MIPVerify, we ran a series of exploratory runs, each with a different correction factor (1.05, 1.25, 1.5, 2), and picked the first factor that caused MIPVerify to find a feasible (but not necessarily optimal) solution. If the solution was not optimal, we then performed a main run with a higher computational budget. We provide the parameters of MIPVerify in Table 8 . We also report in Table 9 the percentage of tight bounds for each combination. The buffer function in CA can be empirically calibrated so as to control the chance of false positives (i.e. inputs wrongly reported as not robust) and false negatives (i.e. non-robust inputs reported as being robust). Given the strong correlation that we observed between the distance of heuristic adversarial examples and the true decision boundary distance, using a linear model for b α seems a reasonable choice. Under this assumption, the buffer value depends only on the distance between the original example and the adversarial one, i.e. on d(x, a f,θ (x)). This property allows us to rewrite the main check performed by CA as: ||x -a f (x))|| p -b(x) = α 1 ||x -a f,θ (x)|| p + α 0 ≤ ε The parameters α 1 , α 0 can then be obtained via quantile regression (Koenker & Bassett Jr, 1978 ) by using the true decision boundary distance (i.e. d * p (x)) as a target. The approach provides a simple, interpretable mechanism to control how conservative the detection check should be: with a small quantile, CA will tend to underestimate the decision boundary distance, leading to fewer missed detections, but more false alarms; using a high quantile will lead to the opposite behavior. We test this type of buffer using 5-fold cross-validation on each configuration. Specifically, we calibrate the model using 1%, 50% and 99% as quantiles. Tables 10 to 13 provide a comparison between the expected quantile and the average true quantile of each configuration on the validation folds. Additionally, we plot in Figures 3 to 8 the mean F 1 score in relation to the choice of ε.

I ADDITIONAL RESULTS

Tables 14 to 17 detail the performance of the various attack sets on every combination, while Figures 9 to 14 showcase the relation between the true and estimated decision boundary distances.

J ABLATION STUDY

We outline the best attack pools by size in Tables 18 to 21 . Additionally, we report the performance of pools composed of individual attacks in Tables 22 to 25 . Finally, we detail the performance of dropping a specific attack in Tables 26 to 29 .

K FAST PARAMETER SET TESTS

We list the chosen parameter sets for Fast-100, Fast-1k and Fast-10k in Table 30 . We plot the difference between the distance of the closest adversarial examples and the true decision boundary (f) MNIST A ReLU Balanced Figure 3 : F 1 scores in relation to ε for MNIST A for each considered percentile. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line. (f) CIFAR10 A ReLU Balanced Figure 6 : F scores in relation to ε for CIFAR10 A for each considered percentile. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line. (f) CIFAR10 B ReLU Balanced Figure 7 : F 1 scores in relation to ε for CIFAR10 B for each considered percentile. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line. (f) CIFAR10 C ReLU Balanced Figure 8 : F 1 scores in relation to ε for CIFAR10 C for each considered percentile. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line. 

M OVERVIEW OF CERTIFIED DEFENSES

In order to put our defense into context, we provide a slightly more in-depth overview of common approaches to certified robustness, as well as their strengths and weaknesses. Initially The most common approach, however, consists in providing statistical guarantees. For example, Sinha et al. (2018) showed that using a custom loss can bound the adversarial risk. Similarly, Dan et al. (2020) proved adversarial risk bounds for Gaussian mixture models depending on the "adversarial Signal-to-Noise Ratio". Finally, Cohen et al. (2019) introduced a smoothing-based certified defense that, due to its high computational cost, is replaced by a Monte Carlo estimate with a given probability of being robust. This work was later expanded upon in (Salman et al., 2020) and (Carlini et al., 2022) . The main drawback of these techniques is the fact that they cannot be used in contexts where statistical guarantees are not sufficient, such as safety-critical applications. All of these certified defenses prioritize certain aspects (speed, strength, generality) over others. In the context of this (simplified) framework, CA in its exact form can be thus considered a defense that prioritizes strength and generality over speed, similarly to Katz et al. (2017) and Tjeng et al. (2019) .



All our code, datasets, pretrained weights and results are available anonymously under MIT license at https://anonymous.4open.science/r/counter-attack. We use the term "norm" for 0 < p < 1 even if in such cases the L p function is not subadditive. The proofs of all our theorems and corollaries can be found in the appendices.



Figure 1: Distances of the nearest adversarial example found by the strong attack pool compared to those found by MIPVerify on MNIST A and CIFAR10 A with standard training. The black line represents the theoretical optimum. Note that no samples are below the black line.

Figure 2: Best mean R 2 value in relation to the number of attacks in the pool.

(and(not(a), b), and(a, c)) (13) where if (a, b, c) returns b if a = 0 and c otherwise.

Moreover, we define open : R 3 → {0, 1} as follows: open(x, a, b) = and(gt(x, a), lt(x, b))

= 3, out = 8, 5x5 kernel, stride = 4, padding = 0) ReLU Conv2D (in = 8, out = 8, 3x3 kernel, stride = 2, padding = 0) ReLU Flatten Linear (in = 72, out = 10) Output

Figure4: F 1 in relation to ε for MNIST B for each considered percentile. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line.

Figure5: F 1 scores in relation to ε for MNIST C for each considered percentile. For ease of visualization, we set the graph cutoff at F 1 = 0.8. We also mark 8/255 (a common choice for ε) with a dotted line.

Figure 9: Decision boundary distances found by the attack pools compared to those found by MIPVerify on MNIST A. The black line represents the theoretical optimum. Note that no samples are below the black line.

Figure 10: Decision boundary distances found by the attack pools compared to those found by MIPVerify on MNIST B. The black line represents the theoretical optimum. Note that no samples are below the black line.

Figure 11: Decision boundary distances found by the attack pools compared to those found by MIPVerify on MNIST C. The black line represents the theoretical optimum. Note that no samples are below the black line.

Figure 12: Decision boundary distances found by the attack pools compared to those found by MIPVerify on CIFAR10 A. The black line represents the theoretical optimum. Note that no samples are below the black line.

Figure 13: Decision boundary distances found by the attack pools compared to those found by MIPVerify on CIFAR10 B. The black line represents the theoretical optimum. Note that no samples are below the black line.

Figure 14: Decision boundary distances found by the attack pools compared to those found by MIPVerify on CIFAR10 C. The black line represents the theoretical optimum. Note that no samples are below the black line.

Figure 15: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 A Standard. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 16: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 A Adversarial. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 17: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 A ReLU. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 18: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 B Standard. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 19: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 B Adversarial. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 20: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 B ReLU. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 21: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 C Standard. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 22: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 C Adversarial. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes logarithmic.

Figure 23: Mean difference between the distance of the closest adversarial examples and the exact decision boundary distance for MNIST & CIFAR10 C ReLU. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. Both axes are logarithmic.

Figure 24: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 A Standard. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Figure 25: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 A Adversarial. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Figure 26: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 A ReLU. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Figure 27: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 B Standard. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Figure 29: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 B ReLU. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Figure 30: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 C Standard. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Figure 31: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 C Adversarial. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Figure 32: R 2 of linear model for the heuristic adversarial distances given the exact decision boundary distances for MNIST & CIFAR10 C ReLU. A dashed line means that the attack found adversarial examples (of any distance) for only some inputs, while the absence of a line means that the attack did not find any adversarial examples. The loosely and densely dotted black lines respectively represent the balanced and strong attack pools. The x axis is logarithmic.

Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin Vechev. Fast and effective robustness certification. Advances in Neural Information Processing Systems, 31, 2018. Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Min Jae Song, Ilias Zadik, and Joan Bruna. On the cryptographic hardness of learning single periodic neurons. Advances in neural information processing systems, 34:29602-29615, 2021. Larry J Stockmeyer. The polynomial-time hierarchy. Theoretical Computer Science, 3(1):1-22, 1976. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. Vincent Tjeng, Kai Y Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, 2019. Florian Tramèr, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, and Jörn-Henrik Jacobsen. Fundamental tradeoffs between invariance and sensitivity to adversarial perturbations. In International Conference on Machine Learning, pp. 9561-9571. PMLR, 2020.

Training and pruning hyperparameters.

MNIST Architectures.

Parameter counts and accuracies of trained models.

Ground truth labels of the UG100 dataset.

Parameters of heuristic attacks.

Parameter set used to initialize MIPVerify for MNIST. All other parameters are identical to the strong MNIST attack parameter set.

Parameters of MIPVerify.

MIPVerify bound tightness statistics.

Expected vs true quantile for MNIST strong with 5-fold cross validation.

Expected vs true quantile for MNIST balanced with 5-fold cross validation.

Expected vs true quantile for CIFAR10 strong with 5-fold cross validation.

Expected vs true quantile for CIFAR10 balanced with 5-fold cross validation.

Performance of the strong attack set on MNIST.

Performance of the balanced attack set on MNIST.

Performance of the strong attack set on CIFAR10.

Performance of the balanced attack set on CIFAR10.

Best pools of a given size by success rate and R 2 for MNIST strong.

Performance of individual attacks for CIFAR10 strong.

Performance of individual attacks for CIFAR10 balanced.

Performance of pools without a specific attack for MNIST strong.

Performance of pools without a specific attack for MNIST balanced.

Performance of pools without a specific attack for CIFAR10 strong.

Performance of pools without a specific attack for CIFAR10 balanced. Figures 15 to 23, while we plot the R 2 values in Figures 24 to 32. We do not study the Brendel & Bethge and the Carlini & Wagner attacks due to the fact that the number of model calls varies depending on how many inputs are attacked at the same time. Note that, for attacks that do not have the a 100% success rate, the mean adversarial example distance can increase with the number of steps as new adversarial examples (for inputs for which there were previously no successful adversarial examples) are added.L RESULTS FOR ATTACKS AGAINST CAWe report the parameters for the variant of our attack in Table31, while we report its success rate in Table32. We set ε = ε ′ equal to {0.025, 0.05, 0.1} for MNIST and {2/255, 4/255, 8/255} for CIFAR10. Note that, since Deepfool is a deterministic attack, no measures against randomizations were taken.

Parameters for the Fast-100, Fast-1k and Fast-10k sets.

Parameters for the variant of the PGD attack.

Success rate of the pool composed of the anti-CA variant of PGD and uniform noise for the A architectures.

, theoretical work focused on providing robustness bounds based on general properties. For example,Szegedy et al. (2013)  computed robustness bounds against L 2 -bounded perturbations by studying the upper Lipschitz constant of each layer, whileHein & Andriushchenko (2017) achieved similar results for L p -bounded perturbations by focusing on local Lipschitzness. While these studies do not require any modifications to the network or distribution hypotheses, in practice the provided bounds are too loose to be used in practice. For this reason,Weng et al. (2018b)  derived stronger bounds through a local Lipschitz constant estimation technique; however, finding this bound is computationally expensive, which is why the authors also provide a heuristic to estimate it.Similarly, solver-based approaches provide tight bounds but require expensive computations. For example, Reluplex was used to verify networks of at most ∼300 ReLU nodes(Katz et al., 2017). Tjeng et al. (2019) was able to use a MIP-based formulation to significantly speed up verification, although large networks are still not feasible to verify. Solver-friendly training techniques can boost the performance of verifiers (such as in(Xiao et al., 2019)); however, this increase in speed often comes at the cost of accuracy (see Section 6).Another solution to the trade-off between speed and bound tightness is to focus on specific (and more tractable) threat models. For example,Han et al. (2021) provide robustness guarantees against adversarial patches(Brown et al., 2017), whileJia et al. (2019) focus on adversarial word substitutions. In the same vein,Raghunathan et al. (2018) provide robustness bounds for specific architectures (i.e. 1-layer and 2-layer neural networks), while Zhang et al. (2021) introduce custom neurons that, if used in place of regular neurons, provide L ∞ robustness guarantees. These techniques thus trade generality for speed.

