FINDING ACTUAL DESCENT DIRECTIONS FOR ADVER-SARIAL TRAINING

Abstract

Adversarial Training using a strong first-order adversary (PGD) is the gold standard for training Deep Neural Networks that are robust to adversarial examples. We show that, contrary to the general understanding of the method, the gradient at an optimal adversarial example may increase, rather than decrease, the adversarially robust loss. This holds independently of the learning rate. More precisely, we provide a counterexample to a corollary of Danskin's Theorem presented in the seminal paper of Madry et al. ( 2018) which states that a solution of the inner maximization problem can yield a descent direction for the adversarially robust loss. Based on a correct interpretation of Danskin's Theorem, we propose Danskin's Descent Direction (DDi) and we verify experimentally that it provides better directions than those obtained by a PGD adversary. Using the CIFAR10 dataset we further provide a real world example showing that our method achieves a steeper increase in robustness levels in the early training stages of smooth-activation networks without BatchNorm, and is more stable than the PGD baseline. As a limitation, PGD training of ReLU+BatchNorm networks still performs better, but current theory is unable to explain this.

1. INTRODUCTION

Adversarial Training (AT) (Goodfellow et al., 2015; Madry et al., 2018 ) has become the de-facto algorithm used to train Neural Networks that are robust to adversarial examples (Szegedy et al., 2014) . Variations of AT together with data augmentation yield the best-performing models in public benchmarks (Croce et al., 2020) . Despite lacking optimality guarantees for the inner-maximization problem, the simplicity and performance of AT are enough reasons to embrace its heuristic nature. From an optimization perspective, the consensus is that AT is a sound algorithm: based on Danskin's Theorem, Madry et al. (2018, Corollary C.2) posit that by finding a maximizer of the inner non-concave maximization problem, i.e., an optimal adversarial example, one can obtain a descent direction for the adversarially robust loss. What if this is not true? are we potentially overlooking issues in its algorithmic framework? As mentioned in (Dong et al., 2020, Section 2.3), Corollary C.2 in Madry et al. ( 2018) can be considered the theoretical optimization foundation of the non-convex non-concave min-max optimization algorithms that we now collectively refer to as Adversarial Training. It justifies the two-stage structure of the training loop: first we find one approximately optimal adversarial example and then we update the model using the gradient (with respect to the model parameters) at the perturbed input. The only drawbacks of a first-order adversary seem to be its computational complexity and its approximate suboptimal solver nature. Ignoring the computational complexity issue, suppose we have access to a theoretical oracle that provides a single solution of the inner-maximization problem. In such idealized setting, can we safely assume AT is decreasing the adversarially robust loss on the data sample? According to the aforementioned theoretical results, it would appear so. In this work, we scrutinize the optimization paradigm on which Adversarial Training (AT) has been founded, and we posit that finding multiple solutions of the inner-maximization problem is necessary to find good descent directions of the adversarially robust loss. In doing so, we hope to improve our understanding of the non-convex/non-concave min-max optimization problem that underlies the Adversarial Training methodology, and potentially improving its performance. Our contributions: We present two counterexamples to Madry et al. (2018, Corollary C.2), the motivation behind AT. They show that using the gradient (with respect to the parameters of the model) evaluated at a single solution of the inner-maximization problem, can increase the robust loss, i.e., it can harm the robustness of the model. In particular, in counterexample 2 many descent directions exist, but they cannot be found if we only compute a single solution of the inner-maximization problem. In Section 2 we explain that the flaw in the proof is due to a misunderstanding of the directional derivative notion that is used in the original work of Danskin (1966) . Based on our findings, we propose Danskin's Descent Direction (DDi, Algorithm 1). It aims to overcome the problems of the single adversarial example paradigm of AT by exploiting multiple adversarial examples, obtaining better update directions for the network. For a data-label pair, DDi finds the steepest descent direction for the robust loss, assuming that (i) there exists a finite number of solutions of the inner-maximization problem and (ii) they can be found with first-order methods. In Section 5 we verify experimentally that: (i) it is unrealistic to assume a unique solution of the inner-maximization problem, hence making a case for our method DDi, (ii) our method can achieve more stable descent dynamics than the vanilla AT method in synthetic scenarions and (iii) on the CIFAR10 dataset DDi is more stable and achieves higher robustness levels in the early stages of traning, compared with a PGD adversary of equivalent complexity. This is observed in a setting where the conditions of Danskin's Theorem holds, i.e., using differentiable activation functions and removing BatchNorm. As a limitation, PGD training of ReLU+BatchNorm networks still performs better, but there is no theory explaining this. The code to reproduce our results will be available at https://github.com/LIONS-EPFL/ddi_at. Remark. The fact that (Madry et al., 2018, Corollary C.2) is false, might be well-known in the optimization field. In the convex setting it corresponds to the common knowledge that a negative subgradient of a non-smooth convex function might not be a descent direction c.f., (Boyd, 2014 , Section 2.1). However, we believe this is not well-known in the AT community given that (i) its practical implications i.e., methods deriving steeper descent updates using multiple adversarial examples, have not been previously introduced, and (ii) the results in Madry et al. (2018) have been central in the development of AT. Hence, our contribution can be understood as raising awareness about the issue, and demonstrating its practical implications for AT. Preliminaries. Let θ ∈ R d be the parameters of a model, (x, y) ∼ D a data-label distribution, δ a perturbation in a compact set S 0 and L a loss function. The optimization objective of AT is: min θ ρ(θ), where ρ(θ) := E (x,y)∼D max δ∈S0 L(θ, x + δ, y) In this setting ρ(θ) is referred to as the adversarial loss or robust loss. In order to optimize Eq. ( 1) via iterative first-order methods, we need access to an stochastic gradient of the adversarial loss ρ or at least, the weaker notion of stochastic descent direction i.e., a direction along which the function φ(θ) := max δ∈S:=S k 0 g(θ, δ) := 1 k k i=1 L(θ, x i + δ i , y i ) decreases in value. We have collected the perturbations δ i ∈ S 0 on the batch {(x i , y i )} k i=1 as the columns of a matrix δ = [δ 1 , . . . , δ k ] ∈ S := S k 0 which is also a compact set. To obtain a descent direction for partial maximization functions like φ we resort to Danskin's Theorem: Theorem 1 (Danskin (1966) ). Let S be a compact topological space, and let g : R d × S be a continuous function such that g(•, δ) is differentiable for all δ ∈ S and ∇ θ g(θ, δ) is continuous on R d × S. Let φ(θ) := max δ∈S g(θ, δ), S (θ) := arg max δ∈S g(θ, δ) Let γ ∈ R d with γ 2 = 1 be an arbitrary unit vector. The directional derivative D γ φ(θ) of φ in the direction γ at the point θ exists, and is given by the formula D γ φ(θ) = max δ∈S (θ) γ, ∇ θ g(θ, δ) Remark. γ = 0 is called a descent direction of φ at θ if and only if D γ φ(θ) < 0, i.e., if the directional derivative is strictly negative.  θδ = |θ|. Note that at θ = 0, we have S (0) = [-1, 1]. Choosing δ = 1 ∈ S (0) we have that g(θ, 1) = θ and so -∇ θ g(0, 1) = -1 = 0. Hence, Corollary 1 would imply that -1 is a descent direction for φ(θ) = |θ|. However, θ = 0 is a global minimizer of the absolute value function, which means that there exists no descent direction. This is a contradiction. To cast more clarity on why Corollary 1 is false, we explain what is the mistake in the proof provided in Madry et al. (2018) . The main issue is the definition of the directional derivative, a concept in multivariable calculus that is defined in slightly different ways in the literature. Definition 1. Let φ : R d → R. For a nonzero vector γ ∈ R d , the one-sided directional derivative of φ in the direction γ at the point θ is defined as the one-sided limit: D γ φ(θ) := lim t→0 + φ(θ + tγ) -φ(θ) t γ 2 The two-sided directional derivative is defined as the two-sided limit: Dγ φ(θ) := lim t→0 φ(θ + tγ) -φ(θ) t γ 2 Unfortunately, it is not always clear which one of the two notions is meant when the term directional derivative is used. Indeed, as our notation suggests, the one-sided definition Eq. ( 6) is the one used in the statement of Danskin's Theorem (Danskin, 1966) . However, the proof of Corollary 1 provided in Madry et al. (2018) mistakenly assumes the two-sided definition Eq. ( 7), and inadvertently uses the following property that holds for Dγ φ(θ) (Eq. ( 7)) but not for D γ φ(θ) (Eq. ( 6)): Lemma 1. For the two-sided directional derivative definition (7) it holds that -Dγ φ(θ) = D-γ φ(θ) provided that Dγ exists. In particular, if Dγ φ(θ) > 0 then D-γ φ(θ) < 0. However this is not true for the one-sided directional derivative (6), as the example φ(θ) = |θ| at θ = 0 shows (both directional derivatives are strictly positive). We provide a proof of this fact in Appendix E. The (flawed) proof of Corollary 1 provided in Madry et al. (2018) starts by noting that for a solution δ of the inner-maximization problem, the directional derivative in the direction γ = ∇ θ g(θ, δ) is positive, as implied by Danskin's Theorem: D γ φ(θ) = max δ∈S (θ) γ, ∇ θ g(θ, δ) ≥ ∇ θ g(θ, δ), ∇ θ g(θ, δ) = ∇ θ g(θ, δ) 2 > 0 assuming that ∇ θ g(θ, δ) is non-zero. The mistake in the proof lies in concluding that D -γ φ(θ) < 0. Following Lemma 1, this property does not hold for the one-sided directional derivative definition Eq. ( 6), the one used in Danskin's Theorem.

3. A COUNTEREXAMPLE AT A POINT THAT IS NOT LOCALLY OPTIMAL

The question remains whether a slightly modified version of Corollary 1 holds true: it might be the case that by adding some mild assumption, we exclude all possible counterexamples. In the particular case of counterexample 1, θ = 0 is a local optimum of the function φ(θ) = |θ|. At such points, descent directions do not exist. However, in the trajectory of an iterative optimization algorithm we are mostly concerned with non-locally-optimal points. Hence, we explore whether adding the assumption that θ is not locally optimal can make Corollary 1 true. Unfortunately, we will show that this is not the case. To this end we construct a family of counterexamples to Corollary 1 with the following properties: (i) there exists a descent direction at a point θ (that is, θ is not locally optimal) and (ii), it does not coincide with -∇ θ g(θ, δ), for any optimal δ ∈ S (θ). Moreover, all the directions -∇ θ g(θ, δ) are in fact ascent directions i.e., they lead to an increase in the function φ(θ). Counterexample 2. Let S := [0, 1] and let u, v ∈ R 2 be unit vectors such that -1 < u, v < 0. That is, u and v form an obtuse angle. Let g(θ, δ) = δ θ, u + (1 -δ) θ, v + δ(δ -1) Clearly, the function satisfies all conditions of Theorem 1. At θ = 0, we have that S (0) = arg max δ∈[0,1] δ(δ -1) = {0, 1}. At δ = 0 we have ∇ θ g(θ, 0) = ∇ θ θ, v = v and at δ = 1 we have ∇ θ g(θ, 1) = ∇ θ θ, u = u. We compute the value of the directional derivatives in the negative direction of such vectors. According to Danskin's Theorem we have D -v φ(0) = max δ∈{0,1} -v, ∇ θ g(θ, δ) = max( -v, v , -v, u ) ≥ -v, u > 0 (10) wherev, u > 0 holds by construction. Analogously, D -u φ(0) > 0. This means that all such directions are ascent directions. However, for the direction γ = -(u + v) we have D γ φ(θ) ∝ max δ∈{0,1} -(u + v), ∇ θ g(θ, δ) = max( -u -v, u , -u -v, v ) = -1 -u, v < 0 (11) where the last inequality also follows by construction. Hence, -(u + v) is a descent direction. As counterexample 2 shows, Adversarial Training has the following problem: even if we are able to compute one solution of the inner-maximization problem δ ∈ S it can be the case that moving in the direction -∇ θ g(θ, δ) increases the robust training loss i.e., the classifier becomes less, rather than more, robust. This can happen at any stage, independently of the local optimality of θ. For a non-locally-optimal θ ∈ R d , the construction of the counterexamples relies on the following: if for any gradient computed at one inner-max solution, there exist another gradient (at a different inner-max solution) forming an obtuse angle, then no single inner-max solution yields a descent direction. Consequently, it suffices to ensure that for any gradient that can be found by solving the inner problem, there exists another one that has a negative inner product with it. Precisely, our counterexample 2 is carefully crafted so that this property holds.

4. DANSKIN'S DESCENT DIRECTION

Danskin's Theorem implies that the directional derivative depends on all the solutions of the innermax problem S (θ) c.f., Eq. (4). One possible issue in Adversarial Training is relying on a single solution, as it does not necessarily lead to a descent direction c.f. counterexample 2. To fix this, we design an algorithm that uses multiple adversarial perturbations per data sample. In theory, we can obtain the steepest descent direction for the robust loss on a batch {(x i , y i ) : i = 1, . . . , k} by solving the following min-max problem: γ ∈ arg min γ: γ 2=1 max δ∈S (θ) γ, ∇ θ g(θ, δ) , g(θ, δ) := 1 k k i=1 L(θ, x i + δ i , y i ) On the one hand, if the set of maximizers S (θ) is infinite, Eq. ( 12) would be out of reach for computationally tractable methods. On the other hand, the solution is trivial if there is a single maximizer , but we verify experimentally in Section 5 that such assumption is wrong in practice. In conclusion, a compromise has to be made in order to devise an tractable algorithm that is relevant in practical scenarios. First, we assume that the set of optimal adversarial perturbations is finite: S (θ) := arg max δ∈S g(θ, δ) = S m (θ) ={δ (1) , . . . , δ (m) }, m ≥ 1, m ∈ Z Under such assumption, it is possible to compute the steepest descent direction in Eq. ( 12) efficiently. Theorem 2. Let ∆ m be the m-dimensional simplex i.e., α ≥ 0, m i=1 α i = 1. Suppose that S (θ) = S m (θ) :={δ (1) , . . . , δ (m) } and denote by ∇ θ g(θ, S m (θ)) the matrix with columns ∇ θ g(θ, δ (i) ) for i = 1, . . . , m. As long as θ is not a local minimizer of the robust loss φ(θ) = max δ∈S g(θ, δ), then the steepest descent direction of φ at θ can be computed as: γ := - ∇ θ g(θ, S m (θ))α ∇ θ g(θ, S m (θ))α , α ∈ arg min α∈∆ m ∇ θ g(θ, S m (θ))α 2 2 (14) We present the proof of Theorem 2 in Appendix C. We now relax our initial finiteness assumption Eq. ( 13), as it might not hold in practice. We show that it might suffice to approximate the (possibly infinite) set of maximizers S (θ) with a finite set S m (θ). If the direction γ defined in Eq. ( 14) satisfies an additional inequality involving the finite set S m (θ), it will be a certified descent direction. Theorem 3. Suppose that ∇ θ g(θ, δ) is L-Lipschitz as a function of δ, i.e., ∇ θ g(θ, δ) -∇ θ g(θ, δ ) 2 ≤ L δδ 2 . Let S (θ) be the set of solutions of the inner maximization problem, and let S m (θ) := {δ (1) , . . . , δ (m) } be a finite set that -approximates S (θ) in the following sense: for any δ ∈ S (θ) there exists δ (i) ∈ S m (θ) such that δδ (i) 2 ≤ . Let γ be as in Eq. (14) . If max δ∈S m (θ) γ , ∇ θ g(θ, δ) < -L then γ is a descent direction for φ at θ. The Lipschitz gradient assumption in Theorem 3 is standard in the optimization literature. We provide a proof of Theorem 3 in Appendix D. This results motivate Danskin's Descent Direction (Algorithm 1). We assume an oracle providing a finite set of adversarial perturbations S m (θ) that satifies the approximation assumption in Theorem 3. In particular, this does not require solving the inner-maximization problem to optimality, which is out of reach for computationally tractable methods and requires expensive branch-and-bound or MIP techniques (Zhang et Wang et al., 2021) . Given S m (θ), we compute γ as in Eq. ( 14), which corresponds to Line 7 of Algorithm 1. If the values of L and in Theorem 3 are not available (they might be hard to compute), we cannot certify that γ is a descent direction. However, note that given a set of adversarial examples S m (θ), γ is still the best choice as it ensures we improve the loss on all elements of S m (θ). The optimization problem defining α and γ can be solved to arbitrary accuracy efficiently: It corresponds to the minimization of a smooth objective subject to the convex constraint α ∈ ∆ m . We use the accelerated PGD algorithm proposed in (Parikh et al., 2014, section 4.3) and pair it with the efficient simplex projection algorithm given in Duchi et al. (2008) . As the problem is smooth, a fixed step-size choice guarantees convergence. We set it as the inverse of the spectral norm of ∇ θ g(θ, S (θ)) ∇ θ g(θ, S (θ)) and run the algorithm for a fixed number of iterations. Alternatively, one can consider Frank-Wolfe with away steps (Lacoste-Julien & Jaggi, 2015). In practice, the theoretical oracle algorithm that computes the set S m (θ) is replaced by heuristics like performing multiple runs of the Fast Gradient Sign Method (FGSM) or Iterative FGSM (Kurakin et al., 2017) (referred to as PGD in Madry et al. (2018) ). The complexity of an iteration in Algorithm 1 depends on this choice. In Section 5 we explore different choices and how it affects the the performance of the method. Algorithm 1 Danskin's Descent Direction (DDi) 1: Input: Batch size k ≥ 1, number of adversarial examples m, initial iterate θ 0 ∈ R d , number of iterations T ≥ 1, step-sizes {β t } T t=1 . 2: for t = 0 to T -1 do 3: Draw (x 1 , y 1 ), . . . , (x k , y k ) from data distribution D 4: g(θ, δ) ← 1 k k i=1 L(θ, x + δ i , y i ) 5: δ (1) , . . . , δ (m) ← MAXIMIZE δ∈S g(θ t , δ) Using a heuristic like PGD 6: M ← ∇ θ g(θ t , δ (i) ) : i = 1, . . . , m ∈ R d×m 7: α ← MINIMIZE α∈∆ m M α 2 2 To -suboptimality 8: γ ← M α M α 2 9: θ t+1 ← θ t + β t γ 10: end for 11: return θ T

5.1. EXISTENCE OF MULTIPLE OPTIMAL ADVERSARIAL SOLUTIONS

This section provides evidence that the set of optimal adversarial examples for a given sample is not a singleton. The hypothesis is tested by using a ResNet-18 pretrained on CIFAR10 and computing multiple randomly initialized PGD-7 attacks for each image with ε = 8 255 . We compute all pairwise 2 -distances between attacks for a given image and plot a joint histogram for 10 examples in Figure 2 . There is a clear separation away from zero for all pairwise distances indicating that the attacks are indeed distinct in the input space. Additionally, we plot a histogram over the adversarial losses for each image. An example is provided in Figure 2 , which is corroborated by similar results for other images (see Figure 6 §B). We find that the adversarial losses all concentrate with low variance far away from the clean loss. This confirms that all perturbations are in fact both strong and distinct. 

5.2. EXPLORING THE OPTIMIZATION LANDSCAPE OF DDI AND STANDARD ADVERSARIAL TRAINING

Having established that there exist multipe adversarial examples, we now show that the gradients computed can exhibit the behaviors discussed in Section 3. In a first synthetic example we borrow from (Orabona, 2019, Chapter 6), we consider the function g(θ, δ) = δ θ 2 1 + (θ 2 + 1) 2 + (1 - δ) θ 2 1 + (θ 2 -1) 2 where θ ∈ R 2 and δ ∈ [0, 1]. As can be seen from Figure 1a and Figure 1b , following a gradient computed at a single example leads to a increase in the objective and an unstable optimization behavior despite the use of a decaying step-size. In a second synthetic examples, we consider robust binary classification with a feed-forward neural network on a synthetic 2-dimensional dataset, trained with batch gradient descent. We observe that during training, after an initial phase where all gradients computed at different perturbations point roughly in the same direction, we begin to observe pairs of gradients with negative innerproducts (see Figure 3 (left)). That means that following one of those gradients would lead to an increase of the robust loss, as shown by the different optimization behavior (see Figure 3 (center)). Therefore, the benefits DDi kick in later in training, once the loss has stabilized and the innersolver starts outputting gradients with negative inner products. Indeed, we see that in the middle of training (iteration 250), DDi finds a descent direction of the (linearized) robust objective, whereas all individual gradients lead to an increase. Step along normalized direction 1 except for some modifications noted below. This means SGD with hyperparameters lr= 0.1, momentum=0.0 (not the default 0.9, we explain why below), batch size= 128 and weight decay= 5e -4. We run for 200 epochs, no warmup, decreasing lr by a factor of 0.1 at 50% and 75% of the epochs. Satisfying theoretical assumptions: Real world architectures are often not covered by theory while simple toy examples are often far removed from practice. To demonstrate the real world impact of our results, we therefore study a setting where the conditions of Danskin's Theorem hold, but which also uses standard building blocks used by practitioners, specifically replacing ReLU with CELU(Barron, 2017), replacing BatchNorm (BN) (Ioffe & Szegedy, 2015) with GroupNorm (GN) (Wu & He, 2018) and removing momentum. This ensures differentiability, removes intrabatch dependencies and ensures each update depends only on the descent direction found at that step respecively. We present more detailed justification in Appendix B.2 due to space constraints and additionally show an ablation study on the effect of our modifications in (Section 5.3)foot_0 . Our main results can be seen in Section 5.3. The robust accuracy of the DDi-trained model increases much more rapidly in the early stages, it increases more after the first drop in the learning rate, and is more stable when compared to the baseline. Section 5.3 also gives evidence that our method has (generally positive or neutral) effects in all settings. Using ReLU instead of CELU re-introduces the characteristic bump in robust accuracy that has led to early stopping becoming standard practice in robust training. It also diminishes the benefit of DDi, but DDi remains on par with PGD in terms of training speed and decays slightly less towards the end of the training. Adding momentum does not help either method in terms of training speed and makes them behave almost identically. Finally, BN seems to significantly ease the optimisation for both methods, raising overall performance and amplifying the bump on both methods. Here, PGD actually reaches a higher maximum robust accuracy and rises faster initially, but then converges to a lower value. This implies that some benefits of DDi remain even outside the setting covered by the theory. Although these are promising results indicating that DDi can give real world benefits in terms of iterations and reduce the need for early stopping, it is worth asking whether once could get the same benefit with a simpler or cheaper method. The final robust accuracies obtained are very close, and the increased convergence rate in terms of steps comes at a more than 10x slowdown due to having to perform 10 independent forward-backward passes and then solving an additional inner problem. Additionally, it could be argued that these results are to be expected and trivial: we are spending 10x the compute to get 10x the gradients. One might even say there is no need to solve the inner product and a simpler method to select the best adversary would suffice. In Fig. 5a we address these concerns by comparing Section 5.3 to the results of the following variants attempting to match the computational complexity: PGD-70 runs a single PGD adversary for 10x the number of steps, PGD-70 -1 t runs a single PGD adversary for 10x the number of steps, using a 1/t learning rate decay after leaving the "standard" PGD regime (i.e. after 8 adversary steps) to converge closer to an optimal adversarial example, PGD-max-10 runs ten parallel, independent PGD adversaries for each image and select the adversarial example that induces the largest loss. Finally, PGD-min-10 runs ten parallel, independent PGD adversaries for each image, then computes the gradients and selects the one with the lowest norm.This is an approximation of DDi that avoids solving Line 7 in Algorithm 1. In Fig. 5b we create a DDi variant based on the FAST adversary (Wong et al., 2020) (using = 8/255, α = 10/255). Using PGD for the evaluation attack, we compare against vanilla FAST in our setting (no BN, momentum and using CELU) as well as a FAST-max-10 variant analoguous to PGD-max-10. As we can see in Fig. 5a , every step of the pipeline of DDi seems to be necessary, with none of the PGD variants achieving the fast initial rise in robustness. PGD-70 -1 t and PGD-min-10 reach a higher final robust accuracy, which we attribute to the higher quality adversarial example and informed selection respectively. This is corroborated in Fig. 5b . Using a single step adversary is sufficient to speed up convergence in the early stages of training, but does not reach the same final robust accuracy. PGD and DDi seem to behave similarly in the later stages of training. We would suggest a computationally cheaper DDi variant which uses single ascent steps (FAST) in the beginning of training and PGD in the later stages. In any case, the bulk of the overhead lies in the subroutine in Line 7 of Algorithm 1. A faster approximate solution could also speed up the method significantly. Such incremental improvements are left for future work Neverthelss, in Appendix B.4 we explore some modifications that can reduce the runtime of Algorithm 1 by at least 70% while retaining its benefits. In contrast, we do not make unrealistic assumptions like strong concavity, and we deal with the existence multiple solutions of the inner-maximization problem.

6. RELATED WORK

In Nouiehed et al. ( 2019), it is shown that if the inner-maximization problem is unconstrained and satisfies the PL-condition, it is differentiable, and the gradient can be computed after obtaining a single solution of the problem. However, in the robust learning problem the adversary is usually constrained to a compact set, and the PL condition does not hold generically. This renders such assumptions hard to justify in the AT setting. Tramer & Boneh (2019); Maini et al. ( 2020) study robustness to multiple perturbation types, which might appear similar to our approach, but is not. Such works strike to train models that are simultaneously robust against ∞ -and 2 -bounded perturbations, for example. In contrast, we focus on a single perturbation type, and we study how to use multiple adversarial examples of the same sample to improve the update directions of the network parameters. ) and possibly many others. This supports our claim that raising awareness about the mistake in the proof is an important contribution.

7. CONCLUSION

In this paper we presented a formal proof, counter examples and evidence about the real world impact of the fact that a foundational corollary of the Adversarial Training literature is in fact false. Raising awareness about an incorrect claim that has been present in the Adversarial Training literature may provide opportunities to develop improved variants of the method. Indeed, we see some improvents in an implementable algorithm that align with our theoretical arguments: DDi exploits multiple approximate solutions of the inner-maximization problem, yields better updates for the parameters of the network and improves the optimization dynamics. However, it is important to remember the limitations and opportunities for future work: our algorithm requires multiple forward-backward passes and one additional optimization problem. Reducing the overhead over the vanlla PGD method would certainly make our results truly practical. Non-smooth activations and the use of Batch Normalization or momentum still falls outside the scope of existing theory but might achieve better performance in benchmarks. To date, this requires using precise hyperparameters and tricks like early-stopping, that have only been found to work a-posteriori through extensive trial and error. Since we observe lower decay even in such setting, future work extending the analysis to cover this case might help alleviate this cost.

A MORE ON COUNTEREXAMPLES

Here we give more details on the construction of the counterexamples. First observe that for a given point θ 0 , and a direction γ, if there exists a δ 0 ∈ S (θ 0 ) such that γ, ∇ θ g(θ 0 , δ) > 0, then γ is not a descent direction since D γ φ(θ 0 ) ≥ 0. In order to ensure that no descent directions can be recovered by solving the inner-maximization, it suffices to guarantee that for any δ ∈ S (θ 0 ), there exists δ ∈ S (θ 0 ) such that ∇ θ g(θ 0 , δ ), ∇ θ g(θ 0 , δ) < 0. This way, neither -∇ θ g(θ 0 , δ) nor -∇ θ g(θ 0 , δ ) would be descent directions. It easy to generate instances verifying the above using linear functions. More formally, by taking any family of vectors V = {v 1 , . . . , v n } such that for any i ∈ {1, . . . , n} there exists j ∈ {1, . . . , n} such that v i , v j < 0, we can construct the objective g(θ, δ) = δ i v i (θθ 0 ) -H(δ), where δ is in the n-dimensional Simplex and H is the Shannon entropy. Solving the inner-maximization would yield any one of the vectors {v 1 , . . . , v n }, and by construction, none of them are descent directions. Finally, to make each update depend only on the current state, we set momentum = 0.0. Since momentum is standard practice in the CV community and works like Yan et al. (2018) argue that it can improve generalisation, we rely on our ablation to show that removing it is safe.

B.3 FURTHER DETAILS ON SYNTHETIC EXPERIMENTS

The synthetic experiment in Fig. 1a is conducted with the following settings. The innermaximization is approximated with 10 steps of projected gradient ascent in order to match the traditional AT setting. The outer iterations have a decaying 0.5 √ k step-size schedule. We observe the same erratic behavior for PGD with a fixed outer stepsize, while DDi consitently remains well-behaved. The synthetic experiment in Fig. 3 is conducted on a dataset of size 100 in dimension 2 where the coordinates are standard Gaussian. The neural network is a 2-layer network with ELU activation with a hidden layer of width 2. The inner solver is PGD with 10 steps with stepsize 0.1 and optimizes over the unit cube. The outer step-size is 0.01 and the weights are optimized with full batch gradient descent. The linear approximation at iteration 250 of the robust loss consits of taking the 10 adversarial examples computed at iteration 250 and approximating it with φ(θ) = max δ1...δ10 φ(θ 250 ) + ∇ θ g(θ 250 , δ i ), θθ 250 Interestingly we do not observe the same drastic improvement over PGD when observing the nonlinearized loss at iteration 250. 0.000 0.002 0.004 0.006 0.008 0.010 Step along normalized direction While the focus of this paper is not to obtain a state-of-the-art method, it does matter whether it is feasible to efficiently capture the benefit of DDi. The naive implementation has about a 10 -12 times overhead compared to PGD, mainly due to three bottlenecks (in descending impact) 1. for k-DDi, generating k adversarial examples with PGD as the base attack involves a ktimes overhead 2. then k separate gradient samples need to be computed on these adversarial examples, which involes k forward-backward passes 3. finally, one additional optimization problem needs to be solved. While steps 1) and 2) can be somewhat parallelized, they still cause a massive increase in compute and memory. We therefore adopt two heuristic approaches to speed up the algorithm while (hopefully) maintaining it's benefits: 1. since later in training the benefits of DDi appear to diminish, we linearly decay the number of gradients sampled k from 10 down to 1 along the 200 epochs (referred to as decay) 2. we also adopt a method of creating k unique batches from only 2 independent adversarial attacks (described below in Appendix B.4.1, referred to as comb). We evaluate this method using both PGD and FAST as base attacks and show the results in Fig. 8a and Fig. 8b . As can be seen, DA-PGD-decay-comb and DA-PGD-comb both enjoy a massive speedup in wallclock time (reducing the 12× overhead to about 3×) while retaining the improved per-step progress of base DDi.

B.4.1 COMBINATORIAL BATCH CONSTRUCTION

Suppose we have a batch of data-label pairs (x i , y i ) of size B. In order to construct k ≤ 2 B different gradients by computing only 2 adversarial examples per data sample x i in the batch we do the following: 1. for each i = 1, . . . , B compute δ i,0 , δ i,1 two adversarial examples using the data-label pair (x i , y i ) in the batch. 2. for each j = 1, . . . , k repeat the following steps: 3. Define ∆ = [ ] as an empty list. 4. generate a random bitvector b ⊆ {0, 1} B of length B 5. when b i is 0 we append δ i,0 to ∆, otherwise when b i is 1 we append δ i,1 to ∆. 6. compute the gradient w.r.t. the network parameters using the perturbations in ∆ While this still incurs overhead of computing k gradients, it greatly reduces running time as seen in Fig. 8b and could further improved by e.g. reusing gradients from past epochs to construct the examples. C PROOF OF THEOREM 2. The steepest descent direction is computed, following Eq. (4) as:  Whenever θ is not a local optimum, there exists a non-zero descent direction. In this case we can relax the constraint that γ 2 = 1 to γ 2 ≤ 1 without changing the solutions or optimal value of (15), which is strictly negative: where the denominator is nonnegative as the optimal objective value is nonzero c.f. Eq. (18). D PROOF OF THEOREM 3. For any δ ∈ S (θ) let i(δ) ∈ {1, . . . , m} be such that δ (i(δ))δ 2 ≤ . That is, we map any maximizer δ to an index i ∈ {1, . . . , m} such that the corresponding perturbation δ (i) in the finite set S m (θ) is at most at an distance. This map can be constructed by the assumption on S m (θ). For any γ such that γ 2 = 1 we have γ, ∇ θ g(θ, δ) = γ, ∇ θ g(θ, δ) -∇ θ g(θ, δ (i(δ)) + γ, ∇ θ g(θ, δ (i(δ)) ) ≤ γ 2 =1 ∇ θ g(θ, δ) -∇ θ g(θ, δ (i(δ)) ) ≤L δ-δ (i(δ)) ≤L + γ, ∇ θ g(θ, δ (i(δ)) ) ≤ γ, ∇ θ g(θ, δ (i(δ)) ) + L  Hence if the supremum on the right-hand-side is strictly smaller than -L we have that D γ φ(θ) < 0, which yields the desired result. E PROOF OF LEMMA 1 Assume the limit that defines Dγ φ(θ) exists (and is finite). 



It is worth noting that the early stopping robust accuracy we achieve in ablations approximately matches that reported in Engstrom et al. (2019) on resnet50 There are whole lines of work studying the effects of BN (Bjorck et al., 2018; Santurkar et al., 2018; Kohler et al., 2019) as well as removing it altogether(Brock et al., 2021). It has also been found to interact with adversarial robustness in Wang et al. (2022) and Benz et al. (2021), the latter also finds GN to be a well performing alternative, justifying our choice.



Figure 1: (a) and (b): comparison of our method (DDi) and the single-adversarial-example method (PGD) on a synthetic min-max problem. Using a single example may increase the robust loss. DDi computes 10 examples and can avoid this. (c): similar improvement over PGD training shown on CIFAR10, where DDi with 10 examples speeds up convergence. More details in Section 5

Figure 2: Non-uniqueness of an optimal adversarial perturbation. (left) Pairwise 2 -distances between PGD-based perturbations are bounded away from zero by a large margin, showing that they are distinct. (right) The losses of multiple perturbations on the same sample concentrate around a value much larger than the clean loss (see Fig. 7 for zoomed-in version).

Figure 3: Count of negative inner products pairs among the 10 gradients computed per iteration(left), corresponding robust loss behavior along optimization (center). At iteration 250, comparison of the direction obtained by DDi and individual gradients.(right).

Figure 4: (left) Evolution of the robust accuracy on the CIFAR10 validation set, using a standard PGD-20 adversary for evaluation and DDi/PGD-7 during training. (right) an ablation testing the effect of adding the elements not covered by theory (BN,ReLU,momentum) back into our setting.

Wang et al. (2019) derive suboptimality bounds for the robust training problem, under a locally strong concavity assumption on the inner-maximization problem. However, such results do not extend to Neural Networks, as the inner-maximization problem is not strongly concave, in general.

Figure 5: (a) Ablations comparing PGD-variants matching the number of adversarial gradients/steps used for DDi. (b) Ablation over single-step adversaries (FAST/DDi-FAST).

our claim that the falseness of Madry et al. (2018, Corollary C.2.) is not wellknown in the literature on Adversarial Training. For example, such result is included in the textbook (Vorobeychik et al., 2018, Proposition 8.1). It has also been either reproduced or mentioned in conference papers like Liu et al. (2020, Section 2), Viallard et al. (2021, Appendix B), Wei & Ma (2020, Section 5

Figure7: The losses of multiple perturbations on 9 different example all concentrate around a value much larger than the clean loss (see Figure6for comparison with the clean loss).

Figure 8: (a) Epoch evolution of a more efficient implementation of DDi. (b) Wallclock evolution of the same methods.

∈ arg min γ: γ 2=1 D γ φ(θ) = arg min γ: γ 2 =1 max δ∈S m (θ) γ, ∇ θ g(θ, δ)

min γ: γ 2=1 max δ∈S m (θ) γ, ∇ θ g(θ, δ) = min γ: γ 2≤1 max δ∈S m (θ) γ, ∇ θ g(θ, δ) < 0(16)We can now transform (15) into a bilinear convex-concave min-max problem, subject to convex and compact constraints:γ ∈ arg min γ: γ 2≤1 D γ φ(θ) = arg min γ: γ 2≤1 max δ∈S m (θ) γ, ∇ θ g(θ, δ) = arg min γ: γ 2≤1 max i=1,...,m γ ∇ θ g(θ, δ (i) ) = arg min γ: γ 2≤1 max α∈∆ m γ ∇ θ g(θ, S m (θ))α(17)By Sion's minimax TheoremSion (1958), we can solve Eq. (17) by swapping the operator order:min γ: γ 2 ≤1 max α∈∆ m γ ∇ θ g(θ, S m (θ))α = max α∈∆ m min γ: γ 2≤1 γ ∇ θ g(θ, S m (θ))α = max α∈∆ m -∇ θ g(θ, S m (θ))α 2 =min α∈∆ m ∇ θ g(θ, S m (θ))α 2 < 0 (18)Finally, by noting that squaring the objective function in the right-hand side of Eq. (18) does not change the set of solutions, we arrive at the formula for α in Eq. (14). Indeed for a solution α to Published as a conference paper at ICLR 2023 this problem we have arg min γ: γ 2≤1 max α∈∆ m γ ∇ θ g(θ, S m (θ))α = arg min γ: γ 2≤1 γ ∇ θ g(θ, S m (θ))α = -∇ θ g(θ, S m (θ))α ∇ θ g(θ, S m (θ))α (19)

sup δ∈S m (θ) γ, ∇ θ g(θ, δ (i) ) + L (20)Taking the supremum over δ ∈ S (θ) on the left-hand-side we obtainD γ φ(θ) := sup δ∈S (θ) γ, ∇ θ g(θ, δ) ≤ sup δ∈S m (θ) γ, ∇ θ g(θ, δ (i) ) + L

t→0 φ(θ + t(-γ))φ(θ) tγ 2 = lim t→0 φ(θ + (-t)γ)φ(θ) -(-t) γ 2 = lim (-t)→0 φ(θ + (-t)γ)φ(θ) -(-t) γ 2 = lim s→0 -φ(θ + sγ)φ(θ) s γ 2 (let s = (-t)) =lim s→0 φ(θ + sγ)φ(θ) s γ Dγ φ(θ)(22)

al., 2022; Tjeng et al., 2019; Palma et al., 2021;

The losses of multiple perturbations on 9 different example all concentrate around a value much larger than the clean loss. See Section 5.1 for experimental details. The histograms have been enlarged in Figure7.

ACKNOWLEDGMENTS

This work is funded (in part) through a PhD fellowship of the Swiss Data Science Center, a joint venture between EPFL and ETH Zurich. Igor Krawczuk, Leello Dadi, Thomas Pethick and Volkan Cevher acknowledge funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement n • 725594 -timedata). This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021 205011. This work is licensed under a Creative Commons "Attribution 3.0 Unported" license.

