ROBUSTNESS GUARANTEES FOR ADVERSARIALLY TRAINED NEURAL NETWORKS

Abstract

We study robust adversarial training of two-layer neural networks with Leaky ReLU activation function as a bi-level optimization problem. In particular, for the inner-loop that implements the PGD attack, we propose maximizing a lower bound on the 0/1-loss by reflecting a surrogate loss about the origin. This allows us to give convergence guarantee for the inner-loop PGD attack and precise iteration complexity results for end-to-end adversarial training, which hold for any width and initialization in a realizable setting. We provide empirical evidence to support our theoretical results.

1. INTRODUCTION

Despite the tremendous success of deep learning, neural network-based models are highly susceptible to small, imperceptible, adversarial perturbations of data at test time (Szegedy et al., 2014) . Such vulnerability to adversarial examples imposes severe limitations on the deployment of neural networks-based systems, especially in critical high-stakes applications such as autonomous driving, where safe and reliable operation is paramount. An abundance of studies demonstrating adversarial examples across different tasks and application domains (Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017) has led to a renewed focus on robust learning as an active area of research within machine learning. The goal of robust learning is to find models that yield reliable predictions on test data notwithstanding adversarial perturbations. A principled approach to training models that are robust to adversarial examples that has emerged in recent years is that of adversarial training (Madry et al., 2018) . Adversarial training formulates learning as a min-max optimization problem wherein the 0-1 classification loss is replaced by a convex surrogate such as the cross-entropy loss, and alternating optimization techniques are used to solve the resulting saddle point problem. Despite empirical success of adversarial training, our understanding of its theoretical underpinnings remain limited. From a practical standpoint, it is remarkable that gradient based techniques can efficiently solve both inner maximization problem to find adversarial examples and outer minimization problem to impart robust generalization. On the other hand, a theoretical analysis is challenging because (1) both the inner-and outer-level optimization problems are non-convex, and (2) it is unclear a-priori if solving the min-max optimization problem would even guarantee robust generalization. In this work, we seek to understand adversarial training better. In particular, under a margin separability assumption, we provide robust generalization guarantees for two-layer neural networks with Leaky ReLU activation trained using adversarial training. Our key contributions are as follows. 1. We identify a disconnect between the robust learning objective and the min-max formulation of adversarial training. This observation inspires a simple modification of adversarial trainingwe propose reflecting the surrogate loss about the origin in the inner maximization phase when searching for an "optimal" perturbation vector to attack the current model. 2. We provide convergence guarantees for PGD attacks on two-layer neural networks with leaky ReLU activation. This is the first of its kind result to the best of our knowledge. 3. We give global convergence guarantees and establish learning rates for adversarial training for two-layer neural networks with Leaky ReLU activation function. Notably, our guarantees hold for any bounded initialization and any width -a property that is not present in the previous works in the neural tangent kernel (NTK) regime (Gao et al., 2019; Zhang et al., 2020) . 4. We provide extensive empirical evidence showing that reflecting the surrogate loss in the inner loop does not have a significant impact on the test time performance of the adversarially trained models. Notation. We denote matrices, vectors, scalar variables, and sets by Roman capital letters, Roman lowercase letters, lowercase letters, and uppercase script letters, respectively (e.g. X, x, x, and X ). For any integer d, we represent the set {1, . . . , d} by [d] . The `2-norm of a vector x and the Frobenius norm of a matrix X are denoted as kxk and kXk F , respectively. Given a set C, the operator ⇧ C (x) = min x 0 2C kx x 0 k projects onto the set C with respect to the `2-norm.

1.1. RELATED WORK

Linear models. Adversarial training of linear models was recently studied by Charles et al. (2019) ; Li et al. (2020) ; Zou et al. (2021) . In particular, Charles et al. (2019) ; Li et al. (2020) give robust generalization error guarantees for adversarially trained linear models under a margin separability assumption. The hard margin assumption was relaxed by Zou et al. (2021) who give robust generalization guarantees for distributions with agnostic label noise. We note that the optimal attack for linear models has a simple closed-form expression, which mitigates the challenge of analyzing the inner loop PGD attack. In contrast, one of our main contributions is to give convergence guarantees for the PGD attack. Nonetheless, as the Leaky ReLU activation function can also realize the identity map for ↵ = 1, our results also provide robust generalization error guarantees for training linear models. Non-linear models. Wang et al. (2019) propose a first order stationary condition to evaluate the convergence quality of adversarial attacks found in the inner loop. Zhang et al. (2021) study adversarial training as a bi-level optimization problem and propose a principled approach towards the design of fast adversarial training algorithms. Most related to our results are the works of Gao et al. (2019) and Zhang et al. (2020) , which study the convergence of adversarial training in non-linear neural networks. Under specific initialization and width requirements, these works guarantee small robust training error with respect to the attack that is used in the inner-loop, without explicitly analyzing the convergence of the attack. Gao et al. (2019) assume that the activation function is smooth and require that the width of the network, as well as the overall computational cost, is exponential in the input dimension. The work of Zhang et al. (2020) partially addresses these issues. In particular, their results hold for ReLU neural networks, and they only require the width and the computational cost to be polynomial in the input parameters. Our work is different from that of Gao et al. (2019) and Zhang et al. (2020) in several ways. Here we highlight three key differences. • First, while the prior work analyzes the convergence in the NTK setting with specific initialization and width requirements, our results hold for any initialization and width. • Second, none of the prior works studies computational aspects of finding an optimal attack vector in the inner loop. Instead, the prior work assumes oracle access to optimal attack vectors. We provide precise iteration complexity results for the projected gradient method (i.e., for the PGD attack) for finding near-optimal attack vectors. • Third, the prior works focus on minimizing the robust training loss, whereas we provide computational learning guarantees on the robust generalization error. The rest of the paper is organized as follows. In Section 2, we give the problem setup and introduce the adversarial training procedure with the reflected surrogate loss in the inner loop. In Section 3, we present our main results, discuss the implications and give a proof sketch. We support our theory with empirical results in Section 4 and conclude with a discussion in Section 5.

2. PRELIMINARIES

We focus on two-layer networks with m hidden nodes computing f (x; a, W) = a > (Wx), where W 2 R m⇥d and a 2 R m are the weights of the first and the second layers, respectively, and (z) = max{↵z, z} is the Leaky ReLU activation function. We randomly initialize the weights a and W such that kak 1   and kWk F  !. The top linear layer (i.e., weights a) is kept fixed, and the hidden layer (i.e., W) is trained using stochastic gradient descent (SGD). For simplicity of notation, we represent the network as f (x; W), suppressing the dependence on the top layer weights. Further, with a slight abuse of notation, we denote the function by f W (x) Algorithm 1 Atk PGD Attack Input: Sample (x, y), Weights W, Stepsize ⌘ atk , # Iters T atk 1: Initialize 1 x 2: for t = 1 to T do 3: t+1 ⇧ (x) ( t + ⌘ atk r ` (yf W ( t ))) 4: end for Output: ⌧ , where ⌧ 2 arg max t2[T ] ` (yf W ( t )) when optimizing over the input adversarial perturbations, and by f x (W) when training the network weights. Formally, adversarial learning is described as follows. Let X ✓ R d and Y = {±1} denote the input feature space and the output label space, respectively. Let D be an unknown joint distribution on X ⇥ Y. For any fixed x 2 X , we consider norm-bounded adversarial perturbations in the set (x) := { : k xk  ⌫}, for some fixed noise budget ⌫. Given a training sample S := {(x i , y i )} n i=1 ⇠ D n drawn independently and identically from the underlying distribution D, the goal is to find a network with small robust misclassification error " rob (W) = E D max 2 (x) I[yf W( ) < 0], where W := W/kWk F is the weight matrix normalized to have unit Frobenius norm. Note that, due to the homogeneity of Leaky ReLU, such normalization has no effects on the robust error whatsoever. In adversarial training, the 0 1 loss inside the expectation is replaced with a convex surrogate such as cross entropy loss `(z) = log(1 + e z ), and the expected value is estimated using a sample average: b " rob (W) := 1 n n X i=1 max i 2 (xi) `(y i f W( i )) (2) -4 -2 2 4 -4 -2 2 4 Figure 1: The 0-1 loss (red), its convex surrogate, the cross-entropy loss (blue), and the reflected cross-entropy loss (green). Notwithstanding the conventional wisdom, adversarial training entails maximizing an upper bound as opposed to a lower bound on the 0 1 loss. In contrast, we propose using a concave lowerbound on the 0 1 loss to solve the inner maximization problem. Let ` (z) = `( z) = log(1 + e z ) denote the reflected loss. In Figure 1 , we plot the 0-1 loss, the cross-entropy loss, and the reflected cross-entropy loss. Starting from 1 = x, the PGD attack updates iterates via t+1 = ⇧ (x) ( t + ⌘ atk ` (yf W ( t )) ), as described in Algorithm 1. We emphasize that the only difference between standard adversarial training and what we propose in Algorithm 2 and Algorithm 1 is that we reflect the loss (about the origin) in Algorithm 1.

3. MAIN RESULTS

We consider a slightly weaker version of the robust error. In particular, we are interested in adversarial attacks that can fool the learner with a margin -for some small, non-negative constant , we define the -robust misclassification error as: " (W) = P min 2 (x) yf W( ) < . In particular, as tends to zero, " (W) ! " rob (W). When is a small positive constant bounded away from zero, (x, y) contributes to " (W) only if there exists an attack 2 (x) such that f W confidently makes a wrong prediction on . In other words, -robust misclassification error is the probability that for (x, y) ⇠ D, a -effective attack exists: Algorithm 2 AdvTr Adversarial Training Input: Stepsize ⌘ tr , # Iters T tr 1: Initialize a and W 1 such that kak 1   and kW 1 k F  ! 2: for t = 1 to T do 3: Draw (x t , y t ) ⇠ D 4: t Atk(W t , x t , y t ) 5: W t+1 W t ⌘ tr r W `(y t f t (W t )) 6: end for Definition 3.1 (Effective Attacks). Given a neural networks with parameters (a, W) and a data point (x, y) and some constant > 0, we say that ⇤ 2 (x) is a -effective attack if yf W( ⇤ )  , where W = W/kWk F . Our bounds depend on several important problem parameters. Before stating the main results of the paper, we remind the reader of these important quantities. ⌫ denotes the attack size.  and ! are the bounds on the norm of the parameters a and W at the initialization. Finally, ↵ is the Leaky-ReLU parameter. Our first result stated in the following theoremfoot_0 gives convergence rates for Algorithm 1 in terms of the and the negated loss derivative `0(•) under the assumption that an effective attack exists. The negative derivative, `0(•), of the loss function has been used in several previous works to give an upper bound on the error Cao & Gu (2019) ; here, we borrow similar ideas from Frei et al. (2021) . In particular, as it will become clear later, we will use positivity and monotonicity of `0(•) to give an upper bound on the -robust loss using Markov's inequality. Theorem 3.2. Let ⇤ be a -effective attack for a given network with weights (a, W) and a given example (x, y), with 2⌫(1 ↵) p m. Then, after T atk iterations, PGD with step size ⌘ atk  1  2 mkWk 2 F generates an attack atk such that `0(yf W ( ⇤ ))  2`0(yf W ( atk )) + 4⌫ 2 ⌘atkTatk . Theorem 3.2 establishes that under proper initialization ( = 1/ p m), when a -effective attack exists, Algorithm 1 finds a ✏-suboptimal attack vector in O( 2 /✏) iteration. We next study convergence of Algorithm 2 under the following distributional assumption. Assumption 3.3. Samples (x, y) are drawn i.i.d. from an unknown joint distribution D that satisfies: • kxk  R with probability 1. • There exists a unit norm vector v ⇤ 2 R d , kv ⇤ k = 1, such that for (x, y) ⇠ D, we have with probability 1 that y(v ⇤ • x) > 0. The first assumption requires that the inputs are bounded, which is standard in the literature and is satisfied for most practical applications. The second assumption implies that D is linearly separable with margin > 0. Of course, we do not need a non-linear neural network to robustly learn a predictor under such a distributional assumption. But can we even guarantee robust learnability of neural networks for such simple settings? Nothing is known as far as we know. We note that even for standard (non-robust) training of two-layer neural networks using SGD, the convergence guarantees in the hard margin setting were unknown until recently (Brutzkus et al., 2018) . The following theorem establishes that adversarial training can efficiently find a network with small -robust error. Theorem 3.4 (Convergence of Algorithm 2). For any ✏ > 0, in at most T tr  64(R+⌫) 2 (1+! ↵ p m✏) ( ⌫) 2 ↵ 2 ✏ 2 iterations, Algorithm 2 with step-size ⌘ tr  1 m 2 (R+⌫) 2 finds an iterate ⌧ that, in expectation over {(x t , y t )} Ttr t=1 , satisfies " (W ⌧ )  2✏ for any 2⌫(1 ↵) p m, pro- vided that for all t 2 [T ], ⌘ atk  1  2 mkWtk 2 F and T atk 8⌫ 2 ⌘atk✏ . A few remarks are in order. Beyond Neural Tangent Kernel. As opposed to the convergence results in the previous work (Gao et al., 2019; Zhang et al., 2020) which requires certain initialization and width requirements specific to the NTK regime, our results holds for any bounded initialization and any width m. Role of the Robustness Parameter ⌫. Our guarantee holds only when the desired robustness parameter ⌫ is smaller than the distribution margin . Furthermore, the iteration complexity increases gracefully as O(⌫ 2 /( ⌫) 2 ) as the attacks become stronger, i.e., as the size of adversarial perturbations tends to the margin. Intuitively, as ⌫ ! 0, the attack becomes trivial, and the adversarial training reduces to the standard non-adversarial training. This is fully captured by our resultsas ⌫ ! 0, the number of attack iterates T atk goes to zero, and we recover the overall runtime of O( 2 ✏ 2 ) as in the previous work (Brutzkus et al., 2018; Frei et al., 2021) . Computational Complexity. To guarantee ✏-suboptimality in the -robust misclassification error, we require T tr = O(( ⌫) 2 ✏ 2 ) iterations of Algorithm 2. Each iteration invokes the PGD attack in Algorithm 1, which itself requires T atk = O(⌫ 2 /✏) gradient updates. Therefore, the overall computational cost of adversarial training to achieve ✏-suboptimality is O( ⌫ 2 ( ⌫) 2 ✏ 3 ). Note that T atk is a purely computational requirement, and the statistical complexity of adversarial training is fully captured by T tr . Remarkably, there is only a mild O( 2 /( ⌫) 2 ) statistical overhead for -robustness, and the computational cost increases gracefully by a multiplicative factor of O ⇣ ⌫ 2 2 ( ⌫) 2 ✏ ⌘ . Learning Robust Linear Halfspaces. When ↵ = 1, the Leaky ReLU activation equals the identity map, and the network reduces to a linear predictor. In this case, we retrieve strong robust generalization guarantees for learning halfspaces, as the lower bound required for in Theorem 3.4 vanishes. The following corollary instantiates such a robust generalization guarantee. Corollary 3.5. Let  = 1/ p m, ! = 1/ , and ⌘ tr = (R + ⌫) 2 . For any ✏ > 0, in at most T tr  128(R+⌫) 2 ( ⌫) 2 ✏ 2 iterations, Algorithm 2 finds an iterate ⌧ , that in expectation over {(x t , y t )} Ttr t=1 , satisfies " rob (W ⌧ )  2✏, provided that for all t 2 [T ], ⌘ atk  kW t k 2 F and T atk 8⌫ 2 ⌘atk✏ . Dependence on the Norm of Iterates. The iteration complexity of Algorithm 1 is inversely proportional to the learning rate ⌘ atk , and therefore increases with kW t k 2 F . Thus, when calculating the overall computational complexity, one needs to compute an upper bound on the norm of the iterates. As we show in Equation ( 6) in the appendix, it holds for all iterates that kW t+1 k 2 F  kW 1 k 2 F + 3⌘ tr t. Therefore, if we set  = 1/ p m and ! 2 = 3/(R + ⌫) 2 , we have the following worst-case weight-independent bound on the overall computational cost: T  Ttr X t=1 8⌫ 2 ⌘ atk ✏  Ttr X t=1 8⌫ 2 kW t k 2 F ✏  Ttr X t=1 8⌫ 2 (! 2 + 3⌘ tr (t 1)) ✏  Ttr X t=1 24⌫ 2 t (R + ⌫) 2 ✏  12⌫ 2 T 2 tr (R + ⌫) 2 ✏  196608⌫ 2 (R + ⌫) 2 ( ⌫) 4 ↵ 4 ✏ 5 . Therefore, the worst-case overall computational cost is of order O(( ⌫) 4 ✏ 5 ). We note again that this cost is purely computational -the statistical complexity is still in the order of O ( ⌫) 2 ✏ 2 . Adversarial Robustness for any . As we discussed earlier, as ! 0, the -robust error tends to the robust error, i.e., " (W) ! " rob (W). Although Theorem 3.4 does not hold for = 0 (except for the linear case discussed above), it is possible to guarantee robust generalization with arbitrarily small , as stated in the following corollary. Corollary 3.6. For any desirable > 0, let  = 2⌫(1 ↵) p m . For any ✏ > 0, in at most T tr  64(R+⌫) 2 (1+! ↵ ✏/(2⌫(1 ↵))) ( ⌫) 2 ↵ 2 ✏ 2 iterations, Algorithm 2 with step-size ⌘ tr  4⌫ 2 (1 ↵) 2 2 (R+⌫) 2 finds an iterate ⌧ that, in expectation over {(x t , y t )} Ttr t=1 , satisfies " (W ⌧ )  2✏ provided that for all t 2 [T ], ⌘ atk  4⌫ 2 (1 ↵) 2 2 kWtk 2 F and T atk 2(1 ↵) 2 2 kWtk 2 F ✏ .

3.1. PROOF SKETCH

In this section, we highlight the key ideas and insights based on our analysis, and give a sketch of the proof of the main result. Using Definition 3.1, the proof of Theorem 3.4 crucially depends on the following two facts. First, whenever there exists a -effective attack, Algorithm 1 will efficiently find a sufficiently good attack (in the sense of Theorem 3.2). Second, as long as the attack size ⌫ is smaller than the margin , robust training is not much harder than standard training. In particular, the following Lemma establishes that the expected value of the negative loss derivative eventually becomes arbitrarily small. Lemma 3.7. For any ✏ > 0, Algorithm 2 with stepsize ⌘ tr  m 1  2 (R + ⌫) 2 finds an iterate ⌧ that, in expectation over {(x t , y t )} Ttr t=1 , satisfies E D [ `0(yf W⌧ ( atk (x)))]  ✏ in at most T tr  4(1+kW1k F ↵ p m✏) ⌘tr( ⌫) 2 ↵ 2  2 m✏ 2 iterations. We remark that the result in Lemma 3.7 holds for any attack algorithm Atk, as long as it respects the condition atk (x) 2 (x) for all x. We are now ready to present the proof of the main result. Proof of Theorem 3.4. Recall, that -robust misclassification error is defined as: " (W) = P ⇢ min 2 (x) yf W( ) < = P ⇢ min 2 (x) yf W ( ) < kWk F (Homogeneity of f ) A key step in the proof is to give an upper bound on ✏ in terms of the attack returned by PGD, i.e., atk(x) , rather than the optimal attack min 2 (x) yf W ( ). Theorem 3.2 does provide us with such an upper bound; however, (1) it only holds in expectation, and 2) it is conditioned on existence of an effective attack at the given example (x, y) and the weights W. Naturally, we can use Markov's inequality to bound the probability above. In order to address the conditional nature of the result in Theorem 3.2, we introduce a truncated version of the negative loss derivative. In particular, for any c, let `0 c (z) = `0(z)I[z  c] be the loss derivative thresholded at c. Note that z  c implies that `0 c (z) `0 c (c) -therefore, P{z  c}  P{ `0 c (z) `0 c (c)}. Let ⌧ := kW ⌧ k F , where W ⌧ is the iterate guaranteed by Lemma 3.7. We have " (W ⌧ ) = P ⇢ min 2 (x) yf W⌧ ( )  ⌧  P ⇢ `0 ⌧ ( min 2 (x) yf W⌧ ( )) `0 ⌧ ( ⌧ )  E D h `0 ⌧ (min 2 (x) yf W⌧ ( )) i `0 ⌧ ( ⌧ ) (Markov's inequality)  2E D  `0 ⌧ ( min 2 (x) yf W⌧ ( )) ( `0 ⌧ (z) 1/2 for z  0) Given W ⌧ , for any (x, y) ⇠ D, one of the two following cases can happen: 1. There exists a -effective attack. In this case, by Definition 3.1, it holds that min 2 (x) yf W⌧ ( )  kW ⌧ k F = ⌧ . Therefore, by definition of the truncated negative loss derivative, it also holds that `0 ⌧ (min 2 (x) yf W⌧ ( )) = `0(min 2 (x) yf W⌧ ( )). Now, using Theorem 3.2, we get that `0 ⌧ ( min 2 (x) yf W⌧ ( ))  2`0(yf W⌧ ( atk (x))) + 4⌫ 2 ⌘ atk T atk (3) 2. There does not exist a -effective attack. In this case, by Definition 3.1, it holds that min 2 (x) yf W⌧ ( ) > kW ⌧ k F = ⌧ . Therefore, by definition of the truncated negative loss derivative, it also holds that `0 ⌧ (min 2 (x) yf W⌧ ( )) = 0, which is trivially bounded by the upper bound in the first case above, given by Equation (3). Putting back the above cases in the upper bound on the -robust error, we arrive at: 1 2 " (W ⌧ )  2E D [ `0(yf W⌧ ( atk (x)))] + 4⌫ 2 ⌘ atk T atk  ✏ 2 + 4⌫ 2 ⌘ atk T atk  ✏ 2 + ✏ 2 where the first inequality follows from Theorem 3.2, the second inequality follows from Lemma 3.7 given the proper choice of T Tr , and the final inequality holds by setting T atk 8⌫ 2 ⌘atk✏ .

4. EMPIRICAL RESULTS

Adversarial training is widely used in training robust models and has been shown to be fairly effective in practice. The goal of this section is not to attest or reproduce previous empirical findings. Instead, since the focus in this paper is on the theoretical analysis of adversarial training in nonlinear networks, the goal of this section is merely to empirically study the effect of using reflected loss in Algorithm 1. The experimental results are organized as follows. First, in Sec. 4.1, we compare the optimal attacks found by a grid search on the surrogate loss and its reflected version. In Sec. 4.2, we empirically study adversarial training with reflected loss in the binary classification setting. Finally, in Sec. 4.3, we generalize the reflected loss, which is key to our theoretical analysis, to multi-class classification setting. We then report the results on the CIFAR-10 dataset using a deep residual network.

4.1. GRID SEARCH OPTIMIZATION

We look at the following simple 3-dimensional 3-class classification problem. Consider the point (x, y) where x = [3, 2, 1] and y = 1. We focus on the simplest non-trivial function, i.e., the identity mapping, given by f (x) = x. Obviously, f correctly assigns x to the first class because the first dimension is larger than the others. Also, a perturbation of the form = [ 0.501, 0.5, 0] with k k = 0.7078 can flip the label, since f (x+ ) = [2.499, 2.5, 1] incorrectly predicts the second class. We restrict the attack to the set { 2 ( 0.51, +0.51) 3 | k k  0.7078}. We look at every possible attack vector on a grid of size 800 ⇥ 800 ⇥ 800. We then sort these vectors in a descending order of the corresponding loss function, i.e., the cross entropy loss and its reflected version, and simply count how many of the top-k attack vectors actually induce a label flip. We take this as a measure of how effective is the corresponding loss maximization problem at finding a good attack vector. As we can see in Figure 2 , the proposed method of maximizing the reflected cross entropy loss is a far more effective way of generating the attacks than maximizing the cross entropy loss.

4.2. BINARY CLASSIFICATION

Experimental Setup. We extract digits 0 and 1 from the MNIST dataset (LeCun et al., 1998) , which provides a (almost) separable distribution, consistent with our theoretical setup. The dataset contains 12665 training samples and 2115 test samples. We evaluate the generalization error as well as the robust generalization error of fully-connected two-layer neural networks which are adversarially trained with and without reflecting the loss. The network has 100 hidden nodes with ReLU activations. The outer loop consists of 20 epochs over the training data with batch size equal to 64, randomly shuffled at the beginning of each epoch. The initial learning rate is set to 1, and is decayed by a multiplicative factor of 0.2 every 5 epochs. We use several benchmark attacks with and without reflecting the loss. The benchmarks include the Fast Gradient Sign Method (FGSM) Goodfellow et al. (2015) , the Basic Iterative Method (BIM) Kurakin et al. (2017) , and the PGD attack with `2 constraint (PGD-2) and `1 constraint (PGD-1). For each of these attack strategies, we have a corresponding approach that involves reflecting the surrogate loss -we denote the resulting methods as R-FGSM, R-BIM, R-PGD-2, and R-PGD-1, respectively. The perturbation size for FGSM, PGD-1, and BIM (and their corresponding reflected version) is set to ⌫ = 0.1. For PGD-2 and R-PGD-2, we let a larger perturbation size of ⌫ = 2 as recommended in the Adversarial ML Tutorial. In the inner-loop, if the attack is iterative, we use a step-decay scheduler with initial step-size of 10, which decreases the step-size every 10 steps by a multiplicative factor of 0.2. In Table 1 , we report the standard test accuracy as well as the adversarial test accuracy of the trained models over 10 independent random runs of the experiment. Different rows and columns correspond to different training algorithms and different attack models, respectively.

P P P P P P

Trg. Analysis. We make the following observations in Table 1 . First, reflecting the loss has a minimal effect on FGSM and BIM attacks, in terms of robust test accuracy of the trained models. In particular, the columns 1 and 2 (similarly columns 5 and 6) are identical up to the third decimal point. Atk. FGSM R-FGSM PGD-1 R-PGD-1 BIM R-BIM PGD-2 R- Second, in PGD-2 attacks, reflecting the loss generally yields a stronger attack -note the striking differences in the last two columns between PGD-2 and R-PGD-2. We observe a milder trend for PGD-1 attacks, where R-PGD-1 attacks turns out to be only slightly stronger, except for the standard training setting where reflecting the loss has a huge impact on the robust error. Third, we would like to remark on the performance of adversarially trained models. We can see that reflecting the loss in general helps robustness. In particular, second and fourth rows (PGD-1 and PGD-2) are completely dominated by the third and fifth rows (R-PGD-1 and R-PGD-2), respectively. Finally, it is notable that even though PGD-2 and PGD-1 are much weaker than their reflected counterparts, they are still competitive in terms of the robustness when used in adversarial training. Thissuggests that finding a "strong" attack is not a necessity for adversarial training to succeed.

4.3. EXTENSION TO MULTI-LABEL SETTING

In binary classification using the logistic loss, in essence, adversarial training finds an attack that minimizes the log-likelihood of the correct class. Using the reflected loss, instead, we aim at maximizing the log-likelihood of the wrong class. In a multiclass classification scenario, there are multiple such wrong classes. Therefore, an important design question is which wrong class should be targeted in the attack phase? Here, we focus on the most natural choice: we target the wrong class with the highest log-likelihood. This greedy approach is easy to implement, and has minimal computational overhead over standard adversarial training. We emphasize though that the greedy approach (described above) is sub-optimal, even in a simple linear setting. Intuitively, when the parameters are such that the logits for the true class correlate with the logits for the most likely wrong class, the greedy approach fails. In particular, consider the following 3-class classification problem in R 2 . Let f W (x) = Wx, where W = [2e 1 , e 1 , 10e 2 ] 2 R 3⇥2 . Here, e i denotes the i-th standard basis. Consider the point x = [1, 0]. Clearly, class 1 and 3 have the highest and the smallest likelihoods, respectively. Given a perturbation size kx 0 xk  0.3, the likelihood of the second class will never dominate that of the first class: w > 1 (x + ) = 2e > 1 (x + ) = 2(x 1 + 1 ) > (x 1 + 1 ) = e > 1 (x + ) = w > 2 (x + ) , where the inequality follows by using the fact that x 1 = 1 and | 1 |  0.3. Therefore, the greedy approach fails here. Whereas, within the specified perturbation budget, maximizing the likelihood of the third class can indeed find a label-flipping attack. For example, with = [0, 0.3], the point x 0 = [1, 0.3] will be assigned to the third class, because w > 3 x 0 = 3 > w > 1 x = 2 > w > 2 x = 1. We use adversarial training with and without reflected loss (denoted by R-PGD and PGD, respectively) to train a PreActResNet (PARN) He et al. (2016) on the CIFAR-10 dataset Krizhevsky et al. (2009) . In the training phase, we conduct experiments for attack size ⌫ 2 {2, 4, 8, 16}/255. We build on the PyTorch implementation in Zhang et al. (2021) , and we follow their experimental setup, which is described next. We use a SGD optimizer with a momentum parameter of 0.9 and weight decay parameter of 5 ⇥ 10 4 . We set the batch size to 128 and train each model for 20 epochs. We use a cyclic scheduler which increases the learning rate linearly from 0 to 0.2 within the first 

5. DISCUSSION

We study robust adversarial training of two-layer neural networks as a bi-level optimization problem. We propose reflecting the surrogate loss about the origin in the inner maximization phase when searching for an "optimal" perturbation vector to attack the current model. We give convergence guarantee for the inner-loop PGD attack and precise iteration complexity results for end-to-end adversarial training, which hold for any width and initialization under a margin assumption. We also provide an empirical study on the effect of reflecting the surrogate loss in real datasets. Next, we list few natural research directions for future work. Extension to multiclass setting. In binary classification, which is the focus of this paper, reflecting the loss about the origin provides a concave lower-bound for the zero one loss (see Figure 1 ). Maximizing the reflected loss then corresponds to maximizing the likelihood of the wrong class. This simple modification enables us to guarantee the convergence of PGD-2 attacks, and yield stronger attacks in our experiments. However, extending this idea to the multiclass setting is not trivial. In particular, the idea of maximizing the likelihood of the wrong class does not trivially generalize to the multiclass setting due to plurality of wrong classes. Nonetheless, as we show in the experimental section, a naive greedy approach to choose a wrong class seems to provide competitive performance in terms of standard/adversarial test error. Is there a simple, principled approach to obtain a lowerbound for the misclassification error in the multiclass setting? It would be interesting to explore theoretical and empirical aspects of such possible extensions. Beyond -robustness. The notion of -robustness is crucial in our analysis. Although we provide robustness guarantees for arbitrarily small positive (see Corollary 3.6), our current analysis does not allow for standard robustness guarantees ( = 0) except for the linear setting (↵ = 1). At a high level, the main challenge here is to guarantee that the attack can always find an adversarial example -if there exists one -regardless of whether the attack is -effective or not. This is, in particular, challenging to establish for iterative attacks such as PGD, because they can only guarantee getting sufficiently close to an optimal attack in finite time. Therefore, if the optimal attack can just barely flip the sign, the computational time for finding it can grow unboundedly. Therefore, providing robust generalization guarantees ( = 0) is an interesting research direction for future work. Optimization geometry. In our theoretical results, we focus on PGD-2 attacks, which are based on steepest descent with respect to the `2 geometry. In our experiments, we also provide empirical results for steepest descent attacks with respect to `1 geometry (including FGSM and BIM) on the reflected loss. We leave the theoretical analysis of such attacks to future work.



Proofs are deferred to the appendix.



Figure 2: Number of the top-k attack vectors that are optimal, i.e., can induce a label flip, for the cross entropy loss (blue) and the reflected version (red), for different values of k: Left: k = 10, Middle: k = 100, and Right: k = 1000.

Robust test accuracy (RA) of adversarially trained models with and without reflecting the loss, for different values of the attack size ⌫ 2 {2, 4, 8, 16}/255 and number of steps in the attack Steps 2 {2, 4, 16, 32}. We report the results for test-time attack size ⌫ = 8/255; the better performance is highlighted in gray, where the intensity corresponds to difference in performance.10 epochs and then reduces it back to 0 in the remaining 10 epochs. We report robust test accuracy (RA) of an adversarially-trained model against PGD attacksMadry et al. (2018) (RA-PGD), where we take 50-step PGD with 10 restarts. We report the results for test-time attack size ⌫ = 8/255. Based on our empirical results, using the (greedy) reflected loss in adversarial training does not significantly impact the standard/robust generalization performance of the learned models.

