ON ACHIEVING OPTIMAL ADVERSARIAL TEST ERROR

Abstract

We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees.

1. INTRODUCTION

Imperceptibly altering the input data in a malicious fashion can dramatically decrease the accuracy of neural networks (Szegedy et al., 2014) . To defend against such adversarial attacks, maliciously altered training examples can be incorporated into the training process, encouraging robustness in the final neural network. Differing types of attacks used during this adversarial training, such as FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2019) , and the C&W attack (Carlini & Wagner, 2016) , which are optimization-based procedures that try to find bad perturbations around the inputs, have been shown to help with robustness. While many other defenses have been proposed (Guo et al., 2017; Dhillon et al., 2018; Xie et al., 2017) , adversarial training is the standard approach (Athalye et al., 2018) . Despite many advances, a large gap still persists between the accuracies we are able to achieve on non-adversarial and adversarial test sets. For instance, in Madry et al. (2019) , a wide ResNet model was able to achieve 95% accuracy on CIFAR-10 with standard training, but only 46% accuracy on CIFAR-10 images with perturbations arising from PGD bounded by 8/255 in each coordinate, even with the benefit of adversarial training. In this work we seek to better understand the optimal adversarial predictors we are trying to achieve, as well as how adversarial training can help us get there. While several recent works have analyzed properties of optimal adversarial zero-one classifiers (Bhagoji et al., 2019; Pydi & Jog, 2020; Awasthi et al., 2021b) , in the present work we build off of these analyses to characterize optimal adversarial convex surrogate loss classifiers. Even though some prior works have suggested shifting away from the use of convex losses in the adversarial setting because they are not adversarially calibrated (Bao et al., 2020; Awasthi et al., 2021a; c; 2022a; b) , we show the use of convex losses is not an issue as long as a threshold is appropriately chosen. We will also show that under idealized settings adversarial training can achieve the optimal adversarial test error. In prior work guarantees on the adversarial test error have been elusive, except in the specialized case of linear regression (Donhauser et al., 2021; Javanmard et al., 2020; Hassani & Javanmard, 2022) . Our analysis is in the Neural Tangent Kernel (NTK) or near-initialization regime, where recent work has shown analyzing gradient descent can be more tractable (Jacot et al., 2018; Du et al., 2018) . Of many such works our analysis is closest to Ji et al. (2021) , which provides a general test error analysis, but for standard (non-adversarial) training. A recent work (Rice et al., 2020) suggests that early stopping helps with adversarial training, as otherwise the network enters a robust overfitting phase in which the adversarial test error quickly rises while the adversarial training error continues to decrease. The present work uses a form of early stopping, and so is in the earlier regime where there is little to no overfitting. Figure 1 : A plot of the (robust/standard) zero-one (training/test) loss throughout training for an adversarially trained network. We ran Rice et al.'s code, using a constant step size of 0.01. The present work is set within the early phase of training, where we can get arbitrarily close to the optimal adversarial test error. In fact, due to technical reasons our analysis will be further restricted to an even earlier portion of this phase, as we remain within the near-initialization/NTK regime. As noted in prior work, adversarial training, as compared with standard training, seems to have more fragile test-time performance and quickly enters a phase of severe overfitting, but we do not consider this issue here.

1.1. OUR CONTRIBUTIONS

In this work, we prove structural results on the nature of predictors that are close to, or even achieve, optimal adversarial test error. In addition, we prove adversarial training on shallow ReLU networks can get arbitrarily close to the optimal adversarial test error over all measurable functions. This theoretical guarantee requires the use of optimal adversarial attacks during training, meaning we have access to an oracle that gives, within an allowed set of perturbations, the data point which maximizes the loss. We also use early stopping so that we remain in the near-initialization regime and ensure low model complexity. The main technical contributions are as follows. 1. Optimal adversarial predictor structure (Section 3). We prove fundamental results about optimal adversarial predictors by relating the global adversarial convex loss to global adversarial zero-one losses (cf. Lemma 3.1). We show that optimal adversarial convex loss predictors are directly related to optimal adversarial zero-one loss predictors (cf. Lemma 3.2). In addition, for predictors whose adversarial convex loss is almost optimal, we show that when an appropriate threshold is chosen its adversarial zero-one loss is also almost optimal (cf. Theorem 3.3). This theorem translates bounds on adversarial convex losses, such as those in Section 4, into bounds on adversarial zero-one losses when optimal thresholds are chosen. Using our structural results of optimal adversarial predictors, we prove that continuous functions can get arbitrarily close to the optimal test error given by measurable functions (cf. Lemma 3.4).

2.. Adversarial training (Section 4).

Under idealized settings, we show adversarial training leads to optimal adversarial predictors. (a) Generalization bound. We prove a near-initialization generalization bound for adversarial risk (cf. Lemma 4.4). To do so, we provide a Rademacher complexity bound for linearized functions around initialization (cf. Lemma 4.5). The overall bound scales directly with the parameter's distance from initialization, and 1/ √ n, where n is the number of training points. Included in the bound is a perturbation term which depends on the width of the network, and in the worst case scales like τ 1/4 , where τ bounds the ℓ 2 norm of the perturbations. (b) Optimization bound. We show that using an optimal adversarial attack during gradient descent training results in a network which is adversarially robust on the training set, in the sense that it is not much worse compared to an arbitrary reference network (cf. Lemma 4.6). Comparing to a reference network instead of just ensuring low training error (as in prior work) will be key to obtaining a good generalization analysis, as the optimal adversarial test error may be high. (c) Optimal test error. As the generalization and optimization bounds are both in a nearinitialization setting, these two bounds can be used in conjunction. We first bound the test error of our trained network in terms of its training error using the generalization bound, and then apply the optimization bound to compare against training error of an arbitrary reference network. Another application of our generalization bound then allows us to compare against the test error of an arbitrary reference network (cf. Theorem 4.1). Applying approximation bounds and Lemma 3.4 then lets us bound our trained network's test error in terms of the optimal test error over all measurable functions (cf. Corollary 4.2).

2. RELATED WORK

We highlight several papers in the adversarial and near-initialization communities that are relevant to this work. Optimal adversarial predictors. Several works study the properties of optimal adversarial predictors when considering the zero-one loss (Bhagoji et al., 2019; Pydi & Jog, 2020; Awasthi et al., 2021b) . In this work, we are able to understand optimal adversarial predictors under convex losses in terms of those under zero-one losses, although we will not make use of any properties of optimal zero-one adversarial predictors other than the fact that they exist. Other works study the inherent tradeoff between robust and standard accuracy (Tsipras et al., 2019; Zhang et al., 2019) , but these are complementary to this work as we only focus on the adversarial setting. Convex losses. Several works explore the relationship between convex losses and zero-one losses in the non-adversarial setting (Zhang, 2004; Bartlett et al., 2006) . Whereas the optimal predictor in the non-adversarial setting can be understood locally at individual points in the input domain, it is difficult to do so in the adversarial setting due to the possibility of overlapping perturbation sets. As a result, our analysis will be focused on the global structure of optimal adversarial predictors. Convex losses as an integral over reweighted zero-one losses have appeared before (Savage, 1971; Schervish, 1989; Hernández-Orallo et al., 2012) , and we will adapt and make use of this representation in the adversarial setting. Adversarial surrogate losses. Several works have suggested convex losses are inappropriate in the adversarial setting because they are not calibrated, and instead propose using non-convex surrogate losses (Bao et al., 2020; Awasthi et al., 2021a; c; 2022a; b) . In this work, we show that with appropriate thresholding convex losses are calibrated, and so are an appropriate choice for the adversarial setting. Near-initialization. Several works utilize the properties of networks in the near-initialization regime to obtain bounds on the test error when using gradient descent (Li & Liang, 2018; Arora et al., 2019; Cao & Gu, 2019; Nitanda et al., 2020; Ji & Telgarsky, 2019; Chen et al., 2019; Ji et al., 2021) . In particular, this paper most directly builds upon the prior work of Ji et al. (2021) , which showed that shallow neural networks could learn to predict arbitrarily well. We adapt their analysis to the adversarial setting.

Adversarial training techniques.

Adversarial training initially used FGSM (Goodfellow et al., 2015) to find adversarial examples. Numerous improvements have since been proposed, such as iterated FGSM (Kurakin et al., 2016) and PGD (Madry et al., 2019) , which strives to find even stronger adversarial examples. These works are complementary to ours, because here we assume that we have an optimal adversarial attack, and show that with such an algorithm we can get optimal adversarial test error. Some of these alterations (Zhang et al., 2019; Wang et al., 2021; Miyato et al., 2018; Kannan et al., 2018) do not strictly attempt to find a maximal adversarial attack at every iteration, but instead use some other criteria. However, Rice et al. (2020) proposes that many of the advancements to adversarial training since PGD can be matched with early stopping. Our work corroborates the power of the early stopping in adversarial training as we use it in our analysis. Adversarial training error bounds. Several works are able to show convergence of the adversarial training error. Gao et al. (2019) did so for networks with smooth activations, but is unable to handle constant-sized perturbations as the width increases. Meanwhile Zhang et al. (2020) uses ReLU activations, but imposes a strong separability condition on the training data. Our training error bounds use ReLU activations, and in contrast to these previous works simultaneously hold for constant-sized perturbations and consider general data distributions. However, we note that the ultimate goals of these works differ, as we focus on adversarial test error. Adversarial generalization bounds. There are several works providing adversarial generalization bounds. They are not tailored to the near-initialization setting, and so they are either looser or require assumptions that are not satisfied here. These other approaches include SDP relaxation based bounds (Yin et al., 2018) , tree transforms (Khim & Loh, 2019) , and covering arguments (Tu et al., 2019; Awasthi et al., 2020; Balda et al., 2019) . Our generalization bound also uses a covering argument, but pairs this with a near-initialization decoupling. There are a few works that are able to achieve adversarial test error bounds in specialized cases. They have been analyzed when the data distribution is linear, both when the model is linear too (Donhauser et al., 2021; Javanmard et al., 2020) , and for random features (Hassani & Javanmard, 2022) . In the present work, we are able to handle general data distributions.

3. PROPERTIES OF OPTIMAL ADVERSARIAL PREDICTORS

This section builds towards Theorem 3.3, relating zero-one losses to convex surrogate losses.

3.1. SETTING

We consider a distribution D with associated measure µ that is Borel measurable over X × Y , where X ⊆ R d is compact, and Y = {-1, 1}. For simplicity, throughout we will take X = B 1 to be the closed Euclidean ball of radius 1 centered at the origin. We allow arbitrary P (y = 1|x) ∈ [0, 1]that is, the true labels may be noisy. We will consider general adversarial perturbations. For x ∈ B 1 , let P(x) be the closed set of allowed perturbations. That is, an adversarial attack is allowed to change the input x to any x ′ ∈ P(x). We will impose the natural restrictions that ∅ ̸ = P(x) ⊆ B 1 for all x ∈ B 1 . That is, there always exists at least one perturbed input, and perturbations cannot exceed the natural domain of the problem. In addition, we will assume the set-valued function P is upper hemicontinuous. That is, for any x ∈ X and any open set U ⊇ P(x), there exists an open set V ∋ x such that P(V ) = ∪ v∈V P(v) ⊆ U . For the commonly used ℓ ∞ perturbations as well as many other commonly used perturbation sets (Yang et al., 2020) , these assumptions hold. As an example, in the above notation we would write ℓ ∞ perturbations as P(x) = {x ′ ∈ B 1 : ∥x ′ -x∥ ∞ ≤ τ }. Let f : B 1 → R be a predictor. We will let ℓ c be any nonincreasing convex loss with continuous derivative. The adversarial loss is ℓ A (x, y, f ) = max x ′ ∈P(x) ℓ c (yf (x ′ )), and the adversarial convex risk is R A (f ) = ℓ A (x, y, f ) dµ(x, y). For convenience, define f + (x) = sup x ′ ∈P(x) f (x ′ ) and f -(x) = inf x ′ ∈P(x) f (x ′ ), the worst-case values for perturbations in P(x) when y = -1 and y = 1, respectively. Then we can write the adversarial zero-one risk as R AZ (f ) := 1[y = +1]1[f -(x) < 0] + 1[y = -1]1[f + (x) ≥ 0] dµ(x, y). To relate the adversarial convex and zero-one risks, we will use reweighted adversarial zero-one risks R t AZ (f ) as an intermediate quantity, defined as follows. The adversarial zero-one risk when the +1 labels have weight (-ℓ ′ c (t)) and the -1 labels have weight (-ℓ ′ c (-t)) is R t AZ (f ) := 1[y = +1]1[f -(x) < 0](-ℓ ′ c (t)) + 1[y = -1]1[f + (x) ≥ 0](-ℓ ′ c (-t)) dµ(x, y).

3.2. RESULTS

We present a number of structural properties of optimal adversarial predictors. The key insight will be to write the global adversarial convex loss in terms of global adversarial zero-one losses, as follows. Lemma 3.1. For any predictor f , R A (f ) = ∞ -∞ R t AZ (f -t) dt. R t AZ (f -t) is an intuitive quantity to consider for the following reason. In the non-adversarial setting, a predictor outputting a value of f (x) corresponds to a prediction of (-ℓ ′ c (-f (x))) (-ℓ ′ c (f (x)))+(-ℓ ′ c (-f (x))) of the labels being +1 at that point. If +1 labels are given weight (-ℓ ′ c (t)) and -1 labels weight (-ℓ ′ c (-t)), then f (x) would predict at least half the labels being +1 if and only if (-ℓ ′ c (-f (x))) (-ℓ ′ c (f (x)))+(-ℓ ′ c (-f (x))) ≥ (-ℓ ′ c (-t)) (-ℓ ′ c (t))+(-ℓ ′ c (-t)) . As a result, the optimal non-adversarial predictor is also an optimal non-adversarial zero-one classifier at thresholds t for the corresponding reweighting of +1 and -1 labels. Even though we won't be able to rely on the same local analysis as our intuition above, it turns out the same thing is globally true in the adversarial setting. Lemma 3.2. There exists a predictor g : X → R such that R A (g) is minimal. For any such predictor, R t AZ (g -t) = inf f R t AZ (f ) for all t ∈ R. Note that the predictor in Lemma 3.2 is not necessarily unique. For instance, the predictor's value at a particular point x might not matter to the adversarial risk because there are no points in the underlying distribution whose perturbation sets include x. To prove Lemma 3.2, we will use optimal adversarial zero-one predictors to construct an optimal adversarial convex loss predictor g. Conversely, we can also use an optimal adversarial convex loss predictor to construct optimal adversarial zero-one predictors. In general, the predictors we find will not exactly have the optimal adversarial convex loss. For these predictors we have the following error gap. Theorem 3.3. Suppose there exists s ≥ 1 and c > 0 such that G ℓc (p) := ℓ c (0) -inf z∈R pℓ c (z) + (1 -p)ℓ c (-z) ≥ |2p -1| c s . Then for any predictor g, inf t R AZ (g -t) -inf h meas. R AZ (h) ≤ 3 1-1 s c 2 R A (g) -inf h meas. R A (h) 1/s . The idea of using G ℓc (p) comes from Zhang (2004) . When we explicitly set ℓ c to be the logistic loss in Section 4, we can choose s = 2 and c = √ 2 (Zhang, 2004) . While similar bounds exist in the non-adversarial case with R AZ (g) instead of inf t R AZ (g -t) (Zhang, 2004; Bartlett et al., 2006) , the analogue with R AZ (g -t) -inf h meas. R AZ (h) appearing on the left-hand side is false in the adversarial setting, which can be seen as follows. Consider a uniform distribution of (x, y) pairs over {(±1, ±1)}, and suppose {-1, +1} ∈ P(-1) ∩ P(+1). Then the optimal adversarial convex risk is ℓ c (0) given by f (x) = 0, and the optimal adversarial zero-one risk is 1/2 given by sgn(f (x)) = +1. However, for ϵ > 0 the predictor g ϵ with g ϵ (+1) = ϵ, g ϵ (-1) = -ϵ, and g ϵ (x) ∈ [-ϵ, ϵ] everywhere else gives adversarial convex risk 1 2 ℓ c (ϵ) + ℓ c (-ϵ) and adversarial zero-one risk 1. This results in R AZ (g ϵ ) -inf h meas. R AZ (h) ---→ ϵ→0 1/2, R A (g ϵ ) -inf h meas. R A (h) ---→ ϵ→0 0, demonstrating the necessity for some change compared to the analogous non-adversarial bound. As this example shows, getting arbitrarily close to the optimal adversarial convex risk does not guarantee getting arbitrarily close to the optimal adversarial zero-one risk. This inadequacy of convex losses in the adversarial setting has been noted in prior work (Bao et al., 2020; Awasthi et al., 2021a; c; 2022a; b) , leading them to suggest the use of non-convex losses. However, as Theorem 3.3 shows, we can circumvent this inadequacy if we allow the choice of an optimal (possibly nonzero) threshold. While we have compared against optimal measurable predictors here, in Section 4 we will use continuous predictors. This presents a potential problem, as there may be a gap between the adversarial risks achievable by measurable and continuous predictors. It turns out this is not the case, as the following lemma shows. Lemma 3.4. For the adversarial risk, comparing against all continuous functions is equivalent to comparing against all measurable functions. That is, inf g cts. R A (g) = inf h meas. R A (h). In the next section, we will use Lemma 3.4 to compare trained continuous predictors against all measurable functions.

4. ADVERSARIAL TRAINING

Theorem 3.3 shows that with optimally chosen thresholds, to achieve nearly optimal adversarial zero-one risk it suffices to achieve nearly optimal adversarial convex risk. However, it is unclear how to find such a predictor. In this section we remedy this issue, proving bounds on the adversarial convex risk when adversarial training is used on shallow ReLU networks. In particular, we show with appropriately chosen parameters we can achieve adversarial convex risk that is arbitrarily close to optimal. Unlike Section 3, our results here will be specific to the logistic loss.

4.1. SETTING

Training points (x k , y k ) n k=1 are drawn from the distribution D. Note that ∥x k ∥ ≤ 1, where by default we use ∥ • ∥ to denote the ℓ 2 norm. We will let τ = sup{∥x ′ -x∥ 2 : x ′ ∈ P(x), x ∈ B 1 } be the maximum ℓ 2 norm of the adversarial perturbations. By our restrictions on the perturbation sets, we have 0 ≤ τ ≤ 2. Throughout this section we will set ℓ c to be the logistic loss ℓ(z) = ln(1+e -z ). The empirical adversarial loss and risk are ℓ A,k (f ) = ℓ A (x k , y k , f ) and R A (f ) = 1 n n k=1 ℓ A,k (f ). The predictors will be shallow ReLU networks of the form f (W ; x) = ρ √ m m j=1 a i σ(w T j x), where W is an m × d matrix, w T j is the jth row of W , σ(z) = max(0, z) is the ReLU, a i ∈ {±1} are initialized uniformly at random, and ρ is a temperature parameter that we can set. Out of all of these parameters, only W will be trained. The initial parameters W 0 will have entries initialized from standard Gaussians with variance 1, which we then train to get future iterates W i . We will frequently use the features of W i for other parameters W ; that is, we will consider f (i) (W ; x) = ∇f (W i ; x), W , where the gradient is taken with respect to the matrix, not the input. Note that f is not differentiable at all points. When this is the case by ∇f we mean some choice of ∇f ∈ ∂f , the Clarke differential. For notational convenience we define R A (W ) = R A (f (W ; •)), R A (W ) = R A (f (W ; •)), R (i) A (W ) = R A (f (i) (W ; •)), and R (i) A (W ) = R A (f (i) (W ; •)). Our adversarial training will be as follows. To get the next iterate W i+1 from W i for i ≥ 0 we will use gradient descent with W i+1 = W i -η∇ R A (W i ). Normally adversarial training occurs in two steps: 1. For each k, find x ′ k ∈ P(x) such that ℓ(y k f (W i ; x ′ k ) ) is maximized. 2. Perform a gradient descent update using the adversarial inputs found in the previous step: W i+1 = W i -η∇ 1 n n k=1 ℓ(x ′ k , y k , f (W i ; •)) . Step 1 is an adversarial attack, in practice done with a method such as PGD that does not necessarily find an optimal attack. However, we will assume the idealized scenario where we are able to find an optimal attack. Our goal will be to find a network that has low risk with respect to optimal adversarial attacks. That is, we want to find f such that R A (f ) is as small as possible.

4.2. RESULTS

Our adversarial training theorem will compare the risk we obtain to that of arbitrary reference parameters Z ∈ R m×d , which we will choose appropriately when we apply this theorem in Corollaries 4.2 and 4.3. To get near-optimal risk, we will apply our early stopping criterion of running gradient descent until ∥W i -W 0 ∥ > 2R Z , where R Z ≥ max{1, ηρ, ∥Z -W 0 ∥}, a quantity we assume knowledge of, at which point we will stop. It is possible this may never occur -in that case, we will stop at some time step t, which is a parameter we are free to choose. We just need to choose t sufficiently large to allow for enough training to occur. The iterate we choose as our final model will be the one with the best training risk. Theorem 4.1. Let m ≥ ln(emd) and ηρ 2 < 2. For any Z ∈ R m×d , let R Z ≥ max{1, ηρ, ∥Z - W 0 ∥} and W ≤t = arg min{ R A (W i ) : 0 ≤ i ≤ t, ∥W j -W 0 ∥ ≤ 2R Z ∀j ≤ i}. Then with probability at least 1 -12δ, R A (W ≤t ) ≤ 2 2 -ηρ 2 R (0) A (Z) + O   1 2 -ηρ 2 R 2 Z ηt + ρR Z d + √ τ m √ n + ρR 4/3 Z d 1/3 m 1/6   , where O suppresses ln(n), ln(m), ln(d), ln(1/δ), ln(1/τ ) terms. In Corollary 4.2 we will show we can set parameters so that all error terms are arbitrarily small. The early stopping criterion ∥W i -W 0 ∥ > 2R Z will allow us to get a good generalization bound, as we will show that all iterates then have ∥W i -W 0 ∥ ≤ 2R Z + ηρ. When the early stopping criterion is met, we will be able to get a good optimization bound. When it is not, choosing t large enough allows us to do so. It may be concerning that we require knowledge of R Z , as otherwise the algorithm changes depending on which reference parameters we use. However, in practice we could instead use a validation set and instead of choosing the model with the best training risk, we could choose the model with the best validation risk, which would remove the need for knowing R Z . Ultimately, our assumption of knowing R Z is there to simplify the analysis and highlight other aspects of the problem, and we leave dropping this assumption to future work. To compare against all continuous functions, we will use the universal approximation of infinite-width neural networks (Barron, 1993) . We use a specific form that adapts Barron's arguments to give an estimate of the complexity of the infinite-width neural network (Ji et al., 2019) . We consider infinitewidth networks of the form f (x; U ∞ ) := ⟨U ∞ (v), x1[v T x ≥ 0]⟩ dN (v) , where U ∞ : R d → R d parameterizes the network, and N is a standard d-dimensional Gaussian distribution. We will choose an infinite-width network f (•; U ϵ ∞ ) with a finite complexity measure sup x ∥U ϵ ∞ (x)∥ that is ϵ-close to a near-optimal continuous function. Letting R ϵ := max{ρ, ηρ 2 , sup x ∥U ϵ ∞ (x)∥}, with high probability we can extract a finite-width network Z close to the infinite-width network whose distance from W 0 is at most R Z = R ϵ /ρ. Note that our assumed knowledge of R Z is equivalent to assuming knowledge of R ϵ . From Lemma 3.4 we know continuous functions can get arbitrarily close to the optimal adversarial risk over all measurable functions, so we can instead compare against all measurable functions. We immediately have an issue with our formulation -the comparator in Theorem 4.1 is homogeneous. To have any hope of predicting general functions, we need biases. We simulate biases by adding a dummy dimension to the input, and then normalizing. That is, we transform the input x → 1 √ 2 (x; 1). The dummy dimension, while part of the input to the network, is not truly a part of the input, and so we do not allow adversarial perturbations to affect this dummy dimension. Corollary 4.2. Let ϵ > 0. Then there exists a finite R ϵ ≥ max{ρ, ηρ 2 } representing the complexity measure of an infinite-width network that is within ϵ of the optimal adversarial risk. Then with probability at least 1 -δ, setting ρ = Θ(ϵ), η = Θ(1/ϵ), t = Ω R 2 ϵ ϵ 2 , m = Ω R 8 ϵ ϵ 6 ρ 2 , with n satisfying n = Ω max(1, τ m)R 2 ϵ /ϵ 2 , where Θ, Ω suppresses ln(R ϵ ), ln(1/ϵ), ln(1/δ) terms, we have R A (W ≤t ) ≤ inf{R A (g) : g measurable} + O(ϵ). Once again, it may be concerning that we require knowledge of the complexity of the data distribution for Corollary 4.2. However, the following result demonstrates that we are effectively guaranteed to converge to the optimal risk as n → ∞, as long as parameters are set appropriately. Corollary 4.3. If we set ρ (n) = n -1/6 , η (n) = n 1/6 , t (n) = n, m (n) = n 1/2 , then R A (W (n) ≤t ) n→∞ ----→ inf{R A (g) : g measurable} almost surely.

4.3. PROOF SKETCH OF THEOREM 4.1

The proof has two main components: a generalization bound and an optimization bound. We describe both in further detail below.

4.3.1. GENERALIZATION Let

τ := √ 2τ + 32τ ln(n/δ) m 1/4 + τ √ 2. We prove a new near-initialization generalization bound. Lemma 4.4. If B ≥ 1 and m ≥ ln(emd), then with probability at least 1 -5δ, sup ∥V -W0∥≤B R (0) A (V ) -R (0) A (V ) ≤ 2 ρB √ n + 2 ρB τ √ n   1 + m ln mn τ 2   + 77ρBd ln 3/2 (4em 2 d 3 /δ) √ n . The key term to focus on is the middle term. Note that τ is a quantity that grows with the perturbation radius τ , and importantly is 0 when τ is 0. When we are in the non-adversarial setting (τ = 0), the middle term is dropped and we recover a Rademacher bound qualitatively similar to Lemma A.8 in (Ji et al., 2021) . In the general setting when τ is a constant, so is τ , resulting in an additional dependence on the width of the network. Lemma 4.4 will easily follow from the following Rademacher complexity bound. Lemma 4.5. Define V = {V : ∥V -W 0 ∥ ≤ B}, and let F = {x k → min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ), V : V ∈ V}. Then with probability at least 1 -3δ, Rad(F) ≤ ρB √ n + ρB τ 1 + m ln mn τ 2 √ n . In the setting of linear predictors with normed ball perturbations, an exact characterization of the perturbations can be obtained, leading to some of the Rademacher bounds in Yin et al. (2018) ; Khim & Loh (2019) ; Awasthi et al. (2020) . In our setting, it is unclear how to get a similar exact characterization. Instead, we prove Lemma 4.5 by decomposing the adversarially perturbed network into its nonperturbed term and the value added by the perturbation. The nonperturbed term can then be handled by a prior Rademacher bound for standard networks. The difficult part is in bounding the complexity added by the perturbation. A naive argument would worst-case the adversarial perturbations, resulting in a perturbation term that scales linearly with m and does not decrease with n. Obtaining the Rademacher bound that appears here requires a better understanding of the adversarial perturbations. In comparison to simple linear models, we use a more sophisticated approach, utilizing our understanding of linearized models. Because we are using the features of the initial network, for a particular perturbation at a particular point, the same features are used across all networks. As the parameter distance between all networks is close, the same perturbation achieves similar effects. We also control the change in features caused by the perturbations, which is given by Lemma A.9. Having bounded the effect of the perturbation term, we then apply a covering argument over the parameter space to get the Rademacher bound.

4.3.2. OPTIMIZATION

Our optimization bound is as follows. Lemma 4.6. Let R gd := 2R Z + ηρ, with ηρ 2 < 2. Then with probability at least 1 -δ, R A (W ≤t ) ≤ 2 2 -ηρ 2 R (0) A (Z) + 1 t R 2 Z 2η -η 2 ρ 2 + 1 m 1/6 2 2 -ηρ 2 52ρR 4/3 gd d 1/4 ln(ed 2 m 2 /δ) 1/4 + 178ρR gd d 1/3 ln(ed 3 m 2 /δ) 1/3 m 1/12 + 5ρ ln(1/δ) dm 5/6 . The main difference between the adversarial case here and the non-adversarial case in Ji et al. (2021) is in appropriately bounding ∥∇ R A (W )∥, which is Lemma A.10. In order to do so, we utilize a relation between adversarial and non-adversarial losses. The rest of the adversarial optimization proofs follow similarly to the non-adversarial case, although simplified because we assume that R Z is known.

5. DISCUSSION AND OPEN PROBLEMS

This paper leaves open many potential avenues for future work, several of which we highlight below. Early stopping. Early stopping played a key technical role in our proofs, allowing us to take advantage of properties that hold in a near-initialization regime, as well as achieve a good generalization bound. However, is early stopping necessary? The necessity of early stopping is suggested by the phenomenon of robust overfitting (Rice et al., 2020) , in which the adversarial training error continues to decrease, but the adversarial test error dramatically increases after a certain point. Early stopping is one method that allows adversarial training to avoid this phase, and achieve better adversarial test error as a result. However, it should be noted that early stopping is necessary in this work to stay in the near-initialization regime, which likely occurs much earlier than the robust overfitting phase. Underparameterization. Our generalization bound increases with the width. As a result, to get our generalization bound to converge to 0 we required the width to be sublinear in the number of training points. Is it possible to remove this dependence on width? Recent works suggest that some sort of dependence on the width may be necessary. In the setting of linear regression, overparameterization has been shown to hurt adversarial generalization for specific types of networks (Hassani & Javanmard, 2022; Javanmard et al., 2020; Donhauser et al., 2021) . Note that in this simple data setting a simple network can fit the data, so underparameterization does not hurt the approximation capabilities of the network. However, in a more complicated data setting, Madry et al. (2019) notes that increased width helps with the adversarial test error. One explanation is that networks need to be sufficiently large to approximate an optimal robust predictor, which may be more complicated than optimal nonrobust predictors. Indeed, they note that smaller networks, under adversarial training, would converge to the trivial classifier of predicting a single class. Interestingly, they also note that width helps more when the adversarial perturbation radius is small. This observation is reflected in our generalization bound, since the dependence on width is tied to the perturbation term. If the perturbation radius is small, then a large width is less harmful to our generalization bound. We propose further empirical exploration into how the width affects generalization and approximation error, and how other factors influence this relationship. This includes investigating whether a larger perturbation radius causes larger widths to be more harmful to the generalization bound, the effect of early stopping on these relationships, and how the approximation error for a given width changes with the perturbation radius. Using weaker attacks. Within our proof we used the assumption that we had access to an optimal adversarial attack. In turn, we got a guarantee against optimal adversarial attacks. However, in practice we do not know of a computationally efficient algorithm for generating optimal attacks. Could we prove a similar theorem, getting a guarantee against optimal attacks, using a weaker attack like PGD in our training algorithm? If this was the case, then we would be using a weaker attack to successfully defend against a stronger attack. Perhaps this is too much to ask for -could we get a guarantee against PGD attacks instead? Transferability to other settings. We have only considered the binary classification setting here. A natural extension would be to consider the same questions in the multiclass setting, where there are three (or more) possible labels. In addition, our results in Section 4 only hold for shallow ReLU networks in a near-initialization regime. To what extent do the relationships observed here transfer to other settings, such as training state-of-the-art networks? For instance, does excessive overparameterization beyond the need to capture the complexity of the data hurt adversarial robustness in practice? Recall the definition of R t AZ (f ): R t AZ (f ) := 1[y = +1]1[f -(x) < 0](-ℓ ′ c (t)) + 1[y = -1]1[f + (x) ≥ 0](-ℓ ′ c (-t)) dµ(x, y). The following lemmas are various forms of continuity for R t AZ (f ), making use of the continuity of ℓ ′ c . Lemma A.1. For any infinite family of predictors g s : X → {-1, +1} indexed by s ∈ R, and any t ∈ R, lim s→t |R t AZ (g s ) -R s AZ (g s )| = 0. Proof. By the Dominated Convergence Theorem (Folland, 1999, Theorem 2.24) , lim s→t |R t AZ (g s ) -R s AZ (g s )| = lim s→t 1[y = +1]1[g - s (x) < 0] (-ℓ ′ c (t)) -(-ℓ ′ c (s)) + 1[y = -1]1[g + s (x) ≥ 0] (-ℓ ′ c (-t)) -(-ℓ ′ c (-s)) dµ(x, y) ≤ lim s→t 1[y = +1] (-ℓ ′ c (t)) -(-ℓ ′ c (s)) + 1[y = -1] (-ℓ ′ c (-t)) -(-ℓ ′ c (-s)) dµ(x, y) ≤ 1[y = +1] lim s→t (-ℓ ′ c (t)) -(-ℓ ′ c (s)) + 1[y = -1] lim s→t (-ℓ ′ c (-t)) -(-ℓ ′ c (-s)) dµ(x, y) = 0. The following lemma allows us to switch the order of the limit and adversarial risk. Lemma A.2. For any predictors g s : X → {-1, +1} indexed by s ∈ R, and any t ∈ R, lim s→t R s AZ (g s ) = R t AZ lim s→t g s when lim s→t g s exists. In particular, when g s = f for all s ∈ R, then lim s→t R s AZ (f ) = R t AZ (f ). Proof. By the Dominated Convergence Theorem (Folland, 1999, Theorem 2.24) , lim s→t R s AZ (g s ) = lim s→t 1[y = +1]1[g - s (x) < 0](-ℓ ′ c (s)) + 1[y = -1]1[g + s (x) ≥ 0](-ℓ ′ c (-s)) dµ(x, y) = lim s→t 1[y = +1]1[g - s (x) < 0](-ℓ ′ c (s)) + 1[y = -1]1[g + s (x) ≥ 0](-ℓ ′ c (-s)) dµ(x, y) = 1[y = +1]1 lim s→t g - s (x) < 0 (-ℓ ′ c (t)) + 1[y = -1]1 lim s→t g + s (x) ≥ 0 (-ℓ ′ c (-t)) dµ(x, y) = R t AZ lim s→t g s . For the rest of this section, let f r : X → {-1, +1} be optimal adversarial zero-one predictors when the -1 labels are given weight (-ℓ ′ c (-r)) and the +1 labels are given weight (-ℓ ′ c (r)) (f r minimizes R r AZ (f r )), for all r ∈ R. These predictors exist by Theorem 1 of Bhagoji et al. ( 2019), although they may not be unique. The following lemma states that R r AZ (f r ) is continuous as a function of r. Lemma A.3. For any t ∈ R, lim s→t R s AZ (f s ) = R t AZ (f t ). Proof. By Lemma A.1, lim sup s→t R s AZ (f s ) -R t AZ (f t ) = lim sup s→t R s AZ (f s ) -R s AZ (f t ) ≤ 0, lim inf s→t R s AZ (f s ) -R t AZ (f t ) = lim inf s→t R t AZ (f s ) -R t AZ (f t ) ≥ 0. Together, we get lim s→t R s AZ (f s ) = R t AZ (f t ). The following lemma gives some structure to optimal adversarial zero-one predictors. Lemma A.4. For any s ≤ t, R s AZ (max(f t , f s )) = R s AZ (f s ) and R t AZ (min(f t , f s )) = R t AZ (f s ). Proof. Let A := {x : f s (x) < f t (x)}, and define s ))µ(x, -1) be the associated measures when the +1 labels have weight (-ℓ ′ c (s)) and the -1 labels have weight (-ℓ ′ c (-s)), and define µ t similarly. Then A + s = {(x, +1) : P(x) ∩ A ̸ = ∅, f s (P(x) \ A) ⊆ {+1}}, A - s = {(x, -1) : P(x) ∩ A ̸ = ∅, f s (P(x) \ A) ⊆ {-1}}, A + t = {(x, +1) : P(x) ∩ A ̸ = ∅, f t (P(x) \ A) ⊆ {+1}}, A - t = {(x, -1) : P(x) ∩ A ̸ = ∅, f t (P(x) \ A) ⊆ {-1}}. Let µ s (x, y) = 1[y = +1](-ℓ ′ c (s))µ(x, +1) + 1[y = -1](-ℓ ′ c (- µ s (A + s ) = (-ℓ ′ c (s))µ(A + s ) ≥ (-ℓ ′ c (t))µ(A + s ) ≥ (-ℓ ′ c (t))µ(A + t ) = µ t (A + t ), µ s (A - s ) = (-ℓ ′ c (-s))µ(A - s ) ≤ (-ℓ ′ c (-t))µ(A - s ) ≤ (-ℓ ′ c (-t))µ(A - t ) = µ t (A - t ). As a result, µ s (A + s ) -µ s (A - s ) ≥ µ t (A + t ) -µ t (A - t ) . The reweighted adversarial zero-one loss can be written in terms of the reweighted measures, as follows. R t AZ (f ) = 1[y = +1]1[f -(x) < 0](-ℓ ′ c (t)) + 1[y = -1]1[f + (x) ≥ 0](-ℓ ′ c (-t)) dµ(x, y) = 1[y = +1]1[f -(x) < 0] + 1[y = -1]1[f + (x) ≥ 0] dµ t (x, y). The following lower bound can then be computed. R s AZ (max(f t , f s )) -R s AZ (f s ) = 1[y = +1] 1[max(f t , f s ) -(x) < 0] -1[f -(x) < 0] + 1[y = -1] 1[max(f t , f s ) + (x) ≥ 0] -1[f + (x) ≥ 0] dµ s (x, y) = -µ s (A + s ) + µ s (A - s ) ≥ 0. Similarly, R t AZ (min(f t , f s )) -R t AZ (f s ) = 1[y = +1] 1[min(f t , f s ) -(x) < 0] -1[f -(x) < 0] + 1[y = -1] 1[min(f t , f s ) + (x) ≥ 0] -1[f + (x) ≥ 0] dµ t (x, y) = µ t (A + t ) -µ t (A - t ) ≥ 0. As 0 ≥ µ s (A + s ) -µ s (A - s ) ≥ µ t (A + t ) -µ t (A - t ) ≥ 0 we must have equality everywhere, giving the desired result. Using our understanding of optimal adversarial zero-one predictors, we can construct a predictor that is optimal at all thresholds. Lemma A.5. There exists f : X → R such that R t AZ (f -t) is the minimum possible value for all t ∈ R. Proof. Define f (x) = sup{r ∈ Q : f r (x) = +1}. To prove R t AZ (f -t) is minimal for all t ∈ R, we first do so for all t ∈ Q, and then for all t ∈ R \ Q. For any t ∈ Q, by definition sgn(f -t) ≥ f t . Let q 1 , q 2 , . . . be an enumeration of Q ∩ (t, ∞). Let g 0 = f t , and recursively define g i+1 = max(g i , f qi ) for all i ≥ 0. Note that lim i→∞ g i = sgn(f -t). Inductively applying Lemma A.4 results in R t AZ (g i ) = R t AZ (f t ) for all i ≥ 0. By the Dominated Convergence Theorem (Folland, 1999, Theorem 2.24) , R t AZ (f -t) = R t AZ (sgn(f -t)) = R t AZ lim i→∞ g i = lim i→∞ R t AZ (g i ) = lim i→∞ R t AZ (f t ) = R t AZ (f t ). For any t ∈ R \ Q, note that sgn(f -t) = lim s↑t s∈Q sgn(f -s). By the Dominated Convergence Theorem (Folland, 1999 , Theorem 2.24), Lemma A.2, and Lemma A.3, R t AZ (f -t) = R t AZ (sgn(f -t)) = R t AZ (lim s↑t s∈Q sgn(f -s)) = lim s↑t s∈Q R s AZ (sgn(f -s)) = lim s↑t s∈Q R s AZ (f s ) = R t AZ (f t ). For any predictor, its adversarial convex loss can be written as a weighted sum of adversarial zero-one losses across thresholds. This then implies the function defined in Lemma A.5 has optimal adversarial convex loss. Proof of Lemma 3.1. Recall the definition of R A (f ): R A (f ) = 1[y = +1]ℓ c (f -(x)) + 1[y = -1]ℓ c (-f + (x)) dµ(x, y). Applying the fundamental theorem of calculus and rearranging, R A (f ) = 1[y = +1] ∞ f -(x) (-ℓ ′ c (t)) dt + 1[y = -1] ∞ -f + (x) (-ℓ ′ c (t)) dt dµ(x, y) = 1[y = +1] ∞ -∞ 1[f -(x) < t](-ℓ ′ c (t)) dt + 1[y = -1] ∞ -∞ 1[f + (x) ≥ t](-ℓ ′ c (-t)) dt dµ(x, y) = ∞ -∞ 1[y = +1]1[f -(x) < t](-ℓ ′ c (t)) + 1[y = -1]1[f + (x) ≥ t](-ℓ ′ c (-t)) dt dµ(x, y) As the integrand is nonnegative, Tonelli's theorem (Folland, 1999, Theorem 2.37) gives R A (f ) = ∞ -∞ 1[y = +1]1[f -(x) < t](-ℓ ′ c (t)) + 1[y = -1]1[f + (x) ≥ t](-ℓ ′ c (-t)) dµ(x, y) dt = ∞ -∞ R t AZ (f -t) dt. While optimal adversarial zero-one predictors were used to construct a predictor with optimal adversarial convex loss in Lemma A.5, the reverse is also possible: using a predictor with optimal adversarial convex loss to construct optimal adversarial zero-one predictors. Proof of Lemma 3.2. Let f be the optimal predictor defined in Lemma A.5, which gives existence. Suppose, towards contradiction, that there exists g : X → R with minimal R A (g) and t ∈ R such that R t AZ (g -t) > R t AZ (f -t). By Lemma A.2 R t AZ (g -t) is left continuous as a function of t, and by Lemma A.3 R t AZ (f -t) is continuous as a function of t. As a result, there exists δ, ϵ > 0 such that for all s ∈ [t -δ, t], R s AZ (g -s) -R s AZ (f -s) ≥ ϵ. Then R A (g) -R A (f ) = ∞ -∞ R s AZ (g -s) -R s AZ (f -s) ds ≥ t t-δ R s AZ (g -s) -R s AZ (f -s) ds ≥ t t-δ ϵ ds = δϵ > 0, contradicting the assumption that R A (g) was minimal. So we must have R t AZ (g -t) minimal for all t ∈ R. The next two lemmas bound the zero-one loss at different thresholds in terms of the zero-one loss at threshold 0. Lemma A.6. Let f be the optimal predictor defined in Lemma A.5. Then R t AZ (f -t) ≤ (-ℓ ′ c (-|t|))R AZ (f ). Published as a conference paper at ICLR 2023 Proof. R t AZ (f -t) = 1[y = +1]1[f -(x) < t](-ℓ ′ c (t)) + 1[y = -1]1[f + (x) ≥ t](-ℓ ′ c (-t)) dµ(x, y) ≤ 1[y = +1]1[f -(x) < 0](-ℓ ′ c (t)) + 1[y = -1]1[f + (x) ≥ 0](-ℓ ′ c (-t)) dµ(x, y) ≤ 1[y = +1]1[f -(x) < 0](-ℓ ′ c (-|t|)) + 1[y = -1]1[f + (x) ≥ 0](-ℓ ′ c (-|t|)) dµ(x, y) = (-ℓ ′ c (-|t|))R AZ (f ). In contrast to the previous lemma, the following bound works for any predictor. Lemma A.7. Let g be any predictor. Then R t AZ (g -t) ≥ (-ℓ ′ c (|t|))R AZ (g -t). Proof. R t AZ (g -t) = 1[y = +1]1[g -(x) < t](-ℓ ′ c (t)) + 1[y = -1]1[g + (x) ≥ t](-ℓ ′ c (-t)) dµ(x, y) ≥ 1[y = +1]1[g -(x) < t](-ℓ ′ c (|t|)) + 1[y = -1]1[g + (x) ≥ t](-ℓ ′ c (|t|)) dµ(x, y) = (-ℓ ′ c (|t|))R AZ (g -t). Now we can relate proximity to the optimal adversarial zero-one loss and proximity to the optimal adversarial convex loss. Proof of Theorem 3.3. Let f be the optimal predictor defined in Lemma A.5. Then R A (g) -inf h meas. R A (h) = R A (g) -R A (f ) = ∞ -∞ R u AZ (g -u) -R u AZ (f -u) du ≥ ∞ -∞ max(0, (-ℓ ′ c (|u|))R AZ (g -u) -(-ℓ ′ c (-|u|))R AZ (f )) du ≥ ∞ -∞ max(0, (-ℓ ′ c (|u|)) inf t R AZ (g -t) -(-ℓ ′ c (-|u|))R AZ (f )) du = 2 ∞ 0 max(0, (-ℓ ′ c (u)) inf t R AZ (g -t) -(-ℓ ′ c (-u))R AZ (f )) du. As (-ℓ ′ c (u)) inf t R AZ (g -t) -(-ℓ ′ c (-u))R AZ (f ) is continuous and nonincreasing as a function of u, as well as nonnegative when u = 0, there exists some r ∈ [0, ∞] such that ∞ 0 max(0, (-ℓ ′ c (u)) inf t R AZ (g -t) -(-ℓ ′ c (-u))R AZ (f )) du = r 0 (-ℓ ′ c (u)) inf t R AZ (g -t) -(-ℓ ′ c (-u))R AZ (f ) du. Letting p := inft RAZ(g-t) RAZ(f )+inft RAZ(g-t) , we find that this occurs when r ). The integral can then be computed exactly as follows. (-ℓ ′ c (r)) inf t R AZ (g -t) -(-ℓ ′ c (-r))R AZ (f ) = 0 ⇐⇒ ℓ ′ c (r) ℓ ′ c (-r) = R AZ (f ) inf t R AZ (g -t) ⇐⇒ ℓ ′ c (r) + ℓ ′ c (-r) ℓ ′ c (-r) = R AZ (f ) + inf t R AZ (g -t) inf t R AZ (g -t) ⇐⇒ ℓ ′ c (-r) ℓ ′ c (r) + ℓ ′ c (-r) = inf t R AZ (g -t) R AZ (f ) + inf t R AZ (g -t) ⇐⇒ ℓ ′ c (-r) ℓ ′ c (r) + ℓ ′ c (-r) = p ⇐⇒ ℓ ′ c (-r) = ℓ ′ c (r) + ℓ ′ c (-r) p ⇐⇒ pℓ ′ c (r) -(1 -p)ℓ ′ c (-r) = 0. Since ℓ c is convex, this implies r minimizes pℓ c (r) + (1 -p)ℓ c (- R A (g) -inf h meas. R A (h) ≥ 2 r 0 (-ℓ ′ c (u)) inf t R AZ (g -t) -(-ℓ ′ c (-u))R AZ (f ) du = 2 ℓ c (0) -ℓ c (r) inf t R AZ (g -t) -ℓ c (-r) -ℓ c (0) R AZ (f ) = 2 ℓ c (0) R AZ (f ) + inf t R AZ (g -t) -ℓ c (r) inf t R AZ (g -t) -ℓ c (-r)R AZ (f ) = 2 ℓ c (0) R AZ (f ) + inf t R AZ (g -t) -ℓ c (r) inf t R AZ (g -t) R AZ (f ) + inf t R AZ (g -t) + ℓ c (-r) R AZ (f ) R AZ (f ) + inf t R AZ (g -t) R AZ (f ) + inf t R AZ (g -t) = 2 ℓ c (0) R AZ (f ) + inf t R AZ (g -t) -ℓ c (r)p + ℓ c (-r)(1 -p) R AZ (f ) + inf t R AZ (g -t) = 2 R AZ (f ) + inf t R AZ (g -t) ℓ c (0) -pℓ c (r) + (1 -p)ℓ c (-r) = 2 R AZ (f ) + inf t R AZ (g -t) G ℓc (p). Using our assumption of a lower bound on G ℓc (p) results in R A (g) -inf h meas. R A (h) ≥ 2 R AZ (f ) + inf t R AZ (g -t) |2p -1| c s = 2 c s R AZ (f ) + inf t R AZ (g -t) inf t R AZ (g -t) -R AZ (f ) R AZ (f ) + inf t R AZ (g -t) s = 2 c s inf t R AZ (g -t) -R AZ (f ) s R AZ (f ) + inf t R AZ (g -t) s-1 . Finally, since R AZ (f ) ≤ 1/2 and inf t R AZ (g -t) ≤ 1, R A (g) -inf h meas. R A (h) ≥ 2 s 3 s-1 c s inf t R AZ (g -t) -R AZ (f ) s = 2 s 3 s-1 c s inf t R AZ (g -t) -inf h meas. R AZ (h) s . Rearranging then gives the desired inequality. The following lemmas states that continuous predictors can get arbitrarily close to the optimal adversarial zero-one risk, even if we require the predictors to output the exact label. Lemma A.8. Define R EAZ (f ) = 1[yf (P(x)) ̸ = {+1}] dµ(x, y), the adversarial zero-one risk when we require f to exactly output the right label over the entire perturbation set. Then for any measurable f : X → {-1, +1} and any ϵ > 0, there exists continuous g : X → [-1, +1] such that µ({(x, y) : yg(P(x)) ̸ = {+1} and yf (P(x)) = {+1}}) < ϵ. In particular, this implies inf g cts. R EAZ (g) = inf h meas. R EAZ (h) = inf g cts. R AZ (g) = inf h meas. R AZ (h). Proof. Let A = {x : f + (x) = -1}, P(A) = ∪ x∈A P(x), B = {x : f -(x) = +1}, P(B) = ∪ x∈B P(x). By the inner regularity of µ x (Folland, 1999, Theorem 7.8) , there exist compact sets K ⊆ A, L ⊆ B such that µ x (A) -µ x (K) < ϵ/2, µ x (B) -µ x (L) < ϵ/2. As P is upper hemicontinuous and K and L are compact, both P(K) and P(L) are also compact (Aliprantis & Border, 2006, Lemma 17.8) . Note that they are also disjoint as P(K) ∩ P(L) ⊆ P(A) ∩ P(B) = ∅. By Urysohn's Lemma (Folland, 1999, Lemma 4.32) , there exists a continuous function g : X → [0, 1] such that g t (x) = 0 for all x ∈ P(K t ) and g t (x) = 1 for all x ∈ P(L t ).

The continuous function

2g -1 : X → [-1, 1] then satisfies µ({(x, y) : y(2g -1)(P(x)) ̸ = {+1} and yf (P(x)) = {+1}}) ≤ µ x (A \ K) + µ x (B \ L) < ϵ. Note that we also have R EAZ (2g -1) ≤ R EAZ (f ) + µ x (A \ K) -µ x (B \ L) < R EAZ (f ) + ϵ. As f and ϵ > 0 were arbitrary, inf g cts. R EAZ (g) ≤ inf h meas. R EAZ (h). To get the implication, note that inf g cts. R AZ (g) ≤ inf g cts. R EAZ (g) ≤ inf h meas. R EAZ (h) = inf h meas. R AZ (h) ≤ inf g cts. R AZ (g), so we must have equality everywhere. While the optimal adversarial predictor may be discontinuous, continuous predictors can get arbitrarily close to the optimal adversarial convex risk. Proof of Lemma 3.4. We have inf g cts. R A (g) ≥ inf h meas. R A (h), so it suffices to show inf g cts. R A (g) ≤ inf h meas. R A (h). Let f be the optimal predictor defined in Lemma A.5. Then R A (f ) = inf h meas. R A (h), so we want to show inf g cts. R A (g) ≤ R A (f ). Let ϵ > 0. Choose M > 0 large enough so that R A (min(max(f, -M ), M )) < R A (f ) + ϵ/3, and let f = min(max(f, -M ), M ). As ℓ c is continuous, there exists a finite-sized partition P = {p 0 , p 1 , p 2 , . . . , p r } with p 0 = -M and p r = M such that ℓ c (p i ) -ℓ c (p i-1 ) ≤ ϵ/3 and ℓ c (-p i )ℓ c (-p i-1 ) ≤ ϵ/3 for all 1 ≤ i ≤ r. By Lemma A.8, for every p i there exists continuous g pi : X → [-1, +1] such that µ({(x, y) : yg pi (P(x)) ̸ = {+1} and ysgn(f -p i )(P(x)) = {+1}}) < ϵ 3r(ℓc(-M )-ℓc(M )) . Consider the continuous function g ϵ = -M + r i=1 (p i -p i-1 ) gp i +1 2 , which will be shown to have adversarial risk within ϵ of the optimal. Define D i := {(x, y) : yg pi (P(x)) ̸ = {+1} and ysgn(f -p i )(P(x)) = {+1}} ∀i, E := {(x, y) : ℓ A (x, y, g ϵ ) > ℓ A (x, y, f ) + ϵ/3}. We will now show that E ⊆ ∪ r i=1 D i . Let (x, y) ̸ ∈ ∪ r i=1 D i . Then yg pi (P(x)) ≥ ysgn( f - p i )(P(x)) for all 1 ≤ i ≤ r. Let i ′ = arg max i {sgn( f -p i ) = +1}. Then yg ϵ (P(x)) ≥ min{yp i ′ , yp max{i ′ +1,r} } and yf (P(x)) ≤ max{yp i ′ , yp max{i ′ +1,r} }, so ℓ A (x, y, g ϵ ) ≤ max{ℓ c (yp i ′ ), ℓ c (yp max{i ′ +1,r} )} ≤ min{ℓ c (yp i ′ ), ℓ c (yp max{i ′ +1,r} )} + ϵ/3 ≤ ℓ A (x, y, f ) + ϵ/3, which implies (x, y) ̸ ∈ E. Consequently, µ(E) ≤ r i=1 µ(D i ) < r i=1 ϵ 3r ℓ c (-M ) -ℓ c (M ) = ϵ 3 ℓ c (-M ) -ℓ c (M ) . Combining the bounds results in inf g cts. R A (g) ≤ R A (g ϵ ) ≤ R A ( f ) + ϵ/3 + µ(E) ℓ c (-M ) -ℓ c (M ) < R A (f ) + ϵ/3 + ϵ/3 + ϵ/3 = R A (f ) + ϵ. As this holds for all ϵ > 0, we have inf g cts. R A (g) ≤ R A (f ), completing the proof.

A.2 GENERALIZATION PROOFS

The following lemma controls the difference in features for nearby points, which will be useful when proving our Rademacher complexity bound. Lemma A.9. With probability at least 1 -3nδ, 1 ρ max ∥δ k ∥≤τ ∇f (x k ; W 0 ) -∇f (x k + δ k ; W 0 ) ≤ √ 2τ + 32τ ln(1/δ) m 1/4 + τ √ 2, for all k. Proof. As in Lemma A.2 of Ji et al. (2021) , with probability at least 1 -3nδ, for any x k ̸ = 0, j 1[|w T 0,j x k | ≤ τ k ∥x k ∥] ≤ mτ k + 8mτ k ln(1/δ), and henceforth assume the failure event does not hold. With τ k = τ ∥x k ∥ we have, for any x k ̸ = 0, j 1[|w T 0,j x k | ≤ τ ] ≤ mτ ∥x k ∥ + 8mτ ln(1/δ) ∥x k ∥ . As such, define the set S k := j : ∃∥δ k ∥ ≤ τ, sgn(w T 0,j x k ) ̸ = sgn(w T 0,j (x k + δ k )) , where the preceding concentration inequality implies |S k | ≤ mτ ∥x k ∥ + 8mτ ln(1/δ) ∥x k ∥ for all x k ̸ = 0. Then for any x k (including x k = 0) and any ∥δ k ∥ ≤ τ , 1 ρ 2 ∇f (x k ; W 0 ) -∇f (x k + δ k ; W 0 ) 2 = 1 m j x k 1[w T 0,j x k ≥ 0] -(x k + δ k )1[w T 0,j (x k + δ k ) ≥ 0] 2 ≤ 2 m j x k 1[w T 0,j x k ≥ 0] -x k 1[w T 0,j (x k + δ k ) ≥ 0] 2 + 2 m j x k 1[w T 0,j (x k + δ k ) ≥ 0] -(x k + δ k )1[w T 0,j (x k + δ k ) ≥ 0] 2 . As S k is exactly the set of indices j over which 1[w T 0,j x k ≥ 0] could possibly differ from 1[w T 0,j (x k + δ k ) ≥ 0], we can restrict the sum in the first term to these indices, resulting in 1 ρ 2 ∇f (x k ; W 0 ) -∇f (x k + δ k ; W 0 ) 2 ≤ 2 m j∈S k ∥x k ∥ 2 1[w T 0,j x k ≥ 0] -1[w T 0,j (x k + δ k ) ≥ 0] 2 + 2 m j δ k 1[w T 0,j (x k + δ k ) ≥ 0] 2 ≤ 2|S k |∥x k ∥ 2 m + 2τ 2 ≤ 2τ + 32τ ln(1/δ) m + 2τ 2 , where we used ∥x k ∥ ≤ 1 in the last step. Taking the square root of both sides gives us 1 ρ ∇f (x k ; W 0 ) -∇f (x k + δ k ; W 0 ) ≤ √ 2τ + 32τ ln(1/δ) m 1/4 + τ √ 2, completing the proof. We will now prove our Rademacher complexity bound. Proof of Lemma 4.5. We have nRad(F) = E ϵ sup V ∈V n k=1 ϵ k min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ), V = E ϵ sup V ∈V n k=1 ϵ k min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ), V -y k ∇f (x k ; W 0 ), V + ϵ k y k ∇f (x k ; W 0 ), V ≤ E ϵ sup V ∈V n k=1 ϵ k min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ) -∇f (x k ; W 0 ), V + E ϵ sup U ∈V n k=1 ϵ k y k ∇f (x k ; W 0 ), U ≤ E ϵ sup V ∈V n k=1 ϵ k min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ) -∇f (x k ; W 0 ), V + ρB √ n, where in the last step we use the Rademacher bound provided in the proof of Lemma A.8 from Ji et al. (2021) . We now focus on bounding E ϵ sup V ∈V n k=1 ϵ k min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ) -∇f (x k ; W 0 ), V . For notational simplicity let D k (V ) := min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ) -∇f (x k ; W 0 ), V , so that the quantity we want to bound can be rewritten as E ϵ sup V ∈V n k=1 ϵ k D k (V ). As Rademacher complexity is invariant under constant shifts, we can subtract the constant C k from D k (V ), where C k := sup V ∈V ′ D k (V ) + inf V ∈V ′ D k (V ) 2 . With this constant shift, the expression becomes E ϵ sup V ∈V n k=1 ϵ k D k (V ) -C k = nRad(G), where G = {x k → min x ′ k ∈P(x k ) y k ∇f (x ′ k ; W 0 ) -∇f (x k ; W 0 ), V -C k : V ∈ V}. We will now bound nRad(G) using a covering argument in the parameter space V. Instead of directly finding a covering for the ball of radius B, we first find a covering for a cube with side length 2B containing the ball. Projecting the cube to the ball then yields a proper covering of the ball, as this mapping is non-expansive. To ensure every point on the surface of the cube is at most ϵ distance away from a point, we use a grid with scale 2ϵ/ √ m, which results in B √ m ϵ m points in the cover. This cover C ϵ has the property that for every V ∈ V, there is some U ∈ C ϵ such that ∥V -U ∥ ≤ ϵ, since every coordinate of V is ϵ/ √ m-close to a coordinate in C ϵ , and ∥V -U ∥ = m i=1 (V i -U i ) 2 ≤ m i=1 (ϵ/ √ m) 2 = ϵ. Due to the non-expansive projection mapping from the cube to the sphere, the ϵ-cover for the cube projects to an ϵ-cover for the sphere. As a result, we have an ϵ-cover for the sphere of radius B with B √ m ϵ m points. A geometric ϵ-cover gives only a ρτ ϵ-cover in the function space, since for any V and U with ∥V -U ∥ ≤ ϵ, and any x k , min x V ∈P(x k ) y k ∇f (x V ; W 0 ) -∇f (x k ; W 0 ), V -C k - min x U ∈P(x k ) y k ∇f (x U ; W 0 ) -∇f (x k ; W 0 ), U -C k ≤ sup ∥V -U ∥≤ϵ x U ∈P(x k ) min x V ∈P(x k ) y k ∇f (x V ; W 0 ) -∇f (x k ; W 0 ), V -C k -y k ∇f (x U ; W 0 ) -∇f (x k ; W 0 ), U -C k ≤ sup ∥V -U ∥≤ϵ x U ∈P(x k ) y k ∇f (x U ; W 0 ) -∇f (x k ; W 0 ), V -C k -y k ∇f (x U ; W 0 ) -∇f (x k ; W 0 ), U -C k = sup ∥V -U ∥≤ϵ x U ∈P(x k ) y k ∇f (x U ; W 0 ) -∇f (x k ; W 0 ), V -U ≤ sup x U ∈P(x k ) ∥∇f (x U ; W 0 ) -∇f (x k ; W 0 )∥∥V -U ∥ ≤ sup ∥δ U ∥≤τ ∥∇f (x U ; W 0 ) -∇f (x k ; W 0 )∥∥V -U ∥ ≤ ρτ ϵ, where the last step follows with probability at least 1 -3δ by Lemma A.9. As a result, we can get an ϵ-cover in the function space with just ρB τ √ m ϵ m points.

This bound on the covering number

N (G, ϵ, ∥ • ∥ u ) ≤ ρB τ √ m ϵ m then implies N (G, ϵ, ∥ • ∥ 2 ) ≤ N (G, ϵ/ √ n, ∥ • ∥ u ) ≤ ρB τ √ mn ϵ m , which we can use in a standard parameter-based covering argument (Anthony & Bartlett, 2009)  to get nRad(G) ≤ inf α>0   α √ n + sup U ∈G ∥U ∥ 2 2 ln N (G, α, ∥ • ∥ 2 )   ≤ inf α>0   α √ n + ρB τ √ n 2 ln ρB τ √ mn α m    . To calculate sup U ∈G ∥U ∥ 2 , note that each of the n entries is bounded above by sup V ∈V ′ D k (V ) -C k = max sup V ∈V ′ D k (V ) -C k , -inf U ∈V ′ D k (U ) -C k = sup V ∈V ′ D k (V ) -inf U ∈V ′ D k (U ) 2 ≤ 1 2 ρτ 2B = ρB τ , so sup U ∈G ∥U ∥ 2 ≤ ρB τ √ n. Setting α = ρB τ (let α - → 0 if τ = 0) we get nRad(G) ≤ ρB τ √ n + ρB τ mn ln mn τ 2 . Putting this together with our previous bound results in nRad(F) ≤ ρB √ n + nRad(G) ≤ ρB √ n + ρB τ √ n   1 + m ln mn τ 2   , and dividing by n then completes the proof. With our Rademacher complexity bound we can prove our generalization bound, as follows. Proof of Lemma 4.4. By Lemma A.6 part 2 of Ji et al. (2021) , with probability at least 1 -δ, sup ∥V -W0∥≤B sup ∥x∥≤1 ∇f (x; W 0 ), V ≤ 18ρB ln(emd(1 + 3(md 3/2 ) d )/δ) ≤ 18ρBd ln(4em 2 d 3 /δ). By the decreasing monotonicity of the logistic loss, sup ∥V -W0∥≤B sup ∥x∥≤1 ℓ A (x, y, ∇f (•; W 0 ), V ) ∈ [ℓ(18ρBd ln(4em 2 d 3 /δ)), ℓ(-18ρBd ln(4em 2 d 3 /δ))] ⊆ [ln(2) -18ρBd ln(4em 2 d 3 /δ), ln(2) + 18ρBd ln(4em 2 d 3 /δ)]. With a bound on the range from above that holds with probability at least 1 -δ and a bound on the Rademacher complexity from Lemma 4.5 that holds with probability at least 1 -3δ, we can now apply a standard Rademacher bound (Shalev-Shwartz & Ben-David, 2014 ) that holds with probability at least 1 -δ to get that altogether with probability at least 1 -5δ, sup ∥V -W0∥≤B R (0) A (V ) -R (0) A (V ) ≤ 2Rad(ℓ A (F)) + 3(36ρBd ln(4em 2 d 3 /δ)) ln(4/δ) 2n ≤ 2 ρB √ n + 2 ρB τ √ n   1 + m ln mn τ 2   + 77ρBd ln 3/2 (4em 2 d 3 /δ) √ n .

A.3 OPTIMIZATION PROOFS

The following lemma bounds the gradient of the adversarial risk. Lemma A.10. For any matrix W ∈ R m×d , ∇ R A (W ) ≤ ρ min 1, R A (W ) . Proof. By properties of the logistic loss (Ji & Telgarsky, 2018) , ∇ R A (W ) = 1 n n k=1 ℓ ′ min x ′ k ∈P(x k ) y k f (W ; x ′ k ) ∇ min x ′ k ∈P(x k ) y k f (W ; x ′ k ) ≤ 1 n n k=1 ℓ ′ min x ′ k ∈P(x k ) y k f (W ; x ′ k ) ∇ min x ′ k ∈P(x k ) y k f (W ; x ′ k ) = 1 n n k=1 ℓ ′ min x ′ k ∈P(x k ) y k f (W ; x ′ k ) ∇ min x ′ k ∈P(x k ) y k f (W ; x ′ k ) ≤ 1 n n k=1 min    1, ℓ min x ′ k ∈P(x k ) y k f (W ; x ′ k )    ρ = n n k=1 min 1, ℓ A,k (f ) ρ ≤ ρ min 1, R A (W ) . We have the following guarantee when using adversarial training. Lemma A.11. When adversarially training with step size η, for any iterate t and any reference parameters Z ∈ R m×d , ∥W t -Z∥ 2 + (2η -η 2 ρ 2 ) i<t R A (W i ) ≤ ∥W 0 -Z∥ 2 + 2η i<t R A (Z). Proof. It suffices to show ∥W i+1 -Z∥ 2 + (2η -η 2 ρ 2 ) R A (W i ) ≤ ∥W i -Z∥ 2 + 2η R (i) A (Z) for 0 ≤ i < t, as summing the left-and right-hand sides over 0 ≤ i < t then gives the desired bound.

By the definition of

W i+1 , ∥W i+1 -Z∥ 2 = ∥W i -Z∥ 2 -2η ∇ R A (W i ), W i -Z + η 2 ∇ R A (W i ) 2 . Note that -2η ∇ R A (W i ), W i -Z = 2η ∇ R A (W i ), Z -W i = 2η ∇ R A (W i ), Z -W i = 2η ∇ R (i) A (W i ), Z -W i ≤ 2η R (i) A (Z) -R (i) A (W i ) , where the last inequality follows because R (i) A (W ) = R A (f (i) (W ; •)) is convex in the function space f (i) (W ; •), which in turn is linear in W , so R (i) A (W ) is convex in W . Using this bound, in addition to Lemma A.10, gives ∥W i+1 -Z∥ 2 ≤ ∥W i -Z∥ 2 + 2η R (i) A (Z) -R (i) A (W i ) + η 2 ρ 2 R A (W i ), and rearranging then gives the desired inequality. We want to bound R (i) A (Z) in terms of R (0) A (Z). To do so, we will show that when changing features the value at every point remains close to its original value. Towards this goal, we first show that the features in a small ball do not change much. Lemma A.12. For any ∥z∥ ≤ 1 and any 0 < ϵ ≤ 1/(dm), with probability at least 1 -δ, sup ∥x-z∥≤ϵ ∥x∥≤1 ∥V -W0∥≤R V ∥∇f (x; V ) -∇f (z; V )∥ ≤ 7ρR 1/3 V m -1/6 ln(em/δ) 1/6 + 12ρd 1/6 ϵ 1/3 ln(em/δ) 1/3 + 2ρϵ + 15ρ ln(edm/δ) m 1/4 . Proof. For notational convenience let W := W 0 . First, note that ∥∇f (x; V ) -∇f (z; V )∥ ≤ ∥∇f (x; V ) -∇f (x; W )∥ + ∥∇f (x; W ) -∇f (z; W )∥ + ∥∇f (z; W ) -∇f (z; V )∥. By Lemma A.5 of Ji et al. (2021) , with probability at least 1 -δ the middle term is bounded by ∥∇f (x; W ) -∇f (z; W )∥ ≤ 11ρ ln(edm/δ) m 1/4 . The first and last terms are both bounded by sup ∥x-z∥≤ϵ ∥x∥≤1 ∥V -W ∥≤R V ∥∇f (x; V ) -∇f (x; W )∥, so we will now focus on bounding this term. Note that ∇f (x; V ) -∇f (x; W ) = ρ √ m m j=1 a j 1[v T j x ≥ 0] -1[w T j x ≥ 0] e j x T . We consider two cases: ∥x∥ ≤ (k + 1)ϵ and ∥x∥ > (k + 1)ϵ, for some k ≥ 1 to be determined later. • Case 1: ∥x∥ ≤ (k + 1)ϵ. Then ∥∇f (x; V ) -∇f (x; W )∥ ≤ ρ √ m √ m(k + 1)ϵ = (k + 1)ρϵ. • Case 2: ∥x∥ > (k + 1)ϵ. Then ∥z∥ > kϵ. With probability at least 1 -mδ, ∥w j ∥ ≤ √ d + 2 ln(1/δ) for all 1 ≤ j ≤ m. Define S 1 := j ∈ [m] : |w T j z| ≤ q∥z∥ , S 2 := j ∈ [m] : |w T j x| ≤ r∥x∥ , S 3 := j ∈ [m] : ∥v j -w j ∥ ≥ r , S := S 2 ∪ S 3 , where q := 2 r + 1 k √ d + 1 k 2 ln(1/δ) , with r a parameter we will choose later. Note that S 2 ⊆ S 1 , since if |w T j x| ≤ r∥x∥, then |w T j z| ≤ |w T j x| + ∥w j ∥ϵ ≤ r∥x∥ + 1 k ∥w j ∥∥x∥ = r + 1 k ∥w j ∥ ∥x∥ ≤ k + 1 k r + 1 k ∥w j ∥ ∥z∥ ≤ 2 r + 1 k √ d + 1 k 2 ln(1/δ) ∥z∥ ≤ q∥z∥. By Lemma A.2 part 1 of Ji et al. (2021) we have that with probability at least 1 -3δ, |S 2 | ≤ qm + 8qm ln(1/δ). Within the proof of Lemma A.7 part 1 of Ji et al. (2021) it is shown that |S 3 | ≤ R 2 V r 2 . So altogether, with probability at least 1 -(m + 3)δ, |S| ≤ |S 2 | + |S 3 | ≤ qm + 8qm ln(1/δ) + R 2 V r 2 ≤ 2 r + 1 k √ d + 1 k 2 ln(1/δ) m + 16 r + 1 k √ d + 1 k 2 ln(1/δ) m ln(1/δ) + R 2 V r 2 ≤ 6 r + 1 k √ d + 1 k 2 ln(1/δ) m ln(e/δ) + R 2 V r 2 ≤ 6 r + 3 k d ln(e/δ) m ln(e/δ) + R 2 V r 2 = 6rm ln(e/δ) + 18m √ d ln(e/δ) k + R 2 V r 2 . Setting r := R 2/3 V m -1/3 ln(e/δ) -1/6 we get |S| ≤ 7R 2/3 V m 2/3 ln(e/δ) 1/3 + 18m √ d ln(e/δ) k . Substituting this upper bound on |S| results in ∥∇f (x; V ) -∇f (x; W )∥ ≤ ρ √ m |S|∥x∥ ≤ ρ √ m |S| ≤ ρ √ m 3R 1/3 V m 1/3 ln(e/δ) 1/6 + 5m 1/2 d 1/4 k -1/2 ln(e/δ) ≤ 3ρR 1/3 V m -1/6 ln(e/δ) 1/6 + 5ρd 1/4 k -1/2 ln(e/δ). Combining the two cases results in ∥∇f (x; V ) -∇f (x; W )∥ ≤ max{(k + 1)ρϵ, 3ρR 1/3 V m -1/6 ln(e/δ) 1/6 + 5ρd 1/4 k -1/2 ln(e/δ)}. After setting k := d 1/6 ϵ -2/3 ln(e/δ) 1/3 to balance the terms, ∥∇f (x; V ) -∇f (x; W )∥ ≤ max{ρd 1/6 ϵ 1/3 ln(e/δ) 1/3 + ρϵ, 3ρR 1/3 V m -1/6 ln(e/δ) 1/6 + 5ρd 1/6 ϵ 1/3 ln(e/δ) 1/3 } ≤ 3ρR 1/3 V m -1/6 ln(e/δ) 1/6 + 5ρd 1/6 ϵ 1/3 ln(e/δ) 1/3 + ρϵ.

So with probability at least

1 -(m + 3)δ, sup ∥x-z∥≤ϵ ∥x∥≤1 ∥V -W ∥≤R V ∥∇f (x; V ) -∇f (z; V )∥ ≤ 6ρR 1/3 V m -1/6 ln(e/δ) 1/6 + 10ρd 1/6 ϵ 1/3 ln(e/δ) 1/3 + 2ρϵ + 11ρ ln(edm/δ) m 1/4 . Rescaling the probability of failure, we get that with probability at least 1 -δ, sup ∥x-z∥≤ϵ ∥x∥≤1 ∥V -W ∥≤R V ∥∇f (x; V ) -∇f (z; V )∥ ≤ 7ρR 1/3 V m -1/6 ln(em/δ) 1/6 + 12ρd 1/6 ϵ 1/3 ln(em/δ) A (Z), as well as R (i) (Z) in terms of R (0) (Z). Lemma A.13. 1. For any ∥z∥ ≤ 1 and R V ≥ 1 and R B ≥ 0, with probability at least 1 -δ, sup ∥x∥≤1 ∥V -W0∥≤R V ∥B-W0∥≤R B ∇f (x; V ) -∇f (x; W 0 ), B ≤ 26ρ(R B + R V )R 1/3 V d 1/4 ln(ed 2 m 2 /δ) 1/4 m 1/6 + 89ρ(R B + R V )d 1/3 ln(ed 3 m 2 /δ) 1/3 m 1/4 + 5ρ ln(1/δ) dm . 2. With probability at least 1 -δ, simultaneously sup ∥Wi-W0∥≤R V ∥B-W0∥≤R B R (i) A (B) -R A (B) ≤ 26ρ(R B + R V )R 1/3 V d 1/4 ln(ed 2 m 2 /δ) 1/4 m 1/6 + 89ρ(R B + R V )d 1/3 ln(ed 3 m 2 /δ) 1/3 m 1/4 + 5ρ ln(1/δ) dm and sup ∥Wi-W0∥≤R V ∥B-W0∥≤R B R (i) (B) -R (0) (B) ≤ 26ρ(R B + R V )R 1/3 V d 1/4 ln(ed 2 m 2 /δ) 1/4 m 1/6 + 89ρ(R B + R V )d 1/3 ln(ed 3 m 2 /δ) 1/3 m 1/4 + 5ρ ln(1/δ) dm . Proof. 1. For notational convenience let W := W 0 . For some 0 < ϵ ≤ 1/(dm) which we will choose later, instantiate a cover C at scale ϵ/  x; V ) -f (z; V ) = ρ √ m m j=1 a i σ(v T j x) -σ(v T j z) ≤ ρ √ m m j=1 σ(v T j x) -σ(v T j z) ≤ ρ √ m m j=1 v T j x -v T j z ≤ ρ √ m √ m∥V ∥∥x -z∥ ≤ ρ(R V + ∥W ∥)ϵ ≤ ρ(R V + √ m + √ d + 2 ln(1/(( √ d/ϵ) d δ)))ϵ. Instantiating Lemma A.12 for all z ∈ C, we get that with probability at least 1 -( √ d/ϵ) d δ, sup ∥x∥≤1 ∥V -W ∥≤R V ∥∇f (x; V ) -∇f (z; V )∥ ≤ 7ρR 1/3 V m -1/6 ln(em/δ) 1/6 + 12ρd 1/6 ϵ 1/3 ln(em/δ) 1/3 V m -1/6 ln(em/δ) 1/6 + 12ρd 1/6 ϵ 1/3 ln(em/δ) 1/3 + 2ρϵ + 15ρ ln(edm/δ) m 1/4 (R B + R V ) + 2ρ(R V + √ m + √ d + 2 ln(1/(( √ d/ϵ) d δ)))ϵ + 3ρ(R B + 2R V )R 1/3 V ln(e/δ) 1/4 m 1/6 . Setting ϵ = 1/(dm) we get with probability at least 1 -5(d √ dm) d δ, sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; V ) -∇f (x; W ), B ≤ 20ρ(R B + R V )R 1/3 V ln(em/δ) 1/4 m 1/6 + 64ρ(R B + R V ) ln(edm/δ) 1/3 m 1/4 + 2 √ 2ρ ln(1/(( √ d/ϵ) d δ)) dm . Rescaling δ we get that with probability at least 1 -δ, sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; V ) -∇f (x; W ), B ≤ 20ρ(R B + R V )R 1/3 V d 1/4 ln(5ed 2 m 2 /δ) 1/4 m 1/6 + 64ρ(R B + R V )d 1/3 ln(5ed 3 m 2 /δ) 1/3 m 1/4 + 2 √ 2ρ ln(5/δ) dm ≤ 26ρ(R B + R V )R 1/3 V d 1/4 ln(ed 2 m 2 /δ) 1/4 m 1/6 + 89ρ(R B + R V )d 1/3 ln(ed 3 m 2 /δ) 1/3 m 1/4 + 5ρ ln(1/δ) dm . 2. With probability at least 1 -δ, the previous part holds. This part is then an immediate consequence since the logistic loss is 1-Lipschitz. We now prove the main optimization lemma. Proof of Lemma 4.6. We will use the fact that it does not matter much which feature we use, because the function values on the domain will differ by a small amount, and hence the resulting difference in risk will be small. This is encapsulated by Lemma A.13. Combining this with Theorem 4.1 holding with probability at least 1 -12δ, and so altogether with probability at least 1 -18δ, R A (W ≤t ) ≤ 2 2 -ηρ 2 R (0) A (Z) + O   1 2 -ηρ 2 R 2 Z ηt + ρR Z d + √ τ m √ n + ρR 4/3 Z d 1/3 m 1/6   ≤ 2 2 -ηρ 2 inf g meas. {R A (g)} + 2 2 -ηρ 2 (κ 2 + ϵ) + O   1 2 -ηρ 2 R 2 Z ηt + ρR Z d + √ τ m √ n + ρR 4/3 Z d 1/3 m 1/6   . Corollary 4.2 then follows by setting parameters and reducing. To get Corollary 4.3, let δ (n) = n -2 . Notice that for any ϵ > 0, there exists n ϵ such that for all n ≥ n ϵ , with probability at least 1 -1/n 2 , R A (W ≤t ) (n) ≤ inf g meas. {R A (g)} + ϵ. Since n≥nϵ 1/n 2 < ∞, by the Borel-Cantelli lemma we have lim sup n→∞ R A (W (n) ≤t ) ≤ inf g meas. {R A (g)} + ϵ almost surely. As ϵ > 0 was arbitrary, we get Corollary 4.3.



d, with |C| ≤ ( √ d/ϵ) d . For a given point x, let z ∈ C denote the closest point in the cover. Then by the triangle inequality,sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; V ) -∇f (x; W ), B ≤ sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; V ) -∇f (x; W ), B -∇f (z; V ) -∇f (z; W ), B + ∇f (z; V ) -∇f (z; W ), B ≤ sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; V ) -∇f (z; V ), B + ∇f (x; W ) -∇f (z; W ), B + ∇f (z; V ) -∇f (z; W ), B .

Upper bounding this expression by taking the supremum separately over each of terms,sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; V ) -∇f (z; V ), B + sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; W ) -∇f (z; W ), B + sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (z; V ) -∇f (z; W ), B ,and noticing the second term is bounded above by the first term, results insup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (x; V ) -∇f (x; W ), B ≤ sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B 2 ∇f (x; V ) -∇f (z; V ), B + sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (z; V ) -∇f (z; W ), B ≤ sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B 2 ∇f (x; V ) -∇f (z; V ), B -V + 2 f (x; V ) -f (z; V ) + sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (z; V ) -∇f (z; W ), B ≤ sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B 2∥∇f (x; V ) -∇f (z; V )∥(R B + R V ) + 2 f (x; V ) -f (z; V ) + sup ∥x∥≤1 ∥V -W ∥≤R V ∥B-W ∥≤R B ∇f (z; V ) -∇f (z; W ), B .Instantiating Lemma A.7 part 1 ofJi et al. (2021) for all z ∈ C, we get that with probability at least 1 -3(√ d/ϵ) d δ, ∇f (z; V ) -∇f (z; W ), B ≤ 3ρ(R B + 2R V )R 1/3 V ln(e/δ)1/4 m 1/6 . By Lemma A.2 part 2 of Ji et al. (2021), with probability at least 1 -( √ d/ϵ) d δ, ∥W ∥ ≤ ( √ m + √ d + 2 ln(1/(( √ d/ϵ) d δ))). Assuming this holds, then for all ∥x -z∥ ≤ ϵ, f

ACKNOWLEDGMENTS

The authors are grateful for support from the NSF under grant IIS-1750051.

annex

We stop training once we reach t iterations, or the parameter distance from initialization exceeds 2R Z . That is, we stop at iteration T , where T = min {t, inf {i : ∥W i -W 0 ∥ > 2R Z }}. By definition ∥W i -W 0 ∥ ≤ 2R Z ≤ R gd for all i < T , andThen by Lemma A.13 part 2, we have that with probability at least 1 -δ,Note that this holds for all iterations of interest as well as when B represents the reference parameters.By Lemma A.11 and the interchangeability of features, we getA (Z) + 2ηT κ 1 . Rearranging and using the definition of W ≤t gives. Note that if ∥W T -Z∥ ≥ ∥W 0 -Z∥, we can bound the term above by 0. Otherwise, we havegiving us the final bound.

A.4 ADVERSARIAL TRAINING RESULTS

We can now prove our results on adversarial training.Proof of Theorem 4.1. To bound the risk of our final iterate, we will first linearize it, apply our generalization bound to get the linearized training risk, use our optimization lemma to get the linearized training risk of the reference model, and then repeat our steps in reverse to get the risk of the linearized finite reference model. LetBy Lemma A.13, with probability at least 1 -δ, we haveA (W ≤t ) + κ 1 and R (0) A (W ≤t ) ≤ R A (W ≤t ) + κ 1 . By Lemma 4.6, with probability at least 1 -δ, we haveBy Lemma 4.4, with probability at least 1 -5δ, we have R (0)A (W ≤t ) + ρR gd κ n , and with another probability at least 1 -5δ we have R (0)and cancelling results inA (Z) + ρR Z κ n and simplifying gives the final bound, as follows.Setting parameters appropriately, we can make all terms in Theorem 4.1 small, and get arbitrarily close to the optimal adversarial convex loss. This is encapsulated by Corollaries 4.2 and 4.3, which we prove at the same time. . Then within the proof of Lemma A.11 of Ji et al. (2021) it is shown that with probability at least 1 -6δ, we can sample finite width reference parameters Z such that |f (0) A (Z) ≤ R A (h) + κ 2 + ϵ/2 ≤ inf{R A (g) : g measurable} + κ 2 + ϵ.

