SMOOTHED-SGDMAX: A STABILITY-INSPIRED AL-GORITHM TO IMPROVE ADVERSARIAL GENERALIZA-TION

Abstract

Unlike standard training, deep neural networks can suffer from serious overfitting problems in adversarial settings. Recent research (Xing et al., 2021b; Xiao et al., 2022) suggested that adversarial training can have nonvanishing generalization error even if the sample size n goes to infinity. A natural question arises: can we eliminate the generalization error floor in adversarial training? This paper gives an affirmative answer. First, by an adaptation of information-theoretical lower bound on the complexity of solving Lipschitz-convex problems using randomized algorithms, we establish a minimax lower bound Ω(s(T )/n) given a training loss of 1/s(T ) for the generalization gap in non-smooth settings, where T is the number of iterations, and s(T ) → +∞ as T → +∞. Next, by observing that the nonvanishing generalization error of existing adversarial training algorithms comes from the non-smoothness of the adversarial loss function, we employ a smoothing technique to smooth the adversarial loss function. Based on the smoothed loss function, we prove that a smoothed version of SGDmax algorithm can achieve a generalization bound O(s(T )/n), which eliminates the generalization error floor and matches the minimax lower bound. Experimentally, we show that the Smoothed-SGDmax algorithm improves adversarial generalization on common datasets.

1. INTRODUCTION

Deep neural networks (DNNs) (Krizhevsky et al., 2012; Hochreiter & Schmidhuber, 1997 ) is successful and rarely suffered overfitting issues (Zhang et al., 2021) . This phenomenon is also called benign overfitting. A well-trained neural network model can generalize well to the test data. However, in adversarial machine learning, overfitting becomes a serious issue (Rice et al., 2020) . Before the training algorithms converge, the robust test error starts to increase. This special type of overfitting is called robust overfitting and can be observed in the experiments on common datasets. See Fig. 1 , orange curve. Therefore, mitigating the robust overfitting is important to increase the adversarial robustness of a DNN model. Several recent works tried to figure out the causes of robust overfitting and designed methods to mitigate it. See the discussion in Sec. 2. A recent line of work (Xing et al., 2021b; Xiao et al., 2022) studied the robust overfitting issue of adversarial training from a theoretical perspective, using the notion of uniform algorithmic stability. Uniform algorithmic stability (UAS) (Bousquet & Elisseeff, 2002) was introduced to bound the generalization gap in machine learning problems. It provides algorithm-specific generalization bounds instead of algorithm-free generalization bounds such as classical results on VC-dimension (Vapnik & Chervonenkis, 2015) and Rademacher complexity (Bartlett & Mendelson, 2002) . Such stability-based generalization bounds provide insight into understanding the generalization ability of neural network models trained by different algorithms. Traditional adversarial training is to perform stochastic gradient descent (SGD) on the max function of the standard counterpart, which is also called SGDmax (Farnia & Ozdaglar, 2021 ). We will not distinguish two algorithms, "SGDmax" and "adversarial training (AT)", in the paper. The work of (Xing et al., 2021b; Xiao et al., 2022) both showed that SGDmax incurs a stability-based generalization bound in O(c(T ) + s(T )/n). Here T is the number of iterations, n is the number of samples, s(T ) is a function satisfies s(T ) → +∞ as T → +∞, and c(T ) is a sample size-independent Table 1: Comparison of stability-based generalization bounds of adversarial generalization gap. c 1 (T ) and c 2 (T ) are sample size-independent terms. Details of the form of s(T ), c 1 (T ), c 2 (T ) are discussed in Sec. 4 and Sec. 5. Upper Bounds Worst-case Achieves minimax Lower Bounds lower bound Ω(s(T )/n) SGDmax O(c 1 (T ) + s(T ) n ) Ω(c 2 (T ) + s(T ) n ) Smoothed-SGDmax O( s(T ) n ) Ω( s(T ) n ) term and increase with T . Details of the form of s(T ), c(T ) are discussed in Sec. 4 and Sec. 5. They also provided the matching lower bounds to show that the sample size-independent term is unavoidable for SGDmax-based adversarial training algorithms. It provides a possible explanation of robust overfitting: even though we have arbitrarily large number of training samples, the adversarial generalization gap still does not vanish. The first question arises: what is the lower bound of the generalization gap for algorithms in adversarial machine learning settings? To answer this question, we develop a minimax lower bound, Ω(s(T )/n), for the generalization gap in non-smoothing settings when the training loss is 1/s(T ). Clearly, SGDmax does not achieve the lower bound. Therefore, we are motivated to design algorithms to reduce the non-vanishing sample size-independent term. The following main question of our paper arises: Can we eliminate the error floor in generalization bounds of adversarial generalization gap? We call the term c(T ) as generalization error floor. It is observed that the term c(T ) comes from the non-smoothness of the adversarial loss. Hence, stability analysis on some smoothed algorithms has been studied recently. It includes noise-SGD and differential privacy-SGD (Bassily et al., 2020) , adding noise to weight and data (Xing et al., 2021b) , stochastic weight averaging, and cyclic learning rate (Xiao et al., 2022) . Unfortunately, these smoothed algorithms cannot eliminate the generalization error floor. In this paper, we employ a smoothing technique using tools from Moreau envelope function to smooth the adversarial loss and perform gradient descent to this smooth surrogate. Following the name SGDmax, we refer the smoothed version of SGDmax as Smoothed-SGDmax, which improves adversarial generalization. We prove that Smoothed-SGDmax has the same training loss 1/s(T ) on adversarial loss. Most importantly, Smoothed-SGDmax eliminates the generalization error floor and achieves the minimax lower bound Ω(s(T )/n) of the generalization gap. The comparison of the stability-based generalization upper bound and lower bound of our proposed algorithm with the SGDmax-based adversarial training algorithm is given in Table 1 . Additionally, our proposed algorithm can be viewed as a general form of stochastic weight averaging (SWA (Izmailov et al., 2018) ). As a by-product, we provide an understanding of SWA in our framework, see more discussion in Sec. 5.4. In Fig. 1 , we show the training procedure of our proposed algorithm as well as adversarial training on CIFAR-10. The contributions of our work are listed as follows: 1. Main result: we prove that the generalization error floor in non-smooth loss minimization can be eliminated by a properly designed algorithm, which we called Smoothed-SGDmax. 2. We develop the minimax lower bound of the generalization gap in non-smooth loss minimization. Specifically, we show that an algorithm has at least Ω(s(T )/n) generalization gap if the training error training loss is 1/s(T ). Smoothed-SGDmax achieves the minimax lower bound. 3. Experiments on common datasets verify the theoretical results and show the effectiveness of our proposed algorithm in practice.

2. RELATED WORK

Adversarial Robustness. Starting from the work of (Szegedy et al., 2013) , it has now been well known that deep neural networks trained via standard gradient descent based algorithms are highly susceptible to imperceptible corruptions to the input data (Goodfellow et al., 2014; Chen et al., 2017; Carlini & Wagner, 2017) . Adversarial training and its variants are proposed to improve the adversarial robustness of DNNs (Madry et al., 2017; Wu et al., 2020; Gowal et al., 2020) . Robust Overfitting. Starting from the work of (Rice et al., 2020) , a series of work studied the causes of robust overfitting. (Yu et al., 2022) studied robust overfitting from the perspective of adversarial distribution. (Chen et al., 2021) leveraged knowledge distillation and self-training to mitigate robust overfitting. Learning Theory for Adversarial Generalization. Classical learning theory. The work of (Attias et al., 2021; Montasser et al., 2019) explained generalization in adversarial settings using VCdimension. The work of (Yin et al., 2019; Khim & Loh, 2018) studied the poor generalization of adversarial training using tools from Rademacher complexity. However, VC-dimension and Rademacher complexity are algorithm-independent bounds for generalization. They cannot reveal the effect of algorithms on generalization. Other theoretical analysis. (Sinha et al., 2017) study the generalization of an adversarial training algorithm in terms of distributional robustness. The work of (Xing et al., 2021a; c; Javanmard et al., 2020) studied the generalization properties in the setting of linear regression. Gaussian mixture models are used to analyze adversarial generalization (Taheri et al., 2020; Javanmard et al., 2020; Dan et al., 2020) . The work of (Allen-Zhu & Li, 2020) explains adversarial generalization through the lens of feature purification. Uniform Stability. Stability can be traced back to the work of (Rogers & Wagner, 1978) . In statistical learning problems, it was well developed in analyzing the algorithm-based generalization bounds (Bousquet & Elisseeff, 2002) . These bounds have been significantly improved in a recent sequence of works (Feldman & Vondrak, 2018; 2019) . The work of (Chen et al., 2018) discussed the optimal trade-off between stability and convergence. (Bassily et al., 2020) studied the stability of SGD on non-smooth loss. They proved that the generalization bound contains a sample sizeindependent term. The work of (Xing et al., 2021b; Xiao et al., 2022) showed that adversarial loss is non-smooth and SGDmax-based adversarial training algorithms will incur the generalization error floor.

3. PRELIMINARIES: STABILITY ANALYSIS FOR GENERALIZATION GAP

Let D be an unknown distribution in the sample space Z. Let S = {z 1 , . . . , z n } ∼ D n be an sample dataset drawn i.i.d. according to D. Our goal is to find a model w with small population risk, defined as: R D (w) = E z∼D h(w, z), where h(•, •) is the loss function. Since we cannot minimize the objective R D (w) directly, we instead minimize the empirical risk, defined as R S (w) = 1 n n i=1 h(w, z i ). Let w be the optimal solution of R S (w). Then, for the algorithm output ŵ = A(S), we define the expected generalization gap as E gen (A, h, n, D) = E S∼D n ,A [R D (A(S)) -R S (A(S))]. (3.1) We define the the expected optimization gap as E opt (A, h, n, D) = E S∼D n ,A [R S (A(S)) -R S ( w)]. (3.2) We use E gen and E opt as short hand notations of the above definition. To bound the generalization gap of a model ŵ = A(S) trained by a randomized algorithm A, we employ the following notion of uniform stability. Definition 3.1. A randomized algorithm A is ε-uniformly stable if for all data sets S, S ′ ∈ Z n such that S and S ′ differ in at most one example, we have sup z E A [h(A(S); z) -h(A(S ′ ); z)] ≤ ε . (3.3) The following theorem shows that expected generalization gap can be attained from uniform stability. Theorem 3.1 (Generalization in expectation (Hardt et al., 2016) ). Let A be ε-uniformly stable. Then, the expected generalization gap satisfies |E gen | = |E S,A [R D [A(S)] -R S [A(S)]]| ≤ ε . Uniform Argument Stability (UAS). If h is L-Lipschitz, i.e., |h(w 1 ; z) -h(w 2 ; z)| ≤ L∥w 1 - w 2 ∥, we can use UAS= E∥A(S) -A(S) ′ ∥ to measure the generalization gap.

3.1. SGDMAX INCURS GENERALIZATION ERROR FLOOR

Adversarial Loss. In adversarial training, we consider the following adversarial loss h(w; z) = max ∥z-z ′ ∥≤ϵ g(w; z ′ ), where g(w; z) is the loss function of the standard counterpart. In practice, w is usually the parameter of neural networks. Generalization Error Floor. As discussed in (Xing et al., 2021b; Xiao et al., 2022) , even if g is a smooth function, h is not necessarily smooth. They assumed h to be generally non-smooth or η-approximately smooth, which is a subset of non-smooth functions. Under both assumptions, there exist non-vanishing terms in the bounds of UAS: c 1 (T ) + LT α n ≤ UAS ≤ c 2 (T ) + LT α n , (3.5) where the forms of c 1 (T ) and c 2 (T ) are listed in Table 2 . We refer c 1 (T ) and c 2 (T ) as generalization error floors. Table 2 : Generalization error floor in previous studies. Assumption on h Upper Bounds c 1 (T ) Lower Bounds c 2 (T ) (Xing et al., 2021b) non-smooth O(Lα √ T ) (Prop. 1) Ω(α √ T ) (Thm. 1) (Xiao et al., 2022) η-approx-smooth O(ηαT ) (Thm 5.1) Ω(ηα √ T ) (Thm. 5.2)

4. MINIMAX LOWER BOUND

Following the work of (Xing et al., 2021b) , we mainly consider the following function class of convex, non-smooth, and Lipschitz functions throughout the paper. H = {h : W × Z → R | h is convex, L-Lipshitz in w, |W | = D W }. (4.1) L-Lipschitz is a standard assumption in uniform stability analysis since (Hardt et al., 2016) . The assumption of convexity is to compare with the existing results and to develop the following the minimax lower bound. Definition 4.1 (Training Loss). We say an algorithm class A has training loss 1/s(T ) on a function class H, if for all A ∈ A and h ∈ H, running A on h for T iterations, we have E opt (A, h, n, D) ≤ O 1 s(T ) , where lim T →+∞ s(T ) = +∞. Proposition 4.1 (Minimax lower bound of generalization gap). Let H be the function class defined in Eq. (4.1). Let A be the class of randomized algorithms using n samples with training loss 1/s(T ) on H. For all n, there exists T , s.t. the following lower bound holds. min A∈A max D E gen (A, h, n, D) ≥ Ω s(T ) n . (4.2) The proof of Prop. 4.1 is based on a lower bound of the complexity of Lipschitz-convex problems ((Nemirovskij & Yudin, 1983) , Ch.4), see Appendix A.1. Clearly, SGDmax can not achieve the minimax lower bound.

5. SMOOTHED-SGDMAX: ELIMINATING GENERALIZATION ERROR FLOOR

In this section, we will design an algorithm satisfying the following two properties: 1. It has the same training loss as the SGDmax algorithm; 2. Suppose it achieves 1/s(T ) training loss after T iterations. Then, the generalization bound is bounded by s(T )/n.

5.1. SMOOTH SURROGATE ADVERSARIAL LOSS

The non-smoothness of h leads to a poor generalization bound. This motivates us to construct smooth surrogate loss functions to improve adversarial generalization. Inspired by the work of (Zhang & Luo, 2020) , we use the Moreau envelope function to smooth the adversarial loss. Let K(w, u; z) = h(w; z) + p 2 ∥w -u∥ 2 . (5.1) If h is l-weakly convex, we can choose p > l to insure that K(w, u; z) is strongly convex with respect to w. In the case that h is convex, we only need p > 0. We define the Moreau envelope function: (5.3) M (u; S) = min Then, M (u; S) is a smooth function. Formally, we state the theoretical results as follows. Lemma 5.1. Assume that h is l-weakly convex. Let p > l. Then, M (u; S) satisfies 1. min u M (u; S) has the same global solutions as min w R S (w).

2.. The gradient of M

(u; S) is ∇ u M (u; S) = p(u -w(u; S)). 3. M (u; S) is pl/(p -l)-weakly convex. 4. M (u; S) is (2p 2 -pl)/(p -l)-gradient Lipschitz continuous. 5. M (u; S) has bounded gradient norm L. Remark: The proof of Lemma 5.1 is due to (Rockafellar, 1976) and also provided in Appendix A.1. We focus on the case where h is convex in the main text. Then, Lemma 5.1.3 and 5.1.4 reduce to M (u; S) is convex and 2p-gradient Lipschitz. Lemma 5.1 is stated in general l-weakly convex cases for further theoretical studies. Since M (u; S) has the same global solutions as R S (w), we can do adversarial training using this smooth objective M (u; S). A natural way is to perform gradient descent to M (u; S). By Lemma 5.1, the estimate of the gradient requires the estimate of the solution of the minimization problem min w K(w, u; S). Depending on whether we solve the subproblems exactly or not, we have the exact approach and inexact approach.

5.2. EXACT APPROACH

We first consider the exact approach, which is the gradient descent to M (u; S). Theorem 5.1. Assume h is a convex, L-Lipschitz function. Suppose we run GD on the smoothed surrogate adversarial loss M (u; S) defined in Eq. (5.2) with fixed stepsize α ≤ 1/ √ T for T ≥ 4p 2 steps. Then, the optimization and generalization gap satisfies E opt ≤ O(1/T α) and E gen ≤ 2L 2 T α n . (5.4) Remark: Thm. 5.1 is not obtained from the work of (Hardt et al., 2016) . Notice that M (u; S) = min w∈W 1 n z∈S K(w, u; z) ̸ = 1 n z∈S min w∈W K(w, u; z). min u M (u; S) is not a finite sum problem. However, the analysis in (Hardt et al., 2016) can only be applied to finite sum problems. Thm. 5.1 requires a different proof. In summary, there are two steps: 1) Build the recursion from ∥u t S -u t S ′ ∥ to ∥u t+1 S -u t+1 S ′ ∥; 2) Unwind the recursion. The main challenge comes from the first step. To this end, we develop a new error bound and a different decomposition to build the recursion. Details are deferred to Appendix A.3. Thm. 5.1 is our first main result. It shows that the exact approach achieves the minimax lower bounds of the generalization gap. The extension to weakly-convex cases is provided in Appendix B. However, the exact approach requires the exact minimization of K(w, u; S), which is sometimes computationally intractable. To address this issue, we consider the inexact approach below.

5.3. THE INEXACT APPROACH

The inexact approach is to estimate ∇ u M (u; S) by inexactly solving min w K(w, u; S). To this aim, we perform multiple steps of SGD to the subproblem min w K(w, u; S), attaining an estimate w(u) of the true w(u), and then use w(u) to estimate ∇ u M (u; S). Algorithm 1 Smoothed-SGDMax 1: Initialize w 0 , u 0 ; 2: Choose stepsize c t s > 0 and α t > 0; 3: for t = 0, 1, 2, . . . , T do 4: Let w t 0 = w t ; 5: for s = 0, 1, 2, • • • , N do 6: Draw a sample z t s from S uniformly; 7: w t s+1 = P W (w t s -c t s ∇ w K(w t s , u t ; z t s )); 8: end for 9: w t+1 = w t N ; 10: u t+1 = u t + α t p(w t+1 -u t ); 11: end for In Step 7 in Alg. 1, we run SGD on K(w, u, S) w.r.t w to find a solution given u. In step 10, we run GD on K(w, u, S) w.r.t u. To provide the upper bounds of the optimization gap and generalization gap of Alg. 1, we need the following Lemma for the inner optimization. Lemma 5.1. Given t and u t , suppose we run SGD on K(w, u t , S) w.r.t. w with stepsize c t s ≤ 1/(p -l)s for N steps. w t N is approximately the minimizer with an error C 2 1 /N , i.e., E∥w t N -w(u t )∥ 2 ≤ C 2 1 N , where C 1 = (L + pD W )/(p -l). In convex case, i.e., l = 0, we have C 1 = L/p + D W . Lemma 5.1 provides the optimization error of the inner loop. In words, if we run the inner loop for sufficient steps, we can approximate the smoothed loss M (u; S). Below we provide the training loss and uniform stability of Smoothed-SGDmax with sufficient steps for the inner loop. Theorem 5.2 (training loss of Smoothed-SGDmax). Suppose h is convex and L-Lipschitz. In Alg. 1, if we choose inner stepsize c t s ≤ 1/ps, number of steps in inner loop N = T , outer stepsize α ≤ 1/ √ T , T ≥ 4p 2 , the optimization gap satisfies E opt ≤ ∥u 0 -u * ∥ 2 + 2pC 1 D W + (L + pD W ) 2 2T α = C 2 T α , where C 2 = ∥u 0 -u * ∥ 2 /2 + pC 1 D W + (L + pD W ) 2 /2. Theorem 5.3 (Generalization bound of Smoothed-SGDmax). Assume that h is convex and L-Lipschitz. In Alg. 1, if we choose inner stepsize c t s ≤ 1/ps, number of steps in inner loop N = n 2 , outer stepsize α t ≤ 1/ √ T , T ≥ 4p 2 , the generalization gap satisfies E gen ≤ L 2C 1 p n + 2L n T t=1 α t = C 3 n T t=1 α t , where C 3 = L(4L + 2pD W ). Thm. 5.2 and 5.3 are the main results of our paper. For fixed stepsize α t = α, it shows that Alg. 1 has training loss O(1/T α) and has optimal generalization bound in O(T α/n).

Interpretation of Number of Steps.

In practice, if we use batch size 1 and go through the whole dataset in each epoch, T can be viewed as the number of epochs, and N can be viewed as the number of samples. Let T α = C 2 n/C 3 , we obtain the optimal excess risk with respect to T and α, i.e., E opt + E gen ≤ 2 C2C3 n .

5.4. FURTHER COMPARISON WITH EXISTING ALGORITHMS

In Alg. 1, Step 7 is just to run SGD on K(w, u; z) = h(w; z)+p∥w -u∥ 2 /2 instead of h(w; z). The additional term can be viewed as a regularization term similar to weight decay. Step 10 is a model averaging step similar to stochastic weight averaging (SWA). We compare Smoothed-SGDmax with some existing algorithms in detail. The summary of the comparison is provided in Table 3 . We can see that only Smoothed-SGDmax can reduce the generalization error floor. (Hardt et al., 2016) , Def. 4.5 and Lemma 4.6. It is proved that the proximal update is 1-expansive if f is convex. Therefore, the generalization bound of the proximal update is no larger than that of SGD. In non-smooth cases, SGD incurs an error floor. The proximal update is not guaranteed to eliminate the error floor. Stochastic Weight Averaging. Stochastic weight averaging suggests using the weighted average of the iterates rather than the final one for inference. The update rules of SWA is u t+1 = τ t u t + (1 -τ t )w t+1 . In the work of (Xiao et al., 2022) , they provide a generalization bound for SWA in the case that u is the average of the iterates, which is equivalent to using the step size u t = (t -1)/t. The generalization bound in this case is E gen (SW A) ≤ (LL z ϵ + 2L 2 /n)T α. (5.9) The sample size-independent term is one-half of the one without SWA. However, the additional term is still unavoidable in the analysis. SWA is still not guaranteed to achieve the minimax lower bound in this analysis. Optimal Generalization Bound of SWA in our Regime. In Alg. 1, if we denote τ t = 1 -α t p, Step 10 can be view as a weight averaging step. In Thm. 5.3, it is required that α t ≤ 1/2p. Then, τ t = (1 -α t p) ≥ 1/2. Therefore, by fixing α t p to be constant and letting p → 0, our proposed algorithm is reduced to SWA. In other words, our proposed algorithm can be viewed as a general form of SWA. Also, we provide an optimal generalization bound of SWA in the regime that τ ∈ [1/2, 1] and p → 0.

6. EXPERIMENTS

Training Procedure of Smoothed-SGDmax. To have a first glance of how Smoothed-SGDmax mitigates robust overfitting, we consider the experiments on a lightweight model, PreActResNet-18, on CIFAR-10, CIFAR-100, and SVHN to plot the training procedure. Training Settings. For the attack algorithms, we use ℓ ∞ -PGD-10 ( Madry et al., 2017) , ϵ = 8/255. The step size is set to be ϵ/4. For adversarial training, we use piece-wise learning rates, which are equal to 0.1, 0.01, 0.001 for epochs 1 to 100, 101 to 150, and 151 to 200, respectively. For Smoothed-SGDmax, we keep the piece-wise learning rate (for the choice of c t s in Alg. 1) for comparison. Because of the similarity of ℓ 2 regularization term of weight decay and the proximal term in K(w, u; z), we set p = 5 × 10 -4 , which is a common choice of weight decay. The step size α t of updating u is set to be 50, then τ = 1 -αp = 0.995. The training procedure of the experiments on CIFAR-10 is already provided in Introduction, Fig. 1 . The experiments on SVHN and CIFAR-100 are provided in Fig. 2 . For adversarial training, the robust test accuracy starts to decrease at around the 100 th epoch, which is called robust overfitting (Rice et al., 2020) . Using Smooth-SGDmax, the robust overfitting issue is much milder. These experiments verify the generalization bounds. The bound of Smoothed-SGDmax (which is O(T α/n)) is much better than the bound of adversarial training (O(T α + T α/n)). Sample Complexity. Secondly, we study the sample complexity provided in Thm. 5.3. We use Wide-ResNet-28 × 10 with Swish activation function for better test accuracy instead of ResNet-18. The training setting mainly follows the work of (Gowal et al., 2020) . We consider two losses, adversarial loss (Madry et al., 2017) and TRADES loss (Zhang et al., 2020) for the choice of h(w; z). The total number of epochs is 400. Other training settings are similar to the experiments on ResNet-18. Adversarial Generalization Gap. CIFAR-10 only contains 50K training samples. We adopt the pseudo-label data introduced in (Carmon et al., 2019) to study the sample complexity. Increasing the percentage of pseudo-label data is an approximation of increasing the training data. In Fig. 3 , we show the robust test accuracy (a) and adversarial generalization gap (b). The results are consistent with the theorem that Smoothed-SGDmax reduces a term in the generalization bounds. In Table 4 , we provide the robust test performance of our proposed algorithms. The baseline performance on CIFAR-10 are reported in (Gowal et al., 2020) . We can see that the performance of our proposed algorithms is comparable in the same settings used in (Gowal et al., 2020) . Notice that the state-of-the-art performance of adversarial robustness is obtained using large models (e.g., WideResNet-106 × 16) and DDPM-generated data (Rebuffi et al., 2021) . We do not have enough resources to run large models.

7. CONCLUSION

In this paper, we study a question: can we design an algorithm to eliminate the generalization error floor of the adversarial generalization gap? By using tools from Moreau envelopes, we consider a smoothed version of SGDmax. We prove that it has the same convergence guarantee as SGDmax and attains the minimax lower bound of the generalization gap in non-smooth loss minimization. Most importantly, Smoothed-SGDmax can eliminate the generalization error floor. We hope our work can lead to a better understanding of adversarial machine learning theory.

A PROOF OF THEOREMS

A.1 PROOF OF PROPOSITION 4.1 The proof is adopted from the proof of minimax lower bound of optimization error from the work of (Chen et al., 2018) . We define the excess risk as R D (w) -min w∈W R D (w). A minimax lower bound of the excess risk for the function class H is given in (Nemirovskij & Yudin, 1983) : min w max D E S∼D n [R D (w) -min w∈W R D (w)] ≥ LD W C 4 √ n , (A.1) where C 4 is a universal constant. By the excess risk decomposition, we have E S∼D n [R D (w) -min w∈W R D (w)] ≤ E opt (w) + E opt (w). (A.2) Let A ∈ A and w T be the algorithm output of A, we have E opt (w T ) ≤ O(1/s(T )). Then, min A∈A max D E gen (w T ) ≥ Ω LD W √ n - 1 s(T ) . (A.3) Complete the square, we have LD W √ n - 1 s(T ) = - 1 s(T ) - LD W s(T ) 2n 2 + L 2 D 2 W s(T ) 4n . (A.4) Since s(T ) → +∞ as n → +∞, we can choose T s.t. 1 √ s(T ) is close to LD W √ s(T ) 2n . Therefore, there exists T , s.t. To simplify the notation, we use M (u) as a short hand notation of M (u; S). Similar to h(u), K(u), and w(u). 1. Let w * ∈ arg min R S (w). We have R S (w * ) = K(w * , u = w * , S) ≥ K(w(u), u = w * , S) ≥ R S (w(u = w * )). Then, the equality holds. Therefore, w = u = w * is the optimal solution of both min w R S (w) and min u M (u; S).

2.. Since

K(w, u) is a (p -l)-strongly convex function, w(u) is unique. Then M (u) = h(w(u)) + p 2 ∥w(u) -u∥ 2 . By taking the derivative of M (u) with respect to u, we have ∇ u M (u) = ∂w(u) ∂u T • ∇ w(u) h(w(u)) + ∂w(u) ∂u -I T • p(w(u) -u). (A.6) = ∂w(u) ∂u T • (∇ w(u) h(w(u)) + p(w(u) -u)) + p(u -w(u)). (A.7) Since w(u) is the optimal solution of K(w, u), we have ∇ w(u) K(w(u), u) = ∇ w(u) h(w(u)) + p(w(u) -u) = 0. (A.8) Therefore, the first term in A.7 is equal to zero. We have ∇ u M (u) = p(u -w(u)). where the second inequality is due to the definition of K(w, u; S), the third one is due to the firstorder optimally condition, and the last inequality is because of the bounded gradient of h(w; z). Next, we move to the proof of Thm. 5.1. Step 1. ∥u t+1 S -u t+1 S ′ ∥ = ∥u t S -u t S ′ -α t (∇M (u t S ; S) -∇M (u t S ′ ; S ′ ))∥ ≤ ∥u t S -u t S ′ -α t (∇M (u t S ; S) + ∇M (u t S ′ ; S))∥ + α t ∥∇M (u t S ′ ; S ′ ) -∇M (u t S ′ ; S)∥ ≤ ∥u t S -u t S ′ ∥ + α t ∥∇M (u t S ′ ; S ′ ) -∇M (u t S ′ ; S)∥ = ∥u t S -u t S ′ ∥ + α t p∥u t S ′ -u t S ′ -w(u t S ′ , S) + w(u t S ′ , S ′ )∥ ≤ ∥u t S -u t S ′ ∥ + 2Lα t n , where the second inequality is due to the non-expansive property of convex function , the last inequality is due to Lemma A.1. Step 2. Unwinding the recursion, we have ∥u T S -u T S ′ ∥ ≤ 2L T t=1 α t n . A.4 PROOF OF LEMMA 5.1 Lemma 5.1 can be obtained from classical strong-convex optimization results. Since ∥∇ w K(w, u; z)∥ = ∥∇ w h(w; z) + p(w -u)∥ ≤ L + pD W , K(w, u; z) has bounded gradient L K = L + pD W . By (Nemirovski et al., 2009) , running SGD on K(w, u; S) with stepsize c s ≤ 1/s(p -l) iccurs an optimization error in E∥w N -w(u)∥ 2 ≤ C 2 1 N , where C 1 = (L + pD W )/(p -l). A.5 PROOF OF THM. 5.2 Proof. Let A t+1 = 1 2 ∥u t+1 -u * ∥ 2 and a t+1 = 1 2 E∥u t+1 -u * ∥ 2 . A t+1 = 1 2 ∥u t+1 -u * ∥ 2 ≤ 1 2 ∥u t -α t ∇ u K(w t N , u t ; S) -u * ∥ 2 ≤ A t + 1 2 α 2 t L 2 K -α t ⟨∇ u K(w t N , u t ; S), u t -u * ⟩ = A t + 1 2 α 2 t L 2 K -α t ⟨∇ u M (u t ; S), u t -u * ⟩ +α t ⟨∇ u M (u t ; S) -∇ u K(w t N , u t ; S), u t -u * ⟩. By taking the expectation on both sides and Rearranging the terms, we have α t E[M (u t ) -M (u * )] ≤ a t -a t+1 + 1 2 α 2 t L 2 K + α t E⟨∇ u M (u t ; S) -∇ u K(w t N , u t ; S), u t -u * ⟩ (A.15) Since E⟨∇ u M (u t ; S) -∇ u K(w t N , u t ; S), u t -u * ⟩ ≤ ∥∇ u M (u t ; S) -∇ u K(w t N , u t ; S)∥E∥u t -u * ∥ ≤ pC 1 D W √ N , Eq. (A.15) becomes α t E[M (u t ) -M (u * )] ≤ a t -a t+1 + 1 2 α 2 t L 2 K + α t pC 1 D W √ N . Let N ≥ T . Take the summation over t. We obtain that T t=1 α t E[M (u t ) -M (u * )] ≤ a 0 + 1 2 T t=1 α 2 t L 2 K + T t=1 α t pC 1 D W √ T . There exists t ≤ T , such that E[M (u t ) -M (u * )] ≤ a 0 + 1 2 T t=1 α 2 t L 2 K + T t=1 αtpC1D W √ T T t=1 α t . Considering constant step α ≤ 1/ √ T , we have α ≤ 1/T α and α √ T ≤ 1. Therefore, E[M (u t ) -M (u * )] ≤ 2a 0 + T α L 2 K + 2α √ T pC 1 D W 2T α ≤ ∥u 0 -u * ∥ 2 + L 2 K + 2pC 1 D W 2T α = C 2 T α . Since M (u; S) and R S (w) have the same global solutions, we can use both of them to measure the optimization error. Above is the optimization error defined in M (u; S). Below we provide the optimization error defined in R S (w). E[R S (w(u t )) -R S (w * )] ≤ E[M (u t ) -M (u * )] ≤ C 2 T α . Notice that the choices of algorithm output are slightly different. Therefore, we have E opt ≤ C 2 T α , where C 2 = ∥u 0 -u * ∥ 2 /2 + pC 1 D W + (L + pD W ) 2 /2. Let N ≥ n 2 . Unwind the recursion and let u T be the output of the algorithm. We have E gen ≤ LE∥u T S -u T S ′ ∥ ≤ L(2L + 2C 1 p) T t=1 α t n = C 3 T t=1 α t n . If we choose w(u T ) to be the algorithm output, we have where the first equality is due to ∇M (u; S) = p(u -w(u)), the second inequality is due to the non-expansive propertiy of M (u; S).

B WEAKLY-CONVEX CASES

Our main result can be extended to weakly convex cases. Theorem B.1. h is a weakly-convex, L-Lipschitz function. Suppose we run GD on the smoothed surrogate adversarial loss M (u; S) defined in Eq. (5.2) with diminishing stepsize α ≤ (p -l)/(2p 2 -pl)t for T steps. Then, the generalization gap satisfies E gen ≤ O 2L 2 T (2p -l)n . (B.1) In this case, the bound also does not contain a non-vanishing term. The proof based on the error bound (Lemma A.1) and the decomposition in the proof of Thm. 5.1. Proof: Step 1. 



Figure 1: Experiments of adversarial training and Smoothed-SGDmax on CIFAR-10.

; S) = arg min w∈W K(w, u; S).

Figure 2: Robust test accuracy of adversarial training and Smoothed-SGDmax on SVHN and CFAR-100. The training procedure of the experiments on CIFAR-10 is already provided in Introduction, Fig. 1. The experiments on SVHN and CIFAR-100 are provided in Fig. 2. For adversarial training, the robust test accuracy starts to decrease at around the 100 th epoch, which is called robust overfitting(Rice et al., 2020). Using Smooth-SGDmax, the robust overfitting issue is much milder. These experiments verify the generalization bounds. The bound of Smoothed-SGDmax (which is O(T α/n)) is much better than the bound of adversarial training (O(T α + T α/n)).

Figure 3: Robust test accuracy and generalization gap in the experiments of training CIFAR-10 using Smoothed-SGDmax.

gen ≤ LE∥w(u T S ; S) -w(u T S ′ ; S ′

Comparison of SGDmax, weight decay, proximal update, stochastic weight averaging, and Smoothed-SGDmax. Only Smoothed-SGDmax reduces the error floor in the generalization bound.Weight Decay. Weight decay (WD) is to add a ℓ 2 regularization to the empirical loss. The loss function with WD is h(w; z) + p∥w∥ 2 /2. Therefore, if we replace Step 10 by u = 0 in Alg. 1, Smoothed-SGDmax reduces to a simple weight decay regularization technique. Following the analysis in Table2, it is easy to see that adversarial training with weight decay incurs a generalization bound in E gen ≤ 2L(L z ϵ + L/n)T α, (5.7) where the step size α ≤ 1/(L w -p). Therefore, weight decay is not guaranteed to reduce the additional sample size-independent term. Both proximal update and Smoothed-SGDmax use the Moreau envelope function, but the algorithms are different. The stability analysis of the proximal update is given in

Robust test accuracy of our proposed algorithm. ϵ = 8/255. Model: WideResNet-28 × 10 with Swish activation function. Training data: Labeled to unlabeled data ratio: 3:7.

E∥u t S -α t ∇ u K(w t N,S , u t S ; S) -u t S ′ + α t ∇ u K(w t N,S ′ , u t S ′ ; S ′ )∥ ≤ E∥u t S -α t ∇ u M (u t S ; S) -u t S ′ + α t ∇ u M (u t S ′ ; S ′ )∥ + 2α t E∥∇ u K(w t N,S , u t

′ -α t (∇M (u t S ; S) -∇M (u t S ′ ; S ′ ))∥ ≤ ∥u t S -u t S ′ -α t (∇M (u t S ; S) + ∇M (u t S ′ ; S))∥ + α t ∥∇M (u t S ′ ; S ′ ) -∇M (u t S ′ ; S)∥ ≤ ∥u t S -u t S ′ ∥ + α t ∥∇M (u t S ; S) -∇M (u t S ′ ; S)∥ + α t ∥∇M (u t S ′ ; S ′ ) -∇M (u t S ′ ; S)∥ ≤ (1 + α t β)∥u t S -u t S ′ ∥ + α t ∥∇M (u t S ′ ; S ′ ) -∇M (u t S ′ ; S)∥, (B.2)where the first and second inequalities are due to triangular inequality. The last inequality is due to the gradient Lipschitz of M (u; S) and β = (2p 2 -pl)/(p -l). Then, α t ∥∇M (u t SStep 2. Let α t ≤ (p -l)/(2p 2 -pl)t,

annex

3. In Eq. (A.8), take the derivatives with respect to u on both sides, we have ∂w(u) ∂u Organizing the terms, we haveThen,Therefore, M (u) is a pl/(p -l)-weakly convex function.4. By Eq. (A.11), we haveThe training loss is a standard result of runing GD on smooth objective function.We focus on the proof of generalization bounds. Thm. 5.1 is not obtained from the work of (Hardt et al., 2016) . Notice that M (u; S) = min min u M (u; S) is not a finite sum problem. The analysis in (Hardt et al., 2016) can only be applied to finite sum problems. Thm. 5.1 requires a different proof. In summary, there are two steps:1. Build the recursion from ∥u t S -u t S ′ ∥ to ∥u t+1 S -u t+1 S ′ ∥; 2. Unwind the recursion.The main challenge comes from the first step, since the problem is not in the form of finite sum. To this end, we develop a new error bound and a different decomposition. We first introduce the following error bound. Lemma A.1. In weakly-convex case, for neighbouring S and S ′ , we have ∥w(u; S) -w(u; S ′ )∥ ≤ 2L/(n(p -ℓ)).Proof. By the (p -l)-strongly convexity of K(w, u; S), we have (p -l)∥w(u; S) -w(u; S ′ )∥ ≤ ∥∇K(w(u; S), u; S) -∇K(w(u; S ′ ), u; S)∥ ≤ ∥∇K(w(u; S), u; S) -∇K(w(u; S ′ ), u; S ′ )∥ + 1 n ∥∇h(w(u; S ′ ), z i )∥ + 1 n ∥∇h(w(u; S ′ ), z ′ i )∥ = 1 n ∥∇h(w(u; S ′ ), z i )∥ + 1 n ∥∇h(w(u; S ′ ),

