ROBUST OVERFITTING MAY BE MITIGATED BY PROP-ERLY LEARNED SMOOTHENING

Abstract

A recent study (Rice et al., 2020) revealed overfitting to be a dominant phenomenon in adversarially robust training of deep networks, and that appropriate early-stopping of adversarial training (AT) could match the performance gains of most recent algorithmic improvements. This intriguing problem of robust overfitting motivates us to seek more remedies. As a pilot study, this paper investigates two empirical means to inject more learned smoothening during AT: one leveraging knowledge distillation and self-training to smooth the logits, the other performing stochastic weight averaging (Izmailov et al., 2018) to smooth the weights. Despite the embarrassing simplicity, the two approaches are surprisingly effective and hassle-free in mitigating robust overfitting. Experiments demonstrate that by plugging in them to AT, we can simultaneously boost the standard accuracy by 3.72% ∼ 6.68% and robust accuracy by 0.22% ∼ 2.03%, across multiple datasets (STL-10, SVHN, CIFAR-10, CIFAR-100, and Tiny Ima-geNet), perturbation types ( ∞ and 2 ), and robustified methods (PGD, TRADES, and FSGM), establishing the new state-of-the-art bar in AT. We present systematic visualizations and analyses to dive into their possible working mechanisms. We also carefully exclude the possibility of gradient masking by evaluating our models' robustness against transfer attacks. Codes are available at https: //github.com/VITA-Group/Alleviate



test error rate comparison, between the vanilla PGD-AT baseline (Madry et al., 2018) , and PGD-AT with our proposed weight/label smoothening techniques applied, on CIFAR-10 with ResNet-18. Our methods effectively mitigates the robust overfitting (Rice et al., 2020) even when trained to 200 epochs, while maintaining the same high standard/robust accuracies compared to the best early-stop checkpoint of the baseline. Adversarial training (AT) (Madry et al., 2018) , i.e., training a deep network to minimize the worst-case training loss under input perturbations, is recognized as the current best defense method to adversarial attacks. However, one of its pitfalls was exposed by a recent work (Rice et al., 2020) : in contrast to the commonly-held belief that overparameterized deep networks hardly overfit in standard training (Zhang et al., 2016; Neyshabur et al., 2017; Belkin et al., 2019) , overfitting turns out to be a dominant phenomenon in adversarially robust training of deep networks. After a certain point in AT, e.g., immediately after the first learning rate decay, the robust test errors will only continue to substantially increase with further training (see Figure 1 bottom for example). That surprising phenomenon, termed as "robust overfitting", has been prevalent on many datasets and models. As Rice et al. (2020) pointed out, it poses serious challenges to assess recent algorithmic advances upon AT: by just using an earlier checkpoint, the performance of AT be drastically boosted to match the more recently reported state-of-the-arts (Yang et al., 2019b; Zhang et al., 2019b) . Even worse, Rice et al. (2020) tested several other implicit and explicit regularization methods, including weight decay, data augmentation and semi-supervised learning; they reported that none of those alternatives seem to combat robust overfitting (stably) better than simple early stopping. The authors thus advocated using the validation set to select a stopping point, although the manual picking would inevitably trade off between selecting either the peak point of robust test accuracy or that of standard accuracy, which often do not coincide (Chen et al., 2020a) . Does there exist more principled, hands-off, and hassle-free mitigation for this robust overfitting, for us to further unleash the competency of AT? This paper explores two options along the way, that draw two more sophisticated ideas from enhancing standard deep models' generalization. Both could be viewed as certain types of learned smoothening, and are directly plugged into AT: • Our first approach is to smooth the logits in AT via self-training, using knowledge distillation with the same model pre-trained as a self-teacher. The idea is inspired by two facts: (1) label smoothening (Szegedy et al., 2016) can calibrate the notorious overconfidence of deep networks (Hein et al., 2019) , and that was found to improve their standard generalization; (2) label smoothening can be viewed as a special case of knowledge distillation (Yuan et al., 2020) , and self-training can produce more semantic-aware and discriminative soft label "self-teachers" than naive label smoothening (Chen et al., 2020b; Tang et al., 2020 ). • Our second approach is to smooth the weights in AT via stochastic weight averaging (SWA) (Izmailov et al., 2018) , a popular training technique that leads to better standard generalization than SGD, with almost no computational overhead. While SWA has not yet be applied to AT, it is known to find flatter minima which are widely believed to indicate stronger robustness (Hein & Andriushchenko, 2017; Wu et al., 2020a) . Meanwhile, SWA could also be interpreted as a temporal model ensemble, and therefore might bring the extra robustness of ensemble defense (Tramèr et al., 2018; Grefenstette et al., 2018) with the convenience of a single model. Those suggest that applying SWA is natural and promising for AT. To be clear, neither knowledge-distillation/self-training nor SWA was invented by this paper: they have been utilized in standard training to alleviate (standard) overfitting and improve generalization, by fixing over-confidence and by finding flatter solutions, respectively. By introducing and adapting them to AT, our aim is to complement the existing study, demonstrating that while simpler regularizations were unable to fix robustness overfitting as Rice et al. (2020) found, our learned logit/weight smoothening could effectively regularize and mitigate it, without needing early stopping. Experiments demonstrate that by plugging in the two techniques to AT, we can simultaneously boost the standard accuracy by 3.72% ∼ 6.68% and robust accuracy by 0.22% ∼ 2.03%, across multiple datasets (STL-10, SVHN, CIFAR-10, CIFAR-100, and Tiny ImageNet), perturbation types ( ∞ and 2 ), and robustified methods (PGD, TRADES, and FSGM), establishing the new state-of-the-art in AT. As shown in Figure 1 example, our method eliminates the robust overfitting phenomenon in AT, even when training up to 200 epochs. Our results imply that although robustness overfitting is more challenging than standard overfitting, its mitigation is still feasible with properly-chosen, advanced regularizations that were developed for the latter. Overall, our findings join (Rice et al., 2020) in re-establishing the competitiveness of the simplest AT baseline.

1.1. BACKGROUND WORK

Deep networks are easily fooled by imperceivable adversarial samples. To tackle this vulnerability, numerous defense methods were proposed (Goodfellow et al., 2015; Kurakin et al., 2016; Madry et al., 2018) , yet many of them (Liao et al., 2018; Guo et al., 2018; Xu et al., 2017; Dziugaite et al., 2016; Dhillon et al., 2018; Xie et al., 2018; Jiang et al., 2020) were later found to result from training artifacts, such as obfuscated gradients (Athalye et al., 2018) caused by input transformation or randomization. Among them, adversarial training (AT) (Madry et al., 2018) remains one of the most competitive options. Recently more improved defenses have been reported (Dong et al., 2018; Yang et al., 2019b; Mosbach et al., 2018; Hu et al., 2020; Wang et al., 2020a; Dong et al., 2020; Zhang et al., 2020a; b) , with some of them also being variants of AT, e.g. TRADES (Zhang et al., 2019b) and AT with metric learning regularizers (Mao et al., 2019; Pang et al., 2019; 2020) . While overfitting has become less a practical concern in training deep networks nowadays, it was not yet noticed nor addressed in the adversarial defense field until lately. An overfitting phenomenon was first observed in a few fast adversarial training methods (Zhang et al., 2019a; Shafahi et al., 2019b; Wong et al., 2020) based on FGSM (Goodfellow et al., 2015) , e.g., sometimes the robust accuracy against a PGD adversary suddenly drop to nearly zero after some training. (Andriushchenko & Flammarion, 2020) suggested it to be rooted in those methods' local linearization assumptions of the loss landscape in those "fast" AT. The recently reported robust overfitting (Rice et al., 2020) seems to raise a completely new challenge for the classical AT (not fast): the model starts to irreversibly lose robustness after training with AT for a period, even the double-descent generalization curves still seemed to hold (Belkin et al., 2019; Nakkiran et al., 2019) . Among various options tried in Rice et al. (2020) , early-stopping was so far the only effective remedy found.

2.1. LEARNING TO SMOOTH LOGITS IN AT

Rationale: Why AT enforces models robust against adversarial attacks of a specific type and certain magnitudes. However, it has been shown to "overfit" the threat model "seen" during training (Kang et al., 2019; Maini et al., 2019; Stutz et al., 2020) , and its gained robustness does not extrapolate to larger perturbations nor unseen attack types. Stutz et al. (2020) hypothesized this to be an unwanted consequence of enforcing high-confidence predictions on adversarial examples since high-confidence predictions are difficult to extrapolate to arbitrary regions beyond the seen examples during training. We generalize this observation: during AT, the attacks generated at every iteration can be naturally considered as continuously varying/evolving, along with the model training. Therefore, we hypothesize one source of robust overfitting might lie in that the model "overfits" the attacks generated in the early stage of AT and fails to generalize or adapt to the attacks in the late stage. To alleviate the overconfidence problem, we adapt the label smoothening (LS) technique in standard training (Szegedy et al., 2016) . LS creates uncertainty in the one-hot labels, by computing crossentropy not with the "hard" targets from the dataset, but with a weighted mixture of these one-hot targets with the uniform distribution. This uncertainty helps to tackle alleviate the overconfidence problem Hein et al. (2019) and improves the standard generalization. The idea of LS was previously investigated in other defense methods (Shafahi et al., 2019a; Goibert & Dohmatob, 2019) , but much of the observed robustness gains were later attributed to obfuscated gradients (Athalye et al., 2018) . Two recent works (Stutz et al., 2020; Cheng et al., 2020) have integrated LS with AT to inject label uncertainty: Stutz et al. (2020) used a convex combination of uniform and one-hot distributions as target for the cross-entropy loss in AT, which resembles the LS regularizer, while Cheng et al. (2020) concurrently used an LS regularizer for AT. However, there is one pitfall of the naive LS in (Szegedy et al., 2016) : over-smoothening labels in a data-blind way could cause loss of information in the logits, and hence weakened discriminative power of the trained models (Müller et al., 2019) . That calls for a careful and adaptive balance between discriminative capability and confidence calibration of the model. In the context of AT, Stutz et al. (2020) crafted a perturbation-dependent parameter, to explicitly control the transition from one-hot to the uniform distribution when the attack magnitude grows from small to large. To identify more automated and principled means, we notice another recent work (Yuan et al., 2020) , who explicitly connected knowledge distillation (KD) (Hinton et al., 2015) to LS. The authors pointed out that LS equals a special case of KD using a virtual and hand-crafted teacher; on the contrary, the conventional KD provides data-driven soften labels rather than simply mixing one-shot and uniform vectors. Together with many others (Furlanello et al., 2018; Chen et al., 2020b) , these works demonstrated that using model-based and learned soft labels supplies much superior confidence calibration and logit geometry compared to the naive LS (Tang et al., 2020) . Furthermore, (Furlanello et al., 2018; Chen et al., 2020b; Yuan et al., 2020) unanimously revealed that another strong teacher model with extra privileged information is NOT critical to the success of KD. Yuan et al. (2020) shows that even a poorly-trained teacher with much lower accuracy can still improve the student. Moreover, Chen et al. (2020b) ; Yuan et al. (2020) find self-teacher to be sufficiently effective for KD, that is, using soft-logit outputs from the student or designed manually as the KD regularization to train itself (also called teacher-free KD (Tf-KD) in (Yuan et al., 2020) ). These observations make the main cornerstone for our learned logit smoothening approach next. Approach: How We follow (Chen et al., 2020b; Yuan et al., 2020) to use self-training with the same model, but introduce one specific modification. The one model could be trained with at least two different ways: standard training, or robust training (AT or other cheaper ways; see ablation experiments). That can yield two self-teachers. We assume both to be available; and let x be the input, y the one-hot ground truth label, δ the adversarial perturbation bounded by p norm ball with radius , and θ r /θ s the weights of the robust-/standard-trained self-teachers, respectively. Note the two self-teachers share the identical network architecture and training data with our target model. Our self-training smoothed loss function is expressed below (λ 1 and λ 2 are two hyperparameters): min θ E (x,y)∈D {(1 -λ 1 -λ 2 ) • max δ∈B (x) L XE (f (θ, x + δ), y)+ λ 1 • KD adv (f (θ, x + δ), f (θ r , x + δ)) + λ 2 • KD std (f (θ, x + δ), f (θ s , x + δ))}, where L XE is robustified cross-entropy loss adopted in the original AT; KD adv and KD std are the Kullback-Leibler divergence loss with the robust-trained and standard-trained self-teachers, respectively. λ 1 = 0.5 and λ 2 = 0.25 are default in all experiments. More details are in Appendix A2.1. Figure 2 visualizes an example of logit distributions, generated by naive LS (Szegedy et al., 2016) , the Tf-KD regularizer using manually-designed self-teacher in (Yuan et al., 2020) , as well our standard-and robust-trained teachers, respectively. We observe both standard and robust selfteachers are more discriminative than the other two baseline smoothenings, while the robust selfteacher is relatively more conservative as one shall expect.

2.2. LEARNING TO SMOOTH WEIGHTS IN AT

Rationale: Why Another measure that is often believed to indicate the standard generalization is the flatness: the loss surface at the final learned weights for well-generalizing models is relatively "flat". Similarly, Wu et al. (2020a) advocated that a flatter adversarial loss landscape shrinks the robustness generalization gap. This is aligned with (Hein & Andriushchenko, 2017) where the authors called it local Lipschitz and proved that the Lipschitz constant can be used to formally measure the robustness of machine learning models. The flatness preference of a robust model has been echoed by many empirical defense methods, such as hessian/curvature-based regularization (Moosavi-Dezfooli et al., 2019) , gradient magnitude penalty (Wang & Zhang, 2019) , smoothening with random noise (Liu et al., 2018) , or entropy regularization (Jagatap et al., 2020) . However, all those methods will incur (sometimes heavy) computational or memory overhead; and many can cause standard accuracy drops, e.g., hessian/curvature-based methods (Gupta et al., 2020) . Stochastic weight averaging (SWA) (Izmailov et al., 2018) was proposed to enforce the weight smoothness, by simply averaging multiple checkpoints along the training trajectory. SWA is known to find much flatter solutions than SGD, is extremely easy to implement, improves standard generalization, and has almost no computational overhead. SWA has been successfully adopted in semi-supervised learning (Athiwaratkun et al., 2018) , Bayesian inference (Maddox et al., 2019) , and low-precision training (Yang et al., 2019a) . In this paper, we introduce SWA to AT for the first time, in order to smooth the weights and find flatter minima that may improve the adversarially robust generalization. Note that we choose SWA mainly due to its simplicity for proofs-of-concept; while extensively comparing alternative "flatness" regularizations is beyond our current work's scope. One additional bonus of adopting SWA in AT is the temporal ensemble effect of SWA. It has been widely observed (Tramèr et al., 2018; Grefenstette et al., 2018; Wu et al., 2020b; Wang et al., 2021) that training a model with the attack transferred from another could reduce "trivial robustness" caused by locally nonlinear loss surfaces, and therefore constructed model ensembles for a stronger defense. SWA was interpreted as approximating the fast geometric ensembling (Garipov et al., 2018) , by aggregating multiple checkpoint weights at different training time. Applying SWA to AT therefore may lead to stronger and more transferable attacks, and consequently stronger defense due to ensembling, with the convenience of a single model. Approach: How Following (Izmailov et al., 2018) , applying SWA to AT is straightforward: where T indexs the training epoch, n the number of past checkpoints to be averaged, W SWA the averaged network weight, W the current network weight, and ∆W the SGD update. W T SWA = W T-1 SWA × n + W T n + 1 , W T = W T-1 + ∆W T (2)

3. EXPERIMENT AND ANALYSIS

Datasets We consider five datasets in our experiments: CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) , SVHN (Netzer et al., 2011 ), STL-10 (Coates et al., 2011) and Tiny-ImageNet (Deng et al., 2009) . In all experiments, we randomly split the original training set into one training and one validation sets with a 9:1 ratio. Due to the limited space, we place the SVHN results in Appendix A1.3. The ablation studies and the visualizations are mainly on CIFAR-10 and CIFAR-100. Attack Methods We consider three representative attacks: FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2018) , and TRADES (Zhang et al., 2019b) . All of them are applied with ( 2 , = 128/255) or ( ∞ , = 8/255) setting as in (Madry et al., 2018) , to generate adversarial samples. We use FSGM-1/PGD-10/TRADES-10 for training and PGD-20 for testing as the default setting, following Madry et al. (2018) ; Chen et al. (2020a) . In addition, we use Auto-Attack (Croce & Hein, 2020) and CW Attack (Carlini & Wagner, 2017) for a more rigorous evaluation. More details are provided in the Appendix A2.2.

Training and Evaluation Details

For all experiments, we by default use ResNet-18 (He et al., 2016) , with the exception of VGG-16 (Simonyan & Zisserman, 2014) and Wide-ResNet (Zagoruyko & Komodakis, 2016) adopted in Table 3 . For training, we adopt an SGD optimizer with a momentum of 0.9 and weight decay of 5×10 -4 , for a total of 200 epochs, with a batch size of 128. The learning rate starts from 0.1 (0.01 for SVHN (Rice et al., 2020) ), decay to one-tenth at epochs 50 and 150 respectively. For Tiny-ImageNet, we train for 100 epochs, and the learning rate decay at epochs 50 and 80 with other settings unchanged. The self-training KD regularization is applied throughout the entire training, and SWA is employed after the first learning rate decay (when the robust overfitting usually starts to occur). We evaluate two common metrics that are widely adopted (Zhang et al., 2019b; Chen et al., 2020a) : Standard Testing Accuracy (SA), and Robust Testing Accuracy (RA), which are the classification accuracies on the original and the attacked test sets, respectively.

3.1. TACKLING ROBUST OVERFITTING

Superior Performance Across Datasets Table 1 demonstrates our proposal on STL-10, CIFAR-10, CIFAR-100, and Tiny-ImageNet. We consider PGD-AT (Madry et al., 2018) as Baseline; and denote our two training techniques as +KD std&adv (KD with standard and robust self-teachers), and +SWA, respectively. To numerically show the gap of robust overfitting, we also report the best RA values when early stopping during training, the final RA in the last epoch, and as the difference between final minus best. For reference, we also report the corresponding SA for the same best-RA checkpoint (not the best SA value throughout training), the final epoch SA, and their difference. We first observe that the robust overfitting prevails in all Baseline cases, with RA differences between final and best early-stopping values as large as 9.34% (CIFAR-10). In comparison, SA stays stable (with negative gaps on STL-10 and CIFAR-10) or continues to improve along with more training epochs (with small positive gaps on CIFAR-100 and Tiny-ImageNet). Fortunately, the gaps were significantly reduced by +KD std&adv ; and further diminished to only 0.4% to 0.6% when SWA is also applied. surpasses the baseline's best RA checkpoint by 6.68%, and by 7.29% for the final checkpoint. Figure 3 further plots the RA and SA curves during training, from which we can clearly observe the diminishing of robust overfitting, after applying KD std&adv , SWA and a combination of two methods. The training curves robustly improve until the end, without compromising the best achievable RA results, and further leads to a much-improved trade-off between RA and SA by avoiding early stopping (e.g., selecting an early checkpoint for RA, when SA might still be half-baked). Across Perturbations and Robustified Methods Our success can extend beyond PGD-AT. Table 2 presents more results in different perturbations (i.e. 2 , ∞ ) and diverse robustified methods (i.e. FSGM in (Wong et al., 2020) , TRADES in (Zhang et al., 2019b) ). Consistent observations can be made: almost eliminated robust overfitting gaps, and significant gains on RA (by 0.61% ∼ 3.11%) and SA (by 1.80% ∼ 4.22%). We also compare with previous state-of-the-art results in (Rice et al., 2020) under the same setting. As shown in Table A7 (Appendix), our methods shrink the gap between the RA best checkpoint and the final epoch RA from 5.70% to 0.17% and simultaneously improve 4.50% by RA and 3.04% by SA. More results can be found in Appendix A1. Across Architectures and Improved Attacks Table 3 demonstrates the effectiveness of our methods across different architectures, including VGG-16, Wide-ResNet-34-4, and Wide-ResNet-34-10. Specifically, our methods reduce the drop of robust accuracy from 5.83% to 0.06% with VGG-16 on CIFAR-10 while achieving an extra robust accuracy improvement of 2.57%, 1.69% and 1.23% with VGG-16, Wide-ResNet-34-4 and Wide-ResNet-34-10 on CIFAR10, respectively. To further verify the improvements achieved by our methods, we conduct extra evaluations under improved attacks. As shown in Table 4 , after applying the combination of KD and SWA, the overfitting problem is largely mitigated under both Auto-Attack (Croce & Hein, 2020) and CW attack (Carlini & Wagner, 2017) . Take CIFAR-10 ∞ adversary as an example, our approaches shrink the drop of robust accuracy from 7.04% to -0.09% under Auto-Attack, and 14.96% to 0.79% under CW attack, when comparing the best model to the eventually converged model. These results indicate that our methods can generalize to different architectures and improved attacks. Excluding Obfuscated Gradients An often argued "counterfeit" of improved robustness is caused by less effectiveness of generated adversarial examples due to obfuscated gradients (Athalye et al., 2018) . To exclude this possibility, we show that our methods maintain improved robustness under unseen transfer attacks. To start with, the left figure in Figure 4 shows the transfer testing performance on an unseen robust model (here we use a separately robustified ResNet-50 with PGD-10 on CIFAR-100), using attacks generated by the different epoch checkpoints of PGD-AT Baseline, Baseline + KD std&adv , and Baseline + KD std&adv + SWA. A higher robust accuracy on the unseen robust model corresponds to a weaker attack. Apparently, our methods consistently yield stronger and more transferable attacks, while the attacking quality generated by the baseline quickly drops with deteriorated transferability. Similarly, the right figure of Figure 4 transfers the attack from an unseen robust model to the above three methods, while our methods consistently defend better. Those empirical pieces of evidence suggest that our RA gains are not a result of gradient masking.

3.2. ABLATION STUDY AND VISUALIZATION

KD adv , KD std and SWA We study the effectiveness of each component in logit and weight smoothening. We also specifically decompose KD std&adv into two ablation methods: KD std (by setting λ 2 = 0 in Eqn. (1)), and KD adv (by setting λ 1 = 0), respectively. Table 5 shows that KD std , KD adv and SWA all substantially contribute to suppressing the robust overfitting and enhancing the SA-RA trade-off. We notice that while KD std seems to (understandably) sacrifice the best RA a bit for improving TA, combining it with KD adv brings the RA compromise back and boosts them both. Naive LS versus learned logit smoothening As KD could be viewed as a learned version of LS (Yuan et al., 2020) , we next quantify the benefit of using KD std&adv , compared to naive LS (Szegedy et al., 2016) , and the teacher-free knowledge distillation regularization (Tf-KD reg ) in (Yuan et al., 2020) , all incorporated with PGD-AT on CIFAR-10. Table 6 show that both naive LS and Tf-KD reg also reduce robust overfitting to some extent, but far less competitive than KD std&adv . Moreover, the robustness gains of naive LS and Tf-KD reg no longer hold under transfer attacks, implying that they are susceptible to obfuscated gradients. Further visualization in Figure 5 demonstrates that our methods smooth the logits without compromising the class-wise discriminative information, while naive LS and Tf-KD might suffer from weaker gradients here.

Quality of Self-Teachers

An extra price for our learned logit smoothening is the pre-training of self-teachers, although this is already quite common in similar literature (Chen et al., 2020a; b) . To further reduce this burden, we explore whether high-quality and more expensive pre-training is necessary for us, and fortunately find that is not the case. For example, Table 6 shows only marginal performance difference when the robust self-teacher is pre-trained using FGSM or PGD-10/100. Visualizing Flatness and Local Linearity We expect SWA to find flatter minima for AT to improve its generalization, and we show it to indeed happen by visualizing the loss landscape w.r.t. both input and weight spaces. Figure 6 shows that our methods notably flatten the rugged landscape w.r.t. the input space, compared to the PGD-AT baseline, which aligns with the robust generalization claims in (Moosavi-Dezfooli et al., 2019; Wu et al., 2020a) . Figure 7 follows (Izmailov et al., 2018) to perturb the trained model in the weight space and show how the robust testing loss changes over the perturbation radius. We perturb 10 different random directions at each different 2 distance. Our methods present better weight smoothness around the achieved local minima too, which suggests improved generalization (Dinh et al., 2017; Petzka et al., 2019) . We additionally look at the local linearity measurement proposed in (Andriushchenko & Flammarion, 2020) , which originally addresses catastrophic overfitting in fast AT. As shown in Figure A11 , our methods also achieve consistently better local linearity.

4. CONCLUSION

This paper takes one more step towards addressing the recently discovered robust overfitting issue in AT. We present two empirical solutions to smooth the logits and weights respectively; both are motivated by successful practice in improving standard generalization, and we adapt them for AT. While Rice et al. (2020) found simpler regularizations unable to fix robustness overfitting, our learned smoothening regularization seems to largely mitigate that. Extensive experiments show our proposal to establish new state-of-the-art performance on AT. While promising progress has been made, the underlying cause of robust overfitting is not yet fully explained. Our future work will connect to more theoretical understandings of this issue (Wang et al., 2019; 2020b) .

A1 MORE EXPERIMENT RESULTS

A1.1 STATE-OF-THE-ART BENCHMARK ON CIFAR-100 We implement our methods with exactly the same setting as (Rice et al., 2020) and compare it with the baseline result reported from the original paper. As shown in Table A7 and Figure A8 , our methods achieve great improvements both on robust accuracy and standard accuracy (1.64% in RA and 3.78% in SA for ∞ , 4.50% in RA and 3.04% in SA for 2 ), which establish a new state-of-theart bar. Table A7 : Comparative Experiment on CIFAR-100, we follow the same setting and compare with the baseline result from (Rice et al., 2020) 

A1.2 T-SNE RESULT ON CIFAR-100

We visualize the learned feature space with all training images and their corresponding adversarial images from PGD-10 on CIFAR-100. As shown in Figure A9 , our learned features have a larger distance between classes while being more clustered within the same class. The more distinguishable feature embedding justifies the improvement of both robust and standard accuracy.

A1.3 SUPERIOR PERFORMANCE ON SVHN

We conduct our experiments on SVHN with ResNet-18 (He et al., 2016) architecture and adopt an SGD optimizer with a momentum of 0.9 and a weight decay of 5 × 10 -4 for 80 epochs in total with a batch size of 128. The learning rate starts from 0.01 and follows a cosine annealing schedule. The result can be found on Table A8 and Figure A10 . As we can see, the robust accuracy of the best checkpoint for ∞ is improved from 52.60% to 53.65%, and robust overfitting is alleviated by 6.30%. In the meantime, standard accuracy has also been improved by 2.47%. The superior performance on SVHN aligns with results on other datasets, which shows the effectiveness of our methods. (Andriushchenko & Flammarion, 2020) , the catastrophic overfitting problem is mainly due to the local linearity reduction when adversarial training with FGSM (Rice et al., 2020) . So we borrow this measurement in the robust overfitting scenario, which calculates the expectation of the cosine similarity of the gradient between the original input and randomly perturbed one with a uniform distribution, as shown in Eqn. 3. The result shown in figure A11 indicates that our methods help to slow the decline of local linearity and the maintenance of local linearity is also helpful for preventing robust overfitting.  E (x,y)∈D,η∈U ([-, ] d ) cos ∇ x L(f (θ, x), y), ∇ x L(f (θ, x + η), y)

A1.5 ABLATION OF TRANSFER ATTACK

With the purpose of fully comparing the effects of label smoothing and knowledge distillation, we introduce a transfer attack with an unseen non-robust model with the same architecture, follow the same setting as (Fu et al., 2020) . A higher accuracy on the unseen model indicates a weaker attack generated by the corresponding setting while a higher accuracy from the unseen model means better robustness. As shown in Table A9 , only knowledge distillation shows significant improvement with both accuracies, compared with baseline(PGD-AT) methods. The strength of generated adversarial images is improved by 4.32% and the robustness is improved by 2.91% for the best model. We also experiment with an unseen robust model and get consistent improvement. This improvement indicates that knowledge distillation introduces more discriminating information from teacher models, which is better than manually designed label smoothing methods. One possible extension of SWA is to replace W T-1 with W T-1 SWA in Eqn. 2. We name this variant as iSWA and compare it with the original SWA in Table A10 . Both weight smoothing techniques can mitigate robust overfitting, and iSWA performs slightly better on RA while sacrificing some SA. where t(y) i = (yi) 1/T j (yj ) 1/T , T = 2 in our case, following the standard setting in (Hinton et al., 2015; Li & Hoiem, 2017) . KL(f (θ, x + δ), f (θ, x)) (5) As for the maximization process, FGSM perturbs the input with a single step in the direction of the sign of the gradient and PGD is the iterative form of FGSM with random restarts, which works as follows. δ t+1 = proj P δ t + α • sgn ∇ x L(f (θ, x + δ t ), y) (6) δ t+1 = proj P δ t + α • sgn ∇ x KL(f (θ, x + δ), f (θ, x) TRADES (Eqn.7) replaces the cross entropy loss in PGD with the Kull-back-Leibler divergence of network output for the clean input and the adversarial input. Where f is the network with parameters θ, (x, y) is the data. α is the step size and δ t is the adversarial perturbation after t times iterations. The perturbation is constrained in an p norm ball, i.e. δ p ≤ , which is realized by projection. we consider both ∞ and 2 in our paper. For ∞ adversary, we use = 8 255 and α = 2 255 for PGD and TRADES with 10 steps in training and 20 steps in testing, while using α = 7 255 for FGSM during training. As for 2 adversary, we use = 128 255 and α = 15 255 with the same steps as ∞ adversary in all three attack methods. For a comprehensive evaluation, we consider two improved attacks, i.e., Auto-Attack (Croce & Hein, 2020) and CW Attack (Carlini & Wagner, 2017) . We use the official implementation and default settings for Auto-Attack ( ∞ with = 8 255 and 2 with = 128 255 ) and the implementation



Figure 1: The standard (top) and robust (bottom)

Figure 2: Comparing the logit distribution of different LS/KD means on CIFAR-10 using ResNet-18 (C7 is the correct label).

Figure 3: Results of testing accuracy over epochs for ResNet-18 trained on CIFAR-10, CIFAR-100, STL-10, and Tiny ImageNet. Dash / Solid lines show the standard accuracy (SA) / robust accuracy (RA). Blue, Green, Black and Orange curves represent the performance of Baseline, KD, SWA and KD&SWA respectively.

Figure 4: (Left) Transfer attack performance on an unseen robust model, where attacks are generated by Baseline, KD, and KD&SWA's different epoch checkpoints. (Right) Transfer attack performance on models from Baseline, KD, and KD&SWA, where attacks are generated by an unseen robust model. The unseen robust model is a ResNet-50 trained by PGD-10. All experiments are conducted on the CIFAR-100 dataset.

Figure 5: t-SNE results of different logits smoothing approaches on CIFAR-10. Dots and stars represent for clean and adversarial images respectively. Orange, Blue and Green represent classes A, B and C respectively.For each class, we visualize all testing images and their corresponding adversarial images from PGD-20.

FigureA8: Results of testing accuracy over epochs for ResNet-18 trained on CIFAR-100 with the same setting asRice et al. (2020). Dash lines show the standard accuracy (SA); solid lines represent the robust accuracy (RA). Blue, Green and Orange curves represent the performance of Baseline, KD and KD&SWA respectively.

Figure A9: t-SNE results of different models trained on CIFAR-100. Dots and stars represent for clean and adversarial images respectively. Red, Blue and Green represent classes A, B and C respectively. For each class, we visualize all training images and their corresponding adversarial images from PGD-10. The left figure is Baseline; the right figure is Our Methods.

Figure A11: The local linearity is calculated from all test images on CIFAR-100 with models from each period of the training process. Blue, Green and Orange curves represent the local linearity of Baseline, KD and KD&SWA respectively.

ADVERSARIAL TRAINING Adversarial training incorporates generated adversarial examples into the training process and significantly improves the robustness of networks. In our paper, we implemented three different adversarial training schemes: FGSM, PGD and TRADES, which can be described as the optimization problem below: Eqn.4 for FGSM and PGD, Eqn.5 for TRADES. min θ E (x,y)∈D max δ∈B (x) L(f (θ, x + δ), y) (4) min θ E (x,y)∈D L(f (θ, x), y) + β • max δ∈B (x)

Performance showing the occurrence of robust overfitting across datasets and the effectiveness of our proposed remedies with ResNet-18. The difference between best and final robust accuracy indicates degradation in performance during training. We pick the checkpoint which has the best robust accuracy on the validation set. The best results and the smallest performance differences are marked in bold.Further, we observe our methods to push the best RA higher by 0.22% ∼ 2.03%. For example, the best RA on Tiny-ImageNet rises from 19.81% to 21.84%. Meanwhile, since there is no longer robust overfitting early in training, the best RA checkpoints become to select late epochs (often close to the end). Consequently, the SA values of the selected best RA models are all substantially improved. For example on CIFAR-100, the standard accuracy of our methods (best RA checkpoint)

Controlled experiments on CIFAR-10. The difference between best and final robust accuracy indicates degradation in performance during training. We pick the checkpoint which has the best robust accuracy on the validation set. The best results and the smallest performance difference are marked in bold.

Controlled experiments across different architecture on CIFAR-10/100 under ∞ adversary. The difference between best and final robust accuracy indicates degradation in performance during training. We pick the checkpoint which has the best robust accuracy on the validation set. The best results and the smallest performance difference are marked in bold.

Evaluation under improved attacks on CIFAR-10/100 with ResNet-18. The difference between best and final robust accuracy indicates degradation in performance during training. We pick the checkpoint which has the best robust accuracy under PGD-20 attack on the validation set. The best results and the smallest performance difference are marked in bold.

Ablation studies on CIFAR-10 with ResNet-18. Compared with Baseline (PGD-AT) methods, the performance improvements and degradations by adding each component are reported in red and blue numbers. KD std&adv + SWA 52.14(↑ 1.42) 51.53(↑ 10.15) 0.61 84.65(↑ 3.87) 85.40(↑ 2.96) -0.75

Ablation of label smoothing versus KD on CIFAR-10.

. Best refers to the model with best robust accuracy during training and Final is an average of accuracy over last 5 epochs.

Performance showing the occurrence of robust overfitting and effectiveness of our proposed remedies with ResNet-18 on SVHN. The difference between best and final robust accuracy indicates degradation in performance during training. We pick the checkpoint which has the best robust accuracy on validation dataset. The best results and the minimum performance difference are marked in bold.

Ablation of Transfer attack. The accuracy on unseen model is the accuracy of unseen model with adversarial images generated by source models from different settings and the accuracy from unseen model means the opposite. We generated adversarial images for all test images on CIFAR-10 with ∞ PGD-20. Baseline represents the PGD-AT methods.

Ablation of SWA on CIFAR-10.Best refers to the model selected with best robust accuracy on validation dataset and Final is the model at the end of training process.

annex

from AdverTorch (Ding et al., 2019) for CW attack with the same setting as Rony et al. (2019) , specifically, 1 search step on C with an initial constant of 0.1, with 100 iterations for each search step and 0.01 learning rate. Detailed links are provided below:• The Official Repository: https://github.com/fra31/auto-attack • The Leaderboard: https://robustbench.github.io/ 

