UNDERSTANDING AND MITIGATING ROBUST OVERFIT-TING THROUGH THE LENS OF FEATURE DYNAMICS Anonymous authors Paper under double-blind review

Abstract

Adversarial Training (AT) has become arguably the state-of-the-art algorithm for extracting robust features. However, researchers recently notice that AT suffers from severe robust overfitting problems, particularly after the learning rate (LR) decay, while the existing static view of feature robustness fails to explain this phenomenon. In this paper, we propose a new dynamic feature robustness framework which takes the dynamic interplay between the model trainer and the attacker into consideration. By tracing temporal and dataset-specific feature robustness, we develop a new understanding of robust overfitting from the dynamics of non-robust features, and empirically verify it on real-world datasets. Built upon this understanding, we explore three techniques to restore the balance between the model trainer and the attacker, and show that they could effectively alleviate robust overfitting and attain state-of-the-art robustness on benchmark datasets. Notably, different from previous studies, our interpretation highlights the necessity of considering the min-max nature of AT for robust overfitting.

1. INTRODUCTION

Overfitting seems to have become a history in the deep learning era. Contrary to the traditional belief in statistical learning theory that large hypothesis class will lead to overfitting, Zhang et al. (2019) note that DNNs have good generalization ability on test data even if they are capable of memorizing random training labels. Nowadays, large-scale training often does not require early stopping, and longer training simply brings better generalization (Hoffer et al., 2017) . However, researchers recently notice that in Adversarial Training (AT), overfitting is still a severe issue on both small and large scale data and models (Rice et al., 2020) . AT is arguably the most effective defense method (Athalye et al., 2018) against adversarially crafted perturbations to images (Szegedy et al., 2014) . Specifically, given training data D train and model f θ , AT can be formulated as a min-max optimization problem (Madry et al., 2018; Goodfellow et al., 2015) : min θ E x,y∼Dtrain max x∈Ep(x) ℓ CE (f θ (x), y), where ℓ CE denotes the cross entropy (CE) loss, E p (x) = {x | ∥x -x∥ p ≤ ε} denotes the ℓ p -norm ball with radius ε. However, in practice, researchers notice this min-max training scheme suffers severely from the robust overfitting (RO) problem: after a particular point (e.g., learning rate decay), its training robustness will keep increasing (Figure 1a , red line) while its test robustness will begin to dramatically decrease (Figure 1b , red line). This abnormal behavior of AT has attracted many interests recently. Previous work correctly points out that during AT, the robust loss landscape becomes much sharper, and RO can be largely surpassed by enforcing a smoother landscape with additional regularizations (Stutz et al., 2021; Chen et al., 2021; Wu et al., 2020) . However, this phenomenological perspective does not explain the rise of either the sharp landscape or RO. Some researchers try to explain RO though the phenomenons of standard training (ST), such as random memorization (Dong et al., 2022) and double descent Dong et al. ( 2021), but they fail to characterize the uniqueness of AT from ST (why AT has the overfitting issue while ST does not). In this paper, we seek to establish a generic explanation for robust overfitting during AT. Different from previous attempts to relate the behaviors of AT to existing theories of ST, we believe that the reasons of RO should lie exactly in the differences between AT and ST. In particular, we emphasize a critical difference: RO usually happens after learning rate (LR) decay in AT, while in ST, LR decay does not lead to (severe) overfitting, so our paper focus on the change of learning behaviors before and after the LR decay. We notice that feature robustness actually changes during this process: a non-robust feature could become robust after LR decays, which is contrary to the static feature robustness framework developed by Ilyas et al. (2019) . This motivates us to design a dynamic feature Based on this dynamic view of feature robustness, we investigate three new strategies to avoid fitting non-robust features: stronger model regularization, smaller LR decay, and stronger attacker. We show that all three strategies help mitigate robust overfitting to a large extent. Based on these insights, we propose Bootstrapped Adversarial Training (BoAT) that has neglectable (if any) degree of robust overfitting without using additional data augmentations, even if we train for 500 epochs. Meanwhile, BoAT attains state-of-the-art robustness (both best-epoch and final-epoch) on benchmark datasets including CIFAR-10, CIFAR-100, and Tiny-ImageNet. This suggests that with appropriate design of training strategies, AT could also enjoy similar "train longer, generalize better" property like ST. To summarize, our main contributions are: • We point out that the existing static robust feature framework fails to explain robust overfitting, particularly after the LR decay. Therefore, we propose a new dynamic robust feature framework to account for this change by taking the interplay among the model trainer and the attacker into consideration. • Based on our dynamic framework, we analyze the change of feature robustness during LR decay. We further propose a new understanding of robust overfitting through the change of non-robust features, and empirically verify it on three nontrivial implications. • From our dynamic perspective, we propose three effective approaches that mitigate robust overfitting by re-striking a balance between the model trainer and the attacker. Experiments show that our proposed BoAT largely alleviates robust overfitting and attains state-of-theart robustness.

2. A NEW DYNAMIC FRAMEWORK FOR ROBUST AND NON-ROBUST FEATURES

In this section, we explain how traditional static framework of feature robustness analysis fails to explain the robust overfitting problem in AT. Motivated by this fact, we establish a dynamic view of feature robustness that centers on the adversarial learning process. (2)



Figure 1: (a, b) Adversarial training results on CIFAR-10 using vanilla PGD-AT (Madry et al., 2018). (c) After decay, non-robust features become significantly more robust on training set, but only increase slightly on test set. (d) With LR decay, injecting stronger non-robust features (larger ε) induces severer degradation in test robustness, while the degradation is less obvious without decay.robustness framework that takes the learning process into consideration, and discussion the dynamics of feature robustness. Moreover, this dynamic perspective also suggests a new understanding of robust overfitting. Specifically, we hypothesize that due to the strong local fitting ability endowed by smaller LR, the model learns false mapping of non-robust features contained in adversarial examples, which opens shortcuts for the test-time attacker and induces large test robust error. Accordingly, we design a series of experiments that empirically verify this understanding.

LIMITATIONS OF TRADITIONAL STATIC FEATURE ROBUSTNESS FRAMEWORKThe arguably most prevailing understanding of adversarial training is the feature robustness framework proposed byIlyas et al. (Ilyas et al., 2019). They regard natural images as a composition of both robust and non-robust features that can both generalize (i.e., useful), and the use of non-robust features in standard training results in its adversarial vulnerability. Adversarial training is to prune the non-robust features and the learned robust model only makes use of robust features. Generally, for a distribution P and model class F, we can define the robustness of a feature f ∈ F : X → Y as a real value, and decide whether it is a (non-)robust feature based on a given threshold γ ∈ [0, 1]: (Ilyas et al.'s definition) R ∆,P (f ) = E (x,y)∼P min δ∈∆(x) 1[f (x + δ) = y] .

