UNDERSTANDING AND MITIGATING ROBUST OVERFIT-TING THROUGH THE LENS OF FEATURE DYNAMICS Anonymous authors Paper under double-blind review

Abstract

Adversarial Training (AT) has become arguably the state-of-the-art algorithm for extracting robust features. However, researchers recently notice that AT suffers from severe robust overfitting problems, particularly after the learning rate (LR) decay, while the existing static view of feature robustness fails to explain this phenomenon. In this paper, we propose a new dynamic feature robustness framework which takes the dynamic interplay between the model trainer and the attacker into consideration. By tracing temporal and dataset-specific feature robustness, we develop a new understanding of robust overfitting from the dynamics of non-robust features, and empirically verify it on real-world datasets. Built upon this understanding, we explore three techniques to restore the balance between the model trainer and the attacker, and show that they could effectively alleviate robust overfitting and attain state-of-the-art robustness on benchmark datasets. Notably, different from previous studies, our interpretation highlights the necessity of considering the min-max nature of AT for robust overfitting.

1. INTRODUCTION

Overfitting seems to have become a history in the deep learning era. Contrary to the traditional belief in statistical learning theory that large hypothesis class will lead to overfitting, Zhang et al. (2019) note that DNNs have good generalization ability on test data even if they are capable of memorizing random training labels. Nowadays, large-scale training often does not require early stopping, and longer training simply brings better generalization (Hoffer et al., 2017) . However, researchers recently notice that in Adversarial Training (AT), overfitting is still a severe issue on both small and large scale data and models (Rice et al., 2020) . AT is arguably the most effective defense method (Athalye et al., 2018) against adversarially crafted perturbations to images (Szegedy et al., 2014) . Specifically, given training data D train and model f θ , AT can be formulated as a min-max optimization problem (Madry et al., 2018; Goodfellow et al., 2015) : min θ E x,y∼Dtrain max x∈Ep(x) ℓ CE (f θ (x), y), where ℓ CE denotes the cross entropy (CE) loss, E p (x) = {x | ∥x -x∥ p ≤ ε} denotes the ℓ p -norm ball with radius ε. However, in practice, researchers notice this min-max training scheme suffers severely from the robust overfitting (RO) problem: after a particular point (e.g., learning rate decay), its training robustness will keep increasing (Figure 1a , red line) while its test robustness will begin to dramatically decrease (Figure 1b , red line). This abnormal behavior of AT has attracted many interests recently. Previous work correctly points out that during AT, the robust loss landscape becomes much sharper, and RO can be largely surpassed by enforcing a smoother landscape with additional regularizations (Stutz et al., 2021; Chen et al., 2021; Wu et al., 2020) . However, this phenomenological perspective does not explain the rise of either the sharp landscape or RO. Some researchers try to explain RO though the phenomenons of standard training (ST), such as random memorization (Dong et al., 2022) and double descent Dong et al. ( 2021), but they fail to characterize the uniqueness of AT from ST (why AT has the overfitting issue while ST does not). In this paper, we seek to establish a generic explanation for robust overfitting during AT. Different from previous attempts to relate the behaviors of AT to existing theories of ST, we believe that the reasons of RO should lie exactly in the differences between AT and ST. In particular, we emphasize a critical difference: RO usually happens after learning rate (LR) decay in AT, while in ST, LR decay does not lead to (severe) overfitting, so our paper focus on the change of learning behaviors before and after the LR decay. We notice that feature robustness actually changes during this process: a non-robust feature could become robust after LR decays, which is contrary to the static feature robustness framework developed by Ilyas et al. (2019) . This motivates us to design a dynamic feature

