TOWARDS UNDERSTANDING ROBUST MEMORIZATION IN ADVERSARIAL TRAINING

Abstract

Adversarial training is a standard method to train neural networks to be robust to adversarial perturbation. However, in contrast with benign overfitting in the standard deep learning setting, which means that over-parameterized neural networks surprisingly generalize well for unseen data, while adversarial training method is able to achieve low robust training error, there still exists a significant robust generalization gap, which promotes us exploring what mechanism leads to robust overfitting during learning process. In this paper, we propose an implicit bias called robust memorization in adversarial training under the realistic data assumption. By function approximation theory, we prove that ReLU nets with efficient size have the ability to achieve robust memorization, while robust generalization requires exponentially large models. Then, we demonstrate robust memorization in adversarial training from both empirical and theoretical perspectives. In particular, we empirically investigate the dynamics of loss landscape over input, and we also provide theoretical analysis of robust memorization on data with linear separable assumption. Finally, we prove novel generalization bounds based on robust memorization, which further explains why deep neural networks have both high clean test accuracy and robust overfitting at the same time.

1. INTRODUCTION

Although deep learning has made a remarkable success in many application fields, such as computer vision (Voulodimos et al., 2018) and natural language process (Devlin et al., 2018) , it is well-known that only small but adversarial perturbations additional to the natural data can make well-trained classifiers confused (Szegedy et al., 2013; Goodfellow et al., 2014) , which promotes designing adversarial robust learning algorithms. In practice, adversarial training methods (Madry et al., 2017; Shafahi et al., 2019; Zhang et al., 2019) are widely used to improve the robustness of models by regarding perturbed data as training data. However, while these robust learning algorithms are able to achieve high robust training accuracy, it still leads to a non-negligible robust generalization gap (Raghunathan et al., 2019) , which is also called robust overfitting (Rice et al., 2020; Yu et al., 2022) . To explain this puzzling phenomenon, a series of works have attempted to provide theoretical understandings from different perspectives. Despite these aforementioned works seem to provide a series of convincing evidence from theoretical views in different settings, there still exists a gap between theory and practice for at least two reasons. First, although previous works have shown that adversarial robust generalization requires more data and larger models (Schmidt et al., 2018; Li et al., 2022) , it is unclear that what mechanism, during adversarial training process, directly causes robust overfitting. In other words, we know there is no robust generalization gap for a trivial model that only guesses labels totally randomly (e.g. deep neural networks with random initialization), which implies that we should take learning process into consideration to analyze robust generalization. Second and most importantly, while some works (Tsipras et al., 2018; Zhang et al., 2019) point out that achieving robustness may hurt clean test accuracy, in most of the cases, it is observed that drop of robust test accuracy is much higher than drop of clean test accuracy in adversarial training (Schmidt et al., 2018; Raghunathan et al., 2019) (see in Table 1 ). Namely, a weak version of benign overfitting, which means that overparameterized deep neural networks can both fit random data powerfully and generalize well for unseen clean data, remains after adversarial training. Therefore, it is natural to ask the following question: What happens, during adversarial learning process, resulting in both benign clean overfitting and significant robust overfitting at the same time? In this paper, we provide a theoretical understanding of the adversarial training process by proposing a novel implicit bias called robust memorization under the realistic data assumption, which explains why adversarial training leads to both high clean test accuracy and robust generalization gap. The fundamental data structure that we leverage is that data can be clean separated by a neural network f clean with moderate size but this neural classifier is non-robust on data with small margin, which is consistent with the common practice that well-trained neural networks have good clean performance but fail to classify perturbed data due to the close distance between small margin data and decision boundary. The existence of small and large margin data for image classification has been empirically verified (Banburski et al., 2021) . And we also assume that data is well-separated, which means that there exists a robust classifier in general (Yang et al., 2020) . However, this robust classifier may be hard to approximate by neural networks (Li et al., 2022) . In other words, clean training often finds simple but non-robust solution, although robust classifier always exists. Specifically, we consider the underlying data distribution D where µ proportional data is with small margin, which can be perturbed as adversarial example, and 1 -µ proportional data is with large margin, which can be robustly classified by the neural network f clean . In adversarial training, we access N instances S randomly drawn from D as training data. In fact, by applying concentration technique in high dimensional probability, we know that there roughly exist µN small margin training data and (1 -µ)N large margin training data. In order to answer the question that we ask, under the above realistic data assumption, we consider the following classifier, f adv (x) := (xi,yi)∈Ssmall y i I{∥x -x i ∥ p ≤ δ} robust local indicators on small margin data + f clean (x)I min (xi,yi)∈Ssmall ∥x -x i ∥ p > δ clean classifier on other data , where S small is the small margin part of training dataset, and δ denotes adversarial perturbation radius, which is smaller than the separation between data in our setting. 



Train and test performances onCIFAR-10 (Raghunathan et al., 2019)

Indeed, the classifer f adv satisfies all main characteristics of deep models outputted by adversarial training (Proposition 3.2): First, adversarial training error with respect to f adv achieves zero; second, f adv has a good clean performance because the classifier is robust within the neighborhood of training data with small margin and it performs as the same as the clean classifier on other data; third, although this classifier has low robust training error, it is non-robust on other data since it only robustly memorizes training data with small margin, which results in robust overfitting.Inspired by the ideal model f adv , we then propose a conjecture that deep neural networks converge to solution similar to f adv in adversarial training, which is a novel implicit bias of adversarial training, and we call it robust memorization, which provides a theoretical understanding of adversarial training, including why solution found by adversarial training has both good clean generalization and robust generalization gap.Based on function approximation theory, we can prove that, for d-dimensional training dataset S with N samples, ReLU networks with Õ(µN d + poly(d)) parameters are able to represent the target functions f adv that we mention for robust memorization (Theorem 3.3 and Corollary 3.4). However, we still prove a lower bound for the network size that is exponential in the data dimension

