TOWARDS UNDERSTANDING ROBUST MEMORIZATION IN ADVERSARIAL TRAINING

Abstract

Adversarial training is a standard method to train neural networks to be robust to adversarial perturbation. However, in contrast with benign overfitting in the standard deep learning setting, which means that over-parameterized neural networks surprisingly generalize well for unseen data, while adversarial training method is able to achieve low robust training error, there still exists a significant robust generalization gap, which promotes us exploring what mechanism leads to robust overfitting during learning process. In this paper, we propose an implicit bias called robust memorization in adversarial training under the realistic data assumption. By function approximation theory, we prove that ReLU nets with efficient size have the ability to achieve robust memorization, while robust generalization requires exponentially large models. Then, we demonstrate robust memorization in adversarial training from both empirical and theoretical perspectives. In particular, we empirically investigate the dynamics of loss landscape over input, and we also provide theoretical analysis of robust memorization on data with linear separable assumption. Finally, we prove novel generalization bounds based on robust memorization, which further explains why deep neural networks have both high clean test accuracy and robust overfitting at the same time.

1. INTRODUCTION

Although deep learning has made a remarkable success in many application fields, such as computer vision (Voulodimos et al., 2018) and natural language process (Devlin et al., 2018) , it is well-known that only small but adversarial perturbations additional to the natural data can make well-trained classifiers confused (Szegedy et al., 2013; Goodfellow et al., 2014) , which promotes designing adversarial robust learning algorithms. In practice, adversarial training methods (Madry et al., 2017; Shafahi et al., 2019; Zhang et al., 2019) are widely used to improve the robustness of models by regarding perturbed data as training data. However, while these robust learning algorithms are able to achieve high robust training accuracy, it still leads to a non-negligible robust generalization gap (Raghunathan et al., 2019) , which is also called robust overfitting (Rice et al., 2020; Yu et al., 2022) . To explain this puzzling phenomenon, a series of works have attempted to provide theoretical understandings from different perspectives. Despite these aforementioned works seem to provide a series of convincing evidence from theoretical views in different settings, there still exists a gap between theory and practice for at least two reasons. First, although previous works have shown that adversarial robust generalization requires more data and larger models (Schmidt et al., 2018; Li et al., 2022) , it is unclear that what mechanism, during adversarial training process, directly causes robust overfitting. In other words, we know there is no robust generalization gap for a trivial model that only guesses labels totally randomly (e.g. deep neural networks with random initialization), which implies that we should take learning process into consideration to analyze robust generalization. Second and most importantly, while some works (Tsipras et al., 2018; Zhang et al., 2019) point out that achieving robustness may hurt clean test accuracy, in most of the cases, it is observed that drop of robust test accuracy is much higher than drop of clean test accuracy in adversarial training (Schmidt et al., 2018; Raghunathan et al., 2019) (see in Table 1 ). Namely, a weak version of benign overfitting, which means that overparameterized deep neural networks can both fit random data powerfully and generalize well for unseen clean data, remains after adversarial training. Therefore, it is natural to ask the following question: What happens, during adversarial learning process, resulting in both benign clean overfitting and significant robust overfitting at the same time? In this paper, we provide a theoretical understanding of the adversarial training process by proposing a novel implicit bias called robust memorization under the realistic data assumption, which explains why adversarial training leads to both high clean test accuracy and robust generalization gap. The fundamental data structure that we leverage is that data can be clean separated by a neural network f clean with moderate size but this neural classifier is non-robust on data with small margin, which is consistent with the common practice that well-trained neural networks have good clean performance but fail to classify perturbed data due to the close distance between small margin data and decision boundary. The existence of small and large margin data for image classification has been empirically verified (Banburski et al., 2021) . And we also assume that data is well-separated, which means that there exists a robust classifier in general (Yang et al., 2020) . However, this robust classifier may be hard to approximate by neural networks (Li et al., 2022) . In other words, clean training often finds simple but non-robust solution, although robust classifier always exists. Specifically, we consider the underlying data distribution D where µ proportional data is with small margin, which can be perturbed as adversarial example, and 1 -µ proportional data is with large margin, which can be robustly classified by the neural network f clean . In adversarial training, we access N instances S randomly drawn from D as training data. In fact, by applying concentration technique in high dimensional probability, we know that there roughly exist µN small margin training data and (1 -µ)N large margin training data. In order to answer the question that we ask, under the above realistic data assumption, we consider the following classifier, f adv (x) := (xi,yi)∈Ssmall y i I{∥x -x i ∥ p ≤ δ} robust local indicators on small margin data + f clean (x)I min (xi,yi)∈Ssmall ∥x -x i ∥ p > δ clean classifier on other data , where S small is the small margin part of training dataset, and δ denotes adversarial perturbation radius, which is smaller than the separation between data in our setting.  1 N N i=1 max ∥ξ∥p≤δ ∥∇ x L(f (x i + ξ), y i )∥ q , where {(x i , y i )} N i=1 denotes the training dataset, and 1 p + 1 q = 1. This measure reflects the local flatness of loss landscape over input on training dataset. In robust memorization regime, the loss landscape will be very flat within the adversarial training perturbation radius but very sharp outside the neighborhood. Through numerical experiments, the dynamics of the measure observed in different perturbation radius δ and different training epoch shows that the behavior of models trained by adversarial training is very similar to robust memorization. More details are presented in Section 4. To further support our conjecture, we theoretically analyze the optimization process of adversarial training under a simple linear separable data set. Our results (Theorem 5.1 and Theorem 5.2) show that clean classification on large margin data implies good clean performance on small margin data, and ReLU networks with moderate size can robustly memorize small margin data efficiently. In other words, we prove that after adversarial training, the network exactly learns the function we described in Section 3. Robust Generalization Gap (Robust Overfitting). One surprising behavior of deep learning is that over-parameterized neural networks can generalize well, which is also called benign overfitting that deep models have not only the powerful memorization but a good performance for unseen data (Zhang et al., 2017; Belkin et al., 2019) . However, in contrast to the standard (non-robust) generalization, for the robust setting, Rice et al. (2020) empirically investigates robust performance of models based on adversarial training methods, which are designed to improve adversarial robustness (Szegedy et al., 2013; Madry et al., 2017) , and the work shows that robust overfitting can be observed on multiple datasets. Theoretical Understanding of Robust Overfitting. Schmidt et al. (2018) ; Balaji et al. (2019) ; Dan et al. (2020) study the sample complexity for adversarial robustness, and their works manifest that adversarial robust generalization requires more data than the standard setting, which gives an explanation of the robust generalization gap from the perspective of statistical learning theory. And another line of works (Tsipras et al., 2018; Zhang et al., 2019) propose a principle called the robustness-accuracy trade-off and have theoretically proven the principle in different setting, which mainly explains the widely observed drop of robust test accuracy due to the trade-off between adversarial robustness and clean test accuracy. Recently, Li et al. (2022) investigates the robust expressive ability of neural networks, which demonstrates that, for the well-separated dataset, robust generalization requires exponentially large models, so the hardness of robust generalization may stem from the expressive power of practical models. Memorization in Adversarial Training. Dong et al. (2021) ; Xu et al. (2021) empirically and theoretically explore the memorization effect in adversarial training for promoting a deeper understanding of model capacity, convergence, generalization, and especially robust overfitting of the adversarially trained models. However, different from their works, the concept robust memorization proposed in our paper focuses on both robust overfitting and high clean test accuracy, which means that there is surprisingly no clean memorization or clean overfitting.

3. ROBUST MEMORIZATION IN ADVERSARIAL TRAINING

In this section, we first introduce some preliminaries in our theoretical framework. We consider a binary classification task X → Y, where we use X ∈ [0, 1] d , Y = {-1, 1} to denote the supporting set of all data input and ground-truth labels, respectively, and the data input dimension is d. Let D be the underlying data distribution. We use clean 0 -1 loss to evaluate clean performance of classifier as L clean D (f ) := E (x,y)∼D [I{sgn(f (x)) ̸ = y}]. Then, we assume that data can be clean separated by a ReLU network with reasonable size (width and depth). Specifically, we assume that there exists a ReLU network classifier f clean such that L clean D (f clean ) = 0, where the clean classifier is defined as f clean (x) := W L σ(W L-1 σ(. . . σ(W 1 x + b 1 ) . . . ) + b L-1 ) + b L , where W i ∈ R mi×mi-1 , b i ∈ R mi , 1 ≤ i ≤ L, m 0 = d, m L = 1 , and we use σ(•) to denote ReLU activation function, which is defined as σ(•) = max{0, •}. Besides, we consider that the width max{m 0 , m 1 , • • • , m L } is O(d), and the depth L is constant. To understand adversarial robustness by a geometry way, we then introduce a notion called decision boundary. Formally, the decision boundary of a classifier f is defined as, DB(f ) := {x ∈ X | f (x) = 0, ∀ϵ > 0, ∃x ′ , x ′′ ∈ B p (x, ϵ), s.t.f (x ′ )f (x ′′ ) < 0}. Notice that this definition is different from that in Zhang et al. (2019) , since it requires that the sign of the neighborhood of the decision boundary can be changed, which helps us to establish the relative between decision boundary and adversarial robust margin. We define l p adversarial margin over classifier f and data point (x, y) as min ∥ξ∥p≤δ yf (x + ξ), then we have the following proposition. Proposition 3.1. (The equivalent between adversarial margin and distance from decision boundary) Assume that the distance between data point and decision boundary is well-defined, then it holds that, {x ∈ X | min ∥ξ∥p≤δ yf (x + ξ) ≥ 0} = {x ∈ X | dist p (x, DB(f )) ≥ δ}, {x ∈ X | min ∥ξ∥p≤δ yf (x + ξ) < 0} = {x ∈ X | dist p (x, DB(f )) < δ}, where we use dist p (•, C) to denote l p distance between point • and curve C. Proposition 3.1 shows that the classifier is robust on data with large distance from decision boundary, and is non-robust on data with small distance from decision boundary. Thus, according distance from decision boundary, we can divide all data into two classes, large margin data and small margin data. Namely, the former is defined as X large = {x ∈ X | dist p (x, DB(f clean )) ≥ δ}, and the latter is defined as X small = {x ∈ X | dist p (x, DB(f clean )) < δ}. We also consider P (x,y)∼D {x ∈ X small } = µ and P (x,y)∼D {x ∈ X large } = 1 -µ, where 0 < µ < 1 is a proportional constant. Another key notion of data in our setting is well-separated, which means that data is far from each other although there exists small margin data. This property is widely observed in Yang et al. Indeed, we can also divide S into two sets, small margin training dataset and large margin training dataset. Specifically, let S small be the set {(x i , y i ) ∈ S | dist p (x i , DB(f clean )) < δ}, and S large be the set {(x i , y i ) ∈ S | dist p (x i , DB(f clean )) ≥ δ}. To evaluate robust performance of models, we use 0 -1 adversarial robust test loss as L adv,δ D (f ) := E (x,y)∼D max ∥ξ∥p≤δ I{sgn(f (x + ξ)) ̸ = y} . The final goal of adversarial training is to find good solution with low robust test loss. Now, we present the concept that we mainly discuss in this work, robust memorization. Concretely, we consider the following classifier f adv (x) := (xi,yi)∈Ssmall y i I{∥x -x i ∥ p ≤ δ} + f clean (x)I min (xi,yi)∈Ssmall ∥x -x i ∥ p > δ . Under the above realistic data assumption, we can derive the following result. Proposition 3.2. Assume that data within the l p δneighborhood of training data has local constant label (which means data within neighborhood of a certain data has the same label), then the classifier f adv has the following properties: • For adversarial robust training error, L adv,δ S (f adv ) = 0; • For clean test error, L clean D (f adv ) = 0; • For adversarial robust test error, L adv,δ D (f adv ) = Ω L adv,δ D (f clean ) . This proposition shows that the classifier f adv has the same behavior as deep models trained by adversarial training.  ( f ) = Ω L adv,δ D (f clean ) . However, while the previous results have shown that achieving both low robust training error and high clean test accuracy is representatively efficient for ReLU nets, robust generalization still requires exponentially large models even with our data assumption. Theorem 3.5. Let ϵ ∈ (0, 1) be a small constant and F n be the set of functions represented by ReLU networks with at most n parameters. There exists a sequence N d = exp(Ω(d)), d ≥ 1 and a universal constant C 1 > 0 such that the following holds: for any c ∈ (0, 1), there exists a underlying data distribution D that satisfies all above data assumptions and is µ 0 -balanced, such that for any R > 2δ and robust radius cϵ, we have inf L adv,cϵ D (f ) : f ∈ F N d ≥ C 1 µ 0 . where µ 0 -balanced means that there exists a uniform probability measure m 0 on X and the distribution D satisfies that inf D(E) m0(E) : E is Lebesgue measurable and m 0 (E) > 0 ≥ µ 0 . In other words, the robust generalization error cannot be lower that a constant α = C 1 µ 0 unless the ReLU network has size larger than exp(Ω(d)). Therefore, we get the following representation complexity inequality, Representation Complexity: Clean Classifier poly(d) < Robust Memorization Õ(µN d+poly(d)) < Robust Classifier exp(Ω(d)) . This inequality shows that, while functions achieving robust memorization have mildly higher representation complexity than clean classifiers, adversarial robustness requires excessively higher complexity, which may lead to adversarial training converges to robust memorization regime.

4. ROBUST MEMORIZATION ON REAL IMAGE DATASETS

In this section, we demonstrate that, on real image datasets, adversarial training can learn classifiers in robust memorization regime. Indeed, we need to study whether models trained Learning Process. We also focus on the dynamics of loss landscape over input during the adversarial learning process. Thus, we compute empirical average of maximum gradient norm within different perturbation radius ϵ and in different training epochs. The numerical results are plotted in Figure 2 . On both MNIST and CIFAR10, with epochs increasing, it is observed that the training curve descents within training perturbation radius, which implies models learn the local robust indicators to robustly memorize training data. However, while the test curve of MNIST has the similar behavior to training, on CIFAR10, the test curve ascents within training radius instead, which potentially explains why robust generalization gap on CIFAR10 is more significant than that on MNIST.

5. THEORETICAL ANALYSIS OF ADVERSARIAL TRAINING ON LINEAR SEPARABLE DATA

In this section, we theoretically demonstrate robust memorization by analyzing the convergence of adversarial training on a certain synthetic dataset. Specifically, we assume that data x ∈ R d can be formalized as cyw * + ξ, where w * is the target direction (∥w * ∥ 2 = 1), y ∈ {-1, +1} is the label, c is the norm scale that is α for large margin data and is β for small margin data (β ≪ α), and ξ ∼ N (0, I d -w * w * T ) is a random Guassian noise orthogonal to w * . Indeed, this synthetic dataset captures the main characteristics of data assumption that we mention in Section 3. On one hand, there exists a simple but non-robust classifier f clean (x) = w * T x that can clean classify data but is non-robust for small margin data. On the other hand, due to the randomness of noise, in the high-dimensional setting, small data is also well-separated. Under review as a conference paper at ICLR 2023 We consider a mixed learner model as f (x) = w 0 T x + m i=1 a i ϕ(w i T x), where the neuron ϕ(t) = σ(t -b), σ(•) is ReLU activation function and b is a threshold. With N -sample training dataset S = S small ∪ S large i.i.d. drawn from the underlying distribution D, we minimize the exponential loss as L exp,adv,δ S (f ) = 1 N N i=1 max ∥δi∥2≤δ exp(-y i f (x i + δ i )). By using the standard adversarial training algorithm, FGSM (Goodfellow et al., 2014) , we have the following result. Theorem 5.1. With large margin data S large as training data, we use FGSM to train the model f when only w 0 is activated and zero initialized, deriving a parameter iteration sequence {w k 0 } k=1,... . Then, with high probability over the sampled set, we have lim k→∞ w k 0 ∥w k 0 ∥2 -w * 2 = o(1). Theorem 5.1 shows that, under the linear separable data assumption, high clean test accuracy on large margin data implies good clean performance on small margin data, which can help us understand clean generalization better in adversarial training (see Theorem 6.2 in Section 6). Theorem 5.2. By using a modified adversarial training algorithm on all training dataset S, we derive a parameter sequence {θ k = (w k i ) m i=0 } k=1,... . Then, with high probability, for 0-1 adversarial training loss L 0-1,adv,δ S , it holds that lim k→∞ L 0-1,adv,δ S (θ k ) = 0. In fact, the above theorems show that adversarial training method on linear separable data will converge to the normalized target solution, f (x) = w * T x + xi∈Ssmall y i ϕ(ξ T i x), which is exactly the robust memorization function mentioned in Section 3.

6. GENERALIZATION GUARANTEES BASED ON ROBUST MEMORIZATION

Standard generalization bound can not directly explain high clean test accuracy after adversarial training. In general, standard generalization bound can be stated as the following form. With high probability over random sampled training data, we have L clean D (f ) ≤ L clean S (f ) + Complexity(F ) N . where N is the number of samples, and Complexity(F ) denotes a complexity measure of function family F , such as VC dimension and Rademacher complexity. 

6.1. IMPROVED GENERALIZATION BOUND ANALYSIS BASED ON ROBUST MEMORIZATION

Fortunately, we notice that the robust memorization function f adv has much lower complexity on all large margin data (poly(d))) than the complexity on only sampled small large margin data (O(N d)). Inspired by this, we can prove a novel generalization bound by leveraging it. Assumption 6.1. We assume that, for any classifier f outputted by adversarial training under the realistic data assumption that we mention in Section 3, we have L clean Dsmall (f ) = O L clean Dlarge (f ) , where we use D small , D large to denote small margin part and large margin part in the population D. This assumption has been theoretically verified in Section 5, which means that the clean test accuracy on small margin data can be bounded by the clean test accuracy on large margin data. Indeed, it holds for any homogeneous classifier when we assume that small margin data also has small norm. We leverage this property to prove the following clean generalization bound. Theorem 6.2. Let D be the underlying distribution that satisfies all assumption in Section 3 and Assumption 6.1. With access to N -sample training dataset S = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N )} is i.i.d. drawn from D, there exists a modified adversarial training algorithm with perturbation This robust generalization bound shows that robust generalization gap can be controlled by global flatness of loss landscape over input rather than local flatness. And we also derive the lower bound of robust generalization gap stated as follow. Proposition 6.4. Let D be the underlying distribution with a smooth density function, then we have L adv,δ D (f ) -L clean D (f ) = Ω δE (x,y)∼D [∥∇ x L(f (x), y)∥ 1 ] . Theorem 6.3 and Proposition 6.4 manifest that robust generalization gap is very related to global flatness. However, although adversarial training achieves good local flatness via robust memorization, models lack global flatness, which leads to robust overfitting. This point is also verified by numerical experiment on CIFAR10 (see results in Figure 3 ). First, global flatness grows much faster than local flatness in practice. Second, with global flatness increasing during training process, it leads to a increase of robust generalization gap.

7. CONCLUSION

This paper provides a theoretical understanding of adversarial training by proposing a implicit bias called robust memorization. We first explore the representation complexity of robust memorization under the realistic data assumption. Then, we empirically demonstrate robust memorization on realimage datasets. And we also theoretical analyze adversarial training in the linear separable data setting. Finally, we prove generalization guarantees inspired by robust memorization, which can explain why both good clean performance and robust overfitting happen in adversarial training.



, motivated by robust memorization, we propose clean and robust generalization bounds. On one hand, clean generalization bound (Theorem 6.2) shows that clean sample complexity is polynomial in data dimension d. On the other hand, we derive a upper bound of robust generalization gap (Theorem 6.3) that only relies on the number of samples and global flatness of loss landscape over input. Since adversarial training only promotes model robustly memorizing training data, the global flatness of landscape has no guarantee, and it is also verified by numerical results, which explains why robust generalization gap is large although robust training error can be very low in adversarial training, 2 RELATED WORK Adversarial Examples and Adversarial Training (AT). Szegedy et al. (2013) first made an intriguing discovery: even state-of-the-art deep neural networks are vulnerable to adversarial examples. Then, Goodfellow et al. (2014) proposes FGSM to seek adversarial examples, and it also introduce the adversarial training framework to enhance the defence of models against adversarial perturbations. Madry et al. (2017) designs PGD to make adversarial attack and uses it to improve adversarial training, which can be viewed as the multi-step vision of FGSM. In general, FGSM-based AT and PGD-based AT are commonly regarded as the standard methods for adversarial training.

2020), and it is clear that this assumption is foundation of robust classifier. Formally, we consider the supporting set X = A ∪ B ⊂ [0, 1] d , where two disjoint sets A, B denote positive class and negative class, respectively. And we assume that dist p (A, B) > R, where we use dist p (•, •) to denote l p distance between two sets.3.1 ROBUST MEMORIZATION UNDER THE REALISTIC DATA ASSUMPTIONIn this subsection, we consider adversarial training with access to N -sample training dataset S = {(x 1 , y 1 ), (x 2 , y 2 ), • • • , (x N , y N )} i.i.d drawn from the data distribution D, we minimize 0 -1 adversarial robust training loss as L adv,δ ∥ξ∥p≤δ I{sgn(f (x i + ξ)) ̸ = y i },where δ is the perturbation radius.

Figure 1: (a)(b): Robust Memorization on MNIST with Training l ∞ Perturbation Radius ϵ = 0.1; (c)(d): Robust Memorization on CIFAR10 with Training l ∞ Perturbation Radius ϵ = 8/255.

Figure 2: (a)(b): Learning Process on MNIST with Training l ∞ Perturbation Radius ϵ = 0.1; (c)(d): Learning Process on CIFAR10 with Training l ∞ Perturbation Radius ϵ = 8/255.

However, this standard generalization bound can not explain high clean test accuracy after adversarial training. In order to have enough capacity for achieving low robust training error (i.e. min f ∈F L adv,δ S (f ) = 0), we need to set Complexity(F ) = O(N d) (due to Corollary 3.4 and the relation between VC dimension and the number of parameters (Bartlett et al., 2019)), which causes the above bound too loose to use.

Figure 3: Left: Local and Global Flatness During Adversarial Training on CIFAR10; Right: The Relation Between Robust Generalization Gap and Global Flatness on CIFAR10.

Train and test performances onCIFAR-10 (Raghunathan et al., 2019)

They have the common performance that they achieve low robust training error and high clean test accuracy, but fail to robust classify the data population D. Although an exponentially large number of parameters is necessary to approximate a smooth function in general(DeVore et al., 1989), some simple functions can be approximate by neural networks more efficientlyTelgarsky (2017). By leveraging this benefit, we use ReLU nets with few parameters to approximate the distance function d i (x) := ∥x-x i ∥ p , and we notice that the exact indicator I(•) can be approximated by a soft indicator with two ReLU neurons. Combined with these results, we can derive Theorem 3.3.

