SGD ON NEURAL NETWORKS LEARNS ROBUST FEATURES BEFORE NON-ROBUST

Abstract

Neural networks are known to be vulnerable to adversarial attacks -small, imperceptible perturbations that cause the network to misclassify an input. A recent line of work attempts to explain this behavior by positing the existence of non-robust features -well-generalizing but brittle features present in the data distribution that are learned by the network and can be perturbed to cause misclassification. In this paper, we look at the dynamics of neural network training through the perspective of robust and non-robust features. We find that there are two very distinct "pathways" that neural network training can follow, depending on the hyperparameters used. In the first pathway, the network initially learns only predictive, robust features and weakly predictive non-robust features, and subsequently learns predictive, non-robust features. On the other hand, a network trained via the second pathway eschews predictive non-robust features altogether, and rapidly overfits the training data. We provide strong empirical evidence to corroborate this hypothesis, as well as theoretical analysis in a simplified setting. Key to our analysis is a better understanding of the relationship between predictive non-robust features and adversarial transferability. We present our findings in light of other recent results on the evolution of inductive biases learned by neural networks over the course of training. Finally, we digress to show that rather than being "quirks" of the data distribution, predictive non-robust features might actually occur across datasets with different distributions drawn from independent sources, indicating that they perhaps possess some meaning in terms of human semantics.

1. INTRODUCTION

Neural networks have achieved state of the art performance on tasks spanning an array of domains like computer vision, translation, speech recognition, robotics, and playing board games (Krizhevsky et al. (2012) ; Vaswani et al. (2017) ; Graves et al. (2013) ; Silver et al. (2016) ). However in recent years, their vulnerability to adversarial attacks -small, targeted input perturbations, has come under sharp focus (Szegedy et al. (2013) ; Papernot et al. (2017) ; Carlini & Wagner (2017) ; Athalye et al. (2018) ; Schmidt et al. (2018) ). Ilyas et al. (2019) propose that neural network vulnerability is at least partly due to neural networks learning well-generalizing but brittle features that are properties of the data distribution. From this point of view, an adversarial example would be constructed by modifying an input of one class slightly such that it takes on the non-robust features of another class. They provide empirical evidence for their theory by training a model on adversarially perturbed examples labeled as the target class, and showing that this model generalizes well to the original, unperturbed distribution. Another unrelated line of work (Brutzkus et al. (2018) ; Ji & Telgarsky (2019) ; Li & Liang (2018) ) aims to study the properties of the functions learned by gradient descent over the course of training. Nakkiran et al. (2019) and Mangalam & Prabhu (2019) independently showed that Stochastic Gradient Descent (SGD) learns simple, almost linear functions to start out, but then learns more complex functions as training progresses. Li et al. (2019) showed that models trained with a low learning rate learn easy-to-generalize but hard-to-fit features first, and thus perform poorly on easy-to-fit patterns. In this paper, we study gradient descent on neural networks from the perspective of robust and nonrobust features. Our main thesis is that based on choices of hyperparameters, neural network training follows one of two pathways, : • Pathway 1 (Informal) : The neural network first learns predictive robust features and weakly predictive non-robust features. As training progresses, it learns predictive nonrobust features, and having learned both robust and non-robust predictive features, achieves good performance on held-out data. This is the pathway that Ilyas et al. (2019) used to prove their theory. • Pathway 2 (Informal) : The neural network learns predictive robust features and weakly predictive non-robust features (as in Pathway 1). But thereafter, it begins to fit the noise in the training set, and quickly achieves zero training error. In this scenario, the network learns only the robust predictive features and shows modest generalization on held-out data. Through a series of experiments, we validate our two-pathway hypothesis, investigate the specific circumstances under which Pathway 1 and Pathway 2 occur, and analyze some properties of the two pathways. We will also develop a closer understanding of the relationship between adversarial transfer and predictive non-robust features, which will aid our analysis of the two pathways. The rest of this paper is organized as follows. Section 2 sets up the notation and definitions we use. In Section 3, we investigate the link between adversarial features and transferability. In Section 4 we provide empirical evidence for the two-pathway hypothesis and analyze some characteristics of each pathway. Section 5 presents a theoretical analysis of gradient descent on a toy linear model. We show that for different choices of initial parameters, the linear model exhibits properties of the first and second pathways. We digress to explore whether non-robust features can occur across datasets in Section 6, and discuss future research directions in Section 7.

2. DEFINITIONS AND PRELIMINARIES

Consider the binary classification setting, where D is a joint distribution over the input space X and the labels {-1, 1}foot_0 . In this setting, Ilyas et al. (2019) define a feature as any function f : X → R, scaled such that E (x,y)∈D [f (x)] = 0 and E (x,y)∈D [f (x) 2 ] = 1. A feature is said to be ρ-useful if E (x,y)∈D [y • f (x)] > ρ (1) for some ρ > 0, and γ-robust if E (x,y)∈D inf δ∈∆(x) y • f (x + δ) > γ (2) for some γ > 0 and some family of perturbations ∆. For brevity, we sometimes refer to a ρ-useful, γ-robust feature (ρ, γ > 0) simply as a robust feature. Let ρ D (f ) be the largest ρ for which f is ρ-useful. A feature f is said to be (highly) predictive or weakly predictive if ρ D (f ) is high or low respectively. A useful, non-robust feature as defined by Ilyas et al. (2019) is one that is ρ-useful for some ρ > 0, but is not γ-robust for any γ > 0. They propose the following experiment to demonstrate the existence of these features. Let C be a classifier trained with Empirical Risk Minimization (ERM) on an empirical distribution D. We operate under the following assumption. Assumption 1. If a distribution D contains a useful feature, then a classifier C trained with ERM on an empirical distribution D drawn from D will learn this feature, provided that we avoid finite sample overfitting through appropriate measures such as regularization and cross-validation. Let L C (x, t) denote the loss of C on input x, for a target label t. Construct adversarial examples by solving the following optimization problem : x adv = arg min x-x ≤ L C (x , t) In particular, construct a distribution called D det comprised of (x adv , t) pairs by using Equation 3 with t chosen deterministically according to y for each (x, y) ∈ D. In the binary classification setting, t must be -y, so E (x adv ,t)∈ D det [t • f (x adv )] > 0, if f is non-robustly useful under D (4) E (x adv ,t)∈ D det [-t • f (x adv )] > 0, if f is robustly useful under D It is observed that a neural network trained on D det achieves non-trivial generalization to the original test set, that is D. From this, we can conclude that non-robust features exist and are useful for classification in the normal setting. Remark : Goh (2019b) showed that the D rand dataset constructed by choosing t randomly in the above procedure, suffers from a sort of "robust feature leakage". PGD introduces faint robust cues in the generated adversarial example that can be learned by the model. But on the D det dataset, the robust features are correlated with a deterministic label which is different from t. Hence we use the D det dataset in preference to the D rand for all our experiments. Two kinds of non-robust features : Goh (2019a) points out a subtle flaw with the above definition of a non-robust feature -highly predictive non-robust features can arise from "contamination" of a robust feature with a non-robust feature, instead of something meaningful. To see how this can happen, consider a highly predictive robust feature f R and a weakly predictive non-robust feature f N R . Let f C be a "contaminated" feature that is a simple sum of f R and f N R (appropriately normalized). Then it is possible to construct a scenario in which E[y • f R (x)] > 0 E inf δ∈∆(x) y • f R (x + δ) > 0 (6) E[y • f N R (x)] 0 E inf δ∈∆(x) y • f N R (x + δ) 0 (7) E[y • f C (x)] > 0 E inf δ∈∆(x) y • f C (x + δ) < 0 (8) f C is thus a highly predictive non-robust feature. Now when you train a model on (x + δ, -y) pairs, f C = f R + f N R is still correlated with -y. But f C = -f R + f N R is more correlated, so the model will learn this combination in preference to f C and will not generalize on the original distribution. In fact, thanks to learning -f R , it will generalize to the distribution with flipped labels, i.e., y → -y. In our analysis and experiments, when we refer to non-robust features, we will exclude such contaminated features. Illustrative Example : Consider a dataset of dog and cat images, where most dog images have snouts and most cats do not have snouts. Most cats have slightly lighter eyes than dogs, and making Name Feature Type of Feature f 1 Snout =⇒ 1, otherwise -1 Predictive Robust f 2 Dark Eyes =⇒ 1, otherwise -1 Predictive Non-Robust f 3 First pixel is an odd number =⇒ 1, otherwise -1 Weakly Predictive Non-Robust f 4 f 1 + f 3 Contaminated Robust Table 1 : An example illustrating the different kinds of features. Dogs and cats are labeled +1 and -1 respectively. Most dogs have dark eyes and snouts. A small majority of dog images start with an odd numbered pixel value. the eyes slightly darker or lighter is part of the set of valid adversarial perturbations. Suppose that a very small majority of the dog images start with a pixel that has an odd numbered value. Then the different types of features in this dataset are enumerated in Table 1 . For f 2 , (x + δ, -y) pairs would be dogs with lighter eyes, labeled as cats. The network trained on these examples will learn Snout =⇒ Cat, Light Eyes =⇒ Cat. Since the eye-color is predictive of the true label, the second feature will ensure that the neural network has non-trivial performance on the original distribution. This is what Ilyas et al. (2019) observed in their experiments. f 2 is thus a true non-robust feature. For f 4 , (x+δ, -y) pairs would be dog images with the first pixel value converted to an even number, labeled as cats. The network trained on these examples will learn Snout =⇒ Cat, Dark Eyes =⇒ Cat, and Even Pixel =⇒ Cat. None of these will be particularly helpful on the true distribution, but the first two will be useful on the flipped distribution, i.e., where dogs are relabeled as cats. f 4 is thus a contaminated robust feature, and not a non-robust feature. Remark : A network that learns only robust features but with contaminants can still be very vulnerable to adversarial attacks, as the above example shows. The weakly predictive non-robust feature f 3 can be manipulated to consistently cause misclassification on out-of-distribution inputs.

3. NON-ROBUST FEATURES AND TRANSFERABILITY

The phenomenon of adversarial transferability (Papernot et al., 2016) , where a non-trivial fraction of the adversarial examples generated for one neural network are still adversarial to other neural networks trained independently on the same data, can be readily explained in terms of non-robust features. By Assumption 1, different neural networks trained using ERM on a distribution would learn the predictive non-robust features (like Dark Eyes =⇒ Dog) present in the distribution. One would then construct an adversarial example by modifying an input such that the predictive non-robust features flip (modify all dog images to have lighter eyes). Then this adversarial example would transfer to all the different networks that have learned to rely on the non-robust features. A natural question to ask is, does all adversarial transferability arise from predictive non-robust features? Nakkiran (2019) showed that by explicitly penalizing transferability during PGD, one can construct adversarial examples that do not transfer, and from which it is not possible to learn a generalizing model. This establishes that adversarial examples that do not transfer, do not contain predictive non-robust features. Here we provide a simpler experiment that constructs non-transferable adversarial examples without explicitly penalizing transferability. This experiment also establishes a stronger claim, that adversarial examples transfer if and only if they exploit predictive non-robust features. Let the CIFAR-10 dataset form the data distribution D. Train two Resnet50 models (He et al., 2016) on D and ensure by Assumption 1 that both networks have learned the predictive non-robust features of the distribution by using regularization and cross-validation across a grid of hyperparameters. Construct a D det dataset for the first network using Equation 3 where t is chosen deterministically according to y using the transformation t = (y + 1)%10. We use Projected Gradient Descent (PGD) (Madry et al., 2018)  D shif t = {(x, (y + 1)%10) : (x, y) ∼ D} (9) Figure 2 shows the performance of these two networks on D and D shif t . We can see that the network trained on the examples that transfer generalizes well to D, but the network trained on the examples that do not transfer generalizes to D shif t . The configuration in the figure is as a result of a thorough grid search over hyperparameters with the metric for success being performance on D. Along with Assumption 1, our experiment establishes that the examples that transfer contain predictive non-robust features, and the examples that don't transfer don't contain predictive non-robust features. In particular, we claim the following : Claim 1. Train two networks N 1 and N 2 on a common dataset such that both networks learn the predictive non-robust features present in the dataset. Then an adversarial example generated for N 1 transfers to the second network if and only if this example contains predictive non-robust features. Further, if a neural network C has learned predictive non-robust features, then PGD will construct some adversarial examples with predictive non-robust features (see Equation 4), and vice-versa. This allows us to infer the following property, which we will use in our analysis in the next section : Claim 2. If a neural network N 2 has learned the predictive non-robust features in a dataset, then adversarial examples generated for another network N 1 using PGD will transfer to N 2 if and only if N 1 has also learned predictive non-robust features. As we discussed in Section 3, if N 2 is able to generalize to D, then N 2 must have learned the predictive non-robust features of D, and if N 2 is able to generalize to D shif t , then N 2 must have learned the predictive robust features of D. This is depicted in Figure 3 in the context of our illustrative example from Section 2.

4. THE TWO PATHWAY HYPOTHESIS

We use the accuracy on D (respectively, D shif t ) as a proxy for how much of the model's performance can be attributed to its learning predictive non-robust (respectively, robust) features. We refer to these as "non-robust feature accuracy" and "robust feature accuracy". Finally, the accuracies on the training and validation splits of D det tell us how well the model has fit the training data, and whether the model is overfitting. We train the network N 2 using SGD for 120 epochs using different combinations of learning rate and regularization, and plot the evolution of these four metrics over the course of training.

4.2. RESULTS AND DISCUSSION

We observe that training follows two very distinct regimes or pathways depending on the choice of hyperparameters. The first pathway, illustrated in Figure 1a occurs when the model is trained with regularization of some sort -either in the form of a high initial learning rate (LR), L 2 weight decay, or data augmentation. The model starts out by learning only predictive robust features (possibly with contaminants), but at some point switches to learning a combination of robust and non-robust predictive features. Training and validation accuracy steadily increase, and the model ends with both training and validation accuracy close to 100%. The second pathway, illustrated in Figure 1b occurs when the model is trained with a low starting learning rate, little or no L 2 weight decay and no data augmentation. The model starts out similar to the first pathway, but then starts overfitting the training data before it can learn non-robust predictive features. At this point, validation accuracy stagnates. The model finishes with a training accuracy of 100% but a validation accuracy of 81%. Nearly all the performance of the model can be attributed to its learning predictive robust features. Hyperparameters : In Section C of the Appendix, we present a study of the effect of different hyperparameters for a Resnet-18 model trained on D det . We observe that the model makes a sharp transition from Pathway 1 to 2 in the space of hyperparameters, with a narrow "middle ground". On clean data : Training on the D det dataset allows us to decompose the accuracy into robust and non-robust, but a similar decomposition doesn't exist for a model trained on D. Instead we utilize Claim 2 and use adversarial transferability as a proxy for whether or not the model has learned non-robust features. Train two models M (2) 1 in Figure 4a and Figure 4b . Simultaneously, we also plot the targeted adversarial attack success, as well as the transfer accuracy to M (1) 2 and M (2) 2 . We observe that targeted adversarial attack success is high for both models. However, while adversarial examples generated for M We conclude that M (2) 1 follows Pathway 2. We observe that learning predictive non-robust features seems to be essential for good generalization on the validation set, and robust features alone do not suffice. Remark : Although Figure 1a suggests that the model eventually generalizes better to the nonrobust feature mapping than the robust, this is not a generally applicable rule over different datasets, architectures and combinations of hyperparameters. Table 5 in the Appendix illustrates this point.

4.3. RELATION TO OTHER RESULTS ABOUT NEURAL NETWORK TRAINING

Low and high learning rates : The regularizing effect of a high initial learning rate has been studied in detail by Li et al. (2019) . They construct a dataset with two types of patterns -those that are hard-to-fit but easy-to-generalize (i.e., low in variation), and those that are easy-to-fit but hard-to-generalize (i.e., noisy). They show that a neural network trained with small learning rate first focuses on the hard-to-fit patterns, and because of the low noise in them, quickly overfits to the training set. As a result, it is not able to learn easy-to-fit pattern effectively later on. In contrast, a model that starts out with a high learning rate learns easy-to-fit pattern first, and since these are noisy, doesn't overfit the training set. Later on, once the learning rate is annealed, the model is able to effectively learn the harder-to-fit patterns. These two cases can be crudely mapped onto our two pathways. The model in Pathway 2, trained with a low LR, learns only robust features to start out, indicating that these features are hard-to-fit. It overfits the training set and thereafter is unable to learn the non-robust features, which are easy-to-fit. The model in Pathway 1, trained with a high LR, quickly begins to learn the non-robust features which are easy-to-fit. However, it learns the robust features too alongside, indicating that this mapping from the low and high LR scenarios to our two pathway theory is not perfect. Complexity of learned functions : Another perspective on the training of neural networks is given by Nakkiran et al. (2019) . They define a performance correlation metric between two models that captures how much of the performance of one model can be explained by the other, and show that as training progresses, the peformance correlation between the current model and the best simple model decreases. This indicates that the functions learned by a neural network become increasingly complex as training progresses. Although their metric is defined for a binary classification setting, we adapt it for the multi-class setting, and use a multi-class logistic regression classifier as the "best simple classifier". We measure the performance correlation between the model trained on D det and the simple classifier as training 6 DIGRESSION : CROSS-DATASET TRANSFER One view of non-robust features is that they are peculiarities or quirks in the data distribution. We provide evidence that allows us to tentatively refute this assumption by showing that one can construct two datasets from completely independent sources, and a model that learns only the predictive non-robust features of one dataset can achieve non-trivial generalization on the other dataset. The CINIC-10 dataset Darlow et al. (2018) is a distribution-shifted version of CIFAR-10 constructed by sampling from the ImageNet synsets corresponding to each of the CIFAR-10 classes. Although it may seem like CIFAR-10 and CINIC-10 could be candidates for two datasets drawn from independent sources, ImageNet is constructed by querying Flickr, and Flickr is also one of the seven sources for the 80 million TinyImages dataset (Torralba et al. (2008) ) that was used to construct CIFAR-10 ( Krizhevsky et al. (2009) ). So roughly one in seven CIFAR-10 images is from Flickr. To be even more certain that there are no spurious correlations creeping in because of a common source, we construct the CIFAR-minus-Flickr dataset that consists of those CIFAR-10 images that haven't been sourced from Flickr. This comprises 52,172 out of the 60,000 CIFAR-10 images. We construct D det datasets as described in Section 4 for CIFAR-minus-Flickr and CINIC-10, and train Resnet50 models on them. These models can only learn non-robust features to help them generalize to the original unperturbed datasets, because the robust features are correlated with the shifted labels. The results are shown in Table 2 . Both D det trained models achieve an accuracy of close to 20% on the other dataset, which is a long way from the expected 10% accuracy of a random model. A line of enquiry that arises naturally from our work is understanding precisely why this behavior occurs in neural networks. What characterstics do predictive non-robust features have that ensure that they are learned only subsequent to predictive robust features? We pose finding a more precise definition of non-robust features that will allow us to theoretically analyze and explain these properties as an important direction for future work.

7. CONCLUSION AND FUTURE DIRECTIONS

Finally, as we show in Section 6, predictive non-robust features can occur across datasets sampled from independent sources. Although this needs to be investigated more thoroughly, our results challenge the view that non-robust features are pecularities of the data distribution. We speculate that some of these features could have a meaning in terms of human semantics, like our illustrative example where the eye color was a predictive non-robust feature. Let the first and second column of A be s and r respectively. Since (with high probability) coordinates 3 to d are orthogonal for all training points, AA T = I + ss T + rr T Other than 1, the eigenvalues of this matrix are 1 + (s T s + r T r) ± (s T s -r T r) 2 + 4(s T r) 2 It's easy to see that s T s = n, r T r = n 2 . Let s T r = r T s = n β. Then since is small, (s T s -r T r) 2 + 4(s T r) 2 = n (1 -2 ) 2 + 4( β) 2 ≈ n(1 + 2 (2β 2 -1)) Then the eigenvalues are λ 1 = 1 + n(1 + 2 β 2 ), λ 2 = 1 + n 2 (1 + β 2 ) Theorem 1 : If ≤ (1 -2p)/(1 -2p/k ), then at the end of the first step, with high probability, the model will rely on the robust feature for classification,, i.e., w ≥ w (2) 1 , and will have a population accuracy of 1 -p. Proof. w (1) 1 = w (1) 0 + αs T (B -Aw 0 ) w (2) 1 = w (2) 0 + αr T (B -Aw 0 ) w (1) 1 ≥ w (2) 1 =⇒ αs T B ≥ αr T B It is easy to see that E 1 n s T B = (1 -2p), E 1 n r T B = 2 1 - 2p k Since n is sufficiently large, these random variables are close to their mean with high probability. So, (1 -2p) ≥ 2 1 -2p k This is true by the assumed bound on . Theorem 2 : Define k t = 2p(1 + n 2 ) -2p(1 -) 2p(1 + n 2 ) -(1 -) Then if η ≤ 2/(1 + 2 + (1/n)), as the number of gradient steps goes to infinity, • if k ≥ k t , sample accuracy approaches 1 and population accuracy approaches 1 -(p/k) with high probability. • if k < k t , sample accuracy approaches 1 and population accuracy approaches 1 -p with high probability. Proof. Gradient descent will converge if αλ 1 ≤ 2 =⇒ η ≤ 2 1 + 2 β 2 + (1/n) Using the fact that β ≤ 1 gives us the bound on learning rate in the theorem statement. Next, using Equation 10, w T = A T (I + ss T + rr T ) -1 (B -Aw 0 ) + w 0 (I + ss T + rr T ) -1 = I -(1 + s T s)(rr T ) + (1 + r T r)(ss T ) -(s T r)(sr T ) -(r T s)(rs T ) (1 + s T s)(1 + r T r) -(s T r)(r T s) =⇒ w (1) T = s T (I + ss T + rr T ) -1 (B -Aw 0 ) + w (1) 0 = s T - (1 + n)(n β)r T + n(1 + n 2 )s T -(n 2 β)r T -(n 2 2 β 2 )s T (1 + n)(1 + n 2 ) -n 2 2 β 2 B = (1 + n 2 )s T -(n β)r T (1 + n)(1 + n 2 ) -n 2 2 β 2 B w (2) T = r T (I + ss T + rr T ) -1 (B -Aw 0 ) + w (2) 0 = r T - (1 + n)(n 2 )r T + (1 + n 2 )(n β)s T -(n 2 2 β 2 )r T -(n 2 3 β)s T (1 + n)(1 + n 2 ) -n 2 2 β 2 B = (1 + n)r T -(n β)s T (1 + n)(1 + n 2 ) -n 2 2 β 2 B where we have used the fact that w 0 = 0. Now suppose we sample a new point (X, Y ) from the data distribution. Let q i denote the index of the noise coordinate of X i and let q denote the index of the noise coordinate of X. With high probability, q = q i ∀i. So, X T w T = X (1) w (1) T + X (2) w (2) T + X (q) w (q) T = X (1) w (1) T + X (2) w (2) T + w (q) 0 = X (1) w (1) T + X (2) w (2) T We want to analyze the case when the first and second coordinate disagree. Let X (1) = -1 and X (2) = . In this scenario if the model always predicts X T w T ≥ 0, it will match the prediction of the second coordinate and achieve a population accuracy of 1 -p/k. On the other hand if it always predicts X T w T < 0, it will match the prediction of the first coordinate and achieve a population accuracy of 1 -p. X T w T > 0 =⇒ w (2) T > w (1) T =⇒ (1 + n)r T -(n β)s T (1 + n)(1 + n 2 ) -n 2 2 β 2 B ≥ (1 + n 2 )s T -(n β)r T (1 + n)(1 + n 2 ) -n 2 2 β 2 B With high probability, s T B = n(1 -2p), r T B = n (1 -2p/k), and β = (1 -p(k + 1)/k + 4p 2 /k).

=⇒

( 1 + n)n 1 - 2p k -(n β)n(1 -2p) ≥ (1 + n 2 )n(1 -2p) -(n β)n 1 - 2p k =⇒ n( 2 -1) + 2pn 1 - 2 k + 2pn 2 2 1 - 1 k ≥ 0 =⇒ k ≥ 2p(1 + n 2 ) -2p(1 -) 2p(1 + n 2 ) -(1 -)



This framework can easily be adapted to the multi-class setting Caveat : here, d is both the input dimensionality and the number of parameters. Although deep learning models are overparameterized, it is uncommon for datasets to have more dimensions than data points.



Figure 1: (Best viewed in color). Neural network training follows two very different pathways based on the choices of hyperparameters. These are training graphs of two Resnet50 models trained on relabeled CIFAR-10 adversarial examples. See Section 4 for more details.

EXPERIMENTAL SETUPWe use the CIFAR-10 training set as our empirical distribution D, and train a neural network N 1 using ERM on D with cross-validation and regularization such that it learns non-robust features by Assumption 1. Construct the D det dataset according to the procedure described in Section 2, where the reference model C is N 1 and the adversarial target t is chosen deterministically as t = (y + 1)%10. Split D det into training and validation sets and train a new neural network N 2 on the training set.

Figure 3: Train a model on dogs with light eyes labeled as cats, and cats with dark eyes labeled as dogs. If the model classifies a clean dog image as a cat, then it must have learned the predictive robust feature (snout), and if it classifies a dog as a dog, it must have learned the predictive nonrobust feature (dark eyes).

random initializations on the unaltered CIFAR-10 dataset, with both data augmentation and some weight decay. Train two more models M weight augmentation nor weight decay. We plot the training and validation accuracies over the course of training for M (1) 1 and M

Figure 4: (Best viewed in color). Training and validation accuracy of M (1) 1 and M (2) 1 along with accuracy of targeted adversarial attacks and adversarial transfer accuracy to M (1) 2 and M (2) 2 .

non-robust features as training progresses, and M

to solve the optimization problem in Equation 3. Split the adversarial examples into two categories -those that transfer to the second network with their target labels, and those that do not. Relabel all adversarial examples x adv with their target label t, and train a Resnet50 model on (x adv , t) pairs from each category. , for (x adv , t) ∼ D det , the non-robust features of D are predictive of t, but the robust features of D are predictive of (t -1)%10. So if a neural network trained on a subset of D det learns predictive nonrobust features, it will generalize to D, and if it learns predictive robust features, it will generalize to the shifted distribution D shif t :

Accuracy of Resnet50 models on the CIFAR-minus-Flickr and CINIC-10 test sets. The two numbers in bold are the ones to focus on.

In this paper, we've shown that from the perspective of predictive robust and non-robust features, neural network training follows two very different pathways, corresponding to the training and overfitting regimes. In both regimes, the model starts out by learning predictive robust features first. This decomposition into two distinct pathways has several interesting implications. For instance, adversarial transferability means that even an adversary with no access to a model can mount a successful attack by constructing adversarial examples for a proxy model. But a model trained via Pathway 2 learns no predictive non-robust features, and adversarial examples generated for another model will in general not transfer to this model. Thus an adversary cannot perform a successful attack on this model without at least the ability to query the model and observe its outputs for a large number of inputs.

annex

Under review as a conference paper at ICLR 2021 progresses. The performance correlation, scaled by 100 and smoothed, is shown in Figure 1 along with the training curves.We observe in Figure 1a that the point at which the performance correlation plateaus corresponds with when the robust accuracy decreases sharply. Similarly in Figure 1b , the correlation levels off along with the robust accuracy. We conjecture that the initial robust features learned by the model are simple, linear functions, and the non-robust features are more complex and non-linear. This is in line with the findings of Tramèr et al. (2017) that transferable adversarial examples occur in a space of high dimensionality.

5. THEORETICAL ANALYSIS

In this section, we present some results for gradient descent on a toy linear model on a distribution with robust and non-robust predictive features. We get results that mirror our two pathways, for different choices of initial conditions. The setting we use is an adaptation of the one used by Nakkiran et al. (2019) . For proofs of these theorems, refer to Section A of the Appendix.Define a data distribution P over R d × {-1, 1} as follows :where < 1 is some small positive constant, and e i denotes the i th natural basis vector. Now sample a "training set"Consider training a linear classifier using gradient descent with a learning rate of η to find w ∈ R d that minimizes the squared loss :We operate in the overparameterized setting, where n d. So with high probability, the coordinates {3, ..., d} of the training data are orthogonal for all the training points. 2 The idea is that the data consists of a "robust" feature given by the first coordinate, a "non-robust" feature (one that can become anti-correlated with a perturbation of 2 ) given by the second coordinate, and a noisy component that comprises the rest of the coordinates, making it possible for a model to fit the data exactly. The robust component is predictive of the true label with probability 1 -p, and the non-robust component is predictive of the true label with probability 1 -(p/k). For simplicity, assume that the initial weight vector w 0 = 0, and that n is sufficiently large. Theorem 1 (Robust before Non-robust). If ≤(1 -2p)/(1 -2p/k), then at the end of the first step, with high probability, the model will rely on the robust feature for classification,, i.e., w1 , and will have a population accuracy of 1 -p. Theorem 2 (Two Pathways). Define, as the number of gradient steps goes to infinity,• if k ≥ k t , sample accuracy approaches 1 and population accuracy approaches 1 -(p/k) with high probability.• if k < k t , sample accuracy approaches 1 and population accuracy approaches 1 -p with high probability.Discussion : The two cases of Theorem 2 very roughly correspond to Pathways 1 and 2. Since this is a strongly convex problem, gradient descent with a small enough learning rate will converge to a fixed solution, so we cannot mimic the setting where different training hyperparameters lead to Pathway 1 or 2. But we can see that if the non-robust feature is predictive enough, the model learns the non-robust feature, otherwise it learns the robust feature.A PROOFS OF THEOREMS IN SECTION 5Consider using Gradient Descent with a learning rate of η to minimize the squared loss as described in Section 5. Then,Letting α = η/n, it can be proved by induction thatLet the largest eigenvalue of AA T be λ max . If |1 -αλ max | ≤ 1, thenfor some very large number of steps T . This achieves zero empirical training error, as we can verify that Aw T = B.

B OTHER DATASETS AND ARCHITECTURES

In this section, we provide training graphs illustrating Pathway 1 and 2 for Resnet18 and Resnet50 models trained on the D det versions of the CIFAR-10 and CINIC-10 (Darlow et al. ( 2018)) datasets.Along with each graph, we mention the hyperparameters used. 

C HYPERPARAMETERS

In this section, we list the hyperparameters used in our experiments. We also take the case of a Resnet18 trained on D det CIFAR-10 and look at which hyperparameters lead to Pathway 1 and which lead to Pathway 2. As we note in Section 4, there is a sharp transition between the two pathways in the space of hyperparameters. We trained the model using 120 epochs of SGD, and did a grid search over the following combinations of hyperparameters. We pick one model (Resnet18) and one dataset (CIFAR-10), and explore which hyperparameters lead to Pathway 1 and which lead to Pathway 2. Remark : As we can see in the case with LR 0.1 and no data augmentation, the network exhibits a sharp transition from Pathway 2 to Pathway 1 in the space of hyperparameters. There is a narrow "middle ground" around L2 = 2.5e-4.

Parameter

Cross-Dataset Transfer (Table 2 ) :• CIFAR-minus-Flickr (clean) : Adam optimizer, LR 1e-3, L2 1e-5, with data augmentation.• CINIC-10 (clean) : LR 0.01, data augmentation, L2 5e-4.• CIFAR-minus-Flickr D det : LR 0.1, No data augmentation, L2 5e-4.• CINIC-10 D det : LR 0.1, No data augmentation, L2 5e-4.

