SGD ON NEURAL NETWORKS LEARNS ROBUST FEATURES BEFORE NON-ROBUST

Abstract

Neural networks are known to be vulnerable to adversarial attacks -small, imperceptible perturbations that cause the network to misclassify an input. A recent line of work attempts to explain this behavior by positing the existence of non-robust features -well-generalizing but brittle features present in the data distribution that are learned by the network and can be perturbed to cause misclassification. In this paper, we look at the dynamics of neural network training through the perspective of robust and non-robust features. We find that there are two very distinct "pathways" that neural network training can follow, depending on the hyperparameters used. In the first pathway, the network initially learns only predictive, robust features and weakly predictive non-robust features, and subsequently learns predictive, non-robust features. On the other hand, a network trained via the second pathway eschews predictive non-robust features altogether, and rapidly overfits the training data. We provide strong empirical evidence to corroborate this hypothesis, as well as theoretical analysis in a simplified setting. Key to our analysis is a better understanding of the relationship between predictive non-robust features and adversarial transferability. We present our findings in light of other recent results on the evolution of inductive biases learned by neural networks over the course of training. Finally, we digress to show that rather than being "quirks" of the data distribution, predictive non-robust features might actually occur across datasets with different distributions drawn from independent sources, indicating that they perhaps possess some meaning in terms of human semantics.

1. INTRODUCTION

Neural networks have achieved state of the art performance on tasks spanning an array of domains like computer vision, translation, speech recognition, robotics, and playing board games (Krizhevsky et al. ( 2012 2019) propose that neural network vulnerability is at least partly due to neural networks learning well-generalizing but brittle features that are properties of the data distribution. From this point of view, an adversarial example would be constructed by modifying an input of one class slightly such that it takes on the non-robust features of another class. 2019) showed that models trained with a low learning rate learn easy-to-generalize but hard-to-fit features first, and thus perform poorly on easy-to-fit patterns. In this paper, we study gradient descent on neural networks from the perspective of robust and nonrobust features. Our main thesis is that based on choices of hyperparameters, neural network training follows one of two pathways, : • Pathway 1 (Informal) : The neural network first learns predictive robust features and weakly predictive non-robust features. As training progresses, it learns predictive nonrobust features, and having learned both robust and non-robust predictive features, achieves good performance on held-out data. This is the pathway that Ilyas et al. ( 2019) used to prove their theory. • Pathway 2 (Informal) : The neural network learns predictive robust features and weakly predictive non-robust features (as in Pathway 1). But thereafter, it begins to fit the noise in the training set, and quickly achieves zero training error. In this scenario, the network learns only the robust predictive features and shows modest generalization on held-out data. Through a series of experiments, we validate our two-pathway hypothesis, investigate the specific circumstances under which Pathway 1 and Pathway 2 occur, and analyze some properties of the two pathways. We will also develop a closer understanding of the relationship between adversarial transfer and predictive non-robust features, which will aid our analysis of the two pathways. The rest of this paper is organized as follows. Section 2 sets up the notation and definitions we use. In Section 3, we investigate the link between adversarial features and transferability. In Section 4 we provide empirical evidence for the two-pathway hypothesis and analyze some characteristics of each pathway. Section 5 presents a theoretical analysis of gradient descent on a toy linear model. We show that for different choices of initial parameters, the linear model exhibits properties of the first and second pathways. We digress to explore whether non-robust features can occur across datasets in Section 6, and discuss future research directions in Section 7.

2. DEFINITIONS AND PRELIMINARIES

Consider the binary classification setting, where D is a joint distribution over the input space X and the labels {-1, 1}foot_0 . In this setting, Ilyas et al. ( 2019) define a feature as any function f : X → R, scaled such that E (x,y)∈D [f (x)] = 0 and E (x,y)∈D [f (x) 2 ] = 1. A feature is said to be ρ-useful if E (x,y)∈D [y • f (x)] > ρ for some ρ > 0, and γ-robust if E (x,y)∈D inf δ∈∆(x) y • f (x + δ) > γ (2)



This framework can easily be adapted to the multi-class setting



); Vaswani et al. (2017); Graves et al. (2013); Silver et al. (2016)). However in recent years, their vulnerability to adversarial attacks -small, targeted input perturbations, has come under sharp focus (Szegedy et al. (2013); Papernot et al. (2017); Carlini & Wagner (2017); Athalye et al. (2018); Schmidt et al. (2018)). Ilyas et al. (

They provide empirical evidence for their theory by training a model on adversarially perturbed examples labeled as the target class, and showing that this model generalizes well to the original, unperturbed distribution. Another unrelated line of work (Brutzkus et al. (2018); Ji & Telgarsky (2019); Li & Liang (2018)) aims to study the properties of the functions learned by gradient descent over the course of training. Nakkiran et al. (2019) and Mangalam & Prabhu (2019) independently showed that Stochastic Gradient Descent (SGD) learns simple, almost linear functions to start out, but then learns more complex functions as training progresses. Li et al. (

Figure 1: (Best viewed in color). Neural network training follows two very different pathways based on the choices of hyperparameters. These are training graphs of two Resnet50 models trained on relabeled CIFAR-10 adversarial examples. See Section 4 for more details.

