SIMPLICITY BIAS IN 1-HIDDEN LAYER NEURAL NET-WORKS

Abstract

Recent works Shah et al. (2020); Chen et al. ( 2021) have demonstrated that neural networks exhibit extreme simplicity bias (SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable (1-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.

1. INTRODUCTION

Figure 1 : Classification of swans vs bears. There are several features such as background, color of the animal, shape of the animal etc., each of which is sufficient for classification but using all of them will lead to a more robust model. 1It is well known that neural networks (NNs) are vulnerable to distribution shifts as well as to adversarial examples (Szegedy et al., 2014; Hendrycks et al., 2021) . A recent line of work (Geirhos et al., 2018; Shah et al., 2020; Geirhos et al., 2020) proposes that Simplicity Bias (SB) -aka shortcut learning -i.e., the tendency of neural networks (NNs) to learn only the simplest features over other useful but more complex features, is a key reason behind this non-robustness. The argument is roughly as follows: for example, in the classification of swans vs bears, as illustrated in Figure 1 , there are many features such as background, color of the animal, shape of the animal etc. that can be used for classification. However using only one or few of them can lead to models that are not robust to specific distribution shifts, while using all the features can lead to more robust models. Several recent works have demonstrated SB on a variety of semi-real constructed datasets (Geirhos et al., 2018; Shah et al., 2020; Chen et al., 2021) , and have hypothesized SB to be the key reason for NN's brittleness to distribution shifts (Shah et al., 2020) . However, such observations are still only for specific semi-real datasets, and a general method that can identify SB on a given dataset and a given model is still missing in literature. Such a method would be useful not only to estimate the robustness of a model but could also help in designing more robust models. A key challenge in designing such a general method to identify (and potentially fix) SB is that the notion of feature itself is vague and lacks a rigorous definition. Existing works like Geirhos et al. (2018) ; Shah et al. (2020) ; Chen et al. (2021) avoid this challenge of vague feature definition by using carefully designed datasets (e.g., concatenation of MNIST images and CIFAR images), where certain high level features (e.g., MNIST features and CIFAR features, shape and texture features) are already baked in the dataset definition, and arguing about their simplicity is intuitively easy. Contributions: One of the main contributions of this work is to provide a precise definition of a particular simplicity bias -LD-SB-of 1-hidden layer neural networks. In particular, we characterize SB as low dimensional input dependence of the model. Concretely, Definition 1.1 (LD-SB). A model f : R d → R c with inputs x ∈ R d and outputs f (x) ∈ R c (e.g., logits for c classes), trained on a distribution (x, y) ∼ D satisfies LD-SB if there exists a projection matrix P ∈ R d×d satisfying the following: • rank (P ) = k ≪ d, • f (P x 1 + P ⊥ x 2 ) ≈ f (x 1 ) ∀(x 1 , y 1 ), (x 2 , y 2 ) ∼ D, and • An independent model g trained on (P ⊥ x, y) where (x, y) ∼ D achieves high accuracy. Here P ⊥ is the projection matrix onto the subspace orthogonal to P . In words, LD-SB says that there exists a small k-dimensional subspace (given by the projection matrix P ) in the input space R d , which is the only thing that the model f considers in labeling any input point x. In particular, if we mix two data points x 1 and x 2 by using the projection of x 1 onto P and the projection of x 2 onto the orthogonal subspace P ⊥ , the output of f on this mixed point P x 1 + P ⊥ x 2 is the same as that on x 1 . This would have been fine if the subspace P ⊥ does not contain any feature useful for classification. However, the third bullet point says that P ⊥ indeed contains features that are useful for classification since an independent model g trained on (P ⊥ x, y) achieves high accuracy. Furthermore, theoretically, we demonstrate LD-SB of 1-hidden layer NNs for a fairly general class of distributions called independent features model (IFM), where the features (i.e., coordinates) are distributed independently conditioned on the label. IFM has a long history and is widely studied, especially in the context of naive-Bayes classifiers Lewis (1998) . For IFM, we show that as long as there is even a single feature in which the data is linearly separable, NNs trained using SGD will learn models that rely almost exclusively on this linearly separable feature, even when there are an arbitrarily large number of features in which the data is separable but with a non-linear boundary. Empirically, we demonstrate LD-SB on three real world datasets: binary and multiclass version of Imagenette (FastAI, 2021) as well as waterbirds-landbirds (Sagawa et al., 2020a) dataset. Compared to the results in Shah et al. (2020) , our results (i) theoretically show LD-SB in a fairly general setting and (ii) empirically show LD-SB on real datasets. Finally, building upon these insights, we propose a simple ensemble method -OrthoP -that sequentially constructs NNs by projecting out principle input data directions that are used by previous NNs. We demonstrate that this method can lead to significantly more robust ensembles for realworld datasets in presence of simple distribution shifts like Gaussian noise. Why only 1-hidden layer networks?: One might wonder why the results in this paper are restricted to 1-hidden layer networks and why they are interesting. We present two reasons. 1. From a theoretical standpoint, prior works have thoroughly characterized the training dynamics of infinite width 1-hidden layer networks under different initialization schemes (Chizat et al., 2019) and have also identified the limit points of gradient descent for such networks (Chizat & Bach, 2020) . Our results crucially build upon these prior works. On the other hand, we do not have such a clear understanding of the dynamics of deeper networks. 2. From a practical standpoint, the dominant paradigm in machine learning right now is to pretrain large models on large amounts of data and then finetune on small target datasets. Given the large and diverse pretraining data seen by these models, it has been observed that they do learn rich features (Rosenfeld et al., 2022; Nasery et al., 2022) . However, finetuning on target datasets might not utilize all the features in the pretrained model. Consequently, approaches that can train robust finetuning heads (such as a 1-hidden layer network on top) can be quite effective. Extending our results to deeper networks and to other architectures is an exciting direction of research from both theoretical and practical points of view. Paper organization: This paper is organized as follows. Section 2 presents related work. Section 3 presents preliminaries. Our main results on LD-SB are presented in Section 4. Section 5 presents results on training diverse classifiers. We conclude in Section 6.

2. RELATED WORK

Simplicity Bias: Subsequent to Shah et al. (2020) , there have been several papers investigating the presence/absence of SB in various networks as well as reasons behind SB Scimeca et al. (2021) . Of these, Huh et al. (2021) and Galanti & Poggio (2022) are the most closely related works to ours. Learning diverse classifiers: There have been several works that attempt to learn diverse classifiers. Most works try to learn such models by ensuring that the input gradients of these models do not align (Ross & Doshi-Velez, 2018; Teney et al., 2022) . Xu et al. (2022) proposes a way to learn diverse/orthogonal classifiers under the assumption that a complete classifier, that uses all features is available, and demonstrates its utility for various downstream tasks such as style transfer. Lee et al. (2022) learns diverse classifiers by enforcing diversity on unlabeled target data.

Spurious correlations:

There has been a large body of work which identifies the reasons for spurious correlations in NNs (Sagawa et al., 2020b) as well as proposing algorithmic fixes in different settings (Liu et al., 2021; Chen et al., 2020) .

Implicit bias of gradient descent:

There is also a large body of work understanding the implicit bias of gradient descent dynamics. Most of these works are for standard linear (Ji & Telgarsky, 2019) or deep linear networks (Soudry et al., 2018; Gunasekar et al., 2018) . For nonlinear neural networks, one of the well-known results is for the case of 1-hidden layer neural networks with homogeneous activation functions (Chizat & Bach, 2020) , which we crucially use in our proofs.

3. PRELIMINARIES

In this section, we provide the notation and background on infinite width max-margin classifiers that is required to interpret the results of this paper.

3.1. BASIC NOTIONS

1-hidden layer neural networks and loss function. Consider instances x ∈ R d and labels y ∈ {±1} jointly distributed as D. A 1-hidden layer neural network model for predicting the label for a given instance x, is defined by parameters Margin. For data distribution D, the margin of a model f (x) is given as min (x,y)∼D yf (x). ( w ∈ R m×d , b ∈ R m , ā ∈ R m ). Notation. Here is some useful notation that we will use repeatedly. For a matrix A, A(i, .) denotes the ith row of A. For any k ∈ N, S k-1 denotes the surface of the unit norm Euclidean sphere in dimension k.

3.2. INITIALIZATIONS

The gradient descent dynamics of the network depends strongly on the scale of initialization. In this work, we primarily consider rich regime initialization. Rich regime. In rich regime initialization, for any i ∈ [m], the parameters ( w(i, .), b(i)) of the first layer are sampled from a uniform distribution on S d . Each ā(i) is sampled from Unif{-1, 1}, and the output of the network is scaled down by 1 m (Chizat & Bach, 2020) . This is roughly equivalent to Xavier initialization Glorot & Bengio (2010) , where the weight parameters in both the layers are initialized approximately as N (0, 2 m ) when m ≫ d. In addition, we also present some results for the lazy regime initialization described below. Lazy regime. In the lazy regime, the weight parameters in the first layer are initialized with N (0, 1 d ), those of second layer are initialized with N (0, 1 m ) and the biases are initialized to 0 (Bietti & Mairal, 2019; Lee et al., 2019) . This is approximately equivalent to Kaiming initialization (He et al., 2015) .

3.3. THE INFINITE WIDTH LIMIT

For 1-hidden layer neural networks with ReLU activation in the infinite width limit i.e., as m → ∞, Jacot et al. (2018) ; Chizat et al. (2019) ; Chizat & Bach (2020) gave interesting characterizations of the trained model. As mentioned above, the training process of these models falls into one of two regimes depending on the scale of initialization (Chizat et al., 2019) : Rich regime. In the infinite width limit, the neural network parameters can be thought of as a distribution ν over triples (w, b, a) ∈ S d+1 where w ∈ R d , b, a ∈ R. Under the rich regime initialization, the function f computed by the model can be expressed as f (ν, x) = E (w,b,a)∼ν [a(ϕ(⟨w, x⟩ + b)] . (1) Chizat & Bach (2020) showed that the training process with rich initialization can be thought of as gradient flow on the Wasserstein-2 space and gave the following characterization of the trained model under the cross entropy loss E (x,y)∼D [L(ν, (x, y))]. Theorem 3.1. (Chizat & Bach, 2020) Under rich initialization in the infinite width limit with cross entropy loss, if gradient flow on 1-hidden layer NN with ReLU activation converges, it converges to a maximum margin classifier ν * given as ν * = arg max ν∈P(S d+1 ) min (x,y)∼D yf (ν, x) , where P(S d+1 ) denotes the space of distributions over S d+1 . This training regime is known as the 'rich' regime since it learns data dependent features ⟨w, •⟩. Lazy regime. Jacot et al. (2018) showed that in the infinite width limit, the neural network behaves like a kernel machine. This kernel is popularly known as the Neural Tangent Kernel(NTK), and is given by K (x, x ′ ) = ∂f (x) ∂W , ∂f (x ′ )

∂W

, where W denotes the set of all trainable weight parameters. This initialization regime is called 'lazy' regime since the weights do not change much from initialization, and the NTK remains almost constant, i.e, the network does not learn data dependent features. We will use the following characterization of the NTK for 1-hidden layer neural networks. Theorem 3.2. Bietti & Mairal (2019) Under lazy regime initialization in the infinite width limit, the NTK for 1-hidden layer neural networks with ReLU activation i.e., ϕ(u) = max(u, 0), is given as K(x, x ′ ) = ∥x∥∥x ′ ∥κ ⟨x, x ′ ⟩ ∥x∥∥x ′ ∥ , where κ(u) = 1 π (2u(π -cos -1 (u)) + 1 -u 2 ) . 1 0 1 Feature 1 1 0 1 Feature 2 1 0 1 Feature 3 Figure 2: Illustration of an IFM dataset. Given a class ±1 represented by blue and red respectively, each coordinate value is drawn independently from the corresponding distribution. Shown above are the supports of distributions on three different coordinates for an illustrative IFM dataset, for positive and negative labels. Lazy regime for binary classification. Soudry et al. (2018) showed that for linearly separable datasets, gradient descent for linear predictors on logistic loss converges to the max-margin support vector machine (SVM) classifier. This implies that, any sufficiently wide neural network, when trained for a finite time in the lazy regime on a dataset that is separable by the finite-width induced NTK, will tend towards the L 2 max-margin-classifier given by arg min f ∈H ∥f ∥ H s.t. yf (x) ≥ 1 ∀ (x, y) ∼ D , where H represents the Reproducing Kernel Hilbert Space (RKHS) associated with the finite width kernel (Chizat, 2020) . With increasing width, this kernel tends towards the infinite-width NTK (which is universal (Ji et al., 2020) ). Therefore, in lazy regime, we will focus on the L 2 maxmargin-classifier induced by the infinite-width NTK.

4. CHARACTERIZATION OF SB IN 1-HIDDEN LAYER NEURAL NETWORKS

In this section, we first theoretically characterize the SB exhibited by gradient descent on linearly separable datasets in the independent features model (IFM). The main result, stated in Theorem 4.1, is that for binary classification of inputs in R d , even if there is a single coordinate in which the data is linearly separable, gradient descent dynamics will learn a model that relies solely on this coordinate, even when there are an arbitrarily large number d -1 of coordinates in which the data is separable, but by a non-linear classifier. In other words, the simplicity bias of these networks is characterized by low dimensional input dependence, which we denote by LD-SB. We then experimentally verify that NNs trained on some real datasets do indeed satisfy LD-SB.

4.1. DATASET

We consider datasets in the independent features model (IFM), where the joint distribution over (x, y) satisfies p(x, y) = r(y) d i=1 q i (x i |y), i.e, the features are distributed independently conditioned on the label y. Here r(y) is a distribution over {-1, +1} and q i (x i |y) denotes the conditional distribution of i th -coordinate x i given y. IFM is widely studied in literature, particularly in the context of naive-Bayes classifiers Lewis (1998). We make the following assumptions which posit that there are at least two features of differing complexity for classification: one with a linear boundary and at least one other with a non-linear boundary. See Figure 2 for an illustrative example. • One of the coordinates (say, the 1 st coordinate WLOG) is separable by a linear decision boundary with margin γ (see Figure 2 ), i.e, ∃γ > 0, such that γ ∈ Supp(q 1 (x 1 |y = +1)) ⊆ [γ, ∞) and -γ ∈ Supp(q 1 (x 1 |y = -1)) ⊆ (-∞, -γ], where Supp(•) denotes the support of a distribution. • None of the other coordinates is linearly separable. More precisely, for all the other coordinates i ∈ [d] \ {1}, 0 ∈ Supp(q i (x i |y = -1)) and {-1, +1} ⊆ Supp(q i (x i |y = +1)). • The dataset can be perfectly classified even without using the linear coordinate. This means, ∃i ̸ = 1, such that q i (x i |y) has disjoint support for y = +1 and y = -1. Though we assume axis aligned features, our results also hold for any rotation of the dataset. While our results hold in the general IFM setting, in comparison, current results for SB e.g., Shah et al. (2020) , are obtained for very specialized datasets within IFM, and do not apply to IFM in general.

4.2. MAIN RESULT

Our main result states that, for rich initialization (Section 3.2), NNs demonstrate LD-SB for any IFM dataset satisfying the above conditions. Its proof appears in Appendix A.1. Theorem 4.1. For any dataset in the IFM model satisfying the above conditions and γ ≥ 1, if gradient flow for 1-hidden layer FCN under rich initialization in the infinite width limit with cross entropy loss converges, it converges to ν * = 0.5δ θ1 + 0.5δ θ2 on S d+1 , where θ 1 = ( γ √ 2(1+γ 2 ) e 1 , 1 √ 2(1+γ 2 ) , 1/ √ 2), θ 2 = (- γ √ 2(1+γ 2 ) e 1 , 1 √ 2(1+γ 2 ) , -1/ √ 2) and e 1 def = [1, 0, • • • , 0] denotes first standard basis vector. This implies f (ν * , P x 1 + P ⊥ x 2 ) = f (ν * , x 1 ) ∀ (x 1 , y 1 ), (x 2 , y 2 ) ∼ D, where P represents the (rank-1) projection matrix on first coordinate. Moreover, since at least one of the coordinates {2, . . . , d} has disjoint support for q i (x i |y = +1) and q i (x i |y = -1), P ⊥ (x) can still perfectly classify the given dataset, thereby implying LD-SB. It is well known that the rich regime is more relevant for the practical performance of NNs since it allows for feature learning, while lazy regime does not (Chizat et al., 2019) . Nevertheless, in the next section, we present theoretical evidence that LD-SB holds even in the lazy regime, by considering a much more specialized dataset within IFM.

4.3. LAZY REGIME

In this regime, we will work with the following dataset within the IFM family: For y ∈ {±1} we generate (x, y) ∈ D as x 1 = γy ∀i ∈ 2, .., d, x i = ±1 for y = 1 0 for y = -1 Although the dataset above is a point mass dataset, it still exhibits an important characteristic in common with the rich regime dataset -only one of the coordinates is linearly separable while others are not. For this dataset, we provide the characterization of max-margin NTK (as in Eqn. (3)): Theorem 4.2. For sufficiently small ϵ > 0, there exists an absolute constant N such that for all d > N and γ ∈ [7, ϵ √ d), the L 2 max-margin classifier for joint training of both the layers of 1-hidden layer FCN in the NTK regime on the dataset D, i.e., any f satisfying Eqn. (3) satisfies: pred(f (P x 1 + P ⊥ x 2 )) = pred(f (x 1 )) ∀ (x 1 , y 1 ), (x 2 , y 2 ) ∈ D where P represents the projection matrix on the first coordinate and pred(f (x)) represents the predicted label by the model f on x. The above theorem shows that the prediction on a mixed example P x 1 + P ⊥ x 2 is the same as that on x 1 , thus establishing LD-SB. The proof for this theorem is provided in Appendix A.2.

4.4. EMPIRICAL VERIFICATION

In this section, we will present empirical results demonstrating LD-SB on 3 real datasets: Imagenette (FastAI, 2021), a binary version of Imagenette (b-Imagenette) and waterbirds-landbirds (Sagawa et al., 2020a) as well as one designed dataset MNIST-CIFAR (Shah et al., 2020) . More details about the datasets can be found in Appendix B.1.

4.4.1. EXPERIMENTAL SETUP

We take Imagenet pretrained Resnet-50 models, with 2048 features, for feature extraction and train a 1-hidden layer fully connected network, with ReLU nonlinearity, and 100 hidden units, for classification on each of these datasets. During the finetuning process, we freeze the backbone Resnet-50 model and train only the 1-hidden layer head (more details in Appendix B.1) . Demonstrating LD-SB: Given a model f (•), we establish its low dimensional SB by identifying a small dimensional subspace, identified by its projection matrix P , such that if we mix inputs x 1 and  = P x 1 +P ⊥ x 2 , f ( x) is always close to the model's output on x 1 i.e., f (x 1 ). We measure closeness in four metrics: (1) P ⊥ -randomized accuracy (P ⊥ -RA): accuracy on the dataset (P x 1 +P ⊥ x 2 , y 1 ) where (x 1 , y 1 ) and (x 2 , y 2 ) are sampled iid from the dataset, (2) P -randomized accuracy (P -RA): accuracy on the dataset (P x 1 +P ⊥ x 2 , y 2 ), (3) P ⊥ logit change (P ⊥ -LC): relative change wrt logits of x 1 i.e., ∥f ( x) -f (x 1 )∥ / ∥f (x 1 )∥, and (4)P logit change (P -LC): relative change wrt logits of x 2 i.e., ∥f ( x) -f (x 2 )∥ / ∥f (x 2 )∥. As described in Sections 4.2 and 4.3, the training of 1-hidden layer neural networks might follow different trajectories depending on the scale of initialization. So, the subspace projection matrix P will be obtained in different ways for rich vs lazy regimes. For rich regime, we will empirically show that the first layer weights have a low rank structure as per Theorem 4.1 while for lazy regime, we will show that though first layer weights do not exhibit low rank structure, the model still has low dimensional dependence on the input as per Theorem 4.2.

4.4.2. RICH REGIME

Theorem 4.1 suggests that asymptotically, the first layer weight matrix will be low rank. However, since we train only for a finite amount of time, the weight matrix will only be approximately low rank. To quantify this, we use the notion of effective rank Roy & Vetterli (2007) to measure the rank of the first layer weight matrix. Definition 4.3. Given a matrix M , its effective rank is defined as: Eff-rank(M ) = e -i σi(M ) 2 log σi(M ) 2 where σ i (M ) denotes the i th singular value of M and σ i (M ) 2 def = σi(M ) 2 i σi(M ) 2 . One way to interpret the effective rank is that it is the exponential of von-Neumann entropy Petz (2001) of the matrix M M ⊤ Tr(M M ⊤ ) , where Tr (•) denotes the trace of a matrix. For illustration, the effective rank of a projection matrix onto k dimensions equals k. Figure 3a shows the evolution of the effective rank through training on the four datasets. We observe that the effective rank of the weight matrix decreases drastically towards the end of training. To confirm that this indeed leads to LD-SB, we set P to be the subspace spanned by the top singular  Dataset Mist-Div (f, f ind ) (↑) Mist-Div (f, f proj ) (↑) CC-LogitCorr (f, f ind ) (↓) CC-LogitCorr (f, f proj ) (↓) B-Imagenette 3.

4.4.3. LAZY REGIME

For the lazy regime, it turns out that the rank of first layer weight matrix remains high throughout training, as shown in Figure 3b . However, we are able to find a low dimensional projection matrix P satisfying the conditions of LD-SB (as stated in Def 1.1) as the solution to an optimization problem. More concretely, given a pretrained model f and a rank r, we obtain a projection matrix P solving: min P 1 n n i=1 L (f (P x i ), y i ) + λL f (P ⊥ x i ), U[L] where U[L] represents a uniform distribution over all the L labels, (x 1 , y 1 ), • • • , (x n , y n ) are training examples and L (•, •) is the cross entropy loss. We reiterate that the optimization is only over P , while the model parameters f are unchanged. In words, the above function ensures that the neural network produces correct predictions along P and uninformative predictions along P ⊥ . Table 2 presents the results for P ⊥ and P -RA as well as LC. As can be seen, even in this case, we are able to find small rank projection matrices demonstrating LD-SB.

5. TRAINING DIVERSE CLASSIFIERS USING OrthoP

Motivated by our results on low dimensional SB, in this section, we present a natural way to train diverse models, so that an ensemble of such models could mitigate SB. More concretely, given an initial model f with a low dimensional projection P that captures its input dependence, we train another model f proj by projecting the input through P ⊥ i.e., instead of using dataset (x i , y i ) for training, we use (P ⊥ x i , y i ) for training the second model (denoted by f proj ). We refer to this training procedure as OrthoP for orthogonal projection. Given any two models f and f , we evaluate their diversity using two metrics. The first is mistake diversity: Mist-Div f, f def = 1 - |{i:f (xi)̸ =yi & f (xi)̸ =yi}| min(|{i:f (xi)̸ =yi}|,|{i: f (xi)̸ =yi}| , where we abuse notation by using 

|Y|

, where corr([f (x i )], [ f (x i )] : y i = y) represents the empirical correlation between the logits of f and f on the data points where the true label is y. Table 3 compares the diversity of two independently trained models (f and f ind ) with that of two sequentially trained models (f and f proj ) as above. The results demonstrate that f and f proj are more diverse compared to f and f ind . Figure 4 shows the decision boundary of f and f proj on 2-dimensional subspace spanned by top two singular vectors of the weight matrix. We observe that the decision boundary of the second model is more non-linear compared to that of the first model. Finally, Figure 5 shows the variation of test accuracy with the strength of gaussian noise added to the pretrained representations of the dataset. We can see that an ensemble of f and f proj is much more robust as compared to an ensemble of f and f ind (where an ensemble is obtained by averaging the logits of the two models).

6. DISCUSSION

In this work, we characterize the simplicity bias exhibited by one hidden layer neural networks in terms of the low-dimensional input dependence of the model. We provide a theoretical proof for linearly separable datasets, and validate it empirically on real datasets. Based on this characterization, we also propose a simple way to train diverse models and show that it leads to models with significantly better Gaussian noise robustness. This work is an initial step towards rigorously defining simplicity bias or shortcut learning of neural networks, which is one of the major challenges to their real-world deployment (Geirhos et al., 2020) . Providing a similar characterization for deeper nets and other architectures is an important research direction, which, in our opinion, requires a better understanding of the training dynamics and limit points of gradient descent on these networks.

A PROOFS FOR RICH AND LAZY REGIME

A.1 RICH REGIME We restate Theorem 4.1 below and prove it. Theorem A.1. For any dataset in IFM model satisfying the conditions in Section 4.1, γ ≥ 1 and f (ν, x) as in Eqn. (1), the distribution ν * = 0.5δ θ1 + 0.5δ θ2 on S d+1 is the unique max-margin classifier satisfying Eqn. ( 2), where θ 1 = ( γ √ 2(1+γ 2 ) e 1 , 1 √ 2(1+γ 2 ) , 1/ √ 2), θ 2 = (- γ √ 2(1+γ 2 ) e 1 , 1 √ 2(1+γ 2 ) , -1/ √ 2) and e 1 def = [1, 0, • • • , 0] denotes first standard basis vector. In particular, this implies that if gradient flow for 1-hidden layer FCN under rich initialization in the infinite width limit with cross entropy loss converges, it converges to ν * satisfying f (ν * , P x 1 + P ⊥ x 2 ) = f (ν * , x 1 )∀(x 1 , y 1 ), (x 2 , y 2 ) ∈ D, where P represents the (rank-1) projection matrix on the first coordinate. Proof of Theorem A.1: (Chizat & Bach, 2020) showed the following primal-dual characterization of maximum margin classifiers in eqn. ( 2 ( ) The plan is to construct a distribution p * that satisfies the conditions of the above Lemma. Uniqueness. Note further that for a fixed p * , E (x,y)∼p * yf (ν, x) is an upper bound for the margin min (x,y)∼D yf (ν, x) of any classifier ν. Hence, for uniqueness, it suffices to show that δ θ1 , δ θ2 are the unique maximizers of the objective on the RHS of eqn. ( 4) and that the unique maximum margin convex combination of δ θ1 , δ θ2 over D is ν * . We first describe the support D of p * . For y ∈ {±1} we generate (x, y) ∈ D as x 1 = γy ∀i ∈ 2, .., d, x i = ±1 for y = 1 0 for y = -1 Now for (x, y) ∈ D, define p * (x, y) = 0.5 for y = 1 0.5 d for y = -1 (6) Note that p * is supported on 2 d-1 positive instances and one negative instance. We begin by showing eqn. ( 5). Claim A.3. p * as in eqn. ( 6) satisfies eqn. ( 5). Further, the unique maximum margin convex combination of δ θ1 , δ θ2 is ν * . Proof. Let us find the minimizers (x, y) ∼ D of yf (ν, x) = yE (w,b,a)∼ν * [a(ϕ(⟨w, x⟩ + b))] for any ν = λδ θ1 + (1 -λ)δ θ2 , 0 ≤ λ ≤ 1. yf (ν, x) for (x, y) with y = -1 (denoting x 1 by -α 1 , where α 1 ≥ γ) is yf (ν, x) = -1 λ * ϕ γ 2(1 + γ 2 ) e ⊤ 1 (-α 1 e 1 ) + 1 2(1 + γ 2 ) * 1 √ 2 + (1 -λ) * ϕ - γ 2(1 + γ 2 ) e ⊤ 1 (-α 1 e 1 ) + 1 2(1 + γ 2 ) * -1 √ 2 , and for (x, y) with y = 1 (denoting x 1 by α 2 , where α 2 ≥ γ) is yf (ν, x) = 1 λ * ϕ γ 2(1 + γ 2 ) e ⊤ 1 (α 2 e 1 ) + 1 2(1 + γ 2 ) * 1 √ 2 + (1 -λ) * ϕ - γ 2(1 + γ 2 ) e ⊤ 1 (α 2 e 1 ) + 1 2(1 + γ 2 ) * -1 √ 2 . As γ ≥ 1, the expressions above equal λ √ γα1+1 2 and (1 -λ) √ γα2+1 2 respectively, and hence are minimized at α 1 = α 2 = γ. Hence, the margin of ν is min(λ, 1 -λ) √ 1+γ 2 2 which is uniquely maximized at λ = 1/2. Further for λ = 1/2, all points in D have the same value of yf (ν, x). In the rest of the proof we show eqn. ( 4 We show that δ θ1 , δ θ2 are the only maximizers of g(w, b, a) over S d+1 . We first find g(θ 1 ), g(θ 2 ). g(θ 1 ) = Pr(y = 1) • 1 • 1 √ 2 • ϕ γ 2(1 + γ 2 ) e T 1 (γe 1 ) + 1 2(1 + γ 2 ) + Pr(y = -1) • -1 • 1 √ 2 • ϕ γ 2(1 + γ 2 ) e T 1 (-γe 1 ) + 1 2(1 + γ 2 ) = γ 2 + 1 4 , where the first term is because w 2 , w 3 , . . . , w d are zero for θ 1 . Similarly, g(θ 2 ) = √ γ 2 +1 4 . We now show that g(w, a, b) < √ γ 2 +1 4 for (w, a, b) / ∈ {θ 1 , θ 2 }. We begin by showing the following simple but useful claim. Claim A.4. All maximizers of g(w, b, a) over S d+1 satisfy |a| = 1/ √ 2. Proof. The proof essentially follows from the 1-homogeneity of the ReLU function ϕ and separability of g(w, b, a). Note that g(w, b, a ) = ∥w∥ 2 + b 2 a • g(w ′ , b ′ , 1) where ∥w ′ ∥ 2 + b 2 = 1. Maximizing g(w, b, a) is equivalent to maximizing g(w ′ , b ′ , 1) over S d and a ∥w∥ 2 + b 2 over S d+1 respectively. The second of these has its unique maximum at |a| = 1/ √ 2, completing the proof. Now express g(w, b, a) as g (w, b, a) = a Pr(y = 1)E[ϕ(w T x + b)|y = 1] -Pr(y = -1)E[ϕ(w T x + b)|y = -1] = a 2 E σ ϕ(γw 1 + b + d i=2 σ i w i ) -ϕ(b -γw 1 ) , where σ i are independent Rademacher random variables. We have two cases on a: Case 1: a = 1/ √ 2. By eqn. (7) we have g(w, b, 1/ √ 2) ≤ 1 2 √ 2 E σ ϕ(γw 1 + b + d i=2 σ i w i ) . To simplify the above, define the random variable X = d i=2 σ i w i and denote γw 1 + b by α. Note that |α| = |γw 1 + b| ≤ γ 2 +1 2 which follows from ∥w∥ 2 + b 2 = 1/2. The expectation in the last expression above becomes E[ϕ(X + α)] = E[(X + α)1{X + α ≥ 0}] = E[X1{X ≥ -α}] + α Pr(X ≥ -α) = E[X1{X ≥ α}] + α(1 -Pr(X ≥ α)) ≤ E[X1{X ≥ α}] + α , where the last equality follows from symmetry of X. Note that Var(X) = d i=2 w 2 i which is at most 1 2 -α 2 1+γ 2 (using γw 1 + b = α and ∥w∥ 2 + b 2 = 1/2). Using A.5 to upper bound E[X1{X ≥ α}] we have E[ϕ(X + α)] ≤ α + 1 2 min 1 2 , 1 2 -α 2 1+γ 2 2α 2 1 2 - α 2 1 + γ 2 . We can check that the RHS of the above has its unique maximizer at α = 1+γ 2 2 for |α| ≤ 1+γ 2 2 . Hence g(w, b, a) ≤ √ 1+γ 2 4 in this case. We are now done since any (w 1 , b) satisfying γw 1 + b = 1+γ 2 2 and w 2 1 + b 2 ≤ 1/2 has b = 1 √ 2(1+γ 2 ) . Case 2: a = -1/ √ 2. Using eqn. ( 7) we have g(w, b, -1/ √ 2) ≤ ϕ(b -γw 1 )/2 √ 2 which for b 2 + w 2 1 ≤ 1/2 attains its unique maximum γ 2 +1 4 at b = 1 √ 2(1+γ 2 ) . Finally, note that the weights of the trained network (w, b, a) are sampled from ν * . Hence, the final claim in the theorem about f (ν * , P x 1 +P ⊥ x 2 ) follows since the distribution of w only has a support on e 1 and -e 1 .

A.1.1 AUXILIARY LEMMAS FOR RICH REGIME

Lemma A.5. For any symmetric discrete random variable X with bounded variance, for α > 0, E[XI(X ≥ α)] ≤ 1 2 min 1 2 , V ar(X) 2α 2 V ar(X) . Proof. E[XI(X ≥ α)] = x≥α xp(x) = x≥α p(x) p(x)x ≤ p(X ≥ α) x≥α x 2 p(x) , where the last inequality is by Cauchy-Schwartz. Also by Chebyshev's inequality, p(|X| ≥ α) ≤ V ar(X)/2α 2 . Combining this with eqn. (8) and using symmetry of X and non-negativity of α gives the required lemma. A.2 LAZY REGIME Theorem 4.2 is a corollary of the following more general theorem. Theorem A.6. Consider a point x ∈ D. For sufficiently small ϵ > 0, there exist an absolute constant N such that for all d > N, γ < ϵ √ d and γ ≥ 7, for the joint training of both the layers of 1-hidden layer FCN in the NTK regime, the prediction of any point of the form (ζ, x 2:d ) satisfies the following: 1. For ζ ≥ 0.73, the prediction is positive. 2. For ζ ≤ -0.95γ, the prediction is negative. The above theorem establishes that perturbing x 1 by O(γ) changes pred(f (x)) for x ∈ D (whereas a classifier exists that achieves a margin of Ω( √ d) on D, as D has margin 1 for coordinates {2 • • • d}). As γ = o(d), this shows that the learned model is adversarially vulnerable. Proof of Theorem A.6. The idea of the proof is to obtain an explicit expression for f (x) by applying standard kernel max-margin SVM theory to the NTK kernel 3.2. We begin with some preliminaries. We will refer to the first coordinate of the instance as the 'linear' coordinate, and to the rest as 'non-linear' coordinates. Also, henceforth we append an extra coordinate with value 1 to all our instances (corresponding to bias term) -as is standard for working with unbiased SVM without loss of generality. Explicit expression for f . Using representer theorem for max margin kernel SVM, we know that f can be expressed as f (x) = (x (t) ,y (t) )∈D λ t y (t) K(x, x (t) ) , for some λ t ≥ 0 (that are known as Lagrange multipliers). Further by KKT conditions, a function possessing such a representation (that correctly classifies D) has maximum margin if y (t) f (x (t) ) = 1 whenever λ t > 0 (training points t satisfying λ t > 0 are called support vectors). We begin with a useful claim. Claim A.7. The max margin kernel SVM for D with the NTK kernel has all points in D as support vectors. Proof. By the above discussion, it suffices to show that the (unique) solution α ∈ R |D| to Kα = y satisfies sign(α i ) = y (i) for all i, where K is the |D| × |D| Gram matrix with (i, j)th entry K(x (i) , x (j) ) and y i = y (i) (the Lagrange multipliers λ i are then given by y i α i ). Structure of Gram matrix. Order D so that the positive instances appear first. Then the Gram matrix K has a block structure of the form B C C T R where B ∈ R 2 d-1 ×2 d-1 and R ∈ R are the Gram matrices for the positive and negative instances respectively, and C ∈ R 2 d-1 ×1 represents the K(x (i) , x (|D|) ) values for i < |D|. Recall that for the NTK kernel, K(x (i) , x (j) ) has the form ∥x (i) ∥∥x (j) ∥κ(⟨x (i) , x (j) ⟩). Note all the positive instances have the same norm (denoted by ρ 1 = d + γ 2 ) and the inner product between two positive instances depends only on the number i of non-matching non-linear coordinates (denoted by β i for 0 ≤ i ≤ d -1). Hence, the rows of B are permutations of each other, with the entry ρ 2 1 β i appearing d-1 i times. Similarly, the entries in C are all equal and are denoted by ρ 1 ρ 2 β d where β d denotes κ(x (t) , x |D| ) for any t < |D| and ρ 2 = ∥x |D| ∥ = 1 + γ 2 . The only entry in R is ρ 2 2 κ(1). In particular, β i = κ d -2i + γ 2 d + γ 2 for i ∈ [|D| -1], and β d = κ 1 -γ 2 d + γ 2 1 + γ 2 . Now we are ready to solve Kα = y. By symmetry in the structure of K, α looks like [a, a, .....

., b],

where the first |D| -1 entries are the same. Expanding Kα = y, we get two equations given by aρ 2 1 d-1 i=0 d -1 i β i + bρ 1 ρ 2 β d = 1 and 2 d-1 aρ 1 ρ 2 β d + ρ 2 2 κ(1)b = -1 . Solving, we get a = ρ 2 κ(1) + ρ 1 β d ρ 2 1 ρ 2 d-1 i=0 d-1 i [κ(1)β i -β 2 d ] and b = -1 -2 d-1 aρ 1 ρ 2 β d ρ 2 2 κ(1) . We now show that a > 0 and b < 0. Note that for sufficiently large d, β d can be made arbitrarily close to κ(0) = 1/π (since κ is smooth around 0). Hence, a > 0 implies b < 0. We in fact give the following estimate for a: a = 2 1-d • ρ 2 κ(1) + ρ 1 β d ξρ 2 1 ρ 2 where 2 π - 1 π 2 + O 1 d ≤ ξ ≤ 2 + O 1 d . For the lower bound on ξ, write d-1 i=0 d -1 i [κ(1)β i -β 2 d ] = κ(1) ⌊d/2⌋ i=0 d -1 i (β i + β d-1 ) -2 d-1 β 2 d ≥ κ(1) ⌊d/2⌋ i=0 d -1 i 2β d/2 -2 d-1 β 2 d ≥ 2 d-1 κ(1)κ(0) -κ 2 (0) + O 1 d , where for the first inequality we used convexity of κ and for the second inequality we used β d/2 = κ(0) + O(1/d), β d = κ(0) + O(1/ √ d). For the upper bound on ξ, write d-1 i=0 d -1 i [κ(1)β i -β 2 d ] ≤ κ(1) d-1 i=0 d -1 i κ 1 - 2i d + γ 2 ≤ κ(1) d-1 i=0 d -1 i 2 - 2i d + γ 2 = κ(1)2 d - κ(1)(d -1)2 d-1 d + γ 2 , where for the second inequality we used κ(u) ≤ 1 + u (which holds by convexity and κ(-1) = 0, κ(1) = 2). Now we analyze predicted labels for points of the form (ζ, x 2:d+1 ) where x ∈ D. We make two cases depending on the label of x. Predicted label for point (ζ, x By the above discussion, we have f (x) = a   |D|-1 t=1 K(x, x (t) )   + bK(x, x |D| ) = aρ 1 ∥x∥ d-1 i=0 d -1 i κ(τ i ) + bρ 2 ∥x∥κ(τ d ) . Substituting b and denoting f (x)/∥x∥ by g(ζ) we get g(ζ) = aρ 1 d-1 i=0 d -1 i κ(τ i (ζ)) - 2 d-1 β d κ(1) κ(τ d (ζ)) - κ(τ d (ζ)) ρ 2 κ(1) . ( ) Now try to expand g(ζ) using the Taylor series around ζ = γ (note that g(γ) = 1/ρ 1 ). Note that κ ′ can however be unbounded around -1 and 1. To get around this, write g = h + q, where h has bounded first and second derivative, and q has lower order than h for ζ of interest. In particular, h(ζ) = aρ 1   3d/4 i=d/4 d -1 i κ(τ i (ζ)) - 2 d-1 β d κ(1) κ(τ d (ζ))   - κ(τ d (ζ)) ρ 2 κ(1) q(ζ) = aρ 1   i:|d/2-i|>d/4 d -1 i κ(τ i (ζ))   . Observe that q(ζ) = o(c d ) for c < 1 using the estimate eqn. ( 9) for a and concentration for sums of independent Bernoullis. By Taylor's theorem, g(ζ) = h(γ) + h ′ (γ)(ζ -γ) + h ′′ (θ)(ζ -γ) 2 2 + q(ζ) , for some θ ∈ [γ, ζ], where h(γ) ≈ 1/ √ d. It will turn out that |h ′ (γ)| = Θ(1/ √ d), |h ′′ (ζ)| = o(1/ √ d). This will allow us to complete the proof using the linear approximation of g(ζ) by neglecting the second order term and q(ζ). We now compute h ′ , h ′′ , treating ∥x∥ = d + ζ 2 as a constant for exposition (the proof works without this approximation or the reader may think of γ as o ( √ d)). Using τ ′ i (ζ) ≈ γ ρ1∥x∥ , τ ′ d (ζ) ≈ -γ ρ2∥x∥ , h ′ (ζ) ≈ aρ 1 d-1 i=0 d -1 i κ ′ (τ i (ζ)) γ ρ 1 ∥x∥ + 2 d-1 β d κ(1) κ ′ (τ d (ζ)) γ ρ 2 ∥x∥ + κ ′ (τ d (ζ)) ρ 2 κ(1) γ ρ 2 ∥x∥ h ′′ (ζ) ≈ aρ 1 d-1 i=0 d -1 i κ ′′ (τ i (ζ)) γ 2 ρ 2 1 ∥x∥ 2 - 2 d-1 β d κ(1) κ ′′ (τ d (ζ)) γ 2 ρ 2 2 ∥x∥ 2 - κ ′′ (τ d (ζ)) ρ 2 κ(1) γ 2 ρ 2 2 ∥x∥ 2 . Plugging ∥x∥ ≈ ρ 1 ≈ √ d and substituting a from eqn. ( 9), h ′ (ζ) = (1 + β 2 d /ξ)κ ′ (τ d (ζ))γ ρ 2 2 κ(1) √ d + o 1 √ d and h ′′ (ζ) = O 1 d , which substituted in eqn. (11) with τ d (ζ) ≈ 0, β d ≈ κ(0), κ ′ (τ d (ζ)) ≈ κ ′ (0) gives g(ζ) = 1 √ d 1 + (1 + κ 2 (0)/ξ)κ ′ (0)γ κ(1)ρ 2 2 (ζ -γ) + o 1 √ d , Hence, g(ζ) > 0 whenever the coefficient of 1/ √ d above is bounded above zero, and a similar condition holds for g(ζ) < 0. Using the estimates of ξ from eqn. (9) and κ ′ (0) = 1, κ(0) = 1/π, κ(1) = 2, ρ 2 2 = 1 + γ 2 in the above gives that g(ζ) > 0 for ζ > -0.68γ -1.68/γ and g(ζ) < 0 for ζ < -0.905γ -1.905/γ. Predicted label for point (ζ, x (t) 2:d+1 ) where x (t) ∈ D has negative label Following the same plan, write our point (denoted by x) as (ζ, 0, . . . , 0, 1). Explicit form for f . Begin by finding τ i = 1 + γζ ρ 1 ∥x∥ and τ d = 1 -γζ ρ 2 ∥x∥ . eqn. (10) now gives g(ζ) = 2 d-1 aρ 1 κ(τ 0 (ζ)) - β d κ(τ d (ζ)) κ(1) - κ(τ d (ζ)) ρ 2 κ(1) . Expanding κ(τ 0 (ζ)) using Taylor series around ζ = -1/γ, κ(τ 0 (ζ)) = κ(0) + κ ′ (τ 0 (θ))τ ′ 0 (θ)(ζ + 1 γ ) , for some θ ∈ [-1, 1]. For large d, τ 0 (θ) ≈ 0 and τ ′ 0 (θ) = O(1/ √ d). Hence we have g(ζ) = ρ 2 κ(1) + ρ 1 β d ξρ 1 ρ 2 κ(0) + O 1 √ d - β d κ(τ d (ζ)) κ(1) - κ(τ d (ζ)) ρ 2 κ(1) = 1 ρ 2 κ 2 (0) ξ - κ 2 (0) ξκ(1) + 1 κ(1) κ(τ d (ζ)) + o(1) . As before g(ζ) > 0 whenever the coefficient of 1/ρ 2 above is bounded above zero which happens for ζ ≥ 0.73 (for γ ≥ 3). Similarly, g(ζ) < 0 for ζ ≤ 0. first layer weight matrix is shown in Figure 6 . As can be seen, the weight matrix becomes sufficiently low rank as the training progresses. Also, P and P ⊥ randomized accuracy (RA) and Logit change (LC) are shown in Table 5 . As can be seen, the model's prediction is almost determined by the projection along the top 150 singular vectors of the weight matrix. We also train a model2 on representations obtained by projecting out the top 150 singular vectors of the weight matrix of model 1. In Table 6 , we show the mistake diversity (Mist-Div) and classconditioned logit correlation (CC-LogitCorr) between model 1 and model 2. As can be seen, the projected out model has higher diversity and lower correlation as compared to an independently trained model. In Table 4 , we also show that the second model achieves comparable accuracy to model 1. Singular value decay . In Figure 7 , we provide the singular value decay of the weight matrix for the first model trained in rich regime. As can be seen, the top few singular values capture most of the Frobenius norm of the matrix. MNIST-CIFAR In Figure 8 , we show that an ensemble of f and f proj has better gaussian robustness than an ensemble of f and f ind on MNIST-CIFAR dataset. Quantitative measurement of non-linearity of decision boundary In this section, we report a quantitative measure of non-linearity of the decision boundary along the top two singular vectors for f and f proj . Basically, we fit a linear classifier to the decision boundary and report its accuracy. As shown in Table 7 , the test accuracy obtained by the linear classifier for f proj is less than f . Variation of LD-SB with depth In Figure 9 and 10, we show the evolution of effective rank of weight matrices for depth-2 and 3 ReLU networks. As can be seen, the rank still decreases with training, however the effect is less pronounced for the initial layers. Note that the initialization used in these runs was the feature learning initialization as proposed in Yang & Hu (2021) . Spurious correlations: There has been a large body of work which identifies the reasons for spurious correlations in NNs (Sagawa et al., 2020b) as well as proposing algorithmic fixes in different settings (Liu et al., 2021; Chen et al., 2020) .

Implicit bias of gradient descent:

There is also a large body of work understanding the implicit bias of gradient descent dynamics. Most of these works are for standard linear (Ji & Telgarsky, 2019) or deep linear networks (Soudry et al., 2018; Gunasekar et al., 2018) . For nonlinear neural networks, one of the well-known results is for the case of 1-hidden layer neural networks with homogeneous activation functions (Chizat & Bach, 2020) , which we crucially use in our proofs.



Image source: Wikipedia swa, bea.



For a fixed activation function ϕ, given input instance x, the model is given as f (( w, b, ā), x) := ⟨ā, ϕ( wx + b)⟩, where ϕ(•) is applied elementwise. The cross entropy loss L for a given model f , input x and label y is given as L (f (x), y) def = log(1 + exp(-yf (( w, b, ā), x))).

Figure 3: Evolution of effective rank of first layer weight matrices in rich and lazy regimes.

Figure 4: Decision boundaries for f and f proj for B-Imagenette and Waterbirds datasets, visualized in the top 2 singular directions of the first layer weight matrix. The decision boundary of f proj is more non-linear compared to that of f .

): Lemma A.2. (Chizat & Bach, 2020) ν * satisfies eqn. (2) if there exists a data distribution p * such that the following two complementary slackness conditions hold: Supp(ν * ) ⊆ arg max (w,b,a)∈S d+1 E (x,y)∼p * y[a(ϕ(⟨w, x⟩ + b))] and (4) Supp(p * ) ⊆ arg min (x,y)∼D yE (w,b,a)∼ν * [a(ϕ(⟨w, x⟩ + b))] .

), Let us denote by g(w, b, a) := E (x,y)∼p * y[a(ϕ(⟨w, x⟩+b))].

) where x (t) ∈ D has positive label Our point (denoted by x) has the form (ζ, ζ 1 , ζ 2 , . . . , ζ d , 1) where ζ i ∈ ±1. The idea of the proof is to write f explicitly as a function of ζ and work with its first order Taylor expansion around ζ = γ, with some additional work to take care of non-smoothness of f .Explicit form for f . Let τ i def = ⟨x, x ′ ⟩/(∥x∥∥x ′ ∥) for a positive instance x ′ ∈ D, where x and x ′ have exactly i non-matching non-linear coordinates (for 0 ≤ i ≤ d -1). Similarly denote by τ d the quantity ⟨x, x |D| ⟩/(∥x∥∥x |D| ∥). In particular,

Figure 6: Evolution of effective rank of first layer weight matrix (dimension -2048 × 2000) for Imagenet dataset in rich regime.

Figure 7: Fraction of Frobenius norm captured by the top i th singular value i.e., σ 2 i / d j=1 σ 2 j vs i of the first layer weight matrix trained in rich regime for various datasets.

Figure 8: Variation of test accuracy with the standard deviation of Gaussian noise added to the pretrained representations of MNIST-CIFAR dataset. Model 1 is kept fixed, and values for both the ensembles are averaged across 3 runs.

Figure 9: Evolution of effective rank of the weight matrices for a depth-2 ReLU network on Resnet-50 pretrained representations of the dataset

Huh et al. (2021) empirically observe that on certain synthetic datasets, the embeddings of NNs both at initialization as well as after training have a low rank structure.Galanti & Poggio (2022) provide a theoretical intuition behind the relation between various hyperparameters (such as learning rate, batch size etc.) and rank of learnt weight matrices, and demonstrate it empirically. In contrast, we prove LD-SB theoretically on the IFM model as well as empirically validate this on real datasets.

Demonstration of LD-SB in the rich regime: This table presents P ⊥ and P randomized accuracies (RA) as well as logit changes (LC) on the four datasets. These results confirm that projection of input x onto the subspace spanned by P essentially determines the model's prediction on x. ↑ (resp. ↓) indicates that LD-SB implies a large (resp. small) value.

Demonstration of LD-SB in the lazy regime: This table presents P ⊥ and P randomized accuracies as well as logit changes on the four datasets. These results confirm that the projection of input x onto the subspace spanned by P essentially determines the model's prediction on x.

Mistake diversity and class conditioned logit correlation of models trained independently (Mist-Div (f, f ind ) and CC-LogitCorr (f, f ind ) resp.) vs trained sequentially after projecting out important features of the first model (Mist-Div (f, f proj ) and CC-LogitCorr (f, f proj ) resp.). The results demonstrate that f and f proj are more diverse compared to f and f ind .

Trained accuracy of f proj in rich regime

Demonstration of LD-SB in the rich regime: This table presents P ⊥ and P randomized accuracies (RA) as well as logit changes (LC) on the Imagenet dataset. These results confirm that projection of input x onto the subspace spanned by P essentially determines the model's prediction on x. ↑ (resp. ↓) indicates that LD-SB implies a large (resp. small) value.

Mistake diversity and class conditioned logit correlation of models trained independently (Mist-Div (f, f ind ) and CC-LogitCorr (f, f ind ) resp.) vs trained sequentially after projecting out important features of the first model (Mist-Div (f, f proj ) and CC-LogitCorr (f, f proj ) resp.). The results demonstrate that f and f proj are more diverse compared to f and f ind .

Quantitative measurement of non-linearity of decision boundary -accuracy of fitted linear classifier to the decision boundary

B EXPERIMENTS

In this section, we provide experimental details, including hyperparameter tuning setup and some additional experiments.

B.1 DETAILS ON THE EXPERIMENTAL SETTING

We will first describe the four datasets that have been used in this work.1. Imagenette (FastAI, 2021) : This is a subset of 10 classes of Imagenet, that are comparatively easier to classify. 2. b-Imagenette: This is a binarized version of Imagenette, where only a subset of two classes (tench and English springer) is used. 3. Waterbirds-Landbirds (Sagawa et al., 2020a) : This is a majority-minority group dataset, consisting of waterbirds on water and land background, as well as landbirds on land and water background. This dataset serves as a baseline for checking the dependence of model on the spurious background feature when predicting the bird class, as most of the training examples have waterbirds on water and landbirds on land background. 4. MNIST-CIFAR (Shah et al., 2020) : This is a collage dataset, created by concatenating MNIST and CIFAR images along an axis. This is a synthetic dataset for evaluating the simplicity bias of a trained model.Setup Throughout the paper, we work with the pretrained representations of the above datasets, obtained by using an Imagenet pretrained Resnet 50. We finetune a 1-hidden layer FCN (hidden dimension -100) on top of these representations (keeping the backbone fixed) using SGD with a momentum of 0.9. Every model is trained for 20000 steps (large enough for convergence) with a warmup and cosine decay learning rate scheduler. For each of the runs, we tune the batch size, learning rate and weight decay using validation accuracy. Below are the hyperparameter tuning details:• Batch size ∈ {128, 256}• Learning rate:-Rich regime: ∈ {0.5, 1.0} (as learning rate in rich regime needs to scale up with the hidden dimension) -Lazy regime: ∈ {0.01, 0.05}The final numbers reported are averaged across 3 independent runs with the selected hyperparameters.Evaluation For Imagenette, b-Imagenette and MNIST-CIFAR, we report the standard test accuracy in all the experiments. For waterbirds, we report train-adjusted test accuracy, as reported in Sagawa et al. (2020a) . Precisely, accuracy for each group present in the test data is individually calculated and then weighed by the proportion of the corresponding group in the train dataset.

B.2 ADDITIONAL EXPERIMENTAL RESULTS

In this section, we present a few additional experimental results.Accuracy of f proj In Table 4 , we show the test accuracy of f proj . As can be seen, even after projecting out the principal components used by f , f proj attains significantly high accuracy. Note that, in these experiments, model 1 was kept fixed and the accuracy of f proj is averaged across 3 runs.

Results on Imagenet

We trained a 1-hidden layer FCN (with 2000 hidden neurons) on Imagenet dataset, using rich regime initialization, with learning rate selected from the set -{5, 10} (as learning rate in rich regime needs to scale up with hidden dimension). The evolution of effective rank of the

C EXTENDED RELATED WORKS

In this section, we provide an extensive literature survey of various topics that the paper is based on.Low rank Simplicity Bias in Linear Networks Multiple works have established low rank simplicity bias for gradient descent on linear networks, both for squared loss as well as cross-entropy loss. For squared loss, Gunasekar et al. (2017) conjectured that the network is biased towards finding minimum nuclear norm solutions for two-layer linear networks. Arora et al. (2019) refuted the conjecture and instead argued that the network is biased towards finding low rank solutions. Razin & Cohen (2020) provided empirical support to the low rank conjecture, by providing synthetic examples where the network drives nuclear norm to infinity, but minimizes the rank of the effective linear mapping. Li et al. (2021) established that for small enough initialization, gradient flow on linear networks follows greedy low-rank learning trajectory. For binary classification on linearly separable data, Ji & Telgarsky (2019) showed that the weight matrices of a linear network eventually become rank-1 as training progresses.Low rank Simplicity Bias in Non-Linear Networks For non-linear networks, the work related to low-rank simplicity bias is rather sparse. Two of the most notable works are Huh et al. (2021) and Galanti & Poggio (2022) . Huh et al. (2021) empirically established that the rank of the embeddings learnt by a neural network with ReLU activations goes down as training progresses. Galanti & Poggio (2022) provided an intuition behind the relation between the rank of the weight matrices and various hyperparameter such as batch size, weight decay etc. In contrast to these works, for 1 layer nets, we theoretically and empirically establish that the network depends on an extremely low dimensional projection of the input, and this bias can be utilized to develop a robust classifier.Relation to OOD Many recent works in OOD detection (Cook et al., 2020; Zaeemzadeh et al., 2021) explicitly create low-rank embeddings so that it is easier to discriminate them for an OOD point. Other works also implicitly rely on the low-rank nature of the embeddings. Ndiour et al. (2020) use PCA on the learnt features, and only model the likelihood along the small subspace spanned by the top few directions. Wang et al. (2022) utilise the low rank nature of the embeddings to estimate the perpendicular projection of a given data point to this low rank subspace and combine it with logit information to detect OOD datapoints. While there have been works implicitly utilizing the low rank property of embeddings, we note that our paper (i) demonstrates low rank property of the weights, rather than that of embeddings, and (ii) shows that it is a consequence of SB.Other Simplicity Bias There have been many works exploring the nature of simplicity bias in neural networks, both empirically and theoretically. Kalimeris et al. (2019) empirically demonstrated that SGD on neural networks gradually learns functions of increasing complexity. Rahaman et al. (2018) empirically demonstrated that neural networks tend to learn lower frequency functions first. Ronen et al. (2019) theoretically established that in NTK regime, the convergence rate depends on the eigenvalues of the kernel spectrum. Hacohen et al. (2020) showed that neural networks always learn train and test examples almost in the same order, irrespective of the architecture. Pezeshki et al. (2021) proposes that gradient starvation at the beginning of training is a potential reason for SB in the lazy/NTK regime but the conditions are hard to interpret. In contrast, our results are shown for any dataset in the IFM model in the rich regime of training. Lyu et al. (2021) consider antisymmetric datasets and show that single hidden layer input homogeneous networks (i.e., without bias parameters) converge to linear classifiers. However, such networks have strictly weaker expressive power compared to those with bias parameters. Hacohen & Weinshall (2022) showed that for deep linear networks, in NTK regime, they learn the higher principal components of the input data first. Most of the previous works used simplicity bias as a reason behind better generalization of neural nets. However, Shah et al. (2020) showed that extreme simplicity bias could also lead to worse OOD performance.Learning diverse classifiers: There have been several works that attempt to learn diverse classifiers. Most works try to learn such models by ensuring that the input gradients of these models do not align (Ross & Doshi-Velez, 2018; Teney et al., 2022) . Xu et al. (2022) proposes a way to learn diverse/orthogonal classifiers under the assumption that a complete classifier, that uses all features is available, and demonstrates its utility for various downstream tasks such as style transfer. Lee et al. (2022) learns diverse classifiers by enforcing diversity on unlabeled target data.

