SIMPLICITY BIAS IN 1-HIDDEN LAYER NEURAL NET-WORKS

Abstract

Recent works Shah et al. (2020); Chen et al. ( 2021) have demonstrated that neural networks exhibit extreme simplicity bias (SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable (1-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.

1. INTRODUCTION

Figure 1 : Classification of swans vs bears. There are several features such as background, color of the animal, shape of the animal etc., each of which is sufficient for classification but using all of them will lead to a more robust model. 1It is well known that neural networks (NNs) are vulnerable to distribution shifts as well as to adversarial examples (Szegedy et al., 2014; Hendrycks et al., 2021) . A recent line of work (Geirhos et al., 2018; Shah et al., 2020; Geirhos et al., 2020) proposes that Simplicity Bias (SB) -aka shortcut learning -i.e., the tendency of neural networks (NNs) to learn only the simplest features over other useful but more complex features, is a key reason behind this non-robustness. The argument is roughly as follows: for example, in the classification of swans vs bears, as illustrated in Figure 1 , there are many features such as background, color of the animal, shape of the animal etc. that can be used for classification. However using only one or few of them can lead to models that are not robust to specific distribution shifts, while using all the features can lead to more robust models. Several recent works have demonstrated SB on a variety of semi-real constructed datasets (Geirhos et al., 2018; Shah et al., 2020; Chen et al., 2021) , and have hypothesized SB to be the key reason for NN's brittleness to distribution shifts (Shah et al., 2020) . However, such observations are still only for specific semi-real datasets, and a general method that can identify SB on a given dataset and a given model is still missing in literature. Such a method would be useful not only to estimate the robustness of a model but could also help in designing more robust models. A key challenge in designing such a general method to identify (and potentially fix) SB is that the notion of feature itself is vague and lacks a rigorous definition. Existing works like Geirhos et al. ( 2018 2021) avoid this challenge of vague feature definition by using carefully designed datasets (e.g., concatenation of MNIST images and CIFAR images), where certain high level features (e.g., MNIST features and CIFAR features, shape and texture features) are already baked in the dataset definition, and arguing about their simplicity is intuitively easy. Contributions: One of the main contributions of this work is to provide a precise definition of a particular simplicity bias -LD-SB-of 1-hidden layer neural networks. In particular, we characterize SB as low dimensional input dependence of the model. Concretely, Definition 1.1 (LD-SB). A model f : R d → R c with inputs x ∈ R d and outputs f (x) ∈ R c (e.g., logits for c classes), trained on a distribution (x, y) ∼ D satisfies LD-SB if there exists a projection matrix P ∈ R d×d satisfying the following: • rank (P ) = k ≪ d, • f (P x 1 + P ⊥ x 2 ) ≈ f (x 1 ) ∀(x 1 , y 1 ), (x 2 , y 2 ) ∼ D, and • An independent model g trained on (P ⊥ x, y) where (x, y) ∼ D achieves high accuracy. Here P ⊥ is the projection matrix onto the subspace orthogonal to P . In words, LD-SB says that there exists a small k-dimensional subspace (given by the projection matrix P ) in the input space R d , which is the only thing that the model f considers in labeling any input point x. In particular, if we mix two data points x 1 and x 2 by using the projection of x 1 onto P and the projection of x 2 onto the orthogonal subspace P ⊥ , the output of f on this mixed point P x 1 + P ⊥ x 2 is the same as that on x 1 . This would have been fine if the subspace P ⊥ does not contain any feature useful for classification. However, the third bullet point says that P ⊥ indeed contains features that are useful for classification since an independent model g trained on (P ⊥ x, y) achieves high accuracy. Furthermore, theoretically, we demonstrate LD-SB of 1-hidden layer NNs for a fairly general class of distributions called independent features model (IFM), where the features (i.e., coordinates) are distributed independently conditioned on the label. IFM has a long history and is widely studied, especially in the context of naive-Bayes classifiers Lewis (1998). For IFM, we show that as long as there is even a single feature in which the data is linearly separable, NNs trained using SGD will learn models that rely almost exclusively on this linearly separable feature, even when there are an arbitrarily large number of features in which the data is separable but with a non-linear boundary. Empirically, we demonstrate LD-SB on three real world datasets: binary and multiclass version of Imagenette (FastAI, 2021) as well as waterbirds-landbirds (Sagawa et al., 2020a) dataset. Compared to the results in Shah et al. (2020) , our results (i) theoretically show LD-SB in a fairly general setting and (ii) empirically show LD-SB on real datasets. Finally, building upon these insights, we propose a simple ensemble method -OrthoP -that sequentially constructs NNs by projecting out principle input data directions that are used by previous NNs. We demonstrate that this method can lead to significantly more robust ensembles for realworld datasets in presence of simple distribution shifts like Gaussian noise. Why only 1-hidden layer networks?: One might wonder why the results in this paper are restricted to 1-hidden layer networks and why they are interesting. We present two reasons. 1. From a theoretical standpoint, prior works have thoroughly characterized the training dynamics of infinite width 1-hidden layer networks under different initialization schemes (Chizat et al., 2019) and have also identified the limit points of gradient descent for such networks (Chizat & Bach, 2020). Our results crucially build upon these prior works. On the other hand, we do not have such a clear understanding of the dynamics of deeper networks. 2. From a practical standpoint, the dominant paradigm in machine learning right now is to pretrain large models on large amounts of data and then finetune on small target datasets. Given the large and diverse pretraining data seen by these models, it has been observed that they do learn rich features (Rosenfeld et al., 2022; Nasery et al., 2022) . However, finetuning on target datasets might not utilize all the features in the pretrained model. Consequently, approaches that can train robust finetuning heads (such as a 1-hidden layer network on top) can be quite effective.



Image source: Wikipedia swa, bea.



); Shah et al. (2020); Chen et al. (

