WHITENING AND SECOND ORDER OPTIMIZATION BOTH DESTROY INFORMATION ABOUT THE DATASET, AND CAN MAKE GENERALIZATION IMPOSSIBLE

Abstract

Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second moment matrix of a dataset. For a general class of models, namely models with a fully connected first layer, we prove that the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information; in the high dimensional regime they have no access at all, resulting in poor or nonexistent generalization ability. We experimentally verify these predictions for several architectures, and further demonstrate that generalization continues to be harmed even when theoretical requirements are relaxed. However, we also show experimentally that regularized second order optimization can provide a practical tradeoff, where training is accelerated but less information is lost, and generalization can in some circumstances even improve.

1. INTRODUCTION

Whitening is a data preprocessing step that removes correlations between input features (see Fig. 1 ). It is used across many scientific disciplines, including geology (Gillespie et al., 1986 ), physics (Jenet et al., 2005) , machine learning (Le Cun et al., 1998 ), linguistics (Abney, 2007) , and chemistry (Bro & Smilde, 2014) . It has a particularly rich history in neuroscience, where it has been proposed as a mechanism by which biological vision realizes Barlow's redundancy reduction hypothesis (Attneave, 1954; Barlow, 1961; Atick & Redlich, 1992; Dan et al., 1996; Simoncelli & Olshausen, 2001) . Whitening is often recommended since, by standardizing the variances in each direction in feature space, it typically speeds up the convergence of learning algorithms (Le Cun et al., 1998; Wiesler & Ney, 2011) , and causes models to better capture contributions from low variance feature directions. Whitening can also encourage models to focus on more fundamental higher order statistics in data, by removing second order statistics (Hyvärinen et al., 2009) . Whitening has further been a direct inspiration for deep learning techniques such as batch normalization (Ioffe & Szegedy, 2015) and dynamical isometry (Pennington et al., 2017; Xiao et al., 2018) .

1.1. WHITENING DESTROYS INFORMATION USEFUL FOR GENERALIZATION

In the high dimensional setting, for any model with a fully connected first layer, we show theoretically and experimentally that whitening the data and then training with gradient descent or stochastic gradient descent (SGD) results in a model with poor or nonexistent generalization ability, depending on how the whitening transform is computed. We emphasize that, analytically, this result applies to any model whose first layer is fully connected, and is not restricted to linear models. Empirically, the results hold in an even larger context, including in convolutional networks. Here, the high dimensional setting corresponds to a number of input features which is comparable to or larger than the number of datapoints. While this setting does not usually arise in modern neural network applications, it is of particular relevance in fields where data collection is expensive or otherwise prohibitive (Levesque

