WHITENING AND SECOND ORDER OPTIMIZATION BOTH DESTROY INFORMATION ABOUT THE DATASET, AND CAN MAKE GENERALIZATION IMPOSSIBLE

Abstract

Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second moment matrix of a dataset. For a general class of models, namely models with a fully connected first layer, we prove that the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information; in the high dimensional regime they have no access at all, resulting in poor or nonexistent generalization ability. We experimentally verify these predictions for several architectures, and further demonstrate that generalization continues to be harmed even when theoretical requirements are relaxed. However, we also show experimentally that regularized second order optimization can provide a practical tradeoff, where training is accelerated but less information is lost, and generalization can in some circumstances even improve.

1. INTRODUCTION

Whitening is a data preprocessing step that removes correlations between input features (see Fig. 1 ). It is used across many scientific disciplines, including geology (Gillespie et al., 1986) , physics (Jenet et al., 2005) , machine learning (Le Cun et al., 1998 ), linguistics (Abney, 2007) , and chemistry (Bro & Smilde, 2014) . It has a particularly rich history in neuroscience, where it has been proposed as a mechanism by which biological vision realizes Barlow's redundancy reduction hypothesis (Attneave, 1954; Barlow, 1961; Atick & Redlich, 1992; Dan et al., 1996; Simoncelli & Olshausen, 2001) . Whitening is often recommended since, by standardizing the variances in each direction in feature space, it typically speeds up the convergence of learning algorithms (Le Cun et al., 1998; Wiesler & Ney, 2011) , and causes models to better capture contributions from low variance feature directions. Whitening can also encourage models to focus on more fundamental higher order statistics in data, by removing second order statistics (Hyvärinen et al., 2009) . Whitening has further been a direct inspiration for deep learning techniques such as batch normalization (Ioffe & Szegedy, 2015) and dynamical isometry (Pennington et al., 2017; Xiao et al., 2018) .

1.1. WHITENING DESTROYS INFORMATION USEFUL FOR GENERALIZATION

In the high dimensional setting, for any model with a fully connected first layer, we show theoretically and experimentally that whitening the data and then training with gradient descent or stochastic gradient descent (SGD) results in a model with poor or nonexistent generalization ability, depending on how the whitening transform is computed. We emphasize that, analytically, this result applies to any model whose first layer is fully connected, and is not restricted to linear models. Empirically, the results hold in an even larger context, including in convolutional networks. Here, the high dimensional setting corresponds to a number of input features which is comparable to or larger than the number of datapoints. While this setting does not usually arise in modern neural network applications, it is of particular relevance in fields where data collection is expensive or otherwise prohibitive (Levesque et al., 2012), or where the data is intrinsically high dimensional (Stringer et al., 2019; Fusi et al., 2016; Shyr, 2012; Martínez-Ramón et al., 2006; Bruce et al., 2002) , and is also the focus of increasing interest in statistics (Wainwright, 2019). The loss of generalization ability for high dimensional whitened datasets is due to the fact that whitening destroys information in the dataset, and that in high dimensional datasets whitening destroys all information which can be used for prediction. This is related to investigations of information loss due to PCA projection (Geiger & Kubin, 2012). Our result is not restricted to neural networks, and applies to any model in which the input is transformed by a dense matrix with isotropic weight initialization.

1.2. SECOND ORDER OPTIMIZATION HARMS GENERALIZATION SIMILARLY TO WHITENING

Second order optimization algorithms take advantage of information about the curvature of the loss landscape to take a more direct route to a minimum (Boyd & Vandenberghe, 2004; Bottou et al., 2018) . There are many approaches to second order or quasi-second order optimization (Martens & Grosse, 2015; Dennis Jr & Moré, 1977; Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970; Liu & Nocedal, 1989; Schraudolph et al., 2007; Sunehag et al., 2009; Martens, 2010; Byrd et al., 2011; Vinyals & Povey, 2011; Lin et al., 2008; Hennig, 2013; Byrd et al., 2014; Sohl-Dickstein et al., 2014; Desjardins et al., 2015; Grosse & Martens, 2016; Martens et al., 2018; George et al., 2018; Zhang et al., 2017; Botev et al., 2017; Bollapragada et al., 2018; Berahas et al., 2019; Gupta et al., 2018; Agarwal et al., 2016; Duchi et al., 2011; Shazeer & Stern, 2018; Anil et al., 2019; Agarwal et al., 2019; Lu et al., 2018; Kingma & Ba, 2014; Zeiler, 2012; Tieleman & Hinton, 2012; Osawa et al., 2020) , and there is active debate over whether second order optimization harms generalization (Wilson et al., 2017; Zhang et al., 2018; 2019; Amari et al., 2020; Vaswani et al., 2020) . The measure of curvature used in these algorithms is often related to feature-feature covariance matrices of the input, and of intermediate activations (Martens & Grosse, 2015) . In some situations, it is already known that second order optimization is equivalent to steepest descent training on whitened data (Sohl-Dickstein, 2012; Martens & Grosse, 2015) . The similarities between whitening and second order optimization allow us to argue that pure second order optimization also prevents information about the input distribution from being leveraged during training, and can harm generalization (see Figs. 3, 4) . We do find, however, that when strongly regularized and carefully tuned, second order methods can lead to superior performance (Fig. 5 ). 2 THEORY OF WHITENING, SECOND ORDER OPTIMIZATION, AND GENERALIZATION Consider a dataset X ∈ R d×n consisting of n independent d-dimensional examples. We write F for the feature-feature second moment matrix and K for the sample-sample second moment matrix: F = XX ∈ R d×d , K = X X ∈ R n×n . (1)



Figure 1: Whitening removes correlations between feature dimensions in a dataset. Whitening is a linear transformation of a dataset that sets all non-zero eigenvalues of the covariance matrix to 1. ZCA whitening is a specific choice of the linear transformation that rescales the data in the directions given by the eigenvectors of the covariance matrix, but without additional rotations or flips. (a) A toy 2d dataset before and after ZCA whitening. Red arrows indicate the eigenvectors of the covariance matrix of the unwhitened data. (b) ZCA whitening of CIFAR10 images preserves spatial and chromatic structure, while equalizing the variance across all feature directions.

