SHAPE MATTERS: UNDERSTANDING THE IMPLICIT BIAS OF THE NOISE COVARIANCE Anonymous authors Paper under double-blind review

Abstract

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise -induced by minibatches or label perturbation -is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse groundtruth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.

1. INTRODUCTION

One central mystery of deep artificial neural networks is their capability to generalize when having far more learnable parameters than training examples Zhang et al. (2016) . To add to the mystery, deep nets can also obtain reasonable performance in the absence of any explicit regularization. This has motivated recent work to study the regularization effect due to the optimization (rather than objective function), also known as implicit bias or implicit regularization Gunasekar et al. (2017; 2018a; b) Among these sources of implicit regularization, the SGD noise is believed to be a vital one (LeCun et al., 2012; Keskar et al., 2016) . Previous theoretical works (e.g., Li et al. (2019b) ) have studied the implicit regularization effect from the scale of the noise, which is directly influenced by learning rate and batch size. However, people have empirically observed that the shape of the noise also has a strong (if not stronger) implicit bias. For example, prior works show that mini-batch noise or label noise (label smoothing) -noise in the parameter updates from the perturbation of labels in training -is far more effective than adding spherical Gaussian noise (e.g., see (Shallue et al., 2018, Section 4.6) and Szegedy et al. (2016); Wen et al. (2019) ). We also confirm this phenomenon in Figure 1 (left). Thus, understanding the implicit bias of the noise shape is crucial. Such an understanding may also apply to distributed training because synthetically adding noise may help generalization if parallelism reduces the amount of mini-batch noise (Shallue et al., 2018) . In this paper, we theoretically study the effect of the shape of the noise, demonstrating that it can provably determine generalization performance at convergence. Our analysis is based on a nonlinear quadratically-parameterized model introduced by (Woodworth et al., 2020; Vaskevicius et al., 2019) , which is rich enough to exhibit similar empirical phenomena as deep networks. Indeed, Figure 1 (right) empirically shows that SGD with mini-batch noise or label noise can generalize with arbitrary initialization without explicit regularization, whereas GD or SGD with spherical Gaussian noise cannot. We aim to analyze the implicit bias of label noise and Gaussian noise in the quadraticallyparametrized model and explain these empirical observations. We choose to study label noise because it can replicate the regularization effects of minibatch noise in both real and synthetic data (Figure 1 ), and has been used to regularize large-batch parallel training (Shallue et al., 2018) . Moreover, label noise is less sensitive to the initialization and the optimization history than mini-batch noise, which makes it more amenable to theoretical analysis. For example, in an extreme case, if we happen to reach or initialize at a solution that overfits the data exactly, then mini-batch SGD will stay there forever because both the gradient and the noise vanish (Vaswani et al., 2019) . In contrast, label noise will not accidentally vanish, so the analysis is more tractable. Understanding label noise may lead to understanding mini-batch noise or replacing it with other more robust choices. In our setting, we prove that with a proper learning rate schedule, SGD with label noise recovers a sparse ground-truth classifier and generalizes well, whereas SGD with spherical Gaussian noise generalizes poorly. Concretely, SGD with label noise biases the parameter towards the low sparsity regime and exactly recovers the sparse ground-truth, even when the initialization is arbitrarily large (Theorem 2.1). In this same regime, noise-free gradient descent quickly overfits because it trains in the NTK regime (Jacot et al., 2018; Chizat and Bach, 2018) . Adding Gaussian noise is insufficient to fix this, as this algorithm would end up sampling from a Gibbs distribution with infinite partition function and fail to converge to the ground-truth (Theorem 2.2). In summary, with not too small learning rate or noise level, label noise suffices to bias the parameter towards sparse solutions without relying on a small initialization, whereas Gaussian noise cannot. Our analysis suggests that the fundamental difference between label or mini-batch noise and Gaussian noise is that the former is parameter-dependent, and therefore introduces stronger biases than the latter. The conceptual message highlighted by our analysis is that there are two possible implicit biases induced by the noise: 1. prior work (Keskar et al., 2016) shows that by escaping sharp local minima, noisy gradient descent biases the parameter towards more robust solutions (i.e, solutions with low curvature, or "flat" minima), and 2. when the noise covariance varies across the parameter space, there is another (potentially stronger) implicit bias effect toward parameters where the noise covariance is smaller. Label or mini-batch noise benefits from both biases, whereas Gaussian noise is independent of the parameter, so it benefits from the first bias but not the second. For the quadratically-parameterized model, this first bias is not sufficient for finding solutions with good generalization because there is a large set of overfitting global minima of the training loss with reasonable curvature. In contrast, the covariance of label noise is proportional to the scale of the parameter, inducing a much stronger bias towards low norm solutions which generalize well. 2019) also theoretically studied implicit regularization effects that arise due to shape, rather than scale, of the noise. However, they only considered the local effect of the noise near some local minimum of the loss. In contrast, our



; Soudry et al. (2018); Arora et al. (2019). The implicit bias is induced by and depends on many factors, such as learning rate and batch size Smith et al. (2017); Goyal et al. (2017); Keskar et al. (2016); Li et al. (2019b); Hoffer et al. (2017), initialization and momentum Sutskever et al. (2013), adaptive stepsize Kingma and Ba (2014); Neyshabur et al. (2015); Wilson et al. (2017), batch normalization Ioffe and Szegedy (2015) and dropout Srivastava et al. (2014).

ADDITIONAL RELATED WORKS Closely related to our work, Blanc et al. (2019) and Zhu et al. (

Figure1: The effect of noise covariance in neural network and quadratically-parameterized models. We demonstrate that label noise induces a stronger regularization effect than Gaussian noise. In both real and synthetic data, adding label noise to large batch (or full batch) SGD updates can recover small-batch generalization performance, whereas adding Gaussian noise with optimallytuned variance σ 2 cannot. Left: Training and validation errors on CIFAR100 for VGG19. Adding Gaussian noise to large batch updates gives little improvement (around 2%), whereas adding label noise recovers the small-batch baseline (around 15% improvement). Right: Training and validation error on a 100-dimensional quadratically-parameterized model defined in Section 2. Similar to deep models, label noise or mini-batch noise leads to better solutions than optimally-tuned spherical Gaussian noise. Moreover, Gaussian noise causes the parameter to diverge after sufficient mixing, as suggested by our negative result for Langevin dynamics (Theorem 2.2). More details are in Section A.

