SHAPE MATTERS: UNDERSTANDING THE IMPLICIT BIAS OF THE NOISE COVARIANCE Anonymous authors Paper under double-blind review

Abstract

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise -induced by minibatches or label perturbation -is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse groundtruth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.

1. INTRODUCTION

One central mystery of deep artificial neural networks is their capability to generalize when having far more learnable parameters than training examples Zhang et al. (2016) . To add to the mystery, deep nets can also obtain reasonable performance in the absence of any explicit regularization. This has motivated recent work to study the regularization effect due to the optimization (rather than objective function), also known as implicit bias or implicit regularization Gunasekar et al. Among these sources of implicit regularization, the SGD noise is believed to be a vital one (LeCun et al., 2012; Keskar et al., 2016) . Previous theoretical works (e.g., Li et al. (2019b) ) have studied the implicit regularization effect from the scale of the noise, which is directly influenced by learning rate and batch size. However, people have empirically observed that the shape of the noise also has a strong (if not stronger) implicit bias. For example, prior works show that mini-batch noise or label noise (label smoothing) -noise in the parameter updates from the perturbation of labels in training -is far more effective than adding spherical Gaussian noise (e.g., see (Shallue et al., 2018, Section 4.6) and Szegedy et al. (2016); Wen et al. (2019) ). We also confirm this phenomenon in Figure 1 (left). Thus, understanding the implicit bias of the noise shape is crucial. Such an understanding may also apply to distributed training because synthetically adding noise may help generalization if parallelism reduces the amount of mini-batch noise (Shallue et al., 2018) . In this paper, we theoretically study the effect of the shape of the noise, demonstrating that it can provably determine generalization performance at convergence. Our analysis is based on a nonlinear quadratically-parameterized model introduced by (Woodworth et al., 2020; Vaskevicius et al., 2019) , which is rich enough to exhibit similar empirical phenomena as deep networks. Indeed, Figure 1 (right) empirically shows that SGD with mini-batch noise or label noise can generalize with arbitrary initialization without explicit regularization, whereas GD or SGD with spherical Gaussian noise cannot. We aim to analyze the implicit bias of label noise and Gaussian noise in the quadraticallyparametrized model and explain these empirical observations.



(2017; 2018a;b); Soudry et al. (2018); Arora et al. (2019). The implicit bias is induced by and depends on many factors, such as learning rate and batch size Smith et al. (2017); Goyal et al. (2017); Keskar et al. (2016); Li et al. (2019b); Hoffer et al. (2017), initialization and momentum Sutskever et al. (2013), adaptive stepsize Kingma and Ba (2014); Neyshabur et al. (2015); Wilson et al. (2017), batch normalization Ioffe and Szegedy (2015) and dropout Srivastava et al. (2014).

