IMPLICIT REGULARIZATION EFFECTS OF UNBIASED RANDOM LABEL NOISES WITH SGD

Abstract

Random label noises (or observational noises) widely exist in practical machine learning settings. we analyze the learning dynamics of stochastic gradient descent (SGD) over the quadratic loss with unbiased label noises, and investigate a new noise term of dynamics, which is dynamized and influenced by mini-batch sampling and random label noises, as an implicit regularizer. Our theoretical analysis finds such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. To validate our analysis, we use our theorems to analyze the implicit regularizer of SGD with unbiased random label noises for linear regression via Ordinary Least-Square (OLS), where the numerical simulation backups our theoretical findings. We further extend our proposals to interpret the newly-fashioned noisy self-distillation tricks for deep learning, where the implicit regularizer demonstrates a unique capacity of selecting models with improved output stability through learning from well-trained teachers with additive unbiased random label noises.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) has been widely used as an effective way to train deep neural networks with large datasets (Bottou, 1991) . While the mini-batch sampling strategy was firstly proposed to lower the cost of computation per iteration, it has been consider to incorporate an implicit regularizer preventing the learning process from converging to the local minima with poor generalization performance (Zhang et al., 2017; Zhu et al., 2019; Jastrzebski et al., 2017; Hoffer et al., 2017; Keskar et al., 2017) . To interpret such implicit regularization, one can model SGD as gradient descent (GD) with gradient noises caused by mini-batch sampling (Bottou et al., 2018) . Studies have demonstrated the potentials of such implicit regularization or gradient noises to improve the generalization performance of learning from both theoretical (Mandt et al., 2017; Chaudhari & Soatto, 2018; Hu et al., 2019b; Simsekli et al., 2019) and empirical aspects (Zhu et al., 2019; Hoffer et al., 2017; Keskar et al., 2017) . In summary, gradient noises keep SGD away from converging to the sharp local minima that generalizes poorly (Zhu et al., 2019; Hu et al., 2019b; Simsekli et al., 2019) and would select a flat minima (Hochreiter & Schmidhuber, 1997) as the outcome of learning. In this work, we aim at investigating the influence of random label noises to the implicit regularization under mini-batch sampling of SGD. To simplify our research, we assume the training dataset as a set of vectors D = {x 1 , x 2 , x 3 , . . . , x N }.The label ỹi for every vector x i ∈ D is the noisy response of the true neural network f * (x) such that ỹi = y i + ε i , y i = f * (x i ), and E[ε i ] = 0, var[ε i ] = σ 2 , ( ) where the label noise ε i is assumed to be an independent zero-mean random variable. In our work, the random label noises can be either (1) drawn from probability distributions before training steps (but dynamized by mini-batch sampling of SGD) or (2) realized by the random variables per training iteration (Han et al., 2018) . Thus learning is to approximate f (x, θ) that beats f * (x), such that θ ← argmin ∀θ∈R d 1 N N i=1 Li (θ) := 1 N N i=1 (f (x i , θ) -ỹi ) 2 . ( ) Inspired by (Hochreiter & Schmidhuber, 1997; Zhu et al., 2019) Contributions. Our analysis shows that under mild conditions, with gradients of label-noisy losses, SGD might incorporate an additional data-dependent noise term, complementing with the stochastic gradient noises (Li et al., 2017; Wu et al., 2020) of label-noiseless losses, through resampling the samples with label noises (Li et al., 2018) or dynamically adding noises to labels over iterations (Han et al., 2018) . We consider such noises as an implicit regularization caused by unbiased label noises, and interpret the effects of such noises as a solution selector of learning procedure. More specifically, this work has made unique contributions as follow. (1) Implicit Regularizer. We reviewed the preliminaries (Li et al., 2017; Ali et al., 2019; Hu et al., 2019b; Wu et al., 2020) and extent the analytical framework in (Li et al., 2017) to interpret the effects of unbiased label noises as an additional implicit regularizer on top of the continuous-time dynamics of SGD. Through discretizing the continuous-time dynamics of label-noisy SGD, we write discrete-time approximation to the learning dynamics, denoted as θ ULN k for k = 1, 2, . . . , as θ ULN k+1 ← θ ULN k - η N N i=1 ∇L * i (θ ULN k ) + ξ * k (θ ULN k ) + ξ ULN k (θ ULN k ), where L * i (θ) = (f (x i , θ) -f * (x i )) foot_0 refers to the label-noiseless loss function with sample x i and the true (noiseless) label y i , the noise term ξ * k (θ) refers to the stochastic gradient noise (Li et al., 2017) of label-noiseless loss function L * i (θ), then we can obtain the new implicit regularizer caused by the unbiased label noises (ULN) for ∀θ ∈ R d , which can be approximated as follow ξ ULN k (θ) ≈ η B σ 2 N N i=1 ∇ θ f (x i , θ)∇ θ f (x i , θ) 1 2 z k , and z k ∼ N (0 d , I d ) , where z k refers to a random noise vector drawn from the standard Gaussian distribution, θ k refers to the parameters of network in the k th iteration, (•) 1/2 refers to the Chelosky decomposition of the matrix, ∇ θ f (x i , θ) = ∂f (x i , θ)/∂θ refers to the gradient of the neural network output for sample x i over the parameter θ k , and B and η are defined as the learning rate and the batch size of SGD respectively. Obviously, the strength of such implicit regularizer is controlled by σ 2 , B and η. (2) Effects to Linear Regression. To understand behaviors of the implicit regularizer ξ ULN t (θ t ) to the learning dynamics, we studied SGD over Ordinary Least Square (OLS).With the proposed model, we could easily obtain the implicit regularizer such that ξ ULN k (β) ≈ η/B σ 2 ΣN 1/2 z k and z k ∼ N (0 d , I d ) where ΣN = 1 N N i=1 x i x i referring to the sample covariance matrix of the training dataset. Our theoretical elaboration suggests that SGD with unbiased random label noises would converge to a distribution of Gaussian-alike centered at the optimal solution of OLS, where the span and shape of the distribution would controlled by σ 2 and ΣN when η and B are constant. We conducted the extensive experiments using SGD with various σ 2 and ΣN , and successfully obtain the results that coincide our theories and directly visualize the effects of the implicit regularizer over the path of learning and converging distribution of SGD for noisy linear regression. (3) Inference Stabilizer. The regularization effects of unbiased random label noises should be E z k ξ ULN k (θ k ) 2 2 ≈ ησ 2 BN N i=1 ∇ θ f (x i , θ k ) 2 2 = ησ 2 BN N i=1 ∂ ∂θ f (x i , θ k ) 2 2 , where ∇ θ f (x, θ) refers to the gradient of f over θ and the effects is controlled by the batch size B and the variance of label noises σ 2 . We extend above results to understand the newly-fashioned noisy self-distillation (Zhang et al., 2019a; Kim et al., 2020) paradigms, where a well-trained model is supposed to be further improved through learning from its noisy outputs. Our analysis showed that, when the new convergence achieved, the noisy self-distillation strategy would prefer to re-select a model with a lower neural network gradient norm 1 N N i=1 ∇ θ f (x i , θ) 2



, where the gradient norm characterizes the variation/instability of neural network inference results (over perturbations) around parameters of the model. We carry out extensive experiments while results backup our theories. Note that while the earlier work(Bishop, 1995) found training with input noises can also bring regularization effects, our work focuses on the observational noises on labels.

