IMPLICIT REGULARIZATION EFFECTS OF UNBIASED RANDOM LABEL NOISES WITH SGD

Abstract

Random label noises (or observational noises) widely exist in practical machine learning settings. we analyze the learning dynamics of stochastic gradient descent (SGD) over the quadratic loss with unbiased label noises, and investigate a new noise term of dynamics, which is dynamized and influenced by mini-batch sampling and random label noises, as an implicit regularizer. Our theoretical analysis finds such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. To validate our analysis, we use our theorems to analyze the implicit regularizer of SGD with unbiased random label noises for linear regression via Ordinary Least-Square (OLS), where the numerical simulation backups our theoretical findings. We further extend our proposals to interpret the newly-fashioned noisy self-distillation tricks for deep learning, where the implicit regularizer demonstrates a unique capacity of selecting models with improved output stability through learning from well-trained teachers with additive unbiased random label noises.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) has been widely used as an effective way to train deep neural networks with large datasets (Bottou, 1991) . While the mini-batch sampling strategy was firstly proposed to lower the cost of computation per iteration, it has been consider to incorporate an implicit regularizer preventing the learning process from converging to the local minima with poor generalization performance (Zhang et al., 2017; Zhu et al., 2019; Jastrzebski et al., 2017; Hoffer et al., 2017; Keskar et al., 2017) . To interpret such implicit regularization, one can model SGD as gradient descent (GD) with gradient noises caused by mini-batch sampling (Bottou et al., 2018) . Studies have demonstrated the potentials of such implicit regularization or gradient noises to improve the generalization performance of learning from both theoretical (Mandt et al., 2017; Chaudhari & Soatto, 2018; Hu et al., 2019b; Simsekli et al., 2019) and empirical aspects (Zhu et al., 2019; Hoffer et al., 2017; Keskar et al., 2017) . In summary, gradient noises keep SGD away from converging to the sharp local minima that generalizes poorly (Zhu et al., 2019; Hu et al., 2019b; Simsekli et al., 2019) and would select a flat minima (Hochreiter & Schmidhuber, 1997) as the outcome of learning. In this work, we aim at investigating the influence of random label noises to the implicit regularization under mini-batch sampling of SGD. To simplify our research, we assume the training dataset as a set of vectors D = {x 1 , x 2 , x 3 , . . . , x N }.The label ỹi for every vector x i ∈ D is the noisy response of the true neural network f * (x) such that ỹi = y i + ε i , y i = f * (x i ), and E[ε i ] = 0, var[ε i ] = σ 2 , ( ) where the label noise ε i is assumed to be an independent zero-mean random variable. In our work, the random label noises can be either (1) drawn from probability distributions before training steps (but dynamized by mini-batch sampling of SGD) or (2) realized by the random variables per training iteration (Han et al., 2018) . Thus learning is to approximate f (x, θ) that beats f * (x), such that θ ← argmin ∀θ∈R d 1 N N i=1 Li (θ) := 1 N N i=1 (f (x i , θ) -ỹi ) 2 . (2) Inspired by (Hochreiter & Schmidhuber, 1997; Zhu et al., 2019) , our work studies how unbiased label noises ε i (1 ≤ i ≤ N ) would affect the "selection" of θ from possible solutions, in the viewpoint of learning dynamics (Saxe et al., 2014) of SGD under mini-batch sampling (Li et al., 2017; Wu et al., 2020; Hu et al., 2019b) . Contributions. Our analysis shows that under mild conditions, with gradients of label-noisy losses, SGD might incorporate an additional data-dependent noise term, complementing with the stochastic gradient noises (Li et al., 2017; Wu et al., 2020) of label-noiseless losses, through resampling the samples with label noises (Li et al., 2018) or dynamically adding noises to labels over iterations (Han et al., 2018) . We consider such noises as an implicit regularization caused by unbiased label noises, and interpret the effects of such noises as a solution selector of learning procedure. More specifically, this work has made unique contributions as follow. (1) Implicit Regularizer. We reviewed the preliminaries (Li et al., 2017; Ali et al., 2019; Hu et al., 2019b; Wu et al., 2020) and extent the analytical framework in (Li et al., 2017) to interpret the effects of unbiased label noises as an additional implicit regularizer on top of the continuous-time dynamics of SGD. Through discretizing the continuous-time dynamics of label-noisy SGD, we write discrete-time approximation to the learning dynamics, denoted as θ ULN k for k = 1, 2, . . . , as θ ULN k+1 ← θ ULN k - η N N i=1 ∇L * i (θ ULN k ) + ξ * k (θ ULN k ) + ξ ULN k (θ ULN k ), where L * i (θ) = (f (x i , θ) -f * (x i )) foot_0 refers to the label-noiseless loss function with sample x i and the true (noiseless) label y i , the noise term ξ * k (θ) refers to the stochastic gradient noise (Li et al., 2017) of label-noiseless loss function L * i (θ), then we can obtain the new implicit regularizer caused by the unbiased label noises (ULN) for ∀θ ∈ R d , which can be approximated as follow ξ ULN k (θ) ≈ η B σ 2 N N i=1 ∇ θ f (x i , θ)∇ θ f (x i , θ) 1 2 z k , and z k ∼ N (0 d , I d ) , where z k refers to a random noise vector drawn from the standard Gaussian distribution, θ k refers to the parameters of network in the k th iteration, (•) 1/2 refers to the Chelosky decomposition of the matrix, ∇ θ f (x i , θ) = ∂f (x i , θ)/∂θ refers to the gradient of the neural network output for sample x i over the parameter θ k , and B and η are defined as the learning rate and the batch size of SGD respectively. Obviously, the strength of such implicit regularizer is controlled by σ 2 , B and η. (2) Effects to Linear Regression. To understand behaviors of the implicit regularizer ξ ULN t (θ t ) to the learning dynamics, we studied SGD over Ordinary Least Square (OLS).With the proposed model, we could easily obtain the implicit regularizer such that ξ ULN k (β) ≈ η/B σ 2 ΣN 1/2 z k and z k ∼ N (0 d , I d ) where ΣN = 1 N N i=1 x i x i referring to the sample covariance matrix of the training dataset. Our theoretical elaboration suggests that SGD with unbiased random label noises would converge to a distribution of Gaussian-alike centered at the optimal solution of OLS, where the span and shape of the distribution would controlled by σ 2 and ΣN when η and B are constant. We conducted the extensive experiments using SGD with various σ 2 and ΣN , and successfully obtain the results that coincide our theories and directly visualize the effects of the implicit regularizer over the path of learning and converging distribution of SGD for noisy linear regression. (3) Inference Stabilizer. The regularization effects of unbiased random label noises should be E z k ξ ULN k (θ k ) 2 2 ≈ ησ 2 BN N i=1 ∇ θ f (x i , θ k ) 2 2 = ησ 2 BN N i=1 ∂ ∂θ f (x i , θ k ) 2 2 , where ∇ θ f (x, θ) refers to the gradient of f over θ and the effects is controlled by the batch size B and the variance of label noises σ 2 . We extend above results to understand the newly-fashioned noisy self-distillation (Zhang et al., 2019a; Kim et al., 2020) paradigms, where a well-trained model is supposed to be further improved through learning from its noisy outputs. Our analysis showed that, when the new convergence achieved, the noisy self-distillation strategy would prefer to re-select a model with a lower neural network gradient norm 1 N N i=1 ∇ θ f (x i , θ) 2

2. PRELIMINARIES AND RELATED WORK

SGD Dynamics and Implicit Regularization We follow settings in (Li et al., 2017) and consider SGD as an algorithm that, in the k th iteration with the estimate θ k , it randomly picks up a B-length subset of samples from the training dataset i.e., B k ⊂ D, and estimates the mini-batch stochastic gradient 1 b ∀xi∈B k ∇L i (θ k ), then updates the estimate for θ k+1 based on θ k , as follow θ k+1 ← θ k - η B ∀xi∈B k ∇L i (θ k ) = θ k - η N ∀xi∈D ∇L i (θ k ) + √ ηV k (θ k ) , where η refers to the step-size of SGD, and V k (θ k ) refers to a stochastic gradient noise term caused by mini-batch sampling. The noise would converge to zero with increasing batch size, as follow V k (θ k ) = √ η 1 N ∀xi∈D ∇L i (θ k ) - 1 B ∀xi∈B k ∇L i (θ k ) → 0 d , as B → N . Let define Σ SGD N (θ k ) as the sample covariance matrix of loss gradients ∇L i (θ k ) for 1 ≤ i ≤ N , where we follow (Li et al., 2017) and do not make low-rank assumptions on Σ SGD N (θ k ). Under mild conditions (Li et al., 2017; Chaudhari & Soatto, 2018) , one can approximate SGD as θk such that θk+1 ← θk - η N ∀xi∈D ∇L i ( θk ) + ξ k ( θk ), ξ k ( θk ) = η B Σ SGD N ( θk ) 1 2 z k , z k ∼ N (0, I d ) . (9) The implicit regularizer of SGD could be considered as ξ k ( θk ) = η/B Σ SGD N ( θk ) 1 2 z k which is data-dependent and controlled by the learning rate η and batch size B (Smith et al., 2018) . (Mandt et al., 2017; Chaudhari & Soatto, 2018; Hu et al., 2019b) discussed SGD for varational inference and enabled novel applications to samplers (Zhang et al., 2019b; Xiong et al., 2019) . To understand the effect to generalization performance, (Zhu et al., 2019; Smith et al., 2018) studied the escaping behavior from the sharp local minima (Keskar et al., 2017) and convergence to the flat ones. Finally, (Gidel et al., 2019) studied regularization effects to linear DNNs and (Wu et al., 2020) proposed new multiplicative noises to interpret SGD and obtain stronger theoretical properties.

SGD Implicit Regularization for Ordinary Least Square (OLS)

The most recent and relevant work in this area is (Ali et al., 2019; 2020) , where the same group of authors studied the implicit regularization of gradient descent and stochastic gradient descent for OLS. They investigated an implicit regularizer of 2 -norm alike on the parameter, which regularizes OLS as a Ridge estimator with decaying penalty. Prior to these efforts, F. Bach and his group have studied the convergence of gradient-based solutions for linear regression with OLS and regularized estimators under both noisy and noiseless settings in (Dieuleveut et al., 2017; Marteau-Ferey et al., 2019; Berthier et al., 2020) . Self-Distillation and Noisy Students Self-distillation (Zhang et al., 2019a; Xie et al., 2020; Xu et al., 2020; Kim et al., 2020) has been examined as an effective way to further improve the generalization performance of well-trained models. Such strategies enable knowledge distillation using the well-trained ones as teacher models and optionally adding noises (e.g., dropout, stochastic depth, and label smoothing or potentially the label noises) onto training procedure of student models. Discussion on the Relevant Work Though tremendous pioneering studies have been done in this area, we still make contributions in above three categories. First of all, this work characterizes the implicit regularization effects of label noises to SGD dynamics. Compared to (Ali et al., 2019; 2020) working on linear regression, our model interpreted general learning tasks. Even from linear regression perspectives (Ali et al., 2019; 2020; Berthier et al., 2020) , we precisely measured the gaps between SGD dynamics with and without label noises using the continuous-time diffusion.Compared to (Lopez-Paz et al., 2016; Kim et al., 2020) , our analysis emphasized role of the implicit regularizer caused by label noises for model selection, where models with high inferential stability would be selected. (Li et al., 2020) is the most relevant work to us, where authors studied the early stopping of gradient descent under label noises via neural tangent kernel (NTK) (Jacot et al., 2018) approximation. Our work made the analyze for SGD without assumptions for approximation such as NTK. To best of our knowledge, this work is the first to understand the effects of unbiased label noises to SGD dynamics, by addressing technical issues including implicit regularization, OLS, self-distillation, model selection, and the stability inference results.

3. LEARNING DYNAMICS AND IMPLICIT REGULARIZATION OF SGD WITH UNBIASED RANDOM LABEL NOISES

From initialization θ ULN 0 , SGD with Unbiased Random Label Noises uses an iterative algorithm that updates the estimate incrementally. Specifically, in the k th iteration, SGD randomly picks up a batch of sample B k ⊆ D to estimate the stochastic gradient, as follow gk (θ ULN k ) = 1 |B k | xi∈B k ∇ Li (θ ULN k ) = 1 N N i=1 ∇L * i (θ ULN k ) + ξ * k (θ ULN k ) + ξ ULN k (θ ULN k ), where ∇L * i (θ) for ∀θ ∈ R d refers to the loss gradient based on the label-noiseless sample (x i , y i ) and y i = f * (x i ), ξ * k (θ) refers to stochastic gradient noises (Li et al., 2017) through mini-batch sampling over the gradients of label-noiseless samples, and ξ ULN k (θ) is an additional noise term caused by the mini-batch sampling and the Unbiased Random Label Noises, such that ∇L * i (θ) = ∂ ∂θ (f (x i , θ) -f * (x i )) 2 = (f (x i , θ) -f * (x i )) • ∇f (x i , θ) , ξ * k (θ) = 1 |B k | xj ∈B k ∇L * j (θ) - 1 N N i=1 L * i (θ) , and E B k [ξ * k (θ)] = 0 d , ξ ULN k (θ) = - 1 |B k | xj ∈B k ε j • ∇f (x j , θ), and E B k ,εi [ξ ULN k (θ)] = 0 d . (11) Note that, for every iteration ∀θ ∈ R d , the random vectors ξ * k (θ) and ξ ULN k (θ) are with zero-mean as E(ε j ) = 0. To characterize the variances of the two random vectors, we define two matrix-value functions Σ SGD N (θ) and Σ ULN N (θ) over θ ∈ R d based on the label-noiseless losses, such that Σ SGD N (θ) = 1 N N j=1 ∇L * j (θ) - 1 N N i=1 L * i (θ) ∇L * j (θ) - 1 N N i=1 L * i (θ) Σ ULN N (θ) = σ 2 N N j=1 ∇ θ f (x j , θ)∇ θ f (x j , θ) as var[ε j ] = σ 2 . ( ) Under mild conditions, we have var [ξ * k (θ)] = 1/B • Σ SGD N (θ) and var[ξ ULN k (θ)] = 1/B • Σ ULN N (θ). SGD Learning Dynamics with Unbiased Random Label Noises We consider the SGD algorithm with unbiased random label noises in the form of gradient descent with additive datadependent noise. When η → 0, we assume the noise terms ξ * k (θ k ) and ξ ULN k (θ k ) are independent, then we can follow the analysis in (Hu et al., 2019a) to derive the diffusion process of SGD with unbiased random label noises, denoted as θ ULN (t) over continuous-time t ≥ 0, such that dθ ULN = - 1 N N i=1 ∇L i (θ ULN )dt + η B Σ SGD N (θ ULN ) 1 2 dW 1 (t) + η B Σ ULN N (θ ULN ) 1 2 dW 2 (t) , where W 1 (t) and W 2 (t) refer to two independent Brownie motions over time and dt = √ η. Again, we can obtain the discrete-time approximation (Li et al., 2017; Chaudhari & Soatto, 2018) to the SGD dynamics, denoted as θULN k for k = 1, 2, . . . , which in the k th iteration behaves as θULN k+1 ← θULN k - η N N i=1 ∇L * ( θULN k )+ η B Σ SGD N ( θULN k ) 1 2 z k + Σ ULN N ( θULN k ) 1 2 z k , ( ) where z k and z k are two independent d-dimensional random vectors drawn from a standard ddimensional Gaussian distribution N (0 d , I d ) per iteration independently, and θ ULN 0 = θ ULN (t = 0). Note that the errors from the SGD algorithm to its continuous-time diffusion process and from the continuous-time dynamics to its discretization are bounded under weak convergence (Hu et al., 2019a) . In this way, we can use the trajectory of the discrete-time dynamics θULN Implicit Regularizer Influenced by Unbiased Random Label Noises Compared the stochastic gradient with unbiased random label noises gk (θ) and the stochastic gradient based on the labelnoiseless losses, we find an additional noise term ξ ULN k (θ) as the implicit regularizer. To interpret ξ ULN k (θ), we first define the diffusion process of SGD based on Label-NoiseLess losses i.e., L * i (θ) for 1 ≤ i ≤ N as dθ LNL = -1 N N i=1 ∇L * i (θ LNL )dt + η B Σ SGD N (θ LNL ) 1 2 W(t). Through comparing θ ULN (t) with θ LNL (t), the effects of ξ ULN k (θ) over continuous-time form should be η/B(Σ ULN N (θ)) 1/2 dW (t). Then, in discrete-time, we could get results as follow. Proposition 1 (The implicit regularizer ξ ULN k (θ)) The implicit regularizer of SGD with unbiased random label noises could be approximated (with O( √ η) approximation error due to discretization (Li et al., 2017) ) as follow, ξ ULN k (θ) ≈ η B σ 2 N N i=1 ∇ θ f (x i , θ)∇ θ f (x i , θ) 1 2 z k , and z k ∼ N (0 d , I d ) . ( ) In this way, we can estimate the expected strength of the implicit regularizer ξ ULN k (θ) as follow, E z k ξ ULN k (θ) 2 2 = ησ 2 BN N i=1 ∇ θ f (x i , θ) 2 2 . ( ) In this way, we can conclude that the effects of implicit regularization caused by unbiased random label noises for SGD is proportional to 1 N N i=1 ∇ θ f (x i , θ) 2 2 -the average gradient norm of the neural network f (x, θ) over samples. Please refer to appendix for the proof. Inference Stabilizer Here we extend the existing results on SGD (Zhu et al., 2019; Wu et al., 2018) to understand Proposition 1 as follows. (1) Inference Stability -The gradient norm 1 N N i=1 ∇ θ f (x i , θ) 2 2 = 1 N N i=1 ∂ ∂θ f (x i , θ) 2 2 characterizes the variation of neural network output f (x, θ) based on samples x i (for 1 ≤ i ≤ N ) over the parameter interpolation around the point θ. Lower 1 N N i=1 ∇ θ f (x i , θ) 2 2 comes higher stability of neural network f (x, θ) outputs against the (random) perturbations over parameters. (2) Escape and Converge -When the noise ξ ULN k (θ) is θ-dependent (section 4 would present a special case that ξ ULN k (θ) is θ-independent with OLS), we follow (Zhu et al., 2019) and suggest that the implicit regularizer help SGD escape from the point θ with high neural network gradient norm 1 N N i=1 ∇ θ f (x i , θ) 2 2 , as the scale of noise ξ ULN k ( θ) is large. Reciprocally, we follow (Wu et al., 2018) and suggest that when the SGD with unbiased random label noises converges, the converging point θ * should be with small 1 N N i=1 ∇ θ f (x i , θ * ) 2 2 . (3) Performance Tuning -Considering ησ 2 /B as the coefficient balancing the implicit regularizer and vanilla SGD, one can regularize/penalize the SGD learning procedure with the fixed η and B more fiercely using a larger σ 2 . More specifically, we could expect to obtain solutions with lower 1 N N i=1 ∇ θ f (x i , θ) 2 2 or higher inference stability of neural networks, as regularization effects become stronger when σ 2 increases.

4. IMPLICIT REGULARIZATION EFFECTS TO LINEAR REGRESSION

Here, we consider a special example of SGD with unbiased random label noises using linear regression, where a simple quadratic loss function is considered for OLS, such that β OLS ← argmin β∈R d 1 N N i=1 Li (β) := 1 N N i=1 x i β -ỹi 2 , ( ) where samples are generated through ỹi = x i β * + ε i , E[ε i ] = 0 and var[ε i ] = σ 2 . Note that in this section, we replace the notation of θ with β to present the parameters of linear regression models.

Learning Dynamics and Implicit Regularization Effects

The continuous-time diffusion processes for SGD algorithms with and without unbiased label noises are as follow dβ ULN (t) = - 1 N N i=1 x i (x i β ULN (t) -x i β * )dt + η B Σ SGD N (β ULN (t)) 1/2 dW 1 (t) + η B Σ ULN N (β ULN (t)) 1/2 dW 2 (t) dβ LNL (t) = - 1 N N i=1 x i (x i β LNL (t) -x i β * )dt + η B Σ SGD N (β LNL (t)) 1/2 dW (t) where β ULN (t) and β LNL (t) refer to the SGD dynamics for OLS under Unbiased Label Noises and Label NoiseLess settings. We then denote the sample covariance matrix of N samples as ΣN = 1 N N i=1 x i x i . Matrices Σ SGD (β(t) ) and Σ ULN (β(t)) in this case are defined as Σ SGD N (β) = 1 N N i=1 x i x i β -ΣN β x i x i β -ΣN β and Σ ULN N (β) = σ 2 ΣN , which are both time-homogeneous. Compared to β LNL (t), the dynamics β ULN (t) incorporates an additional noise term η B Σ ULN N (β ULN (t)) 1/2 dW 2 (t) which affects the dynamics. Proposition 2 (Implicit Regularization on OLS) In this way, we could approximate the implicit regularizer of SGD with the random label noises for OLS through discretization such as, η B Σ ULN N (β ULN (t)) 1/2 dW 2 (t) ⇒ ξ ULN k (β) ≈ ησ 2 B ΣN 1/2 z k , and z k ∼ N (0 d , I d ), which is independent with β and k (the time). According to (Berthier et al., 2020) , SGD for noiseless linear regression would asymptotically converge to the optimal solution β * . With an additional noise term ξ ULN k (β) and the single optima β * (for both noisy and noiseless losses), we could conclude that when k → ∞, SGD with unbiased random label noises would converge to a distribution centered at β * . The distribution would tend to be a Gaussian distribution when σ 2 is significant (we could not ignore the effects of stochastic gradient noise terms of noiseless loss to the overall distribution), as the term σ 2 η/B Σ1/2 N dW (t) corresponds to a Gaussian distribution. The span and shape of the distribution are controlled by σ 2 and ΣN when η and B are constant.

Numerical Validation

To validate Proposition 2, we carry out numerical evaluation using synthesize data to simply visualize the dynamics over iteration of SGD algorithms with label-noisy OLS and label-noiseless OLS. In our experiments, we use 100 random samples realized from a 2-dimension Gaussian distribution X i ∼ N (0, Σ 1,2 ) for 1 ≤ i ≤ 100, where Σ 1,2 is an symmetric covariance matrix controlling the random sample generation. To add the noises to the labels, we first drawn 100 copies of random noises from the normal distribution with the given variance ε i ∼ N (0, σ 2 ), then we setup the OLS problem with (X i , Ỹi ) pairs using Ỹi = X i β * + ε i and β * = [1, 1] and various settings of σ 2 and Σ 1,2 . We setup the SGD algorithms with the fixed learning rate η = 0.01, and bath size B = 5, with the total number of iterations K = 1, 000, 000 to visualize the complete paths. Figure 1 presents the results of numerical validations. In Figure 1(a)-(d) , we gradually increases the variances of label noises σ 2 from 0.25 to 2.0, where we can observe (1) SGD over label-noiseless OLS converges to the optimal solution β * = [1.0, 1.0] in a fast manner, (2) SGD over OLS with unbiased random label noises would asymptotically converge to a distribution centered at the optimal point, and (3) when σ 2 increases, the span of the converging distribution becomes larger. In Figure 1 (e)-(h), we use four settings of Σ 1,2 , where we can see (4) no matter how Σ 1,2 is set for OLS problems, the SGD with unbiased random label noises would asymptotically converge to a distribution centered at the optimal point. Compared the results in (e) with(f), we can find that, when the trace of Σ 1,2 increases, the span of converging distributions would increases. Furthermore, (5) the shapes of converging distributions depend on Σ 1,2 . In Figure 1 (g), when we place the principal component of Σ 1,2 onto the vertical axis (i.e., Σ Ver = [[10, 0] , [0, 100] ]), the distribution lays on Note that the unbiased random label noises are added to the labels prior to the learning procedure. In this setting, it is the mini-batch sampler of SGD that "dynamizes" the noises and influences the dynamics of SGD through forming the implicit regularizer. Given a well-trained model, Noisy Self-Distillation algorithms (Zhang et al., 2019a; Xu et al., 2020; Kim et al., 2020; Xie et al., 2020) intend to further improve the performance of a model through learning from the "soft label" outputs (i.e., logits) of the model (as the teacher). Furthermore, some practices found that the self-distillation could be further improved through incorporating certain randomness and stochasticity in the training procedure so as to obtain better generalization performance (Xie et al., 2020; Kim et al., 2020) . In this work, we study the way that directly adds random label noises to the logit outputs of the pre-trained model so as to improve the self-distillation (Han et al., 2018) . More specifically, we study two well-known strategies for additive noises as follow.

5. IMPLICIT REGULARIZATION EFFECTS TO DEEP NEURAL NETWORKS

(1) Gaussian Noises. Given a pretrained model with L-dimensional logit output, for every iteration of self-distillation, this simple method that draw random vectors from a L-dimensional Gaussian distribution N (0 L , σ 2 I L ), adds the vectors to the logit outputs of the model, and makes the student model learn from the noisy outputs. Note that in our analysis, we assume the output of the model is single dimension while, in self-distillation, the logit labels are with multiple dimensions. Thus, the diagonal matrix σ 2 I L now refers to the complete form the variances and σ 2 controls the scale. (2) Symmetric Noises.. Basically, this strategy is derived from (Han et al., 2018) that generates noises through randomly swapping the values of logit output among the L dimensions. Specifically, in every iteration of self-distillation, given a swap-probability p, every logit output (denoted as y here) from the pre-trained model, and every dimension of logit output denoted as y l , the strategy in probability p swaps the logit value in the dimension that corresponds to y l with any other dimension y m =l in equal prior (i.e., in (L -1) -1 probability). In the rest 1 -p probability, the strategy remains the original logit output there. In this way, new noisy label ỹ is with expectation E[ỹ] as follow, E[ỹ l ] = (1 -p) • y l + p • ∀m =l y m L -1 This strategy introduces explicit bias to the original logit outputs. However, when we consider the expectation E[ỹ] as the innovative soft label, then the random noise around the new soft label is still unbiased as E[ỹ -E[ỹ]] = 0 for all dimensions. Note that this noise is not the symmetric noises studied for robust learning (Wang et al., 2019) . Figure 2 presents the results of above two methods with increasing scales of noises, i.e., increasing σ 2 for Gaussian noises and increasing p for Symmetric noises. In Figure 2 (a)-(c), we demonstrate that the gradient norms of neural networks 1 N ∇ θ f (x i , θ) 2 2 decrease with growing σ 2 and p for two strategies. The results backup our theoretical investigation , which means the model would be awarded high inferential stability, as the variation of neural network outputs against the potential random perturbation in parameters has been reduced by the regularization. In Figure 2 (d)-(f) and (g)-(i), we plot the validation and testing accuracy of the models obtained under noisy self-distillation. The results show that (1) some of models have been improved through noisy self-distillation compared to the teacher model, (2) noisy self-distillation could obtain better performance than noiseless self-distillation, and (3) it is possible to select noisily self-distilled models using validation accuracy for better overall generalization performance. All results here are based on 200 epochs of noisy self-distillation.

6. DISCUSSION AND CONCLUSION

While previous studies primarily focus on the performance degradation caused by label noises or corrupted labels (Jiang et al., 2018; Li et al., 2020) , we investigate the implicit regularization effects of random label noises, under mini-batch sampling settings of stochastic gradient descent (SGD). Specifically, we adopt the dynamical systems interpretation of SGD to analyze the learning procedure based on the quadratic loss with unbiased random label noises. We decompose the mini-batch stochastic gradient based on label-noisy losses into three parts in Eq. ( 11): (i) ∇L * (θ) -the true gradient of label-noiseless losses, (ii) ξ * k (θ) -the stochastic gradient noise caused through mini-batch sampling over the label-noiseless losses, and (iii) ξ ULN k (θ) -the noise term influenced by the both random label noises and mini-batch sampling. Our research considers ξ ULN k (θ) as an implicit regularizer, and finds that effects of such implicit regularizer is to lower the gradient norm of the neural networks 1 N N i=1 ∇ θ f (x i , θ) 2 2 over the learning procedure, where the gradient norm of neural networks here characterizes the variation/stability of the neural network outputs against the random perturbation around the parameters. In summary, the new implicit regularizer ξ ULN k (θ) helps SGD select a point with higher inference stability for convergence. We carry out extensive experiments to validate our theoretical investigations. The numerical study with linear regression clearly illustrates the trajectories of SGD with and without unbiased random label noises, the observation coincides the SGD dynamics derived from our theories. Evaluation based on deep neural network shows that, in self-distillation settings, one can lower the gradient norm of neural networks, improve the inference stability of networks, and obtain better solutions, through iteratively adding noises to the outputs of teacher models. Note that we do not want to claim that the implicit regularization caused by the label noises would improve the generalization performance in this work. The experiments results well backup our theories.

A APPENDIX A.1 SKETCHED PROOFS IN PROPOSITION 1

To obtain Eq 16, we can use the simple vector-matrix-vector products transform that, for the random vector v and symmetric matrix A there has E We choose the ResNet-56 (He et al., 2016) , one of the most practical deep models, for conducting the experiments on three datasets: SVHN (Netzer et al., 2011) , CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) . We follow the standard training procedure (He et al., 2016) for training a teacher model (original model). Specifically we train the model from scratch for 200 epochs and adopt the SGD optimizer with batch size 64 and momentum 0.9. The learning rate is set to 0.1 at the beginning of training and divided by 10 at 100th epoch and 150th epoch. A standard weight decay with a small regularization parameter (10 -4 ) is applied. As for noiseless self-distillation, we follow the standard procedure (Hinton et al., 2015) for distilling knowledge from the teacher to a student of the same network structure. The training setting is the same as training the teacher model. v [v Av] = trace(AE[vv ]), such that E z k ξ ULN k (θ) 2 2 = E z k ξ ULN k (θ) ξ ULN k (θ) ≈ ησ 2 B E z k z k 1 N N i=1 ∇ θ f (x i , θ)∇ θ f (x i , θ) z k = ησ 2 B trace 1 N N i=1 ∇ θ f (x i , θ)∇ θ f (x i , θ) E z k [z k z k ] as E z k [z k z k ] = I d f or z k ∼ N (0 d , I d ) = ησ 2 B trace 1 N N i=1 ∇ θ f (x i , θ)∇ θ f (x i , θ) = ησ 2 BN N i=1 ∇ θ f (x i , θ) 2 2 = ησ 2 BN ∂ ∂θ f (x i , θ) For noisy self-distillation, we continue to use the training setting except the labels are noised by the two types of noises as introduced in the main text. We choose the best scale of label noises using a validation set, where we divide the original training set into a new training set (80%) and a validation set (20%). A set of {0.1, 0.2, ..., 0.9, 1.0} is tried for the scale of symmetric noises. A set of {0.1, 0.5, 1.0, 2.0, 3.0, ..., 9.0, 10.0} is tried for the scale of Gaussian noises. For clarity, we also present the results using all the choices of scales of label noises on test set, where the original training set is used for training.

A.3 TRAINING PROCESS OF DEEP NEURAL NETWORKS

We show the evolution of training and test losses during the entire training procedure, and compare the settings of adding no label noises, symmetric and Gaussian noises for self-distillations. Figure 3 presents the results on the three datasets, i.e., SVHN, CIFAR10 and CIFAR100. 



, where the gradient norm characterizes the variation/instability of neural network inference results (over perturbations) around parameters of the model. We carry out extensive experiments while results backup our theories. Note that while the earlier work(Bishop, 1995) found training with input noises can also bring regularization effects, our work focuses on the observational noises on labels.



to analyze the behaviors of SGD algorithm θ ULN k over iterations.

(a) σ 2 = 0.25 (b) σ 2 = 0.5 (c) σ 2 = 1.0 (d) σ 2 = 2.0 (e) Σ1,2 = 10 • I d (f) Σ1,2 = 100 • I d (g) Σ1,2 = Σ Ver (h) Σ1,2 = Σ Hor

Figure 1: Trajectories of SGD over OLS with and without Unbiased Random Label Noises using various σ2 and Σ 1,2 settings for (noisy) random data generation. For Figures (a)-(d), the experiments are setup with a fixed Σ 1,2 = [[20, 0] , [0, 20] ] and varying σ2 . For Figures (e)-(h), the experiments are setup with a fixed σ2 = 0.5 and varying Σ 1,2 , where we set Σ Ver = [[10, 0] , [0, 100] ] and Σ Hor = [[100, 0] , [0, 10] ] to shape the converging distributions. the vertical axis principally. Figure 1(h) demonstrates the opposite layout of the distribution, when we set Σ Hor = [[100, 0] , [0, 10] ] as Σ 1,2 . The scale and shape of the converging distribution backups our theoretical investigation in Eq 20.

noises (e.g., p or σ 2 ) noises (e.g., p or σ 2 ) noises (e.g., p or σ 2 ) noises (e.g., p or σ 2 )

noises (e.g., p or σ 2 )

noises (e.g., p or σ 2 )

Figure 2: Gradient Norms, Validation Accuracy, and Testing Accuracy in Noisy Self-Distillation using ResNet-56 with varying scale of label noises (e.g., p and σ 2 ).

DETAILS FOR NOISY SELF-DISTILLATION WITH DEEP NEURAL NETWORKS

Figure3

