IMPLICIT REGULARIZATION EFFECTS OF UNBIASED RANDOM LABEL NOISES WITH SGD

Abstract

Random label noises (or observational noises) widely exist in practical machine learning settings. we analyze the learning dynamics of stochastic gradient descent (SGD) over the quadratic loss with unbiased label noises, and investigate a new noise term of dynamics, which is dynamized and influenced by mini-batch sampling and random label noises, as an implicit regularizer. Our theoretical analysis finds such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. To validate our analysis, we use our theorems to analyze the implicit regularizer of SGD with unbiased random label noises for linear regression via Ordinary Least-Square (OLS), where the numerical simulation backups our theoretical findings. We further extend our proposals to interpret the newly-fashioned noisy self-distillation tricks for deep learning, where the implicit regularizer demonstrates a unique capacity of selecting models with improved output stability through learning from well-trained teachers with additive unbiased random label noises.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) has been widely used as an effective way to train deep neural networks with large datasets (Bottou, 1991) . While the mini-batch sampling strategy was firstly proposed to lower the cost of computation per iteration, it has been consider to incorporate an implicit regularizer preventing the learning process from converging to the local minima with poor generalization performance (Zhang et al., 2017; Zhu et al., 2019; Jastrzebski et al., 2017; Hoffer et al., 2017; Keskar et al., 2017) . To interpret such implicit regularization, one can model SGD as gradient descent (GD) with gradient noises caused by mini-batch sampling (Bottou et al., 2018) . Studies have demonstrated the potentials of such implicit regularization or gradient noises to improve the generalization performance of learning from both theoretical (Mandt et al., 2017; Chaudhari & Soatto, 2018; Hu et al., 2019b; Simsekli et al., 2019) and empirical aspects (Zhu et al., 2019; Hoffer et al., 2017; Keskar et al., 2017) . In summary, gradient noises keep SGD away from converging to the sharp local minima that generalizes poorly (Zhu et al., 2019; Hu et al., 2019b; Simsekli et al., 2019) and would select a flat minima (Hochreiter & Schmidhuber, 1997) as the outcome of learning. In this work, we aim at investigating the influence of random label noises to the implicit regularization under mini-batch sampling of SGD. To simplify our research, we assume the training dataset as a set of vectors D = {x 1 , x 2 , x 3 , . . . , x N }.The label ỹi for every vector x i ∈ D is the noisy response of the true neural network f * (x) such that ỹi = y i + ε i , y i = f * (x i ), and E[ε i ] = 0, var[ε i ] = σ 2 , where the label noise ε i is assumed to be an independent zero-mean random variable. In our work, the random label noises can be either (1) drawn from probability distributions before training steps (but dynamized by mini-batch sampling of SGD) or (2) realized by the random variables per training iteration (Han et al., 2018) . Thus learning is to approximate f (x, θ) that beats f * (x), such that θ ← argmin ∀θ∈R d 1 N N i=1 Li (θ) := 1 N N i=1 (f (x i , θ) -ỹi ) 2 . Inspired by (Hochreiter & Schmidhuber, 1997; Zhu et al., 2019) , our work studies how unbiased label noises ε i (1 ≤ i ≤ N ) would affect the "selection" of θ from possible solutions, in the viewpoint of learning dynamics (Saxe et al., 2014) of SGD under mini-batch sampling (Li et al., 2017; Wu et al., 2020; Hu et al., 2019b ).

