TOWARDS UNDERSTANDING LABEL SMOOTHING

Abstract

Label smoothing regularization (LSR) has a great success in training deep neural networks by stochastic algorithms such as stochastic gradient descent and its variants. However, the theoretical understanding of its power from the view of optimization is still rare. This study opens the door to a deep understanding of LSR by initiating the analysis. In this paper, we analyze the convergence behaviors of stochastic gradient descent with label smoothing regularization for solving non-convex problems and show that an appropriate LSR can help to speed up the convergence by reducing the variance. More interestingly, we proposed a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs and drops it off in the later training epochs. We observe from the improved convergence result of TSLA that it benefits from LSR in the first stage and essentially converges faster in the second stage. To the best of our knowledge, this is the first work for understanding the power of LSR via establishing convergence complexity of stochastic methods with LSR in non-convex optimization. We empirically demonstrate the effectiveness of the proposed method in comparison with baselines on training ResNet models over benchmark data sets.

1. INTRODUCTION

In training deep neural networks, one common strategy is to minimize cross-entropy loss with onehot label vectors, which may lead to overfitting during the training progress that would lower the generalization accuracy (Müller et al., 2019) . To overcome the overfitting issue, several regularization techniques such as 1 -norm or 2 -norm penalty over the model weights, Dropout which randomly sets the outputs of neurons to zero (Hinton et al., 2012b) , batch normalization (Ioffe & Szegedy, 2015) , and data augmentation (Simard et al., 1998) , are employed to prevent the deep learning models from becoming over-confident. However, these regularization techniques conduct on the hidden activations or weights of a neural network. As an output regularizer, label smoothing regularization (LSR) (Szegedy et al., 2016) is proposed to improve the generalization and learning efficiency of a neural network by replacing the one-hot vector labels with the smoothed labels that average the hard targets and the uniform distribution of other labels. Specifically, for a Kclass classification problem, the one-hot label is smoothed by y LS = (1 -θ)y + θ y, where y is the one-hot label, θ ∈ (0, 1) is the smoothing strength and y = 1 K is a uniform distribution for all labels. Extensive experimental results have shown that LSR has significant successes in many deep learning applications including image classification (Zoph et al., 2018; He et al., 2019) 2020) have shown that LSR works since it can successfully mitigate label noise. However, to the best of our knowledge, it is unclear, at least from a theoretical viewpoint, how the introduction of label smoothing will help improve the training of deep learning models, and to what stage, it can help. In this paper, we aim to provide an affirmative answer to this question and try to deeply understand why and how the LSR works from the view of optimization. Our theoretical analysis will show that an appropriate LSR can essentially reduce the variance of stochastic gradient in the assigned class labels and thus it can speed up the convergence. Moreover, we will propose a novel strategy of employing LSR that tells when to use LSR. We summarize the main contributions of this paper as follows. • It is the first work that establishes improved iteration complexities of stochastic gradient descent (SGD) (Robbins & Monro, 1951) with LSR for finding an -approximate stationary point (Definition 1) in solving a smooth non-convex problem in the presence of an appropriate label smoothing. The results theoretically explain why an appropriate LSR can help speed up the convergence. (Section 4) • We propose a simple yet effective strategy, namely Two-Stage LAbel smoothing (TSLA) algorithm, where in the first stage it trains models for certain epochs using a stochastic method with LSR while in the second stage it runs the same stochastic method without LSR. The proposed TSLA is a generic strategy that can incorporate many stochastic algorithms. With an appropriate label smoothing, we show that TSLA integrated with SGD has an improved iteration complexity, compared to the SGD with LSR and the SGD without LSR. (Section 5)



, speech recognition(Chorowski & Jaitly, 2017; Zeyer et al., 2018), and language translation(Vaswani et al.,  2017; Nguyen & Salazar, 2019).Due to the importance of LSR, researchers try to explore its behavior in training deep neural networks.Müller et al. (2019)  have empirically shown that the LSR can help improve model calibration, however, they also have found that LSR could impair knowledge distillation, that is, if one trains a teacher model with LSR, then a student model has worse performance.Yuan et al. (2019a)  have proved that LSR provides a virtual teacher model for knowledge distillation. As a widely used trick, Lukasik et al. (

2. RELATED WORK

In this section, we introduce some related work. A closely related idea to LSR is confidence penalty proposed by Pereyra et al. (2017) , an output regularizer that penalizes confident output distributions by adding its negative entropy to the negative log-likelihood during the training process. The authors (Pereyra et al., 2017) presented extensive experimental results in training deep neural networks to demonstrate better generalization comparing to baselines with only focusing on the existing hyper-parameters. They have shown that LSR is equivalent to confidence penalty with a reversing direction of KL divergence between uniform distributions and the output distributions. 2020) proposed a graduated label smoothing method that uses the higher smoothing penalty for high-confidence predictions than that for low-confidence predictions. They found that the proposed method can improve both inference calibration and translation performance for neural machine translation models. By contrast, in this paper, we will try to understand the power of LSR from an optimization perspective and try to study how and when to use LSR.

3. PRELIMINARIES AND NOTATIONS

We first present some notations. Let ∇ w F (w) denote the gradient of a function F (w). When the variable to be taken a gradient is obvious, we use ∇F (w) for simplicity. We use • to denote the Euclidean norm. Let •, • be the inner product.In classification problem, we aim to seek a classifier to map an example x ∈ X onto one of K labels y ∈ Y ⊂ R K , where y = (y 1 , y 2 , . . . , y K ) is a one-hot label, meaning that y i is "1" for the correct class and "0" for the rest. Suppose the example-label pairs are drawn from a distribution P, i.e., (x, y) ∼ P = (P x , P y ). we denote by E (x,y) [•] the expectation that takes over a random variable (x, y). When the randomness is obvious, we write E[•] for simplicity. Our goal is to learn a prediction function f (w; x) : W × X → R K that is as close as possible to y, where w ∈ W is the parameter and W is a closed convex set. To this end, we want to minimize the following expected loss under P: min w∈W F (w) := E (x,y) [ (y, f (w; x))] ,(1)

