TOWARDS UNDERSTANDING LABEL SMOOTHING

Abstract

Label smoothing regularization (LSR) has a great success in training deep neural networks by stochastic algorithms such as stochastic gradient descent and its variants. However, the theoretical understanding of its power from the view of optimization is still rare. This study opens the door to a deep understanding of LSR by initiating the analysis. In this paper, we analyze the convergence behaviors of stochastic gradient descent with label smoothing regularization for solving non-convex problems and show that an appropriate LSR can help to speed up the convergence by reducing the variance. More interestingly, we proposed a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs and drops it off in the later training epochs. We observe from the improved convergence result of TSLA that it benefits from LSR in the first stage and essentially converges faster in the second stage. To the best of our knowledge, this is the first work for understanding the power of LSR via establishing convergence complexity of stochastic methods with LSR in non-convex optimization. We empirically demonstrate the effectiveness of the proposed method in comparison with baselines on training ResNet models over benchmark data sets. In training deep neural networks, one common strategy is to minimize cross-entropy loss with onehot label vectors, which may lead to overfitting during the training progress that would lower the generalization accuracy (Müller et al., 2019) . To overcome the overfitting issue, several regularization techniques such as 1 -norm or 2 -norm penalty over the model weights, Dropout which randomly sets the outputs of neurons to zero (Hinton et al., 2012b), batch normalization (Ioffe & Szegedy, 2015) , and data augmentation (Simard et al., 1998) , are employed to prevent the deep learning models from becoming over-confident. However, these regularization techniques conduct on the hidden activations or weights of a neural network. As an output regularizer, label smoothing regularization (LSR) (Szegedy et al., 2016) is proposed to improve the generalization and learning efficiency of a neural network by replacing the one-hot vector labels with the smoothed labels that average the hard targets and the uniform distribution of other labels. Specifically, for a Kclass classification problem, the one-hot label is smoothed by y LS = (1 -θ)y + θ y, where y is the one-hot label, θ ∈ (0, 1) is the smoothing strength and y = 1 K is a uniform distribution for all labels. Extensive experimental results have shown that LSR has significant successes in many deep learning applications including image classification (Zoph et al.



in the assigned class labels and thus it can speed up the convergence. Moreover, we will propose a novel strategy of employing LSR that tells when to use LSR. We summarize the main contributions of this paper as follows. • It is the first work that establishes improved iteration complexities of stochastic gradient descent (SGD) (Robbins & Monro, 1951) with LSR for finding an -approximate stationary point (Definition 1) in solving a smooth non-convex problem in the presence of an appropriate label smoothing. The results theoretically explain why an appropriate LSR can help speed up the convergence. (Section 4) • We propose a simple yet effective strategy, namely Two-Stage LAbel smoothing (TSLA) algorithm, where in the first stage it trains models for certain epochs using a stochastic method with LSR while in the second stage it runs the same stochastic method without LSR. The proposed TSLA is a generic strategy that can incorporate many stochastic algorithms. With an appropriate label smoothing, we show that TSLA integrated with SGD has an improved iteration complexity, compared to the SGD with LSR and the SGD without LSR. (Section 5)

2. RELATED WORK

In this section, we introduce some related work. A closely related idea to LSR is confidence penalty proposed by Pereyra et al. (2017) , an output regularizer that penalizes confident output distributions by adding its negative entropy to the negative log-likelihood during the training process. The authors (Pereyra et al., 2017) presented extensive experimental results in training deep neural networks to demonstrate better generalization comparing to baselines with only focusing on the existing hyper-parameters. They have shown that LSR is equivalent to confidence penalty with a reversing direction of KL divergence between uniform distributions and the output distributions. DisturbLabel introduced by Xie et al. (2016) imposes the regularization within the loss layer, where it randomly replaces some of the ground truth labels as incorrect values at each training iteration. Its effect is quite similar to LSR that can help to prevent the neural network training from overfitting. The authors have verified the effectiveness of DisturbLabel via several experiments on training image classification tasks. Recently, many works (Zhang et al., 2018; Bagherinezhad et al., 2018; Goibert & Dohmatob, 2019; Shen et al., 2019; Li et al., 2020b) explored the idea of LSR technique. Ding et al. (2019) extended an adaptive label regularization method, which enables the neural network to use both correctness and incorrectness during training. Pang et al. (2018) used the reverse cross-entropy loss to smooth the classifier's gradients. Wang et al. (2020) proposed a graduated label smoothing method that uses the higher smoothing penalty for high-confidence predictions than that for low-confidence predictions. They found that the proposed method can improve both inference calibration and translation performance for neural machine translation models. By contrast, in this paper, we will try to understand the power of LSR from an optimization perspective and try to study how and when to use LSR.

3. PRELIMINARIES AND NOTATIONS

We first present some notations. Let ∇ w F (w) denote the gradient of a function F (w). When the variable to be taken a gradient is obvious, we use ∇F (w) for simplicity. We use • to denote the Euclidean norm. Let •, • be the inner product. In classification problem, we aim to seek a classifier to map an example x ∈ X onto one of K labels y ∈ Y ⊂ R K , where y = (y 1 , y 2 , . . . , y K ) is a one-hot label, meaning that y i is "1" for the correct class and "0" for the rest. Suppose the example-label pairs are drawn from a distribution P, i.e., (x, y) ∼ P = (P x , P y ). we denote by E (x,y) [•] the expectation that takes over a random variable (x, y). When the randomness is obvious, we write E[•] for simplicity. Our goal is to learn a prediction function f (w; x) : W × X → R K that is as close as possible to y, where w ∈ W is the parameter and W is a closed convex set. To this end, we want to minimize the following expected loss under P: min w∈W F (w) := E (x,y) [ (y, f (w; x))] , where : Y × R K → R + is a cross-entropy loss function given by (y, f (w; x)) = K i=1 -y i log exp(f i (w; x)) K j=1 exp(f j (w; x)) . (2) The objective function F (w) is not convex since f (w; x) is non-convex in terms of w. To solve the problem (1), one can simply use some iterative methods such as stochastic gradient descent (SGD). Specifically, at each training iteration t, SGD updates solutions iteratively by w t+1 = w t -η∇ w (y t , f (w t ; x t )), where η > 0 is a learning rate. Next, we present some notations and assumptions that will be used in the convergence analysis. Throughout this paper, we also make the following assumptions for solving the problem (1). Assumption 1. Assume the following conditions hold: (i) The stochastic gradient of F (w) is unbiased, i.e., E (x,y) [∇ (y, f (w; x))] = ∇F (w), and the variance of stochastic gradient is bounded, i.e., there exists a constant σ 2 > 0, such that E (x,y) ∇ (y, f (w; x)) -∇F (w) 2 = σ 2 . (ii) F (w) is smooth with an L-Lipchitz continuous gradient, i.e., it is differentiable and there exists a constant L > 0 such that ∇F (w) -∇F (u) ≤ L wu , ∀w, u ∈ W. Remark. Assumption 1 (i) and (ii) are commonly used assumptions in the literature of non-convex optimization (Ghadimi & Lan, 2013; Yan et al., 2018; Yuan et al., 2019b; Wang et al., 2019; Li et al., 2020a) . Assumption 1 (ii) says the objective function is L-smooth, and it has an equivalent expression (Nesterov, 2004) which is F (w)-F (u) ≤ ∇F (u), w -u + L 2 w -u 2 , ∀w, u ∈ W. For a classification problem, the smoothed label y LS is given by y LS = (1 -θ)y + θ y, where θ ∈ (0, 1) is the smoothing strength, y is the one-hot label, y is an introduced label. For example, one can simply use y = 1 K (Szegedy et al., 2016) for K-class problems. Similar to label y, we suppose the label y is drawn from a distribution P y . We introduce the variance of stochastic gradient using label y as follows. E (x, y) ∇ ( y, f (w; x)) -∇F (w) 2 = σ 2 := δσ 2 . ( ) where δ > 0 is a constant and σ 2 is defined in Assumption 1 (i). We make several remarks for (4). Remark. (a) We do not require that the stochastic gradient ∇ ( y, f (w; x)) is unbiased, i.e., it could be E[∇ ( y, f (w; x))] = ∇F (w). (b) The variance σ 2 is defined based on the label y rather than the smoothed label y LS . (c) We do not assume the variance σ 2 is bounded since δ could be an arbitrary value, however, we will discuss the different cases of δ in our analysis. If δ ≥ 1, then σ 2 ≥ σ 2 ; while if 0 < δ < 1, then σ 2 < σ 2 . It is worth mentioning that δ could be small when an appropriate label is used in the label smoothing. For example, one can smooth labels by using a teacher model (Hinton et al., 2014) or the model's own distribution (Reed et al., 2014) . In the first paper of label smoothing (Szegedy et al., 2016) and the following related studies (Müller et al., 2019; Yuan et al., 2019a) , researchers consider a uniform distribution over all K classes of labels as the label y, i.e., set y = 1 K . We now introduce an important property regarding F (w), i.e. Polyak-Łojasiewicz (PL) condition (Polyak, 1963) . More specifically, the following assumption holds. Assumption 2. There exists a constant µ > 0 such that 2µ(F (w) -F * ) ≤ ∇F (w) 2 , ∀w ∈ W, where F * = min w∈W F (w) is the optimal value. Remark. This property has been theoretically and empirically observed in training deep neural networks (Allen-Zhu et al., 2019; Yuan et al., 2019b) . This condition is widely used to establish convergence in the literature of non-convex optimization, please see (Yuan et al., 2019b; Wang et al., 2019; Karimi et al., 2016; Li & Li, 2018; Charles & Papailiopoulos, 2018; Li et al., 2020a) and references therein. Algorithm 1 SGD with Label Smoothing Regularization 1: Initialize: w 0 ∈ W, θ ∈ (0, 1), set η as the value in Theorem 3. 2: for t = 0, 1, . . . , T -1 do 3: sample (x t , y t ), set y LS t = (1 -θ)y t + θ y t 4: update w t+1 = w t -η∇ w (y LS t , f (w t ; x t )) 5: end for To measure the convergence of non-convex and smooth optimization problems as in (Nesterov, 1998; Ghadimi & Lan, 2013; Yan et al., 2018) , we need the following definition of the first-order stationary point. Definition 1 (First-order stationary point). For the problem of min w∈W F (w), a point w ∈ W is called a first-order stationary point if ∇F (w) = 0. Moreover, if ∇F (w) ≤ , then the point w is said to be an -stationary point, where ∈ (0, 1) is a small positive value.

4. CONVERGENCE ANALYSIS OF SGD WITH LSR

To understand LSR from the optimization perspective, we consider SGD with LSR in Algorithm 1 for the sake of simplicity. The only difference between Algorithm 1 and standard SGD is the use of the output label for constructing a stochastic gradient. The following theorem shows that Algorithm 1 converges to an approximate stationary point in expectation under some conditions. We include its proof in Appendix B. Theorem 3. Under Assumption 1, run Algorithm 1 with η = 1 L and θ = 1 1+δ , then E R [ ∇F (w R ) 2 ] ≤ 2F (w0) ηT + 2δσ 2 , where R is uniformly sampled from {0, 1, . . . , T -1}. Furthermore, given a target accuracy level , we have the following two results. (1) when δ ≤ 2 4σ 2 , if we set T = 4F (w0) η 2 , then Algorithm 1 converges to an -stationary point in expectation, i.e., E R [ ∇F (w R ) 2 ] ≤ 2 . The total sample complexity is T = O 1 2 . (2) when δ > Remark. We observe that the variance term is 2δσfoot_0 , instead of ηLσ 2 for standard analysis of SGD without LSR (i.e., θ = 0, please see the detailed analysis of Theorem 5 in Appendix C). For the convergence analysis, the difference between SGD with LSR and SGD without LSR is that ∇ ( y, f (w; x)) is not an unbiased estimator of ∇F (w) when using LSR. The convergence behavior of Algorithm 1 heavily depends on the parameter δ. When δ is small enough, say δ ≤ O( 2 ) with a small positive value ∈ (0, 1), then Algorithm 1 converges to an -stationary point with the total sample complexity of O 1 2 . Recall that the total sample complexity of standard SGD without LSR for finding an -stationary point is O 1 4 ( (Ghadimi & Lan, 2016; Ghadimi et al., 2016) , please also see the detailed analysis of Theorem 5 in Appendix C). The convergence result shows that if we could find a label y that has a reasonably small amount of δ, we will be able to reduce sample complexity for training a machine learning model from O 1 4 to O 1 2 . Thus, the reduction in variance will happen when an appropriate label smoothing with δ ∈ (0, 1) is introduced. We will find in the empirical evaluations that different label y lead to different performances and an appropriate selection of label y has a better performance (see the performances of LSR and LSR-pre in Table 3 ). On the other hand, when the parameter δ is large such that δ > Ω( 2 ), that is to say, if an inappropriate label smoothing is used, then Algorithm 1 does not converge to an -stationary point, but it converges to a worse level of O(δ).

5. TSLA: A GENERIC TWO-STAGE LABEL SMOOTHING ALGORITHM

Despite superior outcomes in training deep neural networks, some real applications have shown the adverse effect of LSR. Müller et al. (2019) have empirically observed that LSR impairs distillation, that is, after training teacher models with LSR, student models perform worse. The authors believed that LSR reduces mutual information between the input example and output logit. Kornblith et al. (2019) have found that LSR impairs the accuracy of transfer learning when training deep neural network models on ImageNet data set. Seo et al. (2020) trained deep neural network models for few-shot learning on mini-ImageNet and found a significant performance drop Algorithm 2 The TSLA algorithm 1: Initialize: w 0 ∈ W, T 1 , θ ∈ (0, 1), η 1 , η 2 > 0 2: Input: stochastic algorithm A (e.g., SGD) // First stage: A with LSR 3: for t = 0, 1, . . . , T 1 -1 do 4: sample (x t , y t ), set y LS t = (1 -θ)y t + θ y t 5: update w t+1 = A-step(w t ; x t , y LS t , η 1 ) one update step of A 6: end for // Second stage: A without LSR 7: for t = T 1 , 1, . . . , T 1 + T 2 -1 do 8: sample (x t , y t ) 9: update w t+1 = A-step(w t ; x t , y t , η 2 ) one update step of A 10: end for with LSR. This motivates us to investigate a strategy that combines the algorithm with and without LSR during the training progress. Let think in this way, one possible scenario is that training one-hot label is "easier" than training smoothed label. Taking the cross entropy loss in (2) for an example, one need to optimize a single loss function -log exp(f k (w; x))/ K j=1 exp(f j (w; x)) when one-hot label (e.g, y k = 1 and y i = 0 for all i = k) is used, but need to optimize all K loss functions - K i=1 y LS i log exp(f i (w; x))/ K j=1 exp(f j (w; x)) when smoothed label (e.g., y LS = (1 -θ)y + θ 1 K so that y LS k = 1 -(K -1)θ/K and y LS i = θ/K for all i = k) is used. Nev- ertheless, training deep neural networks is gradually focusing on hard examples with the increase of training epochs. It seems that training with smoothed label in the late epochs makes the learning progress more difficult. In addition, after LSR, we focus on optimizing the overall distribution that contains the minor classes, which are probably not important at the end of training progress. One question is whether LSR helps at the early training epochs but it has less (even negative) effect during the later training epochs? This question encourages us to propose and analyze a simple strategy with LSR dropping that switches a stochastic algorithm with LSR to the algorithm without LSR. In this subsection, we propose a generic framework that consists of two stages, wherein the first stage it runs a stochastic algorithm A (e.g., SGD) with LSR for T 1 iterations and the second stage it runs the same algorithm without LSR up to T 2 iterations. This framework is referred to as Two-Stage LAbel smoothing (TSLA) algorithm, whose updating details are presented in Algorithm 2. The notation A-step(•; •, η) is one update step of a stochastic algorithm A with learning rate η. For example, if we select SGD as algorithm A, then SGD-step(w t ; x t , y LS t , η 1 ) = w t -η 1 ∇ (y LS t , f (w t ; x t )), (5) SGD-step(w t ; x t , y t , η 2 ) = w t -η 2 ∇ (y t , f (w t ; x t )). ( ) Although SGD is considered as the subroutine algorithm A in the convergence analysis, in practice, algorithm A can be replaced by any stochastic algorithms such as momentum SGD (Polyak, 1964) , Stochastic Nesterov's Accelerated Gradient (Nesterov, 1983) , and adaptive algorithms including ADAGRAD (Duchi et al., 2011) , RMSProp (Hinton et al., 2012a) , AdaDelta (Zeiler, 2012), Adam (Kingma & Ba, 2015), Nadam (Dozat, 2016) and AMSGRAD (Reddi et al., 2018) . In this paper, we will not study the theoretical guarantees and empirical evaluations of other optimizers, which can be considered as future work. Please note that the algorithm can use different learning rates η 1 and η 2 during the two stages. The last solution of the first stage will be used as the initial solution of the second stage. If T 1 = 0, then TSLA reduces to the baseline, i.e., a standard stochastic algorithm A without LSR; while if T 2 = 0, TSLA becomes to LSR method, i.e., a standard stochastic algorithm A with LSR.

5.1. CONVERGENCE RESULT OF TSLA

In this subsection, we will give the convergence result of the proposed TSLA algorithm. For simplicity, we use SGD as the subroutine algorithm A in the analysis. The convergence result in the following theorem shows the power of LSR from the optimization perspective. Its proof is presented in Appendix D. It is easy to see from the proof that by using the last output of the first stage as the initial point of the second stage, TSLA can enjoy the advantage of LSR in the second stage with an improved convergence.  TSLA LSR baseline Ω( 2 ) < δ δ 4 ∞ 1 4 δ = O( 2 ) 1 2 1 2 1 4 Ω( 4 ) < δ < O( 2 ) 1 2-θ * 1 2 1 4 Ω( 4+c ) ≤ δ ≤ O( 4 ) * * log 1 1 2 1 4 † given a target accuracy level * θ ∈ (0, 2); * * c ≥ 0 is a constant Theorem 4. Under Assumptions 1, 2, suppose σ 2 δ/µ ≤ F (w 0 ), run Algorithm 2 with A = SGD, θ = 1 1+δ , η 1 = 1 L , T 1 = log 2µF (w0)(1+δ) 2δσ 2 /(η 1 µ), η 2 = 2 2Lσ 2 and T 2 = 8δσ 2 µη2 2 , then E R [ ∇F (w R ) 2 ] ≤ 2 , where R is uniformly sampled from {T 1 , . . . , T 1 + T 2 -1}. Remark. It is obvious that the learning rate η 2 in the second stage is roughly smaller than the learning rate η 1 in the first stage, which matches the widely used stage-wise learning rate decay scheme in training neural networks. To explore the total sample complexity of TSLA, we consider different conditions on δ. We summarize the total sample complexities of finding -stationary points for SGD with TSLA (TSLA), SGD with LSR (LSR), and SGD without LSR (baseline) in Table 1 , where ∈ (0, 1) is the target convergence level, and we only present the orders of the complexities but ignore all constants. When Ω( 2 ) < δ < 1, LSR does not converge to an -stationary point (denoted by ∞), while TSLA reduces sample complexity from Ofoot_1 4 to O δ 4 , compared to the baseline. When δ < O(foot_2 ), the total complexity of TSLA is between log(1/ ) and 1/ 2 , which is always better than LSR and the baseline. In summary, TSLA achieves the best total sample complexity by enjoying the good property of an appropriate label smoothing (i.e., when 0 < δ < 1). However, when δ ≥ 1, baseline has better convergence than TSLA, meaning that the selection of label y is not appropriate.

6. EXPERIMENTS

To further evaluate the performance of the proposed TSLA method, we trained deep neural networks on three benchmark data sets, CIFAR-100 (Krizhevsky & Hinton, 2009) , Stanford Dogs (Khosla et al., 2011) and CUB-2011 (Wah et al., 2011) , for image classification tasks. CIFAR-100 1 has 50,000 training images and 10,000 testing images of 32×32 resolution with 100 classes. Stanford Dogs data set 2 contains 20,580 images of 120 breeds of dogs, where 100 images from each breed is used for training. CUB-2011foot_3 is a birds image data set with 11,788 images of 200 birds species. The ResNet-18 model (He et al., 2016) is applied as the backbone in the experiments. We compare the proposed TSLA incorporating with SGD (TSLA) with two baselines, SGD with LSR (LSR) and SGD without LSR (baseline). The mini-batch size of training instances for all methods is 256 as suggested by He et al. (2019) and He et al. (2016) . The momentum parameter is fixed as 0.9.

6.1. STANFORD DOGS AND CUB-2011

We separately train ResNet-18 (He et al., 2016) up to 90 epochs over two data sets Stanford Dogs and CUB-2011. We use weight decay with the parameter value of 10 -4 . For all algorithms, the initial learning rates for FC are set to be 0.1, while that for the pre-trained backbones are 0.001 and 0.01 for Standford Dogs and CUB-2011, respectively. The learning rates are divided by 10 every 30 epochs. For LSR, we fix the value of smoothing strength θ = 0.4 for the best performance, and the label y used for label smoothing is set to be a uniform distribution over all K classes, i.e., y = 1 K . The same values of the smoothing strength θ and the same y are used during the first stage of TSLA. For TSLA, we drop off the LSR (i.e., let θ = 0) after s epochs during the training process, where s ∈ {20, 30, 40, 50, 60, 70, 80}. We first report the highest top-1 and top-5 accuracy on the testing data sets for different methods. All top-1 and top-5 accuracy are averaged over 5 independent random trails with their standard deviations. The results of the comparison are summarized in Table 2 , where the notation "TSLA(s)" means that the TSLA algorithm drops off LSR after epoch s. It can be seen from Table 2 that under an appropriate hyperparameter setting the models trained using TSLA 40) TSLA ( 50) TSLA ( 60) TSLA ( 70 outperform that trained using LSR and baseline, which supports the convergence result in Section 5. We notice that the best top-1 accuracy of TSLA are TSLA(40) and TSLA(50) for Stanford Dogs and CUB-2011, respectively, meaning that the performance of TSLA(s) is not monotonic over the dropping epoch s. For CUB-2011, the top-1 accuracy of TSLA( 20) is smaller than that of LSR. This result matches the convergence analysis of TSLA showing that it can not drop off LSR too early. For top-5 accuracy, we found that TSLA(80) is slightly worse than baseline. This is because of dropping LSR too late so that the update iterations (i.e., T 2 ) in the second stage of TSLA is too small to converge to a good solution. We also observe that LSR is better than baseline regarding top-1 accuracy but the result is opposite as to top-5 accuracy. We then plot the averaged top-1 accuracy, averaged top-5 accuracy, and averaged loss among 5 trails of different methods in Figure 1 . We remove the results for TSLA(20) since it dropped off LSR too early as mentioned before. The figure shows TSLA improves the top-1 and top-5 testing accuracy immediately once it drops off LSR. Although TSLA may not converge if it drops off LSR too late, see TSLA(60), TSLA(70), and TSLA(80) from the third column of Figure 1 , it still has the best performance compared to LSR and baseline. TSLA(30), TSLA(40), and TSLA(50) can converge to lower objective levels, comparing to LSR and baseline.

6.2. CIFAR-100

The total epochs of training ResNet-18 (He et al., 2016) on CIFAR-100 is set to be 200.The weight decay with the parameter value of 5 × 10 -4 is used. We use 0.1 as the initial learning rates for all algorithms and divide them by 10 every 60 epochs as suggested in (He et al., 2016; Zagoruyko & Komodakis, 2016) . For LSR and the first stage of TSLA, the value of smoothing strength θ is fixed as θ = 0.1, which shows the best performance for LSR. We use two different labels y to smooth the one-hot label, the uniform distribution over all labels and the distribution predicted by an ImageNet pre-trained model which is downloaded directly from PyTorchfoot_4 (Paszke et al., 2019) . For TSLA, we try to drop off the LSR after s epochs during the training process, where s ∈ {120, 140, 160, 180}. All top-1 and top-5 accuracy on the testing data set are averaged over 5 independent random trails with their standard deviations. We summarize the results in Table 3 , where LSR-pre and TSLA-pre indicate LSR and TSLA use the label y based on the ImageNet pre-trained model. The results show that LSR-pre/TSLA-pre has a better performance than LSR/TSLA. The reason might be that the pre-trained model-based prediction is closer to the ground truth than the uniform prediction and it has lower variance (smaller δ). Then, TSLA (LSR) with such pre-trained model-based prediction converges faster than TSLA (LSR) with uniform prediction, which verifies our theoretical findings in Section 5 (Section 4). This observation also empirically tells us the selection of the prediction function y used for smoothing label is the key to the success of TSLA as well as LSR. Among all methods, the performance of TSLA-pre is the best. For top-1 accuracy, TSLA-pre(160) outperforms all other algorithms, while for top-5 accuracy, TSLA-pre(180) has the best performance. Finally, we observe from Figure 2 that both TSLA and TSLA-pre converge, while TSLA-pre converges to the lowest objective value. Similarly, the top-1 and top-5 accuracies show the improvements of TSLA and TSLA-pre at the point of dropping off LSR.

7. CONCLUSIONS

In this paper, we have studied the power of LSR in training deep neural networks by analyzing SGD with LSR in different non-convex optimization settings. The convergence results show that an appropriate LSR with reduced label variance can help speed up the convergence. We have proposed a simple and efficient strategy so-called TSLA that can incorporate many stochastic algorithms. The basic idea of TSLA is to switch the training from smoothed label to one-hot label. Integrating TSLA with SGD, we observe from its improved convergence result that TSLA benefits from LSR in the first stage and essentially converges faster in the second stage. Throughout extensive experiments, we have shown that TSLA improves the generalization accuracy of deep models on benchmark data sets. A TECHNICAL LEMMA Recall that the optimization problem is min w∈W F (w) := E (x,y) [ (y, f (w; x))] , where the cross-entropy loss function is given by (y, f (w; x)) = K i=1 -y i log exp(f i (w; x)) K j=1 exp(f j (w; x)) . ( ) If we set p(w; x) = (p 1 (w; x), . . . , p K (w; x)) ∈ R K , p i (w; x) = -log exp(f i (w; x)) K j=1 exp(f j (w; x)) , the problem (7) becomes min w∈W F (w) := E (x,y) [ y, p(w; x) ] . Then the stochastic gradient with respective to w is ∇ (y, f (w; x)) = y, ∇p(w; x) . Lemma 1. Under Assumption 1 (i), we have E ∇ (y LS t , f (w t ; x t )) -∇F (w t ) 2 ≤ (1 -θ)σ 2 + θδσ 2 . Proof. By the facts of y LS t = (1 -θ)y t + θ y t and the equation in ( 11), we have ∇ (y LS t , f (w t ; x t )) = (1 -θ)∇ (y t , f (w t ; x t )) + θ∇ ( y t , f (w t ; x t )). Therefore, E ∇ (y LS t , f (w t ; x t )) -∇F (w t ) 2 =E (1 -θ)[∇ (y t , f (w t ; x t )) -∇F (w t )] + θ[∇ ( y t , f (w t ; x t )) -∇F (w t )] 2 (a) ≤ (1 -θ)E ∇ (y t , f (w t ; x t )) -∇F (w t ) 2 + θE ∇ ( y t , f (w t ; x t )) -∇F (w t ) 2 (b) ≤(1 -θ)σ 2 + θδσ 2 , where (a) uses the convexity of norm, i.e., (1 -θ)a + θb 2 ≤ (1 -θ) a 2 + θ b 2 ; (b) uses assumption 1 (i) and the definitions in (4), and Assumption 1 (i).

B PROOF OF THEOREM 3

Proof. By the smoothness of objective function F (w) in Assumption 1 (ii) and its remark, we have F (w t+1 ) -F (w t ) ≤ ∇F (w t ), w t+1 -w t + L 2 w t+1 -w t 2 (a) = -η ∇F (w t ), ∇ (y LS t , f (w t ; x t )) + η 2 L 2 ∇ (y LS t , f (w t ; x t )) 2 (b) = - η 2 ∇F (w t ) 2 + η 2 ∇F (w t ) -∇ (y LS t , f (w t ; x t )) 2 + η(ηL -1) 2 ∇ (y LS t , f (w t ; x t )) 2 ≤ - η 2 ∇F (w t ) 2 + η 2 ∇F (w t ) -∇ (y LS t , f (w t ; x t )) 2 , where (a) is due to the update of w t+1 ; (b) is due to a, -b = 1 2 a -b 2 -a 2 -b 2 ; (c) is due to η = 1 L . Taking the expectation over (x t , y LS t ) on the both sides of (12), we have E [F (w t+1 ) -F (w t )] ≤ - η 2 E ∇F (w t ) 2 + η 2 E ∇F (w t ) -∇ (y LS t , f (w t ; x t )) 2 ≤ - η 2 E ∇F (w t ) 2 + η 2 (1 -θ)σ 2 + θδσ 2 . ( ) where the last inequality is due to Lemma 1. Then inequality (13) implies 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2F (w 0 ) ηT + (1 -θ)σ 2 + θδσ 2 (a) = 2F (w 0 ) ηT + 2δ 1 + δ σ 2 (b) ≤ 2F (w 0 ) ηT + 2δσ 2 , where (a) is due to θ = 1 1+δ ; (b) is due to 1 1+δ ≤ 1. C CONVERGENCE ANALYSIS OF SGD WITHOUT LSR (θ = 0) Theorem 5. Under Assumption 1, the solutions w t from Algorithm 1 with θ = 0 satisfy 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2F (w 0 ) ηT + ηLσ 2 . In order to have E R [ ∇F (w R ) 2 ] ≤ 2 , it suffices to set η = min 1 L , 2 2 and T = 4F (w0) η 2 , the total complexity is O 1 4 . Proof. By the smoothness of objective function F (w) in Assumption 1 (ii) and its remark, we have 

D PROOF OF THEOREM 4

Proof. Following the similar analysis of inequality (13) from the proof of Theorem 3, we have E [F (w t+1 ) -F (w t )] ≤ - η 1 2 E ∇F (w t ) 2 + η 1 2 (1 -θ)σ 2 + θδσ 2 . ( ) Using the condition in Assumption 2 we can simplify the inequality from ( 16) as E [F (w t+1 ) -F * ] ≤(1 -η 1 µ)E [F (w t ) -F * ] + η 1 2 (1 -θ)σ 2 + θδσ 2 ≤ (1 -η 1 µ) t+1 E [F (w 0 ) -F * ] + η 1 2 (1 -θ)σ 2 + θδσ 2 t i=0 (1 -η 1 µ/2) i ≤ (1 -η 1 µ) t+1 E [F (w 0 )] + η 1 2 (1 -θ)σ 2 + θδσ 2 t i=0 (1 -η 1 µ/2) i , where the last inequality is due to the definition of loss function that F * ≥ 0. Since η 1 ≤ 1 L < 1 µ , then (1 -η 1 µ) t+1 < exp(-η 1 µ(t + 1)) and t i=0 (1 -η 1 µ) i ≤ 1 η1µ . As a result, for any T 1 , we have E [F (w T1 ) -F * ] ≤ exp(-η 1 µT 1 )F (w 0 ) + 1 2µ (1 -θ)σ 2 + θδσ 2 . ( ) Let θ = 1 1+δ and σ 2 := (1 -θ)σ 2 + θδσ 2 = 2δ 1+δ σ 2 then 1 2µ (1 -θ)σ 2 + θδσ 2 ≤ F (w 0 ) since δ is small enough and η 1 L ≤ 1. By setting T 1 = log 2µF (w 0 ) σ 2 /(η 1 µ) we have E [F (w T1 ) -F * ] ≤ σ 2 µ ≤ 2δσ 2 µ . ( ) After T 1 iterations, we drop off the label smoothing, i.e. θ = 0, then we know for any t ≥ T 1 , following the inequality (15) from the proof of Theorem 5, we have E [F (w t+1 ) -F (w t )] ≤ - η 2 2 E ∇F (w t ) 2 + η 2 2 Lσ 2 2 . Therefore, we get 1 T 2 T1+T2-1 t=T1 E ∇F (w t ) 2 ≤ 2 η 2 T 2 E [F (w T1 ) -F (w T1+T2-1 )] + η 2 Lσ 2 (a) ≤ 2 η 2 T 2 E [F (w T1 ) -F * ] + η 2 Lσ 2 (18) ≤ 4δσ 2 µη 2 T 2 + η 2 Lσ 2 , ( ) where (a) is due to F (w T1+T2-1 ) ≥ F * . By setting η 2 = 2 2Lσ 2 and T 2 = 8δσ 2 µη2 2 , we have 1 T2 T1+T2-1 t=T1 E ∇F (w t ) 2 ≤ 2 .



4σ 2 , if we set T = F (w0) ηδσ 2 , then Algorithm 1 does not converge to an -stationary point, but we have E R [ ∇F (w R ) 2 ] ≤ 4δσ 2 ≤ O(δ). https://www.cs.toronto.edu/ ˜kriz/cifar.html http://vision.stanford.edu/aditya86/ImageNetDogs/ http://www.vision.caltech.edu/visipedia/ https://pytorch.org/docs/stable/torchvision/models.html



Figure 1: Testing Top-1, Top-5 Accuracy and Loss on ResNet-18 over Stanford Dogs and CUB-2011. TSLA(s) means TSLA drops off LSR after epoch s.outperform that trained using LSR and baseline, which supports the convergence result in Section 5. We notice that the best top-1 accuracy of TSLA are TSLA(40) and TSLA(50) for Stanford Dogs and CUB-2011, respectively, meaning that the performance of TSLA(s) is not monotonic over the dropping epoch s. For CUB-2011, the top-1 accuracy of TSLA(20) is smaller than that of LSR. This result matches the convergence analysis of TSLA showing that it can not drop off LSR too early. For top-5 accuracy, we found that TSLA(80) is slightly worse than baseline. This is because of dropping LSR too late so that the update iterations (i.e., T 2 ) in the second stage of TSLA is too small to converge to a good solution. We also observe that LSR is better than baseline regarding top-1 accuracy but the result is opposite as to top-5 accuracy. We then plot the averaged top-1 accuracy, averaged top-5 accuracy, and averaged loss among 5 trails of different methods in Figure1. We remove the results for TSLA(20) since it dropped off LSR too early as mentioned before. The figure shows TSLA improves the top-1 and top-5 testing accuracy immediately once it drops off LSR. Although TSLA may not converge if it drops off LSR too late, see TSLA(60), TSLA(70), and TSLA(80) from the third column of Figure1, it still has the best performance compared to LSR and baseline. TSLA(30), TSLA(40), and TSLA(50) can converge to lower objective levels, comparing to LSR and baseline.

Figure 2: Testing Top-1, Top-5 Accuracy and Loss on ResNet-18 over CIFAR-100. TSLA(s)/TSLApre(s) meansTSLA/TSLA-pre drops off LSR/LSR-pre after epoch s.

Comparisons of Total Sample ComplexityPossibilities on δ †

Comparison of Testing Accuracy for Different Methods (mean ± standard deviation, in %).

Comparison of Testing Accuracy for Different Methods (mean ± standard deviation, in %). TSLA(s)/TSLA-pre(s): TSLA/TSLA-pre drops off LSR/LSR-pre after epoch s.

) is due to the update of w t+1 . Taking the expectation over (x t ; y t ) on the both sides of (14), we haveE [F (w t+1 ) -F (w t )] ≤ -ηE ∇F (w t ) 2 + η 2 L 2 E ∇ (y t , f (w t ; x t )) -∇F (w t ) + ∇F (w t ) = -ηE ∇F (w t ) 2 + η 2 L 2 E ∇ (y t , f (w t ; x t )) -∇F (w t ) ∇F (w t ) 2 ≤ 2 .Thus the total complexity is in the order of O 1 η 2 = O 1 4 .

E ADDITIONAL EXPERIMENTS

In this section, we study an ablation study for the smoothing parameter θ. We follow the same settings in Subsection 6.2 but use different values of θ in LSR and TSLA. We summarize the results in Table 4 . The results show that the different values of θ can affect the performances of LSR and TSLA. Besides, TSLA(180) has the best performance for each value of θ. 

