WEMIX: HOW TO BETTER UTILIZE DATA AUGMEN-TATION

Abstract

Data augmentation is a widely used training trick in deep learning to improve the network generalization ability. Despite many encouraging results, several recent studies did point out limitations of the conventional data augmentation scheme in certain scenarios, calling for a better theoretical understanding of data augmentation. In this work, we develop a comprehensive analysis that reveals pros and cons of data augmentation. The main limitation of data augmentation arises from the data bias, i.e. the augmented data distribution can be quite different from the original one. This data bias leads to a suboptimal performance of existing data augmentation methods. To this end, we develop two novel algorithms, termed "AugDrop" and "MixLoss", to correct the data bias in the data augmentation. Our theoretical analysis shows that both algorithms are guaranteed to improve the effect of data augmentation through the bias correction, which is further validated by our empirical studies. Finally, we propose a generic algorithm "WeMix" by combining AugDrop and MixLoss, whose effectiveness is observed from extensive empirical evaluations.

1. INTRODUCTION

Data augmentation (Baird, 1992; Schmidhuber, 2015) has been a key to the success of deep learning in image classification (He et al., 2019) , and is becoming increasingly common in other tasks such as natural language processing (Zhang et al., 2015) and object detection (Zoph et al., 2019) . The data augmentation expands training set by generating virtual instances through random augmentation to the original ones. This alleviates the overfitting (Shorten & Khoshgoftaar, 2019) problem when training large deep neural networks. Despite many encouraging results, it is not the case that data augmentation will always improve generalization errors (Min et al., 2020; Raghunathan et al., 2020) . In particular, Raghunathan et al. (2020) showed that training by augmented data will lead to a smaller robust error but potentially a larger standard error. Therefore, it is critical to answer the following two questions before applying data augmentation in deep learning: • When will the deep models benefit from data augmentation? • How to better leverage augmented data during training? Several previous works (Raghunathan et al., 2020; Wu et al., 2020; Min et al., 2020) tried to address the questions. Their analysis is limited to specific problems such as linear ridge regression therefore may not be applicable to deep learning. In this work, we aim to answer the two questions from a theoretical perspective under a more general non-convex setting. We address the first question in a more general form covering applications in deep learning. For the second question, we develop new approaches that are provably more effective than the conventional data augmentation approaches. Most data augmentation operations alter the data distribution during the training progress. This imposes a data distribution bias (we simply use "data bias" in the rest of this paper) between the augmented data and the original data, which may make it difficult to fully leverage the augmented data. To be more concrete, let us consider label-mixing augmentation (e.g., mixup (Zhang et al., 2018; Tokozume et al., 2018) ). Suppose we have n original data D = {(x i , y i ), i = 1, . . . , n}, where the input-label pair (x i , y i ) follows a distribution P xy = (P x , P y (•|x)), P x is the marginal distribution of the inputs and P y (•|x) is the conditional distribution of the labels given inputs; we generate m augmented data D = {( x i , y i ), i = 1, . . . , m}, where ( x i , y i ) ∼ P x y = (P x , P y (•| x)), and P x = P x but P y (•|x) = P y (•| x). Given x ∼ P x , the data bias is defined as δ y = max y, y yy . We will show that when the bias between D and D is large, directly training on the augmented data will not be as effective as training on the original data. Given the fact that augmented data may hurt the performance, the next question is how to design better learning algorithms to leash out the power of augmented data. To this end, we develop two novel algorithms to alleviate the data bias. The first algorithm, termed AugDrop, corrects the data bias by introducing a constrained optimization problem. The second algorithm, termed MixLoss, corrects the data bias by introducing a modified loss function. We show that, both theoretically and empirically, even with a large data bias, the proposed algorithms can still improve the generalization performance by effectively leveraging the combination of augmented data and original data. We summarize the main contributions of this work as follows: • We prove that in a conventional training scheme, a deep model can benefit from augmented data when the data bias is small. • We design two algorithms termed AugDrop and MixLoss that can better leverage augmented data even when the data bias is large with theoretical guarantees. • Based on our theoretical findings, we empirically propose a new efficient algorithm WeMix by combining AugDrop and MixLoss , which has better performances without extra training cost.

2. RELATED WORK

A series of empirical works (Cubuk et al., 2019; Ho et al., 2019; Lim et al., 2019; Lin et al., 2019a; Cubuk et al., 2020; Hataya et al., 2019) on how to learn a good policy of using different data augmentations have been proposed without theoretical guarantees. In this section, we mainly focus on reviewing theoretical studies on data augmentation. For a survey of data augmentation, we refer readers to (Shorten & Khoshgoftaar, 2019) and references therein for a comprehensive overview. Several works have attempted to establish theoretical understandings of data augmentation from different perspectives (Dao et al., 2019; Chen et al., 2019; Rajput et al., 2019) . Min et al. (2020) shown that, with more training data, weak augmentation can improve performance while strong augmentation always hurts the performance. Later on, Chen et al. (2020) study the gap between the generalization error (please see the formal definition in (Chen et al., 2020) ) of adversarially-trained models and standard models. Both of their theoretical analyses were built on special linear binary classification model or linear regression model for label-preserving augmentation. Recently, Raghunathan et al. (2020) studied label-preserving transformation in data augmentation, which is identical to the first case in this paper. Their analysis is restricted to linear least square regression under noiseless setting, which is not applicable to training deep neural networks. Besides, their analysis requires infinite unlabeled data. By contrast, we do not require the original data to be unlimited. Wu et al. (2020) considered linear data augmentations. There are several major differences between their work and ours. First, they focus on the ridge linear regression problem which is strongly convex, while we consider non-convex optimization problems, which is more applicable in deep learning. Second, we study more general data augmentations beyond linear transformation.

3. PRELIMINARIES AND NOTATIONS

We study a learning problem for finding a classifier to map an input x ∈ X onto a label y ∈ Y ⊂ R K , where K is the number of classes. We assume the input-label pair (x, y) is drawn from a distribution P xy = (P x , P y (•|x)). Since every augmented example ( x, y) is generated by applying a certain transformation to either one or multiple examples, we will assume that ( x, y) is drawn from a slightly different distribution P x y = (P x , P y (•| x)), where P x is the marginal distribution on the inputs x and P y (•| x)) (we can write it as P y for simplicity) is the conditional distribution of the labels y given inputs x. We sample n training examples (x i , y i ), i = 1, . . . , n from distribution P xy and m training examples ( x i , y i ), i = 1, . . . , m from P x y . We assume that m n due to the data augmentation. We denote by D = {(x i , y i ), i = 1, . . . , n} and D = ( x i , y i ), i = 1, . . . , m} the dataset sampled from P xy and P x y , respectively. We denote by T (x) the set of augmented data transformed from x. We use the notation E (x,y)∼Pxy [•] to stand for the expectation that takes over a random variable (x, y) following a distribution P xy . We denote by ∇ w h(w) the gradient of a function h(w) in terms of variable w. When the variable to be taken a gradient is obvious, we use the notation ∇h(w) for simplicity. Let use • as the Euclidean norm for a vector or the Spectral norm for a matrix. The augmented data D can be different from the original data D in two cases, according to (Raghunathan et al., 2020) . In the first case, often referred to as label-preserving, we consider P y (•|x) = P y (•| x), ∀ x ∈ T (x) but P x = P x . (1) In the second case, often referred to as label-mixing, we consider P x = P x but P y (•|x) = P y (•| x), ∃ x ∈ T (x). Examples of label-preserving augmentation include translation, adding noises, small rotation, and brightness or contrast changes (Krizhevsky et al., 2012; Raghunathan et al., 2020) . One important example of label-mixing augmentation is mixup (Zhang et al., 2018; Tokozume et al., 2018) . Due to the space limitation, we will focus on the label-mixing case, and the related studies and analysis for the label-preserving case can be found in Appendix A. To further quantify the difference between original data and augmented data when P x = P x and P y = P y , we introduce the data bias δ y given x ∼ P x as following: δ y := max y, y y -y . The equation in (3) measures the difference between the label from original data and the label from augmented data given input x. We aim to learn a prediction function f (x; w) : R D × X → R K that is as close as possible to y, where w ∈ R D is the parameter and R D is a closed convex set. We respectively define two objective functions for optimization problems over the original data and the augmented data as L(w) = E (x,y) [ (y, f (x; w))] , L(w) = E ( x, y) [ ( y, f ( x; w))] , where is a cross-entropy loss function which is given by (y, f (x; w)) = K i=1 y i p i (x; w), where p i (x; w) = -log exp(f i (x; w)) K j=1 exp(f j (x; w)) . (5) We denote by w * and w * the optimal solutions to min w L(w) and min w L(w) respectively, w * ∈ arg min w∈R D L(w), w * ∈ arg min w∈R D L(w). Taking L(w) as an example, we introduce some function properties used in our analysis. Definition 1. The stochastic gradients of the objective functions L(w) is unbiased and bounded, if we have E (x,y) [∇ w (y, f (x; w))] = ∇L(w), and there exists a constant G > 0, such that ∇ w p(x; w) ≤ G, ∀x ∈ X , ∀w ∈ R D , where p(x; w) = (p 1 (x; w), . . . , p K (x; w)) is a vector. Definition 2. L(w) is smooth with an L-Lipchitz continuous gradient, if there exists a constant L > 0 such that ∇L(w) -∇L(u) ≤ L w -u , ∀w, u ∈ R D , or equivalently, L(w) -L(u) ≤ ∇L(u), w -u + L 2 w -u 2 , ∀w, u ∈ R D . The above properties are standard and widely used in the literature of non-convex optimization (Ghadimi & Lan, 2013; Yan et al., 2018; Yuan et al., 2019; Wang et al., 2019; Li et al., 2020) . We introduce an important property termed Polyak-Łojasiewicz (PL) condition (Polyak, 1963) on the objective function L(w). Definition 3. (PL condition) L(w) satisfies the PL condition, if there exists a constant µ > 0 such that 2µ(L(w) -L(w * )) ≤ ∇L(w) 2 , ∀w ∈ R D , where w * is defined in (6). The PL condition has been observed in training deep and shallow neural networks (Allen-Zhu et al., 2019; Xie et al., 2017) , and is widely used in many non-convex optimization studies (Karimi et al., 2016; Li & Li, 2018; Charles & Papailiopoulos, 2018; Yuan et al., 2019; Li et al., 2020) . It is also theoretically verified in (Allen-Zhu et al., 2019) and empirically estimated in (Yuan et al., 2019) for deep neural networks. It is worth noting that PL condition is weaker than many conditions such as strong convexity, restricted strong convexity and weak strong convexity (Karimi et al., 2016) . Finally, we will refer to κ = L µ as condition number throughout this study.

4. MAIN RESULTS

In this section, we present the main results for label-mixing augmentation satisfying (2). Due to the space limitation, we present the results of label-preserving augmentation satisfying (1) in Appendix A. Since we have access to m n augmented data, it is natural to fully leverage the augmented data D during training. But on the other hand, due to the data bias δ y , the prediction model learned from augmented data D could be even worse than training the prediction model directly from the original data D, as revealed by Lemma 1 (its proof can be found in Appendix C) and its remark. Throughout this section, suppose that a mini-batch SGD is used for optimization, i.e. to optimize L(w), we have w t+1 = w t - η m 0 m0 k=1 ∇ w (y k,t , f (x k,t ; w t )) , ( ) where η is the step size, m 0 is the batch size, and (x k,t , y k,t ), k = 1, . . . , m 0 are sampled from D. A similar mini-batch SGD algorithm can be developed for the augmented data. Lemma 1. Assume that L and L satisfy properties in Definition 1, 2 and 3, by setting η = 1/L and m 0 ≥ 8 δ 2 y , when t ≥ L µ log 4(L(w1)-L(w * ))µ δ 2 y G 2 , we have E[L(w t+1 ) -L(w * )] ≤ δ 2 y G 2 /µ ≤ O(δ 2 y /µ), where w t+1 is output of mini-batch SGD trained on D, δ y is defined in (3). Remark: It is easy to verify (see the details of proof in Appendix D) that if we simply train the learning model by the original data D, we have E [L(w n+1 ) -L(w * )] ≤ O L log(n)/(nµ 2 ) . Comparing the result in ( 9) with the result of (8) in Lemma 1, it is easy to show that, when the data bias is too large, i.e., δ 2 y ≥ Ω(L log(n)/(nµ)), we have O L log(n)/(nµ 2 ) ≤ O(δ 2 y /µ). This implies that training the deep model directly on the original data D is more effective than on the augmented data D. Hence, in order to better leverage the augmented data in the presence of large data bias (δ 2 y ≥ Ω(κ log(n)/n), where κ = L/µ), we need to come up with approaches that automatically correct the data bias. Below, we develop two approaches to correct the data bias. The first approach, termed "AugDrop", corrects the data bias by introducing a constrained optimization approach, and the second approach, termed "MixLoss", addresses the problem by introducing a modified loss function.

4.1. AugDrop: CORRECTING DATA BIAS BY CONSTRAINED OPTIMIZATION

To address this challenge, we propose a constrained optimization problem, i.e. min w∈R D L(w) s.t. L(w) -L( w * ) ≤ γ, where γ > 0 is a positive constant, w * is defined in (6). The key idea is that by utilizing the augmented data to constrain the solution in a small region, we will be able to enjoy a smaller condition number, leading to a better performance in optimizing L(w). To make it concrete, we first define three important terms: γ 0 := δ 2 y G 2 /(2µ), A(γ) = w : L(w) -L( w * ) ≤ γ , µ(γ) = max µ L(w) -L(w * ) ≤ ∇L(w) 2 /(2µ ), w ∈ A(γ) . ( ) We then present a proposition about A(γ) and µ(γ), whose proof is included in Appendix E. Proposition 1. If γ ∈ [γ 0 , 8γ 0 ], we have w * ∈ A(γ) and µ(γ) ≥ µ. According to Proposition 1, by restricting our solutions in A(γ), we have a smaller condition number (since µ(γ) ≥ µ) and consequentially a smaller optimization error. It is worth mentioning that the restriction of solutions in A(γ) is reasonable due to the optimal solution w * ∈ A(γ). The idea of using augmentation transformation to restrict the candidate solution was recognized by several earlier studies, e.g. (Raghunathan et al., 2020) . But none of these studies cast it into a constrained optimization problem, a key contribution of our work. The next question is how to solve the constrained optimization problem in (10). It is worth noting that neither L(w) nor L(w) is convex. Although multiple approaches can be used to solve nonconvex constrained optimization problems (Cartis et al., 2011; Lin et al., 2019b; Birgin & Martínez, 2020; Grapiglia & Yuan, 2019; Wright, 2001; O'Neill & Wright, 2020; Boob et al., 2019; Ma et al., 2019) , they are too complicated to be implemented in deep learning. Instead, we present a simple approach that divides the optimization into two stages, which is referred to as AugDrop (Please see the details of update steps from Algorithm 2 in Appendix F). • Stage I. We minimize L(w) over the augmented data D. It runs a mini-batch SGD against D at least T 1 iterations with the size of mini-batch being m 1 . We denote by w T1+1 the final output solution of this stage. • Stage II. We minimize L(w) using the original data D. It initializes the solution w T1+1 and runs a mini-batch SGD against D in n/m 2 iterations with mini-batch size being m 2 . We notice that AugDrop is closely related to TSLA by (Xu et al., 2020) where the first stage trains the data with label smoothing and the second stage trains the data without label smoothing. However, they study the problem how to reduce the variance of stochastic gradient in using label smoothing, while we study how to correct bias in data augmentation by solving a constrained optimization problem. The following theorem states that if we run this two stage optimization algorithm, we could achieve a better performance since µ(8γ 0 ) is larger than µ. We include its proof in Appendix F. Theorem 1. Define µ c = µ(8γ 0 ). Assume that L and L satisfy properties in Definition 1, 2 and 3, set learing rate η 1 = 1/L in Stage I and learning rate η 2 = 1 2nµc log 8nµ 2 c (L(w1)-L(w * )) G 2 L in Stage II for AugDrop. Let w 1 be the initial solution in Stage I of AugDrop and w T1+2 , . . . , w T1+n/m2+1 be the intermediate solutions obtained by the mini-batch SGD in Stage II of AugDrop. Choose T 1 = 1 η1µ log 2( L(w1)-L( w * ))µ δ 2 y G 2 , m 1 = 1 + 3 log 2T1 δ 2 8 δ 2 y and m 2 = 1 + 3 log 2n δ 2 4 δ 2 y , with a probability 1 -δ, we have w t ∈ A(8γ 0 ), ∀t ∈ {T 1 + 2, . . . , T 1 + n/m 2 + 1} and E [L( w) -L(w * )] ≤ G 2 L 4nµ 2 c 1 + log 4nµ 2 c (L(w 1 ) -L(w * )) G 2 L ≤ O L log(n) nµ 2 c , where w = w T1+n/m2+1 and δ y is defined in in (3). Remark. Theorem 1 shows that all intermediate solutions w t obtained in Stage II of AugDrop satisfy the constraint L(w t ) -L( w * ) ≤ 8γ 0 , that is to say, w t ∈ A(8γ 0 ). Based on Proposition 1, we will enjoy a larger µ c than µ. Comparing the result of ( 13) in Theorem 1 with (9), training by using AugDrop will result in a better performance than directly training on D due to µ c ≥ µ. Besides, when the data bias is large, i.e., δ 2 y ≥ Ω(L log(n)/(nµ)), we know O(L log(n)/(nµ 2 c )) ≤ O(µδ 2 y /µ 2 c ) ≤ O(δ 2 y /µ) , where the last inequality holds due to µ c ≥ µ. By comparing (13) with the result of (8) in Lemma 1, we know that training by using AugDrop has a better performance than directly training on D when the data bias is large. By solving a constrained problem, the AugDrop algorithm can correct the data bias and thus can enjoy an better performance. 4.2 MixLoss: CORRECTING DATA BIAS BY MODIFIED LOSS FUNCTION Without loss of generality, we set L(w * ) = 0, a common property observed in training deep neural networks (Zhang et al., 2016; Allen-Zhu et al., 2019; Du et al., 2018; 2019; Arora et al., 2019; Chizat et al., 2019; Hastie et al., 2019; Yun et al., 2019; Zou et al., 2020) . Since yy ≤ δ y for any y and y and given x, we define a new loss function a ( y, f ( x; w)) as a ( y, f ( x; w)) = min z-y ≤δy (z, f ( x; w)). ( ) It has been shown that since the cross-entropy loss (z, •) is convex in terms of z ∈ Y, then the minimization problem ( 14) is a convex optimization problem and has a closed form solution (Boyd & Vandenberghe, 2004) . Using this new loss, we define a new objective function L a (w) L a (w) = E ( x, y) [ a ( y, f ( x; w))] = E ( x, y) min z-y ≤δy (z, f ( x; w)) . ( ) Algorithm 1 WeMix 1: Input: T 1 , T 2 , stochastic algorithms A 1 , A 2 (e.g., momentum SGD, SGD) 2: Initialize: compute stochastic gradient g t = ∇ (y it , f (x it ; w t )) w 1 ∈ R D , λ ∈ (0, 1), η 1 , η 2 > 0 // 12: w t+1 = A 2 (w t ; g t , η 2 ) update one step of A 2 13: end for 14: Output: w T1+T2+1 . It is easy to verify that L a (w * ) = 0 and therefore w * also minimizes L a (w) (see Appendix G). In contrast, w * , the minimizer of L(w), can be very different from w * . Hence, we can correct the data bias arising from the augmented data by replacing L(w) with L a (w), leading to the following optimization problem: min w∈R D L c (w) = λL(w) + (1 -λ)L a (w), where λ ∈ (0, 1). Since L c (w) shares the same minimizer with L(w) (see Appendix G), it is sufficient to optimize L c (w), instead of optimizing L(w). The main advantage of minimizing L c (w) over L(w) is that by introducing a small λ, we will be able to reduce the variance in computing the gradient of L c (w), and therefore improve the overall convergence. More specifically, our SGD method is given as follows: at each iteration t, we compute the approximate gradient as g t = λ∇ (y t , f (x t ; w t )) + (1 -λ) 1 m0 m0 i=1 ∇ a ( y t,i , f ( x t,i ; w t )) , where (x t , y t ) is an example sampled from D at iteration t. We refer to this approach as MixLoss (Please see the details of update steps from Algorithm 3 in Appendix H). We then give the convergence result in the following theorem, whose proof is included in Appendix H. Theorem 2. Assume that L, L and L a satisfy properties in Definition 1, 2 and 3, by setting m 0 ≥ 72(1-λ) 2 λ 2 and η = 1 µn log nµ 2 L(w1) λ 2 LG 2 ≤ 1 2L in MixLoss, we have E [L(w n+1 ) -L(w * )] ≤ λLG 2 nµ 2 1 + 5 log nµ 2 L(w 1 ) λ 2 LG 2 ≤ O λL log(n/λ 2 ) nµ 2 . ( ) Remark. According to the results in ( 17) and ( 9), we know that O λL log(n/λ 2 )/(nµ 2 ) ≤ O L log(n)/(nµ 2 ) when an appropriate λ ∈ (0, 1) is selected, leading to a better performance by using MixLoss compared with the performance trained on the original data D. For example, one can simply use λ = O(µ/L). On the other hand, when the data bias is large where δ 2 y satisfying δ 2 y ≥ Ω(L log(n)/(nµ))), we know O(L log(n)/(nµ 2 )) ≤ O(δ 2 y /µ). Based on previous discussion, by choosing an appropriate λ ∈ (0, 1) (e.g., λ = O(µ/L)), we will have O λL log(n/λ 2 )/(nµ 2 ) ≤ O(δ 2 y /µ). Then by comparing ( 17) with (8), we know that training by using MixLoss has a better performance than directly training on D when the data bias is large. Therefore, by solving the problem with a modified loss function, the MixLoss algorithm can enjoy a better performance by correcting the data bias.

4.3. WeMix: A GENERIC WEIGHTED MIXED LOSSES WITH AUGMENTATION DROPPING

ALGORITHM Inspired by previous theoretical analysis of using augmented data, we propose a generic framework of weighted mixed losses with augmentation dropping that builds upon two algorithms, Aug-Drop and MixLoss. Algorithm 1 describes our procedure in detail, which is referred to as WeMix. It consists of two stages, wherein the first stage it runs a stochastic algorithm A 1 (e.g., momentum SGD, SGD) for solving weighted mixed losses ( 16) and the second stage it runs another/same Polyak, 1964; Nesterov, 1983; Yan et al., 2018) and adaptive methods (Duchi et al., 2011; Hinton et al., 2012; Zeiler, 2012; Kingma & Ba, 2015; Dozat, 2016; Reddi et al., 2018) . We can also replace a by to avoid solving a minimization problem. The last solution of the first stage will be used as the initial solution of the second stage. If λ = 0 and a = , then WeMix reduces to the AugDrop; while if T 2 = 0, WeMix becomes to MixLoss. For label-preserving case, we only need to simply use a = (i.e, δ f = 0) in WeMix.

5. EXPERIMENTS

To evaluate the performance of the proposed methods, we trained deep neural networks on two benchmark data sets, CIFAR-10 and CIFAR-100foot_0 (Krizhevsky & Hinton, 2009) for the image classification task. Both CIFAR-10 and CIFAR-100 have 50,000 training images and 10,000 testing images of 32×32 resolutions. CIFAR-10 has 10 classes containing 6000 images each, while CIFAR-100 has 100 classes. We use mixup (Zhang et al., 2018) as an example of lable-mixing augmentation and Contrast as an example of lable-preserving augmentation and. For the choice of backbone, we use ResNet-18 model (He et al., 2016) in mixup, and Wide-ResNet-28-10 model (Zagoruyko & Komodakis, 2016) is applied in the Contrast experiment following by (Cubuk et al., 2019; 2020) . To verify our theoretical results, we compare the proposed AugDrop and MixLoss with two baselines, SGD with mixup/Contrast and SGD without mixup/Contrast (baseline). We also include WeMix in the comparison. The mini-batch size of training instances for all methods is 256 as suggested by He et al. (2019) and He et al. (2016) . The momentum parameter of 0.9 is used. The weight decay with the parameter value is set to be 5 × 10 -4 . The total epochs of training progress is fixed as 200. Followed by (He et al., 2016; Zagoruyko & Komodakis, 2016) , we use 0.1 as the initial learning rates for all algorithms and divide them by 10 every 60 epochs. For AugDrop, we drop off the augmentation after s-th epoch, where s ∈ {150, 160, 170, 180, 190} is tuned. For example, if s = 160, then it means that we run the first stage of AugDrop 160 epochs and the second stage 40 epochs. For MixLoss, we tune the parameter δ y from {0.5, 0.05, 0.005, 0.0005} and the best performance is reported. For WeMix, we use the value of δ y with the best performance in MixLoss, and we tune the dropping off epochs s same as AugDrop. We fix the convex combination parameter λ = 0.1 both for MixLoss and WeMix. We use top-1 accuracy to evaluate the performance. All top-1 accuracy on the testing data set are averaged over 5 independent random trails with their standard deviations.

5.1. MIXUP

Given two examples (x i , y i ) and (x j , y j ) that are drawn at random from the training data, mixup creates a virtual training example as follows x = βx i + (1 -β)x j , y = βy i + (1 -β)y j , where β ∈ [0, 1] is sampled from a Beta distribution β(α, α). We use α = 1 in the experiments as suggested in (Zhang et al., 2018) . In this subsection, we want to empirically verify that our theoretical findings for label-mixing augmentation in Section 4. The experimental results conducted on CIFAR-10 and CIFAR-100 are listed in Table 1 . We can see from the results that both AugDrop and MixLoss are Besides, the proposed WeMix enjoys both improvements, leading to the best performance among all algorithms although its convergence theoretical guarantee is unclear. Next, we implement MixLoss and WeMix with δ y = 0 (i.e., use a = ), which are denoted by MixLoss-s and WeMix-s, respectively. We summarize the results in Table 1 , showing that both MixLoss-s and WeMix-s drop performance, comparing with MixLoss and WeMix, respectively. Besides, we use more than two images in mixup such as three and ten images and the results are shown in Table 2 . Although the top-1 accuracy of mixup reduces dramatically, we find that the proposed WeMix can still improve the performance when it comparing with mixup itself, showing the robustness of WeMix.

5.2. CONTRAST

As a simple label-preserving augmentation, Contrast controls the contrast of the image. Its transformation magnitude is randomly selected from a uniform distribution [0.1, 1.9] following by (Cubuk et al., 2019) . Despite its simplicity, we choose it to demonstrate our theory for the considered case in Appendix A. The results of highest top-1 accuracy on the testing data sets for different methods are presented in Table 3 . We find that by directly training on data with Contrast, it will drop the performance a little bit. Even so, the result shows that AugDrop has better performance than two baselines, which is consistent with the theoretical findings for label-preserving augmentation in Appendix A that we need use the data augmentation at the early training stage but drop it at the end of training. Although there is no theoretical guarantee for the label-preserving transformation case, we implement MixLoss and WeMix by setting δ f = 0, i.e., using a = in ( 14). The results show that MixLoss and WeMix are better than two baselines but are slightly worse than AugDrop. 

A MAIN RESULTS FOR LABEL-PRESERVING AUGMENTATION

We consider label-preserving augmentation case (1), that is, P y (•|x) = P y (•| x), ∀ x ∈ T (x) but P x = P x . It covers many image data augmentations including translation, adding noises, small rotation, and brightness or contrast changes (Krizhevsky et al., 2012; Raghunathan et al., 2020) . It is worth mentioning that the compositions of label-preserving augmentation could also be label-preserving. Similar to the case of label-mixing augmentation, we measure the following difference between P x and P x by a KL divergence: δ P := D KL (P x P x ) = E x∼Px log P x (x) P x (x) . ( ) Due to the data bias δ P , the prediction model learned from augmented data D could be even worse than training the prediction model directly from the original data D, as revealed by the following lemma and its remark. Lemma 2. (label-preserving augmentation) Assume that L and L satisfy properties in Definition 1, 2 and 3, by setting η = 1/L and m 0 ≥ 4 ηδ P , when t ≥ t 0 = L µ log (L(w1)-L(w * ))µ 2δ P G 2 , we have E ( xt, yt) [L(w t+1 ) -L(w * )] ≤ 4δ P G 2 µ ≤ O δ P µ , ( ) where w t+1 is output of mini-batch SGD trained on D, δ P is defined in (18). Proof. See Appendix I.1. Remark: Comparing the result in ( 9) with the result of ( 19) in Lemma 2, it is easy to show that, when the data bias is too large, i.e., δ P ≥ Ω(L log(n)/(nµ)), we have O L log(n)/(nµ 2 ) ≤ O(δ P /µ). This implies that training the deep model directly on the original data D is more effective than on the augmented data D. Hence, in order to better leverage the augmented data in the presence of large data bias (δ P ≥ Ω(κ log(n)/n), where κ = L/µ), we need to come up with an approach that automatically correct the data bias in D. Below, we use AugDrop to correct the data bias by solving a constrained optimization problem.

A.1 AugDrop: CORRECTING DATA BIAS BY CONSTRAINED OPTIMIZATION

To correct data bias, we consider to solve the constrained optimization problem (10). The key idea is to shrink the solution in a small region by using utilize augmented data to enjoy a smaller condition number, leading to an improved convergence in optimizing L(w). By introducing a term that γ 1 := δ P G 2 /µ, we can present a proposition about A(γ) and µ(γ), showing that we have a smaller condition number and consequentially a smaller optimization error by restricting our solutions to A(γ). Proposition 2. If γ ∈ [γ 1 , 4γ 1 ], we have w * ∈ A(γ) and µ(γ) ≥ µ, where A(γ) and µ(γ) are defined in ( 11) and ( 12), respectively. Proof. See Appendix I.2. The following theorem shows the convergence result of AugDrop for label-preserving augmentation. δ P , with a probability 1 -δ, we have w t ∈ A(4γ 1 ), ∀t ∈ {T 1 + 2, . . . , T 1 + n/m 2 + 1} and E [L( w) -L(w * )] ≤ G 2 L 8nµ 2 e + G 2 L 8nµ 2 e log 8nµ 2 e (L(w 1 ) -L(w * )) G 2 L ≤ O L log(n) nµ 2 e , where w = w T1+n/m2+1 and δ P is defined in (18). Proof. See Appendix I.3. Remark. 20) with the result of ( 19) in Lemma 2, we know that training by using AugDrop has a better performance than directly training on D when the data bias is large. By solving a constrained problem, the AugDrop algorithm can correct the data bias and thus enjoy an better performance.

B TECHNICAL RESULTS FOR CROSS-ENTROPY LOSS

Lemma 3. Assume that L(w) = E[ (y, f (x; w))] satisfies property in Definition 1, where is a cross-entropy loss, then we have ∇ w (y, f (x; w)) -∇ w ( y, f (x; w)) ≤ G y -y , and ∇ w (y, f (x; w)) ≤ G. Proof. The objective function is L(w) = E (x,y) [ (y, f (x; w))] , where the cross-entropy loss function is given by (y, f (x; w)) = K i=1 -y i log exp(f i (x; w)) K j=1 exp(f j (x; w)) . Let set p(x; w) = (p 1 (x; w), . . . , p K (x; w)), p i (x; w) = -log exp(f i (x; w)) K j=1 exp(f j (x; w)) , then the gradient of with respective to w is ∇ (y, f (x; w)) = y, ∇p(x; w) . Therefore, ∀x ∈ X and w ∈ R D we have ∇ w (y, f (x; w)) -∇ w ( y, f (x; w)) = y -y, ∇p(x; w) ≤ ∇p(x; w) y -y ≤G y -y , and ∇ w (y, f (x; w)) = y, ∇p(x; w) ≤ ∇p(x; w) y ≤ G, where uses the facts that ∇p(x; w) ≤ G and y ≤ y 1 = 1, here • is a Euclidean norm ( 2 norm) and • 1 is 1 norm.

C PROOF OF LEMMA 1

Proof. Recall that the update of mini-batch SGD is given by w t+1 = w t -η g t . Let set the averaged mini-batch stochastic gradients of L(w t ) as g t := 1 m 0 m0 i=1 ∇ ( y t,i , f ( x t,i ; w t )) , then by the Assumption of L satisfying the property in Definition 1, we know that E ( xt,i, yt,i) [∇ ( y t,i , f ( x t,i ; w t ))] = ∇ L(w t ), ∀i ∈ {1, . . . , m 0 } (29) and thus E ( xt, yt) [ g t ] = ∇ L(w t ), where we write E ( xt, yt) [ g t ] as E ( xt,1, yt,1) [. . . E ( xt,m 0 , yt,m 0 ) [ g t ]] for simplicity. Then the norm variance of g t is given by E ( xt, yt) [ g t -∇ L(w t ) 2 ] =E ( xt, yt)   1 m 0 m0 i=1 ∇ ( y t,i , f ( x t,i ; w t )) -∇ L(w t ) 2   (a) = 1 m 2 0 m0 i=1 E ( xt,i, yt,i) ∇ ( y t,i , f ( x t,i ; w t )) -∇ L(w t ) 2 (b) ≤ 4G 2 m 0 , where (a) is due to (29) and the pairs ( x t,1 , y t,1 ), . . . , ( x t,m0 , y t,m0 ) are independently sampled from D; (b) is due to the facts that the Assumption of L satisfying the property in Definition 1 and Lemma 3, and then by Jensen's inequality, we also have ∇ w L(w) ≤ G, implying that ∇ ( y, f ( x; w)) -∇ L(w) 2 ≤ 4G 2 . On the other hand, by the Assumption of L satisfying the property in Definition 2, we have E ( xt, yt) [L(w t+1 ) -L(w t )] ≤E ( xt, yt) ∇L(w t ), w t+1 -w t + L 2 w t+1 -w t 2 (a) = η 2 E ( xt, yt) ∇L(w t ) -g t 2 -∇L(w t ) 2 -(1 -ηL) g t 2 (b) = η 2 ∇L(w t ) -∇ L(w t ) 2 + E ( xt, yt) ∇ L(w t ) -g t 2 -∇L(w t ) 2 -(1 -ηL) E ( xt, yt) [ g t 2 ] (c) ≤ η 2 ∇L(w t ) -∇ L(w t ) 2 + 4G 2 m 0 -∇L(w t ) 2 where the (a) is due to the update of w t+1 = w t -η g t ; (b) is due to (30); (c) is due to η = 1/L and (31). By using the Assumption of L and L satisfying the property in Definition 1 and P x = P x , we have ∇L(w t ) -∇ L(w t ) = E (x,y) [∇ w (y, f (x; w t ))] -E ( x, y) [∇ w ( y, f ( x; w t ))] (a) ≤ E (x,y, y) [ ∇ w (y, f (x; w t )) -∇ w ( y, f (x; w t )) ] (21) ≤ GE (y, y) [ y -y ] (b) ≤Gδ y , where (a) uses Jensen's inequality; (b) is due to (3). By using the Assumption of L satisfying the property in Definition 3 and (33), inequality (32) becomes E ( xt, yt) [L(w t+1 ) -L(w t )] ≤ ηG 2 δ 2 y 2 + 2ηG 2 m 0 - η 2 ∇L(w t ) 2 ≤ ηG 2 δ 2 y 2 + 2ηG 2 m 0 -ηµ (L(w t ) -L(w * )) , which implies E ( xt, yt) [L(w t+1 ) -L(w * )] ≤ (1 -ηµ) (L(w t ) -L(w * )) + ηG 2 δ 2 y 2 + 2ηG 2 m 0 ≤ (1 -ηµ) t (L(w 1 ) -L(w * )) + ηG 2 δ 2 y 2 + 2ηG 2 m 0 t-1 i=0 (1 -ηµ) i . Due to (1 -ηµ) t ≤ exp(-tηµ) and t-1 i=0 (1 -ηµ) i ≤ 1 ηµ , when m 0 ≥ 8 δ 2 y and t ≥ L µ log 4(L(w 1 ) -L(w * ))µ δ 2 y G 2 , we know E ( xt, yt) [L(w t+1 ) -L(w * )] ≤ δ 2 y G 2 µ .

D PROOF OF (9)

We first put the full statement of (9) in the following lemma. Lemma 4. Assume that L satisfies the properties in Definition 1, 2 and 3, by setting η = 1 2nµ log 8nµ 2 (L(w1)-L(w * )) G 2 L , we have E (xn,yn) [L(w n+1 ) -L(w * )] ≤ G 2 L 8nµ 2 + G 2 L 8nµ 2 log 8nµ 2 (L(w1)-L(w * )) G 2 L , where w n+1 is output of SGD trained on D. Proof. By the Assumption of L satisfying the property in Definition 2, we have E (xn,yn) [L(w n+1 ) -L(w n )] ≤E (xn,yn) [ ∇L(w n ), w n+1 -w n ] + L 2 E (xn,yn) w t+n -w n 2 (a) = -ηE (xn,yn) [ ∇L(w n ), ∇ (y n , f (x n ; w n )) ] + L 2 E (xn,yn) ∇ (y n , f (x n ; w n )) 2 (b) = -η ∇L(w n ) 2 + η 2 L 2 E (xn,yn) [ ∇ (y n , f (x n ; w n )) 2 ], where (a) is due to the update of w n+1 = w n -η∇ (y n , f (x n ; w n )); (b) is due to the Assumption of L satisfying the property in Definition 1 that E (x,y) [∇ w (y, f (x; w))] = ∇L(w). By using the Assumption of L satisfying the property in Definition 1 that ∇ w (y, f (x; w)) ≤ G and the Assumption of L satisfying the property in Definition 3, we have E (xn,yn) [L(w n+1 ) -L(w n )] ≤ η 2 LG 2 2 -η ∇L(w n ) 2 ≤ η 2 LG 2 2 -2ηµ (L(w n ) -L(w * )) , which implies E (xn,yn) [L(w n+1 ) -L(w * )] ≤ (1 -2ηµ) E (xn-1,yn-1) [L(w n ) -L(w * )] + η 2 LG 2 2 ≤ (1 -2ηµ) n (L(w 1 ) -L(w * )) + η 2 LG 2 2 n-1 i=0 (1 -2ηµ) i . Due to (1 -2ηµ) n ≤ exp(-2ηµn) and n-1 i=0 (1 -2ηµ) i ≤ 1 2ηµ , then by using the setting of η = 1 2nµ log 8nµ 2 (L(w 1 ) -L(w * )) G 2 L , we have E (xn,yn) [L(w n+1 ) -L(w * )] ≤ exp (-2ηµn) (L(w 1 ) -L(w * )) + ηG 2 L 4µ = G 2 L 8nµ 2 + G 2 L 8nµ 2 log 8nµ 2 (L(w 1 ) -L(w * )) G 2 L ≤O L nµ 2 log(n) .

E PROOF OF PROPOSITION 1

Proof. By using the Assumption of L satisfying the property in Definition 3, we have L(w * ) -L( w * ) ≤ ∇ L(w * ) 2 2µ (a) = ∇ L(w * ) -∇L(w * ) 2 2µ (b) ≤ δ 2 y G 2 2µ where (a) is due to the definition of w * in (6) so that ∇L(w * ) = 0; (b) follows the same analysis of (33) in Lemma 1. Thus we know w * ∈ A(γ) when γ ≥ γ 0 := δ 2 y G 2 2µ . On the other hand, by the definition of µ(γ) in ( 12) and the Assumption of L satisfying the property in Definition 3, we know µ(γ) ≥ µ when γ ≤ 8µ 0 .

F ALGORITHM AUGDROP AND PROOF OF THEOREM 1

We present the details of update steps for AugDrop and its convergence analysis in this section. Proof. In the first stage of the proposed algorithm, we run a mini-batch SGD over the augmented data D with m 1 as the size of mini-batch. Let ( x t,i , y t,i ), i = 1, . . . , m 1 be the m 1 examples sampled in the tth iteration. Let g t be the average gradient for the t iteration, i.e. g t = 1 m 1 m1 i=1 ∇ w ( y t,i , f ( x t,i ; w t )) We then update the solution by mini-batch SGD: w t+1 = w t -η 1 g t . By using Lemma 4 of (Ghadimi et al., 2016) , with a probability 1 -δ , we have g t -∇ L(w t ) ≤ 1 + 3 log 1 δ 8G 2 m 1 . ( ) By the Assumption of L satisfying the property in Definition 2 and the update of w t+1 = w t -η 1 g t , we have L(w t+1 ) -L(w t ) ≤ -η 1 ∇ L(w t ), g t + η 2 1 L 2 g t 2 = η 1 2 ∇ L(w t ) -g t 2 - η 1 2 ∇ L(w t ) 2 - η 1 (1 -η 1 L) 2 g t 2 (a) ≤ η 1 2 1 + 3 log 1 δ 2 8G 2 m 1 -η 1 µ( L(w t ) -L( w * )), where (a) uses the facts that (34), η 1 = 1/L and the Assumption of L satisfying the property in Definition 3. Thus, with a probability (1 -δ ) t , using the recurrence relation, we have L(w t+1 ) -L( w * ) ≤(1 -η 1 µ)( L(w t ) -L( w * )) + η 1 2 1 + 3 log 1 δ 2 8G 2 m 1 ≤ (1 -η 1 µ) t L(w 1 ) -L( w * ) + η 1 2 1 + 3 log 1 δ 2 8G 2 m 1 t-1 i=0 (1 -η 1 µ) i . ( ) Due to (1 -η 1 µ) t ≤ exp(-tη 1 µ) and t-1 i=0 (1 -η 1 µ) i ≤ 1 η1µ , when t ≥ T 1 := 1 η 1 µ log 2( L(w 1 ) -L( w * ))µ δ 2 y G 2 , we have L(w t+1 ) -L( w * ) ≤ exp(-tη 1 µ) L(w 1 ) -L( w * ) + 1 + 3 log 1 δ 2 4G 2 µm 1 ≤ δ 2 y G 2 2µ + 1 + 3 log 1 δ 2 4G 2 µm 1 . ( ) Let δ = δ 2T1 , if we choose m 1 such that m 1 = 1 + 3 log 2T 1 δ 2 8 δ 2 y , then for any t ≥ T 1 , then with a probability 1 -δ/2 we have L(w t+1 ) -L( w * ) ≤ δ 2 y G 2 µ . In the second stage of the proposed algorithm, we run a mini-batch SGD over the original data D with m 2 as the size of mini-batch. Let (x t,i , y t,i ), i = 1, . . . , m 2 be the m 2 examples sampled in the tth iteration. Let g t be the average gradient for the t iteration, i.e. g t = 1 m 2 m2 i=1 ∇ w (y t,i , f (x t,i ; w t )) We then update the solution w t+1 = w t -η 2 g t . By using Lemma 4 of (Ghadimi et al., 2016) , with a probability 1 -δ , we have g t -∇L(w t ) ≤ 1 + 3 log 1 δ 8G 2 m 2 . ( ) By the smoothness of L(w) and the update of w t+1 = w t -η 2 g t , we have L(w t+1 ) -L(w t ) ≤ -η 2 ∇ L(w t ), g t + η 2 2 L 2 g t 2 = η 2 2 ∇ L(w t ) -g t 2 - η 2 2 ∇ L(w t ) 2 - η 2 (1 -η 2 L) 2 g t 2 (a) ≤ η 2 ∇ L(w t ) -∇L(w t ) 2 + η 2 g t -∇L(w t ) 2 -η 2 µ( L(w t ) -L( w * )) (b) ≤η 2   2δ 2 y G 2 + 1 + 3 log 1 δ 2 8G 2 m 2   -η 2 µ( L(w t ) -L( w * )), where (a) uses the facts that Young's inequality, η 2 ≤ 1/L and the Assumption of L satisfying the property in Definition 3; (b) uses inequality (38) and the same analysis of (33) in Lemma 1. It is easy to verify that for any t ∈ {T 1 + 1, . . . , T 1 + n/m 2 }, we have, with a probability (1 -δ ) n/m2 , we have L(w t+1 ) -L( w * ) ≤ (1 -η 2 µ) L(w t ) -L( w * ) + η 2   G 2 δ 2 y + 1 + 3 log 1 δ 2 8G 2 m 2   ≤ (1 -η 2 µ) t L(w T1+1 ) -L( w * ) + η 2   G 2 δ 2 y + 1 + 3 log 1 δ 2 8G 2 m 2   t-1 i=0 (1 -η 2 µ) i ≤ δ 2 y G 2 µ + 1 µ   G 2 δ 2 y + 1 + 3 log 1 δ 2 8G 2 m 2   , where the last inequality is due to  (1 -η 2 µ) t ≤ 1, t-1 i=0 (1 -η 2 µ) i ≤ 1 η2µ , m 2 = 1 + 3 log 2n δ 2 4 δ 2 y , then with a probability 1 -δ, for any t ∈ {T 1 + 1, . . . , T 1 + n/m 2 } we have L(w t+1 ) -L( w * ) ≤ 4δ 2 y G 2 µ . Therefore, w t ∈ A(8γ 0 ) for any t ∈ {T 1 + 2, . . . , T 1 + n/m 2 + 1}. Following the standard analysis in Appendix D, we have E L(w T1+n/m2+1 ) -L(w * ) ≤ G 2 L 4nµ 2 c + G 2 L 4nµ 2 c log 4nµ 2 c (L(w 1 ) -L(w * )) G 2 L , where µ c = µ(8γ 0 ). G OPTIMAL SOLUTIONS OF L a (w) AND L c (w) By the definition of a in ( 14) and P x = P x , we know a ( y, f ( x; w)) = min z-y ≤δy (z, f (x; w)) ≤ (y, f (x; w)) since yy ≤ δ y . Therefore, by ( 15), (40) and P x = P x we have L a (w) =E y [L a (w)] =E (x, y,y) [ a ( y, f (x; w))] ≤E (x, y,y) [ (y, f (x; w))] =E (x,y) [ (y, f (x; w))] =L(w). Since is a non-negative loss function, then we know 0 ≤ L a (w * ) ≤ L(w * ) = 0, which implies that L a (w * ) = 0, and thus L a (w * ) ≤ L a (w), ∀w. Therefore, w * also minimizes L a (w). On the other hand, by ( 16) we know L c (w * ) = λL(w * ) + (1 -λ)L a (w * ) = 0. Therefore, L c (w * ) ≤ L c (w), ∀w, i.e, w * also minimizes L c (w), indicating that L c (w) shares the same minimizer as L(w).

H ALGORITHM MIXLOSS AND PROOF OF THEOREM 2

We present the details of update steps for MixLoss and its convergence analysis in this section. update w t+1 = w t -η g t 8: end for 9: Output: w n+1 . Proof. Recall that L c (w) = λL(w) + (1 -λ)L a (w), where L a (w) = E ( x, y) [ a ( y, f ( x; w))] = E ( x, y) min z-y ≤δy (z, f ( x; w)) and g t = λ∇ (y t , f (x t ; w t )) + (1 -λ) 1 m 0 m0 i=1 ∇ a ( y t,i , f ( x t,i ; w t )). By the update of w t+1 = w t -η g t and by the Assumption of L c satisfying the property in Definition 2, we have (Nesterov, 2004 )  E t [L c (w t+1 ) -L c (w t )] ≤ -ηE t [ ∇L c (w t ), g t ] + η 2 L 2 E t g t 2 (a) ≤ -η(1 -ηL)E t ∇L c (w t ) 2 + η 2 LE t g t -∇L c (w t ) 2 (b) ≤ -η(1 -ηL)E t ∇L c ( where last inequality is due to the fact that L c (w) ≤ L(w). In (45), by choosing η = 1 µn log nµ 2 L(w 1 ) λ 2 LG 2 , we have E n [L c (w n+1 )] ≤ λ 2 LG 2 nµ 2 + 5λ 2 LG 2 nµ 2 log nµ 2 L(w 1 ) λ 2 LG 2 . ( ) Since L(w) = 1 λ L c (w) -1-λ λ L a (w), then ( 46) becomes E n [L(w n+1 )] ≤ λLG 2 nµ 2 + 5λLG 2 nµ 2 log nµ 2 L(w 1 ) λ 2 LG 2 - 1 -λ λ E [L a (w n+1 )] ≤ λLG 2 nµ 2 + 5λLG 2 nµ 2 log nµ 2 L(w 1 ) λ 2 LG 2 , where the last inequality is due to λ ∈ (0, 1) and L a (w n+1 ) ≥ L a (w * ) = 0.

I PROOFS IN APPENDIX A

We include the proofs for Appendix section "Main Results for label-preserving Augmentation".

I.1 PROOF OF LEMMA 2

The analysis is similar to that for Lemma 1. For completeness, we include it here. Proof. Following the same analysis in Lemma 1, we can have the same result as in (32). That is to say, we have E ( xt, yt) [L(w t+1 ) -L(w t )] ≤ η 2 ∇L(w t ) -∇ L(w t ) 2 + 4G 2 m 0 -∇L(w t ) 2 . ( ) We have ∇L(w t ) -∇ L(w t ) ≤ dxy P x (x) -P x (x) ∇ (y, f (x; w t ))  ≤ δ P G 2 µ where (a) is due to the definition of w * in (6) so that ∇L(w * ) = 0; (b) follows the same analysis of (49) in Lemma 2. Thus we know w * ∈ A(γ) when γ ≥ γ 1 := δ P G 2 µ . On the other hand, by the definition of µ(γ) in ( 12) and the Assumption of L satisfying the property in Definition 3, we know µ(γ) ≥ µ when γ ≤ 4µ 1 .

I.3 PROOF OF THEOREM 3

This proof is similar to the proof of Theorem 1. For completeness, we include it here. Proof. In the first stage of the proposed algorithm, we run a mini-batch SGD over the augmented data D with m 1 as the size of mini-batch. Using the similar analysis in (35) from Theorem 1, we have, with a probability (1 -δ ) t , L(w t+1 ) -L( w * ) ≤ (1 -η 1 µ) t L(w 1 ) -L( w * ) + η 1 2 1 + 3 log 1 δ 2 8G 2 m 1 t-1 i=0 (1 -η 1 µ) i .



https://www.cs.toronto.edu/ ˜kriz/cifar.html



Define γ = 4γ 1 , µ e = µ(4γ 1 ). Assume that L and L satisfy properties in Definition 1, 2 and 3, set learning rate η 1 = 1/L in Stage I and learning rate η 2 = 1 2nµe log 8nµ 2 e (L(w1)-L(w * )) G 2 L in Stage II for AugDrop. Let w 1 be the initial solution in Stage I of AugDrop and w T1+2 , . . . , w T1+n/m2+1 be the intermediate solutions obtained by the mini-batch SGD in Stage II of AugDrop. Choose T 1 = L µ log 2( L(w1)-L( w * ))µ δ P G 2, m 1 = 1 + 3 log 2T1

); (a) uses the facts that Young's inequality ab 2 ≤ 2 a 2 + 2 b 2 and E[ g t ] = ∇L c (w t ); (b) uses the facts that (42) (43) and Young's inequality a + b 2 ≤ (1 + 1/c) a 2 + (1 + c) b 2 with a = 8; (c) use the same analysis in (31) from the proof of Lemma 1, the facts that the Assumption of L satisfying the property in Definition 1 and by Jensen's inequality, we also have ∇L(w) ≤ G, implying that ∇ (y, f (x; w)) -∇L(w) 2 ≤ 4G 2 ; (d) holds by setting m 0 ≥ 72(1-λ) 2 λ 2 since we have sufficiently large number of augmented examples. Thus, since η ≤ 1 2L and by using the Assumption of L c satisfying the property in Definition 3, we haveE t [L c (w t+1 ) -L c (w t )] ≤ -ηµE t [L c (w t )] + 5λ 2 η 2 LG 2 ,and thereforeE n [L c (w n+1 )]≤ exp (-ηµn) L c (w 1 ) + 5λ 2 ηLG 2 µ ≤ exp (-ηµn) L(w 1 ) + 5λ 2 ηLG 2 µ ,

G dx P x (x) -P x (x) (b) ≤G 2D KL (P x P x ) (c) = G 2δ P ,(49)where (a) is due to the Assumption of L satisfying the property in Definition 1; (b) uses Pinsker's inequality(Csiszar & Körner, 2011;Tsybakov, 2008); (c) is due to (18). With inequality (49), by using the facts that η = 1/L and the Assumption of L satisfying the property in Definition 3, inequality (48) becomesE ( xt, yt) [L(w t+1 ) -L(w t )] ηµ (L(w t ) -L(w * )) ≤2ηδ P G 2 -ηµ (L(w t ) -L(w * )) ,where the last inequality is due to the selection of m 0 ≥ 4 ηδ P . Then we haveE ( xt, yt) [L(w t+1 ) -L(w * )] ≤ (1 -ηµ) E ( xt-1, yt-1) [L(w t ) -L(w * )] + 2ηδ P G 2 ≤ (1 -ηµ) t (L(w 1 ) -L(w * )) + 2ηδ P G 2 ηµ) i . Due to (1 -ηµ) t ≤ exp(-tηµ) and t-1 i=0 (1 -ηµ) i ≤ 1 ηµ , when t ≥ L µ log (L(w 1 ) -L(w * ))µ 2δ P G 2 ,we knowE ( xt, yt) [L(w t+1 ) -L(w * )] ≤ 4δ P G 2 µ .I.2 PROOF OF PROPOSITION 2 Proof. By using the Assumption of L satisfying the property in Definition 3, we have L(w * ) -L( w * ) w * ) -∇L(w * ) 2 2µ (b)

First stage: Weighted Mixed Losses 3: for t = 1, 2, . . . , T 1 do for t = T 1 + 1, T 1 + 2, . . . , T 1 + T 2 do

Comparison of Testing Top-1 Accuracy (mean ± standard deviation, in %) using Different Methods on ResNet-18 over CIFAR-10 and CIFAR-100 for mixup (e.g., momentum SGD, SGD) for solving the problem over original data. The notation A(•; •, η) is one update step of a stochastic algorithm A with learning rate η. For example, if we select SGD as algorithm A, then SGD(w t ; g t , η) = w t -η g t . The proposed WeMix is a generic strategy where the subroutine algorithm A 1 /A 2 can be replaced by any stochastic algorithms such as stochastic versions of momentum methods (

Comparison of Testing Top-1 Accuracy (mean ± standard deviation, in %) using Different Methods on ResNet-18 over CIFAR-100 for mixup of three images and ten images

Theorem 3 shows that all intermediate solutions w t obtained in Stage II of AugDrop satisfy the constraint L(w t ) -L( w * ) ≤ 4γ 1 , that is to say, w t ∈ A(4γ 1 ). Based on Proposition 2, we will enjoy a larger µ e than µ. Comparing the result of (20) in Theorem 3 with (9), training by using AugDrop will result in a better performance than directly training on D due to µ e ≥ µ. Besides, when the data bias is large, i.e., δ P ≥ Ω(L log(n)/(nµ)), we know O(L log(n)/(nµ 2 e )) ≤ O(µδ P /µ 2 e ) ≤ O(δ P /µ), where the last inequality holds due to µ e ≥ µ. By comparing (

Initialize: w 1 ∈ R D , η 1 , η 2 > 0 // Stage I: Train Augmented Data 3: for t = 1, 2, . . . , T 1 do 4:draw m 1 examples ( x t,1 , y t,1 ), . . . , ( x t,m1 , y t,m1 ) at random from augmented data for t = T 1 + 1, T 1 + 2, . . . , T 1 + n/m 2 do 8: draw m 2 examples (x t,1 , y t,1 ), . . . , (x t,m2 , y t,m2 ) without replacement at random from orig-

and (37). Let δ =

examples ( x t,1 , y t,1 ), . . . , ( x t,m0 , y t,m0 ) at random from augmented data

w t ) 2 + 9 8 λ 2 η 2 LE t ∇ (y t , f (x t ; w t )) -∇L(w t ) ( y t,i , f ( x t,i ; w t )) -∇L a (w t ) -ηL)E t ∇L c (w t ) 2 + 5λ 2 η 2 LG 2 , (44)where E t [•] is taken over random variables (x t , y t ), ( x t,1 , y t,1 ), . . . , ( x t,m0 , y t,m0

annex

Due to (1 -η 1 µ) t ≤ exp(-tη 1 µ) andwe havethen for any t ≥ T 1 , we haveIn the second stage of the proposed algorithm, we run a mini-batch SGD over the original data D with m 2 as the size of mini-batch. Using the same analysis in (39) from Theorem 1, we haveThen by ( 38) in the proof of Theorem 1 and (49) in the proof of Lemma 2, we have, with a probability 1 -δ ,It is easy to verify that for any t ∈ {T 1 + 1, . . . , T 1 + n/m 2 }, we have, with a probability (1δ ) n/m2 , we havewhere the last inequality is due to where µ e = µ(4γ 1 ).

