OVER-TRAINING WITH MIXUP MAY HURT GENERALIZATION

Abstract

Mixup, which creates synthetic training instances by linearly interpolating random sample pairs, is a simple and yet effective regularization technique to boost the performance of deep models trained with SGD. In this work, we report a previously unobserved phenomenon in Mixup training: on a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs, giving rise to a U-shaped generalization curve. This behavior is further aggravated when the size of original dataset is reduced. To help understand such a behavior of Mixup, we show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data. Via analyzing a least-square regression problem with a random feature model, we explain why noisy labels may cause the U-shaped curve to occur: Mixup improves generalization through fitting the clean patterns at the early training stage, but as training progresses, Mixup becomes over-fitting to the noise in the synthetic data. Extensive experiments are performed on a variety of benchmark datasets, validating this explanation.

1. INTRODUCTION

Mixup has empirically shown its effectiveness in improving the generalization and robustness of deep classification models (Zhang et al., 2018; Guo et al., 2019a; b; Thulasidasan et al., 2019; Zhang et al., 2022b) . Unlike the vanilla empirical risk minimization (ERM), in which networks are trained using the original training set, Mixup trains the networks with synthetic examples. These examples are created by linearly interpolating both the input features and the labels of instance pairs randomly sampled from the original training set. Owning to Mixup's simplicity and its effectiveness in boosting the accuracy and calibration of deep classification models, there has been a recent surge of interest attempting to better understand Mixup's working mechanism, training characteristics, regularization potential, and possible limitations (see, e.g., Thulasidasan et al. (2019) , Guo et al. (2019a) , Zhang et al. (2021) , Zhang et al. (2022b) ). In this work, we further investigate the generalization properties of Mixup training. We first report a previously unobserved phenomenon in Mixup training. Through extensive experiments on various benchmarks, we observe that over-training the networks with Mixup may result in significant degradation of the networks' generalization performance. As a result, along the training epochs, the generalization performance of the network measured by its testing error may exhibit a U-shaped curve. Figure 1 shows such a curve obtained from over-training ResNet18 with Mixup on CIFAR10. As can be seen from Figure 1 , after training with Mixup for a long time (200 epochs), both ERM and Mixup keep decreasing their training loss, but the testing error of the Mixup-trained ResNet18 gradually increases, while that of the ERM-trained ResNet18 continues to decrease. Motivated by this observation, we conduct a theoretical analysis, aiming to better understand the aforementioned behavior of Mixup training. We show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data. Then by analyzing the gradientdescent dynamics of training a random feature model for a least-square regression problem, we explain why noisy labels may cause the U-shaped curve to occur: under label noise, the early phase of training is primarily driven by the clean data pattern, which moves the model parameter closer to the correct solution. But as training progresses, the effect of label noise accumulates through iterations and gradually over-weighs that of the clean pattern and dominates the late training process. In this phase, the model parameter gradually moves away from the correct solution until it is sufficient apart and approaches a location depending on the noise realization.

2. RELATED WORK

Mixup Improves Generalization After the initial work of Zhang et al. (2018) , a series of the Mixup's variants have been proposed (Guo et al., 2019a; Verma et al., 2019; Yun et al., 2019; Kim et al., 2020; Greenewald et al., 2021; Han et al., 2022; Sohn et al., 2022) . For example, AdaMixup (Guo et al., 2019a) trains an extra network to dynamically determine the interpolation coefficient parameter α. Manifold Mixup (Verma et al., 2019) performs linear mixing on the hidden states of the neural networks. Aside from its use in various applications, Mixup's working mechanism and it possible limitations are also being explored constantly. For example, Zhang et al. (2021) demonstrate that Mixup yields a generalization upper bound in terms of the Rademacher complexity of the function class that the network fits. Thulasidasan et al. (2019) show that Mixup helps to improve the calibration of the trained networks. Zhang et al. (2022b) theoretically justify that the calibration effect of Mixup is correlated with the capacity of the network. Additionally, Guo et al. (2019a) point out a "manifold intrusion" phenomenon in Mixup training where the synthetic data "intrudes" the data manifolds of the real data. Training on Random Labels, Epoch-Wise Double Descent and Robust Overfitting The thought-provoking work of Zhang et al. (2017) highlights that neural networks are able to fit data with random labels. After that, the generalization behavior on corrupted label dataset has been widely investigated (Arpit et al., 2017; Liu et al., 2020; Feng & Tu, 2021; Wang & Mao, 2022; Liu et al., 2022) . Specifically, Arpit et al. (2017) observe that neural networks will learn the clean pattern first before fitting to data with random labels. This is further explained by Arora et al. (2019a) where they demonstrate that in the overparameterization regime, the convergence of loss depends on the projections of labels on the eigenvectors of some Gram matrix, where true labels and random labels have different projections. In a parallel line of research, an epoch-wise double descent behavior of testing loss of deep neural networks is observed in Nakkiran et al. (2020) , shortly after the observation of the model-wise double descent (Belkin et al., 2019; Hastie et al., 2022; Mei & Montanari, 2022; Ba et al., 2020) . Theoretical works studying the epoch-wise double descent are rather limited to date (Heckel & Yilmaz, 2021; Stephenson & Lee, 2021; Pezeshki et al., 2022) , among which Advani et al. (2020) inspires the theoretical analysis of the U-sharped curve of Mixup in this paper. Moreover, robust overfitting (Rice et al., 2020) is also another yet related research line, In particular, robust overfitting is referred to a phenomenon in adversarial training that robust accuracy will first increase then decrease after a long training time. 

3. PRELIMINARIES

Consider a C-class classification setting with input space X = R d0 and label space Y := {1, 2, . . . , C}. Let S = {(x i , y i )} n i=1 be a training set, where each y i ∈ Y may also be treated as a one-hot vector in P(Y), the space of distributions over Y. Let Θ denote the model parameter space, and for each θ ∈ Θ, let f θ : X → [0, 1] C denote the predictive function associated with θ, which maps an input feature to a distribution in P(Y). For any pair (x, y) ∈ X × P(Y), let ℓ(θ, x, y) denote the loss of the prediction f θ (x) with respect to y. The empirical risk of θ on S is then RS (θ) := 1 n n i=1 ℓ(θ, x i , y i ). When training with Empirical Risk Minimization (ERM), one sets out to find a θ * to minimize this risk. It is evident that if ℓ(•) is taken as the cross-entropy loss, the empirical risk RS (θ) is non-negative, where RS (θ) = 0 precisely when f θ (x i ) = y i for every i = 1, 2, . . . , n. In Mixup, instead of using the original training set S, the training is performed on a synthetic dataset S obtained by interpolating training examples in S. For a given interpolating parameter λ ∈ [0, 1], let synthetic training set S λ be defined as S λ := {(λx + (1 -λ)x ′ , λy + (1 -λ)y ′ ) : (x, y) ∈ S, (x ′ , y ′ ) ∈ S} (1) The optimization objective, or the "Mixup loss", is then E λ R S λ (θ) := E λ 1 | S λ | (x,ỹ)∈ S λ ℓ(θ, x, ỹ) where the interpolating parameter λ is drawn from a symmetric Beta distribution, Beta(α, α). The default option is to take α = 1. In this case, the following can be proved. Lemma 3.1. Let ℓ(•) be the cross-entropy loss, and λ is drawn from Beta(1, 1) (or the uniform distribution on [0, 1]). Then for all θ ∈ Θ and for any given training set S that is balanced, E λ R S λ (θ) ≥ C -1 2C , where the equality holds if and only if f θ (x) = ỹ for each synthetic example (x, ỹ) ∈ S λ . For 10-class classification tasks, the bound has value 0.45. Then only when the Mixup loss approaches this value, the found solution is near a true optimum (for models with adequate capacity).

4. EMPIRICAL OBSERVATIONS

We conduct experiments using CIFAR10, CIFAR100 and SVHN using ERM and Mixup respectively. For each of the datasets, we have adopted both the original dataset and some balanced subsets obtained by downsampling the original data for certain proportions. SGD with weight decay is used. phenomenon is not observed in ERM. We also found that over-training with Mixup tends to force the network to learn a solution located at the sharper local minima on the loss landscape, a phenomenon correlated with degraded generalization performance (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016) 

4.2. RESULTS ON OVER-TRAINING WITH DATA AUGMENTATION

The data augmentation methods include "random crop" and "horizontal flip" are applied to training on CIFAR10 and CIFAR100. We train ResNet18 on 10% of the CIFAR10 training set for up to 7000 epochs. The results are given in Figures 4a and 4b . In this case, the Mixup-trained model also produces a U-shaped generalization curve. However, while the dataset is downsampled to a lower proportion, the turning point of the U-shaped curve nevertheless comes much later compared to the previous experiments where data augmentation is not applied on CIFAR10. The results of over-training ResNet34 on 10% of the CIFAR100 training set for up to 7000 epochs are given in Figures 4c and 4d , where similar phenomenons are observed.

5.1. MIXUP INDUCES LABEL NOISE

We will use the capital letters X and Y to denote the random variables representing the input feature and output label, while reserving the notations x and y to denote their respective realizations. In particular, we consider each true label y is as a token, in Y, not a one-hot vector in P(Y). Let P (Y |X) be the ground-truth conditional distribution of the label Y given input feature X. For simplicity, we also express P (Y |X) as a vector-valued function f : X → R C , where f j (x) ≜ P (Y = j|X = x) for each dimension j ∈ Y. For simplicity, we consider Mixup with a fixed λ ∈ [0, 1]; extension to random λ is straight-forward. Let X and Y be the random variables corresponding to the synthetic feature and synthetic label respectively. Then X ≜ λX + (1 -λ)X ′ . Let P ( Y | X) be the conditional distribution of the synthetic label conditioned on the synthetic feature, induced by Mixup, namely, P ( Y = j| X) = λf j (X)+(1-λ)f j (X ′ ) for each j. Then for a synthetic feature X, there are two ways to assign to it a hard label. The first is based on the ground truth, assigning Y * h ≜ arg max j∈Y f j ( X). The second is based on the Mixup-induced conditional P ( Y | X), assigning Y h ≜ arg max j∈Y P ( Y = j| X). When the two assignments disagree, or Y h ̸ = Y * h , we say that the Mixup-assigned label Y h is noisy. Theorem 5.1. For any fixed X, X ′ and X related by X = λX + (1 -λ)X ′ for a fixed λ ∈ [0, 1], the probability of assigning a noisy label is lower bounded by P ( Y h ̸ = Y * h | X) ≥ TV(P ( Y | X), P (Y |X)) ≥ 1 2 sup j∈Y f j ( X) -[(1 -λ)f j (X) + λf j (X ′ )] , where TV(•, •) is the total variation (see Appendix D). Remark 5.1. This lower bound hints that the label noise induced by Mixup training depends on the distribution of original data P X , the convexity of f (X) and the value of λ. Clearly, Mixup will create noisy labels with non-zero probability (at least for some λ) unless f j is linear for each j. Remark 5.2. We often consider that the real data are labelled with certainty, i.e., max j∈Y f j (X) = 1 and C j=1 f j (X) = 1. Then the probability of assigning noisy label to a given synthetic data can be discussed in three situations : i) if Y * h / ∈ {Y, Y ′ }, where Y could be the same with Y ′ , then Y is a noisy label with probability one; ii) if Y * h ∈ {Y, Y ′ } where Y ̸ = Y ′ , then the probability of assigning a noisy label is non-zero and depends on λ ; iii) if Y * h = Y = Y ′ , then Y * h = Y . As shown in (Arpit et al., 2017; Arora et al., 2019a) , when neural networks are trained with a fraction of random labels, they will first learn the clean pattern and then overfit to noisy labels. In Mixup training, we in fact create much more data, possibly with noisy labels, than traditional ERM training (n 2 for a fixed λ). Thus, one may expect an improved performance (relative to ERM) in the early training phase, due to the clean pattern in the enlarged training set, but a performance impairment in the later phase due to noisy labels. Specifically if Y * h / ∈ {Y, Y ′ } happens with a high chance, a phenomenon known as "manifold intrusion" (Guo et al., 2019a) , then the synthetic dataset contains too many noisy labels, causing Mixup to perform inferior to ERM. Theorem 5.1 has implied that, in classification problems, Mixup training induces label noise. Next, we will provide a theoretical analysis using a regression setup to explain that such label noise may result in the U-shape learning curve. The choice of a regression setup in this analysis is due to the difficulty in directly analyzing classification problems (under the cross-entropy loss). Such a regression setting may not perfectly explain the U-shaped curve in classification tasks, we however believe that they give adequate insight illuminating such scenarios as well. Such an approach has been taken in most analytic works that study the behaviour of deep learning. For example, Arora et al. (2019b) uses a regression setup to analyze the optimization and generalization property of overparameterized neural networks. Yang et al. (2020) theoretically analyze the bias-variance tradeoff in deep network generalization using a regression problem.

5.2. REGRESSION SETTING WITH RANDOM FEATURE MODELS

Consider a simple least squares regression problem. Let Y = R and let f : X → Y be the groundtruth labelling function. Let ( X, Y ) be a synthetic pair obtained by mixing (X, Y ) and (X ′ , Y ′ ). Let Y * = f ( X) and Z ≜ Y -Y * . Then Z can be regarded as noise introduced by Mixup, which may be data-dependent. For example, if f is strongly convex with some parameter ρ > 0, then Z ≥ ρ 2 λ(1 -λ)||X -X ′ || 2 2 . Given a synthesized training dataset S = {( X i , Y i )} m i=1 , consider a random feature model, θ T ϕ(X), where ϕ : X → R d and θ ∈ R d . We will consider ϕ fixed and only learn the model parameter θ using gradient descent on the MSE loss R S (θ) ≜ 1 2m θ T Φ -Y T 2 2 , where Φ = [ϕ( X 1 ), ϕ( X 2 ), . . . , ϕ( X m )] ∈ R d×m and Y = [ Y 1 , Y 2 , . . . , Y m ] ∈ R m . For a fixed λ, Mixup can create m = n 2 synthesized examples. Thus it is reasonable to assume m > d (e.g., under-parameterized regime) in Mixup training. For example, ResNet-50 has less than 30 million parameters while the square of CIFAR10 training dataset size is larger than 200 million without using other data augmentation techniques. Then the gradient flow, as shown in Liao & Couillet (2018) , is θ = -η∇ R S (θ) = η m Φ Φ T Φ † Y -θ , where η is learning rate and Φ † = ( Φ Φ T ) -1 Φ is the Moore-Penrose inverse of Φ T (only possible when m > d). Thus, we have the following important lemma. Lemma 5.1. Let θ * = Φ † Y * and θ noise = Φ † Z wherein Z = [Z 1 , Z 2 , . . . , Z m ] ∈ R m , the ODE in Eq. (2) has the following closed form solution θ t -θ * = (θ 0 -θ * )e -η m Φ Φ T t + (I d -e -η m Φ Φ T t )θ noise . ( ) Remark 5.3. Notably, θ * = Φ † Y * may be seen as the "clean pattern" of the training data. The first term in Eq. ( 3) is decreasing (in norm) and vanishes at t → ∞. Thus its role is moving θ t towards the clean pattern θ * , allowing the model to generalize to unseen data. But it only dominates the dynamics of θ t in the early training phase. The second term, initially 0, increases with t and converges to θ noise as t → ∞. Thus its role is moving θ t towards the "noisy pattern" θ * + θ noise . It dominates the later training phase and hence hurts generalization. It is noteworthy that θ * + θ noise is also the closed-form solution for the regression problem (under Mixup labels). This suggests that the optimization problem associated with the Mixup loss has a "wrong" solution, but it is possible to benefit from only solving this problem partially, using gradient descent without over-training. Noting that the population risk at time step t is R t ≜ E θt,X,Y θ T t ϕ(X) -Y and the true optimal risk is R * = E X,Y Y -θ * T ϕ(X) 2 2 , we have the following result. Theorem 5.2 (Dynamics of Population Risk). Given a synthesized dataset S, assume θ 0 ∼ N (0, ξ 2 I d ), ||ϕ(X)|| 2 ≤ C 1 /2 for some constant C 1 > 0 and |Z| ≤ √ C 2 for some constant C 2 > 0, then we have the following upper bound R t -R * ≤ C 1 d k=1 ξ 2 k + θ * 2 k e -2ηµ k t + C 2 µ k 1 -e -ηµ k t 2 + 2 C 1 R * ζ, where ζ = d k=1 max{ξ 2 k + θ * 2 k , C2 µ k } and µ k is the k th eigenvalue of the matrix 1 m Φ Φ T . Remark 5.4. The additive noise Z is usually assumed as a zero mean Gaussian in the literature of generalization dynamics analysis (Advani et al., 2020; Pezeshki et al., 2022; Heckel & Yilmaz, 2021) , but this would be hardly justifiable in this context. The boundness assumption of Z in the theorem can however be easily satisfied as long as the output of f is bounded. Remark 5.5. If we further let ξ = 0 (i.e. using zero initialization) and assume that the eigenvalues of the matrix 1 m Φ Φ T are all equal to µ, then the summation part in the bound above can be rewritten as C 1 ||θ * || 2 e -2ηµt + (C 2 /µ) (1 -e -ηµt ) 2 , then it is clear that the magnitude of the curve is controlled by the norm of θ * , the norm of the representation, the noise level and µ. Theorem 5.2 indicates that the population risk will first decrease due to the first term (i.e. ξ 2 k + θ * 2 k e -2ηµ k t ) then it will grow due to the existence of label noises (i.e. C2 µ k (1 -e -ηµ k t ) 2 ) . Overall, the population risk will be endowed with a U-shaped curve. Notice that the quantity ηµ k plays a key role in the upper bound, the larger id ηµ k , the earlier comes the turning point of "U". This may have an interesting application, justifying a multi-stage training strategy where the learning rate is reduced at each new stage. Suppose that with the initial learning rate, at epoch T , the test error has dropped to the bottom of the U-curve corresponding to this learning rate. If the learning rate is decreased at this point, then the U-curve corresponding to the new learning rate may have a lower minimum error and its bottom shifted to the right. In this case, the new learning rate allows the testing error to move to the new U-curve and further decay. To empirically verify our theoretical results discussed in Section 5.2, we construct a simple teacher-student regression setting. The teacher network is a two-layer neural networks with Tanh activation and random weights. It only serves to create training data for the student network. Specifically, the training data is created by drawing {X i } n i=1 i.i.d. from a standard Gaussian N (0, I d0 ) and passing them to teacher network to obtain labels {Y i } n i=1 . The student network is also a two-layer neural network with Tanh activation and hidden layer dimension d = 100. We fix the parameters in the first layer and only train the second layer using the generated training data. Full-batch gradient descent on the MSE loss is used. For the value of λ, we consider two cases: a fixed value with λ = 0.5 and random values drawn from Beta(1, 1) at each epoch. As a comparison, we also present the result of ERM training in an over-parameterized regime (i.e., n < d). The testing loss dynamics are presented in Figure 5 . We first note that Mixup still outperforms ERM in this regression problem, but clearly, only Mixup training has a U-shaped curve while the testing loss of ERM training converges to a constant value. Furthermore, the testing loss of Mixup training is endowed with a U-shaped behavior for both fixed λ = 0.5 and random λ drawn from Beta(1, 1).

Published as a conference paper at ICLR 2023

This suggests that our analysis of Mixup in Section 5.2 based on a fixed λ is also indicative for more general settings of λ. Figure 5 also indicates that when λ is fixed to 0.5, the increasing stage of the U-shaped curve comes earlier than that of λ with Beta(1, 1). This is consistent with our theoretical results in Section 5.2. That is, owning to the fact that λ with the constant value 0.5 for λ represents the largest noise level in Mixup, the noise-dominating effect in Mixup training comes earlier.

6.2. USING MIXUP ONLY IN THE EARLY STAGE OF TRAINING

In the previous section, we have argued that Mixup training learns "clean patterns" in the early stage of the training process and then overfits the "noisy patterns" in the later stage. Such a conclusion implies that turning of Mixup after a certain number of epochs and returning to standard ERM training may prevent the training from overfitting the noises induced by Mixup. We now present results obtained from such a training scheme on both CIFAR10 and SVHN in Figure 6 . Results in Figure 6 clearly indicate that switching from Mixup to ERM at an appropriate time will successfully avoid the generalization degradation. Figure 6 also suggests that switching Mixup to ERM too early may not boost the model performance. In addition, if the switch is too late, memorization of noisy data may already taken effect, which impact generalization negatively. We note that our results here can be regarded as a complement to (Golatkar et al., 2019) , where the authors show that regularization techniques only matter during the early phase of learning.

7. FURTHER INVESTIGATION

Impact of Data Size on U-shaped Curve In the over-training experiments without data augmentation, although the U-shaped behavior occurs on both 100% and 30% of the original training data for both CIFAR10 and SVHN, we notice that smaller size datasets appear to enable the turning point of the U-shaped curve to arrive earlier. We now corroborate this phenomenon with more experimental results, as shown in Figure 7 . In this context, an appropriate data augmentation can be seen as simply expanding the training set with additional clean data. Then the impact of data augmentation on the over-training dynamics of Mixup is arguably via increasing the size of the training set. This explains our observations in Section 4.2 where the turning points in training with data augmentation arrive much later compared to those without data augmentation. Those observations are also consistent with the results in Figure 7 . It may be tempting to consider the application of the usual analysis of generalization dynamics from the existing literature (Liao & Couillet, 2018; Advani et al., 2020; Stephenson & Lee, 2021) to the training of Mixup. For example, one can analyze the distribution of the eigenvalues in Theorem 5.2. Specifically, if entries in Φ are independent identically distributed with zero mean, then in the limit of d, m → ∞ with d/m = γ ∈ (0, +∞), the eigenvalues {µ k } d k=1 follow the Marchenko-Pasteur (MP) distribution (Marčenko & Pastur, 1967) , which is defined as P M P (µ|γ) = 1 2π (γ + -µ)(µ -γ -) µγ 1 µ∈[γ-,γ+] , where γ ± = (1 ± γ) 2 . Note that the P M P are only non-zero when µ = 0 or µ ∈ [γ -, γ + ]. When γ is close to one, the probability of extremely small eigenvalues is immensely increased. From Theorem 5.2, when µ k is small, the second term, governed by the noisy pattern, will badly dominate the behavior of population risk and converge to a larger value. Thus, letting d ≪ m will alleviate the domination of the noise term in Theorem 5.2. However, it is important to note that such analysis lacks rigor since the columns in Φ are not independent (two columns might result from linearly combining the same pair of original instances). To apply a similar analysis here, one need to remove or relax the independence conditions on the entries of Φ, for example, by invoking some techniques similar to that developed in Bryson et al. (2021) . This is beyond the scope of this paper, and we will to leave it for future study. 6 , the gradient norm will instantly become zero (see Figure 14 in Appendix B.3). This further justifies that the "clean patterns" are already learned by Mixup trained neural networks at the early stage of training, and the original data may no longer provide any useful gradient signal.

8. CONCLUDING REMARKS

We discovered a novel phenomenon in Mixup: over-training with Mixup may give rise to a U-shaped generalization curve. We theoretically show that this is due to the data-dependent label noises introduced to the synthesized data, and suggest that Mixup improves generalization through fitting the clean patterns at the early training stage, but over-fits the noise as training proceeds. The effectiveness of Mixup and the fact it works by only partially optimizing its loss function without reaching convergence, as are validated by our analysis and experiments, seem to suggest that the dynamics of the iterative learning algorithm and an appropriate criteria for terminating the algorithm might be more essential than the loss function or the solutions to the optimization problem. Exploration in the space of iterative algorithms (rather than the space of loss functions) may lead to fruitful discoveries.

A EXPERIMENTAL SETUPS OF OVER-TRAINING

For any experimental setting ( such as the training dataset and its size, whether ERM or Mixup is used, whether other data augmentation is used, etc.), we define a training "trial" as a training process starting from random initialization to a certain epoch t. In each trial, we record the minimum training loss obtained during the entire training process. The testing accuracy of the model's intermediate solution that gives rise to that minimum training loss is also recorded. For different trials we gradually increase t so as to gradually let the model be over-trained. For each t, we repeat the trial for 10 times with different random seeds and collect all the recorded results (minimum training losses and the corresponding testing accuracies). We then compute their averages and standard deviations for all t's. These results are eventually used to plot the line graphs for presentation. For example, Figure 2a As for ResNet34, besides CIFAR100, it is also used for both the CIFAR10 and the SVHN datasets. Training is performed on both datasets for in total 200, 400 and 800 epochs respectively. The results for CIFAR10 are shown in Figure 10 . For both the 30% dataset and the original dataset, Mixup exhibits a similar phenomenon as it does in training ResNet18 on CIFAR10. The difference is that over-training ResNet34 with ERM let the testing accuracy gradually increase on both the 30% dataset and the original dataset. In addition, we have trained VGG16 on the CIFAR10 training set (100% data and 30% data) for up to in total 1600 epochs without data augmentation. The results are provided in Figure 12 . In both cases, over-training VGG16 with either ERM or Mixup can gradually reduce the best achieved training loss. However, the testing accuracy of the Mixup-trained network also decreases, while that of the ERM-trained network has no significant change. 

B.2 RESULTS OF MEAN SQUARE ERROR LOSS WITHOUT DATA AUGMENTATION

We also perform Mixup training experiments using the mean square error (MSE) loss function on both CIFAR10 and SVHN datasets. Figure 13 illustrates that the U-shaped behavior observed in previous experiments is also present when using the MSE loss function. To ensure optimal training, the learning rate is decreased by a factor of 10 at epoch 100 and 150. In Figure 8 , we can observe that the gradient norm of Mixup training does not diminish at the end of training and can even explode to a very high value. In contrast, ERM results in a gradient norm of zero at the end of training. Figure 14 illustrates that when switching from Mixup training to ERM training after a certain period, the gradient norm will rapidly become zero. This phenomenon occurs because Mixup-trained neural networks have already learned the "clean patterns" and the original data does not provide any useful gradient signal. Therefore, this further supports the idea that the latter stage of Mixup training is primarily focused on memorizing noisy data. To validate the regular pattern of the relationship between the minima flatness and the generalization behavior on the covariate-shift datasets, we have ran some of the Mixup-trained ResNet18 networks and tested their accuracies on CIFAR10.1 (Recht et al., 2018) , CIFAR10.2 (Lu et al., 2020) and CIFAR10-C (Hendrycks & Dietterich, 2019) using Gaussian noise with severity 1 and 5 (denoted by CIFAR10-C-1 and CIFAR10-C-5). The results of the models pretrained on 100% CIFAR10 are given in Figure 15 , and the results of the models pre-trained on 30% CIFAR10 are given in Figure 16 . From the results, it is seen that with training epochs increase, the testing performance on the models on CIFAR10.1 and CIFAR10.2 decreases, taking a similar trend as our results in standard testing sets (i.e., the original CIFAR10 testing sets without covariate shift.) But on CIFAR10-C, this behaviour is not observed. In particular, performance on CIFAR10-C-5 continues to improve over the training iterations. This seems to suggest that the flatness of empirical-risk loss landscape may impact generalization to covariate-shift datasets in more complex ways, possibly depending on the nature and structure of the covariate shift. The results show that when η = 2, RegMixup performs nearly identically as standard Mixup in the over-training scenario. When η = 0.1, RegMixup postpones the presenting of the turning point, and in the large epochs it outperforms standard Mixup. However, the phenomenon that the generalization performance of the trained model degrades with over-training still exists.

C EXPERIMENT SETTINGS FOR THE TEACHER-STUDENT TOY EXAMPLE

We set the dimension of the input feature as d 0 = 10. The teacher network consists of two layers with the activation function Tanh, and the hidden layer has a width of 5. Similarly, the student network is a two-layer neural network with Tanh, where we train only the second layer and keep the parameters in the first layer fixed. The hidden layer has a dimension of 100 (i.e. d = 100). To determine the value of λ, we either draw from a Beta(1,1) distribution in each epoch or fix it to 0.5 in each epoch. We choose n = 20, which puts us in the overparameterized regime where n < d, and the underparameterized regime where m ≥ n 2 > d. The learning rate is set to 0.1, and we use full-batch gradient descent to train the student network with MSE. Here, the term "full-batch" means that the batch size is equal to n, enabling us to compare the fixed λ and random λ methods fairly. For additional information, please refer to our code. In the teacher-student setting, we experiment with different fixed values of λ, and the results are presented in Figure 18 . Of particular interest is the observation that as the noise level increases and λ approaches 0.5, the turning point of testing error occurs earlier. This finding is consistent with our theoretical results.

D OMITTED DEFINITIONS AND PROOFS

Definition D.1 (Total Variation). The total variation between two probability measures P and Q is TV(P, Q) ≜ sup E |P (E) -Q(E)| , where the supremum is over all measurable set E. Lemma D.1 ((Levin & Peres, 2017, Proposition 4.2)). Let P and Q be two probability distributions on X . If X is countable, then Proof. We first prove the closed-form of the cross-entropy loss's lower bound. For any two discrete distributions P and Q defined on the same probability space Y, the KL divergence of P from Q is defined as follows: TV(P, Q) = 1 2 x∈X |P (x) -Q(x)| . D KL (P ∥Q) := y∈Y P (y) log P (y) Q(y) . It is non-negative and it equals 0 if and only if P = Q. Let's denote the i th element in f θ (x) by f θ (x) i . By adapting the definition of the cross-entropy loss, we have: ℓ(θ, (x, y)) = -y T log f θ (x) = - K i=1 y i log f θ (x) i = - K i=1 y i log f θ (x) i y i y i = - K i=1 y i log f θ (x) i y i - C i=1 y i log y i = D KL y∥f θ (x) + H(y) ≥ H(y), where the equality holds if and only if f θ (x) = y. Here H(y) := C i=1 y i log y i is the entropy of the discrete distribution y. Particularly in ERM training, since y is one-hot, by definition its entropy is simply 0. Therefore the lower bound of the empirical risk is given as follows.

RS (θ) =

1 n n i=1 ℓ(θ, (x, y)) ≥ 0 (6) The equality holds if f θ (x i ) = y i is true for each i ∈ {1, 2, • • • , n}. We then prove the lower bound of the expectation of empirical Mixup loss. From Eq. ( 5), the lower bound of the general Mixup loss for a given λ is also given by:  Note that when α = 1, Beta(α, α) is simply the uniform distribution in the interval [0, 1]: U (0, 1). Using the fact that the probability density of U (0, 1) is constantly 1 in the interval [0, 1], the lower bound of E λ∼Beta(1,1) ℓ(θ, (x, ỹ)) where y ̸ = y ′ is given by: For the second inequality, by Lemma D.1, we have TV(P ( Y h | X), P ( Y * h | X)) = 1 2 C j=1 P ( Y * = j| X) -P ( Y = j| X) = 1 2 C j=1 f j ( X) -((1 -λ)f j (X) + λf j (X ′ )) ≥ sup j 1 2 f j ( X) -((1 -λ)f j (X) + λf j (X ′ )) . This completes the proof. D.3 PROOF OF LEMMA 5.1 Proof. The ordinary differential equation of Eq. ( 2) (Newton's law of cooling) has the closed form solution θ t = Φ † Y + (θ 0 -Φ † Y)e -η m Φ Φ T t . Recall that Y = Y * + Z, Proof. We first notice that θ t = Φ † Y * + Z + (θ 0 -Φ † Y * + Z )e -η m Φ Φ T t = Φ † Y * + Φ † Z + θ 0 -Φ † Y * e -η m Φ Φ T t -Φ † Ze -η m Φ Φ T t =θ * + (θ 0 -θ * )e - R t =E θt,X,Y θ T t ϕ(X) -Y 2 2 =E θt,X,Y θ T t ϕ(X) -θ * T ϕ(X) + θ * T ϕ(X) -Y 2 2 =E θt,X θ T t ϕ(X) -θ * T ϕ(X) 2 2 + E X,Y θ * T ϕ(X) -Y 2 2 + 2E θt,X,Y ⟨θ T t ϕ(X) -θ * T ϕ(X), θ * T ϕ(X) -Y ⟩ ≤E X ||ϕ(X)|| 2 2 E θt θ T t -θ * T 2 2 + R * + 2 E θt,X θ T t ϕ(X) -θ * T ϕ(X) 2 2 E X,Y ||θ * T ϕ(X) -Y || 2 2 ≤ C 1 2 E θt θ T t -θ * T 2 2 + R * + 2 C 1 R * 2 E θt θ T t -θ * T 2 2 , ( ) where the first inequality is by the Cauchy-Schwarz inequality and the second inequality is by the assumption. Recall Eq. ( 3), θ t -θ * = (θ 0 -θ * )e -η m Φ Φ T t + (I d -e -η m Φ Φ T t ) Φ † Z. By eigen-decomposition we have 1 m Φ Φ T = V ΛV T = d k=1 µ k v k v T k , where {v k } d k=1 are orthonormal eigenvectors and {µ k } d k=1 are corresponding eigenvectors. Then, for each dimension k, (θ t,k -θ * k ) 2 ≤ 2(θ 0,k -θ * k ) 2 e -2ηµ k t + 2(1 -e -ηµ k t ) 2 mC 2 mµ k , Taking expectation over θ 0 for both side, we have E θ0 (θ t,k -θ * k ) 2 ≤ 2(ξ 2 k + θ * 2 k )e -2ηµ k t + 2(1 -e -ηµ k t ) 2 C 2 µ k . ( ) Notich that the RHS in Eq. 14 first monotonically decreases and then monotonically increases, so the maximum value of RHS is achieved either at t = 0 or t → ∞. That is, E θ0 θ T t -θ * T 2 2 ≤ d k=1 2 max{ξ 2 k + θ * 2 k , C 2 µ k }. ( ) Plugging Eq. 14 and Eq. 15 into Eq. 13, we have R t ≤ C 1 2 E θt θ T t -θ * T 2 2 + R * + 2 C 1 R * 2 E θt θ T t -θ * T 2 2 ≤R * + C 1 d k=1 (ξ 2 k + θ * 2 k )e -2ηµ k t + (1 -e -ηµ k t ) 2 C 2 µ k + 2 C 1 R * ζ, where ζ = d k=1 max{ξ 2 k + θ * 2 k , C2 µ k }. This concludes the proof.



This might be related to the epoch-wise double descent behavior of ERM training. That is, when overtraining ResNet18 on the whole training set with a total of 1000 epochs, the network is still in the first stage of over-fitting the training data, while when over-training the network on 30% of the training set, the network learns faster on the training data due to the smaller sample size, thus it passes the turning point of the double descent curve earlier.



Figure 1: Over-training ResNet18 on CIFAR10.

Figure 2: Training ResNet18 on CIFAR10 training set (100% data and 30% data) without data augmentation. Top row: training loss and testing accuracy for ERM and Mixup. Bottom row: loss landscape of the Mixup-trained ResNet18 (where "loss" refers to the empirical risk on the real data) at various training epochs; left 3 figures are for the 30% CIFAR10 dataset, and the right 3 are for the full CIFAR10 dataset; visualization follows Li et al. (2018)

Figure 3: Training loss, testing accuracy and a U-shaped testing loss curve (subfigure (c), yellow) of training ResNet34 on CIFAR100 (100% training data) without data augmentation.

Figure 4: (a),(b): Training losses and testing errors of over-training ResNet18 on 10% of the CI-FAR10 training set with data augmentation. (c),(d): Training losses and testing errors of overtraining ResNet34 on 10% of the CIFAR100 training set with data augmentation.

Figure 5: Dynamics of testing loss in the toy example.

Figure 6: Switching from Mixup training to ERM training. The number in the bracket is the epoch number where we let α = 0 (i.e. Mixup training becomes ERM training).

Figure 7: Over-training on different number of samples.

Figure 9: Results of training ResNet18 on SVHN training set (100% data and 30% data) without data augmentation. Top raw: training loss and testing accuracy for ERM and Mixup. Bottom raw: loss landscape of the Mixup-trained ResNet18 at various training epochs: the left 3 figures are for the 30% SVHN dataset, and the right 3 are for the full SVHN dataset.

Figure 10: Results of the recorded training losses and testing accuracies of training ResNet34 on CIFAR10 training set (100% data and 30% data) without data augmentation.

Figure 11: Results of the recorded training losses and testing accuracies of training ResNet34 on SVHN training set (100% data and 30% data) without data augmentation.

Figure 12: Results of the recorded training losses and testing accuracies of training VGG16 on CIFAR10 training set (30% data and 100% data) without data augmentation.

Figure 13: Dynamics of MSE during Mixup training.

Figure 14: Dynamics of gradient norm when changing Mixup training to ERM training.

Figure 15: Models Pre-Trained on 100% CIFAR10 (without data augmentation)

Figure 18: Results of the ablation study on λ.

Coupling Inequality (Levin & Peres, 2017, Proposition 4.7)). Given two random variables X and Y with probability distributions P and Q, any coupling P of P , Q satisfiesTV(P, Q) ≤ P (X ̸ = Y ).D.1 PROOF OF LEMMA 3.1

-λ log λ + (1 -λ) log(1 -λ) .

ỹ) is formulated via cross-class mixing. Recall the definition of the Mixup loss, can exchange the computation of the expectation and the empirical average: R S (θ, α) ,α)ℓ(θ, (x, ỹ))

the synthetic example is formulated via in-class mixing, the synthetic label is still onehot, thus the lower bound of its general loss is 0. In a balanced C-class training set, with probability 1 C the in-class mixing occurs. Therefore, the lower bound of the overall Mixup loss is given as follows, if f θ (x) = ỹ is true for each synthetic example (x, ỹ) ∈ S. This completes the proof.D.2 PROOF OF THEOREM 5.1Proof. By the coupling inequality i.e. Lemma D.2, we haveTV(P ( Y h | X), P ( Y * h | X)) ≤ P ( Y h ̸ = Y * h | X), Since TV(P ( Y h | X), P (Y |X)) = TV(P ( Y h | X), P ( Y * h | X)), then the first inequality is straightforward.

η m Φ Φ T t + (I d -e -ηm Φ Φ T t )θ noise , which concludes the proof. D.4 PROOF OF THEOREM 5.2

At each epoch, we record the minimum training loss up to that epoch, as well as the testing accuracy at the epoch achieving the minimum training loss. Results are obtained both for training with data augmentation and for training without. More experimental details are given in Appendix A.

. The results of training ResNet18 on SVHN is presented in Appendix B.1.

Mixup Training Does Not Vanish Normally, ERM training obtains zero gradient norm at the end of training, which indicates that SGD finds a local minimum. However, We observe that the gradient norm of Mixup training does not converge to zero, as shown in Figure 8. Notably, although ERM training is able to find a local minimum in the first 130 epochs on CIFAR10, Figure 1 indicates that Mixup training outperforms ERM in the first 400 epochs. Similar observation also holds for SVHN. This result in fact suggests that Mixup can generalize well without converging to any stationary points. Notice that there is a related observation in the recent work of Zhang et al. (2022a), where they show that large-scale neural networks generalize well without having the gradient norm vanish during training. Additionally, by switching Mixup training to ERM training, as what we did in Figure

illustrates the results of training ResNet18 on 30% CIFAR10 data without data augmentation. The total number of training epochs t, as shown on the horizontal axis, is increased from 100 to 1600. For each t, each point on its vertical axis represents the average of the recorded training losses from the 10 repeats. The width of the shade beside each point reflects the corresponding standard deviation.

ACKNOWLEDGMENTS

This work is supported in part by a National Research Council of Canada (NRC) Collaborative R&D grant (AI4D-CORE-07). Ziqiao Wang is also supported in part by the NSERC CREATE program through the Interdisciplinary Math and Artificial Intelligence (INTER-MATH-AI) project.

