OVER-TRAINING WITH MIXUP MAY HURT GENERALIZATION

Abstract

Mixup, which creates synthetic training instances by linearly interpolating random sample pairs, is a simple and yet effective regularization technique to boost the performance of deep models trained with SGD. In this work, we report a previously unobserved phenomenon in Mixup training: on a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs, giving rise to a U-shaped generalization curve. This behavior is further aggravated when the size of original dataset is reduced. To help understand such a behavior of Mixup, we show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data. Via analyzing a least-square regression problem with a random feature model, we explain why noisy labels may cause the U-shaped curve to occur: Mixup improves generalization through fitting the clean patterns at the early training stage, but as training progresses, Mixup becomes over-fitting to the noise in the synthetic data. Extensive experiments are performed on a variety of benchmark datasets, validating this explanation.

1. INTRODUCTION

Mixup has empirically shown its effectiveness in improving the generalization and robustness of deep classification models (Zhang et al., 2018; Guo et al., 2019a; b; Thulasidasan et al., 2019; Zhang et al., 2022b) . Unlike the vanilla empirical risk minimization (ERM), in which networks are trained using the original training set, Mixup trains the networks with synthetic examples. These examples are created by linearly interpolating both the input features and the labels of instance pairs randomly sampled from the original training set. 



Figure 1: Over-training ResNet18 on CIFAR10.

We first report a previously unobserved phenomenon in Mixup training. Through extensive experiments on various benchmarks, we observe that over-training the networks with Mixup may result in significant degradation of the networks' generalization performance. As a result, along the training epochs, the generalization performance of the network measured by its testing error may exhibit a U-shaped curve. Figure1shows such a curve obtained from over-training ResNet18 with Mixup on CIFAR10. As can be seen from Figure1, after training with Mixup for a long time (200 epochs), both ERM and Mixup keep decreasing their training loss, but the testing error of the Mixup-trained ResNet18 gradually increases, while that of the ERM-trained ResNet18 continues to decrease.Motivated by this observation, we conduct a theoretical analysis, aiming to better understand the aforementioned behavior of Mixup training. We show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data. Then by analyzing the gradientdescent dynamics of training a random feature model for a least-square regression problem, we explain why noisy labels may cause the U-shaped curve to occur: under label noise, the early phase of training is primarily driven by the clean data pattern, which moves the model parameter closer to the correct solution. But as training progresses, the effect of label noise accumulates through iterations and gradually over-weighs that of the clean pattern and dominates the late training process. In this phase, the model parameter gradually moves away from the correct solution until it is sufficient apart and approaches a location depending on the noise realization.Training on Random Labels, Epoch-Wise Double Descent and Robust Overfitting The thought-provoking work ofZhang et al. (2017)  highlights that neural networks are able to fit data with random labels. After that, the generalization behavior on corrupted label dataset has been widely investigated(Arpit et al., 2017; Liu et al., 2020; Feng & Tu, 2021; Wang & Mao, 2022; Liu  et al., 2022). Specifically, Arpit et al. (2017) observe that neural networks will learn the clean pattern first before fitting to data with random labels. This is further explained byArora et al. (2019a)  where they demonstrate that in the overparameterization regime, the convergence of loss depends on the projections of labels on the eigenvectors of some Gram matrix, where true labels and random labels have different projections. In a parallel line of research, an epoch-wise double descent behavior of testing loss of deep neural networks is observed inNakkiran et al. (2020), shortly after the observation of the model-wise double descent(Belkin et al., 2019; Hastie et al., 2022; Mei & Montanari,  2022; Ba et al., 2020). Theoretical works studying the epoch-wise double descent are rather limited to date(Heckel & Yilmaz, 2021; Stephenson & Lee, 2021; Pezeshki et al., 2022), among which Advani et al. (2020) inspires the theoretical analysis of the U-sharped curve of Mixup in this paper. Moreover, robust overfitting (Rice et al., 2020) is also another yet related research line, In particular, robust overfitting is referred to a phenomenon in adversarial training that robust accuracy will first increase then decrease after a long training time. Dong et al. (2022) show that robust overfitting is deemed to the early part of epoch-wise double descent due to the implicit label noise induced by adversarial training. Since Mixup training has been connected to adversarial training or adversarial robustness in the previous works (Archambault et al., 2019; Zhang et al., 2021), the work of Dong et al. (2022) indeed motivates us to study the label noise induced by Mixup training.

