PROVABLY LEARNING DIVERSE FEATURES IN MULTI-VIEW DATA WITH MIDPOINT MIXUP

Abstract

Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have additional synthetic features.

1. INTRODUCTION

Data augmentation techniques have been a mainstay in the training of state-of-the-art models for a wide array of tasks -particularly in the field of computer vision -due to their ability to artificially inflate dataset size and encourage model robustness to various transformations of the data. One such technique that has achieved widespread use is Mixup (Zhang et al., 2018) , which constructs new data points as convex combinations of pairs of data points and their labels from the original dataset. Mixup has been shown to empirically improve generalization and robustness when compared to standard training over different model architectures, tasks, and domains (Liang et al., 2018; He et al., 2019; Thulasidasan et al., 2019; Lamb et al., 2019; Arazo et al., 2019; Guo, 2020; Verma et al., 2021b; Wang et al., 2021) . It has also found applications to distributed private learning (Huang et al., 2021) , learning fair models (Chuang and Mroueh, 2021), semi-supervised learning (Berthelot et al., 2019b; Sohn et al., 2020; Berthelot et al., 2019a) , self-supervised (specifically contrastive) learning (Verma et al., 2021a; Lee et al., 2020; Kalantidis et al., 2020) , and multi-modal learning (So et al., 2022) . The success of Mixup has instigated several works attempting to theoretically characterize its potential benefits and drawbacks (Guo et al., 2019; Carratino et al., 2020; Zhang et al., 2020; 2021; Chidambaram et al., 2021) . These works have focused mainly on analyzing, at a high-level, the beneficial (or detrimental) behaviors encouraged by the Mixup-version of the original empirical loss for a given task. As such, none of these previous works (to the best of our knowledge) have provided an algorithmic analysis of Mixup training in the context of non-linear models (i.e. neural networks), which is the main use case of Mixup. In this paper, we begin this line of work by theoretically separating the full training dynamics of Mixup (with a specific set of hyperparameters) from empirical risk minimization (ERM) for a 2-layer convolutional network (CNN) architecture on a class of data distributions exhibiting a multi-view nature. This multi-view property essentially requires (assuming classification data) that each class in the data is well-correlated with multiple features present in the data. Our analysis is heavily motivated by the recent work of Allen-Zhu and Li (2021), which showed that this kind of multi-view data can provide a fruitful setting for theoretically understanding the benefits of ensembles and knowledge distillation in the training of deep learning models. We show that Mixup can, perhaps surprisingly, capture some of the key benefits of ensembles explained by Allen-Zhu and Li (2021) despite only being used to train a single model. Main Contributions and Outline. Our main contributions are three-fold. In Sections 2 and 3, we introduce the main ideas behind Mixup and analyze a simple, linearly separable multi-view data distribution which we use to lay the groundwork for our main results. In analyzing this distribution, we motivate the use of a particular setting of Mixup -which we refer to as Midpoint Mixup -in which training is done on the midpoints of data points and their labels. Section 4 contains our main results; we prove that, for a highly noisy class of data distributions with two features per class, minimizing the empirical cross-entropy using gradient descent can lead to learning only one of the features in the data while minimizing the Midpoint Mixup cross-entropy succeeds in learning both features. While our theory focuses on the case of two features/views per class to be consistent with Allen-Zhu and Li (2021), our techniques can readily be extended to more general multi-view data distributions. Lastly, in Section 5, we conduct experiments illustrating that our theoretical insights in Sections 3 and 4 can apply to the training of realistic models on image classification benchmarks. We show for each benchmark that, after modifying the training data to include additional spurious features correlated with the true labels, both Mixup (with standard settings) and Midpoint Mixup outperform ERM on the original test data, with Midpoint Mixup closely approximating the performance of regular Mixup. Related Work. The idea of training on midpoints (or approximate midpoints) is not new; both Guo (2021) and Chidambaram et al. (2021) empirically study settings resembling what we consider in this paper, but they do not develop theory for this kind of training (beyond an information theoretic result in the latter case). As mentioned earlier, there are also several theoretical works analyzing the Mixup formulation and it variants (Carratino et al., 2020; Zhang et al., 2020; 2021; Chidambaram et al., 2021; Park et al., 2022) , but none of these works contain optimization results (which are the focus of this work). Additionally, we note that there are many Mixup-like data augmentation techniques and training formulations that are not (immediately) within the scope of the theory developed in this paper. For example, Cut Mix (Yun et al., 2019 ), Manifold Mixup (Verma et al., 2019 ), Puzzle Mix (Kim et al., 2020) , Co-Mixup (Kim et al., 2021), and Noisy Feature Mixup (Lim et al., 2021 ) are all such variations. Our work is also influenced by the existing large body of work theoretically analyzing the benefits of data augmentation (Bishop, 1995; Dao et al., 2019; Wu et al., 2020; Hanin and Sun, 2021; Rajput et al., 2019; Yang et al., 2022; Wang et al., 2022; Chen et al., 2020; Mei et al., 2021) . The most relevant such work to ours is the recent work of Shen et al. (2022) , which also studies the impact of data augmentation on the learning dynamics of a 2-layer network in a setting motivated by that of Allen-Zhu and Li (2021) . However, Midpoint Mixup differs significantly from the data augmentation scheme considered in Shen et al. (2022) , and consequently our results and setting are also of a different nature (we stick much more closely to the setting of Allen-Zhu and Li (2021)). As such, our work can be viewed as a parallel thread to that of Shen et al. (2022) .

2. PRELIMINARIES AND MOTIVATION FOR MIDPOINT MIXUP

We will introduce Mixup in the context of k-class classification, although the definitions below easily extend to regression. As a notational convenience, we will use [k] to indicate {1, 2, ..., k}. Recall that, given a finite dataset X ⊂ R d × [k] with |X| = N , we can define the empirical cross-entropy loss J(g, X ) of a model g : R d → R k as: J(g, X ) = - 1 N i∈[N ] log ϕ yi g(x i ) where ϕ y (g(x)) = exp(g y (x)) s∈[k] exp(g s (x)) (2.1) With ϕ being the standard softmax function and the notation g y , ϕ y indicating the y-th coordinate functions of g and ϕ respectively. Now let us fix a distribution D λ whose support is contained in [0, 1] and introduce the notation z i,j (λ) = λx i + (1 -λ)x j (using z i,j when λ is clear from context) where (x i , y i ), (x j , y j ) ∈ X . Then we may define the Mixup cross-entropy J M (g, X , D λ ) as: J M (g, X , D λ ) = - 1 N 2 i∈[N ] j∈[N ] E λ∼D λ λ log ϕ yi (g(z i,j )) + (1 -λ) log ϕ yj (g(z i,j )) (2.2)

