WEMIX: HOW TO BETTER UTILIZE DATA AUGMEN-TATION

Abstract

Data augmentation is a widely used training trick in deep learning to improve the network generalization ability. Despite many encouraging results, several recent studies did point out limitations of the conventional data augmentation scheme in certain scenarios, calling for a better theoretical understanding of data augmentation. In this work, we develop a comprehensive analysis that reveals pros and cons of data augmentation. The main limitation of data augmentation arises from the data bias, i.e. the augmented data distribution can be quite different from the original one. This data bias leads to a suboptimal performance of existing data augmentation methods. To this end, we develop two novel algorithms, termed "AugDrop" and "MixLoss", to correct the data bias in the data augmentation. Our theoretical analysis shows that both algorithms are guaranteed to improve the effect of data augmentation through the bias correction, which is further validated by our empirical studies. Finally, we propose a generic algorithm "WeMix" by combining AugDrop and MixLoss, whose effectiveness is observed from extensive empirical evaluations.

1. INTRODUCTION

Data augmentation (Baird, 1992; Schmidhuber, 2015) has been a key to the success of deep learning in image classification (He et al., 2019) , and is becoming increasingly common in other tasks such as natural language processing (Zhang et al., 2015) and object detection (Zoph et al., 2019) . The data augmentation expands training set by generating virtual instances through random augmentation to the original ones. This alleviates the overfitting (Shorten & Khoshgoftaar, 2019) problem when training large deep neural networks. Despite many encouraging results, it is not the case that data augmentation will always improve generalization errors (Min et al., 2020; Raghunathan et al., 2020) . In particular, Raghunathan et al. (2020) showed that training by augmented data will lead to a smaller robust error but potentially a larger standard error. Therefore, it is critical to answer the following two questions before applying data augmentation in deep learning: • When will the deep models benefit from data augmentation? • How to better leverage augmented data during training? Several previous works (Raghunathan et al., 2020; Wu et al., 2020; Min et al., 2020) tried to address the questions. Their analysis is limited to specific problems such as linear ridge regression therefore may not be applicable to deep learning. In this work, we aim to answer the two questions from a theoretical perspective under a more general non-convex setting. We address the first question in a more general form covering applications in deep learning. For the second question, we develop new approaches that are provably more effective than the conventional data augmentation approaches. Most data augmentation operations alter the data distribution during the training progress. This imposes a data distribution bias (we simply use "data bias" in the rest of this paper) between the augmented data and the original data, which may make it difficult to fully leverage the augmented data. To be more concrete, let us consider label-mixing augmentation (e.g., mixup (Zhang et al., 2018; Tokozume et al., 2018) ). Suppose we have n original data D = {(x i , y i ), i = 1, . . . , n}, where the input-label pair (x i , y i ) follows a distribution P xy = (P x , P y (•|x)), P x is the marginal distribution of the inputs and P y (•|x) is the conditional distribution of the labels given inputs; we generate m augmented data D = {( x i , y i ), i = 1, . . . , m}, where ( x i , y i ) ∼ P x y = (P x , P y (•| x)), and P x = P x but P y (•|x) = P y (•| x). Given x ∼ P x , the data bias is defined as δ y = max y, y yy . We will show that when the bias between D and D is large, directly training on the augmented data will not be as effective as training on the original data. Given the fact that augmented data may hurt the performance, the next question is how to design better learning algorithms to leash out the power of augmented data. To this end, we develop two novel algorithms to alleviate the data bias. The first algorithm, termed AugDrop, corrects the data bias by introducing a constrained optimization problem. The second algorithm, termed MixLoss, corrects the data bias by introducing a modified loss function. We show that, both theoretically and empirically, even with a large data bias, the proposed algorithms can still improve the generalization performance by effectively leveraging the combination of augmented data and original data. We summarize the main contributions of this work as follows: • We prove that in a conventional training scheme, a deep model can benefit from augmented data when the data bias is small. • We design two algorithms termed AugDrop and MixLoss that can better leverage augmented data even when the data bias is large with theoretical guarantees. • Based on our theoretical findings, we empirically propose a new efficient algorithm WeMix by combining AugDrop and MixLoss , which has better performances without extra training cost.

2. RELATED WORK

A series of empirical works (Cubuk et al., 2019; Ho et al., 2019; Lim et al., 2019; Lin et al., 2019a; Cubuk et al., 2020; Hataya et al., 2019) on how to learn a good policy of using different data augmentations have been proposed without theoretical guarantees. In this section, we mainly focus on reviewing theoretical studies on data augmentation. For a survey of data augmentation, we refer readers to (Shorten & Khoshgoftaar, 2019) and references therein for a comprehensive overview. Several works have attempted to establish theoretical understandings of data augmentation from different perspectives (Dao et al., 2019; Chen et al., 2019; Rajput et al., 2019) . Min et al. (2020) shown that, with more training data, weak augmentation can improve performance while strong augmentation always hurts the performance. Later on, Chen et al. ( 2020) study the gap between the generalization error (please see the formal definition in (Chen et al., 2020)) of adversarially-trained models and standard models. Both of their theoretical analyses were built on special linear binary classification model or linear regression model for label-preserving augmentation. Recently, Raghunathan et al. ( 2020) studied label-preserving transformation in data augmentation, which is identical to the first case in this paper. Their analysis is restricted to linear least square regression under noiseless setting, which is not applicable to training deep neural networks. Besides, their analysis requires infinite unlabeled data. By contrast, we do not require the original data to be unlimited. Wu et al. (2020) considered linear data augmentations. There are several major differences between their work and ours. First, they focus on the ridge linear regression problem which is strongly convex, while we consider non-convex optimization problems, which is more applicable in deep learning. Second, we study more general data augmentations beyond linear transformation.

3. PRELIMINARIES AND NOTATIONS

We study a learning problem for finding a classifier to map an input x ∈ X onto a label y ∈ Y ⊂ R K , where K is the number of classes. We assume the input-label pair (x, y) is drawn from a distribution P xy = (P x , P y (•|x)). Since every augmented example ( x, y) is generated by applying a certain transformation to either one or multiple examples, we will assume that ( x, y) is drawn from a slightly different distribution P x y = (P x , P y (•| x)), where P x is the marginal distribution on the inputs x and P y (•| x)) (we can write it as P y for simplicity) is the conditional distribution of the labels y given inputs x. We sample n training examples (x i , y i ), i = 1, . . . , n from distribution P xy and m training examples ( x i , y i ), i = 1, . . . , m from P x y . We assume that m n due to the data augmentation. We denote by D = {(x i , y i ), i = 1, . . . , n} and D = ( x i , y i ), i = 1, . . . , m} the dataset sampled from P xy and P x y , respectively. We denote by T (x) the set of augmented data transformed from x. We use the notation E (x,y)∼Pxy [•] to stand for the expectation that takes over a random variable (x, y) following a distribution P xy . We denote by ∇ w h(w) the gradient of a function h(w) in terms of variable w. When the variable to be taken a gradient is obvious, we use

