BYPASSING THE RANDOM INPUT MIXING IN MIXUP Anonymous

Abstract

Mixup and its variants have promoted a surge of interest due to their capability of boosting the accuracy of deep models. For a random sample pair, such approaches generate a set of synthetic samples through interpolating both the inputs and their corresponding one-hot labels. Current methods either interpolate random features from an input pair or learn to mix salient features from the pair. Nevertheless, the former methods can create misleading synthetic samples or remove important features from the given inputs, and the latter strategies incur significant computation cost for selecting descriptive input regions. In this paper, we show that the effort needed for the input mixing can be bypassed. For a given sample pair, averaging the features from the two inputs and then assigning it with a set of soft labels can effectively regularize the training. We empirically show that the proposed approach performs on par with state-of-the-art strategies in terms of predictive accuracy.

1. INTRODUCTION

Deep neural networks have demonstrated their profound successes in many challenging real-world applications, including image classification (Krizhevsky et al., 2012) , speech recognition (Graves et al., 2013) , and machine translation (Bahdanau et al., 2015; Sutskever et al., 2014) . One key factor attributing to such successes is the deployment of effective model regularization techniques, which empower the learning to avoid overfitting the training data and to generalize well to unseen samples. This is because current deep models typically embrace high modeling freedom with a very large number of parameters. To this end, many regularizers for deep models have been introduced, including weight decay (Hanson & Pratt, 1988 ), dropout (Srivastava et al., 2014 ), stochastic depth (Huang et al., 2016) , batch normalization (Ioffe & Szegedy, 2015) , and data augmentation schemes (Cubuk et al., 2019; Hendrycks et al., 2020; Inoue, 2018; Lecun et al., 1998; Simard et al., 1998) . (Guo, 2020; Guo et al., 2019; Kim et al., 2020; Li et al., 2020a; Tokozume et al., 2018a; b; Verma et al., 2019; Yun et al., 2019; Zhang et al., 2018) have attracted a surge of interest and shown their effectiveness on boosting the accuracy of deep networks. Nevertheless, unlike label-preserving data augmentation such as rotation, flip, and crop, there is still limited knowledge about how to design better mixing policies for sample pairs for effective label-variant regularization. Current Mixup-based approaches either mix a pair of inputs using random mixing coefficients (Guo, 2020; Guo et al., 2019; Summers & Dinneen, 2019; Tokozume et al., 2018b; Verma et al., 2019; Yun et al., 2019; Zhang et al., 2018) or learn to mix salient features from the given pair (Dabouei et al., 2020; Kim et al., 2020; Li et al., 2020b; Walawalkar et al., 2020) to create a set of synthetic samples. Nonetheless, the former methods may create misleading synthetic samples (Guo et al., 2019) or remove important features from the given inputs (Kim et al., 2020) and the latter strategies can incur significant computation cost for identifying and selecting the most descriptive input regions (Kim et al., 2020; Walawalkar et al., 2020) . In this paper, we show that the effort needed for the input mixing with a range of mixing ratios can be bypassed. For a given input pair, one can average the features from the two inputs, and then assign it with a set of soft labels. These soft labels are adaptively learned during training to incorporate class information beyond the provided label pair as well as the evolving states of the training. The method is illustrated in Figure 1 , where the mixed input is the pixel-wise average of features from the two inputs and its training target is the combination of the local soft label, which is the average of the two one-hot targets, and the global soft label, which is generated by the networks during training. We empirically show that the proposed approach performs on par with state-of-the-art methods with random input mixing policy or learning to mixing strategy, in terms of predictive accuracy. We also demonstrate that the synthetic samples created by our method keep tuning the networks long after the training error on the original training set is minimal, encouraging the learning to generate, for each class of the training samples, tight representation clusters. Also, the Class Activation Mapping (CAM) (Zhou et al., 2016) visualization suggests that our method tends to focus on narrower regions of an image for classification.

2.1. MIXUP-BASED DATA AUGMENTATION

For a standard classification setting with a training data set (X; Y ), the objective of the task is to develop a classifier which assigns every input x ∈ X a label y ∈ Y . Instead of using the provided training set (X; Y ), Mixup (Zhang et al., 2018) generates synthetic samples with soft labels for training. For a pair of random training samples (x i ; y i ) and (x j ; y j ), where x is the input and y the one-hot encoding of the corresponding class, Mixup creates a synthetic sample as follows. x ij λ = λx i + (1 -λ)x j , y ij λ = λy i + (1 -λ)y j , where λ is a scalar mixing policy for mixing both the inputs and the modeling targets of the sample pair. λ is sampled from a Beta(α, α) distribution with a hyper-parameter α. The generated samples ( x ij λ , y ij λ ) are then fed into the model for training to minimize the cross-entropy loss function. Current variants of Mixup focus on the introduction of a representation function ψ for input mixing: x ij λ = ψ(x i |x j , λ) + ψ(x j |x i , 1 -λ). The state-of-the-art random input mixing variant CutMix (Yun et al., 2019) defines the ψ function to form a binary rectangular mask applying to a randomly chosen rectangle covering λ proportion of the input image. PuzzleMix (Kim et al., 2020) is the state-of-the-art learning to mixing variant. This method defines the ψ function to compute the saliency map of the input pair, find the optimal mask, and optimize the transport plans for generating the mixed example. This is to ensure that the mixed image contain sufficient target class information corresponding to the mixing ratio λ while preserving the local statistics of each input.

2.2. FORMING THE TRAINING INPUT

Similar to Mixup and its variants, our method also creates mixed samples from a random sample pair, but for a given sample pair our method does not generate a set of inputs with different features. Instead, for a range of mixing ratios λs, our approach uses the same mixed input, invariant to the



Figure 1: Illustration of the proposed method. The mixed input is the average of the two inputs; the training target for the averaged input is the combination of the local soft label (average of the two one-hot targets) and the global soft label (dynamically generated during training). Among those effective regularizers, Mixup (Zhang et al., 2018) is a simple and yet effective, dataaugmentation based regularizer for enhancing the deep classification models. Through linearly interpolating random input pairs and their training targets in one-hot representation, Mixup generates a set of synthetic examples with soft labels to regularize the training. Such pairwise, label-variant data augmentation techniques(Guo,  2020; Guo et al., 2019;  Kim et al., 2020; Li et al.,  2020a; Tokozume et al., 2018a;b; Verma et al., 2019; Yun et al., 2019; Zhang et al., 2018)  have attracted a surge of interest and shown their effectiveness on boosting the accuracy of deep networks.

