FMIX: ENHANCING MIXED SAMPLE DATA AUGMEN-TATION

Abstract

Mixed Sample Data Augmentation (MSDA) has received increasing attention in recent years, with many successful variants such as MixUp and CutMix. We analyse MSDA from an information theoretic perspective, characterising learned models in terms of how they impact the models' perception of the data. Ultimately, our analyses allow us to decouple two complementary properties of augmentations that are useful for reasoning about MSDA. From insight on the efficacy of CutMix in particular, we subsequently propose FMix, an MSDA that uses binary masks obtained by applying a threshold to low frequency images sampled from Fourier space. FMix improves performance over MixUp and CutMix for a number of models across a range of data sets and problem settings, obtaining new state-of-the-art results on CIFAR-10 and Fashion-MNIST.

1. INTRODUCTION

Recently, a plethora of approaches to Mixed Sample Data Augmentation (MSDA) have been proposed which obtain state-of-the-art results, particularly in classification tasks (Chawla et al., 2002; Zhang et al., 2017; Tokozume et al., 2017; 2018; Inoue, 2018; Yun et al., 2019; Takahashi et al., 2019; Summers and Dinneen, 2019) . MSDA involves combining data samples according to some policy to create an augmented data set on which to train the model. The policies so far proposed can be broadly categorised as either combining samples with interpolation (e.g. MixUp) or masking (e.g. CutMix). Traditionally, augmentation is viewed through the framework of statistical learning as Vicinal Risk Minimisation (VRM) (Vapnik, 1999; Chapelle et al., 2001) . Given some notion of the vicinity of a data point, VRM trains with vicinal samples in addition to the data points themselves. This is the motivation for MixUp (Zhang et al., 2017) ; to provide a new notion of vicinity based on mixing data samples. In the classical theory, validity of this technique relies on the strong assumption that the vicinal distribution precisely matches the true distribution of the data. As a result, the classical goal of augmentation is to maximally increase the data space, without changing the data distribution. Clearly, for all but the most simple augmentation strategies, the data distribution is in some way distorted. Furthermore, there may be practical implications to correcting this, as is demonstrated in Touvron et al. (2019) . In light of this, three important questions arise regarding MSDA: What is good measure of the similarity between the augmented and the original data? Why is MixUp so effective when the augmented data looks so different? If the data is distorted, what impact does this have on trained models? To construct a good measure of similarity, we note that the data only need be 'perceived' similar by the model. As such, we measure the mutual information between representations learned from the real and augmented data, thus characterising how well learning from the augmented data simulates learning from the real data. This measure clearly shows the data-level distortion of MixUp by demonstrating that learned representations are compressed in comparison to those learned from the un-augmented data. To address the efficacy of MixUp, we look to the information bottleneck theory of deep learning (Tishby and Zaslavsky, 2015) . This theory uses the data processing inequality, summarised as 'post-processing cannot increase information', to suggest that deep networks progressively discard information about the input whilst preserving information about the targets. Through this lens, we posit that the distortion and subsequent compression induced by MixUp promotes generalisation by preventing the network from learning about highly sample-specific features in the data. Regarding the impact on trained models, and again armed with the knowledge that MixUp distorts learned functions, we show that MixUp acts as a kind of adversarial training (Good-fellow et al., 2014) , promoting robustness to additive noise. This accords with the theoretical result of Perrault-Archambault et al. ( 2020) and the robustness results of Zhang et al. (2017) . However, we further show that MSDA does not generally improve adversarial robustness when measured as a worst case accuracy following multiple attacks as suggested by Carlini et al. (2019) . In contrast to our findings regarding MixUp, we show that CutMix causes learned models to retain a good knowledge of the real data, which we argue derives from the fact that individual features extracted by a convolutional model generally only derive from one of the mixed data points. At the same time Cut-Mix limits the ability of the model to over-fit by dramatically increasing the number of observable data points, in keeping with the original intent of VRM. We go on to argue that by restricting to only masking a square region, CutMix imposes an unnecessary limitation. Indeed, it should be possible to construct an MSDA which uses masking similar to CutMix whilst increasing the data space much more dramatically. Motivated by this, we introduce FMix, a masking MSDA that uses binary masks obtained by applying a threshold to low frequency images sampled from Fourier space. Using our mutual information measure, we show that learning with FMix simulates learning from the real data even better than CutMix. We subsequently demonstrate performance of FMix for a range of models and tasks against a series of augmented baselines and other MSDA approaches. FMix obtains a new state-of-the-art performance on CIFAR-10 (Krizhevsky et al., 2009) without external data and Fashion MNIST (Xiao et al., 2017) and improves the performance of several state-of-the-art models (ResNet, SE-ResNeXt, DenseNet, WideResNet, PyramidNet, LSTM, and Bert) on a range of problems and modalities. In light of our analyses, and supported by our experimental results, we go on to suggest that the compressing qualities of MixUp are most desirable when data is limited and learning from individual examples is easier. In contrast, masking MSDAs such as FMix are most valuable when data is abundant. We finally suggest that there is no reason to see the desirable properties of masking and interpolation as mutually exclusive. In light of these observations, we plot the performance of MixUp, FMix, a baseline, and a hybrid policy where we alternate between batches of MixUp and FMix, as the number of CIFAR-10 training examples is reduced. This experiment confirms our above suggestions and shows that the hybrid policy can outperform both MixUp and FMix.

2. RELATED WORK

In this section, we review the fundamentals of MSDA. Let p X (x) denote the input data distribution. In general, we can define MSDA for a given mixing function, mix(X 1 , X 2 , Λ), where X 1 and X 2 are independent random variables on the data domain and Λ is the mixing coefficient. Synthetic minority over-sampling (Chawla et al., 2002) , a predecessor to modern MSDA approaches, can be seen as a special case of the above where X 1 and X 2 are dependent, jointly sampled as nearest neighbours in feature space. These synthetic samples are drawn only from the minority class to be used in conjunction with the original data, addressing the problem of imbalanced data. The mixing function is linear interpolation, mix(x 1 , x 2 , λ) = λx 1 +(1-λ)x 2 , and p Λ = U(0, 1). More recently, Zhang et al. (2017 ), Tokozume et al. (2017 ), Tokozume et al. (2018) and Inoue (2018) concurrently proposed using this formulation (as MixUp, Between-Class (BC) learning, BC+ and sample pairing respectively) on the whole data set, although the choice of distribution for the mixing coefficients varies for each approach. We refer to this as interpolative MSDA, where, following Zhang et al. (2017), we use the symmetric Beta distribution, that is p Λ = Beta(α, α). Recent variants adopt a binary masking approach (Yun et al., 2019; Summers and Dinneen, 2019; Takahashi et al., 2019) . Let M = mask(Λ) be a random variable with mask(λ) ∈ {0, 1} n and µ(mask(λ)) = λ, that is, generated masks are binary with average value equal to the mixing coefficient. The mask mixing function is mix(x 1 , x 2 , m) = m x 1 + (1 -m) x 2 , where denotes point-wise multiplication. A notable masking MSDA which motivates our approach is CutMix (Yun et al., 2019) . CutMix is designed for two dimensional data, with mask(λ) ∈ {0, 1} w×h , and uses mask(λ) = rand rect(w √ 1 -λ, h √ 1 -λ), where rand rect(r w , r h ) ∈ {0, 1} w×h yields a binary mask with a shaded rectangular region of size r w × r h at a uniform random coordinate. CutMix improves upon the performance of MixUp on a range of experiments. For the remainder of the paper we focus on the development of a better input mixing function. Appendix A provides a discussion of the importance of the mixing ratio of the labels. For the typical case of classification with a cross entropy loss, the objective function is simply the interpolation between the cross entropy against each of the ground truth targets.

