FMIX: ENHANCING MIXED SAMPLE DATA AUGMEN-TATION

Abstract

Mixed Sample Data Augmentation (MSDA) has received increasing attention in recent years, with many successful variants such as MixUp and CutMix. We analyse MSDA from an information theoretic perspective, characterising learned models in terms of how they impact the models' perception of the data. Ultimately, our analyses allow us to decouple two complementary properties of augmentations that are useful for reasoning about MSDA. From insight on the efficacy of CutMix in particular, we subsequently propose FMix, an MSDA that uses binary masks obtained by applying a threshold to low frequency images sampled from Fourier space. FMix improves performance over MixUp and CutMix for a number of models across a range of data sets and problem settings, obtaining new state-of-the-art results on CIFAR-10 and Fashion-MNIST.

1. INTRODUCTION

Recently, a plethora of approaches to Mixed Sample Data Augmentation (MSDA) have been proposed which obtain state-of-the-art results, particularly in classification tasks (Chawla et al., 2002; Zhang et al., 2017; Tokozume et al., 2017; 2018; Inoue, 2018; Yun et al., 2019; Takahashi et al., 2019; Summers and Dinneen, 2019) . MSDA involves combining data samples according to some policy to create an augmented data set on which to train the model. The policies so far proposed can be broadly categorised as either combining samples with interpolation (e.g. MixUp) or masking (e.g. CutMix). Traditionally, augmentation is viewed through the framework of statistical learning as Vicinal Risk Minimisation (VRM) (Vapnik, 1999; Chapelle et al., 2001) . Given some notion of the vicinity of a data point, VRM trains with vicinal samples in addition to the data points themselves. This is the motivation for MixUp (Zhang et al., 2017) ; to provide a new notion of vicinity based on mixing data samples. In the classical theory, validity of this technique relies on the strong assumption that the vicinal distribution precisely matches the true distribution of the data. As a result, the classical goal of augmentation is to maximally increase the data space, without changing the data distribution. Clearly, for all but the most simple augmentation strategies, the data distribution is in some way distorted. Furthermore, there may be practical implications to correcting this, as is demonstrated in Touvron et al. (2019) . In light of this, three important questions arise regarding MSDA: What is good measure of the similarity between the augmented and the original data? Why is MixUp so effective when the augmented data looks so different? If the data is distorted, what impact does this have on trained models? To construct a good measure of similarity, we note that the data only need be 'perceived' similar by the model. As such, we measure the mutual information between representations learned from the real and augmented data, thus characterising how well learning from the augmented data simulates learning from the real data. This measure clearly shows the data-level distortion of MixUp by demonstrating that learned representations are compressed in comparison to those learned from the un-augmented data. To address the efficacy of MixUp, we look to the information bottleneck theory of deep learning (Tishby and Zaslavsky, 2015) . This theory uses the data processing inequality, summarised as 'post-processing cannot increase information', to suggest that deep networks progressively discard information about the input whilst preserving information about the targets. Through this lens, we posit that the distortion and subsequent compression induced by MixUp promotes generalisation by preventing the network from learning about highly sample-specific features in the data. Regarding the impact on trained models, and again armed with the knowledge that MixUp distorts learned functions, we show that MixUp acts as a kind of adversarial training (Good-

