MODALS: MODALITY-AGNOSTIC AUTOMATED DATA AUGMENTATION IN THE LATENT SPACE

Abstract

Data augmentation is an efficient way to expand a training dataset by creating additional artificial data. While data augmentation is found to be effective in improving the generalization capabilities of models for various machine learning tasks, the underlying augmentation methods are usually manually designed and carefully evaluated for each data modality separately. These include image processing functions for image data and word-replacing rules for text data. In this work, we propose an automated data augmentation approach called MODALS (Modalityagnostic Automated Data Augmentation in the Latent Space) to augment data for any modality in a generic way. MODALS exploits automated data augmentation to fine-tune four universal data transformation operations in the latent space to adapt the transform to data of different modalities. Through comprehensive experiments, we demonstrate the effectiveness of MODALS on multiple datasets for text, tabular, time-series and image modalities. 1 

1. INTRODUCTION

Deep learning models tend to perform better with more labeled training data. However, labeled data are usually scarce and expensive to collect. Data augmentation is a promising means to extend the training dataset with new artificial data. In image recognition, image processing functions, like randomized cropping, horizontal flipping, and color shifting, are commonly adopted in modern image recognition models (Krizhevsky et al., 2012; Shorten & Khoshgoftaar, 2019) . Following the success of image augmentation, it is becoming increasingly common to apply data augmentation in natural language processing tasks, like machine translation, text classification, and semantic parsing. Various word-based transformations have been proposed to perturb word tokens, such as replacing similar words or phrases, swapping word orders, and inserting or dropping random words (Cheng et al., 2018; S ¸ahin & Steedman, 2018; Wei & Zou, 2019) . Over the years, more transformation functions have been proposed to augment different datasets. Cutout randomly occludes a part of an image to avoid overfitting (Devries & Taylor, 2017b) . For label-mixing methods, CutMix replaces the occluded part in Cutout by a different image (Yun et al., 2019) and Mixup interpolates two images with their corresponding one-hot encoded labels (Zhang et al., 2018) . These methods have been tested and found to be effective in multiple image datasets. Alternatively, new data can be created using deep generative models, for example, using GAN-based approaches to generate new images (Antoniou et al., 2017; Sandfort et al., 2019) , conditional pretrained language models to generate training sentences (Kumar et al., 2020) , and back-translation to paraphrase sentences by translating sentences to another language and back to the original language (Xie et al., 2020) . While these generative approaches are found to be useful, the generators or language models are often hard to implement and are expensive to train. Apart from advancing individual transformations, another line of research studies their optimal composition. As the choice and order of the transformations are decided and tested manually, the success of an augmentation scheme in one dataset may not generalize well to other datasets. To tackle this problem, AutoAugment as an automated data augmentation method was proposed to automate this process by learning an optimal augmentation policy, which decides the probability and magnitude to apply pre-defined transformations (Cubuk et al., 2019) . Whether it is for standard augmentation or automated augmentation, the transformation functions are often designed and tested carefully for each data modality separately. While it may be intuitive to create new and valid images using image processing techniques, it is non-trivial to define such label-preserving transformations on discrete data like text data. This prohibits the reuse of augmentation schemes across different data modalities. Beyond supervised learning, there is an increasing trend of utilizing data augmentation to extract information from unlabelled data in unsupervised (Xie et al., 2020) or self-supervised learning tasks (Chen et al., 2020; Grill et al., 2020) . These current methods are heavily dependent on existing augmentations in vision applications. In order to generalize these methods to other modalities like text and graph data, there is a need for robust data augmentation for each modality. Therefore, we propose MODALS to apply modality-agnostic automated data augmentation in the latent space. The idea of transforming latent features is inspired by representation learning. For image generation, the work of Upchurch et al. (2017) interpolates images along specific directions in the latent space to add new semantics without changing the class identity, such as adding facial hair to the image of a male face by translating the corresponding latent representation towards the direction of male faces with facial hair. This suggests that augmenting data in the latent space can capture diverse semantic transformations which are usually hard to define in the input space. Augmentation in the latent space poses two main challenges: learning a latent space that is continuous for transformation, and finding the effective directions to traverse. Failing to address them properly may cause an augmented example to lose its original class identity. In the previous work by Devries & Taylor (2017a) , the latent space is learned by training an autoencoder, which encodes the input data into a latent vector and decodes it back to the original data. The learned latent representations are then transformed by interpolation, extrapolation, or adding Gaussian noise and decoded as synthetic examples for the downstream tasks. For image data, ISDA estimates the semantic directions by inspecting the feature covariance (Wang et al., 2019) . LSI and Manifold Mixup apply Mixup to the latent feature vectors (Liu et al., 2018; Verma et al., 2019) . We develop our framework as MODALS. At a high level, MODALS applies latent space augmentation to address the data augmentation problem in multiple data modalities. Instead of improving the model performance in a specific domain or modality, the major focus and novelty of this work is to propose a general automated data augmentation framework that can work for multiple data modalities in a generic way. To the best of our knowledge, such an attempt has not been made by others in the research community. MODALS also differs from the previous approaches in three ways. First, as opposed to other operation-based latent space transformation methods, MODALS is jointly trained with augmentation. As it involves no auxiliary models or additional processes to generate examples, it can be efficiently integrated into the popular deep learning frameworks. Second, we observe that examples that are more uncertain to predict, or considered to be hard in the active learning literature, tend to carry richer information for model training. Therefore, we modify standard latent space transformations to create harder examples in MODALS. Third, MODALS introduces additional loss terms to improve the quality of augmentation in the latent space. In summary, we make four major contributions in this paper: • Propose a framework to apply automated augmentation in the latent space. • Propose a novel and effective way to create hard examples. • Study additional loss terms to improve label-preserving transformations in the latent space. • Evaluate MODALS extensively on classification datasets across multiple data modalities using various deep learning models.

2.1. AUTOMATED DATA AUGMENTATION

In practice, multiple transformations are used compositely to augment a dataset. The choice and strength of the transformations affect the model performance on different datasets. Motivated by neural architecture search and reinforcement learning, automated data augmentation formulates the tuning of augmentation parameters as a learning task. AutoAugment trains a child model with the augmentation transformations generated by a controller, which is updated by reinforcement learning using the downstream validation accuracy as the reward signals (Cubuk et al., 2019) . To reduce the search effort in AutoAugment, various approaches have been proposed. Population Based Augmentation (PBA) adopts an evolution-based search strategy to simultaneously train and perturb a policy with multiple child models (Ho et al., 2019) . RandAugment replaces the augmentation parameters in AutoAugment by uniform values across all transformations and searches for the magnitude and the number of transformations to apply (Cubuk et al., 2020) . Adversarial AutoAugment finds an adversarial policy that creates difficult examples (Zhang et al., 2020) . Luo et al. (2020) (Zhang et al., 2018) . Manifold Mixup applies Mixup to the outputs from different hidden layers (Verma et al., 2019) ; LSI applies Mixup in the latent space for image classification (Liu et al., 2018) . Previously, Devries & Taylor (2017a) found that extrapolation between features learned by an autoencoder generates more effective synthetic examples than adding Gaussian noise and interpolation. In their experiments, they suspected that interpolation tightens the class boundaries, which leads to overfitting and harms the performance. For text data, Kumar et al. (2019) explored simple latent space transformations and autoencoder-based methods in few-shot intent classification tasks. In general, it remains an open question which transformation in the latent space creates better augmented examples. We leverage automated data augmentation to find the optimal composition of latent space transformations for a variety of datasets of multiple data modalities. 2.3 HARD EXAMPLE AUGMENTATION Kuchnik & Smith (2019) show that not all training examples are equally useful for augmentation and propose methods to find a subset of useful data to augment. Similar methods are also proposed by the recent augmentation algorithms, which aim to find the composition of transformations that create difficult examples (Luo et al., 2020; Wu et al., 2020; Zhang et al., 2020) . In our work, we propose a novel and simple way to create hard examples using latent space transformation.

3.1. BRIEF OVERVIEW OF POPULATION BASED AUGMENTATION

We first briefly review the policy search procedure of PBA (Ho et al., 2019) . In PBA, each augmentation function is associated with its probability and magnitude, for example, the tuple (op: rotation, p = 0.4, λ = 0.5) specifies that the rotation operation is applied with a probability of 0.4 and a magnitude of 0.5. A policy is a set of such operation tuples. When a policy is applied to a mini-batch, 0, 1 or 2 operation tuples are randomly sampled from the policy and applied to the batch of data with their corresponding probability and magnitude. PBA formulates the augmentation policy search as hyperparameter schedule learning. Instead of a fixed policy, it learns a policy schedule that specifies the augmentation parameters at different stages of training. In particular, PBA exploits population-based training (PBT) to optimize the model weights of multiple child models jointly with their policy parameters that maximize the performance. It starts by simultaneously training a fixed number of randomly-initialized child models, each with its own policy. After certain fixed intervals, the child models are evaluated using the validation set. The worst models clone the weights and policies from the best models. The parameters of the cloned policies are either resampled from all possible values or perturbed from the cloned values. The process repeats until the training of the child models completes. The final output of PBA is a schedule of policies unrolled from the best child model. The policy schedule can be applied to augment the dataset when training with more data or larger models. As there is no need to retrain any child model, PBA is considered to be an efficient automated augmentation algorithm. Our method follows the policy formulation and the search strategy used in PBA.

3.2. MODALS

Our main assumption is that, if learned properly, the region corresponding to a class in the latent space learned by a deep neural network is mostly convex and isotropic. Consequently, linear transformation in such a space would result in a smooth transition from real data to artificial data without altering their class identity. Among other non-trivial properties, data points are often sparsely distributed in high-dimensional spaces such that distance-based methods that work well in lower-dimensional spaces fail to take advantage of the distance metrics (Donoho, 2000) . In the supplemental materials, we provide empirical evidence to show that the relatively low-dimensional latent spaces used in our experiments are not anisotropic, especially within the local class regions (see Appendix A.1) . In what follows, we introduce the latent space transformations and the training objectives of our model.

3.2.1. TRANSFORMATIONS

Hard example interpolation. Given a seed latent representation z c i , which is the latent representation of the i-th example in class c, we interpolate z c i to a hard example. The hard example is taken as the latent representation nearest to z c i among q hard example candidates sampled according to the classification loss. This favours the creation of harder examples and prevents z c i from being aggressively interpolated to distant locations or regions that are close to the class center. In our implementation, we take q as 5% of the number of examples in class c. Formally, let S = {s i } q i=1 denote the set of hard latent representations sampled according to the magnitude of the loss in class c, s = arg min s∈S s•z c i s z c i denote the closest hard example (red circle in Figure 1a ), and λ 1 denote a scaling factor. Hard example interpolation is expressed as: ẑc i = z c i + λ 1 (s -z c i ) (1) Hard example extrapolation. The data nearer the class boundaries are more difficult to classify. Instead of extrapolating from random examples, we extrapolate z c i from the class center µ c = 1 m m j=1 z c j with λ 2 as the scaling factor. Hard example extrapolation is thus given by: ẑc i = z c i + λ 2 (z c i -µ c ) Gaussian Noise. We also perturb z c i by adding Gaussian noise with zero mean and per-element standard deviation computed across the examples in the same class. Let ∼ N (0, σ 2 c I) and λ 3 be the scaling factor, we have: ẑc i = z c i + λ 3 (3) Difference Transform. Difference transform is an alternative way to perturb z c i by translating z c i along the direction between two random latent vectors sampled from the same class. Denoting the scaling factor as λ 4 , the difference transform can be written as: ẑc i = z c i + λ 4 (z c j -z c k ) These four latent space transformations are illustrated in Figure 1 .

3.2.2. MODEL

Instead of training an autoencoder to learn the latent space and generate additional synthetic data for training, we train a classification model jointly with different compositions of latent space augmentations. Given an input space X, a latent space Z, and a label space Y, MODALS trains a feature extraction model F (x; θ) : X → Z and a dense layer to map Z → Y. The i-th example x i is mapped to its latent representation z i = F (x i ) and further augmented by a composition of label-preserving transform functions to obtain the augmented latent representation ẑi ∈ Z. We optimize the softmax  L clf (W , θ) = - 1 M M i=1 log exp(w yi ẑi ) k j=1 exp(w j ẑi ) In addition to the classification loss, we further employ two additional loss terms to improve the augmentation results. The effect of these two losses will be discussed in Appendix A.2. Adversarial Loss. Sampling in a high-dimensional latent space may fall into invalid regions that are not on the data manifold. In order to produce smooth morphing between sampled latent codes, there are well-studied techniques to enforce a posterior distribution over the latent variables using VAE or GAN-based methods (Kingma & Welling, 2014; Goodfellow et al., 2014) . In MODALS, we regularize the latent space by imposing an adversarial loss similar to Adversarial Autoencoder (Makhzani et al., 2016) . In particular, we employ a discriminator D(z; φ) to distinguish the latent code generated by the feature extraction model and sampled from the Gaussian distribution. The feature extraction model has to generate latent representations that are similar to the Gaussian distribution to fool the discriminator. This leads to the adversarial objective L adv (θ) for the feature extraction model and the discriminator loss L D (φ) for the discriminator: L adv (θ) = - 1 M M i=1 log D(z i ) (6) L D (φ) = - 1 M M i=1 (log D( i ) + log[1 -D(z i )]) ; ∀i, i ∼ N (0, I) Triplet Loss. Triplet loss is often used in metric learning to learn the latent representations of different classes (Schroff et al., 2015) . Triplet loss pulls together the representations from the same class and repels the ones from different classes. We argue that this characteristic facilitates labelpreserving augmentation in the latent space. Intuitively, the data outside the convex hull of the training samples are less likely to correspond to valid data of the same class. As a result of using the triplet loss, the distribution of the latent representations in the same class will be more compact. Interpolating, extrapolating or perturbing latent representations in this denser region is more likely to result in an augmented example sharing the same class identity. In theory, alternative measures, e.g., center loss (Wen et al., 2016) , large margin softmax loss (Liu et al., 2016) , or other contrastive losses, can also produce similar effects. For simplicity, we use the triplet loss in our implementation: for each anchor latent representation z, we randomly sample a positive latent representation z + from the same class and a negative representation z -from a different class in the same mini-batch. We denote the margin as γ and the cosine distance function as d(•). The triplet loss L tri (θ) is given by:  L tri (θ) = 1 M M i=1 d(z i , z + i ) -d(z i , z - i ) + γ +

3.2.3. POLICY SEARCH

We apply the PBA policy search strategy to the latent representations with the four proposed transformations. We define four latent space transformations with probability ranging from 0 to 1 and magnitude λ ranging from 0 to 0.9, both in intervals of 0.1. As the same operation can be applied twice, the search space is of size (10 × 11) 8 ≈ 2.14 × 10 16 . The total computation cost is the cost to train a single model multiplied by the number of parallel models.

4. EXPERIMENTS

To demonstrate the modality-agnostic property of MODALS, we evaluate it using datasets from four different data modalities: text, tabular, time-series, and image data. For each dataset, we search for the optimal augmentation schedule using PBT with 16 parallel models perturbing every three epochs. The searched policy schedule is applied to the dataset with different numbers of training examples. We implement the discriminator as a 2-layer multilayer perceptron (MLP) model with 256 hidden units in each layer. In all experiments, we set α = 1, β = 0.03 and search for the metric margin value from {0.5, 1, 2, 4, 8}. We compare our method with baseline methods that do not involve training an auxiliary model. In particular, we construct the baseline models trained with well-defined input-space augmentation in the data modality and label-mixing augmentation. We report the average classification accuracy from three trials with the baselines. The detailed configuration and policy of each experiment are provided in Appendix A.3.

4.1. TEXT DATA

We test MODALS on the SST2 (Socher et al., 2013) and TREC6 (Li & Roth, 2002) datasets, which are popular benchmarks for binary and multi-class text classification tasks. SST2 is a binary sentiment classification dataset with 6,920 training examples, while TREC6 is a question classification dataset with 5,452 training examples from six question classes. We employ a 2-layer bidirectional LSTM using fixed pre-trained 300-dimensional GloVe as the word embedding (Pennington et al., 2014) . The hidden state dimensionality of the model is 256. The first baseline is trained without augmentation. We construct the second baseline by applying manually designed input-space augmentations proposed as easy data augmentation (EDA) techniques in (Wei & Zou, 2019) . In this baseline, seed sentences are cloned multiple times, where each cloned sentence is transformed by replacing synonyms, inserting synonyms, deleting random word, or swapping word order. The baseline follows the configurations suggested by EDA for datasets of different sizes. We construct the third baseline using senMixup, which applies mixup augmentation in the last hidden layer before the softmax layer (Guo et al., 2019) . We also include the fourth and fifth baselines which use back-translation augmentation. The back-translation baselines are implemented using the Google Translate API to translate the original text to German (DE) or Spanish (ES) and back to English. Then, the model is trained on the back-translated text with the original text. Our experiments show that MODALS outperforms the other baseline methods in five of the settings and is comparable with back-translation (ES) in the TREC6 dataset. While EDA, senMixup and back-translation also show improvement on the full datasets, senMixup and back-translation sometimes slightly underperform the first baseline when the dataset size is small (see Table 1 ). We also perform an experiment with multiple tabular datasets from the UCI repository (Dua & Graff, 2017) , including the Iris, Breast Cancer, Arcene (Guyon et al., 2005) , Abalone, and HTRU2 (Lyon et al., 2016) datasets. The data are standardized and trained with a 2-layer MLP that encodes the inputs to 128-dimensional latent codes. We compare the accuracy with baseline methods using no augmentation and using Mixup. The experiment is repeated with different numbers of training examples. Table 2 shows that MODALS outperforms the baseline methods in 13 out of 15 settings. In the remaining two settings, MODALS achieves the same performance as Mixup augmentation. 

4.3. TIME-SERIES DATA

For time-series data, we use the HAR (Anguita et al., 2013) and Malware (Catak, 2019) datasets. HAR and Malware belong to continuous and discrete time-series data, respectively. The input of the HAR dataset consists of multiple continuous smartphone accelerometer and gyroscope readings for six human activities. The input of the Malware dataset consists of sequences of discrete events, specifically, the API call sequences of eight malwares. For both datasets, we train an LSTM model to encode the time-series data as 128-dimensional feature vectors and apply MODALS. MODALS performs the best in all cases (see Table 3 ). For image data, we apply MODALS to CIFAR-10, SVHN, and the reduced versions of these two datasets, as in (Ho et al., 2019) . CIFAR-10 and its reduced version contain 50,000 and 4,000 training images, respectively, while SVHN and its reduced version contain 73,537 and 1,000 training images, respectively. We use Wide-ResNet-40-2 as the feature extraction model in all experiments (Zagoruyko & Komodakis, 2016) . We compare our method against three baselines. For the first baseline, we apply randomized cropping, horizontal flipping, and color normalization. For the second baseline, Cutout randomly occludes a patch of size 16×16 from the image on top of the simple augmentation in the first baseline. The third baseline is PBA, which uses the same search strategy as MODALS. Both PBA and MODALS are deployed with their corresponding searched policy schedules on top of Cutout. On the two original and the two reduced datasets, MODALS outperforms simple augmentation and Cutout but underperforms PBA (see Table 4 ). We suspect that the input-space augmentations used in PBA are able to cover most variations in these datasets due to the continuous nature of image data. As the input-space augmentations are independent of the latent space augmentations, it is possible to combine PBA with MODALS by jointly searching the optimal augmentation policy parameters to further improve the results. 

5. ABLATION STUDY

Latent space augmentation. We study the effect of latent space augmentation techniques in MODALS. We remove the latent space augmentation to isolate the effect from applying the triplet loss and adversarial loss. Table 5 summarizes the performance of MODALS across eight datasets from different modalities. Our study shows that the use of latent space augmentation contributes to additional performance gain when trained with different loss objectives. The detailed breakdown is listed in Appendix A.2. Additional losses. We perform an ablation study against the baseline models trained with different loss settings on eight datasets. Both the triplet loss and adversarial loss improves the performance when training with latent space augmentation (see Table 5 ). In Appendix A.2.1, we provide the details and evidence showing that the triplet loss enforces more compact local class regions to preserve the class labels when applying latent space augmentation. Table 5 : Comparison of average accuracy when trained under different loss settings and augmentation techniques  L clf L clf + L adv L clf + L tri L clf + L adv + L tri w/

6. CONCLUSION

In this paper, we introduce MODALS to apply automated data augmentation in the latent space using four proposed modality-agnostic transformations trained with additional loss metrics and hard example augmentation techniques. Our method is tested on text, tabular, time-series, and image data and can be readily integrated with popular deep learning models. Beyond the data modalities tested, MODALS also work on all other data modalities given proper feature extraction models. We believe that MODALS makes larger improvements to data modalities in which the input space augmentation is less trivial to be defined, like text data, video data or even graph data in low-resource regimes.

A APPENDIX

A.1 VALIDITY OF ISOTROPIC ASSUMPTION IN THE LATENT SPACE Due to the curse of dimensionality, high-dimensional vectors are usually sparsely distributed. Let us consider a high-dimensional hypersphere with data points evenly distributed inside the hypersphere. As the dimensionality increases, it can be shown that the percentage of data points residing near the surface of the hypersphere will increase significantly. This contradicts our common perception for low-dimensional spaces, such as two-or three-dimensional spaces. Worse still, in case the phenomenon indeed holds, the isotropic assumption about the region of a class will be violated. As such, we want to conduct an experiment to see if there is evidence revealing the occurrence of this unusual phenomenon. Specifically, we study the spread of data points in the latent spaces of different dimensionality used. For each class c of size m, we compute the normalized distances {r c i } m i=1 from the class mean µ c to the latent representations {z c i } m i=1 : r c i = 2 (z c i , µ c ) 1 m m j=1 2 (z c j , µ c ) We repeat the experiment on one dataset for each of the four data modalities. Table 7 reports the average standard deviation of the normalized distances for all classes. The average standard deviation in all datasets shows a decreasing trend as the dimensionality increases, falling between 0.25 to 0.75 for a dimensionality of 128 or 256. In case the unusual phenomenon as described above really holds so that most of the data points have similar distances from their class center, we would expect the standard deviation to be very small or even close to zero. Our experiment shows that this is not the case especially for lower latent space dimensionality. Here, we study the effect of using additional loss metrics for data augmentation in the latent space. We repeat our previous experiments by removing one or both of the triplet and adversarial losses. Table 8 shows that adding the proposed loss metrics improves the classification performance across all datasets of different modalities. We further analyze the effect of the triplet loss in the augmentation context. Using the triplet loss forces the distribution of data points in the same class to be more compact. Consequently, the local class region becomes more convex and isotropic and hence facilitates smoother label-preserving transformation. We validate our conjecture by constraining the latent space to two dimensions to facilitate visualizing the data points (see Figure 3 ). Without the triplet loss, some class regions overlap and appear in less regular shapes. When the triplet loss is used, the class regions become more compact and appear in more circular shapes as observed from the coordinate axes and the visualization plots. This favors the convex and isotropic assumption in Section 2 of the paper when we propose to apply linear transformation on the feature vectors in local class regions. As the classes are better separated, the transformation can better preserve the original class labels. In addition to visualizing the data points qualitatively, we also provide quantitative measurements in Table 9 to show the separation between and within the class regions for higher dimensionality. Specifically, we measure the average within-class distance and between-class distance of the latent representations when training with and without the triplet loss. Our study shows that the average within-class distance with the triplet loss is smaller, resulting in a more compact class region. The ratio of between-class and within-class distances suggests a better separation between class regions rather than solely the scaling effect. Similar observations are also noticed in other datasets. Next, we study the effectiveness of our proposed transformation in creating hard examples. In our experiment, we measure the uncertainty of an example by the difference in predicted probability between the most likely and second most likely labels (Scheffer et al., 2001) . We call this metric as margin. A smaller margin implies a more uncertain prediction. Margin(z) = p(y 1 |z) -p(y 2 |z) We train the models on eight datasets and compute the margin for the original features and the augmented features. Our results show that hard interpolation is more effective in creating uncertain examples than hard extrapolation in most datasets (see Table 10 ). Admittedly, this finding is a bit counter-intuitive. We suspect that it may be due to the different shapes and complexity of the class boundaries, but more investigation is needed to understand the reasons behind. 11 ). We use a fixed augmentation instead of a fine-tuned policy schedule. In the experiment, the models trained with hard example interpolation and extrapolation achieve a higher accuracy on the sampled hard examples. The larger increases in margin show that the models can learn to predict the hard examples with higher certainty. 



Code is available at https://github.com/jamestszhim/modals.



Figure 1: Illustration of (a) hard example interpolation, (b) hard example extrapolation, (c) Gaussian noise, and (d) difference transform (triangle: seed latent representation; circle outline: class boundary; black circle: sampled feature; red circle: nearest sampled hard example; square: class mean; dotted line: range of augmented example)

Figure 2: MODALS (z: seed latent representation; ẑ: augmented latent representation; black line: forward propagation; red line: gradient flow from L clf ; green line: gradient flow from L tri ; orange line: gradient flow from L adv ; blue line: reward signal; grey line: gradient flow from L D .)

Figure 3: Visualization of data points in the latent space for the TREC6 dataset

Figure 4: Visualization of the scaling factor and probability of the searched augmentation policies on the SST2 (a,b), TREC6 (c,d), Iris (e,f), B. Cancer (g,h), HAR (i,j), Malware (k,l), CIFAR-10 (m,n), and SVHN (o,p) datasets

Comparison of five baselines (training without augmentation, EDA, senMixup, backtranslation from German and back-translation from Spanish) with MODALS on two text datasets for different amounts of training examples (s: 10%, m: 50%)

Comparison of two baselines (training without augmentation and Mixup) with MODALS on five tabular datasets for different amounts of training examples (s: 20%, m: 50%)

Comparison of two baselines (training without augmentation and Mixup) with MODALS on two tabular datasets for different amounts of training examples (s: 10%, m: 50%)

Comparison of three baselines (training with simple augmentation, Cutout, and PBA) with MODALS on two image datasets for different amounts of training examples

We study the effectiveness of our proposed hard augmentation in creating uncertain examples on eight datasets. Taking the difference of the predicted probability between the most likely and second most likely labels as a measurement of prediction certainty, our study shows that the proposed hard example interpolation and extrapolation create 7.23% and 2.34%, respectively, less certain examples on average (see Table 6). We further validate that the augmented examples improve the classification performance on hard examples (see Appendix A.2.2, A.2.3). Comparison of average change in prediction certainty when applied with and without hard augmentation w/o Aug. +Hard Interpolation +Hard Extrapolation

Average standard deviation of the normalized distances for all classes as the latent space dimensionality d varies

Summary of the performance under different settings (Aug.: latent space augmentation; L adv : adversarial loss; L tri : triplet loss)

Quantitative results for within-class and between-class distances under the effect of triplet loss on the TREC6 dataset. (d: dimensionality; c w : within-class distance; c b : between-class distance)

Comparison of the margin between models trained without augmentation and with hard interpolation or extrapolation We also study whether the hard examples created using the proposed transformation benefit model training. In the ablation test, we sample the top 5% hardest examples based on the classification loss from each class and compare the changes in the margin and accuracy of the hard examples after training with different augmentation settings (see Table

Comparison of the change in margin and classification accuracy on the hard examples between models trained with different augmentation schemesFinally, we compare the end-to-end training with simple interpolation and extrapolation to our proposed hard augmentation methods. Table12presents the accuracy of the model trained with hard augmentation and simple interpolation and extrapolation.

ACKNOWLEDGMENTS

This research has been made possible by the Hong Kong PhD Fellowship provided to the first author and research grants (General Research Fund project 16204720 and Collaborative Research Fund project C6030-18GF from the Research Grants Council of Hong Kong; Amazon Web Services Machine Learning Research Award 2020) provided to the second author.

annex

Published as a conference paper at ICLR 2021 

A.3 EXPERIMENT DETAILS

In this section, we present the detailed configuration and augmentation schedule for each of the data modalities. In all the experiments, the augmentation policy is searched using 50% of the data as the validation set. We use 16 child models in PBA implemented using the Ray Tune framework (https://docs.ray.io/en/latest/tune/index.html). The child models are evaluated and perturbed every three epochs. For the discriminator, we employ a 2-layer MLP model with 256 hidden units and ReLU activations. The discriminator is trained using the Adam optimizer with learning rate 0.01.

A.3.1 TEXT DATA

We employ a single-layer bidirectional LSTM to encode the 300-dimensional GloVe text embedding input as a 128-dimensional latent representation. The model is trained for 100 epochs using the Adam optimizer with learning rate 0.01 and batch size 100. Figure 4 (a,b) and Figure 4(c, d ) illustrate the augmentation policies for the SST2 and TREC6 datasets, respectively.

A.3.2 TABULAR DATA

For all tabular datasets, we split 20% of the dataset as the test set unless the test set is explicitly provided in the repository. The tabular data are first standardized and trained with a 2-layer MLP model with 128 hidden units in each hidden layer. We use the Adam optimizer with learning rate 0.01 and batch size 32 to train the model for 30 epochs. In some cases when the dataset size is too small, we use a batch size of 16 instead. Figure 4 (e,f) and Figure 4 (g,h) show the augmentation policies for the IRIS and Breast Cancer datasets, respectively.

A.3.3 TIME-SERIES DATA

For the HAR and Malware datasets, the time series are split into subsequences using a sliding window of size 256 with 50% overlap. In the Malware dataset, we remove consecutive duplicated API calls. We use a single-layer LSTM model to encode the time-series input as a 128-dimensional latent representation. For the Malware dataset, we exploit an additional embedding layer to map the discrete API call events into a 32-dimensional input embedding. The LSTM models are trained using the Adam optimizer with learning rate 0.01 and batch size 128 for 50 epochs. The augmentation policies for the HAR and Malware datasets are presented in Figure 4 (i,j) and Figure 4 (k,l), respectively.

A.3.4 IMAGE DATA

We use Wide-ResNet-40-2 as the feature extractor for the CIFAR-10 and SVHN datasets. The models are trained using the SGD optimizer with batch size 100 for 200 epochs. We use a weight decay of 10 -4 , learning rate of 0.01, and a cosine learning decay with one annealing cycle. 

