ON THE CONSISTENCY LOSS FOR LEVERAGING AUG-MENTED DATA TO LEARN ROBUST AND INVARIANT REPRESENTATIONS

Abstract

Data augmentation is one of the most popular techniques for improving the robustness of neural networks. In addition to directly training the model with original samples and augmented samples, a torrent of methods regularizing the distance between embeddings/representations of the original samples and their augmented counterparts have been introduced. In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings. Our analysis suggests the ideal choices of regularization correspond to various assumptions. With an invariance test, we argue that regularization is important if the model is to be used in a broader context than accuracy-driven setting because non-regularized approaches are limited in learning the concept of invariance, despite equally high accuracy. Finally, we also show that the generic approach we identified (squared `2 norm regularized augmentation) outperforms several recent methods, which are each specially designed for one task and significantly more complicated than ours, over three different tasks.

1. INTRODUCTION

Recent advances in deep learning has delivered remarkable empirical performance over i.i.d test data, and the community continues to investigate the more challenging and realistic scenario when models are tested in robustness over non-i.i.d data (e.g., Ben-David et al., 2010; Szegedy et al., 2013) . Recent studies suggest that one cause of the fragility is the model's tendency in capturing undesired signals (Wang et al., 2020) , thus combating this tendency may be a key to robust models. To help models ignore the undesired signals, data augmentation (i.e., diluting the undesired signals of training samples by applying transformations to existing examples) is often used. Given its widely usage, we seek to answer the question: how should we train with augmented samples so that the assistance of augmentation can be taken to the fullest extent to learn robust and invariant models? In this paper, We analyze the generalization behaviors of models trained with augmented data and associated regularization techniques. We investigate a set of assumptions and compare the worstcase expected risk over unseen data when i.i.d samples are allowed to be transformed according to a function belonging to a family. We bound the expected risk with terms that can be computed during training, so that our analysis can inspire how to regularize the training procedure. While all the derived methods have an upper bound of the expected risk, with progressively stronger assumptions, we have progressively simpler regularization, allowing practical choices to be made according to the understanding of the application. Our contributions of this paper are as follows: • We offer analyses of the generalization behaviors of augmented models trained with different regularizations: these regularizations require progressively stronger assumptions of the data and the augmentation functions, but progressively less computational efforts. For example, with assumptions pertaining to augmentation transformation functions, the Wasserstein distance over the original and augmented empirical distributions can be calculated through simple `1 norm distance. • We test and compare these methods and offer practical guidance on how to choose regularizations in practice. In short, regularizing the squared `2 distance of logits between the augmented samples and original samples is a favorable method, suggested by both theoretical and empirical evidence. • With an invariance test, we argue that vanilla augmentation does not utilize the augmented samples to the fullest extent, especially in learning invariant representations, thus may not be ideal unless the only goal of augmentation is to improve the accuracy over a specific setting.

2. RELATED WORK & KEY DIFFERENCES

Data augmentation has been used effectively for years. Tracing back to the earliest convolutional neural networks, we notice that even the LeNet applied on MNIST dataset has been boosted by mixing the distorted images to the original ones (LeCun et al., 1998) While the above works mainly discuss how to generate the augmented samples, in this paper, we mainly answer the question about how to train the models with augmented samples. For example, instead of directly mixing augmented samples with the original samples, one can consider regularizing the representations (or outputs) of original samples and augmented samples to be close under a distance metric (also known as a consistency loss). Many concrete ideas have been explored in different contexts. For example, `2 distance and cosine similarities between internal representations in speech recognition (Liang et al., 2018) , squared `2 distance between logits (Kannan et al., 2018) , or KL divergence between softmax outputs (Zhang et al., 2019a) in adversarially robust vision models, Jensen-Shannon divergence (of three distributions) between embeddings for texture invariant image classification (Hendrycks et al., 2020) . These are but a few highlights of the concrete and successful implementations for different applications out of a huge collection (e.g., (Wu et al., 2019; Guo et al., 2019; Zhang et al., 2019b; Shah et al., 2019; Asai & Hajishirzi, 2020; Sajjadi et al., 2016; Zheng et al., 2016; Xie et al., 2015) ), and one can easily imagine methods permuting these three elements (distance metrics, representation or outputs, and applications) to be invented. Even further, although we are not aware of the following methods in the context of data augmentation, given the popularity of GAN (Goodfellow, 2016) and domain adversarial neural network (Ganin et al., 2016) , one can also expect the distance metric generalizes to a specialized discriminator (i.e. a classifier), which can be intuitively understood as a calculated (usually maximized) distance measure, Wasserstein-1 metric as an example (Arjovsky et al., 2017; Gulrajani et al., 2017) . Key Differences: With this rich collection of regularizing choices, which one method should we consider in general? More importantly, do we actually need the regularization at all? These questions are important for multiple reasons, especially considering that there are paper suggesting that these regularizations may lead to worse results (Jeong et al., 2019) . In this paper, we answer the first question with a proved upper bound of the worst case generalization error, and our upper bound explicitly describes what regularizations are needed. For the second question, we will show that regularizations can help the model to learn the concept of invariance. There are also several previous discussions regarding the detailed understandings of data augmentation (Yang et al., 2019; Chen et al., 2019; Hernández-García & König, 2018; Rajput et al., 2019; Dao et al., 2019 ), among which, (Yang et al., 2019) is probably the most relevant as it also defends the usage of regularizations. However, we believe our discussions are more comprehensive and supported theoretically, since our analysis directly suggests the ideal regularization. Also, empirically, we design an invariance test in addition to the worst-case accuracy used in the preceding work.

3. TRAINING STRATEGIES WITH AUGMENTED DATA

Notations (X, Y) denotes the data, where X 2 R n⇥p and Y 2 {0, 1} n⇥k (one-hot vectors for k classes), and f (•, ✓) denotes the model, which takes in the data and outputs the softmax (probabilities of the prediction) and ✓ denotes the corresponding parameters. g() completes the prediction (i.e., mapping softmax to one-hot prediction). l(•, •) denotes a generic loss function. a(•) denotes a



. Later, the rapidly growing machine learning community has seen a proliferate development of data augmentation techniques (e.g., flipping, rotation, blurring etc.) that have helped models climb the ladder of the state-of-theart (one may refer to relevant survey (Shorten & Khoshgoftaar, 2019) for details). Recent advances expanded the conventional concept of data augmentation and invented several new approaches, such as leveraging the information in unlabelled data(Xie et al., 2019), automatically learning augmentation functions(Ho et al., 2019; Hu et al., 2019; Wang et al., 2019c; Zhang et al., 2020; Zoph et al.,  2019), and generating the samples (with constraint) that maximize the training loss along training(Fawzi et al., 2016), which is later widely accepted as adversarial training(Madry et al., 2018).

