ON THE CONSISTENCY LOSS FOR LEVERAGING AUG-MENTED DATA TO LEARN ROBUST AND INVARIANT REPRESENTATIONS

Abstract

Data augmentation is one of the most popular techniques for improving the robustness of neural networks. In addition to directly training the model with original samples and augmented samples, a torrent of methods regularizing the distance between embeddings/representations of the original samples and their augmented counterparts have been introduced. In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings. Our analysis suggests the ideal choices of regularization correspond to various assumptions. With an invariance test, we argue that regularization is important if the model is to be used in a broader context than accuracy-driven setting because non-regularized approaches are limited in learning the concept of invariance, despite equally high accuracy. Finally, we also show that the generic approach we identified (squared `2 norm regularized augmentation) outperforms several recent methods, which are each specially designed for one task and significantly more complicated than ours, over three different tasks.

1. INTRODUCTION

Recent advances in deep learning has delivered remarkable empirical performance over i.i.d test data, and the community continues to investigate the more challenging and realistic scenario when models are tested in robustness over non-i.i.d data (e.g., Ben-David et al., 2010; Szegedy et al., 2013) . Recent studies suggest that one cause of the fragility is the model's tendency in capturing undesired signals (Wang et al., 2020) , thus combating this tendency may be a key to robust models. To help models ignore the undesired signals, data augmentation (i.e., diluting the undesired signals of training samples by applying transformations to existing examples) is often used. Given its widely usage, we seek to answer the question: how should we train with augmented samples so that the assistance of augmentation can be taken to the fullest extent to learn robust and invariant models? In this paper, We analyze the generalization behaviors of models trained with augmented data and associated regularization techniques. We investigate a set of assumptions and compare the worstcase expected risk over unseen data when i.i.d samples are allowed to be transformed according to a function belonging to a family. We bound the expected risk with terms that can be computed during training, so that our analysis can inspire how to regularize the training procedure. While all the derived methods have an upper bound of the expected risk, with progressively stronger assumptions, we have progressively simpler regularization, allowing practical choices to be made according to the understanding of the application. Our contributions of this paper are as follows: • We offer analyses of the generalization behaviors of augmented models trained with different regularizations: these regularizations require progressively stronger assumptions of the data and the augmentation functions, but progressively less computational efforts. For example, with assumptions pertaining to augmentation transformation functions, the Wasserstein distance over the original and augmented empirical distributions can be calculated through simple `1 norm distance. • We test and compare these methods and offer practical guidance on how to choose regularizations in practice. In short, regularizing the squared `2 distance of logits between the augmented samples and original samples is a favorable method, suggested by both theoretical and empirical evidence.

