CODA: CONTRAST-ENHANCED AND DIVERSITY-PROMOTING DATA AUGMENTATION FOR NATURAL LANGUAGE UNDERSTANDING

Abstract

Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation framework dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization objective is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2% while applied to the RoBERTa-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training baselines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework.

1. INTRODUCTION

Data augmentation approaches have successfully improved large-scale neural-network-based models, (Laine & Aila, 2017; Xie et al., 2019; Berthelot et al., 2019; Sohn et al., 2020; He et al., 2020; Khosla et al., 2020; Chen et al., 2020b) , however, the majority of existing research is geared towards computer vision tasks. The discrete nature of natural language makes it challenging to design effective label-preserving transformations for text sequences that can help improve model generalization (Hu et al., 2019; Xie et al., 2019) . On the other hand, fine-tuning powerful, over-parameterized language models 1 proves to be difficult, especially when there is a limited amount of task-specific data available. It may result in representation collapse (Aghajanyan et al., 2020) or require special finetuning techniques (Sun et al., 2019; Hao et al., 2019) . In this work, we aim to take a further step towards finding effective data augmentation strategies through systematic investigation. In essence, data augmentation can be regarded as constructing neighborhoods around a training instance that preserve the ground-truth label. With such a characterization, adversarial training (Zhu et al., 2020; Jiang et al., 2020; Liu et al., 2020; Cheng et al., 2020 ) also performs label-preserving transformation in embedding space, and thus is considered as an alternative to data augmentation methods in this work. From this perspective, the goal of developing effective data augmentation strategies can be summarized as answering three fundamental questions: i) What are some label-preserving transformations, that can be applied to text, to compose useful augmented samples? ii) Are these transformations complementary in nature, and can we find some strategies to consolidate them for producing more diverse augmented examples? iii) How can we incorporate the obtained augmented samples into the training process in an effective and principled manner? Previous efforts in augmenting text data were mainly focused on answering the first question (Yu et al., 2018; Xie et al., 2019; Kumar et al., 2019; Wei & Zou, 2019; Chen et al., 2020a; Shen et al., 2020) . Regarding the second question, different label-preserving transformations have been proposed, but it remains unclear how to integrate them organically. In addition, it has been shown that the diversity of augmented samples plays a vital role in their effectiveness (Xie et al., 2019; Gontijo-Lopes et al., 2020) . In the case of image data, several strategies that combine different augmentation methods have been proposed, such as applying multiple transformations sequentially (Cubuk et al., 2018; 2020; Hendrycks et al., 2020) , learning data augmentation policies (Cubuk et al., 2018) , randomly sampling operations for each data point (Cubuk et al., 2020) . However, these methods cannot be naively applied to text data, since the semantic meanings of a sentence are much more sensitive to local perturbations (relative to an image). As for the third question, consistency training is typically employed to utilize the augmented samples (Laine & Aila, 2017; Hendrycks et al., 2020; Xie et al., 2019; Sohn et al., 2020; Miyato et al., 2018) . This method encourages the model predictions to be invariant to certain label-preserving transformations. However, existing approaches only examine a pair of original and augmented samples in isolation, without considering other examples in the entire training set. As a result, the representation of an augmented sample may be closer to those of other training instances, rather than the one it is derived from. Based on this observation, we advocate that, in addition to consistency training, a training objective that can globally capture the intrinsic relationship within the entire set of original and augmented training instances can help leverage augmented examples more effectively. In this paper, we introduce a novel Contrast-enhanced and Diversity-promoting Data Augmentation (CoDA) framework for natural language understanding. To improve the diversity of augmented samples, we extensively explore different combinations of isolated label-preserving transformations in an unified approach. We find that stacking distinct label-preserving transformations produces particularly informative samples. Specifically, the most diverse and high-quality augmented samples are obtained by stacking an adversarial training module over the back-translation transformation. Besides the consistency-regularized loss for repelling the model to behave consistently within local neighborhoods, we propose a contrastive learning objective to capture the global relationship among the data points in the representation space. We evaluate CoDA on the GLUE benchmark (with RoBERTa (Liu et al., 2019) as the testbed), and CoDA consistently improves the generalization ability of resulting models and gives rise to significant gains relative to the standard fine-tuning procedure. Moreover, our method also outperforms various single data augmentation operations, combination schemes, and other strong baselines. Additional experiments in the low-resource settings and ablation studies further demonstrate the effectiveness of this framework.

2. METHOD

In this section, we focus our discussion on the natural language understanding (NLU) tasks, and particularly, under a text classification scenario. However, the proposed data augmentation framework can be readily extended to other NLP tasks as well. 2.1 BACKGROUND: DATA AUGMENTATION AND ADVERSARIAL TRAINING Data Augmentation Let D = {x i , y i } i=1...N denote the training dataset, where the input example x i is a sequence of tokens, and y i is the corresponding label. To improve model's robustness and generalization ability, several data augmentation techniques (e.g., back-translation (Sennrich et al., 2016; Edunov et al., 2018; Xie et al., 2019 ), mixup (Guo et al., 2019) 



, c-BERT (Wu et al., 2019)) have been proposed. Concretely, label-preserving transformations are performed (on the original training sequences) to synthesize a collection of augmented samples, denoted by D = {x i , y i } i=1...N . Thus, a model can learn from both the training set D and the augmented set D , with p θ (•) the predicted output distribution of the model parameterized by θ: θ * = arg min θ (xi,yi)∈D L p θ (x i ), y i + (x i ,y i )∈D L p θ (x i ), y i (1)

