CONTRASTIVE LEARNING WITH ADVERSARIAL PER-TURBATIONS FOR CONDITIONAL TEXT GENERATION

Abstract

Recently, sequence-to-sequence (seq2seq) models with the Transformer architecture have achieved remarkable performance on various conditional text generation tasks, such as machine translation. However, most of them are trained with teacher forcing with the ground truth label given at each time step, without being exposed to incorrectly generated tokens during training, which hurts its generalization to unseen inputs, that is known as the "exposure bias" problem. In this work, we propose to mitigate the conditional text generation problem by contrasting positive pairs with negative pairs, such that the model is exposed to various valid or incorrect perturbations of the inputs, for improved generalization. However, training the model with naïve contrastive learning framework using random non-target sequences as negative examples is suboptimal, since they are easily distinguishable from the correct output, especially so with models pretrained with large text corpora. Also, generating positive examples requires domain-specific augmentation heuristics which may not generalize over diverse domains. To tackle this problem, we propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models. Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples by adding large perturbations while enforcing it to have a high conditional likelihood. Such "hard" positive and negative pairs generated using our method guides the model to better distinguish correct outputs from incorrect ones. We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks -machine translation, text summarization, and question generation.

1. INTRODUCTION

The sequence-to-sequence (seq2seq) models (Sutskever et al., 2014) , which learn to map an arbitrary-length input sequence to another arbitrary-length output sequence, have successfully tackled a wide range of language generation tasks. Early seq2seq models have used recurrent neural networks to encode and decode sequences, leveraging attention mechanism (Bahdanau et al., 2015) that allows the decoder to attend to a specific token in the input sequence to capture long-term dependencies between the source and target sequences. Recently, the Transformer (Vaswani et al., 2017) , which is an all-attention model that effectively captures long-term relationships between tokens in the input sequence as well as across input and output sequences, has become the de facto standard for most of the text generation tasks due to its impressive performance. Moreover, Transformerbased language models trained on large text corpora (Dong et al., 2019; Raffel et al., 2020; Lewis et al., 2020) have shown to significantly improve the model performance on text generation tasks. However, a crucial limitation of seq2seq models is that they are mostly trained only with teacher forcing, where ground truth is provided at each time step and thus never exposed to incorrectly generated tokens during training (Fig. 1-(a) ), which hurts its generalization. This problem is known as the "exposure bias" problem (Ranzato et al., 2016) and often results in the generation of lowquality texts on unseen inputs. Several prior works tackle the problem, such as using reinforcement learning (RL) to maximize non-differentiable reward (Bahdanau et al., 2017; Paulus et al., 2018) . (c) Imposter / Distant-Target Generation with perturbation Negative Imposter He wasn't in good shape.

Perturbation

He was was in good shape.

Encoder-Decoder

He wasn't in great shape <eos> <bos> He wasn't in great shape He wasn't in great shape <eos> <bos> He wasn't in great shape

Manifold Manifold

(a) Teacher Forcing

Encoder-Decoder

He wasn't in great shape <eos> <bos> He wasn't in great shape Another approach is to use RL or gumbel softmax (Jang et al., 2017) to match the distribution of generated sentences to that of the ground truth, in which case the reward is the discriminator output from a Generative Adversarial Network (GAN) (Zhang et al., 2018; 2017; Yu et al., 2017) . Although the aforementioned approaches improve the performance of the seq2seq models on text generation tasks, they either require a vast amount of effort in tuning hyperparameters or stabilize training. In this work, we propose to mitigate the exposure bias problem with a simple yet effective approach, in which we contrast a positive pair of input and output sequence to negative pairs, to expose the model to various valid or incorrect sentences. Naïvely, we can construct negative pairs by simply using random nontarget sequences from the batch (Chen et al., 2020). However, such a naïve construction yields meaningless negative examples that are already well-discriminated in the embedding space (Fig. 1-(b )), which we highlight as the reason why existing methods (Chen et al., 2020) require large batch size. This is clearly shown in Fig. 2 , where a large portion of positive-negative pairs can be easily discriminated without any training, which gets worse as the batch size decreases as it will reduce the chance to have meaningfully difficult examples in the batch. Moreover, discriminating positive and naïve negative pairs becomes even more easier for models pretrained on large text corpora.

Manifold

To resolve this issue, we propose principled approaches to automatically generate negative and positive pairs for constrastive learning, which we refer to as Contrastive Learning with Adversarial Perturbation for Seq2seq learning (CLAPS). Specifically, we generate a negative example by adding a small perturbation to the hidden representation of the target sequence, such that its conditional likelihood is minimized (Denoted as the red circle in Fig. 1-(c) ). Conversely, we construct an additional positive example (Denoted as green circle in Fig. 1-(c )) by adding a large amount of perturbation to the hidden representation of target sequence such that the perturbed sample is far away from the source sequence in the embedding space, while enforcing it to have high conditional likelihood by minimizing Kullback-Leibler (KL) divergence between the original conditional distribution and perturbed conditional distribution. This will yield a negative example that is very close to the original representation of target sequence in the embedding space but is largely dissimilar in the semantics, while the generated positive example is far away from the original input sequence but has the same semantic as the target sequence. This will generate difficult examples that the model fails to correctly discriminate (Fig. 1-(c ), Fig. 2 ), helping it learn with more meaningful pairs. To verify the efficacy of our method, we empirically show that it significantly improves the performance of seq2seq model on three conditional text generation tasks, namely machine translation, text summarization and question generation. Our contribution in this work is threefold: • To mitigate the exposure bias problem, we propose a contrastive learning framework for conditional sequence generation, which contrasts a positive pair of source and target sentence to negative pairs in the latent embedding space, to expose the model to various valid or incorrect outputs. • To tackle the ineffectiveness of conventional approach for constructing negative and positive examples for contrastive learning, we propose a principled method to automatically generate negative and positive pairs, that are more difficult and allows to learn more meaningful representations.



Figure 1: Concept. (a) Training seq2seq with teacher forcing. (b) Naïve contrastive learning with randomly sampled negative examples. (c) Our method, CLAPS, which generates hard negative and positive examples.

Figure 2: Accuracy of classifying a positive pair from negative pairs varying batch size without any training.

