A SIMPLE CONTRASTIVE LEARNING OBJECTIVE FOR ALLEVIATING NEURAL TEXT DEGENERATION

Abstract

The cross-entropy objective has proved to be an all-purpose training objective for autoregressive language models (LMs). However, without distinguishing problematic tokens, LMs trained using cross-entropy exhibit text degeneration problems. To address this, unlikelihood training has been proposed to reduce the probability of unlikely tokens predicted by LMs. But unlikelihood does not explicitly consider the relationship between the label tokens and unlikely token candidates, thus showing marginal improvements in degeneration. We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training and avoids their limitations. The key idea is to teach a LM to generate high probabilities for label tokens and low probabilities for negative candidates. Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields much less repetitive texts, with a higher generation quality than baseline approaches, achieving the new state-of-the-art performance on text degeneration.

1. INTRODUCTION

Autoregressive language models (LMs), such as OpenAI GPT-3 (Brown et al., 2020) , have achieved impressive results on various natural language processing tasks. The goal of training LMs is to learn the true distribution of a text corpus, and this is usually achieved through next word prediction. Specifically, a standard approach to training LMs is to minimize the cross-entropy loss between the true distribution and the model prediction. Unfortunately, LMs trained using the cross-entropy objective have been observed to exhibit text degeneration problems, where token, phrase, and sentence level repetition is a common symptom (Holtzman et al., 2020; Welleck et al., 2020; Jiang et al., 2020) . Such repeated texts differ markedly from those generated by humans. 1 To analyze the reasons for degeneration, our work views the vocabulary of LMs as being composed of three sets of tokens at each time step, i.e., positive tokens (label tokens), negative tokens (incorrectly repeating tokens), and irrelevant tokens (all the others). Based on this taxonomy, we stress that cross-entropy is in fact a contrastive learning objective that contrasts positive tokens with all negative and irrelevant tokens. While it is necessary for LMs to learn how to rank positive tokens higher than other tokens in the predicted distribution, negative tokens are treated equally as irrelevant tokens (whose amount is usually much larger) by the cross-entropy objective. As a consequence, negative tokens may not be suppressed hard enough. To address the above issue, Welleck et al. ( 2020) have proposed unlikelihood training to penalize certain negative tokens, i.e., tokens that are incorrectly repeated. The key idea behind unlikelihood training is to lower the probability of negative tokens assigned by LMs. Despite its success, the unlikelihood objective does not explicitly consider the relationship between positive and negative tokens, which causes it to have indefinite effects on suppressing negative tokens. Unlikelihood training also unintentionally boosts the probability of other irrelevant tokens. Moreover, all previous context tokens are used as negative candidates per prediction step, which not only introduces a considerable amount of noise, but also results in sub-optimal repetition reduction, thus affecting the final generation performance. In this paper, we introduce a simple yet effective contrastive token learning (CT for short) objective that integrates the best of cross-entropy and unlikelihood training, penalizing negative tokens by contrasting them directly with positive tokens. The commonalities and differences between crossentropy, unlikelihood training, and CT are illustrated in Figure 1 . Briefly, (i) without distinguishing between negative and irrelevant tokens, cross-entropy cannot effectively suppress negative tokens; (ii) due to the lack of contrast between negative and positive tokens, it is ineffective for unlikelihood training to penalize negative tokens; and (iii) through its direct contrast between positive and negative tokens, CT is more focused in learning the differences between them, i.e., explicitly teaching the LM to assign negative tokens with a lower probability than positive tokens. In this work, we combine the CT and cross-entropy objectives to train LMs, where cross-entropy operates on the label tokens so that they are assigned the highest probability, and CT effectively suppresses negative tokens from being generated. We perform evaluations on the tasks of language modeling (decoder-only model) and open-domain dialogue generation (encoder-decoder model). Our empirical evidence demonstrates that LMs trained with the proposed CT objective can generate much less repetitive texts and achieve superior text generation performance under both automatic and human evaluations. CT has a minor negative influence on the perplexity of LMs, but thanks to the reduced repetition rates, in our case studies we observe substantial improvements regarding the quality of generated text.

2. BACKGROUND

LMs aim to learn the true distribution over variable-length text sequences in a text corpus X = (x 1 , x 2 , . . . , x |X| ) with |X| tokens. A popular approach to this task is next word prediction, i.e., predicting a distribution over the next word following a given context. To train such a language model, cross-entropy and unlikelihood training are two representative objectives, which we will first review in this section. We then provide an analysis of the text degeneration problem.



Readers are referrred to Table4for some concrete examples. The degeneration problem even exists in large-scale state-of-the-art pre-trained language models such as 2022).



Figure1: Illustrating the differences between our proposed contrastive token learning, unlikelihood training, and the cross-entropy objective for LMs. For contrastive token learning, we use the label token as the positive token and the preceding M tokens as the negative tokens at each decoding step.

The influence comparison of different learning objectives over the positive (label), negative (incorrectly repeating), and irrelevant tokens (all the others) for the LMs.

