A SIMPLE CONTRASTIVE LEARNING OBJECTIVE FOR ALLEVIATING NEURAL TEXT DEGENERATION

Abstract

The cross-entropy objective has proved to be an all-purpose training objective for autoregressive language models (LMs). However, without distinguishing problematic tokens, LMs trained using cross-entropy exhibit text degeneration problems. To address this, unlikelihood training has been proposed to reduce the probability of unlikely tokens predicted by LMs. But unlikelihood does not explicitly consider the relationship between the label tokens and unlikely token candidates, thus showing marginal improvements in degeneration. We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training and avoids their limitations. The key idea is to teach a LM to generate high probabilities for label tokens and low probabilities for negative candidates. Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields much less repetitive texts, with a higher generation quality than baseline approaches, achieving the new state-of-the-art performance on text degeneration.

1. INTRODUCTION

Autoregressive language models (LMs), such as OpenAI GPT-3 (Brown et al., 2020) , have achieved impressive results on various natural language processing tasks. The goal of training LMs is to learn the true distribution of a text corpus, and this is usually achieved through next word prediction. Specifically, a standard approach to training LMs is to minimize the cross-entropy loss between the true distribution and the model prediction. Unfortunately, LMs trained using the cross-entropy objective have been observed to exhibit text degeneration problems, where token, phrase, and sentence level repetition is a common symptom (Holtzman et al., 2020; Welleck et al., 2020; Jiang et al., 2020) . Such repeated texts differ markedly from those generated by humans. 1 To analyze the reasons for degeneration, our work views the vocabulary of LMs as being composed of three sets of tokens at each time step, i.e., positive tokens (label tokens), negative tokens (incorrectly repeating tokens), and irrelevant tokens (all the others). Based on this taxonomy, we stress that cross-entropy is in fact a contrastive learning objective that contrasts positive tokens with all negative and irrelevant tokens. While it is necessary for LMs to learn how to rank positive tokens higher than other tokens in the predicted distribution, negative tokens are treated equally as irrelevant tokens (whose amount is usually much larger) by the cross-entropy objective. As a consequence, negative tokens may not be suppressed hard enough. To address the above issue, Welleck et al. (2020) have proposed unlikelihood training to penalize certain negative tokens, i.e., tokens that are incorrectly repeated. The key idea behind unlikelihood training is to lower the probability of negative tokens assigned by LMs. Despite its success, the unlikelihood objective does not explicitly consider the relationship between positive and negative tokens, which causes it to have indefinite effects on suppressing negative tokens. Unlikelihood training also unintentionally boosts the probability of other irrelevant tokens. Moreover, all previous context tokens are used as negative candidates per prediction step, which not only introduces a considerable amount of noise, but also results in sub-optimal repetition reduction, thus affecting the final generation performance. Figure 1 : Illustrating the differences between our proposed contrastive token learning, unlikelihood training, and the cross-entropy objective for LMs. For contrastive token learning, we use the label token as the positive token and the preceding M tokens as the negative tokens at each decoding step. In this paper, we introduce a simple yet effective contrastive token learning (CT for short) objective that integrates the best of cross-entropy and unlikelihood training, penalizing negative tokens by contrasting them directly with positive tokens. The commonalities and differences between crossentropy, unlikelihood training, and CT are illustrated in Figure 1 . Briefly, (i) without distinguishing between negative and irrelevant tokens, cross-entropy cannot effectively suppress negative tokens; (ii) due to the lack of contrast between negative and positive tokens, it is ineffective for unlikelihood training to penalize negative tokens; and (iii) through its direct contrast between positive and negative tokens, CT is more focused in learning the differences between them, i.e., explicitly teaching the LM to assign negative tokens with a lower probability than positive tokens. In this work, we combine the CT and cross-entropy objectives to train LMs, where cross-entropy operates on the label tokens so that they are assigned the highest probability, and CT effectively suppresses negative tokens from being generated. We perform evaluations on the tasks of language modeling (decoder-only model) and open-domain dialogue generation (encoder-decoder model). Our empirical evidence demonstrates that LMs trained with the proposed CT objective can generate much less repetitive texts and achieve superior text generation performance under both automatic and human evaluations. CT has a minor negative influence on the perplexity of LMs, but thanks to the reduced repetition rates, in our case studies we observe substantial improvements regarding the quality of generated text.

2. BACKGROUND

LMs aim to learn the true distribution over variable-length text sequences in a text corpus X = (x 1 , x 2 , . . . , x |X| ) with |X| tokens. A popular approach to this task is next word prediction, i.e., predicting a distribution over the next word following a given context. To train such a language model, cross-entropy and unlikelihood training are two representative objectives, which we will first review in this section. We then provide an analysis of the text degeneration problem.

2.1. CROSS ENTROPY

A standard approach to training a LM is to minimize the expected cross-entropy loss between the true distribution and the model prediction (Yang et al., 2019a) . Specifically, the cross-entropy loss for each time step t is defined as: L t CE = -log p(x t |x <t ) (1) = -log exp(h T t W xt ) ∑ xt∈V exp(h T t W xt ) (2) = log   1 + ∑ xt∈V,xt̸ =xt exp(h T t W xt -h T t W xt )   , ( ) where h t is the model hidden state at time t, W is the embedding matrix, and W xt denotes the embedding of token x t . Through the transformations from Eq. ( 1)-( 3), we can see that Eq. ( 3) is similar to the N -pair contrastive loss (Sohn, 2016) 

2.2. UNLIKELIHOOD TRAINING

To address the repetition issue of cross-entropy, Welleck et al. (2020) proposed unlikelihood training to penalize negative tokens (UL-T). The unlikelihood loss for time step t is defined as: L t U L = - ∑ x - t ∈C t log(1 -p(x - t |x <t )), where C t = {x 1 , . . . , x t-1 }\{x t } is the set of negative tokens at time t, i.e., all previous context tokens. In this paper, we refer to this set of negative tokens as the preceding tokens set. As we will see in §2.3, UL-T does not work well as it can increase the probability of irrelevant tokens. Welleck et al. (2020) have also proposed a more effective sequence-level unlikelihood objective (UL-S) that uses unlikelihood on generated continuations during training time. We omit the details here as our proposed CT is more closely related to UL-T, but we compare CT to UL-S in our experiments.

2.3. DISCUSSION

The main difference between Eq. ( 3) and the N -pair contrastive loss is that, in Eq. (3), negative and irrelevant tokens are treated equally by cross-entropy.foot_2 These negative tokens need to be penalized harder than irrelevant tokens, otherwise, negative tokens may be incorrectly repeated in later time steps. We believe this to be the reason why LMs trained by cross-entropy have high repetition rates. Although UL-T penalizes negative tokens, it is not effective enough. As can be seen from Table 1 , the reasons are twofold. First, each negative token is not definitely penalized because it depends on other negative tokens, which can be seen from the gradient analysis of UL-T (Eq. ( 11) in Appendix C). Second, the formulation of UL-T unintentionally boosts the probability of other irrelevant tokens and may make them surface as repeated tokens. We detail this analysis in §3.3.

3. METHOD

To address the issues discussed above and inherit the advantages of cross-entropy and unlikelihood training, in this section, we present a novel contrastive token learning (CT) objective. We first define the CT loss for each time step. Then we introduce a negative token selection strategy. Finally, we discuss the relationships among CT, cross-entropy and unlikelihood training.

3.1. CONTRASTIVE TOKEN LEARNING

The key idea of CT is to promote positive (label) tokens in the ranking at each step, while lowering negative (incorrectly repeating) tokens, and leave other irrelevant tokens unchanged. To this end, we formulate the CT loss for step t as: L t CT = log   1 + ∑ x - t ∈S t N exp(h T t W x - t -h T t W xt )   , (5) where S t N is the negative token set and x t is the positive token (i.e., label token) at time t. We detail the token selection mechanism of S t N below. Comparing Eq.( 5) to Eq. ( 4), we see that UL only considers the probabilities of negative tokens, while CT directly contrasts negative with positive tokens. During training, we combine the CT loss with the cross-entropy loss for each time step: L t = L t CE + L t CT . ( ) L t CE trains LMs to assign the highest probabilities to label tokens. While on the other hand, L t CT focuses on contrasting positive tokens and negative tokens, so that the LMs can learn to effectively rank negative tokens lower than their positive counterparts.

3.2. NEGATIVE TOKEN SELECTION STRATEGY

Following (Welleck et al., 2020) , we also select negative tokens from the preceding tokens. However, using all preceding tokens (as done in (Welleck et al., 2020) ) introduces too many irrelevant tokens, especially in later time steps of a sequence. Hence, we instead propose to use the preceding M tokens set to decide the negative tokens, with M being a hyper-parameter. The set S t N is defined as: S t N = {x t-M , . . . , x t-1 }\{x t }. (7) Another difference with the preceding tokens set (Welleck et al., 2020) is that, S t N is a multiset that does not remove redundant occurrences. Intuitively, minimizing the CT loss with the preceding M tokens set makes more frequently repeated tokens less likely to be predicted.

3.3. GRADIENT ANALYSIS

To see how loss functions influence the positive, negative and irrelevant tokens during training, we derive the gradient functions of each loss function with respect to these tokens in Appendix C. Table 1 is an intuitive summary of the influences, from which one can observe that: (i) Cross-entropy trains to promote label tokens in rankings at each time-step, while suppressing all the other tokens including negative and irrelevant tokens. (ii) It cannot be decided for unlikelihood training whether the negative tokens are promoted or suppressed by the gradient function (cf. Eq. ( 11) in Appendix C, the valid region for the corresponding gradient function contains both positive and negative values), and irrelevant tokens are promoted, both of which are problematic. (iii) With contrastive token learning, CT promotes positive tokens and suppresses negative tokens, and it is the only objective that does not affect irrelevant tokens (cf. the gradient functions in Appendix C). When using CT together with CE, as we do for our final loss function, negatives are suppressed both in CT and in CE, while irrelevant tokens are only suppressed in CE. Therefore, our CT objective is able to better restrain incorrectly repeated tokens.

4. RELATED WORK

We review two lines of related work, i.e., neural text degeneration and contrastive learning. Neural text degeneration. With large-scale pre-training, state-of-the-art neural LMs are able to generate human-like texts (Brown et al., 2020; Yang et al., 2019a) . However, they suffer from the text degeneration problem, where model-generated texts are dull and repetitive (Jiang & de Rijke, 2018; Holtzman et al., 2020; Welleck et al., 2020) . The text degeneration problem is especially serious with open-ended generation tasks, such as dialogue generation (See et al., 2019; Jiang et al., 2020) and language modeling (Holtzman et al., 2020; Welleck et al., 2020) . Some decoding approaches have been proposed to address this problem, by introducing randomness (Fan et al., 2018; Holtzman et al., 2020) or disparity (See et al., 2019; Su et al., 2022) at inference time. Some other work suggests that the degeneration problem is caused by defects of the likelihood training objective, and improved training objectives have been proposed (Jiang et al., 2019; Welleck et al., 2020; Su et al., 2022) . ScaleGrad Lin et al. (2021) encourages the LMs to generate novel tokens, but the selection of such tokens can be too open. Our proposed contrastive token learning approach belongs to the training objective family. Compared to unlikelihood training (Welleck et al., 2020) , we address the suppression of repetitive tokens by contrasting them with positive tokens. Contrastive learning. In computer vision, contrastive learning has been widely employed to learn representations (Sohn, 2016; Chen et al., 2020; Khosla et al., 2020) . Noise-contrastive estimation (Gutmann & Hyvärinen, 2010) has been proved successful for training word embeddings (Mikolov et al., 2013) . In recent years, contrastive learning has gained more attention in the area of natural language processing too. Most work builds contrast at the sequence or document level by corrupting the ground truth sequence (Yang et al., 2019b; Clark et al., 2020; Lee et al., 2021; Meng et al., 2021) or mining positive/negative samples (Nguyen & Luu, 2021; Pan et al., 2021) . Existing token-level contrastive learning frameworks contrast model representations from different positions (Zhang et al., 2021; Su et al., 2022) . Differently, we contrast word embeddings while using the hidden representations as anchor points similar to the triplet contrastive loss (Schroff et al., 2015) . Our formulation effectively contrasts logits output by the model for positive and negative tokens, thus it is more direct than unlikelihood training on addressing the repetitive degeneration problem. To the best of our knowledge, our proposed CT is the first to use token embeddings as positive/negative examples in a contrastive framework for the text degeneration problem.

5. EXPERIMENTAL SETUP

We compare CT with baseline approaches on the language modeling and open-domain dialogue generation task (using an encoder-decoder model). Since our experimental results on the dialogue task show a similar pattern as on the language modeling task, we will focus on the language modeling task in the body of the paper and postpone the setup and analyses of the dialogue task to Appendix H. Baselines and implementation. We implement several state-of-the-art baselines and use them with GPT-2 (Radford et al., 2019) : (i) For decoding-based methods, we consider: banning 3-grams (Roller et al., 2021) , top-k sampling (Fan et al., 2018) , nucleus sampling (Holtzman et al., 2020) and contrastive search (SimCTG-CS) (Su et al., 2022) ; and (ii) learning-based methods: unlikelihood training (Welleck et al., 2020) , SimCTG (Su et al., 2022) , and noise-contrastive estimation (NCE; detailed in Appendix B) (Gutmann & Hyvärinen, 2010) . We also consider model trained using CE as a baseline. More details can be found in Appendix D. Dataset, training and inference details. At training time, we fine-tune GPT-2 small on the widelyused Wikitext-103 dataset (Merity et al., 2017) with each learning-based approach (including the CE baseline) for 50K steps with 3K warm-up steps. As suggested in (Welleck et al., 2020) , for sequence-level unlikelihood training, we first fine-tune the language model using UL-T for 48.5K steps, and then switch to the UL-S objective for another 1.5K steps, resulting in UL-TS. Best model checkpoints for each task are selected according to the lowest validation CE loss with an evaluation interval of 1K training steps. We use trunks of 512 tokens, and a training batch size of 4. All models are trained using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e-5. For UL-TS, we had to use a smaller learning rate of 1e-6, otherwise the generated texts contain massive ungrammatical repetitions (continuous token repetitions, as can be seen in Table 5 of Appendix E). At inference time, we compare the performance of each approach using both greedy search and beam search. Following the best settings reported on this task (Welleck et al., 2020) , we use k = 50 for top-k sampling, and p = 0.9 for nucleus sampling. We follow Welleck et al. (2020) to use 50 tokens as the input prefix and let the model generate 100 tokens as a continuation. Evaluation metrics. We measure the perplexity (ppl) of different approaches. For measuring generative repetition, we follow Welleck et al. (2020) to use 1-gram to 4-gram repetition rates (rep-1 rep-4), which are defined as the number of repeated n-grams divided by the total number of generated n-grams in each sequence, micro-averaged over the whole dataset. We also report the generation diversity at the dataset level, which is measured by distinct 1-gram rates (dist-1) (Li et al., 2016) and unique 1-gram counts (uniq-1). We adopt human evaluation for measuring the Table 2 : Results on the test set of Wikitext-103 for the language modeling task. ↑/↓ arrows denote whether higher or lower is better for a metric. The best result for either type of approach (decodingbased vs. learning-based) under each metric is highlighted in bold face. ‡ Does not count as the best. † For this experiment, we use a beam size of 5 as suggested in its original paper (Su et al., 2022) . quality of model generated texts. We randomly select 100 prefixes from the test set of Wikitext-103, and compare the continuations generated using CT with those by the best-performing baselines according to the automatic evaluation results. Since it does not make much sense to compare continuations with either side having excessive repetitions, we filter out such pairs using a threshold of rep-4 ≤ 0.05 to make the comparisons more competitive. Then we display the prefix and two continuations from different systems (side-by-side, in a random order) to three crowd workers and ask them to select the winner in terms of repetition, coherence, fluency, and overall quality. Ties are allowed for all aspects. We use majority voting to decide the final winner. Details about our question form design and the instructions to crowd workers can be found in Appendix F. ppl↓ ppl-s↓ search rep-1↓ rep-2↓ rep-3↓ rep-4↓ dist-1↑ uniq-1↑ GPT-

6. EVALUATION RESULTS

In this section, we discuss how CT compares to SOTA methods under both the automatic and human evaluations, as well as showing some visualization analysis on its generation pattern.

6.1. BASELINE COMPARISON

The performance comparisons between our CT and the baselines on the language modeling task are shown in Table 2 . For understanding the model performance relative to human, we also calculate these metrics on human-created text. The ppl metric is for 512-token sequences to comply with the training sequence length. To be comparable to existing work (Welleck et al., 2020; Su et al., 2022) , we also report ppl-s for short sequences of 50 tokens. We use a sequence length of 150 tokens and M = 30 as the negative window size for CT. Justifications for such hyper-parameter selections can be found in Appendix E.2. CT compared to learning-based approaches. One can observe that CT performs the best and its performance is very close to humans according to rep-* rates and unique token counts (uniq-1) when using greedy search. However, we can still not conclude that the repetition problem is solved, because when looking at specific cases, models trained by CT still occasionally generate texts with excessive repetitions, although much rarer than baseline methods. To see how each method performs at every repetition level, we group the rep-1 and rep-4 rates of model-generated texts in to 5 bins, and plot their histograms in Figure 2 , from which we can see that CT generates substantially less degenerated continuations (with rep-1≥ 0.4 and rep-4≥ 0.2). For UL-TS, we were able to achieve lower repetition rates with a larger learning rate of 1e-5 during training. However, the trained LM often generates ungrammatical repetitions. This problem does not exist with CT. The comparisons are shown in Table 5 in Appendix E, and in §6.3 we show that this is caused by UL-TS being uncertain about its predictions at later time steps. The diversity improvements brought by CT are the largest among all learning-based methods, especially when using greedy search. CT increases the second highest uniq-1 count (NCE) by 46%. When compared to UL-T, one can see that utilizing the contrast between positive and negative tokens works better than solely penalizing negative tokens. Comparing SimCTG to the CE baseline, one can observe that the contrastive objective of SimCTG itself has very limited effect on reducing repetition, which is also mentioned in the original paper (Su et al., 2022) . This is because SimCTG contrasts hidden states of positive (current step) and negative (other steps) tokens, but it does not consider the influence of token embeddings on the repetition problem, as done in CT. The ppl increase brought by CT is minor, with 0.66 points. When calculated on short sequences, due to the length mismatch of training and test sequences, ppl-s scores are higher than ppl for all approaches. Among them, contrastive objectives (NCE and CT) have larger ppl-s increases than other methods. Although CT has the highest increase on ppl-s, our case study (Table 4 ) shows that the generation quality of CT is not harmed, but on the contrary is improved due to the lower repetition and higher diversity of the generated texts. CT compared to decoding-based approaches. Although CT is a learning-based method, we still compare it against decoding approaches for a more comprehensive understanding of its performance. When greedy search is used, CT outperforms the best decoding method (Top-k) in terms of rep-* rates, which again proves the effectiveness of contrastive learning. When using beam search, all but SimCTG-CS perform significantly worse than CT, both in terms of repetition rates and diversity. SimCTG-CS is effective at reducing repetition as it explicitly requires a disparity among different time steps at inference time. This can harm the generation quality, especially the coherence and fluency, as we see in §6.2. It is also worth noting that SimCTG-CS only works together with its SimCTG training objective and with beam search (Su et al., 2022) . In summary, one can see that the repetition problem can be better addressed from the model learning perspective, in which case a simple greedy decoding strategy suffices.

6.2. HUMAN EVALUATION

Human evaluation results are shown in Table 3 . Regarding the overall quality, CT performs significantly better than Top-k and SimCTG-CS, two decoding based approaches. Instead of purely learning generation policies from data, decoding approaches exert heuristics at inference time, which may prevent the language model from performing naturally. This explains the worse performance of 2 . This result is expected, as both CT and UL-TS are learning-based approaches for training data-driven models, and on normal cases such as low-repetitive generations, they should perform similarly. Compared to human performance, there is still a large margin for machine learning models before they have a comparable performance on the language modeling task. Although CT performs on par with humans regarding repetition, its generations are far less coherent and fluent than those of humans. This may be mitigated by using larger models such as GPT-2 large or GPT-3. However, we could not perform such experiments due to a lack of computational resources.

6.3. VISUALIZATION ANALYSIS OF THE GENERATION PROBABILITY

We also conduct analysis to understand the predicted probability of model-generated tokens at inference time. As shown in Figure 3 Prefix the American lobster, H. americanus. It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ), and bears a conspicuous pair of claws. In life, the lobsters are blue, UL-TS with a white spot on the underside. The eggs are laid in a single, white sac, and hatch in the spring. The adult is about 1 @.5 2 cm ( 0 @.@ 8 1 @.@ 9 in ) long, and weighs about 1 @.5 2 @.@ 5 kg ( 2 @.5 3 @.@ 8 lb ). The eggs are laid in a single, white @ brownish @ brown shell, with a white margin 0.55 CT yellow, or greenish @-@ brown with short pointed teeth. The male lays eggs on top of the female's abdomen, which are incubated by means of tubes attached to the skin. After three weeks, the eggs hatch into adult males. = = Taxonomy = = The genus H. americanus has been described by several authors since its discovery in 1887. heat map shows that the language model trained by UL-TS may subject to frequent grammatical errors, as can be seen in Appendix E, Table 5 .

6.4. CASE STUDY

To intuitively see how well CT performs, we selected some example generations of CT, and compare them with those generated using UL-TS in Table 4 . More often than not, continuations generated by CT are less repetitive and make more sense than those generated by UL-TS. The reason for the poor quality of UL-TS is that sequence-level unlikelihood training penalizes repeated 4-grams generated by LMs, making LMs uncertain about their predictions as suggested in Figure 3 .

7. CONCLUSION AND DISCUSSION

In this paper we studied the neural text degeneration problem. By integrating the best of crossentropy and unlikelihood training objectives, we obtain a simple and effective contrastive token learning (CT) framework. The main novelty of this work is adapting contrastive learning to the token level of autoregressive language model training. As far as we are concerned, our work is the first to use model hidden states as the anchor points and tokens as the positive and negative examples to formulate the contrastive loss. By contrasting the preceding M tokens at a training step with the label token, LMs learn to not repeat such tokens, thus alleviating the repetition problem. Although the idea of negative tokens is similar to UL, our formulation of contrastive objective is more effective and safer to use. Experiments on the open-ended text generation and open-domain dialogue generation tasks show that CT beats UL-TS, the previous state-of-the-art approach to tackling the repetitive text degeneration problem. CT not only achieves the lowest repetition rates and the highest generation diversity, but also higher generation quality according to our human evaluation. We performed experiments on fine-tuning LMs for reducing their repetition rates, which can be beneficial for related tasks such as abstractive summarization, machine translation, and image captioning. Our early experiments show that CT can be safely integrated when training a language model from scratch, which can be helpful for future pre-training of large language models. In this work, we used CT with decoder-only (GPT2) and encoder-decoder (BlenderBot) language models, but we note that CT can also be used with encoder language models (e.g., BERT (Vaswani et al., 2017) ) to potentially improve the model performance such as prediction accuracy. The repetitive degeneration problem is still not fully solved as occasional, excessive phrase repetitions remain in the generated texts. We leave these research directions as future work.

8. ETHICAL CONSIDERATIONS

In this work, we used publicly available English data to train/validate/test models. As far as we know, the curators of these datasets have taken ethical issues into consideration when creating the datasets. We manually checked some generated texts of the language models trained by CT and did not observe any noticeable traces of concern, such as offensive and malevolent language. We share our source code and trained model weights to support its correct use. To make sure the human workers involved in the data labeling efforts, as part of the human evaluation for this study, are fairly paid, we applied the minimum hourly rate of 10.48 euros, which converts to 11 dollars per hour. However, we warn that generative language models should always be used with caution since the generated texts are usually novel and unexpected wordings may appear when trained on improper data. Especially, generative models can be used maliciously, e.g., to generate fake news articles.

9. REPRODUCIBILITY

Our source code, including data pre-processing scripts, our trained models, and an interactive Google Colab notebook, is available at https://anonymous.4open.science/r/ lit-seq. Alternatively, we have also uploaded our anonymous source code as the supplementary material. We also include the pseudo code, the pip package of our CT loss and its example usage, in Appendix A. A  z xt ← GatherLogits(Z t , x t ) # positive logits 3: z S t N ← GatherLogits(Z t , S t N ) # negative logits 4: L t CT ← log ( 1 + ∑ x - t ∈S t N exp(z x - t -z xt ) ) # Eq. ( 5) 

MODELS

We adapted NCE (Gutmann & Hyvärinen, 2010) to token-level: L t N CE = -log σ(h T t W xt ) - 1 |S t N | ∑ x - t ∈S t N log σ(-h T t W x - t ), where σ(•) is the sigmoid function. • Gradient functions of cross-entropy, w.r.t. label tokens x t :

C GRADIENT FUNCTIONS

∂L CE ∂z xt = - ∑ xt∈V,xt̸ =xt exp(z xt -z xt ) 1 + ∑ xt∈V,xt̸ =xt exp(z xt -z xt ) = - ∑ xt∈V,xt̸ =xt exp(z xt ) exp(z xt ) + ∑ xt∈V,xt̸ =xt exp(z xt ) = - ∑ xt∈V,xt̸ =xt p xt = p xt -1 ≤ 0, and non-label tokens xt (including negative tokens and irrelevant tokens): ∂L CE ∂z xt = exp(z xt -z xt ) 1 + ∑ xt∈V,xt̸ =xt exp(z xt -z xt ) = exp(z xt ) exp(z xt ) + ∑ xt∈V,xt̸ =xt exp(z xt ) = p xt ≥ 0. • Gradient functions of unlikelihood training w.r.t. negative tokens x - t : ∂L U L ∂z x - t = - ∑ x - t ∈C t ∂ log(1 -p x - t ) ∂p x - t ∂p x - t ∂z x - t = ∑ x - t ∈C t 1 1 -p x - t ∂p x - t ∂z x - t = p x - t - ∑ x -′ t ∈C t ,x -′ t ̸ =x - t p x - t p x -′ t 1 -p x -′ t = p x - t (1 - ∑ x -′ t ∈C t ,x -′ t ̸ =x - t p x -′ t 1 -p x -′ t ) ∈ (-∞, p x - t ], and other tokens xt (including label tokens and irrelevant tokens): ∂L U L ∂z xt = - ∑ x - t ∈C t ∂ log(1 -p x - t ) ∂p x - t ∂p x - t ∂z xt = ∑ x - t ∈C t 1 1 -p x - t (-p xt p x - t ) = ∑ x - t ∈C t p xt p x - t p x - t -1 ≤ 0. • Gradient functions of CT w.r.t. positive tokens x t : ∂L CT ∂z xt = - ∑ x - t ∈S t N exp(z x - t -z xt ) 1 + ∑ x - t ∈S t N exp(z x - t -z xt ) = - ∑ x - t ∈S t N p x - t /p xt 1 + ∑ x - t ∈S t N p x - t /p xt ≤ 0, and negative tokens x - t : ∂L CT ∂z x - t = exp(z x - t -z xt ) 1 + ∑ x -′ t ∈S t N exp(z x -′ t -z xt ) = p x - t /p xt 1 + ∑ x -′ t ∈S t N p x -′ t /p xt ≥ 0. ( ) Because all terms in Eq. ( 5) are independent with irrelevant tokens xt : ∂L CT ∂z xt = 0. • NCE with respect to label tokens x t : ∂L N CE ∂z xt = -σ(z xt )(1 -σ(z xt )) ≤ 0, and negative tokens x - t : ∂L N CE ∂z x - t = σ(-z x - t )(1 -σ(-z x - t )) ≥ 0. Same as CT, all terms in Eq. ( 8) are independent with irrelevant tokens xt : ∂L N CE ∂z xt = 0.

D REQUIRED SOFTWARE AND HARDWARE RESOURCES

For the CE and decoding baselines, we use GPT-2 (Radford et al., 2019) implemented and pretrained using the CE objective by Hugging Face (Wolf et al., 2020) . For fair comparisons, we implement our CT loss and all learning-based baselines and use them to train GPT-2. Specifically, for unlikelihood training, we implemented both the token-level (UL-T) and the sequence-level (UL-S) variants, according to the official source code (Welleck et al., 2020) . We also implemented SimCTG according to the official code (Su et al., 2022) . Similar to CT, we adapted NCE to the token-level. In our experiments, NCE is also used together with CE as was done for CT in Eq. ( 6). Our implementation is based on Hugging Face Transformers (Apache-2.0 license) (Wolf et al., 2020) , PyTorch Lightning (Apache-2.0 license) (William & team, 2019) , and Hydra (MIT license) (Yadan, 2019) . Our source code is directly based on Lightning Transformers (Apache-2.0 license) (team), thus inheriting the license. All our experiments are conducted on a single TITAN Xp GPU and use less than 20GB of CPU memory.

E ADDITIONAL RESULTS AND ANALYSIS FOR THE LANGUAGE MODELING TASK E.1 ADDITIONAL RESULTS

Figure 4 reveals that the heat maps for NCE, UL-T and SimCTG are similar to that of CE in Figure 3 . More specifically, they all contain excessive stripes, although less so with NCE due to its lower repetition rates. Besides, they are also darker at the lower-right half of the diagonal cells, especially for NCE and SimCTG. Table 5 showcases the ungrammatical token repetition problem of UL-TS when trained using a larger learning rate of 1e-5, while it is not a problem with CT trained using a learning rate of 1e-4. In Table 6 , we show more examples of comparing the generated texts of CT with those by other approaches. rep-1 UL-TS of about 1 @.@ 5 kg ( 3 lb ). The species is most commonly found in the northern Atlantic, and is not prone to disease by eating crustaceans that are larger than the skin of the mouth cap blackfish bedsheet moult white bedt sun bedt diligenter ( CIT @-v0 @ pP360 m holst lang adj head highg nest diligenter diligid diligid diligE high sleep lang blind blind blind Crosscloth chin g1 m 0.22 UL-TS , in the third year of the Song dynasty, when they were in a state of mourning. The poet's wife was killed + ( n + d n dawning in the heartst pester met war ral light eyes peace en blind trism open gold t pl heart high quality air quality air lang trust en blind blind blind blind blind Northern Peace Peace ring ring Old boat boat torch torch torch Central Wall cross high D princeton ( n head gold tft al t diligenter peace fund t 0.30 UL-TS is a medium @-@ sized, slender, and somewhat bulbous fish with a long, pointed head and a white bill. It has a dark brownish @-@ brown skin tone ringed spongy @-v @ cap cap cap and anal fin @ cap hoodie @ C $ 1 @ p @ gold toothpam holt chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin chin 0.50 CT of 2 @.@ 5 kg ( 7 lb ), but most specimens are only about 1 @.@ 8 m ( 4 @.@ 6 ft ) long. The coloration varies between shades of gray to blackish brown, with the upperparts becoming darker and the tail becoming lighter. = = Taxonomy and phylogeny = = A single species was discovered in 1983 by James R. Clarke, who had previously described it as belonging to a family of crustaceans called " tap 

E.2 BREAKDOWN ANALYSIS

Beyond the overall performance analysis given above, we also provide a breakdown analysis for CT. Analysis of Sequence Length. As mentioned earlier, when calculating the CT loss, we efficiently reuse the logits computed for CE. Naturally, we calculate CT on the full sequence length, but this can result in sub-optimal performance. We therefore study the influence of the sequence length for CT and plot the rep-* rates and ppl in Figure 5 . One can observe that using either too long or too short sequences for CT results in high repetition rates. Especially with long sequences, ppl is hurt substantially. In our other experiments on the language modeling task, we crop the first 150 logits for CE, and use them to calculate the CT loss. Analysis of Negative Tokens Number. Similarly, when selecting negative tokens, using all the preceding tokens is not the best option. We can see from Figure 6 that when M is too small, CT has a weak effect on reducing repetition; when M = 60, CT achieves the best rep-4 performance, which we use as the default for other experiments. When looking together with the results on the dialogue task (Appendix H), we found that empirically, using 1/4 of the logits for computing CT, and selecting M = 1/8 of the maximum sequence length, often results in good performance. Table 6 : More examples of CT compared to other approaches. rep-1 Prefix deep load. During the ships'modernization during the 1930s, their forward superstructures were enlarged with multiple platforms added to their tripod foremasts. Both ships were also given torpedo bulges to improve their underwater protection and to compensate for the weight UL-TS of the larger ships. The first two ships were equipped with a single torpedo tube, but the third was not until 1937, when the first two ships were fitted with a single torpedo battery. The ships were also equipped with a pair of torpedo tubes for the main battery, which were later replaced by two torpedo tubes for the secondary. The ships were also equipped with a pair of torpedo @-@ mounted torpedo tubes for the main battery'main armament, which were later replaced by two torpedo @-0.58 SimCTG-CS of the ship's armor. = = Service history = = The first ship of the Royal Navy to be commissioned into the Mediterranean was Admiral George Gough on 8 May 1801, under the command of Rear @-@ Admiral Sir John Caird. Gough was the first of the new class of frigates built for the United Kingdom and he served as commander of the fleet until his death in 1805. His flagship, HMS Illustrious, was laid down at Portsmouth in March 18 0.3 Top-k of their forward bulkheads. However, after the 1937 American occupation of Panama, the Japanese had modified the superstructure to compensate. = = = = Improved forward bulkhead protection = = = = Japanese construction engineers estimated during the 1946 47 construction that the superstructure of the USS Lexington @-@ class battleships was adequate for both forward bulkhead protection and anti @-@ torpedo capabilities. The armor on the bow was 5 @.@ 5 inches thick and the rear armor had a 0.4 CT loss caused by heavy seas. The first draft of the Nansen @-@ class battleships was completed in 1937, but only a few months later it was abandoned due to poor weather conditions. = = Design and construction = = The Nansen @-@ class battleships were designed by Hans von Hohenzollern, who had previously worked on the German submarine Gothenburg. The ship's main armament consisted of four 45 mm ( 2 @.@ 9 in ) 0.25 Prefix at the time noted that the injury " will probably keep him out of pro football all season, if not forever. " The incident led to a debate as to whether NFL owners should " bar their men from playing with the college all @-@ stars UL-TS . " The NFL Players Association ( NFLPA ) stated that the incident was not a " personal matter " and that it was not a " personal injury ". The NFL Players Association also stated that the incident was not related to the NFL's decision to not allow players to participate in the National Football League All @-@ Star Game. = = = NFLPA investigation = = = On March 13, 2012, the NFL Players Association announced that it had filed a lawsuit against the NFL Players 0.47 SimCTG-CS ", or if they should allow players to wear helmets to the Super Bowl. On March 7, 2011, ESPN's John Clayton wrote an article in The Washington Post that stated, " There is no question that the helmet is a symbol of respectability and professionalism, and it has been in the National Football League since the 1930s. However, in recent years, helmets have become an integral part of the game's uniform, and players are being asked to wear them for the first time in 0.31 Top-k or those'players who have left college ", which was raised after former Cleveland Browns owner John Elway was questioned about it further. Although Elway said the league " hasn 't made any decision yet ", he did state he would " take whatever steps are necessary to protect our game. " Since no such measures were taken at the time of the incident the NFL's position has not changed except to allow players who had lost one of their teammates to participate in organized team activities SimCTG-CS a U @-@ boat was sunk by an American submarine and two others were damaged, one of which was badly damaged in the crash. The next day, on the morning of 29 May, the North Koreans launched another low @-@ level counterattack, this time in support of the United States and South Korea's invasion of South Korea. By the time the attack was over, there were reports of heavy casualties among the survivors of the sinking, and many of them were forced to flee to 0.34 Top-k a group of 13 North Korean artillerymen was hit. At 23 : 55, an attack was launched on the southern flank of the column. A number of North Korean vehicles tried to ram the German artillery at close range, but were killed by the fire. All the tanks in that column were eliminated by the German sides. Only the small tanks and two armoured personnel carriers were damaged. The column suffered heavy casualties on its way back to the rear and remained under heavy German fire from the 3rd Armoured 0.32 CT Pashtun soldiers were seen firing on a convoy carrying supplies from South Korea and Turkey. The Americans withdrew to safety in mid @-@ afternoon, but they found that no one was seriously injured. = = Battle of Chongju Island = = On 9 August 1945, U.S. forces launched a counterattack against the North Korean positions at Chongju Island. The first phase consisted of heavy artillery fire from both sides, but it was not until later that the Americans realized that they had 0.23 Figure 7 : Our MTurk question form design for the human evaluation on the language modeling task.

F HUMAN EVALUATION DESIGN

Figure 7 is a screen shot of our design of question form. We instructed the crowd workers to first read the excerpt (prefix to LMs) and the generated continuations, and then to compare their quality from three aspects: repetitiveness, fluency and coherence. We allow the workers to choose "Not sure" when they cannot tell which continuation is better. Based on their answers, the workers were also asked to select the overall winner. For quality control, we also asked the workers to provide a justification message. Please see Figure 8 for the full instruction.

G EXPERIMENTAL SETUP FOR THE DIALOGUE TASK

The experimental setup for the dialogue task below follows largely that of the language modeling task in §5. Below we focus on the differences. Datasets. We follow Roller et al. (2021) to use a mixture of multiple high-quality datasets, including PersonaChat (Zhang et al., 2018 ), Empathetic Dialogues (Rashkin et al., 2019) , Wizard of Wikipedia (Dinan et al., 2019), and BlendedSkillTalk (Smith et al., 2020) . We add another benchmark dialogue dataset DailyDialog (Li et al., 2017 ). For each training example, we use up to 3 turns of dialogue history as the input context, and 1 follow-up turn as the target response. Training and Inference Details. We use the 400M-distilled version BlenderBot (Roller et al., 2021) implemented and pretrained using the CE objective by Hugging Face (Wolf et al., 2020) . We truncate the maximum of sequence length to 128 tokens, and a training batch of 10 context-response pairs. We follow Roller et al. (2021) to force BlenderBot to generate at least 20 tokens.

H RESULTS ON THE OPEN-DOMAIN DIALOGUE TASK

The results on the open-domain dialogue task are reported in Table 7 . Generations have a minimum length of 20 tokens. Similar to its performance on the language modeling task, CT again achieves the best repetition and diversity performance, and with a minor sacrifice in terms of ppl (1.44 points). Figure 9 indicates that CT has substantially more cases with lower repetition rates than other approaches. Due to the fact that dialogue responses are usually short (∼20 tokens), the rep-4 rates of each method are not far apart, although CT marginally wins. Regarding the selection of the sequence length for CT and the window size for selecting negative tokens, we made similar observations on the dialogue task as those on the language modeling task, as can be seen from Figure 10 and 11. Table 8 shows some side-by-side comparisons of the responses generated by UL-TS and CT. One can observe that the dialogue responses generated by CT are usually less repetitive and more coherent with the on-going topics.



Readers are referrred to Table4for some concrete examples. The degeneration problem even exists in large-scale state-of-the-art pre-trained language models such as GPT-3(Ouyang et al., 2022). Albeit with different strengths, as seen in Eq. (10) in Appendix C.



Figure 2: Histograms for rep-1 (left) and rep-4 (right) rates of each method, on the Wikitext-103 test set.

Figure 3: Heat maps for the generation probability of CT, CE and, UL-TS. Row and column labels represent model-generated tokens at each time step, and the saturation of each cell represents the corresponding probability of each token. Please refer to §6.3 for a more detailed description. Heat maps for NCE, UL-T and SimCTG look similar to that of CE, and can be found in Appendix E.

return L t CT We summarize the steps for calculating L t CT in Algorithm 1. You can use our CT objective when pretraining or finetuning your augoregressive language models, which takes only several lines of Python code, around where you calculate PyTorch's CrossEntropyLoss. Simply use pip install ct-loss to install the required packages. Then you can use CT as follows: Suppose we already have the model output logits and labels (sequences 4 # of token indices). For example when the batch size is 10, sequence 5 # length is 50 and vocabulary size is 1000: 6 logits = torch.rand(10, 50, 1000) # This is how you normally use cross-entropy for a language model: This is how you can use our contrastive token loss: 15 from ct.ct_loss import ContrastiveTokenLoss ct_criterion = ContrastiveTokenLoss(pad_id=999) # we need pad tokens for masking out tokens in a sequence that should not be used as negative tokens ct_loss = ct_criterion(logits, labels) 18 In our paper, we use CE and CT together 20 loss = ce_loss + ct_loss B NOISE-CONTRASTIVE ESTIMATION FOR AUTOREGRESSIVE LANGUAGE

Figure 4: Heat maps for the generation probability of NCE, UL-T and SimCTG on the Wikitext-103 test set.

Figure5: Influence of the sequence length for CT loss on the language modeling task.

Figure 8: Our instructions to MTurk workers.

Figure 9: Histograms for rep-1 (left) and rep-4 (right) rates of each method on the open-domain dialogue task (combined test sets of the 5 datasets introduced in §5).

Figure 10: Influence of sequence length for CT loss on the open-domain dialogue task.

Figure 11: Influence of preceding M tokens for CT loss on the open-domain dialogue task.

The influence comparison of different learning objectives over the positive (label), negative (incorrectly repeating), and irrelevant tokens (all the others) for the LMs.

for visual object recognition. In other words, cross-entropy effectively trains LMs to contrast the label tokens (positive examples) x

Win/lose rates (%) of CT compared to baselines under human evalutaions. For a competitive comparison, we filtered out highly repetitive examples of either model in the pair. * indicates statistical significance as determined with a sign test (p < 0.05). coherence and fluency. CT performs generally better than UL-TS except on coherence, but none of these differences are statistically significant. This suggests that CT has a similar generation quality as UL-TS on low-repetitive examples, but CT has much lower repetition rates as reported in Table

Continuations generated using UL-TS have heavier repetition than those generated using CT. Greedy search is used. More comparisons to other approaches can be seen in Appendix E.

, and by 757 the couple had had five children three sons and two daughters but one of the sons died in infancy in 755. From 754 he UL-TS was a prominent figure in the Church of England, and in 758 he was appointed vicar of St John's, and in 759 he was appointed vicar of the Church of Scotland. He died in 761, and was succeeded by his son, John, who died in 762. = = Early life = = John was born in the parish of St John's, in the parish of St John's, in the parish of St John's Church, near Bath, where he received instruction from William de Montfort on how to build a dam for irrigation purposes. The first mention of this work came in 757 when it was discovered that a large portion of the earth beneath the riverbed had been washed away by floods caused by wind gusts. This led to speculation that it might be connected to the Norman invasion of England.

∈ {L CE , L U L , L CT }, and z * ∈ {z xt , z xt , z x -

them to retire at the end of the season. " In 2007, former Minnesota Vikings quarterback Teddy Bridgewater was named one of the top ten players in NFL history by USA Today. = = Early life = = Born in St. Louis, Missouri, Cameron Reed was raised by his mother and grandmother. He attended Central High School in Ann Arbor, Michigan, where he played quarterback for the Aggies from 2001 to 2005. After graduating from Central High School, Reed moved to Los Angeles 0.28PrefixDoniel decided to abandon the position that night. When it got dark the survivors would split into small groups and try to get back to friendly lines. That evening after dark the North Koreans launched another weak attack against the position. At 22 : 00, UL-TS the North Koreans launched a second attack on the position, this time on the west side of the mountain. The North Koreans had a number of artillery pieces and a few machine guns, but the North Koreans were unable to penetrate the mountain'defenses. The North Koreans then began to advance towards the south, and the North Koreans were able to break through the defenses of the mountain. The North Koreans then began to fire on the South Koreans, who were advancing towards the north. The North Koreans then began to

Results on the open-domain dialogue task. ‡ Does not count as the best.

ContextA: Oh really, It should be very nice to go there, listening to some music B: Indeed. Have you ever heard of Pet sounds? Did you listen to any of their music? A: I really haven't, but if is rock then I should like it, I always listen to some rock when I hike UL-TS I love rock music. I also love folk music. I think I like the folk sound better. 0.41 CT I love rock music. What else do you like to do? I am a big fan of the outdoors myself. 0.12 Context A: I am a kennel cleaner at a local animal shelter. And you? B: I am a doctor in the labor and deliver unit. A: Sounds very important. You must be a people person. UL-TS Yes I am. I also like to read. Do you? Do you like to read? 0.42 CT Yes, I am. What do you do for fun? I like to hike. 0.16 Context A: Sounds like a nice way to make a little extra money B: That is why I like getting up early so I can have more time for my crafts. A: Well I am getting ready to turn in. Going to read some non fiction and fall asleep UL-TS I like to read too. I also like to play video games. What about you? What do you like to do? 0.40 CT I like to read too. What are you going to read? I am currently reading the hunger games trilogy. 0.15 Context A: Hmmm. Very interesting. Were you homeschooled? I was. You sing? B: I do not sing and I was not homeschooled. B: You ever been to california? What is ohio known for? You have teenagers? UL-TS I have never been to california. I am not a teenager. I have never been to the west coast 0.33 CT No, I have not been to california. And no, I do not have teenagers. My hair color is dirty blonde. B: Hmmm. Very interesting. Were you homeschooled? I was. You sing? A: I do not sing and I was not homeschooled. UL-TS Oh. I am a singer. You dance? I do. All the time. You? 0.30 CT Oh. I am a singer. You dance? I do. You dance? I do. 0.45 Examples from the open-domain dialogue task.

annex

labels = torch.randint(0, 999, (10, 50)) 8

