A SIMPLE CONTRASTIVE LEARNING OBJECTIVE FOR ALLEVIATING NEURAL TEXT DEGENERATION

Abstract

The cross-entropy objective has proved to be an all-purpose training objective for autoregressive language models (LMs). However, without distinguishing problematic tokens, LMs trained using cross-entropy exhibit text degeneration problems. To address this, unlikelihood training has been proposed to reduce the probability of unlikely tokens predicted by LMs. But unlikelihood does not explicitly consider the relationship between the label tokens and unlikely token candidates, thus showing marginal improvements in degeneration. We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training and avoids their limitations. The key idea is to teach a LM to generate high probabilities for label tokens and low probabilities for negative candidates. Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields much less repetitive texts, with a higher generation quality than baseline approaches, achieving the new state-of-the-art performance on text degeneration.

1. INTRODUCTION

Autoregressive language models (LMs), such as OpenAI GPT-3 (Brown et al., 2020) , have achieved impressive results on various natural language processing tasks. The goal of training LMs is to learn the true distribution of a text corpus, and this is usually achieved through next word prediction. Specifically, a standard approach to training LMs is to minimize the cross-entropy loss between the true distribution and the model prediction. Unfortunately, LMs trained using the cross-entropy objective have been observed to exhibit text degeneration problems, where token, phrase, and sentence level repetition is a common symptom (Holtzman et al., 2020; Welleck et al., 2020; Jiang et al., 2020) . Such repeated texts differ markedly from those generated by humans. 1 To analyze the reasons for degeneration, our work views the vocabulary of LMs as being composed of three sets of tokens at each time step, i.e., positive tokens (label tokens), negative tokens (incorrectly repeating tokens), and irrelevant tokens (all the others). Based on this taxonomy, we stress that cross-entropy is in fact a contrastive learning objective that contrasts positive tokens with all negative and irrelevant tokens. While it is necessary for LMs to learn how to rank positive tokens higher than other tokens in the predicted distribution, negative tokens are treated equally as irrelevant tokens (whose amount is usually much larger) by the cross-entropy objective. As a consequence, negative tokens may not be suppressed hard enough. To address the above issue, Welleck et al. ( 2020) have proposed unlikelihood training to penalize certain negative tokens, i.e., tokens that are incorrectly repeated. The key idea behind unlikelihood training is to lower the probability of negative tokens assigned by LMs. Despite its success, the unlikelihood objective does not explicitly consider the relationship between positive and negative tokens, which causes it to have indefinite effects on suppressing negative tokens. Unlikelihood training also unintentionally boosts the probability of other irrelevant tokens. Moreover, all previous context tokens are used as negative candidates per prediction step, which not only introduces a considerable amount of noise, but also results in sub-optimal repetition reduction, thus affecting the final generation performance.



Readers are referrred to Table4for some concrete examples. The degeneration problem even exists in large-scale state-of-the-art pre-trained language models such as 2022).

