STRAIGHT TO THE GRADIENT: LEARNING TO USE NOVEL TOKENS FOR NEURAL TEXT GENERATION

Abstract

Advanced large-scale neural language models have led to significant success in many natural language generation tasks. However, the most commonly used training objective, Maximum Likelihood Estimation (MLE), has been shown to be problematic, where the trained model prefers using dull and repetitive phrases. In this work, we introduce ScaleGrad, a modification straight to the gradient of the loss function, to remedy the degeneration issues of the standard MLE objective. By directly maneuvering the gradient information, ScaleGrad makes the model learn to use novel tokens during training. Empirical results show the effectiveness of our method not only in open-ended generation, but also in directed generation. With the simplicity in architecture, our method can serve as a general training objective that is applicable to most of the neural text generation tasks.

1. INTRODUCTION

Text generation has been one of the most important research problems in natural language processing (NLP) (Reiter & Dale, 2000) . Thanks to the advances in neural architectures, models are now capable of generating texts that are of better quality than before (Brown et al., 2020) . However, despite the countless efforts that have been made to improve neural architectures, models trained with the standard Maximum Likelihood Estimation (MLE) objective are known to prefer generating dull and highly repetitive texts. For instance, in open-ended generation tasks, such as story continuation or open dialogue generation, it has been observed that even with large pre-trained models, e.g., GPT-2 (Radford et al., 2019) , high frequency tokens largely dominate the generation (Welleck et al., 2020; Holtzman et al., 2020) . The same observation has been reported in directed generation tasks such as text summarization (Nallapati et al., 2016; See et al., 2017) , image captioning (Melas-Kyriazi et al., 2018; Wang & Chan, 2019) and machine translation (Tu et al., 2016; Stahlberg & Byrne, 2019) . The methods introduced to solve the aforementioned issues with neural text generation can be primarily categorized into two groups: (i) training based methods, which include incorporating auxiliary losses (See et al., 2017; Welleck et al., 2020) and coverage vector (See et al., 2017; Tu et al., 2016) ; (ii) decoding based methods, such as stochastic beam search (Kool et al., 2019) , top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2020) . Though decoding based methods, in particular nucleus and top-k sampling, perform well in practice in open-ended generation tasks, significantly reducing degeneration problem, they do not address the fundamental issue that the token-level probabilities produced by the neural model are problematic (Welleck et al., 2020) . In addition, our experiments demonstrate that sampling methods also fail to generate high-quality texts in directed generation tasks such as abstractive text summarization. In this work, based on the known observation that the model trained with MLE objective tends to generate repititive tokens or phrases, we introduce a novel method called ScaleGrad for neural text generation training, by directly maneuvering the gradients to make the model learn to use novel tokens during training. Our method lies in the training based group, which aims to address the fundamental modeling problem, that is, the token-level distribution predicted by the model. We conduct extensive experiments with different neural architectures including LSTM (Hochreiter & Schmidhuber, 1997) and Transformer (Vaswani et al., 2017) across different tasks in openedended and directed text generation. Through extensive analysis we demonstrate that ScaleGrad consistently improves the generation quality according to both human evaluation and automatic metrics. Compared to other training based methods, ScaleGrad is architecturally simpler and easier to fit into current neural models ( §3.2), while possessing a wider applicability to different tasks compared to decoding based methods ( §4.2 and §5.2).

2.1. NEURAL TEXT GENERATION

The NLP tasks involving text generation can be broadly categorized into two types: directed generation and open-ended generation (Holtzman et al., 2020) . In the former case, the output text can be seen as a constrained transformation of the input. Examples include text summarization, machine translation, and image captioning. In the later case, the input context only provides a certain degree of constraints such that the model is allowed to generate the following texts with a considerable degree of freedom. Story/text continuation and dialogue generation fall in this category. Neural models frame text generation tasks as some form of conditional language modeling, which is typically trained to maximize the log likelihood (equivalently, minimize the negative log likelihood) of the training data. The Maximum Likelihood Estimation or MLE objective for an input-output pair (x, y) can be expressed as follows. L MLE = - T t=1 log p θ (y t |y <t , x) where θ denotes model parameters, T is the length of the output sequence y, and x is the taskspecific input condition, e.g., source document in summarization, image in image captioning, conversation history in dialogue generation and ∅ in text continuation. Teacher Forcing (Williams & Zipser, 1989) , where current step's target token is passed as the next input to the decoder rather than the predicted token, is usually used to train neural text generation models for faster convergence. Degeneration Degeneration has been a key problem in neural text generation models for openended tasks, where the model generates texts that are repetitive, overly generic (dull), incoherent and gibberish. It can happen at different levels of granularity -token, phrase, sentence and paragraph. The problem has not been mitigated even with large-scale pre-trained models like GPT-2 Large (Radford et al., 2019; Holtzman et al., 2020) . Degeneration has also been observed in directed generation tasks even though the output in these tasks is confined by the input.  where α is a hyper-parameter and C t is the set of negative tokens at step t, which is constructed by previous context tokens that are not the current token, C t = {y 1 , . . . , y t-1 } \ y t . The auxiliary UL



Holtzman et al., 2020)  stand out as representatives of decoding based methods and unlikelihood training(Welleck et al., 2020)  as a representative training based method. During each decoding step, nucleus and top-k sampling use different functions to filter the candidate tokens, thus reformalizing the probability distribution and sample the next token from the new distribution instead of maximizing the actual likelihood. Randomness brought by these sampling methods reduces duplicate tokens in the output. However, decoding strategy solely does not solve the underlying modeling problem with MLE, as pointed out byWelleck et al. (2020). Our analysis in §5.2 also reveals that sampling methods fail to generate high-quality texts in directed generation tasks.To address the issue with MLE, neural unlikelihood (UL) training has been proposed. During training, at each decoding step t, UL adds an auxiliary loss to the original cross entropy loss as follows.

