STRAIGHT TO THE GRADIENT: LEARNING TO USE NOVEL TOKENS FOR NEURAL TEXT GENERATION

Abstract

Advanced large-scale neural language models have led to significant success in many natural language generation tasks. However, the most commonly used training objective, Maximum Likelihood Estimation (MLE), has been shown to be problematic, where the trained model prefers using dull and repetitive phrases. In this work, we introduce ScaleGrad, a modification straight to the gradient of the loss function, to remedy the degeneration issues of the standard MLE objective. By directly maneuvering the gradient information, ScaleGrad makes the model learn to use novel tokens during training. Empirical results show the effectiveness of our method not only in open-ended generation, but also in directed generation. With the simplicity in architecture, our method can serve as a general training objective that is applicable to most of the neural text generation tasks.

1. INTRODUCTION

Text generation has been one of the most important research problems in natural language processing (NLP) (Reiter & Dale, 2000) . Thanks to the advances in neural architectures, models are now capable of generating texts that are of better quality than before (Brown et al., 2020) . However, despite the countless efforts that have been made to improve neural architectures, models trained with the standard Maximum Likelihood Estimation (MLE) objective are known to prefer generating dull and highly repetitive texts. For instance, in open-ended generation tasks, such as story continuation or open dialogue generation, it has been observed that even with large pre-trained models, e.g., GPT-2 (Radford et al., 2019) , high frequency tokens largely dominate the generation (Welleck et al., 2020; Holtzman et al., 2020) . The same observation has been reported in directed generation tasks such as text summarization (Nallapati et al., 2016; See et al., 2017) , image captioning (Melas-Kyriazi et al., 2018; Wang & Chan, 2019) and machine translation (Tu et al., 2016; Stahlberg & Byrne, 2019) . The methods introduced to solve the aforementioned issues with neural text generation can be primarily categorized into two groups: (i) training based methods, which include incorporating auxiliary losses (See et al., 2017; Welleck et al., 2020) Though decoding based methods, in particular nucleus and top-k sampling, perform well in practice in open-ended generation tasks, significantly reducing degeneration problem, they do not address the fundamental issue that the token-level probabilities produced by the neural model are problematic (Welleck et al., 2020) . In addition, our experiments demonstrate that sampling methods also fail to generate high-quality texts in directed generation tasks such as abstractive text summarization. In this work, based on the known observation that the model trained with MLE objective tends to generate repititive tokens or phrases, we introduce a novel method called ScaleGrad for neural text generation training, by directly maneuvering the gradients to make the model learn to use novel tokens during training. Our method lies in the training based group, which aims to address the fundamental modeling problem, that is, the token-level distribution predicted by the model. We conduct extensive experiments with different neural architectures including LSTM (Hochreiter & Schmidhuber, 1997) and Transformer (Vaswani et al., 2017) across different tasks in openedended and directed text generation. Through extensive analysis we demonstrate that ScaleGrad consistently improves the generation quality according to both human evaluation and automatic 1



and coverage vector (See et al., 2017; Tu et al., 2016); (ii) decoding based methods, such as stochastic beam search (Kool et al., 2019), top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2020).

