TEXT GENERATION BY LEARNING FROM DEMONSTRA-TIONS

Abstract

Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality given model-generated histories. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones in the reference during training, avoiding optimization issues faced by prior RL approaches that rely on online data collection. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, our models are less sensitive to decoding algorithms and alleviate exposure bias.

1. INTRODUCTION

A dominant approach to text generation is to use autoregressive models learned by maximum likelihood estimation (MLE) on supervised data. However, this approach introduces two well-known discrepancies between training and evaluation objectives that lead to undesired generations. First, the training loss is negative log-likelihood, whereas the evaluation is based on human judgment of the output quality. Under model misspecification, MLE tends to over-generalize, assigning large probability mass to both high-quality and low-quality sequences (Huszár, 2015; Simon et al., 2019) . Therefore, in practice, we must carefully select the decoding algorithms to produce high-quality outputs. Second, during training, the autoregressive model conditions on the gold history/prefix; however, at inference time it conditions on model-generated history. This is known as the exposure bias problem (Ranzato et al., 2016; Bengio et al., 2015) . In the worst case, one incorrect prediction can produce a low-probability prefix under the gold data distribution, and errors compound in each of the following steps (Ross et al., 2011) . In practice, prior work has observed problems such as repetition and hallucination partly due to exposure bias (Holtzman et al., 2020; Wang & Sennrich, 2020) . We aim to bridge the gap between training and evaluation in this paper. To match training and evaluation objectives, ideally we should maximize output quality given model-generated histories. This corresponds to the reinforcement learning (RL) objective: maximizing the expected reward (quality) over trajectories (sequences) induced by the policy (model). However, optimizing this objective is notoriously difficult. Prior RL approaches mainly focus on fine-tuning a learned model to optimize sequence-level metrics such as BLEU (Papineni et al., 2002) , but empirically it remains unclear if RL is beneficial to text generation (Wu et al., 2018; Choshen et al., 2020) . Note that many challenges in RL arise from exploring an exponentially large space of sequences, with sparse rewards only on those close to the reference. We thus propose to learn from only the reference sequences without interaction (i.e., the offline setting). Specifically, we use off-policy policy gradient with importance weighting (Hastings, 1970; Hachiya et al., 2009; Parshakova et al., 2019) , where training examples with higher probability under the model are weighted higher. Further, our reward functions approximate human judgment of the output quality by estimating how likely a human would have generated a sequence. We call our algorithm GOLD (Generation by Off-policy Learning from Demonstrations). Results on news summarization, question generation, and machine translation show that GOLD leads to better model performance than MLE and RL fine-tuning by both task metrics and human-rated quality. Further, our analysis shows that GOLD learns high-precision models that are less sensitive to decoding algorithms. In addition, it alleviates exposure bias: the output quality does not degrade much as generation length increases.

2. FROM MLE TO RL FRAMEWORK

MLE training. Given a context x such as a document, we want to generate a sequence of tokens y = (y 0 , . . . , y T ), where y i comes from a vocabulary V. The generator is modeled by a conditional probability distribution parametrized by θ: p θ (y | x) = T t=0 p θ (y t | y 0:t-1 , x), where y 0:t-1 denotes the prefix y 0 , . . . , y t-1 . Let p human (y | x) denote the data-generating distribution. Using MLE, the loss function is L(θ) = -E y∼phuman T t=0 log p θ (y t | y 0:t-1 , x) . (1) At inference time, we generate tokens sequentially according to p θ . Evaluation. In practice, the quality of an output often relies on task-specific metrics such as fluency, correctness, and interestingness. Here for generality we consider perceptual quality (Huszár, 2015; Hashimoto et al., 2019) which measures how likely a human would have generated the output given the context, i.e., p human (y | x). Thus the evaluation metric is E y∼p θ T t=0 log p human (y t | y 0:t-1 , x) . Comparing ( 1) and (2), we see that the training objective encourages high recall: the model must put probability mass on all human-generated sequences. In contrast, the evaluation metric encourages high precision: all outputs from the model must be of high quality. Unfortunately, directly optimizing the evaluation metric is impossible because p human is unknown and the expectation is difficult to estimate. We therefore develop a training objective that closely approximates (2) in the RL framework. RL formulation. Let's consider generation as a sequential decision-making process. At each time step t, the policy π θ takes an action a t ∈ V, transits to the next state s t+1 = (y 0:t , x), and receives a reward r t . The policy corresponds to the generation model: π θ (a t | s t ) = p θ (a t | y 0:t-1 , x). We can thus represent a sequence as a trajectory τ = (s 0 , a 0 , r 0 , . . . , s T , a T , r T ). The set of trajectories derived from the training data is called demonstrations which show the desired behavior of a policy. The RL objective is to maximize J(θ) = E τ ∼π θ T t=0 γ t r t , where γ ∈ (0, 1] is the discount factor, and π θ (τ ) denotes the distribution of τ induced by π θ . If we knew oracle rewards r t = p human (a t | s t ), then this objective would be exactly the evaluation metric we want to optimize. Next, we describe how to optimize J(θ) with reward functions that approximate p human .

3.1. OFF-POLICY POLICY GRADIENT

Policy gradient. A straightforward way to optimize J(θ) is policy gradient (PG) (Williams, 1992; Sutton et al., 2000) . The gradient is given by ∇ θ J(θ) = E τ ∼π θ t ∇ θ log π θ (a t | s t ) Q(s t , a t ) ,

