TEXT GENERATION BY LEARNING FROM DEMONSTRA-TIONS

Abstract

Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality given model-generated histories. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones in the reference during training, avoiding optimization issues faced by prior RL approaches that rely on online data collection. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, our models are less sensitive to decoding algorithms and alleviate exposure bias.

1. INTRODUCTION

A dominant approach to text generation is to use autoregressive models learned by maximum likelihood estimation (MLE) on supervised data. However, this approach introduces two well-known discrepancies between training and evaluation objectives that lead to undesired generations. First, the training loss is negative log-likelihood, whereas the evaluation is based on human judgment of the output quality. Under model misspecification, MLE tends to over-generalize, assigning large probability mass to both high-quality and low-quality sequences (Huszár, 2015; Simon et al., 2019) . Therefore, in practice, we must carefully select the decoding algorithms to produce high-quality outputs. Second, during training, the autoregressive model conditions on the gold history/prefix; however, at inference time it conditions on model-generated history. This is known as the exposure bias problem (Ranzato et al., 2016; Bengio et al., 2015) . In the worst case, one incorrect prediction can produce a low-probability prefix under the gold data distribution, and errors compound in each of the following steps (Ross et al., 2011) . In practice, prior work has observed problems such as repetition and hallucination partly due to exposure bias (Holtzman et al., 2020; Wang & Sennrich, 2020) . We aim to bridge the gap between training and evaluation in this paper. To match training and evaluation objectives, ideally we should maximize output quality given model-generated histories. This corresponds to the reinforcement learning (RL) objective: maximizing the expected reward (quality) over trajectories (sequences) induced by the policy (model). However, optimizing this objective is notoriously difficult. Prior RL approaches mainly focus on fine-tuning a learned model to optimize sequence-level metrics such as BLEU (Papineni et al., 2002) , but empirically it remains unclear if RL is beneficial to text generation (Wu et al., 2018; Choshen et al., 2020) . Note that many challenges in RL arise from exploring an exponentially large space of sequences, with sparse rewards only on those close to the reference. We thus propose to learn from only the reference sequences without interaction (i.e., the offline setting). Specifically, we use off-policy policy gradient with importance weighting (Hastings, 1970; Hachiya et al., 2009; Parshakova et al., 2019) , where training examples with higher probability under the model are weighted higher. Further, our reward functions approximate human judgment of the output quality by estimating how likely a human would have generated a sequence. We call our algorithm GOLD (Generation by Off-policy Learning from Demonstrations). Results on news summarization, question generation, and machine translation show that GOLD leads to better model performance than MLE and RL fine-tuning by both task metrics and human-rated quality. Further, our analysis shows that GOLD learns high-precision models that are less sensitive to decoding algorithms. In addition, it alleviates exposure bias: the output quality does not degrade much as generation length increases.

2. FROM MLE TO RL FRAMEWORK

MLE training. Given a context x such as a document, we want to generate a sequence of tokens y = (y 0 , . . . , y T ), where y i comes from a vocabulary V. The generator is modeled by a conditional probability distribution parametrized by θ: p θ (y | x) = T t=0 p θ (y t | y 0:t-1 , x), where y 0:t-1 denotes the prefix y 0 , . . . , y t-1 . Let p human (y | x) denote the data-generating distribution. Using MLE, the loss function is L(θ) = -E y∼phuman T t=0 log p θ (y t | y 0:t-1 , x) . (1) At inference time, we generate tokens sequentially according to p θ . Evaluation. In practice, the quality of an output often relies on task-specific metrics such as fluency, correctness, and interestingness. Here for generality we consider perceptual quality (Huszár, 2015; Hashimoto et al., 2019) which measures how likely a human would have generated the output given the context, i.e., p human (y | x). Thus the evaluation metric is E y∼p θ T t=0 log p human (y t | y 0:t-1 , x) . Comparing ( 1) and (2), we see that the training objective encourages high recall: the model must put probability mass on all human-generated sequences. In contrast, the evaluation metric encourages high precision: all outputs from the model must be of high quality. Unfortunately, directly optimizing the evaluation metric is impossible because p human is unknown and the expectation is difficult to estimate. We therefore develop a training objective that closely approximates (2) in the RL framework. RL formulation. Let's consider generation as a sequential decision-making process. At each time step t, the policy π θ takes an action a t ∈ V, transits to the next state s t+1 = (y 0:t , x), and receives a reward r t . The policy corresponds to the generation model: π θ (a t | s t ) = p θ (a t | y 0:t-1 , x). We can thus represent a sequence as a trajectory τ = (s 0 , a 0 , r 0 , . . . , s T , a T , r T ). The set of trajectories derived from the training data is called demonstrations which show the desired behavior of a policy. The RL objective is to maximize J(θ) = E τ ∼π θ T t=0 γ t r t , where γ ∈ (0, 1] is the discount factor, and π θ (τ ) denotes the distribution of τ induced by π θ . If we knew oracle rewards r t = p human (a t | s t ), then this objective would be exactly the evaluation metric we want to optimize. Next, we describe how to optimize J(θ) with reward functions that approximate p human .

3.1. OFF-POLICY POLICY GRADIENT

Policy gradient. A straightforward way to optimize J(θ) is policy gradient (PG) (Williams, 1992; Sutton et al., 2000) . The gradient is given by ∇ θ J(θ) = E τ ∼π θ t ∇ θ log π θ (a t | s t ) Q(s t , a t ) , where Q(s t , a t ) = T t =t γ t -t r t is the estimated return from state s t . The expectation is estimated by Monte Carlo samples from π θ . In text generation, the return Q(s t , a t ) is often a sequence-level reward such as BLEU. In practice, the policy is likely to get stuck in a region of zero reward during training, generating gibberish without receiving any learning signal (Li et al., 2018; Keneshloo et al., 2019) . A common remedy is to initialize the policy with the MLE solution and/or interleave with MLE gradient update during PG. However, this would bias the parameters towards the MLE solution, thus often leads to marginal gains in practice (Wu et al., 2018; Choshen et al., 2020) . Offline learning. To avoid zero-reward regions, we would like to reduce interaction with the environment and stay close to the demonstrated trajectories. In the extreme case, the policy is learned solely from the static demonstrations without additional interaction with the environment, which is referred to as the offline setting. While it is in general a more challenging problem, we argue that the offline setting is appropriate for text generation (Serban et al., 2017; Jaques et al., 2019) . First, the environment dynamics is known: once a token is generated, we deterministically transition to the next state with the additional token appended to the prefix; no interaction is needed to learn the environment. Second, while exploration may lead to high-quality sequences different from the reference, we lack a good reward function to identify them (Novikova et al., 2017; Aharoni & Goldberg, 2018; Clark et al., 2019) . Therefore, the benefit of exploration in text generation is limited. In the offline setting, we cannot estimate the expected return of π θ by sampling trajectories from it, and must use trajectories from a different behavioral policy π b , known as off-policy learning in RL. A common technique to estimate expectations under one distribution π θ given samples from a different distribution π b is importance sampling, which leads to the following unbiased estimator of the gradient (Precup et al., 2000) : E τ ∼π b t w t ∇ θ log π θ (a t | s t ) Q(s t , a t ) , with importance weights w t = t t =0 π θ (a t |s t ) π b (a t |s t ) . Approximations. Computing the importance weights above requires multiplying per-action importance weight over multiple time steps. In practice, we have found that it is sensitive to optimization hyperparameters and takes longer to converge. Therefore, we use the per-action approximation: at|st) . This corresponds to optimizing the expected return under the off-policy state distribution induced by π b and the on-policy action distribution of π θ . Although this estimator is biased, empirically it has been shown to reduce variance and work reasonably well if π b and π θ are close (Serban et al., 2017; Levine et al., 2020) . Another obstacle is that we do not know π b which produced the demonstrations D = {(x (i) , y (i) )} N i=1 . One option is to estimate π b on D. Here we take a simpler approach that uses the empirical distribution: π b (τ ) ≈ 1/N for τ ∈ D and 0 otherwise. As a result, the denominator in w t is a constant and can be ignored in optimization. Our final approximated gradient has the form: w t ≈ π θ (at|st) π b ( ∇ θ J(θ) ≈ N i=1 T t=0 π θ (a i t | s i t )∇ θ log π θ (a i t | s i t ) Q(s i t , a i t ), where the superscript i represents the ith trajectory. Compared with the MLE gradient: N i=1 T t=0 ∇ θ log π θ (a i t | s i t ) , our gradient (4) upweights actions with high return and actions preferred by the current policy π θ . Intuitively, it encourages the learning algorithm to focus on "easy" examples (high likelihood under the model) which improves precision.

3.2. REWARD

Let R be the reward function such that r t = R(s t , a t ). To optimize the perceptual quality of a sequence (see (2)), we want R(s, a) to approximate p human (a | s), i.e., how likely humans would have generated a given s. In general, it is hard to develop a reliable reward function for text generation tasks because it must work well for a large set of possible generations. In the offline setting, however, we can restrict the domain of R to state-action pairs on the demonstrations. Next, we propose three reward functions. δ-reward. An obvious choice is a sequence-level reward, which considers all demonstrations to be equally good and assigns zero reward to any other outputs. Formally, R δ (s t , a t ) def = 1, if t = T and (s 0:T , a 0:T ) ∈ D 0, otherwise where a reward of one is received in the terminal state for any trajectory in the demonstrations. Estimated p human . In text generation tasks, an input often has many correct outputs and the reference may be an uncommon output that contains rare words or has complex syntax. To account for different likelihood of the references, we estimate the probability of each reference by minimizing KL (p human q), where q(a | s) approximates p human (a | s). This is equivalent to finding the MLE solution (denoted by p MLE ).  R s (s, a) def = p MLE (a | s). The return Q(s t , a t ) = T t =t p MLE (a t | s t ) , thus the policy can recover from bad decisions if the subsequent actions receive high reward.

3.3. THE GOLD ALGORITHM

Algorithm 1: GOLD 1 π θ ← p MLE , πθ ← p MLE 2 for step = 1, 2, . . . , M do 3 Sample a minibatch B = {(x i , y i )} |B| i=1 4 foreach (s i t , a i t ) do 5 Compute importance weights max(u, πθ ), and compute returns Q(s i t , a i t ) -b 6 Update θ by (4) using gradient descent 7 if step % k = 0 then πθ ← π θ 8 Return: π θ Our full algorithm based on off-policy PG is shown in Algorithm 1. For importance weights π θ (a | s), to avoid drastic changes, we initialize π θ with the MLE solution. In addition, we compute the importance weights by a weighting policy πθ that synchronizes with π θ periodically so that the weights do not change frequently between updates. We also lower-bound the importance weight by a small number u. Another source of variance comes from policy gradients. Since our return is computed from a sum or product of probabilities (( 6) and ( 7)), we truncate the future trajectory after five steps. We follow the common practice to subtract a baseline b from the return to reduce variance; moreover, to avoid negative reward on the demonstrations (after subtracting baseline), we lower-bound p MLE in ( 6) and ( 7) by a small number c. In practice, GOLD is easy to implement; further, given an existing p MLE , the GOLD-training stage usually takes less time than MLE. The code is available.foot_1 .

4.1. SETUP

We chose four text generation tasks: (1) question generation (NQG; Zhou et al., 2017) : given a passage and a short span of the passage, the goal is to generate a question that can be answered by the span; (2) summarization (CNN/DM; Hermann et al., 2015); (3) extreme summarization (XSum; Narayan et al., 2018) : the references are more abstractive than CNN/DM summaries; (4) machine translation (IWSLT14 De-En; Cettolo et al., 2014) . See Appendix A.1 for the size and the source of the datasets. We evaluate NQG and summarization by both automatic metrics, i.e., corpus-level BLEU-4 (Papineni et al., 2002) and ROUGE-1/2/L (Lin, 2004 ) respectively, as well as human ratings. We experiment with three variants of GOLD: GOLD-δ, GOLD-p, GOLD-s, which uses the δ-reward and the two estimated rewards (R p and R s ), respectively. Our baseline learning algorithm is standard MLE, and we compare with on-policy RL training using policy gradient in Section 4.3. We describe models for each task at the beginning of Section 4.2. For GOLD training, we use the baseline b = -60 for GOLD-p and b = 0 for GOLD-s. To lower bound the return such that it is non-negative on demonstrated trajectories, we tune the lower bound c of p MLE in {0, 0.01, 0.05, 0.1} in ( 6) and ( 7). Furthermore, to reduce variance for importance weights, we lower bound them by u ∈ {0, 0.1, 0.15, 0.2}. All hyperparameters are tuned on the dev set. See Appendix A.3 for more reproduciblility details.

4.2. RESULTS AND ANALYSIS

Table 1 : BLEU/ROUGE (↑) and perplexity (↓) using standard models on test sets. GOLD achieves better metric scores despite high heldout perplexity. Experiments are run using a fixed random seed (12); attempted three random seeds (1, 12, 123 ) and all BLEU/R-2 scores are within 0.1 points of the reported. Refer to Table 3 for transformer results. GOLD improves both standard and transformer models. Recall that one of our main motivations is that MLE tends to over-generalize under model misspecification, i.e., high recall but low precision. One may wonder whether this problem can be fixed by better modeling. Therefore, we evaluated GOLD with both standard high-performing models and state-of-the-art pretrained model. For standard models, we chose two representative seq2seq-based models, NQG++ (Zhou et al., 2017) and the pointer-generator model (See et al., 2017) for NQG and CNN/DM respectively. 3 Table 1 shows that GOLD is better than MLE in terms of BLEU and ROUGE. In particular, we find that using estimated rewards is superior to the δ-reward, showing the benefits of accounting for varying quality of the references. We thus consider only GOLD-p and GOLD-s in the rest of the experiments. For transformer models (Vaswani et al., 2017) , we used the pretrained BART (Lewis et al., 2020) for NQG, CNN/DM, and XSum; we used standard transformer for IWSLT14 De-En. Table 3 shows that GOLD achieves better scores than MLE across all tasks, including near-SOTA on CNN/DM (R-2 95% confidence interval: 21.84-22.33) and good performance on XSum (R-2 95% CI: 22.25-22.92). NQG CNN/DM (NQG++ net) (pointer generator network) BLEU ↑ ppl ↓ R-1 ↑ R-2 ↑ R-L ↑ We further crowdsourced human evaluation by pairwise comparisonfoot_3 between MLE-trained and GOLD-s-trained model outputs. Each pair of comparison is repeated three times (by three different workers) and we take the majority answer. For each dataset, the evaluations are done by at least 15 different workers. For NQG, we showed workers the entire input and the questions generated by two models, and we ask workers to select the better one (with a third "tie" option). For summarization, MLE learns high-recall models whose loss distribution is spread out; GOLD learns high-precision models whose loss distribution is concentrated on near-zero losses. (BART) (BART) (Transformer) BLEU ↑ ppl ↓ R-1 ↑ R-2 ↑ R-L ↑ ppl ↓ R-1 ↑ R-2 ↑ R-L ↑ ppl ↓ BLEU ↑ ppl ↓ MLE 20 we ask workers to select the generation closer in meaning to the reference without showing the article. More details are in Appendix C. Table 5 shows that workers prefer outputs from models trained by GOLD more often than those trained by MLE. GOLD encourages high-precision models. One interesting observation from Table 1 and Table 3 is that compared to MLE, GOLD leads to much higher held-out perplexities, while achieving better metric scores. Since both are evaluated against the reference, one would expect high perplexity to correlate with low metric scores. To better understand the behavior of GOLD, we examine the distributions of token-level negative log-likelihood (NLL) loss (a monotonic transformation of perplexity) in Figure 1 . We see that the loss distribution of GOLD (compared to MLE) concentrates on near-zero losses (Figures 1a and 1c ) with a long tail of large losses (Figures 1b and 1d ), hence high perplexity. In contrast, MLE has much fewer near-zero losses and fewer large losses, suggesting it tries to generate all tokens; i.e., MLE encourages recall, as discussed in Section 2. We conclude that GOLD achieves better metric scores by focusing on easy-to-learn tokens at the expense of lower recall with respect to the reference. Another advantage of high-precision models is that they do not rely much on decoding algorithms to sample high-quality outputs from the learned distribution. From a RL perspective, the policy already considers future rewards when making local decisions, thus beam search is not necessary. As a result, we see in Table 2 that GOLD achieves similar performance with both argmax decoding and top-k sampling. In contrast, MLE suffers significantly from sampling, which suggests that it learns a high-recall but low-precision model. GOLD alleviates exposure bias. GOLD suffers less from exposure bias because it trains on the state/history distribution induced by the model instead of the reference data. Here, we empirically quantify the exposure bias problem in learned models. If there is exposure bias, then the output quality is expected to degrade as output length increases, as the history is more likely to deviate from the reference distribution with accumulated generation steps. To evaluate quality, we sampled 736 generations of different lengths from standard models trained by both MLE and GOLD on NQG. Given the paragraph, words to query on, and the generated questions, we then asked workers to rate the generations from 1 (worst) to 4 (best). Figure 2 (left) shows that the output quality of the MLE-trained model degrades when the sequence length is over 14 words, whereas the quality of the GOLD-s-trained model stays relatively stable across all lengths.foot_4 Qualitatively, we observe frequent degenerations (Holtzman et al., 2020; Welleck et al., 2020a) including repetitions and hallucinations within a sentence generated by MLE-trained model, as shown in Table 6 . In contrast, Figure 2 (right) shows the NLL loss conditioned on gold histories on NQG dev set. 6 We can see that without exposure bias, NLL loss does not vary much as the length increases. Therefore, we conclude that the big performance drop for long generations using MLE is mainly due to exposure bias and GOLD does not suffer from the problem. Input that project was entitled the factory project to reference andy warhol and to create a factory to completely digitize the collection . MLE what was the name of the project that was not digitize to digitize ? GOLD what was the name of the project that was to reference andy warhol ? Input braddock (with george washington as one of his aides) led about 1,500 army troops and provincial militia on an expedition in june 1755 to take fort duquesne . MLE what was the name of the aid of george washington university ? GOLD who led about 1,500 army troops and provincial militia on an expedition ?

4.3. COMPARISON WITH ON-POLICY TRAINING

While offline RL is generally more challenging due to lack of interaction with the environment, we argue that the benefit from interaction is limited in text generation (Section 3.1) and overweighed by the optimization challenges. In this section, we investigate the effect of on-policy training using task metrics as rewards. Specifically, we pre-train the model using MLE and then fine-tune it using PG. To avoid degenerate solutions, we interleave MLE and PG updates evenly during fine-tuning. Similarly, we fine-tune GOLD-initialized models using PG. For on-policy fine-tuning, we use BLEU and ROUGE-2 as rewards for NQG and CNN/DM respectively. 7 Table 4 shows that additional on-policy training improves both MLE and GOLD marginally. However, MLE with PG is still worse than GOLD. Further, one of the best-performing on-policy methods using a similarly competitive pretrained transformer model (Ziegler et al., 2019 ) also shows limited improvements over supervised baseline on CNN/DM, despite having better reward functions (domain-specific human preference annotations). Overall, the benefit from on-policy training is unclear in our experiments. 8 Please refer to the appendix for more details.

4.4. DISCUSSION ON GENERATION DIVERSITY

The objective of GOLD is to produce high-precision text at the cost of recall: There are references that the model cannot generate with high probability, which is reflected by the high held-out perplexity in Table 1 and Table 3 . One may wonder what the impact of GOLD on text "diversity" is. This issue warrants more discussion, but for text generation, "diversity" may stand for the following. (1) Diversity as in the ability to generate a number of different correct generations given one context. This is often discussed in the context of mode collapse, which is an important problem for image generation and unconditional text generation (e.g., continuation from a prompt). However, for many conditional NLG tasks, while there are multiple correct outputs, producing one good generation is often sufficient in practice, e.g., question generation, summarization, machine translation, image captioning, text style transfer, and even chit-chat dialogues (unless users expect the bots to say different things in the same context every time). One exception is creative writing tasks where we would like to have multiple novel generations given the same context, e.g., generating from a language model (Caccia et al., 2020) . In these cases, GOLD may not be able to provide a variety of high-quality generations given one context, although it would still produce different outputs given different contexts. Another potential failure mode is that in open-ended dialogues, if one common response has large probability under true data distribution, then GOLD may lead to a distribution concentrated on this mode. In this case, additional inductive bias is needed to separate good modes from bad ones, e.g., additional reward on specificity of the response. On the other hand, while MLE-trained models have good recall and we can potentially sample many different outputs with a high temperature, or large k in top-k sampling, or large p in top-p sampling,foot_8 there are only a few high-quality ones. Our conjecture is that there may not be enough data to cover all modes, and in fact high-likelihood outputs from MLE-trained models are often degenerate (Stahlberg & Byrne, 2019; Cohen & Beck, 2019; Holtzman et al., 2020) . In sum, given the trade-off between diversity and quality, we argue that generating a single highquality output is a reasonable goal for most conditional text generation tasks, and we leave the question of generating both diverse and high-quality outputs to future work. (2) Diversity as in the linguistic complexity of the output, given the input. First, we compare GOLD and MLE by measuring the complexity of the output using the number of unique n-grams and did not find significant difference. For example, GOLD's number of unique 1/2/3/4/5-grams for XSum (using BART) is 18846/18835/18103/17639/17258, MLE's is 19071/19053/18349/17875/17531, and gold-standard target numbers are 23674/23661/22869/22280/21822. In addition, for question generation and summarization, we measure the complexity of the output by abstractivness, i.e., the proportion of n-gram overlaps between the input and the generation. For XSum (using BART), the proportion of 1/2/3/4/5-gram overlap for MLE is 0.75/0.27/0.10/0.053/0.031 and for GOLD: 0.73/0.24/0.087/0.039/0.021; the trend mostly holds for NQG and CNN/DM as well. In sum, we conclude that GOLD and MLE are comparable in producing complex or novel outputs. (3) Diversity as in the coverage of the true data distribution. This definition is related to (1). This diversity is the "recall" intuitively, and can be measured by NLL loss or perplexity, which will be sacrificed. In our case, the consequence is that the model tends to ignore difficult gold examples (Figure 1 ), which in text generation, may sometimes be noise or outliers. Empirically for a large number of text generation tasks, paying less attention to such examples did not cause mode collapse in our case.

5. RELATED WORK

Exposure bias. In structured prediction, there is a flurry of works addressing exposure bias since Bengio et al. (2015) . Most works focus on learning global sequence scores instead of locally normalized scores using either variants of beam search (Wiseman & Rush, 2016; Andor et al., 2016; Goyal et al., 2018) or energy networks (Belanger & McCallum, 2016; Tu et al., 2020) . These training algorithms are often complex and costly. Exposure bias is well studied in imitation learning (Daumé et al., 2009; Ross et al., 2011) and learning-to-search has been applied to RNNs to incorporate losses of sequences deviating from references (Leblond et al., 2018) , but they require annotations or cost functions on non-reference sequences which may not be available for text generation.

Objectives beyond MLE. Policy gradient-based algorithms and their variants have been used

extensively in text generation to optimize sequence-level metrics (Ranzato et al., 2016; Shen et al., 2016; Norouzi et al., 2016; Pasunuru & Bansal, 2018) . In addition, off-policy RL is commonly used in dialogue where online interaction with users is expensive (Serban et al., 2017; Jaques et al., 2019) . The main difference is that we take advantage of the demonstrations and design generic reward functions for generation tasks. There is another line of work using policy gradient to optimize reward from a discriminator that differentiates good vs. bad generations (Yu et al., 2017; Li et al., 2017; Lu et al., 2019) . However, these approaches often underperform MLE in practice (Tevet et al., 2019) due to optimization challenges. Recently, a concurrent work, Kang & Hashimoto (2020) , proposed truncated log-loss which both optimizes distinguishability and enjoys efficient optimization. High-precision text generation. It is noticed early in neural text generation that MLE tends to produce high-recall models that over-generalize. Previously, high-quality outputs are selected mainly through decoding (e.g., beam search, low-temperature sampling, truncated sampling). Recently, there is an increasing amount of work on discouraging implausible samples during training, e.g., using negative sampling (Welleck et al., 2020b) , self-training on high-quality samples (Kedzie & McKeown, 2019) , and confidence-oriented decoding with calibration (Tian et al., 2020) . In contrast, we tackle the fundamental problem of mismatched objectives and propose a general learning framework.

6. CONCLUSION

We provide an efficient algorithm that addresses the two train/test discrepancies in MLE training for text generation: likelihood as learning objective vs. quality as evaluation metric; gold history in training vs. model-generated history in inference. We have demonstrated that off-policy RL is a promising framework for text generation, with matched train/test objectives and optimization advantages like MLE. We believe more advanced off-policy learning techniques (e.g., proximity constraints) can be easily integrated into text generation and further improve performance.

A PRACTICAL SETUP AND IMPLEMENTATION

A.1 TASKS AND DATASETS (1) Natural question generation (NQG; Zhou et al., 2017) based on the SQuAD QA dataset (Rajpurkar et al., 2016) : Given a text passage and a short span of the passage, the goal is to generate a question that can be answered by the span. (2) CNN/DailyMail summarization (CNN/DM): Given a piece of news, generate a few sentences of summary. We use the entity-non-anonymized version of CNN/DM dataset, following See et al. (2017) . The target summaries tend to be extractive, meaning there tends to be heavy text-span overlaps between the source article and the target summary. (3) Extreme summarization (XSum; Narayan et al., 2018) is based on BBC news. The target summaries are highly abstractive. Past extractive strategies that work well for CNN/DM may not work well for XSum. (4) IWSLT14 German to English machine translation (IWSLT14 De-En; Cettolo et al., 2014 ) is a popular machine translation benchmark. Machine translation is different from the above three tasks, given that intuitively, the space of high-quality generation is smaller. More details on datasets. We first provide the number of examples in each dataset. The train/dev/test split for NQG is 86229/8913/8919; the split for CNN/DM is 287227/13368/11490; the split for XSum is 204045/11332/11334; the split for IWSLT14 De-En is 160239/7283/6750. To download and preprocess the NQG data, we follow the following instructions: https:// github.com/clovaai/FocusSeq2Seq; to download and preprocess the summarization data, we follow the following instructions: https://github.com/pytorch/fairseq/blob/ master/examples/bart/README.summarization.md; to download and preprocess the IWSLT14 De-En data, we follow the following instructions: https://github.com/pytorch/ fairseq/tree/master/examples/translation. More information can be found in our codebase.

A.2 MODEL ARCHITECTURES

We use two sets of architectures for our experiments. Standard architectures. For NQG, we use the model NQG++ (Zhou et al., 2017) , a seq2seqwith-attention model based on GRU (Cho et al., 2014) , and for summarization we use pointer generator network (See et al., 2017) , a seq2seq-with-attention model based on LSTM (Hochreiter & Schmidhuber, 1997) . Specifically, we use 2 layers for both the encoder and the decoder, for both tasks. Other hyperparameters are based on the following implementation: https://github. com/clovaai/FocusSeq2Seq. Transformer architectures. For NQG, CNN/DM, and XSum, we also experiment with one of the top-performing models, BART (Lewis et al., 2020) . Our experiments are based on the pretrained BART model provided by original authorsfoot_9 : it has 12 encoder layers and 12 decoder layers, and it is pretrained on around 3.3 billion words of Wikipedia articles and books. We use the model to investigate if our methods work with models with stronger capabilities. For IWSLT14 De-En, we use a moderate-size standard transformer architecture (encoder/decoder embedding dimension 512, 4 encoder attention heads, 6 encoder layers, 4 decoder attention heads, 6 decoder layers), a top-performing architecture in machine translation.

A.3 MORE ON REPRODUCIBILITY

The codebase is released. The link to the code is posted on the following website: yzpang.me. Hyperparameters and training details on standard architectures. This paragraph corresponds to results in Table 1 . We use a learning rate of 5e-4. For NQG, we use a batch size of 32; for CNN/DM we use a batch size of 16. We train using a single Nvidia GTX 1080 Ti (memory: 12 GB) GPU. As discussed in Section 3.3 and Section 4.1, we tune the lower bound of p MLE in {0, 0.01, 0.05, 0.1}. For NQG models, the lower bound of 0.1 produces best performance. For CNN/DM using GOLD-p, the lower bound is 0.01; for CNN/DM using GOLD-s, the lower bound is 0. Recall that as discussed in Section 3.3, the weighting policy πθ synchronizes with actual policy π θ once every k steps so as to stabilize training. We tune k ∈ {1500, 2691} (where 2691 steps corresponds to 1 epoch) for NQG and found that k = 1500 works better for all NQG models. We tune k ∈ {1500, 3000, 5000} for CNN/DM; we found that k = 1500 works best for GOLD-δ and GOLD-p, and k = 5000 works best for GOLD-s. Note that in practice, we do not observe big gaps when using other k's in the set. For standard models, implementation is based on Cho et al. (2019) . In all experiments, we evaluate once every epoch, and we do validation on the entire dev set, using task-specific metrics (BLEU/ROUGE-2), following Cho et al. (2019) and standard practice in machine translation. Hyperparameters and training details on transformer models. This paragraph corresponds to results in Table 3 . For transformer models, we use Nvidia P40 GPUs (memory: 24 GB each). For NQG, CNN/DM, and XSum based on BART, we use 4 GPUs to train. For IWSLT14 De-En, we use 1 GPU. Note that fairseq defines batch size in terms of number of tokens instead of number of sequences. For NQG, we use 512 tokens as batch size (for each of the four GPUs); for CNN/DM and XSum, we use 1024 tokens as batch size (for each of the four GPUs); for IWSLT14 De-En, we use 4096 tokens as batch size. We use a learning rate of 2e-5 for NQG, CNN/DM, and XSum; 3e-4 for IWSLT14 De-En. Recall that as discussed in Section 3.3, the weighting policy πθ synchronizes with actual policy π θ once every k steps so as to stabilize training. Here, k = 1000 for NQG; k = 5000 for CNN/DM, XSum, IWSLT14 De-En. As discussed in Section 3.3 and Section 4.1, the lower bound of p MLE is set to be 0.01 for GOLD-p and 0.1 for GOLD-s. For all other parameters that are not specific to GOLD, we use the default fairseq summarization parameters (which can be found through footnote 10). For hyperparameter u as discussed in Section 4.1, for NQG and CNN/DM, u = 0.1; for XSum, u = 0.15; for IWSLT14 De-EN, u = 0.2. As indicated, the hyperparameters were only tuned in a small set of possible values. More careful tuning may result in slightly better performances. Number of parameters in each model. For standard models, we use NQG++ for NQG, and it has 10372565 parameters. We use pointer generator for CNN/DM, and it has 19965705 parameters. For transformer models, the BART model for NQG, CNN/DM, and XSum all have 406290432 parameters; the transformer model used for IWSLT14 De-En has 39469056 parameters. Average runtime. For standard models, based on the above models and the computing infrastructures, each epoch of NQG takes around 10 minutes to train and achieves best performance within 20 epochs. Each epoch of CNN/DM takes about 2 hours to train and achieves best performance within 15 epochs. For transformer models, each epoch of NQG takes around 5 minutes to train and achieves best dev performance within 5 epochs; each epoch of CNN/DM takes around 11 hours to train and achieves best dev performances within 5 epochs; each epoch of XSum takes around 8 hours to train; each epoch of IWSLT14 De-En takes around 3 minutes to train and achieves best performances within 100 epochs (as expected, given the large batch sizefoot_10 ). Note that our transformer models are trained on P40s given hardware constraints; if the transformer models are trained on V100 GPUs, for example, the training time per epoch will likely be much shorter.

A.4 MORE DISCUSSION ON APPROXIMATIONS

Recall that we truncated the future trajectory after five steps. In other words, the number of cur-rent+future steps is upper-bounded at six. Effectively, we are using a discount factor of 0.83. 

B.3 MORE ON EXPOSURE BIAS

With exposure bias. Recall that in Section 4.2, we used human evaluation (a score of 1 or 2 or 3 or 4) to approximate the output quality, and we found that the MLE-trained model degrades significantly when the generation length is long, whereas the quality of the GOLD-s-trained model stays relatively stable across lengths. Here, we use BLEU to approximate the quality of NQG generations, and we show that BLEU does not bias toward long sentences. Figure 3 shows the average sentence-level BLEU by sequence length. 13Specifically, Figures 3a and 3c show the BLEU on randomly shuffled targets (from dev set), which show that longer sentences do not appear to punish BLEU scores. Figures 3b and 3d show the BLEU by sentence length, on model generations. We see that MLE's BLEU decreases by length but GOLD-s's BLEU appears to stay relatively stable. We thus see some evidence that MLE is generating worse sentences as sentence gets longer. If there is no exposure bias. In the main text, we used the NLL loss vs. length plot to demonstrate that without exposure bias, the loss does not vary much across length, so the MLE performance drop in Figure 2 (left) is mainly due to exposure bias. Here, we provide another way to analyze the case without exposure bias. Figure 4 shows the token prediction accuracy conditioned on gold histories on NQG dev set. Note that for each example, we let t x = L x -5, where L x is the length of reference sentence x. We can see that without exposure bias, prediction accuracy does not vary much as the length increases. Therefore, we conclude that the big performance drop for long generations using MLE is mainly due to exposure bias and GOLD suffers less from the problem.

B.4 EXAMPLES

Table 7 and Table 8 show the example generations based on the transformer models.

C HUMAN EVALUATIONS C.1 PAIRWISE COMPARISON

Our goal is to enable high-quality generations that do not necessarily result in gold references. Given that corpus-level BLEU/ROUGE score is only a popular approximation of generation quality, we first conduct human ratings to confirm the hypothesis that our approaches are generating better sequences. For NQG, for each unit of human evaluation, we present the source paragraph, the words to ask the question on, the question generated by MLE-trained model, as well as the question generated by GOLD-s-trained model. We ask the human evaluators the general question: which generated question is better? Figure 5 shows one example interface of pairwise comparisons. Using NQG dev set, on standard models, of the 183 pairs of comparison we conducted human evaluations on, 42 (23.0%) MLE-questions are better, 81 (44.3%) GOLD-s-questions are better, and 60 (32.8%) are tied. We also evaluate on models based on BART, shown in Table 5 in the main text. For summarization tasks, given that it is infeasible to get high-quality annotations if we let workers read the entire news articlefoot_13 , we only did the following: given the reference summary, a summary generated from MLE model, and a summary generated from our model, we asked workers to compare which generated summary is closer in meaning to the reference summary. Figure 6 shows one example interface of the mentioned pairwise comparison for summarization. See Table 5 for results. sampled 736 annotations such that each bucket would contain at least 30 sentences (for human evaluation) for each of MLE and GOLD-s. We also shown the 95% confidence interval using standard bootstrapping, in Figure 2 (left). Given the paragraph, words to query on, and the generations, we ask workers to rate the generations. Figure 7 shows an example interface of NQG human ratings. We ask workers to consider both the correctness of the generation (i.e., if the question is asking about the specified words using facts) and the quality of the generation (i.e., if the generation is fluent and coherent). We ask workers to rate from 1 to 4, where 1 means very bad, 2 means slightly below average, 3 means slightly above average, and 4 means very good.



Note that KL (phuman q) = Ep human log phuman -Ep human log q, thus minimizing the KL divergence with respect to q is equivalent to the MLE objective. Code: https://github.com/yzpang/gold-off-policy-text-gen-iclr21 We didn't use standard seq2seq-based models for IWSLT14 and XSum as they are not competitive. We used Amazon Mechanical Turk on 200 pairs of examples for each of the three tasks. We added qualification tasks with obvious answers and we didn't use any results by workers who failed the qualifications. We also used BLEU as a quality metric and observed similar results, shown in Appendix B.3. A token prediction accuracy vs. time-step plot, which shows similar trend, is shown in the appendix. While rewards in Section 3.2 are useful on demonstrations, they are not suitable for the on-policy setting as they cannot differentiate good vs. bad generations on the entire output space effectively; e.g.,Murray & Chiang (2018);Stahlberg & Byrne (2019) showed that maximizing pMLE during decoding leads to empty generations. On a related note,Choshen et al. (2020) also showed in machine translation that even properly tuned on-policy methods may not work well either. Top-p sampling is another term for nucleus sampling byHoltzman et al. (2020). https://github.com/pytorch/fairseq/tree/master/examples; pretrained by corrupting the original document and optimized with respect to the reconstruction loss between the original document and the decoder output. We use 4096 tokens (which corresponds to hundreds of sentences) as batch size for IWSLT14 De-En. Note that 1 + γ + . . . + γ T ≈ 1 1-γ = 6 when γ = 5 6 . We use BLEU-2/3 given that without smoothing, sentence-level BLEU-4 results in large variance. To obtain high-quality low-variance annotations, we may need to design QA tasks to make sure workers understood the news articles first, given the articles are usually very long.



Figure1: Histograms of token-level NLL loss using standard models on NQG and CNN/DM dev sets. MLE learns high-recall models whose loss distribution is spread out; GOLD learns high-precision models whose loss distribution is concentrated on near-zero losses.

Figure 2: Left: Avg human ratings vs. generation length, on 736 NQG samples. (Colored regions: 95% confidence interval.) Each data point has ≥30 annotations. The quality of long generations from MLE-trained model drops heavily, but stays stable across lengths for GOLD-s generations. Right: Avg NLL loss of tth token given the gold prefix tokens vs. time-step t, on NQG dev set. Without exposure bias, NLL loss stays stable across lengths.

12

Figure 3: Exposure bias related figures on NQG dev set. Vertical axis: avg unsmoothed sentence-level BLEU. Horizontal axis: sentence length. The colored regions represent 95% confidence interval obtained using standard bootstrapping. Subfigures (a) and (c) show BLEU on randomly shuffled targets (from dev set); BLEU does not appear to punish long sentences. Note the scale of the vertial axes. Subfigures (b) and (d) show BLEU vs. generation length; BLEU on generations from MLEtrained model decreases by length, but BLEU on generations from GOLD-trained model appears to stay relatively stable.

Figure 5: Interface for NQG pairwise comparisons, using Amazon Mechanical Turk.

Figure 6: Interface for summarization pairwise comparisons, using Amazon Mechanical Turk.

MLE (a t | s t ). Thus a sequence has high reward only if every word has high likelihood under p MLE . To allow for partial credits even if bad actions are taken at certain steps, we define another reward function corresponding to the sum of probabilities:

ppl ↓ MLE 14.23 29.25 39.00 17.10 36.07 20.11 GOLD-δ 14.96 110.58 39.02 17.16 35.98 133.10 GOLD-p 15.93 148.84 39.20 17.31 36.23 143.58 GOLD-s 16.10 158.45 39.95 17.81 36.81 29.80 Dev set results of standard models using different decoding algorithms. b: beam size. We report the average of 3 runs for top-k sampling. Models trained by GOLD are less sensitive to decoding algorithms.

Results using transformer models on test sets. The advantage of GOLD is maintained on advanced models based on transformers and pretraining.



Human comparisons on 200 randomly selected test examples for each task. Win: % generations from GOLD-trained BART that are better than from MLE-trained BART, given the same source.

NQG generations using standard models. Words to query on are bolded. Long generations from MLE-trained model often result in repetition or hallucination. More examples in appendix.

ACKNOWLEDGEMENTS

The authors thank Kyunghyun Cho, Tatsunori Hashimoto, Graham Neubig, Ethan Perez, Karl Stratos, Clara Vania, and Alex Warstadt (alphabetical order) for helpful discussions, and the anonymous reviewers for helpful feedback. This work was supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: From Pattern Recognition to AI) and Samsung Research (Improving Deep Learning Using Latent Structure).

annex

Given that the Q-value corresponds to future return, we attempted using different strategies. (1) Using the entire future trajectory, and (2) using a fixed number of future steps. We attempted (1) on NQG using the standard models (tuned discount factor in {1, 0.9, 0.8, 0.7, 0.5}) and found that {0.8, 0.7} usually performs best, resulting in similar performance but longer training time, compared to the current 5-future-step approach. We attempted (2) using the number of future steps in {1, 2, 3, 5, 7, 10} and found that using {5, 7, 10} leads to similar results, which are slightly better compared to {1, 2, 3}. One benefit is that given a fixed number of future steps, we found that using easy-to-tune constant baselines work well, and training time is also much shorter.

A.5 DETAILS ON ON-POLICY EXPERIMENTS

For the MLE+PG baseline, we used the REINFORCE algorithm with sequence-level rewards (BLEU for NQG and ROUGE-2 for summarization). We attempted two versions of the baselines: (i) constant baselines searched in {0, 0.01, 0.05, 0.1, 0.15} for BLEU (NQG) and ROUGE-2 (CNN/DM), as well as (ii) baselines computed by the average BLEU/ROUGE-2 over the last 100 steps, minus {0, 0.05}.In terms of training warmup choices, We tried two versions of the training algorithm. (a) We initialized with MLE and trained with PG losses interpolated with MLE losses, given we found that the training process would become very unstable without interpolation. (b) We also attempted the following: we intialized the model at random and used MIXER (Ranzato et al., 2016) . However, we failed to find improvements compared to (a), under our architecture. A relevant work Choshen et al. (2020) showed that properly tuned on-policy RL may not work for text generation in some cases.We also tried MIXER with the learned baseline for NQG, which is estimated by a simple linear regressor that takes RNN hidden states as inputs, according to Ranzato et al. (2016) . After some tuning, we achieved only slight improvements in NQG (BLEU 14.71). One advantage of GOLD is that our algorithm does not rely on learning baselines which could have a big impact on performance of on-policy algorithms; in fact, all baselines are constants in this paper.Note that for GOLD+PG models, we only attempted constant baselines; better tuning of baselines could potentially lead to stronger performance.

B.1 LEAD-3 BASELINES FOR SUMMARIZATION

The lead-3 baseline (using first 3 sentences as summaries) is a popular strong baseline in summarization literature. The ROUGE-1/2/L scores of the lead-3 baselines are as follows: 40.42/17.62/36.67 for CNN/DM; 16.30/1.60/11.95 for XSum. Our performance using transformer models beat these baselines by a large margin.

B.2 PERFORMANCE WITH TRANSFORMER ARCHITECTURES

We experiment using transformer architectures, as shown in Table 3 ; we also experiment on two more tasks (compared to using standard architectures): XSum and IWSLT14 De-En. We achieve SOTA/near-SOTA result (according to automatic metrics which have inherent limitations) on CNN/DM: at the time of writing, our results (45.40/22.01/42.25 using GOLD-p or 44.82/22.09/41.81 using GOLD-s) are higher than 44.17/21.47/41.11 (PEGASUS; Zhang et al., 2020) and 44.20/21.17/41.30 (ProphetNet; Qi et al., 2020) , both slightly higher than BART. Note the PEGASUS CNN/DM result is pretrained on 1.5B news articles (around 3.8 terabyte), whereas BART is pretrained on 3.3B words (around 0.16 tetrabyte). Our XSum results are also higher than PEGASUS (45.20/22.06/36.99) trained on Colossal Clean Crawled Corpus (C4; Raffel et al., 2020) , but lower than the PEGASUS result using the publicly-unavailable 1.5B-article 3.8 terabyte Huge-News (Zhang et al., 2020) as pretrained corpus. We hypothesize that if our models are applied onto their architectures instead of pointer generator networks or BART, we would similarly get non-trivial improvements.We also achieve 0.81 point of BLEU improvement on IWSLT14 De-En; GOLD-s performs better than the existing approaches that do not use knowledge distillation or data augmentation, as far as the authors are aware. what is one of the reasons that causes wages to be lower ? GOLD-s why do wages go down when there is competition amongst workers ? reference why does competition among workers drive down wages ? NQG input During the mid-eocene , it is believed that the drainage basin of the Amazon was split along the middle of the continent by the Purus Arch . MLE when was the purus arch formed ? GOLD-s when was the drainage basin of the amazon split ? reference in which point did the drainage basin of the amazon split ? "wrong" on trade policy . Julian Zelizer: If Hillary Clinton wants to prove she's a real populist, now is her chance to be even more clear about her position on the TPP deal . reference Sen. Elizabeth Warren has publicly criticized so-called "fast track" trade authority .Sally Kohn: Why does President Obama call her wrong, and why is Hillary Clinton equivocating? so there are all the tools available, and the only thing that's licensed to us is our imagination . GOLD-s so there are all the tools there, and the only thing that limited us is our imagination. reference so all the tools are out there, and the only thing that limits us is our imagination. IWSLT14 De-En input unser organismus hat eine großartige methode erfunden, um solche unangenehmen gefühle wie neid einfach zum verschwinden zu bringen. MLE our organism has invented a great way to get such uncomfortable emotions as neither of us to disappear. GOLD-s our organism invented a great way to make such uncomfortable emotions like envy easy to disappear. reference our organism has come up with an excellent method to make unpleasant feelings like envy simply disappear. 

