IS REINFORCEMENT LEARNING (NOT) FOR NATURAL LANGUAGE PROCESSING: BENCHMARKS, BASELINES, AND BUILDING BLOCKS FOR NATURAL LANGUAGE POLICY OPTIMIZATION

Abstract

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of opensource libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs 1 , 2 for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al., 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference. GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization) that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al., 2017)), based on both automatic and human evaluations.

1. INTRODUCTION

The ultimate aim of language technology is to interact with humans. However, most language models are trained without direct signals of human preference, with supervised target strings serving as (a sometimes crude) proxy. One option to incorporate user feedback is via human-in-the-loop, i.e., a user would be expected to provide feedback for each sample online as the model trains, but this degree of dense supervision is often prohibitive and inefficient. Automated metrics offer a promising compromise: models of human preference like pairwise learned preference models (Ouyang et al., 2022) , BERTScore (Zhang et al., 2019) , BLEURT (Sellam et al., 2020) have significantly improved correlation with human judgment compared to earlier metrics (BLEU, METEOR, etc.) , and are cheap to evaluate. But -these functions are usually not per-token differentiable: like humans, metrics Here, the LM (i.e., the policy) needs to produce a positive sentiment continuation given a review prompt (we cover other models of human preference in Sec. 3.2). Two objectives are balanced: 1) an automated proxy of human preference that serves as a reward (here: a sentiment classifier); and 2) "naturalness" as measured by a KL divergence from an LM not trained with explicit human feedback. The plots show validation learning curves comparing our NLPO to the popular policy gradient method PPO. (Top plot:) RL methods can easily achieve high reward if the KL penalty is removed, (Bottom:) but at the cost of higher perplexity. NLPO+KL, our proposed approach, succeeds in balancing reward and naturalness more effectively than prior work. can only offer quality estimates for full generations. Reinforcement Learning (RL) offers a natural path forward for optimizing non-differentiable, scalar objectives for LM-based generation when it is cast as a sequential decision-making problem. However, Goodhart's Lawfoot_0 looms: particularly in the case of imperfect metrics that use neural networks, it is easy to find nonsense samples that achieve high-quality estimates. Recent works have shown promising results in aligning LMs to human preferences via RL by constraining preference-based rewards to incorporate notions of fluency (Wu et al., 2021a; Ouyang et al., 2022) but progress in this line of work is heavily hindered by a lack of open-source benchmarks and algorithmic implementations-resulting in perception that RL is a challenging paradigm for NLP (Choshen et al., 2020; Kreutzer et al., 2021) . To facilitate research in building RL algorithms to better align LMs, we release a library, a benchmark, and an algorithm. First, we release the RL4LMs library, which enables generative HuggingFace models (e.g., GPT-2 or T5) to be trained using a variety of existing RL methods like PPO/A2C/etc. Next, we apply models trained using RL4LMs to the new GRUE (General Reinforced-language Understanding Evaluation) benchmark: GRUE is a collection of 7 contemporary NLP tasks (see Table1 for details); in contrast to other benchmarks, instead of supervised training, we pair each task with reward function(s). GRUE challenges models to optimize these reward functions while remaining fluent language generators. We train language models via RL-both with and without task specific supervised pre-training-to optimize rewards. Finally, beyond existing RL methods, we introduce a novel on-policy RL algorithm called NLPO (Natural Language Policy Optimization), that dynamically learns task-specific constraints over the distribution of language at a token level. Experiments on GRUE and human evaluations show that NLPO better balances learning preference rewards while maintaining language fluency compared to alternatives, including PPO (Figure 1 ). We find that using RL to learn from scalar reward feedback can be more: (1) data efficient than using additional expert demonstrations via supervised learning (though a combination of both is best)-a learned reward function enables greater performance when used as a signal for an RL method than a supervised method trained with 5 times more data, and (2) parameter efficient-enabling a 220 million parameter model trained with a combination of supervision and NLPO to outperform a 3 billion supervised model. We hope that the benchmarks, baselines, and building blocks we release serve to drive forward research in aligning LMs to human preferences.

2. RELATED WORK

Imitation learning for NLP. Algorithms such as Schedule Sampling (SS) (Bengio et al., 2015) , Parallel SS (Duckworth et al., 2019) , SS for Transformers (Mihaylova & Martins, 2019) , Diffential SS (Goyal et al., 2017) , LOLS (Lampouras & Vlachos, 2016; Chang et al., 2015) , TextGAIL (Wu et al., 2021b) , and SEARNN (Leblond et al., 2017) , have been inspired by DAGGER (Ross et al., 2011) and SEARN (Daumé et al., 2009) . However, these algorithms are known to suffer from exposure bias in generation (Chiang & Chen, 2021; Arora et al., 2022) and the cliff MDP problem (Huszár, 2015; Agarwal et al., 2019; Swamy et al., 2021) . RL for Large Action Spaces. MIXER (Ranzato et al., 2016) combined ideas from schedule sampling and REINFORCE (Williams, 1992) . Bahdanau et al. (2016) proposed an actor-critic algorithm to address the variance/large action space problems when using REINFORCE for language generation; follow-up works such as KG-A2C (Ammanabrolu & Hausknecht, 2020) , TrufLL (Martin et al., 2022) , AE-DQN (Zahavy et al., 2018) , and GALAD (Ammanabrolu et al., 2022) addressed similar issues by attempting to eliminate and reduce the action space during exploration. RL for NLP. RL, often in the form of bandit learning, has been used to improve models in machine translation (Wu et al., 2016; Nguyen et al., 2017; Kiegeland & Kreutzer, 2021) , summarization (Stiennon et al., 2020; Paulus et al., 2017) , dialogue (Li et al., 2016; Zhou et al., 2017; Jaques et al., 2020) , image captioning (Rennie et al., 2017) , question generation (Pang & He, 2021), text-games (Narasimhan et al., 2015; Hausknecht et al., 2020) , and more (Ranzato et al., 2016; Snell et al., 2022) . Lu et al. (2022) adapt reward-conditioned transformers (Chen et al., 2021) for several language generation tasks. RL has been the focus of efforts to align LMs with human preferences (Stiennon et al., 2020; Wu et al., 2021a; Nakano et al., 2021; Ziegler et al., 2019) , e.g., Ouyang et al. (2022) fine-tuned large language model with PPO Schulman et al. (2017) to align with models of human preference, but their non-public dataset doesn't enable comparison. Though RL has been successful in some of the use cases described above, it has simultaneously been critiqued for being significantly less stable than supervised LM training (Choshen et al., 2020) . As a result, there is relatively little consensus if RL is a worthwhile consideration for training LMs compared to, say, collecting additional supervised data.

3. RL4LMS: A LIBRARY FOR TRAINING LMS WITH RL

We introduce RL4LMs, an open-source library with building blocks for fine-tuning and evaluating RL algorithms on LM-based generation. The library is built on HuggingFace (Wolf et al., 2020) and stable-baselines-3 (Raffin et al., 2021) , combining important components from their interfaces. RL4LMs can be used to train any decoder only or encoder-decoder transformer models from Hug-gingFace with any on-policy RL algorithm from stable-baselines-3. Furthermore, we provide reliable implementations of popular on-policy RL algorithms that are tailored for LM fine-tuning such as PPO (Schulman et al., 2017) , TRPO (Schulman et al., 2015a) , A2C (Mnih et al., 2016) , and our own NLPO ( §4). The library is modular, which enables users to plug-in customized environments, reward functions, metrics, and algorithms. In the initial release, we provide support for 6 different NLP tasks, 16 evaluation metrics and rewards, and 4 RL algorithms.

3.1. ENVIRONMENTS: GENERATION AS A TOKEN-LEVEL MDP

Each environment is an NLP task: we are given a supervised dataset D = {(x i , y i )} N i=1 of N examples, where x ∈ X is an language input and y ∈ Y is the target string. Generation can be viewed as a Markov Decision Process (MDP) ⟨S, A, R, P, γ, T ⟩ using a finite vocabulary V. Each episode in the MDP begins by sampling a datapoint (x, y) from our dataset and ends when the current time step t exceeds the horizon T or an end of sentence (EOS) token is generated. The input x = (x 0 , • • • , x m ) is a task-specific prompt that is used as our initial state s 0 = (x 0 , • • • , x m ), where s 0 ∈ S and S is the state space with x m ∈ V. An action in the environment a t ∈ A consists of a token from our vocabulary V. The transition function P : S × A → ∆(S) deterministically appends an action a t to the end of the state s t-1 = (x 0 , • • • , x m , a 0 , • • • , a t-1 ). This continues until the end of the horizon t ≤ T and we obtain a state s T = (x 0 , • • • , x m , a 0 , • • • , a T ). At the end of an episode a reward R : S × A × Y → R 1 that depends on the (s T , y) (e.g., an automated metric like PARENT Dhingra et al. (2019) ) is emitted. RL4LMs provides an OpenAI gym (Brockman et al., 2016) style API for an RL environment that simulates this LM-Based MDP formulation. This abstraction allows for new tasks to be added quickly with compatibility across all implemented algorithms.

3.2. REWARD FUNCTIONS AND EVALUATION METRICS

Because RL4LMs provides a generic interface for per-token or per-sequence generation rewards, it is possible to quickly apply a wide array of RL algorithms to a similarly diverse range of textual metrics-as-rewards. Specifically, we provide interfaces to 1) n-gram overlap metrics metrics such as ROUGE (Lin, 2004) , BLEU (Papineni et al., 2002) , SacreBLEU (Post, 2018) , METEOR (Banerjee & Lavie, 2005) ; (2) model-based semantic metrics such as BertScore (Zhang et al., 2019) and BLEURT (Sellam et al., 2020) which generally provide higher correlation with human judgment; 3) task-specific metrics such as CIDER (Vedantam et al., 2015) , SPICE (Anderson et al., 2016) (for captioning/commonsense generation), PARENT (Dhingra et al., 2019) (for data-to-text) and SummaCZS (Laban et al., 2022) (for factuality of summarization); 4) diversity/fluency/naturalness metrics such as perplexity, Mean Segmented Type Token Ratio (MSSTR) (Johnson, 1944) , Shannon entropy over unigrams and bigrams (Shannon, 1948) , the ratio of distinct n-grams over the total number of n-grams (Distinct-1, Distinct-2) and count of n-grams that appear only once in the entire generated text (Li et al., 2015) ; 5) task-specific, model-based human preference metrics such as classifiers trained on human preference data collected in the methodology of Ouyang et al. (2022) .

3.3. ON-POLICY ACTOR-CRITIC ALGORITHMS

RL4LMs supports fine-tuning and training LMs from scratch via on-policy actor-critic algorithms on language environments. Formally, this class of algorithms allows us to train a parameterized control policy defined as π θ : S → ∆(A), a function that attempts to select an action in a given state so as to maximize long term discounted rewards over a trajectory E π [ T t=0 γ t R(s t , a t )]. Our benchmark experiments focus on fine-tuning a pre-trained LM denoted as π 0 from which we initial our agent's policy π θ = π 0 . Similarly, the value network V ϕ used to estimate the value function is also initialized from π 0 except for the final layer which is randomly initialized to output a single scalar value. As with other deep RL actor-critic algorithms, we define our value and Q-value functions as V π t = E at∼π [ T τ =t γR(s τ , a τ , y)], Q π t (s t , a t ) = R(s t , a t , y) + γE st+1∼P [V π t+1 (s t+1 )] leading to a definition of our advantage function as A π t (s, a) = Q π t (s, a) -V π t . To increase training stability, advantage is appoximated using Generalized Advantage Estimation (Schulman et al., 2015b) . Given an input-output pair (x, y) and generation predictions from our agent; because the environment rewards are sequence-level and sparse, following Wu et al. (2021a) we regularize the reward function using a token-level KL penalty for all on-policy algorithms, to prevent the model from deviating too far from the initialized LM π 0 . Formally, the regularized reward function is: R(s t , a t , y) = R(s t , a t , y) -βKL (π θ (a t |s t )||π 0 (a t |s t )) (1) where R is the regularized KL reward, y is gold-truth predictions, KL (π θ (a t |s t )||π 0 (a t |s t )) = (log π 0 (a t |s t ) -log π θ (a t |s t )) and the KL coefficient β is dynamically adapted (Ziegler et al., 2019) . Further details on actor-critic methods can be found in Appendix A.

4. NLPO: NATURAL LANGUAGE POLICY OPTIMIZATION

Language generation action spaces are orders of magnitude larger than what most discrete action space RL algorithms are designed for (Ranzato et al., 2016; Ammanabrolu, 2021) , e.g., GPT-2/3 and T5 have a vocabulary size of 50K and 32K respectively. We hypothesize that the size of the action space is a core cause of instability when training LMs with existing RL methods. To address this issue, we introduce NLPO (Natural Language Policy Optimization), which is inspired by work on action elimination/invalid-action masking (Zahavy et al., 2018; Huang & Ontañón, 2020; Ammanabrolu & Hausknecht, 2020) . NLPO, a parameterized-masked extension of PPO, learns to mask out less relevant tokens in-context as it trains. NLPO accomplishes this via top-p sampling, which restricts tokens to the smallest possible set whose cumulative probability is greater than the probability parameter p (Holtzman et al., 2018) . Specifically, NLPO maintains a masking policy π ψ : the masking policy is a copy of the current policy (π θ ), but is updated only every µ steps. A parameterized-invalid-mask is created from π ψ by first selecting the top-p tokens from the vocabulary, 4 and then applying an invalid-mask to the remaining tokens-i.e. setting their probabilities to zero when sampling actions from π θ during training; this periodic updating policy π ψ is inspired by off-policy Q-learning algorithms (Andrychowicz et al., 2017) , providing the policy π θ with an additional constraint that balances between the benefits of containing more task relevant information than the KL penalty derived from π 0 and the risk of reward hacking. We provide pseudocode in Algorithm 1 (green portions highlight the differences with PPO). Algorithm 1 NLPO -Natural Languge Policy Optimization  Input: Dataset D = {(x i , y i )} N i=1 of π θm+1 = argmax θ 1 |D m |T τ ∈Dm T τ =0 min r t (θ)A π θm , clip(r t (θ), 1 -ϵ, 1 + ϵ)A π θm ) where r t (θ) = π θ (at|st) π θm (at|st) . Update the value function: V ϕm+1 = argmin ϕ 1 |D m |T τ ∈Dm T t=0 V ϕ (s t ) -Rt 2 Update the parameterized masked policy every µ iterations: π ψn+1 (•|•, π θm+1 ) until convergence and return π θ

5. GRUE (GENERAL REINFORCED-LANGUAGE UNDERSTANDING EVAL)

GRUE is a collection of 7 generative NLP tasks. To combat reward hacking for any single metric, each task is evaluated at test time according to a task-specific mix of metrics, detailed in Table 1 . The metrics span two categories. Task preference metrics capture how well the models produce generations that satisfy the desiderata of the specific generation task, e.g., for Commongen, if the generations contain all the required words, or for IMDB, how positive the generated completions are. Naturalness metrics capture fluency, readability, etc. and provide perspective on factors beyond semantics. At training time, there are no special restrictions: models are free to use the supervised data, compute metrics on intermediate generations, etc. Train/val/test splits follow the original works. All results are averaged over multiple seeds, with exact counts being found in Appendix B. Experimental Setup. We use RL4LMs to test a large range of algorithms on the GRUE benchmark. Specifically: We compare 3 algorithms for direct fine-tuning -Supervised, PPO,foot_2 and NLPO. In We note that we test RL algorithms on these tasks for a wider range of possible rewards than just the task specific ones shown here. Unless specified, datasets are in English. addition, we consider a hybrid approach of supervised learning and our RL methods by applying PPO and NLPO on checkpoints that have been fine-tuned in a supervised fashion-we call these Supervised+PPO, Supervised+NLPO. As an additional baseline, we additionally run zero-shot evaluations where we design prompts which aim to elicit task-specific generations, but with no training data or parameter updates. For each task, to isolate the effect of training method, we select a single pre-trained LM backbone. For IMDB text continuation we use GPT-2 (117m parameters), and for the rest of the tasks we use T5-base (220m parameters). For our RL models (PPO, NLPO, Supervised+PPO, Supervised+NLPO), for a thorough investigation of how reward-hacking might interplay with GRUE, we run a separate set of experiments optimizing multiple task rewards for each task independently, e.g., for Commongen which has 6 task rewards (CIDER, ROUGE-2, ROUGE-L, BLEU-3, BLEU-4, METEOR) we run 6 different experiments optimizing each metric independently and report all possible metrics seen in Table 1 regardless of which individual metric was being optimized for. Human Participant Study. We gather human judgments for five of the tasks in GRUE. In doing so, our goals are 1) to validate that the automated metrics we selected for GRUE correlate with human judgments with respect to relative ranking between models; and 2) to provide additional empirical comparisons regarding NLPO vs. PPO, ablations to study the effects of the KL naturalness penalty, etc. We specifically consider IMDB, Commongen, ToTTo, DailyDialog, and CNN Daily Mail. For each individual sample in a task, we ask 3 unique human raters to provide Likert judgments of 1) quality, i.e., for the specific task, how correct/appropriate is the generation, given the context, and 2) fluency, i.e., how well-written is the generation. We used Amazon Mechanical Turk, and paid crowdworkers a minimum of $15/hr. More details, including qualification information, interface screenshots, instructions, etc. are given in the corresponding Appendicies. Figure 2 : Summarized results via automated metrics across all 7 GRUE tasks for each of the 5 algorithms we consider, and human participant studies for the 5 tasks suitable for human studies. Test results are averaged over all the respective metrics seen in Table 1 . T5-base (220m parameter) LM currently outperforms all the models on the ToTTo leaderboard, many of which have ≥ 3b parameter supervised models-suggesting that RL is parameter efficient as well. Questions Tasks IMDB CommonGen CNN/DM ToTTO WMT16 NarQA Dialog Needs Warm Start ✗ ✓ ✗ ✓ ✗ ✓ ✗ Easily reward hackable? ✓ ✓ ✗ ✗ ✗ ✗ ✗ RL > Sup (auto)? ✓ ✗ ✗ ✗ ✗ ✗ ✓ RL > Sup (human)? ✓ ✗ ✗ ✗ - - ✓ Sup+RL > Sup (auto)? ✓ ✓ ✓ ✓ ✓ ✓ ✗ Sup+RL > Sup (human)? ✓ ✗ ✓ ✓ - - ✗ Sup+NLPO > Sup+PPO (auto)? ✓ ✓ ✓ ✓ ✓ ✓ ✓ Sup+NLPO > Sup+PPO (human)? ✓ ✓ ✓ ✓ - - ✓ In these cases, it is critical that the initial policy already contain (some) signal for the task due to it being used as a KL constraint and masking constraint in NLPO. If the mask contains no initial priors about task specific language, it will be eliminating the wrong actions-a better initial policy leads to better RL performance downstream. Human agreement with automated metrics. As human judgments can be noisy, we run additional statistical analysis such as measuring inter-annotator agreement, via Krippendorf's alpha score, and using a one-way ANOVA followed by a post-hoc Tukey HSD test to measure if differences in means of average scores between pairs of models are significant. We find that trends in our human evaluations generally match those seen in the automated metrics for both task and naturalness metrics (see Figures 2(c ), 2(d) which summarize Appendix Tables 10, 15 ,21,26, 35-Supervised+NLPO > Supervised ≥ Supervised+PPO > NLPO ≥ PPO > Zero-shot-with the exception of Supervised outperforming Supervised+PPO on 2 out of 5 tasks when automated metrics would indicate that Supervised+PPO outperforms Supervised on all of the tasks. We draw two conclusions from this: (1) if the generated text is above a certain threshold of naturalness, the automated metrics usually correlate with human judgements; (2) usually but not always as seen in the relative performance of Supervised and Supervised+PPO, potentially indicating reward hacking behaviors undetected by automated metrics but caught by human preference feedback.

5.2. PREFERENCE REWARD LEARNING, SELECTION, AND HACKING

While the GRUE benchmark's metric for each task is an average over several measures, the RL models we trained optimized only a single metric independently. Thus, we can empirically investigate which metric for which GRUE produces the best results. We observe that many possible single metric rewards provide task performance gains over supervised methods (results shown in Fig. 3 (a), 2(c) are averaged across these reward functions) with the condition that the text is also coherent and natural. Which constraints best prevent reward hacking? The reward function in Equation 1 balances a task-specific reward with a KL constraint -models are penalized from straying too far from a base LM in their pursuit of high reward (Table 3 and Appendix Table 5 ) clearly show that if KL constraints are removed entirely, models reward hack). But which model works best as a base regularizing LM? When the initial policy (i.e., the raw, pretrained model) has low performance on the task, the KL penalty pushes the policy towards nonsense, e.g. on Commongen and ToTTo the trained policy learns to simply repeat portions of the input (as seen in Tables B. 4.5, B.6.4) . This behavior is mitigated if the base regularizing LM is the supervised model-the reward encourages the policy to balance the task-specific reward and a more reasonable regularization term. Deriving KL penalties from warm-started initial policies is critical for performance on such tasks. PPO vs. NLPO. Figure 2 shows that NLPO generally outperforms PPO and supervised, especially when applied after supervised training. We hypothesize that the primary reason for NLPO's improved performance and stability is because the masking policy provides an additional constraint for the current policy. This constraint is not based on the initial untuned policy like the KL penalty but of the policy from µ iterations ago and likely contains more task-relevant information learned during RL training. Table 3 (and Appendix Table 8 ) shows how performance increases up to a point and then decreases as p in top-p sampling is increased for the masking policy, relaxing the constraint by eliminating less tokens at each step, implying that there is a balance to be found in how much the model should be constrained during RL training. Human Preference Reward Learning. To this point, our experiments have largely focused on optimizing evaluation metrics that correlate with human judgments, e.g., METEOR. Here: we additionally test how well preferences can be learned from direct human feedback. For this, we focus on Commongen -a GRUE dataset well-suited for displaying differences due to human preferences. First, we randomly select prompts from the Commongen train dataset and sample a single completion from both the Supervised and Supervised+NLPO models. We then present the prompt and the two completion candidates to 3 unique crowdworkers and ask them to select which one they prefer with respect to commonsense/fluency for 417 unique pairs (Krippendorf α = .28). We use this data to train a reward model, T5-11B Raffel et al. (2020) , on the balanced binary classification task of predicting which of the pair was preferred by a majority of 3 annotators, conditioned on the prompt and completion. The resulting model achieved 69.5 test ROC AUC suggesting it indeed captures average human preferences. Additional details on this process are found in Appendix B.4.4. We train Supervised+RL with a METEOR-only reward as a baseline, and compare it to a reward function that uses the fine-tuned T5-11B model. Finally, we rerun the same pairwise preference collection procedure-this time sampling from Commongen test-with human participants to compare the generations from a preference optimized RL policy to the previously best Supervised+NLPO policy. Comparing the METEOR-only to the preference model, the generations produced by the human feedback model are preferred in 682 cases, compared to the METEOR-only model which is preferred in 587 cases (p < 0.01 the models are equally preferred). This implies that this pipeline of collecting preferences, training a reward, and further tuning the policy improves alignment to human preferences.

5.3. DATA BUDGET: IMPROVE YOUR REWARD OR GATHER MORE DEMONSTRATION?

Given a fixed data collection budget, is it more efficient to gather feedback to improve a learned reward function or to gather more expert demonstrations? We use the IMDB text continuation task as a case study. In the IMDB task, a model is given a partial movie review as a prompt, and is asked to continue it as positively as possible (even if the prompt was negative). The original dataset consists of movie reviews and sentiment labels of positive, negative, or neutral. A DistilBERT (Sanh et al., 2019) classifier is trained on these labels and used to provide sentiment scores on how positive a given piece of text is, which serves as the task reward. The trade-off is between gathering more: 1) sentiment labels (improving the reward); or 2) positive sentiment reviews (improving supervised training). We train a classifier on varying amounts of training data and evaluate on the held out test datasetfinding as expected that more training data improves test accuracy and so results in a higher quality reward. We then use each of these rewards of varying quality during RL training, and evaluate using the same metric as GRUE (i.e., a classifier trained with the entire training set). As seen in Table 3 , we find that improving the reward quality improves LM performance as well. Further, we trained a supervised model with at least as many samples used to train each of these reward classifiers. We find that a learned reward function enables greater performance when used as a signal for an RL method than a supervised method trained with 5 times more data. This implies that improving reward models can be more data efficient than collection expert demonstrations for a task-and that's not accounting for the fact that assigning sentiment labels is likely a simpler task than writing full demonstrations. Further details on this ablation are found in Appendix Table 7 .

5.4. PRACTICAL CONSIDERATIONS: WHICH IMPLEMENTATION DETAILS MATTER MOST?

Generation as a token-level MDP, not a bandit environment. Most recent works that tune LMs using RL do so by calculating a reward for all the tokens in the sentence (Wu et al., 2021a; Ouyang et al., 2022; Lu et al., 2022) . This setting is equivalent to a bandit feedback environment where the action space is the space of all possible generations for the task (Sutton & Barto, 2018) . This type of environment can be simulated within our RL formulation by setting the discount factor γ = 1. Table 3 (and Appendix Table 6 ) shows that this causes instability in training with respect to naturalness in both PPO and NLPO for IMDB. Our standard setting is γ = 0.95 when calculating discounted rewards-to-go in the token-level MDP formulation, which reduces the magnitude of the reward that is applied to tokens selected at the beginning. The sentiment scores are approximately the same between both settings but the naturalness of language in the bandit setting is significantly less-indicating that discounting rewards with γ < 1 via a token-level MDP formulation is at least sometimes more effective for language generation. Dropout and Sampling. We found two other implementation details to be critical for stability of RL training. The first is dropout, which in its standard form was found to cause instability in policy gradient methods in continuous control settings by Hausknecht & Wagener (2022) . We find a similar effect when using dropout when RL training LMs as well, with training loss often diverging for dropout > 0 in training. The second important detail, particularly affecting the machine translation task, is sampling methods. We find that using the same sampling methods during exploration and inference is critical to translating training performance to test performance-else the model exhibits high train rewards but low test metrics.

6. CONCLUSIONS

We're hopeful that the GRUE benchmark and the RL4LMs library can push progress in aligning language models to human preferences via RL methods by providing the community with a standard means of comparing methods. Furthermore, we're optimistic that, as the stability and consistency of training improves, our methods provide a path towards iterative improvement of language technologies, with deployment, user feedback collection, and re-optimization enabling better user experiences when interacting with generative models. 

A ON-POLICY ALGORITHM IMPLEMENTATION DETAILS

A.1 PPO DETAILS Given discussion and equations in Section 3.3, we further note that we follow (Ziegler et al., 2019) and dynamically adapt the KL coefficient β during training where, e t = clip KL (π(a t |s t )||π 0 (a t |s t )) -KL target KL target , -0.2, 0.2 β t+1 = β t (1 + K β e t ) where KL target is user-specified KL divergence between initial model h and current policy π and K β is rate of update which we generally set to 0.2 in our experiments. To increase stability during training, we further use Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) and define the advantage estimator Â(s n , a n ) based on the Temporal Difference residual as: δ t = r(s t , a t ) + V ϕ (s t+1 ) -V ϕ (s t ). Â(s n , a n ) = ∞ t=0 λ t δ n+t , where λ provides the trade-off between bias and variance. A.2 NLPO DETAILS NLPO learns to mask irrelevant language by maintaining a masking policy π ψ : the masking policy is a copy of the current policy (π θ ), but is updated only every µ steps. Given Z(π θ ) = a∈V π θ0 (a|s) the normalization value of the sum of probabilities of all action a ∈ A given a particular State s ∈ S, let the parameterized top-p vocabulary V p π θ ⊂ V be the subset of the vocab, consisting of the top-p highest probability vocabulary tokens with respect to π θ . Formally, let Z p be the normalization value for the parameterized top-p vocabulary, can be defined as the subset of tokens that maximizes Z p (π θ ) = a∈V k π θ π θ (a|s). Then optimizing a policy according to the parameterized top-p vocabulary can be defined as: π ψ (•|s, π θ ) = π θ (•|s)/Z p (π θ ) if a ∈ V p π θ and Z(π θ ) 0 otherwise. B EXPERIMENTAL DETAILS

B.1 CROWDWORKING DETAILS

Qualification round We ran a qualification round using the IMDB task. We opened the qualification around to users from {AU, CA, NZ, GB, US} with 5K prior approved HITs and a minimum acceptance rate of 97% on their previous HITs. We gathered judgments over 600 generations from 3 annotators per generation. One of the authors of this paper also completed 17 random HITs to serve as a proxy for "ground truth." After gathering these annotations, we selected workers who: 1) didn't significantly disagree with other annotators on the same instance more than 20% of the time; 2) who completed at least 5 HITs; 3) who didn't disagree with the author annotator on the 17 HITs by more than 1 point; and 4) (likely) spent a reasonable amount of time reading the instructions/examples provided. In the end, 56 annotators were qualified. Additional per-task details are provided in the per-task sections of the Appendix. Compensation details As per Amazon Mechanical Turk policy, annotators were compensated on a per-HIT basis. In addition, we used a timing script to estimate hourly wages to ensure our target of $15/hr was met. In cases where this minimum hourly rate was not met, we manually assigned bonuses.

B.2 GRUE EXPERIMENT SETUP

We benchmark 5 training algorithms on 6 tasks (see Table 1 ) using either an encoder model (eg. GPT-2) or encoder-decoder model (eg. T5). We train policies using PPO, NLPO with variations of whether supervised pre-training is applied before RL fine-tuning and compare against supervised policy. The choice of LM is based on the type of task. For IMDB text continuation, we use GPT-2 and T5 for rest of the tasks. We use two separate LM models as actor and critics networks (i.e. no shared layers) in which the critic network has an additional linear layer mapping last token's hidden representation to a scalar value. We use AdamW optimizer Loshchilov & Hutter (2017) Figure 3 : Summarized results via automated metrics across all 7 GRUE tasks for each of the 5 algorithms we consider, and human participant studies for the 5 tasks suitable for human studies. We break up the metrics into task-specific, e.g. average positive sentiment for IMDB task, and naturalness metrics, such as perplexity and human perceived coherence for the human rated metrics. This plot differs from Figure 2 as this one averages over over multiple reward functions per each task. We chose GPT-2 as LM for this task as it is more suited for text continuation than encoder-decoder LMs (eg. T5). We use top-k sampling with K = 50 as the decoding method and for fair comparison, we keep this setting for all methods. For PPO and NLPO models, we train for 64k steps in total and update policy and value networks every 1280 steps with a mini-batch size of 64 and epochs of 5 per update. We apply adaptive KL controllers with different target KLs of 0.02, 0.05, 0.1, inf with an initial KL co-efficient of β = 0.1. b), it is seen that higher target KL (0.1) is desired to achieve higher rewards. However, this setting drifts away from the original LM too much and loses fluency. Therefore a lower target KL (0.02 or 0.05) is required to keep the model closer to original LM. Similar trends hold for NLPO but when compared to PPO, it retains lower perplexities and is more stable even with higher KL targets Table 5 : Target KL Ablations: Mean and standard deviations over 5 random seeds is reported for sentiment scores along with fluency and diversity metrics on validation set. It is seen from perplexity scores that a lower target KL constraint is desired to keep the model closer to the original model. On the otherhand, a higher target KL yields higher sentiment scores at the cost of fluency. inf KL penalty (target KL of inf), model simply learns to generate positive phrases (eg: "I highly recommend this movie to all!", "worth watching") regardless of the context. NLPO achieves better sentiment and perplexity scores than PPO.

Model

away from pre-trained LM and loses fluency. Therefore, a lower target KL (0.02 or 0.05) is required to keep the LM closer to original LM. This is also seen in Table 5 where we presented a comparative analysis of final performance of all models. Training data size ablation We vary the amount of data used to train the reward classifier and the supervised baseline model to understand whether it is more efficient to gather data to improve reward model or to gather expert demonstrations for supervised learning. As observed in Table 7 , improving the quality of reward function increases the performance on the overall task better than training with more data for supervised training, indicating that improving reward models is efficient than collect expert demonstrations for supervised training from a data efficiency perspective. Discount factor ablation To understand the effect of discounted vs undiscounted (bandit) environments, we report sentiment and perplexity scores for different values of discount factor (0.5, 0.95 and 1.0) in Table 6 and observe that using a bandit environment (discount factor of 1.0) results in performance loss in the case of NLPO and reward hacking in the case of PPO, indicating that discounted setting (with 0.95) is desired. Table 7 : Evaluation of GPT2 with different algorithms on IMDB sentiment text continuation task, data budget ablations: Mean and standard deviations over 5 random seeds is reported for sentiment scores along with fluency and diversity metrics. This table measures performance differences as a function of the fraction of the dataset that has been used. In the case of the RL approaches, this measures how much data is used to train the reward classifier, and for the supervised method it directly measures fraction of positive reviews used for training. We note that using even a small fraction of data to train a reward classifier proves to be effective in terms of downstream task performance while this is not true for supervised approaches. This lends evidence to the hypothesis that adding expending data budget on a reward classifier is more effective than adding more gold label expert demonstrations. 

B.3.3 HUMAN PARTICIPANT STUDY

Figure 5 shows the IMDB instructions, example, and interface used both for the qualification round, and then later, for the human evaluation experiments. Tables 9, 10 show averaged results, annotator agreement, and the results of statistical significance tests to determine which models output better generations when rated by humans. 

B.3.4 QUALITATIVE RESULTS

We show sample generations from each of the algorithms for three randomly picked prompts below.

Sample 1

Prompt: A man and his wife are not getting along because of the wife's jealousy and drinking problems. When the wife disappears, her sister begins a vigorous search involving the Zero-Shot: husband's sister, which leads several men to a remote cabin and then a boat. The wife and her husband then start to see the same things. Finally, a strange woman is driving out to her cabin and discovers the husband's disappearance but References: local police, state government, and a television crime show. What she finds out stuns the entire community. Good drama with lots of fine performances; based on a true story PPO: police and the local public. At the end, the woman is killed and the wife is found naked and with a red head tattooed on her back. This is part of a special feature called"On the Road, Hidden" which was PPO-no-KL: police and other local governments. Written by G. Scott Fitzgerald, who also stars in this production, "A Very Merry Christmas" is a wonderful family movie with a wonderful soundtrack and wonderful characters. It brings home the true beauty of small-NLPO: mother while the brother goes under. After all, his sister, who was apparently kidnapped but who turns out to be a man, is no longer in her place. If a film that was made in the 1950s NLPO-no-KL: friends of the wife, and also the friend of the wife of the woman that is missing, and eventually he begins to find that she has gone to Germany and he has made a very good friend. There is a good scene between John and Supervised: amily tree. The widower, now widowed, returns, the daughter, daughter is not pleased when his wife is mistaken for one of her sons, and the patriarch, then a wealthy man, falls victim to a jealous husband. The

Sample 2

Prompt: A very credible and unsettling movie portraying the effects of war on the general public, Shame gives us a chilling understanding oh Ingmar Bergman's beliefs. The absence of color and sound (as in a soundtrack) helps to give the film a more realistic feeling. There is no soft or hard lighting or a variety of camera angles to give the story a charming or dramatic effect. Everything is very simply portrayed as it would be in reality. At times the storyline was confusing, but perhaps in order to give a sense of the characters' frame of mind how they too do now know what is happening or why. The simplicity of the Zero-Shot: camera work is at once apparent and fascinating in the manner in which it affects the way things are portrayed. In other films these two have been referred to from time to time in a form and manner more suited to the screen. The film shows References: film made it somewhat boring, but I could understand how the director is trying not to glamorize the story and points. He depicted the harsh and unsettling events and effects of war. In the beginning the characters were introduced and shown as content with the lighter settings PPO: characters relationships is so engrossing and the great and unique story presented by each pair of brothers and sisters creates a truly gripping and enjoyable experience. To put it simply, I strongly recommend to everyone who plays the role to buy this DVD!!! PPO-no-KL: movie helps it to flow along in a believable way. Rating 4 out of 5 Great Movie! I highly recommend these movies! I highly recommend these movies! Rating 5 out of 5 Great Movie! I highly recommend these movies! Rating 10 of" NLPO: novel makes it one of the first books I have read by a British actor. It is a fun way for someone who likes their books, like me, to sit through a novel. If you like to read novels, check it out. NLPO-no-KL: setting makes the final product very enjoyable to watch. Highly recommended!! Michael Bowerman, SPSA Film Festival Host, SPSA, WA (December 17, 2006) , 8/17/2006 Great photography, good acting, Supervised: "Dakota Clan" and his sister, the two become friends with Ted Masters who soon finds the chance to fly back to their home land and go back to save his brother's life and his family. The

Sample 3

Prompt: Shakespearean language is wild and rambling, saturated in multiple meanings, word play and metaphor. To be understood it must be wrangled and tamed by an actor with the strength and knowledge to do so. When an actor fails, the words pour forth in a torrent of incomprehensible words, but when he succeeds, the English language springs to life with an immediacy and vibrancy that takes your breath away. Al Pacino is one such actor, and here displays an incredible level of clarity and control that, were there any justice, Zero-Shot: might still be seen in the world today. This movie, like "Vacation", is an effort at telling the tale of two families from different races. One is African but never grew up the other. One of them, a man References: would sweep every award in the offering. He meets the challenge of presenting Shylock head on, and delivers an extraordinarily subtle and nuanced performance. It would be a crime if we never got the opportunity to see what he does with King Lear.<br /><br />The supporting cast PPO: would have made him an instant-classic. This movie captures the spirit of childhood and has everyone in the world watching it. It's a love story and it's really heartbreaking. And at the same time, it's a classic. It PPO-no-KL: will bring victory to the fight against hatred and extremism. Al Pacino is a very beautiful, well executed actor who brings an absolute intensity and sensibility to his portrayal of love and loyalty. A must see! Recommended! Also, an Excellent NLPO: I would watch and enjoy. 7/10, Oscar Wilde is always a pleasure to see. A must see. Highly recommend. Highly recommend! One of my top ten films of all time. A must see!! NLPO-no-KL: the whole film would have fallen to the same fate, just as it did just a couple hundred years ago. Don't miss it. It's a real classic. Highly Recommended. * outta five stars for it! Supervised: his performance (so far) would seem mere shadow. He is truly in the middle of a movie, and this film is one of those films where he can be convincing in it (and his trademark acting, as you can see in the

B.4.2 RESULTS AND DISCUSSION

Tables 13, 12 presents our benchmarking results with 6 reward functions along with supervised baseline performances on dev and test sets respectively. Our main finding is that warm-started initial policies are crucial for learning to generate coherent sentences with common sense. Without warmstart, policies suffer from reward hacking despite application of repetition penalty and task-specific metrics such as CIDer etc. Further, we find that RL fine-tuned models obtain very high concept coverage which is also seen in Table B .4.5. Supervised models often tend to miss few concepts in its generation compared to RL methods. Table 15 : Results of an post-hoc Tukey HSD Test for difference in means between pairs of algorithms (Group 2 -Group 1) and corresponding p-values. Individually statistically significant results are bolded and are used to discuss results in the analysis. Overall p-values showing that there is a significant difference in means between the models via a one-way ANOVA test are significant with p ≪ 0.05 for both coherence and sentiment. 

B.4.3 HUMAN PARTICIPANT STUDY

Figure 6 shows the commongen instructions, examples, and interface used for the human evaluation experiments. Different from the other human evaluations, we didn't provide any prompt because knowing the set of words to be used isn't required for rating either of the axes. Tables 14, 15 show averaged results, annotator agreement, and the results of statistical significance tests to determine which models output better generations when rated by humans.

B.4.4 HUMAN PREFERENCE LEARNING EXPERIMENTS

First, we randomly select prompts from the Commongen train dataset and sample a single completion from both the Supervised and Supervised+NLPO models. Next, we filter to prompts where both models at least attempted to use all input concepts. This filtration step was conducted because if a model fails to use all concepts, it may generate a more natural/fluent sentence, but, a priori, it shouldn't be preferred by crowdworkers; instead of training crowdworkers to prefer sentences with all concepts, we perform this filter. Figure 7 shows the task presented to the crowdworkers. We then present the prompt and the two completion candidates to 3 unique crowdworkers and ask them to Table 21 : Results of an post-hoc Tukey HSD Test for difference in means between pairs of algorithms (Group 2 -Group 1) and corresponding p-values. Individually statistically significant results are bolded and are used to discuss results in the analysis. Overall p-values showing that there is a significant difference in means between the models via a one-way ANOVA test are significant with p ≪ 0.05 for both coherence and sentiment. 

B.5.3 HUMAN PARTICIPANT STUDY

Figure 8 shows the summarization instructions and interface used for the human evaluation experiments. Participants weren't required to read the entire article, but to encourage some reading, a minimum time on the window of 15s was enforced via hiding the sliders. Tables 20, 21 show averaged results, annotator agreement, and the results of statistical significance tests to determine which models output better generations when rated by humans.

B.5.4 QUALITATIVE ANALYSIS

We show sample generations from each of the algorithms for three randomly picked prompts below. Sample 1 Prompt: Manchester City are confident UEFAâȂŹs punishment for breaching financial fairplay regulations will be lifted this summer which would allow them to bid for stellar names like Raheem Sterling, Gareth Bale, Kevin de Bruyne and Ross Barkley. City boss Manuel Pellegrini has been hampered over the past year by UEFA restricting them to a net transfer spend of 49million in each window and keeping the clubâȂŹs overall wage bill to its current level of 205million-a-year. UEFAâȂŹs settlement with City published in May stated those penalties would remain in place until the end of the 2015/16 season but the clubâȂŹs latest financial figures showed drastically-reduced losses of 23million which they feel proves they are now compliant with FPP regulations. Manuel Pellegrini is hoping that the financial restrictions imposed by UEFA for a breach of FFP rules will be lifted at the end of this season . Manchester City have been limited to a net spend of 49 million in the last two transfer windows -they spent 25m bringing Wilfried Bony in from Swansea in January . Ahead of Monday nightâȂŹs trip to Crystal Palace, Pellegrini was certainly talking like a man excited at the prospect of signing 'crack' players this summer. âȂŸI think that next season we donâȂŹt have any restrictions so we will be in the same position that all the other English clubs have,âȂŹ said Pellegrini. âȂŸItâȂŹs important. You have so many strong teams here in England and in Champions League, you can not allow them to keep the advantage every year; having less players to put in your squad or spending less money. We spend money, of course we always spend money, but they spent more.âȂŹ Manchester United, Barcelona, Liverpool and Arsenal have all paid more in transfer fees in the past 12 months than City who were traditionally EuropeâȂŹs biggest spenders after the club was taken over by Abu Dhabi owners in 2008. Uefa also ordered City to play with a reduced squad from 25 players to 21 in the Champions League this season and while that restriction has now ended, any time reduction in the penalties on spending and wages is more controversial. Arsenal have paid more in transfer fees than City in the last 12 months, including 30m on Alexis Sanchez . The document published last May by UEFAâȂŹs Club Financial Control Body investigative chamber explicitly said CityâȂŹs financial penalties would run for two seasons at least and there has been no official deviation from that decision. The published statement said at the time: âȂŸManchester City agrees to significantly limit spending in the transfer market for the seasons 2014/15 and 2015/16. It means City will have to argue their case with Uefa that as they have been financially compliant over the past year, they deserve to be free of restrictions moving forward. They have successfully argued their case with UEFA before. Last summer they persuaded the governing body to allow them to bypass the normal quota of eight homegrown players as their Champions League squad had been reduced. Eliaquim Mangala joined the champions from Porto for 32m last summer . The reigning Premier League champions have only paid money for Fernando, Willie Caballero, Eliaquim Mangala and Wilfried Bony in the last two transfer windows and that was part-paid by the sales of Javi Garcia and Jack Rodwell. Pellegrini admitted they werenâȂŹt in a position to compete with neighbours United for established world stars like Angel di Maria because of the FFP penalties. It has cost City on the pitch. Even if they win their remaining eight Premier League fixtures their points tally will be lower than last season and they crashed out once again at the last-16 stage of the Champions League. Pellegrini and director of football Txiki Begiristain both accept the squad needs replenishing and they want to be in the market for the top available players. Manchester City have been heavily linked with a move for former Chelsea midfielder Kevin de Bruyne . The club also want to be involved in any race for Liverpool contract rebel Raheem Sterling . They have been strongly linked with former Chelsea winger Kevin de Bruyne who has been one of the best players in the Bundesliga with Wolfsburg. And if Liverpool do allow Sterling to leave, City want to be involved in the chase for one of EnglandâȂŹs most outstanding talents to help fit in with quotas for homegrown players. âȂŸIf you are the manager or a player of a big tem, you are always open to criticism because everyone expects a high level of performance,âȂŹ he said. âȂŸThe major issue with our team always seems to be the money but this season I think we were the team that spent less than any other team. That is the weird thing. âȂŸFor this club trying to be a big team in so few years, maybe we are paying the cost for that.âȂŹ Since the transfer penalties were introduced, City have spent 91million on players in the last two windows ( Net spend 68million). That compares to Manchester United 145.5million, Liverpool 113 million, Arsenal 92.5million and Chelsea 82.6million. Over the same time period Barcelona spent 118.3million on players and Real Madrid 81.2million though they also broke the world transfer record for Gareth Bale the previous summer. Zero-Shot: manuel Pellegrini hoping UEFAâȂŹs punishment for breaching financial fairplay regulations will be lifted this summer. city have been restricted to a net transfer spend of 49million in each window. a total of 23million has been lost in the last year. References: Manchester City have been restricted to a net transfer spend of 49m .Club's also had to keep overall wage bill to its current level of 205m .Punishments imposed by UEFA for breaching Financial Fair Play rules .The spending restrictions were set for this season and the next one .But City are confident they will be lifted early after their compliance . PPO: Manuel Pellegrini is hoping UEFAâȂŹs punishment for breaching financial fairplay rules will be lifted this summer. Manchester City have been limited to a net spend of 49million in each transfer window. the club also keep the club's overall wage bill to its current level of 205million-a-year. NLPO: Manchester City are confident UEFAâȂŹs punishment for breaching financial fairplay regulations will be lifted this summer. that would allow them to bid for stellar names like Raheem Sterling, Kevin de Bruyne and Ross Barkley. Supervised: Manchester City have been limited to a net transfer spend of 49million in each window. City's overall wage bill is 205m-a-year due to a breach of UEFA rules. City are confident the financial restrictions will be lifted at the end of the current season in order to bid for stars such as Raheem Sterling. Manuel Pellegrini is hoping that the financial restrictions will be lifted at the end of the 2015/16 season Supervised + PPO: Manchester City have been restricted to a net spend of 49million in each window and kept their overall wage bill at 205m. Manuel Pellegrini believes these restrictions will be lifted at the end of the season. Arsenal have spent more in transfer fees than City in the last 12 months, including 30m on Alexis Sanchez. Supervised + NLPO: Manchester City are confident UEFAâȂŹs punishment for a breach of financial fairplay regulations will be lifted this summer. The club have been restricted to a net transfer spend of 49million over the past two window. They aim to bid for stars like Raheem Sterling, Gareth Bale and Kevin de Bruyne. City lost 23m on transfers in the last 12 months due to financial restrictions imposed by UEFA for breach Sample 2 Prompt: (CNN)Larry Johnson remembers the fear and feeling of helplessness from being on the SkyWest Airlines flight that made an emergency landing in Buffalo, New York. "I thought we were done," he said Thursday, one day later. "There was no doubt in my mind that we weren't going to land." Johnson was flying with his brother, his girlfriend and his 8month-old son when he says a flight attendant came over the speaker asking for someone who was medically trained to help with a sick passenger. Minutes later, Johnson says, the attendant announced there was a pressurization problem and told passengers to prepare for the emergency landing. "You're going to feel dizzy and woozy and lightheaded, and you' re not going to be able to breathe," Johnson recalled the flight attendant saying. The Federal Aviation Administration on Wednesday initially reported a pressurization problem with SkyWest Flight 5622, and said it would investigate. It later issued a statement that did not reference any pressurization issues. "It was like being trapped and you couldn't do anything. You have no control over it. Your body just felt pressured," Johnson said. Johnson said his girlfriend was seeing spots and that "when I was blinking, it was really bright. I could barely see." According to Marissa Snow, spokeswoman for SkyWest, three passengers reported a loss of consciousness while on the flight. Fifteen passengers were examined by medical personnel at the Buffalo airport, but no one was transported to the hospital, she said. The spokeswoman said that maintenance personnel found no indication of a pressurization problem with the aircraft, an Embraer E170, and that the airline continues to investigate the cause. An official with the National Transportation Safety Board told CNN that the agency is in communication with the FAA and SkyWest to gather information on the incident to better understand what took place. Mary Cunningham was the nurse on board who attended to the first sick passenger. "There was a woman, very lethargic, very nauseous, turning gray, just not looking good," Cunningham said. After administering oxygen to the woman, Cunningham, a nurse at Yale-New Haven Hospital in Connecticut, returned to her seat. Soon after, Cunningham said, she was called up again by the flight attendant after another women near the first started to lose consciousness. "It was something that was affecting oxygen in that area," which was near the middle of the plane, she said. Cunningham said she had to return to her seat after feeling lightheaded. Flight 5622 made an emergency landing in Buffalo, New York. The flight was carrying a family of three people. The flight attendant told the passengers to prepare for the landing. Supervised + PPO: "I thought we were done," the father of one passenger says. The flight attendant announced there was a pressurization problem. The flight made an emergency landing in Buffalo, New York. Three passengers reported a loss of consciousness while on the flight. Supervised + NLPO: SkyWest Airlines flight made an emergency landing in Buffalo, New York, on Thursday. "There was no doubt in my mind that we weren't going to land," passenger says. "You're going to feel dizzy and woozy and lightheaded...you're not going to be able to breathe," he says."

Sample 3

Prompt: For Inverness the latest chapter in their remarkable story featured not one, but two, plot-changing twists. One featured penalty Celtic didn't get for 2-0 in which Josh Meekings handball should have also led to a sending off. The other the spot kick they did, followed by a red card for Craig Gordon. 'I've not seen it yet, but going by the reaction of the Celtic players we got away with a penalty and a sending off and that was probably the turning point in the game,' acknowledged Caley manager John Hughes after. Inverness's Josh Meekings appears to get away with a handball on the line in their win over Celtic . Caley boss John Hughes says the break, which could have meant a penalty and red card, was a turning point . 'I've not spoken to Josh. I haven't seen it -but going by the media it was definitely a hand ball. We look at the referee behind the line and all that and I know Ronny will feel aggrieved -because I certainly would. 'But it's part and parcel of football and you need a wee bit of luck to beat Celtic. 'This was their biggest game of the season because they will go on and win the league and if they had beaten us today there was a good chance they would have gone on and won the Scottish Cup. 'But when Marley Watkins was clipped by Craig Gordon and they were down to 10 men that was advantage Inverness. 'We weren't going to give Celtic the ball back, they had to come and get it and we had to be patient. 'When big Edward put us into the lead we thought it was going to be our day on the back of things that had happened. 'Celtic equalised with another free kick but it's typical of Inverness that we don't do anything easy. 'We do it the hard way and we came up with the winner through David Raven.' Hughes hauled Raven, his Scouse defender, from his backside as extra-time beckoned. Offended by the sight of one of his players resting he had a message to impart. Caley players celebrate after upsetting Celtic in a Scottish Cup semi-final 3-2 thriller . Celtic, depleted by games and absentees, were virtually on their knees after a relentless programme of midweek games. In last season's League Cup Final Inverness had been passive and unambitious prior to losing on penalties. This was no time to repeat the mistake. 'I tried to emphasise to the players they would never have a better time to go on and beat Celtic, down to 10 men in the semi final of a cup. We needed to go for it,' Hughes said. 'Before Raven scored at the back post I was looking to change it. I was going to bring on another winger, Aaron Doran, and put him in the full-back position over on the right, but more advanced so he could take their left back on. Thankfully I didn't do that and David Raven came up with the goal. Virgil Van Dijk (centre) fired Celtic into an early lead with a superb free-kick in the 18th minute . 'I didn't realise this is the first time the club have been in the final of the Scottish Cup and that's a remarkable achievement given it was only formed 20 years ago. 'It is a great story isn't it? It's an absolutely fantastic story. It is 20 odd years since the amalgamation. We are a small provincial club up there in the Highlands. 'We have lost a real inspirational skipper in Richie Foran right from the start of the season. He has never played. We have had to adjust to that. 'We had to sell Billy McKay, our top goalscorer, at Christmas. We have had to go again and adjust. I am a very humble guy and I am grateful and thankful that injuries have never caught up with us.' There is remarkable irony in the fact Falkirk will be the opponents for the final. A former Bairns captain, he was manager of the club in 2009 when they lost to Rangers at Hampden. Former Falkirk captain and manager John Hughes will take on his former club in the final . 'I had a lot of great times at Falkirk. So much so that it is possibly my favourite time in my playing career. I am still friendly with an awful lot of the characters who were in that dressing room. Neil Oliver is a good friend of mine from my Falkirk days. He comes along and sits on the bench and gives me a wee hand out. 'That is the spirit that we had at that club. I have met some great guys. Brian Rice, my ex-coach, Davie Weir, Ian McCall, the list is endless. I was just talking the other day about that day at Falkirk. There are times even now when I see. 'I have a real special, special feel for the place. I am not kidded Falkirk is a right good side, trust me on that. No side goes to Tynecastle and beats Hearts without being a good side. No side beats Hibs the way they were playing without being a good side. You might say we are favourites I say it is even stevens and I am just looking forward to when the days comes.' David Raven (second right) celebrates with his Inverness team-mates following his extra-time winner . An Inverness and Falkirk final would not have been the choice of sponsors William Hill, you suspect. Vast swathes of Hampden's west stand were empty, a 12.15 kick-off for the benefit of broadcasters proving unhelpful . There will be no ticket rows this time. Not over a final unlikely to raise a crowd much higher than 35,000 in a stadium of 52,000. The contract with Hampden plc is signed, with no question of moving it. If the bitter amalgamation of 21 years ago produced a fine football club it also disenfranchised many of those who followed the old ones. 'I hope this encourages the next generation to follow Inverness,' added Hughes. 'In the peer groups it is still Celtic and Rangers they follow but I hope the next ones will want to support their local club. 'If we can add another few hundred or thousand to the support then we are playing our part. Zero-Shot: inverness beat Celtic 2-0 in their opening game of the season. Josh Meekings was the subject of a handball after the break. the break could have meant a penalty and a red card for Craig Gordon. References: Josh Meekings handball should have been a penalty and a red card .If the penalty was awarded and converted Celtic would have gone to 2-0 .Caley manager John Hughes admitted his side were fortunate in win .Virgil van Dijk scored Celtic's opener with a superb free-kick .Celtic keeper Craig Gordon was sent off early in the second half .Greg Tansey, Edward Ofere and finally Daven Raven scored for Caley . PPO: Josh Meekings handball could have meant sending off for Celtic in extra-time win. John Hughes says the break could have meant an penalty and red card. Caley boss also admits he would feel aggrieved had Celtic won the championship. NLPO: Caley beat Celtic 2-0 in their Scottish Cup clash. Josh Meekings handball could have led to a penalty and red card. John Hughes says the break was probably the turning point in the game. defender does not believe defensive striker can get away with being shown the yellow card -despite playing in same half we never scored. Supervised: Inverness beat Celtic 2-0 in their Scottish Cup clash. Josh Meekings appeared to get away with a handball on the line. Another player was sent off for a spot kick which ended in a red card. Caley manager John Hughes says the break could have meant a penalty. Hughes says he 'would feel aggrieved' after the break. Supervised + PPO: Caley beat Celtic 1-0 in their Scottish Cup encounter at the Etihad. Josh Meekings' handball on the line should have also seen a sending off. But Craig Gordon got away with a spot kick and sending off. John Hughes said it was a turning point in the game for Inverness. Supervised + NLPO: Josh Meekings appears to get away with a handball after Inverness's win. Inverness were down to 10 men when Craig Gordon was sent off. John Hughes says the break could have meant a penalty and red card. Hughes says Celtic will go on and win the Scottish Cup with win Figure 9 shows the ToTTo instructions, example, and interface used for the human evaluation experiments. We made small modifications to the original code release's HTML renderer to make the tables display in our HITs. Tables 25, 26 show averaged results, annotator agreement, and the results of statistical significance tests to determine which models output better generations when rated by humans.



Strathern (1997) paraphrases: When a measure becomes a target, it ceases to be a good measure. π ψ could be trained with alternate sampling techniques like top-k or beam search (or even hard-coded via rules by domain experts), though we find top-p sampling to be most effective in practice. We consider PPO representative of the present state-of-the-art -in particular, we do not consider the popular REINFORCE(Willianms, 1988; Williams, 1992), as recent works have shown PPO to be strictly superior to REINFORCE in multiple domains(Schulman et al., 2017)



Figure1: Natural Language Policy Optimization (NLPO) in the case of sentiment-guided continuation. Here, the LM (i.e., the policy) needs to produce a positive sentiment continuation given a review prompt (we cover other models of human preference in Sec. 3.2). Two objectives are balanced: 1) an automated proxy of human preference that serves as a reward (here: a sentiment classifier); and 2) "naturalness" as measured by a KL divergence from an LM not trained with explicit human feedback. The plots show validation learning curves comparing our NLPO to the popular policy gradient method PPO. (Top plot:) RL methods can easily achieve high reward if the KL penalty is removed, (Bottom:) but at the cost of higher perplexity. NLPO+KL, our proposed approach, succeeds in balancing reward and naturalness more effectively than prior work.

for Training LMs with RL 3.1 Environments: Generation as a Token-level MDP . . . . . . . . . . . . . . . . . . 3.2 Reward Functions and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 3.3 On-policy Actor-critic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-language Understanding Eval) 5.1 Results on GRUE: Which Algorithm Should be Used to Learn Preferences? . . . . 5.2 Preference Reward Learning, Selection, and Hacking . . . . . . . . . . . . . . . . 5.3 Data Budget: Improve your Reward or Gather More Demonstration? . . . . . . . . 5.4 Practical Considerations: Which Implementation Details Matter Most? . . . . . . . Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 NLPO Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Experimental Details B.1 Crowdworking Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 GRUE Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 IMDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.3 Human Participant Study . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 CommonGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.3 Human Participant Study . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.4 Human Preference Learning Experiments . . . . . . . . . . . . . . . . . . B.4.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5 CNN Daily Mail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4: Learning Curves: Averaged learning curves over 5 different runs by varying target KL, shaded regions indicate one standard deviation. (a) shows the rollout episodic total reward during training (b) shows evolution of sentiment scores on the validation split (c) shows evolution of perplexity on the validation split. From (a) and (b), it is seen that higher target KL (0.1) is desired to achieve higher rewards. However, this setting drifts away from the original LM too much and loses fluency. Therefore a lower target KL (0.02 or 0.05) is required to keep the model closer to original LM. Similar trends hold for NLPO but when compared to PPO, it retains lower perplexities and is more stable even with higher KL targets

Figure 5: Instructions, example, and interface for the IMDB sentiment completion task.

Figure 6: Instructions, examples, and interface for the Commongen task.

Figure 8: Instructions and interface for the summarization task.

Figure 9: Instructions, two examples, and interface for the ToTTo table description task.

size N Input: initial policy parameters π θ0 Input: initial LM π 0 Input: initial value function parameters V ϕ0 Input: initialize parameterized masked policy π ψ0 (•|•, π θ0 ) with parameterized top-p policy π θ0 Input: policy update frequency µ repeat Sample mini-batch D m = {(x m , y m )} M m=1 from D Collect trajectories T m = {τ i } by running policy π ψn in for batch D m in env.

GRUE Benchmark using RL4LMs showing the various tasks, input and output types, and the metrics used.

Key questions answered using GRUE + RL4LMs:

IMDB Ablation Results.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229-256, 1992. RJ Willianms. Toward a theory of reinforcement-learning connectionist systems. Technical Report NU-CCS-88-3, Northeastern University, 1988. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38-45, 2020. Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021a. Qingyang Wu, Lei Li, and Zhou Yu. Textgail: Generative adversarial imitation learning for text generation. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, pp. online. AAAI Press, 2021b.

with fixed learning rate and no scheduling.

IMDB Hyperparams: Table shows a list of all hyper-parameters and their settings We consider IMDB dataset for the task of generating text with positive sentiment. The dataset consists of 25k training, 5k validation and 5k test examples of movie review text with sentiment labels of positive and negative. The input to the model is a partial movie review text (upto 64 tokens) that needs to be completed (generating 48 tokens) by the model with a positive sentiment while retaining fluency. For RL methods, we use a sentiment classifier Sanh et al. (2019) that is trained on pairs of text and labels as a reward model which provides sentiment scores indicating how positive a given piece of text is. For supervised Seq2Seq baselines, we consider only the examples with positive labels.



Table. 8  shows ablation on different hyperparameters in NLPO algorithm. Evaluation of GPT2 with different algorithms on IMDB sentiment text continuation task, discount factor ablations: Mean and standard deviations over 5 random seeds is reported for sentiment scores along with fluency and diversity metrics. This table measures performance differences for the discount factor. We note that most NLP approaches using RL follow the style ofLi et al. (2016); Wu et al. (2021a)  and use a discount factor of 1. This is equivalent to reducing the generation MDP to a bandit feedback environment and causes performance loss (in the case of NLPO) and reward hacking and training instability (in the case of PPO).

± 0.201 0.669 ± 0.008 0.042 ± 0.002 0.284 ± 0.007 8.575 ± 0.064 13.503 ± 0.181 4986 ± 265 45916± 1168 10 0.622 ± 0.014 32.729 ± 0.567 0.659 ± 0.019 0.042 ± 0.002 0.274 ± 0.007 8.489 ± 0.106 13.31 ± 0.272 5138 ± 385 43989 ± 1120 20 0.637 ± 0.013 32.667 ± 0.631 0.677 ± 0.014 0.044 ± 0.002 0.288 ± 0.010 8.588 ± 0.100 13.484 ± 0.236 5205 ± 189 46344 ± 2688 50 0.603 ± 0.015 33.397 ± 0.325 0.67 ± 0.006 0.043 ± 0.001 0.287 ± 0.004 8.605 ± 0.041 13.54 ± 0.116 5228 ± 113 46418 ± 685Evaluation of GPT2 with different algorithms on IMDB sentiment text continuation task, NLPO hyperparameter ablations: Mean and standard deviations over 5 random seeds is reported for sentiment scores along with fluency and diversity metrics. This table shows results of NLPO's stability to the unique hyperparameters introduced in the algorithm -all other parameters held constant from the best PPO model. The number of iterations after which the masking model syncs with the policy and the top-p nucleus percentage for the mask model itself. We see that in general, the higher the top-p mask percentage, the better the performance. For target update iterations, performance is low if the mask model is not updated often enough or if it updated too often.

Results of the human subject study showing the number of participants N, average Likert scale value for coherence and sentiment, Krippendorf's alpha showing inter-annotator agreement, and Skew. For each model a total of 100 samples were drawn randomly from the test set and rated by 3 annotators each, resulting in 300 data points per algorithm.

Results of an post-hoc Tukey HSD Test for difference in means between pairs of algorithms (Group 2 -Group 1) and corresponding p-values. Individually statistically significant results are bolded and are used to discuss results in the analysis. Overall p-values showing that there is a significant difference in means between the models via a one-way ANOVA test are significant with p ≪ 0.05 for both coherence and sentiment.

CommonGen test evaluationTable shows official scores obtained from CommonGen hold-out evaluation. The most important result is that RL fine-tuning on a supervised model yields better performance across most metrics especially Coverage which indicates the ratio of concepts covered in generated texts

CommonGen dev evaluation: Table shows lexical, semantic and diversity metrics for best performing models found in each algorithm-reward function combinations along with best performing supervised baseline models. Generated text from these models are submitted to official CommonGen test evaluation to obtain test scores presented in Table12

Results of the human subject study showing the number of participants N, average Likert scale value for coherence and sentiment, Krippendorf's alpha showing inter-annotator agreement, and Skew. For each model a total of 100 samples were drawn randomly from the test set and rated by 3 annotators each, resulting in 300 data points per algorithm.

PPO Ablation/Model Selection: Evaluation of PPO models on validation set with different reward functions and top k values for rollouts. For each alg-reward combo, best model (top k ) is chosen.

NLPO Ablation/Model Selection: Evaluation of NLPO models on validation set with different reward functions, top k values for rollouts and target update iterations. For each alg-reward combo, best model is chosen

Results of the human subject study showing the number of participants N, average Likert scale value for coherence and sentiment, Krippendorf's alpha showing inter-annotator agreement, and Skew. For each model a total of 50 samples were drawn randomly from the test set and rated by 3 annotators each, each resulting in 150 data points per algorithm.

There were 84 passengers on board the plane, including nine crew members. Flight 5622 was originally scheduled to fly from Chicago to Hartford, Connecticut. The plane descended 28,000 feet in three minutes. "It would feel like a roller coaster --when you're coming over the top and you're going down," CNN aviation analyst Mary Schiavo said, describing how such a descent would feel. "You know that these pilots knew they were in a very grave and very serious situation." Johnson, flying to visit his parents with their only grandchild, was glad that roller coaster ride ended safely. "It was a shaky ride. It wasn't anything I would like to experience again," he said. CNN's Jean Casarez, Josh Gaynor, Dana Ford and Stephanie Gallman contributed to this report. Zero-Shot: "you're going to feel dizzy and woozy and lightheaded," Johnson says. "it was like being trapped and you couldn't do anything," Johnson says. "the plane was trying to get back on the ground," passenger says of fear. References: Three passengers report a loss of consciousness on SkyWest flight .But officials say there is no evidence of a pressurization problem . PPO: SkyWest flight 5312 made emergency landing in Buffalo, new york. passengers were told they were not going to be able to breathe, passenger recalls attendant. the faa initially reported a pressurization problem with the flight 5622. NLPO: SkyWest flight 5622 made emergency landing in buffalo, new york.Larry Johnson says flight attendant announced there was a pressurization problem. three passengers reported loss of consciousness while on flight; no one transported to hospital. Supervised: The FAA says it is investigating the cause of the emergency landing. SkyWest

ToTTo test evaluation: Table shows lexical, semantic and factual correctness metric scores of algorithms with different reward functions on hold-out test set. Without supervised pre-training, both PPO and NLPO results in sub-optimal solutions, with NLPO better than PPO. With supervised pre-training, PPO and NLPO achieve better scores across all metrics showing RL fine-tuning is beneficial. Most importantly, RL fine-tuned models produce more factually consistent text as seen in higher PARENT scores. Another observation, fine-tuning with a task-specific metric PARENT is better than training on task-agnostic lexical rewards

ToTTo dev evaluation: Table shows lexical, semantic and factual correctness metric scores of algorithms with different reward functions on dev set. Without supervised pre-training, both PPO and NLPO results in sub-optimal solutions, with NLPO better than PPO. With supervised pre-training, PPO and NLPO achieve better scores across all metrics showing RL fine-tuning is beneficial. Most importantly, RL fine-tuned models produce more factually correct text as seen in higher PARENT scores. Another observation, fine-tuning with a task-specific metric PARENT is better than training just on task-agnostic lexical metrics

Results of the human subject study showing the number of participants N, average Likert scale value for coherence and sentiment, Krippendorf's alpha showing inter-annotator agreement, and Skew. For each model a total of 50 samples were drawn randomly from the test set and rated by 3 annotators each, resulting in 150 data points per algorithm.

Results of an post-hoc Tukey HSD Test for difference in means between pairs of algorithms (Group 2 -Group 1) and corresponding p-values. Individually statistically significant results are bolded and are used to discuss results in the analysis. Overall p-values showing that there is a significant difference in means between the models via a one-way ANOVA test are significant with p ≪ 0.05 for both coherence and sentiment.

7. ACKNOWLEDGEMENTS

We'd like to acknowledge the support of DARPA MCS program through NIWC Pacific (N66001-19-2-4031), Google Cloud Compute, and the ReViz team at the Allen Institute for AI. KB is supported by NSF under grant No. 2127309 to the Computing Research Association for the CIFellows Project. Rajkumar is funded by the Federal Ministry of Education and Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning and Artificial Intelligence.

annex

B. 4 COMMONGEN B.4 .1 SETUP CommonGen (Lin et al., 2020) deals with task of generating coherent sentences describing an input set of concepts (eg. "a man is throwing a frisbee"). For training RL methods, we consider 3 traditional lexical rewards namely Rouge-1, Rouge-avg (which is an average of Rouge-1, 2 and L) and meteor. Additionally, we also train with task-specific rewards such as CIDEr (Vedantam et al., 2015) , SPICE (Anderson et al., 2016) and SPiDer (Liu et al., 2017) which is a just a linear combination of both with equal weights. We chose T5-base as the base LM since it is well-suited for structure to text tasks. We additionally note that concept set inputs are prefixed with "generate a sentence with:" to encourage exploration.During our initial experiments when fine-tuning directly on LM, we observed that policy learns to repeat the prompted concepts in order to maximize rewards resulting in a well-known problem of reward hacking. To mitigate this, we add a penalty score of -1 to final task reward if the n-grams of prompt text overlaps with generated text. In contrast, when initialized with a supervised policy, this problem is not seen and hence penalty score is not applied. We use beam search as the decoding method during evaluation whereas for rollouts, we use top k sampling to favor exploration over exploitation. Table 11 provides an in-depth summary of setting of hyperparameter values along with other implementation details. select which one they prefer with respect to commonsense/fluency; We gathered 3 annotations on 417 pairs (Krippendorf α = .28), and split into 60/20/20 train/val/test split. We then trained a reward model, T5-11B Raffel et al. (2020) , on the balanced binary classification task of predicting which of the pair was preferred by a majority of 3 annotators, conditioned on the prompt and completion.The resulting model achieved 69.5 test ROC AUC suggesting it indeed captures average human preferences. The model is then used as a reward function. We train Supervised+RL with a METEORonly reward as a baseline, and compare it to a reward function that uses the fine-tuned T5-11B model. We design the reward function based on the preference model as r = meteor + pref /(1 + |miss|) where miss is a set of concepts not covered in the generated text, in an attempt to mimic the data collection process that humans are instructed to follow. This reward function accounts for both the task of using all concepts and also human's preferences for how a sentence should look within the constraints stipulated by the task. Finally, we rerun the same pairwise preference collection procedure-this time sampling from Commongen test-with human participants to compare the generations from a preference optimized RL policy to the previously best Supervised+NLPO policy.Comparing the METEOR-only to the preference model head-to-head, the generations produced by the human feedback model are preferred in 682 cases, compared to the METEOR-only model which is preferred in 587 cases (p < 0.01 the models are equally preferred).

B.4.5 QUALITATIVE ANALYSIS

This section shows sample generations from different algorithms for three randomly picked prompts.

Sample 1

Prompt: generate a sentence with: apron cut hat kitchen sausage Zero-Shot: generate a sentence with: apron cut hat kitchen sausage. PPO: sausage in the kitchen on an apron. NLPO: sausage cut hat cut hat cut hat cut apron cut hat Supervised: A man is cutting sausage on an apron in a kitchen. Supervised + PPO: A man cutting sausage from a hat on an apron in the kitchen. Supervised + NLPO: A man cuts sausage in a hat on an apron in the kitchen. Tables 24, 23 presents our benchmarking results with 5 reward functions along with supervised baseline performances on dev and test sets respectively. Similar to other tasks, our main finding is that warm-started initial policies are crucial for learning to generate descriptions from highlighted cells. Without warm-start, policies suffer from reward hacking and resulting in sub-optimal solutions despite application of task-specific metrics such as PARENT etc. We find that Supervised+NLPO method outperforms all models on ToTTo leaderboard in terms of PARENT metric.

B.6.4 QUALITATIVE ANALYSIS

We show sample generations from each of the algorithms for three randomly picked prompts below.

Sample 1

Prompt: <table> <cell> August 1, 1992 <col_header> Date </col_header> <col_header> North America </col_header> <col_header> Australasia </col_header> <col_header> Asia </ col_header> <col_header> North America </col_header> </cell> <cell> Allentown <col_header> City </col_header> <col_header> North America </col_header> <col_header> Australasia </ col_header> <col_header> Asia </col_header> <col_header> North America </col_header> </ cell> <cell> Allentown Fairgrounds Grandstand <col_header> Country </col_header> < col_header> North America </col_header> <col_header> Australasia </col_header> < col_header> Asia </col_header> <col_header> North America </col_header> </cell> </table>< page_title> Under My Spell Tour </page_title> <section_title> Tour dates </section_title> <table> <cell> August 1, 1992 <col_header> Date </col_header> <col_header> North America </col_header> <col_header> Australasia </col_header> <col_header> Asia </col_header> < col_header> North America </col_header> </cell> <cell> Allentown <col_header> City </ col_header> <col_header> North America </col_header> <col_header> Australasia </ col_header> <col_header> Asia </col_header> <col_header> North America </col_header> </ cell> <cell> Allentown Fairgrounds Grandstand <col_header> Country </col_header> < col_header> North America </col_header> <col_header> Australasia </col_header> < col_header> Asia </col_header> <col_header> North America </col_header> </cell> </table> Zero-Shot: table> cell> Under My Spell Tour /section_title> NarrativeQA (Kočiskỳ et al., 2018) deals with task of generating answers to questions about a given story. For training RL methods, we consider 2 traditional lexical rewards namely Rouge Combined and Rouge-L-Max. We chose T5-base as the base LM since it has been shown to do well at question answering in prior work (Khashabi et al., 2020) . We note that the supervised models we use are trained on the UnifiedQA dataset, which contains other QA datasets, and is shown by Khashabi et al. (2020) to outperform supervised fine-tuning only on NarrativeQA. Hyperparams for our models can be found in Table 27 . We show sample generations from each of the algorithms for three randomly picked prompts below. Prompt: who is mark hunter? mark hunter (slater), a high school student in a sleepy suburb of phoenix, arizona, starts an fm pirate radio station that broadcasts from the basement of his parents' house. mark is a loner, an outsider, whose only outlet for his teenage angst and aggression is his unauthorized radio station. his pirate station's theme song is " everybody knows" by leonard cohen and there are glimpses of cassettes by such alternative musicians as the jesus and mary chain, camper van beethoven, primal scream, soundgarden, ice-t, bad brains, concrete blonde, henry rollins, and the pixies. by day, mark is seen as a loner, hardly talking to anyone around him; by night, he expresses his outsider views about what is wrong with american society. when he speaks his mind about what is going on at his school and in the community, more and more of his fellow students tune in to hear his show.nobody knows the true identity of "hard harry" or "happy harry hard-on ," as mark refers to himself, until nora diniro (mathis), a fellow student, tracks him down and confronts him the day after a student named malcolm commits suicide after harry attempts to reason with him. the radio show becomes increasingly popular and influential after harry confronts the suicide head-on, exhorting his listeners to do something about their problems instead of surrendering to them through suicideÃćÂȂÂŤat the crescendo of his yelled speech, an overachieving student named paige woodward (who has been a constant listener) jams her various medals and accolades into a microwave and turns it on. she then sits, watching the awards cook until the microwave explodes, injuring her. while this is happening, other students act out in cathartic release.eventually, the radio show causes so much trouble in the community that the fcc is called in to investigate. during the fracas, it is revealed that the school's principal (annie ross) has been expelling " problem students," namely, students with below-average standardized test scores, in an effort to boost the district's test scores while still keeping their names on the rolls ( a criminal offense) in order to retain government funding.realizing he has started something huge, mark decides it is up to him to end it. he dismantles his radio station and attaches it to his mother's old jeep, creating a mobile transmitter so his position can't be triangulated. pursued by the police and the fcc, nora drives the jeep around while mark broadcasts. the harmonizer he uses to disguise his voice breaks, and with no time left to fix it, mark decides to broadcast his final message as himself. they finally drive up to the crowd of protesting students, and mark tells them that the world belongs to them and that they should make their own future. the police step in and arrest mark and nora. as they are taken away, mark reminds the students to "talk hard." as the film ends, the voices of other students (and even one of the teachers) speak as intros for their own independent stations, which can be heard broadcasting across the country. Zero-Shot: hunter (slater) is a high school student in a sleepy suburb of phoenix, arizona. he starts an fm pirate radio station that broadcasts from the basement of his References: he is a high school student in phoenix 2) loner and outsider student with a radio station PPO: hunter (slater), a high school student in a sleepy suburb of phoenix, arizona, starts an fm pirate radio station that broadcasts from the basement of his parents' house NLPO: hunter (slater), a high school student in a sleepy suburb of phoenix, arizona, starts an fm pirate radio station that broadcasts from the basement of his parents' house Supervised: a high school student Supervised + PPO: a high school student Supervised + NLPO: a high school student

Sample 2

Prompt: what was the principle doing with the problem students? mark hunter (slater), a high school student in a sleepy suburb of phoenix, arizona, starts an fm pirate radio station that broadcasts from the basement of his parents' house. mark is a loner, an outsider, whose only outlet for his teenage angst and aggression is his unauthorized radio station. his pirate station's theme song is "everybody knows" by leonard cohen and there are glimpses of cassettes by such alternative musicians as the jesus and mary chain, camper van beethoven, primal scream, soundgarden, ice-t, bad brains, concrete blonde, henry rollins, and the pixies. by day, mark is seen as a loner, hardly talking to anyone around him; by night, he expresses his outsider views about what is wrong with american society. when he speaks his mind about what is going on at his school and in the community, more and more of his fellow students tune in to hear his show.nobody knows the true identity of "hard harry" or "happy harry hard-on," as mark refers to himself, until nora diniro ( mathis), a fellow student, tracks him down and confronts him the day after a student named malcolm commits suicide after harry attempts to reason with him. the radio show becomes increasingly popular and influential after harry confronts the suicide head-on, exhorting his listeners to do something about their problems instead of surrendering to them through suicideÂȂÂŤat the crescendo of his yelled speech, an overachieving student named paige woodward (who has been a constant listener) jams her various medals and accolades into a microwave and turns it on. she then sits, watching the awards cook until the microwave explodes, injuring her. while this is happening, other students act out in cathartic release.eventually, the radio show causes so much trouble in the community that the fcc is called in to investigate. during the fracas, it is revealed that the school's principal (annie ross) has been expelling "problem students," namely, students with below-average standardized test scores, in an effort to boost the district's test scores while still keeping their names on the rolls (a criminal offense) in order to retain government funding.realizing he has started something huge, mark decides it is up to him to end it. he dismantles his radio station and attaches it to his mother's old jeep, creating a mobile transmitter so his position can't be triangulated. pursued by the police and the fcc, nora drives the jeep around while mark broadcasts. the harmonizer he uses to disguise his voice breaks, and with no time left to fix it, mark decides to broadcast his final message as himself. they finally drive up to the crowd of protesting students, and mark tells them that the world belongs to them and that they should make their own future. the police step in and arrest mark and nora. as they are taken away, mark reminds the students to "talk hard." as the film ends, the voices of other students (and even one of the teachers) speak as intros for their own independent stations, which can be heard broadcasting across the country. Zero-Shot: mark hunter (slater) starts an fm pirate radio station that broadcasts from the basement of his parents' house. by day, mark is seen as a loner, an outsider, whose only outlet References: 1) expelling them. 2) expelling the students PPO: mark hunter (slater) starts an fm pirate radio station that broadcasts from the basement of his parents' house. mark is a loner, an outsider, whose only outlet for his teenage ang NLPO: mark hunter (slater) starts an fm pirate radio station that broadcasts from the basement of his parents' house. mark is a loner, an outsider, whose only outlet for his teenage ang Supervised: he was taking them to a psychiatric hospital. Supervised + PPO: addressing them Supervised + NLPO: he was expelling them

Sample 3

Prompt: what happens to the people maskull meets in his travels to various lands? maskull, a man longing for adventures, accepts an invitation from krag, an acquaintance of his friend nightspore, to travel to tormance after a seance. the three set off in a crystal ship from an abandoned observatory in scotland but maskull awakens to find himself alone on tormance. in every land he passes through he usually meets only one or two persons; these meetings often (though not always) end in the death of those he meets, either at his own hand or by that of another. he learns of his own impending death, meets krag again, and dies shortly after learning that he is in fact nightspore himself. the book concludes with a final revelation from krag (who claims to be known on earth as "pain") to nightspore about the origin of the universe. the author turns out to support a variation of the doctrine of the demiurge, somewhat similar to that propounded by some gnostics.all of the characters and lands are types used to convey the author's critique of several philosophical systems. on tormance, most such viewpoints or ways of life are accompanied by corresponding new bodily sense organs or modifications of the same, thus each distinct weltanschauung landscape has its corresponding sensorium. Zero-Shot: what happens to the people maskull meets in his travels to various lands? n maskull, a man longing for adventures, accepts an invitation from krag to travel to tormance after References: 1) they usually die by his hand or anothers. 2) they die PPO: what happens to the people maskull meets in his travels to various lands? n maskull, a man longing for adventures, accepts an invitation from krag to travel to tormance after NLPO: maskull meets krag, an acquaintance of nightspore, to travel to tormance after a seance. maskull awakens to find himself alone on tormance. in every land Supervised: they die either from his own hand or from another person's death. Supervised + PPO: they end up dying either at his own hand or by another's. Supervised + NLPO: they end up dying either at his own hand or by another's B.8 MACHINE TRANSLATION B.8.1 SETUP WMT-16 We pick two languages, English and German, and frame this task similarly to other machine translation tasks-requiring the models to translate from English to German. We train models on 4 rewards: SacreBLEU, chRF, TER, and BertScore. We show sample generations from each of the algorithms for three randomly picked prompts from IWSLT below.  Prompt: translate English to German: And of course there are teams that have a lot more "ta-da " structures, and among the best are recent graduates of kindergarten. Zero-Shot: Und natÃijrlich gibt es Teams, die viel mehr "ta-da" Strukturen haben, und zu den besten gehÃűren jÃijngste Absolventen des Kindergartens. References: Und natÃijrlich gibt es auch Teams die viel mehr "ta-da"-Gebilde haben und unter den Besten sind frische Absolventen des Kindergartens. PPO: Und natÃijrlich gibt es Teams, die viel mehr "ta-da" Strukturen, und unter den besten sind jÃijngste Absolventen des Kindergartens. NLPO: Und natÃijrlich gibt es Teams, die viel mehr "t a-da" Strukturen haben, und unter den besten sind jÃijngste Absolventen des Kindergartens Supervised: Und natÃijrlich gibt es Teams, die viel mehr "ta-da"-Strukturen haben, und unter den besten sind jÃijngste Absolventen des Kindergartens. Supervised + PPO: Und natÃijrlich gibt es Teams, die viel mehr "ta-da"-Strukturen haben, und unter den besten sind jÃijngste Absolventen des Kindergartens. Supervised + NLPO: Und natÃijlich gibt es Teams, die viel mehr "ta-da"-Strukturen haben, und unter den besten sind jÃijngste Absolventen des Kindergartens.B.9 DAILY DIALOG B.9.1 SETUP We consider DailyDialog (Li et al., 2017) as the test bed for the dialogue generation task. The dataset includes conversations written by human on various topics. In addition, each utterance contains labels of intent and emotional information. For simplicity, we focus only on generating the next utterance, given the dialogue context. We chose a context window of size 5, which results in 35k training, 3k and 3k utterances. The input to the model is dialogue history in which utterances are concatenated using a <EOU> token. We picked GPT-2 as the LM as they are more suited for text continuation than encoder-decoder LMs. For a fair comparison, we use top-k sampling with k = 20 as the decoding method for all methods. For RL methods, we use a linear combination of meteor score and intent match score (whether the generated text's intent matches with the reference's intent) as the reward function. The coefficients for meteor and intent are chosen based on both lexical scores and intent accuracy on the validation set. For this purpose, we trained an intent classifier (fine-tuned RoBERTa (Liu et al., 2019) ) that classifies given text into intent categories such as inform, question, directive and commisive, etc. We show sample generations from each of the algorithms for three randomly picked prompts below.Sample 1 Prompt: Right . And China is the homeland of tea . <EOU> Yes , Chinese people love drinking tea so much . Some even claim they can't live without tea . <EOU> Do you know there are several catagories of Chinese tea ? <EOU> Yes , I believe there are green teas , black teas and scented teas . Any Others ? <EOU> Well , have you ever heard of Oulong tea and compressed tea ? <EOU> References: Oh , yeah . Oulong tea is good for one's health . isn't it ? Zero-Shot: I know that Oulong tea can be enjoyed with very few meals, and there are very few PPO: I have, but it's not as strong as Oulong tea. NLPO: I haven't. I don't have tea. Supervised: Yes. I hear they do good. And I heard this Taijiquan Tea is very good Supervised + PPO: Actually, I have had Oulong tea, but I haven't had any for awhile now. Supervised + NLPO: No, I've never heard of that.

Sample 2

Prompt: I think I might have a dahl . <EOU> what's in a dahl ? <EOU> it's got chickpeas and vegetables in a spicy curry sauce with rice . <EOU> that sounds nice . Do you want to share some kebabs as well ? <EOU> ok . How about some lamb kebabs ? <EOU> References: that's my favourite . Do you want to have some wine or beer ? Zero-Shot: they're very salty PPO: well you might want to take a peek. NLPO: well I like that. Supervised: ok, here they are. I'm so thirsty. I can't wait until the next bus to Supervised + PPO: I've tried that a few times, but they're too dry and I don't want any of Supervised + NLPO: oh, ok. That's a good choice. I'll take it.

Sample 3

Prompt: Really ? Have you had an auto-reverse break before ? <EOU> Yes , twice now . I think it's too high-tech , and so it's the first part that breaks . <EOU> Hmm . You know , the problem might be dust . 

