OFFLINE RL FOR NATURAL LANGUAGE GENERATION WITH IMPLICIT LANGUAGE Q LEARNING

Abstract

Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility maximization framework of RL algorithms with the ability of supervised learning to leverage previously collected data, as well as its simplicity and stability. Our method employs a combination of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing user-specified utility functions. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not 1 .



Large language models can acquire a remarkable amount of knowledge from large text corpora, and can be applied to a wide range of language-based tasks. However, such models are not designed to optimize any user-specified utility, instead requiring considerable trial and error to design prompts that coerce the models into producing desirable outputs (Liu et al., 2021; Brown et al., 2020; Min et al., 2021) . In essence, standard unsupervised language model training only solves part of the problem, being effective at distilling down knowledge in large corpora, but relatively clumsy when applying this knowledge to solve user-specified tasks. Reinforcement learning (RL) in principle can provide an effective framework for steering language models toward user specified tasks as long as the task can be represented by some utility function (i.e., a reward function); however, as outlined in Figure 2 contemporary methods suffer from high systems complexity and can require expensive human interaction. We need several conditions to make RL practical: (1) Easy to use: the underlying learning algorithm and workflow should be simple, stable, and scalable; (2) Able to optimize user specified rewards: the algorithm should be able to steer a language model toward maximizing any user-defined reward signal, from high-level task goals (e.g., book a flight) to low-level linguistic subtleties (e.g., avoiding rude or toxic speech); (3) Practical for interactive applications: the system should be able to handle a variety of tasks, from generating text with desired properties to sequential turn-taking in settings such as dialogue tasks; (4) Able to leverage existing data: such a system should be able to directly utilize the large quantities of existing data, avoiding expensive and time-consuming online human interactions; (5) Temporally compositional (Emmons et al., 2021; Rafols et al., 2005) : the method should be able to attain significant improvement over the average behavior in the data -not merely copying the best behaviors in the dataset, but actually distilling out underlying patterns in the relationship between rewards, task dynamics, and language to produce near optimal generations, even when the dataset demonstrates only mediocre task performance.  and 6 ). Then at inference time we can simply steer a standard language model towards utility maximizing behavior, by perturbing the predicted likelihoods with our learned values functions (see Figure 3 ). Our main contribution is twofold: (1) a novel offline RL algorithm, ILQL, for language models, that employs a stable optimization process that can flexibly learn high-performing policies from sub-optimal data in arbitrary sequential decision making settings, thus meeting each of the conditions laid out above; and (2) a detailed empirical analysis, not only demonstrating ILQL's ability to more consistently and stably adapt to many different utility functions than prior approaches, but also ILQL's unique ability to optimize stochastic or subjective reward functions, and its ability to discover optimal behaviors in the face of sub-optimal or unusual data distributions. In particular, in the controlled generation setting of generating non-toxic text, we demonstrate that ILQL, trained on both toxic and non-toxic comments, learns to produce fewer toxic outputs than the more standard approach of performing standard supervised fine-tuning on only non-toxic comments.

2. RELATED WORK

A number of prior works have explored combining online RL methods with language models for natural language tasks such as machine translation or summarization (Ranzato et al., 2015; Wu et al., 2016; Paulus et al., 2017; Wu & Hu, 2018) . These works have demonstrated that RL can be an effective tool for steering language models towards satisfying utility functions. However, when it comes to settings that require multiple steps of human interaction, e.g., dialogue, these methods can quickly become impractical (Verma et al., 2022; Ghasemipour et al., 2020) . Offline RL addresses this shortcoming by removing all need for environment interaction or user simulators, instead operating purely on static datasets of prior human interaction. Several prior works have applied offline RL to NLP and more broadly sequence generation problems (Jaques et al., 2020; Verma et al., 2022; Jaques et al., 2017; Snell et al., 2022; Janner et al., 2021; Chen 



Code at https://sea-snell.github.io/ILQL_site/



Figure 1: Offline RL differs from supervised learning in that it learns to maximize user-specified rewards from suboptimal data with reward labels.

Figure 2: ILQL meets each of the five criteria for practical NLP RL methods we outline in Section 1.Offline RL provides a learning paradigm (Figure1) that combines both supervised learning's ability to leverage existing data (criteria 4) with RL's ability to optimize arbitrary rewards and leverage temporal compositionality (criteria 2, 3, 5)(Levine et al., 2020; Kostrikov et al., 2021;  Kumar et al., 2020; Janner et al., 2021; Chen  et al., 2021; Yu et al., 2020; Kidambi et al.,  2020). However, prior offline RL approaches for language tasks are either based on dynamic programming, which enjoy the temporal compositionality (see Appendix A.2) but suffer from high systems complexity, hyper-parameter instability, and slow training times(Verma et al., 2022; Jaques et al., 2020; 2017)  (meets criteria 5, fails 1), or methods based on conditional imitation or dataset value learning that are simple and stable to train, but do not provide the temporal compositionality enjoyed by "full" RL methods (meets criteria 1, fails 5)(Chen  et al., 2021; Snell et al., 2022; Holtzman et al., 2018; Yang & Klein, 2021; Li et al., 2017; Krause  et al., 2021). Motivated by all these criteria, we design a novel offline RL method based on dynamic programming with an implicit dataset support constraint (Kostrikov et al., 2021) that enjoys greater stability, fewer training-time dependencies (such as relying on approximate likelihoods from an external language model during training), and a more flexible decoding process than prior approaches (see Sections 4 and 6.4). Our method, ILQL, fine-tunes a transformer language model to predict the state-action Q function and the state value function V at each token. During training we perform iterative policy improvement by fitting a value function to an upper-expectile of the Q function, enabling us to learn policies that leverage temporal compositionality, significantly outperforming the data, while avoiding the need to execute expensive training-time procedures, such as sampling counterfactual utterances from the language model (Verma et al., 2022) (see Sections 5 and 6). Then at inference time we can simply steer a standard language model towards utility maximizing behavior, by perturbing the predicted likelihoods with our learned values functions (see Figure3).

