OFFLINE RL FOR NATURAL LANGUAGE GENERATION WITH IMPLICIT LANGUAGE Q LEARNING

Abstract

Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility maximization framework of RL algorithms with the ability of supervised learning to leverage previously collected data, as well as its simplicity and stability. Our method employs a combination of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing user-specified utility functions. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not 1 .



Large language models can acquire a remarkable amount of knowledge from large text corpora, and can be applied to a wide range of language-based tasks. However, such models are not designed to optimize any user-specified utility, instead requiring considerable trial and error to design prompts that coerce the models into producing desirable outputs (Liu et al., 2021; Brown et al., 2020; Min et al., 2021) . In essence, standard unsupervised language model training only solves part of the problem, being effective at distilling down knowledge in large corpora, but relatively clumsy when applying this knowledge to solve user-specified tasks. Reinforcement learning (RL) in principle can provide an effective framework for steering language models toward user specified tasks as long as the task can be represented by some utility function (i.e., a reward function); however, as outlined in Figure 2 contemporary methods suffer from high systems complexity and can require expensive human interaction. We need several conditions to make RL practical: (1) Easy to use: the underlying learning algorithm and workflow should be simple, stable, and scalable; (2) Able to optimize user specified rewards: the algorithm should be able to steer a language model toward maximizing any user-defined reward signal, from high-level task goals (e.g., book a flight) to low-level linguistic subtleties (e.g., avoiding rude or toxic speech); (3) Practical for interactive applications: the system should be able to handle a variety of tasks, 1 Code at https://sea-snell.github.io/ILQL_site/ 1



Figure 1: Offline RL differs from supervised learning in that it learns to maximize user-specified rewards from suboptimal data with reward labels.

