REWARD DESIGN WITH LANGUAGE MODELS

Abstract

Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DEALORNODEAL negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning. Code and prompts can be found here.

1. INTRODUCTION

Autonomous agents are becoming increasingly capable with the rise of compute and data. This underscores the importance for human users to be able to control what policies the agents learn and ensure the policies are aligned with their objectives. For instance, imagine training an agent to represent users in a salary negotiation. A working mother fighting for a livable wage may want their agent to be stubborn whereas a new hire looking to develop a good relationship with the company may want their agent to be more versatile. Currently, users specify desired behaviors by 1) designing reward functions or 2) providing large amounts of labeled data. Both approaches are challenging and impractical for different reasons. Designing reward functions is not an intuitive way to specify preferences. For instance, it isn't straightforward how to write a reward function for a "versatile" negotiator. Furthermore, designing reward functions that balance between different objectives -also known as the "reward design problem" -is notoriously difficult because agents are susceptible to reward hacking (Amodei et al., 2016; Hadfield-Menell et al., 2017) . On the other hand, one can learn a reward function from labeled examples. However, that is not possible with a single example; we need large amounts of labeled data to capture the nuances of different users' preferences and objectives, which has shown to be costly (Zhang et al., 2016) . Additionally, both approaches do not generalize well to new users who have different objectives -we would have to re-design our reward functions or re-collect data. Our aim is to create an easier way for users to communicate their preferences, where the interface is more intuitive than crafting a reward function and where they can cheaply specify their preferences with no more than a few examples. To do this, we leverage large language models (LLMs) that are trained on internet-scale text data and have shown an impressive ability to learn in-context from few or zero examples (Brown et al., 2020) . Our key insight is that The scale of data that LLMs have been trained on make them great in-context learners and also allows them to capture meaningful commonsense priors about human behavior. Given a few examples or a description demonstrating the user's objective, an LLM should be able to provide an accurate instantiation of reward values on a new test example, allowing for easier generalization to new objectives. To this end, we explore how to prompt an LLM as a proxy reward function to train RL agents from user inputs. In our approach, the user specifies an objective with a natural language prompt. Objectives can

annex

 ------------------------------------------------------------------- ------------------------------------------------------------------- -Is Alice a versatile negotiator?Alice and Bob are negotiating how to split a set of books, hats, and balls.------------------------------------------------------------------Alice : propose: book=1 hat=1 ball=0 Bob : propose: book=0 hat=1 ball=0 Alice : propose: book=1 hat=0 ball=1 Agreement! Alice : 4 points Bob : 5 points -------------------------------------------------------------------Is Alice a versatile negotiator?Yes, because she suggested different proposals.

Prompt (!)

Task description (! ! )Example from user describing objective (versatile behavior) (! " )Episode outcome described as string using parse " (! # )Feed prompt (!)(1)

LLM

LLM provides textual output(2)"No" Convert to int using parse " and use as reward signal(3)"0"Update agent (Alice) weights and run an episode be specified with a few examples when they are difficult to define (such as "versatility") or as a single phrase when they are well-known concepts (such as "Pareto-optimality"). We use the prompt and the LLM to define a reward function for training an RL agent. The LLM takes the user prompt and a trajectory from an RL episode as input and outputs a score (e.g., "No" or "0") for whether the trajectory satisfies the user's objective, which we parse as an integer reward for the RL agent (Figure 1 ).There are two advantages to prompting LLMs as a proxy reward function: (1) we can leverage LLM's in-context learning abilities and prior knowledge on human behavior so that users only need to provide a handful of example desirable behaviors and (2) users can specify their preferences intuitively using language. On the other hand, a potential disadvantage is that it is unclear how much prompt design will be required for the LLM to reliably infer user intent (see Sec. 5 for a discussion). The goal of this paper is to explore how well LLMs can train objective-aligned agents by providing reward signals, and empirically examine whether we can do so with no more than a few examples. Our contributions are as follows:• We introduce the idea of using LLMs as a proxy reward function.• We propose a general RL training framework that leverages this proxy reward and is agnostic to the RL algorithm used. • We show that an LLM can more accurately train objective-aligned RL agents by an average of 35% compared the baseline. We use few-shot prompting for the Ultimatum Game and DEALORNODEAL negotiation task as well as zero-shot prompting in Matrix Games. • We conduct a pilot study with 10 human users. Users rate our agent to be significantly more aligned with their objective than an agent trained with a different one, p<0.001. • We provide further analysis quantifying the amount of user data required for our approach as well as the effect varying prompts has on the LLM's reward signal accuracy. 

