REWARD DESIGN WITH LANGUAGE MODELS

Abstract

Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DEALORNODEAL negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning. Code and prompts can be found here.

1. INTRODUCTION

Autonomous agents are becoming increasingly capable with the rise of compute and data. This underscores the importance for human users to be able to control what policies the agents learn and ensure the policies are aligned with their objectives. For instance, imagine training an agent to represent users in a salary negotiation. A working mother fighting for a livable wage may want their agent to be stubborn whereas a new hire looking to develop a good relationship with the company may want their agent to be more versatile. Currently, users specify desired behaviors by 1) designing reward functions or 2) providing large amounts of labeled data. Both approaches are challenging and impractical for different reasons. Designing reward functions is not an intuitive way to specify preferences. For instance, it isn't straightforward how to write a reward function for a "versatile" negotiator. Furthermore, designing reward functions that balance between different objectives -also known as the "reward design problem" -is notoriously difficult because agents are susceptible to reward hacking (Amodei et al., 2016; Hadfield-Menell et al., 2017) . On the other hand, one can learn a reward function from labeled examples. However, that is not possible with a single example; we need large amounts of labeled data to capture the nuances of different users' preferences and objectives, which has shown to be costly (Zhang et al., 2016) . Additionally, both approaches do not generalize well to new users who have different objectives -we would have to re-design our reward functions or re-collect data. Our aim is to create an easier way for users to communicate their preferences, where the interface is more intuitive than crafting a reward function and where they can cheaply specify their preferences with no more than a few examples. To do this, we leverage large language models (LLMs) that are trained on internet-scale text data and have shown an impressive ability to learn in-context from few or zero examples (Brown et al., 2020) . Our key insight is that The scale of data that LLMs have been trained on make them great in-context learners and also allows them to capture meaningful commonsense priors about human behavior. Given a few examples or a description demonstrating the user's objective, an LLM should be able to provide an accurate instantiation of reward values on a new test example, allowing for easier generalization to new objectives. To this end, we explore how to prompt an LLM as a proxy reward function to train RL agents from user inputs. In our approach, the user specifies an objective with a natural language prompt. Objectives can

