IN-CONTEXT POLICY ITERATION

Abstract

This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the "few-shot" quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex (Chen et al., 2021b), a language model with no prior knowledge of the domains on which we evaluate it.

1. INTRODUCTION

In-context learning describes the ability of sequence-prediction models to generalize to novel downstream tasks when prompted with a small number of exemplars (Lu et al., 2021; Brown et al., 2020) . The introduction of the Transformer architecture (Vaswani et al., 2017) has significantly increased interest in this phenomenon, since this architecture demonstrates much stronger generalization capacity than its predecessors (Chan et al., 2022) . Another interesting property of in-context learning in the case of large pre-trained models (or "foundation models") is that the models are not directly trained to optimize a meta-learning objective, but demonstrate an emergent capacity to generalize (or at least specialize) to diverse downstream task-distributions (Brown et al., 2020; Wei et al., 2022a) . A litany of existing work has explored methods for applying this remarkable capability to downstream tasks (see Related Work), including Reinforcement Learning (RL). Most work in this area either (1) assumes access to expert demonstrations -collected either from human experts (Huang et al., 2022a; Baker et al., 2022) , or domain-specific pre-trained RL agents (Chen et al., 2021a; Lee et al., 2022; Janner et al., 2021; Reed et al., 2022; Xu et al., 2022) . -or (2) relies on gradient-based methodse.g. fine-tuning of the foundation models parameters as a whole (Lee et al., 2022; Reed et al., 2022; Baker et al., 2022) or newly training an adapter layer or prefix vectors while keeping the original foundation models frozen (Li & Liang, 2021; Singh et al., 2022; Karimi Mahabadi et al., 2022) . Our work presents an algorithm, In-Context Policy Iteration (ICPI) which relaxes these assumptions. ICPI is a form of policy iteration in which the prompt content is the locus of learning. Because our method operates on the prompt itself rather than the model parameters, we are able to avoid gradient methods. Furthermore, the use of policy iteration frees us from expert demonstrations because suboptimal prompts can be improved over the course of training. We illustrate the algorithm empirically on six small illustrative RL tasks-chain, distractor-chain, maze, mini-catch, mini-invaders, and point-mass-in which the algorithm very quickly finds good

