PLANNING WITH SEQUENCE MODELS THROUGH ITER-ATIVE ENERGY MINIMIZATION

Abstract

Recent works have shown that sequence modeling can be effectively used to train reinforcement learning (RL) policies. However, the success of applying existing sequence models to planning, in which we wish to obtain a trajectory of actions to reach some goal, is less straightforward. The typical autoregressive generation procedures of sequence models preclude sequential refinement of earlier steps, which limits the effectiveness of a predicted plan. In this paper, we suggest an approach towards integrating planning with sequence models based on the idea of iterative energy minimization, and illustrate how such a procedure leads to improved RL performance across different tasks. We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy. We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments. We further demonstrate unique benefits of our iterative optimization procedure, involving new task generalization, test-time constraints adaptation, and the ability to compose plans together. Project website:

1. INTRODUCTION

E θ ( τ i-1) E θ ( τ i)

Energy

High Low Figure 1 : Plan Generation through Iteratively Energy Minimization. LEAP plans a trajectory to a goal (specified by the yellow star) by iteratively sampling and minimizing a trajectory energy function estimated using language model E θ . Sequence modeling has emerged as unified paradigm to study numerous domains such as language (Brown et al., 2020; Radford et al., 2018) and vision (Yu et al., 2022; Dosovitskiy et al., 2020) . Recently, (Chen et al., 2021; Janner et al., 2021) have shown how a similar approach can be effectively applied to decision making, by predicting the next action to take. However, in many decision making domains, it is sub-optimal to simply predict the next action to execute -as such an action may be only locally optimal and lead to global dead-end. Instead, it is more desirable to plan a sequence of actions towards a final goal, and choose the action most optimal for the final overall goal. Unlike greedily picking the next action to execute, effectively constructing an action sequence towards a given goal requires a careful, iterative procedure, where we need to assess and refine intermediate actions in a plan to ensure we reach the final goal. To refine an action at a particular timestep in a plan, we must reconsider both actions both before and after the chosen action. Directly applying this procedure to standard language generation is difficult, as the standard autoregressive decoding procedure prevents regeneration of previous actions based of future ones. For example, if the first five predicted actions places an agent at a location too far to reach a given goal, there is no manner we may change the early portions of plan. In this paper, we propose an approach to iteratively generate plans using sequence models. Our approach, Multistep Energy-Minimization Planner (LEAP), formulates planning as an iterative op-timization procedure on an energy function over trajectories defined implicitly by a sequence model (illustrated in Figure 1 ). To define an energy function across trajectories, we train a bidirectional sequence model using a masked-language modeling (MLM) objective (Devlin et al., 2019) . We define the energy of a trajectory as the negative pseudo-likelihood (PLL) of this MLM (Salazar et al., 2019) and sequentially minimize this energy value by replacing actions at different timepoints in the trajectory with the marginal estimates given by the MLM. Since our MLM is bi-directional in nature, the choice of new action at a given time-step is generated based on both future and past actions. By iteratively generating actions through planning, we illustrate how our proposed framework outperforms prior methods in both BabyAI (Chevalier-Boisvert et al., 2019) and Atari (Bellemare et al., 2013) tasks. Furthermore, by formulating the action generation process as an iterative energy minimization procedure, we illustrate how this enables us to generalize to environments with new sets of test-time constraints as well as more complex planning problems. Finally, we demonstrate how such an energy minimization procedure enables us to compose planning procedures in different models together, enabling the construction of plan which achieves multiple objectives. Concretely, in this paper, we contribute the following: First, we introduce LEAP, a framework through which we may iteratively plan with sequence models. Second, we illustrate how such a planning framework can be beneficial on both BabyAI and Atari domains. Finally, we illustrate how iteratively planning through energy minimization gives a set of unique properties, enabling better test time performance on more complex environments and environments with new test-time obstacles, and the ability to compose multiple learned models together, to jointly generate plans that satisfy multiple sets of goals. et al., 2019b; Sutskever et al., 2014; Liu & Lapata, 2019; Dehghani et al., 2018) . With these advances, people start applying sequence models to represent components in standard RL such as policies, value functions, and models to improved performance (Espeholt et al., 2018; Parisotto et al., 2020; Kapturowski et al., 2018) . While the sequence models provide memory information to make the agent predictions temporally and spatially coherent, they still rely on standard RL algorithm to fit value functions or compute policy gradients. Furthermore, recent works replace as much of the RL pipeline as possible with sequence modeling to leverage its scalability, flexible representations and causally reasoning (Janner et al., 2021; Chen et al., 2021; Furuta et al., 2021; Zheng et al., 2022; Emmons et al., 2021; Li et al., 2022) . However, those methods adopt autoregressive modeling objectives and the predicted trajectory sequences have no easy way to be optimized, which will inevitably lower the long-horizon accuracy. Recent studies point out that using sequence models (Chen et al., 2021; Emmons et al., 2021) rather than typical value-based approaches have difficulty converging in stochastic environments (Paster et al., 2020; 2022) .

2. RELATED WORK

Planning in Reinforcement Learning. Planning has been explored extensively in model-based RL, which learns how the environment respond to actions (Sutton, 1991) . The learned world dynamic model is exploited to predict the conditional distribution over the immediate next state or autoregressively reason the long-term future (Chiappa et al., 2017; Ke et al., 2018) . However, due to error compounding, plans generated by this procedure often look more like adversarial examples than optimal trajectories when the planning horizon is extended (Bengio et al., 2015; Talvitie, 2014; Asadi et al., 2018) . To avoid the aforementioned issues, simple gradient-free method like Monte Carlo tree search (Coulom, 2007) Tamar et al., 2016; Oh et al., 2017; Schrittwieser et al., 2020; Sun et al., 2022) ; energy-based models of policies for model-free reinforcement learning (Haarnoja et al., 2017) ; improve the offline policies by planning with learned models for model-based reinforcement learning (Yu et al., 2020; Schrittwieser et al., 2021) ; directly applying collocation techniques for direct trajectory optimization (Erez & Todorov, 2012; Du et al., 2019) ; and folding planning into the generative modeling process (Janner et al., 2021) . In contrast to these works, we explore having planning directly integrated in a language modeling framework.



* denotes equal contribution. Correspondence to hchen657@gatech.edu, yilundu@mit.edu, yychen2019@gatech.edu



, random shooting(Nagabandi et al., 2018)  and beam search(Sun  et al., 2022)  are explored. Another line of works studied how to break the barrier between model learning and planning, and plan with an imperfect model, include training an autoregressive latentspace model to predict values for abstract future states (

availability

https://hychen-naza.github.io/projects

