GOAL-SPACE PLANNING WITH SUBGOAL MODELS

Abstract

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.

1. INTRODUCTION

Planning with learned models in reinforcement learning (RL) is important for sample efficiency. Planning provides a mechanism for the agent to simulate data, in the background during interaction, to improve value estimates. Dyna (Sutton, 1990) is a classic example of background planning. On each step, the agent simulates several transitions according to its model, and updates with those transitions as if they were real experience. Learning and using such a model is worthwhile in vast or ever-changing environments, where the agent learns over a long time period and can benefit from re-using knowledge about the environment. The promise of Dyna is that we can exploit the Markov structure in the RL formalism, to learn and adapt value estimates efficiently, but many open problems remain to make it more widely useful. These include that 1) one-step models learned in Dyna can be difficult to use for long-horizon planning, 2) learning probabilities over outcome states can be complex, especially for high-dimensional states and 3) planning itself can be computationally expensive for large state spaces. A variety of strategies have been proposed to improve long-horizon planning. Incorporating options as additional (macro) actions in planning is one approach. An option is a policy coupled with a termination condition and initiation set (Sutton et al., 1999) . They provide temporally-extended ways of behaving, allowing the agent to reason about outcomes further into the future. Incorporating options into planning is a central motivation of this paper, particularly how to do so under function approximation. Options for planning has largely only been tested in tabular settings (Sutton et al., 1999; Singh et al., 2004; Wan et al., 2021) . Recent work has considered mechanism for identifying and learning option policies for planning under function approximation (Sutton et al., 2022) , but as yet did not consider issues with learning the models. A variety of other approaches have been developed to handle issues with learning and iterating one-step models. Several papers have shown that using forward model simulations can produce simulated states that result in catastrophically misleading values (Jafferjee et al., 2020; van Hasselt et al., 2019; Lambert et al., 2022) . This problem has been tackled by using reverse models (Pan et al., 2018; Jafferjee et al., 2020; van Hasselt et al., 2019) ; primarily using the model for decision-time planning (van Hasselt et al., 2019; Silver et al., 2008; Chelu et al., 2020) ; and improving training strategies to account for accumulated errors in rollouts (Talvitie, 2014; Venkatraman et al., 2015; Talvitie, 2017) . An emerging trend is to avoid approximating the true transition dynamics, and instead learn dynamics tailored to predicting values on the next step correctly (Farahmand et al., 2017;  

