GOAL-SPACE PLANNING WITH SUBGOAL MODELS

Abstract

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.

1. INTRODUCTION

Planning with learned models in reinforcement learning (RL) is important for sample efficiency. Planning provides a mechanism for the agent to simulate data, in the background during interaction, to improve value estimates. Dyna (Sutton, 1990) is a classic example of background planning. On each step, the agent simulates several transitions according to its model, and updates with those transitions as if they were real experience. Learning and using such a model is worthwhile in vast or ever-changing environments, where the agent learns over a long time period and can benefit from re-using knowledge about the environment. The promise of Dyna is that we can exploit the Markov structure in the RL formalism, to learn and adapt value estimates efficiently, but many open problems remain to make it more widely useful. These include that 1) one-step models learned in Dyna can be difficult to use for long-horizon planning, 2) learning probabilities over outcome states can be complex, especially for high-dimensional states and 3) planning itself can be computationally expensive for large state spaces. A variety of strategies have been proposed to improve long-horizon planning. Incorporating options as additional (macro) actions in planning is one approach. An option is a policy coupled with a termination condition and initiation set (Sutton et al., 1999) . They provide temporally-extended ways of behaving, allowing the agent to reason about outcomes further into the future. Incorporating options into planning is a central motivation of this paper, particularly how to do so under function approximation. Options for planning has largely only been tested in tabular settings (Sutton et al., 1999; Singh et al., 2004; Wan et al., 2021) . Recent work has considered mechanism for identifying and learning option policies for planning under function approximation (Sutton et al., 2022) , but as yet did not consider issues with learning the models. A variety of other approaches have been developed to handle issues with learning and iterating one-step models. Several papers have shown that using forward model simulations can produce simulated states that result in catastrophically misleading values (Jafferjee et al., 2020; van Hasselt et al., 2019; Lambert et al., 2022) . This problem has been tackled by using reverse models (Pan et al., 2018; Jafferjee et al., 2020; van Hasselt et al., 2019) ; primarily using the model for decision-time planning (van Hasselt et al., 2019; Silver et al., 2008; Chelu et al., 2020) ; and improving training strategies to account for accumulated errors in rollouts (Talvitie, 2014; Venkatraman et al., 2015; Talvitie, 2017) . An emerging trend is to avoid approximating the true transition dynamics, and instead learn dynamics tailored to predicting values on the next step correctly (Farahmand et al., 2017; Farahmand, 2018; Ayoub et al., 2020) . This trend is also implicit in the variety of techniques that encode the planning procedure into neural network architectures that can then be trained end-to-end (Tamar et al., 2016; Silver et al., 2017; Oh et al., 2017; Weber et al., 2017; Farquhar et al., 2018; Schrittwieser et al., 2020) . We similarly attempt to avoid issues with iterating models, but do so by considering a different type of model. Much less work has been done for the third problem in Dyna: the expense of planning. There is, however, a large literature on approximate dynamic programming-where the model is given-that is focused on efficient planning (see (Powell, 2009) ). Particularly relevant to this work is restricting value iteration to a small subset of landmark states (Mann et al., 2015) . The resulting policy is suboptimal, restricted to going between landmark states, but planning is provably more efficient. The use of landmark states has also been explored in goal-conditioned RL, where the agent is given a desired goal state or states. The first work to exploit the idea of landmark states was for learning and using universal value function approximators (UVFAs) (Huang et al., 2019) . The UVFA conditions action-values on both state-action pairs as well as landmark states. The agent can reach new goals by searching on a learned graph between landmark states, to identify which landmark to moves towards. A flurry of work followed, still in the goal-conditioned setting (Nasiriany et al., 2019; Emmons et al., 2020; Zhang et al., 2020; 2021; Aubret et al., 2021; Hoang et al., 2021; Gieselmann & Pokorny, 2021; Kim et al., 2021; Dubey et al., 2021) . In this paper, we exploit the idea behind landmark states for efficient background planning in general online reinforcement learning problems. The key novelty is a framework to use subgoal-conditioned models: temporally-extended models that condition on subgoals. The models are designed to be simpler to learn, as they are only learned for states local to subgoals and they avoid generating entire next state vectors. We use background planning on subgoals, to quickly propagate (suboptimal) value estimates for subgoals. We propose subgoal-value bootstrapping, that leverages these quickly computed subgoal values, but mitigates suboptimality by incorporating an update on real experience. We prove that dynamic programming with our subgoal models is sound (Proposition 2) and that our modified update converges, and in fact converges faster due to reducing the effective horizon (Proposition 3). We show in the PinBall environment that our Goal-Space Planning (GSP) algorithm can learn significantly faster than Double DQN, and still reaches nearly the same level of performance.

2. PROBLEM FORMULATION

We consider the standard reinforcement learning setting, where an agent learns to make decisions through interaction with an environment, formulated as Markov Decision Process (MDP) (S, A, R, P). S is the state space and A the action space. R : S × A × S → R and the transition probability P : S × A × S → [0, ∞) describes the expected reward and probability of transitioning to a state, for a given state and action. On each discrete timestep t the agent selects an action A t in state S t , the environment transitions to a new state S t+1 and emits a scalar reward R t+1 . The agent's objective is to find a policy π : S × A → [0, 1] that maximizes expected return, the future discounted reward G t . = R t+1 + γ t+1 G t+1 . The state-based discount γ t+1 ∈ [0, 1] depends on S t+1 (Sutton et al., 2011) , which allows us to specify termination. If S t+1 is a terminal state, then γ t+1 = 0; else, γ t+1 = γ c for some constant γ c ∈ [0, 1]. The policy can be learned using algorithms like Q-learning (Sutton & Barto, 2018), which approximate the action-values: the expected return from a given state and action. We can incorporate models and planning to improve sample efficiency beyond these basic model-free algorithms. In this work, we focus on background planning algorithms: those that learn a model during online interaction and asynchronously update value estimates use dynamic programming updates. The classic example of background planning is Dyna (Sutton, 1990) , which performs planning steps by selecting previously observed states, generating transitions-outcome rewards and next states-for every action and performing a Q-learning update with those simulated transitions. Planning with learned models, however, has several issues. First, even with perfect models, it can be computationally expensive. Running dynamic programming can require multiple sweeps, which is infeasible over a large number of states. A small number of updates, on the other hand, may be insufficient. Computation can be focused by carefully selecting which states to sample transitions

