GOAL-SPACE PLANNING WITH SUBGOAL MODELS

Abstract

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.

1. INTRODUCTION

Planning with learned models in reinforcement learning (RL) is important for sample efficiency. Planning provides a mechanism for the agent to simulate data, in the background during interaction, to improve value estimates. Dyna (Sutton, 1990) is a classic example of background planning. On each step, the agent simulates several transitions according to its model, and updates with those transitions as if they were real experience. Learning and using such a model is worthwhile in vast or ever-changing environments, where the agent learns over a long time period and can benefit from re-using knowledge about the environment. The promise of Dyna is that we can exploit the Markov structure in the RL formalism, to learn and adapt value estimates efficiently, but many open problems remain to make it more widely useful. These include that 1) one-step models learned in Dyna can be difficult to use for long-horizon planning, 2) learning probabilities over outcome states can be complex, especially for high-dimensional states and 3) planning itself can be computationally expensive for large state spaces. A variety of strategies have been proposed to improve long-horizon planning. Incorporating options as additional (macro) actions in planning is one approach. An option is a policy coupled with a termination condition and initiation set (Sutton et al., 1999) . They provide temporally-extended ways of behaving, allowing the agent to reason about outcomes further into the future. Incorporating options into planning is a central motivation of this paper, particularly how to do so under function approximation. Options for planning has largely only been tested in tabular settings (Sutton et al., 1999; Singh et al., 2004; Wan et al., 2021) . Recent work has considered mechanism for identifying and learning option policies for planning under function approximation (Sutton et al., 2022) , but as yet did not consider issues with learning the models. A variety of other approaches have been developed to handle issues with learning and iterating one-step models. Several papers have shown that using forward model simulations can produce simulated states that result in catastrophically misleading values (Jafferjee et al., 2020; van Hasselt et al., 2019; Lambert et al., 2022) . This problem has been tackled by using reverse models (Pan et al., 2018; Jafferjee et al., 2020; van Hasselt et al., 2019) ; primarily using the model for decision-time planning (van Hasselt et al., 2019; Silver et al., 2008; Chelu et al., 2020) ; and improving training strategies to account for accumulated errors in rollouts (Talvitie, 2014; Venkatraman et al., 2015; Talvitie, 2017) . An emerging trend is to avoid approximating the true transition dynamics, and instead learn dynamics tailored to predicting values on the next step correctly (Farahmand et al., 2017; Farahmand, 2018; Ayoub et al., 2020) . This trend is also implicit in the variety of techniques that encode the planning procedure into neural network architectures that can then be trained end-to-end (Tamar et al., 2016; Silver et al., 2017; Oh et al., 2017; Weber et al., 2017; Farquhar et al., 2018; Schrittwieser et al., 2020) . We similarly attempt to avoid issues with iterating models, but do so by considering a different type of model. Much less work has been done for the third problem in Dyna: the expense of planning. There is, however, a large literature on approximate dynamic programming-where the model is given-that is focused on efficient planning (see (Powell, 2009) ). Particularly relevant to this work is restricting value iteration to a small subset of landmark states (Mann et al., 2015) . The resulting policy is suboptimal, restricted to going between landmark states, but planning is provably more efficient. The use of landmark states has also been explored in goal-conditioned RL, where the agent is given a desired goal state or states. The first work to exploit the idea of landmark states was for learning and using universal value function approximators (UVFAs) (Huang et al., 2019) . The UVFA conditions action-values on both state-action pairs as well as landmark states. The agent can reach new goals by searching on a learned graph between landmark states, to identify which landmark to moves towards. A flurry of work followed, still in the goal-conditioned setting (Nasiriany et al., 2019; Emmons et al., 2020; Zhang et al., 2020; 2021; Aubret et al., 2021; Hoang et al., 2021; Gieselmann & Pokorny, 2021; Kim et al., 2021; Dubey et al., 2021) . In this paper, we exploit the idea behind landmark states for efficient background planning in general online reinforcement learning problems. The key novelty is a framework to use subgoal-conditioned models: temporally-extended models that condition on subgoals. The models are designed to be simpler to learn, as they are only learned for states local to subgoals and they avoid generating entire next state vectors. We use background planning on subgoals, to quickly propagate (suboptimal) value estimates for subgoals. We propose subgoal-value bootstrapping, that leverages these quickly computed subgoal values, but mitigates suboptimality by incorporating an update on real experience. We prove that dynamic programming with our subgoal models is sound (Proposition 2) and that our modified update converges, and in fact converges faster due to reducing the effective horizon (Proposition 3). We show in the PinBall environment that our Goal-Space Planning (GSP) algorithm can learn significantly faster than Double DQN, and still reaches nearly the same level of performance.

2. PROBLEM FORMULATION

We consider the standard reinforcement learning setting, where an agent learns to make decisions through interaction with an environment, formulated as Markov Decision Process (MDP) (S, A, R, P). S is the state space and A the action space. R : S × A × S → R and the transition probability P : S × A × S → [0, ∞) describes the expected reward and probability of transitioning to a state, for a given state and action. On each discrete timestep t the agent selects an action A t in state S t , the environment transitions to a new state S t+1 and emits a scalar reward R t+1 . The agent's objective is to find a policy π : S × A → [0, 1] that maximizes expected return, the future discounted reward G t . = R t+1 + γ t+1 G t+1 . The state-based discount γ t+1 ∈ [0, 1] depends on S t+1 (Sutton et al., 2011) , which allows us to specify termination. If S t+1 is a terminal state, then γ t+1 = 0; else, γ t+1 = γ c for some constant γ c ∈ [0, 1]. The policy can be learned using algorithms like Q-learning (Sutton & Barto, 2018) , which approximate the action-values: the expected return from a given state and action. We can incorporate models and planning to improve sample efficiency beyond these basic model-free algorithms. In this work, we focus on background planning algorithms: those that learn a model during online interaction and asynchronously update value estimates use dynamic programming updates. The classic example of background planning is Dyna (Sutton, 1990) , which performs planning steps by selecting previously observed states, generating transitions-outcome rewards and next states-for every action and performing a Q-learning update with those simulated transitions. Planning with learned models, however, has several issues. First, even with perfect models, it can be computationally expensive. Running dynamic programming can require multiple sweeps, which is infeasible over a large number of states. A small number of updates, on the other hand, may be insufficient. Computation can be focused by carefully selecting which states to sample transitions from-called search control-but how to do so effectively remains largely unanswered with only a handful of works (Moore & Atkeson, 1993; Wingate et al., 2005; Pan et al., 2019) . The second difficulty arises due to errors in the learned models. In reinforcement learning, the transition dynamics is represented with an expectation model E[S |s, a] or a probabilistic model P (s |s, a). If the state space or feature space is large, then the expected next state or distribution over it can be difficult to estimate, as has been repeatedly shown (Talvitie, 2017) . Further, these errors can compound when iterating the model forward or backward (Jafferjee et al., 2020; van Hasselt et al., 2019) . It is common to use an expectation model, but unless the environment is deterministic or we are only learning the values rather than action-values, this model can result in invalid states and detrimental updates (Wan et al., 2019) . In this work, we take steps towards the ambitious question: how can we leverage a separate computational procedure (planning with a model) to improve learning in complex environments? More specifically, we consider background planning for value-based methods. We address the two difficulties with classic background planning strategies discussed above, by focusing planning on a set of subgoals (abstract states) and changing the form of the model.

3. GOAL-SPACE PLANNING WITH SUBGOAL-CONDITIONED MODELS

At a high level, the Goal-Space Planning algorithm focuses planning over a set of given abstract subgoals to provide quickly updated approximate values to speed up learning. To do so, the agent first learns a set of subgoal-conditioned models, minimal models focused around planning utility. These models then forms a temporally abstract goal-space MDP, with subgoals as states, and options to achieve each subgoal as actions. Finally, the agent can update its policy based on these subgoal values to speed up learning. Figure 1 provides a visual overview of this process.

3.1. DEFINING SUBGOALS

Assume we have a finite subset of subgoal vectors G. For example, g could correspond to a situation where both the front and side distance sensors of a robot report low readings-what a person would call being in a corner. This g could be represented using a two-dimensional vector, even if the sensory space is 100-dimensional. In general, subgoals need not be instances of states (i.e., G ⊂ S). As another example, in Figure 1 , we simply encode the nine subgoals-which correspond to regions with a small radius-using a tabular encoding of nine one-hot vectors. Essentially, our subgoals define a new state space in an abstract MDP, and these new abstract states (subgoals) can be encoded or represented in different ways, just like in regular MDPs. To fully specify a subgoal, we need a membership function m that indicates if a state s is a member of subgoal g: m(s, g) = 1, and zero otherwise. Many states can be mapped to the same subgoal g. For the above example, if the first two elements of the state vector s consist of the front and side distance sensor, m(s, g) = 1 for any states where s 1 , s 2 are less than some threshold . For a concrete example, we visualize subgoals for the environment in our experiments in Figure 1 . Finally, we only reason about reaching subgoals from a subset of states, called initiation sets for options (Sutton et al., 1999) . This constraint is key for locality, to learn and reason about a subset of states for a subgoal. We assume the existence of a (learned) initiation function d(s, g) that is 1 if s is in the initiation set for g (e.g., sufficiently close in terms of reachability) and zero otherwise. We discuss some approaches to learn this initiation function in Appendix C. But, here, we assume it is part of the discovery procedure for the subgoals and first focus on how to use it. For the rest of this paper, we presume we are given subgoals and initiation sets. We develop algorithms to learn and use models, given those subgoals. We expect a complete agent to discover these subgoals on its own, including how to represent these subgoals to facilitate generalization and planning. To separate concerns, we focus on how the agent can leverage reasonably well-specified subgoals.

3.2. DEFINING SUBGOAL-CONDITIONED MODELS

For planning and acting to operate in two different spaces, we define four models: two used in planning over subgoals (subgoal-to-subgoal) and two used to project these subgoal values back into the underlying state space (state-to-subgoal). Figure 2 visualizes these two spaces. The state-to-subgoal models are r γ : S × Ḡ → R and Γ : S × Ḡ → [0, 1], where Ḡ = G ∪ {s terminal } if there is a terminal state (episodic problems) and otherwise Ḡ = G. An option policy π g : S × A → [0, 1] for subgoal g starts from any s in the initiation set, and terminates in g-in s where m(s, g) = 1. The reward-model r γ (s, g) is the discounted rewards under option policy π g : r γ (s, g) = E πg [R t+1 + γ g (S t+1 )r γ (S t+1 , g)|S t = s] where the discount is zero upon reaching subgoal g γg(St+1) def = 0 if m(St+1, g) = 1, namely if subgoal g is achieved by being in St+1 γt+1 else The discount-model Γ(s, g) reflects the discounted number of steps until reaching subgoal g starting from s, in expectation under option policy π g Γ(s, g) = E πg [m(S t+1 , g)γ t+1 + γ g (S t+1 )Γ(S t+1 , g)|S t = s]. These state-to-subgoal will only be queried for (s, g) where d(s, g) > 0: they are local models. To define subgoal-to-subgoal models,foot_0 rγ : G × Ḡ → R and Γ : G × Ḡ → [0, 1], we use the state-to-subgoal models. For each subgoal g ∈ G, we aggregate r γ (s, g ) for all s where m(s, g) = 1. rγ (g, g ) def = 1 z(g) s:m(s,g)=1 r γ (s, g ) and Γ(g, g ) def = 1 z(g) s:m(s,g)=1 Γ(s, g ) for normalizer z(g) def = s:m(s,g)=1 m(s, g). This definition assumes a uniform weighting over the states s where m(s, g) = 1. We could allow a non-uniform weighting, potentially based on visitation frequency in the environment. For this work, however, we assume that m(s, g) = 1 for a smaller number of states s with relatively similar r γ (s, g ), making a uniform weighting reasonable. These models are also local models, as we can similarly extract d(g, g ) from d(s, g ) and only reason about g nearby or relevant to g. We set d(g, g ) = max s∈S:m(s,g)>0 d(s, g ), indicating that if there is a state s that is in the initiation set for g and has membership in g, then g is also relevant to g. Let us consider an example, in Figure 2 . The red states are members of g (m(A, g) = 1) and the blue members of g (m(X, g ) = 1,m(Y, g ) = 1). For all s in the diagram, d(s, g ) > 0 (all are in the initiation set): the policy π g can be queried from any s to get to g . The green path in the left indicates the trajectory under π g from A, stochastically reaching either X or Y , with accumulated reward r γ (A, g ) and discount Γ(A, g ) (averaged over reaching X and Y ). The subgoal-tosubgoal models, on the right, indicate g can be reached from g, with rγ (g, g ) averaged over both r γ (A, g ) and r γ (B, g ) and Γ(g, g ) over Γ(A, g ) and Γ(B, g ), described in Equation (1). A B X Y g g' r (g, g 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " R T 8 G m 6 h E a Z h F O 3 0 h f O 8 V t y d 1 V t I = " > A A A B / 3 i c b V D L S s N A F J 3 U V 6 2 v q O D G T b C I F a Q k U t F l 0 Y 3 L C v Y B T Q i T y S Q d O j M J M x O h x C 7 8 F T c u F H H r b 7 j z b 5 y 2 W W j r g Q u H c + 7 l 3 n u C l B K p b P v b K C 0 t r 6 y u l d c r G 5 t b 2 z v m 7 l 5 H J p l A u I 0 S m o h e A C W m h O O 2 I o r i X i o w Z A H F 3 W B 4 M / G 7 D 1 h I k v B 7 N U q x x 2 D M S U Q Q V F r y z Q N X E R r i X I x 9 N 4 a M w V p 8 F p + c + m b V r t t T W I v E K U g V F G j 5 5 p c b J i h j m C t E o Z R 9 x 0 6 V l 0 O h C K J 4 X H E z i V O I h j D G f U 0 5 Z F h 6 + f T + s X W s l d C K E q G L K 2 u q / p 7 I I Z N y x A L d y a A a y H l v I v 7 n 9 T M V X X k 5 4 W m m M E e z R V F G L Z V Y k z C s k A i M F B 1 p A p E g + l Y L D a C A S O n I K j o E Z / 7 l R d I 5 r z u N + s V d o 9 q 8 L u I o g 0 N w B G r A A Z e g C W 5 B C 7 Q B A o / g G b y C N + P J e D H e j Y 9 Z a 8 k o Z v b B H x i f P z 4 B l Z c = < / l a t e x i t > ˜ (g, g 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " D S / p 4 Q E G q 7 E D k E 2 J C c z h X P E z n M A = " > A A A B / X i c b V D L S s N A F J 3 U V 6 2 v + N i 5 G S x i B S m J V H R Z d K H L C v Y B T S i T y a Q d O j M J M x O h h u K v u H G h i F v / w 5 1 / 4 7 T N Q q s H L h z O u Z d 7 7 w k S R p V 2 n C + r s L C 4 t L x S X C 2 t r W 9 s b t n b O y 0 V p x K T J o 5 Z L D s B U o R R Q Z q a a k Y 6 i S S I B 4 y 0 g + H V x G / f E 6 l o L O 7 0 K C E + R 3 1 B I 4 q R N l L P 3 v M 0 Z S H J v G v E O R p X + i f 9 o + O e X X a q z h T w L 3 F z U g Y 5 G j 3 7 0 w t j n H I i N G Z I q a 7 r J N r P k N Q U M z I u e a k i C c J D 1 C d d Q w X i R P n Z 9 P o x P D R K C K N Y m h I a T t W f E x n i S o 1 4 Y D o 5 0 g M 1 7 0 3 E / 7 x u q q M L P 6 M i S T U R e L Y o S h n U M Z x E A U M q C d Z s Z A j C k p p b I R 4 g i b A 2 g Z V M C O 7 8 y 3 9 J 6 7 T q 1 q p n t 7 V y / T K P o w j 2 w Q G o A B e c g z q 4 A Q 3 Q B B g 8 g C f w A l 6 t R + v Z e r P e Z 6 0 F K 5 / Z B b 9 g f X w D c 1 2 U k g = = < / l a t e x i t > ⇡ g 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " W O g L 0 k 4 B R Y n H D 1 U J S 6 Y M A V o 4 U C 0 = " > A A A B 7 3 i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 6 G m Y m G R N v Q S 8 e I 5 g F k i H 0 d H q S J j 2 L 3 T 1 C G P I T X j w o 4 t X f 8 e b f 2 F k E F X 1 Q 8 H i v i q p 6 f s K Z V J b 1 Y e R W V t f W N / K b h a 3 t n d 2 9 4 v 5 B S 8 a p I L R J Y h 6 L j o 8 l 5 S y i T c U U p 5 1 E U B z 6 n L b 9 8 d X M b 9 9 T I V k c 3 a p J Q r 0 Q D y M W M I K V l j q 9 h P W z 4 e m 0 X y x Z p l 1 x b d t F l n l e d h 2 n p k n Z K l 9 U H W S b 1 h w l W K L R L 7 7 3 B j F J Q x o p w r G U X d t K l J d h o R j h d F r o p Z I m m I z x k H Y 1 j X B I p Z f N 7 5 2 i E 6 0 M U B A L X Z F C c / X 7 R I Z D K S e h r z t D r E b y t z c T / / K 6 q Q p q X s a i J F U 0 I o t F Q c q R i t H s e T R g g h L F J 5 p g I p i + F Z E R F p g o H V F B h / D 1 K f q f t B w d l O n e V E r 1 y 2 U c e T i C Y z g D G 6 p Q h 2 t o Q B M I c H i A J 3 g 2 7 o x H 4 8 V 4 X b T m j O X M I f y A 8 f Y J V N S Q L w = = < / l a t e x i t > Original MDP Subgoal abstraction m(Y, g 0 ) = 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " S C T b V k Q 5 X + e g z 7 O D M u E T U u o O 5 o 8 = " > A A A B 8 n i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 W s I C E J q Y 0 L o e j G Z Q X 7 k D S U y X T a D p 1 k w s x E K K G f 4 c a F I m 7 9 G n f + j d O H o K I H L h z O u Z d 7 7 w k T R q W y r A 8 j t 7 S 8 s r q W X y 9 s b G 5 t 7 x R 3 9 5 q S p w K T B u a M i 3 a I J G E 0 J g 1 F F S P t R B A U h Y y 0 w t H V 1 G / d E y E p j 2 / V O C F B h A Y x 7 V O M l J b 8 q H x 3 O j g + g R f Q 7 h Z L l m m f e c 5 5 F V p m x f a s q q e J 5 7 i u 7 U D b t G Y o g Q X q 3 e J 7 p 8 d x G p F Y Y Y a k 9 G 0 r U U G G h K K Y k U m h k 0 q S I D x C A + J r G q O I y C C b n T y B R 1 r p w T 4 X u m I F Z + r 3 i Q x F U o 6 j U H d G S A 3 l b 2 8 q / u X 5 q e p 7 Q U b j J F U k x v N F / Z R B x e H 0 f 9 i j g m D F x p o g L K i + F e I h E g g r n V J B h / D 1 K f y f N B 3 T d s 3 K j V u q X S 7 i y I M D c A j K w A Z V U A P X o A 4 a A A M O H s A T e D a U 8 W i 8 G K / z 1 p y x m N k H P 2 C 8 f Q L z 6 o / C < / l a t e x i t > m(X, g 0 ) = 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " t P 2 b B h t 0 X S 1 P p Z N C + 0 N S Z Z Y Y 6 Q E = " > A A A B 8 n i c d V D L S s N A F J 3 4 r P V V d e l m s I g V J G R C a u N C K L p x W c E + I A 1 l M p 2 0 Q y c P Z i Z C C f 0 M N y 4 U c e v X u P N v n D 4 E F T 1 w 4 X D O v d x 7 T 5 B y J p V l f R h L y y u r a + u F j e L m 1 v b O b m l v v y W T T B D a J A l P R C f A k n I W 0 6 Z i i t N O K i i O A k 7 b w e h 6 6 r f v q Z A s i e / U O K V + h A c x C x n B S k t e V O m c D U 5 O 4 S V E v V L Z M t G 5 a 1 / U o G V W k W v V X E 1 c 2 3 G Q D Z F p z V A G C z R 6 p f d u P y F Z R G N F O J b S Q 1 a q / B w L x Q i n k 2 I 3 k z T F Z I Q H 1 N M 0 x h G V f j 4 7 e Q K P t d K H Y S J 0 x Q r O 1 O 8 T O Y 6 k H E e B 7 o y w G s r f 3 l T 8 y / M y F b p + z u I 0 U z Q m 8 0 V h x q F K 4 P R / 2 G e C E s X H m m A i m L 4 V k i E W m C i d U l G H 8 P U p / J + 0 b B M 5 Z v X W K d e v F n E U w C E 4 A h W A Q A 3 U w Q 1 o g C Y g I A E P 4 A k 8 G 8 p 4 N F 6 M 1 3 n r k r G Y O Q A / Y L x 9 A v J e j 8 E = < / l a t e x i t >

3.3. GOAL-SPACE PLANNING WITH SUBGOAL-CONDITIONED MODELS

We can now consider how to plan with these models. Planning involves learning ṽ(g): the value for different subgoals. This can be achieved using an update similar to value iteration, for all g ∈ G ṽ(g) = max g ∈ Ḡ: d(g,g )>0 rγ (g, g ) + Γ(g, g )ṽ(g ) (Background Planning) (2) The value of reaching g from g is the discounted rewards along the way, rγ (g, g ), plus the discounted value in g . If Γ(g, g ) is very small, it is difficult to reach g from g-or takes many steps-and so the value in g is discounted by more. With a relatively small number of subgoals, we can sweep through them all to quickly compute ṽ(g). With a larger set of subgoals, we can instead do as many updates possible, in the background on each step, by stochastically sampling g. We can interpret this update as a standard value iteration update in a new MDP, where 1) the set of states is G, 2) the actions from g ∈ G are state-dependent, corresponding to choosing which g ∈ Ḡ to go to in the set where d(g, g ) > 0 and 3) the rewards are rγ and the discounted transition probabilities are Γ. Under this correspondence, it is straightforward to show that the above converges to the optimal values in this new Goal-Space MDP, shown in Proposition 2 in Appendix B. This goal-space planning approach does not suffer from typical issues with model-based RL. First, the model is not iterated, but we still obtain temporal abstraction because the model itself incorporates it. Second, we do not need to predict entire state vectors-or distributions over them-because we instead input the outcome g into the function approximator. This may feel like a false success as it potentially requires restricting ourselves to a smaller number of subgoals. If we want to use a larger number of subgoals, then we may need a function to generate these subgoal vectors anywaybringing us back to the problem of generating vectors. However, this is likely easier as 1) the subgoals themselves can be much smaller and more abstract, making it more feasibly to procedurally generate them and 2) it may be more feasible maintain a large set of subgoal vectors, or generate individual subgoal vectors, than producing relevant subgoal vectors from a given subgoal. Now let us examine how to use ṽ(g) to update our main policy. The simplest way to decide how to behave from a state is to cycle through the subgoals, and pick the one with the highest value. v sub (s) def = max g∈ Ḡ:d(s,g)>0 r γ (s, g) + Γ(s, g)ṽ(g) (Projection Step) (3) and take action a that corresponds to the action given by π g for this maximizing g. However, this approach has two issues. First restricting to going through subgoals might result in suboptimal policies. From a given state s, the set of relevant subgoals g may not be on the optimal path. Second, the learned models themselves may have inaccuracies, or planning may not have been completed in the background, resulting in ṽ(g) that are not yet fully accurate. We instead propose to use v sub (s) within the bootstrap target for the action-values for the main policy. For a given transition (S t , A t , R t+1 , S t+1 ), either as the most recent experience or from a replay buffer, the proposed subgoal-value bootstrapping update to parameterized q(S t , A t ; w) uses TD error δ def = R t+1 + γ t+1 (1 -β) max a q(S t+1 , a ; w) Standard bootstrap target +β v sub (S t+1 ) Subgoal value -q(S t , A t ; w) (4) max < l a t e x i t s h a 1 _ b a s e 6 4 = " r h Z c R L 7 X Z r i g m G U L 2 9 u q S E m m s i U = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o s e i F 4 8 V T F t o Q 9 l s N + 3 S 3 U 3 Y 3 Y g l 9 C 9 4 8 a C I V / + Q N / + N m z Y H b X 0 w 8 H h v h p l 5 Y c K Z N q 7 7 7 Z T W 1 j c 2 t 8 r b l Z 3 d v f 2 D 6 u F R W 8 e p I t Q n M Y 9 V N 8 S a c i a p b 5 j h t J s o i k X I a S e c 3 O Z + 5 5 E q z W L 5 Y K Y J D Q Q e S R Y x g k 0 u 9 Q V + G l R r b t 2 d A 6 0 S r y A 1 K N A a V L / 6 w 5 i k g k p D O N a 6 5 7 m J C T K s D C O c z i r 9 V N M E k w k e 0 Z 6 l E g u q g 2 x + 6 w y d W W W I o l j Z k g b N 1 d 8 T G R Z a T 0 V o O w U 2 Y 7 3 s 5 e J / X i 8 1 0 X W Q M Z m k h k q y W B S l H J k Y 5 Y + j I V O U G D 6 1 B B P F 7 K 2 I j L H C x N h 4 K j Y E b / n l V d K + q H u N + u V 9 o 9 a 8 K e I o w w m c w j l 4 c A V N u I M W + E B g D M / w C m + O c F 6 c d + d j 0 V p y i p l j + A P n 8 w c d p 4 5 M < / l a t e x i t > r (S 0 , gj) + (S 0 , gj)ṽ(gj) < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 z d D m z B 5 p H R 0 B w H G E Y B h B b m M 4 r c = " > A A A C G H i c b V C 7 T s M w F H X K q 5 R X g J H F o k I U g U q C i m C s Y I C x C P q Q m i p y X C c 1 t Z P I d i p V U T + D h V 9 h Y Q A h 1 m 7 8 D e 5 j K C 1 H u t L x O f f K 9 x 4 v Z l Q q y / o x M k v L K 6 t r 2 f X c x u b W 9 o 6 5 u 1 e T U S I w q e K I R a L h I U k Y D U l V U c V I I x Y E c Y + R u t e 9 H f n 1 H h G S R u G T 6 s e k x V E Q U p 9 i p L T k m u f C d Q L E O S o 8 H p 8 F 7 v M J P I X O 3 a z g K M r a J O 0 N C q O X a + a t o j U G X C T 2 l O T B F B X X H D r t C C e c h A o z J G X T t m L V S p F Q F D M y y D m J J D H C X R S Q p q Y h 4 k S 2 0 v F h A 3 i k l T b 0 I 6 E r V H C s z k 6 k i E v Z 5 5 7 u 5 E h 1 5 L w 3 E v / z m o n y r 1 s p D e N E k R B P P v I T B l U E R y n B N h U E K 9 b X B G F B 9 a 4 Q d 5 B A W O k s c z o E e / 7 k R V K 7 K N q l 4 u V D K V + + m c a R B Q f g E B S A D a 5 A G d y D C q g C D F 7 A G / g A n 8 a r 8 W 5 8 G d + T 1 o w x n d k H f 2 A M f w E 8 / p 4 O < / l a t e x i t > r (S 0 , gi) + (S 0 , gi)ṽ(gi) < l a t e x i t s h a 1 _ b a s e 6 4 = " a F r O Q H x x 2 E g L p i C 7 I 2 Z Q c Q 9 c w a 8 = " > A A A C G H i c b V D L S s N A F J 3 4 r P U V d e l m s I g V p S Z S 0 W X R h S 4 r 2 g c 0 I U y m 0 3 b o T B J m J o U S + h l u / B U 3 L h R x 2 5 1 / 4 6 T N o r Y e u H D m n H u Z e 4 8 f M S q V Z f 0 Y S 8 s r q 2 v r u Y 3 8 5 t b 2 z q 6 5 t 1 + X Y S w w q e G Q h a L p I 0 k Y D U h N U c V I M x I E c Z + R h t + / S / 3 G g A h J w + B Z D S P i c t Q N a I d i p L T k m R f C c 7 q I c 1 R 8 O j n v e v Q U n k H n f l Z w F G V t k g x G x f T l m Q W r Z E 0 A F 4 m d k Q L I U P X M s d M O c c x J o D B D U r Z s K 1 J u g o S i m J F R 3 o k l i R D u o y 5 p a R o g T q S b T A 4 b w W O t t G E n F L o C B S f q 7 E S C u J R D 7 u t O j l R P z n u p + J / X i l X n x k 1 o E M W K B H j 6 U S d m U I U w T Q m 2 q S B Y s a E m C A u q d 4 W 4 h w T C S m e Z 1 y H Y 8 y c v k v p l y S 6 X r h 7 L h c p t F k c O H I I j U A Q 2 u A Y V 8 A C q o A Y w e A F v 4 A N 8 G q / G u / F l f E 9 b l 4 x s 5 g D 8 g T H + B T h C n g s = < / l a t e x i t > r (S 0 , gk) + (S 0 , gk)ṽ(gk) < l a t e x i t s h a 1 _ b a s e 6 4 = " S 2 h N d S E k i D E I b x m w I 3 J y 9 i y F G k g = " > A A A C G H i c b V D L S g M x F M 3 4 r P V V d e k m W M S K U m e k o s u i C 1 1 W t A / o l C G T y b S h S W Z I M o U y 9 D P c + C t u X C j i t j v / x v S x q K 0 H L p y c c y + 5 9 / g x o 0 r b 9 o + 1 t L y y u r a e 2 c h u b m 3 v 7 O b 2 9 m s q S i Q m V R y x S D Z 8 p A i j g l Q 1 1 Y w 0 Y k k Q 9 x m p + 9 2 7 k V / v E a l o J J 5 1 P y Y t j t q C h h Q j b S Q v d y E 9 t 4 0 4 R 4 W n k / O 2 1 z 2 F Z 9 C 9 n x V c T V l A 0 t 6 g M H p 5 u b x d t M e A i 8 S Z k j y Y o u L l h m 4 Q 4 Y Q T o T F D S j U d O 9 a t F E l N M S O D r J s o E i P c R W 3 S N F Q g T l Q r H R 8 2 g M d G C W A Y S V N C w 7 E 6 O 5 E i r l S f + 6 a T I 9 1 R 8 9 5 I / M 9 r J j q 8 a a V U x I k m A k 8 + C h M G d Q R H K c G A S o I 1 6 x u C s K R m V 4 g 7 S C K s T Z Z Z E 4 I z f / I i q V 0 W n V L x 6 r G U L 9 9 O 4 8 i A Q 3 A E C s A B 1 6 A M H k A F V A E G L + A N f I B P 6 9 V 6 t 7 6 s 7 0 n r k j W d O Q B / Y A 1 / A U G 6 n h E = < / l a t e x i t > g i < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 2 X O 0 7 Z 7 Y 7 4 8 l 6 N P q 7 0 v C 9 g h O t o = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 G m Z C F r 0 F v X i M a B Z I h t D T 6 Z k 0 6 e k Z u n u E M O Q T v H h Q x K t f 5 M 2 / s b M I r g 8 K H u 9 V U V X P T z h T 2 n H e r d z K 6 t r 6 R n 6 z s L W 9 s 7 t X 3 D 9 o q z i V h L Z I z G P Z 9 b G i n A n a 0 k x z 2 k 0 k x Z H P a c c f X 8 7 8 z h 2 V i s X i V k 8 S 6 k U 4 F C x g B G s j 3 Y Q D N i i W H P u 8 X n Z r N f S b u L Y z R w m W a A 6 K b / 1 h T N K I C k 0 4 V q r n O o n 2 M i w 1 I 5 x O C / 1 U 0 Q S T M Q 5 p z 1 C B I 6 q 8 b H 7 q F J 0 Y Z Y i C W J o S G s 3 V r x M Z j p S a R L 7 p j L A e q Z / e T P z L 6 6 U 6 O P M y J p J U U 0 E W i 4 K U I x 2 j 2 d 9 o y C Q l m k 8 M w U Q y c y s i I y w x 0 S a d g g n h 8 1 P 0 P 2 m X b b d i V 6 8 r p c b F M o 4 8 H M E x n I I L d W j A F T S h B Q R C u I d H e L K 4 9 W A 9 W y + L 1 p y 1 n D m E b 7 B e P w D P 9 4 4 s < / l a t e x i t > g k < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 O N E z Q z + z 1 X Q F 3 J d E K t s v e y 7 P y U = " > A A A B 6 n i c d V D L S s N A F L 2 p r 1 p f V Z d u B o v g K i S 2 m n Z X d O O y o n 1 A G 8 p k O k m H T h 7 M T I Q S + g l u X C j i 1 i 9 y 5 9 8 4 T S u o 6 I E L h 3 P u 5 d 5 7 v I Q z q S z r w y i s r K 6 t b x Q 3 S 1 v b O 7 t 7 5 f 2 D j o x T Q W i b x D w W P Q 9 L y l l E 2 4 o p T n u J o D j 0 O O 1 6 k 6 u 5 3 7 2 n Q r I 4 u l P T h L o h D i L m M 4 K V l m 6 D 4 W R Y r l h m w 6 k 1 q j b S p O 5 Y V S c n F 0 7 D R r Z p 5 a j A E q 1 h + X 0 w i k k a 0 k g R j q X s 2 1 a i 3 A w L x Q i n s 9 I g l T T B Z I I D 2 t c 0 w i G V b p a f O k M n W h k h P x a 6 I o V y 9 f t E h k M p p 6 G n O 0 O s x v K 3 N x f / 8 v q p 8 u t u x q I k V T Q i i 0 V + y p G K 0 f x v N G K C E s W n m m A i m L 4 V k T E W m C i d T k m H 8 P U p + p 9 0 z k y 7 Z p 7 f 1 C r N y 2 U c R T i C Y z g F G x x o w j W 0 o A 0 E A n i A J 3 g 2 u P F o v B i v i 9 a C s Z w 5 h B 8 w 3 j 4 B 5 y C O P A = = < / l a t e x i t > S 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " p a j w F s J h 5 s g T / 0 0 n h 3 j w c o x s o s o = " > A A A B 6 X i c b V D L T g J B E O z F F + I L 9 e h l I j F 6 I r s G o 0 e i F 4 / 4 4 J E A I b P D L E y Y n d 3 M 9 J q Q D X / g x Y P G e P W P v P k 3 D r A H B S v p p F L V n e 4 u P 5 b C o O t + O 7 m V 1 b X 1 j f x m Y W t 7 Z 3 e v u H / Q M F G i G a + z S E a 6 5 V P D p V C 8 j g I l b 8 W a 0 9 C X v O m P b q Z + 8 4 l r I y L 1 i O O Y d 0 M 6 U C I Q j K K V 7 h 9 O e 8 W S W 3 Z n I M v E y 0 g J M t R 6 x a 9 O P 2 J J y B U y S Y 1 p e 2 6 M 3 Z R q F E z y S a G T G B 5 T N q I D 3 r Z U 0 Z C b b j q 7 d E J O r N I n Q a R t K S Q z 9 f d E S k N j x q F v O 0 O K Q 7 P o T c X / v H a C w V U 3 F S p O k C s 2 X x Q k k m B E p m + T v t C c o R x b Q p k W 9 l b C h l R T h j a c g g 3 B W 3 x 5 m T T O y 1 6 l f H F X K V W v s z j y c A T H c A Y e X E I V b q E G d W A Q w D O 8 w p s z c l 6 c d + d j 3 p p z s p l D + A P n 8 w c R H 4 0 Q < / l a t e x i t > S < l a t e x i t s h a 1 _ b a s e 6 4 = " u b w b K 8 c G Z k W b M L W F F T h C w g B x y 5 s = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N R o 9 E L x 4 h y i O B D Z k d e m F k d n Y z M 2 t C C F / g x Y P G e P W T v P k 3 D r A H B S v p p F L V n e 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 7 c x v P a H S P J Y P Z p y g H 9 G B 5 C F n 1 F i p f t 8 r l t y y O w d Z J V 5 G S p C h 1 i t + d f s x S y O U h g m q d c d z E + N P q D K c C Z w W u q n G h L I R H W D H U k k j 1 P 5 k f u i U n F m l T 8 J Y 2 Z K G z N X f E x M a a T 2 O A t s Z U T P U y 9 5 M / M / r p C a 8 9 i d c J q l B y R a L w l Q Q E 5 P Z 1 6 T P F T I j x p Z Q p r i 9 l b A h V Z Q Z m 0 3 B h u A t v 7 x K m h d l r 1 K + r F d K 1 Z s s j j y c w C m c g w d X U I U 7 q E E D G C A 8 w y u 8 O Y / O i / P u f C x a c 0 4 2 c w x / 4 H z + A L C 9 j N 8 = < / l a t e x i t > g j < l a t e x i t s h a 1 _ b a s e 6 4 = " / t G 7 F 9 m O 1 g B o i Y z D / o T A g + 6 A F I E = " > A A A B 6 n i c d V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h s N e 2 t 6 M V j R V s L b S i b 7 S Z d u / l g d y O U 0 J / g x Y M i X v 1 F 3 v w 3 b t M K K v p g 4 P H e D D P z v I Q z q S z r w y g s L a + s r h X X S x u b W 9 s 7 5 d 2 9 j o x T Q W i b x D w W X Q 9 L y l l E 2 4 o p T r u J o D j 0 O L 3 1 x h c z / / a e C s n i 6 E Z N E u q G O I i Y z w h W W r o O B n e D c s U y G 0 6 t U b W R J n X H q j o 5 O X M a N r J N K 0 c F F m g N y u / 9 Y U z S k E a K c C x l z 7 Y S 5 W Z Y K E Y 4 n Z b 6 q a Q J J m M c 0 J 6 m E Q 6 p d L P 8 1 C k 6 0 s o Q + b H Q F S m U q 9 8 n M h x K O Q k 9 3 R l i N Z K / v Z n 4 l 9 d L l V 9 3 M x Y l q a I R m S / y U 4 5 U j G Z / o y E T l C g + 0 Q Q T w f S t i I y w w E T p d E o 6 h K 9 P 0 f + k c 2 L a N f P 0 q l Z p n i / i K M I B H M I x 2 O B A E y 6 h B W 0 g E M A D P M G z w Y 1 H 4 8 V 4 n b c W j M X M P v y A 8 f Y J 5 Z y O O w = = < / l a t e x i t > Figure 3: Computing v sub (S ) to update the policy at S. for some β ∈ [0, 1]. For β = 0, we get a standard Q-learning update. For β = 1, we fully bootstrap off the value provided by v sub (S t+1 ). This may result in suboptimal values q(S t , A t ; w), but should learn faster because a reasonable estimate of value has been propagated back quickly using goal-space planning. On the other hand, β = 0 is not biased by a potentially suboptimal ṽ(g), but does not take advantage of this fast propagation. An interim β can allow for fast propagation, but also help overcome suboptimality in the values. We can show that the above update improves the convergence rate. This result is intuitive: subgoal-value bootstrapping changes the discount rate to γ t+1 (1β). In the extreme case of β = 1, we are moving our estimate towards R t+1 +γ t+1 v sub (S t+1 ) for v sub not based on q without any bootstrapping: it is effectively a regression problem. We prove this intuitive result in Proposition 3 in Appendix B. One other benefit of this approach is that the initiation sets need not cover the whole space: we can have a state d(s, g) = 0 for all g. If this occurs, we simply do not use v sub and bootstrap as usual.  I + L i 9 c Y k R V s g W r w 5 S V i K k Q = " > A A A C D H i c b V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 K G V G K r o s u t B l B f u A z l D u Z D J t a D I z J B m h D P 0 A N / 6 K G x e K u P U D 3 P k 3 p u 0 s t P V A 4 O S c c 0 n u 8 R P O l L b t b 6 u w s r q 2 v l H c L G 1 t 7 + z u l f c P 2 i p O J a E t E v N Y d n 1 Q l L O I t j T T n H Y T S U H 4 n H b 8 0 f X U 7 z x Q q V g c 3 e t x Q j 0 B g 4 i F j I A 2 U r 9 c c W 9 A C K h i W c W u Z j y g 2 V y Z V P O r n J i U X b N n w M v E y U k F 5 W j 2 y 1 9 u E J N U 0 E g T D k r 1 H D v R X g Z S M 8 L p p O S m i i Z A R j C g P U M j E F R 5 2 W y Z C T 4 x S o D D W J o T a T x T f 0 9 k I J Q a C 9 8 k B e i h W v S m 4 n 9 e L 9 X h p Z e x K E k 1 j c j 8 o T D l W M d 4 2 g w O m K R E 8 7 E h Q C Q z f 8 V k C B K I N v 2 V T A n O 4 s r L p H 1 W c + q 1 8 7 t 6 p X G V 1 1 F E R + g Y n S I H X a A G u k V N 1 E I E P a J n 9 I r e r C f r x X q 3 P u b R g p X P H K I / s D 5 / A A b D m w M = < / l a t e x i t > S 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " p a j w F s J h 5 s g T / 0 0 n h 3 j w c o x s o s o = " > A A A B 6 X i c b V D L T g J B E O z F F + I L 9 e h l I j F 6 I r s G o 0 e i F 4 / 4 4 J E A I b P D L E y Y n d 3 M 9 J q Q D X / g x Y P G e P W P v P k 3 D r A H B S v p p F L V n e 4 u P 5 b C o O t + O 7 m V 1 b X 1 j f x m Y W t 7 Z 3 e v u H / Q M F G i G a + z S E a 6 5 V P D p V C 8 j g I l b 8 W a 0 9 C X v O m P b q Z + 8 4 l r I y L 1 i O O Y d 0 M 6 U C I Q j K K V 7 h 9 O e 8 W S W 3 Z n I M v E y 0 g J M t R 6 x a 9 O P 2 J J y B U y S Y 1 p e 2 6 M 3 Z R q F E z y S a G T G B 5 T N q I D 3 r Z U 0 Z C b b j q 7 d E J O r N I n Q a R t K S Q z 9 f d E S k N j x q F v O 0 O K Q 7 P o T c X / v H a C w V U 3 F S p O k C s 2 X x Q k k m B E p m + T v t C c o R x b Q p k W 9 l b C h l R T h j a c g g 3 B W 3 x 5 m T T O y 1 6 l f H F X K V W v s z j y c A T H c A Y e X E I V b q E G d W A Q w D O 8 w p s z c l 6 c d + d j 3 p p z s p l D + A P n 8 w c R H 4 0 Q < / l a t e x i t > {S, A, S 0 , , R} < l a t e x i t s h a 1 _ b a s e 6 4 = " e D m M X g r L j U 1 5 w t A g 0 R u d c q u p Q 7 0 = " > A A A C A H i c b V C 7 T s M w F H X K q 5 R X g I G B x a J C M E R V g o p g L L A w F k o f U h N V j u u 0 V u 0 k s h 2 k K s r C r 7 A w g B A r n 8 H G 3 + C 2 G a D l S F c 6 O u d e 3 X u P H z M q l W 1 / G 4 W l 5 Z X V t e J 6 a W N z a 3 v H 3 N 1 r y S g R m D R x x C L R 8 Z E k j I a k q a h i p B M L g r j P S N s f 3 U z 8 9 i M R k k b h g x r H x O N o E N K A Y q S 0 1 D M P 3 L R h w S s L N k 4 s 6 A 4 Q 5 8 i C 9 2 7 W M 8 t 2 x Z 4 C L h I n J 2 W Q o 9 4 z v 9 x + h B N O Q o U Z k r L r 2 L H y U i Q U x Y x k J T e R J E Z 4 h A a k q 2 m I O J F e O n 0 g g 8 d a 6 c M g E r p C B a f q 7 4 k U c S n H 3 N e d H K m h n P c m 4 n 9 e N 1 H B p Z f S M E 4 U C f F s U Z A w q C I 4 S Q P 2 q S B Y s b E m C A u q b 4 V 4 i A T C S m d W 0 i E 4 8 y 8 v k t Z Z x a l W z u + q 5 d p 1 H k c R H I I j c A o c c A F q 4 B b U Q R N g k I F n 8 A r e j C f j x X g 3 P m a t B S O f 2 Q d / Y H z + A C d 0 l D g = < / l a t e x i t > A < l a t e x i t s h a 1 _ b a s e 6 4 = " V L G 3 t Z C i Z V G l g r C Q D y S t d 2 y H k F I = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N R o + o F 4 + Q y C O B D Z k d e m F k d n Y z M 2 t C C F / g x Y P G e P W T v P k 3 D r A H B S v p p F L V n e 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 3 c x v P a H S P J Y P Z p y g H 9 G B 5 C F n 1 F i p f t M r l t y y O w d Z J V 5 G S p C h 1 i t + d f s x S y O U h g m q d c d z E + N P q D K c C Z w W u q n G h L I R H W D H U k k j 1 P 5 k f u i U n F m l T 8 J Y 2 Z K G z N X f E x M a a T 2 O A t s Z U T P U y 9 5 M / M / r p C a 8 9 i d c J q l B y R a L w l Q Q E 5 P Z 1 6 T P F T I j x p Z Q p r i 9 l b A h V Z Q Z m 0 3 B h u A t v 7 x K m h d l r 1 K + r F d K 1 d s s j j y c w C m c g w d X U I V 7 q E E D G C A 8 w y u 8 O Y / O i / P u f C x a c 0 4 2 c w x / 4 H z + A J V 1 j M 0 = < / l a t e x i t > update ṽ(g) < l a t e x i t s h a 1 _ b a s e 6 4 = " l u z o g e / Y 2 l w P 6 G 6 q n i / H 4 A O f 3 G I = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y W R i h 6 L X j x W s B / Q h L L Z T t q l m 0 3 Y 3 R R K 6 N / w 4 k E R r / 4 Z b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g o Q z p R 3 n 2 y p s b G 5 t 7 x R 3 S 3 v 7 B 4 d H 5 e O T t o p T S b F F Y x 7 L b k A U c i a w p Z n m 2 E 0 k k i j g 2 A n G 9 3 O / M 0 G p W C y e 9 D R B P y J D w U J G i T a S 5 2 n G B 5 h N Z t X h Z b 9 c c W r O A v Y 6 c X N S g R z N f v n L G 8 Q 0 j V B o y o l S P d d J t J 8 R q R n l O C t 5 q c K E 0 D E Z Y s 9 Q Q S J U f r a 4 e W Z f G G V g h 7 E 0 J b S 9 U H 9 P Z C R S a h o F p j M i e q R W v b n 4 n 9 d L d X j r Z 0 w k q U Z B l 4 v C l N s 6 t u c B 2 A M m k W o + N Y R Q y c y t N h 0 R S a g 2 M Z V M C O 7 q y + u k f V V z 6 7 X r x 3 q l c Z f H U Y Q z O I c q u H A D D X i A J r S A Q g L P 8 A p v V m q 9 W O / W x 7 K 1 Y O U z p / A H 1 u c P 1 Q e R j g = = < / l a t e x i t > , r < l a t e x i t s h a 1 _ b a s e 6 4 = " z p t a D i V h O H 4 + / C y 6 z / n C R U I E 1 W U = " > A A A B 7 3 i c b V D L S g N B E O y N r x h f U Y 9 e F o P g Q c K u R P Q Y 9 K D H C O Y B y R J 6 J 5 N k y M z s O j M r h C U / 4 c W D I l 7 9 H W / + j Z N k D 5 p Y 0 F B U d d P d F c a c a e N 5 3 0 5 u Z X V t f S O / W d j a 3 t n d K + 4 f N H S U K E L r J O K R a o W o K W e S 1 g 0 z n L Z i R V G E n D b D 0 c 3 U b z 5 R p V k k H 8 w 4 p o H A g W R 9 R t B Y q d W 5 R S H w T H W L J a / s z e A u E z 8 j J c h Q 6 x a / O r 2 I J I J K Q z h q 3 f a 9 2 A Q p K s M I p 5 N C J 9 E 0 R j L C A W 1 b K l F Q H a S z e y f u i V V 6 b j 9 S t q R x Z + r v i R S F 1 m M R 2 k 6 B Z q g X v a n 4 n 9 d O T P 8 q S J m M E 0 M l m S / q J 9 w 1 k T t 9 3 u 0 x R Y n h Y 0 u Q K G Z v d c k Q F R J j I y r Y E P z F l 5 d J 4 7 z s V 8 o X 9 5 V S 9 T q L I w 9 H c A y n 4 M M l V O E O a l A H A h y e 4 R X e n E f n x X l 3 P u a t O S e b O Y Q / c D 5 / A J a S j 6 8 = < / l a t e x i t > v sub (S 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " R B v F u F + q q 4 w h J 8 9 / z O 9 8 Z s L 6 B D M = " > A A A B / X i c b V D L S s N A F J 3 U V 6 2 v + N i 5 C R a x b k o i F V 0 W 3 b i s a B / Q h j C Z T t q h k 0 m Y u S n W E P w V N y 4 U c e t / u P N v n D 4 W 2 n r g w u G c e 7 n 3 H j / m T I F t f x u 5 p e W V 1 b X 8 e m F j c 2 t 7 x 9 z d a 6 g o k Y T W S c Q j 2 f K x o p w J W g c G n L Z i S X H o c 9 r 0 B 9 d j v z m k U r F I 3 M M o p m 6 I e 4 I F j G D Q k m c e p M P M S z t A H y B V i Z 9 l p b u T U 8 8 s 2 m V 7 A m u R O D N S R D P U P P O r 0 4 1 I E l I B h G O l 2 o 4 d g 5 t i C Y x w m h U 6 i a I x J g P c o 2 1 N B Q 6 p c t P J 9 Z l 1 r J W u F U R S l w B r o v 6 e S H G o 1 C j 0 d W e I o a / m v b H 4 n 9 d O I L h 0 U y b i B K g g 0 0 V B w i 2 I r H E U V p d J S o C P N M F E M n 2 r R f p Y Y g I 6 s I I O w Z l / e Z E 0 z s p O p X x + W y l W r 2 Z x 5 N E h O k I l 5 K A L V E U 3 q I b q i K B H 9 I x e 0 Z v x Z L w Y 7 8 b H t D V n z G b 2 0 R 8 Y n z / J + 5 V u < / l a t e x i t > Figure 4: Goal-Space Planning. The remaining piece is to learn the models and put it all together. Learning the models is straightforward, as we can leverage the large literature on general value functions (Sutton et al., 2011) and UVFAs (Schaul et al., 2015) . There are nuances involved in 1) restricting updating to relevant states according to d(s, g), 2) learning option policies that reach subgoals, but also maximize rewards along the way and 3) considering ways to jointly learn d and Γ. For space we include these details in Appendix C. The algorithm is visualized in Figure 4 (pseudocode in appx. C.3). The steps of agent-environment interaction include: 1) take action A t in state S t , to get S t+1 , R t+1 and γ t+1 ; 2) query the model for r γ (S t+1 , g), Γ(S t+1 , g), ṽ(g) for all g where d(S t+1 , g) > 0; 3) compute projection v sub (S t+1 ) using Eq. (3) and step 2; 4) update the main policy with the transition and v sub (S t+1 ), using Eq. ( 4). All background computation is used for model learning using a replay buffer and for planning to obtain ṽ, so that they can be queried at any time on step 2.

4. EXPERIMENTS WITH GOAL-SPACE PLANNING

We investigate the utility of GSP, for 1) improving sample efficiency and 2) re-learning under nonstationarity. We compare to Double DQN (DDQN) (Van Hasselt et al., 2016) , which uses replay and target networks. We layer GSP on top of this agent: the action-value update is modified to incorporate subgoal-value bootstrapping. By selecting β = 0, we perfectly recover DDQN, allowing us to test different β values to investigate the impact of incorporating subgoal values computed using background planning.

4.1. EXPERIMENT SPECIFICATION

We test the agents in the PinBall environment (Konidaris & Barto, 2009) , which allows for a variety of easy and harder instances to test different aspects. The agent has to navigate a small ball to a destination in a maze-like environment with fully elastic and irregularly shaped obstacles. The state is described by 4 features: x ∈ [0, 1], y ∈ [0, 1], ẋ ∈ [-1, 1], ẏ ∈ [-1, 1]. The agent has 5 discrete actions: increase/decrease ẋ, increase/decrease ẏ, and nothing. The agent receives a reward of -5 per step and a reward of 10,000 upon termination at the goal location. PinBall has a continuous state space with complex and sharp dynamics that make learning and control difficult. We used a harder version of PinBall in our first experiment, shown in Figure 5 , and simpler one for the non-stationary experiment, shown in Figure 9 , to allow DDQN a better chance to adapt under non-stationarity. The hyperparameters are chosen based on sweeping for DDQN performance. We then fixed these hyperparameters, and used them for GSP. This approach helps ensure they have similar settings, with the primary difference due to incorporating subgoal-value bootstrapping. We used neural networks with ReLU activations and = 0.1; details about hyperparameters are in Appendix F. The set of subgoals for GSP are chosen to cover the environment in terms of (x, y) locations. For each subgoal g with location (x g , y g ), we set m(s, g) = 1 for s = (x, y, ẋ, ẏ) if the Euclidean distance between (x, y) and (x g , y g ) is below 0.035. Using a region, rather than requiring (x, y) = (x g , y g ), is necessary for a continuous state space. The agent's velocity is not taken into account for subgoal termination. The width of the region for the initiation function is 0.4. More details about the layout of the environment, positions of these subgoals and initiation functions are shown in Figure 5 . Performance in this environment for GSP with β = 0.1, DDQN, and approximate LAVI, with the standard error shown. Even just increasing to β = 0.1 allows GSP to leverage the longer-horizon estimates given by the subgoal values, making it learn much faster than DDQN. Approximate LAVI is able to learn quickly, but levels off at a suboptiomal performance, as expected.

4.2. EXPERIMENT 1: INVESTIGATING GSP WITH PRE-TRAINED MODELS

We first investigate the utility of the models after they have been learned in a pre-training phase. The models use the same updates as they would when being learned online, and are not perfectly accurate. Pre-training the model allows us to ask: if the GSP agent had previously learned a model in the environment-or had offline data to train its model-can it leverage it to learn faster now? One of the primary goals of model-based RL is precisely this re-use, and so it is natural to start in a setting mimicking this use-case. We assume the GSP agent can do many steps of background planning, so that ṽ is effectively computed in early learning; this is reasonable as we only need to do value iteration for 9 subgoals, which is fast. We compare GSP with β = 0.1 against two baselines: DDQN and approximate LAVI. DDQN is the model-free baseline which GSP builds on top of, and can also be viewed as a version of Dyna when the replay buffer is viewed as a non-parametric model in the PinBall environment (van Hasselt et al., 2019; Pan et al., 2018) , with planning updates sampled based on the agent's prior state-action visitation distribution. Approximate LAVI is a modified version of LAVI (Mann et al., 2015) that uses learned subgoal models, and is a version of GSP that fully relies on subgoal values when performing its updates with β = 0. We selected β = 0.1 as it provided the best tradeoff of the beta values but we find that β as small as 1e -3 was able to outperform DDQN. The performance of GSP with different β can be found in Appendix I. We see in Figure 5 that GSP learns much faster than DDQN, and reaches the same level of performance. This is the result we should expect-GSP gets to leverage a pre-trained model, after all-but it is an important sanity check that using models in this new way is effective. Of particular note is that even just increasing β from 0 (which is DDQN) to β = 0.1 provides the learning speed boost without resulting in suboptimal performance. Likely, in early learning, the suboptimal subgoal values provide a coarse direction to follow, to more quickly update the action-values, which is then refined with more learning. When the approximate subgoal models are fully relied upon on as with approximate LAVI, we similarly get fast initial learning, but it plateaus at a more suboptimal point. To further investigate the hypothesis that GSP more quickly changes its value function early in learning, we visualize the value functions for both GSP and DDQN over time in Figure 6 . After 2000 steps, they are not yet that different, because there are only four replay updates on each step and it takes time to visit the state-space and update values by bootstrapping off of subgoal values. By step 6000, though, GSP already has some of the structure of the problem, whereas DDQN has simply pushed down many of its values (darker blue). To see whether GSP is feasible to apply to other problems, we also evaluated GSP in Lunar Lander (Brockman et al., 2016) , an environment where subgoal specification is not as obvious and environment dynamics cause the agent to frequently crash. We include those results in Appendix H, but note that similar conclusions about comparisons between GSP and DDQN hold. We also compared GSP to various Dyna-style planning algorithms, some of which also incorporates temporal abstraction, in Appendix G and find that GSP is able to outperform these alternatives. One potential benefit of GSP is that the models themselves may be easier to learn, because we can leverage standard value function learning algorithms. We visualize the models learned for the previous experiment, as well as the resulting v sub , with details about model learning in Appendix E. In Figure 7 we see how learned state-to-subgoal models accurately learn the structure. Each plot shows the learned state-to-subgoal for one subgoal, visualized only for the initiation set d(s, g) > 0. We can see larger discount and reward values predicted based on reachability. However, the models are not perfect. We measured model error and find it is reasonable but not very near zero (see Appendix E). This result is actually encouraging: inaccuracies in the model do not prevent useful planning. It is informative to visualize v sub . We can see in Figure 6 that the general structure is correct, matching the optimal path, but that it indeed looks suboptimal compared to the final values computed in Figure 6 by DDQN. This inaccuracy is likely due both to some inaccuracy in the models, as well as the fact that subgoal placement is not optimal. This explains why GSP has lower values particularly in states near the bottom, likely skewed downwards by v sub . Finally, we test the impact on learning using less accurate models. After all, the agent will want to start using its model as soon as possible, rather than waiting for it to become more accurate. We ran GSP using models learned online, using only 50k, 75k and 100k time steps to learn the models. We then froze the models and allowed GSP to learn with them. We can see in Figure 8 that learning with too inaccurate of a model-with 50k-fails, but already with 75k performance improves considerably and with 100k we are already nearly at the same level of optimal performance as the pre-trained models. This result highlights it should be feasible to learn and use these models in GSP, all online.

4.4. EXPERIMENT 2: ADAPTING IN NONSTATIONARY PINBALL

Now we consider another typical use-case for model-based RL: quickly adapting to changes in the environment. We let the agent learn in PinBall for 50k steps, and then switch the goal to a new location for another 50k steps. Goal information is never given to the agent, so it has to visit the old goal, realize it is no longer rewarding, and re-explore to find the new goal. This non-stationary setting is harder for DDQN, so we use a simpler configuration for PinBall, shown in Figure 9 . We can leverage the idea of exploration bonuses, introduced in Dyna-Q+ (Sutton & Barto, 2018) . Exploration bonuses are proportional to the last time that state-action was visited. This encourages the agent to revisit parts of the state-space that it has not seen recently, in case that part of the world has changed. For us, this corresponds to including reward bonus r bonus in the planning and projection steps: ṽ(g) = max g ∈ Ḡ: d(g,g )>0 rγ (g, g ) + Γ(g, g ) ṽ(g ) + r bonus (g ) and v sub (s) = max g∈ Ḡ:d(s,g)>0 r γ (s, g)+Γ(s, g) ṽ(g) + r bonus (g) . Because we have a small, finite set of subgoals, it is straightforward to leverage this idea that was designed for the tabular setting. We use r bonus (g) = 1000 if the count for g is zero, and 0 otherwise. When the world changes, the agent recognizes that it has changed, and resets all counts. Similarly, both agents (GSP and DDQN) clear their replay buffers. The GSP agent can recognize the world has changed, but not how it has changed. It has to update its models with experience. The state-to-subgoal models and subgoal-to-subgoal models local to the previous terminal state location and the new one need to change, but the rest of the models are actually already accurate. The agent can leverage this existing accuracy. In Figure 9 , we can see both GSP and DDQN drop in performance when the environment changes, with GSP recovering much more quickly. It is always possible that an inaccurate model might actually make re-learning slower, reinforcing incorrect values from the model. Here, though, updating these local models is fast, allowing the subgoal values to also be updated quickly. Though not shown in the plot, GSP without exploration bonuses performs poorly. Its model causes it to avoid visiting the new goal region, so preventing the model from updating, because the value in that bottom corner is low.

5. CONCLUSION

In this paper we introduced a new planning framework, called Goal-Space Planning (GSP). GSP provides a new approach to use background planning to improve action-value estimates, with minimalist, local models and computationally efficient planning. We show in the PinBall environment that these subgoal-conditioned models can be accurately learned using standard value estimation algorithms and that GSP is robust to less accurate models (Section 4.3). We also find that GSP can significantly improve the speed of learning over DDQN in both the PinBall environment and outperforms several Dyna variants, including Dyna with options (Appendix G), and that GSP relearns more quickly under non-stationarity than DDQN (Section 4.4). Additionally, we compared GSP to DDQN in another environment, called Lunar Lander (Appendix H), both to highlight that the conclusions extend and to demonstrate that it is straightforward to apply GSP to other problems. This work introduces a new formalism, and many new technical questions along with it. We have only tested GSP with pre-trained models and assumed a given set of subgoals. Our initial experiments learning the models online, from scratch, indicate that GSP can get similar learning speed boosts. Using a recency buffer, however, accumulates transitions only along the optimal trajectory, sometimes causing the models to become inaccurate part-way through learning. An important next step is to incorporate smarter model learning strategies. The other critical open question is in subgoal discovery. We somewhat randomly selected subgoals across the PinBall environment, with a successful outcome; such an approach is unlikely to work in many environments. In general, option discovery and subgoal discovery remain open questions. One utility of this work is that it could help narrow the scope of the discovery question, to that of finding abstract subgoals that help the agent plan more efficiently.

A STARTING SIMPLER: GOAL-SPACE PLANNING FOR POLICY EVALUATION

To highlight the key idea for efficient planning, we provide an example of GSP in a simpler setting: policy evaluation for learning v π for a fixed deterministic policy π in a deterministic environment, assuming access to the true models. The key idea is to propagate values quickly across the space by updating between a subset of states that we call subgoals, g ∈ G ⊂ S, as visualized in Figure 10 . (Later we extend G ⊂ S to abstract subgoal vectors that need not correspond to any state.) To do so, we need temporally extended models between pairs g, g that may be further than one-transition apart. For policy evaluation, these models are the accumulated rewards r π,γ : S × S → R and discounted probabilities P π,γ : S × S → [0, 1] under π: r π,γ (g, g ) def = E π [R t+1 + γ g ,t+1 r π,γ (S t+1 , g )|S t = g] P π,γ (g, g ) def = E π [1(S t+1 = g )γ t+1 + γ g ,t+1 P π,γ (S t+1 , g )|S t = g] where γ g ,t+1 = 0 if S t+1 = g and otherwise equals γ t+1 , the environment discount. If we cannot reach g from g under π, then P π,γ (g, g ) will simply accumulate many zeros and be zero. We can treat G as our new state space and plan in this space, to get value estimates v for all g ∈ G v(g) = r π,γ (g, g ) + P π,γ (g, g )v(g ) where g = argmax g ∈ Ḡ P π,γ (g, g ) where Ḡ = G ∪ {s terminal } if there is a terminal state (episodic problems) and otherwise Ḡ = G. It is straightforward to show this converges, because P π,γ is a substochastic matrix (see Appendix A.1). Once we have these values, we can propagate these to other states, locally, again using the closest g to s. We can do so by noticing that the above definitions can be easily extended to r π,γ (s, g ) and P π,γ (s, g ), since for a pair (s, g) they are about starting in the state s and reaching g under π. v(s) = r γ (s, g) + P π,γ (s, g)v(g) where g = argmax g∈ Ḡ P π,γ (s, g). (5) Because the rhs of this equation is fixed, we only cycle through these states once to get their values. All of this might seem like a lot of work for policy evaluation; indeed, it will be more useful to have this formalism for control. But, even here goal-space planning can be beneficial. Let assume a chain s 1 , s 2 , . . . , s n , where n = 1000 and G = {s 100 , s 200 , . . . , s 1000 }. Planning over g ∈ G only requires sweeping over 10 states, rather than 1000. Further, we have taken a 1000 horizon problem and converted it into a 10 step one.foot_1 As a result, changes in the environment also propagate faster. If the reward at s changes, locally the reward model around s can be updated quickly, to change r π,γ (g, g ) for pairs g, g where s is along the way from g to g . This local change quickly updates the values back to earlier g ∈ G.

A.1 PROOFS FOR THE DETERMINISTIC POLICY EVALUATION SETTING

We provide proofs here for the deterministic policy evaluation setting. We assume throughout that the environment discount γ t+1 is a constant γ c ∈ [0, 1) for every step in an episode, until termination when it is zero. The below results can be extended to the case where γ c = 1, using the standard strategy for the stochastic shortest path problem setting. First, we want to show that given r π,γ and P π,γ , we can guarantee that the update for the values for G will converge. Recall that Ḡ = G ∪ {s terminal } is the augmented goal space that includes the terminal state. This terminal state is not a subgoal-since it is not a real state-but is key for appropriate planning. Lemma 1. Assume that we have a deterministic MDP, deterministic policy π, γ c < 1, a discrete set of subgoals G ⊂ S, and that we iteratively update v t ∈ R | Ḡ| with the dynamic programming update v t (g) = r π,γ (g, g ) + P π,γ (g, g )v t-1 (g ) where g = argmax g ∈ Ḡ P π,γ (g, g ) for all g ∈ G, starting from an arbitrary (finite) initialization v 0 ∈ R | Ḡ| , with v t (s terminal ) fixed at zero. Then then v t converges to a fixed point. Proof. To analyze this as a matrix update, we need to extend P π,γ (g, g ) to include an additional row transitioning from s terminal . This row is all zeros, because the value in the terminal state is always fixed at zero. Note that there are ways to avoid introducing terminal states, using transition-based discounting (White, 2017) , but for this work it is actually simpler to explicitly reason about them and reaching them from subgoals. To show this we simply need to ensure that P π,γ is a substochastic matrix. Recall that P π,γ (g, g ) def = E π [1(S t+1 = g )γ t+1 + γ g ,t+1 P π,γ (S t+1 , g )|S t = g] where γ g ,t+1 = 0 if S t+1 = g and otherwise equals γ t+1 , the environment discount. If it is substochastic, then P π,γ 2 < 1. Consequently, the Bellman operator (Bv)(g) = r π,γ (g, g ) + P π,γ (g, g )ṽ(g ) where g = argmax g ∈ Ḡ P π,γ (g, g ) is a contraction, because Bv 1 -Bv 2 2 = P π,γ v 1 -P π,γ v 2 2 ≤ P π,γ 2 v 1 -v 2 2 < v 1 -v 2 2 . Because γ c < 1, then either g immediately terminates in g , giving 1(S t+1 = g )γ t+1 + γ g ,t+1 P π,γ (S t+1 , g ) = γ t+1 + 0 ≤ γ c . Or, it does not immediately terminate, and 1(S t+1 = g )γ t+1 + γ g ,t+1 P π,γ (S t+1 , g ) = 0 + γ c P π,γ (S t+1 , g ) ≤ γ c because P π,γ (S t+1 , g ) ≤ 1. There- fore, if γ c < 1, then P π,γ 2 ≤ γ c . Proposition 1. For a deterministic MDP, deterministic policy π, and a discrete set of subgoals G ⊂ S that are all reached by π in the MDP, given the ṽ(g) obtained from Equation 6, if we set v(s) = r γ (s, g) + P π,γ (s, g)ṽ(g) where g = argmax g∈ Ḡ P π,γ (s, g) for all states s ∈ S then we get that v = v π . Proof. For a deterministic environment and deterministic policy this result is straightforward. The term P π,γ (s, g) > 0 only if g is on the trajectory from s when the policy π is executed. The term r γ (s, g) consists of deterministic (discounted) rewards and ṽ(g) is the true value from g, as shown in Lemma 6 (namely ṽ(g) = v π (g)). The subgoal g is the closest state on the trajectory from s, and P π,γ (s, g) is γ t c where t is the number of steps from s to g.

B PROOFS FOR THE GENERAL CONTROL SETTING

In this section we assume that γ c < 1, to avoid some of the additional issues for handling proper policies. The same strategies apply to the stochastic shortest path setting with γ c = 1, with additional assumptions. Proposition 2. [Convergence of Value Iteration in Goal-Space] Assuming that Γ is a substochastic matrix, with v 0 ∈ R | Ḡ| initialized to an arbitrary value and fixing v t (s terminal ) = 0 for all t, then iteratively sweeping through all g ∈ G with update v t (g) = max g ∈ Ḡ: d(g,g )>0 rγ (g, g ) + Γ(g, g )v t-1 (g ) convergences to a fixed-point. Proof. We can use the same approach typically used for value iteration. For any v 0 ∈ R | Ḡ| , we can define the operator (B g v)(g) def = max g ∈ Ḡ: d(g,g )>0 rγ (g, g ) + Γ(g, g )ṽ(g ) First we can show that B g is a γ c -contraction. Assume we are given any two vectors v 1 , v 2 . Notice that Γ(g, g ) ≤ γ c , because for our problem setting the discount is either equal to γ c or equal to zero at termination. Then we have that for any g ∈ Ḡ |(B g v 1 )(g) -(B g v 2 )(g)| = max g ∈ Ḡ: d(g,g )>0 rγ (g, g ) + Γ(g, g )v 1 (g ) - max g ∈ Ḡ: d(g,g )>0 rγ (g, g ) + Γ(g, g )v 2 (g ) ≤ max g ∈ Ḡ: d(g,g )>0 |r γ (g, g ) + Γ(g, g )v 1 (g ) -(r γ (g, g ) + Γ(g, g )v 2 (g ))| = max g ∈ Ḡ: d(g,g )>0 | Γ(g, g )(v 1 (g ) -v 2 (g ))| ≤ max g ∈ Ḡ: d(g,g )>0 γ c |v 1 (g ) -v 2 (g )| ≤ γ c v 1 -v 2 ∞ Since this is true for any g, it is true for the max over g, giving B g v 1 -B g v 2 ∞ ≤ γ c v 1 -v 2 ∞ . Because the operator B g is a contraction, since γ c < 1, we know by the Banach Fixed-Point Theorem that the fixed-point exists and is unique. Now we analyze the update to the main policy, that incorporates the subgoal value estimates into the bootstrap target. We assume we have a finite number of state-action pairs n, with parameterized action-values q(•; w) ∈ R n represented as a vector with one entry per state-action pair. Value iteration to find q * corresponds to updating with the Bellman optimality operator (Bq)(s, a) def = r(s, a) + s P (s |s, a)γ(s ) max a ∈A q(s , a ) On each step, for the current q t def = q(•; w t ), if we assume the parameterized function class can represent Bq t , then we can reason about the iterations of w 1 , w 2 , . . . obtain when minimizing distance between q(•; w t+1 ) and Bq t , with q(s, a; w t+1 ) = (Bq(•; w t ))(s, a) Under function approximation, we do not simple update a table of values, but we can get this equality by minimizing until we have zero Bellman error. Note that q * = Bq * , by definition. In this realizability regime, we can reason about the iterates produced by value iteration. The convergence rate is dictated by γ c , as is well known, because Bq 1 -Bq 2 ∞ ≤ γ c q 1 -q 2 ∞ Specifically, if we assume |r(s, a)| ≤ r max , then we can use the fact that 1) the maximal return is no greater than G max def = rmax 1-γc , and 2) for any initialization q 0 no larger in magnitude than this maximal return we have that q 0q * ∞ ≤ 2G max . Therefore, we get that Bq 0q * ∞ = Bq 0 -Bq * ∞ ≤ γ c q 0q * ∞ and so after t iterations we have Proposition 3 (Convergence rate of tabular value iteration under subgoal bootstrapping). The fixed point q * β = B β q * β exists and is unique. Further, for q 0 , and the corresponding w 0 , initialized such that |q 0 (s, a; w 0 )| ≤ G max , the value iteration update with subgoal bootstrapping q t = B β q t-1 for t = 1, 2, . . . satisfies q t -q * ∞ = Bq t-1 -Bq * ∞ ≤ γ c q t-1 -q * ∞ ≤ γ 2 c q t-2 -q * ∞ . . . ≤ γ t c q 0 -q * ∞ = γ t c G q t -q * β ∞ ≤ (1 -β) t γ t c r max + βG max 1 -(1 -β)γ c Proof. First we can show that B β is a γ c (1β)-contraction. Assume we are given any two vectors q 1 , q 2 . Notice that γ(s) ≤ γ c , because for our problem setting it is either equal to γ c or equal to zero at termination. Then we have that for any (s, a)  |(B β q 1 (s, a) -(B β q 2 )(s, a)| = |(1 -β) s P (s |s, a)γ(s )[max a ∈A q 1 (s , a ) -max a ∈A q 2 (s , a )]| ≤ (1 -β)γ c s P (s |s, a)|[max a ∈A q 1 (s , a ) -max a ∈A q 2 (s , ≤ (1 -β)γ c s P (s |s, a) q 1 -q 2 ∞ = (1 -β)γ c q 1 -q 2 ∞ Since this is true for any (s, a), it is true for the max, giving B β q 1 -B β q 2 ∞ ≤ (1 -β)γ c q 1 -q 2 ∞ . Because the operator is a contraction, since (1β)γ c < 1, we know by the Banach Fixed-Point Theorem that the fixed-point exists and is unique. Now we can also use contraction property for the convergence rate. Notice first that we can consider r(s, a) def = r(s, a) + βr sub (s, a) as the new reward, with maximum value r max + βG max . Further, the new discount is (1β)γ c . Consequently, the maximal return is rmax+βGmax 1-(1-β)γc . q t -q * β ∞ = B β q t-1 -B β q * β ∞ ≤ (1 -β)γ c q t-1 -q * ∞ . . . ≤ (1 -β) t γ t c q 0 -q * ∞ ≤ (1 -β) t γ t c r max + βG max 1 -(1 -β)γ c This rate is dominated by ((1β)γ c ) t , and for β near 1 gives a much faster convergence rate than β = 0. We can determine after how many iteration this term overcomes the increase in the upper bound on the return. In other words, we want to know how big t needs to be to get (1 -β) t γ t c r max + βG max 1 -(1 -β)γ c ≤ γ t c G max . The modification to the update is simple: we simply do not update r γ (s, g), Γ(s, g) in states s where d(s, g) = 0.foot_2 For the action-value variant, we do not update for state-action pairs (s, a) where d(s, g) = 0 and π g (s) = a. The model will only ever be queried in (s, a) where d(s, g) = 1 and π g (s) = a. Learning the relevance model d We assume in this work that we simply have d(s, g), but we can at least consider ways that we could learn it. One approach is to attempt to learn Γ for each g, to determine which are pertinent. Those with Γ(s, g) closer to zero can have d(s, g) = 0. In fact, such an approach was taken for discovering options (Khetarpal et al., 2020) , where both options and such a relevance function are learned jointly. For us, they could also be learned jointly, where a larger set of goals start with d(s, g) = 1, then if Γ(s, g) remains small, then these may be switched to d(s, g) = 0 and they will stop being learned in the model updates. Learning the subgoal-to-subgoal models Finally, we need to extract the subgoal-to-subgoal models rγ , Γ from r γ , Γ. The strategy involves updating towards the state-to-subgoal models, whenever a state corresponds to a subgoal. In other words, for a given s, if m(s, g) = 1, then for a given g (or iterating through all of them), we can update rγ using (r γ (s, g )rγ (g, g ))∇r γ (g, g ) and update Γ using (Γ(s, g ) -Γ(g, g ))∇ Γ(g, g ). Note that these updates are not guaranteed to uniformly weight the states where m(s, g) = 1. Instead, the implicit weighting is based on sampling s, such as through which states are visited and in the replay buffer. We do not attempt to correct this skew, as mentioned in the main body, we presume that this bias is minimal. An important next step is to better understand if this lack of reweighting causes convergence issues, and how to modify the algorithm to account for a potentially changing state visitation.

C.2 A GENERAL ALGORITHM FOR LEARNING OPTION POLICIES

Finally, we need to learn the option policies π g . In the simplest case, it is enough to learn π g that makes r γ (s, g) maximal for every relevant s (i.e., d(s, g) > 0). We can learn the action-value variant r γ (s, a, g) using a Q-learning update, and set π g (s) = argmax a∈A r γ (s, a, g), where we overloaded the definition of r γ . We can then extract r γ (s, g) = max a∈A r γ (s, a, g), to use in all the above updates and in planning. In our own Pinball Experiment, this strategy is sufficient for learning π g . More generally, however, this approach may be ineffective because maximizing environment reward may be at odds with reaching the subgoal in a reasonable number of steps (or at all). For example, in environments where the reward is always positive, maximizing environment reward might encourage the option policy not to terminate.foot_3 However, we do want π g to reach g, while also obtaining the best return along the way to g. For example, if there is a lava pit along the way to a goal, even if going through the lava pit is the shortest path, we want the learned option to get to the goal by going around the lava pit. We therefore want to be reward-respecting, as introduced for reward-respecting subtasks (Sutton et al., 2022) , but also ensure termination. We can consider a spectrum of option policies that range from the policy that reaches the goal as fast as possible to one that focuses on environment reward. We can specify a new reward for Algorithm 6 ModelDDQNUpdate(s, a, s , r, γ) Add new transition (s, a, s , r, γ) to buffer D model for g ∈ Ḡ do for n model mini-batches do Sample batch B model = {(s, a, r, s , γ)} from D model γ g ← γ(1m(s , g )) // Update option policy a ← argmax a ∈A q(s , a , g ; θ π ) δ π (s, a, s , r, γ) ← 1 2 (r -1) + γ g q(s , a , g ; θ π targ )q(s, a, g ; θ π ) θ π ← θ π + α π ∇ θ π 1 |B model | (s,a,r,s ,γ)∈B model (δ π (s, a, s , r, γ)) 2 θ π targ ← ρ model θ π + (1 -ρ model )θ π targ // Update reward model and discount model δ r (s, a, r, s , γ) ← r + γ g (γ, s )r γ (s , a , g ; θ r targ ) -r γ (s, a, g ; θ r ) δ Γ (s, a, r, s , γ) ← m(s , g)γ + γ g (γ, s )Γ(s , a , g ; θ Γ targ ) -Γ(s, a, g ; θ Γ ) θ r ← θ r -α r ∇ θ r 1 |B model | (s,a,r,s ,γ)∈B model (δ r ) 2 θ Γ ← θ Γ -α Γ ∇ θ Γ 1 |B model | (s,a,r,s ,γ)∈B model (δ Γ ) 2 θ r targ ← ρ model θ r + (1 -ρ model )θ r targ θ Γ targ ← ρ model θ Γ + (1 -ρ model )θ Γ targ // Update goal-to-goal models using state-to-goal models . . . same as in prior pseudocode.

C.5 OPTIMIZATIONS FOR GSP USING FIXED MODELS

It is possible to reduce computation cost of GSP when learning with a fixed model. When the subgoal models are fixed, v sub for an experience sample does not change over time as all components that are used to calculate v sub are fixed. This means that the agent can calculate v sub when it first receives the experience sample and save it in the buffer, and use the same calculated v sub whenever this sample is used for updating the main policy. When doing so, v sub only needs to be calculated once per sample experienced, instead of with every update. This is beneficial when training neural networks, where each sample is often used multiple times to update network weights. An additional optimization possible on top of caching of v sub in the replay buffer is that we can batch the calculation of v sub for multiple samples together, which can be more efficient than calculating v sub for a single sample every step. To do this, we create an intermediate buffer that stores up to some number of samples. When the agent experiences a transition, it adds the sample to this intermediate buffer rather than the main buffer. When this buffer is full, the agent calculates v sub for all samples in this buffer at once and adds the samples alongside v sub to the main buffer. This intermediate buffer is then emptied and added to again every step. We set the maximum size for the intermediate buffer to 1024 in our experiments.

D CONNECTIONS TO UVFAS AND GOAL-CONDITIONED RL

There is a large and growing literature on goal-conditioned RL (GCRL). This is a problem setting where the aim is to learn a policy π(a|s, g) that can be (zero-shot) conditioned on different possible goals. The agent learns for a given set of goals, with the assumption that at the start of each episode the goal state is explicitly given to the agent. After this training phase, the policy should generalize to previously unseen goals. Naturally, this idea has particularly been applied to navigation, having the agent learn to navigate to different states (goals) in the environment. Many GCRL approaches leverage UVFAs (Schaul et al., 2015) . This setting bears a strong resemblance to what we do in this work, but is notably different. Our models can be seen as goal-conditioned models-part of the solution-for planning in the general RL setting. GCRL, on the other hand, is a problem setting. Many approaches do not consider planning, but instead focus on effectively learning the goal-conditioned value functions or policies. There is more work, however, using landmark states and planning, for GCRL. In addition to the goal given for GCRL, the landmark states can be treated as interim subgoals and UVFA models learned for these as well (Huang et al., 2019) . Planning is done between landmarks, using graph-based search. The policy is set to reach the nearest goal (using action-values with cost-to-goal rewards of -1 per step) and learned distance functions between states and goals and between goals. These models are like our reward and discount models, but tailored to navigation and distances. The idea of learning models that immediately apply to new subtasks, using successor features, is like GCRL but goes beyond navigation. The option keyboard involves encoding options (or policies) as vectors that describe the corresponding (pseudo) reward (Barreto et al., 2019) . This work has been expanded more recently, using successor features (Barreto et al., 2020) . New policies can then be easily obtained for new reward functions, by linearly combining the (basis) vectors for the already learned options. No planning is involved in this work, beyond a one-step decision-time choice amongst options.

E ADDITIONAL DETAILS ON LEARNING SUBGOAL MODELS

This section describes implementation details for learning subgoal models in the PinBall environment and errors observed in the learned models. To ensure that we provide sufficient variety of data to learn the model accurately, when learning the subgoal models, the agent is randomly initialized in the environment at a valid state, ran in the environment for 20 steps with a random policy, then randomly reset again. To ensure that the agent gets sufficient experience near goal states, we initialize the agent, with a 0.01 probability, at states where m(s, g) = 1 for any g with added jitter sampled from U (-0.01, 0.01) for each feature. The model is trained for 300k steps in this data gathering regime. We restrict model update to relevant states in our experiments. Because the only relevant experience for learning r γ and Γ are samples where d(s, g) > 0, we maintain a separate buffer for each subgoal g for learning r γ (s, g) and Γ(s, g) such that all experience within that buffer are relevant. We require 10k samples in the buffer of each subgoal before learning for the corresponding r γ and Γ begins, so that mini-batches are always drawn from a sufficiently diverse set of samples. Similarly, a sample is only relevant for updating Γ and rγ if m(s, g) > 0 for some g, but this might not be true for samples stored in the buffers for learning Γ and r γ . To be able to obtain a batch of samples where all samples are relevant for learning Γ and rγ , the agent uses another buffer that exclusively stores samples where m(s, g) > 0 to learn Γ and rγ . We mentioned in Appendix C.1 that we take the simple approach to restricting model updates to states where d(s, g) = 1. However, this means an update could bootstrap off inaccurate estimates when learning from a sample (s, a, r, s ) if d(s, g) > 0 but d(s , g) = 0. In PinBall, this occurs when the agent starts within the relevance area for a subgoal but taking an action moves the agent outside of it. We attempt to alleviate this issue in practice by changing the estimation target for those state-action pairs to be the minimum possible target in the environment. Because we co-learn the option policy with r γ (s, a, g), we set this minimum value to 1 1-γ r min . If the network can learn this target well, then the learned option policy will not leave the relevance area. We also address the issue that for some fixed d, it is possible that not all states where d(s, g) > 0 could reach the subgoal. This can negatively affect the quality of v sub as our algorithm assumes that goal g is reachable from state s via the option policy if d(s, g) > 0. While this source of error did not seem to affect GSP in our experiments, it might be important in other environments, so we describe the modification to address this problem here. From these states, the agent should not consider these subgoals when doing background planning (g is not reachable from g despite d(g, g ) = 1) and projection (g is not reachable from s despite d(s, g ) = 1). We check for these states by seeing if the learned Γ(s, g) is near 0, which indicates that it is either very difficult or impossible to reach g from s. For states with Γ(s, g) very near 0, we can set d(s, g) = 0 for the purpose of background planning and projection, but not for learning Γ(s, g) as it might be initialized to a low value. In our experiments, we set this threshold to 0. 

Absolute Error

Figure 11 : A heatmap of the absolute error of Γ and r γ for two different subgoal models learned at various (x, y). While the absolute error from states near subgoals can be quite low, they increase substantially as the state gets further away. White indicates that d(s, g) = 0.

E.1 ERROR OF LEARNED SUBGOAL MODELS

To better understand the accuracy of our learned subgoal models, we performed roll-outs of the learned option policy at different (x,y) locations (with 0 velocity) across the environment and compared the true r γ and Γ with the estimated values. Figure 11 shows a heatmap of the absolute error of the model compared to the ground truth, with the mapping of colors on the right. The models learned tend to be more accurate closer to the goal, and less accurate further away. The absolute error of Γ can be as low as 0.01 close to the goal, but increase to 0.2 and higher further away. Similarly, the absolute error for r γ can be as low as below 10 near goals, but can increase over 100 further away. While the magnitudes of errors are not unreasonable, they are also not very near zero. This results is encouraging in that inaccuracies in the model do not prevent useful planning.

F ADDITIONAL EXPERIMENT DETAILS

This section provides additional details for the PinBall environment, the various hyperparameters used for DDQN and GSP, and the hyperparameters sweeps performed. The experiments described in the main body, along with the hyperparameter sweeps, used approximately 10.7 CPU days on an Apple M1 chip. The pinball configuration that we used is based on the "slightly harder configuration" found at http://irl.cs.brown.edu/pinball/. The Python implementation of PinBall was taken from https://github.com/amarack/python-rl, which was released under the GPL-3.0 license. We have modified the environment to support additional features such as changing terminations, visualizing subgoals, and various bug fixes. Network Architecture We used neural networks for learning the main policy, Γ, and r γ . For experiment 1, we used a neural network with hidden layers [256, 256, 128, 128, 64, 64] for the main policy and [256, 256, 128, 128, 64, 64, 32, 32] for Γ and r γ . For experiment 2, we used a neural network with hidden layers [128, 128, 64, 64] for the main policy and [128, 128, 128, 128, 64, 64] for Γ and r γ . We used ReLU activation function for each layer aside from the output layer. The network's bias weights are initialized to 0.001 and other weights were initialized using He uniform initialization (He et al., 2015) . Each network output a vector of length 5, one for each action.

F.1 EXPERIMENT HYPERPARAMETERS

For both experiments, we used the Adam optimizer for training both the main policy and the subgoal models. We used the default hyperparameters for Adam except the step-size (b 1 = 0.9, b 2 = 0.999, = 1e -8 ). The main policy was trained with 4 mini-batches per step with batch size of 16, while the subgoal models were trained with 1 mini-batch per step with the same batch size. We used -greedy exploration strategy, with fixed to = 0.1 in our experiments. For experiment 1, γ = 0.99, α π = α r = α Γ = 5e -4 , and ρ model = 0.4. For experiment 2, γ = 0.95, α π = α r = α Γ = 1e -3 , and ρ model = 0.1. We selected the learning rate for Adam and the polyak averaging rate ρ for updating the main policy in each experiment using the methodology described in the section below.

F.2 HYPERPARAMETER SWEEP METHODOLOGY

For experiment 1, we swept the baseline DDQN algorithm for polyak averaging rate ρ ∈ [0.0125, 0.025, 0.05, 0.1] and learning rate in [1e -3 , 5e -4 , 3e -4 , 1e -4 ] across 4 seeds. We found that ρ = 0.025 and learning rate of 5e -4 had the highest average reward rate in our sweep and used them when running both DDQN and GSP across seeds in the experiment. For experiment 2, we swept DDQN for polyak averaging rate ρ ∈ [1.0, 0.8, 0.4, 0.2, 0.1, 0.05, 0.025] and learning rate in [1e -2 , 5e -3 , 1e -3 , 5e -4 ] for 8 seeds. We find that ρ = 0.05 and learning rate of 1e -3 had the highest average reward rate out of all configurations swept and used these hyperparameters for all DDQN runs in the experiment. For GSP, we used ρ = 0.8 and learning rate of 1e -3 .

G COMPARING GSP TO OTHER DYNA ALTERNATIVES

In this section, we compare GSP against other basic background planning algorithms. Namely, we compare against DDQNx2, a DDQN agent that is given double the amount of computational budget per step compared to our baseline algorithm, and Dyna with options (Dyno), a natural alternative to use option model for background planning. As mentioned in Section 4, DDQN can be viewed as a background planning algorithm when the replay buffer is viewed as a non-parametric model. Providing DDQN with double the number of mini-batch updates attempts to answer the question of what if GSP's background planning resources was instead dedicated to additional one-step updates. Note that in this experiment, our DDQNx2 implementation took 50% additional wall-clock time to run when compared to our GSP implementation. Dyna with options (Dyno) is a basic algorithm that incorporates option models into Dyna such that the agent learns about both action values and option values Q :→ S × A ∪ O. Dyno's behaviour policy then includes both actions and options. If an option π j is selected when taking a greedy action according to Q, then the first action given by π j is executed. The model in Dyna needs to include option models, which allows the agent to reason about accumulated rewards under an option, and outcome states after executing an option. Otherwise, the framework is identical to Dyna. It is a simple, elegant extension on Dyna that allows for planning with temporal abstraction. However, this approach has several limitations. One limitation is that as we include new options-more abstraction-our value function needs to reason over more actions. Our proposed algorithm, GSP, allows the agent to obtain the benefits of abstraction, without modifying the form of the policy. Another limitation is that the model in Dyna is the standard state-to-state model. Though Dyna with options has not been extended to function approximation -somewhat surprisingly -the natural extension suffers from similar problems of model errors and the use of expectation models as standard Dyna. We compare DDQNx2, Dyno, and GSP in the simple PinBall environment. For Dyno, we use the same subgoal-conditioned models pre-trained for GSP as options models and set the predicted next state of each option to (x g , y g , 0, 0). We found Dyna with options difficult to get working. Instead, we used a modified version that only plans over options. This avoids learning and using primitive action models. We see in Figure 12 that this modified variant actually outperformed DDQN initially, but leveled off at a suboptimal level of performance and overall learned slower than GSP. We also



The first input is any g ∈ G, the second is g ∈ Ḡ, which includes sterminal. We need to reason about reaching any subgoal or sterminal. But sterminal is not a real state: we do not reason about starting from it to reach subgoals. In this simplified example, we can plan efficiently by updating the value at the end in sn, and then updating states backwards from the end. But, without knowing this structure, it is not a general purpose strategy. For general MDPs, we would need smart ways to do search control: the approach to pick states from one-step updates. In fact, we can leverage search control strategies to improve the goal-space planning step. Then we get the benefit of these approaches, as well as the benefit of planning over a much smaller state space. More generally, we might consider using emphatic weightings(Sutton et al., 2016) that allow us to incorporate such interest weightings d(s, g), without suffering from bootstrapping off of inaccurate values in states where d(s, g) = 0. Incorporating this algorithm would likely benefit the whole system, but we keep things simpler for now and stick with a typical TD update. It is not always the case that positive rewards result in option policies that do not terminate. If there is a large, positive reward at the subgoal in the environment, Even if all rewards are positive, if γc < 1 and there is larger positive reward at the subgoal than in other nearby states, then the return is higher when reaching this subgoal sooner, since that reward is not discounted as many steps. This outcome is less nuanced for negative reward. If the rewards are always negative, on the other hand, then the option policy will terminate, trying to find the path with the best (but still negative) return.



Figure 1: Goal-Space Planning in the Pinball environment (see Section 4.1). The agent begins with a set of subgoals (denoted in teal) and learns a set of subgoal-conditioned models. (Abstraction) Using these models, the agent forms an abstract goal-space MDP where the states are subgoals with options to reach each subgoal as actions. (Planning) The agent plans in this abstract MDP to quickly learn the values of these subgoals. (Projection) Using learned subgoal values, the agent obtains approximate values of states based on nearby subgoals and their values. These quickly updated approximate values are then used to speed up learning.

Figure 2: Original and Abstract Space.

3.4 PUTTING IT ALL TOGETHER: THE FULL GOAL-SPACE PLANNING ALGORITHMPolicy updateQ ˜ , r < l a t e x i t s h a 1 _ b a s e 6 4 = " I 1 x h H

Figure 5: (left) The harder PinBall environment used in our first experiment. The dark gray shapes are obstacles the ball bounces off of, the small blue circle the starting position of the ball (with no velocity), and the red dot the goal (termination). Solid circles indicate the location and radius of the subgoals (m), with wider initiation set visualized for two subgoals (pink and teal). (right) Performance in this environment for GSP with β = 0.1, DDQN, and approximate LAVI, with the standard error shown. Even just increasing to β = 0.1 allows GSP to leverage the longer-horizon estimates given by the subgoal values, making it learn much faster than DDQN. Approximate LAVI is able to learn quickly, but levels off at a suboptiomal performance, as expected.

Figure 6: (left) Visualizing the action-values for DDQN and GSP (β = 0.1) at various points in training. (right) v sub obtained from using the learned subgoal-values in the projection step.4.3 ACCURACY OF THE LEARNED MODELS

Figure8: The impact on planning performance using frozen models with differing accuracy (shading shows standard error).

Figure 9: (left) The Non-stationary PinBall environment. For the first half of the experiment, the agent terminates at goal A while for the second half, the agent terminates at goal B. (right) The performance of GSP (β = 0.1) and DDQN in the environment. The mean of all 30 runs is shown as the dashed line. The 25th and 75th percentile run for each algorithm are also highlighted. We see that GSP with exploration bonus was able to adapt more quickly when the terminal goal switches compared to the baseline DDQN algorithm where goal values are not used.

Figure 10: Comparing one-step backup with Goal-Space Planning when subgoals are concrete states. GSP first focuses planning over a smaller set of subgoals (in red), then updates the values of individual states.

max We can use the exact same strategy to show convergence of value iteration, under our subgoal-value bootstrapping update. Let r sub (s, a) def = s P (s |s, a)v sub (s ), assuming v sub : S → [-G max , G max ] is a given, fixed function. Then the modified Bellman optimality operator is (B β q)(s, a) def = r(s, a) + βr sub (s, a) + (1β) s P (s |s, a)γ(s ) max a ∈A q(s , a ) (10)

a )]| ≤ (1β)γ c s P (s |s, a) max a ∈A |q 1 (s , a )q 2 (s , a )| ≤ (1β)γ c s P (s |s, a) max s ∈S,a ∈A |q 1 (s , a )q 2 (s , a )|

annex

Rearranging terms, we get that this is true for/ log 1 1β .For example if r max = 1, γ c = 0.99 and β = 0.5, then we have that t > 1.56. If we have that r max = 10, γ c = 0.99 and β = 0.5, then we get that t ≥ 5. If we have that r max = 1, γ c = 0.99 and β = 0.1, then we get that t ≥ 22.

C LEARNING THE SUBGOAL MODELS AND CORRESPONDING OPTION POLICIES

Now we need a way to learn the models, r γ (s, g) and Γ(s, g). These can both be represented as General Value Functions (GVFs) (Sutton et al., 2011) , and we leverage this form to use standard algorithms in reinforcement learning to learn them. We start by assuming that we have π g and discuss learning it after understanding learning these models.

C.1 MODEL LEARNING

The data is generated off-policy-according to some behavior b rather than from π g . We can either use importance sampling or we can learn the action-value variants of these models to avoid importance sampling. We describe both options here, but in our experiments using the action-value variant since it avoids importance sampling and the need to have the distribution over actions under behavior b.Model Update using Importance Sampling We can update r γ (•, g) with an importance-sampled temporal difference (TD) learning update ρ t δ t ∇r γ (S t , g) where ρ t = πg(a|St) b(a|St) andThe discount model Γ(s, g) can be learned similarly, because it is also a GVF with cumulant m(S t+1 , g)γ t+1 and discount γ g,t+1 . The TD update is ρ t δ Γ t whereAll of the above updates can be done using any off-policy GVF algorithm, including those using clipping of IS ratios and gradient-based methods, and can include replay.Model Update without Importance Sampling Overloading notation, let us define the action-value variants r γ (s, a, g) and Γ(s, a, g). We get similar updates to above, now redefiningand using update δ t ∇r γ (S t , A t , g). For Γ we haveWe then define r γ (s, g) def = r γ (s, π g (s), g) and Γ(s, g) def = Γ(s, π g (s), g) as deterministic functions of these learned functions.Restricting the Model Update to Relevant States Recall, however, that we need only query these models where d(s, g) > 0. We can focus our function approximation resources on those states. This idea has previously been introduced with an interest weighting for GVFs (Sutton et al., 2016) , with connections made between interest and initiation sets (White, 2017) . For a large state space with many subgoals, using goal-space planning significantly expands the models that need to be learned, especially if we learn one model per subgoal. Even if we learn a model that generalizes across subgoal vectors, we are requiring that model to know a lot: values from all states to all subgoals. It is likely such a models would be hard to learn, and constraining what we learn about with d(s, g) is likely key for practical performance.Under review as a conference paper at ICLR 2023 learning the option: Rt+1 = cR t+1 + (1c)(-1). When c = 0, we have a cost-to-goal problem, where the learned option policy should find the shortest path to the goal, regardless of reward along the way. When c = 1, the option policy focuses on environment reward, but may not terminate in g. We can start by learning the option policy that takes the shortest path with c = 0, and the corresponding r γ (s, g), Γ(s, g). The constant c can be increased until π g stops going to the goal, or until the discounted probability Γ(s, g) drops below a specified threshold.Even without a well-specified c, the values under the option policy can still be informative. For example, it might indicate that it is difficult or dangerous to attempt to reach a goal. For this work, we propose a simple default, where we fix c = 0.5. Adaptive approaches, such as the idea described above, are left to future work.The resulting algorithm to learn π g involves learning a separate value function for these rewards. We can learn action-values (or a parameterized policy) using the above reward. For example, we can learn a policy with the Q-learning update to action-values qThen we can set π g to be the greedy policy, π g (s) = argmax a∈A q(s, a, g).

C.3 PSEUDOCODE PUTTING IT ALL TOGETHER

We summarize the above updates in pseudocode, specifying explicit parameters and how they are updated. The algorithm is summarized in Algorithm 1, with a diagram in Figure 4 . An online update is used for the action-values for the main policy, without replay. All background computation is used for model learning using a replay buffer and for planning with those models. The pseudocode assumes a small set of subgoals, and is for episodic problems. We provide extensions to other settings in Appendix C.4, including using a Double DQN update for the policy update. We also discuss in-depth differences to existing related ideas, including landmark states, UVFAs, and Goal-conditioned RL in Appendix D.Note that we overload the definitions of the subgoal models. We learn action-value variants r γ (s, a, g; θ r ), with parameters θ r , to avoid importance sampling corrections. We learn the option-policy using action-values q(s, a; θ π ) with parameters θ π , and so query the policy using π g (s; θ π ) def = argmax a∈A q(s, a, g; θ π ). The policy π g is not directly learned, but rather defined by q. Similarly, we do not directly learn r γ (s, g); instead, it is defined by r γ (s, a, g; θ r ). Specifically, for model parameters θ = (θ r , θ Γ , θ π ), we set r γ (s, g; θ) def = r γ (s, π g (s; θ π ), g; θ r ) and Γ(s, g; θ) def = Γ(s, π g (s; θ π ), g; θ Γ ). We query these derived functions in the pseudocode.Finally, we assume access to a given set of subgoals. But there have been several natural ideas already proposed for option discovery, that nicely apply in our more constrained setting. One idea was to use subgoals that are often visited by the agent (Stolle & Precup, 2002) . Such a simple idea is likely a reasonable starting point to make a GSP algorithm that learns everything from scratch, including subgoals. Other approaches have used bottleneck states (McGovern & Barto, 2001) .Algorithm 1 Goal-Space Planning for Episodic Problems Assume given subgoals G and relevance function d Initialize table v ∈ R |G| , main policy w, model parameters θ = (θ r , θ Γ , θ π ), θ = ( θr , θΓ ) Sample initial state s 0 from the environment for t ∈ 0, 1, 2, ... do Take action a t using q (e.g., -greedy), observe s t+1 , r t+1 , γ t+1 ModelUpdate(s t , a t , s t+1 , r t+1 , γ t+1 ) Planning() MainPolicyUpdate(s t , a t , s t+1 , r t+1 , γ t+1 )Algorithm 2 MainPolicyUpdate(s, a, s , r, γ) v sub ← max g∈ Ḡ:d(s,g)>0 r γ (s, g; θ) + Γ(s, g; θ)ṽ(g) δ ← r + γβv sub + γ(1β) max a q(s , a ; w)q(s, a; w) w ← w + αδ∇ w q(s, a; w)Algorithm 3 Planning()for n iterations, for each g ∈ G do ṽ(g) ← max g ∈ Ḡ:d(g,g )>0 rγ (g, g ; θr ) + Γ(g, g ; θΓ )ṽ(g )Algorithm 4 ModelUpdate(s, a, s , r, γ)Add new transition (s, a, s , r, γ) to buffer B for g ∈ Ḡ, for multiple transitions (s, a, r, s , γ) sampled from B do γ g ← γ(1m(s , g )) // Update option policy δ π ← 1 2 (r -1) + γ g max a ∈A q(s , a , g ; θ π )q(s, a, g ; θ π )// Update goal-to-goal models using state-to-goal models for each g such that m(s, g) > 0 do θr ← θr + αr (r γ (s, g ; θ)rγ (g, g ; θr ))∇r γ (g, g ; θr ) θΓ ← θΓ + αΓ (Γ(s, g ; θ) -Γ(g, g ; θr ))∇ Γ(g, g ; θΓ )

C.4 EXTENDING GSP TO DEEP RL

It is simple to extend the above pseudocode for the main policy update and the option policy update to use Double DQN (Van Hasselt et al., 2016) updates with neural networks. The changes from the above pseudocode are 1) the use of a target network to stabilize learning with neural networks, 2) using polyak averaging to interpolate between the target network and the main network's weights, 3) changing the one-step bootstrap target to the DDQN equivalent, 4) adding a replay buffer for learning the main policy, and 5) changing the update from using a single sample to using a batch update.Because the number of subgoals is discrete, the equations for learning θr and θΓ does not change.We summarize these changes for learning the main policy in Algorithm 5 and for learning subgoal models in Algorithm 6.Algorithm 5 MainPolicyDDQNUpdate(s, a, s , r, γ) Add experience (s, a, s , r, γ) to replay buffer D main for n main mini-batches do Sample batch B main = {(s, a, r, s , γ)} from D main v sub (s) = max g∈ Ḡ:d(s,g)>0 r γ (s, g; θ) + Γ(s, g; θ)ṽ(g) Y (r, s , γ) = r + γβv sub + γ(1β)q(s , max a q(s , a ; w), w target ) Dyno learned slower than GSP and converged to a lower performance point when compared to GSP. DDQNx2, despite requiring an additional 50% wall-clock time when compared to GSP, learned at a slower rate but converged to the same optimal peformance. find that DDQNx2 performed better than DDQN, but was unable to perform better than GSP despite requiring more wall-clock time, highlighting GSP's computational efficiency.

H EXPERIMENTS WITH LUNAR LANDER

To see how GSP can be applied to other problems, we ran GSP in the Lunar Lander environment (Brockman et al., 2016) . The environment specification follow OpenAI Gym's LunarLaner-v2 environment. We provide the agent with 9 subgoals, with one terminal subgoal when the agent lands safely on the landing pad and the rest of the subgoals laid throughout the environment at different (x,y) locations in an arrow-like fashionx. As the (x,y) coordinates are continuous, we take a similar approach of defining a small region around each coordinate to for subgoal termination, and define a larger initiation area around each subgoal. We show the non-terminal subgoals in Figure 13 . We compare GSP, DDQN, DDQNx4 (DDQN with 4x the amount of planning steps), and approximate LAVI. We evaluated GSP with β = 0.01 as we found it to be the best performing β ∈ [0.001, 0.01, 0.05] in our experiments.We see in Figure 13 that GSP outperforms DDQN, DDQNx4, and approximate LAVI. Overall, we found that subgoal-conditioned models are more difficult to learn in Lunar Lander, with the learned reward models and discount model having an average absolute error of around 5 and 0.1 respectively from 200 monte-carlo rollouts of the policy in the environment. This aligns with the poor performance of approximate LAVI and the lower value of β that was found to be good for GSP. Surprisingly, DDQNx4 performed worse than GSP and DDQN despite performing 4 times the number of batch updates that DDQN performs per step. We hypothesize that this is because the increased number of updates causes the agent to fit to a suboptimal solution based on insufficient data, thus making the rate of improvement slower.

I INVESTIGATING GSP WITH DIFFERENT BETA

In subgoal-value bootstrapping (Equation 4), the hyperparameter β represents the tradeoff between fully using the quickly updated but approximate subgoal values v sub (s) and the standard bootstrap target. We investigate the impact of β in the harder pinball environment shown in Figure 5 . We ran GSP with β ∈ [0.0, 1e -3 , 0.1, 0.5, 1.0]. Note that β = 0.0 is equivalent to DDQN, and β = 1.0 is equivalent to approximate LAVI. We see in Figure 14 that with β = 0.5 and β = 1.0, GSP gets similar fast initial learning, but converges to a lower final performance. For β = 1e -3 very close to 0, we see that performance is more like DDQN. But even for such a small β we get improvements. 

