DIVIDE-AND-CONQUER MONTE CARLO TREE SEARCH

Abstract

Standard planners for sequential decision making (including Monte Carlo planning, tree search, dynamic programming, etc.) are constrained by an implicit sequential planning assumption: The order in which a plan is constructed is the same in which it is executed. We consider alternatives to this assumption for the class of goal-directed Reinforcement Learning (RL) problems. Instead of an environment transition model, we assume an imperfect, goal-directed policy. This low-level policy can be improved by a plan, consisting of an appropriate sequence of subgoals that guide it from the start to the goal state. We propose a planning algorithm, Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS), for approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively. The algorithm critically makes use of a learned sub-goal proposal for finding appropriate partition trees of new tasks based on prior experience. Different strategies for learning sub-goal proposals give rise to different planning strategies that strictly generalize sequential planning. We show that this algorithmic flexibility wrt. planning order leads to improved results in navigation tasks in grid-worlds as well as in challenging continuous control environments.

1. INTRODUCTION

This is the first sentence of this paper, but it was not the first one we wrote. In fact, the entire introduction section was actually one of the last sections to be added to this manuscript. The discrepancy between the order of inception of ideas and the order of their presentation in this paper probably does not come as a surprise to the reader. Nonetheless, it serves as a point for reflection that is central to the rest of this work, and that can be summarized as "the order in which we construct a plan does not have to coincide with the order in which we execute it". Most standard planners for sequential decision making problems-including Monte Carlo planning, Monte Carlo Tree Search (MCTS) and dynamic programming-have a baked-in sequential planning assumption (Bertsekas et al., 1995; Browne et al., 2012) . These methods begin at either the initial or final state and then plan actions sequentially forward or backwards in time. However, this sequential approach faces two main challenges. (i) The transition model used for planning needs to be reliable over long horizons, which is often difficult to achieve when it has to be inferred from data. (ii) Credit assignment to each individual action is difficult: In a planning problem spanning a horizon of 100 steps, to assign credit to the first action, we have to compute the optimal cost-to-go for the remaining problem with a horizon of 99 steps, which is only slightly easier than solving the original problem. To overcome these two fundamental challenges, here we consider alternatives to the basic assumptions of sequential planners. We focus on goal-directed decision making problems where an agent should reach a goal state from a start state. Instead of a transition and reward model of the environment, we assume a given goal-directed policy (the "low-level" policy) and the associated value oracle that returns its success probability on any given task. 1 In general, a low-level policy will not be not optimal, e.g. it might be too "myopic" to reliably reach goal states that are far away from its current state. We now seek to improve the low-level policy via a suitable sequence of sub-goals that guide it from the start to the final goal, thus maximizing the overall task success probability. This formulation of planning as finding good sub-goal sequences, makes learning of explicit environment models unnecessary, as they are replaced by low-level policies and their value functions. The sub-goal planning problem can still be solved by a conventional sequential planner that begins by searching for the first sub-goal to reach from the start state, then planning the next sub-goal in sequence, and so on. Indeed, this is the approach taken in most hierarchical RL settings based on options or sub-goals (e.g. Dayan & Hinton, 1993; Sutton et al., 1999; Vezhnevets et al., 2017) . However, the credit assignment problem mentioned above persists, as assessing if the first sub-goal is useful still requires evaluating the success probability of the remaining plan. Instead, it could be substantially easier to reason about the utility of a sub-goal "in the middle" of the plan, as this breaks the long-horizon problem into two sub-problems with much shorter horizons: how to get to the sub-goal and how to get from there to the final goal. Based on this intuition, we propose the Divide-and-Conquer MCTS (DC-MCTS) planner that searches for sub-goals to split the original task into two independent sub-tasks of comparable complexity and then recursively solves these, thereby drastically facilitating credit assignment. To search the space of intermediate sub-goals efficiently, DC-MCTS uses a heuristic for proposing promising sub-goals that is learned from previous search results and agent experience. Humans can plan efficiently over long horizons to solve complex tasks, such as theorem proving or navigation, and some plans even span over decades (e.g. economic measures): In these situations, planning sequentially in terms of next steps -such as what arm to move, or what phone call to make -will cover a tiny proportion of the horizon, neglecting the long uncertainty beyond the last planned step. The algorithm put forward in this paper is a step in the direction of efficient planners that tackle long horizons by recursively and parallelly splitting them into many smaller and smaller sub-problems. In Section 2, we formulate planning in terms of sub-goals instead of primitive actions. In Section 3, as our main contribution, we propose the novel Divide-and-Conquer Monte Carlo Tree Search algorithm for this planning problem. In Section 4 we position DC-MCTS within the literature of related work. In Section 5, we show that it outperforms sequential planners both on grid world and continuous control navigation tasks, demonstrating the utility of constructing plans in a flexible order that can be different from their execution order.

2. IMPROVING GOAL-DIRECTED POLICIES WITH PLANNING

Let S and A be finite sets of states and actions. We consider a multi-task setting, where for each episode the agent has to solve a new task consisting of a new Markov Decision Process (MDP) M over S and A. Each M has a single start state s 0 and a special absorbing state s ∞ , also termed the goal state. If the agent transitions into s ∞ at any time it receives a reward of 1 and the episode terminates; otherwise the reward is 0. We assume that the agent observes the start and goal states (s 0 , s ∞ ) at the beginning of each episode, as well as an encoding vector c M ∈ R d . This vector provides the agent with additional information about the MDP M of the current episode and will be key to transfer learning across tasks in the multi-task setting. A stochastic, goal-directed policy π is a mapping from S × S × R d into distributions over A, where π(a|s, s ∞ , c M ) denotes the probability of taking action a in state s in order to get to goal s ∞ . For a fixed goal s ∞ , we can interpret π as a regular policy, here denoted as π s∞ , mapping states to action probabilities. We denote the value of π in state s for goal s ∞ as v π (s, s ∞ |c M ); we assume no discounting γ = 1. Under the above definition of the reward, the value is equal to the success probability of π on the task, i.e. the absorption probability of the stochastic process starting in s 0 defined by running π s∞ : v π (s 0 , s ∞ |c M ) = P (s ∞ ∈ τ πs ∞ s0 |c M ), where τ πs ∞ s0 is the trajectory generated by running π s∞ from state s 0foot_1 . To keep the notation compact, we will omit the explicit dependence on c M and abbreviate tasks with pairs of states in S × S.

2.1. PLANNING OVER SUB-GOAL SEQUENCES

Assume a given goal-directed policy π, which we also refer to as the low-level policy. If π is not already optimal, we can potentially improve it by planning: If π has a low probability of directly reaching s ∞ from the initial state s 0 , i.e. v π (s 0 , s ∞ ) ≈ 0, we will try to find a plan consisting of a sequence of intermediate sub-goals such that they guide π from the start s 0 to the goal state s ∞ . Concretely, let S * = ∪ ∞ n=0 S n be the set of sequences over S, and let |σ| be the length of a sequence σ ∈ S * . We define for convenience S := S ∪ {∅}, where ∅ is the empty sequence representing no sub-goal. We refer to σ as a plan for task (s 0 , s ∞ ) if σ 1 = s 0 and σ |σ| = s ∞ , i.e. if the first and last elements of σ are equal to s 0 and s ∞ , respectively. s 0 S * s ∞ denotes the set of plans for this task. To execute a plan σ, we construct a policy π σ by conditioning the low-level policy π on each of the sub-goals in order: Starting with n = 1, we feed sub-goal σ n+1 to π, i.e. we run π σn+1 ; if σ n+1 is reached, we will execute π σn+2 and so on. We now wish to do open-loop planning, i.e. find the plan with the highest success probability P (s ∞ ∈ τ πσ s0 ) of reaching s ∞ . However, this success probability depends on the transition kernels of the underlying MDPs, which might not be known. We can instead define planning as maximizing the following lower bound of the success probability, that can be expressed in terms of the low-level value v π . Proposition 1 (Lower bound of success probability). The success probability P (s ∞ ∈ τ πσ s0 ) ≥ L(σ) of a plan σ is bounded from below by L(σ) := |σ|-1 i=1 v π (σ i , σ i+1 ), i.e. the product of the success probabilities of π on the sub-tasks defined by (σ i , σ i+1 ). The straight-forward proof is given in Appendix A.1. Intuitively, L(σ) is a lower bound for the success of π σ , as it neglects the probability of "accidentally" (due to stochasticity of the policy or transitions) running into the goal s ∞ before having executed the full plan. We summarize: Definition 1 (Open-Loop Goal-Directed Planning). Given a goal-directed policy π and its corresponding value oracle v π , we define planning as maximizing L(σ) over σ ∈ s 0 S * s ∞ , i.e. the set of plans for task (s 0 , s ∞ ). We define the high-level (HL) value v * (s 0 , s ∞ ) := max σ L(σ) as the maximum value of the planning objective. Note the difference between the low-level value v π and the high-level v * . v π (s, s ) is the probability of the agent directly reaching s from s following π, whereas v * (s, s ) the probability reaching s from s under the optimal plan, which likely includes intermediate sub-goals. In particular, v * ≥ v π .

2.2. AND/OR SEARCH TREE REPRESENTATION

In the following we cast the planning problem into a representation amenable to efficient search. To this end, we use the natural compositionality of plans: We can concatenate a plan σ for the task (s, s ) and a plan σ for the task (s , s ) into a plan σ • σ for the task (s, s ). Conversely, we can decompose any given plan σ for task (s 0 , s ∞ ) by splitting it at any sub-goal s ∈ σ into σ = σ l • σ r , where σ l is the "left" sub-plan for task (s 0 , s), and σ r is the "right" sub-plan for task (s, s ∞ ). Trivially, the planning objective and the optimal high-level value factorize wrt. to this decomposition: L(σ l • σ r ) = L(σ l )L(σ r ) v * (s 0 , s ∞ ) = max s∈ S v * (s 0 , s) • v * (s, s ∞ ). This allows us to recursively reformulate planning as: arg max s∈ S arg max σ l ∈s0S * s L(σ l ) • arg max σr∈sS * s∞ L(σ r ) . (1) The above equations are the Bellman equations and the Bellman optimality equations for the classical single pair shortest path problem in graphs, where edge weights are given bylog v π (s, s ). We can represent this planning problem by an AND/OR search tree (Nilsson, N. J., 1980) with alternating levels of OR and AND nodes. An OR node, also termed an action node, is labeled by a task (s, s ) ∈ S × S; the root of the search tree is an OR node labeled by the original task (s 0 , s ∞ ). A terminal OR node (s, s ) has a value v π (s, s ) attached to it, which reflects the success probability of π s for completing the sub-task (s, s ). Each non-terminal OR node has |S| + 1 AND nodes as children. Each of these is labeled by a triple (s, s , s ) for s ∈ S, which correspond to inserting a sub-goal s into the overall plan, or not inserting one in case of s = ∅. Every AND node (s, s , s ), or conjunction node, has two OR children, the "left" sub-task (s, s ) and the "right" sub-task (s , s ). In this representation, plans are induced by solution trees. A solution tree T σ is a sub-tree of the complete AND/OR search tree, with the properties that (i) the root (s 0 , s ∞ ) ∈ T σ , (ii) each OR node in T σ has at most one child in T σ and (iii) each AND node in T σ as two children in T σ . The plan σ and its objective L(σ) can be computed from T σ by a depth-first traversal of T σ . The correspondence of sub-trees to plans is many-to-one, as T σ , in addition to the plan itself, contains the order in which the plan was constructed. Figure 6 in Section 5.3 shows an example for a search and solution tree. Below we will discuss how to construct a favourable search order heuristic. 3 BEST-FIRST AND/OR PLANNING  G ← G left • G right 13: G ← max(G, v π (s, s )) threshold the return 14: // UPDATE 15: V (s, s ) ← (V (s, s )N (s, s )+G)/(N (s, s )+1) 16: N (s, s ) ← N (s, s ) + 1 17: return G The planning problem from Definition 1 can be solved exactly by formulating it as shortest path problem from s 0 to s ∞ on a fully connected graph with vertex set S with non-negative edge weights given bylog v π and applying a classical Single Source or All Pairs Shortest Path (SSSP / APSP) planner. This approach is appropriate if one wants to solve all goal-directed tasks in a single MDP. Here, we focus however on the multitask setting described above, where the agent is given a new MDP with a single task (s 0 , s ∞ ) every episode. In this case, solving the SSSP / APSP problem is not feasible: Tabulating all graphs weights log v π (s, s ) would require |S| 2 evaluations of v π (s, s ) for all pairs (s, s ). In practice, approximate evaluations of v π could be implemented by e.g. actually running the policy π, or by calls to a powerful function approximator, both of which are often too costly to exhaustively evaluate for large state-spaces S. Instead, we tailor an algorithm for approximate planning to the multi-task setting, which we call Divide-and-Conquer MCTS (DC-MCTS). To evaluate v π as sparsely as possible, DC-MCTS critically makes use of two learned search heuristics that transfer knowledge from previously encountered MDPs / tasks to new problem instance: (i) a distribution p(s |s, s ), called the policy prior, for proposing promising intermediate sub-goals s for a task (s, s ); and (ii) a learned approximation v to the high-level value v * for bootstrap evaluation of partial plans. In the following we present DC-MCTS and discuss design choices and training for the two search heuristics.

3.1. DIVIDE-AND-CONQUER MONTE CARLO TREE SEARCH

The input to the DC-MCTS planner is an MDP encoding c M , a task (s 0 , s ∞ ) as well as a planning budget, i.e. a maximum number B ∈ N of v π oracle evaluations. At each stage, DC-MCTS maintains a (partial) AND/OR search tree T whose root is the OR node (s 0 , s ∞ ) corresponding to the original task. Every OR node (s, s ) ∈ T maintains an estimate V (s, s ) ≈ v * (s, s ) of its high-level value. DC-MCTS searches for a plan by iteratively constructing the search tree T with TRAVERSE until the budget is exhausted, see Algorithm 1. During each traversal, if a leaf node of T is reached, it is expanded, followed by a recursive bottom-up backup to update the value estimates V of all OR nodes visited in this traversal. After this search phase, the currently best plan is extracted from T by EXTRACTPLAN (essentially depth-first traversal, see Algorithm 2 in the Appendix). In the following we briefly describe the main methods of the search. We illustrate DC-MCTS in Figure 1 . TRAVERSE and SELECT T is traversed from the root (s 0 , s ∞ ) to find a promising node to expand. At an OR node (s, s ), SELECT chooses one of its children s ∈ S to traverse into, including s = ∅ for not inserting any further sub-goals into this branch. We implemented SELECT by the pUCT (Rosin, 2011) rule, which consists of picking the next node s ∈ S based on maximizing the following score: V (s, s ) • V (s , s ) + c • p(s |s, s ) • N (s, s ) 1 + N (s, s , s ) , where N (s, s ), N (s, s , s ) are the visit counts of the OR node (s, s ), AND node (s, s , s ) respectively. The first term is the exploitation component, guiding the search to sub-goals that currently look promising, i.e. have high estimated value. The second term is the exploration term favoring nodes with low visit counts. Crucially, it is explicitly scaled by the policy prior p(s |s, s ) to guide exploration. At an AND node (s, s , s ), TRAVERSE traverses into both the left (s, s ) and right child (s , s ). 3 As the two sub-problems are solved independently, computation from there on can be carried out in parallel. All nodes visited in a single traversal form a solution tree T σ with plan σ. EXPAND If a leaf OR node (s, s ) is reached during the traversal and its depth is smaller than a given maximum depth, it is expanded by evaluating the high-and low-level values v(s, s ), v π (s, s ). The initial value of the node is defined as max of both values, as by definition v * ≥ v π , i.e. further planning should only increase the success probability on a sub-task. We also evaluate the policy prior p(s |s, s ) for all s , yielding the proposal distribution over sub-goals used in SELECT. Each node expansion costs one unit of budget B.

BACKUP and UPDATE

We define the return G σ of the traversal tree T σ as follows. Let a refinement T + σ of T σ be a solution tree such that T σ ⊆ T + σ , thus representing a plan σ + that has all sub-goals of σ with additional inserted sub-goals. G σ is now defined as the value of the objective L(σ + ) of the optimal refinement of T σ , i.e. it reflects how well one could do on task (s 0 , s ∞ ) by starting from the plan σ and refining it. It can be computed by a simple back-up on the tree T σ that uses the bootstrap value v ≈ v * at the leafs. As v * (s 0 , s ∞ ) ≥ G σ ≥ L(σ) and G σ * = v * (s 0 , s ∞ ) for the optimal plan σ * , we can use G σ to update the value estimate V . Like in other MCTS variants, we employ a running average operation (line 15-16 in TRAVERSE).

3.2. DESIGNING AND TRAINING SEARCH HEURISTICS

Search results and experience from previous tasks can improve DC-MCTS on new problems via adapting the search heuristics, i.e. the policy prior p and the approximate value function v as follows. Bootstrap Value Function We parametrize v(s, s |c M ) ≈ v * (s, s |c M ) as a neural network that takes as inputs the current task consisting of (s, s ) and the MDP encoding c M . A straight-forward approach to train v is to regress it towards the non-parametric value estimates V computed by DC-MCTS on previous problem instances. However, initial results indicated that this leads to v being overly optimistic, an observation also made in Kaelbling (1993) . We therefore used more conservative training targets, that are computed by backing the low-level values v π up the solution tree T σ of the plan σ return by DC-MCTS. Details can be found in Appendix B.1. Policy Prior Best-first search guided by a policy prior p can be understood as policy improvement of p as described in Silver et al. (2016) . Therefore, a straight-forward way of training p is to distill the search results back into into the policy prior, e.g. by behavioral cloning. When applying this to DC-MCTS in our setting, we found empirically that this yielded very slow improvement when starting from an untrained, uniform prior p. This is due to plans with non-zero success probability L > 0 being very sparse in S * , equivalent to the sparse reward setting in regular MDPs. To address this issue, we propose to apply Hindsight Experience Replay (HER, Andrychowicz et al. (2017) ): Instead of training p exclusively on search results, we additionally execute plans σ in the environment and collect the resulting trajectories, i.e. the sequence of visited states, τ πσ s0 = (s 0 , s 1 , . . . , s T ). HER then proceeds with hindsight relabeling, i.e. taking τ πσ s0 as an approximately optimal plan for the "fictional" task (s 0 , s T ) that is likely different from the actual task (s 0 , s ∞ ). In standard HER, these fictitious expert demonstrations are used for imitation learning of goal-directed policies, thereby circumventing the sparse reward problem. We can apply HER to train p in our setting by extracting any ordered triplet (s t1 , s t2 , s t3 ) from τ πσ s0 and use it as supervised learning targets for p. This is a sensible procedure, as p would then learn to predict optimal sub-goals s * t2 for sub-tasks (s * t1 , s * t3 ) under the assumption that the data was generated by an oracle producing optimal plans τ πσ s0 = σ * . We have considerable freedom in choosing which triplets to extract from data and use as supervision with HER. In our experiments we use a temporally balanced parsing, which creates triplets (s t , s t+∆/2 , s t+∆ ) such that the resulting policy prior should then preferentially propose sub-goals "in the middle" of the task. In Appendix A.4 we discuss this aspect in more detail, and present alternative parsers.

3.3. ALGORITHMIC COMPLEXITY OF DC-MCTS

Denoting an optimal plan as σ * , the complexity of DC-MCTS with optimal search policy prior p = p * is O(|σ * |• |S|). This could potentially be reduced to O(log(|σ * |)) when using progressive widening Coulom (2007) ; Chaslot et al. for fewer evaluations of p and perfect parallelization of tree traversals across multiple workers; for details see Appendix A.5.

4. RELATED WORK

Goal-directed multi-task learning is an important special case of general RL and has been extensively studied. Universal value functions (Schaul et al., 2015) have been established as compact representation for this setting (Kulkarni et al., 2016; Andrychowicz et al., 2017; Ghosh et al., 2018; Dhiman et al., 2018) . This allows to use sub-goals as means for planning, as done in several works such as Kaelbling & Lozano-Pérez ( 2017 2019) propose a top-down policy gradient approach that learns to predict sub-goals in a hierarchical way. Nasiriany et al. (2019) propose gradient-based search jointly over a fixed number of sub-goals for continuous goal spaces. In contrast, DC-MCTS is able to dynamically determine the complexity of the optimal plan. The proposed DC-MCTS planner is a MCTS (Browne et al., 2012) variant, inspired by recent advances in best-first or guided search, such as AlphaZero (Silver et al., 2018) . It can also be understood as a heuristic, guided version of the classic Floyd-Warshall algorithm which exhaustively computes all shortest paths. In the special case of planar graphs, small sub-goal sets, also known as vertex separators, can be constructed that favourably partition the remaining graph, leading to linear time ASAP algorithms (Henzinger et al., 1997) . The heuristic sub-goal proposer p that guides DC-MCTS can be loosely understood as a probabilistic version of a vertex separator. Nowak-Vila et al. ( 2016) also consider neural networks that mimic divide-and-conquer algorithms similar to the sub-goal proposals used here. However, while we do policy improvement for the proposals using search and HER, the networks in Nowak-Vila et al. (2016) are purely trained by policy gradient methods. Decomposing tasks into sub-problems has been formalized as pseudo trees (Freuder & Quinn, 1985) and AND/OR graphs (Nilsson, N. J., 1980) . The latter have been used especially in the context of optimization (Larrosa et al., 2002; Jégou & Terrioux, 2003; Dechter & Mateescu, 2004; Marinescu & Dechter, 2004) . Our approach is related to work on using AND/OR trees for sub-goal ordering in the context of logic inference (Ledeniov & Markovitch, 1998) . While DC-MCTS is closely related to the AO * algorithm (Nilsson, N. J., 1980) , which is the generalization of the heuristic A * search to AND/OR search graphs, interesting differences exist: AO * assumes a fixed search heuristic, which is required to be lower bound on the cost-to-go. In contrast, we employ learned value functions and policy priors that are not required to be exact bounds. Relaxing this assumption, thereby violating the principle of "optimism in the face of uncertainty", necessitates explicit exploration incentives in the SELECT method. Alternatives for searching AND/OR spaces include proof-number search, recently applied to chemical synthesis planning (Kishimoto et al., 2019) . Very recent work concurrent to ours has focused on relevant research directions: Wang et al. (2020) introduce LA-MCTS as a 'meta-algorithm' for black-box optimization, and Chen et al. (2020) propose Retro*, a neural-based A*-like algorithm for molecule synthesis that is also based on AND/OR trees.

5. EXPERIMENTS

We evaluate the proposed DC-MCTS algorithm on navigation in grid-world mazes as well as on a challenging continuous control version of the same problem, comparing it to standard sequential MCTS (in sub-goal space) based on the fraction of "solved" mazes by executing their plans. The MCTS baseline was implemented by restricting the DC-MCTS algorithm to only expand the "right" sub-problem in line 10 of Algorithm 1; the value G left for the "left" sub-problem is computed as in line 7, i.e. using the low-level value v π . This forces MCTS to plan forward and sequentially, as each next step needs to be reachable from the previous state, as evaluated by v π . All remaining parameters and design choice were the same for both planners except where explicitly mentioned otherwise.

5.1. GRID-WORLD MAZES

Each task consists of a new, procedurally generated maze on a 21 × 21 grid with start and goal locations (s 0 , s ∞ ) ∈ {1, . . . , 21} 2 , see Figure 2 . Task difficulty was controlled by the density of walls d (under connectedness constraint), where the easiest setting d = 0.0 corresponds to no walls and the most difficult one d = 1.0 implies so-called perfect or singly-connected mazes. The task embedding c M was given as the maze layout and (s 0 , s ∞ ) encoded together as a feature map of 21 × 21 categorical variables with 4 categories each (empty, wall, start and goal location). The underlying MDPs have 5 primitive actions: up, down, left, right and NOOP. For sake of simplicity, we first tested our proposed approach by hard-coding a low-level policy π 0 as well as its value oracle v π 0 in the following way. If in state s and conditioned on a goal s , and if s is adjacent to s , π 0 s successfully reaches s with probability 1 in one step, i.e. v π 0 (s, s ) = 1; otherwise v π 0 (s, s ) = 0. If π 0 s is nevertheless executed, the agent moves to a random empty tile adjacent to s. Therefore, π 0 is the "most myopic" goal-directed policy that can still navigate everywhere. For each maze, MCTS and DC-MCTS were given a search budget of 200 calls to the low-level value oracle v π 0 . We implemented the search heuristics, i.e. policy prior p and high-level value function v, as convolutional neural networks (CNNs) which operate on input c M ; details for the network architectures are given in Appendix B.3. With untrained networks, both planners were unable to solve the task (<2% success probability), as shown in Figure 3 . This illustrates that a search budget of 200 evaluations of v π 0 is insufficient for unguided planners to find a feasible path in most mazes. This is consistent with standard exhaustive SSSP / APSP graph planners requiring 21 4 > 10 5 200 evaluations for optimal planning in the worst case on these tasks. Next, we trained both search heuristics v and p as detailed in Section 3.2. In particular, the sub-goal proposal p was also trained on hindsight-relabeled experience data, where for DC-MCTS we used the temporally balanced parser and for MCTS the corresponding left-first parser (see Appendix A.4). Training of the heuristics greatly improved the performance of both planners. Figure 3 shows learning curves for mazes with wall density d = 0.75, as mean and std over 20 different hyperparameters. DC-MCTS exhibits substantially improved performance compared to MCTS, and when compared at equal performance levels, DC-MCTS requires 5 to 10-times fewer training episodes than MCTS. An example of a learned sub-goal proposal p for DC-MCTS is visualized in Figure 2 (further examples are given in the Appendix in Figure 8 ). Probability mass concentrates on promising sub-goals that are far from both start and goal, approximately partitioning the task into equally hard sub-tasks. Next, we investigated the performance of both MCTS and DC-MCTS in challenging continuous control environments with non-trivial low-level policies. We embedded the gridworld mazes into a physical 3D environment simulated by MuJoCo Todorov et al. (2012) , rendering each grid-world cell as 4m×4m cell in physical space. The agent is embodied by a quadruped "ant" body; for illustration see Figure 4 . For the low-level policy π m , we pre-trained a goal-directed neural network controller that gets as inputs proprioceptive features (e.g. some joint angles and velocities) of the ant body as well as a 3D-vector pointing from its current position to a target position. π m was trained to navigate to targets randomly placed less than 1.5 m away in an open area (no walls), using MPO (Abdolmaleki et al., 2018) . See Appendix B.4 for more details. If unobstructed, π m can walk in a straight line towards its current goal. However, this policy receives no visual input and thus can only avoid walls when guided with appropriate sub-goals. To establish an interface between the low-level π m and the planners, we used another CNN to approximate the low-level value oracle v π m (s 0 , s ∞ |c M ): It was trained to predict whether π m will succeed in solving the navigation tasks (s 0 , s ∞ ), c M . Its input is the corresponding discrete grid-world representation c M of the maze (21 × 21 feature map of categoricals as described above, details in Appendix). Note that this setting is still challenging: In initial experiments we verified that a model-free baseline (also based on MPO, without HER) with access to state abstraction and low-level controller, only solved about 10% of the mazes after 100 million episodes due to the extremely sparse rewards. We applied MCTS and DC-MCTS to this problem to find symbolic plans consisting of sub-goals in {1, . . . , 21} 2 . The high-level heuristics p and v were trained for 65k episodes, exactly as in Section 5.1, except using v π m instead of v π 0 . We again observed that DC-MCTS outperforms by a wide margin the MCTS planner: Figure 5 shows performance of both (with fully trained search heuristics) as a function of the search budget for the most difficult mazes with wall density d = 1.0. Performance of DC-MCTS with the MuJoCo low-level controller was comparable to that with the hard-coded low-level policy from the grid-world experiment (with same wall density), showing that the abstraction of planning over low-level sub-goals successfully isolates high-level planning from low-level execution. We did not manage to successfully train the MCTS planner on MuJoCo navigation. This was likely due to HER, which we found -in ablation studies -essential for training DC-MCTS on both settings and MCTS on the grid-world problem, but not appropriate for MCTS on MuJoCo navigation: Left-first parsing for HER consistently biased the MCTS search prior p to propose next sub-goals too close to the previous sub-goal. This lead the MCTS planner to "micro-manage" the low-level policy, in particular in long corridors that π m can solve by itself. DC-MCTS, by recursively partitioning, found an appropriate length scale of sub-goals, leading to drastically improved performance.

5.3. VISUALIZING MCTS AND DC-MCTS

To further illustrate the difference between DC-MCTS and MCTS planning we can look at an example search tree from each method in Figure 6 . Light blue nodes are part of the final plan: note how in the case of DC-MCTS, the plan is distributed across a sub-tree within the search tree, while for the standard MCTS the plan is a chain. The first 'actionable' sub-goal, i.e. the first sub-goal for the low-level policy, is the left-most leaf in DC-MCTS and the first dark node from the root for MCTS. 

6. DISCUSSION

To enable guided, divide-and-conquer style planning, we made a few strong assumptions. Sub-goal based planning requires a universal value function oracle of the low-level policy, which often will have to be approximated from data. Overly optimistic approximations can be exploited by the planner, leading to "delusional" plans (Little & Thiébaux, 2007) . Joint learning of the high and low-level components can potentially this issue. In sub-goal planning, at least in its current naive implementation, the "action space" for the planner is the whole state space of the underlying MDPs. Therefore, the search space will have a large branching factor in large state spaces. A solution to this problem likely lies in using learned state abstractions for sub-goal specifications, which is a fundamental open research questions.We also implicitly assumed that low-level skills afforded by the low-level policy need to be "universal", i.e. if there are states that it cannot reach, no amount of high level search will lead to successful planning outcomes. In spite of these assumptions and open challenges, we showed that non-sequential sub-goal planning has fundamental advantages over the standard approach of search over primitive actions: (i) Abstraction and dynamic allocation: Sub-goals automatically support temporal abstraction as the high-level planner does not need to specify the exact time horizon required to achieve a sub-goal. Plans are generated from coarse to fine, and additional planning is dynamically allocated to those parts of the plan that require more compute. (ii) Closed & open-loop: The approach combines advantages of both open-and closed loop planning: The closed-loop low-level policies can recover from failures or unexpected transitions in stochastic environments, while at the same time the high-level planner can avoid costly closed-loop planning. (iii) Long horizon credit assignment: Sub-goal abstractions open up new algorithmic possibilities for planning -as exemplified by DC-MCTS -that can facilitate credit assignment and therefore reduce planning complexity. (iv) Parallelization: Like other divide-and-conquer algorithms, DC-MCTS lends itself to parallel execution by leveraging problem decomposition made explicit by the independence of the "left" and "right" sub-problems of an AND node. (v) Reuse of cached search: DC-MCTS is highly amenable to transposition tables, by caching and reusing values for sub-problems solved in other branches of the search tree. (vi) Generality: DC-MCTS is strictly more general than both forward and backward goal-directed planning, both of which can be seen as special cases.

A ADDITIONAL DETAILS FOR DC-MCTS A.1 PROOF OF PROPOSITION 1

Proof. The performance of π σ on the task (s 0 , s ∞ ) is defined as the probability that its trajectory τ πσ s0 from initial state s 0 gets absorbed in the state s ∞ , i.e. P (s ∞ ∈ τ πσ s0 ). We can bound the latter from below in the following way. Let σ = (σ 0 , . . . , σ m ), with σ 0 = s 0 and σ m = s ∞ . With (σ 0 , . . . , σ i ) ⊆ τ πσ s0 we denote the event that π σ visits all states σ 0 , . . . , σ i in order: P ((σ 0 , . . . , σ i ) ⊆ τ πσ s0 ) = P i i =1 (σ i ∈ τ πσ s0 ) ∧ (t i -1 < t i ) , where t i is the arrival time of π σ at σ i , and we define t 0 = 0. Obviously, the event (σ 0 , . . . , σ m ) ⊆ τ πσ s0 is a subset of the event s ∞ ∈ τ πσ s0 , and therefore P ((σ 0 , . . . , σ m ) ⊆ τ πσ s0 ) ≤ P (s ∞ ∈ τ πσ s0 ). Using the chain rule of probability we can write the lhs as: P ((σ 0 , . . . , σ m ) ⊆ τ πσ s0 ) = m i=1 P (σ i ∈ τ πσ s0 ) ∧ (t i-1 < t i ) | (σ 0 , . . . , σ i-i ) ⊆ τ πσ s0 . We now use the definition of π σ : After reaching σ i-1 and before reaching σ i , π σ is defined by just executing π σi starting from the state σ i-1 : P ((σ 0 , . . . , σ m ) ⊆ τ πσ s0 ) = m i=1 P σ i ∈ τ πσ i σi-1 | (σ 0 , . . . , σ i-i ) ⊆ τ πσ s0 . We now make use of the fact that the σ i ∈ S are states of the underlying MDP that make the future independent from the past: Having reached σ i-1 at t i-1 , all events from there on (e.g. reaching σ j for j ≥ i) are independent from all event before t i-1 . We can therefore write:  P ((σ 0 , . . . , σ m ) ⊆ τ πσ s0 ) = m i=1 P σ i ∈ τ πσ i σi-1 = m i=1 v π (σ i-1 , σ i ) . return σ l • σ r , G l • G r computation is not an option, or if there are specific needs e.g. as illustrated by the following three heuristics. These can be used to decide when to traverse into the left sub-problem (s, s ) or the right sub-problem (s , s ). Note that both nodes have a corresponding current estimate for their value V , coming either from the bootstrap evaluation of v or further refined from previous traversals. • Preferentially descend into the left node encourages a more accurate evaluation of the near future, which is more relevant to the current choices of the agent. This makes sense when the right node can be further examined later, or there is uncertainty about the future that makes it sub-optimal to design a detailed plan at the moment. • Preferentially descend into the node with a lower value, following the principle that a chain (plan) is only as good as its weakest link (sub-problem) . This heuristic effectively greedily optimizes for the overall value of the plan. • Use 2-way UCT on the values of the nodes, which acts similarly to the previous greedy heuristic, but also takes into account the confidence over the value estimates given by the visit counts. The rest of the algorithm can remain unchanged, and during the BACKUP phase the current value estimate V of the sibling sub-problem can be used.

A.4 PARSERS FOR HINDSIGHT EXPERIENCE REPLAY

Given a task (s 0 , s ∞ ), the policy prior p defines a distribution over binary partition trees of the task via recursive application (until the terminal symbol ∅ closes a branch). A sample T σ from this distribution implies a plan σ as described above; but furthermore it also contains the order in which the task was partitioned. Therefore, p not only implies a distribution over plans, but also a search order: Trees with high probability under p will be discovered earlier in the search with DC-MCTS. For generating training targets for supervised training of p, we need to parse a given sequence τ πσ s0 = (s 0 , s 1 , . . . , s T ) into a binary tree. Therefore, when applying HER we are free to choose any deterministic or probabilistic parser that generates a solution tree T τ πσ s 0 from re-labeled HER data τ πσ s0 . As mentioned in the main text, the particular choice of HER-parser will shape the search strategy defined by p. Possible choices for the parsers include: 1. Left-first parsing creates triplets (s t , s t+1 , s T ). The resulting policy prior will then preferentially propose sub-goals close to the start state, mimicking standard forward planning. Analogously right-first parsing results in approximate backward planning; 2. Temporally balanced parsing creates triplets (s t , s t+∆/2 , s t+∆ ). The resulting policy prior will then preferentially propose sub-goals "in the middle" of the task. This is the one we used in our experiments; 3. Weight-balanced parsing creates triplets (s, s , s ) such that v(s, s ) ≈ v(s s, ) or v π (s, s ) ≈ v π (s s, ). The resulting policy prior will attempt to propose sub-goals such that the resulting sub-tasks are equally difficult. A.5 DETAILS ON ALGORITHMIC COMPLEXITY Let c v π denote the cost of evaluating the low-level value v π on any sub-problem (s, s ). We assume c v π to be independent of (s, s ) which holds if e.g. v π is a fixed size neural network. Denote the cost of evaluating the policy prior p on a sub-goal (s, s , s ) with c p . Expanding a new OR node in the search tree incurs a cost of |S|c p for evaluating p for all children. Assuming the computational cost of tree traversals is negligible, the total cost of running DC-MCTS for N node expansions is thus N • (|S|c p + 2c v π ). The number of expansions to find the optimal (or a sufficiently good) plan strongly depends on the quality of the policy prior (similar to A * search), making an analysis of the complexity of DC-MCTS challenging for arbitrary p. However, if p = p * is the optimal policy prior -i.e. p * (s |s, s ) = 1 if s ∈ σ * is in the optimal plan σ * for(s, s ) and 0 otherwise -DC-MCTS will construct σ * in the minimal number of step 

B TRAINING DETAILS B.1 DETAILS FOR TRAINING THE VALUE FUNCTION

In order to train the value network v, that is used for bootstrapping in DC-MCTS, we can regress it towards targets computed from previous search results or environment experiences. A first obvious option is to use as regression target the Monte Carlo return (i.e. 0 if the goal was reached, and 1 if it was not) from executing the DC-MCTS plans in the environment. This appears to be a sensible target, as the return is an unbiased estimator of the success probability P (s ∞ ∈ τ πσ s0 ) of the plan. Although this approach was used in Silver et al. (2016) , its downside is that gathering environment experience is often very costly and only yields little information, i.e. one binary variable per episode. Furthermore no other information from the generated search tree T except for the best plan is used. Therefore, a lot of valuable information might be discarded, in particular in situations where a good sub-plan for a particular sub-problem was found, but the overall plan nevertheless failed. This shortcoming could be remedied by using as regression targets the non-parametric value estimates V (s, s ) for all OR nodes (s, s ) in the DC-MCTS tree at the end of the search. With this approach, a learning signal could still be obtained from successful sub-plans of an overall failed plan. However, we empirically found in our experiments that this lead to drastically over-optimistic value estimates, for the following reason. By standard policy improvement arguments, regressing toward V leads to a bootstrap value function that converges to v * . In the definition of the optimal value v * (s, s ) = max s v * (s, s ) • v * (s , s ), we implicitly allow for infinite recursion depth for solving sub-problems. However, in practice, we often used quite shallow trees (depth < 10), so that bootstrapping with approximations of v * is too optimistic, as this assumes unbounded planning budget. A principled solution for this could be to condition the value function for bootstrapping on the amount of remaining search budget, either in terms of remaining tree depth or node expansions. Instead of the cumbersome, explicitly resource-aware value function, we found the following to work well. After planning with DC-MCTS, we extract the plan σ * with EXTRACTPLAN from the search tree T . As can be seen from Algorithm 2, the procedure computes the return G σ * for all OR nodes in the solution tree T σ * . For training v we chose these returns G σ * for all OR nodes in the solution tree as regression targets. This combines the favourable aspects of both methods described above. In particular, this value estimate contains no bootstrapping and therefore did not lead to Table 1 : Architectures of the neural networks used in the experiment section for the high-level value and prior. For each convolutional layer we report kernel size, number of filters and stride. LN stands for Layer normalization, FC for fully connected,. All convolutions are preceded by a 1 pixel zero padding.

Value head

3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 1 swish, LN Flatten FC: N h = 1 sigmoid Torso 3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 2 swish, LN 3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 2 swish, LN

Policy head

3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 1 swish, LN 3 × 3, 64, stride = 1 swish, LN Flatten FC: N h = #classes softmax overly-optimistic bootstraps. Furthermore, all successfully solved sub-problems given a learning signal. As regression loss we chose cross-entropy.

B.2 DETAILS FOR TRAINING THE POLICY PRIOR

The prior network is trained to match the distribution of the values of the AND nodes, also with a cross-entropy loss. Note that we did not use visit counts as targets for the prior network -as done in AlphaGo and AlphaZero for example (Silver et al., 2016; 2018) -since for small search budgets visit counts tend to be noisy and require significant fine-tuning to avoid collapse (Hamrick et al., 2020) .

B.3 NEURAL NETWORKS ARCHITECTURES FOR GRID-WORLD EXPERIMENTS

The shared torso of the prior and value network used in the experiments is a 6-layer CNN with kernels of size 3, 64 filters per layer, Layer Normalization after every convolutional layer, swish (cit) as activation function, zero-padding of 1, and strides [1, 1, 2, 1, 1, 2] to increase the size of the receptive field. The two heads for the prior and value networks follow the pattern described above, but with three layers only instead of six, and fixed strides of 1. The prior head ends with a linear layer and a softmax, in order to obtain a distribution over sub-goals. The value head ends with a linear layer and a sigmoid that predicts a single value, i.e. the probability of reaching the goal from the start state if we further split the problem into sub-problems. We did not heavily optimize networks hyper-parameters. After running a random search over hyper-parameters for the fixed architecture described above, the following were chosen to run the experiments in Figure 3 . The replay buffer has a maximum size of 2048. The prior and value networks are trained on batches of size 128 as new experiences are collected. Networks are trained using Adam with a learning rate of 1e-3, the boltzmann temperature of the softmax for the prior network set to 0.003. For simplicity, we used HER with the time-based rebalancing (i.e. turning experiences into temporal binary search trees). UCB constants are sampled uniformly between 3 and 7, as these values were observed to give more robust results.



As we will observe in Section 5, in practice both the low-level policy and value can be learned. Approximating the value oracle with a learned value function was sufficient for DC-MCTS to plan successfully. We assume MDPs with multiple absorbing states such that this probability is not trivially equal to 1 for most policies, e.g. uniform policy. In experiments, we used a finite episode length. It is possible to traverse into a single node at the time, we describe several heuristics in Appendix A.3



Figure 1: Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS).

);Gao et al. (2017);Savinov et al. (2018);Stein et al. (2018);Nasiriany et al. (2019), all of which rely on forward sequential planning.Gabor et al. (2019) use MCTS for traditional sequential planning based on heuristics, sub-goals and macro-actions.Zhang et al. (2018) apply traditional graph planners to find abstract sub-goal sequences. We extend this line of work by showing that the abstraction of sub-goals affords more general search strategies than sequential planning. Work concurrent to ours has independently investigated non-sequential sub-goals planning:Jurgenson et al. (

Figure 2: Left: Two grid-world maze examples for wall density d = 0.75 and 0.95. In light blue, the distribution over sub-goals induced by the policy prior p that guides the DC-MCTS planner. Right group: The first sub-goal, i.e. at depth 0 of the solution tree, approximately splits the problem in half. Next, the two sub-goals at depth 1. Last, the final plan with the depth of each sub-goal shown. See supplementary material for full animations.

Figure 3: Grid-world mazes.

Figure 4: The 'ant', i.e. the agent, should navigate to the green target.

Figure 5: Fraction of solved mazes vs. planning budget.

Figure 6: Only colored nodes are part of the final plan: a sub-tree for DC-MCTS, a chain for MCTS.

Putting together equation 3 and equation 4 yields the proposition.A.2 ADDITIONAL ALGORITHMIC DETAILSAfter the search phase, in which DC-MCTS builds the search tree T , it returns its estimate of the best plan σ * and the corresponding lower bound L(σ * ) by calling the EXTRACTPLAN procedure on the root node (s 0 , s ∞ ). Algorithm 2 gives details on this procedure.A.3 DESCENDING INTO ONE NODE AT THE TIME DURING SEARCHInstead of descending into both nodes during the TRAVERSE step of Algorithm 1, it is possible to choose only one of the two sub-problems to expand further. This can be especially useful if parallel Forward planning is equivalent to expanding only the right sub-problem Backward planning is equivalent to expanding only the left sub-problem Divide and Conquer Tree Search can do both, and also start from the middle, jump back and forth, etc.

Figure 7: Divide and Conquer Tree Search is strictly more general than both forward and backward search.

N = |σ * |, therefore incurring a cost of |σ * |•(|S|c p + 2c v π ).The dependency on |S| for the policy prior can be further reduced -in principle down to a constant -using techniques from the literature on MCTS for continuous or large discrete action spaces (e.g. progressive wideningCoulom (2007);Chaslot et al.). We can compare this to the cost of unguided, standard SSSP planners, which is |S| 2 c v π as they need to query the low-level value function for all pairs of states (s, s ). Therefore, DC-MCTS can be significantly more cost efficient than conventional SSSP planners if a good policy prior can be learned and the cost of evaluating c p c v π is at least comparable to that of evaluating v π . Under the assumption p = p * , MCTS and DC-MCTS have the same sample complexity. However MCTS represents the solution as one path of length |σ * | in the search tree, whereas DC-MCTS presents it as a sub-tree with |σ * | nodes. Therefore, if computation can be carried out in parallel (e.g. by batching independent sub-problems at the same level of the sub-tree), the time complexity of DC-MCTS could be drastically reduced compared to MCTS, in the best case (perfect parallelism and balanced binary solution tree) from O(|σ

B.4 LOW-LEVEL CONTROLLER TRAINING DETAILS

For physics-based experiments using MuJoCo (Todorov et al., 2012) , we trained a low-level policy first and then trained the planning agent to reuse the low-level motor skills afforded by this body and pretrained policy. The low-level policy, was trained to control the quadruped ("ant") body to go to a randomly placed target in an open area (a "go-to-target" task, essentially the same as the task used to train the humanoid in Merel et al., 2019, which is available at dm_control/locomotion). The task amounts to the environment providing an instruction corresponding to a target position that the agent is is rewarded for moving to (i.e, a sparse reward when within a region of the target). When the target is obtained, a new target is generated that is a short distance away (<1.5m). What this means is that a policy trained on this task should be capable of producing short, direct, goal-directed locomotion behaviors in an open field. And at test time, the presence of obstacles will catastrophically confuse the trained low-level policy. The policy architecture, consisting of a shallow MLP for the actor and critic, was trained to solve this task using MPO Abdolmaleki et al. (2018) . More specifically, the actor and critic had respectively 2 and 3 hidden layers, 256 units each and elu activation function. The policy was trained to a high level of performance using a distributed, replay-based, off-policy training setup involving 64 actors. In order to reuse the low-level policy in the context of mazes, we can replace the environment-provided instruction with a message sent by a high-level policy (i.e., the planning agent). For the planning agent that interfaces with the low-level policy, the action space of the high-level policy will, by construction, correspond to the instruction to the low-level policy.

B.5 PSEUDOCODE

We summarize the training procedure for DC-MCTS in the following pseudo-code. 

C MORE SOLVED MAZES

In Figure 8 we show more mazes as solved by the trained Divide and Conquer MCTS.C More solved mazes 663In Figure 8 we show more mazes as solved by the trained Divide and Conquer MCTS. 

