DIVIDE-AND-CONQUER MONTE CARLO TREE SEARCH

Abstract

Standard planners for sequential decision making (including Monte Carlo planning, tree search, dynamic programming, etc.) are constrained by an implicit sequential planning assumption: The order in which a plan is constructed is the same in which it is executed. We consider alternatives to this assumption for the class of goal-directed Reinforcement Learning (RL) problems. Instead of an environment transition model, we assume an imperfect, goal-directed policy. This low-level policy can be improved by a plan, consisting of an appropriate sequence of subgoals that guide it from the start to the goal state. We propose a planning algorithm, Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS), for approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively. The algorithm critically makes use of a learned sub-goal proposal for finding appropriate partition trees of new tasks based on prior experience. Different strategies for learning sub-goal proposals give rise to different planning strategies that strictly generalize sequential planning. We show that this algorithmic flexibility wrt. planning order leads to improved results in navigation tasks in grid-worlds as well as in challenging continuous control environments.

1. INTRODUCTION

This is the first sentence of this paper, but it was not the first one we wrote. In fact, the entire introduction section was actually one of the last sections to be added to this manuscript. The discrepancy between the order of inception of ideas and the order of their presentation in this paper probably does not come as a surprise to the reader. Nonetheless, it serves as a point for reflection that is central to the rest of this work, and that can be summarized as "the order in which we construct a plan does not have to coincide with the order in which we execute it". Most standard planners for sequential decision making problems-including Monte Carlo planning, Monte Carlo Tree Search (MCTS) and dynamic programming-have a baked-in sequential planning assumption (Bertsekas et al., 1995; Browne et al., 2012) . These methods begin at either the initial or final state and then plan actions sequentially forward or backwards in time. However, this sequential approach faces two main challenges. (i) The transition model used for planning needs to be reliable over long horizons, which is often difficult to achieve when it has to be inferred from data. (ii) Credit assignment to each individual action is difficult: In a planning problem spanning a horizon of 100 steps, to assign credit to the first action, we have to compute the optimal cost-to-go for the remaining problem with a horizon of 99 steps, which is only slightly easier than solving the original problem. To overcome these two fundamental challenges, here we consider alternatives to the basic assumptions of sequential planners. We focus on goal-directed decision making problems where an agent should reach a goal state from a start state. Instead of a transition and reward model of the environment, we assume a given goal-directed policy (the "low-level" policy) and the associated value oracle that returns its success probability on any given task. 1 In general, a low-level policy will not be not optimal, e.g. it might be too "myopic" to reliably reach goal states that are far away from its current state. We now seek to improve the low-level policy via a suitable sequence of sub-goals that guide it from the start to the final goal, thus maximizing the overall task success probability. This formulation of planning as finding good sub-goal sequences, makes learning of explicit environment models unnecessary, as they are replaced by low-level policies and their value functions. The sub-goal planning problem can still be solved by a conventional sequential planner that begins by searching for the first sub-goal to reach from the start state, then planning the next sub-goal in sequence, and so on. Indeed, this is the approach taken in most hierarchical RL settings based on options or sub-goals (e.g. Dayan & Hinton, 1993; Sutton et al., 1999; Vezhnevets et al., 2017) . However, the credit assignment problem mentioned above persists, as assessing if the first sub-goal is useful still requires evaluating the success probability of the remaining plan. Instead, it could be substantially easier to reason about the utility of a sub-goal "in the middle" of the plan, as this breaks the long-horizon problem into two sub-problems with much shorter horizons: how to get to the sub-goal and how to get from there to the final goal. Based on this intuition, we propose the Divide-and-Conquer MCTS (DC-MCTS) planner that searches for sub-goals to split the original task into two independent sub-tasks of comparable complexity and then recursively solves these, thereby drastically facilitating credit assignment. To search the space of intermediate sub-goals efficiently, DC-MCTS uses a heuristic for proposing promising sub-goals that is learned from previous search results and agent experience. Humans can plan efficiently over long horizons to solve complex tasks, such as theorem proving or navigation, and some plans even span over decades (e.g. economic measures): In these situations, planning sequentially in terms of next steps -such as what arm to move, or what phone call to make -will cover a tiny proportion of the horizon, neglecting the long uncertainty beyond the last planned step. The algorithm put forward in this paper is a step in the direction of efficient planners that tackle long horizons by recursively and parallelly splitting them into many smaller and smaller sub-problems. In Section 2, we formulate planning in terms of sub-goals instead of primitive actions. In Section 3, as our main contribution, we propose the novel Divide-and-Conquer Monte Carlo Tree Search algorithm for this planning problem. In Section 4 we position DC-MCTS within the literature of related work. In Section 5, we show that it outperforms sequential planners both on grid world and continuous control navigation tasks, demonstrating the utility of constructing plans in a flexible order that can be different from their execution order.

2. IMPROVING GOAL-DIRECTED POLICIES WITH PLANNING

Let S and A be finite sets of states and actions. We consider a multi-task setting, where for each episode the agent has to solve a new task consisting of a new Markov Decision Process (MDP) M over S and A. Each M has a single start state s 0 and a special absorbing state s ∞ , also termed the goal state. If the agent transitions into s ∞ at any time it receives a reward of 1 and the episode terminates; otherwise the reward is 0. We assume that the agent observes the start and goal states (s 0 , s ∞ ) at the beginning of each episode, as well as an encoding vector c M ∈ R d . This vector provides the agent with additional information about the MDP M of the current episode and will be key to transfer learning across tasks in the multi-task setting. A stochastic, goal-directed policy π is a mapping from S × S × R d into distributions over A, where π(a|s, s ∞ , c M ) denotes the probability of taking action a in state s in order to get to goal s ∞ . For a fixed goal s ∞ , we can interpret π as a regular policy, here denoted as π s∞ , mapping states to action probabilities. We denote the value of π in state s for goal s ∞ as v π (s, s ∞ |c M ); we assume no discounting γ = 1. Under the above definition of the reward, the value is equal to the success probability of π on the task, i.e. the absorption probability of the stochastic process starting in s 0 defined by running π s∞ : v π (s 0 , s ∞ |c M ) = P (s ∞ ∈ τ πs ∞ s0 |c M ),



As we will observe in Section 5, in practice both the low-level policy and value can be learned. Approximating the value oracle with a learned value function was sufficient for DC-MCTS to plan successfully.



Figure 1: Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS).

