STRUCTURE AND RANDOMNESS IN PLANNING AND REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Planning in large state spaces inevitably needs to balance depth and breadth of the search. It has a crucial impact on planners performance and most manage this interplay implicitly. We present a novel method Shoot Tree Search (STS), which makes it possible to control this trade-off more explicitly. Our algorithm can be understood as an interpolation between two celebrated search mechanisms: MCTS and random shooting. It also lets the user control the bias-variance trade-off, akin to T D(n), but in the tree search context. In experiments on challenging domains, we show that STS can get the best of both worlds consistently achieving higher scores.

1. INTRODUCTION

Classically, reinforcement learning is split into model-free and model-based methods. Each of these approaches has its strengths and weaknesses: the former often achieves state-of-the-art performance, while the latter holds the promise of better sample efficiency and adaptability to new situations. Interestingly, in both paradigms, there exists a non-trivial interplay between structure and randomness. In the model-free approach, Temporal Difference (TD) prediction leverages the structure of function approximators, while Monte Carlo (MC) prediction relies on random rollouts. Model-based methods often employ planning, which counterfactually evaluates future scenarios. The design of a planner can lean either towards randomness, with random rollouts used for state evaluation (e.g. random shooting), or towards structure, where a data-structure, typically a tree or a graph, forms a backbone of the search, e.g. Monte Carlo Tree Search (MCTS). Planning is a powerful concept and an important policy improvement mechanism. However, in many interesting problems, the search state space is prohibitively large and cannot be exhaustively explored. Consequently, it is critical to balance the depth and breadth of the search in order to stay within a feasible computational budget. This dilemma is ubiquitous, though often not explicit. The aim of our work is twofold. First, we present a novel method: Shoot Tree Search (STS). The development of the algorithm was motivated by the aforementioned observations concerning structure, randomness, and dilemma between breadth and depth of the search. It lets the user control the depth and breadth of the search more explicitly and can be viewed as a bias-variance control method. STS itself can be understood as an interpolation between MCTS and random shooting. We show experimentally that, on a diverse set of environments, STS can get the best of both worlds. We also provide some toy environments, to get an insight into why STS can be expected to perform well. The critical element of STS, multi-step expansion, can be easily implemented on top of many algorithms from the MCTS family. As such, it can be viewed as one of the extensions in the MCTS toolbox. The second aim of the paper is to analyze various improvements to planning algorithms and test them experimentally. This, we believe, is of interest in its own right. The testing was performed on the Sokoban and Google Research Football environments. Sokoban is a challenging combinatorial puzzle proposed to be a testbed for planning methods by Racanière et al. (2017) . Google Research Football is an advanced, physics-based simulator of football, introduced recently in Kurach et al. (2019) . It has been designed to offer a diverse set of challenges for testing reinforcement learning algorithms. The rest of the paper is organized as follows. In the next section, we discuss the background and related works. Further, we present details of our method. Section 4 is devoted to experimental results. The introduction to reinforcement learning can be found in Sutton & Barto (2018) . In contemporary research, the line between model-free and model-based methods is often blurred. An early example is Guo et al. (2014) , where MCTS plays the role of an 'expert' in DAgger (Ross & Bagnell (2014)), a policy learning algorithm. In the series of papers Silver et al. (2017; 2018) , culminating in AlphaZero, the authors developed a system combining elements of model-based and model-free methods that master the game of Go (and others). Similar ideas were also studied in Anthony et al. (2017) 2018), they offer several advantages, including simplicity, ease of parallelization. At the same time, they reach competitive results to other (more complicated) methods on many important tasks. Williams et al. (2016) applied their Model Predictive Path Integral control algorithm (Williams et al., 2015) , the approach based on stochastic sampling of trajectories, to the problem of controlling a fifth-scale Auto-Rally vehicle in an aggressive driving task. Some works aim to compose a planning module into neural network architectures, see e. 

3. METHODS

Reinforcement learning (RL) is formalized with the Markov decision processes (MDP) formalism see Sutton & Barto (2018) . An MDP is defined as (S, A, P, r, γ), where S is a state space, A is a set of actions available to an agent, P is the transition kernel, r is reward function and γ ∈ (0, 1) is the discount factor. An agent policy, π : S → P (A), maps states to distribution over actions. An object central to the MDP formalism is the value function V π : S → R associated with policy π V π (s) := E π +∞ t=0 γ t r t |s 0 = s , where r t denotes the stream of rewards, assuming that the agent operates with policy π (which is denoted as E π ) and that at t = 0 the system starts from s. The



. In Miłoś et al. (2019), planning and model-free learning were brought together to solve combinatorial environments. Schrittwieser (2019) successfully integrated model learning with planning in the latent space. A recent paper, Hamrick et al. (2020), suggests further integration model-free and model-based methods via utilizing internal planner information to calculate more accurate estimates of the Q-function. Soemers et al. (2016) presents expansion much similar to ours. The crucial algorithmic difference is the aggregate backprop (for details see Section 3). In a similar vain, Coulom (2006), proposes a framework blending tree search and Monte-Carlo simulations in a smooth way. Both, Soemers et al. (2016) and Coulom (2006) differ from our work as they do not used learned value functions, resorting to heuristics and/or long rollouts. In James et al. (2017) a detailed empirical analysis suggests that the key to the UCT effectiveness is the correct ordering of actions. As most of these works, we use the model-based reinforcement learning paradigm, in which the agent has access to a true model of the environment. Searching and planning algorithms are deeply rooted in classical computer science and classical AI, see e.g. Cormen et al. (2009) and Russell & Norvig (2002). Traditional heuristic algorithms such as A * (Hart et al. (1968)) or GBFS (Doran & Michie (1966)) are widely used. The Monte Carlo Tree Search algorithm, which combines heuristic search with learning, led to breakthroughs in the field, see Browne et al. (2012) for an extensive survey. Similarly, Orseau et al. (2018) bases on the classical BFS to build a heuristic search mechanism with theoretical guarantees. In Agostinelli et al. (2019) the authors utilise the value-function to improve upon the A * algorithm and solve Rubik's cube. Monte Carlo rollouts are known to be a useful way of approximating the value of a state-action pair Abramson (1990). Approaches in which the actions of a rollout are uniformly sampled are often called flat Monte Carlo. Impressively, Flat Monte Carlo achieved the world champion level in Bridge Ginsberg (2001) and Scrabble Sheppard (2002). Moreover, Monte Carlo rollouts are often used as a part of model predictive control, see Camacho & Alba (2013). As suggested by Chua et al. (2018); Nagabandi et al. (

g., Oh et al. (2017); Farquhar et al. (2017). Kaiser et al. (2019), recent work on model-based Atari, has shown the possibility of sample efficient reinforcement learning with an explicit visual model. Gu et al. (2016) uses model-based methods at the initial phase of training and model-free methods during 'fine-tuning'. Furthermore, there is a body of work that attempts to learn a planning module, see Pascanu et al. (2017); Racanière et al. (2017); Guez et al. (2019).

