DREAM AND SEARCH TO CONTROL: LATENT SPACE PLANNING FOR CONTINUOUS CONTROL Anonymous

Abstract

Learning and planning with latent space dynamics has been shown to be useful for sample efficiency in model-based reinforcement learning (MBRL) for discrete and continuous control tasks. In particular, recent work, for discrete action spaces, demonstrated the effectiveness of latent-space planning via Monte-Carlo Tree Search (MCTS) for bootstrapping MBRL during learning and at test time. However, the potential gains from latent-space tree search have not yet been demonstrated for environments with continuous action spaces. In this work, we propose and explore an MBRL approach for continuous action spaces based on tree-based planning over learned latent dynamics. We show that it is possible to demonstrate the types of bootstrapping benefits as previously shown for discrete spaces. In particular, the approach achieves improved sample efficiency and performance on a majority of challenging continuous-control benchmarks compared to the state-of-the-art.

1. INTRODUCTION

Deep reinforcement learning (RL) has been effective at solving sequential decision-making problems with varying levels of difficulty. The solutions generally fall into one of two categories: model-free and model-based methods. Model-free methods (Haarnoja et al., 2018; Silver et al., 2014; Lillicrap et al., 2015; Fujimoto et al., 2018; Schulman et al., 2017) directly learn a policy or action-values. However, they are usually considered to be sample-inefficient. Model-based methods (Lee et al., 2019; Gregor et al., 2019; Zhang et al., 2018; Ha & Schmidhuber, 2018) learn the environment's dynamics. Common model-based approaches involve sampling trajectories from the learned dynamics to train using RL or applying a planning algorithm directly on the learned dynamics (Ha & Schmidhuber, 2018; Hafner et al., 2019a; b) . However, learning multi-step dynamics in the raw observation space is challenging. This is primarily because it involves the reconstruction of high-dimensional features (e.g., pixels) in order to roll-out trajectories, which is an error-prone process. Instead, recent work has focused on learning latent space models (Hafner et al., 2019a; b) . This has been shown to improve robustness and sample efficiency by eliminating the need for high-dimensional reconstruction during inference. Learning on latent dynamics: Once the dynamics have been learned, a classic approach is to sample trajectories from it to learn a policy using RL. This approach is usually motivated by sample efficiency. Dreamer (Hafner et al., 2019a) took this approach and demonstrated state-of-the-art performance on continuous control by performing gradient-based RL on learned latent dynamics. Another approach is to perform a look-ahead search -where the dynamics are used for a multi-step rollout to determine an optimal action. This could be accompanied by a value estimate and/or a policy that produces state-action mappings to narrow the search space or reduce the search's depth. MuZero (Schrittwieser et al., 2019) took this approach and applied tree-based search on latent dynamics -however, it was restricted to discrete action spaces only. The role of look-ahead search using learned latent dynamics has not been explored sufficiently for continuous action spaces. Our contribution: In this paper, we extend the idea of performing look-ahead search using learned latent dynamics to continuous action spaces. Our high level approach is shown in Fig 1 . We build on top of Dreamer and modify how actions are sampled during online planning. Instead of sampling actions from the current policy, we search over a set of actions sampled from a mix of distributions. For our search mechanism, we implement MCTS but also investigate a simple rollout algorithm that trades off performance for compute complexity. We observed that look-ahead search results in optimal actions early on. This, in turn, leads to faster convergence of model estimates and of optimal policies. However,these benefits come at the cost of computational time since deeper and iterative look-ahead search takes more time.

2. RELATED WORK

Model-Based RL involves learning a dynamical model and utilizing it to sample trajectories to train a policy or performing look-ahead search for policy improvement. World Model (Ha & Schmidhuber, 2018) learns a latent representation of observations and learns a dynamical model over this latent representation. It then utilizes the learned model to optimize a linear controller via evolution. The latent representations help to plan in a low dimensional abstract space. PlaNet (Hafner et al., 2019b) does end-to-end training of the latent observation encoding and dynamics and does online planning using CEM . Dreamer, the current state-of-the-art, (Hafner et al., 2019a) builds on top of PlaNet and uses analytic gradients to efficiently learn long-horizon behaviors for visual control purely by latent imagination. We built our work on top of Dreamer. Planning using look-ahead search: Given a dynamics model, one can search over its state-action space to determine an optimal action. If one has a perfect model but a sub-optimal action-value function, then in fact, a deeper look-ahead search will usually yield better policies. This look-ahead search could be done both during online decision making and during policy improvement. AlphaGo (Silver et al., 2017; Lowrey et al., 2018) combined MCTS with a value estimator to plan and explore with known ground-truth dynamics. MuZero (Schrittwieser et al., 2019) learned a latent dynamic model and performed MCTS over it. They conditioned the latent representation on value and multistep reward prediction rather than observation reconstruction. These works are limited to discrete action space due to large action space in continuous action space tasks. A variation on MCTS is Progressive Widening (Coulom, 2006; Chaslot et al., 2008; Couëtoux et al., 2011) where the the child-actions of a node are expanded based on its visitation count. This has been exploited in continuous action spaces by adding sampled actions from a proposal distribution to create a discrete set of child-actions. AOC (Moerland et al., 2018) utilized this approach to apply MCTS in continuous action spaces over true dynamics.However, they showed their results only for the Pendulum-v0 task in OpenAI Gym. A common thread across all these prior works is the availability of known ground-truth dynamics. Another approach that can be used to reduce the size of the look-ahead search space is to utilize hierarchical optimistic optimization (HOO), which splits the action space and gradually increases the resolution of actions (Mansley et al., 2011; Bubeck et al., 2009) .



Figure 1: Overall Approach. Phase 1: At each time-step, determine the actions to be stepped via look-ahead search over latent dynamics and write the resulting transition data in a replay buffer. Phase 2: Sample fixed-horizon rollouts from replay buffer. Phase 3: Optimize latent model to reconstruct rewards and observations. Phase 4: Sample imaginary rollouts from the latent model to train agent.

