DREAM AND SEARCH TO CONTROL: LATENT SPACE PLANNING FOR CONTINUOUS CONTROL Anonymous

Abstract

Learning and planning with latent space dynamics has been shown to be useful for sample efficiency in model-based reinforcement learning (MBRL) for discrete and continuous control tasks. In particular, recent work, for discrete action spaces, demonstrated the effectiveness of latent-space planning via Monte-Carlo Tree Search (MCTS) for bootstrapping MBRL during learning and at test time. However, the potential gains from latent-space tree search have not yet been demonstrated for environments with continuous action spaces. In this work, we propose and explore an MBRL approach for continuous action spaces based on tree-based planning over learned latent dynamics. We show that it is possible to demonstrate the types of bootstrapping benefits as previously shown for discrete spaces. In particular, the approach achieves improved sample efficiency and performance on a majority of challenging continuous-control benchmarks compared to the state-of-the-art.

1. INTRODUCTION

Deep reinforcement learning (RL) has been effective at solving sequential decision-making problems with varying levels of difficulty. The solutions generally fall into one of two categories: model-free and model-based methods. Model-free methods (Haarnoja et al., 2018; Silver et al., 2014; Lillicrap et al., 2015; Fujimoto et al., 2018; Schulman et al., 2017) directly learn a policy or action-values. However, they are usually considered to be sample-inefficient. Model-based methods (Lee et al., 2019; Gregor et al., 2019; Zhang et al., 2018; Ha & Schmidhuber, 2018) learn the environment's dynamics. Common model-based approaches involve sampling trajectories from the learned dynamics to train using RL or applying a planning algorithm directly on the learned dynamics (Ha & Schmidhuber, 2018; Hafner et al., 2019a; b) . However, learning multi-step dynamics in the raw observation space is challenging. This is primarily because it involves the reconstruction of high-dimensional features (e.g., pixels) in order to roll-out trajectories, which is an error-prone process. Instead, recent work has focused on learning latent space models (Hafner et al., 2019a; b) . This has been shown to improve robustness and sample efficiency by eliminating the need for high-dimensional reconstruction during inference. Learning on latent dynamics: Once the dynamics have been learned, a classic approach is to sample trajectories from it to learn a policy using RL. This approach is usually motivated by sample efficiency. Dreamer (Hafner et al., 2019a) took this approach and demonstrated state-of-the-art performance on continuous control by performing gradient-based RL on learned latent dynamics. Another approach is to perform a look-ahead search -where the dynamics are used for a multi-step rollout to determine an optimal action. This could be accompanied by a value estimate and/or a policy that produces state-action mappings to narrow the search space or reduce the search's depth. MuZero (Schrittwieser et al., 2019) took this approach and applied tree-based search on latent dynamics -however, it was restricted to discrete action spaces only. The role of look-ahead search using learned latent dynamics has not been explored sufficiently for continuous action spaces. Our contribution: In this paper, we extend the idea of performing look-ahead search using learned latent dynamics to continuous action spaces. Our high level approach is shown in Fig 1 . We build on top of Dreamer and modify how actions are sampled during online planning. Instead of sampling actions from the current policy, we search over a set of actions sampled from a mix of distributions. For our search mechanism, we implement MCTS but also investigate a simple rollout algorithm

