LEARNING TO PLAN OPTIMISTICALLY: UNCERTAINTY-GUIDED DEEP EXPLORATION VIA LATENT MODEL ENSEMBLES

Abstract

Learning complex behaviors through interaction requires coordinated long-term planning. Random exploration and novelty search lack task-centric guidance and waste effort on non-informative interactions. Instead, decision making should target samples with the potential to optimize performance far into the future, while only reducing uncertainty where conducive to this objective. This paper presents latent optimistic value exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term rewards. We combine finitehorizon rollouts from a latent model with value function estimates to predict infinitehorizon returns and recover associated uncertainty through ensembling. Policy training then proceeds on an upper confidence bound (UCB) objective to identify and select the interactions most promising to improve long-term performance. We apply LOVE to visual control tasks in continuous state-action spaces and demonstrate improved sample complexity on a selection of benchmarking tasks.

1. INTRODUCTION

The ability to learn complex behaviors through interaction will enable the autonomous deployment of various robotic systems in the real world. Reinforcement learning (RL) provides a key framework for realizing these capabilities, but efficiency of the learning process remains a prevalent concern. Reallife applications yield complex planning problems due to high-dimensional environment states, which are further exacerbated by the agent's continuous actions space. For RL to enable real-world autonomy, it therefore becomes crucial to determine efficient representations of the underlying planning problem, while formulating interaction strategies capable of exploring the resulting representation efficiently. In traditional controls, planning problems are commonly formulated based on the underlying statespace representation. This may inhibit efficient learning when the environment states are highdimensional or their dynamics are susceptible to non-smooth events such as singularities and discontinuities (Schrittwieser et al., 2019; Hwangbo et al., 2019; Yang et al., 2019) . It may then be desirable for the agent to abstract a latent representation that facilitates efficient learning (Ha & Schmidhuber, 2018; Zhang et al., 2019; Lee et al., 2019) . The latent representation may then be leveraged either in a model-free or model-based setting. Model-free techniques estimate state-values directly from observed data to distill a policy mapping. Model-based techniques learn an explicit representation of the environment that is leveraged in generating fictitious interactions and enable policy learning in imagination (Hafner et al., 2019a) . While the former reduces potential sources of bias, the latter offers a structured representation encoding deeper insights into underlying environment behavior. The agent should leverage the chosen representation to efficiently identify and explore informative interactions. We provide a motivational one-dimensional example of a potential action-value mapping in Figure 1 (left). The true function and its samples are visualized in red with the true maximum denoted by the green dot. Relying only on the predicted mean can bias policy learning towards local optima (orange dot; Sutton & Barto (2018)), while added stochasticity can waste samples on un-informative interactions. Auxiliary information-gain objectives integrate predicted uncertainty, however, uncertain environment behavior does not equate to potential for improvement (pink dot). It is desirable to focus exploration on interactions that harbor potential for improving overall performance. Combining mean performance estimates with the associated uncertainty into an upper confidence 1

