LEARNING TO PLAN OPTIMISTICALLY: UNCERTAINTY-GUIDED DEEP EXPLORATION VIA LATENT MODEL ENSEMBLES

Abstract

Learning complex behaviors through interaction requires coordinated long-term planning. Random exploration and novelty search lack task-centric guidance and waste effort on non-informative interactions. Instead, decision making should target samples with the potential to optimize performance far into the future, while only reducing uncertainty where conducive to this objective. This paper presents latent optimistic value exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term rewards. We combine finitehorizon rollouts from a latent model with value function estimates to predict infinitehorizon returns and recover associated uncertainty through ensembling. Policy training then proceeds on an upper confidence bound (UCB) objective to identify and select the interactions most promising to improve long-term performance. We apply LOVE to visual control tasks in continuous state-action spaces and demonstrate improved sample complexity on a selection of benchmarking tasks.

1. INTRODUCTION

The ability to learn complex behaviors through interaction will enable the autonomous deployment of various robotic systems in the real world. Reinforcement learning (RL) provides a key framework for realizing these capabilities, but efficiency of the learning process remains a prevalent concern. Reallife applications yield complex planning problems due to high-dimensional environment states, which are further exacerbated by the agent's continuous actions space. For RL to enable real-world autonomy, it therefore becomes crucial to determine efficient representations of the underlying planning problem, while formulating interaction strategies capable of exploring the resulting representation efficiently. In traditional controls, planning problems are commonly formulated based on the underlying statespace representation. This may inhibit efficient learning when the environment states are highdimensional or their dynamics are susceptible to non-smooth events such as singularities and discontinuities (Schrittwieser et al., 2019; Hwangbo et al., 2019; Yang et al., 2019) . It may then be desirable for the agent to abstract a latent representation that facilitates efficient learning (Ha & Schmidhuber, 2018; Zhang et al., 2019; Lee et al., 2019) . The latent representation may then be leveraged either in a model-free or model-based setting. Model-free techniques estimate state-values directly from observed data to distill a policy mapping. Model-based techniques learn an explicit representation of the environment that is leveraged in generating fictitious interactions and enable policy learning in imagination (Hafner et al., 2019a) . While the former reduces potential sources of bias, the latter offers a structured representation encoding deeper insights into underlying environment behavior. The agent should leverage the chosen representation to efficiently identify and explore informative interactions. We provide a motivational one-dimensional example of a potential action-value mapping in Figure 1 (left). The true function and its samples are visualized in red with the true maximum denoted by the green dot. Relying only on the predicted mean can bias policy learning towards local optima (orange dot; Sutton & Barto (2018)), while added stochasticity can waste samples on un-informative interactions. Auxiliary information-gain objectives integrate predicted uncertainty, however, uncertain environment behavior does not equate to potential for improvement (pink dot). It is desirable to focus exploration on interactions that harbor potential for improving overall performance. Combining mean performance estimates with the associated uncertainty into an upper confidence 2011)). The underlying uncertainty can be explicitly represented by maintaining an ensemble of hypothesis on environment behavior (Osband et al., 2016; Lakshminarayanan et al., 2017) . Figure 1 (right) demonstrates this selective uncertainty reduction by showcasing forward predictions of a ensembled model on two motion patterns of a walker agent. The expected high-reward walking behavior has been sufficiently explored and model hypotheses strongly agree (top), while little effort has been extended to reduce uncertainty over the expected low-reward falling behavior (bottom). This paper demonstrates that exploring interactions through imagined positive futures can yield information-dense sampling and data-efficient learning. We present latent optimistic value exploration (LOVE), an algorithm that leverages optimism in the face of uncertain long-term rewards in guiding exploration. Potential futures are imagined by an ensemble of latent variable models and their predicted infinite-horizon performance is obtained in combination with associated value function estimates. Training on a UCB objective over imagined futures yields a policy that behaves inherently optimistic and focuses on interactions with the potential to improve performance under the current world model. This provides a concise, differentiable framework for driving deep exploration while not relying on stochasticity. We present empirical results on challenging visual control tasks that highlight the necessity for deep exploration in scenarios with sparse reward signals, and demonstrate improved sample-efficiency on a selection of benchmarking environments from the DeepMind Control Suite (Tassa et al., 2018) . We compare to both Dreamer (Hafner et al., 2019a) , the current state-of-the-art model-based agent, and DrQ (Kostrikov et al., 2020) , a concurrent model-free approach.

2. RELATED WORK

Problem representation Model-free approaches learn a policy by directly estimating performance from interaction data. While their asymptotic performance previously came at the cost of sample complexity (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018) , recent advances in representation learning through contrastive methods and data augmentation have improved their efficiency (Srinivas et al., 2020; Laskin et al., 2020; Kostrikov et al., 2020) . However, their implicit representation of the world can make generalization of learned behaviors under changing task specifications difficult. Model-based techniques leverage a structured representation of their environment that enables them to imagine potential interactions. The nature of the problem hereby dictates model complexity, ranging from linear (Levine & Abbeel, 2014; Kumar et al., 2016) , over Gaussian process models (Deisenroth & Rasmussen, 2011; Kamthe & Deisenroth, 2018) to deep neural networks (Chua et al., 2018; Clavera et al., 2018) . In high-dimensional environments, latent variable models can provide concise representations that improve efficiency of the learning process (Watter et al., 2015; Ha & Schmidhuber, 2018; Lee et al., 2019; Hafner et al., 2019a) .



Figure 1: Left -illustrative example of an action-value mapping (red line) and associated samples (red dots). The agent aims to maximize obtained value (green dot) and builds a model through interaction. Exploration based on maximization of the predicted mean can exploit local optima (orange dot), while information-gain bonuses may focus on uncertain regions with little potential of improvement (pink dot). Explicitly considering uncertainty over expected performance can help focus exploration on regions with high potential for improvement (blue dot). Right -reducing uncertainty over expected high-reward behaviors (walking, top); ignoring expected low-reward behaviors (falling, bottom).

