PLANNING WITH UNCERTAINTY: DEEP EXPLORATION IN MODEL-BASED REINFORCEMENT LEARNING

Abstract

Deep model-based Reinforcement Learning (RL) has shown super-human performance in many challenging domains. Low sample efficiency and limited exploration remain as leading obstacles in the field, however. In this paper, we demonstrate deep exploration in model-based RL by incorporating epistemic uncertainty into planning trees, circumventing the standard approach of propagating uncertainty through value learning. We evaluate this approach with the state of the art model-based RL algorithm MuZero, and extend its training process to stabilize learning from explicitly-exploratory trajectories. In our experiments planning with uncertainty is able to demonstrate deep exploration with standard uncertainty estimation mechanisms, and with it provide significant gains in sample efficiency in hard-exploration problems.

1. INTRODUCTION

In February 2022, state of the art performance in video compression on YouTube videos has been achieved with the algorithm MuZero (Mandhane et al., 2022; Schrittwieser et al., 2020) , setting a new milestone in successful deployments of reinforcement learning (RL). MuZero is a deep modelbased RL algorithm that learns an abstracted model of the environment through interactions, and uses it for planning. While able to achieve state of the art performance in extremely challenging domains, MuZero is limited by on-policy exploration that relies on random action selection. Effective, informed exploration is crucial in many problem settings (O'Donoghue et al., 2018) , and can induce up to exponential gain in sample efficiency (Osband et al., 2016) . Standard approaches for exploration rely on estimates of epistemic uncertainty (uncertainty that is caused by a lack of information, Hüllermeier & Waegeman, 2021) to drive exploration into underunexplored areas of the state-action space (Bellemare et al., 2016; Sekar et al., 2020) or the learner's parameter space (Russo et al., 2018) . These approaches often incorporate the uncertainty into valuelearning as a non-stationary reward bonus (Oudeyer & Kaplan, 2009) or to directly approximate the total uncertainty in a value-like prediction propagated over future actions (O'Donoghue et al., 2018) to achieve exploration that is deep. We refer to deep exploration as exploration that is 1) directed over multiple consecutive time steps, as well as 2) farsighted with respect to information about future rewards, following the definition by (Osband et al., 2016) . Having access to deep exploration enables the agent to aim for (through farsightedness) and reach (through directness over multiple time steps) areas that are both attractive to explore as well as far (in number of transitions) from its initial or current state. While propagating the uncertainty from future actions through the value function enables propagation over a far horizon, it also introduces several problems. First, the values become non-stationary with the uncertainty bonus, introducing potential instability into the value learning. Second, the horizon of the propagation is limited in the number of training steps: uncertainty in states that are far from the initial state will only propagate to the initial state after sufficiently many training steps, and not immediately. Third, the propagation speed is directly correlated with number of training steps, and as a result uncertainty from encountered high-uncertainty areas of the state space that are not trained often will not (or barely) propagate. To overcome these problems, this paper proposes to propagate uncertainty through the planning-tree of model-based RL instead of a neural-network value function. This facilitates the decoupling of the propagation from the learning process by

