PLANNING WITH UNCERTAINTY: DEEP EXPLORATION IN MODEL-BASED REINFORCEMENT LEARNING

Abstract

Deep model-based Reinforcement Learning (RL) has shown super-human performance in many challenging domains. Low sample efficiency and limited exploration remain as leading obstacles in the field, however. In this paper, we demonstrate deep exploration in model-based RL by incorporating epistemic uncertainty into planning trees, circumventing the standard approach of propagating uncertainty through value learning. We evaluate this approach with the state of the art model-based RL algorithm MuZero, and extend its training process to stabilize learning from explicitly-exploratory trajectories. In our experiments planning with uncertainty is able to demonstrate deep exploration with standard uncertainty estimation mechanisms, and with it provide significant gains in sample efficiency in hard-exploration problems.

1. INTRODUCTION

In February 2022, state of the art performance in video compression on YouTube videos has been achieved with the algorithm MuZero (Mandhane et al., 2022; Schrittwieser et al., 2020) , setting a new milestone in successful deployments of reinforcement learning (RL). MuZero is a deep modelbased RL algorithm that learns an abstracted model of the environment through interactions, and uses it for planning. While able to achieve state of the art performance in extremely challenging domains, MuZero is limited by on-policy exploration that relies on random action selection. Effective, informed exploration is crucial in many problem settings (O'Donoghue et al., 2018) , and can induce up to exponential gain in sample efficiency (Osband et al., 2016) . Standard approaches for exploration rely on estimates of epistemic uncertainty (uncertainty that is caused by a lack of information, Hüllermeier & Waegeman, 2021) to drive exploration into underunexplored areas of the state-action space (Bellemare et al., 2016; Sekar et al., 2020) or the learner's parameter space (Russo et al., 2018) . These approaches often incorporate the uncertainty into valuelearning as a non-stationary reward bonus (Oudeyer & Kaplan, 2009) or to directly approximate the total uncertainty in a value-like prediction propagated over future actions (O'Donoghue et al., 2018) to achieve exploration that is deep. We refer to deep exploration as exploration that is 1) directed over multiple consecutive time steps, as well as 2) farsighted with respect to information about future rewards, following the definition by (Osband et al., 2016) . Having access to deep exploration enables the agent to aim for (through farsightedness) and reach (through directness over multiple time steps) areas that are both attractive to explore as well as far (in number of transitions) from its initial or current state. While propagating the uncertainty from future actions through the value function enables propagation over a far horizon, it also introduces several problems. First, the values become non-stationary with the uncertainty bonus, introducing potential instability into the value learning. Second, the horizon of the propagation is limited in the number of training steps: uncertainty in states that are far from the initial state will only propagate to the initial state after sufficiently many training steps, and not immediately. Third, the propagation speed is directly correlated with number of training steps, and as a result uncertainty from encountered high-uncertainty areas of the state space that are not trained often will not (or barely) propagate. To overcome these problems, this paper proposes to propagate uncertainty through the planning-tree of model-based RL instead of a neural-network value function. This facilitates the decoupling of the propagation from the learning process by propagating the uncertainty during online inference, and addresses the challenges originating from propagating the uncertainty through value learning. To demonstrate that propagating uncertainty in the online planning-trees of model-based methods like MuZero can provide deep exploration, this paper's contribution is divided into three parts. First, we propose a framework for propagating epistemic uncertainty about the planning process itself through a planning tree, for example, in Monte Carlo Tree Search (MCTS, see Browne et al., 2012 , for an overview) with learned models. Second, we propose to harness planning with uncertainty to achieve deep exploration in the standard RL setting (in difference to prior approaches, Sekar et al., 2020, that required a pre-training phase to explore), by modifying the objective of an online planning phase with epistemic uncertainty, which we dub online planning to explore, or OP2E. Third, to stabilize learning from off-policy exploratory decisions in MuZero, we extend MuZero's training process by splitting training into exploration episodes and exploitation episodes, and generating the value and policy targets differently depending on the type of episode. We conduct experiments against hard-exploration tasks to evaluate the capacity of OP2E to achieve deep exploration. In addition, we conduct an ablation study to evaluate the individual effects of the different extensions we propose to training from explicitly exploratory trajectories. In our experiments OP2E was able to significantly outperform vanilla MuZero in hard-exploration tasks, demonstrating deep exploration resulting in significant gains in sample efficiency. Our ablation study points out the potential value of discerning between positive and negative exploration trajectories and generating policy targets accordingly, as well as the resilience of n-step value targets to the presence of strongly off-policy trajectories. This paper is organized as follows: Section 2 provides relevant background for model-based RL, MuZero and epistemic uncertainty estimation in deep learning. Section 3 describes our contributions, starting with the framework for uncertainty propagation, followed by our approach for modifying planning with uncertainty to achieve deep exploration (OP2E) and finally the extensions proposed to stabilize learning from exploratory decisions. Section 4 evaluates OP2E against two hard-exploration tasks in comparison with vanilla MuZero and presents and ablation study of the different extensions to learning from exploratory decisions. Section 5 discusses related work, and Section 6 concludes the paper and discusses future work.

2.1. REINFORCEMENT LEARNING

In reinforcement learning (RL), an agent learns a behavior policy π(a|s) through interactions with an environment, by observing states (or observations), executing actions and receiving rewards. The environment is represented with a Markov Decision Process (MDP, Bellman, 1957) , or a partiallyobservable MDP (POMDP, Åström, 1965) . An MDP M is a tuple: M = ⟨S, A, ρ, P, R⟩, where S is a set of states, A a set of actions, ρ a probability distribution over the state space specifying the probability of starting at each state s ∈ S, R : S × A → R a bounded reward function, and P : S ×A×S → [0, 1] is a transition function, where P (s ′ |s, a) specifies the probability of transitioning from state s to state s ′ after executing action a. In a POMDP M ′ = ⟨S, A, ρ, P, R, Ω, O⟩, the agent observes observations o ∈ Ω. O : S×A×Ω → [0, 1] specifies the probability O(o|s, a) of observing a possible observation o. In model based reinforcement learning (MBRL) the agent learns a model of the environment through interactions, and uses it to optimize its policy, often through planning. In deep MBRL (DMBRL) the agent utilizes deep neural networks (DNNs) as function approximators. Many RL approaches rely on learning a state-action Q -value function Q π (s, a) = R(s, a) + γE[V π (s ′ )|s ′ ∼ P (•|s, a)] or the corresponding state value function V π (s) = E[Q π (s, a)|a ∼ π(•|s)], which represents the expected return from starting in state s (and possibly action a) and then following policy π.

2.2. MONTE CARLO TREE SEARCH

MCTS is a planning algorithm that constructs a planning tree with the current state s t at its root, to estimate the objective: arg max a Q π (s t , a). The algorithm repeatedly performs the following four steps: trajectory selection, expansion, simulation and backup. Starting from the root node n 0 ≡ s t , the algorithm selects a trajectory in the existing tree based on the averaged returns q(n k , a)

