UNCERTAINTY-DRIVEN EXPLORATION FOR GENERALIZATION IN REINFORCEMENT LEARNING

Abstract

Value-based methods are competitive when trained and tested in single environments. However, they fall short when trained on multiple environments with similar characteristics and tested on new ones from the same family. We investigate the potential reasons behind the poor generalization performance of value-based methods and discover that exploration plays a crucial role in these settings. Exploration is helpful not only for finding optimal solutions to the training environments but also for acquiring knowledge that helps generalization to other unseen environments. We show how to make value-based methods competitive in these settings by using uncertainty-driven exploration and distributional RL. Our algorithm is the first value-based approach to achieve state-of-the-art on both Procgen and Crafter, two challenging benchmarks for generalization in RL.

1. INTRODUCTION

Value-based methods (Watkins & Dayan, 1992) (which directly derive a policy from the value functions) tend to be competitive on singleton Markov decision processes (MDPs) where agents are trained and tested on the same environment (Mnih et al., 2013; Hessel et al., 2018; Badia et al., 2020) . However, they fall short in contextual MDPs (CMDPs) (Wang et al., 2020; Ehrenberg et al., 2022) , where agents are trained on a number of different environments that share a common structure and tested on unseen environments from the same family (Cobbe et al., 2019; Wang et al., 2020; Mohanty et al., 2021; Ehrenberg et al., 2022) . In this work, we aim to understand potential reasons for why value-based approaches work well in singleton MDPs but not in contextual MDPs and how we can make them competitive in CMDPs. Most of the existing approaches for improving generalization in CMDPs have treated the problem as a pure representation learning problem, applying regularization techniques which are commonly used in supervised deep learning (Farebrother et al., 2018; Cobbe et al., 2018; Igl et al., 2019; Lee et al., 2020; Ye et al., 2020; Laskin et al., 2020; Raileanu et al., 2020) . However, these methods neglect the unique structure of reinforcement learning (RL), namely that agents collect their own data by exploring their environments. This suggests that there may be other avenues for improving generalization in RL beyond representation learning. Here, we identify the agent's exploration strategy as a key factor influencing generalization in contextual MDPs. First, exploration can accelerate training in RL, and since neural networks may naturally generalize, better exploration can result in better training performance and consequently better generalization performance. Moreover, in singleton MDPs, exploration can only benefit decisions in that environment, while in CMDPs exploration in one environment can also help decisions in other, potentially unseen, environments. This is because learning about other parts of the environment can be useful in other MDPs even if it is not useful for the current MDP. As shown in Figure 1 , trajectories that are suboptimal in certain MDPs may turn out to be optimal in other MDPs from the same family, so this knowledge can help find the optimal policy more quickly in a new MDP encountered during training, and better generalize to new MDPs without additional training. One goal of exploration is to learn new things about the (knowable parts of the) environment so as to asymptotically reduce epistemic uncertainty. To model epistemic uncertainty (which is reducible by acquiring more data), we need to disentangle it from aleatoric uncertainty (which is irreducible and stems from the inherent stochasticity of the environment). As first observed by Raileanu & Fergus (2021) but the agent does not know which environment it is in so it cannot perfectly predict the value of such states. This is a type of aleatoric uncertainty which can be modeled by learning a distribution over all possible values rather than a single point estimate (Bellemare et al., 2017) . Based on these observations, we propose Exploration via Distributional Ensemble (EDE), a method that uses an ensemble of Q-value distributions to encourage exploring states with large epistemic uncertainty. We evaluate our approach on both Procgen (Cobbe et al., 2019) and Crafter (Hafner, 2022), two procedurally generated CMDP benchmarks for generalization in deep RL, demonstrating a significant improvement over more naïve exploration strategies. This is the first model-free value-based method to achieve state-of-the-art performance on these benchmarks, in terms of both sample efficiency and generalization, surpassing strong policy-optimization baselines (i.e., methods that learn a parameterized policy in addition to a value function) and even a model-based one. To summarize, this work: (i) identifies exploration as a key factor for generalization in CMDPs, (ii) supports this hypothesis using a didactic example in a tabular CMDP, (iii) proposes an exploration method based on minimizing the agent's epistemic uncertainty, and (iv) achieves state-of-the-art performance on two generalization benchmarks for deep RL, Procgen and Crafter.

2. BACKGROUND

Episodic Reinforcement Learning. A Markov decision process (MDP) is defined by the tuple µ = S, A, R, P, γ, ρ 0 , where S is the state space, A is the action space, R : S × A → [R min , R max ] is the reward function , P : S × A × S → R ≥0 is the transition distribution, γ ∈ (0, 1] is the discount factor, and ρ 0 : S → R ≥0 is the initial state distribution. We further denote the trajectory of an episode to be the sequence τ = (s 0 , a 0 , r 0 , . . . , s T , a T , r T , s T +1 ) where r t = R(s t , a t ) and T is the length of the trajectory which can be infinite. If a trajectory is generated by a probabilistic policy π : S × A → R ≥0 , Z π = T t=0 γ t r t is a random variable that describes the discounted return the policy achieves. The objective is to find a π ⋆ that maximizes the expected discounted return, π ⋆ = arg max π E τ ∼p π (•) [Z π ] , where p π (τ ) = ρ 0 (s 0 ) T t=0 P (s t+1 | s t , a t )π(s t | a t ). For simplicity, we will use E π instead of E τ ∼p π (•) to denote the expectation over trajectories sampled from the policy π. With a slight abuse of notation, we use Z π (s, a) to denote the conditional discounted return when starting at s and taking action a (i.e., s 0 = s and a 0 = a). Finally, without loss of generality, we assume all measures are discrete and their values lie within [0, 1]. Value-based methods (Watkins & Dayan, 1992) rely on a fundamental quantity in RL, the stateaction value function, also referred to as the Q-function, Q π (s, a) = E π [Z π | s 0 = s, a 0 = a]. The Q-function of a policy can be found at the fixed point of the Bellman operator, T π (Bellman, 1957), Policy optimization approaches (Williams, 1992; Sutton et al., 1999) on the other hand seeks to directly learn the optimal policy via Monte Carlo estimation of the return. Specifically, given a T π Q(s, a) = E s ′ ∼P (•|s,a), a ′ ∼π(•|s ′ ) [R(s, a) + γQ(s ′ , a ′ )] .



Figure 1: Exploration can help agents generalize to new situations at test time, even if it is not needed for finding the optimal policy on the training environments.

Bellemare et al. (2017)    extends the procedure to the distribution of discounted returns, T π Z(s, a)d = R(s, a) + γZ(s ′ , a ′ ), s ′ ∼ P (• | s, a) and a ′ ∼ π(• | s ′ ), where d = denotes that two random variables have the same distributions. This extension is referred to as distributional RL (we provide a more detailed description of QR-DQN, the distributional RL algorithm we use, in Appendix B.1). For value-based methods, the policy is directly derived from the Q-function as π(a | s) = 1 arg max a ′ Q(a ′ ,s) (a).

, in CMDPs the same state can have different values depending on the environment,

