UNCERTAINTY-DRIVEN EXPLORATION FOR GENERALIZATION IN REINFORCEMENT LEARNING

Abstract

Value-based methods are competitive when trained and tested in single environments. However, they fall short when trained on multiple environments with similar characteristics and tested on new ones from the same family. We investigate the potential reasons behind the poor generalization performance of value-based methods and discover that exploration plays a crucial role in these settings. Exploration is helpful not only for finding optimal solutions to the training environments but also for acquiring knowledge that helps generalization to other unseen environments. We show how to make value-based methods competitive in these settings by using uncertainty-driven exploration and distributional RL. Our algorithm is the first value-based approach to achieve state-of-the-art on both Procgen and Crafter, two challenging benchmarks for generalization in RL.

1. INTRODUCTION

Value-based methods (Watkins & Dayan, 1992) (which directly derive a policy from the value functions) tend to be competitive on singleton Markov decision processes (MDPs) where agents are trained and tested on the same environment (Mnih et al., 2013; Hessel et al., 2018; Badia et al., 2020) . However, they fall short in contextual MDPs (CMDPs) (Wang et al., 2020; Ehrenberg et al., 2022) , where agents are trained on a number of different environments that share a common structure and tested on unseen environments from the same family (Cobbe et al., 2019; Wang et al., 2020; Mohanty et al., 2021; Ehrenberg et al., 2022) . In this work, we aim to understand potential reasons for why value-based approaches work well in singleton MDPs but not in contextual MDPs and how we can make them competitive in CMDPs. Most of the existing approaches for improving generalization in CMDPs have treated the problem as a pure representation learning problem, applying regularization techniques which are commonly used in supervised deep learning (Farebrother et al., 2018; Cobbe et al., 2018; Igl et al., 2019; Lee et al., 2020; Ye et al., 2020; Laskin et al., 2020; Raileanu et al., 2020) . However, these methods neglect the unique structure of reinforcement learning (RL), namely that agents collect their own data by exploring their environments. This suggests that there may be other avenues for improving generalization in RL beyond representation learning. Here, we identify the agent's exploration strategy as a key factor influencing generalization in contextual MDPs. First, exploration can accelerate training in RL, and since neural networks may naturally generalize, better exploration can result in better training performance and consequently better generalization performance. Moreover, in singleton MDPs, exploration can only benefit decisions in that environment, while in CMDPs exploration in one environment can also help decisions in other, potentially unseen, environments. This is because learning about other parts of the environment can be useful in other MDPs even if it is not useful for the current MDP. As shown in Figure 1 , trajectories that are suboptimal in certain MDPs may turn out to be optimal in other MDPs from the same family, so this knowledge can help find the optimal policy more quickly in a new MDP encountered during training, and better generalize to new MDPs without additional training. One goal of exploration is to learn new things about the (knowable parts of the) environment so as to asymptotically reduce epistemic uncertainty. To model epistemic uncertainty (which is reducible by acquiring more data), we need to disentangle it from aleatoric uncertainty (which is irreducible and stems from the inherent stochasticity of the environment). As first observed by Raileanu & Fergus (2021) , in CMDPs the same state can have different values depending on the environment,

