ADDRESSING EXTRAPOLATION ERROR IN DEEP OFFLINE REINFORCEMENT LEARNING

Abstract

Reinforcement learning (RL) encompasses both online and offline regimes. Unlike its online counterpart, offline RL agents are trained using logged-data only, without interaction with the environment. Therefore, offline RL is a promising direction for real-world applications, such as healthcare, where repeated interaction with environments is prohibitive. However, since offline RL losses often involve evaluating state-action pairs not well-covered by training data, they can suffer due to the errors introduced when the function approximator attempts to extrapolate those pairs' value. These errors can be compounded by bootstrapping when the function approximator overestimates, leading the value function to grow unbounded, thereby crippling learning. In this paper, we introduce a three-part solution to combat extrapolation errors: (i) behavior value estimation, (ii) ranking regularization, and (iii) reparametrization of the value function. We provide ample empirical evidence on the effectiveness of our method, showing state of the art performance on the RL Unplugged (RLU) ATARI dataset. Furthermore, we introduce new datasets for bsuite as well as partially observable DeepMind Lab environments, on which our method outperforms state of the art offline RL algorithms.

1. INTRODUCTION

Agents are, fundamentally, entities which map observations to actions and can be trained with reinforcement learning (RL) in either an online or offline fashion. When trained online, an agent learns through trial and error by interacting with its environment. Online RL has had considerable success recently: on Atari (Mnih et al., 2015) , the game of GO (Silver et al., 2017) , video games like StarCraft II, and Dota 2, (Vinyals et al., 2019; Berner et al., 2019), and robotics (Andrychowicz et al., 2020) . However, the requirement of extensive environmental interaction combined with a need for exploratory behavior makes these algorithms unsuitable and potentially unsafe for many real world applications. In contrast, in the offline setting (Fu et al., 2020; Fujimoto et al., 2018; Gulcehre et al., 2020; Levine et al., 2020) , also known as batch RL (Ernst et al., 2005; Lange et al., 2012) , agents learn from a fixed dataset which is assumed to have been logged by other (possibly unknown) agents. See also Fig. 1 for an illustration of these two settings. Learning purely from logged data allows these algorithms to be more widely applicable, including in problems such as healthcare and self-driving cars, where repeated interaction with the environment is costly and potentially unsafe or unethical, and where logged historical data is abundant. However these algorithms tend to behave considerably worse than their online counterpart. Although similar in principle, there are some important differences between the two regimes. While it is useful for online agents to explore unknown regions of the state space so as to gain knowledge about the environment and better their chances of finding a good policy (Schmidhuber, 1991), this is not the case for the offline setting. Choosing actions not well-represented in the dataset for 



Figure 1: In online RL (left), the agent must interact with the environment to gather data to learn from. In offline RL (right), the agent must learn from a logged dataset.

