ADDRESSING EXTRAPOLATION ERROR IN DEEP OFFLINE REINFORCEMENT LEARNING

Abstract

Reinforcement learning (RL) encompasses both online and offline regimes. Unlike its online counterpart, offline RL agents are trained using logged-data only, without interaction with the environment. Therefore, offline RL is a promising direction for real-world applications, such as healthcare, where repeated interaction with environments is prohibitive. However, since offline RL losses often involve evaluating state-action pairs not well-covered by training data, they can suffer due to the errors introduced when the function approximator attempts to extrapolate those pairs' value. These errors can be compounded by bootstrapping when the function approximator overestimates, leading the value function to grow unbounded, thereby crippling learning. In this paper, we introduce a three-part solution to combat extrapolation errors: (i) behavior value estimation, (ii) ranking regularization, and (iii) reparametrization of the value function. We provide ample empirical evidence on the effectiveness of our method, showing state of the art performance on the RL Unplugged (RLU) ATARI dataset. Furthermore, we introduce new datasets for bsuite as well as partially observable DeepMind Lab environments, on which our method outperforms state of the art offline RL algorithms.

1. INTRODUCTION

Agents are, fundamentally, entities which map observations to actions and can be trained with reinforcement learning (RL) in either an online or offline fashion. When trained online, an agent learns through trial and error by interacting with its environment. Online RL has had considerable success recently: on Atari (Mnih et al., 2015) , the game of GO (Silver et al., 2017) , video games like StarCraft II, and Dota 2, (Vinyals et al., 2019; Berner et al., 2019), and robotics (Andrychowicz et al., 2020) . However, the requirement of extensive environmental interaction combined with a need for exploratory behavior makes these algorithms unsuitable and potentially unsafe for many real world applications. In contrast, in the offline setting (Fu et al., 2020; Fujimoto et al., 2018; Gulcehre et al., 2020; Levine et al., 2020) , also known as batch RL (Ernst et al., 2005; Lange et al., 2012) , agents learn from a fixed dataset which is assumed to have been logged by other (possibly unknown) agents. See also Fig. 1 for an illustration of these two settings. Learning purely from logged data allows these algorithms to be more widely applicable, including in problems such as healthcare and self-driving cars, where repeated interaction with the environment is costly and potentially unsafe or unethical, and where logged historical data is abundant. However these algorithms tend to behave considerably worse than their online counterpart. Although similar in principle, there are some important differences between the two regimes. While it is useful for online agents to explore unknown regions of the state space so as to gain knowledge about the environment and better their chances of finding a good policy (Schmidhuber, 1991) , this is not the case for the offline setting. Choosing actions not well-represented in the dataset for offline methods would force the agent to rely on function approximators' extrapolation ability. This can lead to substantial errors during training, as well as during deployment of the agent. During training, the extrapolation errors are exacerbated by bootstrapping and the use of max operators (e.g. in Q-learning) where evaluating the loss entails taking the maximum over noisy and possibly overestimated values of the different possible actions. This can result in a propagation of the erroneous values, leading to extreme over-estimation of the value function and potentially unbounded error; see (Fujimoto et al., 2019b) and our remark in Appendix A. As we empirically show in Section 4.2, extrapolation errors are a different source of overestimation compared to those considered by standard methods such as Double DQN (Hasselt, 2010) , and hence cannot be addressed by those approaches. In addition to extrapolation errors during training, a further degradation in performance can result from the use of greedy policies at test time which maximize over value estimates extrapolated to under-represented actions. We propose a coherent set of techniques that work well together to combat extrapolation error and overestimation: Behavior value estimation. First, we address extrapolation errors during training time. Instead of Q π * , we estimate the value of the behavioral policy Q π B , thereby avoid the max-operator during training. To improve upon the behavioral policy, we conduct what amounts to a single step of policy improvement by employing a greedy policy at test time. Surprisingly, this technique with only one round of improvement allows us to perform significantly better than the behavioral policies and often outperform existing offline RL algorithms. Ranking regularization. We introduce a max-margin based regularizer that encourages the value function, represented as a deep neural network, to rank actions present in the observed rewarding episodes higher than any other actions. Intuitively, this regularizer pushes down the value of all unobserved state-action pairs, thereby minimizing the chance of a greedy policy selecting actions under-represented in the dataset. Employing the regularizer during training will minimize the impact of the max-operator used by the greedy policy at test time, i.e. this approach addresses extrapolation errors both at training and (indirectly) at test time. Reparametrization of Q-values. While behavior value estimation typically performs well, particularly when combined with ranking regularization, it only allows for one iteration of policy improvement. When more data is available, and hence we can trust our function approximator to capture more of the structure of the state space and as a result generalize better, we can rely on Q-learning which permits multiple policy improvement iterations. However this exacerbates the overestimation issue. We propose, in addition to the ranking loss, a simple reparametrization of the value function to disentangle the scale from the relative ranks of the actions. This reparametrization allows us to introduce a regularization term on the scale of the value function alone, which reduces over-estimation. To evaluate our proposed method, we introduce new datasets based on bsuite environments (Osband et al., 2019) , as well as the partially observable DeepMind Lab environments (Beattie et al., 2016) . We further evaluate our method as well as baselines on the RL Unplugged (RLU) Atari dataset (Gulcehre et al., 2020) . We achieve a new state of the art (SOTA) performance on the RLU Atari dataset as well as outperform existing SOTA offline RL methods on our newly introduced datasets. Last but not least, we provide careful ablations and analyses that provide insights into our proposed method as well as other existing offline RL algorithms. 2020) have proposed offline-RL algorithms and shown that they outperform off-the-shelf off-policy RL methods. There also exist methods explicitly addressing the issues stemming from extrapolation error (Fujimoto et al., 2019b) .

2. BACKGROUND AND PROBLEM STATEMENT

We consider, in this work, Markov Decision Processes (MDPs) defined by (S, A, P, R, ρ 0 , γ) where S is the set of all possible states and A all possible actions. An agent starts in some state s 0 ∼ ρ 0 (•) where ρ 0 (•) is a distribution over S and takes actions according to its policy a ∼ π(•|s), a ∈ A, when in state s. Then it observes a new state s and reward r according to the transition distribution P (s |s, a) and reward function r(s, a). The state action value function Q π describes the expected



Figure 1: In online RL (left), the agent must interact with the environment to gather data to learn from. In offline RL (right), the agent must learn from a logged dataset.

Related work. Early examples of offline/batch RL include least-squares temporal difference methods (Bradtke & Barto, 1996; Lagoudakis & Parr, 2003) and fitted Q iteration (Ernst et al., 2005; Riedmiller, 2005). Recently, Agarwal et al. (2019a), Fujimoto et al. (2019b), Kumar et al. (2019), Siegel et al. (2020) , Wang et al. (2020) and Ghasemipour et al. (

