UNCERTAINTY WEIGHTED OFFLINE REINFORCEMENT LEARNING

Abstract

Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that models the epistemic uncertainty to detect OOD stateaction pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.

1. INTRODUCTION

Deep reinforcement learning (RL) has seen a surge of interest over the recent years. It has achieved remarkable success in simulated tasks (Silver et al., 2017; Schulman et al., 2017; Haarnoja et al., 2018) , where the cost of data collection is low. However, one of the drawbacks of RL is its difficulty of learning from prior experiences. Therefore, the application of RL to unstructured real-world tasks is still in its primal stages, due to the high cost of active data collection. It is thus crucial to make full use of previously collected datasets whenever large scale online RL is infeasible. Offline batch RL algorithms offer a promising direction to leveraging prior experience (Lange et al., 2012) . However, most prior off-policy RL algorithms (Haarnoja et al., 2018; Munos et al., 2016; Kalashnikov et al., 2018; Espeholt et al., 2018; Peng et al., 2019) fail on offline datasets, even on expert demonstrations (Fu et al., 2020) . The sensitivity to the training data distribution is a well known issue in practical offline RL algorithms (Fujimoto et al., 2019; Kumar et al., 2019; 2020; Peng et al., 2019; Yu et al., 2020) . A large portion of this problem is attributed to actions or states not being covered within the training set distribution. Since the value estimate on out-of-distribution (OOD) actions or states can be arbitrary, OOD value or reward estimates can incur destructive estimation errors that propagates through the Bellman loss and destabilizes training. Prior attempts try to avoid OOD actions or states by imposing strong constraints or penalties that force the actor distribution to stay within the training data (Kumar et al., 2019; 2020; Fujimoto et al., 2019; Laroche et al., 2019) . While such approaches achieve some degree of experimental success, they suffer from the loss of generalization ability of the Q function. For example, a state-action pair that does not appear in the training set can still lie within the training set distribution, but policies trained with strong penalties will avoid the unseen states regardless of whether the Q function can produce an accurate estimate of the state-action value. Therefore, strong penalty based solutions often promote a pessimistic and sub-optimal policy. In the extreme case, e.g., in certain benchmarking environments with human demonstrations, the best performing offline algorithms only achieve the same performance as a random agent (Fu et al., 2020) , which demonstrates the need of robust offline RL algorithms. In this paper, we hypothesize that a key aspect of a robust offline RL algorithm is a proper estimation and usage of uncertainty. On the one hand, one should be able to reliably assign an uncertainty score to any state-action pair; on the other hand, there should be a mechanism that utilizes the estimated uncertainty to prevent the model from learning from data points that induce high uncertainty scores. Empirically, we first verified the effectiveness of dropout uncertainty estimation at detecting OOD samples. We show that the uncertainty estimation makes intuitive sense in a simple environment. With the uncertainty based down weighting scheme, our method significantly improves the training stability over our chosen baseline (Kumar et al., 2019) , and achieves state-of-the-art performance in a variety of standard benchmarking tasks for offline RL. Overall, our contribution can be summarized as follows: 1) We propose a simple and efficient technique (UWAC) to counter the effect of OOD samples with no additional loss terms or models. 2) We experimentally demonstrate the effectiveness of dropout uncertainty estimation for RL. 3) UWAC offers a novel way for stabilizing offline RL. 4) UWAC achieves SOTA performance on common offline RL benchmarks, and obtains significant performance gain on narrow human demonstrations.

2. RELATED WORK

In this work, we consider offline batch reinforcement learning (RL) under static datasets. Offline RL algorithms are especially prone to errors from inadequate coverage of the training set distribution, distributional shifts during actor critic training, and the variance induced by deep neural networks. Such error have been extensively studied as "error propagation" in approximate dynamic programming (ADP) (Bertsekas & Tsitsiklis, 1996; Farahmand et al., 2010; Munos, 2003; Scherrer et al., 2015) . Scherrer et al. (2015) obtains a bound on the point-wise Bellman error of approximate modified policy iteration (AMPI) with respect to the supremum of the error in function approximation for an arbitrary iteration. We adopt the theoretical tools from (Kumar et al., 2019) and study the accumulation and propagation of Bellman errors under the offline setting. One of the most significant problems associated with off-policy and offline RL is the bootstrapping error (Kumar et al., 2019) : When training encounters an action or state unseen within the training set, the critic value estimate on out-of-distribution (OOD) samples can be arbitrary and incur an error that destabilizes convergence on all other states (Kumar et al., 2019; Fujimoto et al., 2019) through the Bellman backup. Yu et al. ( 2020) trains a model of the environment that captures the epistemic uncertainty. The uncertainty estimate is used to penalize reward estimation for uncertain states and actions, promoting a pessimistic policy against OOD actions and states. The main drawback of such a model based approach is the unnecessary introduction of a model of the environment -it is often very hard to train a good model. On the other hand, model-free approaches either train an agent pessimistic to OOD states and actions (Wu et al., 2019; Kumar et al., 2020) or constrain the actor distribution to the training set action distribution (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Jaques et al., 2019; Fox et al., 2015; Laroche et al., 2019) . However, the pessimistic assumption that all unseen states or actions are bad may lead to a sub-optimal agent and



Figure 1: Left. Plot of average return v.s. training epochs of our proposed method (red) v.s. baseline (brown) (Kumar et al., 2019) on the relocate-expert offline dataset. Right. Corresponding plot of Q-Target values v.s. training epochs. Our proposed method achieves much higher average return, with better training stability, and more controlled Q-values.

