EMAQ: EXPECTED-MAX Q-LEARNING OPERATOR FOR SIMPLE YET EFFECTIVE OFFLINE AND ONLINE RL

Abstract

Off-policy reinforcement learning (RL) holds the promise of sample-efficient learning of decision-making policies by leveraging past experience. However, in the offline RL setting -where a fixed collection of interactions are provided and no further interactions are allowed -it has been shown that standard off-policy RL methods can significantly underperform. Recently proposed methods often aim to address this shortcoming by constraining learned policies to remain close to the given dataset of interactions. In this work, we closely investigate an important simplification of BCQ (Fujimoto et al., 2018a) -a prior approach for offline RL -which removes a heuristic design choice and naturally restrict extracted policies to remain exactly within the support of a given behavior policy. Importantly, in contrast to their original theoretical considerations, we derive this simplified algorithm through the introduction of a novel backup operator, Expected-Max Q-Learning (EMaQ), which is more closely related to the resulting practical algorithm. Specifically, in addition to the distribution support, EMaQ explicitly considers the number of samples and the proposal distribution, allowing us to derive new sub-optimality bounds which can serve as a novel measure of complexity for offline RL problems. In the offline RL setting -the main focus of this work -EMaQ matches and outperforms prior state-of-the-art in the D4RL benchmarks (Fu et al., 2020a). In the online RL setting, we demonstrate that EMaQ is competitive with Soft Actor Critic (SAC). The key contributions of our empirical findings are demonstrating the importance of careful generative model design for estimating behavior policies, and an intuitive notion of complexity for offline RL problems. With its simple interpretation and fewer moving parts, such as no explicit function approximator representing the policy, EMaQ serves as a strong yet easy to implement baseline for future work.

1. INTRODUCTION

Leveraging past interactions in order to improve a decision-making process is the hallmark goal of off-policy reinforcement learning (RL) (Precup et al., 2001; Degris et al., 2012) . Effectively learning from past experiences can significantly reduce the amount of online interaction required to learn a good policy, and is a particularly crucial ingredient in settings where interactions are costly or safety is of great importance, such as robotics Gu et al. (2017) ; Kalashnikov et al. (2018a) , health Murphy et al. (2001 ), dialog agents (Jaques et al., 2019 ), and education Mandel et al. (2014) . In recent years, with neural networks taking a more central role in the RL literature, there have been significant advances in developing off-policy RL algorithms for the function approximator setting, where policies and value functions are represented by neural networks (Mnih et al., 2015; Lillicrap et al., 2015; Gu et al., 2016b; a; Haarnoja et al., 2018; Fujimoto et al., 2018b) . Such algorithms, while off-policy in nature, are typically trained in an online setting where algorithm updates are interleaved with additional online interactions. However, in purely offline RL settings, where a dataset of interactions are provided ahead of time and no additional interactions are allowed, the performance of these algorithms degrades drastically (Fujimoto et al., 2018a; Jaques et al., 2019) . those based on dynamic programming and value estimation (Fujimoto et al., 2018a; Jaques et al., 2019; Kumar et al., 2019; Wu et al., 2019; Levine et al., 2020) . Most proposed algorithms are designed with a key intuition that it is desirable to prevent policies from deviating too much from the provided collection of interactions. By moving far from the actions taken in the offline data, any subsequently learned policies or value functions may not generalize well and lead to the belief that certain actions will lead to better outcomes than they actually would. Furthermore, due to the dynamics of the MDP, taking out-of-distribution actions may lead to states not covered in the offline data, creating a snowball effect (Ross et al., 2011) . In order to prevent learned policies from straying from the offline data, various methods have been introduced for regularizing the policy towards a base behavior policy (e.g. through a divergence penalty (Jaques et al., 2019; Wu et al., 2019; Kumar et al., 2019) or clipping actions (Fujimoto et al., 2018a) ). Taking the above intuitions into consideration, in this work we investigate a simplifcation of the BCQ algorithm (Fujimoto et al., 2018a) (a notable prior work in offline RL), which removes a heuristic design choice and has the property that extracted policies remain exactly within the support of a given behavior policy. In contrast to the theoretical considerations in the original work, we derive this simplified algorithm in a theoretical setup that more closely reflects the resulting algorithm. We introduce the Expected-Max Q-Learning (EMaQ) operator, which interpolates between the standard Q-function evaluation and Q-learning backup operators. The EMaQ operator makes explicit the relation between the proposal distribution and number of samples used, and leads to sub-optimality bounds which introduce a novel notion of complexity for offline RL problems. In its practical implementation for the continuous control and function approximator setting, EMaQ has only two standard components (an estimate of the base behavior policy, and Q functions) and does not explicitly represent a policy, requiring fitting one less function approximator than prior approaches (Fujimoto et al., 2018a; Kumar et al., 2019; Wu et al., 2019) . In online RL, EMaQ is competitive with Soft Actor Critic (SAC) (Haarnoja et al., 2018) and surpasses SAC in the deployment-efficient setting (Matsushima et al., 2020) . In the offline RL settingthe main focus of this work -EMaQ matches and outperforms prior state-of-the-art in the D4RL (Fu et al., 2020a) benchmark tasks. Through our explorations with EMaQ we make two intriguing findings. First, due to the strong dependence of EMaQ on the quality of behavior policy used, our results demonstrate the significant impact of careful considerations in modeling the behavior policy that generate the offline interaction datasets. Second, relating to the introduced notion of complexity, in a diverse array of benchmark settings considered in this work we observe that surprisingly little modification to a base behavior policy is necessary to obtain a performant policy. As an example, we observe that while a HalfCheetah random policy obtains a return of 0, a policy that at each state uniformly samples only 5 actions and chooses the one with the best value obtains a return of 2000. The simplicity, intuitive interpretation, and strong empirical performance of EMaQ make it a great test-bed for further examination and theoretical analyses, and an easy to implement yet strong baseline for future work in offline RL.

2. BACKGROUND

Throughout this work, we represent Markov Decision Process (MDP) as M = S, A, r, P, γ , with state space S, action space A, reward function r : S×A → R, transition dynamics P, and discount γ. In offline RL, we assume access to a dataset of interactions with the MDP, which we will represent as collection of tuples D = {(s, a, s , r, t)} N , where t is an indicator variable that is set to True when s is a terminal state. We will use µ to represent the behavior policy used to collect D, and depending on the context, we will overload this notation and use µ to represent an estimate of the true behavior policy. For a given policy π, we will use the notation d π (s), d π (s, a) to represent the state-visitation and state-action visitation distributions respectively. As alluded to above, a significant challenge of offline RL methods is the problem of distribution shift. At training-time, there is no distribution shift in states as a fixed dataset D is used for training, and the policy and value functions are never evaluated on states outside of d µ (s). However, a very significant challenge is the problem of distribution shift in actions. Consider the Bellman backup for obtaining the Q-function of a given policy π, T π Q(s, a) := r(s, a) + γ • E s E a ∼π(a |s ) Q(s , a ) (1)

