EXTREME Q-LEARNING: MAXENT RL WITHOUT EN-TROPY

Abstract

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website 1 .

1. INTRODUCTION

Modern Deep Reinforcement Learning (RL) algorithms have shown broad success in challenging control (Haarnoja et al., 2018; Schulman et al., 2015) and game-playing domains (Mnih et al., 2013) . While tabular Q-iteration or value-iteration methods are well understood, state of the art RL algorithms often make theoretical compromises in order to deal with deep networks, high dimensional state spaces, and continuous action spaces. In particular, standard Q-learning algorithms require computing the max or soft-max over the Q-function in order to fit the Bellman equations. Yet, almost all current off-policy RL algorithms for continuous control only indirectly estimate the Q-value of the next state with separate policy networks. Consequently, these methods only estimate the Q-function of the current policy, instead of the optimal Q * , and rely on policy improvement via an actor. Moreover, actor-critic approaches on their own have shown to be catastrophic in the offline settings where actions sampled from a policy are consistently out-of-distribution (Kumar et al., 2020; Fujimoto et al., 2018) . As such, computing max Q for Bellman targets remains a core issue in deep RL. One popular approach is to train Maximum Entropy (MaxEnt) policies, in hopes that they are more robust to modeling and estimation errors (Ziebart, 2010) . However, the Bellman backup B * used in MaxEnt RL algorithms still requires computing the log-partition function over Q-values, which is usually intractable in high-dimensional action spaces. Instead, current methods like SAC (Haarnoja et al., 2018) rely on auxiliary policy networks, and as a result do not estimate B * , the optimal Bellman backup. Our key insight is to apply extreme value analysis used in branches of Finance and Economics to Reinforcement Learning. Ultimately, this will allow us to directly model the LogSumExp over Q-functions in the MaxEnt Framework. Intuitively, reward or utility-seeking agents will consider the maximum of the set of possible future returns. The Extreme Value Theorem (EVT) tells us that maximal values drawn from any exponential tailed distribution follows the Generalized Extreme Value (GEV) Type-1 distribution, also referred to as the Gumbel Distribution G. The Gumbel distribution is thus a prime candidate for modeling errors in Q-functions. In fact, McFadden's 2000 Nobel-prize winning work in Economics on discrete choice models (McFadden, 1972) showed that soft-optimal utility functions with logit (or softmax) choice probabilities naturally arise when utilities are assumed to have Gumbel-distributed errors. This was subsequently generalized to stochastic MDPs by Rust (1986) . Nevertheless, these results have remained largely unknown in the RL community. By introducing a novel loss optimization framework, we bring them into the world of modern deep RL. Empirically, we find that even modern deep RL approaches, for which errors are typically assumed to be Gaussian, exhibit errors that better approximate the Gumbel Distribution, see Figure 1 . By assuming errors to be Gumbel distributed, we obtain Gumbel Regression, a consistent estimator over log-partition functions even in continuous spaces. Furthermore, making this assumption about Qvalues lets us derive a new Bellman loss objective that directly solves for the optimal MaxEnt Bellman operator B * , instead of the operator under the current policy B π . As soft optimality emerges from our framework, we can run MaxEnt RL independently of the policy. In the online setting, we avoid using a policy network to explicitly compute entropies. In the offline setting, we completely avoid sampling from learned policy networks, minimizing the aforementioned extrapolation error. Our resulting algorithms surpass or consistently match state-of-the-art (SOTA) methods while being practically simpler. In this paper we outline the theoretical motivation for using Gumbel distributions in reinforcement learning, and show how it can be used to derive practical online and offline MaxEnt RL algorithms. Concretely, our contributions are as follows: • We motivate Gumbel Regression and show it allows calculation of the log-partition function (LogSumExp) in continuous spaces. We apply it to MDPs to present a novel loss objective for RL using maximum-likelihood estimation. • Our formulation extends soft-Q learning to offline RL as well as continuous action spaces without the need of policy entropies. It allows us to compute optimal soft-values V * and soft-Bellman updates B * using SGD, which are usually intractable in continuous settings. • We provide the missing theoretical link between soft and conservative Q-learning, showing how these formulations can be made equivalent. We also show how Max-Ent RL emerges naturally from vanilla RL as a conservatism in our framework. • Finally, we empirically demonstrate strong results in Offline RL, improving over prior methods by a large margin on the D4RL Franka Kitchen tasks, and performing moderately better than SAC and TD3 in Online RL, while theoretically avoiding actor-critic formulations.

2. PRELIMINARIES

In this section we introduce Maximium Entropy (MaxEnt) RL and Extreme Value Theory (EVT), which we use to motivate our framework to estimate extremal values in RL. We consider an infinite-horizon Markov decision process (MDP), defined by the tuple (S, A, P, r, γ), where S, A represent state and action spaces, P(s ′ |s, a) represents the environment dynamics, r(s, a) represents the reward function, and γ ∈ (0, 1) represents the discount factor. In the offline RL setting, we are given a dataset D = (s, a, r, s ′ ) of tuples sampled from trajectories under a behavior policy π D without any additional environment interactions. We use ρ π (s) to denote the distribution of states that a policy π(a|s) generates. In the MaxEnt framework, an MDP with entropy-regularization is referred to as a soft-MDP (Bloem & Bambos, 2014) and we often use this notation.

2.1. MAXIMUM ENTROPY RL

Standard RL seeks to learn a policy that maximizes the expected sum of (discounted) rewards E π [ ∞ t=0 γ t r(s t , a t )], for (s t , a t ) drawn at timestep t from the trajectory distribution that π generates. We consider a generalized version of Maximum Entropy RL that augments the standard reward objective with the KL-divergence between the policy and a reference distribution µ:



* Equal Contribution 1 https://div99.github.io/XQL/

