EXTREME Q-LEARNING: MAXENT RL WITHOUT EN-TROPY

Abstract

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website 1 .

1. INTRODUCTION

Modern Deep Reinforcement Learning (RL) algorithms have shown broad success in challenging control (Haarnoja et al., 2018; Schulman et al., 2015) and game-playing domains (Mnih et al., 2013) . While tabular Q-iteration or value-iteration methods are well understood, state of the art RL algorithms often make theoretical compromises in order to deal with deep networks, high dimensional state spaces, and continuous action spaces. In particular, standard Q-learning algorithms require computing the max or soft-max over the Q-function in order to fit the Bellman equations. Yet, almost all current off-policy RL algorithms for continuous control only indirectly estimate the Q-value of the next state with separate policy networks. Consequently, these methods only estimate the Q-function of the current policy, instead of the optimal Q * , and rely on policy improvement via an actor. Moreover, actor-critic approaches on their own have shown to be catastrophic in the offline settings where actions sampled from a policy are consistently out-of-distribution (Kumar et al., 2020; Fujimoto et al., 2018) . As such, computing max Q for Bellman targets remains a core issue in deep RL. One popular approach is to train Maximum Entropy (MaxEnt) policies, in hopes that they are more robust to modeling and estimation errors (Ziebart, 2010) . However, the Bellman backup B * used in MaxEnt RL algorithms still requires computing the log-partition function over Q-values, which is usually intractable in high-dimensional action spaces. Instead, current methods like SAC (Haarnoja et al., 2018) rely on auxiliary policy networks, and as a result do not estimate B * , the optimal Bellman backup. Our key insight is to apply extreme value analysis used in branches of Finance and Economics to Reinforcement Learning. Ultimately, this will allow us to directly model the LogSumExp over Q-functions in the MaxEnt Framework.

funding

//div99.github.io/XQL/ 1

