MUTUAL INFORMATION REGULARIZED OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning (RL) aims at learning an effective policy from offline datasets without active interactions with the environment. The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy for deviating from the behavior policy during policy improvement or making conservative updates for value functions during policy evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. Intuitively, mutual information measures the mutual dependence of actions and states, which reflects how a behavior agent reacts to certain environment states during data collection. To effectively utilize this information to facilitate policy learning, MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. In this way, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding a mutual information regularization. MISA is a general offline RL framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. Our experiments show that MISA performs significantly better than existing methods and achieves new state-of-the-art on various tasks of the D4RL benchmark.

1. INTRODUCTION

Reinforcement learning (RL) has made remarkable achievements for solving sequential decisionmaking problems, ranging from game playing (Mnih et al., 2013; Silver et al., 2017; Berner et al., 2019) to robot control (Levine et al., 2016; Kahn et al., 2018; Savva et al., 2019) . However, its success heavily relies on 1) an environment to interact with for data collection and 2) an online algorithm to improve the agent based only on its own trial-and-error experiences. These make RL algorithms incapable in real-world safety-sensitive scenarios where interactions with the environment are dangerous or prohibitively expensive, such as in autonomous driving and robot manipulation with human autonomy (Levine et al., 2020; Kumar et al., 2020) . Therefore, offline RL is proposed to study the problem of learning decision-making agents from experiences that are previously collected from other agents when interacting with the environment is costly or not allowed. Though much demanded, extending RL algorithms to offline datasets is challenged by the distributional shift between the data-collecting policy and the learning policy. Specifically, a typical RL algorithm alternates between evaluating the Q values of a policy and improving the policy to have better cumulative return under the current value estimation. When it comes to the offline setting, policy improvement often involves querying out-of-distribution (OOD) state-action pairs that have never appeared in the dataset, for which the Q values are over-estimated due to extrapolation error of neural networks. As a result, the policy improvement direction is erroneously affected, eventually leading to catastrophic explosion of value estimations as well as policy collapse after error accumulation. Existing methods (Kumar et al., 2020; Wang et al., 2020; Fujimoto & Gu, 2021; Yu et al., 2021) tackle this problem by either forcing the learned policy to stay close to the behavior policy (Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021) or generating low value estimations for OOD actions (Nachum et al., 2017; Kumar et al., 2020; Yu et al., 2021) . Though these methods are effective at alleviating the distributional shift problem of the learning policy, the improved policy is unconstrained and might still deviate from the data distribution. A natural question thus arises: can we directly constrain the policy improvement direction to lie in the data manifold? In this paper, we step back and consider the offline dataset from a new perspective, i.e., the Mutual Information between States and Actions (MISA). By viewing state and action as two random variables, the mutual information represents the reduction of uncertainty of actions given certain states, a.k.a., information gain in information theory (Nowozin, 2012) . Therefore, mutual information is an appealing metric to sufficiently acquire knowledge from a dataset and characterize a behavior policy. We for the first time introduce it into offline RL as an regularization that directly constrains the policy improvement direction. Specifically, to allow practical optimizations of state-action mutual information estimation, we introduce the MISA lower bound of state-action pairs, which connects mutual information with RL by treating a parameterized policy as a variational distribution and the Q-values as the energy functions. We show that this lower bound can be interpreted as the likelihood of a non-parametric policy on the offline dataset, which actually represents the one-step improvement of the current policy based on the current value estimation. Maximizing MISA lower bound is equivalent to directly regularizing the policy improvement within the dataset manifold. However, the constructed lower bound involves integration over a self-normalized energy-based distribution, whose gradient estimation is intractable. To alleviate this dilemma, Markov Chain Monte Carlo (MCMC) estimation is adopted to produce an unbiased gradient estimation for MISA lower bound. Theoretically, MISA is a general framework for offline RL that unifies several existing offline RL paradigms including behavior regularization and conservative learning. As examples, we show that TD3+BC (Fujimoto & Gu, 2021) and CQL (Kumar et al., 2020) are degenerated cases of MISA. In our experiments, we demonstrate that MISA achieves significantly better performance on various environments of the D4RL (Fu et al., 2020) benchmark than the state-of-the-art methods. Additional ablation studies, visualizations, and limitations are discussed to better understand the proposed method. Our code will be released upon publication.

2. RELATED WORKS

Offline Reinforcement Learning The most critical challenge for extending an off-policy RL algorithm to an offline setup is the distribution shift between the behavior policy, i.e., the policy for data collection, and the learning policy. To tackle this challenge, most of the offline RL algorithms consider a conservative learning framework. They either regularize the learning policy to stay close to the behavior policy (Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021; Siegel et al., 2020; Wang et al., 2020) , or force Q values to be low for OOD state-action pairs (Nachum et al., 2017; Kumar et al., 2020; Yu et al., 2021) . For example, TD3+BC (Fujimoto & Gu, 2021) adds an additional behavior cloning (BC) signal along with the TD3 (Fujimoto et al., 2018) , which encourages the policy to stay in the data manifold; CQL (Kumar et al., 2020) , from the Q-value perspective, penalizes the OOD state-action pairs for generating high Q-value estimations and learns a lower bound of the true value function. However, their policy improvement direction is unconstrained and might deviate from the data distribution. On the other hand, SARSA-style updates (Sutton & Barto, 2018) are considered to only query in-distribution state-action pairs (Peng et al., 2019; Kostrikov et al., 2022) . Nevertheless, without explicitly querying Bellman's optimality equation, they limit the policy from producing unseen actions. Our proposed MISA follows the conservative framework and directly regularizes the policy improvement direction to lie within the data manifold with mutual information, which more fully exploits the dataset information while learning a conservative policy. Mutual Information Estimation. Mutual information is a fundamental quantity in information theory, statistics, and machine learning. However, direct computation of mutual information is intractable as it involves computing a log partition function of a high dimensional variable. Thus, how to estimate the mutual information I(x, z) between random variables X and Z, accurately and efficiently, is a critical issue. One straightforward lower bound for mutual information estimation is Barber-Agakov bound (Barber & Agakov, 2004) , which introduces an additional variational distribution q(z | x) to approximate the unknown posterior p(z | x). Instead of using an explicit "decoder" q(z | x), we can use unnormalized distributions for the variational family

