BRAC+: GOING DEEPER WITH BEHAVIOR REGULAR-IZED OFFLINE REINFORCEMENT LEARNING

Abstract

Online interactions with the environment to collect data samples for training a Reinforcement Learning agent is not always feasible due to economic and safety concerns. The goal of Offline Reinforcement Learning (RL) is to address this problem by learning effective policies using previously collected datasets. Standard off-policy RL algorithms are prone to overestimations of the values of outof-distribution (less explored) actions and are hence unsuitable for Offline RL. Behavior regularization, which constraints the learned policy within the support set of the dataset, has been proposed to tackle the limitations of standard off-policy algorithms. In this paper, we improve the behavior regularized offline reinforcement learning and propose BRAC+. We use an analytical upper bound on KL divergence as the behavior regularizor to reduce variance associated with sample based estimations. Additionally, we employ state-dependent Lagrange multipliers for the regularization term to avoid distributing KL divergence penalty across all states of the sampled batch. The proposed Lagrange multipliers allow more freedom of deviation to high probability (more explored) states leading to better rewards while simultaneously restricting low probability (less explored) states to prevent out-of-distribution actions. To prevent catastrophic performance degradation due to rare out-of-distribution actions, we add a gradient penalty term to the policy evaluation objective to penalize the gradient of the Q value w.r.t the out-of-distribution actions. By doing so, the Q values evaluated at the out-ofdistribution actions are bounded. On challenging offline RL benchmarks, BRAC+ outperforms the state-of-the-art model-free and model-based approaches.

1. INTRODUCTION

Reinforcement Learning (RL) has shown great success in a wide range of applications including board games (Silver et al., 2016) , strategy games (Vinyals et al., 2019) , energy systems (Zhang et al., 2019) , robotics (Lin, 1992) , recommendation systems (Choi et al., 2018) , etc. The success of RL relies heavily on extensive online interactions with the environment for exploration. However, this is not always feasible in the real world as it can be expensive or dangerous (Levine et al., 2020) . Offline RL, also known as batch RL, avoids online interactions with the environment by learning from a static dataset that is collected in an offline manner (Levine et al., 2020) . While standard offpolicy RL algorithms (Mnih et al., 2013; Lillicrap et al., 2016; Haarnoja et al., 2018a) can, in theory, be employed to learn from an offline data, in practice, they perform poorly due to distributional shift between the behavior policy (probability distribution of actions conditioned on states as observed in the dataset) of the collected dataset and the learned policy (Levine et al., 2020) . The distributional shift manifests itself in form of overestimation of the out-of-distribution (OOD) actions leading to erroneous Bellman backups. Prior works tackle this problem via behavior regularization (Fujimoto et al., 2018b; Kumar et al., 2019; Wu et al., 2019; Siegel et al., 2020) . This ensures that the learned policy stays "close" to the behavior policy. This is achieved by adding a regularization term that calculates the f -divergence between the learned policy and the behavior policy. Kernel Maximum Mean Discrepancy (MMD) (Gretton et al., 2007) , Wasserstein distance and KL divergence are widely used (Wu et al., 2019) . The regularization term is either fixed (Wu et al., 2019) , or tuned via dual gradient descent (Kumar et al., 2019) , or applied using a trust region objective (Siegel et al., 2020) . In this paper, we propose improvements to the Behavior Regularized Actor Critic (BRAC) algorithm presented in (Wu et al., 2019) . To obtain the same, we observe that sample based estimation of divergence measures is computationally expensive and prone to higher variance. Therefore, to reduce variance, we derive an analytical upper bound on the KL divergence measure as the regularization term in the objective function. Moreover, we show that current works that apply the regularization term i.e. the divergence measure, on the entire batch, end up distributing the penalty over all states in the batch in amounts inversely proportional to the state's probability of occurrence in the batch. This needlessly restricts the deviation of highly explored states while allowing less explored ones to deviate farther leading to OOD actions. To address the same, we employ state dependent Lagrange multipliers for the regularization terms and automatically tune their strength using state-wise dual gradient descent. In addition, the performance of the learned agent trained using prior methods often deteriorates over the course of training. We found that if the learned Q function generalizes such that the gradient of the Q function w.r.t the OOD actions is monotonically increasing, behavior regularization fails to keep such actions within the support set. To mitigate this issue, we penalize the gradient of the Q function w.r.t the OOD actions by adding a gradient penalty term to the policy evaluation objective. This reduces the policy improvement at OOD actions to the problem of minimizing the divergence between the learned policy and the behavior policy. We call our improved algorithm BRAC+ following (Wu et al., 2019) . Our experiments suggest that BRAC+ outperforms existing state-of-the-art model-free and model-based offline RL algorithms in various datasets on the D4RL benchmark (Fu et al., 2020) .

2. BACKGROUND

Markov Decision Process RL algorithms aim to solve Markov Decision Process (MDP) with unknown dynamics. A Markov decision process (Sutton & Barto, 2018) is defined as a tuple < S, A, R, P, µ >, where S is the set of states, A is the set of actions, R(s, a, s ) : S × A × S → R defines the intermediate reward when the agent transitions from state s to s by taking action a, P (s |s, a) : S × A × S → [0, 1] defines the probability when the agent transitions from state s to s by taking action a, µ : S → [0, 1] defines the starting state distribution. The objective of reinforcement learning is to select policy π : µ → P (A) to maximize the following objective: J(π) = E s0∼µ,at∼π(•|st) st+1∼P (•|st,at) [ ∞ t=0 γ t R(s t , a t , s t+1 )] Offline Reinforcement Learning The goal of offline RL is to learn policy π θ from a fixed dataset D = {(s i , a i , s i , r i )} N i=1 consisting of single step transitions {(s i , a i , s i , r i )}. The dataset is assumed to be collected using a behavior policy π β which denotes the conditional distribution p(a|s) observed in the dataset. Note that π β may consist of multi-modal policy distribution. In principle, standard off-policy RL algorithms using a replay buffer (Mnih et al., 2013; Lillicrap et al., 2016; Haarnoja et al., 2018a) can directly learn from D. The key challenge resides in the policy evaluation step: Q ψ = arg min ψ [(Q ψ (s, a) -(r(s, a) + γE a ∼π θ Q ψ (s , a )))] 2 (policy evaluation) (2) In this step, the target Q value depends on the learned policy. If the learned policy distribution π θ diverges from the data distribution π β , it results in evaluation of target Q values using outof-distribution (OOD) actions. Such evaluations are prone to errors. Erroneous overestimation of values get exploited by the policy improvement, preventing the algorithm from learning useful policies. In order to avoid such cases, behavior regularization is adopted to force the learned policy to stay "close" to the behavior policy (Fujimoto et al., 2018b; Kumar et al., 2019; Wu et al., 2019) .

3. IMPROVING BEHAVIOR REGULARIZED OFFLINE REINFORCEMENT LEARNING

In this section, we discuss and propose three non-trivial improvements to the Behavior Regularized Actor Critic (BRAC) offline reinforcement learning (Wu et al., 2019) . BRAC (Wu et al., 2019) 

