OFFLINE REINFORCEMENT LEARNING WITH CLOSED-FORM POLICY IMPROVEMENT OPERATORS Anonymous

Abstract

Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.

1. INTRODUCTION

Deploying Reinforcement Learning (RL) (Sutton & Barto, 2018) in the real world is hindered by its massive demand for online data. In domains such as robotics (Cabi et al., 2019) and autonomous driving (Sallab et al., 2017) , rolling out a premature policy is prohibitively costly and unsafe. To address this issue, offline RL (a.k.a batch RL) (Levine et al., 2020; Lange et al., 2012) has been proposed to learn a policy directly from historical data without environment interaction. However, learning competent policies from a static dataset is challenging. Prior studies have shown that learning a policy without constraining its deviation from the data-generating policies suffers from significant extrapolation errors, leading to training divergence (Fujimoto et al., 2019; Kumar et al., 2019) . Current literature has demonstrated two successful paradigms for managing the trade-off between policy improvement and limiting the distributional shift from the behavior policies. Under the actor-critic framework (Konda & Tsitsiklis, 1999) , behavior constrained policy optimization (BCPO) (Fujimoto et al., 2019; Kumar et al., 2019; Fujimoto & Gu, 2021; Wu et al., 2019; Brandfonbrener et al., 2021; Ghasemipour et al., 2021) explicitly regularizes the divergence between learned and behavior policies, while conservative methods (Kumar et al., 2020b; Bai et al., 2022; Yu et al., 2020; 2021) penalize the value estimate for out-of-distribution (OOD) actions to avoid overestimation error. However, most existing model-free offline RL algorithms still require learning off-policy value functions and a target policy through stochastic gradient descent (SGD). Unlike supervised learning, off-policy learning with non-linear function approximators and temporal difference learning (Sutton & Barto, 2018) is notoriously unstable (Kumar et al., 2020a; Mnih et al., 2015; Henderson et al., 2018; Konda & Tsitsiklis, 1999; Watkins & Dayan, 1992) due to the existence of the deadly-triad (Sutton & Barto, 2018; Van Hasselt et al., 2018) . The performance can exhibit significant variance even across different random seeds (Islam et al., 2017) . In offline settings, learning becomes even more problematic as environment interaction is restricted, thus preventing the learning from receiving corrective feedback (Kumar et al., 2020a) . Consequently, training stability poses a major challenge in offline RL. Although some current approaches (Brandfonbrener et al., 2021) circumvent the requirement for learning an off-policy value function, they still require learning a policy via SGD. Can we mitigate the issue of learning instability by leveraging optimization techniques? In this paper, we approach this issue from the policy learning perspective, aiming to design a stable policy improvement operator. We take a closer look at the BCPO paradigm and make a novel observation that the requirement of limited distributional shift motivates the use of the first-order Taylor approximation (Callahan, 2010), leading to a linear approximation of the policy objective that is accurate in a sufficiently small neighborhood of the behavior action. Based on this crucial insight, we construct our policy improvement operators that return closed-form solutions by carefully designing a tractable behavior constraint. When modeling the behavior policies as a Single Gaussian, our policy improvement operator deterministically shifts the behavior policy towards a value improving direction derived by solving a Quadratically Constrained Linear Program (QCLP) in closed form. Therefore, our method only requires learning the underlying behavior policies of a given dataset with supervised learning, avoiding the training instability from policy improvement. Furthermore, we note that practical datasets are likely to be collected by heterogeneous policies, which may give rise to a multimodal behavior action distribution. In this scenario, a Single Gaussian will fail to capture the entire picture of the underlying distribution, limiting the potential of policy improvement. While modeling the behavior as a Gaussian Mixture provides better expressiveness, it incurs extra optimization difficulties due to the non-concavity of its log-likelihood function. We tackle this issue by leveraging the LogSumExp's lower bound and Jensen's inequality, leading to a closed-form policy improvement (CFPI) operator compatible with a multimodal behavior policy. Empirically, we demonstrate the effectiveness of Gaussian Mixture over the conventional Single Gaussian when the underlying distribution comes from hetereogenous policies. In this paper, we empirically demonstrate that our CFPI operators can instantiate successful offline RL algorithms in a one-step or iterative fashion. Moreover, our methods can also be leveraged to improve a policy learned by the other algorithms. In summary, our main contributions are threefold: • CFPI operators compatible with single mode and multimodal behavior policies. • An empirical demonstration of the benefit to model the behavior policy as a Gaussian Mixture in model-free offline RL. To the best of our knowledge, we are the first to do this. • One-step and iterative instantiations of our algorithm, which outperform state-of-the-art (SOTA) algorithms on the standard D4RL benchmark (Fu et al., 2020) .

2. PRELIMINARIES

Reinforcement Learning. RL aims to maximize returns in a Markov Decision Process (MDP) (Sutton & Barto, 2018) M = (S, A, R, T, ρ 0 , γ), with state space S, action space A, reward function R, transition function T , initial state distribution ρ 0 , and discount factor γ ∈ [0, 1). At each time step t, the agent starts from a state s t ∈ S, selects an action a t ∼ π(•|s t ) from its policy π, transitions to a new state s t+1 ∼ T (•|s t , a t ), and receives reward r t := R(s t , a t ). The goal of an RL agent is to learn an optimal policy π * that maximizes the expected discounted cumulative reward E π [ ∞ t=0 γ t r t ] without access to the ground truth R and T . We define the action value function associated with π by Q π (s, a) = E π [ ∞ t=0 γ t r t |s 0 = s, a 0 = a]. The RL objective can then be reformulated as π * = arg max π J(π) := E s∈ρ0,a∈π(•|s) [Q π (s, a)] In this paper, we consider offline RL settings, where we assume restricted access to the MDP M, and a previously collected dataset D with N transition tuples {(s i t , a i t , r i t )} N i=1 . We denote the underlying policy that generates D as π β , which may or may not be a mixture of individual policies. Behavior Constrained Policy Optimization. One of the critical challenges in offline RL is that the learned Q function tends to assign spuriously high values to OOD actions due to extrapolation error, which is well documented in previous literature (Fujimoto et al., 2019; Kumar et al., 2019) . Behavior Constrained Policy Optimization (BCPO) methods (Fujimoto et al., 2019; Kumar et al., 2019; Fujimoto & Gu, 2021; Wu et al., 2019; Brandfonbrener et al., 2021) explicitly constrain the action selection of the learned policy to stay close to the behavior policy π β , resulting in a policy improvement step that can be generally summarized by the optimization problem below: max π E s∼D E ã∼π(•|s) [Q (s, ã)] -αD (π(• | s), π β (• | s)) , where D(•, •) is a divergence function that calculates the divergence between two action distributions, and α is a hyper-parameter controlling the strength of regularization. Consequently, the policy is optimized to maximize the Q-value while staying close to the behavior distribution.

