OFFLINE REINFORCEMENT LEARNING WITH CLOSED-FORM POLICY IMPROVEMENT OPERATORS Anonymous

Abstract

Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.

1. INTRODUCTION

Deploying Reinforcement Learning (RL) (Sutton & Barto, 2018) in the real world is hindered by its massive demand for online data. In domains such as robotics (Cabi et al., 2019) and autonomous driving (Sallab et al., 2017) , rolling out a premature policy is prohibitively costly and unsafe. To address this issue, offline RL (a.k.a batch RL) (Levine et al., 2020; Lange et al., 2012) has been proposed to learn a policy directly from historical data without environment interaction. However, learning competent policies from a static dataset is challenging. Prior studies have shown that learning a policy without constraining its deviation from the data-generating policies suffers from significant extrapolation errors, leading to training divergence (Fujimoto et al., 2019; Kumar et al., 2019) . Current literature has demonstrated two successful paradigms for managing the trade-off between policy improvement and limiting the distributional shift from the behavior policies. Under the actor-critic framework (Konda & Tsitsiklis, 1999), behavior constrained policy optimization (BCPO) (Fujimoto et al., 2019; Kumar et al., 2019; Fujimoto & Gu, 2021; Wu et al., 2019; Brandfonbrener et al., 2021; Ghasemipour et al., 2021) explicitly regularizes the divergence between learned and behavior policies, while conservative methods (Kumar et al., 2020b; Bai et al., 2022; Yu et al., 2020; 2021) penalize the value estimate for out-of-distribution (OOD) actions to avoid overestimation error. However, most existing model-free offline RL algorithms still require learning off-policy value functions and a target policy through stochastic gradient descent (SGD). Unlike supervised learning, off-policy learning with non-linear function approximators and temporal difference learning (Sutton & Barto, 2018) is notoriously unstable (Kumar et al., 2020a; Mnih et al., 2015; Henderson et al., 2018; Konda & Tsitsiklis, 1999; Watkins & Dayan, 1992) due to the existence of the deadly-triad (Sutton & Barto, 2018; Van Hasselt et al., 2018) . The performance can exhibit significant variance even across different random seeds (Islam et al., 2017) . In offline settings, learning becomes even more problematic as environment interaction is restricted, thus preventing the learning from receiving corrective feedback (Kumar et al., 2020a) . Consequently, training stability poses a major challenge in offline RL. Although some current approaches (Brandfonbrener et al., 2021) circumvent the requirement for learning an off-policy value function, they still require learning a policy via SGD. Can we mitigate the issue of learning instability by leveraging optimization techniques? In this paper, we approach this issue from the policy learning perspective, aiming to design a stable policy 1

