OFFLINE POLICY OPTIMIZATION WITH VARIANCE REGULARIZATION

Abstract

Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing algorithms.

1. INTRODUCTION

Offline batch reinforcement learning (RL) algoithms are key towards scaling up RL for real world applications, such as robotics (Levine et al., 2016) and medical problems . This is because offline RL provides the appealing ability for agents to learn from fixed datasets, similar to supervised learning, avoiding continual interaction with the environment, which could be problematic for safety and feasibility reasons. However, significant mismatch between the fixed collected data and the policy that the agent is considering can lead to high variance of value function estimates, a problem encountered by most off-policy RL algorithms (Precup et al., 2000) . A complementary problem is that the value function can become overly optimistic in areas of state space that are outside the visited batch, leading the agent in data regions where its behavior is poor Fujimoto et al. (2019) . Recently there has been some progress in offline RL (Kumar et al., 2019; Wu et al., 2019b; Fujimoto et al., 2019) , trying to tackle both of these problems. In this work, we study the problem of offline policy optimization with variance minimization. To avoid overly optimistic value function estimates, we propose to learn value functions under variance constraints, leading to a pessimistic estimation, which can significantly help offline RL algorithms, especially under large distribution mismatch. We propose a framework for variance minimization in offline RL, such that the obtained estimates can be used to regularize the value function and enable more stable learning under different off-policy distributions. We develop a novel approach for variance regularized offline actor-critic algorithms, which we call Offline Variance Regularizer (OVR). The key idea of OVR is to constrain the policy improvement step via variance regularized value function estimates. Our algorithmic framework avoids the double sampling issue that arises when computing gradients of variance estimates, by instead considering the variance of stationary distribution corrections with per-step rewards, and using the Fenchel transformation (Boyd & Vandenberghe, 2004) to formulate a minimax optimization objective. This allows minimizing variance constraints by instead optimizing dual variables, resulting in simply an augmented reward objective for variance regularized value functions. We show that even with variance constraints, we can ensure policy improvement guarantees, where the regularized value function leads to a lower bound on the true value function, which mitigates the usual overestimation problems in batch RL The use of Fenchel duality in computing the variance allows us to avoid double sampling, which has been a major bottleneck in scaling up variance-constrained

