OFFLINE POLICY OPTIMIZATION WITH VARIANCE REGULARIZATION

Abstract

Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing algorithms.

1. INTRODUCTION

Offline batch reinforcement learning (RL) algoithms are key towards scaling up RL for real world applications, such as robotics (Levine et al., 2016) and medical problems . This is because offline RL provides the appealing ability for agents to learn from fixed datasets, similar to supervised learning, avoiding continual interaction with the environment, which could be problematic for safety and feasibility reasons. However, significant mismatch between the fixed collected data and the policy that the agent is considering can lead to high variance of value function estimates, a problem encountered by most off-policy RL algorithms (Precup et al., 2000) . A complementary problem is that the value function can become overly optimistic in areas of state space that are outside the visited batch, leading the agent in data regions where its behavior is poor Fujimoto et al. (2019) . Recently there has been some progress in offline RL (Kumar et al., 2019; Wu et al., 2019b; Fujimoto et al., 2019) , trying to tackle both of these problems. In this work, we study the problem of offline policy optimization with variance minimization. To avoid overly optimistic value function estimates, we propose to learn value functions under variance constraints, leading to a pessimistic estimation, which can significantly help offline RL algorithms, especially under large distribution mismatch. We propose a framework for variance minimization in offline RL, such that the obtained estimates can be used to regularize the value function and enable more stable learning under different off-policy distributions. We develop a novel approach for variance regularized offline actor-critic algorithms, which we call Offline Variance Regularizer (OVR). The key idea of OVR is to constrain the policy improvement step via variance regularized value function estimates. Our algorithmic framework avoids the double sampling issue that arises when computing gradients of variance estimates, by instead considering the variance of stationary distribution corrections with per-step rewards, and using the Fenchel transformation (Boyd & Vandenberghe, 2004) to formulate a minimax optimization objective. This allows minimizing variance constraints by instead optimizing dual variables, resulting in simply an augmented reward objective for variance regularized value functions. We show that even with variance constraints, we can ensure policy improvement guarantees, where the regularized value function leads to a lower bound on the true value function, which mitigates the usual overestimation problems in batch RL The use of Fenchel duality in computing the variance allows us to avoid double sampling, which has been a major bottleneck in scaling up variance-constrained actor-critic algorithms in prior work A. & Ghavamzadeh (2016); A. & Fu (2018) . Practically, our algorithm is easy to implement, since it simply involves augmenting the rewards with the dual variables only, such that the regularized value function can be implemented on top of any existing offline policy optimization algorithms. We evaluate our algorithm on existing offline benchmark tasks based on continuous control domains. Our empirical results demonstrate that the proposed variance regularization approach is particularly useful when the batch dataset is gathered at random, or when it is very different from the data distributions encountered during training.

2. PRELIMINARIES AND BACKGROUND

We consider an infinite horizon MDP as (S, A, P, γ) where S is the set of states, A is the set of actions, P is the transition dynamics and γ is the discount factor. The goal of reinforcement learning is to maximize the expected return J (π) = E s∼d β [V π (s)], where V π (s) is the value function V π (s) = E[ ∞ t=0 γ t r(s t , a t ) | s 0 = s], and β is the initial state distribution. Considering parameterized policies π θ (a|s), the goal is maximize the returns by following the policy gradient (Sutton et al., 1999) , based on the performance metric defined as : J(π θ ) = E s0∼ρ,a0∼π(s0) Q π θ (s 0 , a 0 ) = E (s,a)∼dπ θ (s,a) r(s, a) where Q π (s, a) is the state-action value function, since V π (s) = a π(a|s)Q π (s, a). The policy optimization objective can be equivalently written in terms of the normalized discounted occupancy measure under the current policy π θ , where d π (s, a) is the state-action occupancy measure, such that the normalized state-action visitation distribution under policy π is defined as : d π (s, a) = (1 -γ) ∞ t=0 γ t P (s t = s, a t = a|s 0 ∼ β, a ∼ π(s 0 )). The equality in equation 1 holds and can be equivalently written based on the linear programming (LP) formulation in RL (see (Puterman, 1994; Nachum & Dai, 2020) for more details). In this work, we consider the off-policy learning problem under a fixed dataset D which contains s, a, r, s tuples under a known behaviour policy µ(a|s). Under the off-policy setting, importance sampling (Precup et al., 2000) is often used to reweight the trajectory under the behaviour data collecting policy, such as to get unbiased estimates of the expected returns. At each time step, the importance sampling correction π(at|st) µ(at|st) is used to compute the expected return under the entire trajectory as (Fujimoto et al., 2019) have demonstrated that instead of importance sampling corrections, maximizing value functions directly for deterministic or reparameterized policy gradients (Lillicrap et al., 2016; Fujimoto et al., 2018) allows learning under fixed datasets, by addressing the over-estimation problem, by maximizing the objectives of the form max θ E s∼D Q π θ (s, π θ (s) . J(π) = (1 -γ)E (s,a)∼dµ(s,a) [ T t=0 γ t r(s t , a t ) T t=1 π(at|st) µ(at|st ]. Recent works

3. VARIANCE REGULARIZATION VIA DUALITY IN OFFLINE POLICY OPTIMIZATION

In this section, we first present our approach based on variance of stationary distribution corrections, compared to importance re-weighting of episodic returns in section 3.1. We then present a derivation of our approach based on Fenchel duality on the variance, to avoid the double sampling issue, leading to a variance regularized offline optimization objective in section 3.2. Finally, we present our algorithm in 1, where the proposed regularizer can be used in any existing offline RL algorithm.

3.1. VARIANCE OF REWARDS WITH STATIONARY DISTRIBUTION CORRECTIONS

In this work, we consider the variance of rewards under occupancy measures in offline policy optimization. Let us denote the returns as D π = T t=0 γ t r(s t , a t ), such that the value function is V π = E π [D π ]. The 1-step importance sampling ratio is ρ t = π(at|st) µ(at|st) , and the T-steps ratio can be denoted ρ 1:T = T t=1 ρ t . Considering per-decision importance sampling (PDIS) (Precup et al., 2000) , the returns can be similarly written as D π = T t=0 γ t r t ρ 0:t . The variance of episodic returns, which we denote by V P (π), with off-policy importance sampling corrections can be written as : 



P (π) = E s∼β,a∼µ(•|s),s ∼P(•|s,a) D π (s, a) -J(π) 2 .

