OFFLINE POLICY OPTIMIZATION WITH VARIANCE REGULARIZATION

Abstract

Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing algorithms.

1. INTRODUCTION

Offline batch reinforcement learning (RL) algoithms are key towards scaling up RL for real world applications, such as robotics (Levine et al., 2016) and medical problems . This is because offline RL provides the appealing ability for agents to learn from fixed datasets, similar to supervised learning, avoiding continual interaction with the environment, which could be problematic for safety and feasibility reasons. However, significant mismatch between the fixed collected data and the policy that the agent is considering can lead to high variance of value function estimates, a problem encountered by most off-policy RL algorithms (Precup et al., 2000) . A complementary problem is that the value function can become overly optimistic in areas of state space that are outside the visited batch, leading the agent in data regions where its behavior is poor Fujimoto et al. (2019) . Recently there has been some progress in offline RL (Kumar et al., 2019; Wu et al., 2019b; Fujimoto et al., 2019) , trying to tackle both of these problems. In this work, we study the problem of offline policy optimization with variance minimization. To avoid overly optimistic value function estimates, we propose to learn value functions under variance constraints, leading to a pessimistic estimation, which can significantly help offline RL algorithms, especially under large distribution mismatch. We propose a framework for variance minimization in offline RL, such that the obtained estimates can be used to regularize the value function and enable more stable learning under different off-policy distributions. We develop a novel approach for variance regularized offline actor-critic algorithms, which we call Offline Variance Regularizer (OVR). The key idea of OVR is to constrain the policy improvement step via variance regularized value function estimates. Our algorithmic framework avoids the double sampling issue that arises when computing gradients of variance estimates, by instead considering the variance of stationary distribution corrections with per-step rewards, and using the Fenchel transformation (Boyd & Vandenberghe, 2004) to formulate a minimax optimization objective. This allows minimizing variance constraints by instead optimizing dual variables, resulting in simply an augmented reward objective for variance regularized value functions. We show that even with variance constraints, we can ensure policy improvement guarantees, where the regularized value function leads to a lower bound on the true value function, which mitigates the usual overestimation problems in batch RL The use of Fenchel duality in computing the variance allows us to avoid double sampling, which has been a major bottleneck in scaling up variance-constrained actor-critic algorithms in prior work A. & Ghavamzadeh (2016); A. & Fu (2018) . Practically, our algorithm is easy to implement, since it simply involves augmenting the rewards with the dual variables only, such that the regularized value function can be implemented on top of any existing offline policy optimization algorithms. We evaluate our algorithm on existing offline benchmark tasks based on continuous control domains. Our empirical results demonstrate that the proposed variance regularization approach is particularly useful when the batch dataset is gathered at random, or when it is very different from the data distributions encountered during training.

2. PRELIMINARIES AND BACKGROUND

We consider an infinite horizon MDP as (S, A, P, γ) where S is the set of states, A is the set of actions, P is the transition dynamics and γ is the discount factor. The goal of reinforcement learning is to maximize the expected return J (π) = E s∼d β [V π (s)], where V π (s) is the value function V π (s) = E[ ∞ t=0 γ t r(s t , a t ) | s 0 = s], and β is the initial state distribution. Considering parameterized policies π θ (a|s), the goal is maximize the returns by following the policy gradient (Sutton et al., 1999) , based on the performance metric defined as : J(π θ ) = E s0∼ρ,a0∼π(s0) Q π θ (s 0 , a 0 ) = E (s,a)∼dπ θ (s,a) r(s, a) (1) where Q π (s, a) is the state-action value function, since V π (s) = a π(a|s)Q π (s, a). The policy optimization objective can be equivalently written in terms of the normalized discounted occupancy measure under the current policy π θ , where d π (s, a) is the state-action occupancy measure, such that the normalized state-action visitation distribution under policy π is defined as : d π (s, a) = (1 -γ) ∞ t=0 γ t P (s t = s, a t = a|s 0 ∼ β, a ∼ π(s 0 )). The equality in equation 1 holds and can be equivalently written based on the linear programming (LP) formulation in RL (see (Puterman, 1994; Nachum & Dai, 2020) for more details). In this work, we consider the off-policy learning problem under a fixed dataset D which contains s, a, r, s tuples under a known behaviour policy µ(a|s). Under the off-policy setting, importance sampling (Precup et al., 2000) is often used to reweight the trajectory under the behaviour data collecting policy, such as to get unbiased estimates of the expected returns. At each time step, the importance sampling correction π(at|st) µ(at|st) is used to compute the expected return under the entire trajectory as J(π) = (1 -γ)E (s,a)∼dµ(s,a) [ T t=0 γ t r(s t , a t ) T t=1 π(at|st) µ(at|st ]. Recent works (Fujimoto et al., 2019) have demonstrated that instead of importance sampling corrections, maximizing value functions directly for deterministic or reparameterized policy gradients (Lillicrap et al., 2016; Fujimoto et al., 2018) allows learning under fixed datasets, by addressing the over-estimation problem, by maximizing the objectives of the form max θ E s∼D Q π θ (s, π θ (s) .

3. VARIANCE REGULARIZATION VIA DUALITY IN OFFLINE POLICY OPTIMIZATION

In this section, we first present our approach based on variance of stationary distribution corrections, compared to importance re-weighting of episodic returns in section 3.1. We then present a derivation of our approach based on Fenchel duality on the variance, to avoid the double sampling issue, leading to a variance regularized offline optimization objective in section 3.2. Finally, we present our algorithm in 1, where the proposed regularizer can be used in any existing offline RL algorithm.

3.1. VARIANCE OF REWARDS WITH STATIONARY DISTRIBUTION CORRECTIONS

In this work, we consider the variance of rewards under occupancy measures in offline policy optimization. Let us denote the returns as D π = T t=0 γ t r(s t , a t ), such that the value function is at|st) µ(at|st) , and the T-steps ratio can be denoted ρ 1:T = T t=1 ρ t . Considering per-decision importance sampling (PDIS) (Precup et al., 2000) , the returns can be similarly written as D π = T t=0 γ t r t ρ 0:t . The variance of episodic returns, which we denote by V P (π), with off-policy importance sampling corrections can be written as : V π = E π [D π ]. The 1-step importance sampling ratio is ρ t = π( V P (π) = E s∼β,a∼µ(•|s),s ∼P(•|s,a) D π (s, a) -J(π) 2 . Instead of importance sampling, several recent works have instead proposed for marginalized importance sampling with stationary state-action distribution corrections (Liu et al., 2018; Nachum et al., 2019a; Zhang et al., 2020; Uehara & Jiang, 2019) , which can lead to lower variance estimators at the cost of introducing bias. Denoting the stationary distribution ratios as ω(s, a) = dπ (s,a) dµ(s,a) , the returns can be written as W π (s, a) = ω(s, a)r(s, a). The variance of marginalized IS is : V D (π) = E (s,a)∼dµ(s,a) W π (s, a) -J(π) 2 = E (s,a)∼dµ(s,a) W π (s, a) 2 -E (s,a)∼dµ(s,a) W π (s, a) 2 (2) Our key contribution is to first consider the variance of marginalized IS V D (π) itself a as risk constraints, in the offline batch optimization setting. We show that constraining the offline policy optimization objective with variance of marginalized IS, and using the Fenchel-Legendre transformation on V D (π) can help avoid the well-known double sampling issue in variance risk constrained RL (for more details on how to compute the gradient of the variance term, see appendix B). We emphasize that the variance here is solely based on returns with occupancy measures, and we do not consider the variance due to the inherent stochasticity of the MDP dynamics.

3.2. VARIANCE REGULARIZED OFFLINE MAX-RETURN OBJECTIVE

We consider the variance regularized off-policy max return objective with stationary distribution corrections ω π/D (which we denote ω for short for clarity) in the offline fixed dataset D setting: max π θ J(π θ ) := E s∼D Q π θ (s, π θ (s)) -λV D (ω, π θ ) where λ ≥ 0 allows for the trade-off between offline policy optimization and variance regularization (or equivalently variance risk minimization). The max-return objective under Q π θ (s, a) has been considered in prior works in offline policy optimization (Fujimoto et al., 2019; Kumar et al., 2019) . We show that this form of regularizer encourages variance minimization in offline policy optimization, especially when there is a large data distribution mismatch between the fixed dataset D and induced data distribution under policy π θ .

3.3. VARIANCE REGULARIZATION VIA FENCHEL DUALITY

At first, equation 3 seems to be difficult to optimize, especially for minimizing the variance regularization w.r.t θ. This is because finding the gradient of V(ω, π θ ) would lead to the double sampling issue since it contains the squared of the expectation term. The key contribution of OVR is to use the Fenchel duality trick on the second term of the variance expression in equation 2, for regularizing policy optimization objective with variance of marginalized importance sampling. Applying Fenchel duality, x 2 = max y (2xy -y 2 ), to the second term of variance expression, we can transform the variance minimization problem into an equivalent maximization problem, by introducing the dual variables ν(s, a). We have the Fenchel conjugate of the variance term as : V(ω, π θ ) = max ν - 1 2 ν(s, a) 2 + ν(s, a)ω(s, a)r(s, a) + E (s,a)∼d D ω(s, a)r(s, a) 2 = max ν E (s,a)∼d D - 1 2 ν(s, a) 2 + ν(s, a)ω(s, a)r(s, a) + ω(s, a)r(s, a) 2 Regularizing the policy optimization objective with variance under the Fenchel transformation, we therefore have the overall max-min optimization objective, explicitly written as : max θ min ν J(π θ , ν) := E s∼D Q π θ (s, π θ (s)) -λE (s,a)∼d D - 1 2 ν 2 + ν • ω • r + ω • r 2 (s, a) (5)

3.4. AUGMENTED REWARD OBJECTIVE WITH VARIANCE REGULARIZATION

In this section, we explain the key steps that leads to the policy improvement step being an augmented variance regularized reward objective. The variance minimization step involves estimating the stationary distribution ration (Nachum et al., 2019a) , and then simply computing the closed form solution for the dual variables. Fixing dual variables ν, to update π θ , note that this leads to a standard maximum return objective in the dual form, which can be equivalently solved in the primal form, using augmented rewards. This is because we can write the above above in the dual form as : J(π θ , ν, ω) := E (s,a)∼d D (s,a) ω(s, a) • r(s, a) -λ - 1 2 ν 2 + ν • ω • r + ω • r 2 (s, a) = E (s,a)∼d D (s,a) ω(s, a) • r -λ • ν • r -λ • r 2 (s, a) + λ 2 ν(s, a) 2 = E (s,a)∼d D (s,a) ω(s, a) • r(s, a) + λ 2 ν(s, a) 2 (6) where we denote the augmented rewards as : r(s, a) ≡ [r -λ • ν • r -λ • r 2 ](s, a) The policy improvement step can either be achieved by directly solving equation 6 or by considering the primal form of the objective with respect to Q π θ (s, π θ ) as in (Fujimoto et al., 2019; Kumar et al., 2019) . However, solving equation 6 directly can be troublesome, since the policy gradient step involves findinding the gradient w.r.t ω(s, a) = dπ θ (s,a) (s,a) too, where the distribution ratio depends on d π θ (s, a). This means that the gradient w.r.t θ would require finding the gradient w.r.t to the normalized discounted occupancy measure, ie, ∇ θ d π θ (s). Instead, it is therefore easier to consider the augmented reward objective, using r(s, a) as in equation 7 in any existing offline policy optimization algorithm, where we have the variance regularized value function Qπ θ (s, a). d D Note that as highlighted in (Sobel, 1982) , the variance of returns follows a Bellman-like equation. Following this, (Bisi et al., 2019 ) also pointed to a Bellman-like solution for variance w.r.t occupancy measures. Considering variance of the form in equation 2, and the Bellman-like equation for variance, we can write the variance recursively as a Bellman equation: V π D (s, a) = r(s, a) -J(π) 2 + γE s ∼P,a ∼π (•|s ) V π D (s , a ) Since in our objective, we augment the policy improvement step with the variance regularization term, we can write the augmented value function as s, a) . This suggests we can modify existing policy optimization algorithms with augmented rewards on value function. Q π λ (s, a) := Q π (s, a) -λV π D ( Remark : Applying Fenchel transformation to the variance regularized objective, however, at first glance, seems to make the augmented rewards dependent on the policy itself, since r(s, a) depends on the dual variables ν(s, a) as well. This can make the rewards non-stationary, thereby the policy maximization step cannot be solved directly via the maximum return objective. However, as we discuss next, the dual variables for minimizing the variance term has a closed form solution ν(s, a), and thereby does not lead to any non-stationarity in the rewards, due to the alternating minimization and maximization steps.

Variance Minimization

Step : Fixing the policy π θ , the dual variables ν can be obtained using closed form solution given by ν(s, a) = ω(s, a) • r(s, a). Note that directly optimizing for the target policies using batch data, however, requires a fixed point estimate of the stationary distribution corrections, which can be achieved using existing algorithms (Nachum et al., 2019a; Liu et al., 2018) . Solving the optimization objective additionally requires estimating the state-action distribution ratio, ω(s, a) = dπ(s,a) d D (s,a) . Recently, several works have proposed estimating the stationary distribution ratio, mostly for the off-policy evaluation case in infinite horizon setting (Zhang et al., 2020; Uehara & Jiang, 2019) . We include a detailed discussion of this in appendix E.4. Algorithm : Our proposed variance regularization approach with returns under stationary distribution corrections for offline optimization can be built on top of any existing batch off-policy optimization algorithms. We summarize our contributions in Algorithm 1. Implementing our algorithm requires estimating the state-action distribution ratio, followed by the closed form estimate of the dual variable ν. The augmented stationary reward with the dual variables can then be used to compute the regularized value function Q π λ (s, a). The policy improvement step involves maximizing the variance regularized value function, e.g with BCQ (Fujimoto et al., 2019) .

Algorithm 1 Offline Variance Regularizer

Initialize critic Q φ , policy π θ , network ω ψ and regularization weighting λ; learning rate η for t = 1 to T do Estimate distribution ratio ω ψ (s, a) using any existing DICE algorithm Estimate the dual variable ν(s, a) = ω ψ (s, a) • r(s, a) Calculate augmented rewards r(s, a) using equation 7 Policy improvement step using any offline policy optimization algorithm with augmented rewards r(s, a) : θ t = θ t-1 + η∇ θ J(θ, φ, ψ, ν) end for

4.1. VARIANCE OF MARGINALIZED IMPORTANCE SAMPLING AND IMPORTANCE SAMPLING

We first show in lemma 1 that the variance of rewards under stationary distribution corrections can similarly be upper bounded based on the variance of importance sampling corrections. We emphasize that in the off-policy setting under distribution corrections, the variance is due to the estimation of the density ratio compared to the importance sampling corrections. Lemma 1. The following inequality holds between the variance of per-step rewards under stationary distribution corrections, denoted by V D (π) and the variance of episodic returns with importance sampling corrections V P (π) V P (π) ≤ V D (π) (1 -γ) 2 The proof for this and discussions on the variance of episodic returns compared to per-step rewards under occupancy measures is provided in the appendix B.1.

4.2. POLICY IMPROVEMENT BOUND UNDER VARIANCE REGULARIZATION

In this section, we establish performance improvement guarantees (Kakade & Langford, 2002) for variance regularized value function for policy optimization. Let us first recall that the performance improvement can be written in terms of the total variation D TV divergence between state distributions (Touati et al., 2020) (for more discussions on the performance bounds, see appendix C) Lemma 2. For all policies π and π, we have the performance improvement bound based on the total variation of the state-action distributions d π and d π J(π ) ≥ L π (π ) -π D TV (d π ||d π ) where π = max s |E a∼π (•|s) [A π (s, a)]|, and L π (π ) = J(π) + E s∼dπ,a∼π [A π (s, a)]. For detailed proof and discussions, see appendix C. Instead of considering the divergence between state visitation distributions, consider having access to both state-action samples generated from the environment. To avoid importance sampling corrections we can further considers the bound on the objective based on state-action visitation distributions, where we have an upper bound following from (Nguyen et al., 2010) : D TV (d π (s)||d π (s)) ≤ D TV (d π (s, a)||d π (s, a)) . Following Pinsker's inequality, we have: J(π ) ≥ J(π)+E s∼dπ(s),a∼π (|s) A π (s, a) -π E (s,a)∼dπ(s,a) D KL (d π (s, a)||d π (s, a)) (11) Furthermore, we can exploit the relation between KL, total variation (TV) and variance through the variational representation of divergence measures. Recall that the total divergence between P and Q distributions is given by : D TV (p, q) = 1 2 x |p(x) -q(x)|. We can use the variational representation of the divergence measure. Denoting d π (s, a) = β π (s, a), we have D TV (β π ||β π ) = sup f :S×A→R E (s,a)∼β π [f (s, a)] -E (s,a)∼β(s,a) [φ * • f (s, a)] where φ * is the convex conjugate of φ and f is the dual function class based on the variational representation of the divergence. Similar relations with the variational representations of f-divergences have also been considered in (Nachum et al., 2019b; Touati et al., 2020) . We can finally obtain a bound for the policy improvement following this relation, in terms of the per-step variance: Theorem 1. For all policies π and π , and the corresponding state-action visitation distributions d π and d π , we can obtain the performance improvement bound in terms of the variance of rewards under state-action occupancy measures. J(π ) -J(π) ≥ E s∼dπ(s),a∼π (a|s) [A π (s, a)] -Var (s,a)∼dπ(s,a) f (s, a) where f (s, a) is the dual function class from the variational representation of variance. Proof. For detailed proof, see appendix C.1.

4.3. LOWER BOUND OBJECTIVE WITH VARIANCE REGULARIZATION

In this section, we show that augmenting the policy optimization objective with a variance regularizer leads to a lower bound to the original optimization objectiven J(π θ ). Following from (Metelli et al., 2018) , we first note that the variance of marginalized importance weighting with distribution corrections can be written in terms of the α-Renyi divergence. Let p and q be two probability measures, such that the Renyi divergence is F α = 1 α log x q(x) p(x) q(x) α . When α = 1, this leads to the well-known KL divergence F 1 (p||q) = F KL (p||q). Let (Metelli et al., 2018) , and extending results from importance sampling ρ to marginalized importance sampling ω π/D , we provide the following result that bounds the variance of the approximated density ratio ωπ/D in terms of the Renyi divergence : Lemma 3. Assuming that the rewards of the MDP are bounded by a finite constant, ||r|| ∞ ≤ R max . Given random variable samples (s, a) ∼ d D (s, a) from dataset D, for any N > 0, the variance of marginalized importance weighting can be upper bounded as : Var (s,a)∼d D (s,a) [ω π/D (s, a)] = F 2 (d π ||d D ) -1 (14) Following from Var (s,a)∼d D (s,a) ωπ/D (s, a) ≤ 1 N ||r|| 2 ∞ F 2 (d π ||d D ) See appendix D.1 for more details. Following this, our goal is to derive a lower bound objective to our off-policy optimization problem. Concentration inequalities has previously been studied for both off-policy evaluation (Thomas et al., 2015a) and optimization (Thomas et al., 2015b) . In our case, we can adapt the concentration bound derived from Cantelli's ineqaulity and derive the following result based on variance of marginalized importance sampling. Under state-action distribution corrections, we have the following lower bound to the off-policy policy optimization objective with stationary state-action distribution corrections Theorem 2. Given state-action occupancy measures d π and d D , and assuming bounded reward functions, for any 0 < δ ≤ 1 and N > 0, we have with probability at least 1 -δ that : J(π) ≥ E (s,a)∼d D (s,a) ω π/D (s, a) • r(s, a) - 1 -δ δ Var (s,a)∼d D (s,a) [ω π/D (s, a) • r(s, a)] (16) Equation 16 shows the lower bound policy optimization objective under risk-sensitive variance constraints. The key to our derivation in equation 16 of theorem 2 shows that given off-policy batch data collected with behaviour policy µ(a|s), we are indeed optimizing a lower bound to the policy optimization objective, which is regularized with a variance term to minimize the variance in batch off-policy learning.

5. EXPERIMENTAL RESULTS ON BENCHMARK OFFLINE CONTROL TASKS

Experimental Setup : We demonstrate the significance of variance regularizer on a range of continuous control domains (Todorov et al., 2012) based on fixed offline datasets from (Fu et al., 2020) , which is a standard benchmark for offline algorithms. To demonstrate the significance of our variance regularizer OVR, we mainly use it on top of the BCQ algorithm and compare it with other existing baselines, using the benchmark D4RL (Fu et al., 2020) offline datasets for different tasks and off-policy distributions.

Experimental results are given in table 1

Performance on Optimal and Medium Quality Datasets : We first evaluate the performance of OVR when the dataset consists of optimal and mediocre logging policy data. We collected the dataset using a fully (expert) or partially (medium) trained SAC policy. We build our algorithm OVR on top of BCQ, denoted by BCQ + VAR. Note that the OVR algorithm can be agnostic to the behaviour policy too for computing the distribution ratio (Nachum et al., 2019a ) and the variance. We observe that even (Kumar et al., 2019) , behavior-regularized actor critic with policy (BRAC-p) (Wu et al., 2019a) , AlgeaDICE (aDICE) (Nachum et al., 2019b) and offline SAC (SAC-off) (Haarnoja et al., 2018) . The results presented are the normalized returns on the task as per Fu et al. ( 2020) (see Table 3 in Fu et al. (2020) for the unnormalized scores on each task). We see that in most tasks we are able to significant gains using OVR. Our algorithm can be applied to any policy optimization baseline algorithm that trains the policy by maximizing the expected rewards. Unlike BCQ, BEAR (Kumar et al., 2019) does not have the same objective, as they train the policy using and MMD objective. though performance is marginally improved with OVR under expert settings, since the demonstrations are optimal itself, we can achieve significant improvements under medium dataset regime. This is because OVR plays a more important role when there is larger variance due to distribution mismatch between the data logging and target policy distributions. Experimental results are shown in first two columns of figure 1 . Performance on Random and Mixed Datasets : We then evaluate the performance on random datasets, i.e, the worst-case setup when the data logging policy is a random policy, as shown in the last two columns of figure 1 . As expected, we observe no improvements at all, and even existing baselines such as BCQ (Fujimoto et al., 2019) can work poorly under random dataset setting. When we collect data using a mixture of random and mediocre policy, denoted by mixed, the performance is again improved for OVR on top of BCQ, especially for the Hopper and Walker control domains. We provide additional experimental results and ablation studies in appendix E.1.

6. RELATED WORKS

We now discuss related works in offline RL, for evaluation and opimization, and its relations to variance and risk sensitive algorithms. We include more discussions of related works in appendix A.1. In off-policy evaluation, per-step importance sampling (Precup et al., 2000; 2001) have previously been used for off-policy evaluation function estimators. However, this leads to high variance estimators, and recent works proposed using marginalized importance sampling, for estimating stationary state-action distribution ratios (Liu et al., 2018; Nachum et al., 2019a; Zhang et al., 2019) , to reduce variance but with additional bias. In this work, we build on the variance of marginalized IS, to develop variance risk sensitive offline policy optimization algorithm. This is in contrast to prior works on variance constrained online actor-critic (A. & Ghavamzadeh, 2016; Chow et al., 2017; Castro et al., 2012) and relates to constrained policy optimization methods (Achiam et al., 2017; Tessler et al., 2019) . For offline policy optimization, several works have recently addressed the overestimation problem in batch RL (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019b) , including the very recently proposed Conservative Q-Learning (CQL) algorithm (Kumar et al., 2020) . Our work is done in parallel to CQL, due to which we do not include it as a baseline in our experiments. CQL learns a value function which is guaranteed to lower-bound the true value function. This helps prevent value over-estimation for out-of-distribution (OOD) actions, which is an important issue in offline RL. We . note that our approach is orthogonal to CQL in that CQL introduces a regularizer on the state action value function Q π (s, a) based on the Bellman error (the first two terms in equation 2 of CQL), while we introduce a variance regularizer on the stationary state distribution d π (s). Since the value of a policy can be expressed in two ways -either through Q π (s, a) or occupancy measures d π (s), both CQL and our paper are essentially motivated by the same objective of optimizing a lower bound on J(θ), but through different regularizers. Our work can also be considered similar to AlgaeDICE (Nachum et al., 2019b) , since we introduce a variance regularizer based on the distribution corrections, instead of minimizing the f-divergence between stationary distributions in AlgaeDICE. Both our work and AlgaeDICE considers the dual form of the policy optimization objective in the batch setting, where similar to the Fenchel duality trick on our variance term, AlgaeDICE instead uses the variational form, followed by the change of variables tricks, inspired from (Nachum et al., 2019a) to handle their divergence measure.

7. DISCUSSION AND CONCLUSION

We proposed a new framework for offline policy optimization with variance regularization called OVR, to tackle high variance issues due to distribution mismatch in offline policy optimization. Our work provides a practically feasible variance constrained actor-critic algorithm that avoids double sampling issues in prior variance risk sensitive algorithms (Castro et al., 2012; A. & Ghavamzadeh, 2016) . The presented variance regularizer leads to a lower bound to the true offline optimization objective, thus leading to pessimistic value function estimates, avoiding both high variance and overestimation problems in offline RL. Experimentally, we evaluate the significance of OVR on standard benchmark offline datasets, with different data logging off-policy distributions, and show that OVR plays a more significant role when there is large variance due to distribution mismatch. While we only provide a variance related risk sensitive approach for offline RL, for future work, it would be interesting other risk sensitive approaches (Chow & Ghavamzadeh, 2014; Chow et al., 2017) and examine its significance in batch RL. We hope our proposed variance regularization framework would provide new opportunities for developing practically robust risk sensitive offline algorithms.

A APPENDIX : ADDITIONAL DISCUSSIONS A.1 EXTENDED RELATED WORK

Other related works : Several other prior works have previously considered the batch RL setting (Lange et al., 2012) for off-policy evaluation, counterfactual risk minimization (Swaminathan & Joachims, 2015a; b) , learning value based methods such as DQN (Agarwal et al., 2019) , and others (Kumar et al., 2019; Wu et al., 2019b) . Recently, batch off-policy optimization has also been introduced to reduce the exploitation error (Fujimoto et al., 2019) and for regularizing with arbitrary behaviour policies (Wu et al., 2019b) . However, due to the per-step importance sampling corrections on episodic returns (Precup et al., 2000) , off-policy batch RL methods is challenging. In this work, we instead consider marginalized importance sampling corrections and correct for the stationary stateaction distributions (Nachum et al., 2019a; Uehara & Jiang, 2019; Zhang et al., 2020) . Additionally, under the framework of Constrained MDPs (Altman & Asingleutility, 1999) , risk-sensitive and constrained actor-critic algorithms have been proposed previously (Chow et al., 2017; Chow & Ghavamzadeh, 2014; Achiam et al., 2017) . However, these works come with their own demerits, as they mostly require minimizing the risk (ie, variance) term, where finding the gradient of the variance term often leads a double sampling issue (Baird, 1995) . We avoid this by instead using Fenchel duality (Boyd & Vandenberghe, 2004) , inspired from recent works (Nachum & Dai, 2020; Dai et al., 2018) and cast risk constrained actor-critic as a max-min optimization problem. Our work is closely related to (Bisi et al., 2019) , which also consider per-step variance of returns, w.r.t state occupancy measures in the on-policy setting, while we instead consider the batch off-policy optimization setting with per-step rewards w.r.t stationary distribution corrections. Constrained optimization has previously been studied in in reinforcement learning for batch policy learning (Le et al., 2019) , and optimization (Achiam et al., 2017) , mostly under the framework of constrained MDPs (Altman & Asingleutility, 1999) . In such frameworks, the cumulative return objective is augmented with a set of constraints, for safe exploration (García et al., 2015; Perkins & Barto, 2003; Ding et al., 2020) , or to reduce risk measures (Chow et al., 2017; A. & Fu, 2018; Castro et al., 2012) . Batch learning algorithms (Lange et al., 2012) have been considered previously for counterfactual risk minimization and generalization (Swaminathan & Joachims, 2015a; b) and policy evaluation (Thomas et al., 2015a; Li et al., 2015) , although little has been done for constrained offline policy based optimization. This raises the question of how can we learn policies in RL from fixed offline data, similar to supervised or unsupervised learning.

A.2 WHAT MAKES OFFLINE OFF-POLICY OPTIMIZATION DIFFICULT?

Offline RL optimization algorithms often suffer from distribution mismatch issues, since the underlying data distribution in the batch data may be quite different from the induced distribution under target policies. Recent works (Fujimoto et al., 2019; Kumar et al., 2019; Agarwal et al., 2019; Kumar et al., 2020) have tried to address this, by avoiding overestimation of Q-values, which leads to the extraplation error when bootstrapping value function estimates. This leads to offline RL agents generalizing poorly for unseen regions of the dataset. Additionally, due to the distribution mismatch, value function estimates can also have large variance, due to which existing online off-policy algorithms (Haarnoja et al., 2018; Lillicrap et al., 2016; Fujimoto et al., 2018) may fail without online interactions with the environment. In this work, we address the later problem to minimize variance of value function estimates through variance related risk constraints.

B APPENDIX : PER-STEP VERSUS EPISODIC VARIANCE OF RETURNS

Following from (Castro et al., 2012; A. & Ghavamzadeh, 2016) , let us denote the returns with importance sampling corrections in the off-policy learning setting as : (18) where note that as in (Sobel, 1982) , equation 18 also follows a Bellman like equation, although due to lack of monotonocitiy as required for dynamic programming (DP), such measures cannot be directly optimized by standard DP algorithms (A. & Fu, 2018) . D π (s, a) = T t=0 γ t r(s t , a t ) T t=1 π(a t | s t ) µ(a t | s t ) | s 0 = s, a 0 = a, τ ∼ µ In contrast, if we consider the variance of returns with stationary distribution corrections (Nachum et al., 2019a; Liu et al., 2018) , rather than the product of importance sampling ratios, the variance term involves weighting the rewards with the distribution ratio ω π/µ . Typically, the distribution ratio is approximated using a separate function class (Uehara & Jiang, 2019) , such that the variance can be written as : W π (s, a) = ω π/D (s, a) • r(s, a) | s = s, a ∼ π(• | s), (s, a) ∼ d D (s, a) ) where we denote D as the data distribution in the fixed dataset, collected by either a known or unknown behaviour policy. The variance of returns under occupancy measures is therefore given by : V D (π) = E (s,a)∼d D (s,a) W π (s, a) 2 -E (s,a)∼d D (s,a) W π (s, a) 2 (20) where note that the variance expression in equation 20 depends on the square of the per-step rewards with distribution correction ratios. We denote this as the dual form of the variance of returns, in contrast to the primal form of the variance of expected returns (Sobel, 1982) . Note that even though the variance term under episodic per-step importance sampling corrections in equation 18 is equivalent to the variance with stationary distribution corrections in equation 20, following from (Bisi et al., 2019) , considering per-step corrections, we will show that the variance with distribution corrections indeed upper bounds the variance of importance sampling corrections. This is an important relationship, since constraining the policy improvement step under variance constraints with occupancy measures therefore allows us to obtain a lower bound to the offline optimization objective, similar to (Kumar et al., 2020) .

B.1 PROOF OF LEMMA 1 : VARIANCE INEQUALITY

Following from (Bisi et al., 2019) , we show that the variance of per-step rewards under occupancy measures, denoted by V D (π) upper bounds the variance of episodic returns V P (π). V P (π) ≤ V D (π) (1 -γ) 2 (21) Proof. Proof of Lemma 1 following from (Bisi et al., 2019) is as follows. Denoting the returns, as above, but for the on-policy case with trajectories under π, as D π (s, a) = ∞ t=0 γ t r(s t , a t ), and denoting the return objective as J(π) = E s0∼ρ,at∼π(•|st),s ∼P D π (s, a) , the variance of episodic returns can be written as : V P (π) = E (s,a)∼dπ(s,a) D π (s, a) - J(π) (1 -γ) 2 (22) = E (s,a)∼dπ(s,a) (D π (s, a)) 2 + J(π) (1 -γ) 2 - 2J(π) (1 -γ) E (s,a)∼dπ(s,a) D π (s, a) (23) = E (s,a)∼dπ(s,a) D π (s, a) 2 - J(π) 2 (1 -γ) 2 (24) Similarly, denoting returns under occupancy measures as W π (s, a) = d π (s, a)r(s, a), and the returns under occupancy measures, equivalently written as J(π) = E (s,a)∼dπ(s,a) [r(s, a)] based on the primal and dual forms of the objective (Uehara & Jiang, 2019; Nachum & Dai, 2020) , we can equivalently write the variance as : Following from equation 22 and 25, we therefore have the following inequality : V D (π) = E (s, (1 -γ) 2 E s0∼ρ,a∼π D π (s, a) 2 ≤ (1 -γ) 2 E s0∼ρ,a∼π ∞ t=0 γ t ∞ t=0 γ t r(s t , a t ) 2 (28) = (1 -γ)E s0∼ρ,a∼π ∞ t=0 γ t r(s t , a t ) 2 (29) = E (s,a)∼dπ(s,a) r(s, a) 2 (30) where the first line follows from Cauchy-Schwarz inequality. This concludes the proof. We can further extend lemma 1, for off-policy returns under stationary distribution corrections (ie, marginalized importance sampling) compared importance sampling. Recall that we denote the variance under stationary distribution corrections as : V D (π) = E (s,a)∼d D (s,a) ω π/D (s, a) • r(s, a) -J(π) 2 (31) = E (s,a)∼d D (s,a) ω π/D (s, a) 2 • r(s, a) 2 -J(π) 2 (32) where J(π) = E (s,a)∼d D (s,a) ω π/D (s, a) • r(s, a) . We denote the episodic returns with importance sampling corrections as : D π = T t=0 γ t r t ρ 0:t . The variance, as denoted earlier is given by : V P (π) = E (s,a)∼dπ(s,a) D π (s, a) 2 - J(π) 2 (1 -γ) 2 (33) We therefore have the following inequality (1 -γ) 2 E s0∼ρ,a∼π D π (s, a) 2 ≤ (1 -γ) 2 E s0∼ρ,a∼π T t=0 γ t T t=0 γ t r(s t , a t ) 2 T t=0 π(a t |s t ) µ D (a t |s t ) 2 = (1 -γ)E s0∼ρ,a∼π ∞ t=0 γ t r(s t , a t ) 2 T t=0 π(a t |s t ) µ D (a t |s t ) 2 (34) = E (s,a)∼d D (s,a) ω π/D (s, a) 2 • r(s, a) 2 (35) which shows that lemma 1 also holds for off-policy returns with stationary distribution corrections.

B.2 DOUBLE SAMPLING FOR COMPUTING GRADIENTS OF VARIANCE

The gradient of the variance term often leads to the double sampling issue, thereby making it impractical to use. This issue has also been pointed out by several other works (A. & Ghavamzadeh, 2016; Castro et al., 2012; Chow et al., 2017) , since the variance involves the squared of the objective function itself. Recall that we have: V D (θ) = E (s,a)∼d D ω π/D (s, a) • r(s, a) 2 -E (s,a)∼d D ω π/D (s, a) • r(s, a) 2 (36) The gradient of the variance term is therefore : ∇ θ V D (θ) = ∇ θ E (s,a)∼d D ω π/D (s, a) • r(s, a) 2 -2 • E (s,a)∼d D ω π/D (s, a) • r(s, a) • ∇ θ E (s,a)∼d D ω π/D (s, a) • r(s, a) where equation 37 requires multiple samples to compute the expectations in the second term. To see why this is true, let us denote J(θ) = E d D (s,a) ω π/D (s, a) •r(s, a) IS(ω,π θ ) where we have IS(ω, π θ ) as the returns in short form. The variance of the returns with the stationary state-action distribution corrections can therefore be written as : V D (θ) = E d D (s,a) IS(ω, π θ ) 2 (a) -E d D (s,a) IS(ω, π θ ) 2 (b) We derive the gradient of each of the terms in (a) and (b) in equation 38 below. First, we find the gradient of the variance term w.r.t θ : ∇ θ E d D (s,a) IS(ω, π θ ) 2 = ∇ θ s,a d D (s, a)IS(ω, π θ ) 2 = s,a d D (s, a)∇ θ IS(ω, π θ ) 2 = s,a d D (s, a) • 2 • IS(ω, π θ ) • IS(ω, π θ ) • ∇ θ log π θ (a | s) = 2 • s,a d D (s, a)IS(ω, π θ ) 2 ∇ θ log π θ (a | s) = 2 • E d D (s,a) IS(ω, π θ ) 2 • ∇ θ log π θ (a | s) Equation 39 interestingly shows that the variance of the returns w.r.t π θ has a form similar to the policy gradient term, except the critic estimate in this case is given by the importance corrected returns, since IS(ω, π θ ) = [ω π/D (s, a) • r(s, a)]. We further find the gradient of term (b) from equation 38. Finding the gradient of this second term w.r.t θ is therefore : ∇ θ E d D (s,a) IS(ω, π θ ) 2 = ∇ θ J(θ) 2 = 2 • J(θ) • E d D (s,a) ω π/D • {∇ θ log π θ (a | s) • Q π (s, a)} 40) Overall, the expression for the gradient of the variance term is therefore : ∇ θ V D (θ) = 2 • E d D (s,a) IS(ω, π θ ) 2 • ∇ θ log π θ (a | s) -2 • J(θ) • E d D (s,a) ω π/D • {∇ θ log π θ (a | s) • Q π (s, a)} (41) The variance gradient in equation 41 is difficult to estimate in practice, since it involves both the gradient of the objective and the objective J(θ) itself. This is known to have the double sampling issue (Baird, 1995) which requires separate independent rollouts. Previously, (Castro et al., 2012) tackled the variance of the gradient term using simultaneous perturbation stochastic approximation (SPSA) (Spall, 1992) , where we can keep running estimates of both the return and the variance term, and use a two time scale algorithm for computing the gradient of the variance regularizer with per-step importance sampling corrections.

B.3 ALTERNATIVE DERIVATION : VARIANCE REGULARIZATION VIA FENCHEL DUALITY

In the derivation of our algorithm, we applied the Fenchel duality trick to the second term of the variance expression 25. An alternative way to derive the proposed algorithm would be to see what happens if we apply the Fenchel duality trick to both terms of the variance expression. This might be useful since equation 41 requires evaluating both the gradient terms and the actual objective J(θ), due to the analytical expression of the form ∇ θ J(θ) • J(θ), hence suffering from a double sampling issue. In general, the Fenchel duality is given by : x 2 = max y (2xy -y 2 ) and applying Fenchel duality to both the terms, since they both involve squared terms, we get : E d D (s,a) IS(ω, π θ ) 2 ≡ E d D (s,a) max y 2 • IS(ω, π θ ) • y(s, a) -y(s, a) 2 = 2 • max y E d D (s,a) IS(ω, π θ ) • y(s, a) -E d D (s,a) y(s, a) 2 Similarly, applying Fenchel duality to the second (b) term we have : E d D (s,a) IS(ω, π θ ) 2 = max ν 2 • E d D (s,a) IS(ω, π θ ) • ν(s, a) -ν 2 Overall, we therefore have the variance term, after applying Fenchel duality as follows, leading to an overall objective in the form max y max ν V D (θ), which we can use as our variance regularizer V D (θ) = 2 • max y E d D (s,a) IS(ω, π θ ) • y(s, a) -E d D (s,a) y(s, a) 2 -max ν 2 • E d D (s,a) IS(ω, π θ ) • ν(s, a) -ν 2 Using the variance of stationary distribution correction returns as a regularizer, we can find the gradient of the variance term w.r.t θ as follows, where the gradient terms dependent on the dual variables y and ν are 0. ∇ θ V D (θ) = 2 • ∇ θ E d D (s,a) IS(ω, π θ ) • y(s, a) -0 -2 • ∇ θ E d D (s,a) IS(ω, π θ ) • ν(s, a) + 0 = 2•E d D (s,a) IS(ω, π θ )•y(s, a)•∇ θ log π θ (a | s) -2•E d D (s,a) IS(ω, π θ )•ν(s, a)•∇ θ log π θ (a | s) = 2 • E d D (s,a) IS(ω, π θ ) • ∇ θ log π θ (a | s) • y(s, a) -ν(s, a) Note that from equation 46, the two terms in the gradient is almost equivalent, and the difference comes only from the difference between the two dual variables y(s, a) and ν(s, a). Note that our variance term also requires separately maximizing the dual variables, both of which has the following closed form updates : ∇ ν V D (θ) = -2 • ∇ ν E d D (s,a) IS(ω, π θ ) • ν(s, a) + ∇ ν ν 2 = 0 Solving which exactly, leads to the closed form solution ν(s, a) = E d D (s,a) IS(ω, π θ ) . Similarly, we can also solve exactly using a closed form solution for the dual variables y, such that : ∇ y V D (θ) = 2 • ∇ y E d D (s,a) IS(ω, π θ ) • y(s, a) -2 • ∇ y E d D (s,a) y(s, a) 2 = 0 Solving which exactly also leads to the closed form solution, such that y(s, a) a) . Note that the exact solutions for the two dual variables are similar to each other, where ν(s, a) is the expectation of the returns with stationary distribution corrections, whereas y(s, a) is only the return from a single rollout. = 1 2 • IS(ω, π θ ) = 1 2 • dπ(s,a) dµ(s,a) • r(s,

C APPENDIX : MONOTONIC PERFORMANCE IMPROVEMENT GUARANTEES UNDER VARIANCE REGULARIZATION

We provide theoretical analysis and performance improvements bounds for our proposed variance constrained policy optimization approach. Following from (Kakade & Langford, 2002; Schulman et al., 2015; Achiam et al., 2017) , we extend existing performance improvement guarantees based on the stationary state-action distributions instead of only considering the divergence between the current policy and old policy. We show that existing conservative updates in algorithms (Schulman et al., 2015) can be considered for both state visitation distributions and the action distributions, as similarly pointed by (Achiam et al., 2017) . We can then adapt this for the variance constraints instead of the divergence constraints. According to the performance difference lemma (Kakade & Langford, 2002), we have that, for all policies π and π : J(π ) -J(π) = E s∼d π ,a∼π [A π (s, a)] ) which implies that when we maximize 49, it will lead to an improved policy π with policy improvement guarantees over the previous policy π. We can write the advantage function with variance augmented value functions as : A π λ = Q π λ (s, a) -V π λ (s) = E s ∼P r(s, a) -λ(r(s, a) -J(π)) 2 + γV π λ (s ) -V π λ (s) However, equation 49 is often difficult to maximize directly, since it additionally requires samples from π and d π , and often a surrogate objective is instead proposed by (Kakade & Langford, 2002) . Following (Schulman et al., 2015) , we can therefore obtain a bound for the performance difference based on the variance regularized advantage function : J(π ) ≥ J(π) + E s∼dπ(s),a∼π (a|s) A π λ (s, a) where we have the augmented rewards for the advantage function, and by following Fenchel duality for the variance, can avoid policy dependent reward functions. Otherwise, we have the augmented rewards for value functions as r(s, a) = r(s, a) -λ(r(s, a) -J(π)) 2 . This however suggests that the performance difference does not hold without proper assumptions (Bisi et al., 2019) . We can therefore obtain a monotonic improvement guarantee by considering the KL divergence between Following from (Schulman et al., 2015) , we can alternately write this follows, where we further apply the variational form of TV J(π ) ≥ J(π) + E s∼dπ,a∼π A π (s, a) -C • E s∼dπ D TV (d π ||d π ) 2 = J(π) + E s∼dπ,a∼π A π (s, a) -C • E s∼dπ max f {E s∼d π ,a∼π [f (s, a)] -E s∼dπ,a∼π [f (s, a)]} 2 ≥ J(π) + E s∼dπ,a∼π A π (s, a) -C • max f E s∼dπ E s∼d π ,a∼π [f (s, a)] -E s∼dπ,a∼π [f (s, a)] 2 = J(π) + E s∼dπ,a∼π A π (s, a) -C • max f E s∼dπ,a∼π [f (s, a)] -E s∼dπ,a∼π [E s∼dπ,a∼π [f (s, a)]] 2 = J(π) + E s∼dπ,a∼π A π (s, a) -C • max f Var s∼dπ,a∼π f (s, a) Therefore the policy improvement bound depends on maximizing the variational representation f (s, a) of the f-divergence to guaranetee improvements from J(π) to J(π ). This therefore leads to the stated result in theorem 1.

D APPENDIX : LOWER BOUND OBJECTIVE WITH VARIANCE REGULARIZATION

D.1 PROOF OF LEMMA 3 Recalling lemma 3 which states that, the proof of this follows from (Metelli et al., 2018) . We extend this for marginalized importance weighting, and include here for completeness. Note that compared to importance weighting, which leads to an unbiased estimator as in (Metelli et al., 2018) , correcting for the state-action occupancy measures leads to a biased estimator, due to the approximation ωπ/D . However, for our analysis, we only require to show a lower bound objective, and therefore do not provide any bias variance analysis as in off-policy evaluation. Var  d D (s 1 , a 1 ) • r(s 1 , a 1 ) ≤ 1 N E (s1,a1)∼d D (s,a) d π (s 1 , a 1 ) d D (s 1 , a 1 ) • r(s 1 , a 1 ) 2 ≤ 1 N ||r|| 2 ∞ E (s1,a1)∼d D (s,a) d π (s 1 , a 1 ) d D (s 1 , a 1 ) • r(s 1 , a 1 ) 2 = 1 N ||r|| 2 ∞ F 2 (d π ||d D ) Proof. The proof for the lower bound objective can be obtained as follows. We first define a relationship between the variance and the α-divergence with α = 2, as also similarly noted in (Metelli et al., 2018) . Given we have batch samples D, and denoting the state-action distribution correction with ω π/D (s, a), we can write from lemma 3 : Var (s,a)∼d D (s,a) ωπ/D ≤ 1 N ||r|| 2 ∞ F 2 (d π ||d D ) where the per-step estimator with state-action distribution corrections is given by ω π/D (s, a) • r(s, a). Here, the reward function r(s, a) is a bounded function, and for any N > 0 the variance of the per-step reward estimator with distribution corrections can be upper bounded by the Renyi-divergence (α = 2). Finally, following from (Metelli et al., 2018) and using Cantelli's inequality, we have with probability at least 1 -δ where 0 < δ < 1 : Pr we get that with probability at least 1 -δ, we have: (65) where we can further replace the variance term with α = 2 for the Renyi divergence to conclude the proof for the above theorem. We can further write the lower bound for for α-Renyi divergence, following the relation between variance and Renyi-divergence for α = 2 as : ω π/D -J(π) ≥ λ ≤ 1 1 + λ 2 Var (s, J(π) = E (s, J(π) = E (s,a)∼dπ(s,a) [r(s, a)] ≥ E (s,a)∼d D (s,a) [ d π (s, a) d D (s, a) • r(s, a)] -||r|| ∞ (1 -δ)d 2 (d π ||d D ) δN This hints at the similarity between our proposed variance regularized objective with that of other related works including AlgaeDICE (Nachum et al., 2019b) which uses a f-divergence D f (dπ||d D ) between stationary distributions.

E.1 EXPERIMENTAL ABLATION STUDIES

In this section, we present additional results using state-action experience replay weightings on existing offline algorithms, and analysing the significance of our variance regularizer on likelihood corrected offline algorithms. Denoting ω(s, a) for the importance weighting of state-action occupancy measures based on samples in the experience replay buffer, we can modify existing offline algorithms to account for state-action distribution ratios. The ablation experimental results using the Hopper control benchmark are summarized in figure 2 . The same base BCQ algorithm is used with a modified objective for BCQ (Fujimoto et al., 2019) where the results for applying off-policy importance weights are denoted as "BCQ+I.W.". We employ the same technique to obtain ω(s, a) for both the baseline and for adding variance regularization as described. The results suggest that adding the proposed per-step variance regularization scheme significantly outperforms just importance weighting the expected rewards for off-policy policy learning. 

E.2 EXPERIMENTAL RESULTS IN CORRUPTED NOISE SETTINGS

We additionally consider a setting where the batch data is collected from a noisy environment, i.e, in settings with corrupted rewards, r → r + , where ∼ N (0, 1). Experimental results are presented in figures 1, 3. From our results, we note that using OVR on top of BCQ (Fujimoto et al., 2019) , we can achieve significantly better performance with variance minimization, especially when the agent is given sub-optimal demonstrations. We denote it as medium (when the dataset was collected by a half trained SAC policy) or a mixed behaviour logging setting (when the data logging policy is a mixture of random and SAC policy). This is also useful for practical scalability, since often data collection is expensive from an expert policy. We add noise to the dataset, to examine the significance of OVR under a noisy corrupted dataset setting. In this section, we include several alternatives by which we can compute the stationary state-action distribution ratio, borrowing from recent works (Uehara & Jiang, 2019; Nachum et al., 2019a) . Off-Policy Optimization with Minimax Weight Learning (MWL) : We discuss other possible ways of optimizing the batch off-policy optimization objective while also estimating the state-action density ratio. Following from (Uehara & Jiang, 2019) we further modify the off-policy optimization part of the objective J(θ) in L(θ, λ) as a min-max objective, consisting of weight learning ω π/D and optimizing the resulting objective J(θ, ω). We further propose an overall policy optimization objective, where a single objective can be used for estimating the distribution ratio, evaluating the critic and optimizing the resulting objective. We can write the off-policy optimization objective with its equivalent starting state formulation, such that we have : This is similar to the MWL objective in (Uehara & Jiang, 2019) except we instead consider the bias reduced estimator, such that accurate estimates of Q or ω will lead to reduced bias of the value function estimation. Furthermore, note that in the first part of the objective J(π θ , ω, Q) 2 , we can further use entropy regularization for smoothing the objective, since instead of Q π (s , a ) in the target, we can replace it with a log-sum-exp and considering the conjugate of the entropy regularization term, similar to SBEED (Dai et al., 2018) . This would therefore give the first part of the objective as an overall min-max optimization problem : Minimizing Divergence for Density Ratio Estimation : The distribution ratio can be estimated using an objective similar to GANs (Goodfellow et al., 2014; Ho & Ermon, 2016) , as also similarly proposed in (Kostrikov et al., 2019) .  E d D (s,a) ω π θ /D (s, a) • r(s, a) = (1 -γ)E s0∼β0(s),a0∼π(•|s0) Q π (s 0 , a 0 ) J(ω, π θ ) = E dµ Empirical Likelihood Ratio : We can follow Sinha et al. (2020) to compute the state-action likelihood ratio, where they use a binary a classifier to classify samples between an on-policy and off-policy distribution. The proposed classifier, φ, is trained on the following objective, and takes as input the state-action tuples (s, a) to return a probability score that the state-action distribution is from the target policy. The objective for φ can be formulated as L cls = max  where s, a ∼ D are samples from the behaviour policy, and s, π(s) are samples from the target policy. The density ratio estimates for a given s, a ∼ D are simply ω(s, a) = σ(φ(s, a)) like in Sinha et al. (2020) . We then use these ω(s, a) for density ratio corrections for the target policy in equantion ??.



THEORETICAL ANALYSISIn this section, we provide theoretical analysis of offline policy optimization algorithms in terms of policy improvement guarantees under fixed dataset D. Following then, we demonstrate that using the variance regularizer leads to a lower bound for our policy optimization objective, which leads to a pessimistic exploitation approach for offline algorithms.



us denote the state-action occupancy measures under π and dataset D as d π and d D . The variance of state-action distribution ratios is Var (s,a)∼d D (s,a) [ω π/D (s, a)]. When α = 2 for the Renyi divergence, we have :

Figure1: Evaluation of the proposed approach and the baseline BCQ(Fujimoto et al., 2019) on a suite of three OpenAI Gym environments. Details about the type of offline dataset used for training, namely random, medium, mixed, and expert are included in Appendix. Results are averaged over 5 random seeds(Henderson et al., 2018). We evaluate the agent using standard procedures, as inKumar et al. (2019);Fujimoto et al. (2019)

17) From this definition in equation 17, the action-value function, with off-policy trajectory-wise importance correction is Q π (s, a) = E (s,a)∼dµ(s,a) [D π (s, a)], and similarly the value function can be defined as : V π (s) = E s∼dµ(s) [D π (s)]. For the trajectory-wise importance corrections, we can define the variance of the returns, similar to (A. & Fu, 2018) as : V P (π) = E (s,a)∼dµ(s,a) [D π (s, a) 2 ] -E (s,a)∼dµ(s,a) [D π (s, a)] 2

a)∼dπ(s,a) r(s, a) -J(π) 2 (25) = E (s,a)∼dπ(s,a) r(s, a) 2 + J(π) 2 -2J(π)E (s,a)∼dπ(s,a) [r(s, a)] (26) = E (s,a)∼dπ(s,a) r(s, a) 2 -J(π) 2 (27)

(s,a)∼d D (s,a) ωπ/D ≤ 1 N ||r|| 2 ∞ F 2 (d π ||d D ) (60) Proof. Assuming that state action samples are drawn i.i.d from the dataset D, we can write : Var (s,a)∼d D (s,a) ωπ/D (s, a) ≤ 1 N Var (s1,a1)∼d D (s,a) d π (s 1 , a 1

PROOF OF THEOREM 2: First let us recall the stated theorem 2. By constraining the off-policy optimization problem with variance constraints, we have the following lower bound to the optimization objective with stationary state-action distribution corrections J(π) ≥ E (s,a)∼d D (s,a) [ d π (s, a) d D (s, a) r(s, a)] -1 -δ δ Var (s,a)∼dµ(s,a) [ d π (s, a) d D (s, a) r(s, a)]

a)∼d D (s,a) [ω π/D (s,a)•r(s,a)]

Var (s,a)∼d D (s,a) [ω π/D (s,a)•r(s,a)]

a)∼dπ(s,a) ≥ E (s,a)∼d D (s,a) [ω π/D (s, a) • r(s, a)] -1 -δ δ Var (s,a)∼d D (s,a) [ω π/D (s, a) • r(s, a)]

Figure 2: Ablation performed on Hopper. The mean and standard deviation are reported over 5 random seeds. The offline datasets for these experiments are same as the corresponding ones in Fig 1 of the main paper.

Figure 3: Evaluation of the proposed approach and the baseline BCQ on a suite of three OpenAI Gym environments. We consider the setting of rewards that are corrupted by a Gaussian noise. Results for the uncorrupted version are in Fig. 1. Experiment results are averaged over 5 random seeds

)Furthermore, following Bellman equation, we expect to haveE[r(s, a)] = E[Q π (s, a) -γQ π (s , a )] E d D (s,a) ω π θ /D (s, a)•{Q π (s, a)-γQ π (s , a )} = (1-γ)E s0∼β0(s),a0∼π(•|s0) Q π (s 0 , a 0 ) (67)We can therefore write the overall objective as :J(ω, π θ , Q) = E d D (s,a) ω π θ /D (s, a) • {Q π (s, a) -γQ π (s , a )} -(1 -γ)E s0∼β0(s),a0∼π(•|s0) Q π (s 0 , a 0 ) (68)

(s,a)  ω π θ /D (s, a) • {r(s, a) + γQ π (s , a ) + τ log π(a | s) -Q π (s, a)}+ (1 -γ)E s0∼β0(s),a0∼π(•|s0) Q π (s 0 , a 0 ) (69)such that from our overall constrained optimization objective for maximizing θ, we have turned it into a min-max objective, for estimating the density ratios, estimating the value function and maximizing the policies ω * π/D , Q * , π * = argmin point solution for the density ratio can be solved by minimizing the objective :ω * π/D = argmin ω L(ω π/D , Q) 2 = E dµ(s,a) {γω(s, a) • Q π (s , a ) -ω(s, a)Q π (s, a)}+ (1 -γ)E β(s,a) Q π (s 0 , a 0 )(71) DualDICE : In contrast to MWL (Uehara & Jiang, 2019), DualDICE (Nachum et al., 2019a) introduces dual variables through the change of variables trick, and minimizes the Bellman residual of the dual variables ν(s, a) to estimate the ratio, such that : ν * (s, a) -B π ν * (s, a) = ω π/D (s, a) (72) the solution to which can be achieved by optimizing the following objective min ν L(ν) = 1 2 E d D (ν -B π ν)(s, a) 2 -(1 -γ)E s0,a0∼β(s,a) ν(s 0 , a 0 ) (73)

= E (s,a)∼d D log h(s, a) + E (s,a)∼dπ log(1 -h(s, a))(74)    where h is the discriminator class, discriminating between samples from d D and d π . The optimal discriminator satisfies :log h * (s, a) -log(1 -h * (s, a)) = log d D (s, a) d π (s, a)(75)The optimal solution of the discriminator is therefore equivalent to minimizing the divergence between d π and d D , since the KL divergence is given by :-D KL (d π ||d D ) = E (s,a)∼dπ log d D (s, a) d π (s, a)(76)Additionally, using the Donsker-Varadhan representation, we can further write the KL divergence term as :-D KL (d π ||d D ) = min x log E (s,a)∼d D exp x(s, a) -E (s,a)∼dπ x(s, a)(77)such that now, instead of the discriminator class h, we learn the function class x, the optimal solution to which is equivalent to the distribution ratio plus a constantx * (s, a) = log d π (s, a) d D (s, a)(78) However, note that both the GANs like objective in equation 74 or the DV representation of the KL divergence in equation 77 requires access to samples from both d π and d D . In our problem setting however, we only have access to batch samples d D . To change the dependency on having access to both the samples, we can use the change of variables trick, such that : x(s, a) = ν(s, a) -B π ν(s, a), to write the DV representation of the KL divergence as :-D KL (d π ||d D ) = min ν log E (s,a)∼d D exp ν(s, a) -B π ν(s, a) -E (s,a)∼dπ ν(s, a) -B π ν(s, a)(79)where the second expectation can be written as an expectation over initial states, following from DualDICE, such that we have -D KL (d π ||d D ) = min ν log E (s,a)∼d D exp ν(s, a) -B π ν(s, a) -(1-γ)E (s,a)∼β0(s,a) ν(s 0 , a 0 ) (80) By minimizing the above objective w.r.t ν, which requires only samples from the fixed batch data d D and the starting state distribution. The solution to the optimal density ratio is therefore given by : x * (s, a) = ν * (s, a) -B π ν * (s, a) = log d π (s, a) d D (s, a) = log ω * (s, a)

E s,a∼D [log(φ(s, a))] + E s∼D [log(φ(s, π(s))]

The results on D4RL tasks compare BCQ(Fujimoto et al., 2019) with and without OVR, bootstrapping error reduction (BEAR)

The results on D4RL tasks compare BCQ(Fujimoto et al., 2019) with and without OVR, bootstrapping error reduction (BEAR)(Kumar et al., 2019), behavior-regularized actor critic with policy (BRAC-p) (?), AlgeaDICE (aDICE)(Nachum et al., 2019b)  and offline SAC (SAC-off)(Haarnoja et al., 2018). The results presented are the normalized returns on the task as perFu et al. (2020) (see Table3inFu et al. (2020) for the unnormalized scores on each task).

Results on the Safety-Gym environmentsRay et al..

policies :

L π (π ) = J(π) + E s∼dπ,a∼π [A π (s, a)] (51) which ignores the changes in the state distribution d π due to the improved policy π . (Schulman et al., 2015) optimizes the surrogate objectives L π (π ) while ensuring that the new policy π stays close to the current policy π, by imposing a KL constraintThe performance difference bound, based on the constraint between π and π as in TRPO (Schulman et al., 2015) is given by : Lemma 4. The performance difference lemma in (Schulman et al., 2015) , where α = D max TV = max s D TV (π, π )where = max s,a |A π (s, a)|, which is usually denoted with α, whereThe performance improvement bxound in (Schulman et al., 2015) can further be written in terms of the KL divergence by following the relationship between total divergence (TV) and KL, which follows from Pinsker's inequality, D TV (p||q) 2 ≤ D KL (p||q), to get the following improvement bound :We have a performance difference bound in terms of the state distribution shift d π and d π . This justifies that L π (π ) is a sensible lower bound to J(π ) as long as there is a total variation distance between d π and d π which ensures that the policies π and π stay close to each other. Finally, following from (Achiam et al., 2017) , we obtain the following lower bound, which satisfies policy improvement guarantees :Equation 53 and 54 assumes that there is no state distribution shift between π and π. However, if we explicitly assume state distribution changes, d π and d π due to π and π respectively, then we have the following performance improvement bound :Lemma 5. For all policies π and π, we have the performance improvement bound based on the total variation of the state-action distributions d π andwhich can be further written in terms of the surrogate objective L π (π ) as : 

