HELLINGER DISTANCE CONSTRAINED REGRESSION

Abstract

This paper introduces an off-policy reinforcement learning method that uses Hellinger distance between sampling policy (from what samples were collected) and current policy (policy being optimized) as a constraint. Hellinger distance squared multiplied by two is greater than or equal to total variation distance squared and less than or equal to Kullback-Leibler divergence, therefore a lower bound for expected discounted return for the new policy is improved compared to the lower bound for training with KL. Also, Hellinger distance is less than or equal to 1, so there is a policy-independent lower bound for expected discounted return. HDCR is capable of training with Experience Replay, a common setting for distributed RL when collecting trajectories using different policies and learning from this data centralized. HDCR shows results comparable to or better than Advantage-weighted Behavior Model and Advantage-Weighted Regression on MuJoCo tasks using tiny offline datasets collected by random agents. On bigger datasets (100k timesteps) obtained by pretrained behavioral policy, HDCR outperforms ABM and AWR methods on 3 out of 4 tasks.

1. INTRODUCTION

Policy gradient algorithms are methods of model-free reinforcement learning that optimize policy through differentiating expected discounted return. Despite the simplicity, to converge, these methods should stay on-policy because of the first-order approximation of state visitation frequencies. This issue makes agents learn through trial-and-error, using data only once. To make policy gradient updates more off-policy, we can add a constraint on the update to decrease the step size if the current policy is too far from the sampling policy. One of the first methods was to add total variation distance squared as a constraint for mixture policies. Later it was proven that there is a lower bound for new policy's expected discounted return (Kakade & Langford, 2002) . Recently it was proven that this lower bound exists for all types of updates (Schulman et al., 2015) . Next, total variation distance squared was replaced by Kullback-Leibler divergence that is greater than or equal to the previous one (Pinker's inequality (Levin & Peres, 2017) ), so that the lower bound was decreased (Schulman et al., 2015) . Using Lagrangian, have been derived off-policy method called Advantage-Weighted Regression (Peng et al., 2019) , which also used KL as a constraint. This article proposes a new method whose lower bound of expected discounted return is greater than or equal to the bound with KL. We achieve this by replacing total variation distance by Hellinger distance, which decreases lower bound. Therefore strictness stays the same. Then we derive an offpolicy method called Hellinger Distance Constrained Regression using the new constraint. It can be used on discrete and continuous action spaces since derivation uses Lebesgue integrals rather than a summation or Riemann integrals.

2. PRELIMINARIES

To better present the problem, we start from basic definitions, go through the history of improvements, and then describe the disadvantages of using KL divergence as a constraint. We consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple (S, A, P, r, ρ 0 , γ), where S is a set of states (finite or infinite), A is a set of actions (finite or infinite), P : S × A × S → R is the transition probability distribution, r : S → R is the reward function, ρ 0 : S → R is the distribution of the initial state s 0 , and γ ∈ (0, 1) is the discount factor. Let π denote a stochastic policy π : S × A → [0, 1], and then its expected discounted return is: η(π) = E s0,a0,... ∞ t=0 γ t r(s t ) , where s 0 ∼ ρ 0 (•), a t ∼ π(•|s t ), s t+1 ∼ P (•|s t , a t ). (1) This paper uses state-action value function Q π , state value function V π , and advantage function A π with the following definitions: Q π (s t , a t ) = E st+1,at+1,... ∞ l=0 γ l r(s t+l ) V π (s t ) = E at,st+1,at+1,... ∞ l=0 γ l r(s t+l ) A π (s t , a t ) = Q π (s t , a t ) -V π (s t ). (2) Let ρ π (s) be unnormalized visitation frequencies of state s where actions are chosen according to π: ρ π (s) = ∞ t=0 γ t P (s t = s). Following identity expresses the expected return of another policy π in terms of the advantage over π, accumulated over states (see Schulman et al. (2015) for proof): η(π) = η(π) + ρ π (s) π(a|s)A π (s, a) da ds. In approximately optimal learning, we replace state visitation frequency ρ π by ρ π , since this drastically decrease optimization complexity: L π (π) = η(π) + ρ π (s) π(a|s)A π (s, a) da ds. Let π old denote current policy, then the lower bound for the expected discounted return for the new policy π new will be (see Schulman et al. (2015) for proof): η(π new ) ≥ L π old (π new ) - 4 γ (1 -γ) 2 α 2 where = max s,a |A π (s, a)|, α = max s D T V (π old (•|s)||π new (•|s)), D T V (π old (•|s)||π new (•|s)) = 1 2 |π old (a|s) -π new (a|s)| da. Theoretical Trust-Region Policy Optimization algorithm relays on Pinsker's inequality (see (Tsybakov, 2009) for proof): D KL (π old (•|s)||π new (•|s)) ≥ D T V (π old (•|s)||π new (•|s)) 2 where D KL (π old  η(π new ) ≥ L π old (π new ) -CD max KL (π old ||π new ) where = max s,a |A π (s, a)|, C = 4 γ (1 -γ) 2 , D max KL (π old ||π new ) = max s D KL (π old (•|s)||π new (•|s)). However, this replacement greatly decreases the lower bound for the expected discounted return for the new policy. Moreover, Kullback-Leibler divergence has no upper bound. Therefore we have no policy-independent lower bound for this type of update.

3. HELLINGER DISTANCE IN POLICY OPTIMIZATION

We can improve lower bound (compared to KL) by replacing D T V (π old (• | s) || π new (• | s)) with Hellinger distance H(π old (•|s) || π new (• | s)): H(π old (•|s) || π new (•|s)) 2 = 1 - π old (a|s) π new (a|s) da Theorem 1 (see Appendix A or (Tsybakov, 2009, section 2.4) for proof) proves correctness and improvement (compared to KL) to the lower bound. Let p(v) and q(v) be two probability density functions then: D T V (p(•) || q(•)) 2 ≤ 2H(p(•) || q(•)) 2 ≤ D KL (p(•) || q(•)) Replacing p(v) and q(v) with π old (• | s) and π new (• | s) respectively, new lower bound follows: η(π new ) ≥ L π old (π new ) - 8 γ (1 -γ) 2 α 2 where = max s,a |A π (s, a)|, α = max s H(π old (•|s)||π new (•|s)) (11) It is worth to note that H(π old (• | s) || π new (• | s)) ≤ 1.

4. HELLINGER DISTANCE CONSTRAINED REGRESSION (HDCR)

We could use presented lower bound as in TRPO, but instead, we derive an offline regression algorithm by introducing the following optimization problem where µ is the sampling policy: arg max π ρ µ (s) π(a|s)A µ (s, a) da ds s.t. ρ µ (s)H(π(•|s)||µ(•|s)) ds ≤ , π(a|s) da = 1, ∀s ∈ S. Constructing Lagrangian, differentiating it with respect to π, and solving for π gives us the following optimal policy (see Appendix B for derivation): π * (a|s) = µ(a|s) β 2 (β -2A µ (s, a)) 2 , where β is a Lagrangian multiplier. ( ) Constructing a regression problem of KL divergence between optimal policy π * and current policy π and simplifying gives us the following supervised regression problem (see Appendix B for derivation): arg max π E s∼ρµ(•) E a∼µ(•|s) log π(a|s) 1 (β -A µ (s, a)) 2 (14) Using notation of ABM paper (Siegel et al., 2020) "advantage-weighting" function is f (A(s, a)) = 1 (β-A(s,a)) 2 . If we use HDCR with Experience Replay, in equation 14, we replace µ(•|s) in expectation E a ∼ µ(• | s) and advantage function A µ (s, a). Let Π = {π i , π i+1 , ..., π i+N } be a set of sampling policies from which actions were sampled, w(π i ) probability of selecting policy π i , then: µ(s, a) = Π w(π)ρ π (s)π(a|s) dπ A µ (s, a) = Π w(π)ρ π (s) (Q π (s, a) -V π (s)) dπ Π w(π)ρ π (s) dπ V (s) = Π w(π)ρ π (s)V π (s) dπ Π w(π)ρ π (s) dπ The proof will repeat the proof for AWR with Experience Replay (Peng et al., 2019) . Practically we simply sample uniformly from the replay buffer and using a value function estimator. This type of sampling provides us an approximation of expectation and state value function. Let D denote a set of stored trajectories (replay buffer), A(s, a; φ) is an advantage function parameterized by vector φ. The most popular method to obtain A(s, a; φ) for offline learning is to use state-action function estimator Q(s, a; φ) parameterized by φ: A(s, a; φ) = Q(s, a; φ) -π(a |s; θ k )Q(s, a ; φ k )da Despite Q(s, a; φ) being closer to the expectation of discounted return following policy π (because of "taking" the first action according to π rather then µ), we found Monte-Carlo return more efficient on tiny offline datasets. Greater performance using MC return can be explained by a lack of experience "produced" by certain actions. Monte-Carlo estimation of A(s, a; φ) can be described as follows where R D st,at = T l=0 γ l r t+l and V (s; φ) is a state value function estimator parameterized by vector φ: A(s, a; φ) = R D s,a -V (s; φ) Also, we can use Generalized Advantage Estimation (Schulman et al., 2016) , where γ ∈ [0, 1] and δ φ k t is a one-step temporal difference for state s t calculated using old vector φ k : δ φ k t = r t + γV (s t+1 ; φ k ) -V (s t ; φ k ) Â(s t , a t ; φ k ) = T l=0 (γλ) l δ φ k t+l A(s t , a t ; φ) = Â(s t , a t ; φ k ) + V (s t ; φ k ) -V (s t ; φ) Finally, we propose the following reinforcement learning algorithm: 

5. EXPERIMENTS

In our experiments, we evaluate the algorithm on MuJoCo (Todorov et al., 2012) tasks.

5.1. TINY DATASETS

For evaluating on extremely small datasets we use setting inspired by Behavioral Modelling Priors for Offline Reinforcement Learning paper (Siegel et al., 2020) but instead of using actions from a behaviorial policy we use random actions while generating buffer. First, we collect 2048 timestamps or more, until episode termination (whichever occurred later), from each of 5 seeds using random actions. Then we load collected trajectories to a replay buffer for the agent training. Separate networks with the same architecture (except the last layer) represent the policy and value function and consist of 2 hidden layers of 256 ELU units. Each train iteration uses only old data obtained by random agents. Each iteration, the value function is updating with 5 gradient steps and policy with 50 steps using a uniformly sampled batches of 512 samples using all data from the replay buffer. Learning rates for Adam optimizer are 2 × 10 -3 and 2 × 10 -4 for critic and actor, respectively. We compare 3 different "advantage-weight" functions f (A(s, a)): • Hellinger Distance Constrained Regression, where f (A(s, a)) = 1 (β-A(s,a)) 2 ; • Advantage-weighted Behavior Model, where f (A(s, a)) = I A(s,a)>0 , I x>0 = 1 if x > 0 otherwise I x>0 = 0; • Advantage-Weighted Regression, where f (A(s, a)) = exp( 1 β A(s, a)). AWR method uses β = 1.0 as it is in implementation released by authors. HDCR uses β = 1.0. For TD(λ) we use λ = 0.95. On simple tasks as Hopper-v2, all methods are able to learn (Figure 1 ), and HDCR shows slightly better results (Table 1 ). While on difficult tasks as Ant-v2, all algorithms do not improve their results through iterations. Moreover, evaluation returns decrease.

5.2. LARGE DATASETS

Next, we perform tests using buffers with the size of 100k timesteps. Buffer is filled by running a pretrained behavioral policy. This setting replicates the setting from Off-Policy Deep Reinforce- ment Learning without Exploration (Fujimoto et al., 2019) paper. Therefore we also provide results of BCQ method achieved by authors' implementation trained from the same datasets. For calculating advantage we use equation 16 where we approximate integral by taking mean of 10 Q-values obtained by 10 actions sampled from the policy. While BCQ outperforms all the presented methods, it uses a gradient of Q-value function in actor training, which provides better generalization. This provides better results on evaluation but affects stability and performance. Against other methods that update the policy function directly, HDCR shows better results on 3 environments out of 4 (Figure 2 and Table 2 ).

6. DISCUSSION

We theoretically proved that Hellinger distance improves the lower bound of expected discounted return compared to Kullback-Leibler divergence and proposed a simple off-policy reinforcement learning method that uses Hellinger distance as a constraint. The expected discounted return for a new policy now has a policy-independent lower bound. This bound guarantees that return will not decrease in "one" shot. Experiments show that HDCR outperforms both ABM and AWR on tiny datasets obtained by random agents. This performance proves the efficiency of using Hellinger distance by allowing bigger step sizes, retaining lower bound. On bigger datasets, HDCR shows comparable or better results than AWR and ABM. A THEOREM 1 PROOF Let (Ω, A) be a measurable space, P and Q be two probability measures on that space, v be a σ-finite measure on (Ω, A) such that P v (P (A) = 0 for any A ∈ A such that µ(A) = 0) and Q v. And let us denote Radon-Nikodym derivatives (in our derivations we can assume them as probability density functions) as follows: p = dP dv q = dQ dv So, p and q satisfy following conditions: P (A) = A p(v) dv ∀A ∈ A Q(A) = A q(v) dv ∀A ∈ A Then we define distances between probability measures: D T V (P || Q) = sup A∈A |P (A) -Q(A)| H(P || Q) 2 = 1 2 ( p(v) -q(v)) 2 dv = 1 - p(v)q(v) dv D KL (P || Q) = log dP dQ dP = p(v) log p(v) q(v) dv Following Lemmas will be used in Theorem 1 proof: Lemma 1. Given two probability distributions p and q total variance divergence can be calculated as follows: D T V (P || Q) = sup A∈A |P (A) -Q(A)| = 1 2 |p(v) -q(v)| dv (22) Proof. Let B = {v : P (v) > Q(v)}, B c = {v : P (v) ≤ Q(v)} and A ∈ A: P (A) -Q(A) ≤ P (A ∩ B) -Q(A ∩ B) ≤ P (B) -Q(B) P (A) -Q(A) ≤ Q(A ∩ B c ) -P (A ∩ B c ) ≤ Q(B c ) -P (B c ) sup A∈A |P (A) -Q(A)| = P (B) -Q(B) = Q(B c ) -P (B c ) D T V (P || Q) = 1 2 [P (B) -Q(B) + Q(B c ) -P (B c )] = 1 2 |p(v) -q(v)| dv (23) Lemma 2. D T V (P || Q) = 1 -min(p(v), q(v)) dv Proof. D T V (P || Q) = 1 2 |p(v) -q(v)| dv = {v: p(v)>q(v)} [p(v) -q(v)] dv = 1 - {v: p(v)<q(v)} p(v) dv - {v: p(v)>q(v)} q(v) dv = 1 -min(p(v), q(v)) dv (25) Lemma 3. max(p(v), q(v)) dv + min(p(v), q(v)) dv = 2 Proof. Rewriting left part in 4 integrals over following sets {max(p(v), q(v)) = p(v)}, {max(p(v), q(v)) = q(v)}, {min(p(v), q(v)) = q(v)} and {min(p(v), q(v)) = q(v)} and stacking integrals back gives us P (Ω) + Q(Ω) = 2. Theorem 1. For any two probability density functions p and q, the following double inequality is true: D T V (p || q) 2 ≤ 2H(p || q) 2 ≤ D KL (p || q) Proof. First inequality can be proved as follows: ( 1 -H(P || Q) 2 ) 2 = p(v)q(v) dv 2 = min(p(v), q(v)) max(p(v), q(v)) dv 2 ≤ min(p(v), q(v)) dv max(p(v), q(v)) dv = min(p(v), q(v)) dv 2 -min(p(v), q(v)) dv = (1 -D T V (P || Q))(1 + D T V (P || Q)) = 1 -D T V (P || Q) 2 D T V (P || Q) 2 ≤ H(P || Q) 2 (2 -H(P || Q) 2 ) D T V (P || Q) 2 ≤ 2H(P || Q) 2 Proof of the second inequality:  D KL (P || Q) = p(v) log p(v) q(v) dv = 2 p(v) log p(v) q(v) dv = -2 p(v) log q(v) p(v) -1 + 1 dv ≥ -2 p(v) q(v) p(v) -1 dv = -2 q(v)p(v) -1 dv = 2 1 - p(v)q(v) dv = 2H(P || Q) 2 D KL (P || Q) ≥ 2H(P || Q) 2 Then optimal policy π * (a|s) can be written as follows: π * (a|s) = µ(a|s) β 2 (β -2A µ (s, a)) 2 (35) To obtain regression problem we construct Kullback-Leibler divergence between optimal policy π * and current policy π: (37)



Figure 1: Learning curves of ABM, AWR, and HDCR averaged across results of learning from 10 different datasets of 10k timestamps (also 5 seeds used to generate each dataset).

Figure 2: Curves of BCQ, ABM, AWR, and HDCR evaluation results averaged across 3 seeds.

s) π(a|s)A µ (s, a) da ds s.t. ρ µ (s)H(π(•|s)||µ(•|s)) ds ≤ , π(a|s) da = 1, ∀s ∈ S.(30)Constructing Lagrangian where α : S → R is a function for obtaining Lagrange multiplier for every state, β is also a Lagrange multiplier: to π(a|s) gives us following result:∂L ∂π(a|s) = ρ µ (s)A µ (s, a) + βρ µ (s) a|s)µ(a|s) = α s -ρ µ (s)A µ (s, ) -A µ (s, a) β(33)Substituting π(a|s) = µ(•|s) and taking expectation over actions taken according to µ gives us expression for α s : ) -A µ (s, a) β E a∼µ(•|s) α s = α s = ρ µ (s)E a∼µ(•|s)

s∼ρµ(•) D KL (π * (•|s) || π(•|s)) 2A µ (s, a)) 2 log µ(a|s) β 2 (β -2A µ (s, a)) 2 -log π(a|s) = arg max π E s∼ρµ(•) E a∼µ(•|s) log π(a|s) 1 (β -2A µ (s, a)) 2 = arg max π E s∼ρµ(•) E a∼µ(•|s) log π(a|s) 1 (β -A µ (s, a)) 2(36)Regression problem follows:arg max π E s∼ρµ(•) E a∼µ(•|s) log π(a|s) 1 (β -A µ (s, a)) 2

Final returns for different algorithms, with ± corresponding to one standard deviation of the average return across 10 datasets of 10k timestamps.

Final returns for different algorithms, with ± corresponding to one standard deviation of the average return across 3 seeds.

