HELLINGER DISTANCE CONSTRAINED REGRESSION

Abstract

This paper introduces an off-policy reinforcement learning method that uses Hellinger distance between sampling policy (from what samples were collected) and current policy (policy being optimized) as a constraint. Hellinger distance squared multiplied by two is greater than or equal to total variation distance squared and less than or equal to Kullback-Leibler divergence, therefore a lower bound for expected discounted return for the new policy is improved compared to the lower bound for training with KL. Also, Hellinger distance is less than or equal to 1, so there is a policy-independent lower bound for expected discounted return. HDCR is capable of training with Experience Replay, a common setting for distributed RL when collecting trajectories using different policies and learning from this data centralized. HDCR shows results comparable to or better than Advantage-weighted Behavior Model and Advantage-Weighted Regression on MuJoCo tasks using tiny offline datasets collected by random agents. On bigger datasets (100k timesteps) obtained by pretrained behavioral policy, HDCR outperforms ABM and AWR methods on 3 out of 4 tasks.

1. INTRODUCTION

Policy gradient algorithms are methods of model-free reinforcement learning that optimize policy through differentiating expected discounted return. Despite the simplicity, to converge, these methods should stay on-policy because of the first-order approximation of state visitation frequencies. This issue makes agents learn through trial-and-error, using data only once. To make policy gradient updates more off-policy, we can add a constraint on the update to decrease the step size if the current policy is too far from the sampling policy. One of the first methods was to add total variation distance squared as a constraint for mixture policies. Later it was proven that there is a lower bound for new policy's expected discounted return (Kakade & Langford, 2002) . Recently it was proven that this lower bound exists for all types of updates (Schulman et al., 2015) . Next, total variation distance squared was replaced by Kullback-Leibler divergence that is greater than or equal to the previous one (Pinker's inequality (Levin & Peres, 2017)), so that the lower bound was decreased (Schulman et al., 2015) . Using Lagrangian, have been derived off-policy method called Advantage-Weighted Regression (Peng et al., 2019) , which also used KL as a constraint. This article proposes a new method whose lower bound of expected discounted return is greater than or equal to the bound with KL. We achieve this by replacing total variation distance by Hellinger distance, which decreases lower bound. Therefore strictness stays the same. Then we derive an offpolicy method called Hellinger Distance Constrained Regression using the new constraint. It can be used on discrete and continuous action spaces since derivation uses Lebesgue integrals rather than a summation or Riemann integrals.

2. PRELIMINARIES

To better present the problem, we start from basic definitions, go through the history of improvements, and then describe the disadvantages of using KL divergence as a constraint. We consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple (S, A, P, r, ρ 0 , γ), where S is a set of states (finite or infinite), A is a set of actions (finite or infinite), P : S × A × S → R is the transition probability distribution, r : S → R is the reward function, ρ 0 : S → R is the distribution of the initial state s 0 , and γ ∈ (0, 1) is the discount factor. Let π denote a stochastic policy π : S × A → [0, 1], and then its expected discounted return is: η(π) = E s0,a0,... ∞ t=0 γ t r(s t ) , where s 0 ∼ ρ 0 (•), a t ∼ π(•|s t ), s t+1 ∼ P (•|s t , a t ). (1) This paper uses state-action value function Q π , state value function V π , and advantage function A π with the following definitions: Q π (s t , a t ) = E st+1,at+1,... ∞ l=0 γ l r(s t+l ) V π (s t ) = E at,st+1,at+1,... ∞ l=0 γ l r(s t+l ) A π (s t , a t ) = Q π (s t , a t ) -V π (s t ). (2) Let ρ π (s) be unnormalized visitation frequencies of state s where actions are chosen according to π: ρ π (s) = ∞ t=0 γ t P (s t = s). Following identity expresses the expected return of another policy π in terms of the advantage over π, accumulated over states (see Schulman et al. (2015) for proof): η(π) = η(π) + ρ π (s) π(a|s)A π (s, a) da ds. In approximately optimal learning, we replace state visitation frequency ρ π by ρ π , since this drastically decrease optimization complexity: L π (π) = η(π) + ρ π (s) π(a|s)A π (s, a) da ds. (5) Let π old denote current policy, then the lower bound for the expected discounted return for the new policy π new will be (see Schulman et al. (2015) for proof):  η(π new ) ≥ L π old (π new ) - 4 γ (1 -γ) 2 α 2 where = max s,a |A π (s, a)|, α = max s D T V (π old (•|s)||π new (•|s)), D T V (π old (•|s)||π new (•|s)) = 1 2 |π old (a|s) -π new (a|s)| da.



Theoretical Trust-Region Policy Optimization algorithm relays on Pinsker's inequality (see(Tsybakov, 2009)  for proof):D KL (π old (•|s)||π new (•|s)) ≥ D T V (π old (•|s)||π new (•|s)) 2 where D KL (π old (•|s)||π new (•|s)) = π old (a|s) log π old (a|s) π new (a|s) da. (7)To retain strictness and decrease calculation complexity, total variation distance squared was replaced with Kullback-Leibler divergence D KL (π old (•|s)||π new (•|s)):

