PARAMETER-BASED VALUE FUNCTIONS

Abstract

Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zeroshot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.

1. INTRODUCTION

Value functions are central to Reinforcement Learning (RL). For a given policy, they estimate the value of being in a specific state (or of choosing a particular action in a given state). Many RL breakthroughs were achieved through improved estimates of such values, which can be used to find optimal policies (Tesauro, 1995; Mnih et al., 2015) . However, learning value functions of arbitrary policies without observing their behavior in the environment is not trivial. Such off-policy learning requires to correct the mismatch between the distribution of updates induced by the behavioral policy and the one we want to learn. Common techniques include Importance Sampling (IS) (Hesterberg, 1988) and deterministic policy gradient methods (DPG) (Silver et al., 2014) , which adopt the actorcritic architecture (Sutton, 1984; Konda & Tsitsiklis, 2001; Peters & Schaal, 2008) . Unfortunately, these approaches have limitations. IS suffers from large variance (Cortes et al., 2010; Metelli et al., 2018; Wang et al., 2016) while traditional off-policy actor-critic methods introduce off-policy objectives whose gradients are difficult to follow since they involve the gradient of the action-value function with respect to the policy parameters ∇ θ Q π θ (s, a) (Degris et al., 2012; Silver et al., 2014) . This term is usually ignored, resulting in biased gradients for the off-policy objective. Furthermore, off-policy actor-critic algorithms learn value functions of a single target policy. When value functions are updated to track the learned policy, the information about old policies is lost. We address the problem of generalization across many value functions in the off-policy setting by introducing a class of parameter-based value functions (PBVFs) defined for any policy. PBVFs are value functions whose inputs include the policy parameters, the PSSVF V (θ), PSVF V (s, θ), and PAVF Q(s, a, θ). PBVFs can be learned using Monte Carlo (MC) (Metropolis & Ulam, 1949) or Temporal Difference (TD) (Sutton, 1988) methods. The PAVF Q(s, a, θ) leads to a novel stochastic and deterministic off-policy policy gradient theorem and, unlike previous approaches, can directly compute ∇ θ Q π θ (s, a). Based on these results, we develop off-policy actor-critic methods and compare our algorithms to two strong baselines, ARS and DDPG (Mania et al., 2018; Lillicrap et al., 2015) , outperforming them in some environments. We make theoretical, algorithmic, and experimental contributions: Section 2 introduces the standard MDP setting; Section 3 formally presents PBVFs and derive algorithms for V (θ), V (s, θ) and Q(s, a, θ); Section 4 describes the experimental evaluation using shallow and deep policies; Sections 5 and 6 discuss related and future work. Proofs and derivations can be found in Appendix A.2. We consider a Markov Decision Process (MDP) (Stratonovich, 1960; Puterman, 2014 ) M = (S, A, P, R, γ, µ 0 ) where at each step an agent observes a state s ∈ S, chooses action a ∈ A, transitions into state s with probability P (s |s, a) and receives a reward R(s, a). The agent starts from an initial state, chosen with probability µ 0 (s). It is represented by a parametrized stochastic policy π θ : S → ∆(A), which provides the probability of performing action a in state s. Θ is the space of policy parameters. The policy is deterministic if for each state s there exists an action a such that π θ (a|s) = 1. The return R t is defined as the cumulative discounted reward from time step t: R t = T -t-1 k=0 γ k R(s t+k+1 , a t+k+1 ), where T denotes the time horizon and γ a realvalued discount factor. The performance of the agent is measured by the cumulative discounted expected reward (expected return), defined as J(π θ ) = E π θ [R 0 ]. Given a policy π θ , the state-value function V π θ (s) = E π θ [R t |s t = s] is defined as the expected return for being in a state s and following policy π θ . By integrating over the state space S, we can express the maximization of the expected cumulative reward in terms of the state-value function J(π θ ) = S µ 0 (s)V π θ (s) ds. The action-value function Q π θ (s, a), which is defined as the expected return for performing action a in state s, and following the policy π θ , is Q π θ (s, a) = E π θ [R t |s t = s, a t = a], and it is related to the state-value function by V π θ (s) = A π θ (a|s)Q π θ (s, a) da. We define as d π θ (s ) the discounted weighting of states encountered starting at s 0 ∼ µ 0 (s) and following the policy π θ : d π θ (s ) = S ∞ t=1 γ t-1 µ 0 (s)P (s → s , t, π θ ) ds, where P (s → s , t, π θ ) is the probability of transitioning to s after t time steps, starting from s and following policy π θ . Sutton et al. (1999) showed that, for stochastic policies, the gradient of J(π θ ) does not involve the derivative of d π θ (s) and can be expressed in a simple form: ∇ θ J(π θ ) = S d π θ (s) A ∇ θ π θ (a|s)Q π θ (s, a) da ds. Similarly, for deterministic policies Silver et al. ( 2014) obtained the following: ∇ θ J(π θ ) = S d π θ (s)∇ θ π θ (s)∇ a Q π θ (s, a)| a=π θ (s) ds. Off-policy RL In off-policy policy optimization, we seek to find the parameters of the policy maximizing a performance index J b (π θ ) using data collected from a behavioral policy π b . Here the objective function J b (π θ ) is typically modified to be the value function of the target policy, integrated over d π b ∞ (s) = lim t→∞ P (s t = s|s 0 , π b ), the limiting distribution of states under π b (assuming it exists) (Degris et al., 2012; Imani et al., 2018; Wang et al., 2016) . Throughout the paper we assume that the support of d π b ∞ includes the support of µ 0 so that the optimal solution for J b is also optimal for J. Formally, we want to find: J b (π θ * ) = max θ S d π b ∞ (s)V π θ (s) ds = max θ S d π b ∞ (s) A π θ (a|s)Q π θ (s, a) da ds. Unfortunately, in the off-policy setting, the states are obtained from d π b ∞ and not from d π θ ∞ , hence the gradients suffer from a distribution shift (Liu et al., 2019; Nachum et al., 2019) . Moreover, since we have no access to d π θ ∞ , a term in the policy gradient theorem corresponding to the gradient of the action value function with respect to the policy parameters needs to be estimated. This term is usually ignored in traditional off-policy policy gradient theorems 1 . In particular, when the policy is stochastic, Degris et al. (2012) showed that:  ∇ θ J b (π θ ) = S d π b ∞ (s) A π b (a|s) π θ (a|s) π b (a|s) (Q π θ (s, a)∇ θ log π θ (a|s) + ∇ θ Q π θ (s,



θ (s, a)∇ θ log π θ (a|s)) da ds.(5)Analogously, Silver et al. (2014)  provided the following approximation for deterministic policies 2 :∇ θ J b (π θ ) = S d π b ∞ (s) ∇ θ π θ (s)∇ a Q π θ (s, a)| a=π θ (s) + ∇ θ Q π θ (s, a)| a=π θ (s) ds (6) ≈ S d π b ∞ (s) ∇ θ π θ (s)∇ a Q π θ (s, a)| a=π θ (s) ds. (7)1 With tabular policies, dropping this term still results in a convergent algorithm(Degris et al., 2012). 2 In the original formulation of Silver et al. (2014) d π b ∞ (s) is replaced by d π b (s).

