PARAMETER-BASED VALUE FUNCTIONS

Abstract

Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zeroshot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.

1. INTRODUCTION

Value functions are central to Reinforcement Learning (RL). For a given policy, they estimate the value of being in a specific state (or of choosing a particular action in a given state). Many RL breakthroughs were achieved through improved estimates of such values, which can be used to find optimal policies (Tesauro, 1995; Mnih et al., 2015) . However, learning value functions of arbitrary policies without observing their behavior in the environment is not trivial. Such off-policy learning requires to correct the mismatch between the distribution of updates induced by the behavioral policy and the one we want to learn. Common techniques include Importance Sampling (IS) (Hesterberg, 1988) and deterministic policy gradient methods (DPG) (Silver et al., 2014) , which adopt the actorcritic architecture (Sutton, 1984; Konda & Tsitsiklis, 2001; Peters & Schaal, 2008) . Unfortunately, these approaches have limitations. IS suffers from large variance (Cortes et al., 2010; Metelli et al., 2018; Wang et al., 2016) while traditional off-policy actor-critic methods introduce off-policy objectives whose gradients are difficult to follow since they involve the gradient of the action-value function with respect to the policy parameters ∇ θ Q π θ (s, a) (Degris et al., 2012; Silver et al., 2014) . This term is usually ignored, resulting in biased gradients for the off-policy objective. Furthermore, off-policy actor-critic algorithms learn value functions of a single target policy. When value functions are updated to track the learned policy, the information about old policies is lost. We address the problem of generalization across many value functions in the off-policy setting by introducing a class of parameter-based value functions (PBVFs) defined for any policy. PBVFs are value functions whose inputs include the policy parameters, the PSSVF V (θ), PSVF V (s, θ), and PAVF Q(s, a, θ). PBVFs can be learned using Monte Carlo (MC) (Metropolis & Ulam, 1949) or Temporal Difference (TD) (Sutton, 1988) methods. The PAVF Q(s, a, θ) leads to a novel stochastic and deterministic off-policy policy gradient theorem and, unlike previous approaches, can directly compute ∇ θ Q π θ (s, a). Based on these results, we develop off-policy actor-critic methods and compare our algorithms to two strong baselines, ARS and DDPG (Mania et al., 2018; Lillicrap et al., 2015) , outperforming them in some environments. We make theoretical, algorithmic, and experimental contributions: Section 2 introduces the standard MDP setting; Section 3 formally presents PBVFs and derive algorithms for V (θ), V (s, θ) and Q(s, a, θ); Section 4 describes the experimental evaluation using shallow and deep policies; Sections 5 and 6 discuss related and future work. Proofs and derivations can be found in Appendix A.2. 1

