GENERAL POLICY EVALUATION AND IMPROVEMENT BY LEARNING TO IDENTIFY FEW BUT CRUCIAL STATES

Abstract

Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single map from policy parameters to expected return that evaluates (and thus helps to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of 'probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training.

1. INTRODUCTION

Policy Evaluation and Policy Improvement are arguably the most studied problems in Reinforcement Learning. They are at the root of actor-critic methods (Konda and Tsitsiklis, 2001; Sutton, 1984; Peters and Schaal, 2008) , which alternate between these two steps to iteratively estimate the performance of a policy and using this estimate to learn a better policy. In the last few years, they received a lot of attention because they have proven to be effective in visual domains (Mnih et al., 2016; Wu et al., 2017) , continuous control problems (Lillicrap et al., 2015; Haarnoja et al., 2018; Fujimoto et al., 2018) , and applications such as robotics (Kober et al., 2013) . Several ways to estimate value functions have been proposed, ranging from Monte Carlo approaches, to Temporal Difference methods (Sutton, 1984) , including the challenging off-policy scenario where the value of a policy is estimated without observing its behavior (Precup et al., 2001) . A limiting feature of value functions is that they are defined for a single policy. When the policy is updated, they need to keep track of it, potentially losing useful information about old policies. By doing so, value functions typically do not capture any structure over the policy parameter space. While off-policy methods learn a single value function using data from different policies, they have no specific mechanism to generalize across policies and usually suffer for large variance (Cortes et al., 2010) . Parameter Based Value Functions (PBVFs) (Faccio et al., 2021) are a promising approach to design value functions that overcome this limitation and generalize over multiple policies. A crucial problem in the application of such value functions is choosing a suitable representation of the policy. Flattening the policy parameters as done in vanilla PBVFs is difficult to scale to larger policies. Here we present an approach that connects PBVFs and a policy embedding method called "fingerprint mechanism" by Harb et al. (2020) . Using policy fingerprinting allows us to scale PBVFs to handle larger NN policies and also achieve invariance with respect to the policy architecture. Policy fingerprinting was introduced to learn maps from policy parameters to expected return offline and prior to this work was never applied to the online RL setting. We show in visual classification tasks and in continuous control problems that our approach can identify a small number of critical "probing states" that are highly informative of the policies performance. Our learned value function generalizes across many NN-based policies. It combines the behavior of many bad policies to learn a better policy, and is able to zero-shot learn policies with a different architecture. We compare our approach with strong baselines in continuous control tasks: our method is competitive with DDPG (Lillicrap et al., 2015) and evolutionary approaches.

2. BACKGROUND

We consider an agent interacting with a Markov Decision Process (MDP) Stratonovich (1960) ; Puterman (2014) M = (S, A, P, R, γ, µ 0 ). The state space S ⊂ R n S and the action space A ⊂ R n A are assumed to be compact sub-spaces. In the MDP framework, at each time-step t, the agent observes a state s t ∈ S, chooses an action a t ∈ A, transitions into state s t+1 with probability P (s t+1 |s t , a t ), and receives a reward r t = R(s t , a t ). The initial state is chosen with probability µ 0 (s). The agent's behavior is represented by its policy π : S → ∆A: a function assigning for each state s a probability distribution over the action space. A policy is deterministic when for each state there exists an action a such that a is selected with probability one. Here we consider parametrized policies of the form π θ , where θ ∈ Θ ⊂ R n θ are the policy parameters. The return R t is defined as the cumulative discounted reward from time-step t, e.g. R t = ∞ k=0 γ k R(s t+k+1 , a t+k+1 ), where γ ∈ (0, 1] is the discount factor. The agent's performance is measured by the expected return (i.e. the cumulative expected discounted reward) from the initial state: J(θ) = E π θ [R 0 ]. The state-value function V π θ (s) = E π θ [R t |s t = s] of a policy π θ is defined as the expected return for being in a state s and following π θ . Similarly, the action-value function Q π θ (s, a) = E π θ [R t |s t = s, a t = a] of a policy π θ is defined as the expected return for being in a state s, taking action a and then following π θ . State and action value functions are related by V π θ (s) = A π θ (a|s)Q π θ (s, a) da. The expected return can be expressed in terms of the state and action value functions by integration over the initial state distribution: J(θ) = S µ 0 (s)V π θ (s) ds = S µ 0 (s) A π θ (a|s)Q π θ (s, a) da ds. (1) The goal of a RL agent is to find the policy parameters θ that maximize the expected return. Instead of learning a single value function for a target policy, here we try to estimate the value function of any policy and maximize it over the set of initial states.

3. GENERAL POLICY EVALUATION

Recent work focused on extending value functions to allow them to receive the policy parameters as input. This can potentially result in single value functions defined for any policy and methods that can perform direct search in the policy parameters. We begin by extending the state-value function, and define the parameter-based state-value function (PSVF) (Faccio et al., 2021) as the expected return for being in state s and following policy π θ : V (s, θ) = E[R t |s t = s, θ]. Using this new definition, we can rewrite the RL objective as J(θ) = S µ 0 (s)V (s, θ) ds. Instead of learning V (s, θ) for each state, we focus here on the policy evaluation problem over the set of the initial states of the agent. This is equivalent to trying to model J(θ) directly as a differentiable function V (θ), which is the expectation of V (s, θ) over the initial states: V (θ) := E s∼µ0(s) [V (s, θ)] = S µ 0 (s)V (s, θ) ds = J(π θ ). V (θ) is a parameter-based start-state value function (PSSVF). We consider the undiscounted case in our setting, so γ is set to 1 throughout the paper. Once V (θ) is learned, direct policy search can be performed by following the gradient ∇ θ V (θ) to update the policy parameters. This learning procedure can naturally be implemented in the actor-critic framework, where a critic value functionthe PSSVF-iteratively uses the collected data to evaluate the policies seen so far, and the actor follows the critic's direction of improvement to update itself. As in vanilla PSSVF, we inject noise in the policy parameters for exploration. The PSSVF actor-critic framework is reported in Algorithm1.

