GENERAL POLICY EVALUATION AND IMPROVEMENT BY LEARNING TO IDENTIFY FEW BUT CRUCIAL STATES

Abstract

Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single map from policy parameters to expected return that evaluates (and thus helps to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of 'probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training.

1. INTRODUCTION

Policy Evaluation and Policy Improvement are arguably the most studied problems in Reinforcement Learning. They are at the root of actor-critic methods (Konda and Tsitsiklis, 2001; Sutton, 1984; Peters and Schaal, 2008) , which alternate between these two steps to iteratively estimate the performance of a policy and using this estimate to learn a better policy. In the last few years, they received a lot of attention because they have proven to be effective in visual domains (Mnih et al., 2016; Wu et al., 2017) , continuous control problems (Lillicrap et al., 2015; Haarnoja et al., 2018; Fujimoto et al., 2018) , and applications such as robotics (Kober et al., 2013) . Several ways to estimate value functions have been proposed, ranging from Monte Carlo approaches, to Temporal Difference methods (Sutton, 1984) , including the challenging off-policy scenario where the value of a policy is estimated without observing its behavior (Precup et al., 2001) . A limiting feature of value functions is that they are defined for a single policy. When the policy is updated, they need to keep track of it, potentially losing useful information about old policies. By doing so, value functions typically do not capture any structure over the policy parameter space. While off-policy methods learn a single value function using data from different policies, they have no specific mechanism to generalize across policies and usually suffer for large variance (Cortes et al., 2010) . Parameter Based Value Functions (PBVFs) (Faccio et al., 2021) are a promising approach to design value functions that overcome this limitation and generalize over multiple policies. A crucial problem in the application of such value functions is choosing a suitable representation of the policy. Flattening the policy parameters as done in vanilla PBVFs is difficult to scale to larger policies. Here we present an approach that connects PBVFs and a policy embedding method called "fingerprint mechanism" by Harb et al. (2020) . Using policy fingerprinting allows us to scale PBVFs to handle larger NN policies and also achieve invariance with respect to the policy architecture. Policy fingerprinting was

