WHAT ABOUT TAKING POLICY AS INPUT OF VALUE FUNCTION: POLICY-EXTENDED VALUE FUNCTION APPROXIMATOR

Abstract

The value function lies in the heart of Reinforcement Learning (RL), which defines the long-term evaluation of a policy in a given state. In this paper, we propose Policy-extended Value Function Approximator (PeVFA) which extends the conventional value to be not only a function of state but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies in contrast to a conventional one with limited capacity for only one policy, inducing the new characteristic of value generalization among policies. From both the theoretical and empirical lens, we study value generalization along the policy improvement path (called local generalization), from which we derive a new form of Generalized Policy Iteration with PeVFA to improve the conventional learning process. Besides, we propose a framework to learn the representation of an RL policy, studying several different approaches to learn an effective policy representation from policy network parameters and state-action pairs through contrastive learning and action prediction. In our experiments, Proximal Policy Optimization (PPO) with PeVFA significantly outperforms its vanilla counterpart in MuJoCo continuous control tasks, demonstrating the effectiveness of value generalization offered by PeVFA and policy representation learning.

1. INTRODUCTION

Reinforcement learning (RL) has been widely considered as a promising way to learn optimal policies in many decision making problems (Mnih et al., 2015; Lillicrap et al., 2015; Silver et al., 2016; You et al., 2018; Schreck et al., 2019; Vinyals et al., 2019; Hafner et al., 2020) . Lying in the heart of RL is the value function which defines the long-term evaluation of a policy. With function approximation (e.g., deep neural networks), a value function approximator (VFA) is able to approximate the values of a policy under large and continuous state spaces. As commonly recognized, most RL algorithms can be described as Generalized Policy Iteration (GPI) (Sutton & Barto, 1998) . As illustrated in the left of Figure 1 , at each iteration the VFA is trained to approximate the true values of current policy, regarding which the policy are improved. However, value approximation can never be perfect and its quality influences the effectiveness of policy improvement, thus raising a requirement for better value approximation (v. Hasselt, 2010; Bellemare et al., 2017; Fujimoto et al., 2018) . Since a conventional VFA only approximates the values (i.e., knowledge (Sutton et al., 2011) ) for one policy, the knowledge learned from previously encountered policies is not preserved and utilized for future learning in an explicit way. For example in GPI, a conventional VFA cannot track the values of the changing policy by itself and has no idea of the direction of value generalization when approximating the values of a new policy. In this paper, we propose Policy-extended Value Function Approximator (PeVFA), which additionally takes an explicit policy representation as input in contrast to conventional VFA. PeVFA is able to preserve values for multiple policies and induces an appealing characteristic, i.e., value generalization among policies. We study the formal generalization and contraction conditions on the value approximation error of PeVFA, focusing specifically on value generalization along the policy improvement path which we call local generalization. Based on both theoretical and empirical evidences, we propose a new form of GPI with PeVFA (the right of Figure 1 ) which can benefit from the closer approximation distance induced by local value generalization under some conditions; thus, GPI with PeVFA is expected to be more efficient in consecutive value approximation along the policy improvement path. Moreover, we propose a framework to learn effective policy representation for an RL policy from policy network parameters and state-action pairs alternatively, through contrastive learning and an auxiliary loss of action prediction. Finally, based on Proximal Policy Optimization (PPO), we derive a practical RL algorithm PPO-PeVFA from the above methods. Our experimental results demonstrate the effectiveness of both value generalization offered by PeVFA and policy representation learning. Our main contributions are summarized as follows: 𝜋 0 𝜋 1 𝑣 𝜋 𝑜 𝑣 𝜋 1 𝑉 𝜙 0 𝜋 0 𝑉 𝜙 -1 𝜋 0 𝑉 𝜙 1 𝜋 1 𝜋 2 … … … 𝜋 0 𝜋 1 𝑣 𝜋 𝑜 𝑣 𝜋 1 𝜋 2 … 𝕍 𝜃 -1 𝜒 𝜋 0 𝕍 𝜃 0 𝜒 𝜋 0 𝕍 𝜃 0 𝜒 𝜋 1 𝕍 𝜃 1 𝜒 𝜋 1 𝕍 𝜃 1 𝜒 𝜋 2 Improvement Approximation Evaluation Generalization • We propose PeVFA which improves generalization of values among policies and provide a theoretical analysis of generalization especially in local generalization scenario. • We propose a new form of GPI with PeVFA resulting in closer value approximation along the policy improvement path demonstrated through experiments. • To our knowledge, we are the first to learn a representation (low-dimensional embedding) for an RL policy from its network parameters (i.e., weights and biases).

2. BACKGROUND 2.1 REINFORCEMENT LEARNING

We consider a Markov Decision Process (MDP) defined as S, A, r, P, γ where S is the state space, A is the action space, r is the reward function, P is the transition function and γ ∈ [0, 1) is the discount factor. The goal of an RL agent is to learn a policy π ∈ Π where π(a|s) is a distribution of action given state that maximizes the expected long-term discounted return. The state-value function v π (s) is defined in terms of the expected discounted return obtained through following the policy π from a state s: v π (s) = E π [ ∞ t=0 γ t r t+1 |s 0 = s] for all s ∈ S where r t+1 = r(s t , a t ). We use V π = v π (•) to denote the vector of values for all possible states. Value function is determined by policy π and environment models (i.e., P and r). For a conventional value function, policy is modeled implicitly within a table or a function approximator, i.e., a mapping from only state to value. One can refer to Appendix E.1 to see a more detailed description. Schaul et al. (2015) introduced Universal Value Function Approximators (UVFA) that generalize values over goals in goal-conditioned RL. Similar ideas are also adopted for low-level learning of Hierarchical RL (Nachum et al., 2018) . Such extensions are also studied in more challenging RL problems, e.g., opponent modeling (He & Boyd-Graber, 2016; Grover et al., 2018; Tacchetti et al., 2019), and context-based Meta-RL (Rakelly et al., 2019; Lee et al., 2020) . General value function (GVF) in (Sutton et al., 2011) are proposed as a form of general knowledge representation through cumulants instead of rewards. In an unified view, each approach generalizes different aspects of the conventional VFA, focusing on different components of the vector form Bellman equation (Sutton & Barto, 1998) expanded on as discussed in Appendix E.2. Concurrent to our work, several works also study to take policy as an input of value functions. Harb et al. (2020) propose Policy Evaluation Networks (PVN) to approximate the objective function J(π) = E s0∼ρ0 [v π (s 0 )] of different policy π, where ρ 0 is the initial state distribution. Later



Figure 1: Generalized Policy Iteration (GPI) with function approximation. Left: GPI with conventional value function approximator V π φ . Right: GPI with PeVFA V θ (χ π ) (Sec. 3) where extra generalization steps exist. The subscripts of policy π and value function parameters φ, θ denote the iteration number. The squiggle lines represent non-perfect approximation of true values.

