TOWARDS A UNIFIED POLICY ABSTRACTION THE-ORY AND REPRESENTATION LEARNING APPROACH IN MARKOV DECISION PROCESSES

Abstract

Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

1. INTRODUCTION

How to obtain the optimal policy is the ultimate problem in many decision-making systems, such as Game Playing (Mnih et al., 2015) , Robotics Manipulation (Smith et al., 2019) , Medicine Discovery (Schreck et al., 2019) . Policy, the central notion in the aforementioned problem, defines the agent's behavior under specific circumstances. Towards solving the problem, a lot of works carry out studies on policy with different focal points, e.g., how policy can be well represented (Ma et al., 2020; Urain et al., 2020) , how to optimize policy (Schulman et al., 2017a; Ho & Ermon, 2016) and how to analyze and understand agents' behaviors (Zheng et al., 2018; Hansen & Ostermeier, 2001) . The root challenge to the studies on policy is the large scale and the high complexity of policy space, especially in real-world scenarios. As a consequence, the difficulty of policy learning is escalated severely. Intuitively and naturally, such issues can be significantly alleviated if we have an ideal surrogate policy space, which are compact in scale while keep the key features of policy space. Related to this direction, low-dimensional latent representation of policy plays an important role in Reinforcement Learning (RL) (Tang et al., 2020 ), Opponent Modeling (Grover et al., 2018 ), Fast Adaptation (Raileanu et al., 2020; Sang et al., 2022 ), Behavioral Characterization (Kanervisto et al., 2020) and etc. In these domains, a few preliminary attempts have been made in devising different policy representations. Most policy representations introduced in prior works resort to encapsulating the information of policy distribution under interest states (Harb et al., 2020; Pacchiano et al., 2020) , e.g., learning policy embedding by encoding policy's state-action pairs (or trajectories) and optimizing a policy recovery objective (Grover et al., 2018; Raileanu et al., 2020) . Rather than policy distribution, some other works resort to the information of policy's influence on the environment, e.g., state(-action) visitation distribution induced by the policy (Kanervisto et al., 2020; Mutti et al., 2021 ). Recently, Tang et al. (2020) offers several methods to learn policy representation through policy contrast or recovery from both policy network parameters and interaction experiences. Put shortly, the key question of policy representation learning is by what criterion we should abstract the policy space for desired compression and generalization. Unfortunately, both a unified theory on policy abstraction and a systematic methodology on policy representation are currently missing. In this paper, we make first efforts to fill up the plank in both the theory and methodology. Inspired by the state abstraction theory (Li et al., 2006) , first we introduce a unified theory of policy abstraction. We start from proposing three types of policy abstraction: distribution-irrelevance, influence-irrelevance, and value-irrelevance. They follow different abstraction criteria, each of which concerns distinct features of policy. Concretely, we utilize the exact equivalence relations between policies and derive the corresponding policy abstractions. Further, we generalize the exact equivalence relations to policy metrics, allowing quantitatively measure the distance (i.e., similarity) between policies. Such policy metrics are more informative than the binary outcomes of policy equivalence and thus provide more usefulness in policy representation learning. Moreover, towards applying practical policy representation in downstream learning problems, we introduce a policy representation learning approach based on deep metric learning (Kaya & Bilge, 2019) . We propose an alignment loss for a unified objective function of learning with different policy metrics. The policy representation is learned to render the abstraction criterion through minimizing the difference between the distance of policy embeddings and the quantity measure by the policy metrics. In particular, we use Maximum Mean Discrepancy (Gretton et al., 2012; Nguyen-Tang et al., 2021) for efficient empirical estimation of the policy metrics; and we adopt Layer-wise Permutation-invariant Encoder (Tang et al., 2020) for structure-aware encoding of the parameters of policy network. In addition to the theoretical understanding of policy abstraction, we further investigate the empirical efficacy of different policy metrics and representations in characterizing policy difference and conveying policy generalization respectively. We conduct experiments in both policy optimization and policy evaluation problems. For policy optimization, we adopt Trust-Region Policy Optimization (TRPO) and Diversity-Guided Evolution Strategy (DGES) as the problem settings from (Kanervisto et al., 2020) , covering both gradient-based and gradient-free policy optimization. For policy evaluation, we consider Off-policy Evaluation (OPE). In particular, we establish a series of OPE settings with different configurations of training data and generalization tasks. These settings reflect the circumstances often encountered in RL. Our experimental results indicate that, somewhat naturally, there is no a universally optimal abstraction for all downstream learning problems. Additionally, it turns out that the influence-irrelevance abstraction can be a preferred choice in general cases. Our main contributions are summarized as follows: 1) We focus on the general policy abstraction problem and to our knowledge, we propose a unified theory of policy abstraction along with several policy metrics for the first time. 2) We propose a unified policy representation learning approach based on deep metric learning. 3) We empirically evaluate the efficacy of our proposed policy representations in multiple fundamental problems (i.e., TRPO, DGES and OPE).

2. BACKGROUND

Reinforcement Learning We consider a Markov Decision Process (MDP) (Puterman, 2014) typically defined by a five-tuple ⟨S, A, P, R, γ⟩, with the state space S, the action space A, the transition probability P : S ×A → ∆(S), the reward function R : S ×A → R and the discount factor γ ∈ [0, 1). ∆(X) denotes the probability distribution over X. A stationary policy π : S → ∆(A) is a mapping from states to action distributions, which defines how to behave under specific states. An agent interacts with the MDP at discrete timesteps by its policy π, generating trajectories with s 0 ∼ ρ 0 (•), a t ∼ π(•|s t ), s t+1 ∼ P (• | s t , a t ) and r t = R (s t , a t ), where ρ 0 is the initial state distribution. We use P π (s ′ |s) = E a∼π(•|s) P (s ′ |s, a) to denote the distribution of next state s ′ when performing policy π at state s. For a policy π, the return G t = ∞ t=0 γ t r t is the random variable for the sum of discounted rewards while following π, whose distribution is denoted by Z π . The value function of policy π defines the expected return for state s, i.e., V π (s) = E π [G t | s 0 = s]. The goal of an RL agent is to learn an optimal policy π * that maximizes J(π) = E s0∼ρ0(•) [V π (s 0 )]. Metric Learning Here we recall the standard definition of metrics which is central to our work. Definition 1 (Metrics (Royden, 1968)). Let X be a non-empty set of data elements and a metric is a real-valued function d: X × X → [0, ∞) such that for all x, y, z ∈ X: 1) d(x, y) = 0 ⇐⇒ x = y; 2) d(x, y) = d(y, x); 3) d(x, y) ≤ d(x, z) + d(z, y). A pseudo-metric d is a metric with the first condition replaced by x = y =⇒ d(x, y) = 0. The combination ⟨X, d⟩ is called a metric space.

