TOWARDS A UNIFIED POLICY ABSTRACTION THE-ORY AND REPRESENTATION LEARNING APPROACH IN MARKOV DECISION PROCESSES

Abstract

Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

1. INTRODUCTION

How to obtain the optimal policy is the ultimate problem in many decision-making systems, such as Game Playing (Mnih et al., 2015) , Robotics Manipulation (Smith et al., 2019) , Medicine Discovery (Schreck et al., 2019) . Policy, the central notion in the aforementioned problem, defines the agent's behavior under specific circumstances. Towards solving the problem, a lot of works carry out studies on policy with different focal points, e.g., how policy can be well represented (Ma et al., 2020; Urain et al., 2020) , how to optimize policy (Schulman et al., 2017a; Ho & Ermon, 2016) and how to analyze and understand agents' behaviors (Zheng et al., 2018; Hansen & Ostermeier, 2001) . The root challenge to the studies on policy is the large scale and the high complexity of policy space, especially in real-world scenarios. As a consequence, the difficulty of policy learning is escalated severely. Intuitively and naturally, such issues can be significantly alleviated if we have an ideal surrogate policy space, which are compact in scale while keep the key features of policy space. Related to this direction, low-dimensional latent representation of policy plays an important role in Reinforcement Learning (RL) (Tang et al., 2020) , Opponent Modeling (Grover et al., 2018) , Fast Adaptation (Raileanu et al., 2020; Sang et al., 2022) , Behavioral Characterization (Kanervisto et al., 2020) and etc. In these domains, a few preliminary attempts have been made in devising different policy representations. Most policy representations introduced in prior works resort to encapsulating the information of policy distribution under interest states (Harb et al., 2020; Pacchiano et al., 2020) , e.g., learning policy embedding by encoding policy's state-action pairs (or trajectories) and optimizing a policy recovery objective (Grover et al., 2018; Raileanu et al., 2020) . Rather than policy distribution, some other works resort to the information of policy's influence on the environment, e.g., state(-action) visitation distribution induced by the policy (Kanervisto et al., 2020; Mutti et al., 2021) . Recently, Tang et al. (2020) offers several methods to learn policy representation through policy contrast or recovery from both policy network parameters and interaction experiences. Put shortly, the key question of policy representation learning is by what criterion we should abstract the policy space for desired compression and generalization. Unfortunately, both a unified theory on policy abstraction and a systematic methodology on policy representation are currently missing. In this paper, we make first efforts to fill up the plank in both the theory and methodology. Inspired by the state abstraction theory (Li et al., 2006) , first we introduce a unified theory of policy abstraction. We start from proposing three types of policy abstraction: distribution-irrelevance, influence-irrelevance, and value-irrelevance. They follow different abstraction criteria, each of which concerns distinct features of policy. Concretely, we utilize the exact equivalence relations between policies and derive the corresponding policy abstractions. Further, we generalize the exact equivalence relations to policy metrics, allowing quantitatively measure the distance (i.e., similarity) between policies. Such policy metrics are more informative than the binary outcomes of policy equivalence and thus provide more usefulness in policy representation learning. Moreover, towards applying practical policy representation in downstream learning problems, we introduce a policy representation learning approach based on deep metric learning (Kaya & Bilge, 2019) . We propose an alignment loss for a unified objective function of learning with different policy metrics. The policy representation is learned to render the abstraction criterion through minimizing the difference between the distance of policy embeddings and the quantity measure by the policy metrics. In particular, we use Maximum Mean Discrepancy (Gretton et al., 2012; Nguyen-Tang et al., 2021) for efficient empirical estimation of the policy metrics; and we adopt Layer-wise Permutation-invariant Encoder (Tang et al., 2020) for structure-aware encoding of the parameters of policy network. In addition to the theoretical understanding of policy abstraction, we further investigate the empirical efficacy of different policy metrics and representations in characterizing policy difference and conveying policy generalization respectively. We conduct experiments in both policy optimization and policy evaluation problems. For policy optimization, we adopt Trust-Region Policy Optimization (TRPO) and Diversity-Guided Evolution Strategy (DGES) as the problem settings from (Kanervisto et al., 2020) , covering both gradient-based and gradient-free policy optimization. For policy evaluation, we consider Off-policy Evaluation (OPE). In particular, we establish a series of OPE settings with different configurations of training data and generalization tasks. These settings reflect the circumstances often encountered in RL. Our experimental results indicate that, somewhat naturally, there is no a universally optimal abstraction for all downstream learning problems. Additionally, it turns out that the influence-irrelevance abstraction can be a preferred choice in general cases. Our main contributions are summarized as follows: 1) We focus on the general policy abstraction problem and to our knowledge, we propose a unified theory of policy abstraction along with several policy metrics for the first time. 2) We propose a unified policy representation learning approach based on deep metric learning. 3) We empirically evaluate the efficacy of our proposed policy representations in multiple fundamental problems (i.e., TRPO, DGES and OPE).

2. BACKGROUND

Reinforcement Learning We consider a Markov Decision Process (MDP) (Puterman, 2014) typically defined by a five-tuple ⟨S, A, P, R, γ⟩, with the state space S, the action space A, the transition probability P : S ×A → ∆(S), the reward function R : S ×A → R and the discount factor γ ∈ [0, 1). ∆(X) denotes the probability distribution over X. A stationary policy π : S → ∆(A) is a mapping from states to action distributions, which defines how to behave under specific states. An agent interacts with the MDP at discrete timesteps by its policy π, generating trajectories with s 0 ∼ ρ 0 (•), a t ∼ π(•|s t ), s t+1 ∼ P (• | s t , a t ) and r t = R (s t , a t ), where ρ 0 is the initial state distribution. We use P π (s ′ |s) = E a∼π(•|s) P (s ′ |s, a) to denote the distribution of next state s ′ when performing policy π at state s. For a policy π, the return G t = ∞ t=0 γ t r t is the random variable for the sum of discounted rewards while following π, whose distribution is denoted by Z π . The value function of policy π defines the expected return for state s, i.e., V π (s ) = E π [G t | s 0 = s]. The goal of an RL agent is to learn an optimal policy π * that maximizes J(π) = E s0∼ρ0(•) [V π (s 0 )]. Metric Learning Here we recall the standard definition of metrics which is central to our work. Definition 1 (Metrics (Royden, 1968) ). Let X be a non-empty set of data elements and a metric is a real-valued function d: X × X → [0, ∞) such that for all x, y, z ∈ X: 1) d(x, y) = 0 ⇐⇒ x = y; 2) d(x, y) = d(y, x); 3) d(x, y) ≤ d(x, z) + d(z, y). A pseudo-metric d is a metric with the first condition replaced by x = y =⇒ d(x, y) = 0. The combination ⟨X, d⟩ is called a metric space. A metric d is often used to quantify the distance between two data elements in a general sense. In this paper, we will also use metric to stand for pseudo-metric for brevity. Typically metric learning aims to reduce the distance between similar data and increase the distance between dissimilar data. With nonlinear transformation offered by deep neural networks, Deep Metric Learning allows us to find such optimal metrics by optimizing a latent representation space of raw data.

3. POLICY ABSTRACTION THEORY

Inspired by the state abstraction theory (Li et al., 2006) , in this section, we make the first effort in proposing a unified policy abstraction theory. First, we propose the formal definition of three types for policy abstraction; then, we generalize the abstractions to three types of policy metrics. Finally, we analyze the properties of policy abstraction and compare them in several Gridworld MDPs.

3.1. POLICY ABSTRACTION

First of all, following the classic definition of an abstraction (Giunchiglia & Walsh, 1992) , we propose a general definition of policy abstraction as follows: Definition 2 (Policy Abstraction). A policy abstraction f : Π → X , is a mapping from ground policy space Π to an abstract space X . f (π) ∈ X is the abstract policy representation corresponding to ground policy π ∈ Π, and the inverse image f -1 (χ) with χ ∈ X , is the set of ground policies that correspond to χ under abstraction function f . It is apparent that there are many such abstractions since we may have many possible ways to partition the policy space. However, we are only interested in some useful ones among them that follow specific abstraction criteria to preserve the important features related to decision making. In this paper, we present three types of policy abstraction which are defined below: Definition 3. Given an MDP and a ground policy space Π, for any two policies π i , π j ∈ Π, we define three types of policy abstraction as follows: 1 . A distribution-irrelevance abstraction (f π ) is such that for all s ∈ S, a ∈ A, f π (π i ) = f π (π j ) implies that π i (a | s) = π j (a | s). 2. An influence-irrelevance abstraction (f P π ) is such that for all s, s ′ ∈ S, f P π (π i ) = f P π (π j ) implies that P πi (s ′ |s) = P πj (s ′ |s).

3.. A value-irrelevance abstraction (f

V π ) is such that for all s ∈ S, f V π (π i ) = f V π (π j ) implies that V πi (s) = V πj (s). These abstractions aggregate policies based on the corresponding equivalence relations with respective concerns on different features of policy. Intuitively, the distribution-irrelevance abstraction (f π ) preserves the action distribution of the policy; the influence-irrelevance abstraction (f P π ) preserves the state transition distribution induced by the policy, i.e., the influence caused by the policy on the environment; and value-irrelevance abstraction (f V π ) preserves the value function of the policy. In addition to the policy abstractions introduced in Definition 3, we provide some other ones in Appendix A.2. Moreover, we revisit the policy abstractions adopted in prior related works and summarize them from the angle of our policy abstraction theory in Table 4 of Appendix B.

3.2. POLICY METRICS

The policy abstractions allow us to aggregate policies according to equivalence relation. However, exact equivalence is rarely encountered in continuous policy space (e.g., the usual case with neural policies), thus useful abstraction can be seldom obtained. Moreover, the equivalence relation offers only qualitative (i.e., binary) outcomes and is incapable of measuring the similarity between policies, which is significant to policy representation learning. To this end, we generalize the policy abstractions to policy metrics which quantitatively measures the distance between two policies. Corresponding to the three types of policy abstraction, we define the following three policy metrics: Definition 4. Given an MDP, a ground policy space Π, a state distribution p(s) and a distribution (pseudo-)metric D(•, •), for any two policies π i , π j ∈ Π, we define three policy metrics as follows: 1. A distribution-irrelevance metric: d π (π i , π j ) = E s∼p(s) [D (π i (a | s), π j (a | s))]. 2. An influence-irrelevance metric:  d P π (π i , π j ) = E s∼p(s) [D (P πi (s ′ | s), P πj (s ′ | s))]. 3. A value-irrelevance metric: d V π (π i , π j ) = E s∼p(s) [D (Z πi (s), Z πj (s))]. , π 2 , ∀s, s ′ , a ∈ S 2 × A) Fineness Task Relevance f Θ Policy Parameter Equivalence (θ 1 = θ 2 ) Highest None f π Action Distribution Equivalence (π i (a | s) = π j (a | s)) High Low f P π Dynamics Influence Equivalence (P πi (s ′ |s) = P πj (s ′ |s)) Middle Middle f V π Value Function Equivalence (V πi (s) = V πj (s)) Low High f 0 Triviality (taking all policies as the same) Lowest None These metrics follow the same abstraction criteria as in Definition 3, i.e., the irrelevance regarding action distribution, influence and value, measuring the similarity of policies by the distance at respective levels. Compared to the binary outcomes offered by the equivalence relations, the metrics defined here are continuous, thus are more informative in comparing and representing policies in finer views. Specially, one may see that the equivalence relations used in Definition 3 induce corresponding discrete pseudo-metrics, e.g., d Eq π (π i , π j ) = 0 if f π (π i ) = f π (π j ) , and 1 otherwise. Notice the metrics proposed above depends on the distribution metric D and state distribution p(s). For D, typical choices can be Jeffreys Divergence (Jeffreys, 1946) and Maximum Mean Discrepancy (MMD) (Nguyen- Tang et al., 2021) . For p(s), intuitively, it should be the distribution of states we are interested in when comparing two policies. We defer the concrete choices for practical implementation of these metrics in Section 4.

3.3. PROPERTIES OF THE ABSTRACTIONS

Superficially, the three abstractions proposed preserve features that are progressively more relevant to decision making in the learning task, but essentially, what is the relationship between the three abstractions? To investigate the problem, we define the fineness of policy abstractions similar to the one for state abstractions used in (Li et al., 2006) , to prove how the three abstractions are related. Definition 5 (Abstraction Fineness). Let F Π denotes the set of abstractions on ground policy space Π. Suppose f 1 , f 2 ∈ F Π . We say f 1 is finer than f 2 , denoted f 1 ⪰ f 2 , if f ∀ π 1 , π 2 ∈ Π, f 1 (π 1 ) = f 1 (π 2 ) implies f 2 (π 1 ) = f 2 (π 2 ). If , f 1 ̸ = f 2 , then f 1 is strictly finer than f 2 , denoted f 1 ≻ f 2 . In contrast, we may also say f 2 is (strictly) coarser than f 1 , denoted f 2 ⪯ f 1 (f 2 ≺ f 1 ). It is easy to see the relation ⪰ satisfies self-reflexivity, antisymmetry and transitivity, thus it is a partial ordering. Consider the set of possible policy abstractions, while the coarsest abstraction (f 0 ) is the trivial representation where all policies are treated as the same; while the finest abstraction is the identity representation, e.g., f Θ (π θ ) = θ for a policy neural network parameterized with θ ∈ Θ. With the partial ordering ⪰, we further derive the following theory. Theorem 3.1. Under the Definition 3 and 5, if the reward function R depends only on state s ∈ S, we have (f Θ ⪰)f π ⪰ f P π ⪰ f V π (⪰ f 0 ). The proof is provided in Appendix A.1 along with more discussion on other cases of the reward function. The theorem declares how the three policy abstractions are related to each other in the sense of abstraction fineness with the two extreme cases (f Θ , f 0 ) for reference. The coarser the abstraction is, the more the original policy space is abstracted. In Table 1 , we summarize the properties of different policy abstractions, regarding abstraction criteria, abstraction fineness and task relevance. The major conclusion is that there is an inverse relation between abstraction fineness and task relevance. Except for the two extreme cases (f Θ , f 0 ) that are totally task-independent, the policy abstraction becomes more task-relevant as the abstraction criterion concerns more policy features related to the learning task. The distribution-irrelevance abstraction f π concerns the policy behavior in the learning task, defined by action distribution at interested states; meanwhile f π is coarser than f Θ since the same policy behavior can be realized by non-unique policy parameter. Taking one step closer to the task, the influence-irrelevance abstraction f P π cares about the state transition dynamics induced by policy behavior. Obviously, f P π is coarser than f π as different behaviors may induce the same transition distribution. The value-irrelevance abstraction f V π further involves the rewards of long-term dynamics, thus is the most task-relevant and coarsest among the three types of policy abstraction.

Empirical Comparison of Policy Metrics in Gridworld MDPs

To compare these policy abstractions in a quantitative view, we demonstrate how the distances of two policies measured by the corresponding policy metrics differ in several Gridworld MDPs. We use Distinct Policies, Doorway from (Kanervisto et al., 2020) and design a new task, Key Action for simple prototypes of tasks with The distance curves of the two policies measured by d π , d P π , d V π (y-axi), against the stochasticity of the environment (x-axi). d P π is able to distinguish the two policies across all the settings. different features; moreover, we increase the stochasticity of the environment for a better evaluation. In particular, E s∼p(s) D(•, •) is calculated by average the absolute differences over all states. The illustrations and results are shown in Fig. 1 while more results can be found in Appendix C. We observe that the distribution-irrelevance metric d π may fail to show the difference in dynamics and outcome between the two policies, e.g., in Doorway. This is because d π measures the difference in the action distribution itself, i.e., independent of the dynamics as well as the increasing stochasticity of the environment. This issue may be resolved by using a designated distribution that concentrates over key states. Conversely, the value-irrelevance d V π measures the difference in the outcomes of the two policies regardless their differences in action distribution and dynamics, e.g., in Distinct Policies. Another discovery is that d V π quickly degenerates and turns to be not informative as the increase of stochasticity, showing its poor robustness. By contrast, the influence-irrelevance metric d P π is a sweet intermediate point, consistently keeping the ability of distinguishing the two policies across all the environments and stochasticity configurations. In a summary, different policy abstractions and metrics may yield different outcomes for the same two policies and the optimality depends on the specific downstream learning problem concerned. Later, we evaluate these options in representative downstream RL problems in Section 5 and 6 for useful insights.

4. POLICY REPRESENTATION LEARNING APPROACH

The next question concerned in practice is: how can we learn the representation of RL policies (usually modeled by NNs) in a general way? Based on the policy metrics introduced above, we propose a policy representation learning approach by following the principle of Deep Metric Learning.

4.1. LEARNING POLICY REPRESENTATION BY EMBEDDING ALIGNMENT

The policy metrics proposed in previous section measure the quantitative relationship between policies from different perspectives of the policy abstraction criteria. For a unified objective function of learning from different policy metrics, we use the alignment loss, with which the difference between the distances of two policies in the representation space and in the policy metric space is minimized. Concretely, consider a policy representation function f ψ , and the alignment loss can be formalized as, L AL (ψ) = E π,π ′ ∈Π ∥f ψ (π) -f ψ (π ′ )∥ 2 -ηd * (π, π ′ ) 2 , where we consider d * ∈ {d π , d P π , d V π } and η is the weight for scaling. Similar forms of the alignment loss are also adopted in the studies on state representation learning (Zhang et al., 2020) . As we can see, L AL (ψ) consists of two metrics, i.e., the L 2 distance function (∥•∥ 2 ) of two inputs and the policy metric (d * (•, •)). Intuitively, minimizing the alignment loss is to align the two metrics by optimizing the policy representation function f ψ . By this means, we are able to learn different policy representation functions, which maps the ground policy π ∈ Π to the latent embedding χ π = f ψ (π) ∈ X . The embedding preserves the policy features corresponding to the abstraction criteria reflected by the specific policy metric considered. For a practical implementation, the following problems are the estimation of policy metrics d * and the realization of the training for policy representation function f ψ , which are detailed in the next two subsections respectively. In the literature of learning policy representation, representative methods follow the principle of behavior recovering (Grover et al., 2018) and policy contrast (Tang et al., 2020) . To our knowledge, none of prior works take a systematic view of policy abstraction. In Table 4 , we show that prior methods are specific instances of one of our proposed policy abstractions that differ in realization.

4.2. ESTIMATING POLICY METRICS VIA MAXIMUM MEAN DISCREPANCY

Given a tractable metric D and a state distribution p, the policy metrics (i.e., d π , d P π , d V π ) can be calculated exactly if the probability distributions (i.e., π, P π , Z π ) are available. However, this is usually infeasible in practice; instead, in more regular cases, only finite samples of policy interaction are available. Although the empirical distributions can be estimated in simple MDPs where the state-action space is finite (as in Appendix D), unfortunately, approximating the distributions and computing the metrics are non-trivial, especially with high-dimensional continuous state-action space. Therefore, we estimate the policy metrics directly from the samples, bypassing estimating the empirical distributions (i.e., π, P π , Z π ). In particular, we adopt MMD (Gretton et al., 2012; Nguyen-Tang et al., 2021) as the distribution metric, i.e., let D be D MMD . MMD measures the maximum value of the mean discrepancy of two distributions regarding all possible functions in a predefined family. Conventionally, let the class of functions h : X → R be a unit ball in a Reproducing Kernel Hilbert Space (RKHS) H associated with a continuous kernel k(•, •) on X, p, q be two distribution defined on X, x, x ′ and y, y ′ be i.i.d. samples from p and q respectively, the MMD is defined as: D MMD (p, q; H) = sup h∈H:∥h∥ H ≤1 (E x∼p [h (x)] -E y∼q [h (y)]) = ∥µ p -µ q ∥ H = E x,x ′ [k(x, x ′ )] + E y,y ′ [k(y, y ′ )] -2E x,y [k(x, y)] 1 2 , where µ p = X k(x, •)p(dx) is the mean embedding of p into H (Smola et al., 2007) . Thus, MMD can be empirically estimated with samples {x i } N i=1 ∼ p and {y i } M i=1 ∼ q: D 2 MMD ({x i }, {y i }; k) = 1 N 2 i,j k(x i , x j ) + 1 M 2 i,j k(y i , y j ) - 2 N M i,j k(x i , y j ). According to Eq. 3, we can estimate the policy metrics d π , d P π , d V π empirically from the samples {a i }, {s ′ i }, {G i } of different policies respectively, under the sampled states {s i } for the expectation E p(s) . However, it is often impractical to obtain multiple samples under the same state. Thus, we resort to estimating the surrogates, e.g., dP π (π i , π j ) = D (P πi (s, s ′ ), P πj (s, s ′ )) for d P π , where the joint distributions rather than the state-conditioned distributions are measured. We use Gaussian kernel by default, i.e., k (x, x ′ ) = exp - ||x-x ′ || 2 2 2σ 2 . Consequently, the empirical estimates of the policy metrics serve as the self-supervision in Eq. 1.

4.3. REALIZING THE TRAINING OF POLICY REPRESENTATION FUNCTION

With the empirical policy metrics provided in previous section, the training of policy representation is straightforward with a differentiable function f ψ by optimizing the alignment loss (Eq. 1). The realization of policy representation function concerns two aspects: 1) the choice of policy data (or original representation) and 2) the construction of f ψ (i.e., how policy data is encoded). For the first aspect, we focus on parameterized policy π θ (typically by a neural network) and use policy parameter θ as the policy data. One may recall that θ itself can be viewed as the finest representation obtained by policy abstraction f Θ in Table 1 . Such an original representation (i.e., θ) is high-dimensional and highly nonlinear, offering no help in the compression and generalization of policy space. In addition, we are aware that in some cases the policy parameters may be not available, and thus the interaction experiences generated by the policy can be alternative policy data, as used in (Grover et al., 2018; Tang et al., 2020) . Our policy representation learning approach is compatible with such alternatives with the need of possible slight modifications. For the second aspect, we adopt Layer-wise Permutation-invariant Encoder (LPE) (Tang et al., 2020) as the implementation choice of f ψ (see Fig. 5 ), which has demonstrated the effectiveness in encoding conventional policy networks. To be specific, for the parameter θ = {W i , b i } k i=0 of policy π, i.e., the weights and biases of k-layer MLP, 1 the weight W i ∈ R li×li+1 and bias b i ∈ R 1×li+1 (l i is the unit number of the i-layer; l 0 and l k are for the input and output layers) are concatenated (⊕) and transposed, followed by a MLP (f ψ,i ) and a mean-reduce operation (MR), resulting in a layer embedding z i ; Thereafter, the policy embedding is obtained by concatenating the embedding of each layer. Formally, z i = MR f ψ,i ([W i ⊕ b i ] ⊤ ) = 1 l i + 1 li+1 j=1 f ψ,i ([W i ⊕ b i ] ⊤ ) j,• , χ π θ = f ψ (θ) = k i=0 z i (4) Each row of [W i ⊕ b i ] ⊤ , indexing by the subscript j, •, describes a transformation of the i-layer into the next layer. All the rows are fed into f ψ,i separately and are then averaged into z i . In a consequence, the policy embedding serves as the compact representation of the policy network by summarizing the transformations made by the each layer of it. The significant difference between LPE and a straightforward MLP encoder is that, LPE provides structure-aware representation, i.e., both the intra-layer and inter-layer structures are explicitly considered. Intuitively, this alleviates the difficulty of learning representation from the policy network parameters. Other advanced encoder structures are beyond the scope of this work and we leave them as future work. Till now, we can update the parameters of LPE ψ = {ψ i } k i=0 by optimizing Eq. 1 with the policy samples from some given set of policies and the empirical policy metrics estimated accordingly. Depending on the specific choice of policy metric, the policy representation is learned to render the policy abstraction in Table 1 , starting from f Θ and going downwards to the corresponding level.

5. APPLYING POLICY ABSTRACTION TO POLICY OPTIMIZATION

Despite the theoretical understanding of the policy abstraction, we have no idea about how the derived policy metrics behave in different downstream learning problems. To shed some light on this, we evaluate the efficacy of the policy metrics proposed in Sec. 3.2 in policy optimization, including Trust-Region Policy Optimization (TRPO) and Diversity-Guided Evolutionary Strategy (DGES).

Trust-Region Policy Optimization

We adopt TRPO problem as the first test stone for our policy abstractions. Specifically, the objective of TRPO problem is to maximize the policy return while constraining the difference between old and new policies: J TRPO (θ) = E τ ∼Pπ θ [R(τ )], s.t., d * (π θ , π θ old ) ≤ σ, where σ is a threshold. For our experiments, we consider the policy metrics d * ∈ {d π , d P π , d V π }. In another word, the learning agent checks if the difference measured by the policy metrics are larger than σ for each policy update. In this experiment, the original TRPO (Schulman et al., 2015) is generalized to incorporate different alternative metrics for the trust-region constraint. Thus, we can evaluate the efficacy of the different trust regions provided by our proposed policy metrics, shedding some light on what policy features we care the most in TRPO. We adopt a Gridworld environment where the agent can move to one of N directions at each grid and only one direction yields high reward (Kanervisto et al., 2020) . The results are shown in Fig. 2(a) . We observe that all our TRPO variants (i.e., TRPO-f * ) outperform Vanilla-PO (i.e., no trust-region constraint used), demonstrating the effectiveness of our policy abstractions. Moreover, TRPO-f π outperforms the others. This is because f π follows the abstraction criterion regarding action distribution, thus pertains to the essence of TRPO. By contrast, f P π and f V π utilize coarser abstraction which does not hold the features of action distribution. In addition, we demonstrate the superiority of our policy abstractions when compared with existing related methods in Appendix E.1. Diversity-Guided Evolution Strategy Next, we adopt DGES problem as the second test stone for our policy abstractions. Formally, the objective of DGES problem is to maximize the policy return of the current policy π θ and maximize its policy difference to the ancestor policy π: J DGES (θ) = E τ ∼Pπ θ [R(τ )] + β N p=1 d * (π θ , π) , where β ≥ 0 is the weight. Similarly, we consider the policy metrics d * ∈ {d π , d P π , d V π }. Here, the choices of policy metrics realize the population diversity in different ways. We aim at exploring the diversity concerning which policy feature is the most effective in DGES. To explore this, we leverage the Point environment with deceptive rewards (Pacchiano et al., 2020) . The results are reported in Fig. 2(b ). In comparison to Vanilla-ES (i.e., β = 0), optimizing policy diversity (i.e., β > 0) based on our policy metrics (i.e., DGES-f * ) does help exploration and thus leads to better performance. In particular, DGES-f P π performs the best. Since ES optimizes policy in a gradient-free fashion, the evolution process concerns only policy return. Therefore, the distribution-irrelevance abstraction f π (i.e., the winner in the TRPO experiment) can be redundant since multiple action distributions may have the same outcome (i.e., influence and value). For the value-irrelevance abstraction f V π , it turns to be too fine to contain the features of policy behavior (i.e., action distribution and influence). Therefore, the influence-irrelevance abstraction f P π serves as a sweet point. Furthermore, we provide additional comparative evaluation in Appendix E.2.

6. APPLYING POLICY ABSTRACTION TO OFF-POLICY EVALUATION

After the investigation in policy optimization, now we move to policy evaluation. Typical OPE (Fu et al., 2021; Harb et al., 2020) focuses on using offline data to evaluate unseen policies. Likewise, we are interested in studying the value generalization performance on unseen policies of the representations learned regarding different policy abstractions. The appealing characteristic of policy representation in value generalization has been studied in (Tang et al., 2020) , where a Policy-extended Value Function Approximator (PeVFA,V(χ π )) takes as input the policy representation χ π approximates the values of multiple policies and offers implicit value generalization among the policy representation space. For policy data collection, we run PPO (Schulman et al., 2017b) in OpenAI Gym continuous control tasks: InvertedDoublePendulum-v2 (IDP-v2) and LunarLanderContinuous-v2 (LLC-v2) (Brockman et al., 2016) . By collecting the policies at intervals during the learning process, we build an offline policy set, based on which we train our policy representations and a PeVFA V(χ π ). For concrete problem settings, we establish both weak and strong generalization OPE scenarios which differs at the difficulty of evaluating the unseen policies. For the weak generalization scenario (easy), we sample training data uniformly from the whole band of the policy set. For the strong generalization scenario (hard), we separate the policy set and use the low-performance policies for the training data, with the rest taken as the unseen policies to evaluate. For both the settings, the ratio of sampling and separation is set to be 20%, 40%, or 80%. We report the results of the ratio 20% in Table 2 and leave the results of other ratios in Appendix G. For evaluation protocols, we report the evaluation (testing) error of unseen policies (T-error) and the generalization gap (G-gap), i.e., the difference between training and testing error. We denote different policy representations by their underlying policy abstraction (e.g., f π ) correspondingly. Complete details can be found in Appendix F. Weak Generalization Scenario in OPE First, we study the empirical comparison in the weak generalization scenario. Table 2 reports the results of value generalization for the policy representations learned based on corresponding policy abstractions. To be specific, the f Θ denotes directly using policy parameters θ as policy representations (i.e., no representation training). For our proposed policy abstractions f π , f P π , f V π , we learn the representations for them according to Eq. 1 based on the LPE and MMD estimation (Sec. 4.3). To further complete the comparison, we include two additional representations f RE and f EL : f RE uses a randomly initialized LPE with no further training while f EL uses the LPE trained by the end-to-end OPE loss (see Appendix F.2) respectively. Note that f EL can be viewed as a variant of f V π since it also learns from values but does not optimize the alignment loss. Besides, we also include the representation (f CL ) learned by unsupervised contrastive learning based on InfoNCE loss (van den Oord et al., 2018) , as proposed in (Tang et al., 2020) . From the Table 2 (Weak Generalization), we can observe that f π , f P π , f V π outperforms f Θ , f RE , and f EL in both IDP-v2 and LLC-v2. This demonstrates the effectiveness and superiority of our proposed representations in value function approximation and generalization. f EL is significantly better than the f Θ and f RE , indicating the advantages of LPE structure and training. We can observe that contrastive policy representation f CL performs poorly. We postulate that with less training data available at the 20% sampling ratio, the f CL with emphasis on policy instance-level comparison suffers from higher evaluation error and generalization gap. The superiority of f V π compared to f EL from the Table 2 demonstrates the effectiveness of alignment loss. This is because although both f V π , f EL learn policy representation from the information of policy value, naive end-to-end training is less effective than alignment optimization which establishes the representation space based on the policy metrics. In general, the value generalization results among our abstractions f π , f P π , f V π do not differ much. This is mainly because in the weak generalization setting, the unseen policies obey the same distribution as the training policies, thus posing less difficulty of value generalization. Strong Generalization Scenario in OPE Now we move to the study in the strong generalization scenario and similarly the results are reported in Table 2 (Strong Generalization). Compared to the weak generalization scenario, the overall T-error and G-gap are significantly higher in the strong generalization scenario. This is reasonable because there is a larger performance difference between the training and unseen policies. In other words, the unseen policies belong to out-of-distribution data. The f V π obtains the lowest evaluation error on the two environments, which indicates the valueirrelevance abstraction with higher task relevance may be best suited for the strong generalization setting. For the explanation, since the objective of OPE lies at the value function approximation and generalization, we consider that the value-irrelevance principle of f V π is consistent to the objective and thus fits naturally. With only low-performance policies for the training data at the 20% sampling ratio (hardest), the results of other abstractions including our proposed f π and f P π on the task are poor. The main reason may be that under the strong generalization setting, there is a large data-shift between the training and unseen policies. f π and f P π fails to learn a policy abstraction with generalization ability in the absence of diversity policies. Nevertheless, from the Table9, 10, as the sampling ratio increase and training policies become more diverse, the advantage of f π and f P π over other policy abstractions gradually emerge. Moreover, in the hardest case, f P π is better than f π , which shows that the influence-irrelevance policy abstraction may be a general policy abstraction option. For the other baselines, f Θ still shows few competition. For f RE , f EL , f CL , they falls behind f V π while slightly outperforms f P π and f π in Table 2 . Such slight advantages no long holds in the settings of higher sampling ratios (i.e., 40% in Table 9 and 80% in Table 10 ). Unlike weak generalization scenario, f CL is not so bad in strong generalization scenario. The main reason is that encountering hard policy evaluation tasks (Strong Generalization), other methods suffer from performance degradation and are no longer superior to contrastive learning. In contrast, contrastive learning based on policy instance-level comparison maintains a relatively good result. Other Experiments In addition to Table 2 , we provide more results under different settings of data amount in Appendix G.1, for both the weak and strong generalization scenarios. Other comprehensive studies (e.g., extrapolation behaviors, visualization) can be found in Appendix G.2,G.3.

7. CONCLUSION & LIMITATIONS

In this work, we introduce a unified policy abstraction theory, including three major types of policy abstraction, and corresponding policy metrics derived from the abstraction, as well as the analysis of their properties. We further propose a policy representation learning approach based on deep metric learning. We empirically evaluate the efficacy of different policy abstraction in both policy optimization (i.e., TRPO, DGES) and off-policy evaluation (OPE). For limitations and future work, we only provide the theory on the fineness of policy abstraction, while provide no theory on the optimality, although the optimality ought to depend on the downstream problem considered. For policy representation learning, the alignment loss and MMD metric are not the only choices; besides, other representation learning principles (Bardes et al., 2021) are potential.

A MORE CONTENT ON POLICY ABSTRACTION THEORY

A.1 PROOF OF THEOREM 3.1 Proof. We prove the partial ordering (⪰) of the Theorem 3.1 one by one in the following. ❶ f π ⪰ f P π . Given an MDP M , two policies π i , π j ∈ Π. We define P as the transition probability and use P π (s ′ |s) = E a∼π(•|s) P (s ′ |s, a) to denote the distribution of next state s ′ when performing policy π at state s, respectively. Then, we have P πi (s ′ |s) = E a∼πi(•|s) P (s ′ |s, a), P πj (s ′ |s) = E a∼πj (•|s) P (s ′ |s, a). (5) If f π (π i ) = f π (π j ) , with the Definition 3, we have ∀s ∈ S, ∀a ∈ A, π i (a|s) = π j (a|s). Combined with Eq. 5, we derive, ∀s, s ′ ∈ S, P πi (s ′ |s) = P πj (s ′ |s). (6) Recall the definition of f P π , we can obtain f π ⪰ f P π . ❷ f P π ⪰ f V π . Given an MDP M , two policies π i , π j ∈ Π, a reward function R. Start with the definition of the value function V π (•), we consider the two cases: Case 1. The reward function R depends only on state s ∈ S, we derive the value function: ∀s ∈ S, V π (s) = R (s) + s ′ P π (s ′ | s) V π (s ′ ). If f P π (π i ) = f P π (π j ), with the Definition 3, we have ∀s, s ′ ∈ S, P πi (s ′ |s) = P πj (s ′ |s). When the Eq. 7 holds, we have: ∀s ∈ S, V πi (s) = V πj (s). (8) Recall the definition of f V π , we can obtain f P π ⪰ f V π . Case 2. The reward function R depends on both state s ∈ S and action a ∈ A, we derive the value function: V π (s) = a π (a | s) R (s, a) + s ′ P s ′ | s, a V π s ′ . ( ) If f P π (π i ) = f P π (π j ), with the Definition 3, we have ∀s, s ′ ∈ S, P πi (s ′ | s) = P πj (s ′ | s). Unlike the Eq.7, π i may not be equivalent to π j regarding the abstraction criterion of value irrelevance. Therefore, the partial ordering (f P π ⪰ f V π ) is not obtained under Case 2. Nevertheless, Case 1 is fairly standard across a broad set of real world RL problems. Remark A.1 (More discussions on R(s, a) and R(s)). For Case 2 discussed above, i.e., the reward function R depends on both state s ∈ S and action a ∈ A, we can further separate it into two categories according to our knowledge on real-world decision-making problems: • In our first category, the dependence of R on state and action is due to the consequence of s, a in leading the decision system into the specific new state s ′ , i.e., R(s, a) = s ′ P (s ′ |s, a)U (s ′ ) where U (s ′ ) is the utility of s ′ (we adopt the expectation form for the convenience of discussion). In the cases that fall into this category, it can be easy to re-define the reward function by the utility function, i.e., R(s, a) = U (s) for any a ∈ A. With such a conversion, we can also obtain f P π ⪰ f V π in these cases. • Our second category covers the exclusive cases of the first category. For example, consider an environment, where two actions a 1 , a 2 lead to the same new state s ′ from state s but gain different rewards. In such cases, the reward is independent on the dynamics caused by s, a. We consider that such cases are minority in the ones of interest.

A.2 OTHER POLICY ABSTRACTIONS

In this paper, we also propose three other policy abstractions in Definition 6. Before introducing the definitions of additional policy abstraction, we make some necessary notations. We use d π,k (•) to denote the distribution of state when policy π performs k steps from initial states regarding the initial state distribution ρ 0 . The discounted state visitation distribution from initial states regarding ρ 0 is defined as  d π (s ′ ) = (1 -γ) ∞ t=0 γ t d π, (s ′ ) = (1 -γ) ∞ t=0 γ t d π,t s (s ′ ) for any s ′ ∈ S. Definition 6. Given an MDP ⟨S, A, P, R, γ⟩ and a ground policy space Π, for any two policies π i , π j ∈ Π, we define four more policy abstractions as follows: 1. An influence-irrelevance abstraction (f d π s ) is such that for all s, s ′ ∈ S, f d π s (π i ) = f d π s (π j ) implies that d πi s (s ′ ) = d πj s (s ′ ). 2. An influence-irrelevance abstraction (f d π ) is such that for all s ∈ S, f d π (π i ) = f d π (π j ) implies that d πi (s) = d πj (s). 3. A value-irrelevance abstraction (f J π ) is such that f J π (π i ) = f J π (π j ) implies that E s0∼ρ0 [V πi (s 0 )] = E s0∼ρ0 [V πj (s 0 )]. 4. A value-irrelevance abstraction (f Z π ) is such that for all s ∈ S, f Z π (π i ) = f Z π (π j ) implies that Z πi (s) = Z πj (s). Similarly, we prove how the newly proposed policy abstractions are related to other abstractions we introduce in the main body of the paper. Theorem A.2 (Partial Ordering (⪰)). Under the Definition 3, 5 and 6, we have ❶ f π ⪰ f P π ⪰ f d π s ⪰ f d π ⪰ f J π ; ❷ f π ⪰ f P π ⪰ f d π s ⪰ f V π ⪰ f J π . An illustration is provided below: 𝑓 π ⪰ 𝑓 𝑃 π ⪰ 𝑓 𝑑 𝑠 π ⪰ 𝑓 𝑑 π 𝑓 𝑉 π ⪰ 𝑓 𝐽 π ⪰ ⪰ Proof. The partial ordering (⪰) satisfies transitivity, thus, let us prove the TheoremA.2 one by one in the following. ❶f P π ⪰ f d π s . Given an MDP M , two policies π i , π j ∈ Π. If f P π (π i ) = f P π (π j ), with the Definition 3, we have ∀s, s ′ ∈ S, P πi (s ′ | s) = P πj (s ′ | s). Since the initial state distribution are the same, and π i , π j follow two identical Markov chains, we derive f P π ⪰ f d π s . ❷f d π s ⪰ f d π . Given an MDP, two policies π i , π j ∈ Π. If f d π s (π i ) = f d π s (π j ) , with the Definition 6, we have ∀s, s ′ ∈ S, d πi s (s ′ ) = d πj s (s ′ ). Furthermore, we derive, ∀s 0 ∈ ρ 0 , s ′ ∈ S, d πi s0 (s ′ ) = d πj s0 (s ′ ). (10) Thus, f d π s ⪰ f d π . ❸f d π ⪰ f J π . Given an MDP, two policies π i , π j ∈ Π. If f d π (π i ) = f d π (π j ) , with the Definition 6, we have ∀s ∈ S, d πi (s) = d πj (s). When the reward function R depends only on state s ∈ S, we have, J(π i ) = (1 -γ) -1 E s∈d π i [R(s)], J(π j ) = (1 -γ) -1 E s∈d π j [R(s)]. Thus, f d π ⪰ f J π . ❹f d π s ⪰ f V π . Given an MDP, two policies π i , π j ∈ Π. If f d π s (π i ) = f d π s (π j ), with the Definition 6, we have ∀s, s ′ ∈ S, d πi s (s ′ ) = d πj s (s ′ ). When the reward function R depends only on state s ′ ∈ S, we have, ∀s ∈ S, V πi (s) = (1 -γ) -1 E s ′ ∈d π i s [R(s ′ )], V πj (s) = (1 -γ) -1 E s ′ ∈d π j s [R(s ′ )]. Thus, f d π s ⪰ f V π . ❺ f V π ⪰ f J π . Given an MDP, two policies π i , π j ∈ Π. If f V π (π i ) = f V π (π j ) , with the Definition 3, we have ∀s ∈ S, V πi (s) = V πj (s). Furthermore, we derive, ∀s 0 ∈ ρ 0 , V πi (s 0 ) = V πj (s 0 ), E s0∼ρ0 [V πi (s 0 )] = E s0∼ρ0 [V πj (s 0 )] . (13) Thus, f V π ⪰ f J π . In particular, when the reward function R depends on both state s ∈ S and action a ∈ A, π i may not be equivalent to π j . The f d π ⪰ f J π and f d π s ⪰ f V π are not hold. Connecting Definition 3,6 and Theorem A.2, we summarize the properties of different policy abstraction in Table 3 . Table 3 : Properties of different policy abstraction, including the additional ones introduced in A.2. We use (+) and (-) to denote the higher or lower degree in the same level.

Abstraction

Abstraction Criterion (for π 1 , π 2 , ∀s, s ′ , a ∈ S 2 × A) Fineness Task Relevance f Θ Policy Parameter Equivalence (θ 1 = θ 2 ) Highest None f π Action Distribution Equivalence (π i (a | s) = π j (a | s)) High Low f P π Dynamics Influence Equivalence (P πi (s ′ |s) = P πj (s ′ |s)) Middle (+) Middle (-) f d π s Dynamics Influence Equivalence (d πi s (s ′ ) = d πj s (s ′ )) Middle Middle f d π Dynamics Influence Equivalence (d πi (s) = d πj (s)) Middle (-) Middle (+) f V π Value Function Equivalence (V πi (s) = V πj (s)) Low High f J π Value Function Equivalence (E s0∼ρ0 [V πi (s 0 )] = E s0∼ρ0 [V πj (s 0 )]) Low (-) High (+) f 0 Triviality (taking all policies as the same) Lowest None Table 4 : A taxonomy of prior policy abstractions under our policy abstraction theory. Prior Policy Representation Abstraction Criterion Related Policy Abstraction Vectorized Network Parameters (Faccio et al., 2020) Policy Parameter Equivalence f Θ Contrastive OPR/SPR (Tang et al., 2020) Policy Instance Contrast f CL Policy Recovery OPR/SPR (Tang et al., 2020) Action Distribution Prediction f π (Def. 3) End-to-End OPR/SPR (Tang et al., 2020) Value Function Prediction f V π (Def. 3) Network Fingerprint (Harb et al., 2020) Action Distribution Similarity f π (Def. 3) Behavior Embedding (State) (Pacchiano et al., 2020) Final State Similarity f d π (Def. 6) Behavior Embedding (Action) (Pacchiano et al., 2020) Action Distribution Similarity f π (Def. 3) Behavior Embedding (Reward) (Pacchiano et al., 2020) Return Similarity f J π (Def. 6) Generative Representation (Grover et al., 2018) Action Distribution Prediction f π (Def. 3) Discriminative Representation (Grover et al., 2018) Policy Instance Contrast f CL α-compression (Mutti et al., 2021) Action Distribution & Dynamics Influence Similarity f π (Def. 3) & f d π (Def. 6) Policy Supervector (Kanervisto et al., 2020) Dynamics Influence Similarity f d π (Def. 6) B TAXONOMY OF PRIOR POLICY ABSTRACTIONS UNDER OUR THEORY Closely related to our work, (Harb et al., 2020) adopts policy fingerprints as differentiable policy representation obtained by concatenating the distribution of actions of policy in a set of key states. Obviously, it's more concerned with policy distribution information. (Kanervisto et al., 2020 ) make use of Gaussian mixture models to learn policy supervectors (policy representation), which characterizes agents' behaviour by the distribution of states they visit. Pacchiano et al. (2020) define a Behavioral Embedding Map (BEM) with different implementation options and propose using Wasserstain distance in the latent space induced by BEM to measure the similarity among policies. Like us, (Tang et al., 2020 ) also employs neural network-based policy parameters as policy data, but it utilize policy distribution reconstruction and contrast learning principles to learn policy representation. The principle of contrastive learning itself is a general learning principle irrelevant to the learning task. Therefore, the contrast of instantiation makes its abstraction level low. Compared with these two works, (Faccio et al., 2020) is a simpler and more direct way of constructing policy representation, because it directly compress the policy parameters based on neural network into vectors and regards it as a form of policy representation. In fact, the policy space may be redundant due to the existence of some policies that induce similar behaviors. It provides feasibility for policy abstraction and policy space compression (Mutti et al., 2021) . For instance, Mutti et al. (2021) study policy space compression by formulating a Set Covering problem with Rényi Divergence of discounted state-action distribution of policies as a metric. In addition, in a multi-agent system, (Grover et al., 2018 ) also converts opponent modeling into an opponent policy representation learning problem and attempts to distinguish opponents by their policy representations. In Table 4 , we categorize these works under our abstractions.

C CASE ILLUSTRATION OF POLICY ABSTRACTION

To compare these policy abstractions in a quantitative view, we demonstrate how the distances of two policies (blue and greed in Fig. 3 ) measured by the corresponding policy metrics differ in several Gridworld MDPs. We borrow Distinct Policies, Doorway from (Kanervisto et al., 2020) and design a new environment, Key Action for simple prototypes of environments with different features. All three environments are 5×5 Gridworld, with the starting state and goal in the lower left and upper right grid, respectively. The discrete action space is {up, down, left, right}. We obtain the estimated value v(•) of each state by estimating 1 -[expected number of step to goal when starting from a given state]. About the Key Action, the agent obtain a positive reward only if it chooses right at the key state (i.e., marked by the red box in Fig. 3 ). Every agent rollouts 10k episodes for comparing different policy metrics. To be specific, a policy is represented by tensor π ∈ R 5×5×4 and the distribution-irrelevance metric is defined as d π (π, π) = E π i,j,k ∈π,π i,j,k ∈π [|π i,j,k -πi,j,k |]. Similarly, the state transition dynamics induced by policy is represented by tensor P π ∈ R 5×5×5 and the influence-irrelevance metric is defined as d P π (π, π) = E P π i,j,k ∈P π ,P π i,j,k ∈P π P π i,j,k -P π i,j,k . Different from the d P π samplingbased estimation, we leverage the dynamic programming to learn the value function V π ∈ R 5×5 . Thus, the value-irrelevance metric is defined as d V π (π, π) = E V π i,j ∈V π ,V π i,j ∈V π V π i,j -V π i,j . In addition, we also compare the two other policy abstractions proposed in Appendix A.2. Among them, we estimate the discounted state visitation distribution d π (s) via dividing visits to a state s by a total number of interactions (i.e., d d π (π, π) = E d π i,j ∈d π ,d π i,j ∈d π d π i,j -d π i,j ). Simply and naturally, based on the definition of f J π in Definition 6, we have 3 shows the illustrations and the results of the five policy metrics. The d d π and d J π are newly added results compared to the original paper. We observe that the d d π cannot indicate the difference in dynamics between the two policies in Key Action. Like the d V π , the d J π measures the difference in the outcomes of the two policies but shows the poor robustness as the increase of stochasticity. d J π (π, π) = |E s0∼ρ0 [V π (s 0 )] - E s0∼ρ0 V π (s 0 ) |. Fig.

D ADDITIONAL DISCUSSIONS D.1 ESTIMATING POLICY METRICS VIA JEFFREYS DIVERGENCE

In simple MDPs where the state-action space is finite, we calculate the frequency distribution (i.e., π, P π , Z π ) using sufficient samples as an estimate of the exact probability distribution (i.e., π, P π , Z π ). Then we use the Jeffreys Divergence (Jeffreys, 1946) between empirical distributions as policy metrics (i.e., d π (•, •), d P π (•, •), and d V π (•, •)), d π (π i , π j ) ≈ D KL ( π i ∥ π j ) + D KL ( π j ∥ π i ) , d P π (π i , π j ) ≈ D KL P πi ∥ P πj + D KL P πj ∥ P πi , d V π (π i , π j ) ≈ D KL Z πi ∥ Z πj + D KL Z πj ∥ Z πi .

E DETAILS ON APPLYING POLICY ABSTRACTIONS TO POLICY OPTIMIZATION

The policy optimization experiments are run on a single NVIDIA GeForce GTX 2080Ti GPU. Our codes are implemented with Python 3.7.13 and Torch 1.11.0. Table 5 : Trust-region constraint value choices of different Methods. We use '-' to denote the 'not applicable' situation.

Methods

Trust-region threshold (σ) Vanilla-PO -Max TV {0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5} Gaussian {0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 20.0} Supervector {0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5} TRPO-f π {0.05, 0.1, 0.2, 0.3, 0.4, 0.5} TRPO-f P π {0.05, 0.1, 0.2, 0.3, 0.4, 0.5} TRPO-f V π {0.05, 0.1, 0.5, 1.0, 2.0, 5.0}

E.1 TRUST-REGION POLICY OPTIMIZATION

In this experiment, we adopt a N -dimensional Gridworld MDP provided by (Kanervisto et al., 2020) , where the position coordinates of the agent form the state space. At each state, the agent choose one of N actions corresponding N directions. In the experiments, we set N =5 and only one action of N action move the agent forward. The agent is rewarded with +1 reward for taking the correct action and the maximum length of each episode is 25. The agent uses a two-layer network with 16 units and tanh-activations each. For the policy optimization, we rewrite the implementation of the policy metrics in the code of (Kanervisto et al., 2020) . The learning agent checks if the difference between old and new policies measured by the policy metrics are larger than trust-region threshold σ. If the threshold is exceeded, we stop updating the policy with current samples and move on to collect new samples for the next policy update. The sample size is 4096 and the policy is updated for 100 mini-batches of 64 items over the collected samples, or until constraint prevents updates. We compare with existing related methods, i.e.,Vanilla-PO, Total Variation Divergence(Max TV), Gaussian and Supervector, and follow hyperparameters in (Kanervisto et al., 2020) . Among them, Vanilla-PO means updating policy 100 mini-batches with no trust-region constraint; the Total Variation Divergence measures the maximum amount of how much probability of taking any single action (in any state) can change and is defined as d (π i , π j ) = max s 1 2 a |π i (a | s) -π j (a | s)|; the Gaussian refers to fitting a multivariate, diagonal Gaussian on collected states from 5 trajectories for old and new policy, respectively and measures Jeffreys Divergence between old and new policy; the Supervector calculates the upper bound of KL-divergence between old and new policy supervectors, which are obtained via fitting a four-component UBM on collected states from 5 trajectories for corresponding policy. To ensure a fair comparison, we estimate the proposed policy metrics d π , d P π , d V π on collected 5 trajectories from old and new policy. All approaches repeat ten times for same environment and the search space of trust-region threshold for them are shown in Table 5 . Fig. 4 (a) reports the empirical results for different methods using the corresponding optimal thresholds which are selected based on the largest AUC (i.e., Area Under The Curve). Overall, the TRPO-f π and Max TV are better than others, mainly because they concern the action distribution at interested states. The Gaussian and Supervector are concerned with the state visitation distribution induced by the policy, which are specific instances of our proposed dynamics influence abstraction. Thus, as with f P π and f V π , they ignore the differences of policies in action distribution.

E.2 DIVERSITY-GUIDED EVOLUTION STRATEGY

In the experiments, we adopt the Point environment with deceptive rewards using MuJoCo simulator. The deception comes from a barrier, which misleads the agent to move directly forward leading to a suboptimal policy. For the Point environment, the state and action are represented by a 6 dimensional vector and 2 dimensional vector, respectively. At each timestep, the agent is penalized for its distance from a given goal, and we limit episode length to 50 steps. To reduce the number of policy parameters and thus the time cost, we use a Toeplitz policy (Choromanski et al., 2018) often used for ES algorithms. For DGES, we set the population size to 50 (i.e., N = 50), and estimate the proposed policy metrics d π , d P π , d V π on one episode from the currently policy and the ancestor policy. We compare with state-of-the art method on the Point environment for exploration: BGES (Pacchiano et al., 2020) , which uses the terminal state as a policy representation and compare two policies by Wasserstein distances. The hyperparameters associated with the BGES remain consistent with the original paper. From the Fig. 4 (b), we observe our approach DGES-f P π is comparable to the BGES (Pacchiano et al., 2020) and performs better than the DGES-f π and DGES-f V π . It illustrates the influence-irrelevance metric d P π is a sweet intermediate point, keeping the ability of distinguishing the two policies.

F DETAILS ON APPLYING POLICY ABSTRACTIONS TO OFF-POLICY EVALUATION

The off-policy evaluation experiments are run on a single NVIDIA GeForce GTX 2080Ti GPU. Our codes are implemented with Python 3.6.13 and tensorflow 2.2.0. F.1 COLLECT POLICY DATA FOR OFF-POLICY EVALUATION Benchmark Environments. We conduct our experiments on OpenAI Gym continuous control tasks and detailed description is below. • InvertedDoublePendulum-v2 (IDP-v2): Balance a pole on a pole on a cart. The agent gets a reward for every timestep that the pendulum has not fallen off the cart. • LunarLanderContinuous-v2 (LLC-v2): Navigate a lander to its landing pad. Landing pad is always at coordinates (0,0). Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Policy Data Collection. We collect policies in two stages. The first stage, we train the PPO (Schulman et al., 2017a) agent for 1M steps and store a checkpoint each 10 updates of the policy. Then, we evaluate each checkpoint by 10 rollouts. To be specific, for the ppo agent, the update frequency of critic is 5 per epoch on two environments, and the update frequency of actor is 2,5,10 per epoch on IDP-v2 and 5,10,20 per epoch on LLC-v2, respectively. The ppo agent with each update frequency is trained using 20 random seeds. The second stage, we divide 50 intervals I τ (τ = 1, • • • , 50) depending on the performance range of collected policy set I. Naturally, each interval contains a certain number of policies. Then, we randomly select K(K = 40) policies from each interval. Finally, we collect B(B = 200) trajectories of data per selected policy. Further, for each policy, we calculate the average return Ḡ0 over the 200 trajectories, and randomly sample state-action pairs {(s j , a j )} j=m j=1 , state-next state pairs {(s j , s ′ j )} j=m j=1 , state-value pairs {(s j , G j )} j=m j=1 . The pseudo-code for policy data collection is presented in Algorithm1. In the experiment, we construct a policy dataset for IDP-v2 and LLC-v2, respectively, and each policy dataset contains 2000 policies. Fig. 6 shows the histograms of the two policy datasets, where the  L AL (ψ) = E πi,πj ∈B (∥f ψ (θ πi ) -f ψ (θ πj )∥ 2 -η • d * (π i , π j )) 2 , d * ∈ {d π , d P π , d V π } 6: Calculate the evaluation loss L EL 7: L EL (β, ψ) = E πi∈B (V β (f ψ (θ πi )) -Ḡ0 ) 8: Update parameters ψ, β to minimize L AL and L EL 9: end for Output: Parameters ψ of the policy representation function f ψ ; parameters β the policy evaluation function V β x-axis is the average return of policies and the y-axis is the number of policies. As shown in the Fig. 6 , the two policy datasets basically satisfy the diversity and balance of policies, which allows us to design a series of Off-Policy Evaluation (OPE) scenarios.

F.2 OFF-POLICY EVALUATION

With regard to off-policy evaluation tasks, in this work, we adopt a policy evaluation function (Harb et al., 2020) representing the expected return of a policy. Naturally, the policy evaluation function can be trained based on sampled training policies and then generalize to unseen policies. The policy evaluation loss L EL can be formalized as, L EL (β, ψ) = E π∈D train (V β (f ψ (θ π )) -Ḡ0 ) 2 , ( ) where V β (•) parameterized by β denotes the policy evaluation function. f ψ (•) parameterized by ψ is the policy representation function which takes the policy parameters θ π as input. D train denotes the training policy dataset sampled from the collected policy dataset D. Below, we provide the implementation details and pseudo-code (Algorithm 2) for off-policy evaluation. Network Structure. The Fig. 5 illustrates the policy encoder (LPE) we used. The Table 6 shows the structure of the policy network and the policy evaluation network. As shown in Table 6 , we use a two-layer feed-forward neural network of 32 and 32 hidden units with ReLU activation (except for the output layer) for the policy network. In this paper, we learn a policy representation (pr) for a RL policy from its network parameters. Like the policy network, the policy evaluation network also uses a two-layer feed-forward neural network, where the input to the policy evaluation network is the policy representation and the output is the predicted value of the expected return for the policy. Hyperparameter. In the Table 7 , we discuss the policy representation dimension, the number of hidden units of the policy evaluation network, and the batch size for training policy evaluation network. For a fair comparison, we perform a 10-fold cross-validation weak generalization experiment with 20% sampling ratio based on the f RE and IDP-v2 to search for best hyperparameters. For all methods and environments, the batch size, the number of hidden units and the policy representation dimension are 256, 128 and 256, respectively. Additionally, we report the common hyperparamters of off-policy evaluation experiments in Table 8 .

G.1 COMPLETE EXPERIMENTAL RESULTS OF DIFFERENT GENERALIZATION SCENARIOS

In the paper, we design two OPE generalization settings, namely weak generalization and strong generalization. For each of the generalization settings, we further construct generalization scenarios with different sampling ratios. As with the Table 2 , the results of the generalization experiments on the 40% and 80% sampling ratio are presented in the Table 9 and the Table 10 , respectively. Not surprisingly, the performance of all methods improves as the amount of training data increases. In Visualization Results. Overall, the policy representations in Fig. 11 are more clearly distinguishable and all of our proposed methods show some degree of discriminability. In contrast, in the strong generalization experiments, the better performing policies are less discriminative as only 20% of the poor performing policies are used as training data. Nevertheless, our proposed value-irrelevance policy abstraction still learns a more compact and discriminative representation. The same results can also be found in the 14) . Differently, f RE has already obtained a better discriminability on that environment. We consider two possible reasons for this result, one being due to the variability of the different environments, and the other possibly being the superiority of the policy encoder network (LPE) we adopt. To verify the latter, we show the visualization of the unabstracted policy in fig. 15 . From the results, it is clear that our policy encoder network is indeed able to perform an effective compression of the policy information and obtain a compact policy representation space. 



The activation function is not considered since the structure is fixed for policies in convention RL setting. In principle, LPE can be generalized to tailor other advanced network structure.



Figure 1: Policy comparison with different policy metrics in Gridworld. Top Panel: The illustration of three Gridworld MDPs and two deterministic policies (blue and green). Bottom Panel: The distance curves of the two policies measured by d π , d P π , d V π (y-axi), against the stochasticity of the environment (x-axi). d P π is able to distinguish the two policies across all the settings.

Figure 2: Performance of different policy abstractions in: (a) Trust-Region Policy Optimization (TRPO) for a Gridworld environment; and (b) Diversity-Guided Evolution Strategy (DGES) for a Point environment. Results are the mean and half a standard deviation (shaded) over 10 and 5 trials for TRPO and DGES respectively.

Figure 3: Policy comparison with different policy metrics in Gridworld. Top Panel: The illustration of three Gridworld MDPs and two deterministic policies (blue and green). Bottom Panel: The distance curves of the two policies measured by d π , d P π , d V π , d d π , d J π (y-axi), against the stochasticity of the environment (x-axi).

Figure 4: Performance of different methods in: (a) Trust-Region Policy Optimization (TRPO); and (b) Diversity-Guided Evolution Strategy (DGES). Results are the mean and half a standard deviation (shaded) over 10 and 5 trials for TRPO and DGES respectively.

Collect Policy Data for Off-policy Evaluation Input: Policy set I = {I 1 , I 2 , • • • , I τ } collected based on PPO algorithm; the number τ of policy performance intervals; the number K of policies selected per interval; the rollouts B and policy dataset D ← ∅ 1: for τ = 1,2,• • Carlo return G 0 using policy π k 7:Collect data with π k in real environments:{(s, a, r, s ′ )} b 8: R ← R ∪ {G 0 }, T ← T ∪ {(s, a, r, s ′ )} b ← D ∪ {θ π k , R, T }, θ π krepresents the parameters of policy π k Off-policy Evaluation Input: Training policy dataset D train ; the data of the policy π consisting of policy parameters θ π , state-action pairs {(s j , a j )} j=m j=1 , state-next state pairs {(s j , s ′ j )} j=m j=1 and state-value pairs {(s j , G j )} j=m j=1 ; the expected return Ḡ0 Initialize: The policy evaluation function V β with parameters β Initialize: The policy representation function f ψ with parameters ψ 1: Calculate d π , d P π , and d V π for all (π i , π j ) ∈ D train ▷ see Section 4.2 2: for iteration t = 0,1,2,• • • do 3: Sample a mini-batch N of policy data B from D train 4: Calculate the alignment loss L AL 5:

Figure5: An illustration for Layer-wise Permutation-invariant Encoder (LPE). l i denotes the numbers of hidden units for the hidden layers. The main idea is to perform permutation-invariant transformation for inner-layer weights and biases for each layer first and then concatenate encoding of layers.

Figure 6: Histograms of policy dataset collected from the (a) InvertedDoublePendulum-v2 and (b)LunarLanderContinuous-v2 environments, where the x-axis is the average return of policies and the y-axis is the number of policies.

fπ (f) fP π (g) fV π

Figure 7: Results of policy evaluation with 20% sampling ratio (Weak Generalization, IDP-v2). The results show the true and predicted values of the train policies (red dots) and target policies (blue dots) at the minimum testing error for policy abstraction methods.

Figure 8: Results of policy evaluation with 20% sampling ratio (Strong Generalization, IDP-v2). The results show the true and predicted values of the train policies (red dots) and target policies (blue dots) at the minimum testing error for policy abstraction methods.

fπ (f) fP π (g) fV π

Figure 9: Results of policy evaluation with 20% sampling ratio (Weak Generalization, LLC-v2). The results show the true and predicted values of the train policies (red dots) and target policies (blue dots) at the minimum testing error for policy abstraction methods.

Figure 10: Results of policy evaluation with 20% sampling ratio (Strong Generalization, LLC-v2).The results the true and predicted values of the train policies (red dots) and target policies (blue dots) at the minimum testing error for policy abstraction methods.

Figure 11: T-SNE visualization results of policy representation (Weak Generalization with 20% sampling ratio, IDP-v2). Each colored point represents a policy and the label of the color bar for all results is the expected return of the policy.

Figure 12: T-SNE visualization results of policy representation (Strong Generalization with 20% sampling ratio, IDP-v2). Each colored point represents a policy and the label of the color bar for all results is the expected return of the policy.

fπ (e) fP π (f) fV π

Figure 13: T-SNE visualization results of policy representation (Weak Generalization with 20% sampling ratio, LLC-v2). Each colored point represents a policy and the label of the color bar for all results is the expected return of the policy.

Figure 14: T-SNE visualization results of policy representation (Strong Generalization with 20% sampling ratio, LLC-v2). Each colored point represents a policy and the label of the color bar for all results is the expected return of the policy.

Figure 15: visualization results of policy representation (Weak Generalization with 20% sampling ratio, LLC-v2). Each colored point represents a policy and the label of the color bar for all results is the expected return of the policy.

Properties of different policy abstraction.

Performance of different policy abstractions in Off-policy Evaluation (OPE). The minimum value for each task is highlighted. Results are the mean ± a standard deviation over 10 and 5 trials (for weak and strong respectively). The f V π has lower T-error and G-gap on both the generalization tasks.

The discounted state visitation distribution from any state s is defined as d π s

𝑾 𝒊 [𝑙 0 , 𝑙 𝟏 ] [𝑙 𝟏 , 𝑙 𝟐 ] [𝑙 𝟐 , 𝑙 𝟑 ] 𝒃 𝒊 [1, 𝑙 𝟏 ] [1, 𝑙 𝟐 ] [1, 𝑙 𝟑 ] 𝑾 𝒊 ⊕ 𝒃 𝒊 ⊤ [𝑙 𝟏 , 𝑙 𝟎 + 𝟏] [𝑙 𝟐 , 𝑙 𝟏 + 𝟏] [𝑙 𝟑 , 𝑙 𝟐 + 𝟏]𝑴𝑹 𝒇 𝝍,𝒊 𝑾 𝒊 ⊕ 𝒃 𝒊

Structure of policy network and policy evaluation network.

Performance of different policy abstractions in Off-policy Evaluation (OPE). The minimum value for each task is highlighted. Results are the mean ± a standard deviation over 10 and 5 trials (for weak and strong with 40% sampling ratio respectively).

Performance of different policy abstractions in Off-policy Evaluation (OPE). The minimum value for each task is highlighted. Results are the mean ± a standard deviation over 10 and 5 trials (for weak and strong with 80% sampling ratio respectively).

annex

the weak generalization experimental scenarios, our methods are better than or comparable to other methods. In the strong generalization experimental scenarios, the proposed method, especially f V π , is significantly more robust than the other methods.

G.2 HOW DIFFERENT POLICY ABSTRACTIONS INTRAPOLATE AND EXTRAPOLATE IN OPE?

To present the results of weak and strong generalization more intuitively, we attach scatter plots of policy evaluation results for two generalization settings with 20% sampling ratio in Fig. 7 ,8,9,10. To be specific, the horizontal axis of scatter plots represents the true value and the vertical axis is the predicted value. The evaluation results of the training policies (red dots) and target policies (blue dots) are unified in a scatter plot for each method. Experimental results are the best of 10 and 5 trials (for weak and strong respectively). Obviously, the results of the weak generalization setting basically remained near the diagonal, while the strong generalization experiments show an overall underestimation. This is due to the fact that in the strong generalization setting we only sample low-performance policies as training data and the rest (i.e., high-performance policies) are used as test policies.

G.3 VISUALIZATION OF LEARNED POLICY REPRESENTATIONS

In this work, we further analyze different policy abstractions (i.e., f Θ , f RE , f EL , f π , f P π , f V π ) by visualizing policy representations. Among them, the f Θ denotes directly compressing policy parameters as policy representations. The f RE and f EL refer to learning policy representations by random initial abstraction function and optimized abstraction function by the evaluation loss, respectively. The unsupervised contrastive learning approach (f CL ) learns policy representations based on the InforNCE (Tang et al., 2020) . f π , f P π , f V π correspond to the three policy abstractions we propose, and the policy representations are learned via the alignment loss.

