TOWARDS A UNIFIED POLICY ABSTRACTION THE-ORY AND REPRESENTATION LEARNING APPROACH IN MARKOV DECISION PROCESSES

Abstract

Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

1. INTRODUCTION

How to obtain the optimal policy is the ultimate problem in many decision-making systems, such as Game Playing (Mnih et al., 2015) , Robotics Manipulation (Smith et al., 2019 ), Medicine Discovery (Schreck et al., 2019) . Policy, the central notion in the aforementioned problem, defines the agent's behavior under specific circumstances. Towards solving the problem, a lot of works carry out studies on policy with different focal points, e.g., how policy can be well represented (Ma et al., 2020; Urain et al., 2020) , how to optimize policy (Schulman et al., 2017a; Ho & Ermon, 2016) and how to analyze and understand agents' behaviors (Zheng et al., 2018; Hansen & Ostermeier, 2001) . The root challenge to the studies on policy is the large scale and the high complexity of policy space, especially in real-world scenarios. As a consequence, the difficulty of policy learning is escalated severely. Intuitively and naturally, such issues can be significantly alleviated if we have an ideal surrogate policy space, which are compact in scale while keep the key features of policy space. Related to this direction, low-dimensional latent representation of policy plays an important role in Reinforcement Learning (RL) (Tang et al., 2020 ), Opponent Modeling (Grover et al., 2018 ), Fast Adaptation (Raileanu et al., 2020; Sang et al., 2022 ), Behavioral Characterization (Kanervisto et al., 2020) and etc. In these domains, a few preliminary attempts have been made in devising different policy representations. Most policy representations introduced in prior works resort to encapsulating the information of policy distribution under interest states (Harb et al., 2020; Pacchiano et al., 2020) , e.g., learning policy embedding by encoding policy's state-action pairs (or trajectories) and optimizing a policy recovery objective (Grover et al., 2018; Raileanu et al., 2020) . Rather than policy distribution, some other works resort to the information of policy's influence on the

