CONTRASTIVE BEHAVIORAL SIMILARITY EMBEDDINGS FOR GENERALIZATION IN REINFORCEMENT LEARNING

Abstract

Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs 1 ). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.

1. INTRODUCTION

Train Env Train Env Test Env Figure 1 : Jumping task: The agent (white block), learning from pixels, needs to jump over an obstacle (grey square). The challenge is to generalize to unseen obstacle positions and floor heights in test tasks using a small number of training tasks. We show the agent's trajectories using faded blocks. Current reinforcement learning (RL) approaches often learn policies that do not generalize to environments different than those the agent was trained on, even when these environments are semantically equivalent (Tachet des Combes et al., 2018; Song et al., 2019; Cobbe et al., 2019) . For example, consider a jumping task where an agent, learning from pixels, needs to jump over an obstacle (Figure 1 ). Deep RL agents trained on a few of these tasks with different obstacle positions struggle to solve test tasks where obstacles are at previously unseen locations. Recent solutions to circumvent poor generalization in RL are adapted from supervised learning, and, as such, largely ignore the sequential aspect of RL. Most of these solutions revolve around enhancing the learning process, including data augmentation (e.g. Instead, we tackle generalization by incorporating properties of the RL problem into the representation learning process. Our approach exploits the fact that an agent, when operating in environments with similar underlying mechanics, exhibits at least short sequences of behaviors that are similar across these environments. Concretely, the agent is optimized to learn an embedding in which states are close when the agent's optimal policies in these states and future states are similar. This notion of proximity is general and it is applicable to observations from different environments. Specifically, inspired by bisimulation metrics (Castro, 2020; Ferns et al., 2004) , we propose a novel policy similarity metric (PSM). PSM (Section 3) defines a notion of similarity between states originated from different environments by the proximity of the long-term optimal behavior from these states. PSM is reward-agnostic, making it more robust for generalization compared to approaches that rely on reward information. We prove that PSM yields an upper bound on suboptimality of policies transferred from one environment to another (Theorem 1), which is not attainable with bisimulation. We employ PSM for representation learning and introduce policy similarity embeddings (PSEs) for deep RL. To do so, we present a general contrastive procedure (Section 4) to learn an embedding based on any state similarity metric. PSEs are the instantiation of this procedure with PSM. PSEs are appealing for generalization as they encode task-relevant invariances by putting behaviorally equivalent states together. This is unlike prior approaches, which rely on capturing such invariances without being explicitly trained to do so, for example, through value function similarities across states (e.g., Castro & Precup, 2010) , or being robust to fixed transformations of the observation space (e.g., Kostrikov et al., 2020; Laskin et al., 2020a) . PSEs lead to better generalization while being orthogonal to how most of the field has been tackling generalization. We illustrate the efficacy and broad applicability of our approach on three existing benchmarks specifically designed to test generalization: (i) jumping task from pixels ( 

2. PRELIMINARIES

We describe an environment as a Markov decision process (MDP) (Puterman, 1994) M = (X , A, R, P, γ), with a state space X , an action space A, a reward function R, transition dynamics P , and a discount factor γ ∈ [0, 1). A policy π(• | x) maps states x ∈ X to distributions over actions. Whenever convenient, we abuse notation and write π(x) to describe the probability distribution π(• | x), treating π(x) as a vector. In RL, the goal is to find an optimal policy π * that maximizes the cumulative expected return E at∼π(• | xt) [ t γ t R(x t , a t )] starting from an initial state x 0 . We are interested in learning a policy that generalizes across related environments. We formalize this by considering a collection ρ of MDPs, sharing an action space A but with disjoint state spaces. We use X and Y to denote the state spaces of specific environments, and write R X , P X for the reward and transition functions of the MDP whose state space is X , and π * X for its optimal policy, which we assume unique without loss of generality. For a given policy π, we further specialize these into R π X and P π X , the reward and state-to-state transition dynamics arising from following π in that MDP. We write S for the union of the state spaces of the MDPs in ρ. Concretely, different MDPs correspond to specific scenarios in a problem class (Figure 1 ), and S is the space of all possible configurations. Used without subscripts, R, P , and π refer to the reward and transition function of this "union MDP", and a policy defined over S; this notation simplifies the exposition. We measure distances between states across environments using pseudometricsfoot_0 on S; the set of all such pseudometrics is M, and M p is the set of metrics on probability distributions over S. In our setting, the learner has access to a collection of training MDPs {M i } N i=1 , drawn from ρ. After interacting with these environments, the learner must produce a policy π over the entire state space S, which is then evaluated on unseen MDPs from ρ. Similar in spirit to the setting of transfer learning (Taylor & Stone, 2009) , here we evaluate the policy's zero-shot performance on ρ. Our policy similarity metric (Section 3) builds on the concept of π-bisimulation (Castro, 2020). Under the π-bisimulation metric, the distance between two states, x and y, is defined in terms of the difference between the expected rewards obtained when following policy π. (1) To achieve good generalization properties, we learn an embedding function z θ : S → R k that reflects the information encoded in the policy similarity metric; this yields a policy similarity



Pseudometrics are generalization of metrics where the distance between two distinct states can be zero.



The π-bisimulation metric d π satisfies a recursive equation based on the 1-Wasserstein metric W 1 : M → M p , where W 1 (d)(A, B) is the minimal cost of transporting probability mass from A to B (two probability distributions on S) under the base metric d (Villani, 2008). The recursion isd π (x, y) = |R π (x) -R π (y)| + γW 1 (d π ) P π (• | x), P π (• | y) ,x, y ∈ S.

Cobbe et al., 2019; Farebrother et al., 2018), noise injection (Igl et al., 2019), and diverse training conditions (Tobin et al., 2017); they rarely exploit properties of the sequential decision making problem such as similarity in actions across temporal observations.

