HYPERBOLIC DEEP REINFORCEMENT LEARNING

Abstract

In deep reinforcement learning (RL), useful information about the state is inherently tied to its possible future successors. Consequently, encoding features that capture the hierarchical relationships between states into the model's latent representations is often conducive to recovering effective policies. In this work, we study a new class of deep RL algorithms that promote encoding such relationships by using hyperbolic space to model latent representations. However, we find that a naive application of existing methodology from the hyperbolic deep learning literature leads to fatal instabilities due to the non-stationarity and variance characterizing common gradient estimators in RL. Hence, we design a new general method that directly addresses such optimization challenges and enables stable end-to-end learning with deep hyperbolic representations. We empirically validate our framework by applying it to popular on-policy and off-policy RL algorithms on the Procgen and Atari 100K benchmarks, attaining near universal performance and generalization benefits. Given its natural fit, we hope this work will inspire future RL research to consider hyperbolic representations as a standard tool.

1. INTRODUCTION

Reinforcement Learning (RL) achieved notable milestones in several game-playing and robotics applications (Mnih et al., 2013; Vinyals et al., 2019; Kalashnikov et al., 2018; OpenAI et al., 2019; Lee et al., 2021) . However, all these recent advances relied on large amounts of data and domain-specific practices, restricting their applicability in many important real-world contexts (Dulac-Arnold et al., 2019) . We argue that these challenges are symptomatic of current deep RL models lacking a proper prior to efficiently learn generalizable features for control (Kirk et al., 2021) . We propose to tackle this issue by introducing hyperbolic geometry to RL, as a new inductive bias for representation learning. The evolution of the state in a Markov decision process can be conceptualized as a tree, with the policy and dynamics determining the possible branches. Analogously, the same hierarchical evolution often applies to the most significant features required for decision-making (e.g., presence of bricks, location of paddle/ball in Fig. 1 ). These relationships tend to hold beyond individual trajectories, making hierarchy a natural basis to encode information for RL (Flet-Berliac, 2019) . Consequently, we hypothesize that deep RL models should prioritize encoding precisely hierarchically-structured features to facilitate learning effective and generalizable policies. In contrast, we note that nonevolving features, such as the aesthetic properties of elements in the environment, are often linked with spurious correlations, hindering generalization to new states (Song et al., 2019) . Similarly, human cognition also appears to learn representations of actions and elements of the environment by focusing on their underlying hierarchical relationship (Barker & Wright, 1955; Zhou et al., 2018) . Hyperbolic geometry (Beltrami, 1868; Cannon et al., 1997) provides a natural choice to efficiently encode hierarchically-structured features. A defining property of hyperbolic space is exponential volume growth, which enables the embedding of tree-like hierarchical data with low distortion using only a few dimensions (Sarkar, 2011) . In contrast, the volume of Euclidean spaces only grows polynomially, requiring high dimensionality to precisely embed tree structures (Matoušek, 1990) , potentially leading to higher complexity, more parameters, and overfitting. We analyze the properties of learned RL representations using a measure based on the δ-hyperbolicity (Gromov, 1987) , quantifying how close an arbitrary metric space is to a hyperbolic one. In line with our intuition, we show that performance improvements of RL algorithms correlate with the increasing hyperbolicity of the discrete space spanned by their latent representations. This result validates the importance of appropriately encoding hierarchical information, suggesting that the inductive bias provided by employing hyperbolic representations would facilitate recovering effective solutions. Hyperbolic geometry has recently been exploited in other areas of machine learning showing substantial performance and efficiency benefits for learning representations of hierarchical and graph data (Nickel & Kiela, 2017; Chamberlain et al., 2017) . Recent contributions further extended tools from modern deep learning to work in hyperbolic space (Ganea et al., 2018; Shimizu et al., 2020) , validating their effectiveness in both supervised and unsupervised learning tasks (Khrulkov et al., 2020; Nagano et al., 2019; Mathieu et al., 2019) . However, most of these approaches showed clear improvements on smaller-scale problems that failed to hold when scaling to higher-dimensional data and representations. Many of these shortcomings are tied to the practical challenges of optimizing hyperbolic and Euclidean parameters end-to-end (Guo et al., 2022) . In RL, We show the nonstationarity and high-variance characterizing common gradient estimators exacerbates these issues, making a naive incorporation of existing hyperbolic layers yield underwhelming results. In this work, we overcome the aforementioned challenges and effectively train deep RL algorithms with latent hyperbolic representations end-to-end. In particular, we design spectrally-regularized hyperbolic mappings (S-RYM), a simple recipe combining scaling and spectral normalization (Miyato et al., 2018) that stabilizes the learned hyperbolic representations and enables their seamless integration with deep RL. We use S-RYM to build hyperbolic versions of both on-policy (Schulman et al., 2017) and off-policy algorithms (Hessel et al., 2018) , and evaluate on both Procgen (Cobbe et al., 2020) and Atari 100K benchmarks (Bellemare et al., 2013) . We show that our framework attains near universal performance and generalization improvements over established Euclidean baselines, making even general algorithms competitive with highly-tuned SotA baselines. We hope our work will set a new standard and be the first of many incorporating hyperbolic representations with RL. To this end, we share our implementation at sites.google.com/view/hyperbolic-rl.

2. PRELIMINARIES

In this section, we introduce the main definitions required for the remainder of the paper. We refer to App. A and (Cannon et al., 1997) for further details about RL and hyperbolic space, respectively.

2.1. REINFORCEMENT LEARNING

The RL problem setting is traditionally described as a Markov Decision Process (MDP), defined by the tuple (S, A, P, p 0 , r, γ). At each timestep t, an agent interacts with the environment, observing some state from the state space s ∈ S, executing some action from its action space a ∈ A, and receiving some reward according to its reward function r : S × A → R. The transition dynamics P : S×A×S → R and initial state distribution p 0 : S → R determine the evolution of the environment's state while the discount factor γ ∈ [0, 1) quantifies the agent's preference for earlier rewards. Agent behavior in RL can be represented by a parameterized distribution function π θ , whose sequential interaction with the environment yields some trajectory τ = (s 0 , a 0 , s 1 , a 1 , ..., s T , a T ). The agent's objective is to learn a policy maximizing its expected discounted sum of rewards over trajectories: arg max θ Eτ∼π θ ,P ∞ t=0 γ t r(st, at) . We differentiate two main classes of RL algorithms with very different optimization procedures based on their different usage of the collected data. On-policy algorithms collect a new set of trajectories with the latest policy for each training iteration, discarding old data. In contrast, off-policy algorithms maintain a large replay buffer of past experiences and use it for learning useful quantities about the environment, such as world models and value functions. Two notable instances from each class are Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Rainbow DQN (Hessel et al., 2018) , upon which many recent advances have been built.



Figure 1: Hierarchical relationship between states in breakout, visualized in hyperbolic space.

