SPECTRAL DECOMPOSITION REPRESENTATION FOR REINFORCEMENT LEARNING

Abstract

Representation learning often plays a critical role in avoiding the curse of dimensionality in reinforcement learning. A representative class of algorithms exploits spectral decomposition of the stochastic transition dynamics to construct representations that enjoy strong theoretical properties in idealized settings. However, current spectral methods suffer from limited applicability because they are constructed for state-only aggregation and are derived from a policy-dependent transition kernel, without considering the issue of exploration. To address these issues, we propose an alternative spectral method, Spectral Decomposition Representation (SPEDER), that extracts a state-action abstraction from the dynamics without inducing spurious dependence on the data collection policy, while also balancing the explorationversus-exploitation trade-off during learning. A theoretical analysis establishes the sample efficiency of the proposed algorithm in both the online and offline settings. In addition, an experimental investigation demonstrates superior performance over current state-of-the-art algorithms across several RL benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) seeks to learn an optimal sequential decision making strategy by interacting with an unknown environment, usually modeled by a Markov decision process (MDP). For MDPs with finite states and actions, RL can be performed in a sample efficint and computationally efficient way; however, for large or infinite state spaces both the sample and computational complexity increase dramatically. Representation learning is therefore a major tool to combat the implicit curse of dimensionality in such spaces, contributing to several empirical successes in deep RL, where policies and value functions are represented as deep neural networks and trained end-to-end (Mnih et al., 2015; Levine et al., 2016; Silver et al., 2017; Bellemare et al., 2020) . However, an inappropriate representation can introduce approximation error that grows exponentially in the horizon (Du et al., 2019b) , or induce redundant solutions to the Bellman constraints with large generalization error (Xiao et al., 2021) . Consequently, ensuring the quality of representation learning has become an increasingly important consideration in deep RL. In prior work, many methods have been proposed to ensure alternative properties of a learned representation, such as reconstruction (Watter et al., 2015) , bi-simulation (Gelada et al., 2019; Zhang et al., 2020) , and contrastive learning (Zhang et al., 2022a; Qiu et al., 2022; Nachum & Yang, 2021) . Among these methods, a family of representation learning algorithms has focused on constructing features by exploiting the spectral decomposition of different transition operators, including successor features (Dayan, 1993; Machado et al., 2018) , proto-value functions (Mahadevan & Maggioni, 2007; Wu et al., 2018) , spectral state aggregation (Duan et al., 2019; Zhang & Wang, 2019) , and Krylov bases (Petrik, 2007; Parr et al., 2008) . Although these algorithms initially appear distinct, they all essentially factorize a variant of the transition kernel. The most attractive property of such representations is that the value function can be linearly represented in the learned features, thereby reducing the complexity of subsequent planning. Moreover, spectral representations are compatible with deep neural networks (Barreto et al., 2017) , which makes them easily applicable to optimal policy learning (Kulkarni et al., 2016b) in deep RL.

