SPECTRAL DECOMPOSITION REPRESENTATION FOR REINFORCEMENT LEARNING

Abstract

Representation learning often plays a critical role in avoiding the curse of dimensionality in reinforcement learning. A representative class of algorithms exploits spectral decomposition of the stochastic transition dynamics to construct representations that enjoy strong theoretical properties in idealized settings. However, current spectral methods suffer from limited applicability because they are constructed for state-only aggregation and are derived from a policy-dependent transition kernel, without considering the issue of exploration. To address these issues, we propose an alternative spectral method, Spectral Decomposition Representation (SPEDER), that extracts a state-action abstraction from the dynamics without inducing spurious dependence on the data collection policy, while also balancing the explorationversus-exploitation trade-off during learning. A theoretical analysis establishes the sample efficiency of the proposed algorithm in both the online and offline settings. In addition, an experimental investigation demonstrates superior performance over current state-of-the-art algorithms across several RL benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) seeks to learn an optimal sequential decision making strategy by interacting with an unknown environment, usually modeled by a Markov decision process (MDP). For MDPs with finite states and actions, RL can be performed in a sample efficint and computationally efficient way; however, for large or infinite state spaces both the sample and computational complexity increase dramatically. Representation learning is therefore a major tool to combat the implicit curse of dimensionality in such spaces, contributing to several empirical successes in deep RL, where policies and value functions are represented as deep neural networks and trained end-to-end (Mnih et al., 2015; Levine et al., 2016; Silver et al., 2017; Bellemare et al., 2020) . However, an inappropriate representation can introduce approximation error that grows exponentially in the horizon (Du et al., 2019b) , or induce redundant solutions to the Bellman constraints with large generalization error (Xiao et al., 2021) . Consequently, ensuring the quality of representation learning has become an increasingly important consideration in deep RL. In prior work, many methods have been proposed to ensure alternative properties of a learned representation, such as reconstruction (Watter et al., 2015) , bi-simulation (Gelada et al., 2019; Zhang et al., 2020) , and contrastive learning (Zhang et al., 2022a; Qiu et al., 2022; Nachum & Yang, 2021) . Among these methods, a family of representation learning algorithms has focused on constructing features by exploiting the spectral decomposition of different transition operators, including successor features (Dayan, 1993; Machado et al., 2018) , proto-value functions (Mahadevan & Maggioni, 2007; Wu et al., 2018) , spectral state aggregation (Duan et al., 2019; Zhang & Wang, 2019) , and Krylov bases (Petrik, 2007; Parr et al., 2008) . Although these algorithms initially appear distinct, they all essentially factorize a variant of the transition kernel. The most attractive property of such representations is that the value function can be linearly represented in the learned features, thereby reducing the complexity of subsequent planning. Moreover, spectral representations are compatible with deep neural networks (Barreto et al., 2017) , which makes them easily applicable to optimal policy learning (Kulkarni et al., 2016b) in deep RL.

annex

Published as a conference paper at ICLR 2023 However, despite their elegance and desirable properties, current spectral representation algorithms exhibit several drawbacks. One drawback is that current methods generate state-only features, which are heavily influenced by the behavior policy and can fail to generalize well to alternative polices. Moreover, most existing spectral representation learning algorithms omit the intimate coupling between representation learning and exploration, and instead learn the representation from a pre-collected static dataset. This is problematic as effective exploration depends on having a good representation, while learning the representation requires comprehensively-covered experiences-failing to properly manage this interaction can lead to fundamentally sample-inefficient data collection (Xiao et al., 2022) . These limitations lead to suboptimal features and limited empirical performance. In this paper, we address these important but largely ignored issues, and provide a novel spectral representation learning method that generates policy-independent features that provably manage the delicate balance between exploration and exploitation. In summary:• We provide a spectral decomposition view of several current representation learning methods, and identify the cause of spurious dependencies in state-only spectral features (Section 2.2). • We develop a novel model-free objective, Spectral Decomposition Representation (SPEDER), that factorizes the policy-independent transition kernel to eliminate policy-induced dependencies, while revealing the connection between model-free and model-based representation learning (Section 3). • We provide algorithms that implement the principles of optimism and pessimism in the face of uncertainty using the SPEDER features for online and offline RL (Section 3.1), and equip behavior cloning with SPEDER for imitation learning (Section 3.2).• We analyze the sample complexity of SPEDER in both the online and offline settings, to justify the achieved balance between exploration versus exploitation (Section 4). • We demonstrate that SPEDER outperforms state-of-the-art model-based and model-free RL algorithms on several benchmarks (Section 6).

2. PRELIMINARIES

In this section, we briefly introduce Markov Decision Processes (MDPs) with a low-rank structure, and reveal the spectral decomposition view of several representation learning algorithms, which motivates our new spectral representation learning algorithm.

2.1. LOW-RANK MARKOV DECISION PROCESSES

Markov Decision Processes (MDPs) are a standard sequential decision-making model for RL, and can be described as a tuple M = (S, A, r, P, ρ, γ), where S is the state space, A is the action space, r : S × A → [0, 1] is the reward function, P : S × A → ∆(S) is the transition operator with ∆(S) as the family of distributions over S, ρ ∈ ∆(S) is the initial distribution and γ ∈ (0, 1) is the discount factor. The goal of RL is to find a policy π : S → ∆(A) that maximizes the cumulative discounted reward V π P,r := E s0∼ρ,π ∞ i=0 γ i r(s i , a i )|s 0 by interacting with the MDP. The value function is defined asand the action-value function isThese definitions imply the following recursive relationships:s, a) = r(s, a) + γE P V π P,r (s ′ ) . We additionally define the state visitation distribution induced by a policy π as, where 1(•) is the indicator function. When |S| and |A| are finite, there exist sample-efficient algorithms that find the optimal policy by maintaining an estimate of P or Q π P,r (Azar et al., 2017; Jin et al., 2018) . However, such methods cannot be scaled up when |S| and |A| are extremely large or infinite. In such cases, function approximation is needed to exploit the structure of the MDP while avoiding explicit dependence on |S| and |A|. The low rank MDP is one of the most prominent structures that allows for simple yet effective function approximation in MDPs, which is based on the following spectral structural assumption on P and r: Assumption 1 (Low Rank MDP, (Jin et al., 2020; Agarwal et al., 2020) ). An MDP M is a low rank MDP if there exists a low rank spectral decomposition of the transition kernel P (s ′ |s, a), such that P (s ′ |s, a) = ⟨ϕ(s, a), µ(s ′ )⟩, r(s, a) = ⟨ϕ(s, a), θ r ⟩,(1)

