SPECTRAL DECOMPOSITION REPRESENTATION FOR REINFORCEMENT LEARNING

Abstract

Representation learning often plays a critical role in avoiding the curse of dimensionality in reinforcement learning. A representative class of algorithms exploits spectral decomposition of the stochastic transition dynamics to construct representations that enjoy strong theoretical properties in idealized settings. However, current spectral methods suffer from limited applicability because they are constructed for state-only aggregation and are derived from a policy-dependent transition kernel, without considering the issue of exploration. To address these issues, we propose an alternative spectral method, Spectral Decomposition Representation (SPEDER), that extracts a state-action abstraction from the dynamics without inducing spurious dependence on the data collection policy, while also balancing the explorationversus-exploitation trade-off during learning. A theoretical analysis establishes the sample efficiency of the proposed algorithm in both the online and offline settings. In addition, an experimental investigation demonstrates superior performance over current state-of-the-art algorithms across several RL benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) seeks to learn an optimal sequential decision making strategy by interacting with an unknown environment, usually modeled by a Markov decision process (MDP). For MDPs with finite states and actions, RL can be performed in a sample efficint and computationally efficient way; however, for large or infinite state spaces both the sample and computational complexity increase dramatically. Representation learning is therefore a major tool to combat the implicit curse of dimensionality in such spaces, contributing to several empirical successes in deep RL, where policies and value functions are represented as deep neural networks and trained end-to-end (Mnih et al., 2015; Levine et al., 2016; Silver et al., 2017; Bellemare et al., 2020) . However, an inappropriate representation can introduce approximation error that grows exponentially in the horizon (Du et al., 2019b) , or induce redundant solutions to the Bellman constraints with large generalization error (Xiao et al., 2021) . Consequently, ensuring the quality of representation learning has become an increasingly important consideration in deep RL. In prior work, many methods have been proposed to ensure alternative properties of a learned representation, such as reconstruction (Watter et al., 2015) , bi-simulation (Gelada et al., 2019; Zhang et al., 2020) , and contrastive learning (Zhang et al., 2022a; Qiu et al., 2022; Nachum & Yang, 2021) . Among these methods, a family of representation learning algorithms has focused on constructing features by exploiting the spectral decomposition of different transition operators, including successor features (Dayan, 1993; Machado et al., 2018) , proto-value functions (Mahadevan & Maggioni, 2007; Wu et al., 2018) , spectral state aggregation (Duan et al., 2019; Zhang & Wang, 2019) , and Krylov bases (Petrik, 2007; Parr et al., 2008) . Although these algorithms initially appear distinct, they all essentially factorize a variant of the transition kernel. The most attractive property of such representations is that the value function can be linearly represented in the learned features, thereby reducing the complexity of subsequent planning. Moreover, spectral representations are compatible with deep neural networks (Barreto et al., 2017) , which makes them easily applicable to optimal policy learning (Kulkarni et al., 2016b) in deep RL. However, despite their elegance and desirable properties, current spectral representation algorithms exhibit several drawbacks. One drawback is that current methods generate state-only features, which are heavily influenced by the behavior policy and can fail to generalize well to alternative polices. Moreover, most existing spectral representation learning algorithms omit the intimate coupling between representation learning and exploration, and instead learn the representation from a pre-collected static dataset. This is problematic as effective exploration depends on having a good representation, while learning the representation requires comprehensively-covered experiences-failing to properly manage this interaction can lead to fundamentally sample-inefficient data collection (Xiao et al., 2022) . These limitations lead to suboptimal features and limited empirical performance. In this paper, we address these important but largely ignored issues, and provide a novel spectral representation learning method that generates policy-independent features that provably manage the delicate balance between exploration and exploitation. In summary: • We provide a spectral decomposition view of several current representation learning methods, and identify the cause of spurious dependencies in state-only spectral features (Section 2.2). • We develop a novel model-free objective, Spectral Decomposition Representation (SPEDER), that factorizes the policy-independent transition kernel to eliminate policy-induced dependencies, while revealing the connection between model-free and model-based representation learning (Section 3). • We provide algorithms that implement the principles of optimism and pessimism in the face of uncertainty using the SPEDER features for online and offline RL (Section 3.1), and equip behavior cloning with SPEDER for imitation learning (Section 3.2). • We analyze the sample complexity of SPEDER in both the online and offline settings, to justify the achieved balance between exploration versus exploitation (Section 4). • We demonstrate that SPEDER outperforms state-of-the-art model-based and model-free RL algorithms on several benchmarks (Section 6).

2. PRELIMINARIES

In this section, we briefly introduce Markov Decision Processes (MDPs) with a low-rank structure, and reveal the spectral decomposition view of several representation learning algorithms, which motivates our new spectral representation learning algorithm.

2.1. LOW-RANK MARKOV DECISION PROCESSES

Markov Decision Processes (MDPs) are a standard sequential decision-making model for RL, and can be described as a tuple M = (S, A, r, P, ρ, γ), where S is the state space, A is the action space, r : S × A → [0, 1] is the reward function, P : S × A → ∆(S) is the transition operator with ∆(S) as the family of distributions over S, ρ ∈ ∆(S) is the initial distribution and γ ∈ (0, 1) is the discount factor. The goal of RL is to find a policy π : S → ∆(A) that maximizes the cumulative discounted reward V π P,r := E s0∼ρ,π ∞ i=0 γ i r(s i , a i )|s 0 by interacting with the MDP. The value function is defined as V π P,r (s) = E π ∞ i=0 γ i r(s i , a i )|s 0 = s , and the action-value function is Q π P,r (s, a) = E π ∞ i=0 γ i r(s i , a i )|s 0 = s, a 0 = a . These definitions imply the following recursive relationships: V π P,r (s) = E π Q π P,r (s, a) , Q π P,r (s, a) = r(s, a) + γE P V π P,r (s ′ ) . We additionally define the state visitation distribution induced by a policy π as d π P (s) = (1 - γ)E s0∼ρ,π E [ ∞ t=0 γ t 1(s t = s)|s 0 ] , where 1(•) is the indicator function. When |S| and |A| are finite, there exist sample-efficient algorithms that find the optimal policy by maintaining an estimate of P or Q π P,r (Azar et al., 2017; Jin et al., 2018) . However, such methods cannot be scaled up when |S| and |A| are extremely large or infinite. In such cases, function approximation is needed to exploit the structure of the MDP while avoiding explicit dependence on |S| and |A|. The low rank MDP is one of the most prominent structures that allows for simple yet effective function approximation in MDPs, which is based on the following spectral structural assumption on P and r: Assumption 1 (Low Rank MDP, (Jin et al., 2020; Agarwal et al., 2020) ). An MDP M is a low rank MDP if there exists a low rank spectral decomposition of the transition kernel P (s ′ |s, a), such that P (s ′ |s, a) = ⟨ϕ(s, a), µ(s ′ )⟩, r(s, a) = ⟨ϕ(s, a), θ r ⟩, with two spectral maps ϕ : S × A → R d and µ : S → R d , and a vector θ r ∈ R d . The ϕ and µ also satisfy the following normalization conditions: ∀(s, a), ∥ϕ(s, a)∥ 2 ⩽ 1, ∥θ r ∥ 2 ⩽ √ d, ∀g : S → R, ∥g∥ L∞ ⩽ 1 , S µ(s ′ )g(s ′ )ds ′ 2 ⩽ √ d. (2) The low rank MDP allows for a linear representation of Q π P,r for any arbitrary policy π, since Q π P,r (s, a) = r(s, a) + γ V π P,r (s)P (s ′ |s, a)ds ′ = ϕ(s, a), θ r + γ V π P,r (s ′ )µ(s ′ )ds ′ . (3) Hence, we can provably perform computationally-efficient planning and sample-efficient exploration in a low-rank MDP given ϕ(s, a), as shown in (Jin et al., 2020) . However, ϕ(s, a) is generally unknown to the reinforcement learning algorithm, and must be learned via representation learning to leverage the structure of low rank MDPs.

2.2. SPECTRAL FRAMEWORK FOR REPRESENTATION LEARNING

Representation learning based on a spectral decomposition of the transition dynamics was investigated as early as Dayan (1993) , although the explicit study began with Mahadevan & Maggioni (2007) , which constructed features via eigenfunctions from Laplacians of the transitions. This inspired a series of subsequent work on spectral decomposition representations, including the Krylov basis (Petrik, 2007; Parr et al., 2008) , continuous Laplacian Wu et al. (2018) , and spectral state aggregation (Duan et al., 2019; Zhang & Wang, 2019) . We summarize these algorithms in Table 1 to reveal their commonality from a unified perspective, which motivates the development of a new algorithm. A similar summary has also been provided in (Ghosh & Bellemare, 2020) . These existing spectral representation methods construct features based on the spectral space of the state transition probability P π (s ′ |s) = P (s ′ |s, a)π(a|s)da, induced by some policy π. Such a transition operator introduces inter-state dependency from the specific π, and thus injects an inductive bias into the state-only spectral representation, resulting in features that might not be generalizable to other policies. To make the state-feature generalizable, some work has resorted incorporating a linear action model (e.g. Yao & Szepesvári, 2012; Gehring et al., 2018) where action information is stored in the linear weights. However, this work requires known state-features, and it is not clear how to combine the linear action model with the spectral feature framework. Moreover, these existing spectral representation methods completely ignore the problem of exploration, which affects the composition of the dataset for representation learning and is conversely affected by the learned representation during the data collection procedure. These drawbacks have limited the performance of spectral representations in practice.

3. SPECTRAL DECOMPOSITION REPRESENTATION LEARNING

To address these issues, we provide a novel spectral representation learning method, which we call SPEctral DEcomposition Representation (SPEDER). SPEDER is compatible with stochastic gradient updates, and is therefore naturally applicable to general practical settings. We will show that SPEDER can be easily combined with the principle of optimism in the face of uncertainty to obtain sample efficient online exploration, and can also be leveraged to perform latent behavioral cloning. As discussed in Section 2, the fundamental cause of the spurious dependence in state-only spectral features arises from the state transition operator P π (s ′ |s), which introduces inter-state dependence induced by a specific behavior policy π. To resolve this issue, we extract the spectral feature from P (s ′ |s, a) alone, which is invariant to the policy, thereby resulting in a more stable spectral representation. Assume we are given a set of observations {(s i , a i , s ′ i )} n i=1 sampled from ρ 0 (s, a) × P (s ′ |s, a), and want to learn spectral features ϕ(s, a) ∈ R d and µ(s ′ ) ∈ R d , which are produced by function approximators like deep neural nets, such that: P (s ′ |s, a) ≈ ϕ(s, a) ⊤ µ(s ′ ). (4) Such a representation allows for a simple linear parameterization of Q for any policy π, making the planning step efficient, as discussed in section 2.1. Based on (4), one can exploit density estimation techniques, e.g., maximum likelihood estimation (MLE), to estimate ϕ and µ: ϕ, µ = arg max ϕ∈Φ,µ∈Ψ 1 n n i=1 log ϕ(s i , a i ) ⊤ µ(s ′ i ) -log Z(s, a), where Z(s, a) = S ϕ(s, a) ⊤ µ(s ′ ) ds ′ . In fact, (Agarwal et al., 2020; Uehara et al., 2022) provide rigorous theoretical guarantees when a computation oracle for solving (5) is provided. However, such a computation for MLE is highly nontrivial. Meanwhile, the MLE ( 5) is invariant to the scale of ϕ and µ; that is, if (ϕ, µ) is a solution of (5), then (c 1 ϕ, c 2 µ) is also a solution of (5) for any c 1 , c 2 > 0. Hence, we generally do not have Z(s, a) = 1 for any (s, a), and we can only use P (s ′ |s, a) = ϕ(s,a) Z(s,a) ⊤ µ(s ′ ) . Therefore, we need to use ϕ(s, a) := ϕ(s,a) Z(s,a) to linearly represent the Q-function, which incurs an extra estimation requirement for Z(s, a). Recall that the pair ϕ(s, a) and µ(s ′ ) actually form the subspace of transition operator P (s ′ |s, a), so instead of MLE for the factorization of P (s ′ |s, a), we can directly apply singular value decomposition (SVD) to the transition operator to bypass the computation difficulties in MLE. Specifically, the SVD of transition operator can be formulated as max E[ϕϕ ⊤ ]=I d ∥E ρ0 [P (s ′ |s, a)ϕ(s, a)]∥ 2 2 (6) = max E[ϕϕ ⊤ ]=I d /d max µ 2Trace E ρ0 µ(s ′ )P (s ′ |s, a)ϕ(s, a) ⊤ ds ′ -1/d µ(s ′ ) ⊤ µ(s ′ )ds ′ = max E[ϕϕ ⊤ ]=I d /d,µ ′ 2E ρ0×P ϕ(s, a) ⊤ µ ′ (s ′ ) p(s ′ ) -E p p(s ′ )µ ′ (s ′ ) ⊤ µ ′ (s ′ ) /d, where ∥•∥ 2 denotes the L 2 (µ) norm where µ denotes the Lebesgue measure for continuous case and counting measure for discrete case, the second equality comes from the Fenchel duality of ∥•∥ 2 2 with up-scaling of µ by √ d, and the third equality comes from reparameterization µ(s ′ ) = p(s ′ )µ ′ (s ′ ) with some parametrized probability measure p(s ′ ) supported on the state space S. As (7) can be approximated with the samples, it can be solved via stochastic gradient updates, where the constraint is handled via the penalty method as in (Wu et al., 2018) . This algorithm starkly contrasts with existing policy-dependent methods for spectral features via explicit eigendecomposition of state transition matrices (Mahadevan & Maggioni, 2007; Machado et al., 2017; 2018) . Remark (equivalent model-based view of (7)): We emphasize that the SVD variational formulation in (6) is model-free, according to the categorization in Modi et al. (2021) , without explicit modeling of µ. Here, µ is only introduced only for tractability. This reveals an interesting model-based perspective on representation learning. We draw inspiration from spectral conditional density estimation (Grünewälder et al., 2012) : min ϕ,µ E (s,a)∼ρ0 P (•|s, a) -ϕ(s, a) ⊤ µ(•) 2 2 . ( ) This objective (8) has a unique global minimum, ϕ(s, a) ⊤ µ(s ′ ) = P (s ′ |s, a), thus it can be used as an alternative representation learning objective. However, the objective (8) is still intractable when we only have access samples from P . To resolve the issue, we note that L(ϕ, µ) := E (s,a)∼ρ0 P (•|s, a) -ϕ(s, a) ⊤ µ(•) 2 2 =C -2E (s,a)∼ρ0,s ′ ∼P (s ′ |s,a) ϕ(s, a) ⊤ µ(s ′ ) + E (s,a)∼ρ0 S ϕ(s, a) ⊤ µ(s ′ ) 2 ds ′ , where C = E s,a∼ρ0 (P (s ′ |s, a)) 2 is a problem-dependent constant. For the third term, we turn to an approximation method by reparameterization µ(s ′ ) = p(s ′ )µ ′ (s ′ ), E (s,a)∼ρ0 S ϕ(s, a) ⊤ µ(s ′ ) 2 ds ′ = Trace E (s,a)∼ρ0 ϕ(s, a)ϕ(s, a) ⊤ E p p(s ′ )µ ′ (s ′ )µ ′ (s ′ ) ⊤ . Algorithm 1 Online Exploration with SPEDER 1: Input: Regularizer λ n , parameter α n , Model class F = {(ϕ, µ) : ϕ ∈ Φ, µ ∈ Ψ, }, Iteration N 2: Initialize π 0 (• | s) to be uniform; set D 0 = ∅, D ′ 0 = ∅ 3: for episode n = 1, • • • , N do 4: Collect the transition (s, a, s ′ , a ′ , s) where s ∼ d πn-1 P ⋆ , a ∼ U(A), s ′ ∼ P ⋆ (•|s, a),a ′ ∼ U(A), s ∼ P ⋆ (•|s ′ , a ′ ), where U(A) denotes the uniform distribution on A. 5: D n = D n-1 ∪ {s, a, s ′ }, D ′ n = D ′ n-1 ∪ {s ′ , a ′ , s}. 6: Learn representation ϕ(s, a) with D n ∪ D ′ n via equation 10 .

7:

Update the empirical covariance matrix Σ n = s,a∈Dn ϕ n (s, a) ϕ n (s, a) ⊤ + λ n I 8: Set the exploration bonus b n (s, a) = α n ϕ n (s, a) ⊤ Σ -1 n ϕ n (s, a) 9: Update policy π n = arg max π V π Pn,r+ bn 10: end for 11: Return π 1 , • • • , π N Under the constraint that E s,a [ϕ(s, a)ϕ(s, a) ⊤ ] = I d /d, we have Trace E (s,a)∼ρ0 ϕ(s, a)ϕ(s, a) ⊤ E p p(s ′ )µ ′ (s ′ )µ ′ (s ′ ) ⊤ = E p p(s ′ )µ ′ (s ′ ) ⊤ µ ′ (s ′ ) /d. Hence, Equation 8 can be written equivalently as: min ϕ,µ ′ -E (s,a,s ′ )∼ρ0×P ϕ(s, a) ⊤ µ ′ (s ′ )p(s ′ ) + E p(s ′ ) p(s ′ )µ ′ (s ′ ) ⊤ µ ′ (s ′ ) /(2d) s.t. E (s,a)∼ρ0 ϕ(s, a)ϕ(s, a) ⊤ = I d /d, which is exact as the dual form of the SVD in (7). Such an equivalence reveals an interesting connection between model-free and model-based representation learning, obtained through duality, which indicates that the spectral representation learned via SVD is implicitly minimizing the model error in L 2 norm. This connection paves the way for theoretical analysis.

3.1. ONLINE EXPLORATION AND OFFLINE POLICY OPTIMIZATION WITH SPEDER

Unlike existing spectral representation learning algorithms, where the features are learned based on a pre-collected static dataset, we can use SPEDER to perform sample efficient online exploration. In Algorithm 1, we show how to use the representation obtained from the solution to (10) to perform sample efficient online exploration under the principle of optimism in the face of uncertainty. Central to the algorithm is the newly proposed representation learning procedure (Line 6 in Algorithm 1), which learns the representation ϕ(s, a) and the model P (s ′ |s, a) = ϕ(s, a) ⊤ µ(s) with adaptively collected exploratory data. After recovering the representation, we use the standard elliptical potential (Jin et al., 2020; Uehara et al., 2022) as the bonus (Line 8 in Algorithm 1) to enforce exploration. We then plan using the learned model P n with the reward bonus b n to obtain a new policy that is used to collect additional exploratory data. These procedures iterate, comprising Algorithm 1. SPEDER can also be combined with the pessimism principle to perform sample efficient offline policy optimization. Unlike the online setting where we enforce exploration by adding a bonus to the reward, we now subtract the elliptical potential from the reward to avoid risky behavior. For completeness, we include the algorithm for offline policy optimization in Appendix C. On the requirements of P n . As we need to plan with the learned model P n , we generally require P n to be a valid transition kernel, but the representation learning objective (10) does not explicitly enforce this. Therefore in our implementations, we use the data from the replay buffer collected during the past executions to perform planning. We can also enforce that P n is a valid probability by adding the following additional regularization term Ma & Collins ( 2018): E (s,a) log S ϕ(s, a) ⊤ µ ′ (s ′ )p(s ′ )ds ′ 2 , which can be approximated with samples from p(s ′ ). Obviously, the regularization is non-negative and achieves zero when S ϕ(s, a ) ⊤ µ ′ (s ′ )p(s ′ )ds ′ = 1. Practical Implementation We parameterize ϕ(s, a) and µ ′ (s ′ ) as separate MLP networks, and train them by optimizing objective (10). Instead of using a linear Q on top of ϕ(s, a), as suggested by the low-rank MDP, we parameterize the critic network as a two-layer MLP on top of the learned representation ϕ(s, a) to support the nonlinear exploration bonus and entropy regularization. Unlike other representation learning methods in RL, we do not backpropagate the gradient from TD-learning to the representation network ϕ(s, a). To train the policy, we use the Soft Actor-Critic (SAC) algorithm (Haarnoja et al., 2018) , and alternate between policy optimization and critic training.

3.2. SPECTRAL REPRESENTATION FOR LATENT BEHAVIORAL CLONING

We additionally expand the use of the learned spectral representation ϕ(s, a) as skills for downstream imitation learning, which seeks to mimic a given set of expert demonstrations. Specifically, recall the correspondence between the max-entropy policy and Q-function, i.e., π Q (a|s) := exp(Q(s, a)) a∈A exp(Q(s, a)) = arg max π(•|s)∈∆(A) E π [Q(s, a)] + H (π) , where H (π) := a∈A π(a|s) log π(a|s). Therefore, given a set of linear basis functions for Q, {ϕ i } d i=1 , the we can construct the policy basis, or skill sets, based on ϕ according to (12), which induces the policy family π w (a|s) ∝ exp(w ⊤ ϕ(s, a)). We emphasize that the policy construction from the skills is no longer linear. This inspires us to use a latent variable composition to approximate policy construction, i.e., π(a|s) = π α (a|s, z)π Z (z|s)dz, with z = ϕ(s, a) to insert the learned representation. The policy decoder π α : S × Z → ∆(A) and the policy encoder π Z : S → ∆(Z) can be composed to form the final policy. We assume access to a fixed set of expert transitions D π * = {(s t , a t , s t+1 ) : s t ∼ d π * P , a t ∼ π E (s t ), s t+1 ∼ P (s ′ | s t , a t )}. In practice, while expert demonstrations can be expensive to acquire, non-expert data of interactions in the same environment can be more accessible to collect at scale, and provide additional information about the transition dynamics of the environment. We denote the offline transitions D off = {(s, a, s ′ )} from the same MDP, which is collected by a non-expert policy with suboptimal performance (e.g., an exploratory policy). We follow latent behavioral cloning (Yang et al., 2021; Yang & Nachum, 2021) where learning is separated into a pre-training phase, where a representation ϕ : S × A → Z and a policy decoder π α : S × Z → ∆(A) are learned on the basis of the suboptimal dataset D off , and a downstream imitation phase that learns a latent policy π Z : S → ∆(Z) using the expert dataset D π * . With SPEDER, we perform latent behavior cloning as follows: 1. Pretraining Phase: We pre-train ϕ(s, a) and µ(s ′ ) on D off by minimizing the objective (10). Additionally, we train a policy decoder π α (a | s, ϕ(s, a)) that maps latent action representations to actions in the original action space, by minimizing the action decoding error: E s∼d off P [-log π α (a | s, ϕ(s, a))] 2. Downstream Imitation Phase: We train a latent policy π Z : S → ∆(Z) by minimizing the latent behavioral cloning error: E (s,a)∼d π * P [-log π Z (ϕ(s, a) | s)] At inference time, given the current state s ∈ S, we sample a latent action representation z ∼ π Z (s), then decode the action a ∼ π α (a | s, z).

4. THEORETICAL ANALYSIS

In this section, we establish generalization properties of the proposed representation learning algorithm, and provide sample complexity and error bounds when the proposed representation learning algorithm is applied to online exploration and offline policy optimization.

4.1. NON-ASYMPTOTIC GENERALIZATION BOUND

We first state a performance guarantee on the representation learned with the proposed objective. Theorem 1. Assume the size of candidate model class |F| < ∞, P ∈ F, and for any P ∈ F, P (s ′ |s, a) ⩽ C for all (s, a, s ′ ). Given the dataset D := {(s i , a i , s ′ i )} n i=1 where (s i , a i ) ∼ ρ 0 , s ′ i ∼ P (•|s i , a i ) , the estimator P obtained by empirical surrogate of (9) satisfies the following inequality with probability at least 1 -δ: E (s,a)∼ρ0 ∥P (•|s, a) -P (•|s, a)∥ 2 2 ⩽ C ′ log |F|/δ n , where C ′ is a constant that only depends on C, which we will omit in the following analysis. Note that the i.i.d.data assumption can be relaxed to an assumption that the data generating process is a martingale process. This is essential for proving the sample complexity of online exploration, as the data are collected in an adaptive manner. The proofs are deferred to Appendix D.1.

4.2. SAMPLE COMPLEXITIES OF ONLINE EXPLORATION AND OFFLINE POLICY OPTIMIZATION

Next, we establish sample complexities for the online exploration and offline policy optimization problems. We still assume P ∈ F. As the generalization bound equation 13 only guarantees the expected L 2 distance, we need to make the following additional assumptions on the representation and reward: Assumption 2 (Representation Normalization). ∀ϕ ∈ Φ, we have S A ∥ϕ(s, a)∥ 2 da 2 ds ⩽ d. Assumption 3 (Reward Normalization). S A r(s, a) da 2 ds ⩽ d, where r is the reward function. A simple example that satisfies both Assumption 2 and 3 is a tabular MDP with features ϕ(s, a) forming the canonical basis in R |S||A| . In this case, we have d = |S||A|, hence Assumption 2 naturally holds. Furthermore, since r(s, a) ∈ [0, 1], it is also straightforward to verify that Assumption 3 holds for a tabular MDP. Such an assumption can also be satisfied for a continuous state space where the volume of the state space satisfies µ(S) ⩽ d |A| . Since we need to plan on P , we also assume P is a valid transition kernel. With Assumptions 2 and 3 in hand, we are now ready to provide the sample complexities of online exploration and offline policy optimization. The proofs are deferred to Appendix D.2 and D.3. Theorem 2 (PAC Guarantee for Online Exploration). Assume |A| < ∞. After interacting with the environment for N = Θ d 4 |A| 2 (1-γ) 6 ϵ 2 episodes, where Θ omits log-factors, we obtain a policy π s.t. V π * P,r -V π P,r ⩽ ϵ with high probability, where π * is the optimal policy. Furthermore, note that, we can obtain a sample from the state visitation distribution d π P via terminating with probability 1 -γ for each step. Hence, for each episode, we can terminate within Θ(1/(1 -γ)) steps with high probability. Theorem 3 (PAC Guarantee for Offline Policy Optimization). Let ω = min s,a 1 π b (a|s) where π b is the behavior policy. With probability 1 -δ, for all baseline policies π including history-dependent non-Markovian policies, we have that V π P,r -V π P,r ≲ ω 2 d 4 C * π log(|F|/δ) (1 -γ) 6 , where C * π is the relative conditional number under ϕ * which measures the quality of the offline data: C * π := sup x∈R x ⊤ E (s,a)∼d π P [ϕ * (s, a)ϕ * (s, a) ⊤ ]x x ⊤ E (s,a)∼ρ b [ϕ * (s, a)ϕ * (s, a) ⊤ ]x .

5. RELATED WORK

Aside from the family of spectral decomposition representation methods reviewed in Section 2, there have been many attempts to provide algorithmic representation learning algorithms for RL in different problem settings. Learning action representations, or abstractions, such as temporally-extended skills, has been a long-standing focus of hierarchical RL (Dietterich et al., 1998; Sutton et al., 1999; Kulkarni et al., 2016a; Nachum et al., 2018) for solving temporally-extended tasks. Recently, many algorithms have been proposed for online unsupervised skill discovery, which can reduce the cost of exploration and sample complexity of online RL algorithms. A class of methods extract temporally-extended skills by maximizing a mutual information objective (Eysenbach et al., 2018; Sharma et al., 2019; Lynch et al., 2020) or minimizing divergences (Lee et al., 2019) . Unsupervised skill discovery has been also studied in offline settings, where the goal is to pre-train useful skill representations from offline trajectories, in order to accelerate learning on downstream RL tasks (Yang & Nachum, 2021) . Such methods include OPAL (Ajay et al., 2020) , SPiRL (Pertsch et al., 2020) , and SkiLD (Pertsch et al., 2021) , which exploit a latent variable model with an autoencoder for skills acquisition; and PARROT (Singh et al., 2020) , which learns a behavior prior with flow-based models. Another offline representation learning algorithm, TRAIL (Yang et al., 2021) , uses a contrastive implementation of the MLE for an energy-based model to learn state-action features. These algorithms achieve empirical improvements in different problem settings, such as imitation learning, policy transfer, etc. However, as far as we know, the coupling between exploration and representation learning has not been well handled, and there is no rigorous characterization yet for these algorithms. Another line of research focuses on theoretically guaranteed representation learning in RL, either by limiting the flexibility of the models or by ignoring the practical issue of computational cost. For example, (Du et al., 2019a; Misra et al., 2020) considered representation learning in block MDPs, where the representation can be learned via regression. However, the corresponding representation ability is exponentially weaker than low-rank MDPs (Agarwal et al., 2020) . Ren et al. (2021) exploited representation from arbitrary dynamics models, but restricted the noise model to be Gaussian. On the other hand, (Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2022; Zhang et al., 2022b; Chen et al., 2022) provably extracted spectral features in low-rank MDPs with exploration, but these methods rely on a strong computation oracle, which is difficult to implement in practice. In contrast, SPEDER enjoys both theoretical and empirical advantages. We provide a tractable surrogate with an efficient algorithm for spectral feature learning with exploration in low-rank MDPs. We have established its sample complexity and next demonstrate its superior empirical performance.

6. EXPERIMENTS

We evaluate SPEDER on the dense-reward MuJoCo tasks (Brockman et al., 2016) and sparse-reward DeepMind Control Suite tasks (Tassa et al., 2018) . In MuJoCo tasks, we compare with model-based (e.g., PETS (Chua et al., 2018) , ME-TRPO (Kurutach et al., 2018) ) and model-free baselines (e.g., SAC (Haarnoja et al., 2018) , PPO (Schulman et al., 2017) ), showing strong performance compared to SoTA RL algorithms. In particular, we find that in the sparse reward DeepMind Control tasks, the optimistic SPEDER significantly outperforms the SoTA model-free RL algorithms. We also evaluate the method on offline behavioral cloning tasks in the AntMaze environment using the D4RL benchmark (Fu et al., 2020) , and show comparable results to state-of-the-art representation learning methods. Additional details about the experiment setup are described in Appendix F.

6.1. ONLINE PERFORMANCE WITH THE SPECTRAL REPRESENTATION

We evaluate the proposed algorithm on the dense-reward MuJoCo benchmark from MBBL (Wang et al., 2019) . We compare SPEDER with several model-based RL baselines (PETS (Chua et al., 2018) , ME-TRPO (Kurutach et al., 2018) ) and SoTA model-free RL baselines (SAC (Haarnoja et al., 2018) , PPO (Schulman et al., 2017) ). As a standard evaluation protocol in MBBL, we ran all the algorithms for 200K environment steps. The results are averaged across four random seeds with window size 20K. In Table 2 , we show that SPEDER achieves SoTA results among all model-based RL algorithms and significantly improves the prior baselines. We also compare the algorithm with the SoTA model-free RL method SAC. The proposed method achieves comparable or better performance in most of the tasks. Lastly, compared to two representation learning baselines (Deep SF (Kulkarni et al., 2016b) and SPEDE (Ren et al., 2021) ), SPEDER also shows superior performance, which demonstrates the proposed SPEDER is able to overcome the aforementioned drawback of vanilla spectral representations.

6.2. EXPLORATION IN SPARSE-REWARD DEEPMIND CONTROL SUITE

To evaluate the exploration performance of SPEDER, we additionally run experiments on the DeepMind Control Suite. We compare the proposed method with SAC, (including a 2-layer, 3-layer and 5-layer MLP for critic network), PPO, Dreamer-v2 (Hafner et al., 2020 ), Deep SF (Kulkarni et al., 2016b) and Proto-RL (Yarats et al., 2021) . Since the original Dreamer and Proto-RL are designed for image-based control tasks, we adapt them to run the state-based tasks and details can be found at Appendix. F. We run all the algorithms for 200K environment steps across four random seeds with a window size of 20K. From Table 3 , we see that SPEDER achieves superior performance compared to SAC using the 2-layer critic network. Compared to SAC and PPO with deeper critic networks, SPEDER has significant gain in tasks with sparse reward (e.g., walker-run-sparse and hopper-hop).

6.3. IMITATION LEARNING PERFORMANCE ON ANTMAZE NAVIGATION

We additionally experiment with using SPEDER features for downstream imitation learning. We consider the challenging AntMaze navigation domain (shown in Figure 3 ) from the D4RL (Fu et al., 2020) , which consists of a 8-DoF quadraped robot whose task is to navigate towards a goal position in the maze environment. We compare SPEDER to several recent state-of-the-art for pre-training representations from suboptimal offline data, including OPAL (Ajay et al., 2020) , SPiRL (Pertsch et al., 2020) , SkiLD (Pertsch et al., 2021) , and TRAIL (Yang et al., 2021) . For OPAL, SPiRL, and SkiLD, we use horizons of t = 1 and t = 10 for learning temporally-extended skills. For TRAIL, we report the performance of the TRAIL energy-based model (EBM) as well as the TRAIL Linear model with random Fourier features (Rahimi & Recht, 2007) . Following the behavioral cloning setup in (Yang et al., 2021) , we use a set of 10 expert trajectories of the agent navigating from one corner of the maze to the opposite corner as the expert dataset D π * . For the suboptimal dataset D off , we use the "diverse" datasets from D4RL (Fu et al., 2020) , which consist of 1M samples of the agent navigating from different initial locations to different goal positions. We report the average return on AntMaze tasks, and observe that SPEDER achieves comparable performance as other state-of-the-art representations on downstream imitation learning in Figure 4 . The comparison and experiment details can be found in Appendix F.

7. CONCLUSION

We have proposed a novel objective, Spectral Decomposition Representation (SPEDER), that factorizes the state-action transition kernel to obtain policy-independent spectral features. We show how to use the representations obtained with SPEDER to perform sample efficient online and offline RL, as well as imitation learning. We provide a thorough theoretical analysis of SPEDER and empirical comparisons on multiple RL benchmarks, demonstrating the effectiveness of SPEDER.

A MORE RELATED WORK

Representation learning in RL has attracted more attention in recent years. Within model-based RL (MBRL), many methods for learning representations of the reward and the dynamics have been proposed. Several recent MBRL methods learn latent state representations to be used for planning in latent space as a way to improve model-based policy optimization (Oh et al., 2017; Silver et al., 2018; Racanière et al., 2017; Hafner et al., 2019) . Beyond MBRL, there also exist many algorithms for learning useful state representations to accelerate RL. For example, recent works have introduced unsupervised auxiliary losses to significantly improve RL performance (Pathak et al., 2017; Oord et al., 2018; Laskin et al., 2020; Jaderberg et al., 2016) . Contrastive losses (Oord et al., 2018; Anand et al., 2019; Srinivas et al., 2020; Stooke et al., 2021) , which encourage similar states to be closer in embedding space, where the notion of similarity is usually defined in terms of temporal distance (Anand et al., 2019; Sermanet et al., 2018) or image-based data augmentations (Srinivas et al., 2020) , also show promising performance. Within goal-conditioned RL (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017) , various representation learning algorithms have been proposed to handle high-dimensional observation and goal spaces, such as using a variational autoencoder (Nair et al., 2018; Pong et al., 2019) , or representations that explicitly capture useful information for control, while ignoring irrelevant factors of variation in the observation (Ghosh et al., 2018; Lee et al., 2020) . Beyond these representations on the state space, there are other kinds of representations that are designed for specific tasks. For example, Touati & Ollivier (2021) proposed to deal with the reward transfer task by learning a reward-dependent feature F (s, a, r) such that the greedy policy with respect to F (s, a, r) ⊤ r is optimal under r.

B IMPLEMENTATION DETAILS

In this section, we provide more implementation details of SPEDER for online exploration. • Representation Learning. We parameterize the representation network ϕ θ (s, a) and µ(s ′ ), and optimize the representation in Line 6 in Algorithm 1 with the data collected in the replay buffer D, via minimizing the following objective: L(ϕ, µ) := - 1 |D| (si,ai,si+1)∈D ϕ(s i , a i ) ⊤ µ(s i+1 )p(s i+1 ) + 1 2d|D base | sj ∈Dbase p(s j )µ(s j ) ⊤ µ(s j ) + λ ortho |D| 2 (si,ai)∼D (s ′ i ,a ′ i )∼D   j,k∈[d] ϕ j (s i , a i )ϕ k (s i , a i ) - δ jk d ϕ j (s ′ i , a ′ i )ϕ k (s ′ i , a ′ i ) - δ jk d   + λ prob |D| (si,ai)∈D      log 1 |D base | sj ∈Dbase ϕ(s i , a i ) ⊤ µ(s j )   2    , where p(s) is a base measure on the state space and |D base | = {s j } where s j ∼ p(s), λ ortho and λ prob are coefficients of the regularizers that can help enforce ϕ to be orthogonal (see Wu et al. (2018) for more details) and ϕ(s, a) ⊤ µ(s ′ ) to be a valid conditional density (see Ma & Collins (2018) for more details) accordingly. • Planning module. We implement Line 9 in Algorithm 1 with SAC algorithm (Haarnoja et al., 2018) upon the learned feature. Specifically, -We parameterize the critic network Q θ as a two-layer MLP on top of the representation ϕ θ (s, a), whose parameter will be frozen. -The critic network and the actor network in SAC are both updated with the samples collected in the replay buffer. • Exploration bonus. We can optionally add the exploration bonus (Line 8 in Algorithm 1) as we discussed in the main text.

C ALGORITHM FOR OFFLINE POLICY OPTIMIZATION

For completeness, we include the algorithm for offline policy optimization with SPEDER here. Algorithm 2 Offline Policy Optimization with SPEDER In this subsection, we consider the non-asymptotic generalization bound for the ℓ 2 minimization, which is necessary for the proof of series of key lemmas (Lemma 7 and Lemma 15) that are used in the PAC guarantee of the online and offline reinforcement learning. For simplicity, we denote the instance space as X and the target space as Y, and we want to estimate the conditional density p(y|x) = f * (x, y). Assume we are given a function class F : (X × Y) → R with f * ∈ F, as well as the data D := {(x i , y i )} n i=1 , where x i ∼ D i (x 1:i-1 , y 1:i-1 ), y i ∼ p(•|x i ) and D i is some data generating process that depends on the previous samples (a.k.a a martingale process). We define the tangent sequence D ′ := {(x ′ i , y ′ i )} where x ′ i ∼ D i (x 1:i-1 , y 1:i-1 ) and y ′ i ∼ p(•|x ′ i ). Consider the estimator obtained by following minimization problem: f = arg min f ∈F    n i=1 -2f (x i , y i ) + n i=1   y∈Y f 2 (x i , y)      , ( ) where the summation over the counting measure of Y for discrete case can be replaced by the integration over the Lebesgue measure of Y for continuous case. We first prove the following decoupling inequality motivated by Lemma 24 of (Agarwal et al., 2020) . Lemma 4. Let L(f, D) = n i=1 ℓ(f, (x i , y i )) , D ′ is a tangent sequence of D and f (D) be any estimator taking random variable D as input with range F. Then E D exp -L( f (D), D) -log E D ′ [exp[-L( f (D), D ′ )]] -log |F| ⩽ 1. Proof. Let π be the uniform distribution over F and g : F → R be any function. Define the following probability measure over F: µ(f ) = exp g(f ) f ∈F exp(g(f )) . Then for any probability distribution π over F, we have: 0 ⩽KL( π∥µ) = f ∈F π(f ) log π(f ) µ(f ) = f ∈F [ π(f ) log π(f ) -π(f )g(f )] + log f ∈F exp(g(f )) = f ∈F [ π(f ) log π(f ) + π(f ) log |F|] - f ∈F π(f )g(f ) + log E f ∼π exp(g(f )) =KL( π∥π) - f ∈F π(f )g(f ) + log E f ∼π exp(g(f )) ⩽ log |F| - f ∈F π(f )g(f ) + log E f ∼π exp(g(f )). Re-arranging, we have that f π(f )g(f ) -log |F| ⩽ log E f ∼π exp(g(f )). Take g = -L(f, D) -log E D ′ [exp(-L(f, D ′ ))], π(f ) = 1 f (D) , we obtain that for any D, -L( f (D), D) -log E D ′ exp(-L( f (D), D ′ )) -log |F| ⩽ log E f ∼π exp(-L( f (D), D)) E D ′ exp(-L( f (D), D ′ )) . We exponentiate both sides and take the expectation over D, which gives E D exp -L( f (D), D) -log E D ′ exp(-L( f (D), D ′ )) -log |F| ⩽ E f ∼π E D exp(-L( f (D), D)) E D ′ exp -L( f (D), D ′ ) D . Note that, conditioned on D, the samples in the tangent sequence D ′ are independent, which leads to E D ′ exp -L( f (D), D ′ ) D = n i=1 exp E (xi,yi)∼Di [-l(f, (x i , y i ))] . As a result, we can peel off terms from n down to 1 and cancel out terms in the numerator. Hence, we have E D -exp L( f (D), D) -log E D ′ exp -L( f (D), D ′ ) -log |F| ⩽ 1, which concludes the proof. Theorem 5. Assume |F| < ∞, f * ∈ F and ∥f (x, y)∥ ∞ ⩽ C, ∀f ∈ F. Then with probability at least 1 -δ, we have n i=1 E xi∼Di ∥f * -f ∥ 2 2 ⩽ C ′ log |F|/δ, where C ′ only depends on C. With Chernoff's method, we have that -log E D ′ exp -L( f (D), D ′ ) ⩽ L( f (D), D) + log |F| + log 1/δ. Take l(f, (x i , y i )) = 2(f * (x i , y i ) -f (x i , y i )) + y∈Y f (x i , y) 2 -f * (x i , y) 2 , and L(f, D) = ρ   n i=1 2(f * (x i , y i ) -f (x i , y i )) + n i=1 y∈Y f (x i , y) 2 -f * (x i , y) 2   , where ρ > 0 is a constant to determine later. As f (D) is obtained by minimizing L(f, D), and f * ∈ F, we have L( f (D), D) ⩽ L(f * , D) ⩽ 0. Furthermore, as D ′ is the tangent sequence of D, direct computation shows -log E D ′ exp -L( f (D), D ′ ) ⩽ log |F| δ . We now relate the term -log E D ′ exp -L( f (D), D ′ ) with our target n i=1 E xi∼Di ∥ f (x i , •) - f * (x i , •)∥ 2 2 using the method introduced in Zhang ( 2006). Note that y∈Y f (x, y) = 1, as ∥f ∥ ∞ ⩽ C, with a straightforward application of Hölder's inequality, we have that y∈Y f (x, y) 2 ⩽ C. We then consider the term Zhang (2006) , we have that E yi∼f * (xi,y) l(f, (x i , y i )) 2 + E yi∼f (xi,y) l(f, (x i , y i )) 2 =4 y∈Y (f (x i , y) + f * (x i , y))(f (x i , y) -f * (x i , y)) 2 -3   y∈Y (f * (x i , y) 2 -f (x i , y) 2 )   2 ⩽ y∈Y (f (x i , y) -f * (x i , y)) 2   8C + 3 y∈Y (f (x i , y) + f * (x i , y)) 2   ⩽20C y∈Y (f (x i , y) -f * (x i , y)) 2 =20CE yi∼f * (x,y) [l(f, (x i , y i ))] . As E yi∼f (xi) l(f, (x i , y i )) 2 ⩾ 0, we can conclude that E yi∼f * (xi,y) l(f, (x i , y i )) 2 ⩽ (20CE yi∼f * (x,y) [l(f, (x i , y i ))] . Furthermore, it is straightforward to see |l(f, (x i , y i ))| ⩽ 3C. With the last bound in Proposition 1.2 in log E D ′ exp -L( f (D), D ′ ) = n i=1 log E (xi,yi)∼Di [exp(-ρl(f, (x i , y i )))] ⩽ -ρ n i=1 E (xi,yi)∼Di [l(f, (x i , y i ))] + exp(3ρC) -3ρC -1 9C 2 E (xi,yi)∼Di l(f, (x i , y i )) 2 ⩽ -ρ - 20(exp(3ρC) -3ρC -1) 9C n i=1 E (xi,yi)∼Di ∥ f (x i , •) -f * (x i , •)∥ 2 2 . As exp(x) -x -1 ≈ 0.5x 2 as x → 0, we know there exists sufficiently small ρ that only depends on C, such that 9ρC > 20(exp(3ρC) -3ρC -1). Hence, we know that, E (xi,yi)∼Di ∥ f (x i , •) -f * (x i , •)∥ 2 2 ⩽ 9C 9ρC -20(exp(3ρC) -3ρC -1) log |F| δ . Compared with the MLE guarantee For discrete domain, as L 2 norm is always bounded by L 1 norm, our guarantee is weaker than the guarantee of MLE used in (Agarwal et al., 2020; Uehara et al., 2022) . However, for general cases, L 1 and L 2 does not imply each other, and hence we cannot directly compare our theoretical guarantee with the MLE guarantee. Nevertheless, our method is easier to optimize compared to the MLE, which makes it a preferable practical choice.

D.2 PAC BOUNDS FOR ONLINE REINFORCEMENT LEARNING

Before we start, we first state some basic properties of MDP that can be obtained from the definition of the related terms. For the state visitation distribution, a straightforward computation shows that d π P (s) = (1 -γ)ρ(s) + γE s∼d π P ,ã∼π(•|s) P (s|s, ã). Meanwhile, we have that V π P,r = 1 1 -γ E s∼d π P ,a∼π(•|s) [r(s, a)]. For now, we assume P n (•, •) is a valid probability measure, P ∈ F, and the following two inequalities hold ∀n ∈ N + with probability at least 1 -δ: E s∼ρn,a∼U (A) ∥ P n (•|s, a) -P (•|s, a)∥ 2 2 ⩽ζ n E s∼ρ ′ n ,a∼U (A) ∥ P n (•|s, a) -P (•|s, a)∥ 2 2 ⩽ζ n Proof Sketch Our proof is organized as follows: • Based on Theorem 5, we prove a one-step back inequality for the learned model (Lemma 7), which is further used to show the optimisticity, i.e., the policy planning on the learned model with the additional bonus upper bound the optimal value up to some error term (Lemma 9). • We then bound the cumulative regret of the adaptive chosen policy (Lemma 13) based on the established optimisticity, and further exploit a one-step back inequality for the true model (Lemma 10) and standard elliptical potential lemma (Lemma 20). • With standard regret to PAC conversion, we obtain the final PAC guarantee (Theorem 14). We first state the following basic property for the value function: Lemma 6 (L 2 norm of V π P,r ). For any policy π, we have that ∥V π P,r ∥ 2 ⩽ 2d 1 + dγ 2 (1 -γ) 2 ≲ d 1 -γ Proof. From the properties of low-rank MDP, we know there exists  w π , ∥w π ∥ 2 ⩽ √ d 1-γ and Q π P,r (s, a) = ϕ * (s, a) ⊤ w π h . Then we have ∥V π P,r ∥ 2 2 = S V π (s) 2 ds = S A 2 ds + 2γ 2 A Q π P,r (s, a) da 2 ds ⩽2d + 2dγ 2 (1 -γ) 2 S A ∥ϕ * (s, a)∥ 2 da 2 ds ⩽2d 1 + dγ 2 (1 -γ) 2 ≲ d 2 (1 -γ) 2 , which concludes the proof. Before we proceed to the proof, we first define the following terms. Let ρ n (s) = 1 The following lemmas will be helpful when we demonstrate the effectiveness of our bonus: Lemma 7 (One-step back inequality for the learned model). Assume g : S × A → R satisfies that ∥g∥ ∞ ⩽ B ∞ , A g(•, a) da 2 ⩽ B 2 , then we have that E (s,a)∼d π Pn {g(s, a)} ⩽ (1 -γ)|A|E s∼ρn,a∼U (A) {g 2 (s, a)} + γ n|A|E s∼ρ ′ n ,a∼U (A) {g 2 (s, a)} + B 2 2 nζ n + λ n B 2 ∞ d • E (s,ã)∼d π Pn ϕ n (s, ã) Σ -1 ρn ×U , ϕn . Proof. Note that E (s,a)∼d π Pn {g(s, a)} = γE (s,ã)∼d π Pn ,s∼ Pn(•|s,ã),a∼π(•|s) {g(s, a)} + (1 -γ)E s∼ρ,a∼π(•|s) {g(s, a)}. For the second term, note that d π P (s) ⩾ (1 -γ)ρ(s), hence (1 -γ)E s∼ρ,a∼π(•|s) {g(s, a)} ⩽(1 -γ) E s∼ρ,a∼π(•|s) {g 2 (s, a)} =(1 -γ) E s∼ρn,a∼U (A) ρ(s)π(a|s)|A| ρ n (s) g 2 (s, a) ⩽ (1 -γ)|A|E s∼ρn,a∼U (A) {g 2 (s, a)}. For the first term, we have that  E (s,ã)∼d π Pn ,s∼ Pn(•|s,ã),a∼π(•|s) {g(s, a)} =E (s,ã)∼d π Pn ϕ n (s, ã) ⊤ S×A µ n (s)π(a|s)g(s, + λ n B 2 ∞ d. With Jensen's inequality, we have E s∼ρn,ã∼U (A) S×A P (s|s, ã)π(a|s)g(s, a) ds da 2 ⩽E s∼ρn,ã∼U (A),s∼P (•|s,ã),a∼π(•|s) g 2 (s, a) =E s∼ρ ′ n ,a∼π(•|s) g 2 (s, a) ⩽|A|E s∼ρ ′ n ,a∼U (A) g 2 (s, a) Meanwhile, E s∼ρn,ã∼U (A) S×A ( P n (s|s, ã) -P (s|s, ã))π(a|s)g(s, a) ds da ⩽ d (1 -γ) 2 S A ∥ϕ * (s, a)∥ da + A ∥ ϕ n (s, a)∥ da 2 ds ⩽ 4d 2 (1 -γ) 2 , where the last inequality is due to Assumption 2 and the fact that (a + b) 2 ⩽ 2(a 2 + b 2 ). Invoking Lemma 7, we have that E (s,a)∼d π Pn {g(s, a)} ⩽ (1 -γ)|A|E s∼ρn,a∼U (A) {g 2 (s, a)} + γ n|A|E s∼ρ ′ n ,a∼U (A) {g 2 (s, a)} + 4d 2 (1 -γ) 2 nζ n + 4λ n d (1 -γ) 2 • E (s,ã)∼d π Pn ϕ n (s, ã) Σ -1 ρn ×U , ϕn . Note that E s∼ρn,a∼U (A) {g 2 (s, a)} =E s∼ρn,a∼U (A) S P (s ′ |s, a) -P n (s ′ |s, a) V π P,r (s ′ ) ds ′ 2 ⩽E s∼ρn,a∼U (A) P (•|s, a) -P n (•|s, a) 2 2 V π P,r 2 2 ⩽2d 1 + dγ 2 (1 -γ) 2 ζ n , where the first inequality is due to the Hölder's inequality and the last inequality is due to Lemma 6. With the selected hyperparameters and Lemma 8, we conclude the proof. To further provide the regret bound, we need the following analog of Lemma 7. Note that, here we don't require ρ ′ n . Lemma 10 (One-step back inequality for the true model). Assume g : S × A → R satisfies that ∥g∥ ∞ ⩽ B ∞ , then we have that We also need the following properties on the bonus and the value function when we plan on the learned model with the bonus. Lemma 11 (Norm of the Bonus). We have that ∥b n (s, a)∥ ∞ ⩽ α n √ λ n ≲ d|A| 1 -γ , A b n (•, a) da ⩽ α n √ d √ λ n ≲ d |A| 1 -γ . Proof. Note that, Σ n, ϕn ≳ λ n I, and as a result, we have Σ -1 n, ϕn op ⩽ 1 λn . Recall b n (s, a) = α n ϕ n (s, a) Σ -1 n, ϕn , we know b 2 n (s, a) = α 2 n ϕ n (s, a) Σ -1 n, ϕn ϕ n (s, a) ⩽ α 2 n ∥ ϕ n (s, a)∥ 2 2 λ n ⩽ α 2 n λ n , as well as A b n (•, a) da 2 =α 2 n A ∥ ϕ n (•, a)∥ Σ -1 n , ϕn da 2 =α 2 n S A ∥ ϕ n (•, a)∥ Σ -1 n , ϕn da 2 ds ⩽ α 2 n λ n S A ∥ ϕ n (s, a)∥ da 2 ds ⩽ α 2 n d λ n , Combined with the fact that αn √ λn = Θ √ d|A| 1-γ , we conclude the proof. Lemma 12 (L 2 norm of V π Pn,r+bn ). For any policy π, we have that V π Pn,r+bn ⩽ 3d + 3α 2 n d λ n + 3d 2 γ 2 1 + αn √ λn 2 (1 -γ) 2 ≲ d 1.5 |A| (1 -γ) 2 Proof. We have Q π Pn,r+bn (s, a) 2 da 2 ds ⩽3d + 3α 2 n d λ n + 3d 2 γ 2 1 + αn √ λn 2 (1 -γ) 2 ≲ d 3 |A| (1 -γ) 4 , which concludes the proof. Now we are ready to prove the following regret bounds and obtain the final PAC guarantee. Lemma 13 (Regret). With probability at least 1 -δ, we have that N n=1 V π * P,r -V πn P,r ≲ N d 4 |A| 2 log(N |F|/δ) (1 -γ) 6 log 1 + N d 2 log(N |F|/δ) . Proof. Standard decomposition shows V π * P,r -V πn P,r ⩽V π * Pn,r+bn + 2|A|d 1 + γ 2 d (1-γ) 2 ζ n (1 -γ) -V πn P,r ⩽V πn Pn,r+bn -V πn P,r + 2|A|d 1 + γ 2 d (1-γ) 2 ζ n (1 -γ) ⩽ 1 1 -γ E (s,a)∼d πn P b n (s, a) + γE Pn(s ′ |s,a) [V πn Pn,r+bn (s ′ )] -γE P (s ′ |s,a) [V πn Pn,r+bn (s ′ )] + 2|A|d 1 + γ 2 d (1-γ) 2 ζ n (1 -γ) . Applying Lemma 10 to E (s,a)∼d  ⩽3d   1 + α 2 n λ n + dγ 2 1 + αn √ λn 2 (1 -γ) 2    ζ n ≲ d 3 |A|ζ n (1 -γ) 4 Hence, E (s,a)∼d πn P {g(s, a)} ≲ (1 -γ)d 3 |A| 2 ζ n + d 3 |A| 2 nζ n (1 -γ) 4 + d 3 |A|nζ n (1 -γ) 4 E (s,ã)∼d πn P ∥ϕ * (s, ã)∥ Σ -1 ρn ,ϕ * . Finally, with Lemma 20 and notice that λ 1 ⩽ λ 2 ⩽ • • • ⩽ λ N , we have that N n=1 E (s,ã)∼d πn P ∥ϕ * (s, ã)∥ Σ -1 ρn,ϕ * ⩽ N Tr E (s,ã)∼d πn P ϕ * (s, ã) (ϕ * (s, ã)) ⊤ Σ -1 ρn,ϕ * ⩽ N d log λ N + N λ 1 Combine the previous terms and take the dominating terms out, we have that N n=1 V π * P,r -V πn P,r ≲ N d 4 |A| 2 log(N |F|/δ) (1 -γ) 6 log 1 + N d 2 log(N |F|/δ) , which concludes the proof. Theorem 14 (PAC Guarantee). After interacting with the environments for N = Θ d 4 |A| 2 (1-γ) 6 ϵ 2 episodes, we can obtain an ϵ-optimal policy with high probability. Furthermore, with high probability, for each episode, we can terminate within Θ(1/(1 -γ)) steps. Proof. It directly follows from the standard regret to PAC reduction. See Jin et al. (2018) ; Uehara et al. (2022) for the detail.

D.3 PAC BOUNDS FOR OFFLINE REINFORCEMENT LEARNING

Proof Sketch Similar to the online counterpart, our proof for offline setting is organized as follows: • We show the policy obtained by planning on the learned model with additional penalty lower bound the optimal value up to some error term (Lemma 16), with the help of an analog of the one-step back inequality for the learned model in the offline setting (Lemma 15) based on Theorem 5. • We then show the PAC guarantee (Theorem 18) with an analog of the one-step back inequality for the true model in the offline setting (Lemma 17). We first prove the analog of Lemma 7 in the offline setting. Lemma 15 (One-step back inequality for the learned model in the offline setting). For the second term, we have that Let ω = max s,a {1/π b (a|s)}. Assume g : S × A → R satisfies that ∥g∥ ∞ ⩽ B ∞ , ∥ A g(•, a) da∥ 2 ⩽ B 2 , (1 -γ)E s∼ρ,a∼π(•|s) {g(s, a)} ⩽(1 -γ) E s∼ρ,a∼π(•|s) {g 2 (s, a)} =(1 -γ) E s∼ρ b ,a∼π b (•|s) ρ(s)π(a|s) ρ b (s)π(a|s) g 2 (s, a) ⩽ ω(1 -γ)|A|E s∼ρ b ,a∼π b (•|s) g 2 (s, a). For the first term, we have that Lemma 16 (Pessimism). Let ω = max s,a {π b (a|s)}, α n =Θ d √ ωζ n 1 -γ , λ =Θ(d log(|F|/δ)), then we have V π P ,r-b ⩽ V π P,r + 2ωd 1 + γ 2 d (1-γ) 2 ζ n (1 -γ) . Proof. With the simulation Lemma (i.e., Lemma 19), we have V π Pn,r-b -V π P,r = 1 1 -γ E (s,a)∼d π Pn -b(s, a) + γ E Pn(s ′ |s,a) V π P,r (s ′ ) -E P (s ′ |s,a) V π P,r (s ′ ) . Consider g(s, a) := E Pn(s ′ |s,a) V π P,r (s ′ ) -E P (s ′ |s,a) V π P,r (s ′ ) . With Hölder's inequality, ∥g∥ ∞ ⩽ 2 1-γ . Furthermore, with the derivation in the proof of Lemma 9, we know A g(•, a) da 2 ⩽ √ 1-γ . Applying Lemma 15, we have that E (s,a)∼d π P {g(s, a)} ⩽ (1 -γ)ωE (s,a)∼ρ b {g 2 (s, a)} + γ nωE (s,a)∼ρ b {g 2 (s, a)} + 2d 2 (1 -γ) 2 log(|F|/δ) + 4λ n d (1 -γ) 2 • E (s,ã)∼d π P ϕ(s, ã) Σ -1 ρ b , ϕ . With Lemma 6, we know E (s,a)∼ρ b {g 2 (s, a)} ⩽ 2d(1 + dγ 2 (1 -γ) 2 )ζ n . Then, with the selected hyperparameters and Lemma 8, we conclude the proof. Lemma 17 (One-step back inequality for the true model in the offline setting). Let ω = max s,a {π b (a|s)}, assume g : S × A → R satisfies ∥g∥ ∞ ⩽ B ∞ , then we have E (s,a)∼d π P {g(s, a)} ⩽ (1 -γ)ωE (s,a)∼ρ b {g 2 (s, a)} + nγωE (s,a)∼ρ b {g 2 (s, a)} + λγ 2 B 2 ∞ d • E (s,ã)∼d π P ∥ϕ * (s, ã)∥ Σ -1 ρ b ,σ * . Proof. The proof is identical to the proof of Lemma 7. We now provide the PAC guarantee for the offline setting. Theorem 18 (PAC Guarantee). With probability 1-δ, ∀ baseline policy π including history-dependent non-Markovian policies, we have that V π P,r -V π P,r ≲ ω 2 d 4 C * π log(|F|/δ) (1 -γ) 6 , where C * π is the relative conditional number under ϕ * , defined as C * π := sup x∈R d x ⊤ E (s,a)∼d π P [ϕ * (s, a)ϕ * (s, a) ⊤ ]x x ⊤ E (s,a)∼ρ b [ϕ * (s, a)ϕ * (s, a) ⊤ ]x . Proof. Standard decomposition shows For all methods, we use latent behavioral cloning as described in Section 3.2 to pre-train representations on a suboptimal dataset D off , then finetune on the expert dataset D π * for downstream imitation learning. We also compare with baseline behavioral cloning (BC) (Pomerleau, 1998) , which directly learns a policy from the expert dataset (without latent representations) by maximizing the loglikelihood objective, E (s,a)∼Pr(D π * ) [-log π(a | s)]. V π P,r -V π P,r ⩽V π P,r -V π P ,r-b + 2ωd 1 + γ 2 d (1-γ 2 ) ζ n ) (1 -γ) ⩽V π P,r -V π P ,r-b + 2ωd 1 + γ 2 d (1-γ 2 ) ζ n (1 -γ) =E (s, We report the average return on AntMaze tasks, and observe that SPEDER achieves comparable performance as other state-of-the-art representations on downstream imitation learning (Figure 4 ). We also observe that the normalized marginalization regularization equation 11 helps performance (Figure 5 ). We provide the performance curves for imitation learning in Figures 6 and 7 . For TRAIL, OPAL, SPiRL, SKiLD, and BC, we used the same hyperparameters as reported in Yang et al. (2021) . For all methods, we pre-trained the representations for 200K steps using Adam optimizer with a learning rate of 3e-4 and batch size 256. For latent behavioral cloning, we train the latent policy π Z for 1M iterations using a learning rate of 1e-4 for BC, SPiRL, SkiLD, and OPAL, and 3e-5 for SPEDER and TRAIL (both EBM and Linear). We found that decaying the BC learning rate helped prevent overfitting for all methods. We evaluate the policy every 10K iterations by rolling out the policy in the environment for 10 episodes, and recording the average return. The representations ϕ and action decoder π α were frozen during downstream behavioral cloning. All imitation learning results are reported over 4 seeds. Both the action decoder π α and the latent policy π Z are parameterized as a multivariate Gaussian distribution, with the mean and variance approximated using a two-layer MLP network with hidden layer size 256. For SPEDER and TRAIL, ϕ and µ are parameterized as a 2-layer MLP with hidden layer size 256, and a Swish activation function (Ramachandran et al., 2017) at the end of each hidden layer. We ran a sweep of embedding dimensions d ∈ {64, 256} and found that d = 64 worked best for TRAIL, and d = 256 worked best for SPEDER. For SPEDER, we ran a sweep of coefficients for each loss term in equation 10, and summarize the coefficients used in Table 5 . For TRAIL Linear, we used a Fourier dimension of 8192, which has been provided more preference, while still performing worse. For SPiRL, SkiLD and OPAL, we used an embedding dimension of 8, which was reported to work best (Yang et al., 2021) . The trajectory encoder is parameterized as a bidirectional RNN, and the skill prior is parameterized as a Gaussian network following (Ajay et al., 2020) . SPiRL and SkiLD are adapted for downstream behavioral cloning by minimizing the KL divergence between the latent policy and the skill prior. 



Input: Regularizer λ, Parameter α, Model class F, Dataset D sampled from the stationary distribution of the behavior policy π b . 2: Learn representation ϕ(s, a) with D via equation 10. 3: Set the empirical covariance matrix Σ = (s,a)∈D ϕ(s, a) ϕ(s, a) ⊤ + λI. 4: Set the reward penalty: b(s, a) = α ϕ(s, a) ⊤ Σ -1 ϕ(s, a). 5: Solve π = arg max π V π P ,r-b . 6: Return π D PROOF DETAILS D.1 NON-ASYMPTOTIC GENERALIZATION BOUND

With slightly abuse of notation, we also use ρ n (s, a) = 1 n n i=1 d πi P * (s, a), and use ρ ′ n to denote the marginal distribution of s ′ for the triple (s, a, s ′ ) ∼ ρ n (s)U(a)P * (s ′ |s, a). For notation simplicity, we denote Σ ρn×U (A),ϕ =nE s∼ρn,a∼U (A) ϕ(s, a)ϕ(s, a) ⊤ + λ n I, Σ ρn,ϕ =nE (s,a)∼ρn ϕ(s, a)ϕ(s, a) ⊤ + λ n I, Σ n,ϕ =nE (s,a)∼Dn ϕ(s, a)ϕ(s, a) ⊤ + λ n I.

s)π(a|s)g(s, a) ds da Σ ρn×U , ϕn , where for the inequality we use the generalized Cauchy-Schwartz inequality. Note S×A µ n (s)π(a|s)g(s, a) ds da 2 Σ ρn ×U , ϕn =nE s∼ρn,ã∼U (A) S×A P n (s|s, ã)π(a|s)g(s, a) ds da 2 + λ n S×A µ n (s)π(a|s)g(s, a) ds da 2 ⩽2nE s∼ρn,ã∼U (A) S×A P (s|s, ã)π(a|s)g(s, a) ds da 2 + 2nE s∼ρn,ã∼U (A) S×A ( P n (s|s, ã) -P (s|s, ã))π(a|s)g(s, a) ds da 2

(s,a)∼d πn P {g(s, a)} ⩽ (1 -γ)|A|E s∼ρn,a∼U (A) {g 2 (s, a)} + nγ|A|E s∼ρn,a∼U (A) {g 2 (s, a)}+ λ n γ 2 B 2 ∞ d • E (s,ã)∼d πn P ∥ϕ * (s, ã)∥ Σ -1 ρn ,ϕ * . Proof. Note that E (s,a)∼d πn P {g(s, a)} =γE (s,ã)∼d πn P ,s∼P (•|s,ã),a∼π(•|s) {g(s, a)} + (1 -γ)E s∼ρ,a∼π(•|s) {g(s, a)}. For the second term, note that d π P (s) ⩾ (1 -γ)ρ(s), hence (1 -γ)E s∼ρ,a∼π(•|s) {g(s, a)} ⩽(1 -γ) E s∼ρ,a∼π(•|s) {g 2 (s, a)} =(1 -γ) E s∼ρn,a∼U (A) ρ(s)π(a|s)|A| ρ n (s) g 2 (s, a) ⩽ (1 -γ)|A|E s∼ρn,a∼U (A) {g 2 (s, a)}.For the first term, we have thatE (s,ã)∼d πn P ,s∼P (•|s,ã),a∼π(•|s) {g(s, a)} =E (s,ã)∼d πn P ϕ * (s, ã) ⊤ S×A µ * (s)π(a|s)g(s, a) ds da ⩽E (s,ã)∼d πn P ∥ϕ * (s, ã)∥ Σ -1 ρn ,ϕ * S×A µ * (s)π(a|s)g(s, a) ds da Σ ρn ,ϕ * , where for the inequality we use the generalized Cauchy-Schwartz inequality. Note S×A µ * (s)π(a|s)g(s, a) ds da 2 Σ ρn,ϕ * =nE (s,ã)∼ρn S×A P (s|s, ã)π(a|s)g(s, a) ds da 2 + λ n S×A µ * (s)π(a|s)g(s, a) ds da 2 ⩽nE (s,ã)∼ρn,s∼P (•|s,a),a∼π(•|s) {g 2 (s, a)} + λ n B 2 ∞ d, where in the last inequality we use Jensen's inequality. Note that E (s,ã)∼ρn,s∼P (•|s,a),a∼π(•|s) {g 2 (s, a)} ⩽ 1 γ E (s,ã)∼ρn,s∼P * (•|s,a),a∼π(•|s) {g 2 (s, a)} ⩽ |A| γ E s∼ρn,a∼U (A) {g 2 (s, a)} Substituting this back, we obtain the desired result.

) r(s, a) + b n (s, a) + γQ π Pn,r+bn (s, a) da a) + b n (s, a) + γQ π Pn,r+bn (s, a) da

then we have that E (s,a)∼d π P {g(s, a)} ⩽ (1 -γ)ωE (s,a)∼ρ b {g 2 (s, a)} + γ nωE (s,a)∼ρ b {g 2 (s, a)} + B 2 2 nζ n + λ n B 2 ∞ d • E s,ã∼d π P Note that E (s,a)∼d π P {g(s, a)} = γE (s,ã)∼d π P ,s∼ P (•|s,ã),a∼π(•|s) {g(s, a)} + (1 -γ)E s∼ρ,a∼π(•|s) {g(s, a)}.

(s,ã)∼d π P ,s∼ P (•|s,ã),a∼π(•|s) {g(s, a)} =E (s,ã)∼d π P ϕ(s, ã) ⊤ S×A µ(s)π(a|s)g(s, a) ⩽E (s,ã)∼d π P ϕ(s, ã) Σ -1 ρ b , ϕ S×A µ(s)π(a|s)g(s, a) ds da Σ ρ b , ϕ . , ã)π(a|s)g(s, a) ds da 2 + 2nE (s,ã)∼ρ b S×A P (s|s, ã) -P (s|s, ã) π(a|s)g(s, a) ds da 2 + λB 2 ∞ d. With Jensen's inequality, we have E (s,ã)∼ρ b S×A P (s|s, ã)π(a|s)g(s, a) ds da 2 ⩽E (s,ã)∼ρ b ,s∼P (•|s,a),a∼π(•|s) {g 2 (s, a)} ⩽ ω γ E (s,a)∼ρ b {g 2 (s, a)}. On the other hand, E (s,ã)∼ρ b S×A P (s|s, ã) -P (s|s, ã) π(a|s)g(s, a) ds da 2 ⩽E (s,ã)∼ρ b P (•|s, ã) -P (•|s, ã) ã)∼ρ b P (•|s, ã) -P (•|s, ã) , where the last inequality is due to Theorem 5. Substituting this back, we obtain the desired result.

a)∼d π P b(s, a) + γE P (s ′ |s,a) V π P ,r-b (s ′ ) -γE P (s ′ |s,a) V π P ,r-b (s ′ ) + 2ωd 1 + γ 2 d (1-γ 2 ) ζ n (1 -γ).With Lemma 17 and the identical method used in the proof of Lemma 13, we have thatE (s,a)∼d π P {b n (s, a)} ⩽ (1 -γ)α 2 n dω n + γα 2 n dω + γ 2 α 2 n d • E (s,ã)∼d π P ∥ϕ * (s, ã)∥ Σρ b ,ϕ * .Furthermore, define g(s, a) := E Pn(s ′ |s,a) V πn Pn,r+bn (s ′ ) -E P (s ′ |s,a) V πn Pn,r+bn (s ′ ) . With Lemma 17 and the identical method used in the proof of Lemma 13, we can obtainE (s,a)∼d π P {g(s,a)} ≲ (1 -γ)d 3 ω 2 ζ n + d 3 ω 2 nζ n (1 -γ) 4 E (s,ã)∼d π P ∥ϕ * (s, ã)∥ Σ -1 ρ b ,ϕ * .

Figure 3: AntMaze navigation domains in mazes of medium (left) and large (right) sizes.

Figure 1: Performance Curves for online DM Control Suite.

Figure 5: Ablation of SPEDER with vs. without normalized marginalization regularization equation 11.

Figure 6: After pre-training, we train latent behavioral cloning on top of the learned representations for 1M iterations. BC refers to direct behavioral cloning on the expert dataset without latent representations. The corresponding barplot of the final performance is provided in Figure 4.

Figure 7: Performance curve of downstream behavioral cloning for SPEDER with vs. without normalized marginalization regularization equation 11. The corresponding barplot of the final performance is provided in Figure 5.

A unified spectral decomposition view of existing related representations. Here, r denotes the reward function, Λ denotes some diagonal reweighting operator, and P

Performance on various MuJoCo control tasks. All the results are averaged across 4 random seeds and a window size of 20K. Results marked with * is adopted from MBBL(Wang et al., 2019). SPEDER achieves strong performance compared with baselines.

Performance on various DeepMind Suite Control tasks. All the results are averaged across four random seeds and a window size of 20K. Comparing with SAC, our method achieves even better performance on sparse-reward tasks. Results are presented in mean ± standard deviation across different random seeds.

=Tr E s∼ρn,a∼U (A) ϕ n (s, a) ϕ n (s, a) ⊤ nE s∼ρn,a∼U (A) ϕ n (s, a) ϕ n (s, a) ⊤ + λ n IWe then consider the remaining term. With a slightly abuse of notation, define g(s, a) :=E Pn(s ′ |s,a) V πnPn,r+bn (s ′ ) -E P (s ′ |s,a) V πn Pn,r+bn (s ′ ) . With Hölder's inequality, we know that

annex

ϕ(s i , a i )ϕ(s i , a i ) ⊤ + λ n I.Then there exists absolute constant c 1 and c 2 , such thatwhich holds with probability at least 1 -δ.Proof. See (Uehara et al., 2022, Lemma 11) .With these lemmas, we are now ready to show the optimism. Lemma 9 (Optimism). Letthen for any policy π we haveProof. With the simulation lemma (i.e., Lemma 19), we have thatConsider the function g on S × A defined as follows:With Hölder's inequality, we have that ∥g∥ ∞ ⩽ 2 1-γ . Furthermore, aswhere the first inequality is due to the triangle inequality; the second inequality is from the Cauchy-Schwartz inequality; and the last inequality comes from the fact that ∥V π P,r ∥ ∞ ⩽ 1 1-γ . Thus, we haveFinally, by the definition of C * , we have thatCombining the previous terms and taking the dominating terms out, we conclude the proof.

E TECHNICAL LEMMAS

Lemma 19 (Simulation Lemma). With a slightly abuse of notation, we haveProof. Note thatThe second equation can be obtained with a similar method, which concludes the proof.Proof. By the concavity of log det(•) function and d log det(X) dX = (X ⊤ ) -1 , we knowTelescoping, we can obtain the first inequality. For the second inequality, note that, with Jensen's inequality, we havewhere σ i is the i-th eigenvalue of M n .

F EXPERIMENT DETAILS F.1 ONLINE SETTING

We list all the hyperparameter and network architecture we use for our experiments. For online MuJoCo and DM Control tasks, the hyperparameters can be found at Table 4 . Therefore, we set bonus scaling term to 0 for MuJoCo tasks. However, this bonus is critical to the success of DM Control Suite (especially sparse reward environments). Note that we use exactly the same actor and critic network architecture for all the algorithms in the DM Control Suite experiment.For evaluation in Mujoco, in each evaluation (every 5K steps) we test our algorithm for 10 episodes. We average the results over the last 4 evaluations and 4 random seeds. For Dreamer and Proto-RL, we change their network from CNN to 3-layer MLP and disable the image data augmentation part (since we test on the state space). We tried to tune some of their hyperparameter (e.g., exploration steps in Proto-RL) and report the best number across our runs. However, due to the short time, it is also possible that we didn't tune the hyperparameter enough. We show that the SPEDER objective can learn valid transitions of the environment. We use a empty-room maze environment, where the state is the position of the agent and the action is the velocity. The transition can be expressed as s ′ = s + at + ϵ, where t is a fixed time interval and ϵ ∼ N (0, I). We run SPEDER for 100K steps and the learned transition heatmap is visualized in Figure 2 . The blue region is the heatmap estimation via spectral decomposition and S1 is the target position of the agent. The high density region is centered around the red dot (target position S1), which means the representation learned by our objective captures the environment transition. This shows the spectral decomposition can learn a good transition function.

