LATENT VARIABLE REPRESENTATION FOR REINFORCE-MENT LEARNING

Abstract

Deep latent variable models have achieved significant empirical successes in modelbased reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) seeks an optimal policy that maximizes the expected accumulated rewards by interacting with an unknown environment sequentially. Most research in RL is based on the framework of Markov decision processes (MDPs) (Puterman, 2014) . For MDPs with finite states and actions, there is already a clear understanding with sample and computationally efficient algorithms (Auer et al., 2008; Dann & Brunskill, 2015; Osband & Van Roy, 2014; Azar et al., 2017; Jin et al., 2018) . However, the cost of these RL algorithms quickly becomes unacceptable for large or infinite state problems. Therefore, function approximation or parameterization is a major tool to tackle the curse of dimensionality. Based on the parametrized component to be learned, RL algorithms can roughly be classified into two categories: model-free and model-based RL, where the algorithms in the former class directly learn a value function or policy to maximize the cumulative rewards, while algorithms in the latter class learn a model to mimic the environment and the optimal policy is obtained by planning with the learned simulator. Model-free RL algorithms exploit an end-to-end learning paradigm for policy and value function training, and have achieved empirical success in robotics (Peng et al., 2018 ), video-games (Mnih et al., 2013) , and dialogue systems (Jiang et al., 2021) , to name a few, thanks to flexible deep neural network parameterizations. The flexibility of such parameterizations, however, also comes with a cost in optimization and exploration. Specifically, it is well-known that temporal-difference methods become unstable or even divergent with general nonlinear function approximation (Boyan & Moore, 1994; Tsitsiklis & Van Roy, 1996) . Uncertainty quantization for general nonlinear function approximators is also underdeveloped. Although there are several theoretically interesting model-free exploration algorithms with general nonlinear function approximators (Wang et al., 2020; Kong et al., 2021; Jiang et al., 2017) , a computationally-friend exploration method for model-free RL is still missing. Model-based RL algorithms, on the other hand, exploit more information from the environment during learning, and are therefore considered to be more promising in terms of sample efficiency (Wang et al., 2019) . Equipped with powerful deep models, model-based RL can successfully reduce approximation error, and have demonstrated strong performance in practice (Hafner et al., 2019a; b; Wu et al., 2022) , following with some theoretical justifications (Osband & Van Roy, 2014; Foster et al., 2021) . However, the reduction of approximation error brings new challenges in planning and exploration, which have not been treated seriously from the empirical and theoretical aspects. Specifically, with general nonlinear models, the planning problem itself is already no longer tractable, and the problem becomes more difficult with an exploration mechanism introduced. While theoretical analysis typically assumes a planning oracle providing an optimal policy, some approximations are necessary in practice, including dyna-style planning (Chua et al., 2018; Luo et al., 2018) , random shooting (Kurutach et al., 2018; Hafner et al., 2019a) , and policy search with backpropagation through time (Deisenroth & Rasmussen, 2011; Heess et al., 2015) . These may lead to sub-optimal policies, even with perfect models, wasting potential modeling power. In sum, for both model-free and model-based algorithms, there has been insufficient work considering both statistical and computation tractability and efficiency in terms of learning, planning and exploration in a unified and coherent perspective for algorithm design. This raises the question: Is there a way to design a provable and practical algorithm to remedy both the statistical and computational difficulties of RL? Here, by "provable" we mean the statistical complexity of the algorithm can be rigorously characterized without explicit dependence on the number of states but instead the fundamental complexity of the parameterized representation space; while by "practical" we mean the learning, planning and exploration components in the algorithm are computationally tractable and can be implemented in real-world scenarios. This work provides an affirmative answer to the question above by establishing the representation view of latent variable dynamics models through a connection to linear MDPs. Such a connection immediately provides a computationally tractable approach to planning and exploration in the linear space constructed by the flexible deep latent variable model. Such a latent variable model view also provides a variational learning method that remedies the intractbility of MLE for general linear MDPs (Agarwal et al., 2020; Uehara et al., 2022) . Our main contributions consist of the following: • We establish the representation view of latent variable dynamics models in RL, which naturally induces Latent Variable Representation (LV-Rep) for linearly representing the state-action value function, and paves the way for a practical variational method for representation learning (Section 3); • We provide computation efficient algorithms to implement the principle of optimistm and pessimism in the face of uncertainty with the learned LV-Rep for online and offline RL (Section 3.1); • We theoretically analyze the sample complexity of LV-Rep in both online and offline settings, which reveals the essential complexity beyond the cardinality of the latent variable (Section 4); • We empirically demonstrate LV-Rep outperforms the state-of-the-art model-based and modelfree RL algorithms on several RL benchmarks (Section 6)

2. PRELIMINARIES

In this section, we provide brief introduction to MDPs and linear MDP, which play important roles in the algorithm design and theoretical analysis. We also provide the required background knowledge on functional analysis in Appendix D.

2.1. MARKOV DECISION PROCESSES

We consider the infinite horizon discounted Markov decision process (MDP) specified by the tuple M = ⟨S, A, T * , r, γ, d 0 ⟩, where S is the state space, A is a discrete action space, T * : S × A → ∆(S) is the transition, r : S × A → [0, 1] is the reward, γ ∈ (0, 1) is the discount factor and d 0 ∈ ∆(S) is the initial state distribution. Following the standard convention (e.g. Jin et al., 2020) , we assume r(s, a) and d 0 are known to the agent. We aim to find the policy π : S → ∆(A), that maximizes the following discounted cumulative reward: V π T * ,r := E T * ,π ∞ i=0 γ i r(s i , a i ) s 0 ∼ d 0 . We define the state value function V : S → 0, 1 1-γ and state-action value function Q : S × A → 0, 1 1-γ following the standard notation: Q π T * ,r (s, a) = E T * ,π ∞ i=0 γ i r(s i , a i ) s 0 = s, a 0 = a , V π T * ,r (s) = E a∼π(•|s) Q π T * ,r (s, a) , It is straightforward to see that V π T * ,r = E s∼d0 V π T * ,r (s) , as well as the following Bellman equation: Q π T * ,r (s, a) = r(s, a) + γE s ′ ∼T * (•|s,a) V π T * ,r (s ′ ) . We also define the discounted occupancy measure d π T * of policy π as follows: d π T * (s, a) = E T * ,π ∞ i=0 γ i 1 si=s,ai=a s 0 ∼ d 0 . By the definition of the discounted occupancy measure, we can see V π T * ,r = E (s,a)∼d π T * [r(s, a)]. Furthermore, with the property of the Markov chain, we can obtain d π T * (s, a) = (1 -γ)d 0 • π(a|s) + γE ( s, a)∼d π T * (s,a) [T * (s| s, a) × π(a|s)] .

2.2. LINEAR MDP

In the tabular MDP, where the state space |S| is finite, there exist lots of work on sample-and computation-efficient RL algorithms (e.g. Azar et al., 2017; Jin et al., 2018) . However, such methods can still be expensive when |S| becomes large or even infinite, which is quite common for in realworld applications. To address this issue, we would like to introduce function approximations into RL algorithms to alleviate the statistical and computational bottleneck. The linear MDP (Jin et al., 2020; Agarwal et al., 2020 ) is a promising subclass admits special structure for such purposes. Definition 1 (Linear MDP (Jin et al., 2020; Agarwal et al., 2020) ). An MDP is called a linear MDP if there exists ϕ * : S × A → H and µ * : S → H for some proper Hilbert space H, such that T * (s ′ |s, a) = ⟨ϕ * (s, a), µ * (s ′ )⟩ H . The complete definition of linear MDPs require ϕ * and µ * satisfy certain normalization conditions, which we defer to Section 4 for the ease of presentation. The most significant benefit for linear MDP is that, for any policy π : S → A, Q π T * ,r (s, a) is linear with respect to [r(s, a), ϕ * (s, a)], thanks to the following observation: Q π T * ,r (s, a) = r(s, a)+γE s ′ ∼T * (•|s,a) V π T * ,r (s ′ ) = r(s, a)+ ϕ * (s, a), S µ * (s ′ )V π T * ,r (s ′ )ds ′ H . (1) Plenty of sample-efficient algorithms have been developed based on the linear MDP structure with known ϕ * (e.g. Yang & Wang, 2020; Jin et al., 2020; Yang et al., 2020) . This requirement limits their practical applications. In fact, in most cases, we do not have access to ϕ * and we need to perform representation learning to obtain an estimate of ϕ * . However, the learning of ϕ relies on efficient exploration for the full-coverage data, while the design of exploration strategy relies on the accurate estimation of ϕ. The coupling between exploration and learning induces extra difficulty. Recently, Uehara et al. (2022) designed UCB-style exploration for iterative finite-dimension representation updates with theoretical guarantees. The algorithm requires the computaiton oracle for the maximum likelihood estimation (MLE) to the conditional density estimation, max ϕ,µ n i=1 log⟨ϕ(s i , a i ), µ(s ′ i )⟩ H , s.t. ∀(s, a), ϕ(s, a), S µ(s ′ )ds ′ H = 1, which is difficult as we generally do not have specific realization of (ϕ, µ) pairs to make the constraints hold for arbitrary (s, a) pairs, and therefore, impractical for real-world applications.

3. LATENT VARIABLE MODELS AS LINEAR MDPS

In this section, we first reveal the linear representation view of the transitions with a latent variable structure. This essential connection brings several benefits for learning, planning and exploration/exploitation. More specifically, the latent variable model view provides us a tractable variational learning scheme, while the linear representation view inspires computational-efficient planning and exploration/exploitation mechanism. We focus on the transition operator T * : S × A → ∆(S) with a latent variable structure, i.e., there exist latent space Z and two conditional probability measure p * (z|s, a) and p * (s ′ |z), such that T * (s ′ |s, a) = Z p * (z|s, a)p * (s ′ |z)dµ, where µ is the Lebesgue measure on Z when Z is continuous and µ is the counting measure on Z when Z is discrete. Connection to Ren et al. (2022b) . To provide a concrete example of LV-rep, we consider the stochastic nonlinear control model with Gaussian noise (Ren et al., 2022b) , which is widely used in most of model-based RL algorithms. Such a model can be understood as a special case of LV-Rep. In Ren et al. (2022b) , the transition operator is defined as Assume that p * (•|s, a) ∈ L 2 (µ), p * (s ′ |•) ∈ L 2 (µ), T * (s ′ |s, a) = 2πσ 2 -d/2 exp -∥s ′ -f * (s, a)∥ 2 /(2σ 2 ) = ⟨p * (•|s, a), p * (s ′ |•)⟩ L2(µ) , where Efficient Simulation from LV-Rep. Specifically, we can easily draw samples from the learned model T (s ′ |s, a) = Z p(z|s, a)p(s ′ |z)dµ by first sampling z i ∼ p(z|s, a), then sampling s ′ i ∼ p(s ′ |z i ), without the need to call other complicated samplers, e.g., MCMC, for the general unnormalized transition operator in linear MDPs. Such a property is important for computation-efficient planning on the learned model. p * (z|s, a) ∝ exp -2 ∥z -f * (s, a)∥ 2 /σ 2 and p * (s ′ |z) ∝ exp -2 ∥z -s ′ ∥ 2 /σ 2 , Variational Learning of LV-Rep. Another significant benefit of the LV-Rep is that, we can leverage the variational method to obtain a tractable surrogate objective of MLE, which is also known as the evidence lower bound (ELBO) (Kingma et al., 2019) , that can be derived as follows: log T (s ′ |s, a) = log p * (z|s, a) p * (s ′ |z)dz = log p * (z|s, a) p * (s ′ |z) q(z|s, a, s ′ ) q(z|s, a, s ′ )dz = max q∈∆(Z) E z∼q(•|s,a,s ′ ) [log p * (s ′ |z)] -D KL (q(z|s, a, s ′ )∥p * (z|s, a)) , where q(z|s, a, s ′ ) is an auxiliary distribution. The last equality comes from Jensen's inequality, and the equality only holds when q (z|s, a, s ′ ) = p (z|s, a, s ′ ) ∝ p (z|s, a) p (s ′ |z). Compared with the standard MLE used in (Agarwal et al., 2020; Uehara et al., 2022) , maximizing the ELBO is more computation-efficient, as it avoids the computation of integration at any time. Meanwhile, if the family of variational distribution q is sufficient flexible that contains the optimal p(z|s, a, s ′ ) for any possible (p(z|s, a), p(s ′ |z)) pair, then maximizing the ELBO is equivalent to perform MLE, i.e., they share the same solution, Collect the transition (s, a, s ′ , a ′ , s) where s ∼ d πn-1 T * , a ∼ U(A), s ′ ∼ T * (•|s, a),a ′ ∼ U(A), s ∼ T * (•|s ′ , a ′ ). D n = D n-1 ∪ {s, a, s ′ }, D ′ n = D ′ n-1 ∪ {s ′ , a ′ , s}.

5:

Learn the latent variable model pn (z|s, a) with D n ∪ D ′ n via maximizing the ELBO in (5), and obtain the learned model Tn .

6:

Set the exploration bonus bn (s, a) as (7).

7:

Update policy π n = arg max π V π Tn,r+ bn . 8: end for 9: Return π 1 , • • • , π N .

3.1. REINFORCEMENT LEARNING WITH LV-REP

As the transition operator is linear with respect to LV-Rep, the state-action function for arbitrary policy can be linearly represented by LV-Rep. Once the LV-Rep is learned, we can execute planning and exploration in the linear space formed by LV-Rep. Due to the space limit, we mainly consider online exploration setting, and the offline policy optimization is explained in Appendix B. Practical Parameterization of Q function. With the linear factorization of dynamics through latent variable models (1), we have Q π T * ,r (s, a) = r(s, a) + γE p * (z|s,a) [w π (z)] , where w π (z) = S p * (z|s ′ )V π T * ,r (s ′ )ds ′ can be viewed as a value function of the latent state. When the latent variable is in finite dimension, i.e., |Z| is finite, we have w = [w(z)] z∈Z ∈ R |Z| , and the expectation E p * (z|s,a) [w π (z)] can be computed exactly by enumerating over Z. However, when Z is not a finite set, generally we can not exactly compute the expectation, which makes the representation of Q function through p * (z|s, a) hard. Particularly, under our normalization condition Assumption 2 shown later, we have w π ∈ H k where H k is a reproducing kernel Hilbert space with kernel k. When k admits a random feature representation (see Definition 13), we can then express w π as: w π (z) = Ξ w π (ξ)ψ(z; ξ)dP (ξ), where the concrete P (ξ) depends on the kernel k. Plug this representation of w π (z) into (6), we obtain the approximated representation of Q π T * ,r (s, a) as: Q π T * ,r (s, a) = r(s, a) + γ Z w π (z)p * (z|s, a)dµ = r(s, a) + γ Z Ξ w(ξ)ψ(z; ξ)dP (ξ) • p * (z|s, a)dµ ≈ r(s, a) + γ m i∈[m] w(ξ i )ψ(z i ; ξ i ), which shows that we can approximate Q π T * r (s, a) with a linear function on top of the random feature φ(s, a) = [ψ(z i ; ξ i )] i∈ [m] where z i ∼ p * (z|s, a) and ξ i ∼ P (ξ). This can be viewed as a two-layer neural network with fixed first layer weight ξ i and activation ψ and trainable second layer weight w = [ w(ξ i )] m i=1 ∈ R m . Planning and Exploration with LV-Rep. Following the idea of REP-UCB (Uehara et al., 2022) , we introduce an additional bonus to implement the principle of optimism in the face of uncertainty. We use the standard elliptical potential for the upper confidence bound, which can be computed efficiently as below, where α n and λ are some constants, and D n is the collected dataset. φn (s, a) = [ψ(z i ; ξ i )] i∈[m] , The planning can be then completed by Bellman recursion with bonus, i.e., Q π (s, a) = r (s, a) + bn (s, a) + γE T [V π (s ′ )] . We can exploit the augmented feature [r(s, a), φ(s, a), bn (s, a)] to linearly represented Q π after bonus introduced. However, there will be an extra O m 2 due to the bonus in feature. Therefore, we consider a two-layer MLP upon φ to parametrize Q(s, a) = w 0 r(s, a) + w ⊤ 1 φ(s, a) + w ⊤ 2 σ w ⊤ 3 φ(s, a) , where σ (•) is a nonlinear activation function, used for complement the effect of the nonlinear bn . We finally conduct approximate dynamic programming style algorithm (e.g. Munos & Szepesvári, 2008) with the Q parameterization. The Complete Algorithm. We show the complete algorithm for the online exploration with LV-Rep in Algorithm 1. Our algorithm follows the standard protocol for sequential decision making. In each episode, the agent first executes the exploratory policy obtained from the last episode and collects the data (Line 4). The data are later used for training the latent variable model by maximizing the ELBO defined in equation 5 (Line 5). With the newly learned pn (z|s, a), we add the exploration bonus defined in equation 7 to the reward (Line 6), and obtain the new exploratory policy by planning on the learned model with the exploration bonus (Line 7), that will be used in the next episode. Note that, in Line 4, we requires to sample s ∼ d πn-1 T * , which can be obtained by starting from s 0 ∼ d t , executing π n-1 , stopping with probability 1 -γ at each time step t ≥ 0 and returning s t . LV-Rep can also be used for offline exploitation, and we defer the corresponding algorithm to Appendix B.

4. THEORETICAL ANALYSIS

In this section, we provide the theoretical analysis of representation learning with LV-Rep. Before we start, we introduce the following two assumptions, that are widely used in the community (e.g. Agarwal et al., 2020; Uehara et al., 2022) . Assumption 1 (Finite Candidate Class with Realizability). |P| < ∞ and (p * (z|s, a), p * (s ′ |z)) ∈ P. Meanwhile, for all (p(z|s, a), p(s ′ |z)) ∈ P, p(z|s, a, s ′ ) ∈ Q. Remark 1. The assumption on P is widely used in the community (e.g. Agarwal et al., 2020; Uehara et al., 2022) , while the assumption on Q is to guarantee the estimator obtained by maximizing the ELBO defined in equation 5 is identical to the estimator obtained by MLE. We would like to remark that, the extension to other data-independent function class complexity (e.g. Rademacher complexity (Bartlett & Mendelson, 2002) ) can be straightforward with a refined non-asymptotic generalization bound of MLE. Assumption 2 (Normalization Conditions). ∀P ∈ P, (s, a) ∈ S × A, ∥p(•|s, a)∥ H k ≤ 1. Further- more, ∀g : S → R such that ∥g∥ ∞ ≤ 1, we have S p(s ′ |•)g(s ′ )ds ′ H k ≤ C. Remark 2. Our assumptions on normalization conditions is substantially different from standard linear MDPs. Specifically, standard linear MDPs assume that the representation ϕ(s, a) and µ(s ′ ) are of finite dimension d, with ∥ϕ(s, a)∥ 2 ≤ 1 and ∀∥g∥ ∞ ≤ 1, S µ(s ′ )g(s ′ )ds ′ 2 ≤ d. When |Z| is finite, as ∥f ∥ L2(µ) ≤ ∥f ∥ H k , our normalization conditions are more general than the counterparts of the standard linear MDPs and we can use the identical normalization conditions as the standard linear MDPs. However, when |Z| is infinite, if we assume ∥p(z|s, a)∥ L2(µ) ≤ 1, we cannot provide a sample complexity bound without polynomial dependency on |P|, which can be unsatisfactory. Furthermore, we would like to note that, the assumption S p(s ′ |z)g(s ′ )ds ′ ∈ H k is mild, which is necessary for justifying the estimation from the approximate dynamic programming algorithm. Theorem 1 (PAC Guarantee for Online Exploration, Informal). Assume the reproducing kernel k satisfies the regularity conditions in Appendix E.1. If we properly choose the exploration bonus bn (s, a), we can obtain an ε-optimal policy with probability at least 1 -δ after we interact with the environments for N = poly C, |A|, (1 -γ) -1 , ε, log(|P|/δ) episodes. Remark 3. Although |Z| may not be finite, we can still obtain a sample complexity independent w.r.t. |S|, while has polynomial dependency on C, |A|, (1 -γ) -1 and ε and log |P| with the assumption that p(•|s, a) ∈ H k and some standard regularity conditions for the kernel k. This means that we do not really need to assume a discrete Z with finite cardinality, but only need to properly control the complexity of the representation class, by either the ambient dimension |Z|, or some "effective dimension" that can be derived from the eigendecay of the kernel k (see Appendix E.1 for the details). The formal statement for Theorem 1 and the proof is deferred to Appendix E.2. We also provide the PAC guarantee for offline exploitation with LV-Rep in Appendix E.3. Remark 4. Our proof strategy is based on the analysis of REP-UCB (Uehara et al., 2022) . However, there are substantial differences between our analysis and the analysis of REP-UCB, as the representation we consider can be infinite-dimensional, and hence the analysis of REP-UCB, which assumes that the feature is finite-dimensional, cannot be directly applied in our case. As we mentioned, to address the infinite dimension issue, we assume the representation p(z|s, a) ∈ H k and prove two novel lemmas, one for the concentration of bonus (Lemma 17) and one for the ellipsoid potential bound (Lemma 19) when the representation lies in the RKHS. We further note that, different from the work on the kernelized bandit and kernelized MDP (Srinivas et al., 2010; Valko et al., 2013; Yang et al., 2020) that assume the reward function and Q function lies in some RKHS, we assume the condition density of the latent random variable lies in the RKHS and the Q function is the L 2 (µ) inner product of two functions in RKHS. As a result, the techniques used in their work cannot be directly adapted to our setting, and their regret bounds depend on the alternative notions of maximum information gain and effective dimension of the specific kernel, which can be implied by the eigendecay conditions we assume in Appendix E.1 (see Yang et al. (2020) for the details).

5. RELATED WORK

There are several other theoretically grounded representation learning methods under the assumption of linear MDP. However, most of these work either consider more restricted model or totally ignore the computation issue. Du et al. (2019); Misra et al. (2020) focused on the representation learning in block MDPs, which is a special case of linear MDP (Agarwal et al., 2020) , and proposed to learn the representation via the regression. However, both of them used policy-cover based exploration that need to maintain large amounts of policies in the training phase, which induces a significant computation bottleneck. Uehara et al. (2022) and Zhang et al. (2022b) exploit UCB upon learned representation to resolve this issue. However, their algorithms depend on some computational oracles, i.e., MLE for unnormalized conditional distribution in (2) or a max -min -max optimization solver motivated from Modi et al. ( 2021), respectively, that can be hard to implement in practice. A variety of recent work have been proposed to replace the computational oracle with more tractable estimators. For example, Ren et al. (2022b) exploited representation with the structure of Gaussian noise in nonlinear stochastic control problem with arbitrary dynamics, which restricts the flexibility. Zhang et al. (2022a) ; Qiu et al. (2022) proposed to use a contrastive learning approach as an alternative. However, similar to other contrastive learning approach, both of their methods require the access to a negative distribution supported on the whole state space, and their performance highly depends on the quality of the negative distribution. Ren et al. (2022a) designed a new objective based on the idea of the spectral decomposition. But the solution for their objective is not necessarily to be a valid distribution, and the generalization bound is worse than the MLE when the state space is finite. Algorithmically, many representation learning methods have been developed for different purposes, such as state extractor from vision-based features (Laskin et al., 2020a; b; Kostrikov et al., 2020) , bi-simulation (Ferns et al., 2004; Gelada et al., 2019; Zhang et al., 2020) , successor feature (Dayan, 1993; Barreto et al., 2017; Kulkarni et al., 2016) , spectral representation from transition operator decomposition (Mahadevan & Maggioni, 2007; Wu et al., 2018; Duan et al., 2019) , contrastive learning (Oord et al., 2018; Nachum & Yang, 2021; Yang et al., 2020) , and so on. However, most of these methods are designed for state-only feature, ignoring the action dependency, and learning from pre-collected datasets, without taking the planning and exploration in to account and ignoring the coupling between representation learning and exploratin. Therefore, there is no rigorously theoretical characterization provided. We would like to emphasize that the proposed LV-Rep is the algorithm which achieves both statistical efficiency theoretically and computational tractability empirically. For more related work on modelbased RL, please refer to Appendix A.

6. EXPERIMENTS

We extensively test our algorithm on the Mujoco (Todorov et al., 2012) and DeepMind Control Suite (Tassa et al., 2018) . Before presenting the experiment results, we first discuss some details towards a practical implementation of LV-Rep. 

6.1. IMPLEMENTATION DETAILS

As discussed, the latent variable representation is learned by minimizing the ELBO (5). We consider two practical implementations. The first one applies a continuous latent variable model, where the distributions are approximated using Gaussian with parameterized mean and variance, similarly to (Hafner et al., 2019b) . We call this method LV-Rep-C. The second implementation considers using a discrete sparse latent variable model (Hafner et al., 2019b) , which we call LV-Rep-D. As discussed in line 7 of Algorithm 1, we apply a planning algorithm with the learned latent representation to improve the policy. We use Soft Actor Critic (SAC) (Haarnoja et al., 2018) as our planner, where the critic is parameterized as shown in (6). In practice, we find that it is beneficial to have more updates for the latent variable model than critic. We also use a target network for the latent variable model to stabilize training.

6.2. DENSE-REWARD MUJOCO BENCHMARKS

We first conduct experiments on dense-reward Mujoco locomotion control tasks, which are commonly used test domains for both model-free and model-based RL algorithms. We compare LV-Rep with model-based algorithms, including ME-TRPO (Kurutach et al., 2018) , PETS (Chua et al., 2018) , and the best model-based results from (Wang et al., 2019) , among 9 baselines (Luo et al., 2018; Deisenroth & Rasmussen, 2011; Heess et al., 2015; Clavera et al., 2018; Nagabandi et al., 2018; Tassa et al., 2012; Levine & Abbeel, 2014) , as well as model-free algorithms, including PPO (Schulman et al., 2017) , TRPO (Schulman et al., 2015) and SAC (Haarnoja et al., 2018) . We compare all algorithms after running 200K environment steps. Table 1 presents all experiment results, where all results are averaged over 4 random seeds. In practice we found LV-Rep-C provides comparable or better performance (see Figure 1 for example), so that we report its result for LV-Rep in the table. We present the best model-based RL performance for comparison. The results clearly show that LV-Rep provides significant better or comparable performance compared to all model-based algorithms. In particular, in the most challenging domains such as Walker and Ant, most modelbased methods completely fail the task, while LV-Rep achieves the state-of-the-art performance in contrast. Furthermore, LV-Rep show dominant performance in all domains comparing to two representative representation learning based RL methods, Deep Successor Feature (DeepSF) (Barreto et al., 2017) and SPEDE (Ren et al., 2022b) . LV-Rep also achieves better performance than the strongest model-free algorithm SAC in most challenging domains except Humanoid. Finally, we provide learning curves of LV-Rep-C and LV-Rep-D in comparison to SAC in Figure 1 , which clearly shows that comparing to the SOTA model-free baseline SAC, LV-Rep enjoys great sample efficiency in these tasks. 

6.3. SPARSE-REWARD DEEPMIND CONTROL SUITE

In this experiment we show the effectiveness of our proposed methods in sparse reward problems. We compare LV-Rep with the state-of-the-art model-free RL methods including SAC and PPO. Since the proposed LV-Rep significantly dominates all the model-based RL algorithms in MuJoCo from Wang et al. (2019) , we consider a different model-based RL method, i.e., Dreamer (Hafner et al., 2019b) , and add another representation-based RL methods, i.e., Proto-RL (Yarats et al., 2021) , besides DeepSF (Barreto et al., 2017) . We compare all algorithms after running 200K environment steps across 4 random seeds. Results are presented in Table 2 . We report the result of LV-Rep-C for LV-Rep as it gives better empirical performance. We can clearly observe that LV-Rep dominates the performance across all domains. In relatively dense-reward problems, cheetah-run and walker-run, LV-Rep outperforms all baselines by a large margin. Remarkably, for sparse reward problems, hopper-hop and humanoid-run, LV-Rep provides reasonable results while other methods do not even start learning. We also plot the learning curves of LV-Rep with all competitors in Figure 2 . This shows that LV-Rep outperforms other baselines in terms of both sample efficiency and final performance. 

A MORE RELATED WORK

Our method is also closely related to the model-based reinforcement learning. These methods maintain an estimation of the dynamics and reward from the data, and extract the optimal policy via planning modules. The major differences among these methods are in terms of i), model parameterization and learning objectives, and ii), the approximated algorithms used for planning. Specifically, Gaussian processes are exploited in (Deisenroth & Rasmussen, 2011) . A stochastic deep dynamics with restricts Gaussian noise assumption is widely used (Heess et al., 2015; Kurutach et al., 2018; Chua et al., 2018; Clavera et al., 2018; Nagabandi et al., 2018) . Hafner et al. (2019a; b; 2020) ; Lee et al. ( 2020) recently exploits recurrent latent state space model, but focused on Partially Observable MDP setting, which is beyond the scope of our paper. Different approximated planning algorithms, including Dyna-style, shooting, and policy search with backpropagation through time, have been tailored in these methods. Please refer to (Wang et al., 2019) for detailed discussion. As we discussed in Section 1, these algorithms did not balance the flexibility in modeling and tractability in planning and exploration, which may lead to sub-optimal performances. While the proposed LV-Rep not only is more flexible beyond Gaussian noise assumption, but also lead to provable and tractable learning, planning, and exploration, and thus, achieving better empirical performances. 5: Return π.

B ALGORITHMS AND THEORETICAL ANALYSIS FOR OFFLINE EXPLOITATION

In this section, we show the algorithms for offline exploitation. For offline exploitation, we have the access to a offline dataset, which we assume is collected from the stationary distribution of the fixed behavior policy π b , which we will denote as ρ. And we are not allowed to interact with the environments to collect new data. The only difference between the algorithms for offline exploitation and online exploration is that, as we do not have access to the new data from the environment, we cannot further explore the state-action pair that the offline dataset do not cover. Hence, we need to penalize the visitation to the unseen state action pair to avoid the risky behavior.

C IMPLEMENTATION DETAILS

Our algorithm is implemented using Pytorch. For DeepMind control, we use an open source implementation as our SAC baseline (Yarats & Kostrikov, 2020) . As discussed in Section 6.1, we find it is beneficial to have more updates for the latent variable model than critic in practice. We use a parameter feature-updates-per-step that decides how many updates are performed for the latent variable model at each training step. For all Mujoco and DeepMind Control experiments, we tune this parameter from {1, 3, 5, 10, 20} and report the best result. Finally, in Table 3 , we list all other hyperparameters and network architecture we use for our experiments. For evaluation in Mujoco, in each evaluation (every 5K steps) we test our algorithm for 10 episodes. We average the results over the last 4 evaluations and 4 random seeds. For Dreamer and Proto-RL, we change their network from CNN to 3-layer MLP and disable the image data augmentation part (since we test on the state space). We tune some of their hyperparameters (e.g., exploration steps in Proto-RL) and report the best number across our runs. 

D TECHNICAL BACKGROUNDS

In this section, we introduce several important concepts from functional analysis that will be repeatedly used in our theoretical analysis. We start from the concept of the R-vector space. Definition 2 (R-vector space (Steinwart & Christmann, 2008) ). An R-vector space is defined as a triple (E, +, •), where E is a non-empty set, + : E × E → E and • : R × E → E satisfies the following properties: • ∀x, y, z ∈ E, (x + y) + z = x + (y + z). • ∀x, y ∈ E, x + y = y + x. • ∃ an element 0 ∈ E, such that ∀x ∈ E, x + 0 = x. • ∀x ∈ E, ∃ -x ∈ E, such that x + (-x) = 0. • ∀α, β ∈ R, x ∈ E, (αβ) • x = α • (β • x). • ∀x ∈ E, 1 • x = x. • ∀α, β ∈ R, x ∈ E, (α + β) • x = α • x + β • x. • ∀α ∈ R, x, y ∈ E, α • (x + y) = α • x + α • y. The • denotes the scalar multiplication will be omitted if there will be no confusion. Definition 3 (Norm and Banach Space (Steinwart & Christmann, 2008) ). Let E be a R-vector space. A map ∥ • ∥ : E → [0, ∞) is a norm on E if • ∥x∥ = 0 ⇔ x = 0. • ∀α ∈ R, x ∈ E, ∥αx∥ = α∥x∥. • ∀x, y ∈ E, ∥x + y∥ ≤ ∥x∥ + ∥y∥. In this case, the pair (E, ∥ • ∥) is called a Banach space, and we use E to denote the Banach space for simplicity if there will be no confusion. Definition 4 (Bounded Linear Operator (Steinwart & Christmann, 2008) ). Let E and F be two Banach spaces. A map S : E → F is a bounded linear operator if • ∀x, y ∈ E, S(x + y) = Sx + Sy. • ∀α ∈ R, x ∈ E, S(αx) = α(Sx). • S0 = 0. Furthermore, S satisfies the following properties • ∃c ∈ [0, ∞], such that ∀x ∈ E, ∥Sx∥ F ≤ c∥x∥ E . Note that, all of the bounded linear operator itself can define an R-vector space, and we can define an operator norm of S as ∥S∥ op := sup x∈B E ∥Sx∥ F , where B E = {x ∈ E : ∥x∥ E ≤ 1} is the unit ball of E. Definition 5 (Compact Operator (Steinwart & Christmann, 2008) ). A bounded linear operator S : E → F is compact if the closure of SB E is compact in F . Definition 6 (Inner Product and Hilbert Space (Steinwart & Christmann, 2008) ). A map ⟨•, •⟩ : H × H → R on a R-vector space is an inner product if • ∀x, y, z ∈ H, ⟨x + y, z⟩ = ⟨x, z⟩ + ⟨y, z⟩. • ∀α ∈ R, x, y ∈ H, ⟨αx, y⟩ = α⟨x, y⟩. • ∀x, y ∈ H, ⟨x, y⟩ = ⟨y, x⟩. • ∀x ∈ H, ⟨x, x⟩ ≥ 0, and ⟨x, x⟩ = 0 ⇔ x = 0. If the norm induced by the inner product ∥x∥ H := ⟨x, x⟩ is complete, the pair (H, ⟨•, •⟩) is called a Hilbert space. We sometimes use H to denote the Hilbert space and use ⟨•, •⟩ H to distinguish between different inner products. Note that, the inner product satisfies the following Cauchy-Schwartz inequality: ∀x, y ∈ H, |⟨x, y⟩ H | ≤ ∥x∥ H ∥y∥ H . Definition 7 ((Self-)Adjoint Operator (Steinwart & Christmann, 2008) ). Let H 1 and H 2 be two Hilbert spaces, For the operator S : H 1 → H 2 , the adjoint operator S * : H 2 → H 1 is defined by ∀x ∈ H 1 , y ∈ H 2 , ⟨Sx, y⟩ H2 = ⟨x, S * y⟩ H1 . Furthermore, S is a self-adjoint operator, if S : H 1 → H 1 , and ∀x, y ∈ H 1 , ⟨Sx, y⟩ H1 = ⟨x, Sy⟩ H1 . For self-adjoint operator S, if ⟨Sx, x⟩ ≥ 0, S is called a positive semi-definite operator, and if ⟨Sx, x⟩ > 0, S is called a positive definite operator. Remark 5. Note that, the definition of the adjoint operator can be generalized to Banach spaces. But adjoint operators for Hilbert spaces are sufficient for our purposes. So we omit the definition of the adjoint operators on Banach spaces. Definition 8 (Orthonormal System and Orthonormal Basis (Steinwart & Christmann, 2008) ). For the Hilbert space H, the family {e i } i∈I , e i ∈ H is an orthonormal system if ⟨e i , e i ⟩ = 1, and ⟨e i , e j ⟩ = 0 if i ̸ = j. Furthermore, if the closure of the linear span of {e i } i∈I equals to H, it is an orthonormal basis. Note that, each Hilbert space H has an orthonormal basis, and if H is separable, H has a countable orthonormal basis. Furthermore, ∀x ∈ H, we have x = i∈I ⟨x, e i ⟩e i . Theorem 2 (Spectral Theorem (Steinwart & Christmann, 2008) ). Let H be a Hilbert space and T : H → H is compact and self-adjoint. Then their exists at most countable {µ i (T )} i∈I converging to 0 such that |µ 1 (T )| ≥ |µ 2 (T )| ≥ • • • > 0 and an orthonormal system {e i } i∈I , such that ∀x ∈ H, T x = i∈I µ i (T )⟨x, e i ⟩ H e i . Here {µ i (T )} i∈I can be viewed as the set of eigen-value of T , as T e i = µ i (T ). Definition 9 (Trace and Trace class (Steinwart & Christmann, 2008) ). For a compact and self-adjoint operator T , if ∞ i=1 µ i (T ) < ∞, we say T is nuclear or of trace class, and define the nuclear norm and the trace as: ∥T ∥ * = Tr(T ) = i∈I µ i (T ). Definition 10 (Hilbert-Schmidt Operator (Steinwart & Christmann, 2008) ). Let H 1 , H 2 be two Hilbert spaces. An operator S : H 1 → H 2 is Hilbert-Schmidt if ∥S∥ HS := i∈I ∥Se i ∥ 2 H2 1/2 < ∞, where {e i } i∈I is an arbitrary orthonormal basis of H 1 . Furthermore, the set of Hilbert-Schmidt operators defined on H → H where H is a Hilbert space is indeed a Hilbert space with the following inner product: ⟨T 1 , T 2 ⟩ HS(H) = i∈I ⟨T 1 e i , T 2 e i ⟩ H , T 1 , T 2 ∈ HS(H), where {e i } i∈I is an arbitrary orthonormal basis of H. Definition 11 (L 2 (µ) space). Let (X , A, µ) be a measure space. The L 2 (µ) space is defined as the Hilbert space consists of square-integrable function with respect to µ, with inner product ⟨f, g⟩ L2(µ) := X f gdµ, and the norm ∥f ∥ L2(µ) := X f 2 dµ 1/2 . Throughout the paper, µ is specified as the Lebesgue measure for continuous X and the counting measure for discrete X . Specifically, when X is discrete, we can represent f as a sequence [f (x)] x∈X , and the corresponding L 2 (µ) inner product and L 2 (µ) norm is identical to the ℓ 2 inner product and ℓ 2 norm, which is defined as ⟨f, g⟩ l 2 = x∈X f (x)g(x), ∥f ∥ l 2 = x∈X f 2 (x) 1/2 , that is closely related to the inner product and norm of the Euclidean space. Definition 12 (Kernel and Reproducing Kernel Hilbert Space (RKHS) (Aronszajn, 1950; Paulsen & Raghupathi, 2016) ). A function k : X × X → R is a kernel on non-empty set X , if there exists a Hilbert space H and a feature map ϕ : X → H, such that ∀x, x ′ ∈ X , k(x, x ′ ) = ⟨ϕ(x), ϕ(x ′ )⟩ H . Furthermore, if ∀n ≥ 1, {a i } i∈[n] ⊂ R and mutually distinct {x i } i∈[n] , i∈[n] j∈[n] a i a j k(x i , x j ) ≥ 0, the kernel k is said to be positive semi-definite. And if the inequality is strict, the kernel k is said to be positive definite. Given the kernel k, the Hilbert space H k consists of R-valued function on non-empty set X is said to be a reproducing kernel Hilbert space associated with k if the following two conditions hold: • ∀x ∈ X , k(x, •) ∈ H k . • Reproducing Property: ∀x ∈ X , f ∈ H k , f (x) = ⟨f, k(x, •)⟩ H k . Here k is also called the reproducing kernel of H k . The RKHS norm of f ∈ H k is defined as ∥f ∥ H k := ⟨f, f ⟩ H k . Some of the well-known kernels include: • Linear Kernel: k(x, x ′ ) = x ⊤ x ′ , where x, x ′ ∈ R d ; • Polynomial Kernel: k(x, x ′ ) = (1 + x ⊤ x ′ ) n , where x, x ′ ∈ R d , n ∈ N + . • Gaussian Kernel: k(x, x ′ ) = exp - ∥x-x ′ ∥ 2 2 σ 2 , where x, x ′ ∈ R d , σ > 0 is the scale parameter.

• Matérn Kernel

: k(x, x ′ ) = 2 1-ν Γ(ν) r ν B ν (r), where x, x ′ ∈ R d , ν > 0 is the smoothness parameter, l > 0 is the scale parameter, r = √ 2ν l ∥x -x ′ ∥, Γ(•) is the Gamma function and B ν (•) is the modified Bessel function of the second kind. Theorem 3 (Mercer's Theorem (Riesz & Nagy, 2012; Steinwart & Scovel, 2012) ). Let (X , A, µ) be a measure space with compact support X and strictly positive Borel measure ϕ. k is a continuous positive definite kernel defined on X × X . Then there exists at most countable {µ i } i∈I with µ 1 ≥ µ 2 ≥ • • • > 0 and {e i } i∈I where {e i } i∈I is the set of orthonormal basis of L 2 (µ), such that ∀x, x ′ ∈ X , k(x, x ′ ) = i∈I µ i e i (x)e i (x ′ ), where the convergence is absolute and uniform. Remark 6 (Spectral Characterization of RKHS). With the representer property, we know that i∈I µ i e i (x)e i (•) = k(x, •) ∈ H k . Note that, for β-finite spectrum, we can choose I such that µ i > 0. If we define the inner product i∈I a i e i (•), i∈I b i e i (•) H k = i∈I a i b i µ i , then we have the reproducing property i∈I a i e i (•), k(x, •) H k = i∈I a i e i (•), i∈I µ i e i (x)e i (•) H k = i∈I a i e i (x). With the spectral characterization, we know that i∈I a i e i (•) H k = i∈I a 2 i µ i ≥ i∈I a 2 i µ 1 = i∈I a i e i (•) L2(µ) µ 1 . Hence, we know ∀f ∈ H k , f ∈ L 2 (µ). Furthermore, note that k(x, x) = i∈I µ i e i (x)e i (•), i∈I µ i e i (x)e i (•) H k = i∈I µ i e 2 i (x). Hence, i∈I µ i = X i∈I µ i e 2 i (x)dµ = X k(x, x)dµ. The following Hilbert-Schmidt integral operator is useful in our analysis: T k : L 2 (µ) → L 2 (µ), T k f = X k(x, x ′ )f (x ′ )dµ. Obviously, T k is self-adjoint. Use the fact that k(x, x ′ ) = i∈I µ i e i (x)e i (x ′ ), we know T k e i = µ i e i , which means e i is the eigenfunction of T k with the corresponding eigenvalue as µ i . With the spectral characterization of T k , we can define the power operator T k , by T τ k f : L 2 (µ) → L 2 (µ), T τ k f = i∈I µ τ i ⟨f, e i ⟩e i . And these power operators are all self-adjoint. Note that, ∥f ∥ H k = ⟨f, T -1 k f ⟩ L2(µ) . Throughout the paper, we work on the L 2 (µ) space, and all of the operators are defined on L 2 (µ) → L 2 (µ). As H k ⊂ L 2 (µ), all of these operators can also operator on the elements from H k . The power RKHS induced by the following kernel will be helpful in our analysis: ∀x, x ′ ∈ X , k(x, x ′ ) = i∈I µ 2 i e i (x)e i (x ′ ). And it is straightforward to see ∥f ∥ H k = ⟨f, T -2 k f ⟩ L2(µ) , which will be useful in the proof. Definition 13 (Kernel with Random Feature Representation). A kernel k : X × X → R is said to have a random feature representation if there exists a function ψ : X × Ξ → R and a probability measure P over Ξ such that k(x, x ′ ) = Ξ ψ(x; ξ)ψ(x ′ ; ξ)dP (ξ). We then show that, H k coincides with the following R-valued function space f : X → R f (x) = Ξ f (ξ)ψ(x; ξ)dP (ξ), f ∈ L 2 (P ) , with the inner product defined as ⟨f, g⟩ H k = Ξ f (ξ) g(ξ)dP (ξ). Note that, k(x, •) = ψ(x; ξ). Hence, it is straightforward to show that ∀x ∈ X , k(x, •) ∈ H k . Furthermore, we have f (x) = Ξ f (ξ)ψ(x; ξ)dP (ξ) = Ξ f (ξ) k(x, •)dP (ξ) = ⟨f, k(x, •)⟩ H k , which shows the reproducing property. As a result, we obtain an equivalent representation of the RKHS H k , which means ∀f ∈ H k , we can obtain a random feature representation. Examples of such kernel k includes the Gaussian kernel and the Matérn kernel. See Rahimi & Recht (2007) ; Dai et al. (2014) ; Choromanski et al. (2018) for the details. Definition 14 (ε-net and ε-covering number and i-th (dyadic) entropy number (Steinwart & Christmann, 2008) ). Let (T, d) be a metric space. S ⊂ T is an ε-net for ε > 0, if ∀t ∈ T , ∃s ∈ S, such that d(s, t) ≤ ε. Furthermore, the ε-covering number N (T, d, ε) is defined as as the minimum cardinality of the ε-net for T under metric d, and the i-th entropy number e i (T, d) is the minimum ε that there exists an ε cover of cardinality 2 i-1 .

E MAIN PROOF E.1 TECHNICAL CONDITIONS

Assumption 3 (Regularity Conditions for Kernel). Z is a compact metric space with the Lebesgue measure µ if Z is continuous, and Z k(z, z)dµ ≤ 1. Remark 7. Assumption 3 is mainly for the ease of presentation. The assumption that Z is compact when Z is continuous can be relaxed to Z is a general domain but requires much more involved techniques from e.g. Steinwart & Scovel (2012) . Furthermore, with Mercer's theorem (see Theorem 3 and Remark 6 for the details), we know i∈I µ i = Z k(z, z)dµ ≤ 1. As ∀i ∈ I, µ i > 0, we know µ 1 ≤ 1, and ∥f ∥ L2(µ) ≤ ∥f ∥ H k without any other absolute constant, that can keep the eventual result clean. We can relax the assumption Z k(z, z)dµ ≤ 1 to Z k(z, z)dµ ≤ c for some positive constant c > 1, at the cost of additional terms at most poly(c) in the sample complexity. Assumption 4 (Eigendecay Conditions for Kernel). For the reproducing kernel k, we assume µ i , the i-th eigenvalue of the operator T k : L 2 (µ) → L 2 (µ), T k f = Z f (z ′ )k(z, z ′ )dµ(z ′ ), satisfies one of the following conditions: • β-finite spectrum: µ i = 0, ∀i > β, where β is a positive integer. • β-polynomial decay: µ i ≤ C 0 i -β , where C 0 is an absolute constant and β > 1. • β-exponential decay: µ i ≤ C 1 exp(-C 2 i β ), where C 1 and C 2 are absolute constants and β > 0. For ease of presentation, we use C poly to denote constants appeared in the analysis of β-polynomial decay that only depends on C 0 and β, and C exp to denote constants appeared in the analysis of β-exponential decay that only depends on C 1 , C 2 and β. Both of them can be varied step by step. Remark 8. We remark that, most of the existing kernels satisfy one of these eigendecay conditions. Specifically, as discussed in Seeger et al. (2008) ; Yang et al. (2020) , the linear kernel and the polynomial kernel satisfy the β-finite spectrum condition, the Matérn kernel satisfies the β-polynomial decay and the Gaussian kernel satisfies the β-exponential decay. Furthermore, for discrete Z, we can directly observe that it corresponds to the case of β-finite spectrum with β ≤ |Z|.

E.2 PROOF FOR THE ONLINE SETTING

Theorem 4 (PAC Guarantee for Online Exploration, Formal). If we choose the bonus bn (s, a) as: bn (s, a) = min α n ∥p n (•|s, a)∥ L2(µ), Σ-1 n, pn , 2 , where Σn,pn : L 2 (µ) → L 2 (µ), Σn,pn := (si,ai)∈Dn pn (z|s i , a i )p n (z|s i , a i ) ⊤ + λT -1 k , ∥f ∥ L2(µ),Σ := ⟨f, Σf ⟩ L2(µ) for self-adjoint operator Σ, λ for different eigendecay conditions is given by: • β-finite spectrum: λ = Θ(β log N + log(N |P|/δ)) • β-polynomial decay: λ = Θ(C poly N 1/(1+β) + log(N |P|/δ)); • β-exponential decay: λ = Θ(C exp (log N ) 1/β + log(N |P|/δ)); and α n = Θ γ 1-γ |A| log(n|P|/δ) + λC , then with probability at least 1 -δ, After interacting with the environments for N episodes where • N = Θ Cβ 3 |A| 2 log(|P|/δ) (1-γ) 4 ε 2 log 3 Cβ 3 |A| 2 log(|P|/δ) (1-γ) 4 ε 2 for β-finite spectrum; • N = Θ   C poly |A| √ C log(|P|/δ) (1-γ) 2 ε log 3/2 √ C|A| log(|P|/δ) (1-γ) 2 ε 2(1+β) β-1   for β-polynomial de- cay; • N = Θ CexpC|A| 2 log(|P|/δ) (1-γ) 4 ε 2 log 3+2β β C|A| 2 log(|P|/δ) (1-γ) 4 ε 2 for β-exponential decay; we can obtain an ε-optimal policy with probability at least 1 -δ. Notation Following the notation of Uehara et al. (2022) , we define ρ n (s) := 1 n n-1 i=1 d πi T * (s), and with slight abuse of notation, we overload the above notation and define ρ n (s, a) := 1 n n-1 i=1 d πi T * (s, a). Furthermore, we define ρ ′ n (s ′ ) as the marginal distribution of s ′ for the following joint distribution  (s, a, s ′ ) ∼ ρ n (s) × U(A) × T * (s ′ |s, a). (µ) norm: ∀x ∈ L 2 (µ), ⟨x, x ′ ⟩ L2(µ) = T 1/2 x, T -1/2 x ′ L2(µ) ≤ ∥x∥ L2(µ),T ∥x ′ ∥ L2(µ),T -1 . Lemma 5 (One Step Back Inequality for the Learned Model). Assume g : S × A → R such that ∥g∥ ∞ ≤ B, then conditioning on the event that the following MLE generalization bound holds: E s∼ρn,a∼U (A) ∥ T (s, a) -T * (s, a)∥ 1 ≤ ζ n , ∀π, we have E (s,a)∼d π Tn [g(s, a)] ≤γE ( s, a)∼d π Tn ∥p n (•| s, a)∥ L2(µ),Σ -1 ρn ×U (A), pn n|A|E s∼ρ ′ n ,a∼U (A) [g 2 (s, a)] + λB 2 C + nB 2 ζ n + (1 -γ)|A|E s∼ρn,a∼U (A) [g 2 (s, a)]. Proof. We start from the following equality: E (s,a)∼d π Tn [g(s, a)] = γE ( s, a)∼d π Tn ,s∼ Tn(•| s, a),a∼π(•|s) [g(s, a)] + (1 -γ)E s∼d0,a∼π(•|s) [g(s, a)], ) which is obtained by the property of the stationary distribution. For the second term, with Jensen's inequality and an importance sampling step, we have that (10) For the second term, we still use the following upper bound:  (1 -γ)E s∼d0,a∼π(•|s) [g(s, a)] ≤ (1 -γ)|A|E s∼ρn,a∼U (A) [g 2 (s, a)]. H k ≤nE s∼ρn, a∼U (A) E s∼T * (•| s, a),a∼π(•|s) [g(s, a)] 2 + λB 2 C + nB 2 ζ n ≤nE s∼ρn, a∼U (A),s∼T * (•| s, a),a∼π(•|s) g 2 (s, a) + λB 2 C + nB 2 ζ n ≤n|A|E s∼ρn, a∼U (A),s∼T * (•| s, a),a∼U (A) g 2 (s, a) + λB 2 C + nB 2 ζ n =n|A|E s∼ρ ′ n ,a∼U (A) g 2 (s, a) + λB 2 C + nB 2 ζ n Substitute back, (1 -γ)E s∼d0,a∼π(•|s) [g(s, a)] ≤ (1 -γ)|A|E s∼ρn,a∼U (A) [g 2 (s, a)]. ζ n = Θ log(n|P|/δ) n (such that the MLE generalization bound holds by Lemma 16), λ for different eigendecay condition as follows: • β-finite spectrum: λ = Θ(β log N + log(N |P|/δ)) • β-polynomial decay: λ = Θ(C poly N 1/(1+β) + log(N |P|/δ)); • β-exponential decay: λ = Θ(C exp (log N ) 1/β + log(N |P|/δ)); and α n = Θ γ 1-γ |A| log(n|P|/δ) + λC , the following events hold with probability at least 1 -δ: ∀n ∈ [N ], ∀π, V π Tn,r+ bn -V π T * ,r ≥ - |A|ζ n (1 -γ) 3 . Proof. With Lemma 17 and a union bound over P, we know using the chosen λ, ∀ Tn ∈ P, with probability at least 1 -δ, ∥p n (•|s, a)∥ L2(µ), Σ-1 n, pn = Θ ∥p n (•|s, a)∥ -1 L2(µ),Σ ρn ×U (A), pn . With Lemma 18, we have that Denote f n (s, a) = TV(T * (s ′ |s, a), Tn (s ′ |s, a)) with ∥f n ∥ ∞ ≤ 2, with Hölder's inequality, we have that (1 -γ) V π E (s,a)∼d π Tn E Tn(s ′ |s,a) V π T,r (s ′ ) -E T * (s ′ |s,a) V π T,r (s ′ ) ≤ E (s,a)∼d π Tn f n (s, a) 1 -γ . With Lemma 5, we have that E (s,a)∼d π Tn f n (s, a) 1 -γ ≤E ( s, a)∼d π Tn ∥p n (•|s, a)∥ L2(µ),Σ -1 ρn ×U (A), pn nγ 2 |A|E s∼ρ ′ n ,a∼U (A) [f 2 n (s, a)] (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + 4nγ 2 ζ n (1 -γ) 2 + |A|E s∼ρn,a∼U (A) [f 2 n (s, a)] 1 -γ ≤E ( s, a)∼d π Tn ∥p n (•|s, a)∥ L2(µ),Σ -1 ρn ×U (A), pn nγ 2 |A|ζ n (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + 4γ 2 nζ n (1 -γ) 2 + |A|ζ n 1 -γ Note that, we set α n such that nγ 2 |A|ζ n (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + 4nγ 2 ζ n (1 -γ) 2 ≲ α n , which concludes the proof. Lemma 8 (Regret). With probability at least 1 -δ, we have that • For β-finite spectrum, we have N n=1 V π * T * ,r -V πn T * ≲ β 3/2 |A| √ CN log(N |P|/δ) (1 -γ) 2 . • For β-polynomial decay, we have N n=1 V π * T * ,r -V πn T * ≲ C poly √ C|A|N 1 2 + 1 2(1+β) log(N |P|/δ) (1 -γ) 2 . • For β-exponential decay, we have N n=1 V π * T * ,r -V πn T * ≲ C exp |A| √ CN (log N ) 3+2β 2β log(N |P|/δ) (1 -γ) 2 . Proof. With Lemma 7 and Lemma 18, we have that V π * T * ,r -V πn T * ,r ≤V π * Tn,r+bn + |A|ζ n (1 -γ) 3 -V πn T * ,r

≤V πn

Tn,r+bn + |A|ζ n (1 -γ) 3 -V πn T * ,r ≤ 1 1 -γ E (s,a)∼d πn T * b n (s, a) + γE Tn(s ′ |s,a) [V πn Tn,r+bn (s ′ )] -γE T * (s ′ |s,a) [V πn Tn,r+bn (s ′ )] + |A|ζ n (1 -γ) 3 . Applying Lemma 6 and note that b n = O(1), we have that E (s,a)∼d πn T * [b n (s, a)] ≲E (s,a)∼d πn T * min α n ∥p n (•|s, a)∥ L2(µ),Σ -1 ρn ×U (A), pn , 2 ≲E ( s, a)∼d πn T * ∥p * (•| s, a)∥ Σ -1 ρn ,p * nγ|A|α 2 n E s∼ρn,a∼U (A) ∥p n (•|s, a)∥ 2 L2(µ),Σ -1 ρn×U (A), pn + λγ 2 C + (1 -γ)|A|α 2 n E s∼ρn,a∼U (A) ∥p n (•|s, a)∥ 2 L2(µ),Σ -1 ρn×U (A), pn . Note that, nE s∼ρn,a∼U (A) ∥p n (•|s, a)∥ 2 L2(µ),Σ -1 ρn×U (A), pn =Tr nE s∼ρn,a∼U (A) pn (•|s, a)p n (•|s, a) ⊤ nE s∼ρn,a∼U (A) pn (•|s, a)p n (•|s, a) ⊤ + λT -1 k -1 =Tr nE s∼ρn,a∼U (A) T 1/2 k pn (•|s, a)p n (•|s, a) ⊤ T 1/2 k nE s∼ρn,a∼U (A) T 1/2 k pn (•|s, a)p n (•|s, a) ⊤ T 1/2 k + λI -1 ≤ log det I + n λ E s∼ρn,a∼U (A) T 1/2 k pn (•|s, a) ⊤ pn (•|s, a)T 1/2 k , where the first equality is due to the definition of Hilbert-Schmidt inner product and the expectation operator is a linear operator, the second equality is due to the fact that Tr(A(A + B) -1 ) = Tr B -1/2 AB -1/2 I + B -1/2 AB -1/2 -1 for positive semi-definite operator A and positive definite operator B, and the last inequality is due to the fact that if A has the eigensystem {µ i , e i }, then A(A + λI) -1 has the eigensystem { µi µi+λ , e i }, and x 1+x ≤ log(1 + x). Here det denotes the Fredholm determinant. Note that, if  x ∈ B H k , T 1/2 k x ∈ B H k , log det I + n λ E s∼ρn,a∼U (A) T 1/2 k pn (•|s, a) ⊤ pn (•|s, a)T 1/2 k = O (β log n) , which means E (s,a)∼d πn T * [b n (s, a)] ≲ γ|A|α 2 n β log(n) + λγ 2 C • E ( s, a)∼d πn T * ∥p * (•| s, a)∥ Σ -1 ρn,p * + (1 -γ)|A|α 2 n β log(n) n . • For β-polynomial decay, as λ = Θ(C poly N 1/(1+β) + log(N |P|/δ)) and n ≤ N , we have n/λ = O C poly n β 1+β and log det I + n λ E s∼ρn,a∼U (A) T 1/2 k pn (•|s, a) ⊤ pn (•|s, a)T 1/2 k = O C poly n 1 2(1+β) log(n) , This leads to E (s,a)∼d πn T * [b n (s, a)] ≲ γ|A|C poly α 2 n n 1 2(1+β) log n + λγ 2 C • E ( s, a)∼d πn T * ∥p * (•| s, a)∥ Σ -1 ρn ,p * + (1 -γ)|A|C poly n -1+ 1 2(1+β) log(n)α 2 n . • For β-exponential decay, as λ = Θ C exp (log N ) 1/β + log(N |P|/δ) , we have n/λ = O (C exp n) and log det I + n λ E s∼ρn,a∼U (A) T 1/2 k pn (•|s, a) ⊤ pn (•|s, a)T 1/2 k = O C exp (log n) 1+1/β , This leads to E (s,a)∼d πn T * [b n (s, a)] ≲ γ|A|C exp (log n) 1+1/β α 2 n + λγ 2 C • E ( s, a)∼d πn T * ∥p * (•| s, a)∥ Σ -1 ρn,p * + (1 -γ)|A|C exp (log n) 1+1/β α 2 n n . For the remaining terms, denote f n (s, a) = TV(T * (s ′ |s, a), Tn (s ′ )|s, a) with ∥f n ∥ ∞ ≤ 2. With Hölder's inequality, we have E (s,a)∼d πn T * E Tn(s ′ |s,a) V πn Tn,r+bn (s ′ ) -E T * (s ′ |s,a) V πn Tn,r+bn (s ′ ) ≲ E (s,a)∼d πn T * f n (s, a) 1 -γ . With Lemma 6, we have that E (s,a)∼d πn T * f n (s, a) 1 -γ ≤E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρn ,p * nγ|A|E s∼ρn,a∼U (A) [f 2 n (s, a)] (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + |A|E s∼ρn,a∼U (A) [f 2 n (s, a)] 1 -γ ≤E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρn ,p * • nγ|A|ζ n (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + |A|ζ n 1 -γ ≲α n E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρn ,p * + |A|ζ n 1 -γ Combine with the previous results and take the dominating terms out, we have that • For β-finite spectrum, V π * T * ,r -V πn T * ≲ 1 1 -γ γ|A|α 2 n β log n + λγ 2 C • E ( s, a)∼d πn T * ∥p * (•| s, a)∥ Σ -1 ρn ,p * + |A|α 2 n β log n (1 -γ)n + |A|ζ n (1 -γ) 3 . • For β-polynomial decay, V π * T * ,r -V πn T * ≲ 1 1 -γ γ|A|C poly α 2 n n 1 2(1+β) log n + λγ 2 C • E ( s, a)∼d πn T * ∥p * (•| s, a)∥ Σ -1 ρn ,p * + |A|C poly α 2 n n -1+ 1 2(1+β) log n 1 -γ + |A|ζ n (1 -γ) 3 . • For β-exponential decay, V π * T * ,r -V πn T * ≲ 1 1 -γ γ|A|C exp α 2 n (log n) 1+1/β + λγ 2 C • E ( s, a)∼d πn T * ∥p * (•| s, a)∥ Σ -1 ρn ,p * + |A|C exp α 2 n (log n) 1+1/β (1 -γ)n + |A|ζ n (1 -γ) 3 . Finally, with Cauchy-Schwartz inequality, we have N n=1 E ( s, a)∼d πn T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρn,p * ≤ N N n=1 E ( s, a)∼d πn T * p * (•| s, a), Σ -1 ρn,p * p * (•| s, a) L2(µ) . Note that E ( s, a)∼d πn T * p * (•| s, a), Σ -1 ρn,p * p * (•| s, a) L2(µ) =E ( s, a)∼d πn T * p * (•| s, a), nE (s,a)∼ρn p * (•|s, a)p * (•|s, a) ⊤ + λT -1 k -1 p * (•| s, a) L2(µ) =E ( s, a)∼d πn T * T 1/2 k p * (•| s, a), nE (s,a)∼ρn T 1/2 k p * (•|s, a)p * (•|s, a) ⊤ T 1/2 k + λI -1 T 1/2 k p * (•| s, a) L2(µ) =Tr   n λ E (s,a)∼ρn T 1/2 k p * (•|s, a)p * (•|s, a) ⊤ T 1/2 k + I -1 , E ( s, a)∼d πn T * T 1/2 k p * (•| s, a)p * (•| s, a)T 1/2 k λ   ≤ log det n λ E (s,a)∼ρn T 1/2 k p * (•|s, a)p * (•|s, a) ⊤ T 1/2 k + I -log det n -1 λ E (s,a)∼ρn-1 T 1/2 k p * (•|s, a)p * (•|s, a) ⊤ T 1/2 k + I , where in the last inequality, we use the fact that log det(X) is concave with positive definite operators X and d log det(X) dX = (X ⊤ ) -1 . Telescoping and applying Lemma 19, we have that: • For β-finite spectrum: as N/λ = O(N ), we have N n=1 E ( s, a)∼d πn T * p * (•| s, a), Σ -1 ρn,p * p * (•| s, a) L2(µ) = O(β log N ). • For β-polynomial decay: as λ = Θ(C poly N 1/(1+β) + log(N |P|/δ)), we have N/λ = O C poly N β 1+β and N n=1 E ( s, a)∼d πn T * p * (•| s, a), Σ -1 ρn,p * p * (•| s, a) L2(µ) = O C poly N 1 2(1+β) log N . • For β-exponential decay: N n=1 E ( s, a)∼d πn T * p * (•| s, a), Σ -1 ρn,p * p * (•| s, a) L2(µ) = O C exp (log N ) 1+1/β . Hence, after we substitute α n and λ back and take the dominating term out, we can conclude that: • For β-finite spectrum, we have N n=1 V π * T * ,r -V πn T * ≲ β 3/2 |A| log N CN log(N |P|/δ) (1 -γ) 2 . • For β-polynomial decay, we have N n=1 V π * T * ,r -V πn T * ≲ C poly |A|N 1 2 + 1 1+β log N C log(N |P|/δ) (1 -γ) 2 . • For β-exponential decay, we have N n=1 V π * T * ,r -V πn T * ≲ C exp |A| CN log(N |P|/δ)(log N ) 3+2β 2β (1 -γ) 2 . This finishes the proof. Theorem 9 (PAC Guarantee for Online Setting). After interacting with the environments for N episodes where • N = Θ Cβ 3 |A| 2 log(|P|/δ) (1-γ) 4 ε 2 log 3 Cβ 3 |A| 2 log(|P|/δ) (1-γ) 4 ε 2 for β-finite spectrum; • N = Θ   C poly |A| √ C log(|P|/δ) (1-γ) 2 ε log 3/2 √ C|A| log(|P|/δ) (1-γ) 2 ε 2(1+β) β-1   for β-polynomial decay; • N = Θ CexpC|A| 2 log(|P|/δ) (1-γ) 4 ε 2 log 3+2β β C|A| 2 log(|P|/δ) (1-γ) 4 ε 2 for β-exponential decay; we can obtain an ε-optimal policy with high probability. Proof. Note that, log log x = O(log x). We consider the case with different eigendecay conditions separately. • For β-finite spectrum, by taking the output policy as the uniform policy over {π i } i∈[n] , we can obtain a policy with the sub-optimality gap O β 3/2 |A| log N C log(N |P|/δ) (1 -γ) 2 √ N . Take N = Θ Cβ 3 |A| 2 log(|P|/δ) (1-γ) 4 ε 2 log 3 Cβ 3 |A| 2 log(|P|/δ) (1-γ) 4 ε 2 , we can see the sub-optimality gap is smaller than ε, which finishes the proof for β-finite spectrum. • For β-polynomial decay, by taking the output policy as the uniform policy over {π i } i∈[n] , we can obtain a policy with the sub-optimality gap O C poly N β-1 2(1+β) log N C log(N |P|/δ) (1 -γ) 2 . Take N = Θ   C poly |A| √ C log(|P|/δ) (1-γ) 2 ε log 3/2 √ C|A| log(|P|/δ) (1-γ) 2 ε 2(1+β) β-1   , we can see the sub-optimality gap is smaller than ε, which finishes the proof for β-exponential decay. • For β-exponential decay, by taking the output policy as the uniform policy over {π i } i∈[n] , we can obtain a policy with the sub-optimality gap O C exp |A| C log(N |P|/δ)(log N ) 3+2β 2β (1 -γ) 2 √ N . Take N = Θ CexpC|A| 2 log(|P|/δ) (1-γ) 4 ε 2 log 3+2β β C|A| 2 log(|P|/δ) (1-γ) 4 ε 2 , we can see the suboptimality gap is smaller than ε, which finishes the proof for β-exponential decay. As a result, we finish the proof for the PAC guarantee.

E.3 PROOF FOR THE OFFLINE SETTING

Similar to the online exploration case, we can obtain the upper bound of the statistical error for π, which is stated in the following: Theorem 10 (PAC Guarantee for Offline Exploitation). Define ω := max s,a π -1 b (a|s), and C * π := sup x∈L2(µ) E (s,a)∼d π T * ⟨p * (•|s, a), x⟩ L2(µ) 2 E (s,a)∼ρ ⟨p * (•|s, a), x⟩ L2(µ) 2 . If the penalty and its corresponding parameters are identical to the bonus we define in Theorem 4, then with probability at least 1-δ, for any competitor policy π including non-Markovian history-dependent policy, we have • For β-finite spectrum, we have V π T * ,r -V π T * ,r ≲ ωβ 3/2 log n (1 -γ) 2 CC * π log(|P|/δ) n • For β-polynomial decay, we have V π T * ,r -V π T * ,r ≲ C poly ωn 1-β 2(1+β) log n CC * π log(|P|/δ) (1 -γ) 2 • For β-exponential decay, we have V π T * ,r -V π T * ,r ≲ C exp ω(log n) 3+2β 2β (1 -γ) 2 Proof. We denote the eigendecomposition of Λ as Λ = U ΣU where {σ i , u i } is the eigensystem of Λ. Then we have Proof. We still start from the following inequality: Substitute back, we have the desired result. E (s,a)∼d π T * ⟨p * (•|s, a), Λp * (•|s, a)⟩ L2(µ) = i∈I σ i E (s,a)∼d π T * ⟨u i , p * (•|s, a) ⊤ ⟩ 2 L2(µ) ≤C * π i∈I σ i E (s,a)∼ρ ⟨u i , p * (•|s, a) ⊤ ⟩ 2 L2(µ) =C * π E (s, E (s, Lemma 13 (One Step Back Inequality for the True Model in Offline Setting). Assume g : S ×A → R, such that ∥g∥ ∞ ≤ B. Then we have that ∀π, E (s,a)∼d π T * [g(s, a)] ≤γE ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ, p nωγE (s,a)∼ρ [g 2 (s, a)] + λγ 2 B 2 C + (1 -γ)ωE (s,a)∼ρ [g 2 (s, a)]. Proof. The proof for this lemma is nearly identical to the previous lemma, and we omit it for simplicity. Lemma 14 (Almost Pessimism at the Initial Distribution). If we set ζ = Θ log(|P|/δ) n , λ for different eigendecay condition as follows: • β-finite spectrum: λ = Θ(β log n + log(|P|/δ)) • β-polynomial decay: λ = Θ(C poly n 1/(1+β) + log(|P|/δ)); • β-exponential decay: λ = Θ(C exp (log n) 1/β + log(|P|/δ)); and α = Θ γ 1-γ ω log(|P|/δ) + λγ 2 C , the following events hold with probability at least 1 -δ: ∀π, V π T ,r-b -V π T * ,r ≤ ωζ (1 -γ) 3 . Proof. Note that, with the proof of Lemma 17 and a union bound over P (but not over n), we know using the chosen λ, ∀ T ∈ P, with probability at least 1 -δ, ∥p(•|s, a)∥ L2(µ), Σ-1 n, p = Θ ∥p(•|s, a)∥ -1 L2(µ),Σ ρ, p . With Lemma 18, we have that (1 -γ) V π T ,r-b -V π T * ,r =E (s,a)∼d π T -b(s, a) + γE T (s ′ |s,a) V π T * ,r (s ′ ) -γE T * (s ′ |s,a) V π T * ,r (s ′ ) ≲E (s,a)∼d π T -min α ∥p(•|s, a)∥ L2(µ),Σ -1 ρ, p , 2 + γE T (s ′ |s,a) V π T * ,r (s ′ ) -γE T * (s ′ |s,a) V π T * ,r (s ′ ) Denote f (s, a) = TV( T (s, a), T * (s, a)), we know ∥f ∥ ∞ ≤ 2. With Hölder's inequality, we can obtain that E T (s ′ |s,a) V π T * ,r (s ′ ) -E T * (s ′ |s,a) V π T * ,r (s ′ ) ≤ E (s,a)∼d π T f (s, a) 1 -γ . With Lemma 12, we have that E (s,a)∼d π T f (s, a) 1 -γ ≤E ( s, a)∼d π T ∥p(•| s, a)∥ L2(µ),Σ -1 ρ, p nωγ 2 E (s,a)∼ρ [f 2 (s, a)] (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + 4nγ 2 ζ (1 -γ) 2 + ωE (s,a)∼ρ [f 2 (s, a)] 1 -γ ≤E ( s, a)∼d π T ∥p(•| s, a)∥ L2(µ),Σ -1 ρ, p nωγ 2 ζ n (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + 4nγ 2 ζ (1 -γ) 2 + ωζ 1 -γ . With the choice of α, we can conclude the proof. Theorem 15 (PAC Guarantee for Offline Setting). With probability at least 1 -δ, for any competitor policy π including non-Markovian history-dependent policy, we have • For β-finite spectrum, we have V π T * ,r -V π T * ,r ≲ ωβ 3/2 log n (1 -γ) 2 CC * π log(|P|/δ) n • For β-polynomial decay, we have V π T * ,r -V π T * ,r ≲ C poly ωn 1-β 2(1+β) log n CC * π log(|P|/δ) (1 -γ) 2 • For β-exponential decay, we have V π T * ,r -V π T * ,r ≲ C exp ω(log n) 3+2β 2β (1 -γ) 2 CC * π log(|P|/δ) n Proof. With Lemma 14 and Lemma 18, we have that V π T * ,r -V π T * ,r ≤V π T * ,r -V π T ,r-b + ωζ (1 -γ) 3 ≤V π T * ,r -V π T ,r-b + ωζ (1 -γ) 3 ≤ 1 1 -γ E (s,a)∼d π T * b(s, a) + γE T * (s ′ |s,a) V π T * ,r (s ′ ) -γE T (s ′ |s,a) V π T * ,r (s ′ ) + ωζ (1 -γ) 3 . As b = O(1), with Lemma 13, we have that E (s,a)∼d π T * [b(s, a)] ≲E min α ∥p(•|s, a)∥ L2(µ),Σ -1 ρ, p , 2 ≲E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * nωγα 2 E (s,a)∼ρ ∥p(•|s, a)∥ 2 L2(µ),Σ -1 ρ, p + λγ 2 C + (1 -γ)ωα 2 E (s,a)∼ρ ∥p(•|s, a)∥ 2 L2(µ),Σ -1 ρ, p . With the reasoning similar to the proof in Lemma 8, we have that • For β-finite spectrum, E (s,a)∼ρ ∥p(•|s, a)∥ 2 L2(µ),Σ -1 ρ, p = O (β log n) , which leads to E (s,a)∼d π T * [b(s, a)] ≲E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * ωγβα 2 log n + λγ 2 C + (1 -γ)ωβα 2 log n n . • For β-polynomial decay, + (1 -γ)ωC poly α 2 n -1+ 1 2(1+β) log n. E (s,a)∼ρ ∥p(•|s, a)∥ 2 L2(µ),Σ -1 ρ, p = O C poly n 1 2(1+β) log n , • For β-exponential decay, E (s,a)∼ρ ∥p(•|s, a)∥ 2 L2(µ),Σ -1 ρ, p = O C exp (log n) 1+1/β , which leads to E (s,a)∼d π T * [b(s, a)] ≲E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * ωγC exp α 2 (log n) 1+1/β + λγ 2 C + (1 -γ)ωα 2 C exp (log n) 1+1/β n . Furthermore, denote f (s, a) = TV( T (s, a), T * (s, a)), we have ∥f ∥ ∞ ≤ 2. With Hölder's inequality, we have E (s,a)∼d π T * E T * (s ′ |s,a) V π T * ,r (s ′ ) -E T (s ′ |s,a) V π T * ,r (s ′ ) ≤ E (s,a)∼d π T * f (s, a) 1 -γ . With Lemma 13, we have E (s,a)∼d π T * f (s, a) 1 -γ ≤ E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * nωγE (s,a)∼ρ [f 2 (s, a)] (1 -γ) 2 + 4λγ 2 C (1 -γ) 2 + ωE (s,a)∼ρ [f 2 (s, a)] 1 -γ ≤E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * • nωγζ (1 -γ 2 ) + 4λγ 2 C (1 -γ) 2 + ωζ (1 -γ) ≲α n E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * + ωζ 1 -γ . Combine with the previous results and take the dominating terms out, we have that • For β-finite spectrum, V π T * ,r -V π T * ,r ≲ 1 1 -γ E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * ωγβα 2 log n + λγ 2 C + ωβα 2 log n (1 -γ)n + ωζ (1 -γ) 3 . • For β-polynomial decay, V π T * ,r -V π T * ≲ 1 1 -γ E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * ωγC poly α 2 n 1 2(1+β) log n + λγ 2 C + ωC poly α 2 n -1+ 1 2(1+β) log n 1 -γ + ωζ (1 -γ) 3 . • For β-exponential decay, V π T * ,r -V π T * ,r ≲ 1 1 -γ E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * ωγC exp α 2 (log n) 1/β log log n + λγ 2 C + ωC exp α 2 (log n) 1+1/β (1 -γ)n + ωζ (1 -γ) 3 . We now deal with the term E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * . With Lemma 11, we know E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * ≤ E ( s, a)∼d π T * ∥p * (•| s, a)∥ 2 L2(µ),Σ -1 ρ,p * ≤ CE ( s, a)∼ρ ∥p * (•| s, a)∥ 2 L2(µ),Σ -1 ρ,p * . Applying the identical method used in the proof of Lemma 8, we have that: • For β-finite spectrum, we have E ( s, a)∼ρ ∥p * (•| s, a)∥ 2 L2(µ),Σ -1 ρ,p * = O (β log n) . • For β-polynomial decay, we have E ( s, a)∼ρ ∥p * (•| s, a)∥ 2 L2(µ),Σ -1 ρ,p * = O C poly n 1 2(1+β) log n . • For β-exponential decay, we have E ( s, a)∼ρ ∥p * (•| s, a)∥ 2 L2(µ),Σ -1 ρ,p * = O C exp (log n) 1+1/β . Substitute α and λ back, we have that: • For β-finite spectrum, we have V π T * ,r -V π T * ,r ≲ ωβ 3/2 log n (1 -γ) 2 CC * π log(|P|/δ) n • For β-polynomial decay, we have V π T * ,r -V π T * ,r ≲ C poly ωn 1-β 2(1+β) log n CC * π log(|P|/δ) (1 -γ) 2 • For β-exponential decay, we have V π T * ,r -V π T * ,r ≲ C exp ω(log n) 3+2β 2β (1 -γ) 2 CC * π log(|P|/δ) n This finishes the proof.

F AUXILLARY LEMMAS

We first state the MLE generalization bound from (Agarwal et al., 2020) . Note that, when Assumption 1 holds, the MLE generalization bound only depends on the complexity of P. Lemma 16 (MLE Generalization Bound (Agarwal et al., 2020) ). For a fixed episode n, with probability at least 1 -δ, we have that E s∼0.5ρn+0.5ρ ′ n ,a∼U (A) Tn (s, a) -T * (s, a) 2 1 ≤ log(|P|/δ) n . With a union bound, with probability at least 1 -δ, we have that ∀n ∈ N + , E s∼0.5ρn+0.5ρ ′ n ,a∼U (A) Tn (s, a) -T * (s, a) 2 1 ≤ log(n|P|/δ) n Lemma 17 (Concentration of the Bonuses). Let µ i be the conditional distribution of ϕ given the sampled ϕ 1 , • • • , ϕ i-1 , define Σ : L 2 (µ) → L 2 (µ), Σ n := 1 n i∈[n] E ϕ∼µi ϕϕ ⊤ . Assume ∥ϕ∥ H k ≤ 1 for any realization of ϕ. If λ satisfies the following conditions for each eigendecay condition: • β-finite spectrum: λ = Θ(β log N + log(N/δ)); • β-polynomial decay: λ = Θ(C poly N 1/(1+β) + log(N/δ)); • β-exponential decay: λ = Θ(C exp (log N ) 1/β +log(N/δ)), where C 3 is a constant depends on C 1 and C 2 ; then there exists absolute constant c 1 and c 2 , such that ∀x ∈ H k , the following event holds with probability at least 1 -δ: ∀n ∈ [N ], c 1 x, nΣ n + λT -1 k x L2(µ) ≤ x,   i∈[n] ϕ i ϕ ⊤ i + λT -1 k   x L2(µ) , and x,   i∈[n] ϕ i ϕ ⊤ i + λT -1 k   x L2(µ) ≤c 2 x, nΣ n + λT -1 k x L2(µ) . In the same event above, the following event must hold as well:  ∀n ∈ [N ], 1 c 2 x, nΣ n + λT -1 k -1 x L2(µ) ≤ x,   i∈[n] ϕ i ϕ ⊤ i + λT -1 k   -1 x L2(µ) and x,   i∈[n] ϕ i ϕ ⊤ i + λT -1 k   -1 x L2(µ) ≤ 1 c 1 x, nΣ n + λT -1 k -1 x L2(µ) Proof. Note that, ∥T -1/2 k ϕ∥ L2(µ) = ∥ϕ∥ H k ≤ 1. Hence, the operator norm of operators Σ n := T -1/2 k Σ n T -1/2 k that maps from L 2 (µ) to L 2 , • • • , ϕ i-1 . Note that ∀x ∈ H k , T 1/2 k x ∈ H k . We now prove the following equivalent form of the claim: ∀x ∈ H k , ∀n ≥ 1, c 1 x, n Σ n + λT -2 k x L2(µ) ≤ x,   i∈[n] ϕ ϕ ⊤ + λT -2 k   x L2(µ) ≤ c 2 x, n Σ n + λT -2 k x L2(µ) It is sufficient to consider x with ∥x∥ H k = 1. Note that, we have x, ϕ ϕ ⊤ x L2(µ) = x, ϕ 2 L2(µ) ≤ ∥x∥ 2 L2(µ) . Denote Σ i := E ϕ∼µi ϕ ϕ ⊤ . We have Var ϕ∼µi ⟨x, ϕ⟩ 2 L2(µ) ≤ ∥x∥ 2 L2(µ) E ϕ∼ µi ⟨x, ϕ⟩ 2 L2(µ) = ∥x∥ 2 L2(µ) ⟨x, Σi x⟩ L2(µ) we can invoke a Bernstein-style martingale concentration inequality (Lemma 45, Zanette et al., 2021) , and obtain that with probability at least 1 -δ 1 n i∈[n] ⟨x, ϕ i ⟩ 2 L2(µ) -⟨x, Σx⟩ L2(µ) ≤ c   ∥x∥ 2 L2(µ) ⟨x, Σn x⟩ L2(µ) log(2/δ) n + ∥x∥ 2 L2(µ) log(2/δ) 3n   , where c is an absolute constant. We then show that, if we have λ = Ω(log(1/δ)), we have that c   ∥x∥ 2 L2(µ) ⟨x, Σx⟩ L2(µ) log(2/δ) n + ∥x∥ L2(µ) 2 log(2/δ) 3n   ≤ C ⟨x, Σx⟩ L2(µ) + λ∥x∥ 2 L2(µ) n . where C < 1 is an absolute constant, following the similar reasoning in the proof of Lemma 39 in (Zanette et al., 2021 ): • ⟨x, Σx⟩ L2(µ) ≤ • ⟨x, Σx⟩ L2(µ) ≥ λ∥x∥ 2 L 2 (µ) n : It's sufficient to show that λ ≥ c C log(2/δ) and c C ∥x∥ 2 L2(µ) log(2/δ) n ≤ ⟨x, Σx⟩ L2(µ) . As ⟨x, Σx⟩ L2(µ) ≥ λ∥x∥ 2 L 2 (µ) n , when λ ≥ max c C , c 2 C 2 log(2/δ ), these two conditions hold simultaneously. Hence, for any fixed x with ∥x∥ H k = 1, we have 1 n i∈[n] ⟨x, ϕ i ⟩ 2 L2(µ) -⟨x, Σx⟩ L2(µ) ≤ C x, Σ + λ n I x L2(µ) ≤ C x, Σ + λ n T -2 k x L2(µ) . Now, assume such condition holds for an ε-net B ε of S H k , the unit sphere of RKHS H k (i.e. {x : ∥x∥ H k = 1}), under ∥ • ∥ L2(µ) . Then ∀x satisfies ∥x∥ H k = 1, let x ′ be the closest point of x in B ε under ∥ • ∥ L2(µ) (note that x ′ ∈ H k by the definition of ε-net). We have that ⟨x, Σx⟩ -⟨x ′ , Σx ′ ⟩ ≤2ε x,   1 n i∈[n] ϕ i ϕ ⊤ i   x -x ′ ,   1 n i∈[n] ϕ i ϕ ⊤ i   x ′ ≤2ε With a triangle inequality, ∀n, ∀∥x∥ H k ≤ 1, we have x,   1 n i∈[n] ϕ i ϕ ⊤ i + λ n T -2 k   x L2(µ) -x, Σ + λ n T -2 k x L2(µ) ≤C x, Σ + λ n T -2 k x L2(µ) + (4 + 2C)ε Hence, we can choose ε = O λ n , to guarantee that C x, Σ + λ n T -2 k x L2(µ) + (4 + 2C)ε ≤ C ′ x, Σ + λ n T -2 k x L2(µ) , where C ′ < 1 is an absolute constant.  i (B H k , ∥ • ∥ L2(µ) ) ≤ e i (S H k , ∥ • ∥ L2(µ) ) ≤ 2e i (B H k , ∥ • ∥ L2(µ) ), where B H k is the unit ball in RKHS H k (i.e. {x : ∥x∥ H k ≤ 1}). With Carl's inequalityfoot_0 (Carl & Stephani, 1990 ) (also see (Steinwart et al., 2009) ), ∀p > 0, m ∈ N + , we have sup i∈[m] i 1/p e i (id : H k → L 2 (µ)) ≤c p sup i∈[m] i 1/p µ 1/2 i T 2 k : L 2 (µ) → L 2 (µ) =c p sup i∈[m] i 1/p µ i (T k : L 2 (µ) → L 2 (µ)) where c p = 128(32 + 16/p) 1/p denotes a constant only depending on p. We then consider the entropy number under different eigendecay conditions: • For β-finite spectrum, as we have i∈I µ i ≤ 1 from Assumption 3, and ∀i > β, µ i (T k : L 2 (µ) → L 2 (µ)) = 0, we know for a fixed p, e i (id : H k → L 2 (µ)) ≤ 128 ((32 + 16/p)) 1/p (β/i) 1/p . Take p = β/i, we know that e i (id : H k → L 2 (µ)) ≤ 128(32β + 16) -i/β . • For β-polynomial decay, take p = 2/β and obtain that e i (id : H k → L 2 (µ)) ≤ 128C 0 (32 + 8β) β/2 i -β . • For β-exponential decay, note that, for a fixed p, direct computation shows the maximum of i 1/p exp(-C 2 i β ) is achieved when i β = 1 C2βp . Furthermore, i 1/p exp(-C 2 i β ) is monotonically increasing with respect to i when i β < 1 C2βp , while monotonically decreasing with respect to i when i β > 1 C2βp . Hence, for a given i, we can choose p such that i β > 1 C2βp , and obtain that e i (id : H k → L 2 (µ)) ≤ 128(32 + 16/p) 1/p C 1 exp(-C 2 i β ). As we can take p → ∞, we have that e i (id : H k → L 2 (µ)) ≤ 128C 1 exp(-C 2 i β ). We now convert the entropy number bound for different eigendecay conditions to the covering number bound accordingly. • For β-finite spectrum, we fix a δ ∈ (0, 1) and an ε ∈ (0, 128], and assume the integer i ≥ 1 satisfies the condition: 128(1 + δ)(32β + 16) • For β-polynomial decay, with Lemma 6.21 in (Steinwart & Christmann, 2008) , we have that log N (B H k , ∥ • ∥ L2(µ) , ε) ≤ log(4) 128C 0 (32 + 8β) β 2 ε 1/β = O C poly ε -1/β . • For β-exponential decay, we fix a δ ∈ (0, 1) and an ε ∈ (0, 128C 1 ], and assume the integer i ≥ 1 satisfies the condition 128C 1 (1 + δ) exp(-C 2 (i + 1) β ) ≤ ε ≤ 128C 1 (1 + δ) exp(-C 2 i β ). By the definition of the entropy number and covering number, we know log N (B H k , ∥ • ∥ L2(µ) , ε) ≤ log N (B H k , ∥ • ∥ L2(µ) , 128C 1 (1 + δ) exp(-C 2 (i + 1) β )) ≤i log(2) ≤ log(2)   log 128C1(1+δ) ε C 2   1/β ≤ log(2) log 256C1 ε C 2 1/β = O C exp log(1/ε) 1/β , where C 3 is a constant depends on C 1 and C 2 . Note that n ≤ N . Hence, we can choose ε for different eigendecay conditions and lead to the first claim as follows: • For β-finite spectrum: we choose ε = Θ(n -1 ), and obtain the first claim with λ = Θ (β log N + log(N/δ)) using a union bound over B ε and [N ]. • For β-polynomial decay: we choose ε = Θ(n -β/(1+β) ), and obtain the first claim with λ = Θ C poly N 1/(1+β) + log(N/δ) using a union bound over B ε and [N ]. • For β-exponential decay: we choose ε = Θ(n -1 ), and obtain the first claim with λ = Θ C exp (log N ) 1/β + log(N/δ) using a union bound over B ε and [N ]. For the second claim, note that, x, n Σ n + λT -1 k x = T -1/2 k x, T 1/2 k n Σ n T 1/2 k + λT -1 k T 1/2 k T -1/2 k x , x, n i=1 ϕ i ϕ ⊤ i + λT -1 k x = T -1/2 k x, T 1/2 k n n i=1 ϕ i ϕ ⊤ i + λT -1 k T 1/2 k T -1/2 k x . Note that, {T -1/2 k x, x ∈ H k } spans the L 2 (µ), when the first claim holds, we have that, ∀x ′ ∈ L 2 (µ), ∀n ∈ [N ] 1 c 2 x ′ , T -1/2 k (nΣ + λT k ) -1 T -1/2 k x ′ L2(µ) ≤ x ′ , T -1/2 k   i∈[n] ϕ i ϕ ⊤ i + λT -1 k   -1 T -1/2 k x ′ L2(µ) , and x ′ , T -1/2 k   i∈[n] ϕ i ϕ ⊤ i + λT -1 k   -1 T -1/2 k x ′ L2(µ) ≤ 1 c 1 x ′ , nΣ + λT -1 k -1 T -1/2 k x ′ L2(µ) . As ∀x ∈ H k , T k x ∈ L 2 (µ) and we can choose x ′ = T k x, which shows the second claim holds when the first claim holds. Remark 9. Here we follow the idea of Zanette et al. (2021, Lemma 45) and present a less involved proof. However, it is also possible to use the Bernstein inequality for matrix martingale with intrinsic dimension (e.g. Minsker, 2017) to prove the similar results. Lemma 18 (Simulation Lemma). Suppose we have two MDP instances M = (S, A, P, r, d 0 , γ) and M ′ = (S, A, P ′ , r + b, d 0 , γ). Then for any policy π, we have that Proof. See Uehara et al. (2022, Lemma 20) . V π P ′ ,r+b -V π T,r = 1 1 -γ E (s, Lemma 19 (Potential Function Lemma for RKHS). If α = Ω(1), then for any distribution ν supported on the unit ball of H k , we have that, • For β-finite spectrum: log det αE ν [ϕϕ ⊤ ] + I = O β log 1 + α β . • For β-polynomial decay: log det αE ν [ϕϕ ⊤ ] + I = O C poly α 1/(2β) log α . • For β-exponential decay: log det αE ν [ϕϕ ⊤ ] + I = O C exp (log α) 1+1/β . where operators are in the space of L 2 (µ) → L 2 (µ). Meanwhile, when α = O(1), for any eigendecay conditions, we have that log det αE ν [ϕϕ ⊤ ] + I = O(1). Proof. We consider the optimization problem: sup ν log det I + αE ϕ∼ν ϕϕ ⊤ . We first consider the optimality condition of ν. Note that, log det(X) is concave with respect to positive definite X and E ϕ∼ν ϕϕ ⊤ is linear with respect to ν. Direct computation shows that d log det I + αE ϕ∼ν ϕϕ ⊤ dν(ϕ ′ ) =Tr α I + αE ϕ∼ν ϕϕ ⊤ -1 ϕϕ ⊤ = ϕ ′ , α -1 I + E ϕ∼ν ϕϕ ⊤ -1 ϕ ′

L2(µ)

. Now we consider the constraint optimization problem max ϕ ′ ϕ ′ , α -1 I + E ϕ∼ν ϕϕ ⊤ -1 ϕ ′ , s.t. ∥ϕ ′ ∥ H k ≤ 1. With the method of Lagrange multiplier, we know that CT -2 k -α -1 I -E ϕ∼ν ϕϕ ⊤ -1 ⪰ 0, and for all ϕ in the support of ν, we have CT -2 k -α -1 I + E ϕ∼ν ϕϕ ⊤ -1 ϕ = 0. Note that α -1 I + E ϕ∼ν ϕϕ ⊤ -1 op ≤ α. With Weyl's inequality, we know that, µ i CT -2 k -α -1 I -E ϕ∼ν ϕϕ ⊤ -1 ≥ Cµ i (T k ) -2 -α ≥ α µ 2 1 (T k )µ -2 i (T k ) αµ 2 1 (T k ) + 1 -1 , which means the support of ν is at most i 0 dimension, where i 0 is the largest integer that µ 2 1 (T k )µ -2 i (T k ) ≤ αµ 2 1 (T k ) + 1. When α = O(1), with Assumption 3, we know µ i (T k ) ≤ 1 and i 0 = O(1). Combined with Jensen's inequality, we finish the proof of the second claim. We then consider the case when α = Ω(1) under different eigendecay conditions: • β-finite spectrum: we know i 0 ≤ β. As ∥ϕ∥ L2(µ) ≤ ∥ϕ∥ H k = 1, we have 



A more formal claim is on the approximation number of the bounded linear operator, which, as shown inSteinwart & Christmann (2008), is identical to the eigenvalue of the bounded linear operator if the bounded linear operator is compact, self-adjoint and positive.



we have the equivalent formulation of (3) as T * (s ′ |s, a) = ⟨p * (•|s, a), p * (s ′ |•)⟩ L2(µ) , which obviously demonstrates the linear MDP structure following Defintion 1, and immediately implies ϕ * (s, a) = p * z (•|s, a), and µ * (s ′ ) = p * (s ′ |•). We call p * z (•|s, a) as Latent Variable Representation (LV-Rep).

both following the Gaussian distributions. The proposed LV-Rep can exploit more general distributions beyond Gaussian for p * (•|s, a) and p * (s ′ |z), that introduces more flexibility in transition modeling.Our definition of LV-Rep is more general than the original definition (Definition 2) inAgarwal et al. (2020), which assumes |Z| is finite. As shown byAgarwal et al. (2020), block MDPs(Du et al., 2019; Misra et al., 2020)  with finite latent state space Z have a latent variable representation where S corresponds to the set of observation, Z corresponds to the set of latent state, and p * (z ′ |s, a) is a composition of deterministic p(z|s) and a transition p(z ′ |z, a).Agarwal et al. (2020) also remarks that, compared with the latent variable representation, the original low-rank representation relaxes the simplex constraint on the p * (z|s, a), and thus, can be more compact with fewer dimensions. However, the ambient dimension may not be a proper measure of the representation complexity. As we will show in Section 4, even we work on the infinite Z, as long as p(z|s, a) ∈ H k and k satisfies standard regularity conditions, we can still perform sample-efficient learning. A proper measure of the representation complexity is still an open problem to the whole community.The LV-Rep with p * (•|s, a) and p * (s ′ |•) naturally satisfies the distribution requirements, which brings the benefits of efficient sampling and learning.

Online Exploration with LV-Rep 1: Input: Model class P = {(p(z|s, a), p(s ′ |z))}, Q = {q(z|s, a, s ′ )}, Iteration N . 2: Initialize π 0 (s) = U(A) where U(A) denotes the uniform distribution on A; D 0 = ∅; D ′ 0 = ∅. 3: for episode n = 1, • • • , N do 4:

Figure 1: We show the learning curves in Mujoco control compared to the baseline algorithms. The x-axis shows the training iterations and y-axis shows the performance. All plots are averaged over 4 random seeds. The shaded area shows the standard error. We only compare to SAC as it has the best overall performance in all baseline methods.

Figure 2: We show the results in DeepMind Control Suite compared to the baseline algorithms. The x-axis shows the training iterations and y-axis shows the performance. All plots are averaged over 4 random seeds. The shaded area shows the standard error.

Offline Exploitation with LV-Rep 1: Input: Model class P = {(p(z|s, a), p(s ′ |z))}, Q = {q(z|s, a, s ′ )}, Offline Dataset D n . 2: Learn the latent variable model p(z|s, a) with D n via maximizing the ELBO defined in equation 5, and obtain the learned model T . 3: Set the exploitation penalty b(s, a) as equation 7. 4: Obtain policy π = arg max π V π T ,r-b.

Finally, we define the following operators in the space of L 2 (µ) → L 2 (µ): Σ ρn×U (A),ϕ =nE s∼ρn,a∼U (A) ϕ(s, a)ϕ ⊤ (s, a) + λT -1 k Σ ρn,ϕ =nE (s,a)∼ρn ϕ(s, a)ϕ ⊤ (s, a) + λT -1Note that, by the spectral theorem, if T -1/2 x ′ L2(µ) ≤ ∞ for x ′ ∈ L 2 (µ), we have the following Cauchy-Schwartz inequality for weighted L 2

Now we consider the first term. With Cauchy-Schwartz inequality of L 2 (µ) inner product, we have that γE ( s, a)∼d π Tn ,s∼ Tn(• s, a),a∼π(•|s) [g(s, a)] =γE ( s, a)∼d π Tn pn (•| s, a), S a∈A pn (s|•)π(a|s)g(s, a)ds L2(µ) ≤γE ( s, a)∼d π Tn ∥p n (•| s, a)∥ L2(µ),Σ -1 ρn ×U (A), pn S a∈A pn (s|•)π(a|s)g(s, a)ds L2(µ),Σ ρn×U (A), pn Note that S a∈A pn (s|•)π(a|s)g(s, a)ds 2 L2(µ),Σ ρn ×U (A), pn =nE s∼ρn, a∼U (A) E s∼ Tn(•| s, a),a∼π(•|s) [g(s, a)] •)π(a|s)g(s, a)ds

we obtain the desired result.Lemma 6 (One Step Back Inequality for the True Model). Assume g :S × A → R such that ∥g∥ ∞ ≤ B, then E (s,a)∼d π T * [g(s,a)] ≤E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρn,p * nγ|A|E s∼ρn,a∼U (A) [g 2 (s, a)] + λγ 2 B 2 C + (1 -γ)|A|E s∼ρn,a∼U(A) [g 2 (s, a)]. Proof. By the property of the stationary distribution, we have E (s,a)∼d π T * [g(s, a)] = γE ( s, a)∼d π T * ,s∼T * (•| s, a),a∼π(•|s) [g(s, a)] + (1 -γ)E s∼d0,a∼π(•|s) [g(s, a)].

the first term, with the Cauchy-Schwartz inequality, we have γE ( s, a)∼d π T * ,s∼T * (• s, a),a∼π(•|s) [g(s, a)] =γE ( s, a)∼d π T * p * (•| s, a), S a∈A p * (s|•)π(a|s)g(s, a)ds L2(µ) ≤γE ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ * =nE ( s, a)∼ρn E s∼T * (•| s, a),a∼π(•|s) [g(s, a)] 2 + λ S a∈A pn (s|•)π(a|s)g(s, a)ds H k ≤nE ( s, a)∼ρn,s∼T * (•| s, a),a∼π(•|s) g 2 (s, a) + λB 2 C ≤n|A|E ( s, a)∼ρn,s∼T * (•| s, a),a∼U (A) g 2 (s, a) + λB 2 C = n|A| γ E s∼ρn,a∼U (A) g 2 (s, a) + λB 2 C Substitute back, we obtain the desired result. Lemma 7 (Almost Optimism at the Initial Distribution). Consider an episode n ∈ [N ], if we set

Tn b n (s, a) + γE Tn(s ′ |s,a) V π T,r (s ′ ) -γE P (s ′ |s,a) V π T,r (s ′ ) ≿E (s,a)∼d π Tn min α n ∥p n (•|s, a)∥ L2(µ),Σ -1 ρn ×U (A), pn , 2 + γE Tn(s ′ |s,a) V π T * ,r (s ′ ) -γE T * (s ′ |s,a) V π T * ,r (s ′ ) .

and we know E s∼ρn,a∼U (A) T 1/2 k pn (•|s, a) ⊤ pn (•|s, a)T 1/2 k is in the trace class and the Fredholm determinant is well-defined. Invoking Lemma 19, we have that • For β-finite spectrum, as λ = Θ(β log N + log(N |P|/δ)), we have n/λ = O(n)

We start by showing that C * π can be viewed as a measure of the offline data quality, which can be demonstrated by the following lemma, that was first introduced inChang et  al. (2021): Lemma 11 (Distribution Shift Lemma). For any positive definite operator Λ : L 2 (µ) → L 2 (µ), we have that E (s,a)∼d π T * ⟨p * (•|s, a), Λp * (•|s, a)⟩ L2(µ) ≤ C * π E (s,a)∼ρ ⟨p * (•|s, a), Λp * (•|s, a)⟩ L2(µ) .

a)∼ρ ⟨p * (•|s, a), Λp * (•|s, a)⟩ L2(µ) , which finishes the proof. We also define the Σ ρ,ϕ : L 2 (µ) → L 2 (µ): Σ ρ,ϕ := nE (s,a)∼ρ ϕ(s, a)ϕ ⊤ (s, a) + λT -1 k , where ρ is the stationary distribution of π b . Lemma 12 (One Step Back Inequality for the Learned Model in Offline Setting). Assume g : S × A → R, such that ∥g∥ ∞ ≤ B. Then conditioning on the following generalization bound: E (s,a)∼ρ ∥ T (s, a) -T * (s, a)∥ 2 1 ≤ ζ, we have that ∀π E (s,a)∼d π T [g(s, a)] ≤γE ( s, a)∼d π T ∥p(•| s, a)∥ L2(µ),Σ -1 ρ, p nωγE (s,a)∼ρ [g 2 (s, a)] + λB 2 C + nB 2 ζ + (1 -γ)ωE (s,a)∼ρ [g 2 (s, a)].

a)∼d π T [g(s, a)] = γE ( s, a)∼d π T ,s∼ pn(•| s, a),a∼π(•|s) [g(s, a)] + (1 -γ)E s∼d0,a∼π(•|s) [g(s, a)].For the second term, with Jensen's inequality and an importance sampling step, we have that(1 -γ)E s∼d0,a∼π(•|s) [g(s, a)] ≤ (1 -γ)ωE (s,a)∼ρ [g 2 (s, a)].For the first term, with Cauchy-Schwartz inequality of L 2 (µ) inner product, we have that γE ( s, a)∼d π T ,s∼ T (•| s, a),a∼π(•|s) [g(s, a)] =γE ( s, a)∼d π T p(•| s, a), S a∈A p(s|•)π(a|s)g(s, a)ds L2(µ) ≤γE ( s, a)∼d π T ∥p(•| s, a)∥ L2(µ),Σ -1 ρ, p S a∈A p(s|•)π(a|s)g(s, a)ds L2(µ),Σ ρ, p . Note that S a∈A p(s|•)π(a|s)g(s, a)ds 2 L2(µ),Σ ρ, p =nE ( s, a)∼ρ E s∼ T (•| s, a),a∼π(•|s) [g(s, a)] •)π(a|s)g(s, a)ds H k ≤nE ( s, a)∼ρ E s∼T * (•| s, a),a∼π(•|s) [g(s, a)] 2 + λB 2 C + nB 2 ζ ≤nE ( s, a)∼ρ E s∼T * (•| s, a),a∼π(•|s) [g 2 (s, a)] + λB 2 C + nB 2 ζ ≤nωE ( s, a)∼ρ E s∼T * (•| s, a),a∼π b (•|s) [g 2 (s, a)] + λB 2 C + nB 2 ζ ≤ nω γ E (s,a)∼ρ [g 2 (s, a)] + λB 2 C + nB 2 ζ.

(s,a)∼d π T * [b(s, a)] ≲E ( s, a)∼d π T * ∥p * (•| s, a)∥ L2(µ),Σ -1 ρ,p * ωγC poly α 2 n 1 2(1+β) log n + λγ 2 C

µ) are upper bounded by 1. For notation simplicity, we define ϕ := T -1/2 k ϕ and µ i denotes the conditional distribution of ϕ given the sampled ϕ 1

s sufficient to show that c λ log(2/δ) ≤ Cλ 2 and c log(2can be achieved by λ = Ω(log 1/δ).

a)∼d π P b(s, a) + γ E P ′ (s ′ |s,a) [V π P ′ ,r+b (s ′ )] -E P (s ′ |s,a) [V π P ′ ,r+b (s ′ )] , V π P ′ ,r+b -V π T,r = 1 1 -γ E (s,a)∼d π P ′ b(s, a) + γ E P ′ (s ′ |s,a) [V π T,r (s ′ )] -E P (s ′ |s,a) [V π T,r (s ′ )] .

Figure 3: We show the learning curves in Mujoco control compared to SAC. The x-axis shows the training iterations and y-axis shows the performance. All plots are averaged over 4 random seeds. The shaded area shows the standard error.

where {z i } i∈[m] ∼ pn (z|s, a), {ξ i } i∈[m] ∼ P (ξ),

Performance on various MuJoCo control tasks. All the results are averaged across 4 random seeds and a window size of 10K. Results marked with * is adopted from MBBL. LV-Rep-C and LV-Rep-D use continuous and discrete latent variable model respectively. LV-Rep achieves strong performance compared with baselines.

Performance of on various Deepmind Suite Control tasks. All the results are averaged across four random seeds and a window size of 10K. Comparing with SAC, our method achieves even better performance on sparse-reward tasks.

7 CONCLUSIONIn this paper, we reveal the representation view of latent variable dynamics model, which induces the Latent Variable Representation (LV-Rep). Based on the LV-Rep, a new provable and practical algorithm for reinforcement learning is proposed, achieving the balance between flexibility and efficiency in terms of statistical complexity, with tractable learning, planning and exploration. We provide rigorous theoretical analysis, which is applicable for LV-Rep with both finite-and infinitedimension and comprehensive empirical justification, which demonstrates the superior performances.Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016.Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, 2018.

Hyperparameters used for LV-Rep in all the environments in MuJoCo and DM Control Suite.

Now we consider the covering number N (S H k , ∥ • ∥ L2(µ) , ε). We start from the entropy number e i (S H k , ∥ • ∥ L2(µ) ). From (A.36) inSteinwart & Christmann (2008), we know e

-(i+1)/β ≤ ε ≤ 128(1 + δ)(32β + 16) -i/β .By the definition of the entropy number and covering number, we know logN (B H k , ∥ • ∥ L2(µ) , ε) ≤ log N (B H k , ∥ • ∥ L2(µ) , 128(1 + δ)(32β + 16) -(i+1)/β )

funding

https://rlrep.github.io/lvrep/ 

annex

Note that, α -1 I + E ϕ∼ν ϕϕ ⊤ -1 ⪯ αI. Hence, the inner product is well-defined. As ν is a probability measure over the B H k , and if c ≥ 1,.Hence, we can focus on the ν supported on S H k . Furthermore, with the optimality condition of the probability measure, we know the optimal ν should satisfy that ∀ϕ ′ ∈ supp(ν),where C is some constant.We first show that, C ≥ αµ1(T k ) 2 αµ 2 1 (T k )+1 , which can be shown by consider the following constraint optimization problem, where ν is from the space of probability measure supported on.With the optimality condition, the optimal ν should satisfy that ∀ϕ ∈ supp(ν),and ∀ ϕ ∈ S H k , we havewhere C ′ is an absolute constant. With Cauchy-Schwartz inequality, we have,where the maximum only achieves whenwhere c ′ is an absolute constant to make sure ∥ ϕ∥ H k = 1. Hence, the optimal ν is a point measure supported on ϕ, which further leads tok , we know ν should only support on µ 1 (T k )e 1 , and

G.2 ABLATION STUDY ON LATENT REPRESENTATION SIZE

In this section, we provide an ablation study on the latent representation dimension to show this parameter affects the performance of LV-Rep. In all our experiments the latent feature dimension is set to 256. We compare to latent feature dimension 64 and 128 in HalfCheetah. The results are reported in Figure 4 . 

