UNIFIED ALGORITHMS FOR RL WITH DECISION-ESTIMATION COEFFICIENTS: NO-REGRET, PAC, AND REWARD-FREE LEARNING

Abstract

Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a governing complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. ( 2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC. Under review as a conference paper at ICLR 2023 Despite this progress, several important questions remain open within the DEC framework. First, in Foster et al. (2021), regret upper bounds for low-DEC problems are achieved by the Estimation-To-Decisions (E2D) meta-algorithm, which requires a subroutine for online model estimation given past observations. However, their instantiations of this algorithm either (1) use a general improper 1 estimation subroutine that works black-box for any model class, but results in a regret bound that scales with a (potentially significantly) larger variant of the DEC that does not admit known polynomial bounds, or (2) require a proper estimation subroutine, which typically requires problem-specific designs and unclear how to construct for general model classes. These additional bottlenecks prevent their instantiations from being a unified sample-efficient algorithm for any low-DEC problem. Second, while the DEC captures the complexity of no-regret learning, there are alternative learning goals that are widely studied in the RL literature such as PAC learning (Dann et al., 2017) and reward-free learning (Jin et al., 2020a), and it is unclear whether they can be characterized using a similar framework. Finally, several other optimistic model-based algorithms such as Optimistic Posterior Sampling (Zhang, 2022; Agarwal & Zhang, 2022a) or Optimistic Maximum Likelihood Estimation (Mete et al., 2021; Liu et al., 2022a;b) have been proposed in recent work, whereas the E2D algorithm does not explicitly use optimism in its algorithm design. It is unclear whether E2D actually bears any similarities or connections to the aformentioned optimistic algorithms. In this paper, we resolve the above open questions positively by developing new complexity measures and unified algorithms for RL with Decision-Estimation Coefficients. Our contributions can be summarized as follows. • We design E2D-TA, the first unified algorithm that achieves low regret for any problem with bounded DEC and low-capacity model class (Section 3). E2D-TA instantiates the E2D metaalgorithm with Tempered Aggregation, a general improper online estimation subroutine that achieves stronger guarantees than variants used in existing work. • We establish connections between E2D-TA and two existing model-based algorithms: Optimistic Model-Based Posterior Sampling, and Optimistic Maximum-Likelihood Estimation. We show that these two algorithms enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC (Appendix E). • We extend the DEC framework to two new learning goals: PAC learning and reward-free learning. We define variants of the DEC, which we term as Explorative DEC (EDEC) and Reward-Free DEC (RFDEC), and show that they give upper and lower bounds for sample-efficient learning in the two settings respectively (Section 4). • We instantiate our results to give sample complexity guarantees for the broad problem class of RL with low-complexity Bellman representations. Our results recover many existing and yield new guarantees when specialized to concrete RL problems (Section 5). Our work is closely related to the long lines of work on sample-efficient RL (both no-regret/PAC and reward-free), and problems/algorithms in general interactive decision making. We review these related work in Appendix A due to the space limit. PRELIMINARIES We adopt the general framework of Decision Making with Structured Observations (DMSO) (Foster et al., 2021) , which captures broad classes of problems such as bandits and reinforcement learning. In DMSO, the environment is described by a model M " pP M , R M q, where P M specifies the distribution of the observation o P O, and R M specifies the conditional means 2 of the reward vector r P r0, 1s H , where H is the horizon length. The learner interacts with a model using a policy π P Π. Upon executing π in M , they observe an (observation, reward) tuple po, rq " M pπq as follows:

1. INTRODUCTION

Reinforcement Learning (RL) has achieved immense success in modern artificial intelligence. As RL agents typically require an enormous number of samples to train in practice (Mnih et al., 2015; Silver et al., 2016) , sample-efficiency has been an important question in RL research. This question has been studied extensively in theory, with provably sample-efficient algorithms established for many concrete RL problems starting with tabular Markov Decision Processes (MDPs) (Brafman & Tennenholtz, 2002; Azar et al., 2017; Agrawal & Jia, 2017; Jin et al., 2018; Dann et al., 2019; Zhang et al., 2020b) , and later MDPs with various types of linear structures (Yang & Wang, 2019; Jin et al., 2020b; Zanette et al., 2020b; Ayoub et al., 2020; Zhou et al., 2021; Wang et al., 2021) . Towards a more unifying theory, a recent line of work seeks general structural conditions and unified algorithms that encompass as many as possible known sample-efficient RL problems. Many such structural conditions have been identified, such as Bellman rank (Jiang et al., 2017) , Witness rank (Sun et al., 2019) , Eluder dimension (Russo & Van Roy, 2013; Wang et al., 2020b) , Bilinear Class (Du et al., 2019) , and Bellman-Eluder dimension (Jin et al., 2021) . The recent work of Foster et al. (2021) proposes the Decision-Estimation Coefficient (DEC) as a quantitative complexity measure that governs the statistical complexity of model-based RL with a model class. Roughly speaking, the DEC measures the optimal trade-off-achieved by any policy-between exploration (gaining information) and exploitation (being a near-optimal policy itself) when the true model could be any model within the model class. Foster et al. (2021) establish regret lower bounds for online RL in terms of the DEC, and upper bounds in terms of (a variant of) the DEC and model class capacity, showing that the DEC is necessary and (in the above sense) sufficient for online RL with low regret. This constitutes a significant step towards a unified understanding of sample-efficient RL. 1. The learner first observes an observation o " P M pπq (also denoted as P M,π p¨q P ∆pOq). 2. Then, the learner receives a (random) reward vector r " pr h q H h"1 , with conditional mean R M poq " pR M h poqq H h"1 :" E r"R M p¨|oq rrs P r0, 1s H and independent entries conditioned on o. Let f M pπq :" E M,π r ř H h"1 r h s denote the value (expected cumulative reward) of π under M , and let π M :" arg max πPΠ f M pπq and f M pπ M q denote the optimal policy and optimal value for M , respectively. In this paper, we focus on RL in episodic Markov Decision Processes (MDPs) using the DMSO framework. An MDP M " pH, S, A, P M , r M q can be cast as a DMSO problem as follows. The observation o " ps 1 , a 1 , . . . , s H , a H q is the full state-action trajectory (so that the observation space is O " pS ˆAq H ). Upon executing policy π " tπ h : S Ñ ∆pAqu hPrHs in M , the learner observes o " ps 1 , a 1 . . . , s H , a H q " P M pπq, which sequentially samples s 1 " P M 0 p¨q, a h " π h p¨|s h q, and s h`1 " P M h p¨|s h , a h q for all h P rHs. The learner then receives a reward vector r " pr h q hPrHs P r0, 1s H , where r h " r M h ps h , a h q is the (possibly random) instantaneous reward for the h-th step with conditional mean E M rr h |os " R M h poq ": R M h ps h , a h q depending only on ps h , a h q. We assume that ř H h"1 R M h ps h , a h q P r0, 1s for all M and all o P O. Learning goals We consider the online learning setting, where the learner interacts with a fixed (unknown) ground truth model M ‹ for T episodes. Let π t P Π denote the policy executed within the t-th episode. In general, π t may be sampled by the learner from a distribution p t P ∆pΠq before the t-th episode starts. One main learning goal in this paper is to minimize the standard notion of regret that measures the cumulative suboptimality of tπ t u tPrT s : Reg DM :" T ÿ t"1 E π t "p t " f M ‹ pπ M ‹ q ´f M ‹ pπ t q ı . To achieve low regret, this paper focuses on model-based approaches, where we are given a model class M, and we assume realizability: M ‹ P M. Additionally, throughout the majority of the paper, we assume that the model class is finite: |M| ă 8 (or |P| ă 8 for the reward-free setting in Section 4.2) for simplicity of the presentation; both can be relaxed using standard covering arguments (see e.g. Appendix D.2), which we do when we instantiate our results to concrete RL problems in Example 13-15.

2.1. DEC WITH RANDOMIZED REFERENCE MODELS

The Decision-Estimation Coefficient (DEC) is proposed by Foster et al. (2021) as a key quantity characterizing the statistical complexity of sequential decision making. We consider the following definition of DEC with randomized reference models (henceforth "DEC"): Definition 1 (DEC with randomized reference models). The DEC of M with respect to distribution µ P ∆pMq (with policy class Π and parameter γ ą 0) is defined as dec γ pM, µq :" inf pP∆pΠq sup M PM E π"p E M "µ " f M pπ M q ´f M pπq ´γD 2 RL pM pπq, M pπqq ‰ , Further define dec γ pMq :" sup µP∆pMq dec γ pM, µq. Above, D 2 RL is the following squared divergence function D 2 RL pM pπq, M pπqq :" D 2 H pP M pπq, P M pπqq `Eo"P M pπq " › › R M poq ´RM poq › › 2 2 ı , where D 2 H pP, Qq :" ş p a dP{dµ ´adQ{dµq 2 dµ denotes the standard Hellinger distance between probability distributions P, Q. Definition 1 instantiates the general definition of DECs in Foster et al. (2021, Section 4.3 ) with divergence function chosen as D 2 RL . The quantity dec γ pM, µq measures the optimal trade-off of a policy distribution p P ∆pΠq between two terms: low suboptimality f M pπ M q ´f M pπq, and high information gain D 2 RL pM pπq, M pπqq with respect to the randomized reference model M " µ. The main feature of D 2 RL is that it treats the estimation of observations and rewards separately: It requires the observation distribution to be estimated accurately in Hellinger distance between the full distributions, but the reward only accurately in the squared L 2 error between the conditional means. Such a treatment is particularly suitable for RL problems, where estimating mean rewards is easier than estimating full reward distributionsfoot_2 and is also sufficient in most scenarios. Algorithm 1 E2D-TA: ESTIMATION-TO-DECISIONS WITH TEMPERED AGGREGATION Input: Parameter γ ą 0; Learning rate η p P p0, 1 2 q, η r ą 0. 1: Initialize µ 1 Ð UnifpMq. 2: for t " 1, . . . , T do 3: Set p t Ð arg min pP∆pΠq p V µ t γ ppq, where p V µ t γ is defined in (2).

4:

Sample π t " p t . Execute π t and observe po t , r t q. 5: Update randomized model estimator by Tempered Aggregation: µ t`1 pM q 9 M µ t pM q ¨exp ´ηp log P M,π t po t q ´ηr › › r t ´RM po t q › › 2 2 ¯. (3)

3. E2D WITH TEMPERED AGGREGATION

We begin by presenting our algorithm Estimation-to-Decisions with Tempered Aggregation (E2D-TA; Algorithm 1), a unified sample-efficient algorithm for any problem with bounded DEC. Algorithm description In each episode t, Algorithm 1 maintains a randomized model estimator µ t P ∆pMq, and uses it to obtain a distribution of policies p t P ∆pΠq by minimizing the following risk function (cf. Line 3): p V µ t γ ppq :" sup M PM E π"p E M "µ t " f M pπ M q ´f M pπq ´γD 2 RL pM pπq, M pπqq ‰ . This risk function instantiates the E2D meta-algorithm with randomized estimators (Foster et al., 2021 , Algorithm 3) with divergence D 2 RL . The algorithm then samples a policy π t " p t , executes π t , and observes po t , r t q from the environment (Line 4). Core to our algorithm is the subroutine for updating our randomized model estimator µ t : Inspired by Agarwal & Zhang (2022a) , we use a Tempered Aggregation subroutine that performs an exponential weights update on µ t pM q using a linear combination of the log-likelihood log P M,π t po t q for the observation, and the negative squared L 2 loss ´› › r t ´RM po t q › › 2 2 for the reward (cf. Line 5). An important feature of this subroutine is the learning rate η p ď 1{2, which is smaller compared to e.g. Vovk's aggregating algorithm (Vovk, 1995) which uses η p " 1. As we will see shortly, this difference is crucial and allows a stronger estimation guarantee that is suitable for our purpose. As an intuition, exponential weights with exppη p log P M,π t po t qq " pP M,π t po t qq ηp with η p ď 1{2 is equivalent to computing the tempered posterior in a Bayesian setting (Bhattacharya et al., 2019; Alquier & Ridgway, 2020) (hence our name "tempered"), whereas η p " 1 computes the exact posterior (see Appendix C.2 for a derivation). We are now ready to present the main theoretical guarantee for Algorithm 1. Theorem 2 (E2D with Tempered Aggregation). Choosing η p " η r " 1{3, Algorithm 1 achieves the following with probability at least 1 ´δ: Reg DM ď T dec γ pMq `10γ ¨logp|M| {δq. By choosing the optimal γ ą 0, we get Reg DM -inf γą0 ␣ T dec γ pMq `γ logp|M| {δq ( , which scales as a dT logp|M| {δq if the model class satisfies dec γ pMq À d{γ for some complexity measure d. To our best knowledge, this is the first unified sample-efficient algorithm for general problems with low DEC, and resolves a subtle but important technical challenge in Foster et al. (2021) which prevented them from obtaining such a unified algorithm. Concretely, Foster et al. (2021, Theorem 3.3 & 4.1) show that E2D with Vovk's aggregating algorithm as the estimation subroutine achieves the following regret bound with high probability: T ¨sup M PcopMq dec γ pM; δ M q `Opγ logp|M| {δqq, where δ M denotes point mass at M , and copMq denotes the set of all possible mixtures of models in M. Unfortunately, this mixture causes sup M PcopMq dec γ pM; δ M q to be potentially intractable for most RL problems-Even when M={tabular MDPs}, sup M PcopMq dec γ pM; δ M q does not admit known bounds of the form polypH, S, A, 1{γq. Our Theorem 2 removes this bottleneck and only scales with dec γ pMq, which is much milder and admits tractable bounds for most known tractable RL problems (Section 5); For example, dec γ pMq À H 2 SA{γ for tabular MDPs (Appendix K.3.1). See Appendix C.1 for additional details on the comparison between these two DECs. Proof overview The proof of Theorem 2 (deferred to Appendix D.2) builds upon the analysis of E2D meta-algorithms (Foster et al., 2021) . The main new ingredient in the proof is the following online estimation guarantee for the Tempered Aggregation subroutine (proof in Appendix D.1). Lemma 3 (Online estimation guarantee for Tempered Aggregation). Subroutine (3) with 4η p `ηr ă 2 achieves the following bound with probability at least 1 ´δ: Est RL :" T ÿ t"1 E π t "p t E x M t "µ t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı ď C ¨logp|M| {δq, where C depends only on pη p , η r q. Specifically, we can choose η p " η r " 1{3 and C " 10. Bound ( 4) is stronger than the estimation bound for Vovk's aggregating algorithm (e.g. Foster et al. (2021, Lemma A.15) , adapted to D 2 RL ), which only achieves T ÿ t"1 E π t "p t " D 2 RL ´M ‹ pπ t q, E x M t "µ t " x M t pπ t q ı¯ı ď C ¨logp|M| {δq, where E x M t "µ t " x M t pπ t q ı denotes the mixture model of x M t pπ t q where x M t " µ t . Note that (4) is stronger than (5) by convexity of D 2 RL in the second argument and Jensen's inequality. Therefore, while both algorithms yield randomized model estimates, the guarantee of Tempered Aggregation is stronger, which in turn allows our regret bound in Theorem 2 to scale with a smaller DEC.

Connections to other optimistic algorithms

In Appendix E, we re-analyze two existing optimistic algorithms: Model-based Optimistic Posterior Sampling (MOPS), and Optimistic Maximum Likelihood Estimation (OMLE). These algorithms are similar to E2D-TA due to their use of posteriors/likelihoods, and we show that they achieve regret bounds similar as E2D-TA under structural conditions similar as the DEC, thereby establishing a connection between these three algorithms.

4. PAC LEARNING AND REWARD-FREE LEARNING

We now extend the DEC framework to two alternative learning goals in RL beyond no-regret: PAC learning, and reward-free learning. We propose new generalized definitions of the DEC and show that they upper and lower bound the sample complexity in both settings.

4.1. PAC LEARNING VIA EXPLORATIVE DEC

In PAC learning, we only require the learner to output a near-optimal policy after T episodes are finished, and does not require the executed policies tπ t u T t"1 (the "exploration policies") to be highquality. It is a standard result that any no-regret algorithm can be converted into a PAC algorithm by the online-to-batch conversion (e.g. Jin et al. (2018) ), so that the DEC (and the corresponding E2D-TA algorithm) gives upper bounds for PAC learning as well. However, for certain problems, there may exist PAC algorithms that are better than converted no-regret algorithms, for which the DEC would not tightly capture the complexity of PAC learning. To better capture PAC learning, we define the following Explorative DEC (EDEC): Definition 4 (Explorative DEC). The Explorative Decision-Estimation Coefficient (EDEC) of a model-class M with respect to µ P ∆pMq and parameter γ ą 0 is defined as edec γ pM, µq :" inf pexpP∆pMq poutP∆pMq sup M PM E π"pout " f M pπ M q ´f M pπq ‰ ´γE π"pexp,M "µ " D 2 RL pM pπq, M pπqq ‰ . Further, define edec γ pMq :" sup µP∆pMq edec γ pM, µq. The main difference between the EDEC and the DEC (Definition 1) is that the inf is taken over two different policy distributions p exp and p out , where p exp (the "exploration policy distribution") appears in the information gain term, and p out (the "output policy distribution") appears in the suboptimality term. In comparison, the DEC restricts the policy distribution to be the same in both terms. This accurately reflects the difference between PAC learning and no-regret learning, where in PAC learning the exploration policies are not required to be the same as the final output policy. Algorithm and theoretical guarantee The EDEC naturally leads to the following EXPLORATIVE E2D algorithm for PAC learning. Define risk function p V µ t pac,γ : ∆pΠq ˆ∆pΠq Ñ R as p V µ t pac,γ pp exp , p out q :" sup M PM E π"pout " f M pπ M q ´f M pπq ‰ ´γE π"pexp, x M t "µ t " D 2 RL pM pπq, x M t pπqq ı . (6) Our algorithm (full description in Algorithm 7) is similar as E2D-TA (Algorithm 1), except that in each iteration, we find pp t exp , p t out q that jointly minimizes p V µ t pac,γ p¨, ¨q (Line 3), execute π t " p t exp to collect data, and return p p out " 1 T ř T t"1 p t out as the output policy after T episodes. Theorem 5 (PAC learning with EXPLORATIVE E2D). Choosing η p " η r " 1{3, Algorithm 7 achieves the following PAC guarantee with probability at least 1 ´δ: SubOpt :" f M ‹ pπ M ‹ q ´Eπ"p pout " f M ‹ pπq ı ď edec γ pMq `10 γ logp|M| {δq T . The proof can be found in Appendix H.1. For problems with edec γ pMq À r O pd{γq, Theorem 5 achieves SubOpt ď r O ´ad log |M| {T ¯(by tuning γ), which implies an r O `d log |M| {ε 2 ˘sample complexity for learning an ε near-optimal policy. In the literature, PAC RL algorithms with exploration policies different from output policies have been designed for various problems, e.g. Jiang et al. (2017) ; Du et al. (2021) ; Liu et al. (2022a) . These algorithms typically design their exploration policies manually (e.g. concatenating the output policy with a uniform policy over time step h) using prior knowledge about the problem. By contrast, EXPLORATIVE E2D does not require such knowledge and automatically learns the best exploration policy p exp P ∆pΠq by minimizing (6), thus substantially simplifying the algorithm design.

Lower bound

We show that a suitably localized version of the EDEC gives an informationtheoretic lower bound for PAC learning. The form of this lower bound is similar as the regret lower bound in terms of the (localized) DEC (Foster et al., 2021) . For any model class M and M P M, define shorthand edec γ pM, M q :" edec γ pM, δ M q where δ M denotes the point mass at M . Proposition 6 (Lower bound for PAC learning; Informal version of Proposition H.2). For any model class M, T P Z ě1 , and any algorithm A, there exists a M ‹ P M such that E M ‹ ,A rSubOpts ě c 0 ¨max γą0 sup M PM edec γ pM 8 ε γ pM q, M q, where c 0 ą 0 is an absolute constant, and M 8 ε γ pM q denotes a certain localized subset of M around M with radius ε γ -γ{T (formal definition in ( 45)). The upper and lower bounds in Theorem 5 and Proposition 6 together show that the EDEC governs the complexity of PAC learning, similar as the DEC for no-regret learning (Foster et al., 2021) . Proposition 6 can be used to establish PAC lower bounds for concrete problems: For example, for tabular MDPs, we show in Proposition H.3 that sup M PM edec γ pM 8 ε pM q, M q Á min t1, HSA{γu as long as ε Á HSA{γ, which when plugged into Proposition 6 recovers the known Ωp a HSA{T q PAC lower bound for tabular MDPs with ř h r h P r0, 1s (Domingues et al., 2021) . This implies (and is slightly stronger than) the Ωp ? HSAT q regret lower bound for the same problem implied by the DEC (Foster et al., 2021, Section 5.2.4) , as no-regret is at least as hard as PAC learning. Relationship between DEC and EDEC As the definition of EDEC takes the infimum over a larger set than the DEC, we directly have edec γ pM, µq ď dec γ pM, µq for any M and µ P ∆pMq. The following shows that the converse also holds in an approximate sense (proof in Appendix H.3). Proposition 7 (Relationship between DEC and EDEC). For any pα, γq P p0, 1q ˆRą0 and µ P ∆pMq, we have dec γ pM, µq ď α `p1 ´αq edec γα{p1´αq pM, µq, and thus edec γ pMq piq ď dec γ pMq piiq ď inf αą0 ␣ α `p1 ´αqedec γα{p1´αq pMq ( . Bound (i) asserts that, any problem with a bounded DEC enjoys the same bound on the EDEC, on which EXPLORATIVE E2D achieves sample complexity no worse than that of E2D-TA (Theorem 5 & Theorem 2). On the other hand, the converse bound (ii) is in general a lossy conversion-For a class with low EDEC, the implied DEC bound yields a slightly worse rate, similar to the standard explore-then-commit conversion from PAC to no-regret (cf. Appendix H.3.1 for detailed discussions). Indeed, there exist problems for which the current best sample complexity through no-regret learning and bounding the DEC is r O `1{ε 3 ˘; whereas PAC learning through bounding the EDEC gives a tighter r O `1{ε 2 ˘(cf. Proposition 12 and the discussions thereafter).

4.2. REWARD-FREE LEARNING VIA REWARD-FREE DEC

In reward-free RL (Jin et al., 2020a) , the goal is to optimally explore the environment without observing reward information, so that after the exploration phase, a near-optimal policy of any given reward can be computed using the collected trajectory data alone without further interacting with the environment. We define the following Reward-Free DEC (RFDEC) to capture the complexity of reward-free learning. Let R denote a set of mean reward functions, P denote a set of transition dynamics, and M :" P ˆR denote the class of all possible models specified by M " pP, Rq P P ˆR. We assume the true transition dynamics P ‹ P P. Definition 8 (Reward-Free DEC). The Reward-Free Decision-Estimation Coefficient (RFDEC) of model class M " P ˆR with respect to µ P ∆pPq and parameter γ ą 0 is defined as rfdec γ pM, µq :" inf pexpP∆pΠq sup RPR inf poutP∆pΠq sup PPP ! E π"pout " f P,R pπ P,R q ´f P,R pπq ‰ ´γE π"pexp E P"µ " D 2 H pPpπq, Ppπqq ‰ ) . Further, define rfdec γ pMq :" sup µP∆pPq rfdec γ pM, µq. The RFDEC can be viewed as a modification of the EDEC, where we further insert a sup RPR to reflect that we care about the complexity of learning any reward R P R, and use D 2 H pPpπq, Ppπqq as the divergence to reflect that we observe the state-action trajectories o t only and not the reward. Algorithm and theoretical guarantee Our algorithm REWARD-FREE E2D (full description in Algorithm 8) is an adaptation of EXPLORATIVE E2D to the reward-free setting, and works in two phases. In the exploration phase, we find p t exp P ∆pΠq minimizing the sup-risk sup RPR p V µ t rf,γ p¨, Rq in the t-th episode, where p V µ t rf,γ pp exp , Rq :" inf pout sup PPP E π"pout " f P,R pπ P,R q ´f P,R pπq ‰ ´γE π"pexp E p P t "µ t " D 2 H pPpπq, p P t pπqq ı . Then, in the planning phase, for any given reward R ‹ P R, we compute p t out pR ‹ q as the argmin of the inf pout in p V µ t rf,γ pp t exp , R ‹ q, and output the average policy p p out pR ‹ q :" 1 T ř T t"1 p t out pR ‹ q. Theorem 9 (Reward-Free E2D). Algorithm 8 achieves the following with probability at least 1 ´δ: SubOpt rf :" sup R ‹ PR ! f P ‹ ,R ‹ pπ P ‹ ,R ‹ q ´Eπ"p poutpR ‹ q " f P ‹ ,R ‹ pπq ı) ď rfdec γ pMq `γ 3 logp|P| {δq T . The proof can be found in Appendix I.2. For problems with rfdec γ pMq À r O pd{γq, by tuning γ ą 0 Theorem 9 achieves SubOpt rf ď ε within r O `d log |P| {ε 2 ˘episodes of play. The only known such general guarantee for reward-free RL is the recently proposed RFOlive algorithm of Chen et al. (2022) which achieves sample complexity r O `polypHq ¨d2 BE logp|F| |R|q{ε 2 ˘in the model-free settingfoot_3 . Theorem 9 can be seen as a generalization of this result to the model-based setting, with a more general form of structural condition (RFDEC). Further, our guarantee does not further depend on the statistical complexity (e.g. log-cardinality) of R once we assume bounded RFDEC. Lower bound Similar as EDEC for PAC learning, the RFEC also gives the following lower bound for reward-free learning. For any M " P ˆR and P P P, define shorthand rfdec γ pM, Pq :" rfdec γ pM, δ P q where δ P denotes the point mass at P. Proposition 10 (Reward-free lower bound; Informal version of Proposition I.2). For any model class M " P ˆR, T P Z ě1 , and any algorithm A, there exists a P ‹ P P such that E P ‹ ,A rSubOpt rf s ě c 0 ¨max γą0 sup PPP rfdec γ pM 8,rf ε γ pPq, Pq, where c 0 ą 0 is an absolute constant, and M 8,rf ε γ pM q denotes a certain localized subset of M around M with radius ε γ -γ{T (formal definition in (50)).

5. INSTANTIATION: RL WITH BELLMAN REPRESENTABILITY

In this section, we instantiate our theories to bound the three DEC variants and give unified sampleefficient algorithms for a broad class of problems-RL with low-complexity Bellman Representations (Foster et al., 2021) . Consequently, our algorithms recover existing and obtain new sample complexity results on a wide range of concrete RL problems. Definition 11 (Bellman Representation). The Bellman representation of pM, M q is a collection of function classes pG M h :" tg M ;M h : M Ñ r´1, 1suq hPrHs such that: (a) For all M P M, ˇˇE M ,π M " Q M,π M h ps h , a h q ´rh ´V M,π M h`1 ps h`1 q ıˇˇˇď ˇˇg M ;M h pM q ˇˇ. (b) There exists a family of estimation policies tπ est M,h u M PM,hPrHs and a constant L ě 1 such that for all M, M 1 P M, ˇˇg M 1 ;M h pM q ˇˇď L ¨DRL pM pπ est M,h q, M 1 pπ est M,h qq. We say M satisfies Bellman representability with Bellman representation G :" pG M h q M PM,hPrHs if pM, M q admits a Bellman representation pG M h q hPrHs for all M P M. It is shown in Foster et al. (2021) that problems admitting a low-complexity Bellman representation G (e.g. linear or low Eluder dimension) include tractable subclasses such as (model-based versions of) Bilinear classes (Du et al., 2021) and Bellman-Eluder dimension (Jin et al., 2021) . We show that Bellman representability with a low complexity G implies bounded DEC/EDEC/RFDEC, which in turn leads to concrete rates using our E2D algorithms in Section 3 and 4. et al. (2021, Theorem 7.1 & F. 2) which uses the conversion to DEC and results in an r O `1{ε 3 sample complexity. The no-regret result in (1) also improves over the result of Foster et al. (2021) as a unified algorithm without requiring proper online estimation oracles which is often problemspecific. We remark that a similar bound as (1) also holds for psc γ (c.f. Definition E.1), and for mlec γ (c.f. Definition E.4) assuming bounded Eluder dimension epG M h , ∆q ď r O pd e q, and thus MOPS and OMLE algorithms achieve similar regret bounds as E2D-TA (cf. Appendix K.1). Examples Proposition 12 can be specialized to a wide range of concrete RL problems, for which we give a few illustrating examples here (problem definitions and proofs in Appendix K.3). We emphasize that, except for feeding the different model class M's into the Tempered Aggregation subroutines, the rates below (for each learning goal) are obtained through a single unified algorithm without further problem-dependent designs. Example 13. For tabular MDPs, E2D-TA achieves regret Reg DM ď r O ´?S 3 A 2 H 3 T ¯, and REWARD-FREE E2D achieves reward-free guarantee SubOpt rf ď r O ´aS 3 A 2 H 3 {T ¯. ♢ Both rates are worse then the optimal r Op ? HSAT q regret bound (Azar et al., 2017) foot_4 and r Op a polypHqS 2 A{T q reward-free bound (Jin et al., 2020a) . However, our rates are obtained through unified algorithms that is completely agnostic to the tabular structure. Example 14. For linear mixture MDPs (Ayoub et al., 2020) Our PAC result matches the best known sample complexity achieved by e.g. the V-Type Golf Algorithm of Jin et al. (2021) . For reward-free learning, our linear in d dependence improves over the current best d 2 dependence achieved by the RFOlive algorithm (Chen et al., 2022) , and we do not require linearity or low complexity assumptions on the class of reward functions R made in existing work (Wang et al., 2020a; Chen et al., 2022) . However, we remark that they handle a slightly more general setting where only the Φ class is known, due to their model-free approach. An important subclass of low-rank MDPs is Block MDPs (Du et al., 2019) 

6. CONCLUSION

This paper proposes unified sample-efficient algorithms for no-regret, PAC, and reward-free reinforcement learning, by developing new complexity measures and stronger algorithms within the DEC framework. We believe our work opens up many important questions, such as developing model-free analogs of this framework, extending to other learning goals (such as multi-agent RL), and computational efficiency of our algorithms. function approximation (Wang et al., 2021) , Block MDPs (Du et al., 2019; Misra et al., 2020) , and so on. More general structural conditions and algorithms has been studied (Russo & Van Roy, 2013; Jiang et al., 2017; Sun et al., 2019; Wang et al., 2020b) and later unified by frameworks such as Bilinear Class (Du et al., 2021) and Bellman-Eluder dimension (Jin et al., 2021) . Foster et al. (2021) propose the DEC as a complexity measure for interactive decision making problems, and develop the E2D meta-algorithm as a general model-based algorithm for problems within their DMSO framework (which covers bandits and RL). The DEC framework is further generalized in (Foster et al., 2022) to capture adversarial decision making problems. The DEC has close connections to the modulus of continuity (Donoho & Liu, 1987; 1991a; b) , information ratio (Russo & Van Roy, 2016; 2018; Lattimore & Gyorgy, 2021) , and Exploration-by-optimization (Lattimore & Szepesvári, 2020) . Our work builds on and extends the DEC framework: We propose the E2D-TA algorithm as a general and strong instantiation of the E2D meta-algorithm, and generalize the DEC to capture PAC and reward-free learning. Other general algorithms Posterior sampling (or Thompson Sampling) is another general purpose algorithm for interactive decision making (Thompson, 1933; Russo, 2019; Agrawal & Jia, 2017; Zanette et al., 2020a; Zhang, 2022; Agarwal & Zhang, 2022a; b) . Frequentist regret bounds for posterior sampling are established in tabular MDPs (Agrawal & Jia, 2017; Russo, 2019) and linear MDPs (Russo, 2019; Zanette et al., 2020a) . Zhang (2022) proves regret bounds of a posterior sampling algorithm for RL with general function approximation, which is then generalized in Agarwal & Zhang (2022a; b) . Our Appendix E.1 discusses the connection between the MOPS algorithm of Agarwal & Zhang (2022a) and E2D-TA. The OMLE (Optimistic Maximum Likelihood Estimation) algorithm is studied in (Liu et al., 2022a; b) for Partially Observable Markov Decision Process; however, the algorithm itself is general and can be used for any problem within the DMSO framework; We provide such a generalization and discuss the connections in Appendix E.2. Maximum-likelihood based algorithms for RL are also studied in (Mete et al., 2021; Agarwal et al., 2020; Uehara et al., 2021) . Reward-free RL The reward-free learning framework is proposed by (Jin et al., 2020a) and wellstudied in both tabular and function approximation settings (Jin et al., 2020a; Zhang et al., 2020a; Kaufmann et al., 2021; Ménard et al., 2021; Wang et al., 2020a; Zanette et al., 2020c; Agarwal et al., 2020; Liu et al., 2021; Modi et al., 2021; Zhang et al., 2021a; b; Qiu et al., 2021; Wagenmaker et al., 2022) . The recent work of Chen et al. (2022) proposes a general algorithm for problems with low (reward-free version of) Bellman-Eluder dimension. Our Reward-Free DEC framework generalizes many of these results by offering a unified structural condition and algorithm for reward-free RL with a model class. Other problems covered by DMSO Besides multi-armed bandits and RL, the DMSO framework of (Foster et al., 2021) (and thus all our theories as well) can handle other problems such as contextual bandits (Auer et al., 2002; Langford & Zhang, 2007; Chu et al., 2011; Beygelzimer et al., 2011; Agarwal et al., 2014; Foster & Rakhlin, 2020; Foster et al., 2020) , contextual reinforcement learning (Abbasi-Yadkori & Neu, 2014; Modi et al., 2018; Dann et al., 2019; Modi & Tewari, 2020) , online convex bandits (Kleinberg, 2004; Bubeck et al., 2015; Bubeck & Eldan, 2016; Lattimore, 2020) , and non-parametric bandits (Kleinberg, 2004; Auer et al., 2007; Kleinberg et al., 2013) . Instantiating our theories to these settings would be an interesting direction for future work.

B TECHNICAL TOOLS B.1 STRONG DUALITY

The following strong duality result for variational forms of bilinear functions is standard, e.g. extracted from the proof of Foster et al. (2021, Proposition 4.2) . Theorem B.1 (Strong duality). Suppose that X , Y are two topological spaces, such that X is Hausdorff 6 and Y is finite (with discrete topology). Then for a bi-continuous function f : X ˆY Ñ R that is uniformly bounded, it holds that sup XP∆pX q inf Y P∆pYq E x"X E y"Y rf px, yqs " inf Y P∆pYq sup XP∆pX q E x"X E y"Y rf px, yqs. In this paper, for most applications of Theorem B.1, we take X " M and Y " Π. We will assume that Π is finite, which is a natural assumption. For example, in tabular MDPs, it is enough to consider deterministic Markov policies and there are only finitely many of them. Also, the finiteness assumption in Theorem B.1 can be relaxed-The strong duality holds as long as both X , Y is Hausdorff, and the function class tf px, ¨q : Y Ñ Ru xPX has a finite ρ-covering for all ρ ą 0. Such relaxed assumption is always satisfied in our applications.

B.2 CONCENTRATION INEQUALITIES

We will use the following standard concentration inequality in the paper. Lemma B.2 (Foster et al. (2021, Lemma A.4 )). For any sequence of real-valued random variables pX t q tďT adapted to a filtration pF t q tďT , it holds that with probability at least 1 ´δ, for all t ď T , t ÿ s"1 ´log E r expp´X s q| F s´1 s ď t ÿ s"1 X s `log `δ´1 ˘.

B.3 PROPERTIES OF THE HELLINGER DISTANCE

Recall that for two distributions P, Q that are absolutely continuous with respect to µ, their squared Hellinger distance is defined as D 2 H pP, Qq :" ż p a dP{dµ ´adQ{dµq 2 dµ. We will use the following properties of the Hellinger distance. Lemma B.3 (Foster et al. (2021, Lemma A.11, A.12) ). For distributions P, Q defined on X and function h : X Ñ r0, Rs, we have Lemma B.4. For any pair of random variable pX, Y q, it holds that E X"P X " D 2 H `PY |X , Q Y |X ˘‰ ď 2D 2 H pP X,Y , Q X,Y q . Conversely, it holds that D 2 H pP X,Y , Q X,Y q ď 3D 2 H pP X , Q X q `2E X"P X " D 2 H `PY |X , Q Y |X ˘‰. Proof. Throughout the proof, we slightly abuse notations and write a distribution P and its density dP{dµ interchangeably. By the definition of the Hellinger distance, we have 1 2 D 2 H pP X,Y , Q X,Y q " 1 ´ż a P X,Y a Q X,Y " 1 ´ż a P X Q X b P Y |X b Q Y |X ě 1 ´ż P X `QX 2 b P Y |X b Q Y |X " ż P X `QX 2 ´1 ´bP Y |X b Q Y |X " 1 4 E X"P X " D 2 H `PY |X , Q Y |X ˘‰ `1 4 E X"Q X " D 2 H `PY |X , Q Y |X ˘‰. Similarly, 1 2 D 2 H pP X,Y , Q X,Y q " 1 ´ż a P X Q X `ż a P X Q X p1 ´bP Y |X Q Y |X q ď 1 2 D 2 H pP X , Q X q `ż P X `QX 2 ¨1 2 D 2 H `PY |X , Q Y |X ˘, and hence D 2 H pP X,Y , Q X,Y q ďD 2 H pP X , Q X q `1 2 E X"P X " D 2 H `PY |X , Q Y |X ˘‰ `1 2 E X"Q X " D 2 H `PY |X , Q Y |X ˘‰ ď3D 2 H pP X , Q X q `2E X"P X " D 2 H `PY |X , Q Y |X ˘‰, where the last inequality is due to Lemma B.3 and D 2 H P r0, 2s. Next, recall the divergence D 2 RL defined in (1): D 2 RL pM pπq, M pπqq " D 2 H pP M pπq, P M pπqq `Eo"P M pπq " › › ›R M poq ´RM poq › › › 2 2 ȷ . Proposition B.5. Recall that po, rq " M pπq is the observation and reward vectors as described in Section 2, with o " P M pπq and r " R M p¨|oq. Suppose that r P r0, 1s H almost surely and }R M poq ´RM poq} 2 2 ď 2 for all o P O. Then it holds that D 2 RL pM pπq, M pπqq ď 5D 2 H `M pπq, M pπq ˘, where D 2 H `M pπq, M pπq ˘is the standard squared Hellinger distance between R M b P M pπq and R M b P M pπq. Proof. To prove this proposition, we need to bound }R M poq ´RM poq} 2 2 in terms of D 2 H ´RM poq, R M poq ¯. We denote by R M h poq the distribution of r h . Then by independence, we have 1 ´1 2 D 2 H ´RM poq, R M poq ¯" ź h ˆ1 ´1 2 D 2 H ´RM h poq, R M h poq ¯ď ź h ˆ1 ´1 2 D 2 TV ´RM h poq, R M h poq ¯ď ź h ˆ1 ´1 2 ˇˇR M h poq ´RM h poq ˇˇ2 ď exp ˆ´1 2 › › ›R M poq ´RM poq › › › 2 2 ď1 ´1 4 › › ›R M poq ´RM poq › › › 2 2 , where the last inequality use the fact that e ´x ď 1 ´x{2 for all x P r0, 1s. Then by Lemma B.4, E o"P M pπq " › › ›R M poq ´RM poq › › › 2 2 ȷ ď2E o"P M pπq " D H pR M poq, R M poqq 2 ı ď4D 2 H `M pπq, M pπq ˘. Combining the above estimation with the fact that D 2 H pP M pπq, P M pπqq ď D 2 H `M pπq, M pπq (data-processing inequality) completes the proof. The following lemma shows that, although D 2 RL is not symmetric with respect to its two arguments (due to the expectation over o " P M pπq in the second term), it is almost symmetric within a constant multiplicative factor: Lemma B.6. For any two models M, M and any policy π, we have D 2 RL pM pπq, M pπqq ď 5D 2 RL pM pπq, M pπqq. Proof. For any function h : O Ñ r0, 2s, by Lemma B.3 we have E o"P M pπq rhpoqs ď 3E o"P M pπq rhpoqs `4D 2 H pP M pπq, P M pπqq. Therefore, we can take h as hpoq :" › › ›R M poq ´RM poq › › › 2 2 , and the bound above gives D 2 H pP M pπq, P M pπqq `Eo"P M pπq rhpoqs loooooooooooooooooooooooomoooooooooooooooooooooooon D 2 RL pM pπq,M pπqq ď 5D 2 H pP M pπq, P M pπqq `5E o"P M pπq rhpoqs loooooooooooooooooooooooooomoooooooooooooooooooooooooon "5D 2 RL pM pπq,M pπqq , which is the desired result. Lemma B.7. For any two models M, M and any policy π, we have ˇˇf M pπq ´f M pπq ˇˇď ? H `1 ¨DRL pM pπq, M pπqq. Proof. We have ˇˇf M pπq ´f M pπq ˇˇ" ˇˇE o"P M pπq " R M poq ‰ ´Eo"P M pπq " R M poq ıˇˇď ˇˇE o"P M pπq " R M poq ´RM poq ıˇˇˇ`ˇˇˇE o"P M pπq " R M poq ı ´Eo"P M pπq " R M poq ıˇˇp iq ď E o"P M pπq " ? H › › ›R M poq ´RM poq › › › 2 ı `DH pP M pπq, P M pπqq piiq ď d pH `1q ˆEo"P M pπq " › › ›R M poq ´RM poq › › › 2 2 ȷ `D2 H pP M pπq, P M pπqq " ? H `1 ¨DRL pM pπq, M pπqq. Above, (i) uses the fact that R M poq P r0, 1s almost surely, and the bound ˇˇE o"P M pπq " R M poq ı ´Eo"P M pπq " R M poq ıˇˇˇď D TV pP M pπq, P M pπqq ď D H pP M pπq, P M pπqq; (ii) uses the Cauchy inequality ? Ha `b ď a pH `1qpa 2 `b2 q and the fact that the squared mean is upper bounded by the second moment.

C DISCUSSIONS ABOUT DEC DEFINITIONS AND AGGREGATION ALGORITHMS C.1 DEC DEFINITIONS

Here we discuss the differences between the DEC definitions used within our E2D-TA and within the E2D algorithm of Foster et al. (2021, Section 4 .1) which uses Vovk's aggregating algorithm as the subroutine (henceforth E2D-VA). Recall that the regret bound of E2D-TA scales with dec γ pMq defined in Definition 1 (cf. Theorem 2). We first remark that all the following DECs considered in Foster et al. (2021) are defined in terms of the squared Hellinger distance D 2 H pM pπq, M pπqq between the full distribution of po, rq induced by models M and M under π, instead of our D 2 RL which is defined in terms of squared Hellinger distance in o and squared L 2 loss in (the mean of) r. However, all these results hold for D 2 RL as well with the DEC definition and algorithms changed correspondingly. For clarity, we state their results in terms of D 2 RL , which will not affect the essence of the comparisons. Foster et al. (2021, Theorem 3.3 & 4.1) show that E2D-VA achieves the following regret bound with probability at least 1 ´δ: Reg DM À O ´T ¨sup µPcopMq dec γ pM; δ µ q `γ logp|M| {δq ¯, where sup µPcopMq dec γ pM; δ µ q " sup µPcopMq inf pP∆pΠq sup M PM E π"p " f M pπ M q ´f M pπq ´γD 2 RL ´M pπq, E M "µ " M pπq ‰ ¯ı, where E M "µ rM pπqs denotes the mixture distribution of M pπq for M " µ, and copMq denotes the set of all mixtures of models in M (which can be also identified with ∆pMq, the set of all probability distributions over M). Compared with dec γ , Eq. ( 8) is different only in the place where the expectation E M "µ is taken. As D 2 RL is convex in the second argument (by convexity of the squared Hellinger distance and linearity of E M "µ E o"P M pπq " › › R M poq ´RM poq › › 2 2 ı in µ) , by Jensen's inequality, we have sup µPcopMq dec γ pM; δ µ q ě dec γ pMq. Unfortunately, bounding sup µPcopMq dec γ pM; δ µ q requires handling the information gain with respect to the mixture model E M "µ rM pπqs, which is in general much harder than bounding the dec γ pMq which only requires handling the expected information gain with respect to a proper model M pπq over M " µ. For general RL problems with H ą 1, it is unclear whether sup µPcopMq dec γ pM; δ µ q admit bounds of the form d{γ where d is some complexity measure (Foster et al., 2021, Section 7.1.3) . By contrast, our dec γ pMq can be bounded for broad classes of problems (e.g. Section 5). We also remark that an alternative approach considered in Foster et al. (2021, Theorem 4 .1) depends on the following definition of DEC with respect to deterministic reference models: dec γ pM, M q :" inf pP∆pMq sup M PM E π"p " f M pπ M q ´fM pπq ´γD 2 RL pM pπq, M pπqq ‰ , and the DEC of the model class M is simply dec γ pMq :" sup M PM dec γ pM, M q. As M (viewed as the set of all point masses) is a subset of copMq, by definition we have dec γ pMq ď dec γ pMq, therefore dec γ pMq can be bounded as long as dec γ pMq can. However, that approach requires the online model estimation subroutine to output a proper estimator x M t P M with bounded Hellinger error, which-unlike Tempered Aggregation (for the improper case)-requires problem-specific designs using prior knowledge about M and is unclear how to construct for general M.

C.2 AGGREGATION ALGORITHMS AS POSTERIOR COMPUTATIONS

We illustrate that Tempered Aggregation is equivalent to computing the tempered posterior (or power posterior) (Bhattacharya et al., 2019; Alquier & Ridgway, 2020) in the following vanilla Bayesian setting. Consider a model class M associated with a prior µ 1 P ∆pMq, and each model specifies a distribution P M p¨q P ∆pOq of observations o P O. Suppose we receive observations o 1 , . . . , o t , . . . in a sequential fashion. In this setting, the Tempered Aggregation updates µ t`1 pM q 9 M µ t pM q ¨exp `ηp log P M po t q ˘" µ t pM q ¨`P M po t q ˘ηp . Therefore, for all t ě 1, µ t`1 pM q 9 M µ 1 pM q ¨˜t ź s"1 P M po s q ¸ηp . If η p " 1 as in Vovk's aggregating algorithm (Vovk, 1995) , by Bayes' rule, the above µ t`1 is exactly the posterior M |o 1:t . As we chose η p ď 1{2 ă 1 in Tempered Aggregation, µ t`1 gives the tempered posterior, which is a slower variant of the posterior where data likelihoods are weighed less than in the exact posterior.

Algorithm 2 TEMPERED AGGREGATION

Input: Learning rate η p P p0, 1 2 q, η r ą 0. 1: Initialize µ 1 Ð UnifpMq. 2: for t " 1, . . . , T do 3: Receive pπ t , o t , r t q. 4: Update randomized model estimator: µ t`1 pM q 9 M µ t pM q ¨exp ´ηp log P M po t |π t q ´ηr › › r t ´RM po t q › › 2 2 ¯. ( ) D PROOFS FOR SECTION 3 D.1 TEMPERED AGGREGATION In this section, we analyze the Tempered Aggregation algorithm for finite model classes. For the sake of both generality and simplicity, we state our results in the following general setup of online model estimation. Lemma 3 (restated in Corollary D.2) then follows as a direct corollary. Setup: Online model estimation In an online model estimation problem, the learner is given a model set M, a context space Π, an observation space O, a family of conditional distributions pP M p¨|¨q : Π Ñ ∆pOqq M PMfoot_6 , a family of vector-valued mean reward functions pR M : O Ñ r0, 1s H q M PM . The environment fix a ground truth model M ‹ P M; for shorthand, let P ‹ :" P M ‹ , R ‹ :" R M ‹ . For simplicity (in a measure-theoretic sense) we assume that O is finitefoot_7 . At each step t P rT s, the learner first determines a randomized model estimator (i.e. a distribution over models) µ t P ∆pMq. Then, the environment reveals the context π t P Π (that is in general random and possibly depends on µ t and history information), generates the observation o t " P ‹ p¨|π t q, and finally generates the reward r t P R d (which is a random vector) such that E r r t | o t s " R ‹ po t q. The information pπ t , o t , r t q may then be used by the learner to obtain the updated estimator µ t`1 . For any M P M, we consider the following estimation error of model M with respect to the true model, at step t: Err t M :" E t " D 2 H `PM p¨|π t q, P ‹ p¨|π t q ˘`› › R M po t q ´R‹ po t q › › 2 2 ı , where E t is taken with respect to all randomness after prediction µ t is madefoot_8 -in particular it takes the expectation over pπ t , o t q. Note that Err t M ‹ " 0 by definition.

Algorithm and theoretical guarantee

The Tempered Aggregation Algorithm is presented in Algorithm 2. Here we present the case with a finite model class (|M| ă 8); In Appendix D.3 we treat the more general case of infinite model classes using covering arguments. Theorem D.1 (Tempered Aggregation). Suppose |M| ă 8, the reward vector r t is σ 2 -sub-Gaussian conditioned on o t , and › › R M po t q ´R‹ po t q › › 2 ď D almost surely for all t P rT s. Then, Algorithm 2 with any learning rate η p , η r ą 0 such that 2η p `2σ 2 η r ă 1 achieves the following with probability at least 1 ´δ: T ÿ t"1 E M "µ t " Err t M ‰ ď C logp|M| {δq, where C " max ! 1 ηp , 1 p1´2ηpqc 1 ) , c 1 :" p1 ´e´cp1´2σ 2 cqD 2 q{D 2 and c :" η r {p1 ´2η p q are constants depending on pη p , η r , σ 2 , Dq only. Furthermore, it also achieves the following in-expectation guarantee: E « T ÿ t"1 E M "µ t " Err t M ‰ ff ď C log |M|. The proof of Theorem D.1 can be found in Appendix D.1.1. Lemma 3 now follows as a direct corollary of Theorem D.1, which we restate and prove below. Corollary D.2 (Restatement of Lemma 3). The Tempered Aggregation subroutine (3) in Algorithm 1 with 4η p `ηr ă 2 achieves the following bound with probability at least 1 ´δ: Est RL :" T ÿ t"1 E π t "p t E x M t "µ t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı ď C ¨logp|M| {δq, where C depends only on pη p , η r q. Specifically, we can choose η p " η r " 1{3 and C " 10. Furthermore, when η p " η P p0, 1 2 s, η r " 0, (3) achieves Est H :" T ÿ t"1 E π t "p t E x M t "µ t " D 2 H ´PM ‹ pπ t q, P x M t pπ t q ¯ı ď 1 η ¨logp|M| {δq with probability at least 1 ´δ. Proof. Note that subroutine (3) in Algorithm 1 is exactly an instantiation of the Tempered Aggregation algorithm (Algorithm 2) with context π t sampled from distribution p t (which depends on µ t ), observation o t , and reward r t . Therefore, we can apply Theorem D.1, where we further note that E M "µ t " Err t M ‰ corresponds exactly to E M "µ t " Err t M ‰ " E x M t "µ t E π t "p t " D 2 H pP M ‹ pπ t q, P x M t pπ t qq `Eo"P M ‹ pπ t q › › ›R M ‹ poq ´R x M t poq › › › 2 2 ȷ " E x M t "µ t E π t "p t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı . Notice that we can pick σ 2 " 1{4 and D " ? 2, as each individual reward r h P r0, 1s almost surely (so is 1{4-sub-Gaussian by Hoeffding's Lemma), and › › ›R M poq ´RM 1 poq › › › 2 2 " H ÿ h"1 ˇˇR M h poq ´RM 1 h poq ˇˇ2 ď H ÿ h"1 ˇˇR M h poq ´RM 1 h poq ˇˇď H ÿ h"1 ˇˇR M h poq ˇˇ`ˇˇR M 1 h poq ˇˇ" 2. for any two models M, M 1 and any o P O. Therefore, Theorem D.1 yields that, as long as 4η p `ηr ă 2, we have with probability at least 1 ´δ that Est RL :" T ÿ t"1 E π t "p t E x M t "µ t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı ď C ¨logp|M| {δq, where C " max ! 1 ηp , 1 p1´2ηpqc 1 ) , c 1 " p1 ´e´cp2´cq q{2, and c " η r {p1 ´2η p q. Choosing η p " η r " 1{3, we have c " 1, c 1 " p1´e ´1q{2, and C " max t3, 3{c 1 u ď 10 by numerical calculations. This is the desired result. The case η r " 0 follows similarly.

D.1.1 PROOF OF THEOREM D.1

For all t P rT s define the random variable ∆ t :" ´log E M "µ t " exp ˆηp log P M po t |π t q P ‹ po t |π t q `ηr δ t M ˙ȷ , where δ t M :" › › r t ´R‹ po t q › › 2 2 ´› › r t ´RM po t q › › 2 2 . (11) Recall that E t is taken with respect to all randomness after prediction µ t is made. Then E t " exp `´∆ t ˘‰ " E t " E M "µ t " exp ˆηp log P M po t |π t q P ‹ po t |π t q `ηr δ t M ˙ȷȷ " ÿ M PM µ t pM qE t " exp ˆηp log P M po t |π t q P ‹ po t |π t q `ηr δ t M ˙ȷ ď ÿ M PM µ t pM qE t " 2η p exp ˆ1 2 log P M po t |π t q P ‹ po t |π t q ˙`p1 ´2η p q exp ˆηr 1 ´2η p δ t M ˙ȷ "2η p ÿ M PM µ t pM qE t « E o"P ‹ p¨|π t q «d P M po|π t q P ‹ po|π t q ffff `p1 ´2η p q ÿ M PM µ t pM qE t " exp ˆηr 1 ´2η p δ t M ˙ȷ. ( ) For the first term, by definition E o"P ‹ p¨|π t q «d P M po|π t q P ‹ po|π t q ff " 1 ´1 2 D 2 H pP ‹ p¨|π t q, P M p¨|π t qq. ( ) To bound the second term, we abbreviate c :" ηr 1´2ηp , and invoke the following lemma. The proof can be found in Appendix D.1.2. Lemma D.3. Suppose that r P R d is a σ 2 -sub-Gaussian random vector, r " Errs is the mean of r, and p r P R d is any fixed vector. Then the random variable δ :" }r ´r} Therefore, E t " exp `cδ t M ˘‰ ďE t " exp ´´cp1 ´2σ 2 cq › › R M po t q ´R‹ po t q › › 2 2 ¯ı ď1 ´c1 E t " › › R M po t q ´R‹ po t q › › 2 2 ı , where the second inequality is due to the fact that for all x P r0, D 2 s, it holds that e ´cp1´2σ 2 cqx ď 1 ´c1 x, which is ensured by our choice of c P r0, 2{σ 2 q and c 1 :" p1 ´e´D 2 cp1´2σ 2 cq q{D 2 ą 0. Therefore, by flipping ( 12) and adding one on both sides, and plugging in ( 13) and ( 14), we get 1 ´Et " exp `´∆ t ˘‰ ěη p E M "µ t E π t "¨|Ft´1 " D 2 H pP M p¨|π t q, P ‹ p¨|π t qq ‰ `p1 ´2η p qc 1 E M "µ t E t " › › R M po t q ´R‹ po t q › › 2 2 ı . Thus, by martingale concentration (Lemma B.2), we have with probability at least 1 ´δ that T ÿ t"1 ∆ t `logp1{δq ě T ÿ t"1 ´log E t " exp `´∆ t ˘‰ ě T ÿ t"1 1 ´Et " exp `´∆ t ˘‰ ěη p T ÿ t"1 E M "µ t E t " D 2 H pP M p¨|π t q, P ‹ p¨|π t qq ‰ `p1 ´2η p qc 1 T ÿ t"1 E M "µ t E t " › › R M po t q ´R‹ po t q › › 2 2 ı ě min ␣ η p , p1 ´2η p qc 1 ( ¨EM"µ t " Err t M ‰ . It remains to upper bound ř T t"1 ∆ t . Note that the update rule of Algorithm 2 can be written in the following Follow-The-Regularized-Leader form: µ t pM q " µ 1 pM q exp `řsďt´1 η p log P M po s |π s q `ηr δ s M řM 1 PM µ 1 pM 1 q exp `řsďt´1 η p log P M 1 po s |π s q `ηr δ s M 1 ˘, where we have used that δ t M " ´› › r t ´RM po t q › › 2 2 `}r t ´R‹ po t q} 2 2 in which }r t ´R‹ po t q} 2 2 is a constant that does not depend on M for all t P rT s. Therefore we have expp´∆ t q " E M "µ t " exp ˆηp log P M po t |π t q P ‹ po t |π t q `ηr δ t M ˙ȷ " ÿ M PM µ t pM q exp ˆηp log P M po t |π t q P ‹ po t |π t q `ηr δ t M " ÿ M PM µ 1 pM q exp `řsďt´1 η p log P M po s |π s q `ηr δ s M řM 1 PM µ 1 pM 1 q exp `řsďt´1 η p log P M 1 po s |π s q `ηr δ s M 1 ˘exp ˆηp log P M po t |π t q P ‹ po t |π t q `ηr δ t M " ř M PM µ 1 pM q exp ´řsďt η p log P M po s |π s q P ‹ po s |π s q `ηr δ s M řMPM µ 1 pM q exp ´řsďt´1 η p log P M po s |π s q P ‹ po s |π s q `ηr δ s M ¯, where the last equality used again the fact that ´ηp log P ‹ po s |π s q is a constant that does not depend on M for all s P rts. Taking ´log on both sides above and summing over t P rT s, we have by telescoping that T ÿ t"1 ∆ t " ´log ÿ M PM µ 1 pM q exp ˜T ÿ t"1 η p log P M po t |π t q P ‹ po t |π t q `ηr δ t M ¸. By realizability M ‹ P M, we have T ÿ t"1 ∆ t ď ´log µ 1 pM ‹ q " log |M|. Plugging this bound into (15) gives the desired high-probability statement. Sample π t " p t . Execute π t and observe po t , r t q. 4: Update randomized model estimator by online estimation subroutine: µ t`1 Ð Alg t Est ´tpπ s , o s , r s qu sPrts ¯.

D.2 GENERAL E2D & PROOF OF THEOREM 2

We first prove a guarantee for the following E2D meta-algorithm that allows any (randomized) online estimation subroutine, which includes Algorithm 1 as a special case by instantiating Alg Est as the Tempered Aggregation subroutine (for finite model classes) and thus proving Theorem 2. The following theorem is an instantiation of Foster et al. (2021, Theorem 4.3 ) by choosing the divergence function to be D RL . For completeness, we provide a proof in Appendix D.4. Let Est RL :" T ÿ t"1 E π t "p t E x M t "µ t " D 2 RL p x M t pπ t q, M ‹ pπ t qq ı (18) denote the online estimation error of tµ t u T t"1 in D 2 RL divergence (achieved by Alg Est ). Theorem D.4 (E2D Meta-Algorithm (Foster et al., 2021) ). Algorithm 3 achieves Reg DM ď T ¨dec γ pMq `γ ¨Est RL . We are now ready to prove the main theorem (finite M).

Proof of Theorem 2

Note that Algorithm 1 is an instantiation of Algorithm 3 with Alg Est chosen as Tempered Aggregation. By Lemma 3, choosing η p " η r " 1{3, the Tempered Aggregation subroutine achieves Est RL ď 10 logp|M| {δq with probability at least 1 ´δ. On this event, by Theorem D.4 we have that Reg DM ď T ¨dec γ pMq `γ ¨Est RL ď T ¨dec γ pMq `10γ logp|M| {δq. This is the desired result.

D.3 E2D-TA WITH COVERING

In many scenarios, we have to work with an infinite model class M instead of a finite one. In the following, we define a covering number suitable for divergence D RL , and provide the analysis of the Tempered Aggregation subroutine (as well as the corresponding E2D-TA algorithm) with such coverings. We consider the following definition of optimistic covering. Definition D.5 (Optimistic covering). Given ρ P r0, 1s, an optimistic ρ-cover of M is a tuple p r P, M 0 q, where M 0 is a finite subset of M, and each M 0 P M 0 is assigned with an optimistic likelihood function r P M0 , such that the following holds: (1) For M 0 P M 0 , for each π, r P M0,π p¨q specifies a un-normalized distribution over O, and it holds that › › ›P M0,π p¨q ´r P M0,π p¨q › › › 1 ď ρ 2 . (2) For any M P M, there exists a M 0 P M 0 that covers M : for all π P Π, o P O, it holds r P M0,π poq ě P M,π poqfoot_9 , and › › R M poq ´RM0 poq › › 1 ď ρ. The optimistic covering number N pM, ρq is defined as the minimal cardinality of M 0 such that there exists r P such that p r P, M 0 q is an optimistic ρ-cover of M. With the definition of an optimistic covering at hand, the Tempered Aggregation algorithm can be directly generalized to infinite model classes by performing the updates on an optimistic cover (Algorithm 4). Proposition D.6 (Tempered Aggregation with covering for RL). For any model class M and an associated optimistic ρ-cover p r P, M 0 q, the Tempered Aggregation subroutine µ t`1 pM q 9 M µ t pM q ¨exp ´ηp log r P M,π t po t q ´ηr › › r t ´RM po t q › › 2 2 ¯(19) with µ 1 " UnifpM 0 q and η p " η r " 1{3 achieves the following bound with probability at least 1 ´δ: `T ¨dec γ pMq `γ estpM, T q `γ logp1{δq with probability at least 1 ´δ, where C is a universal constant. Est RL ď 10 ¨rlog |M 0 | `2T ρ `2

D.3.1 DISCUSSIONS ABOUT OPTIMISTIC COVERING

We make a few remarks regarding our definition of the optimistic covering. Examples of optimistic covers on concrete model classes can be found in e.g. Example K.13, Proposition K.15; see also (Liu et al., 2022a, Appendix B) . A more relaxed definition We first remark that Definition D.5(2) can actually be relaxed to (2') For any M P M, there exists a M 0 P M 0 , such that max oPO › › R M poq ´RM0 poq › › 1 ď ρ, E o"P M p¨|πq « P M po|πq r P M0 po|πq ff ď 1 `ρ, @π P Π. (:) For the simplicity of presentation, we state all the results in terms of Definition D.5. But the proof of Theorem D.8 can be directly adapted to (:); see Remark D.9. Input: Learning rate η p P p0, 1 2 q, η r ą 0, number of steps T , ρ-optimistic cover p r P, M 0 q. 1: Initialize µ 1 Ð UnifpM 0 q. 2: for t " 1, . . . , T do 3:

Relation to

Receive pπ t , o t , r t q. 4: Update randomized model estimator: µ t`1 pM q 9 M µ t pM q ¨exp ´ηp log r P M po t |π t q ´ηr › › r t ´RM po t q › › 2 2 ¯. B ě sup oPO,πPΠ,M PM P M po|πq νpo|πq with ν being certain base distribution. Actually, with such a B, we can show that N 1 pM, ρq ď N TV pM, ρ 2 {4Bq, where N TV is the covering number in the TV sense, and N 1 is the optimistic covering number with respect to (:). Relation to other notions of covering numbers Ignoring the reward component, our optimistic covering number is essentially equivalent to the bracketing number. We further remark that optimistic covering can be slightly weaker than the covering in χ 2 -distance sense: given a ρ 2 -covering M 0 in the latter sense, we can take r P " p1 `ρ2 qP to obtain a ρ-optimistic covering defined by (:).

D.4 PROOF OF THEOREM D.4

We have by definition of Reg DM that Reg DM " T ÿ t"1 E π t "p t " f M ‹ pπ M ‹ q ´f M ‹ pπ t q ı " T ÿ t"1 E π t "p t " f M ‹ pπ M ‹ q ´f M ‹ pπ t q ´γE x M t "µ t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ıı `γ ¨T ÿ t"1 E π t "p t E x M t "µ t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı piq ď T ÿ t"1 sup M PM E π t "p t E M "µ t " f M pπ M q ´f M pπ t q ´γD 2 RL pM pπ t q, M pπ t qq ‰ `γ ¨Est RL piiq " T ÿ t"1 p V µ t γ pp t q looomooon "inf pP∆pΠq p V µ t γ ppq `γ ¨Est RL piiiq " T ÿ t"1 dec γ pM, µ t q `γ ¨Est RL ď T ¨dec γ pMq `γ ¨Est RL . Above, (i) follows by the realizability assumption M ‹ P M; (ii) follows by definition of the risk p V µ t γ (cf. (2)) as well as the fact that p t minimizes p V µ t γ p¨q in Algorithm 3; (iii) follows by definition of dec γ pM, µ t q. This completes the proof.

D.5 PROOF OF PROPOSITION D.6

We first restates the TEMPERED AGGREGATION WITH COVERING subroutine (19) in the general setup of online model estimation in Algorithm 4. Theorem D.8 (Tempered Aggregation over covering). For any M that is not necessarily finite, but otherwise under the same setting as Theorem D.1, Algorithm 4 with 2η p `2σ 2 η r ă 1 achieves with probability at least 1 ´δ that T ÿ t"1 E M "µ t " Err t M ‰ ď Crlog |M 0 | `2 logp2{δq `2T ρpη r `ηp qs, where C is defined same as in Theorem D.1. Plugging Theorem D.8 into the RL setting, picking pη p , η r q and performing numerical calculations, we directly have the Proposition D.6. The proof follows the same arguments as Corollary D.2 and hence omitted. Similarly, when η p " η P p0, 1 2 q, η r " 0, the proof of Theorem D.8 implies that (19) with µ 1 " UnifpM 0 q achieves the following bound with probability at least 1 ´δ: T ÿ t"1 E π t "p t E x M t "µ t " D 2 H ´PM ‹ pπ t q, P x M t pπ t q ¯ı ď 1 η ¨rlog |M 0 | `2ηT ρ `2 logp2{δqs. ( ) Proof of Theorem D.8 The proof is similar to that of Theorem D.1. Consider the random variable ∆ t :" ´log E M "µ t « exp ˜ηp log r P M po t |π t q P ‹ po t |π t q `ηr δ t M ¸ff , for all t P rT s, where δ is defined in (11). Then by ( 12) and ( 14), we have E t " exp `´∆ t ˘‰ ď2η p E M "µ t E t » - d r P M po t |π t q P ‹ po t |π t q fi fl `p1 ´2η p q ´1 ´c1 E M "µ t E t " › › R M po t q ´R‹ po t q › › 2 2 ı¯, where c 1 is the same as in Theorem D.1. To bound the first term, we notice that for all π P Π, and o " P ‹ p¨|πq, we have « ˆbr P M po|πq ´aP M po|πq ˙2 P ‹ po|πq ff 1 2 ď1 ´1 2 D 2 H pP M p¨|πq, P ‹ p¨|πqq `Eo"P ‹ p¨|πq « ˇˇr P M po|πq ´PM po|πq ˇP ‹ po|πq ff 1 2 "1 ´1 2 D 2 H pP M p¨|πq, P ‹ p¨|πqq `› › › r P M p¨|πq ´PM p¨|πq › › › 1 2 1 ď1 ´1 2 D 2 H pP M p¨|πq, P ‹ p¨|πqq `ρ, where the last inequality is due to the fact that › › ›P M p¨|πq ´r P M p¨|πq › › › 1 ď ρ 2 . (22) directly implies that E t » - d r P M po t |π t q P ‹ po t |π t q fi fl ď 1 ´1 2 E t " D 2 H pP M p¨|π t q, P ‹ p¨|π t qq ‰ `ρ. Therefore, by Lemma B.2, with probability at least 1 ´δ{2, it holds that T ÿ t"1 ∆ t `logp2{δq ě T ÿ t"1 ´log E t " exp `´∆ t ˘‰ ě T ÿ t"1 1 ´Et " exp `´∆ t ˘‰ ěη p « T ÿ t"1 E M "µ t E t " D 2 H pP M p¨|π t q, P ‹ p¨|π t qq ‰ ´2T ρ ff `p1 ´2η p qc 1 T ÿ t"1 E M "µ t E t " › › R M po t q ´R‹ po t q › › 2 2 ı . In the following, we complete the proof by showing that with probability at least 1 ´δ{2, T ÿ t"1 ∆ t ď log |M 0 | `2T η r ρ `logp2{δq. By a telescoping argument same as ( 17), we have T ÿ t"1 ∆ t " ´log ÿ M PM0 µ 1 pM q exp ˜T ÿ t"1 η p log r P M po t |π t q P ‹ po t |π t q `ηr δ t M ¸. ( ) By the definition of M 0 and the realizability M ‹ P M, there exists a M P M 0 such that M ‹ is covered by M (i.e. › › R M0 poq ´R‹ poq › › 8 ď ρ and r P M p¨|πq ě P ‹ p¨|πq for all π). Then E « exp ˜T ÿ t"1 ∆ t ¸ff ď |M 0 | E « exp ˜´T ÿ t"1 η p log r P M po t |π t q P ‹ po t |π t q ´ηr δ t M ¸ff. ( ) Now E « exp ˜´T ÿ t"1 η p log r P M po t |π t q P ‹ po t |π t q ´ηr δ t M ¸ff " E « T ź t"1 ˜P‹ po t |π t q r P M po t |π t q ¸ηp ¨expp´η r δ t M q ff ď E « T ź t"1 expp´η r δ t M q ff " E « T ´1 ź t"1 expp´η r δ t M q ¨E " expp´η r δ T M q ˇˇo T ‰ ff ď exp p2ρη r qE « T ´1 ź t"1 expp´η r δ t M q ff ď ¨¨¨ď exp p2T ρη r q, where the first inequality is due to r P M ě P ‹ , the second inequality is because for all t P rT s, E " expp´η r δ t M q ˇˇo t ‰ ď exp ´ηr p1 `2σ 2 η r q › › R M po t q ´R‹ po t q › › 2 2 ¯ď expp2ρη r q, which is due to Lemma D.3 and › › R M po t q ´R‹ po t q › › 2 2 ď › › R M po t q ´R‹ po t q › › 1 › › R M po t q ´R‹ po t q › › 8 ď ρ. Applying Chernoff's bound completes the proof. Remark D.9. From the proof above, it is clear that Theorem D.8 also holds for for the alternative definition of covering number in (:): Under that definition, we can proceed in (26) by using the fact E o"P ‹ p¨|πq " P ‹ po|πq r P M po|πq ı ď 1 `ρ and the fact E r expp´η r δ t M q| o t s ď expp2ρη r q alternately.

E CONNECTIONS TO OPTIMISTIC ALGORITHMS

Motivated by the close connection between E2D-TA and posteriors/likelihoods, in this section, we re-analyze two existing optimistic algorithms: Model-based Optimistic Posterior Sampling (MOPS), and Optimistic Maximum Likelihood Estimation (OMLE), in a parallel fashion to E2D-TA. We show that these two algorithms-in addition to algorithmic similarity to E2D-TA-work under general structural conditions related to the DEC, for which we establish formal relationships.

E.1 MODEL-BASED OPTIMISTIC POSTERIOR SAMPLING (MOPS)

We consider the following version of the MOPS algorithm of Agarwal & Zhang (2022a) 11 . Similar as E2D-TA, MOPS also maintains a posterior µ t P ∆pMq over models, initialized at a suitable prior µ 1 . The policy in the t-th episode is directly obtained by posterior sampling: π t " π M t where M t " µ t . After executing π t and observing po t , r t q, the algorithm updates the posterior as µ t`1 pM q 9 M µ t pM q ¨exp ´γ´1 f M pπ M q `ηp log P M,π t po t q ´ηr › › r t ´RM po t q › › 2 2 ¯. ( ) This update is similar as Tempered Aggregation (3), and differs in the additional optimism term γ ´1f M pπ M q which favors models with higher optimal values. (Full algorithm in Algorithm 5.) We now state the structural condition and theoretical guarantee for the MOPS algorithm. Definition E.1 (Posterior sampling coefficient). The Posterior Sampling Coefficient (PSC) of model class M with respect to reference model M P M and parameter γ ą 0 is defined as psc γ pM, M q :" sup µP∆pMq E M "µ E M 1 "µ " f M pπ M q ´f M pπ M q ´γD 2 RL pM pπ M 1 q, M pπ M 1 qq ı . Theorem E.2 (Regret bound for MOPS). Choosing η p " 1{6, η r " 0.6 and the uniform prior µ 1 " UnifpMq, Algorithm 5 achieves the following with probability at least 1 ´δ: Reg DM ď T " psc γ{6 pM, M ‹ q `2{γ ı `4γ ¨logp|M| {δq. Theorem E.2 (proof in Appendix F.1) is similar as Agarwal & Zhang (2022a, Theorem 1) and is slightly more general in the assumed structural condition, as the PSC is bounded whenever the "Hellinger decoupling coefficient" used in their theorem is bounded (Proposition F.5). Relationship between DEC and PSC The definition of the PSC resembles that of the DEC. We show that the DEC can indeed be upper bounded by the PSC modulo a (lower-order) additive constant; in other words, low PSC implies a low DEC. The proof can be found in Appendix F.4. Proposition E.3 (Bounding DEC by PSC). Suppose Π is finite, then we have for any γ ą 0 that dec γ pMq ď sup M PM psc γ{6 pM, M q `2pH `1q{γ.

E.2 OPTIMISTIC MAXIMUM LIKELIHOOD ESTIMATION (OMLE)

Standard versions of the OMLE algorithm (e.g. Liu et al. (2022a) ) use the log-likelihood of all observed data as the risk function. Here we consider the following risk function involving the log-likelihood of the observations and the negative L 2 2 loss of the rewards, to be parallel with E2D-TA and MOPS: L t pM q :" ř t´1 s"1 " log P M,π s po s q ´› › r s ´RM po s q › › 2 2 ı . In the t-th iteration, the OMLE algorithm plays the greedy policy of the most optimistic model within a β-superlevel set of the above risk (Full algorithm in Algorithm 6): pM t , π t q :" arg max pM,πqPMˆΠ f M pπq such that L t pM q ě max M 1 L t pM 1 q ´β. ( ) We now state the structural condition and theoretical guarantee for the OMLE algorithm. Definition E.4 (Maximum likelihood estimation coefficient). The maximum likelihood estimation coefficient (MLEC) of model class M with respect to reference model M P M, parameter γ ą 0, and length K P Z ě1 is defined as mlec γ,K pM, M q :" sup tM k uPM 1 K K ÿ k"1 " f M k pπ M k q ´f M pπ M k q ı ´γ K «˜m ax kPrKs ÿ tďk´1 D 2 RL pM pπ M t q, M k pπ M t qq ¸_ 1 ff . Algorithm 5 MOPS (Agarwal & Zhang, 2022a) 1: Input: Parameters η, γ ą 0; prior distribution µ 1 P ∆pMq; optimistic likelihood function r P. 2: for t " 1, . . . , T do 3: Sample M t " µ t p¨q and set π t " π M t .

4:

Execute π t and observe po t , r t q.

5:

Update posterior of models by Optimistic Posterior Sampling (OPS): µ t`1 pM q 9 M µ t pM q ¨exp ´γ´1 f M pπ M q `ηp log r P M,π t po t q ´ηr › › r t ´RM po t q › › 2 2 ¯. ( ) Theorem E.5 (Regret bound for OMLE). Choosing β " 3 logp|M|{δq ě 1, with probability at least 1 ´δ, Algorithm 6 achieves Reg DM ď inf γą0 tT ¨mlec γ,T pM, M ‹ q `12γ ¨logp|M|{δqu. Existing sample-efficiency guarantees for OMLE-type algorithms are only established on specific RL problems through case-by-case analyses (Mete et al., 2021; Uehara et al., 2021; Liu et al., 2022a; b) . By contrast, Theorem E.5 shows that OMLE works on any problem with bounded MLEC, thereby offering a more unified understanding. The proof of Theorem E.5 is deferred to Appendix G.2. We remark that the MLEC is also closely related to the PSC, in that bounded MLEC (under a slightly modified definition) implies bounded PSC (Proposition G.4 & Appendix G.3). F PROOFS FOR SECTION E.1 F.1 ALGORITHM MOPS Here we present a more general version of the MOPS algorithm where we allow M to be a possibly infinite model class, and require a prior µ 1 P ∆pMq and an optimistic likelihood function r P (cf. Definition D.5) as inputs. The algorithm stated in Appendix E.1 is a special case of Algorithm 5 with |M| ă 8, r P " P, and µ 1 " UnifpMq. We state the theoretical guarantee for Algorithm 5 as follows. Theorem F.1 (MOPS). Given a ρ-optimistic cover p r P, M 0 q, Algorithm 5 with η p " 1{6, η r " 0.6 and µ 1 " UnifpM 0 q achieves the following with probability at least 1 ´δ: Reg DM ď T " psc γ{6 pM, M ‹ q `2 γ ȷ `γrlog |M 0 | `3T ρ `2 logp2{δqs. Choosing the optimal γ ą 0, with probability at least 1 ´δ, suitable implementation of Algorithm 5 achieves Reg DM ď 12 inf γą0 " T psc γ pM, M ‹ q `T γ `γrestpM, T q `logp1{δqs * . When M is finite, clearly pP, Mq itself is a 0-optimistic covering, and hence Theorem F.1 implies Theorem E.2 directly. It is worth noting that Agarwal & Zhang (2022a) state the guarantee of MOPS in terms of a general prior, with the regret depending on a certain "prior around true model" like quantity. The proof of Theorem F.2 can be directly adapted to work in their setting; however, we remark that, obtaining an explicit upper bound on their "prior around true model" in a concrete problem likely requires constructing an explicit covering, similar as in Theorem F.1. Proof of Theorem F.1 By definition, Reg DM " T ÿ t"1 E M "µ t " f M ‹ pπ M ‹ q ´f M ‹ pπ M q ı " T ÿ t"1 E M "µ t " f M ‹ pπ M ‹ q ´f M pπ M q `γ 6 E π"p t " D 2 RL pM ‹ pπq, M pπqq ‰ ı looooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooon Bounded by Corollary F.3 `T ÿ t"1 E M "µ t " f M pπ M q ´f M ‹ pπ M q ´γ 6 E π"p t " D 2 RL pM ‹ pπq, M pπqq ‰ ı loooooooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooooooon Bounded by psc ďγrlog |M 0 | `3T ρ `2 logp2{δqs `2T γ `T psc γ{6 pM, M ‹ q.

F.2 OPTIMISTIC POSTERIOR SAMPLING

In this section, we analyze the following Optimistic Posterior Sampling algorithm under a more general setting. The problem setting and notation are the same as the online model estimation problem introduced in Appendix D.1. Additionally, we assume that each M P M is assigned with a scalar V M P r0, 1s; in our application, V M is going to be optimal value of model M . Theorem F.2 (Analysis of posterior in OPS). Fix a ρ ą 0 and a ρ-optimistic covering p r P, M 0 q of M. Under the assumption of Theorem D.1, the following update rule µ t`1 pM q 9 M µ t pM q ¨exp ´γ´1 V M `ηp log r P M po t |π t q ´ηr › › r t ´RM po t q › › 2 2 ¯. with 2η p `4σ 2 η r ă 1 and µ 1 " UnifpM 0 q achieves with probability at least 1 ´δ that T ÿ t"1 E M "µ t " V ‹ ´VM `c0 γ Err t M ‰ ď T 8γp1 ´2η p ´4σ 2 η r q `γ log |M 0 | `γ" T ρp2γ ´1 `2η p `ηr q `2 logp2{δq ‰ , where V ‹ " V M ‹ and c 0 " mintη p , 4σ 2 η r p1 ´e´D 2 {8σ 2 q{D 2 u, as long as there exists M P M 0 such that M ‹ is covered by M (cf. Definition D.5) and V M ě V ‹ ´2ρ. The proof of Theorem F.2 can be found in Appendix F.2.1. As a direct corollary of Theorem F.2, the posterior µ t maintained in the MOPS algorithm (Algorithm 5) achieves the following guarantee. Corollary F.3. Given a ρ-optimistic covering p r P, M 0 q, subroutine (30) within Algorithm 5 with η p " 1{6, η r " 0.6, γ ě 1 and uniform prior µ 1 " UnifpM 0 q achieves with probability at least 1 ´δ that T ÿ t"1 E M "µ t " f M ‹ pπ M ‹ q ´f M pπ M q `γ 6 E π"p t " D 2 RL pM ‹ pπq, M pπqq ‰ ı ď 2T γ `γrlog |M 0 | `3T ρ `2 logp2{δqs. Proof. Note that subroutine (30) in Algorithm 5 is exactly an instantiation of (31) with context π t sampled from distribution p t (which depends on µ t ), observation o t , reward r t , and V M " f M pπ M q. Furthermore, E M "µ t " Err t M ‰ corresponds to E x M t "µ t E π t "p t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı (cf. Corollary D.2). Therefore, in order to apply Theorem F.2, we have to verify: as long as M P M 0 covers the ground truth model M ‹ (i.e. › › R M0 poq ´R‹ poq › › 1 ď ρ and r P M p¨|πq ě P ‹ p¨|πq for all π), it holds that Therefore, V ‹ ´VM ď ρ `ρ2 ď 2ρ. Now we can apply Theorem F.2 and plug in σ 2 " 1{4, D 2 " 2 as in Corollary D.2. Choosing η p " 1{6, η r " 0.6, and γ ě 1, we have 4η r `γ´1 `ηp ď 3, 8p1 ´2η p ´ηr q ě 1{2, and c 0 " 1{6. This completes the proof. F.2.1 PROOF OF THEOREM F.2 For all t P rT s define the random variable V M ě V ‹ ´2ρ. We note that V ‹ ě f M ‹ pπ M q, thus V ‹ ´VM ď sup π ˇˇf M pπq ´f M ‹ pπq ˇˇď sup π D TV `P‹ p¨|πq, P M p¨|πq ˘`ρ ∆ t :" ´log E M "µ t « exp ˜γ´1 pV M ´V‹ q `ηp log r P M po t |π t q P ‹ po t |π t q `ηr δ t M ¸ff , where δ t M is defined as in (11). Similar as the proof of Theorem D.1, we begin by noticing that log E t " exp `´∆ t ˘‰ " log E M "µ t E t « exp ˜γ´1 pV M ´V‹ q `ηp log r P M po t |π t q P ‹ po t |π t q `ηr δ t M ¸ff ďp1 ´2η p ´4σ 2 η r q log E M "µ t " exp ˆVM ´V‹ γp1 ´2η p ´4σ 2 η r q ˙ȷ `2η p log E M "µ t E t « exp ˜1 2 log r P M po t |π t q P ‹ po t |π t q ¸ff `4σ 2 η r log E M "µ t E t " exp ˆ1 4σ 2 δ t M ˙ȷ, which is due to Jensen's inequality. For the first term, we abbreviate η 0 " 1 ´2η p ´4σ 2 η r and consider a M :" pV ‹ ´VM q{γη 0 . Then by the boundedness of a M and Hoeffding's Lemma, E M "µ t rexpp´a M qs ď exp ˆEM"µ t rV M s ´V‹ γη 0 ˙¨exp ˆ1 8γ 2 η 2 0 ˙. ( ) The second term can be bounded as in (22) : log E M "µ t E t « exp ˜1 2 log r P M po t |π t q P ‹ po t |π t q ¸ff ď log E M "µ t " 1 ´1 2 E t " D 2 H pP M p¨|π t q, P ‹ p¨|π t qq ‰ `ρȷ ď ´1 2 E M "µ t E t " D 2 H pP M p¨|π t q, P ‹ p¨|π t qq ‰ `ρ, and the third term can be bounded by Lemma D.3 (similar to ( 14)): log E M "µ t E t " exp ˆ1 4σ 2 δ t M ˙ȷ ď log E M "µ t E t " exp ˆ1 8σ 2 › › R M po t q ´R‹ po t q › › 2 2 ˙ȷ ď log E M "µ t E t " 1 ´p1 ´e´D 2 {8σ 2 q{D 2 › › R M po t q ´R‹ po t q › › 2 2 ı ď ´p1 ´e´D 2 {8σ 2 q{D 2 E M "µ t E t " › › R M po t q ´R‹ po t q › › 2 2 ı . Plugging ( 35), (36), and (37) into (34) gives ´log E t " exp `´∆ t ˘‰ ě V ‹ ´EM"µ t rV M s γ ´1 8γ 2 η 0 `2η p " 1 2 E M "µ t E t " D 2 H pP M p¨|π t q, P ‹ p¨|π t qq ‰ ´ρȷ `4σ 2 η r ´1 ´e´D 2 {8σ 2 ¯{D 2 ¨EM"µ t E t " › › R M po t q ´R‹ po t q › › 2 2 ı ěE M "µ t " γ ´1pV ‹ ´VM q `c0 Err t M ‰ ´1 8γ 2 η 0 ´2η p ρ. On the other hand, by Lemma B.2, we have with probability at least 1 ´δ{2 that T ÿ t"1 ∆ t `logp2{δq ě T ÿ t"1 ´log E t " exp `´∆ t ˘‰. It remains to bound ř T t"1 ∆ t . By the update rule (31) and a telescoping argument similar to (16), we have T ÿ t"1 ∆ t " ´log E M "µ 1 exp ˜T ÿ t"1 γ ´1pV M ´V‹ q `ηp log r P M po t |π t q P ‹ po t |π t q `ηr δ t M ¸. The following argument is almost the same as the argument we make to bound (24). Fix a M P M 0 that covers M ‹ and V M ´V‹ ě ´2ρ. We bound the following moment generating function E « exp ˜T ÿ t"1 ∆ t ¸ff "E » - 1 E M "µ 1 exp ´řT t"1 γ ´1pV M ´V‹ q `ηp log r P M po t |π t q P ‹ po t |π t q `ηr δ t M ¯fi fl ď |M 0 | E « exp ˜´T ÿ t"1 γ ´1pV M ´V‹ q `ηp log r P M po t |π t q P ‹ po t |π t q `ηr δ t M ¸ff ď exp `2T γ ´1ρ ˘|M 0 | E « T ź t"1 exp `´η r δ t M ˘ff ď exp `2T γ ´1ρ `T ρη r ˘|M 0 | , where the first inequality is because µ 1 pM q " 1{ |M 0 |, the second inequality is due to r P M ě P ‹ , and the last inequality follows from the same argument as (26): by Lemma D.3 we have E r expp´η r δ t M q| o t s ď exppη r ρq, and applying this inequality recursively yields the desired result. Therefore, with at least with probability 1 ´δ{2, T ÿ t"1 ∆ t ď log |M 0 | `T ρ `2γ ´1 `2η r ˘`logp2{δq. Summing (38) over t P rT s, then taking union of ( 40) and (39) establish the theorem.

F.3 BOUNDING PSC BY HELLINGER DECOUPLING COEFFICIENT

The Hellinger decoupling coefficient is introduced by Agarwal & Zhang (2022a) as a structural condition for the sample efficiency of the MOPS algorithm. Definition F.4 (Hellinger decoupling coefficient). Given α P p0, 1q, ε ě 0, the coefficient dcp h,α pM, M , εq is the smallest positive number c h ě 0 such that for all µ P ∆pMq, E M "µ E M ,π M " Q M,π M h ps h , a h q ´rh ´V M,π M h`1 ps h`1 q ı ď ˆch E M,M 1 "µ E M ,π M 1 " D 2 H ´PM h p¨|s h , a h q, P M h p¨|s h , a h q ¯`ˇˇˇR M h ps h , a h q ´RM h ps h , a h q ˇˇ2 ȷ˙α `ε. The Hellinger decoupling coefficient dcp is defined as dcp α pM, M , εq :" ˜1 H H ÿ h"1 dcp h,α pM, M , εq α{p1´αq ¸p1´αq{α . We remark that the main difference between the PSC and the Hellinger decoupling coefficient is that, the Hellinger decoupling coefficient is defined in terms of Bellman errors and Hellinger distances within each layer h P rHs separately, whereas the PSC is defined in terms of the overall value function and Hellinger distances of the entire observable (o, r). The following result shows that the PSC can be upper bounded by the Hellinger decoupling coefficient, and thus is a slightly more general definition 12 . Proposition F.5 (Bounding PSC by Hellinger decoupling coefficient). For any α P p0, 1q, we have psc γ pM, M q ď H inf εě0 ˜p1 ´αq ˆ2αH dcp α pM, M , εq γ ˙α{p1´αq `ε¸. Proof. Fix M, M P M, and α P p0, 1q. Consider ∆ h,s h ,a h pM, M q :" D 2 H ´PM h p¨|s h , a h q, P M h p¨|s h , a h q ¯`ˇˇˇR M h ps h , a h q ´RM h ps h , a h q ˇˇ2 . By the definition of D RL and Lemma B.4, we have for any h P rHs that E M ,π M 1 " ∆ h,s h ,a h pM, M q ‰ ď 2D 2 RL `M pπ M 1 q, M pπ M 1 q ˘. Fix ε ě 0 and and write c h " dcp h,α pM, M , εq. For any µ P ∆pMq, we have E M "µ " f M pπ M q ´f M pπ M q ı " H ÿ h"1 E M "µ E M ,π M " Q M,π M h ps h , a h q ´rh ´V M,π M h`1 ps h`1 q ı ď H ÿ h"1 ´ch E M,M 1 "µ E M ,π M 1 " ∆ h,s h ,a h pM, M q ‰ ¯α `Hε ď H ÿ h"1 `ch ˘α`2 E M,M 1 "µ " D 2 RL `M pπ M 1 q, M pπ M 1 q ˘‰˘α `Hε ďγE M,M 1 "µ " D 2 RL `M pπ M 1 q, M pπ M 1 q ˘‰ `p1 ´αq ˆ2αH γ ˙α{p1´αq H ÿ h"1 `ch ˘α{p1´αq `Hε "γE M,M 1 "µ " D 2 RL `M pπ M 1 q, M pπ M 1 q ˘‰ `p1 ´αqH 1{p1´αq ˆ2α dcp α pM, M , εq γ ˙α{p1´αq `Hε, where the last inequality is due to the fact that for all x, y ě 0, α P p0, 1q, x α y α ď α ¨γx αH `p1 ´αq ˆαHy γ ˙α{p1´αq " γx H `p1 ´αq ˆαH γ ˙α{p1´αq ¨yα{p1´αq , by weighted AM-GM inequality.

F.4 PROOF OF PROPOSITION E.3

Our overall argument is to bound the DEC by strong duality and the probability matching argument (similar as Foster et al. (2021, Section 4.2)), after which we show that the resulting quantity is related nicely to the PSC. In the following, we denote psc γ pMq :" sup M PM psc γ pM, M q. By definition, it suffices to bound dec γ pM, µq for any fixed µ P ∆pMq. We have Compute pM t , π t q " arg max M PM t ,πPΠ f M pπq. decpM, µq " inf pP∆pΠq sup M PM E M "µ E π"p " f M pπ M q ´f M pπq

5:

Execute π t and observe τ t " po t , r t q.

6:

Update confidence set with (28): M t`1 :" " M P M : L t`1 pM q ě max M 1 PM L t`1 pM 1 q ´β* . " inf pP∆pΠq sup µP∆pMq E M "µ,M "µ E π"p " f M pπ M q ´f M pπq ´γD 2 RL pM pπq, M pπqq ‰ " sup µP∆pMq inf pP∆pΠq E M "µ,M "µ E π"p " f M pπ M q ´f M pπq ´γD 2 RL pM pπq, M pπqq ‰ , where the last equality follows by strong duality (Theorem B.1). Now, fix any µ P ∆pMq, we pick p P ∆pΠq by probability matching: π " p is equal in distribution to π " π M 1 where M 1 " µ is an independent copy of M . For this choice of p, the quantity inside the sup-inf above is E M "µ,M 1 "µ E M "µ " f M pπ M q ´f M pπ M 1 q ´γD 2 RL pM pπ M 1 q, M pπ M 1 qq ‰ "E M "µ,M 1 "µ E M "µ " f M pπ M q ´f M pπ M 1 q ´5γ 6 D 2 RL pM pπ M 1 q, M pπ M 1 qq ȷ `EM"µ E M "µ " f M pπ M 1 q ´f M pπ M 1 q ´γ 6 D 2 RL pM pπ M 1 q, M pπ M 1 qq ı piq "E M "µ,M 1 "µ E M "µ " f M pπ M q ´f M pπ M q ´5γ 6 D 2 RL pM pπ M 1 q, M pπ M 1 qq ȷ `EM"µ E M "µ " f M pπ M 1 q ´f M pπ M 1 q ´γ 6 D 2 RL pM pπ M 1 q, M pπ M 1 qq ı piiq ď E M "µ,M 1 "µ E M "µ " f M pπ M q ´f M pπ M q ´γ 6 D 2 RL pM pπ M 1 q, M pπ M 1 qq ı looooooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooooon ďE M "µ rpsc γ{6 pM,M qsďpsc γ{6 pMq `EM"µ,M 1 "µ E M "µ " ? H `1D RL pM pπ M 1 q, M pπ M 1 qq ´γ 6 D 2 RL pM pπ M 1 q, M pπ M 1 qq ı piiiq ď psc γ{6 pMq `2pH `1q γ . Above, (i) uses the fact that f M pπ M 1 q is equal in distribution to f M pπ M q (since M " µ and M 1 " µ); (ii) uses Lemma B.7 and the fact that 5D 2 RL pM pπ M 1 q, M pπ M 1 qq ě D 2 RL pM pπ M 1 q, M pπ M 1 qq that is due to Lemma B.6; (iii) uses the inequality ? H `1x ď γ 6 x 2 `3pH`1q for any x P R. Finally, by the arbitrariness of µ P ∆pΠq, we have shown that dec γ pMq ď psc γ{6 pMq `2pH `1q{γ. This is the desired result.

G PROOFS FOR SECTION E.2 G.1 ALGORITHM OMLE

In this section, we present the Algorithm OMLE (Algorithm 6), and then state the basic guarantees of its confidence sets, as follows. Theorem G.1 (Guarantee of MLE). By choosing β ě 3 estpM, 2T q `3 logp1{δq, Algorithm 6 achieves the following with probability at least 1 ´δ: for all t P rT s, M ‹ P M t , and it holds that ÿ săt D 2 RL pM ‹ pπ s q, M pπ s qq ď 2β `6 estpM, 2T q `6 logp1{δq ď 4β, @M P M t . Proof of Theorem G.1 The proof of Theorem G.1 is mainly based on the following lemma. Lemma G.2. Fix a ρ ą 0. With probability at least 1 ´δ, it holds that for all t P rT s and M P M, ÿ săt D 2 RL pM ‹ pπ s q, M pπ s qq ď 2pL t pM ‹ q ´Lt pM qq `6 log N pM, ρq δ `8T ρ. Now, we can take ρ that attains estpM, 2T q and apply Lemma G.2. Conditional on the success of Lemma G.2, it holds that for all t P rT s and M P M, L t pM q ´Lt pM ‹ q ď 3 estpM, 2T q `3 logp1{δq. Therefore, our choice of β is enough to ensure that M ‹ P M t . Then, for M P M t , we have L t pM q ě max M 1 PM L t pM 1 q ´β ě L t pM ‹ q ´β. Applying Lemma G.2 again completes the proof. The proof of Lemma G.2 is mostly a direct adaption of the proof of Theorem D.1 and Theorem D.8. Proof of Lemma G.2 For simplicity, we denote P ‹ po|πq :" P M ‹ ,π poq and R ‹ poq :" R M poq. We pick a ρ-optimistic covering p r P, M 0 q of M such that |M 0 | " N pM, ρq. Recall that the MLE functional is defined as L t pM q :" t´1 ÿ s"1 log P M,π s po s q ´› › r s ´RM po s q › › 2 2 . For M P M 0 , we consider ℓ t M :" log r P M po t |π t q P ‹ po t |π t q `δt M , δ t M :" › › r t ´R‹ po t q › › 2 2 ´› › r t ´RM po t q › › 2 2 , where the definition of δ agrees with (11). We first show that with probability at least 1 ´δ, for all M P M 0 and all t P rT s, ÿ săt 1 ´Es » - d r P M po s |π s q P ‹ po s |π s q fi fl `1 2 E s " › › R ‹ po s q ´RM po s q › › 2 2 ı ď ´ÿ săt ℓ s M `3 log |M 0 | δ . This is because by Lemma B.2, it holds that with probability at least 1 ´δ, for all t P rT s and M P M 0 , ÿ săt ´1 3 ℓ s M `logp|M 0 |{δq ě ÿ săt ´log E s " exp ˆ1 3 ℓ s M ˙ȷ. Further, ´log E s " exp ˆ1 3 ℓ s M ˙ȷ ě ´2 3 log E s « exp ˜1 2 log r P M po s |π s q P ‹ po s |π s q ¸ff ´1 3 log E s rexp pδ s M qs ě 1 3 ¨1 ´Es » - d r P M po s |π s q P ‹ po s |π s q fi fl ‚`1 6 E s " › › R ‹ po s q ´RM po s q › › 2 2 ı , where the second inequality is due to the fact that ´log x ě 1 ´x and Lemma D.3 (with σ 2 " 1{4). Hence ( 41) is proven. Now condition on the success of ( 41) for all M 0 P M 0 . Fix a M P M, there is a M 0 P M 0 such that M is covered by M 0 (i.e. › › R M0 poq ´RM poq › › 1 ď ρ and r P M0 p¨|πq ě P M p¨|πq for all π). Notice that ř oPO r P M0 po|πq ď 1 `ρ2 , and therefore › › › r P M0 p¨|πq ´PM p¨|πq › › › 1 ď ρ 2 . Then the first term in (41) (plug in M 0 ) can be lower bounded as 1 ´Es » - d r P M0 po s |π s q P ‹ po s |π s q fi fl ě 1 2 E s " D H `PM p¨|π s q, P ‹ p¨|π s q ˘‰ ´ρ, by ( 22). For the second term, by the fact that R P r0, 1s H and › › R M0 poq ´RM poq › › 1 ď ρ, we have E s " › › R ‹ po s q ´RM po s q › › 2 2 ı ěE s " › › R ‹ po s q ´RM po s q › › 2 2 ı ´2ρ. Similarly, δ s M0 ě δ s M ´2ρ, and hence ´řsăt ℓ s M0 ď L t pM ‹ q ´Lt pM q `2T ρ, which completes the proof.

G.2 PROOF OF THEOREM E.5

In the following, we show the following general result. Theorem G.3 (Full version of Theorem E.5). Choosing β ě 3 estpM, 2T q `3 logp1{δq ě 1, with probability at least 1 ´δ, Algorithm 6 achieves Reg DM ď inf γą0 tT ¨mlec γ,T pM, M ‹ q `4γβu. Especially, when M is finite, we can take β " 3 logp|M| {δq (because estpM, 2T q ď log |M|), and Theorem G.3 implies Theorem E.5 directly. Proof of Theorem G.3 Condition on the success of Theorem G.1. Then, for t P rT s, it holds that M ‹ P M t . Therefore, by the choice of pM t , π t q, it holds that f M t pπ t q ě f M ‹ pπ M ‹ q. Then, Reg DM " T ÿ t"1 " f M ‹ pπ M ‹ q ´f M ‹ pπ t q ı ď T ÿ t"1 " f M t pπ t q ´f M ‹ pπ t q ı "T ¨# 1 T T ÿ t"1 " f M t pπ t q ´f M ‹ pπ t q ı ´γ T «˜m ax tPrT s ÿ sďt´1 D 2 RL pM ‹ pπ M s q, M t pπ M s qq ¸_ 1 ff+ looooooooooooooooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooooooooooooooon bounded by mlec γ,T pM,M ‹ q `γ ¨˜max tPrT s ÿ sďt´1 D 2 RL pM ‹ pπ M s q, M t pπ M s qq ¸_ 1 looooooooooooooooooooooooooooomooooooooooooooooooooooooooooon bounded by 4β ďT mlec γ,T pM, M ‹ q `4γβ. Taking inf γą0 completes the proof.

G.3 RELATIONSHIP BETWEEN PSC AND MLEC

The MLEC resembles the PSC in that both control a certain decoupling error between a family of models and their optimal policies. The main difference is that the MLEC concerns any sequence of models whereas the PSC concerns any distribution of models. Intuitively, the sequential nature of the MLEC makes it a stronger requirement than the PSC. Formally, we show that low MLEC with a slightly modified definition of MLEC (where the max kPrKs is replaced by the average; cf. ( 42)) indeed implies low PSC. Note that the modified MLEC defined in ( 42) is larger than the MLEC defined in Definition E.4; however, in most concrete applications they can be bounded by the same upper bounds. We present Definition E.4 as the main definition of MLEC in order to capture generic classes of problems such as RL with low Bellman-Eluder dimensions or (more generally) low-Eluder-dimension Bellman representations (Proposition K.7). Proposition G.4 (Bounding PSC by modified MLEC). Consider the following modified MLEC: Ć mlec γ,K pM, M q :" sup tM k uPM 1 K K ÿ k"1 " f M k pπ M k q ´f M pπ M k q ı ´γ K 2 ˜K ÿ k"1 ÿ tďk´1 D 2 RL pM pπ M t q, M k pπ M t qq ¸. ( ) Then it holds that psc γ pM, M q ď inf Kě1 ˜Ć mlec γ,K pM, M q `pγ `1q c 2 log p|Π| ^|M| `1q K ¸. Proof of Proposition G.4 Fix a K ě 1 and µ P ∆pMq. We first prove the bound in terms of |Π|. The proof uses the probabilistic method to establish the desired deterministic bound. We draw K i.i.d samples M 1 , ¨¨¨, M K from µ, and we write p µ " Unifp ␣ M 1 , ¨¨¨, M K ( q. Then with probability at least 1 ´p|Π| `1qδ, the following holds simultaneously: pE M "µ ´EM"p µ q " f M pπ M q ´f M pπ M q ı ď c 2 logp1{δq K , pE M " p µ ´EM"µ q " D 2 RL `M pπq, M pπq ˘‰ ď c 2 logp1{δq K , @π P Π. Therefore, with probability at least 1 ´p|Π| `1qδ (over the randomness of p µ), we have E M "µ E M 1 "µ " f M pπ M q ´f M pπ M q ´γD 2 RL pM pπ M 1 q, M pπ M 1 qq ı ďE M " p µ E M 1 " p µ " f M pπ M q ´f M pπ M q ´γD 2 RL pM pπ M 1 q, M pπ M 1 qq ı `p1 `γq c 2 logp1{δq K ď 1 K K ÿ k"1 " f M k pπ M k q ´f M pπ M k q ı ´γ K 2 ˜K ÿ k"1 K ÿ t"1 D 2 RL pM pπ M t q, M k pπ M t qq p1 `γq c 2 logp1{δq K ď Ć mlec γ,K pM, M q `p1 `γq c 2 logp1{δq K . In particular, for any δ ă 1{p1 `|Π|q, the above holds with positive probability, and thus there exists one p µ such that the above holds. Taking δ Ò 1{p1 `|Π|q on the right-hand side and supremum over µ P ∆pMq on the left-hand side, we get psc γ pMq ď Ć mlec γ,K pM, M q `p1 `γq c 2 logp|Π| `1q K . The bound in term of |M| follows analogously, by noticing that with probability at least 1 ´p|M| 1qδ, the following holds simultaneously: pE M "µ ´EM"p µ q " f M pπ M q ´f M pπ M q ı ď c 2 logp1{δq K , pE M 1 " p µ ´EM 1 "µ q " D 2 RL `M pπ M 1 q, M pπ M 1 q ˘‰ ď c 2 logp1{δq K , @M P M, and repeating the same argument as above. Algorithm 7 Exploartive E2D with Tempered Aggregation (EXPLORATIVE E2D) Input: Parameter γ ą 0; Learning rate η p P p0, 1 2 q, η r ą 0. 1: Initialize µ 1 Ð UnifpMq. 2: for t " 1, . . . , T do 3: Set pp t exp , p t out q Ð arg min ppexp,poutqP∆pΠq 2 p V µ t pac,γ pp exp , p out q, where p V µ t pac is defined in (6).

4:

Sample π t " p t exp . Execute π t and observe po t , r t q. 5: Update randomized model estimator by Tempered Aggregation: µ t`1 pM q 9 M µ t pM q ¨exp ´ηp log P M,π t po t q ´ηr › › r t ´RM po t q › › 2 2 ¯. ( ) Output: Policy p p out :" 1 T ř T t"1 p t out . H PROOFS FOR SECTION 4.1 We describe the full EXPLORATIVE E2D algorithm in Algorithm 7.

H.1 PROOF OF THEOREM 5

We first show that Algorithm 7 achieves the following: f M ‹ pπ M ‹ q ´Eπ"pout " f M ‹ pπq ı ď edec γ pMq `γ Est RL T , where Est RL denotes the following online estimation error (cf. ( 18)): Est RL :" T ÿ t"1 E π t "p t exp E x M t "µ t " D 2 RL p x M t pπ t q, M ‹ pπ t qq ı . We have T ÿ t"1 E π t "p t out " f M ‹ pπ M ‹ q ´f M ‹ pπ t q ı " T ÿ t"1 E π"p t out rf M pπ M q ´fM pπqs ´γE π"p t exp E x M t "µ t " D 2 RL pM pπq, x M t pπqq ı `γ ¨T ÿ t"1 E π t "p t exp E x M t "µ t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı piq ď T ÿ t"1 sup M PM E π"p t out rf M pπ M q ´fM pπqs ´γE π"p t exp E x M t "µ t " D 2 RL pM pπq, x M t pπqq ı `γ ¨T ÿ t"1 E π t "p t exp E x M t "µ t " D 2 RL pM ‹ pπ t q, x M t pπ t qq ı piiq " T ÿ t"1 p V µ t pac,γ pp t exp , p t out q looooooooomooooooooon "inf ppexp,p out qP∆pΠq 2 p V µ t pac,γ `γ ¨Est RL piiiq " T ÿ t"1 edec γ pM, µ t q `γ ¨Est RL ď T ¨edec γ pMq `γ ¨Est RL . Above, (i) follows by the realizability assumption M ‹ P M; (ii) follows by definition of the risk p V µ t pac,γ (cf. ( 6)) as well as the fact that pp t exp , p t out q minimizes p V µ t pac,γ p¨, ¨q in Algorithm 7; (iii) follows by definition of edec γ pM, µ t q. Dividing both sides by T proves (44). Combining (44) with the online estimation guarantee of Tempered Aggregation (Corollary D.2) completes the proof. Theorem 5 can be extended to the more general case with infinite model classes. Combining (44) with Proposition D.6, we can establish the following general guarantee of EXPLORATIVE E2D WITH COVERING. Theorem H.1 (EXPLORATIVE E2D WITH COVERING). Suppose model class M admits a ρoptimistic cover p r P, M 0 q. Then, Algorithm 7 with the Tempered Aggregation subroutine (3) replaced by by ( 19) with η p " η r " 1{3 and µ 1 " UnifpM 0 q achieves the following with probability at least 1 ´δ: SubOpt ď edec γ pMq `10γ T rlog |M 0 | `T ρ `2 logp2{δqs. By tuning γ ą 0, with probability at least 1 ´δ, Algorithm 7 achieves SubOpt ď C inf γą0 ! edec γ pMq `γ T restpM, T q `logp1{δqs ) , where C is a universal constant.

H.2 PROOF OF PROPOSITION 6

We present the full version of Proposition 6 here, and then provide its proof, which is a generalization of Foster et al. (2021, Theorem 3.2) . Proposition H.2 (PAC lower bound). Consider a model class M and T ě 1 a fixed integer. Define V pMq :" 3 sup M,M sup π,o P M po|πq P M po|πq , CpT q :" 2 13 logp2T ^V pMqq and ε γ :" γ CpT qT . Then we can assign each model M P M with a reward distribution (with r P r´2, 2s H almost surely), such that for any algorithm A, there exists a model M P M for which E M,A rSubOpts ě 1 6 max γą0 sup M PM edec γ pM 8 ε γ pM q, M q, where we define g M pπq :" f M pπ M q ´f M pπq and the localization M 8 ε pM q " ! M P M : ˇˇg M pπq ´gM pπq ˇˇď ε, @π P Π ) . Proof. First, we need to specify the reward distribution for each M P M, given its mean reward function R M : O Ñ r0, 1s H : conditional on the observation o, we let r " pr h q be a random vector, with each entry r h independently sampled from P M ˆrh " ´1 2 ˇˇo ˙" 3 4 ´RM h poq 2 , P M ˆrh " 3 2 ˇˇo ˙" 1 4 `RM h poq 2 . Then a simple calculation gives D 2 H ´RM h poq, R M h poq ¯ď 1 2 ˇˇR M h poq ´RM h poq ˇˇ2. Therefore, by the fact that pr h q are mutually independent conditional on o, we have 1 ´1 2 D 2 H ´RM poq, R M poq ¯" ź h ˆ1 ´1 2 D 2 H ´RM h poq, R M h poq ¯˙ě ź h ˆ1 ´1 4 ˇˇR M h poq ´RM h poq ˇˇ2 ě exp ˆ´logp4{3q › › ›R M poq ´RM poq › › › 2 2 ˙ě 1 ´logp4{3q › › ›R M poq ´RM poq › › › 2 2 , where the second inequality is because 1 ´x{4 ě expp´logp4{3qxq for all x P r0, 1s. Therefore, by Lemma B.4, D 2 H `M pπq, M pπq ˘ď 3D 2 H ´PM pπq, P M pπq ¯`2E o"P M pπq " D 2 H ´RM poq, R M poq ¯ı ď 3D 2 H ´PM pπq, P M pπq ¯`3E o"P M pπq " › › ›R M poq ´RM poq › › › 2 2 ȷ ď 3D 2 RL `M pπq, M pπq ˘. Next, suppose that algorithm A is given by rules p " ! p ptq exp p¨| ¨q) T t"1 Ť tp out p¨| ¨qu, where p ptq exp `¨|H pt´1q ˘P ∆pΠq is the rule of interaction at t-th step (given H pt´1q the history before tth step), and p out `¨|H pT q ˘P ∆pΠq is the rule of the output policy. For M P M, we define p M,exp " E M,A « 1 T T ÿ t"1 p ptq exp ´¨| H pt´1q ¯ff P ∆pΠq, p M,out " E M,A " p out p¨|H T q ‰ P ∆pΠq, where P M,A is the probability distribution induced by A when interacting with M . Notice that E M,A rSubOpts " E π"p M,out " g M pπq ‰ . Let us abbreviate edec :" sup M PM edec γ `M8 ε pM q, M ˘, and let M P M attains the supremum. Then sup M PM 8 ε pM q E π"p M ,out " f M pπ M q ´f M pπq ‰ ´γE π"p M ,exp " D 2 RL `M pπq, M pπq ˘‰ ě edec . Let M P M 8 ε pM q attain the supremum above. Then we have E π"p M ,out " g M pπq ‰ ě γ¨E π"p M ,exp " D 2 RL `M pπq, M pπq ˘‰`edec ě γ 3 ¨Eπ"p M ,exp " D 2 H `M pπq, M pπq ˘‰`edec . Recall from the definition of M 8 ε pM q that ˇˇg M pπq ´gM pπq ˇˇď ε for all π. Hence, by Lemma B.3, it holds that ˇˇEπ"p M ,out " g M pπq ´gM pπq ı ´Eπ"p M ,out " g M pπq ´gM pπq ıˇˇď c 8ε ¨´E π"p M ,out " g M pπq `gM pπq ı `Eπ"p M ,out " g M pπq `gM pπq ı¯¨D 2 H ´PM,A , P M ,A ď 4εD 2 H ´PM,A , P M ,A ¯`1 2 ´Eπ"p M ,out " g M pπq `gM pπq ı `Eπ"p M ,out " g M pπq `gM pπq ı¯, which implies E π"p M ,out " g M pπq ‰ `Eπ"p M ,out " g M pπq ı ě 1 3 E π"p M ,out " g M pπq ‰ `1 3 E π"p M ,out " g M pπq ı ´8 3 εD 2 H ´PM,A , P M ,A ¯. Furthermore, by the subadditivity of the squared Hellinger distance (Foster et al., 2021, Lemma A.13 ), we have D 2 H ´PM,A , P M ,A ¯ďC T T ÿ t"1 E M ,A " D 2 H ´M pπ ptq q, M pπ ptq q ¯ı "C T T ¨Eπ"p M ,exp " D 2 H `M pπq, M pπq ˘‰ , where C T :" 2 8 plogp2T q ^log pV pMqqq as in the proof of Foster et al. (2021, Theorem 3.2) . As long as 24C T T ε ď γ, it holds that E π"p M ,out " g M pπq ‰ `Eπ"p M ,out " g M pπq ı ě 1 3 edec . This completes the proof.

H.3 PROOF OF PROPOSITION 7

Fix a µ P ∆pMq, and we take pp exp , p out q :" arg min ppexp,poutqP∆pMq 2 sup M PM ! E π"pout " f M pπ M q ´f M pπq ‰ ´γE π"pexp E M "µ " D 2 RL pM pπq, M pπqq ‰ ) . Then consider p " αp exp `p1 ´αqp out . By definition, decpM, µq ď sup M PM E π"p " f M pπ M q ´f M pπq ‰ ´γE π"p E M "µ " D 2 RL pM pπq, M pπqq ‰ " sup M PM ! αE π"p exp " f M pπ M q ´f M pπq ‰ ´γαE π"p exp E M "µ " D 2 RL pM pπq, M pπqq ‰ p1 ´αqE π"p out " f M pπ M q ´f M pπq ‰ ´γp1 ´αqE π"p out E M "µ " D 2 RL pM pπq, M pπqq ‰ ) ď sup M PM ! α `p1 ´αqE π"p out " f M pπ M q ´f M pπq ‰ ´αγE π"p exp E M "µ " D 2 RL pM pπq, M pπqq ‰ ) "α `p1 ´αq edec αγ{p1´αq pM, µq.

H.3.1 ADDITIONAL DISCUSSIONS ON BOUNDING DEC BY EDEC

Here we argue that, for classes with low EDEC, obtaining a PAC sample complexity through the implied DEC bound is in general worse than the bound obtained by the EDEC bound directly. Consider any model class M with edec γ pMq À d{γ, where d is some dimension-like complexity measure. Using the EXPLORATIVE E2D algorithm, by Theorem 5, the suboptimality of the output policy scales as f M ‹ pπ M ‹ q ´Eπ"p pout " f M ‹ pπq ı ď edec γ pMq `10 γ logp|M| {δq T À d γ `γ T ¨logp|M| {δq À c d logp|M| {δq T , where the last inequality follows by choosing the optimal γ ą 0. This implies a PAC sample complexity d logp|M| {δq{ε 2 for finding an ε near-optimal policy. By contrast, suppose we use an algorithm designed for low DEC problems (such as E2D-TA). To first bound the DEC by the EDEC, by Proposition 7, we have dec γ pMq ď inf αą0 ␣ α `p1 ´αqedec γα{p1´αq pMq ( ď inf αą0 " α `p1 ´αq 2 d γα * À d d γ . Then, using the E2D-TA algorithm, by Theorem 2 and the online-to-batch conversion, the suboptimality of the average policy scales as Reg DM T " 1 T T ÿ t"1 f M ‹ pπ M ‹ q ´Eπ t "p t " f M ‹ pπ t q ı ď dec γ pMq `10 γ logp|M| {δq T À d d γ `γ T ¨logp|M| {δq À ˆd logp|M| {δq T ˙1{3 , where the last inequality follows by choosing the optimal γ ą 0. This implies a PAC sample complexity d logp|M| {δq{ε 3 for finding an ε near-optimal policy, which is an 1{ε factor worse than that obtained from the EDEC directly. Note that this 1{ε 3 rate is the same as obtained from the standard explore-then-commit conversion from PAC algorithms with sample complexity 1{ε 2 to no-regret algorithms. We remark that the same calculations above also hold in general for problems with edec γ pMq À 1{γ β (when only highlighting dependence on γ) for some β ą 0. In that case, the EDEC yields PAC sample complexity p1{εq  edec γ pM 8 ε pM q, M q ě c 1 min " 1, HSA γ * , for all γ ą 0 such that ε ě c 2 HSA{γ, where c 1 , c 2 are two universal constants. As a corollary, applying the PAC lower bound in EDEC (Proposition H.2), we have that for any algorithm A that interacts with the environment for T episodes, sup M PM E M,A rSubOpts ě c 0 min # 1, c HSA T + , where c 0 is a universal constant. Proposition H.3 implies a sample complexity lower bound of Ω `HSA{ε 2 ˘for learning ε-optimal policy in tabular MDPs. This simple example illustrates the power of edec as a lower bound for PAC learning, analogously to the DEC for no-regret learning. Moreover, notice that all models in M have the same reward function (denote it by R), hence for this class M it holds that edec γ pM 8 ε pM q, M q ď rfdec γ pM 8,rf ε pPq, Pq for all M " pP, Rq P M. Combining this fact with Proposition H.3 gives a lower bound of RFDEC of M, and hence we can also obtain a sample complexity lower bound for reward-free learning in tabular MDPs. Proof. Without loss of generality, we assume that S " 2 n`1 `1 and let S 1 " 2 n . We also write A 1 " A ´1, H 1 " H ´n ě H{2. Fix a ∆ P p0, 1 3 s, we consider M ∆ the class of MDPs described as follows. 1. The state space S " S tree Ů ts ' , s a u, where S tree is a binary tree of level n `1 (hence |S tree | " 2 n`1 ´1), and s ' , s a are two auxiliary nodes. Let s 0 be the root of S tree , and S 1 be the set of leaves of S tree (hence |S 1 | " 2 n ).

2.. Each episode has horizon H.

3. The reward function is fixed and known: arriving at s ' emits a reward 1, and at all other states the reward is 0. 4. For h ‹ P H 1 :" tn `1, ¨¨¨, Hu, s ‹ P S 1 , a ‹ P rA 1 s, the transition dynamic of M " M h ‹ ,s ‹ ,a ‹ is defined as follows: • The initial state is always s 0 . • At a node s P S tree such that s is not leaf of S tree , there are two available actions left and right, with left leads to the left children of s and right leads to the right children of s. • At leaf s P S 1 , there are A actions: wait, 1, ¨¨¨, A´1. The dynamic of M " M h ‹ ,s ‹ ,a ‹ P M at s is given by: P M h ps|s, waitq " 1, and for a P rA 1 s, h P rHs P M h ps ' |s, aq " 1 2 `∆ ¨1ph " h ‹ , s " s ‹ , a " a ‹ q, P M h ps a |s, aq " 1 2 ´∆ ¨1ph " h ‹ , s " s ‹ , a " a ‹ q, • The state s ' always transits to s a , and s a is the absorbing state (i.e. Pps a |s ' , ¨q " 1, Pps a |s a , ¨q " 1). Let M be the MDP model with the same transition dynamic and reward function as above, except that for all h P rHs, s P S 1 , a P rA 1 s it holds P M h ps ' |s, aq " P M h ps a |s, aq " 1 2 . Note that M does not depend on ∆. We then define M ∆ " ␣ M ( ď ␣ M h,s,a : ph, s, aq P H 1 ˆS1 ˆrA 1 s ( . Before lower bounding edec γ pM ∆ , M q, we make some preparations. Define ν π ph, s, aq " P M ,π ps h " s, a h " aq, @ph, s, aq P H 1 ˆS1 ˆrA 1 s. Note that due to the structure of M , the events A h,s,a :" ts h " s, a h " au are disjoint for ph, s, aq P H 1 ˆS1 ˆrA 1 s; therefore, ÿ hPH 1 ,sPS 1 ,aPrA 1 s ν π ph, s, aq ď 1. Furthermore, for M " M h,s,a P M ∆ , we have D 2 H `M pπq, M pπq ˘" P M ,π ps h " s, a h " aqD 2 H ´PM h p¨|s, aq, P M h p¨|s, aq M pπq and M pπq only differs at the conditional probability of s h`1 |s h " s, a h " a. Therefore, due to the fact that D 2 H ´PM h p¨|s, aq, P M h p¨|s, aq ¯" D 2 H `Bernp 1 2 `∆q, Bernp 1 2 q ˘ď 3∆ 2 , we have D 2 H `M pπq, M pπq ˘ď 3ν π ph, s, aq∆ 2 . Now, that for p exp , p out P ∆pπq and M P M 1 , we have E π"pout " f M pπ M q ´f M pπq ‰ ´γE π"pexp " D 2 H `M pπq, M pπq ˘‰ " 1 2 `∆ ´Eπ"pout " 1 2 `∆P M,π ph ‹ , s ‹ , a ‹ q ȷ ´γE π"pexp " D 2 H `M pπq, M pπq ˘‰ ě∆p1 ´Eπ"pout rν π ph ‹ , s ‹ , a ‹ qsq ´3γ∆ 2 E π"pexp rν π ph ‹ , s ‹ , a ‹ qs. Therefore, we define ν pexp ph, s, aq " E π"pexp rν π ph, s, aqs, ν pout ph, s, aq " E π"pout rν π ph, s, aqs. Then, for any fixed p exp , p out P ∆pΠq, by the fact that ÿ hPH 1 ,sPS 1 ,aPrA 1 s ␣ ν pexp ph, s, aq `3γ∆ν pout ph, s, aq ( ď 1 `3γ∆, we know that there exists ph 1 , s 1 , a 1 q P H 1 ˆS1 ˆrA 1 s such that ν pexp ph 1 , s 1 , a 1 q `3γ∆ν pout ph 1 , s 1 , a 1 q ď 1 `3γ∆ H 1 S 1 A 1 . Then we can consider M 1 " M h 1 ,s 1 ,a 1 , and sup M PM 1 E π"pout " f M pπ M q ´f M pπq ‰ ´γE π"pexp " D 2 H `M pπq, M pπq ˘‰ ěE π"pout " f M 1 pπ M 1 q ´f M 1 pπq ı ´γE π"pexp " D 2 H `M 1 pπq, M pπq ˘‰ ě∆ `1 ´νpexp ph 1 , s 1 , a 1 q ˘´3γ∆ 2 ν pout ph 1 , s 1 , a 1 q ě∆ ´∆ ¨1 `3γ∆ H 1 S 1 A 1 . By the arbitrariness of p exp , p out P ∆pΠq, we derive that edec γ pM ∆ , M q ě ∆ ´∆ ¨1 `3γ∆ H 1 S 1 A 1 . Algorithm 8 REWARD-FREE E2D  p out pR ‹ q " 1 T ř T t"1 p t out pR ‹ q. Also, by definition, it holds that M ∆ ε,8 pM q " M ,foot_11 V pM ∆ q " 3 1´∆ ď 6. Therefore, given T ě 1 and algorithm A, we set C 0 " 2 14 , ∆ " min ! 1 3 , b H 1 S 1 A 1 12C0T ) , γ " C 0 T ∆, then applying Proposition H.2 to M ∆ gives a M P M ∆ such that E M,A rSubOpts ě 1 6 ˆ∆ ´∆ ¨1 `3γ∆ H 1 S 1 A 1 ˙ě ∆ 24 ě c 0 min # 1, c H 1 S 1 A 1 T + , where the second inequality is due to the fact that 1`3γ∆ H 1 S 1 A 1 ď 3 4 which follows from simple calculation, and c 0 is a universal constant. On the other hand, we can similarly provide lower bound of edecpM, M q for the model class M " Ť ∆ą0 M ∆ : For any given γ ą 0, we can take ε " ∆ " min ! 1 3 , H 1 S 1 A 1 12γ ) , and then edec γ pM 8 ε pM q, M q ě edec γ pM ∆ , M q ě ∆ ´∆ ¨1 `3γ∆ H 1 S 1 A 1 ě ∆ 4 " 1 4 min " 1 3 , H 1 S 1 A 1 12γ * . I PROOFS FOR SECTION 4.2 I.1 ALGORITHM REWARD-FREE E2D Algorithm 8 presents a slightly more general version of the REWARD-FREE E2D algorithm where we allow |P| to be possibly infinite. The algorithm described in Section 4.2 is a special case of Algorithm 8 with |P| ă 8 and we pick the uniform prior µ 1 " UnifpPq. More generally, just as Algorithm 7, Algorithm 8 can also apply to the case when M only admits a finite optimistic covering. Its guarantee in this setting is stated as follows. Theorem I.1 (REWARD-FREE E2D). Given a ρ-optimistic cover p r P, P 0 q of P, we can replace the subroutine (47) in Algorithm 8 with µ t`1 pPq 9 P µ t pPq ¨exp ´η log r P π t po t q ¯. (48) and let η " 1{2, µ 1 " UnifpP 0 q, then Algorithm 8 achieves the following with probability at least 1 ´δ: SubOpt rf ď rfdec γ pMq `2γ T rlog |P 0 | `T ρ `2 logp2{δqs. By tuning γ ą 0, with probability at least 1 ´δ, Algorithm 8 achieves the following: SubOpt rf ď 4 inf γą0 ! rfdec γ pMq `γ T restpP, T q `logp1{δqs ) . I.2 PROOF OF THEOREM 9 AND THEOREM I.1 For P P P, R P R, p exp , p out P ∆pΠq, we consider the function Therefore, for any R ‹ P R, using the definition of the infs and sups in the risks, we have that V t pp rfdec γ pMq ě rfdec γ pM, µ t q " sup RPR p V µ t pp t exp , Rq ě p V µ t pp t exp , R ‹ q ě inf poutP∆pΠq sup PPP V t pp t exp , R ‹ ; p out , Pq " sup PPP V t pp t exp , R ‹ ; p t out pR ‹ q, Pq ě V t pp t exp , R ‹ ; p t out pR ‹ q, P ‹ q. Therefore, for any R ‹ P R and the associated M ‹ " pP ‹ , R ‹ q, we have E π"p t out pR ‹ q " f M ‹ pπ M ‹ q ´f M ‹ pπq ı ď rfdec γ pMq `γE P"µ t E π"p t exp " D 2 H pP ‹ pπq, Ppπqq ‰ , Taking average over t P rT s and taking supremum over R ‹ P R on the left-hand side, we get sup R ‹ PR ! f P ‹ ,R ‹ pπ P ‹ ,R ‹ q ´Eπ"p poutpR ‹ q " f P ‹ ,R ‹ pπq ı) ď rfdec γ pMq `γ ¨Est H T , where Est H :" T ÿ t"1 E P"µ t E π"p t exp " D 2 H pP ‹ pπq, Ppπqq ‰ . Note that (52) holds regardless of the subroutine (( 47) or ( 48)) we use. Therefore, it remains to bound Est H for ( 47) and ( 48). When |P| ă 8, Corollary D.2 implies that subroutine (47) (agrees with (3) with η r " 0 and η p " η) achieves with probability at least 1 ´δ that Est H ď 1 η logp|P| {δq. This proves Theorem 9. Similarly, when a ρ-optimistic covering p r P, P 0 q is given, Theorem D.8 implies that: subroutine (48) (agrees with (19) with η r " 0 and η p " η) achieves with probability at least 1 ´δ that Est H ď 1 η rlog |P 0 | `T ρ `2 logp2{δqs. This completes the proof of Theorem I.1.

I.3 PROOF OF PROPOSITION 10

We state and prove the formal version of Proposition 10 as follows. Proposition I.2 (Reward-free lower bound). Consider a model class M " P ˆR and T ě 1 a fixed integer. Define V pMq, CpT q and ε γ as in Proposition H.2. Then for any algorithm A that returns a mapping p out : R Ñ ∆pΠq after T episodes, there exists a model P P P for which E P,A rSubOpt rf s ě sup RPR E P,A " f P,R pπ P,R q ´Eπ out "poutpRq " f P,R pπ out q ‰‰ ě 1 6 ¨max γą0 sup PPP rfdec γ pM 8,rf ε γ pPq, Pq, where the localization is defined as M 8,rf ε pPq :" P 8,rf ε pPq ˆR, where P 8,rf ε pPq " ! P P P : ˇˇg P,R pπq ´gP,R pπq ˇˇď ε, @π P Π, R P R ) , and g P,R pπq :" f P,R pπ P,R q ´f P,R pπq for any pP, R, πq P P ˆR ˆΠ. Proof. Suppose that algorithm A is given by rules p " ! p ptq exp p¨| ¨q) T t"1 Ť tp out p¨| ¨qu, where p ptq exp `¨|H pt´1q ˘P ∆pΠq is the rule of interaction at t-th step with H pt´1q the history before t-th step, and p out `¨|H pT q , ¨˘: R Ñ ∆pΠq is the rule of outputting the policy given a reward. Let P P,A refers to the distribution (of H pT q ) induced by the exploration phase of A under transition P. For any P P M and R P R, we define p exp pPq " E P,A « 1 T T ÿ t"1 p ptq exp ´¨| H pt´1q ¯ff P ∆pΠq, p out pP, Rq " E P,A " p out p¨|H pT q , Rq ı P ∆pΠq. Notice that E P,A " f P,R pπ P,R q ´Eπ out "poutpRq " f P,R pπ out q ‰‰ " E π"poutpP,Rq " g P,R pπq ‰ . Let us abbreviate rfdec :" sup PPP rfdec γ pM  " f P,R pπ P,R q ´f P,R pπ P,R q ı `EP"µ E P 1 "µ E P"μ " f P,R pπ P 1 ,R q ´f P,R pπ P 1 ,R q ı ´γE P"µ E P"µ E π"pexp " D 2 H pPpπq, Ppπqq ‰ piiiq ď inf pexpP∆pΠq sup RPR sup µP∆pPq 2E P"µ " sup π 1 PΠ ˇˇE P"μ " f P,R pπ 1 q ´f P,R pπ 1 q ıˇˇˇȷ ´γE P"µ E P"µ E π"pexp " D 2 H pPpπq, Ppπqq ‰ "2 inf pexpP∆pΠq sup RPR,PPP ! sup π 1 PΠ ˇˇE P"μ " f P,R pπ 1 q ´f P,R pπ 1 q ıˇˇγ 2 E P"µ E π"pexp " D 2 H pPpπq, Ppπqq ‰ ) , where (i) is due to strong duality Theorem B.1, in (ii) we upper bound inf pout by letting p out P ∆pΠq be defined by p out pπq " µptP : π P,R " πuq, and in (iii) we upper bound E P"μ " f P,R pπ P,R q ´f P,R pπ P,R q ı ď sup π 1 PΠ ˇˇE P"μ " f P,R pπ 1 q ´f P,R pπ 1 q ıˇˇˇ, E P 1 "µ E P"μ " f P,R pπ P 1 ,R q ´f P,R pπ P 1 ,R q ı ď sup π 1 PΠ ˇˇE P"μ " f P,R pπ 1 q ´f P,R pπ 1 q ıˇˇˇ. Taking inf μ over μ P ∆pPq gives the desired result. Theorem J.1. Given a ρ-optimistic cover p r P, P 0 q of P, we choose η " 1{2, µ 1 " UnifpP 0 q, then Algorithm 9 achieves the following with probability at least 1 ´δ: This completes the proof of Theorem J.1. D Π

K PROOFS FOR SECTION 5

This section provides the proofs for Section 5 along with some additional discussions, organized as follows. We begin by presenting some useful intermediate results in Appendix K. As examples, the decoupling coefficient can be bounded for linear function classes, and more generally any function class with low Eluder dimension (Russo & Van Roy, 2013) or star number (Foster et al., 2020) . Example K.2. Suppose that there exists ϕ : X Ñ R d such that F Ă tf θ : x Ñ xθ, ϕpxqyu θPR d , then dcpF, γq ď d{p4γq. Definition K.3 (Eluder dimension). The eluder dimension epF, ∆q is the maximum of the length of sequence pf 1 , π 1 q, ¨¨¨, pf n , π n q P F ˆΠ such that there is a ∆ 1 ě ∆, and |f i pπ i q| ě ∆ 1 , ÿ jăi |f i pπ j q| 2 ď p∆ 1 q 2 , @i. Definition K.4 (Star number). The star number spF, ∆q is the maximum of the length of sequence pf 1 , π 1 q, ¨¨¨, pf n , π n q P F ˆΠ such that there is a ∆ 1 ě ∆, and More generally, the decoupling coefficient can be bounded by the disagreement coefficient introduced in (Foster et al., 2021, Definition 6.3 ). The proof of Example K.5 along with some further discussions can be found in Appendix K.4.  |f i pπ i q| ě ∆ 1 , ÿ j‰i |f i pπ j q| 2 ď p∆ 1 q 2 , @i. Example K.5. When F Ă pΠ Ñ r´1, dcpG M h , γ{12HL 2 q `6H γ , and let comppG, γq :" max M PM comppG M , γq. Then we have for any γ ą 0 that (1) If π est M,h " π M (the on-policy case), we have dec γ pMq ď comppG, γq and psc γ pM, M ‹ q ď comppG M ‹ , γq for all M ‹ P M. (2) For G with general estimation policies, we have edec γ pMq ď comppG, γq. Similarly, we show that the MLEC can also be bounded in terms of the Eluder dimension of G M ‹ . Proposition K.7. Suppose that G " pG M h q M PM,hPrHs is a Bellman representation of M in the on-policy case (π est M,h " π M for all ph, M q). Then we have for any γ ą 0, K P Z ě1 that mlec γ,K pM, M ‹ q ď CH 2 L 2 inf ∆ą0 # max hPrHs epG M ‹ h , ∆q γ `∆+ , where C ą 0 is an absolute constant. Bounding RFDEC under strong Bellman representability We define a strong Bellman representation as follows. Definition K.8 (Strong Bellman representation). Given a pair pM " P ˆR, P P Pq, its strong Bellman representation is a collection of function classes G M " tG M h u hPrHs with G M h " tg M ;P : Π Ñ r´1, 1su M PM such that: 1. For M " pP, Rq P M, π P Π, it holds that for M " pP, Rq, ˇˇE M ,π " Q M,π h ps h , a h q ´rh ´V M,π h`1 ps h`1 q ıˇˇˇď ˇˇg M ;P h pπq ˇˇ. 2. For M " pP, Rq P M, π P Π, it holds that ˇˇg M ;P h pπq ˇˇď LD H pPpπ est h q, Ppπ est h qq, for some constant L ě 1. Given a strong Bellman representation G of M, the RFDEC of M can be bounded in terms of certain complexity measure of G as follows. Proposition K.9 (Bounding RFDEC by decoupling coefficient of strong Bellman representation). Suppose G is a strong Bellman representation of M. Then for µ P ∆pPq, rfdec γ pM, µq ď 2H ¨max PPP max hPrHs dc ´GP h , γ{4HL 2 ¯`H γ . The proof of Proposition K.9 can be found in Appendix K.5.4. For its applications, see Appendix K.3.4.

K.2 PROOF OF PROPOSITION 12

The claims about bounded DEC/EDEC/RFDEC in (1)-(3) follows by combining the bounds in terms of the decoupling coefficients (Proposition K.6(1), Proposition K.6(2), and Proposition K.9) with the bound of decoupling coefficient by the Eluder dimension/star number (Example K.5). Then, the claims about the sample complexities of the E2D algorithms follow by applying Theorem 2, Theorem 5, and Theorem 9 and optimizing γ ą 0, respectively.  Ñ R d , W h p¨; ¨q : MˆM Ñ R d , such that g M 1 ;M h pM q " @ X h pM ; M q, W h pM 1 ; M q D , @M, M 1 , M P M. Suppose that G is bilinear with rank d, then by Example K.2, comppG, γq ď 2H 2 L 2 `2H `2 γ . Similarly, by Proposition K.7 we have mlec γ,K pMq ď r O `dH 2 L 2 {γ ˘. Example K.12 (Bellman error as Bellman representation). For a model class M, its Q-type Bellman error is defined as g M 1 ;M h pM q :" E M ,π M " Q M 1 ,‹ ps h , a h q ´rh ´V M 1 ,‹ h`1 ps h`1 q ı , @M, M 1 , M P M. Along with π est " π and L " ? 2, it gives a Bellman representation of M, which we term as its QBE. Similarly, we can consider the V-type Bellman error g M 1 ;M h pM q :" E M ,π M ˝hπ M 1 " Q M 1 ,‹ ps h , a h q ´rh ´V M 1 ,‹ h`1 ps h`1 q ı , @M, M 1 , M P M, where π M ˝h π M 1 stands for the policy which executes π M for the first h ´1 steps, and then follows π M 1 from the h-th step. Along with π est h " π ˝h UnifpAq and L " ? 2A, it gives a Bellman representation of M, which we term as its VBE. Relation to Model-based Bellman-Eluder dimension Take Q-type Bellman error for example. Note that the argument of g M 1 ;M h pM q corresponds to the roll-in policy π M . Therefore, one can check that epG M ‹ h , ∆q " dim DE pE h , Π h , ∆q, where E h " ! Q M,‹ h ´rh ´P‹ h V M,‹ h`1 ) M PM is the model-induced Bellman residual function class, Π h " ! d M ‹ ,π h ) πPΠ is the collection of distributions over S ˆA at h-step induced by policies. Therefore, epG M ‹ q is indeed equivalent to the model-based version of the Q-type (model-induced) Bellman Eluder dimension (Jin et al., 2021) . Definition K.16 (Occupancy rank). We say a MDP is of occupancy rank d if for all h P rHs, there exists map ϕ Along with π est h " π ˝h UnifpAq and L " ? 2A, it gives a strong Bellman representation of M, which we term as its VER. h : Π Ñ R d , ψ h : S Ñ R d , As the following proposition indicates, the two choices of strong Bellman representation above are enough for us to bound the RFDEC for tabular MDP, linear mixture MDP and MDP with low occupancy complexity. The proof of Proposition K.18 is mainly based on the decoupling behavior of linear function classes (cf. Example K.2), and can be found in Appendix K.5.6. Proposition K.18. For model class M of linear mixture MDPs (of a given feature ϕ), by its QER G and Proposition K.9, we have rfdec γ pMq ď 8dH 2 {γ. For model class M of MDPs with occupancy rank d, because its VER G is bilinear, we have rfdec γ pMq ď 8dAH 2 {γ. pM q in terms of the discrepancy function ℓ M pM 1 ; s h , a h , r h , s h`1 q, while we only require g M 1 ;M h pM q can be upper bound by D RL . In general, g M 1 ;M h pM q they define can be only upper bound by the Hellinger distance in the full observation po, rq (which is in general larger than D RL ), as the expected discrepancy function may depend on distributional information about the reward that is not captured by the mean. However, when the reward r is included in the observation o, our definition is more general than theirs. More importantly, for the majority of known concrete examples, e.g. those in Du et al. (2021) ; Jin et al. (2021) , the discrepancy function is linear in r h , and hence its expectation can be upper bound by D RL . Complexity measure for Bellman representations In Foster et al. (2021) , the complexity of a Bellman representation is measured in terms of disagreement coefficient, which can be upper bounded by eluder dimension or star number. Definition K.19. The disagreement coefficient of a function class F Ă pΠ Ñ r´1, 1sq is defined as θ pF, ∆ 0 , ε 0 ; ρq " sup ∆ě∆0,εěε0 where θ `F, ∆, γ ´1˘: " sup ρP∆pΠq θpF, ∆, ε; ρq. Lemma K.20 also gives Example K.5 directly. K.5 PROOF OF PROPOSITIONS K.5.1 PROOF OF EXAMPLE K.2 Under the linearity assumption, we can consider f Þ Ñ θ f P R d such that f pxq " xθ f , ϕpxqy @x P X . Given a ν P ∆pF ˆX q, let us set Φ λ :" λI d `Ex"ν " ϕpxqϕpxq J ‰ for λ ą 0. Then E pf,xq"ν r|f pxq|s ď E pf,xq"ν " }θ f } Φ λ }ϕpxq} Φ ´1 λ ı ď γE f "ν " }θ f } 2 Φ λ ı `1 4γ E x"ν " }ϕpxq} 2 Φ ´1 λ ı . For the first term, we have E f "ν " }θ f } 2 Φ λ ı "E f "ν " θ J f `Ex"ν " ϕpxqϕpxq J ‰˘θ f ‰ `λE f "ν }θ f } 2 "E f "ν E x"ν " |f pxq| 2 ı `λE f "ν }θ f } 2 . For the second term, we have E x"ν " }ϕpxq} 2 Φ ´1 λ ı "E x"ν " tr ´Φ´1{2 λ ϕpxqϕpxq J Φ ´1{2 λ ¯ı "tr ´Φ´1{2 λ E x"ν " ϕpxqϕpxq J ‰ Φ ´1{2 λ "tr ´Φ´1{2 λ Φ 0 Φ ´1{2 λ ¯ď d. Letting λ Ñ 0 `and then taking inf ν completes the proof. As a corollary, we have the following result. Corollary K.21. Suppose that there exists ϕ " pϕ i : X Ñ R d q iPI such that F Ă ! f θ : x Ñ max i |xθ, ϕ i pxqy| ) θPR d , then dcpF, γq ď d{γ. Corollary K.21 can be obtained similarly to the proof above of Example K.2. However, we believe the following fact is important: Proposition K.22. Suppose that W Ă pX ˆY Ñ R ě0 q, then for the function class F defined by Proof of Proposition K.22 Fix a ν P ∆pF ˆX q. Consider the map W : F Ñ W so that f pxq " max y W pf qpx, yq for all x P X . Further consider Y : F ˆX Ñ Y defined as Y pf, xq :" arg max yPY W pf qpx, yq (break ties arbitrarily). Then we let ν 1 P ∆pF ˆX ˆYq given by ν 1 ppw, x, yqq " νptpf, xq : W pf q " w, y " Y pf, xquq, and E pf,xq"ν r|f pxq|s "E pw,x,yq"ν 1 rwpx, yqs ďdcpW, γq `Ew"ν 1 E px,yq"ν 1 " wpx, yq 2 ‰ ďdcpW, γq `Ew"ν 1 E x"ν 1 " |f w pxq| 2 ı "dcpW, γq `Ef"ν E x"ν " |f pxq| 2 ı , where the second inequality is due to the definition f w pxq " max y wpx, yq, and the last equality is due to the fact that the marginalization of ν 1 to X agrees with marginalization of ν to X , and for f P F, ν 1 ptw : f w " f uq " νptpf 1 , xq : W pf 1 q " w, f w " f uq " νpf q. Taking inf ν completes the proof.

K.5.2 PROOF OF PROPOSITION K.6

To simplify the proof, we first introduce the PSC with estimation policies. Definition K.23. For M P M, let π est M be the uniform mixture of ␣ π M,0 , π est M,1 , ¨¨¨, π est M,H , where we define π M,0 " π M . Let psc est γ pM, M q :" sup µP∆pMq E M "µ E M 1 "µ " f M pπ M q ´f M pπ M q ´γD 2 RL pM pπ est M 1 q, M pπ est M 1 qq ı , where we understand D 2 RL pM pπ est M 1 q, M pπ est M 1 qq " 1 H `1 H ÿ h"0 D 2 RL pM pπ est M 1 ,h q, M pπ est M 1 ,h qq. We further define psc est γ pMq " sup M PM psc est γ pM, M q. We can generalize Proposition E.3 to psc est by the same argument, as follows. Proposition K.24. When π est " π, it holds that dec γ pM, M q ď psc est γ{6 pMq `2pH `1q γ . Generally, we always have edec γ pM, M q ď psc est γ{6 pMq `2pH `1q γ . Therefore, it remains to upper bound psc est γ by comppG, γq. Proposition K.25. It holds that psc est γ pM, M q ď H max h dcpG M h , γ{pH `1qL 2 q `H `1 4γ . Combining Proposition K.25 with Proposition K.24 completes the proof of Proposition K.6. Proof of Proposition K.25 Fix a M P M and µ P ∆pMq. Then by the standard performance decomposition using the simulation lemma, we have E M "µ " f M pπ M q ´f M pπ M q ı " E M "µ « E M " V M,π M 1 ps 1 q ı ´EM " V M,π M 1 ps 1 q ı `H ÿ h"1 E M ,π M " Q M,π M ps h , a h q ´rh ´V M,π M h`1 ps h`1 q ı ff ď E M "µ " D TV ´PM 0 , P M 0 ¯ı `H ÿ h"1 E M "µ "ˇˇˇg M ;M h pM q ˇˇı. Therefore, we bound H ÿ h"1 E M "µ "ˇˇˇg M ;M h pM q ˇˇı " H ÿ h"1 E M "µ "ˇˇˇg M ;M h pM q ˇˇı ´ηE M,M 1 "µ " ˇˇg M ;M h pM 1 q ˇˇ2 ȷ loooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooon ďdcpG M h ,ηq `ηE M,M 1 "µ " ˇˇg M ;M h pM 1 q ˇˇ2 ȷ ď H ÿ h"1 dcpG M h , ηq `ηL 2 E M,M 1 "µ " D 2 RL `M pπ est M,h q, M pπ est M,h q ˘‰, where the inequality is due to the definition of Bellman representation and decoupling coefficient. Furthermore, we have E M "µ " D TV ´PM 0 , P M 0 ¯ı ďE M,M 1 "µ " D 2 H `M pπ M 1 q, M pπ M 1 q ˘‰ ď E M,M 1 "µ " D RL `M pπ M 1 q, M pπ M 1 q ˘‰ ď γ H `1 E M,M 1 "µ " D 2 RL `M pπ M 1 q, M pπ M 1 q ˘‰ `H `1 4γ . Now taking η " 1{L 2 pH `1q and combining the above two inequalities above completes the proof. K.5.3 PROOF OF PROPOSITION K.7 Fix any set of models ␣ M k ( kPrKs P M. By standard performance decomposition using the simulation lemma, we have 1 K K ÿ k"1 " f M k pπ M k q ´f M ‹ pπ M k q ı " 1 K K ÿ k"1 H ÿ h"1 E M ‹ ,π M k " Q M k ,π M k ps h , a h q ´rh ´V M k ,π M k h`1 ps h`1 q ı ď 1 K H ÿ h"1 K ÿ k"1 ˇˇg M k ;M ‹ h pM k q ˇˇ. On the other hand, by the definition of Bellman representation, k´1 ÿ t"1 ˇˇg M k ;M ‹ h pM t q ˇˇ2 ď L 2 k´1 ÿ t"1 D 2 RL pM k pπ M t q, M ‹ pπ M t qq. Therefore, defining r β :" max kPrKs k´1 ÿ t"1 D 2 RL pM k pπ M t q, M ‹ pπ M t qq, we have ř k´1 t"1 ˇˇg M k ;M ‹ h pM t q ˇˇ2 ď L 2 r β for all k P rKs. Our final step is to use an Eluder dimension argument. By the above precondition, we can apply the Eluder dimension bound in Jin et al. (2021, Lemma 41) to obtain that, for any h P rHs and ∆ ą 0, where the last inequality uses AM-GM. Therefore, we have shown that for any Then, we set N " rB{ρs and let B 1 " N ρ. For θ P r´B 1 , B 1 s d , we define the ρ-neighborhood of θ as Bpθ, ρq :" ρ tθ{ρu `r0, ρs d , and let r P θ ps 1 |s, aq :" max θ 1 PBpθ,ρq @ θ 1 , ϕps 1 |s, aq D . 1 K K ÿ k"1 ˇˇg M k ;M ‹ h pM k q ˇˇď O p1q K ¨ˆb epG h , ␣ M k ( kPrKs , 1 K K ÿ k"1 " f M k pπ M k q ´f M ‹ pπ M k q ı ´γ K ¨max ! r β, Then, if θ induces a transition dynamic P θ , then r P θ ě P θ , and ÿ Now, for Θ " pθ h q h P pR d q H´1 , we define r P π Θ ps 1 , a 1 , ¨¨¨, s H , a H q :"P 1 ps 1 qπ 1 pa 1 |s 1 q r P θ1 ps 2 |s 1 , a 1 q ¨¨¨r P θ H´1 ps H |s H´1 , a H´1 qπpa H |s H q "P 1 ps 1 q ¨H ź h"1 π h pa h |s h q ˆH´1 ź h"1 r P θ h ps h`1 |s h , a h q. Suppose that ρ ď 1{p2Hdq, then a simple calculation shows that when Θ induces an MDP, › › › r P π Θ ´Pπ Θ › › › 1 ď 2eHdρ. Therefore, let ρ 1 " ? 2eHdρ, then by picking representative in each ℓ 8 -ρ-ball, we can construct a ρ 1 -optimistic covering with |M 0 | ď p2N q Hd " p2 rB{ρsq Hd " `2 P 2eHdB{ρ 2 1 T˘H d , which implies that log |M 0 | ď O pdH logpdHB{ρ 1 qq.

K.5.6 PROOF OF PROPOSITION K.18

We deal with linear mixture MDP first. Suppose that for each P P P, P is parameterized by θ P " pθ P h q h . Then the QER of P is given by g P;P h pπq "E P,π " D TV ´PP h p¨|s h , a h q, P P h p¨|s h , a h q ¯ı " 1 2 max V :SˆSˆAÑr´1,1s V ps h`1 |s h , a h qϕ h ps h`1 |s h , a h q ffGˇˇˇˇˇ. Applying Proposition K.9 and Corollary K.21 completes the proof of this case. We next deal with the class P of dynamic of MDPs with occupancy rank at most d. For P P P, we consider ϕ P " `ϕP h : Π Ñ R d ˘, ψ P " `ψP h : S Ñ R d ˘, such that P P,π ps h " sq " @ ψ P h psq, ϕ P h pπq D . Then the VER of P is given by g P;P h pπq "E Applying Proposition K.9 and Example K.2 completes the proof of this case.



Which outputs in general a distribution of models within class rather than a single model. Note that R M (and thus M ) only specifies the conditional mean rewards instead of the reward distributions. Foster et al. (2021) mostly use the standard Hellinger distance (in the tuple po, rq) in their definition of the DEC, which cares about full reward distributions (cf. Appendix C.1 for detailed discussions). Where F denotes the value class, and R denotes the reward class, and dBE denotes the Bellman-Eluder dimension of a certain class of reward-free Bellman errors induced by F. Rescaled to total reward within r0, 1s. The Hausdorff space requirement of X is only needed to ensure that ∆pX q contains all finitely supported distributions on X . We use P M,π poq and P M po|πq interchangeably in the following. To extend to the continuous setting, only slight modifications are needed, see e.g.Foster et al. (2021, Section 3.2.3). In other words, Et is the conditional expectation on Ft´1 " σpµ 1 , π 1 , o 1 , r 1 , ¨¨¨, π t´1 , o t´1 , r t´1 , µ t q. An important observation is that, along with (1), this requirement implies DTV `PM,π p¨q, P M 0 ,π p¨q ˘ď ρ 2 (for proof, see e.g. (33)). Therefore, a ρ-optimistic covering must be a ρ 2 -covering in TV distance. Our version is equivalent to Agarwal & Zhang (2022a, Algorithm 1) except that we look at the full observation and reward vector (of all layers), whereas they only look at a random layer h t " UnifprHsq. Here to avoid confusion, we move the 8 in the superscript (cf. (45)) to the subscript.



Erexppλδqs ď exp ´´λp1 ´2σ 2 λq }r ´p r} 2 2 ¯for any λ P R.

Foster et al. (2021, Definition 3.2)  We comment on the relationship between our optimistic covering and the covering introduced in Foster et al. (2021, Definition 3.2) (which is also used in their algorithms to handle infinite model classes). First, the covering inFoster et al. (2021) needs to cover the distribution of reward, while ours only need to cover the mean reward function. More importantly,Foster et al. (2021, Lemma A.16) explicitly introduces a factor log B, where Algorithm 4 TEMPERED AGGREGATION WITH COVERING

pP M p¨|πq, P ‹ p¨|πqq `Eo"P ‹ p¨|πq

the implied DEC bound only yields a slightly worse p1{εq β`2 β sample complexity. H.4 EXAMPLE: EDEC LOWER BOUND FOR TABULAR MDPS In this section, we follow Domingues et al. (2021); Foster et al. (2021) to construct a class of tabular MDPs whose (localized) EDEC has a desired lower bound, and hence establish a Ω `HSA{ε 2 lower bound of sample complexity for PAC learning in tabular MDPs, recovering the result of Domingues et al. (2021). Proposition H.3 (EDEC lower bound for tabular MDPs). There exists M a class of MDPs with S ě 4 states, A ě 2 actions, horizon H ě 2 log 2 pSq and the same reward function, such that sup M PM

:" " f w : x Þ Ñ max yPY wpx, yq * wPW , we have dcpF, γq ď dcpW, γq. Combining Proposition K.22 and Example K.2 gives Corollary K.21 directly.

s 1 ˇˇr P θ ps 1 |s, aq ´Pθ ps 1 |s, aq ˇˇ" ÿ

L 2 {γ ˘, and Algorithm REWARD-FREE E2D achieves SubOpt rf ď ε within r O `dH 2 L 2 log |P| {ε 2 ˘episodes of play.

where the inequality is due to the definition of σ 2 -sub-Gaussian random vector: For v " 2λpp r´rq P R d ,

Theorem D.7 (E2D-TA with covering). Algorithm 3 with Alg Est chosen as TEMPERED AGGRE-GATION WITH COVERING (19) and optimally chosen γ achieves Reg DM ď C inf

the corresponding Hellinger decoupling coefficient can be bounded by psc est (Definition K.23), by a same argument. Input: Parameter β ą 0.2: Initialize confidence set M 1 " M.3: for t " 1, . . . , T do

Input: Parameters η " 1{3, γ ą 0; prior distribution µ 1 P ∆pPq. 2: // Exploration phase 3: for t " 1, . . . , T do P ∆pPq by Tempered Aggregation with observations only: µ t`1 pPq 9 P µ t pPq ¨exp ´η log P π t po t q ¯.

exp , R; p out , Pq :" E π"pout " f P,R pπ P,R q ´f P,R pπq

8,rf ε γ pPq, Pq, and let P P P attain the supremum. Plug in p exp " p exp pPq and by definition of rfdec γ (which considers inf pexp ), we have Ppπq ˘‰ ě rfdec .Let R P R attain the supremum above, and plug in p out " p out pP, Rq and by definition of the above (which considers inf pout ), we have Ppπq ˘‰ , where we recall that P P,A refers to the distribution induced by the exploration phase of A. Requiring 8C T T ε ď γ gives

Input: Parameters η " 1{3, γ ą 0; prior distribution µ 1 P ∆pPq. 2: for t " 1, . . . , T do

TV ´p P, P ‹ ¯:" max TV ´p Ppπq, P ‹ pπq ¯ď 2mdec γ pMq `4γT rlog |P 0 | `T ρ `2 logp2{δqs.For P P P, π P Π, p exp P ∆pΠq, µ out P ∆pPq, we consider the function V t pp exp , µ out ; P, πq :" E P"µout " D TV `Ppπq, Ppπq ˘‰ ´γE π"pexp E Note that Theorem D.8 implies that: subroutine (48) (agrees with (19) with η r " 0 and η p " η) achieves with probability at least 1 ´δ that

1 for proving Proposition 12; The proof of Proposition 12 then follows by combining several statements therein (see Appendix K.2). The proofs of Example 13-15 are provided in Appendix K.3. Appendix K.4 presents some discussions regarding the definition of Bellman representability (compared with(Foster et al., 2021)) along with some useful results regarding the complexities of general function classes. Finally, unless otherwise specified, the proofs of all new results in this section are presented in Appendix K.5.

Bounding DEC/EDEC by decoupling coefficient of M We now state our main intermediate result for bounding the DEC/EDEC for any M admitting a low-complexity Bellman representation, in the sense of a bounded decoupling coefficient. Proposition K.6 (Bounding DEC/EDEC by decoupling coefficient of Bellman representation). Suppose that G " pG M h q M PM,hPrHs is a Bellman representation of M. For any M P M, we define comppG M , γq :" H max

Proposition 12 can be directly extended to the more general case with infinite model classes, by using the TEMPERED AGGREGATION WITH COVERING subroutine(19) (and (48)  for reward-free learning) in the corresponding E2D algorithms (see Theorem D.7, Theorem H.1 and Theorem I.1). We summarize this in the following proposition. Proposition K.10 (Variant of Proposition 12 with covering). Suppose M admits a Bellman representation G with low complexity: mintepG M h , ∆q, spG M h , ∆q 2 u ď r O pdq, where r O p¨q contains possibly polylogp1{∆q factors. Further assume that M and P admits optimistic covers with bounded covering numbers: log N pM, ρq ď r O pdimpMqq and log N pP, ρq ď r O pdimpPqq for any ρ ą 0, where dimpMq, dimpPq ą 0 and r O p¨q contains possibly polylogp1{ρq factors. Then with subroutines changed correspondingly to the versions with covering, we have (1) (No-regret learning) If π est M,h " π M for all M P M (the on-policy case), then Algorithm E2D-TA achieves Reg DM ď r For any general tπ est M,h u M PM,hPrHs , Algorithm EXPLORATIVE E2D achieves SubOpt ď ε within r O `d ¨dimpMqH 2 L 2 {ε 2 ˘episodes of play. (3) (Reward-free learning) If G is a strong Bellman representation, then Algorithm REWARD-FREE E2D achieves SubOpt rf ď ε within r O `d ¨dimpPqH 2 L 2 {ε 2 ˘episodes of play.

such that P π ps h " sq " xψ h psq, ϕ h pπqy , @s P S, π P Π.By definition, low-rank MDP with rank d is of occupancy rank d. Furthermore, for model class M consisting of MDPs with occupancy rank d, its VBE is bilinear with rank d. Therefore, EXPLO-p¨|s h , a h q, P P h p¨|s h , a h q ¯ı.

ByFoster et al. (2021, Lemma 6.1), for ∆, ε ą 0, ρ P ∆pΠq, it holds that θpF, ∆, ε; ρq ď 4 min ␣ s 2 pF, ∆q, epF, ∆q ( . It turns out that our decoupling coefficient can be upper bound by the disagreement coefficient: applying Foster et al. (2021, Lemma E.3), we directly obtain the following result. Lemma K.20. For function class F Ă pΠ Ñ r´1, 1sq, we have dcpF, γq ď inf

∆qL 2 r βK `min tepG h , ∆q, Ku `K ¨∆5where O p1q hides the universal constant. Summing over h P rHs gives

By definition of mlec γ,K , taking inf ∆ą0 completes the proof. K.5.4 PROOF OF PROPOSITION K.9 By Proposition I.3 and the strong duality, we only need to upper bound the following quantity in terms of the decoupling coefficient:rrec γ pM, µq " sup E pP,R,πq"ν E P"µ ˇˇE pP,Rq,πwhere (i) is because we use the performance decomposition lemma, (ii) is due to the definition of strong Bellman representation. Similar to the proof of Proposition K.25, we have , ηq `ηE pP,Rq"ν E π"ν E P"µWe just need to choose p e,h P ∆pΠq as p e,h pπ 1 q " νptpP, R, πq : π est h " π 1 uq (here π est Applying Proposition I.3 completes the proof.K.5.5 PROOF OF PROPOSITION K.15We construct such a covering directly, which is a generalization of the construction in Example K.13. An important observation is that, by the definition of feature map (cf. Definition K.14), it must hold that ÿ

´PP h ps h`1 |s h , a h q ´PP h ps h`1 |s h , a h q ¯V ps h`1 |s h , a h q V ps h`1 |s h , a h qϕ h ps h`1 |s h , a h q

K.3.1 PROOF OF EXAMPLE 13, REGRET BOUND

Clearly, for the tabular MDP model class M of S states and A actions, its estimation complexity log N pM, ρq " r O `S2 AH ˘(see Example K.13) and its QBE is bilinear with rank SA. Thus by the definition of comp in (54) and Example K.2, we have comppG, γq " O `SAH 2 {γ ˘, and further by Proposition K.6(1) and Proposition K.10, E2D-TA achievesWe remark that the same regret bound also holds for MOPS and OMLE, by the PSC bound in Proposition K.6(1) and Proposition K.7 combined with Theorem F.1 and Theorem G.3.In the following, we demonstrate briefly how to construct an optimistic covering of the class of tabular MDPs. Without loss of generality, we only cover the class of transition dynamic P.Example K.13 (Optimistic covering of tabular MDP). Consider M, the class of MDPs with S states, A actions, H steps. Fix a ρ 1 P p0, 1s, and ρ " ρ 2 1 {eHS. For M P M, we compute its ρ 1 -optimistic likelihood function as follows: defineand for Markov policy π, letA direct calculation shows that r P M,π ě P M,π for all π, and } r P M,π p¨q ´PM,π p¨q} 1 ď ρ 2 1 . Clearly, there are at most r1{ρs S 2 AH different optimistic likelihood functions defined by ( 55), and we can form M 0 by picking a representative in M for each optimistic likelihood function (if possible). Then, log |M 0 | " O `S2 AH logpSH{ρ 1 q ˘.

K.3.2 PROOF OF EXAMPLE 14, REGRET BOUND

We follow the commonly used definition of linear mixture MDPs (Chen et al., 2021) , which is slightly more general than the one in Ayoub et al. (2020) . Definition K.14 (Linear mixture MDPs). A MDP is called a linear mixture MDP (of rank d) if there exists feature maps ϕ h p¨|¨, ¨q : S ˆS ˆA Ñ R d and parameter pθ h q h Ă R d , such that P h ps 1 |s, aq " xθ h , ϕ h ps 1 |s, aqy. We further assume that }θ h } 2 ď B for all h P rHs, and } ř s 1 ϕ h ps 1 |s, aqV ps 1 q} 2 ď 1 for all V : S Ñ r0, 1s and tuple ps, a, hq P S ˆA ˆrHs.Suppose that M is a linear mixture MDP model with the given feature map ϕ. Then the definition above directly yields the QBE of M is bilinear with rank d. Therefore, as long as log N pM, ρq " r O pdHq, by Proposition K.10 we can obtain r O ´d? H 3 T ¯regret of E2D-TA as claimed.The following proposition provides an upper bound on estpMq via a concrete construction. For the simplicity of discussion, we assume that the initial distribution is known, and we also assume the mean reward function is known for no-regret and PAC learning setting. Proposition K.15 (Optimistic covering for linear mixture MDPs). Given feature map ϕ of dimension d and constant B. Suppose that M consists of linear mixture MDPs with feature map ϕ and parameter bounded by B (without reward component). Then for ρ, there exists a ρ-optimistic covering p r P, M 0 q with log |M 0 | " r O pdHq.

K.3.3 PROOF OF EXAMPLE 15, PAC BOUND

We consider the broader class of MDPs with low occupancy rank (Du et al., 2021 ):

