HORIZON-FREE REINFORCEMENT LEARNING FOR LA-TENT MARKOV DECISION PROCESSES

Abstract

We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a modeloptimistic and a value-optimistic solver. We prove an r O `?M ΓSAK ˘regret bound where M is the number of contexts, S is the number of states, A is the number of actions, K is the number of episodes, and Γ ď S is the maximum transition degree of any state-action pair. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors, which is carefully bounded by a recursion-based technique. We complement our positive result with a novel Ω `?M SAK ˘regret lower bound with Γ " 2, which shows our upper bound minimax optimal when Γ is a constant. Our lower bound relies on new constructions of hard instances and an argument based on the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest. 1 Their original bound is r Op ? M S 2 AH 3 Kq with the scaling that the reward from each step is bounded by 1. We rescale the reward to be bounded by 1{H in order to make the total reward from each episode bounded by 1, which is the setting we consider.

1. INTRODUCTION

One of the most popular model for Reinforcement Learning(RL) is Markov Decision Process (MDP), in which the transitions and rewards are dependent only on current state and agent's action. In standard MDPs, the agent has full observation of the state, so the optimal policy for the agent also only depends on states (called a history-independent policy). There is a line of research on MDPs, and the minimax regret and sample complexity guarantees have been derived. Another popular model is Partially Observable MDPs (POMDPs) in which the agent only has partial observations of states. Even though the underlying transition is still Markovian, the lower bound for sample complexity has been proven to be exponential in state and action sizes. This is in part because the optimal policies for POMDPs are history-dependent. In this paper we focus on a middle group between MDP and POMDP, namely Latent MDP (LMDP). An LMDP can be viewed as a collection of MDPs sharing the same state and action spaces, but the transitions and rewards may vary across them. Each MDP has a probability to be sampled at the beginning of each episode, and it will not change during the episode. The agent needs to find a policy which works well on these MDPs in an average sense. Empirically, LMDPs can be used for a wide variety of applications (Yu et al., 2020; Iakovleva et al., 2020; Finn et al., 2018; Ramamoorthy et al., 2013; Doshi-Velez & Konidaris, 2016; Yao et al., 2018) . In general, there exists no policy that is optimally on every single MDP simultaneously, so this task is definitely harder than MDPs. On the other hand, LMDP is a special case of POMDP because for each MDP, the unobserved state is static in each episode and the observable state is just the state of MDP. Unfortunately, for generic LMDPs, there exists exponential sample complexity lower bound (Kwon et al., 2021) , so additional assumptions are needed to make the problem tractable. In this paper, we consider the setting that after each episode ends, the agent will get the context on which MDP it played with. Such information is often available. For example, in a maze navigation task, the location of the goal state can be viewed as the context. In this setting, Kwon et al. (2021) obtained an r Op ? M S 2 AHKq regret upper bound where M is the number of contexts, S is the number of states, A is the number of actions, H is the planning horizon, and K is the number of episodes. They did not study the regret lower bound. 1 To benchmark this result, the only available bound is r Θ `?SAK ˘from standard MDP by viewing MDP as a special case of LMDP. Comparing these two bounds, we find significant gaps: ① Is the dependency on M in LMDP necessary? ② The bound for MDP is (nearly) horizon-free (no dependency on H), is the polynomial dependency on H in LMDP necessary? ③ The dependency on the number of states is ? S for MDP but the bound in Kwon et al. (2021) for LMDP is S. In this paper, we resolve the first two questions and partially answer the third.

1.1. MAIN CNTRIBUTIONS AND TECHNICAL NOVELTIES

We obtain the following new results: ‚ Near-optimal regret guarantee for LMDPs. We present an algorithm framework for LMDPs with context in hindsight. This framework can be instantiated with a plug-in solver for planning problems. We consider two types of solvers, one model-optimistic and one value-optimistic, and prove their regret bound to be r O `?M ΓSAK ˘where Γ ď S is the maximum transition degree of any state-action pair. Compared with the result in Kwon et al. (2021) , ours only requires the total reward to be bounded whereas they required a bounded reward for each step. Furthermore, we improve the H-dependence from ? H to logarithmic, making our bound (nearly) horizon-free. Lastly, our bound scales with ? SΓ, which is strictly better than S in their bound. The main technique of our model-optimistic algorithm is to use a Bernstein-type confidence set on each position of transition dynamics, leading to a small Bellman error. The main difference between our value-optimistic algorithm and Kwon et al. (2021) 's is that we use a bonus depending on the variance of next-step values according to Bennett's inequality, instead of using Bernstein's inequality. It helps propagate the optimism from the last step to the first step, avoiding the Hdependency. We analyse these two solvers in a unified way, as their Bellman error are of the same order. ‚ New regret lower bound for LMDPs. We obtain a novel Ω `?M SAK ˘regret lower bound for LMDPs. This regret lower bound shows the dependency on M is necessary for LMDPs. Notably the lower bound also implies r O `?M ΓSAK ˘upper bound is optimal up to a ? Γ factor. Furthermore, our lower bound holds even for Γ " 2, which shows our upper bound is minimax optimal for a class of LMDPs with Γ " Op1q. Our proof relies on new constructions of hard instances, different from existing ones for MDPs (Domingues et al., 2021) . In particular, we use a two-phase structure to construct hard instances (cf. Figure 1 ). Furthermore, the previous approaches for proving lower bounds of MDPs do not work on LMDPs. For example, in the MDP instance of Domingues et al. ( 2021), the randomness comes from the algorithm and the last transition step before entering the good state or bad state. In an LMDP, the randomness of sampling the MDP from multiple MDPs must also be considered. Such randomness not only dilutes the value function by averaging over each MDP, but also divides the pushforward measure (see Page 3 of Domingues et al. ( 2021)) into M parts. As a result, the M terms in KL divergence in Equation (2) of Domingues et al. (2021) and that in Equation (10) cancels out -the final lower bound does not contain M . To overcome this, we adopt the symmetrization technique from theoretical computer science. This novel technique is helps generalize the bounds from a single-party result to a multiple-party result, which may give rise to a tighter lower bound.

2. RELATED WORK

LMDPs. As shown by Steimle et al. (2021) , in the general cases, optimal policies for LMDPs are history dependent and P-SPACE hard to find. This is different from standard MDP cases where there always exists an optimal history-independent policy. However, even finding the optimal historyindependent policy is NP-hard (Littman, 1994) . Chades et al. (2012) provided heuristics for finding the optimal history-independent policy. Kwon et al. (2021) investigated the sample complexity and regret bounds of LMDPs. Specifically, they presented an exponential lower-bound for general LMDPs without context in hindsign, and then they derived an algorithm with polynomial sample complexity and sub-linear regret for two special cases (with context in hindsight, or δ-strongly separated MDPs). LMDP has been studied as a type of multi-task RL (Taylor & Stone, 2009; Brunskill & Li, 2013; Liu et al., 2016; Hallak et al., 2015) . It has been applied to model combinatorial optimization problems (Zhou et al., 2022) . There are also some related studies such as model transfer (Lazaric, 2012; Zhang & Wang, 2021) and contextual decision processes (Jiang et al., 2017) . In empirical works, LMDP has has wide applications in multi-task RL (Yu et al., 2020) , meta RL Iakovleva et al. (2020); Finn et al. (2018 ), latent-variable MDPs (Ramamoorthy et al., 2013) and hidden parameter MDPs (Doshi-Velez & Konidaris, 2016; Yao et al., 2018) . Regret Analysis for MDPs. LMDPs are generalizations of MDPs, so some previous approaches to solving MDPs can provide insights. There is a long line of work on regret analysis for MDPs (Azar et al., 2017; Dann et al., 2017; 2019; Zanette & Brunskill, 2019; Zhang et al., 2021a) . In this paper, we focus on time-homogeneous, finite horizon, undiscounted MDPs whose total reward is upper-bounded by 1. Recent work showed in this setting the regret can be (nearly) horizon-free for tabular MDPs Wang et al. (2020) ; Zhang et al. (2022; 2021a; 2020) ; Ren et al. (2021) . Importantly these results indicate RL may not be more difficult than bandits in the minimax sense. More recent work generalized the horizon-free results to other MDP problems (Zhang et al., 2021b; Kim et al., 2021; Tarbouriech et al., 2021; Zhou & Gu, 2022) . However, all existing work with horizon-free guarantees only considered single-environment problems. Ours is the first horizon-free guarantee that goes beyond MDP. Neu & Pike-Burke (2020) summarized up the "optimism in the face of uncertainty" (OFU) principle in RL. They named two types of optimism: ① model-optimistic algorithms construct confidence sets around empirical transitions and rewards, and select the policy with the highest value in the best possible models in these sets. ② value-optimistic algorithms construct upper bounds on the optimal value functions, and select the policy which maximizes this optimistic value function. Our paper follows their idea and provide one algorithm for each type of optimism.

3. PROBLEM SETUP

In this section, we give a formal definition of Latent Markov Decision Processes (Latent MDPs). Notations. For any event E, we use 1rEs to denote the indicator function, i.e., 1rEs " 1 if E holds and 1rEs " 0 otherwise. For any set X, we use ∆pXq to denote the probability simplex over X. For any positive integer n, we use rns to denote the set t1, 2, . . . , nu. For any probability distribution P , we use supppP q " }P } 0 to denote the size of support of P , i.e., ř x 1rP pxq ą 0s. There are three ways to denote a d-dimensional vector (function): suppose p is any parameter, xppq " px 1 ppq, x 2 ppq, . . . , x d ppqq if the indices are natural numbers, xp¨|pq " pxpi 1 |pq, xpi 2 |pq, . . . , xpi d |pqq and xpp¨q " pxppi 1 q, xppi 2 q, . . . , xppi d qq if the indices are from the set I " ti 1 , i 2 , . . . , i d u. For any number q, we use x q to denote the vector px q 1 , x q 2 , . . . , x q d q. For two d-dimensional vectors x and y, we use x J y " ř i x i y i to denote the inner product. If x is a probability distribution, we use Vpx, yq " ř i x i py i ´xJ yq 2 " x J py 2 q ´px J yq 2 to denote the empirical variance. We use ι " 2 ln `2MSAHK δ ˘as a log term where δ is the confidence parameter.

3.1. LATENT MARKOV DECISION PROCESS

Latent MDP (Kwon et al., 2021) is a collection of finitely many MDPs M " tM 1 , M 2 , . . . , M M u where M " |M|. All the MDPs share state set S, action set A and horizon H. Each MDP M m " pS, A, H, ν m , P m , R m q has its own initial state distribution ν m P ∆pSq, transition model P m : SˆA Ñ ∆pSq and a deterministic reward function R m : SˆA Ñ r0, 1s. Let w 1 , w 2 , . . . , w M be the mixing weights of MDPs such that w m ą 0 for any m and ř M m"1 w m " 1. Denote S " |S|, A " |A| and Γ " max m,s,a supp pP m p¨|s, aqq. Γ can be interpreted as the maximum degree of each transition, which is a quantity our regret bound depends on. Note we always have Γ ď S. In previous work, Lattimore & Hutter (2012) assumes Γ " 2, and Fruit et al. (2020) also has a regret bound that scales with Γ. In the worst case, the optimal policy of an LMDP is history-dependent and is PSPACE-hard to find (Corollary 1 and Proposition 3 in Steimle et al. (2021) ). Aside from computational difficulty, storing a history-dependent policy needs a space which is exponentially large, so it is generally impractical. In this paper, we seek to provide a result for any fixed policy class Π. For example, we can have Π to be the set of all history-independent, deterministic policies to alleviate the space issue. Following previous work (Kwon et al., 2021) , we assume access to oracles for planning and optimization. See Section 4 for the formal definitions. We consider an episodic, finite-horizon and undiscounted reinforcement learning problem on LMDPs. In this problem, the agent interacts with the environment for K episodes. At the start of every episode, one MDP M m P M is randomly chosen with probability w m . Throughout the episode, the true context is hidden. The agent can only choose actions based on the history information up until current time. However, at the end of each episode (after H steps), the agent gets revealed the true context m. This permits an unbiased model estimation for the LMDP. As in Cohen et al. (2020) , the central difficulty is to estimate the transition, we also focus on learning P only. For simplicity, we assume that w, ν and R are known to the agent, because they can be estimated easily. The assumption of deterministic rewards is also for simplicity. Our analysis can be extended to unknown and bounded-support reward distributions.

3.2. VALUE FUNCTIONS, Q-FUNCTIONS AND ALPHA VECTORS

By convention, the expected reward of executing a policy on any MDP can be defined via value function V and Q-function Q. Since for MDPs there is always an optimal policy which is historyindependent, V and Q only need the current state and action as parameters. However, these notations fall short of history-independent policies under the LMDP setting. The full information is encoded in the history, so here we use a more generalized definition called alpha vector (following notations in Kwon et al. (2021) ). For any time t ě 1, let h t " ps, a, rq 1:t´1 s t be the history up until time t. Define H t as the set of histories observable at time step t, and H :" Y H t"1 H t as the set of all possible histories. We define the alpha vectors α π m phq for pm, hq P rM sˆH as follows: α π m phq :" E « H ÿ t 1 "t R m ps t 1 , a t 1 q ˇˇˇˇM m , π, h t " h ff , α π m ph, aq :" E « H ÿ t 1 "t R m ps t 1 , a t 1 q ˇˇˇˇM m , π, ph t , a t q " ph, aq ff . The alpha vectors are indeed value functions and Q-functions on each individual MDP. Next, we introduce the concepts of belief state to show how to do planning in LMDP. Let b m phq denote the belief state over M MDPs corresponding to a history h, i.e., the probability of the true MDP being M m conditioned on observing history h. We have the following recursion: b m psq " w m ν m psq ř M m 1 "1 w m 1 ν m 1 psq and b m phars 1 q " b m phqP m ps 1 |s, aq1rr " R m ps, aqs ř M m 1 "1 b m 1 phqP m 1 ps 1 |s, aq1rr " R m 1 ps, aqs . The value functions and Q-functions for LMDP is defined via belief states and alpha vectors: V π phq :" bphq J α π phq and Q π ph, aq :" bphq J α π ph, aq. Direct computation (see Appendix B.1) gives V π phq " ÿ aPA πpa|hq ˜bphq J Rps, aq `ÿ s 1 PS,r M ÿ m 1 "1 b m 1 phqP m 1 ps 1 |s, aq1rr " R m 1 ps, aqsV π phars 1 q ļoooooooooooooooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooooooooooooooon "Q π ph,aq . So planning in LMDP can be viewed as planning in belief states. For the optimal history-dependent policy, we can select πphq " arg max aPA ˜bphq J Rps, aq `ÿ s 1 PS,r M ÿ m 1 "1 b m 1 phqP m 1 ps 1 |s, aq1rr " R m 1 ps, aqsV π phars 1 q ¸, using dynamic programming in descending order of h's length.

3.3. PERFORMANCE MEASURE.

We use cumulative regret to measure the algorithm's performance. The optimal policy is π ‹ " arg max πPΠ V π , which also does not know the context when interacting with the LMDP. Suppose the agent interacts with the environment for K episodes, and for each episode k a policy π k is played. The regret is defined as RegretpKq :" KV ‹ ´K ÿ k"1 V π k .

4. MAIN ALGORITHMS AND RESULTS

In this section, we present two algorithms, and show their minimax regret guarantee. The first is to use a Bernstein confidence set on transition probabilities, which was first applied to SSP in Cohen et al. (2020) to derive a horizon-free regret. This algorithm uses a bi-level optimization oracle: for the inner layer, an oracle is needed to find the optimal policy inside Π under a given LMDP; for the outer layer, an oracle finds the best transition inside the confidence set which maximizes the optimal expected reward. The second is to adapt the Monotoic Value Propagation (MVP) algorithm (Zhang et al., 2021a) to LMDPs. This algorithm requires an oracle to solve an LMDP with dynamic bonus: the bonuses depends on the variances of the next-step alpha vector. Both algorithms enjoy the following regret guarantee. Theorem 1. For both the Bernstein confidence set for LMDP (Algorithm 1 combined with Algorithm 2) and the Monotonic Value Propagation for LMDP (Algorithm 1 combined with Algorithm 3), with probability at least 1 ´δ, we have that RegretpKq " O ˆ?M ΓSAK ln ˆM SAHK δ ˙`M S 2 A ln 2 ˆM SAHK δ ˙Ȧs we have discussed, our result improves upen Kwon et al. (2021) , and has only logarithmic dependency on the planning horizon H. We also have a lower order which scales with S 2 . We note that even in the standard MDP setting, it remains a major open problem how to obtain minimax optimal regret bound with no lower order term (Zhang et al., 2021a) . Below we describe the details of our algorithms. Algorithm framework. The two algorithms introduced in this section share a framework for estimating the model. The only difference between them is the solver for the exploration policy. The framework is shown in Algorithm 1. Our algorithmic framework estimates the model (cf. Line 14 in Algorithm 1) and then selects a policy for the next round based on different oracles (cf. Line 18 in Algorithm 1). Following Zhang et al. (2021a) , we use a doubling schedule for each state-action pair in every MDP to update the estimation and exploration policy. Common notations. Some of the notations have been introduced in Algorithm 1, but for reading convenience we will repeat the notations here. For any notation, we put the episode number k in the superscript. For any observation, we put the time step t in the subscript. For any model component, we put the context m in the subscript. The alpha vector and value function for the optimistic model are denoted using an extra "r". Choose action a k t " π k ps k t q. 7: end for 8: Observe state s k H`1 and get m k as hindsight. 9: for t " 1, 2, . . . , H do 10: Set n m k ps k t , a k t q Ð n m k ps k t , a k t q `1 and n m k ps k t`1 |s k t , a k t q Ð n m k ps k t`1 |s k t , a k t q `1. 11: if Di P N, n m k ps k t , a k t q " 2 i then 12: Set TRIGGERED = TRUE. 13: Set N m k ps k t , a k t q Ð n m k ps k t , a k t q. 14: Set p P m ps 1 |s k t , a k t q Ð n m k ps 1 |s k t ,a k t q n m k ps k t ,a k t q for all s 1 P S. 15: end if 16: end for 17: if TRIGGERED then 18: Set π k`1 Ð Solver() (by Algorithm 2 or Algorithm 3). 19: else 20: Set π k`1 Ð π k . 21: end if 22: end for

4.1. BERNSTEIN CONFIDENCE SET OF TRANSITIONS FOR LMDPS

We introduce a model-optimistic approach by using a confidence set of transition probability. Optimistic LMDP construction. The Bernstein confidence set is constructed as below:  Notice that we do not change the reward function, so we still have the total reward of any trajectory upper-bounded by 1. Policy solver. The policy solver is in Algorithm 2. It solves a two-step optimization problem on Line 2: for the inner problem, given a transition model r P and all other known quantities w, ν, R, it needs a planning oracle to find the optimal policy; for the outer problem, it needs to find the optimal transition model. For planning, we can use the method presented in Equation ( 1 Algorithm 2 Solver-L-Bernstein 1: Construct P k`1 using Equation (2). 2: Find r P k`1 Ð arg max r P PP k`1 ´max πPΠ V π r P ¯. 3: Find π k`1 Ð arg max πPΠ V π r P k`1 . 4: Return: π k`1 .

4.2. MONOTONIC VALUE PROPAGATION FOR LMDP

We introduce a value-optimistic approach by calculating a variance-dependent bonus. This technique was originally used to solve standard MDPs (Zhang et al., 2021a) . Optimistic LMDP construction. The optimistic model contains a bonus function, which is inductively defined using the next-step alpha vector. In episode k, or any policy π, assume the alpha vector for any history with length t `1 is calculated, then for any history h with length t, the bonus is defined as follows: B k m ph, aq :" max $ ' & ' % 4 g f f e supp ´p P k m p¨|s, aq ¯V ´p P k m p¨|s, aq, r α π m phar¨q ¯ι N k m ps, aq , 16Sι N k m ps, aq , / . / - , where r " R m ps, aq. Next, the alpha vector of history h is: r α π m phq :" min ! R m ps, aq `Bk m ph, aq `p P k m p¨|s, aq J r α π m phar¨q, 1 ) , where a " πphq. (4) Finally, the value function is: r V π :" M ÿ m"1 ÿ sPS w m ν m psqr α π m psq. ( ) Policy solver. The policy solver is in Algorithm 3. It finds the policy maximizing the optimistic value, with a dynamic bonus function depending on the policy itself. This solver is tractable if we only care about deterministic policies in Π. This restriction is reasonable because for the original LMDP there always exists an optimal policy which is deterministic. Further, according to the proof of Lemma 13, we only need a policy which has optimistic value no less than that of the optimal value. Thus, there always exists an exhaustive search algorithm for this solver, which enumerates each action at each history. Algorithm 3 Solver-L-MVP 1: Use the optimistic model defined in Equation (3), Equation (4) and Equation ( 5). 2: Find π k`1 Ð arg max πPΠ r V π . 3: Return: π k`1 .

5. REGRET LOWER BOUND

In this section, we present a regret lower bound for the unconstrained policy class, i.e., when Π contains all possible history-dependent policies. First, we note that this lower bound cannot be directly reduced to solving M MDPs (with the context revealed at the beginning of each episode). Because simply changing the time of revealing the context results in the change of the optimal policy and its value function. At a high level, our approach is to transform the problem of context in hindsight into a problem of essentially context being told beforehand, while not affecting the optimal value function. To achieve this, we can use a small portion of states to encode the context at the beginning, then the optimal policy can extract information from them and fully determine the context. After the transformation, we can view the LMDP as a set of independent MDPs, so it is natural to leverage results from MDP lower bounds. Intuitively, since the lower bound of MDP is ? SAK, and each MDP is assigned roughly K M episodes, the lower bound of LMDP is M b SA ¨K M " ? M SAK. To formally prove this, we adopt the symmetrization technique from the theoretical computer science community (Phillips et al., 2012; Woodruff & Zhang, 2014; Fischer et al., 2017; Vempala et al., 2020) . When an algorithm interacts with an LMDP, we can focus on each MDP, while viewing the interactions with other MDPs as irrelevant -we hard code the other MDPs into the algorithm, deriving an algorithm for an MDP. In other words, we can insert an MDP into any of the M positions, and they are all symmetric to the algorithm's view. So, the regret can be averagely distributed to each MDP. The main theorem is shown here, before we introduce the construction of LMDP instances. Its proof is placed in Appendix B.4. Theorem 2. Assume that S ě 6, A ě 2 and M ď X S 2 \ !. For any algorithm π, there exists an LMDP M π such that, for K ě r ΩpM 2 `M SAq, its expected regret in M π after K episodes satisfies RpM π , π, Kq :" E « K ÿ k"1 pV ‹ ´V k q ˇˇˇˇM π , π ff " Ω ´?M SAK ¯. Several remarks are in the sequel. ① This is the first regret lower bound for LMDPs with context in hindsight. To the best of our knowledge, the introduction of the symmetrization technique is novel to the construction of lower bounds in the field of RL. ② This lower bound matches the minimax regret upper bound (Theorem 1) up to logarithm factors, because in the hard instance construction Γ " 2. For general cases, our upper bound is optimal up to a ? Γ factor. ③ There is a limitation of M , which could be at most X S 2 \ !, though an exponentially large M is not practical.

5.1. HARD INSTANCE CONSTRUCTION

Since M ď X S 2 \ !, we can always find an integer d 1 such that d 1 ď S 2 and M ď d 1 !. Since S ě 6 and d 1 ď S 2 , we can always find an integer d 2 such that d 2 ě 1 and 2 d2 ´1 ď S ´d1 ´2 ă 2 d2`1 ´1. We can construct a two-phase structure, each phase containing d 1 and d 2 steps respectively. The hard instance uses similar components as the MDP instances in Domingues et al. (2021) . We construct a collection of LMDPs C :" tM pℓ ‹ ,a ‹ q : pℓ ‹ , a ‹ q P rLs M ˆrAs M u, where we define L :" 2 d2´1 " ΘpSq. For a fixed pair pℓ ‹ , a ‹ q " ppℓ ‹ 1 , ℓ ‹ 2 , . . . , ℓ ‹ m q, pa ‹ 1 , a ‹ 2 , . . . , a ‹ m qq, we construct the LMDP M pℓ ‹ ,a ‹ q as follows.

5.1.1. THE LMDP LAYOUT

All MDPs in the LMDP share the same logical structure. Each MDP contains two phases: the encoding phase and the guessing phase. The encoding phase contains d 1 states, sufficient for encoding the context because M ď d 1 !. The guessing phase contains a number guessing game with C :" LA " ΘpSAq choices. If the agent makes a correct choice, it receives an expected reward slightly larger than 1 2 . Otherwise, it receives an an expected reward of 1 2 .

5.1.2. THE DETAILED MODEL

Now we give more details about our construction. Figure 1 shows an example of the model with M " 2, S " 11, arbitrary A ě 2 and H ě 6. States. The states in the encoding phase are e 1 , . . . , e d1 . The states in the guessing phase are s 1 , . . . , s N where N " ř d2´1 i"0 2 i " 2 d2 ´1. There is a good state g for reward and a terminal state t. All the unused states can be ignored. Transitions. The weights are equal, i.e., w m " 1 M . We assign a unique integer in m P rM s to each MDP as a context. Each integer m is uniquely mapped to a permutation σpmq " pσ 1 pmq, σ 2 pmq, . . . , σ d1 pmqq. Then the initial state distribution is ν m pe σ1pmq q " 1. The transitions for the first d 1 steps are: for any pm, aq P rM s ˆA, P m pe σi`1pmq | e σipmq , aq " 1, @1 ď i ď d 1 ´1; P m ps 1 | e σd 1 pmq , aq " 1. This means, in the encoding phase, whatever the agent does is irrelevant to the state sequence it observes. The guessing phase is a binary tree which we modify from Section 3.1 of Domingues et al. ( 2021) (here we equal each action a to an integer in rAs): for any pm, aq P rM s ˆA, P m ps 2i`pa mod 2q | s i , aq " 1, @1 ď i ď 2 d2´1 ´1. For the tree leaves L " ts ℓ : 2 d2´1 ď ℓ ď 2 d2 ´1u (notice that |L| " L), we construct: for any pm, ℓ, aq P rM s ˆL ˆA, For any of the MDP, the agent first goes through an encoding phase, where it observes a sequence of states regardless of what actions it take. The state sequence is different for each MDP, so the agent can fully determine which context it is in after this phase. When in the guessing phase, the agent needs to travel through a binary tree until it gets to some leaf. Exactly one of the leaves are "correct", and only performing exactly one of the actions at the correct leaf yields an expected higher reward. P m pt | s ℓ , aq " 1 2 ´ε1rℓ " ℓ ‹ m , a " a ‹ m s, P m pg | s ℓ , aq " 1 2 `ε1rℓ " ℓ ‹ m , a " a ‹ m s. ! ! ! " " ! " " " # " $ " % " & " ' # $ !"#$!%&'(&) ( ) ( ) ( ) * + ( ) , + ! " ! ! " ! " " " # " $ " % " & " ' # $ !"#$!%&'(&) ( ) Recall that we denote C " LA as the effective number of choices. The agent needs to first find the correct leaf by inputting its binary representation correctly, then choose the correct action. The good state is temporary between the guessing phase and the terminal state: if the agent is at g and makes any action, it enters t. The terminal state is self-absorbing. For any pm, aq P rM s ˆA, P m pt | g, aq " 1, P m pt | t, aq " 1. All the unmentioned probabilities are 0. Clearly, this transition model guarantees that supppP m p¨|s, aqq ď 2 for any pair of pm, s, aq P rM s ˆS ˆA. The rewards. The only non-zero rewards are R m pg, aq " 1 for any pm, aq P rM s ˆA. Since this state-action pair is visited at most once in any episode, this reward guarantees that in a single episode the cumulative reward is either 0 or 1.

6. CONCLUSION

In this paper, we present two different RL algorithms (one model-optimistic and one valueoptimistic) for LMDPs with context in hindsight, both achieving r Op ? M ΓSAKq regret. This is the first (nearly) horizon-free regret bound for LMDP with context in hindsight. We also provide a regret lower bound for this setting, which is Ωp ? M SAKq. In this lower bound, Γ " 2, so the upper bound is minimax optimal for the subclass of LMDPs with constant Γ. One future direction is to obtain a minimax regret bound for LMDPs for the Γ " ΘpSq case. For example, can we derive a regret lower bound of Ωp ? M S 2 AKq? On the other hand, it is also possible to remove the ? Γ in our upper bound. We believe this will require properties beyond the standard Bellman-optimality condition for standard MDPs. Good events. The entire proof depends heavily on the good events defined below in Definition 8. They show that the estimation of transition probability is very close to the true value. We show in Lemma 9 that they happen with a high probability. Definition 8 (Good events). For every episode k, define the following events: Further, define Ω 1 :" Ω k 1 :" $ & % @pm X K k"1 Ω k 1 and Ω 2 :" X K k"1 Ω k 2 . Lemma 9. PrΩ 1 s, PrΩ 2 s ě 1 ´δ. Assume that good events hold, then we have the following useful property: Lemma 10. Conditioned on Ω 1 , we have that for any pm, s, a, kq P rM s ˆS ˆA ˆrKs, and any S-dimensional vector α such that }α} 8 ď Trigger property. Let K be the set of indexes of episodes in which no update is triggered. By the update rule, it is obvious that ˇˇK C ˇˇď M SAplog 2 pHKq `1q ď M SAι. Let t 0 pkq be the first time an update is triggered in the k-th episode if there is an update in this episode and otherwise H `1. Define X 0 " tpk, t 0 pkqq : k P K C u and X " tpk, tq : k P K C , t 0 pkq `1 ď t ď Hu. We will study quantities multiplied by the trigger indicator 1rpk, tq R X s, which we denote using an extra "q". We will encounter a special type of summation, so we state it here. Lemma 11. Let tw k t ě 0 : pk, tq P rKs ˆrHsu be a group of weights, then K ÿ k"1 H ÿ t"1 1rpk, tq R X s N k m k ps k t , a k t q ď 3M SAι, K ÿ k"1 H ÿ t"1 d w k t 1rpk, tq R X s N k m k ps k t , a k t q ď g f f e 3M SAι K ÿ k"1 H ÿ t"1 w k t 1rpk, tq R X s.

B.2.1 OPTIMISM

As a standard approach, we need to show that both Algorithm 2 and Algorithm 3 have optimism in value functions. For Algorithm 2, it is straightforward. For each episode k, we choose the optimistic transition r P k with the maximum possible value. Lemma 9 shows that with probability at least 1 ´δ, Ω 1 holds, hence the true transition P is inside the confidence set P k for all k P rKs. Therefore, r  V k ě V ‹ .

*

. Then for all p P ∆prDsq, }v} 8 ď 1 and n, ι ą 0, 1. f pp, v, n, ιq is non-decreasing in v, i.e., @v, v 1 such that }v} 8 , }v 1 } 8 ď 1, v ď v 1 , it holds that f pp, v, n, ιq ď f pp, v 1 , n, ιq; 2. f pp, v, n, ιq ě pv `c1 2 b Vpp,vqι n `c2 2 ι n . Due to the complex structure of LMDP, we cannot prove the strong optimism in Zhang et al. (2021a) . This is because in LMDP, the optimal policy cannot maximize all alpha vectors simultaneously, hence the optimal alpha vectors are not unique. As Algorithm 2, we can only show the optimism at the first step, which is stated in Lemma 13. Lemma 13 (Optimism of Algorithm 3). Algorithm 3 satisfies that: Conditioned on Ω 1 , for any episode k P rKs, r  V k ě V ‹ . B. where r " R m ps, aq and β k m ph, aq " 7 d ΓV pP m p¨|s, aq, r α k m phar¨qq ι N k m ps, aq `30Sι N k m ps, aq . Throughout the proof, we denote q β k t " β k m k ph k t , a k t q1rpk, tq R X s. Assume that optimism holds, then it is more natural to bound r V k ´V π k instead of V ‹ ´V π k , because the underlying policies are the same for the former case. With simple manipulation, we decompose the regret into X 1 the Monte Carlo estimation term for the optimistic value, X 2 the Monte Carlo estimation term for the true value, X 3 the model estimation error, X 4 the Bellman error (main order term), and ˇˇK C ˇˇthe correction term for 1rpk, tq R X s. RegretpKq " K ÿ k"1 ´V ‹ ´V π k ¯ď K ÿ k"1 ´r V k ´V π k " K ÿ k"1 ´r V k ´r α k m k ps k 1 q looooooooooomooooooooooon X1 `K ÿ k"1 ˜r α k m k ps k 1 q ´H ÿ t"1 q r k t ¸`K ÿ k"1 ˜H ÿ t"1 q r k t ´V π k ļooooooooooomooooooooooon X2 (i) " X 1 `X2 `K ÿ k"1 H ÿ t"1 `q α k m k ph k t q ´q r k t ´Pm k p¨|s k t , a k t q J r α k m k ph k t a k t r k t ¨q1rpk, tq R X s looooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooon ď q β k t `K ÿ k"1 H ÿ t"1 `Pm k p¨|s k t , a k t q J q α k m k ph k t a k t r k t ¨q ´q α k m k ph k t`1 q loooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooon X3 `K ÿ k"1 H ÿ t"1 P m k p¨|s k t , a k t q J r α k m k ph k t a k t r k t ¨qp1rpk, tq R X s ´1rpk, t `1q R X sq (ii) ď X 1 `X2 `X3 `K ÿ k"1 H ÿ t"1 q β k t loooomoooon X4 `ÿ k,t"t0pkq r P k m k p¨|s k t , a k t q J r α k m k ph k t a k t r k t ¨q (iii) ď X 1 `X2 `X3 `X4 `ˇK C ˇˇ, where (i) is by pk, 1q P X so r α k m k ps k 1 q " q α k m k ps k 1 q; (ii) follows by Lemma 14 and checking the difference between 1rpk, tq R X s and 1rpk, t `1q R X s; (iii) is from the fact that r α k ď 1, and the definition of t 0 pkq and K.

B.2.3 BOUNDING EACH TERM

We start from the easier terms X 1 and X 2 . Lemma 15. With probability at least 1 ´δ, we have that X 1 ď ? Kι. Lemma 16. With probability at least 1 ´δ, we have that X 2 ď ? Kι. X 3 is a martingale difference sequence. However, if we want to avoid polynomial dependency of H, we cannot apply the Azuma's inequality which scales as ? H. Instead, we use a variance-dependent martingale bound, and this changes X 3 into a lower-order term of X 4 . Lemma 17. With probability at least 1 ´δ, we have that X 3 ď 2 ? 2X 4 ι `5ι. Here we show the bound for X 4 and its proof first, next we prove Theorem 1. When bounding X 4 , we are faced with another quantity X 5 , which is the summation of variances. We do not bound X 5 explicitly. Instead, we derive a relation between X 5 and X 4 (Lemma 19), so finally we solve an inequality of X 4 . Lemma 18. Conditioned on Ω 1 and Ω 2 , with probability at least 1 ´δ, we have that X 4 ď 46 ? M S 2 AKι `947M S 2 Aι 2 . Proof. From Lemma 14 and Lemma 11, we have that X 4 ď 13 g f f f f e M ΓSAι 2 K ÿ k"1 H ÿ t"1 V `Pm k p¨|s k t , a k t q, r α k m k ph k t a k t r k t ¨q˘1 rpk, tq R X s looooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooon X5 `90M S 2 Aι 2 . Applying Lemma 19, using ? x `y ď ? x`?y, and loosening the constants, we have the following inequality: X 4 ď 23 ? M ΓSAKι 2 `209M S 2 Aι 2 `23 ? M ΓSAι 2 ¨aX 4 . Since x ď a `b? x implies x ď b 2 `2a, we finally have X 4 ď 46 ? M ΓSAKι `947M S 2 Aι 2 . This completes the proof. We use the technique of higher-order variance expansion used by (Zhang et al., 2021a) to draw the relation between X 5 and X 4 . Lemma 19. Conditioned on Ω 1 and Ω 2 , with probability at least 1 ´δ, we have that X 5 ď 3pK X4 q `83ι.

B.2.4 PROOF OF THEOREM 1

Finally, we are able to prove the main theorem. Proof. From Lemma 15, Lemma 16, Lemma 17 and property of K, we have that, with probability at least 1 ´3δ, RegretpKq ď 2 ? Kι `2a 2X 4 ι `5ι `X4 `M SAι. Plugging in Lemma 18, using ? x `y ď ? x `?y, we finally have RegretpKq ď 68 ? M S 2 AKι `1041M S 2 Aι 2 holds with probability at least 1 ´6δ (using Lemma 9). Rescaling δ Ð δ 6 completes the proof.

B.3 PROOF OF THE LEMMAS USED IN THE MINIMAX REGRET GUARANTEE

Lemma 9. PrΩ 1 s, PrΩ 2 s ě 1 ´δ. Proof. From Lemma 5 we have that, for any fixed pm, s, a, s  ¨1rpk, tq R X s N k m k ps k t , a k t q (ii) ď g f f e ˜K ÿ k"1 H ÿ t"1 w k t 1rpk, tq R X s ¸˜K ÿ k"1 H ÿ t"1 1rpk, tq R X s N k m k ps k t , a k t q ḑ g f f e 3M SAι K ÿ k"1 H ÿ t"1 w k t 1rpk, tq R X s, where (i) is by the property of indicator function; (ii) is by the Cauchy-Schwarz inequality. Lemma 13 (Optimism of Algorithm 3). Algorithm 3 satisfies that: Conditioned on Ω 1 , for any episode k P rKs, r V k ě V ‹ . Proof. We first argue that for any policy π and any episode k, we have that r V π M k ě V π . Throughout the proof, the episode number k is fixed and omitted in any superscript. We proceed the proof for h in the order H H , H H´1 , . . . , H 1 , using induction. Recall that for any h P H H`1 we define r α π m ph, aq " α π m ph, aq " 0. Now suppose for time step t, we already have r α π m ph 1 , aq ě α π m ph 1 , aq for any  h 1 P H t`1 , then r α π m ph 1 q " r α π m ph 1 , πph 1 qq ě α π m ph 1 , πph k " arg max πPΠ r V π , hence r V k ě r V π ‹ ě V ‹ . Lemma Proof. By definition, V k " ř M m"1 ř sPS w m ν m psqα k m psq. Thus, α k m k ps k 1 q is a random vari- able with mean V k . Also, α k m k ps k 1 q is measurable with respect to Ū k´1 . Using Lemma 3 and ˇˇα k m k ps k 1 q ´V k ˇˇď 1, we have P " X 1 ą ? Kι ı ď δ. This completes the proof. Lemma 16. With probability at least 1 ´δ, we have that X 2 ď ? Kι. Proof. By definition, X 2 ď ř K k"1 ´řH t"1 r k t ´V π k ¯. From Monte Carlo simulation, E " ř H t"1 r k t ı " V π k . Also, ř H t"1 r k t is measurable with respect to Ū k´1 . Using Lemma 3 and ˇˇř H t"1 r k t ´V π k ˇˇď 1, we have P « K ÿ k"1 ˜H ÿ t"1 r k t ´V π k ¸ą ? Kι ff ď δ. This completes the proof. Lemma 17. With probability at least 1 ´δ, we have that X 3 ď 2 ? 2X 4 ι `5ι. Proof. Observe that 1rpk, t `1q R X s ď 1rpk, tq R X s, so X 3 ď K ÿ k"1 H ÿ t"1 `Pm k p¨|s k t , a k t q J r α k m k ph k t a k t r k t ¨q ´r α k m k ph k t`1 q ˘1rpk, tq R X s. This is a martingale. By taking c " 1 in Lemma 6, we have P " X 3 ą 2 a 2X 4 ι `5ι ı ď δ. This completes the proof. Lemma 19. Conditioned on Ω 1 and Ω 2 , with probability at least 1 ´δ, we have that X 5 ď 3pK X4 q `83ι. Proof. For any non-negative integer d, define F pdq :" K ÿ k"1 H ÿ t"1 ´Pm k p¨|s k t , a k t q J `r α k m k ph k t a k t r k t ¨q˘2d ´`r α k m k ph k t`1 q ˘2d ¯1rpk, tq R X s, Gpdq :" K ÿ k"1 H ÿ t"1 V ´Pm k p¨|s k t , a k t q, `r α k m k ph k t a k t r k t ¨q˘2d ¯1rpk, tq R X s. Then X 5 " Gp0q. Direct computation gives that Gpdq " K ÿ k"1 H ÿ t"1 ˆPm k p¨|s k t , a k t q `r α k m k ph k t a k t r k t ¨q˘2d`1 ´´P m k p¨|s k t , a k t q `r α k m k ph k t a k t r k t ¨q˘2d ¯2˙1 rpk, tq R X s (i) ď K ÿ k"1 H ÿ t"1 ´Pm k p¨|s k t , a k t q `r α k m k ph k t a k t r k t ¨q˘2d`1 ´`r α k m k ph k t`1 q ˘2d`1 ¯1rpk, tq R X s ``r α k m k ph k H`1 q ˘2d`1 looooooooooomooooooooooon "0 `K ÿ k"1 H ÿ t"1 ´`r α k m k ph k t q ˘2d`1 ´`P m k p¨|s k t , a ď F pd `1q `2d`1 K ÿ k"1 H ÿ t"1 pq r k t `q β k t q (iv) ď F pd `1q `2d`1 pK `X4 q, where (i) is by convexity of function x 2 d ; (ii) is by x 2 d ´y2 d ď 2 d maxtx ´y, 0u for x, y P r0, 1s; (iii) comes from Lemma 14; (iv) is by the assumption that reward within an episode is upperbounded by 1 and the definiton of X 4 . For a fixed d, F pdq is a martingale. By taking c " 1 in Lemma 6, we have P " F pdq ą 2 a 2Gpdqplog 2 pHKq `lnp2{δqq `5plog 2 pHKq `lnp2{δqq ı ď δ. Taking δ 1 " δ{plog 2 pHKq `1q, using x ě lnpxq `1 and finally swapping δ and δ 1 , we have that P " F pdq ą 2 a 2Gpdqp2 log 2 pHKq `lnp2{δqq `5p2 log 2 pHKq `lnp2{δqq ı ď δ log 2 pHKq `1 . Taking a union bound over d " 1, 2, . . . , log 2 pHKq, we have that with probability at least 1 ´δ, F pdq ď 4 a pF pd `1q `2d`1 pK `X4 qqι `10ι. From Lemma 7, taking λ 1 " HK, λ 2 " 4 ? ι, λ 3 " K `X4 , λ 4 " 10ι, we have that F p1q ď maxtp4 ? ι `?26ιq 2 , 8 a 2pK `X4 qι `10ιu (i) ď K `X4 `83ι, where (i) uses ? xy ď x`y 2 and maxtx, yu ď x `y for x, y ě 0. Hence X 5 " Gp0q ď F p1q `2pK `X4 q ď 3pK `X4 q `83ι. This completes the proof. Lemma 20. Conditioned on Ω 1 , we have that for any pk, m, s, a, s 1 q P rKs ˆrM s ˆS ˆA ˆS, Proof. We need to introduce an alternative regret measure for an MDP based on simulating an LMDP algorithm. Let Mpm, ℓ ‹ , a ‹ q be an MDP which contains an encoding phase with permutation σpmq, and a guessing phase with correct answer pℓ ‹ , a ‹ q. Given any LMDP algorithm π, a target position m and a pair of LMDP configuration pℓ ‹ , a ‹ q, we can construct an MDP algorithm πpm, ℓ ‹ , a ‹ q as in Algorithm 4. This algorithm admits two types of training: ① When K is specified, it returns after K episodes, regardless of how many times it interacts with the target MDP; ② When K m is specified, it does not return until it interacts with the MDP for K m times, regardless of how many episodes elapse. where in (i) we use x ´m to denote the positions other than m in x; (ii) is by setting K 1 " K 2M in Equation (9). Set δ :" ? M C 16 ? 2K , then we have that max ℓ ‹ ,a ‹ RpMpℓ ‹ , a ‹ q, π, Kq ě 1 C M ÿ ℓ ‹ ,a ‹ RpMpℓ ‹ , a ‹ q, π, Kq ą ? M CK 32 ? 2 " Ω ´?M SAK ¯. This holds when K ą p6 `4? 2qM 2 ln `2MC δ ˘and K 1 ě SA. It then reduces to K ě ΩpM 2 polyplogpM, S, Aqq `M SAq. This completes the proof.



Pm ´p P k m ¯ps 1 |s, aq ˇˇď 2 d p P k m ps 1 |s, aqι N k m ps, aq `5ι N k m ps, aq , @pm, s, a, sq P rM s ˆS ˆA ˆS, . -.

). For the outer problem, we can use Extended Value Iteration as in Auer et al. (2008); Fruit et al. (2020); Filippi et al. (2010); Cohen et al. (2020). For notational convenience, we denote the alpha vectors and value functions calculated by r P k and π k as r α k and r V k .

Figure1: Illustration of the class of hard LMDPs used in the proof of Theorem 2. Solid arrows are deterministic transitions, while dashed arrows are probabilistic transitions. The probabilities are written aside of the transitions. For any of the MDP, the agent first goes through an encoding phase, where it observes a sequence of states regardless of what actions it take. The state sequence is different for each MDP, so the agent can fully determine which context it is in after this phase. When in the guessing phase, the agent needs to travel through a binary tree until it gets to some leaf. Exactly one of the leaves are "correct", and only performing exactly one of the actions at the correct leaf yields an expected higher reward.

Algorithm 3 relies on an important function introduced byZhang et al. (2021a), so we cite it here: Lemma 12 (Adapted from Lemma 14 inZhang et al. (2021a)). For any fixed dimension D and two constants c 1 , c 2 satisfying c 2 1 ď c 2 , let f : ∆prDsq ˆRD ˆR ˆR Ñ R with f pp, v, n, ιq " pv `max" cVpp

|s, aq ´r P k m ps 1 |s, aq ˇˇď 4 d P m ps 1 |s, ps 1 |s, aq. Using the fact that x 2 ď ax `b implies x ď a `?b with a " 2 b ι N k m ps,aq , b " 5ι N k m ps,aq `Pm ps 1 |s, aq, and ? Substituting this into Ω we have ˇˇPmps 1 |s, aq ´p P k m ps 1 |s, aq ˇˇď 2 d P m ps 1 |s, ˇˇp P k m ps 1 |s, aq ´r P k m ps 1 |s, aq ˇˇď 2 d P m ps 1 |s, aqι N k m ps, aq `15ι N k m ps, aq .Therefore, from triangle inequality we have the desired result.B.4 PROOF OF THE REGRET LOWER BOUNDTheorem 2. Assume that S ě 6, A ě 2 and M ď X S !. For any algorithm π, there exists an LMDP M π such that, for K ě rΩpM 2 `M SAq, its expected regret in M π after K episodes satisfies RpM π , π, Kq :"

Algorithm 1 Algorithmic Framework for Solving LMDPs 1: Input: Number of MDPs M , state space S, action space A, horizon H; policy class Π; confidence parameter δ. 2: Set an arbitrary policy π 1 , initialize all n, N with 0 and set constant ι Ð 2 ln `2MSAHK

, s, a, s 1 q P rM s ˆS ˆA ˆS, ˇˇ´p P k m ´Pm ¯ps 1 |s, aq ˇˇď 2

1,

2.2 REGRET DECOMPOSITIONWe introduce the Bellman error here. It contributes to the main order term in the regret. Lemma 14 (Bellman error). Both Algorithm 2 and Algorithm 3 satisfy the following Bellman error bound: Conditioned on Ω 1 and Ω 2 , for any pm, h, a, kq P rM s ˆH ˆA ˆrKs,

1 , kq P rM s ˆS ˆA ˆS ˆrKs and 2 ď N k m ps, aq ď HK, From 1 x´1 ď 2 x when x ě 2 (N k m ps, aq " 1 is trivial), and applying union bound over all possible events, we have that PrX K k"1 Ω k 1 s ě 1 ´δ. From Lemma 4 we have that, for any fixed pm, s, a, s 1 , kq P rM s ˆS ˆA ˆS ˆrKs and 1 ď N k m ps, aq ď HK, By taking a union bound over all possible events, we have that PrX K k"1 Ω k 2 s ě 1 ´δ. Lemma 10. Conditioned on Ω 1 , we have that for any pm, s, a, kq P rM s ˆS ˆA ˆrKs, and any S-dimensional vector α such that }α} 8 ď 1, We fix the episode number k and omit it for simplicity.ˇˇ´p P m ´Pm ¯p¨|s, aq J α ˇ( PS ´p P m ´Pm ¯ps 1 |s, aq ´αps 1 q ´p P m p¨|s, aq J α ¯ˇˇˇď ÿ s 1 PS ˇˇp P m ´Pm ˇˇps 1 |s, aq ˇˇαps 1 q ´p P m p¨|s, aq J α ˇ( where (i) comes from that P m p¨|s, aq J α is a constant and p P , P are two distributions; (ii) is by the definition of Ω k 1 (Equation (6)) and }α} 8 ď 1; (iii) is from the Cauchy-Schwarz inequality. The second part is similar.

1 qq " α π m ph 1 q. For any h P H t , where (i) is by taking r " R m ps, aq; (ii) is by recognizing c 1 " 4c supp ´p P m p¨|s, aq ¯, c 2 " 16Sin Lemma 12, which satisfy c 2 1 ď c 2 ; (iii) and (iv) come by successively applying the first and second property in Lemma 12; (v) is an implication of Lemma 10, conditioning on Ω 1 and taking α " α π m phar¨q. The proof is completed by the fact that π

14 (Bellman error). Both Algorithm 2 and Algorithm 3 satisfy the following Bellman error bound: Conditioned on Ω 1 and Ω 2 , for any pm, h, a, kq P rM s ˆH ˆA ˆrKs, Here we decompose the Bellman error in a generic way. We use r P for the transition and B for the bonus used in the optimistic model. For Algorithm 2, B " 0; while for Algorithm 3, r P " p P .

A TECHNICAL LEMMAS

Lemma 3 (Anytime Azuma, Theorem D.1 in Cohen et al. (2020) ). Let pX n q 8n"1 be a martingale difference sequence with respect to the filtration pF n q 8n"0 such that |X n | ď B almost surely. Then with probability at least 1 ´δ,Lemma 4 (Bennett's Inequality, Theorem 3 in Maurer & Pontil (2009) ). Let Z, Z 1 , . . . , Z n be i.i.d. random variables with values in r0, bs and let δ ą 0. Define VrZs " ErpZ ´ErZsq 2 s. Then we haveLemma 5 (Theorem 4 in Maurer & Pontil (2009) ). Let Z, Z 1 , . . . , Z n pn ě 2q be i.i.d. random variables with values in r0, bs and let δ ą 0. Define Z " 1 n Z i and Vn " 1 n ř n i"1 pZ i ´Zq 2 . Then we haveLemma 6 (Lemma 30 in Tarbouriech et al. (2021) ). Let pM n q ně0 be a martingale such that M 0 " 0 and |M n ´Mn´1 | ď c for some c ą 0 and any n ě 1. Let Var n " ř n k"1 ErpM k ´Mk´1 q 2 |F k´1 s for n ě 0, where F k " σpX 4 , . . . , M k q. Then for any positive integer n and δ P p0, 2pnc 2 q 1{ ln 2 s, we have that2Var n plog 2 pnc 2 q `lnp2{δqq `2a log 2 pnc 2 q `lnp2{δq `2cplog 2 pnc 2 q `lnp2{δqq ı ď δ.Lemma 7 (Lemma 11 in Zhang et al. (2021a) ). Let λ 1 , λ 2 , λ 4 ě 0, λ 3 ě 1 and i 1 " log 2 λ 1 . Let a 1 , a 2 , . . . , a i 1 be non-negative reals such that a i ď λ 1 and a i ď λ 2 a a i`1 `2i`1 λ 3 `λ4 for any 1 ď i ď i 1 . Then we have that a 1 ď maxtpλ 2 `aλ 2 2 `λ4 q 2 , λ 2 ? 8λ 3 `λ4 u.

B SKIPPED PROOFS B.1 OMITTED CALCULATION

Here we give the details for ommitted calculations.V π phq "

B.2 UNIFIED ANALYSES OF ALGORITHM 1, ALGORITHM 2 AND ALGORITHM 3

In this subsection, we present the proof of Theorem 1 by showing each step. However, when encountered with some lemmas, the proofs of lemmas are skipped and deferred to Appendix B.3.Lemma 11. Let tw k t ě 0 : pk, tq P rKs ˆrHsu be a group of weights, thenProof. From Algorithm 1 and the definition of K, we have that for any i P N, pm, s, aq P rM s Ŝ ˆA,Use π to interact with the m k th MDP of Mpℓ ‹ , a ‹ q.7:else 8:Use π to interact with Mpm, ℓ ‹ , a ‹ q.9:10:if (K is specified and k " K) or (K m is specified and K m " K m ) then 12:Break.13:end if 14: end for Let V ‹ and V k be the optimal value function and the value function of πpm, ℓ ‹ , a ‹ q, Kq under the MDP Mpm, ℓ ‹ , a ‹ q. The alternative regret for MDP (corresponding to ①) is:Roughly, this is a regret for K m episodes, though K m is stochastic.In our hard instances, the MDPs in the LMDP can be considered separately. Som is the optimal value function of (which is equal to the value function of the optimal policy applied to) the mth MDP. According to Monte-Carlo sampling,The last step is because the behavior of "focusing on the mth MDP in the LMDP" and "using the simulator" are the same. Denote K m as the number of episodes spent in the mth MDP, which is a random variable. According to Lemma 4,. By a union bound over all possible hard instances Mpℓ ‹ , a ‹ q P C and all indices m P rM s, the following event happens with probability at least 1 ´δ:for all Mpℓ ‹ , a ‹ q P C and m P rM s * .Now look into Equation ( 8), ( 11) and ( 12) of Domingues et al. (2021) . For any K 1 ě SA and any fixed encoding number m, we have thatwhen set ε "We study the cases when we use πpm, ℓ ‹ , a ‹ q to solve Mpm, ℓ ‹ m , a ‹ m q with a target interaction episode K m " K 2M . The regret is RpMpm, ℓ ‹ m , a ‹ m q, πpm, ℓ ‹ , a ‹ q, K 2M q (this is the regret of MDPs).• The K 2M th interaction with the mth MDP comes before the Kth simulation episode. This case happens under E. The regret of this part is denoted as R `.• Otherwise. This case happens under Ē. The regret of this part is denoted as R ´. Since the regret of a single episode is at most 1, we have that R ´ă δK 2M .Now we study the cases when we use πpm, ℓ ‹ , a ‹ q to solve Mpm, ℓ ‹ m , a ‹ m q with a simulation episode budget K. The alternative regret for MDP is r RpMpm, ℓ ‹ m , a ‹ m q, πpm, ℓ ‹ , a ‹ q, Kq.• The K 2M th interaction with the mth MDP comes before the Kth simulation episode. This case happens under E. The regret of this part is denoted as r R `. Since the regret of a single episode is at least 0, and in this case K m ě K 2M , we have r R `ě R `.• Otherwise. This case happens under Ē. The regret of this part is denoted as r R ´ě 0.Using the connection between R `and r R `, we have: 

