ON THE POWER OF PRE-TRAINING FOR GENERALIZA-TION IN RL: PROVABLE BENEFITS AND HARDNESS

Abstract

Generalization in Reinforcement Learning (RL) aims to train an agent during training that generalizes to the target environment. In this work, we first point out that RL generalization is fundamentally different from the generalization in supervised learning, and fine-tuning on the target environment is necessary for good test performance. Therefore, we seek to answer the following question: how much can we expect pre-training over training environments to be helpful for efficient and effective fine-tuning? On one hand, we give a surprising result showing that asymptotically, the improvement from pre-training is at most a constant factor. On the other hand, we show that pre-training can be indeed helpful in the non-asymptotic regime by designing a policy collection-elimination (PCE) algorithm and proving a distribution-dependent regret bound that is independent of the state-action space. We hope our theoretical results can provide insight towards understanding pre-training and generalization in RL.

1. INTRODUCTION

Reinforcement learning (RL) is concerned with sequential decision making problems in which the agent interacts with the environment aiming to maximize its cumulative reward. This framework has achieved tremendous successes in various fields such as game playing (Mnih et al., 2013; Silver et al., 2017; Vinyals et al., 2019) , resource management (Mao et al., 2016) , recommendation systems (Shani et al., 2005; Zheng et al., 2018) and online advertising (Cai et al., 2017) . However, many empirical applications of RL algorithms are typically restricted to the single environment setting. That is, the RL policy is learned and evaluated in the exactly same environment. This learning paradigm can lead to the issue of overfitting in RL (Sutton, 1995; Farebrother et al., 2018) , and may have degenerate performance when the agent is deployed to an unseen (but similar) environment. The ability to generalize to test environments is important to the success of reinforcement learning algorithms, especially in the real applications such as autonomous driving (Shalev-Shwartz et al., 2016; Sallab et al., 2017) , robotics (Kober et al., 2013; Kormushev et al., 2013) and health care (Yu et al., 2021) . In these real-world tasks, the environment can be dynamic, open-ended and always changing. We hope the agent can learn meaningful skills in the training stage and be robust to the variation during the test stage. Furthermore, in applications such as robotics where we have a simulator to efficiently and safely generate unlimited data, we can firstly train the agent in the randomized simulator models and then generalize it to the real environment (Rusu et al., 2017; Peng et al., 2018; Andrychowicz et al., 2020) . An RL algorithm with good generalization ability can greatly reduce the demand of real-world data and improve test-time performance. Generalization in supervised learning has been widely studied for decades (Mitchell et al., 1986; Bousquet & Elisseeff, 2002; Kawaguchi et al., 2017) . For a typical supervised learning task such as classification, given a hypothesis space H and a loss function ℓ, the agent aims to find an optimal solution in the average manner. That is, we hope the solution is near-optimal compared with the optimal hypothesis h * in expectation over the data distribution, which is formally defined as h * = arg min h∈H E ℓ(h(X), Y ) . From this perspective, generalization in RL is fundamentally different. Once the agent is deployed in the test environment M sampled from distribution D, it is expected to achieve comparable performance with the optimal policy in M. In other words, we hope the learned policy can perform near-optimal compared with the optimal value V * M in instance for the sampled test environment M. Unfortunately, as discussed in many previous works (Malik et al., 2021; Ghosh et al., 2021) , the instance-optimal solution in the target environment can be statistically intractable without additional assumptions. We formulate this intractability into a lower bound (Proposition 1) to show that it is impractical to directly obtain a near-optimal policy for the test environment M * with high probability. This motivates us to ask: in what settings can the generalization problem in RL be tractable? Targeting on RL generalization, the agent is often allowed to further interact with the test environment to improve its policy. For example, many previous results in robotics have demonstrated that fine-tuning in the test environment can greatly improve the test performance for sim-to-real transfer (Rusu et al., 2017; James et al., 2019; Rajeswaran et al., 2016) . Therefore, one possible way to formulate generalization is to allow further interaction with the target environment during the test stage. Specifically, suppose the agent interacts with MDP M ∼ D in the test stage, and we measure the performance of the fine-tuning algorithm A using the expected regret in K episodes, i.e. Reg K (D, A) = E M∼D K k=1 V π * (M) M -V π k M . In this setting, can the information obtained from pre-trainingfoot_0 help reduce the regret suffered during the test stage? In addition, when the test-time fine-tuning is not allowed, what can we expect the pre-training to be helpful? As discussed above, we can no longer demand instance-optimality in this setting, but can only step back and pursue a near-optimal policy in expectation. Specifically, our goal is to perform near-optimal in terms of the optimal policy with maximum value in expectation, i.e. π * (D) = arg max π∈Π E M∼D V π M . Here V π M is the value function of the policy π in MDP M. We seek to answer: is it possible to design a sample-efficient training algorithm that returns a ϵ-optimal policy π in expectation, i.e. E M∼D V π * (D) M -V π M ≤ ϵ? Main contributions. In this paper, we theoretically study RL generalization in the above two settings. Our contributions can be summarized as follows: • When fine-tuning is allowed, we study the benefit of pre-training for the test-time performance. Since all information we can gain from training is no more than the distribution D itself, we start with a somewhat surprising theorem showing the limitation of this benefit: there exists hard cases where, even if the agent has exactly learned the environment distribution D in the training stage, it cannot improve the test-time regret up to a universal factor in the asymptotic setting (K → ∞). In other words, knowing the distribution D cannot provide more information in consideration of the regret asymptotically. Our theorem is proved by using Radon transform and Lebesgue integral analysis to give a global level information limit, which we believe are novel techniques for RL communities. • Inspired by this lower bound, we focus on the non-asymptotic setting, and study whether and how much we can reduce the regret in this case. We propose an efficient pre-training and test-time finetuning algorithm called PCE (Policy Collection-Elimination). By maintaining a minimum policy set that generalizes well, it achieves a regret upper bound Õ C(D)K in the test stage, where C(D) is a complexity measure of the distribution D. This bound removes the polynomial dependence on the cardinality of state-action space by leveraging the information obtained from pre-training. We give a fine-grained analysis on the value of C(D) and show that our bound can be significantly smaller than state-action space dependent bound in many settings. • When the agent cannot interact with the test environment, we propose an efficient algorithm called OMERM (Optimistic Model-based Empirical Risk Minimization) to find a near-optimal policy in expectation. This algorithm is guaranteed to return a ϵ-optimal policy with O log N Π ϵ/(12H) /ϵ 2 sampled MDP tasks in the training stage where N Π ϵ/(12H) is the complexity of the policy class. This rate matches the traditional generalization rate in many supervised learning results (Mohri et al., 2018; Kawaguchi et al., 2017) .

2. RELATED WORKS

Generalization and Multi-task RL. Many empirical works study how to improve generalization for deep RL algorithms (Packer et al., 2018; Zhang et al., 2020; Ghosh et al., 2021) . We refer readers to a recent survey Kirk et al. (2021) for more discussion on empirical results. Our paper is more closely related to the recent works towards understanding RL generalization from the theoretical perspective. Wang et al. (2019) focused on a special class of reparameterizable RL problems, and derive generalization bounds based on Rademacher complexity and the PAC-Bayes bound. Malik et al. (2021) ; Duan et al. (2021) also provided lower bounds showing that instance-optimal solution is statistically difficult for RL generalization when we cannot access the sampled test environment. Further, they proposed efficient algorithms which is guaranteed to return a near-optimal policy for deterministic MDPs under the strong proximity condition they introduced. Our paper is also related to recent works studying multi-task learning in RL (Tirinzoni et al., 2020; Hu et al., 2021; Zhang & Wang, 2021; Lu et al., 2021) , in which they studied how to transfer the knowledge learned from previous tasks to new tasks. Their problem formulation is different from ours since they study the multi-task setting where the MDP is selected from a given MDP set without probability mechanism (Brunskill & Li, 2013) . In addition, they typically assume that all the tasks have similar transition dynamics or share common representations. Provably Efficient Exploration in RL. Recent years have witnessed many theoretical results studying provably efficient exploration in RL (Osband et al., 2013; Azar et al., 2017; Osband & Van Roy, 2017; Jin et al., 2018; 2020b; Wang et al., 2020; Zhang et al., 2021) with the minimax regret for tabular MDPs with non-stationary transition being Õ( √ HSAK). These results indicate that polynomial dependence on the whole state-action space is unavoidable without additional assumptions. Their formulation corresponds to the single-task setting where the agent only interacts with a single environment aiming to maximize its cumulative rewards without pre-training. The regret defined in the fine-tuning setting coincides with the concept of Bayesian regret in the previous literature (Osband et al., 2013; Osband & Van Roy, 2017; O'Donoghue, 2021) . The best-known Bayesian regret for tabular RL is Õ( √ HSAK) when applied to our setting (O'Donoghue, 2021).

3. PRELIMINARY AND FRAMEWORK

Notations 

3.1. EPISODIC MDPS

An episodic MDP M is specified as a tuple (S, A, P M , R M , H), where S, A are the state and action space with cardinality S and A respectively, and H is the steps in one episode. P M,h : S × A → ∆(S) is the transition such that P M,h (s ′ |s, a) denotes the probability to transit to state s ′ if action a is taken in state s in step h. R M,h : S × A → ∆(R) is the reward function such that R M,h (s, a) is the distribution of reward with non-negative mean r M,h (s, a) when action a is taken in state s at step h. In order to compare with traditional generalization, we make the following assumption: Assumption 1. The total mean reward is bounded by 1, i.e. ∀M ∈ Ω, H h=1 r M,h (s h , a h ) ≤ 1 for all trajectory (s 1 , a 1 , • • • , s H , a H ) with positive probability in M; The reward mechanism R M (s, a) is 1-subgaussian, i.e. E X∼R M,h (s,a) [exp(λ[X -r M,h (s, a)])] ≤ exp λ 2 2 for all λ ∈ R. The total reward assumption follows the previous works on horizon-free RL (Ren et al., 2021; Zhang et al., 2021; Li et al., 2022) and covers the traditional setting where r M,h (s, a) ∈ [0, 1] by scaling H, and it is more natural in environments with sparse rewards (Vecerik et al., 2017; Riedmiller et al., 2018) . In addition, it allows us to compare with supervised learning bound where H = 1 and the loss is bounded by [0, 1]. The subgaussian assumption is more common in practice and is widely used in bandits (Lattimore & Szepesvári, 2020) . It also covers traditional RL setting where R M,h (s, a) ∈ ∆([0, 1]), and allows us to study MDP environment with a wider range. For the convenience of explanation, we assume the agent always starts from the same state s 1 . It is straightforward to recover the initial state distribution µ from this setting by adding an initial state s 0 with transition µ (Du et al., 2019; Chen et al., 2021) . Policy and Value Function. A policy π is set of H functions where each maps a state to an action distribution, i.e. π = {π h } H h=1 , π h : S → ∆(A) and π can be stochastic. We de-note the set of all policies described above as Π. We define N Π ϵ as the ϵ-covering number of the policy space Π w.r. t. distance d(π 1 , π 2 ) = max s∈S,h∈[H] ∥π 1 h (•|s) -π 2 h (•|s)∥ 1 . Given π and h ∈ [H], we define the Q-function Q π M,h : S × A → R + , where Q π M,h (s, a) = r M,h (s, a) + s ′ ∈S P M,h (s ′ |s, a)V π M,h+1 (s ′ ), and the V-function V π M,h : S → R + , where V π M,h (s) = E a∼π h (•|s) Q π M,h (s, a) for h ≤ H and V π M,H+1 (s) = 0. We abbreviate V π M,1 (s 1 ) as V π M , which can be interpreted as the value when executing policy π in M. Following the notations in previous works, we use P h V (s, a) as the shorthand of s ′ ∈S P h (s ′ |s, a)V (s ′ ) in our analysis.

3.2. RL GENERALIZATION FORMULATION

We mainly study the setting where all MDP instances we face in training and testing stages are i.i.d. sampled from a distribution D supported on a (possibly infinite) countable set Ω. For an MDP M ∈ Ω, we use P(M) to denote the probability of sampling M according to distribution D. For an MDP set Ω ⊆ Ω, we similarly define P( Ω) = M∈ Ω P(M). We assume that S, A, H is shared by all MDPs, while the transition and reward are different. When interacting with a sampled instance M, one does not know which instance it is, but can only identify its model through interactions. In the training (pre-training) stage, the agent can sample i.i.d. MDP instances from the unknown distribution D. The overall goal is to perform well in the test stage with the information learned in the training stage. Define the optimal policy as π * (M) = arg max π∈Π V π M , π * (D) = arg max π∈Π E M∼D V π M . We say a policy π is ϵ-optimal in expectation, if E M∼D V π * (D) M -V π M ≤ ϵ. We say a policy π is ϵ-optimal in instance, if E M∼D V π * (M) M -V π M ≤ ϵ. Without Test-time Interaction. When the interaction with the test environment is unavailable, optimality in instance can be statistically intractable, and we can only pursue optimality in expectation. We formulate this difficulty into the following proposition. Proposition 1. There exists an MDP support Ω, such that for any distribution D with positive p.d.f. p(r), ∃ϵ 0 > 0, and for any deployed policy π, E M * ∼D V π * (M * ) M * -V π M * ≥ ϵ 0 . Proposition 1 is proved by constructing Ω as a set of MDPs with opposed optimal action, and the complete proof can be found in Appendix A. When Ω is discrete, there exists hard instances where the proposition holds for ϵ 0 ≥ 1 2 . This implies that without test-time interactions or special knowledge on the structure of Ω and D, it is impractical to be near optimal in instance. This intractability arises from the demand on instance optimal policy, which is never asked in supervised learning. With Test-time Interaction. To pursue the optimality in instance, we study the problem of RL generalization with test-time interaction. When our algorithm is allowed to interact with the target MDP M * ∼ D for K episodes in the test stage, we want to reduce the regret, which is defined as Reg K (D, A) ≜ E M * ∼D Reg K (M * , A), Reg K (M * , A) ≜ K k=1 [V π * (M * ) M * -V π k M * ], where π k is the policy that A deploys in episode k. Here M * is unknown and unchanged during all K episodes. The choice of Bayesian regret is more natural in generalization, and can better evaluate the performance of an algorithm in practice. From the standard Regret-to-PAC technique (Jin et al., 2018; Dann et al., 2017) , an algorithm with Õ( √ K) regret can be transformed to an algorithm that returns an ϵ-optimal policy with Õ(1/ϵ 2 ) trajectories. Therefore, we believe regret can also be a good criterion to measure the sample efficiency of fine-tuning algorithms in the test stage.

4. RESULTS FOR THE SETTING WITH TEST-TIME INTERACTION

Õ( √ SAHK) (Zhang et al., 2021) . For generalization in RL, we mainly care about the performance in the test stage, and hope the agent can reduce test regret by leveraging the information learned in the pre-training stage. Obviously, when Ω is the set of all tabular MDPs and the distribution D is uniform over Ω, pre-training can do nothing on improving the test regret, since it provides no extra information for the test stage. Therefore, we seek a distribution-dependent improvement that is better than traditional upper bound in most of benign settings.

4.1. LOWER BOUND

We start by understanding how much information the pre-training stage can provide at most. One natural focus is on the MDP distribution D, which is a sufficient statistic of the possible environment that the agent will encounter in the test stage. We strengthen the algorithm by directly telling it the accurate distribution D, and analyze how much this extra information can help to improve the regret. Specifically, we ask: Is there a multiplied factor C(D) that is small when D enjoys some benign properties (e.g. D is sharp and concentrated), such that when knowing D, there exists an algorithm that can reduce the regret by a factor of C(D) for large enough K? Perhaps surprisingly, our answer towards this question is negative for all large enough K in the asymptotic case. As is formulated in Theorem 1, the importance of D is constrained by a universal factor c 0 asymptotically. Here c 0 = 1 16 holds universally and does not depend on D. This theorem implies that no matter what distribution D is, for sufficiently large K, any algorithm can only reduce the total regret by at most a constant factor with the extra knowledge of D. Theorem 1. There exists an MDP instance set Ω, a universal constant c 0 = 1 16 , and an algorithm Â that only inputs the episode K, such that for any distribution D with positive p.d.f. p(r) ∈ C(Ω) (which Â does NOT know), any algorithm A that inputs D and the episode K, 1. Ω is not degraded, i.e. lim K→∞ Reg K (D, A(D, K)) = +∞. 2. Knowing the distribution is useless up to a constant, i.e. lim inf K→∞ Reg K (D, A(D, K)) Reg K (D, Â(K)) ≥ c 0 . In Theorem 1, Point (1) avoids any trivial support Ω where ∃π * that is optimal for all M ∈ Ω, in which case the distribution is of course useless since Â can be optimal by simply following π * even it does not know D. Note that our bounds hold for any distribution D, which indicates that even a very sharp distribution cannot provide useful information in the asymptotic case where K → ∞. We point out that the value of c 0 depends on the coefficient of previous upper and lower bound, and we conjecture that it could be arbitrarily close to 1. We defer the complete proof to Appendix B, and briefly sketch the intuition here. The key observation is that the information provided in the training stage (prior) is fixed, while the required information gradually increase as K increases. When K = 1, the agent can clearly benefit from the knowledge of D. Without this knowledge, all it can do is a random guess since it has never interacted with M * before. However, when K is large, the algorithm can interact with M * many times and learn M * more accurately, while the prior D will become relatively less informative. As a result, the benefits of knowing D vanishes eventually. Theorem 1 lower bounds the improvement of regret by a constant. As is commonly known, the regret bound can be converted into a PAC-RL bound (Jin et al., 2018; Dann et al., 2017) . This implies that when δ, ϵ → 0, in terms of pursuing a ϵ-optimal policy to π * (M * ), pre-training cannot help reduce the sample complexity. Despite negative, we point out that this theorem only describe the asymptotic setting where K → ∞, but it imposes no constraint when K is fixed.

4.2. NON-ASYMPTOTIC UPPER BOUND

In the last subsection, we provide a lower bound showing that the information obtained from the training stage can be useless in the asymptotic setting where K → ∞. In practice, a near-optimal regret in the non-asymptotic setting is also desirable in many applications. In this section, we fix the value of K and seek to design an algorithm such that it can leverage the pre-training information and reduce K-episode test regret. To avoid redundant explanation for single MDP learning, we formulate the following oracles. Definition 1. (Policy learning oracle) We define O l (M, ϵ, log(1/δ)) as the policy learning oracle which can return a policy π that is ϵ-optimal w.r.t. MDP M with probability at least 1 -δ, i.e. V * M (s 1 ) -V π M (s 1 ) ≤ ϵ. The randomness of the policy π is due to the randomness of both the oracle algorithm and the environment. Definition 2. (Policy evaluation oracle) We define O e (M, π, ϵ, log(1/δ)) as the policy evaluation oracle which can return a value v that is ϵ-close to the value function V π M (s 1 ) with probability at least 1 -δ, i.e. |v -V π M (s 1 )| ≤ ϵ. The randomness of the value v is due to the randomness of both the oracle algorithm and the environment. Both oracles can be efficiently implemented using the previous algorithms for single-task MDPs. Specifically, we can implement the policy learning oracle using algorithms such as UCBVI (Azar et al., 2017) , LSVI-UCB (Jin et al., 2020b) and GOLF (Jin et al., 2021) with polynomial sample complexities, and the policy evaluation oracle can be achieved by the standard Monte Carlo method (Sutton & Barto, 2018).

4.2.1. ALGORITHM

There are two major difficulties in designing the algorithm. First, what do we want to learn during the pre-training process and how to learn it? One idea is to directly learn the whole distribution D, which is all that we can obtain for the test stage. However, this requires Õ(|Ω| 2 /δ 2 ) samples for a required accuracy δ, and is unacceptable when |Ω| is large or even infinite. Second, how to design the test-stage algorithm to leverage the learned information effectively? If we cannot effectively use the information from the pre-training, the regret or samples required in the test stage can be Õ(poly(S, A)) in the worst case.  for phase l = 1, • • • do 4: Sample an MDP set Ω with N MDPs {M 1 , M 2 , • • • , M N } from distribution D 5: for j = 1, • • • , N do 6: Calculate π j = O l (M j , ϵ/2, log(N/δ)) for the MDP M j 7: for i, j = 1, • • • , N do 8: Calculate v i,j = O e (M i , π j , ϵ/2, log(N 2 /δ)) to evaluate the policy π j on the MDP M i 9: Call Subroutine 4 to find a set Π that covers (1 -3δ)-fraction of the MDPs in Ω 10: if | Π| log(2N/δ) N -| Π| ≤ δ then 11: Output: the policy-value set Π = {(π j , v j,j ), ∀j ∈ U}  l = 1, k 0 = 1, δ = ϵ = 1/ √ K 3: for episode k = 1, • • • , K do 4: Calculate (π l , v l ) = arg max (π l ,v l )∈ Πl v l 5: Execute the optimal policy π l , and receive the total reward G k 6: if 1 k-k0+1 k τ =k0 G k -v l ≥ 4ϵ + 2 log(4K/δ) k-k0+1 then 7: Eliminate (π l , v l ) from the instance set Πl , denote the remaining set as Πl+1 8: Set k 0 = k + 1, and l = l + 1 To tackle the above difficulties, we formulate this problem as a policy candidate collectionelimination process. Our intuition is to find a minimum policy set that can generalize to most MDPs sampled from D. In the pre-training stage, we maintain a policy set that can perform well on most MDP instances. This includes policies that are the near-optimal policy for an MDP M with relatively large P(M), or that can work well on different MDPs. In the test stage, we sequentially execute policies in this set. Once we realize that current policy is not near-optimal for M * , we eliminate it and switch to another. This helps reduce the regret from the cardinality of the whole state-action space to the size of policy covering set. The pseudo code is in Algorithm 1. Pre-training Stage. In the pre-training stage, we say a policy-value pair (π, v) that covers an MDP M if π is O(ϵ)-optimal for the MDP M and v is an estimation of the optimal value V * M (s 1 ) with at most O(ϵ) error. For a policy-value set Π, we say the policy set covers the distribution D with  probability at least 1 -O(δ) if Pr M∼D ∃(π, v) ∈ Π, (π, v) covers M ≥ 1 -O(δ). MDP set {M 1 , M 2 , • • • , M N }. We call the oracle O l to calculate the near-optimal policy π j for each MDP M j , and we calculate the value estimation v i,j for each policy π j on MDP M i using oracle O e . We use the following condition to indicate whether the pair (π j , v j,j ) covers the MDP M i : Cnd(v i,j , v i,i , v j,j ) = I [|v i,j -v i,i | < ϵ] ∩ I [|v i,j -v j,j | < ϵ] . The above condition indicates that π j is a near-optimal policy for M i , and v j,j is an accurate estimation of the value V πj Mi,1 (s 1 ). With this condition, we construct a policy-value set Π that covers (1 -3δ)-fraction of the MDPs in the sampled MDP set by calling Subroutine 4. We output the policy-value set once the distribution estimation error | Ω| log(2N/δ) N -| Ω| is less than δ, and double the number of the sampled MDPs N to increase the accuracy of the distribution estimation otherwise. After the pre-training phase, we can guarantee that the returned policy set can cover D with probability at least 1 -O(δ), i.e. Pr M∼D ∃(π, v) ∈ Π, V π M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V π M,1 (s 1 ) -v < 2ϵ ≥ 1 -O(δ). Fine-tuning Stage. We start with the policy-value set Π from the pre-training stage and eliminate the policy-value pairs until we reach a (π, v) ∈ Π that covers the test MDP M * . Specifically, we split all episodes into different phases. In phase l, we maintain a set Πl that covers the real environment M * with high probability. We select (π l , v l ) with the most optimistic value v l in Πl and execute the policy π l for several episodes. During execution, we also evaluate the policy π l on the MDP M * (i.e. V π l M * ,1 (s 1 )) and maintain the empirical estimation 1 k-k0+1 k τ =k0 G k . Once we identify that π l is not near-optimal for M * , we end this phase and eliminate (π l , v l ) from Πl .

4.2.2. REGRET

The efficiency of Algorithm 1 is summarized in Theorem 2, and the proof is in Appendix C. Theorem 2. The regret of Algorithm 1 is at most Reg K (D, Alg. 1) ≤ O C(D)K log 2 (K) + C(D) , where C(D) ≜ min P( Ω)≥1-δ | Ω| is a complexity measure of D and δ = 1/ √ K. In addition, with probability at least 1 -O (δ log(C(D)/δ)), samples required in the pre-training stage is O( C(D) δ 2 ). Different from the previous regret bound when the pre-training is unavailable (Azar et al., 2017; Osband et al., 2013; Zhang et al., 2021) , this distribution-dependent upper bound improves the dependence on S, A to a complexity measure of D defined as C(D). C(•) serves as a multiplied factor, and can be small when D enjoys benign properties as shown below. First, when the cardinality of Ω is small, i.e. |Ω| ≪ SA, we have C(D) ≤ |Ω| and the pre-training can greatly reduce the instance space via representation learning or multi-task learning (Agarwal et al., 2020; Brunskill & Li, 2013) . Specifically, the regret in the test stage is reduced from Õ( √ SAHK) to Õ( |Ω|K). To the best of our knowledge, this is the first result that achieves dependence only on M for any general distribution in RL generalization. We provide a lower bound in Appendix C.4 to show that our regret in the test stage is near-optimal except for logarithmic factors. When |Ω| is large or even infinite, C(D) is still bounded and can be significantly smaller when D enjoys benign properties. In fact, in the worst case, it is not hard to find that the dependence on C(D) is unavoidablefoot_1 , and there is no way to expect any improvement. However, in practice where |Ω| is large, the probability will typically be concentrated on several subset region of Ω and decay quickly outside, e.g. when D is subgaussian or mixtures of subgaussian. Specifically, if the probability of the i-th MDP p i ≤ c 1 e -λi for some positive constant c 1 , λ, then C(D) ≤ O(log 1 δ ), which gives the upper bound O K logfoot_2 (K) . It is worthwhile to mention that our algorithm in the pre-training stage actually finds a policy covering set, rather than an MDP covering set, despite that the regret in Theorem 2 depends on the cardinality of the MDP covering set. This dependence can possibly be improved to the cardinality of the policy covering set by adding other assumptions to our policy learning oracle, which we leave as an interesting problem for the future research.

5. RESULTS FOR THE SETTING WITHOUT TEST-TIME INTERACTION

In this section, we study the benefits of pre-training without test-time interaction. As illustrated by Proposition 1, we only pursue the optimality in expectation, which is in line with supervised learning. Traditional Empirical Risk Minimization algorithm can be ϵ-optimal with O( log |H| ϵ 2 ) samples, and we expect this to be true in RL generalization as well. However, a policy in RL needs to sequentially interact with the environment, which is not captured by pure ERM algorithm. In addition, different MDPs in Ω can have distinct optimal actions, making it hard to determine which action is better even in expectation. To overcome these issues, we design an algorithm called OMERM in Algorithm 2. Our algorithm is designed for tabular MDPs with finite state-actions space. Nevertheless, it can be extended to the non-tabular setting by combining our ideas with previous algorithms for efficient RL with function approximation (Jin et al., 2020a; Wang et al., 2020) . In Algorithm 2, we first sample N tasks from the distribution D as the input. The goal of this algorithm is to find a near-optimal policy in expectation w.r.t. the task set {M 1 , M 2 , • • • , M N }. In each episode, the algorithm estimates each MDP instance using history trajectories, and calculates the optimistic value function of each instance. Based on this, it selects the policy from Π that maximizes the average optimistic value, i.e. 1

N N i=1

V π Mi,1 (s 1 ). This selection objective is inspired by ERM, with the difference that we require this estimation to be optimistic. It can be achieved by a planning oracle when Π is all stochastic maps from (S, H) to ∆(A), or by gradient descent when Π is a parameterized model. For each sampled M i , our algorithm needs to interact with it for poly(S, A, H, 1 ϵ ) times, which is a typical requirement to learn a model well. Theorem 3. With probability 3 at least 2/3, Algorithm 2 can output a policy π satisfying E M * ∼D [V π * (D) M * -V π M * ] ≤ ϵ with O log N Π ϵ/(12H) ϵ 2 MDP instance samples during training. The number of episodes collected for each task is bounded by O H 2 S 2 A log(SAH) ϵ 2 . We defer the proof of Theorem 3 to Appendix D. This theorem implies that Algorithm 2 needs approximately O(log N Π ϵ 12H /ϵ 2 ) samples to return an ϵ-optimal policy in expectation. Recall that log N Π ϵ 12H is the log-covering number of Π. When Π is all stochastic maps from S, H to ∆(A), it is bounded by Õ(HSA). When Π is a parameterized model where the parameter θ ∈ R d has finite norm and π θ satisfies some smoothness condition on θ, log(N Π ϵ 12H ) ≤ Õ(d). This result Algorithm 2 OMERM (Optimistic Model-based Empirical Risk Minimization) Input: target accuracy ϵ > 0 Input: the sampled N MDPs denoted as {M 1 , M 2 , • • • , M N } from distribution D, with N = C 1 log (N (Π, ϵ/(12H), d)) /ϵ 2 for a constant C 1 > 0 K = C 2 S 2 AH 2 log(SAH/ϵ)/ϵ 2 for constants C 2 > 0 for episode k = 1, 2, • • • , K do 5: for i = 1, 2, • • • , N do Denote N Mi,k,h s, a, s ′ ) and N Mi,k,h (s, a) as the counter that agent encounters (s, a, s ′ ) and (s, a) at step h in M i till step k -1, respectively Estimate PMi,k,h (s ′ |s, a) = N M i ,k,h (s,a,s ′ ) max{1,N M i ,k,h (s,a)} for step h ∈ [H] Estimate RMi,k,h (s, a) = k-1 τ =1 r M i ,τ,h 1(s M i ,τ,h =s,a M i ,τ,h =a) max{1,N M i ,k,h (s,a)} for step h ∈ [H] Define the UCB bonus b Mi,k,h (s, a) = 8S log(8SAN HK) max{1,N M i ,k,h (s,a)} 10: Initialize V π Mi,k,H+1 (s, a) = 0, ∀s, a for h = H, H -1, • • • , 1 do Qπ Mi,k,h (s, a) = min 1, RMi,k,h (s, a) + b Mi,k (s, a) + PMi,k,h V π Mi,k,h+1 (s, a) V π Mi,k,h (s) = a π h (a|s) Qπ Mi,k,h (s, a) Calculate the optimistic policy π k = arg max π∈Π 1 N N i=1 V π Mi,1 (s 1 ). 15: for i = 1, 2, • • • , N do Execute the policy π k on MDP M i for one episode, and observe the trajectory (s Mi,k,h , a Mi,k,h , r Mi,k,h ) H h=1 . Output: a policy selected uniformly randomly from the policy set {π k } K k=1 . matches traditional bounds in supervised learning, and it implies that when we pursue the optimal in expectation, generalization in RL enjoys quantitatively similar upper bound to supervised learning.

6. CONCLUSION AND FUTURE WORK

This work theoretically studies how much pre-training can improve test performance under different generalization settings. we first point out that RL generalization is fundamentally different from the generalization in supervised learning, and fine-tuning on the target environment is necessary for good generalization. When the agent can interact with the test environment to update the policy, we first prove that the prior information obtained in the pre-training stage can be theoretically useless in the asymptotic setting, and show that in non-asymptotic setting we can reduce the test-time regret to Õ C(D)K by designing an efficient learning algorithm. In addition, when the agent cannot interact with the test environment, we provide an efficient algorithm called OMERM which returns a near-optimal policy in expectation by interacting with O log(N Π ϵ/(12H) )/ϵ 2 MDP instances. Our work seeks a comprehensive understanding on how much pre-training can be helpful to test performance theoretically, and it also provides insights on real RL generalization application. For example, when test time interactions are not allowed, one cannot guarantee to be near optimal in instance. Therefore, for a task where large regret is not tolerable, instead of designing good algorithm, it is more important to find an environment that is close to the target environment and improve policies there, rather than to train a policy in a diverse set of MDPs and hope it can generalize to the target. In addition, for tasks where we can improve policies on the fly, we can try to pre-train our algorithm in advance to reduce the regret suffered. This corresponds to the many applications where test time interactions are very expensive, such as autonomous driving and robotics. There are still problems remaining open. Firstly, we mainly study the i.i.d. case where the training MDPs and test MDPs are sampled from the same distribution. It is an interesting problem to study the out-of-distribution generalization in RL under certain distribution shifting. Secondly, there is still room for improving our instance dependent bound possibly by leveraging ideas from recent Bayesian-optimal algorithms such as Thompson sampling (Osband et al., 2013) . We hope these problems can be addressed in the future research. A OMITTED PROOF FOR PROPOSITION 1 Assume Ω is the set of M MDP instances M 1 , • • • , M M , where all instances consist only one identical state s 1 and their horizon H = 1. In M i and state s 1 , the reward is 1 for action a i and 0 otherwise. The optimal policy in M i is therefore taking action a t and V π * (Mi) Mi = 1. For any distribution D, assume the probability to sample M i is p i > 0. For any deployed policy π, assume the probability that π takes action a i is q i , then E M * ∼D V (M * , π) = M i=1 p i q i ≤ max i∈[M ] p i . Therefore, denote ϵ 0 = 1 -max i∈[M ] p i > 0, we have E M * ∼D V π * (M * ) M * -V π M * ≥ ϵ 0 . Notice that when D is a uniform distribution, ϵ 0 = M -1 M , which can be arbitrarily close to 1.

B OMITTED PROOF FOR THEOREM 1

To prove the theorem, we let Ω be a subset of MDP with H = S = 1, under which it becomes a bandit problem, and it suffices to prove the theorem in this setting. Below we first introduce the bandits notations, and give the complete proof.

B.1 NOTATIONS

To be consistent with traditional K-arm bandits problem, in this section we use a slightly different notations from the main paper. An bandit instance can be represented as a vector r ∈ Ω ⊂ [0, 1] K , where r k is the mean reward when arm k is pulled. Inherited from previous MDP settings, we also assume that this reward is sampled from an 1-subgaussian distribution with mean r k . In each episode t ∈ [T ], based on initial input and the history, an algorithm A chooses an arm a t to pull, and obtain a reward y at ∼ D at . Similarly, assume r is sampled from a distribution D supported on Ω, and we want to minimize the Bayesian regret Reg T (D, A) ≜ E r∈D Reg T (r, A), where Reg T (r, A) ≜ E A T t=1 [r * -r at ]. Here r * = max k r k is the optimal arm. If we define S T k (r, A) = T t=1 I[a t = k] as the r.v. of how many times A has pulled arm k in T episodes and ∆ k = r * -r k as the sub-optimal gap, we can decompose regret Reg T (r, A) = K k=1 ∆ k E[S T k (r, A)]. This identity is frequently used in the subsequent proof.

B.2 PROOF

We first specify the choice of support, constant and algorithm. Without loss of generality, we set Ω = [0, 1] K , which is quite common and general in bandit tasks. Let c 0 = 1 16 , and Â be the Asymptotically Optimal UCB Algorithm defined in Algorithm 3. On the other hand, let Ã be uniformly optimal, i.e. Ã(D, T ) = arg min A Reg T (D, A). To prove the theorem, we only need to show that lim inf T →∞ Reg T (D, Ã(D, T )) Reg T (D, Â(T )) ≥ c 0 , which is the major result (point (2)). Later, we show that lim T →∞ Reg T (D, Ã(D, T )) = +∞ in Lemma 9, which proves point (1) in the theorem. When T is fixed, we abbreviate Ã(D, T ), Â(T ) as Ã, Â. Â enjoys a well known instance dependent regret upper bound, which is copied below: Lemma 4 (Theorem 8.1 in Lattimore & Szepesvári (2020) ). For all r ∈ Ω, ∀k ∈ [K], E[S T k (r, Â)] ≤ min{T, inf ε∈(0,∆ k ) 1 + 5 ε 2 + 2 log(T log 2 T + 1) + π log(T log 2 T + 1) + 1 (∆ k -ε) 2 }. By Holder Inequality, we immediately have ∆ k E[S T k (r, Â)] ≤ min{∆ k T, ∆ k + u(T ) log T ∆ k }, where the coefficient function u(T ) = sup t≥T 1 log t 5 1/3 + (2 log(t log 2 t + 1) + π log(t log 2 t + 1) + 1) 1/3 3 being non-increasing and lim T →∞ u(T ) = 2. For simplicity, we define e(T ) ≜ ) is a universal constant to be specified later. Further define Λ ϵ k = {r ∈ Ω, r k + ∆ k (1 + ϵ) < 1}. Using Eq. 1, it suffices to prove the lemma if ∀k ∈ [K], lim inf T →∞ Ω p(r)∆ k E[S T k (r, Ã)]dr Ω p(r)∆ k E[S T k (r, Â)]dr ≥ c 0 . Fix k ∈ [K] and ϵ > 0 that is sufficiently small, we decompose Inq. 3 as below: Ω p(r)∆ k E[S T k (r, Ã)]dr Ω p(r)∆ k E[S T k (r, Â)]dr ≥ Ω k T Λ ϵ k p(r)∆ k E[S T k (r, Ã)]dr Ω k T Λ ϵ k p(r)s(∆ k , T )dr • Ω k T Λ ϵ k p(r)s(∆ k , T )dr Ω k T p(r)s(∆ k , T )dr × Ω k T p(r)s(∆ k , T )dr Ω p(r)s(∆ k , T )dr • Ω p(r)s(∆ k , T )dr Ω p(r)∆ k E[S T k (r, Â)]dr Sequentially denote the four terms in the right hand side as M 1 ∼ M 4 . M 4 can be lower bounded by 1 based on the Lemma 4. The rest three terms is bounded by the subsequent three lemmas. Lemma 5. ∀k ∈ [K], lim inf T →∞ Ω k T p(r)s(∆ k , T )dr Ω p(r)s(∆ k , T )dr ≥ 2p 0 . Lemma 5 lower bounds M 3 , saying that the influence of instance in Ω \ Ω k T is negligible. The proof of this lemma needs theory on Lebesgue Integral and Radon Transform, which will be introduced in the Section B.3. Lemma 6. For any k ∈ [K], ϵ ∈ (0, 1), lim T →∞ Ω k T Λ ϵ k p(r)s(∆ k , T )dr Ω k T p(r)s(∆ k , T )dr = 1. Lemma 6 implies that M 2 → 1, allowing us to focus on the integral on a calibrated smaller set Ω k T Λ ϵ k , in which we can use information theory to control E[S T k (r, Ã)]. The following lemma is the major lemma in our proof, which use the optimality of Ã to analyze the global structure of Reg T (r, Ã) for r ∈ Ω k T Λ ϵ k , and lower bound term M 1 : Lemma 7. For any ϵ ∈ (0, 1), we have lim inf T →∞ Ω k T Λ ϵ k p(r)∆ k E[S T k (r, Ã)]dr Ω k T Λ ϵ k p(r)s(∆ k , T )dr ≥ 1 2 -2p 0 (1 + ϵ) 2 . Combined together, lim inf T →∞ Ω p(r)∆ k E[S T k (r, Ã)]dr Ω p(r)∆ k E[S T k (r, Â)]dr ≥ 2p 0 ( 1 2 -2p 0 )c k (1 + ϵ) 2 , ( ) The proof of Inq. 2 is finished by selecting p 0 = 1 8 and letting ϵ → 0. Algorithm 3 Â: Asymptotically UCB 1: Input Ω = [0, 1] K , total episode T 2: for step t = 1, • • • , K do 3: Choose arm a t , and obtain reward y t

4:

Set Rt = y t , Ŝt = 1 5: for step t = K + 1, • • • , T do 6: f (t) = 1 + t log 2 (t) 7: Choose a t = arg max k Rk Ŝk + 2 log f (t) Ŝk 8: Set Rk = R k + y t , S k = S k + 1 B.3 PROOF OF LEMMAS Before proving all lemmas above, we need to introduce theory on Lebesgue Integral and Radon Transform. In R K space where K ≥ 2, when a function is Riemann integrable, it is also Lebesgue integrable, and two integrals are equal. Since we assume p(r) ∈ C(Ω), the integral always exists. Below we always consider the Lebesgue integral. For a compact measurable set S, define L p (S) as the space of all measurable function in S with standard p-norm. Since the p.d. f of D is continuous in compact set Ω and is positive, ∃L, U ∈ R + , ∀r ∈ Ω, p(r) ∈ [L, U ]. Denote T k = {r ∈ Ω : r k = max(r)}. Clearly, Ω f (r)dr = k∈[K] T k f (r)dr since m(T i T j ) = 0, ∀T i ̸ = T j . Here m(•) is the Lebesgue measure in R K . If we define P t,γ = {r ∈ Ω, γ • r = t} and γ i k = (e i -e k ), i ̸ = k, then P t,γ i k T i = {r ∈ T i : ∆ k = t}. According to Radon Transform theory, since T i is compact and p(r) ∈ C(T i ), ρ i k (t) ≜ P t,γ i k Ti p(r)dr, is also continuous w.r.t variable t ∈ [0, 1] for all k ̸ = i ∈ [K]. Here the integration is perform in the corresponding R K-1 space, i.e. the plane P t,γ i k , and when m K-1 (P t,γ i k T i ) > 0, ρ i k (t) m K-1 (P t,γ i k Ti) ∈ [L, U ]. We further define q k (t) = i̸ =k ρ i k (t), which also belongs to C([0, 1]). The continuity of q k (t) helps derive the following equation. For any f ∈ L 1 (Ω) that relies only on ∆ k , i.e. f (r) = f (∆ k ), we have Ω p(r)f (r)dr = i∈[K] Ti p(r)f (r)dr = i∈[K] Ti p(r) f (r i -r k )dr = √ 2 2 i∈[K] [0,1] f (t)ρ i k (t)d∆ = √ 2 2 [0,1] f (t)q k (t)d∆. ( ) Here the last line is because of Fubini Theorem, which states that the integral of a function can be computed by iterating lower-dimensional integrals in any order. The factor √ 2 2 arises because in traditional Radon Transform we require ∥γ∥ 2 = 1, but here ∥γ i k ∥ 2 = ∥e i -e k ∥ 2 = √ 2.

B.3.1 PROOF OF LEMMA 5

Recall that s(∆, T ) = min{∆T, ∆ + u(T ) ∆ log T }, and ∆T < ∆ + u(T ) ∆ log T ↔ ∆ < e(T ) , where e(T ) is defined as u(T ) log T T -1 . When e(T ) ≤ T -p0 , Eq. 6 implies that ∀r ∈ Ω k T , lim inf T →∞ Ω k T p(r)s(∆ k , T )dr Ω p(r)s(∆ k , T )dr = lim inf T →∞ Ω p(r)s(∆ k , T )I[∆ k ≥ T -p0 ]dr Ω p(r)s(∆ k , T )dr = lim inf T →∞ [0,1] q k (∆)s(∆, T )I[∆ ≥ T -p0 ]d∆ [0,1] q k (∆)s(∆, T )d∆ . To prove the lemma, it suffices to show that lim sup T →∞ [0,1] q k (t)s(∆, T )I[∆ ≤ T -p0 ]d∆ [0,1] q k (∆)s(∆, T )I[∆ ≥ T -p0 ]d∆ ≤ 0.5 -p 0 p 0 . ( ) Define E T = {x ∈ [0, 1] : x ≤ e(T )}, F T = {x ∈ [0, 1] : x ≤ T -p0 }, then lim T →∞ En q k (∆)s(∆, T )d∆ Fn\En q k (∆)s(∆, T )d∆ = lim T →∞ En q k (∆)∆T d∆ Fn\En q k (∆)(∆ + u(T ) log T ∆ )d∆ (8) ≤ lim T →∞ 1 e 2 (T ) En q k (∆)∆d∆ Fn\En q k (∆) 1 ∆ d∆ (9) We have shown that ρ i k (t) m K-1 (P t,γ i k ) ∈ [L, U ]. With some calculation, m(t) ≜ m K-1 (P t,γ i k T i ) = 1-t K-1 K-1 . Therefore, q k (t) ∈ [m(t)(K -1)L, m(t)(K -1)U ], and q k (0) = lim t→0 + q k (t) ≥ L > 0. By this continuity, for small enough ε 1 > 0, ∃δ 1 > 0, ∀t ∈ [0, δ 1 ], q k (t) ∈ [(1 -ε 1 )q k (0), (1 + ε 1 )q k (0)]. As a result, lim T →∞ 1 e 2 (T ) En q k (∆)∆d∆ Fn\En q k (∆) 1 ∆ d∆ ≤ lim T →∞ 1 -ε 1 e 2 (T )(1 + ε 1 ) En ∆d∆ Fn\En 1 ∆ d∆ = lim T →∞ 1 -ε 1 e 2 (T )(1 + ε 1 ) 1 2 e 2 (T ) 1 2 log T -1 u(T ) log T -p 0 log T = lim T →∞ 1 -ε 1 2(1 + ε 1 ) 1 ( 1 2 -p 0 ) log T -1 2 log u(T )T log T T -1 = 0. ( ) This implies that the limitation of 8 exists and equals to 0. Similarly, define G T = {x ∈ [0, 1] : x ≤ 1 log T }, then G T , F T ⊂ [0, δ 1 ] for sufficiently large T . This implies that lim sup T →∞ [0,1] q k (∆)s(∆, T )I[∆ ≤ T -p0 ]d∆ [0,1] q k (∆)s(∆, T )I[∆ ≥ T -p0 ]d∆ = lim sup T →∞ Fn q k (∆)s(∆, T )d∆ [0,1]\Fn q k (∆)s(∆, T )d∆ = lim sup T →∞ Fn\En q k (∆)(∆ + u(T ) log T ∆ )d∆ [0,1]\Fn q k (∆)(∆ + u(T ) log T ∆ )d∆ ≤ lim sup T →∞ Fn\En q k (∆) 1 ∆ d∆ Gn\Fn q k (∆) 1 ∆ d∆ ≤ 1 + ε 1 1 -ε 1 lim sup T →∞ Fn\En 1 ∆ d∆ Gn\Fn 1 ∆ d∆ = 1 + ε 1 1 -ε 1 lim sup T →∞ log e(T ) -p 0 log T p 0 log T -log(log T ) = 1 + ε 1 1 -ε 1 1 2 -p 0 p 0 . Here the second equation comes from Inq. 10. The proof of Inq. 7 is finished by letting ε 1 → 0. B.3.2 PROOF OF LEMMA 6 Recall that Λ ϵ k is defined as {r ∈ Ω, r k + ∆ k (1 + ϵ) < 1}. For all t ∈ [0, 1), i ̸ = k, P t,γ i k T i Λ ϵ k = {r ∈ P t,γ i k : r i = max j r j , r ∈ Λ ϵ k } = {r ∈ Ω : r i = max j r j , r i -r k = t, r ∈ Λ ϵ k } = {r ∈ Ω : r i = max j r j , r i -r k = t, r i + ϵt < 1} = {r ∈ P t,γ i k T i : r i < 1 -ϵt}. This implies that m K-1 (P t,γ i k T i \Λ ϵ k ) = m K-1 ({r ∈ P t,γ i k T i : r i ∈ [1-ϵt, 1]}) = 1-(1-ϵt) K-1 K-1 t ≤ 1 1+ϵ 1-t K-1 K-1 o.w. . Notice that Λ ϵ k is open, T i \ Λ ϵ k is compact, and (S 1 S 2 ) \ S 3 = S 1 (S 2 \ S 3 ). We have ρi k (t) ≜ P t,γ i k (Ti\Λ ϵ k ) p(r)dr ∈ C(T i \ Λ ϵ k ). For all i ̸ = k (Define O = [T -p0 , 1]), lim T →∞ Ω k T Ti\Λ ϵ k p(r)s(∆ k , T )dr Ω k T Ti p(r)s(∆ k , T )dr = lim T →∞ O s(∆, T )ρ i k (∆)d∆ O s(∆, T )ρ i k (∆)d∆ = lim T →∞ O 1 ∆ ρi k (∆)d∆ O 1 ∆ ρ i k (∆)d∆ ≤ lim T →∞ U L • [T -p 0 , 1 1+ϵ ] 1-(1-ϵ∆) K-1 ∆ d∆ + [ 1 1+ϵ ,1] 1-∆ K-1 ∆ d∆ O 1-∆ K-1 ∆ d∆ ≤ lim T →∞ U L • [T -p 0 , 1 1+ϵ ] (K -1)ϵd∆ + [ 1 1+ϵ ,1] 1 ∆ d∆ p 0 log T -1 K-1 (1 -T -p0 ) ≤ lim T →∞ U L • (K -1)ϵ + log(1 + ϵ) p 0 log T -1 K-1 (1 -T -p0 ) = 0. Here the first equation is based on Eq.6, and the second last Inq. comes from Bernoulli inequality.

B.3.3 PROOF OF LEMMA 7

To prove the lemma, We first reiterate an important lemma that lower bounds E[S T k (r, A)]. Lemma 8 (Lemma 16.3 in Lattimore & Szepesvári (2020) ). Let r, r ′ ∈ Ω be 2 instances that differs only in one arm k ∈ [K], where ∆ k > 0 in r and k uniquely optimal in r ′ . Then for any algorithm A, ∀T , E[S T k (r, A)] ≥ 2 (r k -r ′ k ) 2 log min{r ′ k -r k -∆ k , ∆ k } 4 + log T -log(Reg T (r, A) + Reg T (r ′ , A)) . For fixed ϵ and r we set r ′ k = r k + (1 + ∆ k )ϵ, so for r ∈ Λ ϵ k , r ′ ∈ Ω. On the other hand, when T -p0 > e(T ), ∀r ∈ Ω k T , s(∆ k , T ) = ∆ k + u(T ) log T ∆ k . According to Lemma 8, for all r ∈ Λ ϵ k , ∆ k E[S T k (r, Ã)] ≥ 2 ∆ k (1 + ϵ) 2 log ϵ∆ k 4 + log T -log(Reg T (r, Ã) + Reg T (r ′ , Ã)) . Define I k T ≜ Ω k T Λ ϵ k p(r) ∆ k log T dr < ∞, and we have q k T (r) = p(r)/∆ k I k n is thus a p.d.f. in Ω k T Λ ϵ k . We already know that for any 1-subgaussian bandit instance, the regret of UCB algorithm is bounded by 8 √ KT log T + 3 K k=1 ∆ k ≤ 9 √ KT log T . The optimality of Ã(D, T ) = arg min A Reg T (D, A) implies that Reg T (D, Ã) ≤ 9 √ KT log T . Due to the concavity of log(•), q(r) log(Reg T (r, Ã) + Reg T (r ′ , Ã))dr ≤ log q(r)(Reg T (r, Ã) + Reg T (r ′ , Ã))dr = log Ω k T Λ ϵ k p(r) ∆ k (Reg T (r, Ã) + Reg T (r ′ , Ã))dr -log Ω k T Λ ϵ k p(r) ∆ k dr ≤ p 0 log T + log Ω k T Λ ϵ k p(r)(Reg T (r, Ã) + Reg T (r ′ , Ã))dr -log Ω k T Λ ϵ k p(r)dr ≤ p 0 log T + log Reg T (D, Ã) + q(r) q(r ′ ) q(r ′ ) Reg T (r ′ , Ã) -log Ω k T Λ ϵ k p(r)dr ≤ log( 9(L + U ) √ K L • T 1 2 +p0 log T ) -log Ω k T Λ ϵ k p(r)dr. Therefore, we have lim inf T →∞ Ω k T Λ ϵ k p(r)∆ k E[S T k (r, Ã)]dr Ω k T Λ ϵ k p(r)s(∆ k , T )dr (11) ≥ 2 (1 + ϵ) 2 lim inf T →∞ Ω k T Λ ϵ k p(r) ∆ k log ϵ∆ k 4 + log T -log(Reg T (r, Ã) + Reg T (r ′ , Ã)) dr Ω k T Λ ϵ k p(r) ∆ k ∆ 2 k + u(T ) log T dr (12) = 1 (1 + ϵ) 2 1 + lim inf T →∞ 1 I k T Ω k T Λ ϵ k p(r) ∆ k log ∆ k -log(Reg T (r, Ã) + Reg T (r ′ , Ã)) dr (13) ≥ 1 (1 + ϵ) 2 1 -p 0 -lim sup T →∞ Ω k T Λ ϵ k q k T (r) log(Reg T (r, Ã) + Reg T (r ′ , Ã))dr (14) ≥ 1 (1 + ϵ) 2 1 -p 0 -lim sup T →∞ 1 log T log( 9(L + U ) √ K L • T 1 2 +p0 log T ) -log Ω k T Λ ϵ k p(r)dr (15) = 1 (1 + ϵ) 2 ( 1 2 -2p 0 ). Here the last third line is because r ≥ T -p0 , ∀r ∈ Ω k T .

B.3.4 PROOF OF PART 1 IN THEOREM 1

Lemma 9. lim T →∞ Reg T (D, Ã(D, T )) = +∞. Proof. In Lemma 7 we already show that lim inf T →∞ Ω k T Λ ϵ k p(r)∆ k E[S T k (r, Ã)]dr Ω k T Λ ϵ k p(r)s(∆ k , T )dr ≥ 1 (1 + ϵ) 2 ( 1 2 -2p 0 ) > 0. To prove this lemma, if suffices to show that lim T →∞ Ω k T Λ ϵ k p(r)s(∆ k , T )dr = +∞. Notice that Ω k T Λ ϵ k is non-decreasing in terms of T . lim T →∞ Ω k T Λ ϵ k p(r)s(∆ k , T )dr ≥ lim T →∞ Ω k T Λ ϵ k L u(T ) log T ∆ k dr ≥ lim T →∞ Ω k T Λ ϵ k Lu(T ) log T dr ≥ lim T →∞ m(ω k T0 Λ ϵ k )Lu(T ) log T = +∞. C OMITTED DETAILS FOR THEOREM 2 C.1 SUBROUTINE FOR FINDING THE COVER SET This subroutine is used to find a policy-value set Π such that Π covers (1 -3δ)-fraction of the MDPs in the sampled MDP set, i.e. N i=1 I[∃(π, v) ∈ Π, s.t.(π, v) covers M i ] N ≥ 1 -3δ. The algorithm is a greedy algorithm consisting of at most N steps. As the beginning of the algorithm, we calculate a matrix A, where A i,j indicates whether (π j , v j ) covers the MDP M i . In each step t, we find a policy-value pair (π jt , v jt ) with the maximum cover number in the uncovered MDP set T t-1 . We update the index set U t and T t according to the selected index j t . We output the policy-value set Π once the cover size t τ =1 n τ ≥ (1 -3δ)N . Algorithm 4 Subroutine: Policy Cover Set 1: Input: v i,j for i ∈ [N ] and j ∈ [N ] 2: Initialize: the policy index set U 0 = ∅, the MDP index set T 0 = [N ] 3: Calculate the covering matrix A ∈ R N ×N where A i,j = Cnd(v i,j , v i,i , v j,j ) 4: for t = 1, • • • , N do 5: Calculate the policy index with maximum cover: j t = arg max [N ]\Ut-1 i∈Tt-1 A i,j 6: Set U t = U t-1 ∪ j t , T t = T t-1 \{i : A i,j = 1}, the cover size n t = i∈Tt-1 A i,jt 7: if The cover size t τ =1 n τ ≥ (1 -3δ)N then 8: Denote U t as U, then break the loop 9: Output: the policy-value set Π = {(π j , v j,j ) , ∀j ∈ U}

C.2 PROOF FOR THE PRE-TRAINING STAGE

During the proof, we use Ω * to denote the MDP set satisfying the (1 -δ)-cover condition with minimum cardinality, i.e. Ω * = arg min P( Ω)≥1-δ | Ω|. As defined in Algorithm 4, we use U to denote the index set of Ω, which has the same cardinality as Ω. We have the following lemma for the pre-training stage. . The return set Π satisfies Pr M∼D ∃(π, v) ∈ Π, V π M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V π M,1 (s 1 ) -v < 2ϵ ≥ 1 -6δ, and the size is bounded by | Π| ≤ 2C(D) log 1 δ . Proof. For each phase, we first define the high-probability event in Lemma 11. We prove the lemma in Appendix C.2.1. Lemma 11 (High Probability Events). For all phases, with probability at least 1 -4δ, the following events hold: 1. 1 N N i=1 I [M i ∈ Ω * ] -P(Ω * ) ≤ δ. 2. ∀i ∈ [N ], π i is ϵ 2 -optimal for M i . ∀i, j ∈ [N ], v i,j -V πj Mi,1 (s 1 ) ≤ ϵ 2 . 3. For all index set U ′ ⊂ [N ], we have Pr M∼D ∃j ∈ U ′ , V πj M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V πj M,1 (s 1 ) -v j,j < 2ϵ (18) ≥ 1 N -|U ′ | i∈[N ]\U ′ max j∈U ′ A i,j -2δ - |U ′ | log 2N δ N -|U ′ | . According to Lemma 11, the three events defined in Lemma 11 hold with probability 1 -4δ. As stated in Lemma 12, condition on the first and second events, we have |U| ≤ 2C(D) log 1 δ for each phase. We prove Lemma 12 in Appendix C.2.2. Note that these lemmas also hold for the last phase in which we return the policy-value set Π. Lemma 12. For all phases, if the first and second events in Lemma 11 hold, then the size of candidate set U satisfies |U| ≤ (C(D) + 1) log(1/δ). Since the stopping condition is | Ω| log(2N/δ) N -| Ω| = |U | log(2N/δ) N -|U | ≤ δ and we already know |U| ≤ 2C(D) log 1 δ , the stopping condition is satisfied for N ≥ 4C(D) log 2 (C(D)/δ) δ 2 . Based on the doubling trick, we know that the number of total phases is bounded by log Õ(C(D)) and the sample complexity is bounded by 2N = Õ(C(D)/δ 2 ). Therefore, by union bound across all phases, with probability at least 1 -O(log C(D) δ δ), Lemma 11 holds for all phases. Using the third event on the return set U, we know that Pr M∼D ∃j ∈ U, V πj M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V πj M,1 (s 1 ) -v j,j < 2ϵ (21) ≥ 1 N -|U| i∈[N ]\U max j∈U A i,j -2δ - |U| log 2N δ N -|U | (22) ≥ (1 -3δ)N -|U| N -|U | -2δ -δ ≥ 1 - 3δN N -|U | -3δ ≥ 1 -6δ. Therefore, the property on Π holds.

C.2.1 PROOF OF LEMMA 11

Proof. To prove this lemma, we sequentially bound the failure probability for each event. Empirical probability of Ω * . Notice that since M i ∼ D, the expectation of r.v. I [M i ∈ Ω * ] is exactly P(Ω * ). According to the Chernoff Bound, we have Pr 1 N N i=1 I [M i ∈ Ω * ] -P(Ω * ) ≤ δ < exp{-2N δ 2 } < δ. Therefore, the failure rate of the first event is bounded by δ. Oracle error. For each i ∈ [N ], the failure rate of O l is at most δ N ; For all i, j ∈ [N ], the failure rate of O e is at most δ N 2 . Therefore, with probability at least 1 -2δ, we know that π i is indeed ϵ 2 -optimal for M i , and v ij -V πj Mi,1 (s 1 ) < ϵ 2 for all i, j ∈ [N ]. This implies that the failure rate of the second event is bounded by 2δ. Covering probability of U ′ . We first fix any index set U ′ ⊂ [N ] and define U c = [N ] \ U ′ . For a policy-value pair (π, v) and an MDP M, we define the following random variable χ(π, v, M) ≜ Cnd O e M, π, ϵ 2 , log(N 2 /δ)) , v, O e M, O l M, ϵ 2 , log(N/δ) , ϵ 2 , log(N 2 /δ) . Notice that A i,j is exactly an instance of r.v. χ(π j , v j , M i ). For a fixed index set U ′ ⊂ [N ], each MDP M i with index i ∈ [N ] \ U ′ can be regarded as an i.i.d. sample from the distribution D. According to Chernoff Bound, with probability at least 1 - δ (2N ) |U ′ | we have 1 |U c | i∈U c max j∈U ′ A i,j ≤ E M∼D max j∈U ′ {χ(π j , v j , M)} + |U ′ | log 2N δ |U c | . ( ) On the other hand, we can use χ to control the probability that π j is near optimal for M i . Specifically, E M∼D max j∈U ′ {χ(π j , v j , M)} (25) = Pr M∼D max j∈U ′ {χ(π j , v j , M)} = 1 (26) = M∈Ω P(M) • Pr max j∈U ′ {χ(π j , v j , M)} = 1 M (27) = M∈Ω P(M) • Pr ∃j ∈ U ′ , π ′ = O l M, ϵ 2 , log(N/δ) , ( ) v = O e M, π, ϵ 2 , log(N 2 /δ)) , v ′ = O e M, π ′ , ϵ 2 , log(N 2 /δ) , (29) Cnd(v, v ′ , v j ) = 1 M . ( ) Similar to the analysis in the second event of Lemma 11, with probability at least 1 -2δ, for all j ∈ U ′ , the return π ′ is ϵ 2 -optimal for M, and the estimated value v and v ′ in the RHS is ϵ 2 -close to their mean. Assume this event hold, we have I [∃j ∈ U ′ , Cnd(v, v ′ , v j ) = 1] ≤ I ∃j ∈ U ′ , V πj M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V πj M,1 (s 1 ) -v j,j < 2ϵ . Therefore, we can substitute the RHS in Eqn. 25 as (where 2δ is the oracle failure probability) E M∼D max j∈U ′ {χ(π j , v j , M)} ≤ M∈Ω P(M) • (1 -2δ) (32) × Pr ∃j ∈ U ′ , V πj M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V πj M,1 (s 1 ) -v j,j < 2ϵ M + 2δ (33) ≤ 2δ + Pr M∼D ∃j ∈ U ′ , V πj M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V πj M,1 (s 1 ) -v j,j < 2ϵ . ( ) Combining Eqn. 24 and Eqn. 31, by the union bound, with probability at least Notice that U is generated by greedily selecting the best policy under current MDP remaining set T t-1 , i.e. the policy that covers most MDP in T t-1 . On the other hand, since the second event in Lemma 11 holds, for M i = M j , we have Cnd(v i,j , v i,i , v j,j ) and Cnd(v j,i , v i,i , v j,j ) are true and thus A i,j = A j,i = 1. Therefore, for each step t and each M ∈ Ω * , if we have M i = M for some i ∈ T t-1 , then in this round, we have 1 - N l=1 δ (2N ) l {U ′ ⊂ [N ], |U ′ | = l} ≥ 1 - N l=1 δ (2N ) l N l ≥ 1 -δ, for all U ′ ⊂ [N ], we have Pr M∼D ∃j ∈ U ′ , V πj M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V πj M,1 (s 1 ) -v j,j < 2ϵ (35) ≥ 1 N -|U ′ | i∈[N ]\U ′ max j∈U ′ A i,j -2δ - |U ′ | log 2N δ N -|U ′ | . ( i ′ ∈Tt-1 A i ′ ,i ≥ N Mi,t-1 . Since we choose policy π jt instead of π i in step t according to the greedy strategy, we have n t = i ′ ∈Tt-1 A i ′ ,jt ≥ i ′ ∈Tt-1 A i ′ ,i ≥ N M,t-1 for all M ∈ Ω * . The right term is 0 if M no longer exists in the remaining set, and the inequality still holds. Summing over all M ∈ Ω * , we obtain n t ≥ 1 C(D) M∈Ω * N M,t-1 = 1 C(D) M∈Ω * N M,0 - M∈Ω * (N M,0 -N M,t-1 ) ≥ 1 C(D) M∈Ω * N M,0 - M∈Ω (N M,0 -N M,t-1 ) = 1 C(D) M∈Ω * N M,0 - t τ =1 n τ = 1 C(D) Ĉ - t τ =1 n τ , where the last inequality is because N M,t is monotonically decreasing in t and the second last equation is because t τ =1 n t represents the population of MDPs that are covered in the first t rounds. This implies that (notice that n 0 = 0) Ĉ - t τ =1 n τ ≤ C(D) C(D) + 1 Ĉ - t-1 τ =1 n τ ≤ • • • ≤ C(D) C(D) + 1 t Ĉ, which gives t τ =1 n τ ≥ 1 - C(D) C(D) + 1 t Ĉ. When t ≥ (C(D) + 1) log 1 δ , we have |U t | = t τ =1 n τ ≥ 1 -exp - t C(D) + 1 Ĉ ≥ (1 -δ)N × Ĉ N ≥ (1 -δ)(1 -2δ) = 1 -3δ. Therefore, upon breaking, the size of U satisfies |U| = |U t | ≤ (C(D) + 1) log 1 δ .

C.3 PROOF OF THEOREM 2

In Lemma 10, we prove that with probability at least 1 -O δ log C(D) δ , the policy set Π returned in the pre-training stage covers the MDPs M ∼ D with probability at least 1 -6δ, i.e. Pr M∼D ∃(π, v) ∈ Π, V π M,1 (s 1 ) -V * M,1 (s 1 ) < 2ϵ ∩ V π M,1 (s 1 ) -v < 2ϵ ≥ 1 -6δ. Note that this event happens with high probability. If this event does not happen (w.p. O δ log C(D) δ ), the regret can still be upper bounded by K, which leads to an additional term of K • O δ log C(D) δ = O √ K log(KC(D)) in the final bound. This term is negligible compared with the dominant term in the regret. In the following analysis, we only discuss the case where the statement in Lemma 10 holds. We also assume that for the test MDP M * , ∃(π * , v * ) ∈ Π, V π * M * ,1 (s 1 ) -V * M * ,1 (s 1 ) < 2ϵ ∩ V π M * ,1 (s 1 ) -v < 2ϵ, which will happen with probability 1 -6δ under the event defined in Lemma 10. We use L to denote the maximum epoch counter, which satisfies L ≤ | Π|. Lemma 13. (Optimism) With probability at least 1 -δ/2, we have v l ≥ V * M * ,1 (s 1 ) -2ϵ, ∀l ∈ [L]. Proof. To prove the lemma, we need to show that the optimal policy-value pair (π * , v * ) for M * will never be eliminated from the set Πl with high probability. Condition on the sampled M * , for a fixed episode k ∈ [K], by Azuma's inequality, we have: Pr   1 k -k 0 + 1 k τ =k0 G k -V π * M * ,1 (s 1 ) ≥ 2 log(4K/δ) k -k 0 + 1   ≤ δ/(2K). By union bound over all k ∈ [K], we know that Pr   ∃k ∈ [K], 1 k -k 0 + 1 k τ =k0 G k -V π * M * ,1 (s 1 ) ≥ 2 log(4K/δ) k -k 0 + 1   ≤ δ/2 (40) By Inq. 38, we know that V π * M * ,1 (s 1 ) -v * < 4ϵ. Therefore, Pr   ∃k ∈ [K], 1 k -k 0 + 1 k τ =k0 G k -v * ≥ 2 log(4K/δ) k -k 0 + 1 + 4ϵ   ≤ δ/2 By the elimination condition defined in line 6 of Algorithm 1 in the fine-tuning stage, (π * , v * ) will never be eliminated from the set Πl with probability at least 1 -δ/2. By the definition that v l = max (π,v)∈ Πl v, we have v l ≥ v * ≥ V * M * ,1 (s 1 ) -2ϵ. Now we are ready to prove Theorem 2. Proof. We use τ l to denote the starting episode of epoch l. Without loss of generality, we set τ L+1 = K + 1. By lemma 13, we know that v l ≥ V * M * ,1 (s 1 ) with high probability. Under this event, we can decompose the value gap in the following way: k k=1 V * M * ,1 (s 1 ) -V π k M * ,1 (s 1 ) ≤ L l=1 τ l+1 -1 τ =τ l v l -V π l M * ,1 (s 1 ) + 2ϵK ≤ L l=1 τ l+1 -1 τ =τ l (v l -G τ ) + L l=1 τ l+1 -1 τ =τ l G τ -V π l M * ,1 (s 1 ) + 2ϵK For the first term, by the elimination condition defined in line 6 of Algorithm 1 in the fine-tuning stage, we have τ l+1 -1 τ =τ l (v l -G τ ) ≤ 2(τ l+1 -τ l ) log(4K/δ) + 4(τ l+1 -τ l )ϵ + 1. For the second term, by Azuma's inequality and union bound over all episodes, with probability at least 1 -δ/2, τ l+1 -1 τ =τ l G τ -V π l M * ,1 (s 1 ) ≤ 2(τ l+1 -τ l ) log(4K/δ) Therefore, by Cauchy-Schwarz inequality, k k=1 V * M * ,1 (s 1 ) -V π k M * ,1 (s 1 ) ≤ O K| Π| log(4K/δ) + | Π| ≤ O KC(D) log(4K/δ) log(1/δ) + C(D) . The last inequality is due to | Π| ≤ 2C(D) log(1/δ) by Lemma 10. Finally, we take expectation over all possible M * , and we get Reg(K) ≤ O C(D)K log(4K/δ) log(1/δ) + C(D) . C.4 LOWER BOUND FOR THEOREM 2 In this subsection, we provide a lower bound to show that the regret upper bound in Theorem 2 is tight except for logarithmic factors. The lower bound is stated as follows. Theorem 14. Suppose |Ω| ≥ 2 and K ≥ 5. For any pre-training and fine-tuning algorithm Alg, there exists a distribution D over the MDP class Ω, such that the regret in the fine-tuning stage is at least Reg K (D, Alg) ≥ Ω min C(D)K, K . This lower bound states that no matter how many samples are collected in the pre-training stage, the regret in the fine-tuning stage is at least Ω C(D)K , which indicates that our upper bound is near-optimal except for logarithmic factors. The proof is from an information theoretical perspective, which shares the similar idea with the lower bound proof for bandits (e.g. Theorem 15.2 in Lattimore & Szepesvári (2020) and Theorem 5.1 in Auer et al. (2002) ). We first construct the hard instance and then prove the theorem. Since the bandit problem can be regarded as an MDP with horizon 1, our hard instance is constructed using a distribution over bandit instances. For a fixed algorithm Alg, we define Ω to be a set of M multi-armed bandit instances. For bandit instance ν i ∈ Ω, there are M arms. The reward of arm j is a Guassian distribution with unit variance and mean reward 1 2 + ∆I{i = j}. The parameter ∆ ∈ [0, 1/2] will be defined later. We use ν 0 to denote the bandit instance where the reward of each arm is a Guassian distribution with unit variance and mean reward 1 2 . We set D to be a uniform distribution over the MDP set Ω.

We use r

k = ⟨r 1 , • • • , r k ⟩ to denote the sequence of rewards received up through step k ∈ [K]. Note that any randomized algorithm Alg is equivalent to an a-prior random choice from the set of all deterministic algorithms. We can formally regard the algorithm Alg as a fixed function which maps the reward history r k-1 to the action a k in each step k. This technique is not crucial for the proof but simplifies the notations, which has also been applied in Auer et al. (2002) . With a slight abuse of notation, we use Reg K (ν i , Alg ) to denote the regret of algorithm Alg in bandit instance ν i . Therefore, we have Reg K (D, Alg ) = 1 M M i=1 Reg K (ν i , Alg ). We use P νi and E νi to denote the probability and the expectation under the condition that the bandit instance in the test stage is ν i , respectively. We use D KL (P 1 , P 2 ) to denote the KL-divergence between the probability measure P 1 and P 2 . We use T i (K) to denote the number of times arm i is pulled in the K steps. We first provide the following lemma to upper bound the difference between expectations when measured using E νi and E ν0 . Lemma 15. Let f : R K → [0, K] be any function defined on the reward sequence r. Then for any action i, E νi [f (r)] ≤ E ν0 [f (r)] + K∆ 4 E ν0 [T i (K)]. Proof. We upper bound the difference E νi [f (r)] -E ν0 [f (r)] by calculating the expectation w.r.t. different probability measure. E νi [f (r)] -E ν0 [f (r)] = r f (r)dP νi (r) - r f (r)dP ν0 (r) ≤ K 2 r (dP νi (r) -dP ν0 (r)) . Here r (dP νi (r) -dP ν0 (r)) is the TV-distance between the two probability measure P νi (r) and P ν0 (r). Note that ν i and ν 0 only differs in the expected reward of arm i. By Pinsker's inequality and Lemma 15.1 in Lattimore & Szepesvári (2020), we have The lemma can be proved by combining the above two inequalities. Now we can prove Theorem 14. Proof. By definition, we have Reg K (D, Alg ) = 1 M M i=1 Reg K (ν i , Alg ) (43) =∆ 1 M M i=1 (K -E νi [T i (K)]) (44) =∆K - ∆ M M i=1 E νi [T i (K)] We apply Lemma 15 to T i (K), which is a function of the reward sequence r since the actions of the algorithm Alg are determined by the past rewards. We have E νi [T i (K)] ≤E ν0 [T i (K)] + K∆ 4 E ν0 [T i (K)]. We sum the above inequality over all ν i ∈ Ω, then we have M i=1 E νi [T i (K)] ≤ M i=1 E ν0 [T i (K)] + M i=1 K∆ 4 E ν0 [T i (K)] ≤K + K∆ 4 √ M K, where the second inequality is due to the Cauchy-Schwarz inequality and the fact that νi∈Ω E ν0 [T i (K)] = K. Plgging this inequality back to Inq. 43, we have Reg K (D, Alg ) ≥∆ K - K M - K∆ 4 K M Since K ≥ 5, we know that C(D) has the same order as M . If K ≤ M , we know that C(D)K = Ω(K). We choose ∆ = 1/2 and know that Reg K (D, Alg ) ≥ Ω(K). Therefore, We decompose the value difference using Bellman equation. For each episode k ∈ [K]: By Cauchy-Schwarz inequality, we have V π k Mi,k,h (s Mi,k,h ) -V π k Mi,h (s Mi,k,h ) = V π k Mi, K k=1 N i=1 V π * Mi,1 (s 1 ) -V π k Mi,1 (s 1 ) ≤ O N H 2 S 2 AK log(SAHN K) + N HS 2 A . Since π is uniformly selected from the policy set {π k } K k=1 . By Markov's inequality, the following inequality holds with probability at least 5/6, N i=1 V π * Mi,1 (s 1 ) -V π Mi,1 (s 1 ) ≤ 1 6K K k=1 N i=1 V π * Mi,1 (s 1 ) -V π k Mi,1 (s 1 ) . With our choice of K = C 2 S 2 AH 2 log(SAH/ϵ)/ϵ 2 , we have 1 N N i=1 V π * Mi,1 (s 1 ) -V π Mi,1 (s 1 ) ≤ ϵ/3.

D.2 HIGH PROBABILITY BOUND

To obtain a high probability bound with probability at least 1 -δ. Our idea is to first execute Algorithm 2 independently for O(log(1/δ)) times and obtains a policy set with cardinality O(log(1/δ)). We evaluate the policies in the policy set on the sampled M MDPs, and then return the policy with the maximum empirical value. The algorithm is described in Algorithm 5 . The proof follows the proof idea of Theorem 3, with only the difference in the bound on the empirical risk. We first prove the following lemma.



We call the training stage "pre-training" when interactions with the test environment are allowed. Consider the bandit case where there are M arms. The optimal arm is arm i in Mi, and D is a uniform distribution over M MDPs. C(D) = M in this case, and the M dependence is unavoidable in this hard instance since the agent has to independently explore and test whether each arm is the optimal arm. This probability can be further improved to 1 -δ by executing Algorithm 2 for log(1/δ) times and then returning a policy with maximum average value. Please see Appendix D.2 for the detailed discussion.



Throughout the paper, we use [N ] to denote the set {1, • • • , N } where N ∈ N + . For an event E, let I[E] be the indicator function of event E, i.e. I[E] = 1 if and only if E is true. For any domain Ω, we use C(Ω) to denote the continuous function on Ω. We use O(•) to denote the standard big O notation, and Õ(•) to denote the big O notation with log(•) term omitted.

PCE (Policy Collection-Elimination) Pre-training Stage 1: Input: episode number K, policy learning oracle O l and policy evaluation oracle O e 2: Initialize: δ = ϵ = 1/ √ K, the number of the sampled MDPs N = log(1/δ)/δ 2 3:

of the sampled MDP, i.e. N = 2N Test Stage 1: Input: the policy-value set Π from the Pre-train Stage, Episode number K 2: Initialize: the MDP set Π1 = Π, the phase counter

The basic goal in the pre-training phase is to find a policy-value set Π with bounded cardinality that covers D with high probability. The pre-training stage contains several phases. In each phase, we sample N MDPs from the distribution D and obtain an

T ) ≜ min{∆T, ∆ + u(T ) ∆ log T }. We can upper bound Reg T (r, Â) ≤ K k=1 s(∆ k , T ). Decomposition of Inq. 2 For k ∈ [K], define Ω k T = {r ∈ Ω : ∆ k ≥ T -p0 }, where p 0 ∈ (0, 1 2

Pre-training algorithm). With probability at least 1 -O δ log C(D) δ , the pre-training stage algorithm returns within log Õ(C(D)) phases with total MDP sample complexity bounded by Õ C(D) δ 2

For a certain fixed phase, we define N M,t = i∈Tt I [M i = M] as the population of M in T t , and Ĉ ≜ N i=1 I [M i ∈ Ω * ] . We have Ĉ = M∈Ω * N M,0 . Thanks to the conditional events in Lemma 11, we have Ĉ N -P(Ω * ) ≤ δ and Ĉ N ≥ 1 -2δ.

dP νi (r) -dP ν0 (r)) ≤ 1 2 D KL (P νi , P ν0 ) ν0 (a k = i)D KL (N (0, 1)∥N (∆, 1)) = E ν0 [T i (K)] ∆ 2 4 .

HK) max{1, N Mi,k,h (s Mi,k,h , a Mi,k,h )} + O( N HK log(KH)).

OMERM with High ProbabilityInput: target accuracy ϵ > 0, high probability parameter δN = C 1 log (N (Π, ϵ/(12H), d) /δ) /ϵ 2 for a constant C 1 > 0 N 1 = log(2/δ)/ log(1/6), N 2 = C 2 log(N N 1 /δ)/ϵ 2 for a constant C 2 > 0 Sample N tasks from the distribution D, denoted as {M 1 , M 2 , • • • , M N } 5: for ξ = 1, 2, • • • , N 1 doExecute Algorithm 2 with target accuracy ϵ/2 and task set{M 1 , M 2 , • • • , M N }, and obtain a policy π ξ for task index i = 1, 2, • • • , N doExecute π ξ on task M i for N 2 times, denoted the average total rewards as V ξ,i Calculate the average valueV ξ = 1 N N i=1 V ξ,i 10: Output: the policy π ξ * with ξ * = arg max ξ∈[N1] V ξWe have the following theorem for Algorithm 5.Theorem 20. With probability at least 1 -δ, Algorithm 5 can output a policy π satisfyingE M * ∼D [V π * (D) M * -V π M * ] ≤ ϵ with O log(N Π ϵ/(12H) /δ) ϵ 2MDP instance samples during training.The number of episodes collected for each task is bounded by O H 2 S 2 A log(SAH) log(1/δ) ϵ 2

Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pp. 4528-4531. PMLR, 2021. Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 167-176, 2018.

k,h (s Mi,k,h ) -V π k Mi,h (s Mi,k,h ) -Qπ k Mi,k,h (s Mi,k,h , a Mi,k,h ) -Q π k Mi,h (s Mi,k,h , a Mi,k,h ) + Qπ k Mi,k,h (s Mi,k,h , a Mi,k,h ) -Q π k Mi,h (s Mi,k,h , a Mi,k,h ) ≤ V π k Mi,k,h (s Mi,k,h ) -Qπ k Mi,k,h (s Mi,k,h , a Mi,k,h ) -V π k Mi,h (s Mi,k,h ) -Q π k Mi,h (s Mi,k,h , a Mi,k,h ) + RMi,k,h (s Mi,k,h , a Mi,k,h ) -r Mi,h (s Mi,k,h , a Mi,k,h ) + PMi,k,h -P Mi,h V π k Mi,k,h+1 (s Mi,k,h , a Mi,k,h ) + P Mi,h V π k Mi,k,h+1 -V π k Mi,h (s Mi,k,h , a Mi,k,h ) -V π k Mi,k,h+1 (s Mi,k+1,h+1 ) -V π k Mi,h (s Mi,k+1,h+1 ) + b Mi,k,h (s Mi,k,h , a Mi,k,h ) + V π k Mi,k,h+1 (s Mi,k+1,h+1 ) -V π k Mi,h (s Mi,k+1,h+1 ). Mi,k,h (s Mi,k,h ) -V π k Mi,h (s Mi,k,h ) -Qπ k Mi,k,h (s Mi,k,h , a Mi,k,h ) -Q π k Mi,h (s Mi,k,h , a Mi,k,h ) Mi,k,h , a Mi,k,h ) -r Mi,h (s Mi,k,h , a Mi,k,h ) Mi,k,h+1 -V π k Mi,h (s Mi,k,h , a Mi,k,h ) -V π k Mi,k,h+1 (s k+1,h+1 ) -V π k Mi,h (s k+1,h+1 ) Mi,k,h , a Mi,k,h ) -r Mi,h (s Mi,k,h , a Mi,k,h ) + PMi,k,h -P Mi,h V π k Mi,k,h+1 (s Mi,k,h , a Mi,k,h ) Mi,k,h (s Mi,k,h , a Mi,k,h ) + O( N HK log(KH)).The last inequality is due to Lemma 19 under event Λ 1 . By Lemma 17, we haveMi,k,h , a Mi,k,h ) -r Mi,h (s Mi,k,h , a Mi,k,h ) + PMi,k,h -P Mi,h V π k Mi,k,h+1 (s Mi,k,h , a Mi,k,h ) Mi,k,h , a Mi,k,h ) -r Mi,h (s Mi,k,h , a Mi,k,h )Mi,1 (s 1 ) -V π k Mi,1 (s 1 ) by the summation of b Mi,k,h (s Mi,k,h , a Mi,k,h ). By definition, we have

annex

we have Reg K (D, Alg) ≥ Ω min C(D)K, K . If K ≥ M , we choose ∆ = M K . Since M ≥ 2, we can also prove that Reg K (D, Alg) ≥ Ω min C(D)K, K .

D OMITTED PROOF FOR THEOREM 3

We define π * = arg max π∈Π E M∼D V π M,1 (s 1 ) and π * = arg max π∈ΠMi,1 (s 1 ). For the returned policy π, we can decompose the value gap into the following terms:Mi,1 (s 1 ) -Note that the first and the second terms are generalization gap for a given policy, which can be upper bounded by Chernoff bound and union bound. The third term is the value gap during the training phase. The last term is less than 0 by the optimality of π * .Upper bounds on the first and the second terms We can bound these terms following the generalization technique. Define the distance between polices d(πBy Inq. 50, we haveBy definition of the covering number, Π = N (Π, ϵ 0 , d). By Chernoff bound and union bound over the policy set Π, we have with prob. 1 -δ 1 , for any π ∈ Π,By Inq. 51 and Inq. 52, for ∀π ∈ Π,We can set ϵ 0 = ϵ 12H and δ 1 = 1/6. Since N = C 1 log (N (Π, ϵ/(12H), d)) /ϵ 2 and π * , π ∈ Π, we know that with probability at least 5/6,Upper bound on the third term We have the following lemma, which is proved in the following subsections.Lemma 16. With probability at least 5/6, Algorithm 2 can return a policy π satisfyingwhere π * is the empirical maximizer, i.e. π * = arg max π∈ΠPluging the results in Lemma 16, Inq. 57 and Inq. 58 back into Eq. 46, we know that with probability at least 2/3,

D.1 PROOF OF LEMMA 16

We first state the following high-probability events. Lemma 17. With probability at least 1 -1 2HK , the following inequality holds for any i ∈Proof. According to Weissman et al. (2003) , the L 1 -deviation of the true distribution and the empirical distribution over m distinct events from n samples is bounded byIn our case where m = S, for any fixed i, k, h, s, a, we havewith probability at least 1 -δ.Taking union bound over all possible i, k, h, s, a, we know that Inq. 61 holds for any i, k, h, s, a with probability at least 1 -N KHSAδ. We reach Inq. 59 by setting δ = 1 4N K 2 H 2 SA . For the reward estimation, we know that the reward is 1-subgaussian by definition. By Hoeffding's inequality, we havewith probability at least 1 -δ for any fixed i, k, h, s, a. Inq. 60 can be similarly proved by union bound over all possible i, k, h, s, a and setting δ =Lemma 18. The following inequality holds with probability at least 1 -Proof. For the above two inequalities, the RHS can be regarded as the summation of Martingale differences. Therefore, the above inequalities hold by applying Azuma's inequality.We use Λ 1 to denote the events defined in the above lemmas. Now we prove the optimism of our algorithm under event Λ 1 .Lemma 19. Under event Λ 1 , we haveProof. We prove the lemma by induction. Suppose Qπwhere the second inequality is due to Lemma 17 and the third inequality is derived from the definition of b Mi,k,h (s, a). The last inequality is from the induction condition that Qπ M,k,h+1 (s, a) ≥ Q π M,h+1 (s, a). Therefore, for step h, we also have QπFrom the induction, we know that QπNow we prove Lemma 16.Proof. (Proof of Lemma 16) By Lemma 19 and the optimality of πk ,Lemma 21. With probability at least 1 -δ 2 , Algorithm 5 can return a policy π ξ * satisfyingwhere π * is the empirical maximizer, i.e. π * = arg max π∈ΠMi,1 (s 1 ).Proof. By Lemma 16, for each ξ ∈ [N 1 ], the following inequality holds with probability at least 5/6,In Algorithm 5, we evaluate the policies in the MDPs by executing the policies for N 2 times. By Hoeffding's inequality and the union bound over all ξ ∈ [N 1 ] and i ∈ [N ], with probability at least 1 -δ 2 /2, we haveWe denote the above event as Λ 2 . For each ξ ∈ [N 1 ], we define η i as the event thatMi,1 (s 1 ) -V ξ ≤ 2ϵ 9 . By Inq. 63, we have Pr{η i } ≥ 5/6 under event Λ 2 . Note that the event {η i } N1 i=1 are independent with each other. Therefore, with probability at least 1 -(1/6) N1 = 1 -δ 2 /2, there exists ξ 0 , such that the event η ξ0 happens. That is, there exists a policy π ξ0 such that.By definition, ξ * = arg max ξ∈[N1] V ξ . Therefore, we have 1Mi,1 (s 1 ) -V ξ * ≤ 2ϵ 9 . Under the event of Λ 2 , we have the following inequality holds with probability at least 1 -δ 2 /2,Proof. (Proof of Theorem 20) The proof follows the proof idea of Theorem 3. We bound the empirical risk by Lemma 21. With our choice of δ 2 = δ/3, we have with probability at least 1 -δ/2,Mi,1 (s 1 ) -Following the proof of Theorem 3, we can similarly show that the following inequalities hold with probability 1 -δ/2,Combining the results in Inq. 65, Inq.66 and Inq. 64, we know that with probability at least 1 -δ, E M∼D V π * M,1 (s 1 ) -V π M,1 (s 1 ) ≤ ϵ.

