EFFICIENT REINFORCEMENT LEARNING IN FACTORED MDPS WITH APPLICATION TO CONSTRAINED RL

Abstract

Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, whose regret is exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the previous FMDP result of Osband & Van Roy (2014b) by a factor of nH|S i |, where |S i | is the cardinality of the factored state subspace, H is the planning horizon and n is the number of factored transitions. We also provide a lower bound, which shows near-optimality of our algorithm w.r.t. timestep T , horizon H and factored state-action subspace cardinality. Finally, as an application, we study a new formulation of constrained RL, RL with knapsack constraints (RLwK), and provide the first sample-efficient algorithm based on FMDP-BF.

1. INTRODUCTION

Reinforcement learning (RL) is concerned with sequential decision making problems where an agent interacts with a stochastic environment and aims to maximize its cumulative rewards. The environment is usually modeled as a Markov Decision Process (MDP) whose transition kernel and reward function are unknown to the agent. A main challenge of the agent is efficient exploration in the MDP, so as to minimize its regret, or the related sample complexity of exploration. Extensive study has been done on the tabular case, in which almost no prior knowledge is assumed on the MDP dynamics. The regret or sample complexity bounds typically depend polynomially on the cardinality of state and action spaces (e.g., Strehl et al., 2009; Jaksch et al., 2010; Azar et al., 2017; Dann et al., 2017; Jin et al., 2018; Dong et al., 2019; Zanette & Brunskill, 2019) . Moreover, matching lower bounds (e.g., Jaksch et al., 2010) imply that these results cannot be improved without additional assumptions. On the other hand, many RL tasks involve large state and action spaces, for which these regret bounds are still excessively large. In many practical scenarios, one can often take advantage of specific structures of the MDP to develop more efficient algorithms. For example, in robotics, the state may be high-dimensional, but the subspaces of the state may evolve independently of others, and only depend on a lowdimensional subspace of the previous state. Formally, these problems can be described as factored MDPs (Boutilier et al., 2000; Kearns & Koller, 1999; Guestrin et al., 2003) . Most relevant to the present work is Osband & Van Roy (2014b) , who proposed a posterior sampling algorithm and a The agent interacts with the environment for K episodes with policy π k = {π k,h : S → A} h∈ [H] determined before the k-th episode begins. The agent's goal is to maximize its cumulative rewards K k=1 H h=1 r k,h over T = KH steps, or equivalently, to minimize the following expected regret: Reg(K) def = K k=1 [V * 1 (s k,1 ) -V π k 1 (s k,1 )] , where s k,1 is the initial state of episode k.

2.1. FACTORED MDPS

A factored MDP is an MDP whose rewards and transitions exhibit certain conditional independence structures. We start with the formal definition of factored MDP (Boutilier et al., 2000; Osband & Van Roy, 2014b; Xu & Tewari, 2020; Lu & Van Roy, 2019) . Let P(X , Y) denote the set of functions that map x ∈ X to the probability distribution on Y. Definition 1. (Factored set) Let X = X 1 × • • • × X d be a factored set. For any subset of indices Z ⊆ {1, 2, . . . , d}, we define the scope set X [Z] := ⊗ i∈Z X i . Further, for any x ∈ X , define the scope variable x[Z] ∈ X [Z] to be the value of the variables x i ∈ X i with indices i ∈ Z. If Z is a singleton, we will write x[i] for x[{i}]. Definition 2. (Factored reward) The reward function class R ⊂ P(X , R) is factored over i } m i=1 and {Z P j } n j=1 are the scopes for the reward and transition functions, which we assume to be known to the agent. S × A = X = X 1 × • • • × X d with scopes Z 1 , • • • , Z m if for all R ∈ R, x ∈ X , An excellent example of factored MDP is given by Osband & Van Roy (2014) , about a large production line with d machines in sequence with S i possible states for machine i. Over a single time-step each machine can only be influenced by its direct neighbors. For this problem, the scopes Z R i and Z P i of machine i ∈ {2, ..., d -1} can be defined as {i -1, i, i + 1}, and the scopes of machine 1 and machine d are {1, 2} and {d -1, d} respectively. Another possible example to explain factored MDP is about robotics. For a robot, the transition dynamics of its different parts (e.g. its legs and arms) may be relatively independent. In that case, the factored transition can be defined for each part separately. For notation simplicity, we use X [i : j] and S[i : j] to denote X [∪ k=i,...,j Z k ] and ⊗ j k=i S k respectively. Similarly, We use P s ∈S P(s | s, a)V (s ). A state-action pair can be represented as (s, a) or x. We also use (s, a) [Z] to denote the corresponding x[Z] for notation convenience. We mainly focus on the case where the total time step T = KH is the dominant factor, and assume that T ≥ |X i | ≥ H during the analysis.

3. RELATED WORK

Exploration in Reinforcement Learning Recent years have witnessed a tremendous of work for provably efficient exploration in reinforcement learning, including tabular MDP (e.g., Dann et al., 2017; Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019) , linear RL (e.g., Jiang et al., 2017; Yang & Wang, 2019; Jin et al., 2020; Zanette et al., 2020) , and RL with general function approximation (e.g., Osband & Van Roy, 2014a; Ayoub et al., 2020; Wang et al., 2020) . For algorithms in tabular setting, the regret bounds inevitably depend on the cardinality of state-action space, which may be excessively large. Based on the concept of eluder dimension (Russo & Van Roy, 2013) , many recent works proposed efficient algorithms for RL with general function approximation (Osband & Van Roy, 2014a; Ayoub et al., 2020; Wang et al., 2020) . Since eluder dimension of the function class for factored MDPs is at most O m i=1 |X [Z R i ]| + n i=1 |X [Z P i ]| , it is possible to apply their algorithms and regret bounds to our setting, though the direct application of their algorithms leads to a loose regret bound. decomposition theorem for factored Markov chains (Theorem 1), which results in a better regret by a factor of √ n; the theorem is also of independent interest with potential use in other problems in factored MDPs. Furthermore, we formulate the RLwK problem, and provide a sample-efficient algorithm based on our FMDP algorithm. Constrained MDP and knapsack bandits The knapsack setting with hard constraints has already been studied in bandits with both sample-efficient and computational-efficient algorithms (Badanidiyuru et al., 2013; Agrawal et al., 2016) . This setting may be viewed as a special case of RLwK with H = 1. In constrained RL, there is a line of works that focus on soft constraints where the constraints are satisfied in expectation or with high probability (Brantley et al., 2020; Zheng & Ratliff, 2020) , or a violation bound is established (Efroni et al., 2020; Ding et al., 2020) . RLwK requires stronger constraints that is almost surely satisfied during the execution of the agents. A more related setting is proposed by Brantley et al. (2020) , which studies a sample-efficient algorithm for knapsack episodic setting with hard constraints on all K episodes. However, we require the constraints to be satisfied within each episode, which we believe can better describe the real-world scenarios. The setting of Singh et al. (2020) is closer to ours since they are focusing on "every-time" hard constraints, although they consider the non-episodic case.

4. MAIN RESULTS

In this section, we introduce our FMDP-BF algorithm, which uses empirical variance to construct a Bernstein-type confidence bound for value estimation. Besides FMDP-BF, we also propose a simpler algorithm called FMDP-CH with a slightly worse regret, which follows the similar idea of UCBVI-CH (Azar et al., 2017) . The algorithm and the corresponding analysis are more concise and easy to understand; details are deferred to Section B.

4.1. ESTIMATION ERROR DECOMPOSITION

Our algorithm will follow the principle of "optimism in the face of uncertainty". Like ORLC (Dann et al., 2019) and EULER (Zanette & Brunskill, 2019) , our algorithm also maintains both the optimistic and pessimistic estimates of state values to yield an improved regret bound. We use V k,h and V k,h to denote the optimistic estimation and pessimistic estimation of V * h , respectively. To guarantee optimism, we need to add confidence bonus to the estimated value function V k,h at each step, so that V k,h (s) ≥ V * h (s) holds for any k ∈ [K], h ∈ [H] and s ∈ S. Suppose Rk,i and Pk,j denote the estimated value of each expected factored reward R i and factored transition probability P j before episode k respectively. By the definition of the reward R and the transition P, we use R def = 1 m m i=1 Rk,i and Pk def = n j=1 Pk,j as the estimation of R and P. Following the previous framework, this confidence bonus needs to tightly characterize the estimation error of the one-step backup R(s, a)+PV * h (s, a); in other words, it should compensate for the estimation errors, Rk -R (s, a) and Pk -P V * h (s, a), respectively. For the estimation error of rewards Rk -R (s k,h , a k,h ), since the reward is defined as the average of m factored rewards, it is not hard to decompose the estimation error of R(s, a) to the average of the estimation error of each factored rewards. In that case, we separately construct the confidence bonus of each factored reward Ri . Suppose CB R k,Z R i (s, a) is the confidence bonus that compensates for the estimation error Rk,i -Ri , then we have CB R k (s, a) def = 1 m m i=1 CB R k,Z R i (s, a). For the estimation error of transition Pk -P V * h+1 (s k,h , a k,h ), the main difficulty is that Pk is the multiplication of n estimated transition dynamics Pk,i . In that case, the estimation error Pk -P V * h+1 (s k,h , a k,h ) may be calculated as the multiplication of n estimation error for each factored transition Pk,i , which makes the analysis much more difficult. Fortunately, we have the following lemma to address this challenge. Lemma 4.1. (Informal) Let the transition function class P ∈ P(X , S) be factored over X = X 1 × • • • × X d , and S = S 1 × • • • × S n with scopes Z P 1 , • • • , Z P n . For a given function V : S → R, the estimation error of one-step value |( Pk -P)V (s, a)| can be decomposed by: |( Pk -P)V (s, a)| ≤ n i=1 ( Pk,i -P i )   n j =i,j=1 P j   V (s, a) + β k,h (s, a) Here, β k,h (s, a), formally defined in Lemma E.1, are higher order terms that do not harm the order of the regret. This lemma allows us to decompose the estimation error Pk -P V * h+1 (s k,h , a k,h ) into an additive form, so we can construct the confidence bonus for each factored transition P j separately. Let CB P k,Z P j (s, a) be the confidence bonus for the estimation error ( Pk,j - P j ) n t =j,t=1 P t V (s, a). Then, CB P k (s, a) def = n j=1 CB P k,Z P j (s, a) + η k,h (s, a) , where η k,h (s, a) collects higher order factors that will be explicitly given later. Finally, we define the confidence bonus as the summation of all confidence bonuses for rewards and transition: CB k (s, a) = CB R k (s, a) + CB P k (s, a).

4.2. VARIANCE OF FACTORED MARKOV CHAINS

After the analysis in Section 4.1, the remaining problem is how to define the confidence bonus In the Markov chain setting, the reward is defined to be a mapping from S to R. Suppose J t1:t2 (s) denotes the total rewards the agent obtains from step t 1 to step t 2 (inclusively), given that the agent starts from state s in step t 1 . J t1:t2 is a random variable depending on the randomness of the trajectory from step t 1 to t 2 , and stochastic rewards therein. Following this definition of J t1:t2 , we define J 1:H to be the total reward obtained during one episode. We use s t to denote the random state that the agent encounters at step t. We define ω 2 h (s) CB R k,Z R i (s, def = E (J h:H (s h ) -V h (s h )) 2 |s h = s to be the variance of the total gain after step h, given that s h = s. We define σ 2 R,i (s) def = V [R i (ξ)|ξ = s] to be the variance of the i-th factored reward, given that the current state is s. Given the current state s, we define the variance of the next-state value function w.r.t. the i-th factored transition as:  σ 2 P,i,h (s) def = E s h+1 [1:i-1] V s h+1 [i] E s h+1 [i+1:n] [V h+1 (s h+1 )] | s h = s . [i] ∼ P i (•|(s, a)[Z P i ]) given fixed s [1 : i -1]. Finally, we take the expectation of this variance w.r. t. s [1 : i -1] ∼ P [1:i-1] . Theorem 1. For any horizon h ∈ [H], we have ω 2 h (s) = s P(s |s)ω 2 h+1 (s ) + n i=1 σ 2 P,i,h (s) + 1 m 2 m i=1 σ 2 R,i (s). Theorem 1 generalizes the analysis of Munos & Moore (1999) , which deals with non-factored MDPs and deterministic rewards. From the Bellman equation of variance, we can give an upper bound to the expected summation of per-step variance. Corollary 1.1. Suppose the agent takes policy π during an episode. Let w h (s, a) denote the probability of entering state s and taking action a in step h. Then we have the following inequality: H h=1 (s,a)∈X w h (s, a) n i=1 σ 2 P,i (V π h+1 , s, a) + 1 m 2 m i=1 σ 2 R,i (s, a) ≤ H 2 , where σ 2 R,i (s, a) def = V [r i (ξ, ζ)|ξ = s, ζ = a] is the variance of i-th factored reward given the current state-action pair (s, a), and σ 2 P,i (V π h+1 , s, a) = E s h+1 [1:i-1] V s h+1 [i] E s h+1 [i+1:n] V π h+1 (s h+1 ) | s h = s is the variance of i-th factored transition given current state s. This corollary makes it possible to construct confidence bonus with variance for each factored rewards and transition separately. Please refer to Section F.2 for the detailed proof of Theorem 1 and Corollary 1.1.

4.3. ALGORITHM

Our algorithm is formally described in Alg. 1, with a more detailed explanation in Section C. Denote by N k ((s, a)[Z]) the number of steps that the agent encounters (s, a)[Z] during the first k episodes. In episode k, we estimate the mean value of each factored reward R i and each factored transition P i with empirical mean value Rk,i and Pk,i of the previous history data L respectively. After that, we construct the optimistic MDP M based on the estimated rewards and transition functions. For a certain (s, a) pair, the transition function and reward function are defined as Rk (s, a) = 1 m m i=1 Rk,i ((s, a)[Z R i ]) and Pk (s | s, a) = n j=1 Pk,j s [j] | (s, a) Z P j . Algorithm 1 FMDP-BF Input: δ L = ∅, initialize N ((s, a)[Z i ]) = 0 for any factored set Z i and any (s, a)[Z i ] ∈ X [Z i ] for episode k = 1, 2, • • • do Set V k,H+1 (s) = V k,H+1 = 0 for all s, a.

5:

Estimate the empirical mean Rk and Pk with history data L. for horizon h = H, H -1, ..., 1 do for s ∈ S do for a ∈ A do Q k,h (s, a) = min{H, Rk (s, a) + CB k (s, a) + Pk V k,h+1 (s, a)} 10: end for π k,h (s) = arg max a Q k,h (s, a) V k,h (s) = max a∈A Q k,h (s, a) V k,h (s) = max 0, Rk (s, π k,h (s)) -CB k (s, π k,h (s)) + Pk V k,h+1 (s, π k,h (s)) end for 15: end for Take action according to π k,h for H steps in this episode. Update L = L {s k,h , a k,h , r k,h , s k,h+1 } h=1,2,...,H , and update counter N k-1 ((s, a)[Z i ]). end for Following the analysis in Section 4.1, we separately construct the confidence bonus of each factored reward R i with the empirical variance: CB R k,Z R i (s, a) = 2σ 2 R,k,i (s,a)L R i N k-1 ((s,a)[Z R i ]) + 8L R i 3N k-1 ((s,a)[Z R i ]) , i ∈ [m], where L R i def = log 18mT X [Z R i ] /δ , and σ2 R,k,i is the empirical variance of the i-th fac- tored reward R i , i.e. σ2 R,k,i (s, a) = 1 N k-1 ((s,a)[Z P i ]) (k-1)H t=1 1 (s t , a t )[Z R i ] = (s, a)[Z R i ] • r 2 t,i - Rk,i ((s, a)[Z R i ]) 2 . We define L P = log (18nT SA/δ) for short. Following the idea of Lemma 4.1, we separately construct the confidence bonus of scope Z P i for transition estimation: CB P k,Z P i (s, a) = 4σ 2 P,k,i (V k,h+1 ,s,a)L P N k-1 ((s,a)[Z P i ]) + 2u k,h,i (s,a)L P N k-1 ((s,a)[Z P The definition of σ2 P,k,i (V k,h+1 , s, a) corresponds to σ 2 P,i (V π h , s, a) in Corollary 1.1, which can be regarded as the empirical variance of transition P k,i : σ2 P,k,i (V k,h+1 , s, a) = E s [1:i-1]∼ Pk,[1:i-1] (•|s,a) V s [i]∼ Pk,i (•|(s,a)[Z P i ]) E s [i+1:n]∼ Pk,[i+1:n] (•|s,a) V k,h+1 s ) . To guarantee optimism, we need to use the empirical variance σ2 P,k,i (V * , s, a) to upper bound the estimation error in the proof. Since we do not know V * beforehand, we use σ2 P,k,i (V k,h+1 , s, a) as a surrogate in the confidence bonus. However, we cannot guarantee that σ2 P,k,i (V * , s, a) is upper bounded by σ2 P,k,i (V k,h+1 , s, a). To compensate for the error due to the difference between V * h+1 and V k,h+1 , we add 2u k,h,i (s,a)L P N k-1 ((s,a)[Z P i ] ) to the confidence bonus, where u k,h,i (s, a) is defined as: u k,h,i (s, a) = E s [1:i] ∼ Pk,[1:i] (•|s,a) E s [i+1:n] ∼ Pk,[i+1:n] (•|s,a) V k,h+1 -V k,h+1 (s ) 2 . 4.4 REGRET Theorem 2. Suppose X [Z R i ] ≤ J R , X [Z P j ] ≤ J P for i ∈ [m], j ∈ [n], then with prob. 1 -δ, the regret of Alg. 1 is O J R T log(mT J R /δ) log T + nHJ P T log(nT SA/δ) log T . Note that the regret bound does not depend on the cardinalities of state and action spaces, but only has a square-root dependence on the cardinality of each factored subspace X [Z i ]. By leveraging the structure of factored MDP, we achieve regret that scales exponentially smaller compared with that of UCBVI (Azar et al., 2017) . The best previous regret bound for episodic factored MDP is achieved by Osband & Van Roy (2014b) . When transformed to our setting, it becomes  O J R T log(mT J R /δ) + nH ΓJ P T log(nT J P /δ) , X [Z R i ] T + 1 n n i=1 X [Z P i ] HT . The lower bound of Tian et al. (2020) is Ω max max i X [Z R i ] T , max j X [Z P i ] HT , which is derived from different hard instance construction. Their lower bound is of the same order with ours, while our bound measures the dependence on all the parameters including the number of the factored transition n and the factored rewards m. If X [Z R i ] = J R and X [Z P i ] = J P , the lower bound turns out to be Ω √ J R T + √ HJ P T , which matches the upper bound in Theorem 2 except for a factor of √ n and logarithmic factors.

5. RL WITH KNAPSACK CONSTRAINTS

In this section, we study RL with Knapsack constraints, or RLwK, as an application of FMDP-BF.

5.1. PRELIMINARIES

We generalize bandit with knapsack constraints or BwK (Badanidiyuru et al., 2013; Agrawal et al., 2016) to episodic MDPs. We consider the setting of tabular episodic Markov decision process, (S, A, H, P, R, C), which adds to an episodic MDP with a d-dimensional stochastic cost vector, C(s, a). We use C i (s, a) to denote the i-th cost in the cost vector C(s, a). If the agent takes action a in state s, it receives reward r sampled from R(s, a), together with cost c, before transitioning to the next state s with probability P(s |s, a). In each episode, the agent's total budget is B. We also use B i to denote the total budget of i-th cost. Without loss of generality, we assume B i ≤ B for all i. An episode terminates after H steps, or when the cumulative cost h c h,i of any dimension i exceeds the budget B i , whichever occurs first. The agent's goal is to maximize its cumulative reward K k=1 H h=1 r k,h in K episodes.

5.2. COMPARISON WITH OTHER SETTINGS

While RLwK might appear similar to episodic constrained RL (Efroni et al., 2020; Brantley et al., 2020) , it is fundamentally different, so those algorithms cannot be applied here. As discussed in Section 3, the episodic constrained RL setting can be roughly divided into two categories. A line of works focus on soft constraints where the constraints are satisfied in expectation, i.e. H h=1 E[c k,h ] ≤ B. The expectation is taken over the randomness of the trajectories and the random sample of the costs. Another line of work focuses on hard constraints in K episodes. To be more specific, they assume that the total costs in K episodes cannot exceed a constant vector B, i.e. K k=1 H h=1 c k,h ≤ B. Once this is violated before episode K 1 < K, the agent will not obtain any rewards in the remaining K -K 1 episodes. Though both settings are interesting and useful, they do not cover many common situations in constrained RL. For example, when playing games, the game is over once the total energy or health reduce to 0. After that, the player may restart the game (starting a new episode) with full initial energy again. In robotics, a robot may episodically interact with the environment and learn a policy to carry out a certain task. The interaction in each episode is over once its energy is used up. In these two examples, we cannot just consider the expected cost or the cumulative cost across all episodes, but calculate the cumulative cost in every individual episode. Moreover, in many constrained RL applications, the agent's optimal action should depend on its remaining budget. For example, in robotics, the robot should do planning and take actions based on its remaining energy. However, previous results do not consider this issue, and use policies that map states to actions. Instead, in RLwK, we need to define the policy as a mapping from states and remaining budget to actions. Section H gives further details, including two examples for illustrating the difference between these settings.

5.3. ALGORITHM

We make the following assumptions about the cost function for simplicity. Both of them hold if all the stochastic costs are integers with an upper bound. Assumption 1. The budget B i as well as the possible value of costs C i (s, a) of any state s and action a is an integral multiple of the unit cost 1 m . Assumption 2. The stochastic cost C i (s, a) has finite support. That is, the random variable C i (s, a) can only take at most n possible values. The reason for Assumption 2 is that we need to estimate the distribution of the cost, instead of just estimating its mean value. We discuss the necessity of the assumptions and the possible methods for continuous distribution in Section H.3. From the above discussion, we know that we need to find a policy that is a mapping from state and budget to action. Therefore, it is natural to augment the state with the remaining budget. It follows that the size of augmented state space is S • (Bm) d . Directly applying UCBVI algorithm (Azar et al., 2017) will lead to a regret of order O HSAT (Bm) d . Our key observation is that the constructed state representation can be represented as a product of subspaces. Each subspace is relatively independent. For example, the transition matrix over the original state space S is independent of the remaining budget. Therefore, the constructed MDP can be formulated as a factored MDP, and the compact structure of the model can reduce the regret significantly. By applying Alg. 1 and Theorem 2 to RLwK, we can reduce the regret to the order of O HSA(1 + dBm)T roughly, which is exponentially smaller. However, the regret still de-pends on the total budget B and the discretization precision m, which may be very large for continuous budget and cost. Another observation to tackle the problem is that the cost of taking action a on state s only depends on the current state-action pair (s, a), but has no dependence on the remaining budget B. To be more formal, we have b h+1 = b h -c h , where b h is the remaining budget at step h, and c h is the cost suffered in step h. As a result, we can further reduce the regret to roughly O √ HdSAT by estimating the distribution of cost function. A similar model has been discussed in Brunskill et al. (2009) , which is named as noisy offset model. Our algorithm, which is called FMDP-BF for RLwK, follows the same basic idea of Alg. 1. We defer the detailed description to Section H to avoid redundance. The regret can be upper bounded by the following theorem: Theorem 4. With prob. at least 1 -δ, the regret of Alg. 4 is upper bounded by O dHSAT (log(SAT ) + d log(Bm)) Compared with the lower bound for non-factored tabular MDP (Jaksch et al., 2010) , this regret bound matches the lower bound w.r.t. S, A, H and T . There may still be a gap in the dependence of the number of constraints d, which is often much smaller than other quantities. It should be noted that, though we achieve a near-optimal regret for RLwK, the computational complexity is high, scaling polynomially with the maximum budget B, and exponentially with the number of constraints d. This is a consequence of the NP-hardness of knapsack problem with multiple constraints (Martello, 1990; Kellerer et al., 2004) . However, since the policy is defined on the state and budget space with cardinality SB d , this computational complexity seems unavoidable. How to tackle this problem, such as with approximation algorithms, is an interesting future work.

6. CONCLUSION

We propose a novel RL algorithm for solving FMDPs with near optimal regret guarantee. It improves the best previous regret bound by a factor of nH|S i |. We also derive a regret lower bound for FMDPs based on the minimax lower bound of multi-armed bandits and episodic tubular MDPs (Jaksch et al., 2010) . Further, we formulate the RL with Knapsack constraints (RLwK) setting, and establish the connections between our results for FMDP and RLwK by providing a sample efficient algorithm based on FMDP-BF in this new setting. A few problems remain open. The regret upper and lower bounds have a gap of approximately √ n, where n is the number of transition factors. For RLwK, it is important to develop a computationally efficient algorithm, or find a variant of the hard-constraint formulation. We hope to address these issues in the future work.

A NOTATIONS

Before presenting the proof, we restate the definition of the following notations.

Symbol

Explanation s k,h , a k,h The state and action that the agent encounters in episode k and step h L P log (18nT SA/δ) L R i log 18mT X [Z R i ] /δ X P i X [Z P i ] X R i X [Z R i ] N k ((s, a)[Z]) the number of steps that the agent encounters (s, a)[Z] during the first k episodes PV (s, a) A shorthand of s ∈S P(s |s, a)V (s ) φ k,i (s, a) 4|Sj |L P N k-1 ((s,a)[Z P j ]) + 4|Sj |L P 3N k-1 ((s,a)[Z P j ]) σ2 R,i (s, a) The empirical variance of reward R i σ 2 P,i (V, s, a) the next state variance of PV for the transition P i , i.e. E s [1:i-1]∼P [1:i] (•|s,a) V s [i]∼Pi(•|(s,a)[Z P i ]) E s [i+1:n]∼P [i+1:n] (•|s,a) V (s ) σP,k,i (V, s, a) the empirical next state variance of Pk V for the transition Pk,i , i.e. E s [1:i-1]∼ Pk,[1:i] (•|s,a) V s [i]∼ Pk,i (•|(s,a)[Z P i ]) E s [i+1:n]∼ Pk,[i+1:n] (•|s,a) V (s ) Ω k,h The optimism and pessimism event for k, h: V k,h ≥ V * h ≥ V k,h Ω The optimism and pessimism events for all 1 ≤ k ≤ K, 1 ≤ h ≤ H, i.e. ∪ k,h Ω k,h w k,h,Z (s, a) The probability of entering (s, a) [Z] at step h in episode k w k,Z (s, a) H h=1 w k,h,Z (s, a) w k,h (s, a) The probability of entering (s, a) at step h in episode k, i.e. w k,h,Z (s, a) with Z = {1, 2, ..., d} w k (s, a) H h=1 w k,h (s, a)

B OMITTED DETAILS FOR FMDP-CH

In this section, we introduce our algorithm with Hoeffding-type confidence bonus and present the corresponding regret bound. Our algorithm, which is described in Algorithm 2, is related to UCBVI-CH algorithm (Azar et al., 2017) , in the sense that Algorithm 2 reduces to UCBVI-CH if we consider a flat MDP with m = n = d = 1. Let N k ((s, a)[Z] ) denote the number of steps that the agent encounters (s, a)[Z] during the first k episodes, and N k ((s, a)[Z j ], s j ) denotes the number of steps that the agent transits to a state with s[j] = s j after encountering (s, a)[Z j ] during the first k episodes. In episode k, we estimate the mean value of each factored reward R i and each factored transition P i with empirical mean value Rk,i and Pk,i respectively. To be more specific, Rk,i ((s, a)[Z R i ]) = t≤(k-1)H 1[(st,at)[Z R i ]=(s,a)[Z R i ]]•rt,i N k-1 ((s,a)[Z R i ]) , where r t,i denotes the reward R i sampled in step t, and Pk,j s[j]|(s, a)[Z P j ] = N k-1 ((s,a)[Z P j ],s[j]) N k-1 ((s,a)[Z P j ] ) . After that, we construct the optimistic MDP M based on the estimated rewards and transition functions. For a certain (s, a) pair, the transition function and reward function are defined as Rk (s, a) = 1 m m i=1 Rk,i ((s, a)[Z R i ]) and Pk (s | s, a) = n j=1 Pk,j s [j] | (s, a) Z P j . We define L R i = log 18mT X [Z R i ] /δ , L P = log (18nT SA/δ) and φ k,i (s, a) = 4|Si|L P N k-1 ((s,a)[Z P i ]) + 4|Si|L 3N k-1 ((s,a)[Z P i ] ) . We separately construct the confidence bonus of each fac- Algorithm 2 FMDP-CH Input: δ, history data L = ∅, initialize N ((s, a)[Z i ]) = 0 for any factored set Z i and (s, a)[Z i ] ∈ X [Z i ] for episode k = 1, 2, ... do Set Vk,H+1 (s) = 0 for all s.

5:

Estimate Rk,i (s, a) with empirical mean value if Vk,h (s) = max a∈A Qk,h (s, a) end for end for N k-1 ((s, a)[Z R i ]) > 0, otherwise Rk,i (s, a) = 1, then calculate R(s, a) = 1 m m i=1 Ri ((s, a)[Z R i ]) Let K P = (s, a) ∈ S × A, ∪ i∈[n] N k ((s, a)[Z P i ]) > 0 Estimate Pk (• for step h = 1, • • • , H do Take action a k,h = arg max a Qk,h (s k,h , a) 20: end for Update history trajectory L = L {s k,h , a k,h , r k,h , s k,h+1 } h=1,2,...,H , and update history counter N k-1 ((s, a)[Z i ]). end for tored reward R i and factored transition P i in the following way: CB R k,Z R i (s, a) = 2L R i N k-1 ((s, a)[Z R i ]) , i ∈ [m] (1) CB P k,Z P i (s, a) = 2H 2 L P N k-1 ((s, a)[Z P i ]) + Hφ k,i (s, a) n j=1,j =i φ k,j (s, a), i ∈ [n] We define the confidence bonus as the summation of all confidence bonus for rewards and transition, i.e. CB k (s, a) = 1 m m i=1 CB R k,Z R i (s, a) + n j=1 CB P k,Z P j (s, a). We propose the following regret upper bound for Alg. 2. Theorem 5. With prob. 1 -δ, the regret of Alg. 2 is upper bounded by Reg(K) = O   1 m m i=1 X [Z R i ] T log(mT |X [Z R i ]|/δ) + n j=1 H X [Z P j ] T log(nT SA/δ)   Here O hides the lower order terms with respect to T .

C OMITTED DETAILS IN SECTION 4

In this section, we clarify the omitted details in Section 4. The detailed algorithm is described in Alg. 3. we denote N k ((s, a)[Z]) as the number of steps that the agent encounters (s, a) [Z] during the first k episodes, and N k ((s, a)[Z j ], s j ) as the number of steps that the agent transits to a state with s[j] = s j after encountering (s, a)[Z j ] during the first k episodes. In episode k, we estimate the mean value of each factored reward R i and each factored transition P i with empirical mean value Rk,i and Pk,i respectively. To be more specific, Rk,i ((s, a)[Z R i ]) = Published as a conference paper at ICLR 2021 t≤(k-1)H 1[(st,at)[Z R i ]=(s,a)[Z R i ]]•rt,i N k-1 ((s,a)[Z R i ]) , where r t,i denotes the reward R i sampled in step t, and Pk,j s[j]|(s, a)[Z P j ] = N k-1 ((s,a)[Z P j ],s[j]) N k-1 ((s,a)[Z P j ]) . The formal definition of the confidence bonus for Alg. 1 is: CB R k,Z R i (s, a) = 2σ 2 R,k,i (s, a)L R i N k-1 ((s, a)[Z R i ]) + 8L R i 3N k-1 ((s, a)[Z R i ]) (3) CB P k,Z P i (s, a) = 4σ 2 P,k,i (V k,h+1 , s, a)L P N k-1 ((s, a)[Z P i ]) + 2u k,h,i (s, a)L P N k-1 ((s, a)[Z P i ]) + 16H 2 L P N k-1 ((s, a)[Z P i ]) n j=1   4|S j |L P N k-1 ((s, a)[Z P j ]) 1 4 + 4|S j |L P 3N k-1 (s, a)[Z P j ]   (5) + n j=1 Hφ k,i (s, a)φ k,j (s, a), where φ k,i (s, a) = 4|Sj |L P N k-1 ((s,a)[Z P j ]) + 4|Sj |L P 3N k-1 ((s,a)[Z P j ]) . The definition of η k,h,i (s, a) is 16H 2 L P N k-1 ((s, a)[Z P i ]) n j=1   4|S j |L P N k-1 ((s, a)[Z P j ]) 1 4 + 4|S j |L P 3N k-1 (s, a)[Z P j ]   + n j=1 Hφ k,i (s, a)φ k,j (s, a). Theorem 6. (Refined Statement of Theorem 2) With prob. at least 1 -δ, the regret of Alg. 1 is upper bounded by O   1 m m i=1 X [Z R i ] T log(mT X [Z R i ] /δ) log T + n i=1 H X [Z P i ] T log(nT SA/δ) log T   . For clarity, we also present a cleaner single-term regret bound under a symmetric problem setting. Suppose M is a set of factored MDP with m = n, |S i | = S i , |X i | = S i A i and |Z R i | = |Z P j | = ζ for i = 1, ..., m and j = 1, ..., n, we write X i = (S i A i ) ζ and assume that X i ≤ J and S i ≤ Γ. Corollary 6.1. Suppose M * ∈ M, with prob. 1 -δ, the regret of FMDP-BF is upper bounded by O nHJT log(nT SA/δ) . The minimax regret bound for non-factored MDP is O HSAT log(SAT /δ) . Compared with this result, our algorithm's regret is exponentially smaller when n and ζ are relatively small. Under this problem setting, the regret of Osband & Van Roy (2014b) is O nH ΓJT log(nJT ) . Our results is better by a factor of √ nHΓ.

D HIGH PROBABILITY EVENTS

In this section, we discuss the high-prob. events, and assume that these events happen during the proof. Algorithm 3 FMDP-BF (Detailed Description of Alg. 1) Input: δ L = ∅, initialize N ((s, a)[Z i ]) = 0 for any factored set Z i and any (s, a)[Z i ] ∈ X [Z i ] for episode k = 1, 2, • • • do Set V k,H+1 (s) = V k,H+1 s) = 0 for all s, a. 5: Let K = (s, a) ∈ S × A : ∩ i=1,...,n N k ((s, a)[Z P i ]) > 0 Estimate Rk,i (s, a) as the empirical mean if N k-1 ((s, a)[Z R i ]) > 0, and 1 otherwise R(s, a) = 1 m m i=1 Ri ((s, a)[Z R i ]) Estimate Pk (•|s, a) with empirical mean value for all (s, a) ∈ K for horizon h = H, H -1, ..., 1 do 10: for s ∈ S do for a ∈ A do if (s, a) ∈ K then Q k,h (s, a) = min{H, Rk (s, a) + CB k (s, a) + Pk V k,h+1 (s, a)} else 15: Q k,h (s, a) = H end if end for π k,h (s) = arg max a Q k,h (s, a) V k,h (s) = max a∈A Q k,h (s, a) 20: V k,h (s) = max 0, Rk (s, π k,h (s)) -CB k (s, π k,h (s)) + Pk V k,h+1 (s, π k,h (s)) end for end for for step h = 1, • • • , H do Take action a k,h = arg max a Q k,h (s k,h end for Update history trajectory L = L {s k,h , a k,h , r k,h , s k,h+1 } h=1,2,...,H , and update history counter N k-1 ((s, a)[Z i ]). end for Lemma D.1. (High prob. event) With prob. at least 1 -2δ/3, the following events hold for any k, h, s, a: | Rk,i ((s, a)[Z R i ]) -Ri ((s, a)[Z R i ])| ≤ 2L R i N k-1 ((s, a)[Z R i ]) , i ∈ [m] | Pk,i j =i P k,j V * h (s, a) - n j=1 P j V * h (s, a)| ≤ 2H 2 L P N k-1 ((s, a)[Z P i ]) , i ∈ [n] (8) |( Pk,i -P k,i )(•|(s, a)[Z P i ])| 1 ≤ 2 |S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) i ∈ [n] (9) |( Pk,i -P k,i )(s |(s, a)[Z P i ])| ≤ 2P i (s [i]|(s, a)[Z P i ])L P N k-1 ((s, a)[Z P i ]) + L P 3N k-1 ((s, a)[Z P i ]) i ∈ [n] (10) K k=1 H h=1 P Vk,h+1 -V π k h+1 (s k,h , a k,h ) -Vk,h+1 -V π k h+1 (s k,h+1 ) ≤ 2HT log(18SAT ) (11) K k=1 H h=1 P Vk,h+1 -V * h+1 (s k,h , a k,h ) -Vk,h+1 -V * h+1 (s k,h+1 ) ≤ 2HT log(18SAT ) We define the above events as Λ 1 , and assume it happens during the proof.

Proof. By Hoeffding's inequality and union bounds over all

i ∈ [m], step k ∈ [K] and (s, a) ∈ X [Z R i ], we know that Inq. 7 holds with prob. 1 -δ 9 for any i ∈ [m], k ∈ [K], (s, a) ∈ X [Z P i ]. Similarly, by Hoeffding's inequality and union bounds over all i ∈ [n], step t and (s, a) ∈ X , Inq. 8 also holds with prob. 1 -δ 9 for any i, s, a, k. Inq. 9 is the high probability bound on the L 1 norm of the Maximum Likelihood Estimate, which is proved by Weissman et al. (2003) . Inq. 10 can be proved with the use of Bernstein inequality and union bound (See Azar et al. (2017) for a similar derivation). Inq. 11 and Inq. 12 can be regarded as the summation of martingale difference sequences, which can be derived with the application of Azuma's inequality. Finally, we take union bounds over all these inequalities, which indicates that Λ 1 holds with prob. at least 1 -2δ/3. For the proof of Thm. 2, we also need to consider the following high-prob. events. We define the following events as Λ 2 . During the proof of Thm. 2, we assume both Λ 1 and Λ 2 happen. Lemma D.2. With prob. at least 1 -δ/3, the following events hold for any k, h, s, a: Rk,i ((s, a)[Z R i ]) -Rk,i ((s, a)[Z R i ]) ≤ 2σ 2 R,i (s, a)L R i N k-1 ((s, a)[Z R i ]) + 8L R i 3N k-1 ((s, a)[Z R i ]) , i ∈ [m] ( Pk,i -P i ) j =i P j V * h+1 (s, a) ≤ 2σ 2 P,i (V * h+1 , s, a)L P N k-1 ((s, a)[Z P i ])) + 2HL P 3N k-1 ((s, a)[Z P i ]) , i ∈ [n] (14) N k ((s, a)[Z P i ]) ≥ 1 2 j<k w j,Z P i (s, a) -H log(18nX P i H/δ), i ∈ [n] Proof. Inq. 13 can be proved directly by empirical Bernstein inequality. Now we mainly focus on Inq. 14. By Bernstein's inequality and union bounds over all s, a, k, h, we know that the following inequality holds with prob. at least 1 -δ 9 . ( Pk,i -P i ) j =i P j V * h+1 (s, a) = s [1:i-1]∈X [1:i-1] P(s [1 : i -1]|s, a) Pk,i -P i n j=i+1 P j V * h+1 (s, a) ≤ s [1:i-1]∈X [1:i-1] P(s [1 : i -1]|s, a) 2 Var s [i]∼Pi(•|(s,a)[Z P i ]) E s [i+1:n]∼P [i+1:n] (•|s,a) V h+1 (s ) | s [1 : i -1] L P N k-1 ((s, a)[Z P i ]) + s [1:i-1]∈X [1:i-1] P(s [1 : i -1]|s, a) 2HL P 3N k-1 ((s, a)[Z P i ]) ≤ 2σ 2 (V * h+1 , s, a)L P N k-1 ((s, a)[Z P i ])) + 2HL P 3N k-1 ((s, a)[Z P i ]) The last inequality is due to Jensen's inequality. That is, s [1:i-1] P(s [1 : i -1]) C 1 N k-1 ((s, a)[Z P i ]) ≤ s [1:i-1] C 1 P(s [1 : i -1]) N k-1 ((s, a)[Z P i ]) = 2σ 2 (V * h+1 , s, a)L P • 1 N k-1 ((s, a)[Z P i ]) where P(s [1 : i -1]) is a shorthand of P(s [1 : i -1]|s, a), and C 1 here denotes 2 Var s [i]∼Pi(•|(s,a)[Z P i ]) E s [i+1:n]∼P [i+1:n] (•|s,a) V (s ) | s [1 : i -1] L P . Inq. 15 follows the same proof of the failure event F N in section B.1 of Dann et al. (2019) . E PROOF OF THEOREM 5

E.1 ESTIMATION ERROR DECOMPOSITION

Lemma E.1. The estimation error can be decomposed in the following way: |( Pk -P)(•|s, a)| 1 ≤ n i=1 |( Pk,i -P i )(•|(s, a)[Z P i ])| 1 (16) |( Pk -P)V (s, a)| ≤ n i=1 ( Pk,i -P i )   n j =i,j=1 P j   V (s, a) + n i=1 n j =i,j=1 |V | ∞ Pk,i -P i (•|(s, a)[Z P i ]) 1 • Pk,j -P j (•|(s, a)[Z P j ]) 1 , here V denotes any value function mapping from S to R, e.g. V * h+1 or Vk,h+1 -V * h+1 . Proof. Inq. 16 has the same form of Lemma 32 in Li (2009) and Lemma 1 in Osband & Van Roy (2014b) . We mainly focus on Inq. 17. We can decompose the difference in the following way: ( Pk -P)V (s, a) ≤ ( Pk,n -P n ) n-1 i=1 P i V (s, a) + P n ( n-1 i=1 Pk,i - n-1 i=1 P i )V (s, a) + ( Pk,n -P n )( n-1 i=1 Pk,i - n-1 i=1 P i )V * (s, a) For the last term of Inq. 18, we have ( Pk,n -P n )( n-1 i=1 Pk,i - n-1 i=1 P i )V (s, a) ≤ Pk,n -P n (•|(s, a)[Z P n ]) 1 • n-1 i=1 Pk,i (•|s, a[Z P i ]) - n-1 i=1 P i (•|s, a[Z P i ]) 1 • |V | ∞ ≤ Pk,n -P n (•|(s, a)[Z P n ]) 1 n-1 i=1 Pk,i -P i (•|(s, a)[Z P i ]) 1 • |V | ∞ , Where the last inequality is due to Inq. 16. For the second part of Inq. 18, we can further decompose the term as: P n n-1 i=1 Pk,i - n-1 i=1 P i V (s, a) ≤ P n Pk,n-1 -P n-1 n-2 i=1 P i V (s, a) + P n P n-1 n-2 i=1 Pk,i - n-2 i=1 P i V (s, a) + P n Pk,n-1 -P n-1 n-2 i=1 Pk,i - n-2 i=1 P i V (s, a) Following the same decomposition technique, we can prove Inq. 17 by recursively decomposing the second term over all possible n: |( Pk -P)V * (s, a)| ≤ n i=1 ( Pk,i -P i )   n j =i,j=1 P j   V (s, a) + n i=1 n j =i,j=1 Pk,i -P i (•|(s, a)[Z P i ]) 1 • Pk,j -P j (•|(s, a)[Z P j ]) 1 • |V | ∞ Lemma E.2. Under event Λ 1 , then the following Inequality holds: | Rk (s, a) -R(s, a)| ≤ 1 m m i=1 2L R i N k-1 ((s, a)[Z R i ]) |( Pk -P)(•|s, a)| 1 ≤ n i=1 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) |( Pk -P)V * (s, a)| ≤ n i=1 2H 2 L P N k-1 ((s, a)[Z P i ]) + n i=1 n j =i,j=1 H 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) (22) Proof. Inq. 20 can be proved by Lemma D.1: | Rk (s, a) -R(s, a)| ≤ 1 m m i=1 | Rk (s, a) -R(s, a)| ≤ 1 m m i=1 2L R i N k-1 ((s, a)[Z R i ]) Inq. 21 follows directly by applying Lemma D.1 to Lemma E.1. |( Pk -P)(•|s, a)| 1 ≤ n i=1 |( Pk,i -P i )(•|(s, a)[Z P i ])| 1 ≤ n i=1 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) Similarly, Inq 22 can be proved by: |( Pk -P)V * (s, a)| ≤ n i=1 ( Pk,i -P i )   n j =i,j=1 P j   V * (s, a) + n i=1 n j =i,j=1 H Pk,i -P i (•|(s, a)[Z P i ]) 1 • Pk,j -P j (•|(s, a)[Z P j ]) 1 ≤ n i=1 2H 2 L P N k-1 ((s, a)[Z P i ]) + n i=1 n j =i,j=1 H 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) E.2 OPTIMISM Lemma E.3. (Optimism) Under event Λ 1 , Vk,h (s) ≥ V * h (s) for any k, h, s. Proof. We prove the Lemma by induction. Firstly, for h = H + 1, the inequality holds trivially since Vk,H+1 (s ) = V * H+1 (s) = 0. Vk,h (s) -V * h (s) ≥ Rk (s, π * h (s)) + CB k (s, π * h (s)) + Pk Vk,h+1 (s, π * h (s)) -R(s, π * h (s)) -PV * h+1 (s, π * h (s)) = Rk (s, π * h (s)) -R(s, π * h (s)) + CB k (s, π * h (s)) + Pk ( Vk,h+1 -V * h+1 )(s, π * h (s)) + ( Pk -P)V * h+1 (s, π * (s)) ≥ Rk (s, π * h (s)) -R(s, π * h (s)) + CB k (s, π * h (s)) + ( Pk -P)V * h+1 (s, π * (s)) ≥0 The first inequality is due to Vk,h (s) ≥ Qk,h (s, π * h (s)). The second inequality follows by induction condition that Vk,h+1 (s) ≥ V * h+1 (s) for all s. The last inequality is due to Inq. 20 and Inq. 22 in Lemma E.2.

E.3 PROOF OF THEOREM 5

Now we are ready to prove Thm. 5. Proof. (Proof of Thm. 5) V * h (s k,h ) -V π k h (s k,h ) ≤ Vk,h (s k,h ) -V π k h (s k,h ) = Rk (s k,h , π k,h (s k,h )) + Pk Vk,h+1 (s k,h , π k,h (s k,h )) + CB k (s k,h , π k,h (s k,h )) -R(s k,h , π k,h (s k,h )) -PV π k h+1 (s k,h , π k,h (s k,h )) = Vk,h+1 (s k,h+1 ) -V π k h+1 (s k,h+1 ) + Rk (s, π k,h (s k,h )) -R(s, π k,h (s k,h )) + CB k (s k,h , π * h (s)) + P Vk,h+1 -V π k h+1 (s k,h , π k,h (s k,h )) -Vk,h+1 -V π k h+1 (s k,h+1 ) + Pk -P V * h+1 (s k,h , π k,h (s k,h )) + Pk -P Vk,h+1 -V * h+1 (s k,h , π k,h (s k,h )) The first inequality is due to optimism Vk,h (s k,h ) ≥ V * h (s k,h ). The first equality is due to Bellman equation for V π k h and Vk,h . For notation simplicity, we define δ 1 k,h = Rk (s, π k,h (s k,h )) -R(s, π k,h (s k,h ) δ 2 k,h = P Vk,h+1 -V π k h+1 (s k,h , π k,h (s k,h )) -Vk,h+1 -V π k h+1 (s k,h+1 ) δ 3 k,h = Pk -P V * h+1 (s k,h , π k,h (s k,h )) Firstly we focus on the upper bound of Pk,h -P Vk,h+1 -V * h+1 (s k,h , π k,h (s k,h )). We bound this term following the idea of Azar et al. (2017) . Pk -P Vk,h+1 -V * h+1 (s k,h , a k,h ) ≤ n i=1 ( Pi -P i ) n j=1,j =i P j Vk,h+1 -V * h+1 (s k,h , a k,h ) + n i=1 n j=1 H Pi -P i (•|(s k,h , a k,h )[Z P i ]) 1 Pj -P j (•|(s k,h , a k,h )[Z P j ]) 1 ≤ n i=1   s [i]∈S[i] 2 P i (s [i]|X [Z P i ])L P N k-1 ((s, a)[Z P i ]) + L P 3N k-1 ((s, a)[Z P i ])   n j=1,j =i P j Vk,h+1 -V * h+1 (s k,h , a k,h ) + n i=1 n j =i,j=1 H 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) ≤ n i=1 s [i]∈S[i] 2 P i (s [i]|X [Z P i ])L P N k-1 ((s, a)[Z P i ]) n j=1,j =i P j Vk,h+1 -V * h+1 (s k,h , a k,h ) + n i=1 |S i |HL P 3N k-1 ((s, a)[Z P i ]) + n i=1 n j =i,j=1 H 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) The first inequality is due to Lemma E.1. The second inequality is because of Lemma D.1, and the last inequality is due to the fact that Vk,h+1 -V * h+1 ∞ ≤ H. For each i ∈ [n], we consider those s [i] satisfying N k-1 ((s, a)[Z P i ])P i (s [i]|(s k,h , a k,h )[Z P i ]) ≥ 2n 2 H 2 L P and N k-1 ((s, a)[Z P i ])P i (s [i]|(s k,h , a k,h )[Z P i ]) ≤ 2n 2 H 2 L P separately. For those s [i] satisfying N k-1 ((s, a)[Z P i ])P i (s [i]|s k,h , a k,h ) ≥ 2n 2 H 2 L P , the first term can be bounded by n i=1 s [i]∈S[i] 2 P i (s [i]|X [Z P i ])L P N k-1 ((s, a)[Z P i ]) n j=1,j =i P j Vk,h+1 -V * h+1 (s k,h , a k,h ) = n i=1 s [i]∈S[i] P i (s [i]|X [Z P i ]) 2 L P P i (s [i]|X [Z P i ])N k-1 ((s, a)[Z P i ]) n j=1,j =i P j Vk,h+1 -V * h+1 (s k,h , a k,h ) ≤ 1 H P Vk,h+1 -V * h+1 (s k,h , a k,h ) = 1 H Vk,h+1 -V * h+1 (s k,h+1 , a k,h+1 ) + 1 H P Vk,h+1 -V * h+1 (s k,h , a k,h ) -Vk,h+1 -V * h+1 (s k,h+1 , a k,h+1 ) where the second term can be regarded as a martingale difference sequence, and we denote it as δ 4 k,h . For those s [i] satisfying N k-1 ((s, a)[Z P i ])P i (s [i]|s k,h , a k,h ) ≤ 2n 2 H 2 L P , the summation can be bounded by n i=1 nH 2 |S i |L P N k-1 ((s, a)[Z P i ]) For notation simplicity, we define δ 5 k,h as: δ 5 k,h = n i=1 n j =i,j=1 H 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) + n i=1 2nH 2 |S i |L P N k-1 ((s, a)[Z P i ]) To sum up, by the above analysis, we prove that Pk,h -P Vk,h+1 -V * h+1 (s k,h , π k,h (s k,h )) ≤ 1 H Vk,h+1 -V * h+1 (s k,h+1 , a k,h+1 ) + δ 4 k,h + δ 5 k,h Now we are ready to summarize all the terms in the regret. Firstly, we recursively calculate the regret for all h ∈ [H]. V * 1 (s k,1 , a k,1 ) -V π k 1 (s k,1 , a k,1 ) ≤ Vk,1 (s k,1 , a k,1 ) -V π k (s k,1 , a k,1 ) ≤CB(s k,1 , a k,1 ) + δ 1 k,1 + δ 2 k,1 + δ 3 k,1 + δ 4 k,1 + δ 5 k,1 + (1 + 1 H ) (V k,2 (s k,2 , a k,2 ) -V π k 2 (s k,2 , a k,2 )) • • • ≤ H h=1 1 + 1 H h-1 CB k (s h , a h ) + δ 1 k,h + δ 2 k,h + δ 3 k,h + δ 4 k,h + δ 5 k,h ≤ H h=1 e CB k (s h , a h ) + δ 1 k,h + δ 2 k,h + δ 3 k,h + δ 4 k,h + δ 5 k,h Then we sum up the regret over k episodes, Reg(K) ≤ K k=1 (V * 1 (s 1 , a 1 ) -V π k 1 (s 1 , a 1 )) ≤ K k=1 H h=1 e CB k (s h , a h ) + δ 1 k,h + δ 2 k,h + δ 3 k,h + δ 4 k,h + δ 5 k,h δ 2 k,h and δ 4 k,h can be regarded as martingale difference sequence, the summation of which can be bounded by O(H T log(T )) by Lemma E.2, while δ 1 k,h and δ 3 k,h can also be bounded by Lemma E.2. The summation of different terms in δ 1 k,h , δ 3 k,h , δ 4 k,h and δ 5 k,h can be separated into the following categories. In the following proof, we use C to denote the dependence of other parameters except the counters N k ((s k,h , a k,h )[Z i ]) For those terms of the form C √ N k ((s k,h ,a k,h )[Zi]) , we have k h C N k ((s k,h , a k,h )[Z i ]) ≤HC + x[Zi]∈X [Zi] N K (x[Zi]) c=1 C √ c =HC + x[Zi]∈X [Zi] C N K (x[Z i ]) ≤HC + C |X [Z i ]|T The last inequality is due to Cauchy-Schwarz inequality. This term influence the main factors in the final regret. For those terms of the form C N k ((s k,h ,a k,h )[Zi]) , we have k h C N k ((s k,h , a k,h )[Z i ]) ≤HC + x[Zi]∈X [Zi] N K (x[Zi]) c=1 C c ≤HC + x[Zi]∈X [Zi] C ln (N K (x[Z i ])) ≤HC + C|X [Z i ]| ln T, which has only logarithmic dependence on T . For those terms of the form C √ N k ((s k,h ,a k,h )[Zi])N k ((s k,h ,a k,h )[Zj ]) . we define N k ((s, a)[Z i ], (s, a)[Z j ] ) as the number of times that agent has encountered (s, a)[Z i ] and (s, a)[Z j ] simultaneously for the first k episodes. It is not hard to find that N k ((s, a)[Z i ]) ≥ N k ((s, a)[Z i ], (s, a)[Z j ]) and N k ((s, a)[Z j ]) ≥ N k ((s, a)[Z i ], (s, a)[Z j ]). k h C N k ((s k,h , a k,h )[Z i ])N k ((s k,h , a k,h )[Z j ]) ≤ k h C N k ((s k,h , a k,h )[Z i ], (s k,h , a k,h )[Z j ]) ≤HC + x[Zi∪Zj ]∈X [Zi∪Zj ] C ln (N k ((s k,h , a k,h )[Z i ], (s k,h , a k,h )[Z j ])) ≤HC + C|X [Z i ∪ Z j ]| ln T, which also has only logarithmic dependence on T . For other terms with the form of C (N k ((s k,h ,a k,h )[Zi])) 2 and C N k ((s k,h ,a k,h )[Zi]) √ N k ((s k,h ,a k,h )[Zj ]) , the summation of these terms has no dependence on T , which is negligible since T is the dominant factor. By bounding these different kinds of terms with the above methods, we can finally show that Reg(K) = O   1 m m i=1 |X [Z R i ]|T log(10mT |X [Z R i ]|/δ) + n j=1 H |X [Z P j ]|T log(10nT SA/δ)   Here O hides the lower-order factors w.r.t T . F PROOF OF THEOREM 2 F.1 ESTIMATION ERROR DECOMPOSITION Lemma F.1. Under event Λ 1 and Λ 2 , we have Rk (s, a) -R(s, a) ≤ 1 m m i=1 2σ 2 R,i (s, a)L R i N k-1 ((s, a)[Z R i ]) + 1 m m i=1 8L R i 3N k-1 ((s, a)[Z R i ]) |( Pk -P)V * h+1 (s, a)| ≤ n i=1   2σ 2 P,i (V * h+1 , s, a)L P N k-1 ((s, a)[Z P i ])) + 2HL P 3N k-1 ((s, a)[Z P i ])   + n i=1 n j =i,j=1 H 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) Proof. The first inequality follows directly by the definition that R(s, a) = 1 m m i=1 R i (s, a) and Lemma D.2. We now prove the second inequality. By Lemma E.1, we have |( Pk -P)V * h+1 (s, a)| ≤ n i=1 ( Pk,i -P i )   n j =i,j=1 P j   V * h+1 (s, a) + n i=1 n j =i,j=1 H Pk,i -P i (•|(s, a)[Z P i ]) 1 • Pk,j -P j (•|(s, a)[Z P j ]) 1 , By Inq. 9 in Lemma D.1 and Inq. 14 in Lemma D.2, we have |( P -P)V * h+1 (s, a)| ≤ n i=1   2σ 2 P,i (V * h+1 , s, a)L P N k-1 ((s, a)[Z P i ])) + 2HL P 3N k-1 ((s, a)[Z P i ])   + n i=1 n j =i,j=1 H 4|S i |L P N k-1 ((s, a)[Z P i ]) + 4|S i |L P 3N k-1 ((s, a)[Z P i ]) 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) F.2 OMITTED PROOF IN SECTION 4.2 Proof. (Proof of Theorem. 1) ω 2 h (s) =E (J h:H (s h ) -V h (s h )) 2 | s h = s = s P(s |s)E (J h+1:H (s h+1 ) + r h -V h (s h )) 2 | s h = s, s h+1 = s = s P(s |s)E J 2 h+1:H (s h+1 ) + r 2 h + V 2 h (s h ) | s h = s, s h+1 = s + s P(s |s)E [2r h (J h+1:H (s h+1 ) -V h (s h )) -2J h+1:H (s h+1 )V h (s h ) | s h = s, s h+1 = s ] Given s h = s, s h+1 = s , r h , V h (s h ) and J h+1:H (s h+1 ) are conditionally independent, thus we have E[2r h (J h+1:H (s h+1 ) -V h (s h )) | s h = s, s h+1 = s ] = E[2 R(s h )(V h+1 (s h+1 ) -V h (s h )) | s h = s, s h+1 = s ] E[2J h+1:H (s h+1 )V h (s h ) | s h = s, s h+1 = s ] = E[2V h+1 (s h+1 )V h (s h ) | s h = s, s h+1 = s ] Therefore, we have ω 2 h (s) = s P(s |s)E J 2 h+1:H (s h+1 ) + r 2 h + V 2 h (s h ) | s h = s, s h+1 = s + s P(s |s)E 2 R(s h )(V h+1 (s h+1 ) -V h (s h )) -2V h+1 (s h+1 )V h (s h ) | s h = s, s h+1 = s =E r 2 h -R2 (s h ) | s h = s + s P(s |s)E J 2 h+1:H (s h+1 ) -(V h (s h ) -R(s h )) 2 | s h = s, s h+1 = s =E r 2 h -R2 (s h ) | s h = s + s P(s |s)E   J 2 h+1:H (s h+1 ) - s P(s |s)V h+1 (s ) 2 | s h = s, s h+1 = s   =E r 2 h -R2 (s h ) | s h = s + s P(s |s)V 2 h+1 (s ) - s P(s |s)V h+1 (s ) 2 + s P(s |s)E J 2 h+1:H (s h+1 ) -V 2 h+1 (s h+1 ) | s h+1 = s The second equality is due to the fact that V h (s) = R(s) + s P(s |s)V h+1 (s ). For the factored rewards, since the rewards r h,i are conditionally independent give s, we have E r 2 h -R2 (s h ) | s h = s = 1 m 2 m i=1 σ 2 R,i (s) For the factored transition, we decompose the variance in the following way: s P(s |s)V 2 h+1 (s ) - s P(s |s)V h+1 (s ) 2 -σ 2 P,1,h (s) = s P(s |s)V 2 h+1 (s ) - s [1] P(s [1] | s)   s [2:n] P 2:n (s [2 : n]|s)V h+1 (s )   2 = s [1] P(s [1] | s)    s [2:n] P 2:n (s [2 : n]|s)V 2 h+1 (s ) -   s [2:n] P 2:n (s [2 : n]|s)V h+1 ([s [1], s [2 : n]])   2    Here ([s [1], s [2 : n]] denotes the vector s with s [1] replaced with s [1] . By subtracting σ 2 P,i,h (s) for i = 2, ..., n in the above way, we can show that s P(s |s)V 2 h+1 (s ) - s P(s |s)V h+1 (s ) 2 - n i=1 σ 2 P,i,h (s) = 0 (24) Plugging Eqn. 24 back to Eqn. 23, we have ω 2 h (s) = s P(s |s)ω 2 h+1 (s ) + n i=1 σ 2 P,i,h (s) + 1 m 2 m i=1 σ 2 R,i (s), Proof. (Proof of Corollary 1.1) We can regard the MDP with given policy π as a Markov chain. By Theorem 1, we have ω 2 h (s) = s P(s |s, π(s))ω 2 h+1 (s ) + n i=1 σ 2 P,i (V π h , s, a) + 1 m 2 m i=1 σ 2 R,i (s, a) By recursively decomposing the variance until step H, we have: ω 2 h (s 1 ) = H h=1 (s,a)∈X w h (s, a) n i=1 σ 2 P,i (V π h , s, a) + 1 m 2 m i=1 σ 2 R,i (s, a) Since ω 2 h (s 1 ) = E (J h:H (s h ) -V h (s)) 2 |s h = s ≤ H 2 , we can immediately reach the conclusion.

F.3 THE "GOOD" SET CONSTRUCTION

The construction of the "good" set is similar with that in Dann et al. (2017) and Zanette & Brunskill (2019) , though we modify it to handle this more complicated factored setting. The idea is to partition each factored state-action subspace at each episode into two sets, the set of state-action pairs that have been visited sufficiently often (so that we can lower bound these visits by their expectations using standard concentration inequalities) and the set of (s, a) that were not visited often enough to cause high regret. That is: Definition 4. (The Good Set) The set L k,i for factored transition P i is defined as: L k,i def =    (x[Z P i ]) ∈ X [Z P i ] : 1 4 j<k w j,Z P i (x) ≥ H log(18nX P i H/δ) + H    The following two Lemmas follow the same idea of Lemma 6 and Lemma 7 in Zanette & Brunskill (2019) . Lemma F.2. Under event Λ 1 and Λ 2 , if (s, a)[Z P i ] ∈ L i,k , we have N k ((s, a)[Z P i ]) ≥ 1 4 j<k w j,Z P i (s, a). Proof. By Lemma D.2, we have  N k ((s, a)[Z P i ]) ≥ 1 2 j<k w j ((s, a)[Z P i ]) -H log(18nX P i H/δ). Since (s, a)[Z P i ] ∈ L i,k , we have 1 4 j<k w j,Z P i (x) ≥ H log(18nX P i H/δ) + H. That is, N k ((s, a)[Z P i ]) ≥ 1 2 j<k w j ((s, a)[Z P i ]) -H log(18nX P i H/δ) ≥ 1 2 j<k w j ((s, a)[Z P i ]) - 1 4 j<k w j ((s, a)[Z P i ]) = 1 4 j<k w j ((s, a)[Z P i ]) Lemma F.3. It holds that K k=1 H h=1 (s,a)[Z P i ] / ∈L k,i w k,h P i ]∈L k,i w k,h,Z P i (s, a) N k ((s, a)[Z P i ]) ≤ 4X P i log T. Proof. For those (s, a)[ Z P i ] ∈ L k,i , we have N k ((s, a)[Z P i ]) ≥ 1 4 j<k w j,Z P i (s, a). Therefore, we have K k=1 H h=1 (s,a)[Z P i ]∈L k,i w k,h,Z P i (s, a) N k ((s, a)[Z P i ]) ≤ K k=1 H h=1 (s,a)[Z P i ]∈L k,i 4w k,h,Z P i (s, a) j<k w j,Z P i (s, a) ≤ (s,a)[Z P i ]∈X P i K k=1 w k,Z P i (s, a) j<k w j,Z P i (s, a) ≤4X P i log T Lemma F.5. For factored set Z P i of transition, we have: K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i ]) ≤8X P i log T (25) K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i ])N k-1 ((s, a)[Z P j ]) ≤8X P i log T (26) K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i ]) N k-1 ((s, a)[Z P j ]) 1 4 ≤8 X P i,j T 1/4 log T (27) where X P i,j = |X[Z P i ∪ Z P j ]|. For factored set Z R i of rewards, similarly we have: K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z R i ]) ≤8X R i log T (28) K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z R i ])N k-1 ((s, a)[Z R j ]) ≤8X R i log T (29) K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z R i ]) N k-1 ((s, a)[Z R j ]) 1 4 ≤8 X R i,j T 1/4 log T (30) where X R i,j = |X[Z R i ∪ Z R j ]|. Proof. We only prove the inequalities for the factored set of transition. The inequalities for the factored set of rewards can be proved in the same manner. For Inq. 25, we define X i ((s, a)[ Z P i ]) = {x ∈ X | x[Z P i ] = (s, a)[Z P i ]}, then we have K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i ]) = k,h (s,a)[Z P i ]∈X [Z P i ] w k,h,Z P i (s, a) (s1,a1)∈Xi((s,a)[Z P i ]) w k,h (s1,a1) w k,h,Z P i (s,a) N k-1 ((s, a)[Z P i ]) = k,h (s,a)[Z P i ]∈X [Z P i ] w k,h,Z P i (s, a) N k-1 ((s, a)[Z P i ]) = k,h (s,a)[Z P i ]∈L k,i w k,h,Z P i (s, a) N k-1 ((s, a)[Z P i ]) + k,h (s,a)[Z P i ] / ∈L k,i w k,h,Z P i (s, a) N k-1 ((s, a)[Z P i ]) ≤ k,h (s,a)[Z P i ]∈L k,i w k,h,Z P i (s, a) N k-1 ((s, a)[Z P i ]) + k,h (s,a)[Z P i ] / ∈L k,i w k,h,Z P i (s, a) k,h (s,a)[Z P i ] / ∈L k,i 1 N k-1 ((s, a)[Z P i ] ) ≤4X P i log T + 8HX P i log(10nX P i H/δ) ≤8X P i log T In the first equality, we firstly categorize (s, a) based on their value (s, a)[Z P i ] and sum up over all possible choice of (s, a)[Z P i ], then we sum up the value in each category in the inner summation. The second equality is due to (s1,a1)∈Xi ((s,a)  [Z P i ]) w k,h (s1,a1) w k,h,Z P i (s,a) = 1. The first inequality is due to Cauchy-Schwarz inequality. The second inequality is due to Lemma F.4 and Lemma F.3. The last inequality is due to the assumption that X P i ≥ H log(10nX P i H/δ). For Inq. 26 and Inq. 27, we define Z P i,j = Z P i ∪ Z P j . For the factored set Z P i,j , similarly we have K k=1 H h=1 (s,a)[Z P i,j ]∈L k,i w k,h,Z P i,j (s, a) N k ((s, a)[Z P i,j ]) ≤4X P i,j log T K k=1 H h=1 (s,a)[Z P i,j ] / ∈L k,i w k,h,Z P i,j (s, a) ≤8HX P i,j log(10nX P i,j H/δ), By the definition of Z P i,j , we know that N k-1 ((s, a)[Z P i ]) ≥ N k-1 ((s, a)[Z P i,j ]) and N k-1 ((s, a)[Z P j ]) ≥ N k-1 ((s, a)[Z P i,j ] ). Therefore, we have K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i ])N k-1 ((s, a)[Z P j ]) ≤ K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i,j ]) K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i ]) N k-1 ((s, a)[Z P j ]) 1 4 ≤ K k=1 H h=1 (s,a)∈X w k,h (s, a) N k-1 ((s, a)[Z P i,j ]) use V i and V [i:j] as a shorthand of V s [i]∼Pi(•|(s,a)[Z P i ]) and V s [i:j]∼P [i:j] (•|(s,a)[Z P [i:j] ] ) . For those w.r.t the empirical transition Pk , we use Êk and Vk to denote the corresponding expectation and variance. For example, E s [i]∼ Pk,i (•|(s,a)[Z P i ] ) is denoted as Êk,i . Lemma F.6. Under event Λ 1 , Λ 2 , we have: σ2 P,k,i (V, s, a) -σ 2 P,i (V, s, a) ≤ 4H 2 n j=1 2 |S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) , where V denotes some given function mapping from S to R. Proof. |σ 2 P,k,i (V, s, a) -σ 2 P,i (V, s, a)| =| Ê[1:i-1] Vi Ê[i+1:n] V (s ) -E [1:i-1] V i E [i+1:n] V (s )| (31) ≤| Ê[1:i-1] Vi Ê[i+1:n] V (s ) -E [1:i-1] Vi Ê[i+1:n] V (s )| (32) + |E [1:i-1] Vi Ê[i+1:n] V (s ) -E [1:i-1] V i E [i+1:n] V (s )| (33) We bound Equ. 32 and 33 separately. For equ. 32, we have | Ê[1:i-1] Vi Ê[i+1:n] V (s ) -E [1:i-1] Vi Ê[i+1:n] V (s )| = s [1:i-1]∈S[1:i-1] P[1:i-1] -P [1:i-1] (s [1 : i -1]|s, a) Vi Ê[i+1:n] V (s ) ≤ P[1:i-1] (•|s, a) -P [1:i-1] (•|s, a) 1 • Vi Êi+1:n V (s ) ∞ ≤H 2 i-1 j=1 Pj (•|s, a) -P j (•|s, a) 1 ≤H 2 i-1 j=1 2 |S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) The last inequality is due to Lemma D.1. For equ. 33, given fixed s [1 : i -1], we have Vi Ê[i+1:n] V (s ) -V i E [i+1:n] V (s ) ≤ Êi Ê[i+1:n] V (s ) 2 -E i E [i+1:n] V (s ) 2 + E [i:n] V (s ) 2 -Ê[i:n] V (s ) 2 ≤ Êi Ê[i+1:n] V (s ) 2 -E i Ê[i+1:n] V (s ) 2 + E i Ê[i+1:n] V (s ) 2 -E i E [i+1:n] V (s ) 2 + E [i:n] V (s ) 2 -Ê[i:n] V (s ) 2 ≤ Êi Ê[i+1:n] V (s ) 2 -E i Ê[i+1:n] V (s ) 2 + E i Ê[i+1:n] V (s ) 2 -E i E [i+1:n] V (s ) 2 + 2H E [i:n] V (s ) -Ê[i:n] V (s ) The first inequality is due to Inq. 38, and the second inequality is due to VX ≤ EX 2 for any random variable X ∈ R. To sum up, we have σ P,i (V * h+1 , s, a) -2σ P,k,i (V k,h+1 , s, a) ≤ u k,h,i (s, a) + 8H 2 n j=1 2 |S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) Lemma F.8. Under event Λ 1 , Λ 2 and Ω, suppose Reg(K) = K k=1 V k,1 (s 1 ) -V π k 1 (s 1 ), we have: K k=1 H h=1 (s,a)∈X w k,h (s, a) σ 2 P,i (V * h+1 , s, a) -σ 2 P,i (V π k h+1 , s, a) ≤ 2H 2 Reg(K) K k=1 H h=1 (s,a)∈X w k,h (s, a) σ 2 P,i (V k,h+1 , s, a) -σ 2 P,i (V π k h+1 , s, a) ≤ 2H 2 Reg(K) Proof. We only prove the first inequality in detail. By replacing V * h+1 with V k,h+1 , we can prove the second inequality in the same manner. σ 2 P,i (V * h+1 , s, a) -σ 2 P,i (V π k h+1 , s, a) = E [1:i] V i E [i+1:n] V * h+1 (s ) -V i E [i+1:n] V π k h+1 (s ) Given fixed s [1 : i -1], we bound the difference of the variances: V i E [i+1:n] V * h+1 (s ) - V i E [i+1:n] V π k h+1 (s ) . V i E [i+1:n] V * h+1 (s ) -V i E [i+1:n] V π k h+1 (s ) =E i E [i+1:n] V * h+1 (s ) 2 -E [i+1:n] V π k h+1 (s ) 2 -E [i:n] V * h+1 (s ) 2 + E [i:n] V π k h+1 (s ) 2 ≤E i E [i+1:n] V * h+1 (s ) 2 -E [i+1:n] V π k h+1 (s ) 2 ≤2HE i E [i+1:n] V * h+1 (s ) -E [i+1:n] V π k h+1 (s ) =2HE [i:n] V * h+1 (s ) -V π k h+1 (s ) The first inequality is due to V * h+1 (s ) ≥ V π k h+1 (s ) , and the second inequality is due to E [i+1:n] V * h+1 (s ) + E [i+1:n] V π k h+1 (s ) ≤ 2H. We then take expectation over all s [1 : i -1]. that is σ 2 P,i (V * h+1 , s, a) -σ 2 P,i (V π k h+1 , s, a) ≤ 2HE [1:n] V * h+1 (s ) -V π k h+1 (s ) Plugging the inequality into the former equation, we have K k=1 H h=1 (s,a)∈X w k,h (s, a) σ 2 P,i (V * h+1 , s, a) -σ 2 P,i (V π k h+1 , s, a) ≤ K k=1 H h=1 (s,a)∈X w k,h (s, a)2HE s ∼P(•|s,a) V * h+1 (s ) -V π k h+1 (s ) = K k=1 H h=2 s∈S 2w k,h (s)H [V * h (s) -V π k h (s)] ≤ K k=1 2H 2 [V * 1 (s 1 ) -V π k 1 (s 1 )] ≤ K k=1 2H 2 V k,1 (s 1 ) -V π k 1 (s 1 ) =2H 2 Reg(K) For the second inequality, this is because that by lemma E.15 of Dann et al. ( 2017), we have s w k,h (s) [V * h (s) -V π k h (s)] = H h1=h s w k,h1 (s) R(s, π * (s)) -R(s, π k (s)) + PV * h1+1 (s, π * (s)) -PV * h1+1 (s, π k (s)) V * 1 (s 1 )-V π k 1 (s 1 ) = H h1=1 s w k,h1 (s) R(s, π * (s)) -R(s, π k (s)) + PV * h1+1 (s, π * (s)) -PV * h1+1 (s, π k (s)) This means that s w k,h (s) [V * h (s) -V π k h (s)] ≤ V * 1 (s 1 ) -V π k 1 (s 1 ) for any k, h.

F.5 OPTIMISM AND PESSIMISM

Lemma F.9. Suppose that Λ 1 , Λ 2 and Ω k,h+1 happen, then we have the following inequalities hold for any a ∈ A and s ∈ S: R(s, a) -R(s, a) ≤ 1 m m i=1 CB R i (s, a) Pk V * h+1 (s, a) -PV * h+1 (s.a) ≤ n i=1 CB P i (s, a) Proof. The first inequality follows directly by Lemma F.1 and the definition of CB R i (s, a). For the second inequality, by Lemma F.1, we have Pk -P V * h+1 (s, a) ≤ n i=1   2σ 2 P,i (V * h+1 , s, a)L P N k-1 ((s, a)[Z P i ])) + 2HL P 3N k-1 ((s, a)[Z P i ])   + n i=1 n j =i,j=1 Hφ k,i (s, a)φ k,j (s, a), where φ k,i (s, a) = 4|Si|L P N k-1 ((s,a)[Z P i ]) + 4|Si|L P 3N k-1 ((s,a)[Z P i ]) . The first inequality is due to V k,h (s) ≥ Q k,h (s, π * (s)). By the definition of CB P i (s, a), we have i CB P i (s, a) -Pk -P V * h+1 (s, a) ≥ n i=1   4σ 2 P,k,i (V k,h+1 , s, a)L P N k-1 ((s, a * )[Z P i ]) - 2σ 2 P,i (V * h+1 , s, a * )L P N k-1 ((s, a)[Z P i ])   + n i=1 2u k,h,i (s, a)L P N k-1 ((s, a)[Z P i ]) + n i=1 16H 2 L P N k-1 ((s, a)[Z P i ]) n j=1   4|S j |L P N k-1 ((s, a)[Z P j ]) 1 4 + 4|S j |L P 3N k-1 (s, a)[Z P j ]   We bound Equ. 47, 48, 49 and 50 separately by Lemma F.13, Lemma F.14, Lemma F.15 and Lemma F.16. Combining the results of these Lemmas, we have Reg(K) ≤ C 1 1 m m i=1 X R i T L R i log T + C 2 n i=1 HT X P i L P log T + C 3 nH 2 Reg(K) n i=1 X P i L P log T Here C 1 , C 2 , C 3 denote some constants. Solving the Reg(K) in Inq 51, we can show that Reg(K) ≤ O   1 m m i=1 X R i T L R i log T + n i=1 HT X P i L P log T   , where O hides the lower order terms w.r.t T . By the optimism principle (Lemma F.12), we have V * 1 (s k,1 , a k,1 ) ≤ V k,1 (s k,1 , a k,1 ). This leads to the final result: K k=1 (V * 1 (s k,1 , a k,1 ) -V π k 1 (s k1 , a k,1 )) ≤ O   1 m m i=1 X R i T L R i log T + n i=1 HT X P i L P log T   . F.7 BOUNDING THE MAIN TERMS Lemma F.13. Under event Λ 1 , Λ 2 and Ω k,h , suppose Reg(K) = K k=1 V k,1 (s 1 ) -V π k 1 (s 1 ), we have K k=1 H h=1 s,a w k,h (s, a)CB k (s, a) ≤O   1 m m i=1 T X R i L R i log T + HT n i=1 X P i L P log T + H 2 n Reg(K) n i=1 X P i L P log T   Proof. By the definition of CB k (s, a), we have k,h,s,a w k,h (s, a)CB k (s, a) (52) = k,h,s,a w k,h (s, a) 1 m m i=1 CB R k,i (s, a) + n i=1 CB P k,i (s, a) (53) = k,h,s,a w k,h (s, a)   1 m m i=1 2σ 2 R,k,i (s, a)L R i N k-1 ((s, a)[Z R i ]) + n i=1 4σ 2 P,k,i (V k,h+1 , s, a)L P N k-1 ((s, a)[Z P i ])   (54) + k,h,s,a w k,h (s, a) 1 m m i=1 8L R i 3N k-1 ((s, a)[Z R i ]) + k,h,s,a w k,h (s, a) n i=1   16H 2 L P N k-1 ((s, a)[Z P i ]) n j=1   4|S j |L P N k-1 ((s, a)[Z P j ]) 1 4 + 4|S j |L P 3N k-1 (s, a)[Z P j ]     (56) + k,h,s,a w k,h (s, a) n i=1 n j =i,j=1 36H|S i ||S j |(L P ) 2 N k-1 ((s, a)[Z P i ])N k-1 ((s, a)[Z P j ]) + k,h,s,a w k,h (s, a) n i=1 2u k,h,i (s, a)L P N k-1 ((s, a)[Z P i ]) By Lemma F.5, the upper bound of Eqn. 55, 56 and 57 is O(T 1 4 ), which doesn't contribute to the main factor in the regret. We prove the upper bound of Eqn. 54 and Eqn. 58 in detail. By Lemma F.6, we have σ2 P,k,i (V k,h+1 , s, a) -σ 2 P,i (V k,h+1 , s, a) ≤ 4H 2 n j=1 2 |S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ]) , Then Eqn. 54 can be bounded as 1 m m i=1 2σ 2 R,k,i (s, a)L R i N k-1 ((s, a)[Z R i ]) + n i=1 4σ 2 P,k,i (V k,h+1 , s, a)L P N k-1 ((s, a)[Z P i ]) ≤ 1 m m i=1 2σ 2 R,k,i (s, a)L R i N k-1 ((s, a)[Z R i ]) + n i=1 4σ 2 P,i (V k,h+1 , s, a)L P N k-1 ((s, a)[Z P i ]) + n i=1 4|σ 2 P,i (V k,h+1 , s, a) -σ2 P,k,i (V k,h+1 , s, a)|L P N k-1 ((s, a)[Z P i ]) ≤ 1 m m i=1 2σ 2 R,k,i (s, a)L R i N k-1 ((s, a)[Z R i ]) + n i=1 4σ 2 P,i (V k,h+1 , s, a)L P N k-1 ((s, a)[Z P i ]) + 8H n i=1 L P N k-1 ((s, a)[Z P i ]) n j=1   |S j |L P N k-1 ((s, a)[Z P j ]) 1 4 + 4|S j |L P 3N k-1 ((s, a)[Z P j ])   Similar with Eqn. 61, the summation of Eqn. 61 is upper bounded by O(T 1/4 ) by Lemma F.5. For Eqn. 59, we have k,h s,a w k,h (s, a) 1 m m i=1 2σ 2 R,k,i (s, a)L R i N k-1 ((s, a)[Z R i ]) ≤ k,h s,a w k,h (s, a) 1 m m i=1 2L R i N k-1 ((s, a)[Z R i ]) ≤ 1 m m i=1 k,h s,a w k,h (s, a) k,h s,a 2w k,h (s, a)L R i N k-1 ((s, a)[Z R i ]) ≤ 1 m m i=1 √ T k,h s,a 2w k,h (s, a)L R i N k-1 ((s, a)[Z R i ]) The first inequality is due to σ2 R,k,i (s, a) ≤ 1. The second inequality is due to Cauchy-Schwarz inequality. By Lemma F.5, the summation can be bounded by 1 m m i=1 X R i L R i T log T . For Eqn. 60, we have k,h s,a w k,h (s, a) n i=1 4σ 2 P,i (V k,h+1 , s, a)L P N k-1 ((s, a)[Z P i ]) ≤ k,h,s,a w k,h (s, a) n i=1 σ 2 P,i (V k,h+1 , s, a) • k,h,s,a w k,h (s, a) n i=1 4L P N k-1 ((s, a)[Z P i ]) ≤ k,h,s,a w k,h (s, a) n i=1 σ 2 P,i (V k,h+1 , s, a) n i=1 4X P i L P log T = k,h,s,a w k,h (s, a) n i=1 σ 2 P,i (V π k h+1 , s, a) n i=1 4X P i L P log T + k,h,s,a w k,h (s, a) n i=1 σ 2 P,i (V k,h+1 , s, a) -σ 2 P,i (V π k h+1 , s, a) n i=1 4X P i L P log T ≤ k,h,s,a w k,h (s, a) n i=1 σ 2 P,i (V π k h+1 , s, a) n i=1 4X P i L P + 2H 2 n Reg(K) n i=1 X P i L P log T ≤ k,h,s,a w k,h (s, a) 1 m 2 m i=1 σ 2 R,i (s, a) + n i=1 σ 2 P,i (V π k h+1 , s, a) n i=1 4X P i L P log T + 2H 2 n Reg(K) n i=1 X P i L P log T ≤ √ HT • n i=1 4X P i L P log T + 2H 2 n Reg(K) n i=1 X P i L P log T The first inequality is due to Cauchy-Schwarz inequality. The second inequality is due to Lemma F.5. The third inequality is due to Lemma F.8. The forth inequality is due to σ 2 R,k,i (s, a) ≥ 0, and the last inequality is because of Corollary 1.1. For Eqn. 58, we have k,h,s,a w k,h (s, a) n i=1 2u k,h,i (s, a)L P N k-1 ((s, a)[Z P i ]) ≤ n i=1 2L P   k,h,s,a 4w k,h (s, a) N k-1 ((s, a)[Z P i ])     k,h,s,a w k,h (s, a)u k,h,i (s, a)   ≤ n i=1 64X P i L P log T   k,h,s,a w k,h (s, a)u k,h,i (s, a)   By Lemma F.20, we know that the summation k,h,s,a w k,h (s, a)u k,h,i (s, a) is of order O(T 2 ). This means that Equ. 62 is of order O(T 1 4 ), which doesn't contribute to the main term (O( √ T )). Lemma F.14. Under event Λ 1 , Λ 2 and Ω, suppose Reg(K) = K k=1 V k,1 (s 1 ) -V π k 1 (s 1 ), we have  w k,h (s, a) 2σ 2 P,i (V * h+1 , s, a)L P N k-1 ((s, a)[Z P i ]) + k h n i=1 (s,a)∈X w k,h (s, a) n j =i,j=1 36H|S i ||S j |(L P ) 2 N k-1 ((s, a)[Z P i ])N k-1 ((s, a)[Z P j ]) By Lemma F.5, the second term has only logarithmic dependence on T , which is negligible compared with the main factor. We mainly focus on the first term.  P i ]) ≤ k,h (s,a)∈X 2w k,h (s, a) n i=1 σ 2 P,i (V * h+1 , s, a) • k,h (s,a)∈X n i=1 w k,h (s, a)L P N k-1 ((s, π(s))[Z P i ]) ≤ k,h (s,a)∈X 2w k,h (s, a) 1 m m i=1 σ 2 R,i (s, a) + n i=1 σ 2 P,i (V * h+1 , s, a) • k,h (s,a)∈X n i=1 w k,h (s, a)L P N k-1 ((s, π(s))[Z P i ]) ≤ HT + nH 2 Reg(K) 8 n i=1 X P i L P log T The first inequality is due to Cauchy-Schwarz inequality. The second inequality is due to σ 2 R,k,i (s, a) ≥ 0. For the last inequality, the first part is the summation of the variance, which can be bounded by Lemma 1.1 and Lemma F.8, while the second part can be bounded as i X P i L P log T by Lemma F.5. Lemma F.15. Under event Λ 1 , Λ 2 and Ω, suppose Reg(K) = K k=1 V k,1 (s 1 ) -V π k 1 (s 1 ), we have K k=1 H h=1 (s,a)∈X w k,h (s, a) Pk -P V k,h+1 -V * h+1 (s, a) ≤ O   H n i=1 T X P j |S j |L P n i=1 2 8X P i L P log T   Proof. By Lemma E.1, we can prove that Pk -P V k,h+1 -V * h+1 (s, a) ≤ n i=1 ( Pk,i -P i )P [1:i-1] P [i+1:n] V k,h (s k,h , a k,h ) -V * h (s k,h , a k,h ) + n i=1 n j=1 H Pk,i -P i (•|(s, a)[Z P i ]) 1 Pk,j -P j (•|(s, a)[Z P j ]) 1 ≤ n i=1   2 s [i]∈Si P i (s [i]|X [Z P i ])L P N k-1 ((s, a)[Z P i ])   P [1:i-1] P [i+1:n] V k,h (s k,h , a k,h ) -V * h (s k,h , a k,h ) + n i=1 n j =i,j=1 36H |S i ||S j |(L P ) 2 N k-1 ((s, a)[Z P i ])N k-1 ((s, a)[Z P j ]) + n i=1 |S i |L P 3N k-1 ((s, a)[Z P i ]) The second inequality is due to Lemma D.1. We only focus on the summation of the first term, since the summation of other terms has only logarithmic dependence on T by Lemma F.5. k,h s,a w k,h (s, a) n i=1   2 s [i]∈Si P i (s [i]|(s, a)[Z P i ])L P N k-1 ((s, a)[Z P i ])   P [1:i-1] P i+1:n V k,h (s k,h , a k,h ) -V * h (s k,h , a k,h ) = n i=1 k,h s,a w k,h (s, a)P 1:i-1   2 s [i]∈Si P i (s [i]|X [Z P i ])L P P [i+1:n] V k,h (s k,h , a k,h ) -V * h (s k,h , a k,h ) 2 N k-1 ((s, a)[Z P i ])    ≤ n i=1 k,h P 1:i-1 s,a w k,h (s, a)   2 s [i]∈Si P i (s [i]|(s, a)[Z P i ])L P P [i+1:n] V k,h (s k,h , a k,h ) -V k,h (s k,h , a k,h ) 2 N k-1 ((s, a)[Z P i ])    ≤ n i=1 k,h s,a w k,h (s, a)   2 |S i |L P E [1:i] E [i+1:n] V k,h (s ) -V k,h (s ) 2 N k-1 ((s, a)[Z P i ])    ≤ n i=1 2   k,h,s,a w k,h (s, a) N k-1 ((s, a)[Z P i ])     k,h,s,a w k,h (s, a)E [1:i] E [i+1:n] V k,h+1 (s ) -V k,h+1 (s ) 2   ≤ n i=1 2 8X P i L P log T H 2 n i=1 T X P j |S j |L P The first inequality is due to the fact that V k,h ≥ V * h ≥ V k,h . The second and the third inequality is due to Cauchy-Schwarz inequality. The last inequality is because of Lemma F.5 and Lemma F.20. Lemma F.16. Under event Λ 1 , Λ 2 and Ω, we have k,h s h ,a h w k,h (s h , a h ) Rk (s h , a h ) -R(s h , a h ) ≤ O 1 m m i=1 T X R i L R i log T Proof. By Lemma F.1, we have k,h s h ,a h w k,h (s h , a h ) Rk (s h , a h ) -R(s h , a h ) ≤ 1 m n i=1 k,h s h ,a h w k,h (s h , a h ) 2σ R,k,i (s h , a h )L R i N k-1 ((s h , a h )[Z R i ]) + 8L R i 3N k-1 ((s h , a h )[Z R i ]) ≤ 1 m m i=1 k,h s h ,a h w k,h (s h , a h ) 2L R i N k-1 ((s h , a h )[Z R i ]) + 8L R i 3N k-1 ((s h , a h )[Z R i ]) ≤ 1 m m i=1 k,h s h ,a h 2w k,h (s h , a h )L R i N k-1 ((s h , a h )[Z R i ]) + 1 m m i=1 k,h s h ,a h w k,h (s h , a h ) 8L R i 3N k-1 ((s h , a h )[Z R i ]) The second inequality is due to σR,k,i (s, a) ≤ 1. The last inequality is due to Cauchy-Schwarz inequality. By Lemma F.5, we know that the summation is of order O 1 m m i=1 T X R i L R i log T . Lemma F.17. Under event Λ 1 and Λ 2 , we have V k,h -V k,h (s) ≤ E traj k   H i=h   2CB k (s, π k,i (s)) + n j=1 2H 2 log(18nT |X [Z P j ]|/δ) N k-1 ((s, π k,j (s))[Z P j ])   | s h = s, π k   The expectation is over all possible trajectories in episode k given s h = s following policy π k . Proof. V k,h -V k,h (s) =2CB k (s, π k,h (s)) + Pk (V k,h+1 (s) -V k,h+1 (s)) =2CB k (s, π k,h (s)) + ( Pk -P)(V k,h+1 -V k,h+1 )(s, π k,h (s)) + P(V k,h+1 -V k,h+1 )(s, π k,h (s)) • • • =E traj k H i=h 2CB k (s, π k,i (s i )) + ( Pk -P)(V k,i+1 -V k,i+1 )(s i , π k,i (s i )) | s h = s, π k The second term can be bounded as: ( Pk -P)(V k,i+1 -V k,i+1 )(s i , π k,i (s i )) ≤| Pk -P| 1 |V k,i+1 -V k,i+1 | ∞ (s i , π k,i (s)) ≤H n i=1 | Pk,i -P i | 1 (•|s i , π k,i (s i )) ≤ n j=1 2H 2 L P N k-1 ((s, π k,j (s))[Z P j ]) Lemma F.18. Under event Λ 1 and Λ 2 , we have K k=1 H h=1 (s,a)∈X w k,h (s, a)CB 2 k (s, a) ≤ m i=1 2(m + 2)H 2 L R i X R i log T m 2 + n i=1 128n(m + n)H 2 X P i L P n j=1 |S j |L P log T, Which has only logarithmic dependence on T . Note that this bound is loose w.r.t parameters such as H, |S j |, X P i and X R i . However, it is acceptable since we regard T as the dominant parameter. This bound doesn't influence the dominant factor in the final regret. Proof. By the definition of CB k (s, a), CB R k,i (s, a) and CB P k,i (s, a), we have CB 2 k (s, a) ≤(m + n) m i=1 1 m 2 CB R k,i (s, a) 2 + n i=1 CB P k,i (s, a) 2 ≤2(m + n) m i=1 1 m 2 2H 2 L R i N k-1 ((s, a)[Z R i ]) + 64(L R i ) 2 9(N k-1 ((s, a)[Z R i ])) 2 + 4n(m + n) n i=1 4H 2 L P N k-1 ((s, a)[Z P i ]) + 2HL P N k-1 ((s, a)[Z P i ]) ) + 4n(m + n) n i=1   32H 2 L P N k-1 ((s, a)[Z P i ]) n j=1 4|S j |L P N k-1 ((s, a)[Z P j ]) + 4|S j |L P 3N k-1 ((s, a)[Z P j ])   The second inequality is due to σ2 R,i (s, a) ≤ 1, σ2 P,i (V k,h+1 , s, a) ≤ H 2 and u k,h,i (s, a) ≤ H.

Now we are ready to bound

K k=1 H h=1 (s,a)∈X w k,h (s, a)CB 2 k (s, a): K k=1 H h=1 (s,a)∈X w k,h (s, a)CB 2 k (s, a) ≤ K k=1 H h=1 (s,a)∈X w k,h (s, a) m i=1 2(m + 2)H 2 L R i m 2 N k-1 ((s, a)[Z R i ]) + n i=1 128n(m + n)H 2 L P n j=1 |S j |L P N k-1 ((s, a)[Z P i ]) ≤ m i=1 2(m + 2)H 2 L R i X R i log T m 2 + n i=1 128n(m + n)H 2 X P i L P n j=1 |S j |L P log T The last inequality is due to Lemma F.5. Lemma F.19. Under event Λ 1 and Λ 2 , we have K k=1 H h=1 (s,a)∈X w k,h (s, a)E s [1:i]∼P [1:i] (•|s,a) E s [i+1:n]∼P [i+1:n] (•|s,a) V k,h+1 -V k,h+1 (s ) 2 ≤ O(log T ), Here O hides the dependence on other parameters such as H, X P i , X R i except T . Proof. For notation simplicity, we use E i and E [i:j] as a shorthand of E s [i]∼Pi(•|(s,a)[Z P i ]) and E s [i:j]∼P [i:j] (•|s,a) . k,h,s,a w k,h (s, a)E [1:i] E [i+1:n] V k,h+1 -V k,h+1 (s ) 2 ≤ k,h,s,a w k,h (s, a)E [1:i] E [i+1:n] V k,h+1 -V k,h+1 2 (s, a) = k,h,s,a w k,h (s, a)E [1:n] V k,h+1 -V k,h+1 2 (s, a) = K k=1 H h=1 s,a w k,h+1 (s, a) V k,h+1 -V k,h+1 2 (s, a) Define U k (s, a) = 2CB k (s, a) + n j=1 2H 2 L P N k-1 ((s,a)[Z P j ]) . By Lemma F.17, we have K k=1 H h=1 s,a w k,h+1 (s, a) V k,h+1 -V k,h+1 2 (s, a) ≤ k,h,s,a w k,h+1 (s, a)   H h1=h+1 s h 1 ,a h 1 Pr(s h1 , a h1 |s h+1 = s, a h+1 = a)U k (s h1 , a h1 )   2 ≤ k,h,s,a w k,h+1 (s, a)H H h1=h+1   s h 1 ,a h 1 Pr(s h1 , a h1 |s h+1 = s, a h+1 = a)U k (s h1 , a h1 )   2 ≤ k,h,s,a w k,h+1 (s, a)H H h1=h+1 s h 1 ,a h 1 Pr(s h1 , a h1 |s h+1 = s, a h+1 = a) (U k (s h1 , a h1 )) 2 = k,h H H h1=h+1 s h 1 ,a h 1 w k,h1 (s h1 , a h1 ) (U k (s h1 , a h1 )) 2 ≤ k,h H 2 s h ,a h w k,h (s h , a h ) (U k (s h , a h )) 2 Plugging the definition of U k (s, a) into Equ. 63, we have: K k=1 H h=1 s,a w k,h+1 (s, a) V k,h+1 -V k,h+1 2 (s, a) ≤ k,h H 2 s h ,a h w k,h (s h , a h )   2CB k (s, a) + n j=1 2H 2 L P N k-1 ((s, a)[Z P j ])   2 (65) ≤ k,h 2nH 2 s h ,a h w k,h (s h , a h )   CB 2 k (s, a) + n j=1 2H 2 L P N k-1 ((s, a)[Z P j ])   (66) ≤ k,h 2nH 2 s h ,a h w k,h (s h , a h )CB 2 k (s, a) + 16nH 4 X P i L P log T The last inequality is due to Lemma F.5. We can bound k,h 2nH 2 s h ,a h w k,h (s h , a h )CB 2 k (s, a) by Lemma F.18. Summing up over all terms, we can show that K k=1 H h=1 s,a w k,h+1 (s, a) V k,h+1 -V k,h+1 2 (s, a) is of order O(1). Lemma F.20. Under event Λ 1 and Λ 2 , for any i ∈ [n],we have K k=1 H h=1 (s,a)∈X w k,h (s, a)u k,h,i (s, a) ≤ O(H 2 n j=1 T X P j |S j |L P ), Here O hides the lower order terms w.r.t. T . Proof. For notation simplicity, we use E i and E [i:j] as a shorthand of E s [i]∼Pi(•|(s,a)[Z P i ]) and E s [i:j]∼P [i:j] (•|(s,a)[Z P i ] ) . For those expectation w.r.t the empirical transition Pk , we use Êk to denote the corresponding expectation. u k,h,i (s, a) is defined as: u k,h,i (s, a) = Ê[1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 . k,h,s,a w k,h (s, a) Ê[1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 (68) = k,h,s,a w k,h (s, a)E [1:i] E [i+1:n] V k,h+1 -V k,h+1 (s ) 2 (69) + k,h,s,a w k,h (s, a) Ê[1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 - k,h,s,a w k,h (s, a)E [1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 (70) + k,h,s,a w k,h (s, a)E [1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 - k,h,s,a w k,h (s, a)E [1:i] E [i+1:n] V k,h+1 -V k,h+1 (s ) 2 (71)

That is

We can bound Eqn. 70 and Eqn. 71 by Lemma D.1. For Eqn. 70, we have  k,h,s,a w k,h (s, a) Ê[1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 - k,h,s,a w k,h (s, a)E [1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 ≤ k,h,s,a w k,h (s, a) P[1:i] (•|s, a) -P [1:i] (•|s, a) 1 H 2 ≤ k,h,s,a w k,h (s, a) i j=1 |S j |L P N k-1 ((s, a)[Z P j ]) H 2 ≤8H 2 i j=1 T X P j |S j |L P log T The first inequality is due to Ê[i+1:n] V k,h+1 -V k,h+1 k,h,s,a w k,h (s, a)E [1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) 2 - k,h,s,a w k,h (s, a)E [1:i] E [i+1:n] V k,h+1 -V k,h+1 (s ) 2 ≤2H k,h,s,a w k,h (s, a)E [1:i] Ê[i+1:n] V k,h+1 -V k,h+1 (s ) -E [i+1:n] V k,h+1 -V k,h+1 (s ) ≤2H k,h,s,a w k,h (s, a)E [1:i] H P[i+1:n] (•|s, a) -P [i+1:n] (•|s, a) 1 ≤2H 2 k,h,s,a w k,h (s, a)E [1:i]   n j=i+1 |S j |L P N k-1 ((s, a)[Z P j ])   =2H 2 k,h,s,a w k,h (s, a) n j=i+1 |S j |L P N k-1 ((s, a)[Z P j ]) ≤16H 2 n j=i+1

G PROOF OF THEOREM 3

Proof. We consider the following two hard instances. The first instance is an extension of the hard instance in Jaksch et al. (2010) . They proposed a hard instance for non-factored weakly-communicating MDP, which indicates that the lower bound in that setting is Ω( √ DSAT ). When transformed to the hard instance for non-factored episodic MDP, it shows a lower bound of order Ω( H|X [Z P i ]|T ). Similar construction has been used to prove the lower bound for factored weakly-communicating MDP (Xu & Tewari, 2020) . The second hard instance is an extension of the hard instance for stochastic multi-armed bandits. The lower bound of stochastic multi-armed bandits shows that the regret of a MAB problem with k 0 arms in T 0 steps is lower bounded by Ω( √ k 0 T 0 ). Consider a factored MDP instance with d = m = n and X [Z R i ] = X [Z P i ] = X i = S i × A i , i = 1, ..., n. There are m independent reward functions, each associated with an independent deterministic transition. For reward function i, There are log 2 (|S i |) levels of states, which form a binary tree of depth log 2 (|S i |). There are 2 h-1 states in level h, and thus |S i | -1 states in total. Only those states in level log 2 (|S i |) have non-zero rewards, the number of which is |Si| 2 . After taking actions at state s in level log 2 (|S i |), the agent will transits back to state s in level log 2 (|S i |). That is to say, the agent can enter "reward states" at least H -log 2 (|S i |) ≥ H 2 times in one episodes. For each reward function i, the instance can be regarded as an MAB problem |SiAi| 2 arms running for KH 2 steps 1 , thus the regret for reward i is 1 The instance is not exactly an MAB with |S i A i | 2 arms running for KH 2 steps, since in each episode the agent will choose a state s , and then stay in s and choose different actions for H 2 steps. However, this is a mild difference and we can still follow the same proof idea of the lower bound for MAB (See e.g. Theorem 14.1 in Lattimore & Szepesvári (2020) ) Ω( |SiAi| 2 KH 2 ) = Ω( |X [Z R i ]|T ). In this construction, the total reward function can be regarded as the average of m independent reward functions of m stochastic MDP. This indicates that the lower bound is Ω 1 m m i=1 X [Z R i ] T ≥ Ω 1 m m i=1 X [Z R i ] T . To sum up, the regret is lower bounded by 2020) is to explore the MDP environment, and then find a near-optimal policy satisfying that the expected cumulative cost less than a constant vector B 0 , i.e. E[ h∈[H] c h ] ≤ B 0 . However, in our setting, the agent has to terminate the interaction once the total costs in this episode exceed budget B. Because of this difference, their algorithm will converge to an sub-optimal policy with unbounded regret in our setting. In the first MDP instance (Fig. H .1), the agent starts from state s 0 . After taking action a 1 , it will transit to s 1 with a deterministic cost c 1 = 0.5. After taking action a 2 , it will transit to s 2 . The cost of taking a 2 is 0 with prob. 0.5, and 1 with prob 0.5. There are no rewards in state s 0 . In state s 1 and s 2 , the agent will not suffer any costs. The deterministic rewards are 0.5 and 0.8 respectively. s 3 and s 4 are termination states. The budget B 0 is 0.5. For this MDP instance, the optimal policy is to take action a 1 in s 0 , since the agent can receive total rewards 0.5 by taking a 1 . If taking action a 2 , the agent will terminate at state s 2 with no rewards with prob. 0.5, which leads to an expected total rewards of 0.4. However, if we run the algorithm in Efroni et al. (2020) ; Brantley et al. (2020) , the algorithm will converge to the policy that always selects action a 2 in s 0 , since the expected cumulative cost of taking a 2 is 0.5 ≤ B 0 . We further show that the policies defined in the previous literature are not expressive enough in our setting. In the second instance, the agent starts in state s 0 with one action a 0 . By taking a 0 , the agent transits to s 1 with no rewards. The cost of taking a 0 is 0 with prob. 0.5, and 1 with prob 0.5. In s 1 , the agent needs to decide to take a 1 or a 2 , with deterministic costs of 0 and 0.5 respectively. After taking a 1 , the agent will transits to s 2 , in which it can obtain a reward r 3 = 0.5. While by taking a 2 , the agent can transits to s 3 , and obtain a reward r 4 = 1. The budget B = 0.5. In this instance, the action taken in s 1 depends on the remaining budget of the agent. That is to say, the policy is not expressive enough if it is defined as a mapping from state to action. Instead, we need to define it as a mapping from both state and remaining budget to action. However, previous literature only considers policies on the state space, which cannot deal with this problem. regret guarantee without further assumptions and the modification of the basic setting. If the cost distributions are continuous and the remaining budget can take any value in R, a small perturbation on the remaining budget may totally change the policy in the following steps and the optimal value. To be more specific, suppose the agent enters a certain state with remaining budget b. There are three actions to choose, with the cost of b -, b and b + , respectively. After suffering the cost, the agent can achieve reward of 0, 0.5 and 1 respectively. Note that we can construct such hard instances with extremely small. For these hard instances, we need to carefully estimate the value function of any remaining budget b ∈ R and the density functions of the costs, after which we can calculate the value through Bellman backup and find the optimal policy. However, estimating the density functions requires infinite number of samples and makes the problem intractable. In other words, the "non-smoothness" of the value and the policy w.r.t the remaining budget makes the problem difficult for continuous value distribution without further assumptions. This "non-smoothness" phenomenon also happens in the classical knapsack problem. There are two possible ways to remove these assumptions and apply our algorithm to continuous cost distributions with -net technique. The first idea is to allow the slight violation of the total budget constraints, with the maximum violation threshold δ, or we assume that the value of any initial state is lipschiz w.r.t the total budget B in a small neighborhood of B (With maximum L ∞ distance δ). In that case, we can tolerate the estimation error of each cost function to be at most δ H . -net technique with = δ H still works in this case and we can estimate the cost distribution with precision δ H . This modification is somewhat reasonable since the agent's policies are always "smooth" w.r.t the total budget in a small region near B in many real applications such as games and robotics. The second idea is to consider soft constraints. That is, when the budget constraints are violated, the agent will suffer a loss that is linear w.r.t the violation of the constraints. We assume the linear coefficient is relatively large compared with other parameters. This is also a possible method to remove the non-smoothness w.r.t the total budget, which has wide applications in constrained optimization.



i ]) + η k,h,i (s, a), i ∈ [n], where σP,k,i (s, a) and u k,h,i (s, a) are defined later. η k,h,i (s, a) collects the additional bonus terms that do not affect the order of the final regret. The precise expression of η k,h,i (s, a) is deferred to Section C. The following proof of Inq. 26 and Inq. 27 shares the same idea of the proof of Inq. 25.F.4 TECHNICAL LEMMAS ABOUT VARIANCEIn this subsection, we prove several technical lemmas about variance. For notation simplicity, we use E i andE [i:j] as a shorthand of E s [i]∼Pi(•|(s,a)[Z P i ]) and E s [i:j]∼P [i:j] (•|(s,a)[Z P [i:j] ]) . Similarly, we



there exist functions{R i ∈ P (X [Z i ] , [0, 1])} m i=1 such that r ∼ R(x) is equal to 1 m m i=1 r i with each r i ∼ R i (x[Z i ]) individually observed. We use Ri to denote the expectation E[R i ]. Definition 3. (Factored transition) The transition function class P ⊂ P(X , S) is factored over S × A = X = X 1 × • • • × X d and S = S 1 × • • • × S n with scopes Z 1 , • • • , Z n if and only if, for all P ∈ P, x ∈ X , s ∈ S, there exist functions {P j ∈ P (X [Z j ] , S j )} n j=1 such that P(s | x) = n j=1 P j (s[j] | x [Z j ]) .A factored MDP is an MDP with factored rewards and transitions. A factored MDP is fully characterized by M = {X i } , where X = S × A, {Z R

[i:j] (s [i : j] | s, a) to denote j k=i P(s [k]|(s, a)[Z P k ]). For every V : S → R and the right-linear operators P, we define PV (s, a) def =

a). In this subsection, we tackle this problem by deriving the variance decomposition formula for factored MDP. To begin with, we consider Markov chains with stochastic factored transition and stochastic factored rewards, and deduce the Bellman equation of variance for factored Markov chains. The analysis shows how to define the empirical variance in the confidence bonus for factored MDP and gives an upper bound on the summation of per-step variance (Corollary 1.1).

|s, a) with empirical mean value for all (s, a) ∈ K P for horizon h = H, H -1, ..., 1 do for all (s, a) ∈ S × A do 10: if (s.a) ∈ K P then Qk,h (s, a) = min{H, Rk (s, a) + CB k (s, a) + Pk Vk,h+1 (s, a)} else Qk,h (s, a) = H end if 15:

h (s, a) Pk -P V * (s, a) h (s, a) Pk -P V * (s, a)

(V * h+1 , s, a)L P N k-1 ((s, a)[Z

H 2 for any given s [1 : i]. The second inequality is due to Lemma D.1. The third inequality is due to Lemma F.5. For Eqn. 71, similarly we have

HSAT ) in episodic settingAzar et al. (2017);Jin et al. (2018). Consider a factored MDP instance with d = m = n and X[Z R i ] = X [Z P i ] = X i = S i × A i , i = 1, ..., n.This factored MDP can be decomposed into n independent non-factored MDPs. By simply setting these n non-factored MDPs to be the construction used inJaksch et al. (2010), the regret for each MDP is Ω( H|X [Z P i ]|T ). The total regret is Ω(n i=1H|X [Z P i ]|T ). Note that in our setting, the reward R = 1 m m i=1 R i is [0, 1]-bounded. Therefore, we need to normalize the reward function in the hard instance by a factor of 1 m . This leads to a final lower bound of order Ω(

Figure 1: MDP Instances, Budget B 0 = 0.5 We further explain the difference with two specific examples in Fig. H.1. No matter which setting the previous work considers, the main idea of the algorithms in Efroni et al. (2020); Brantley et al. (2020) is to explore the MDP environment, and then find a near-optimal policy satisfying that the expected cumulative cost less than a constant vector B 0 , i.e. E[ h∈[H] c h ] ≤ B 0 . However, in our setting, the agent has to terminate the interaction once the total costs in this episode exceed budget B. Because of this difference, their algorithm will converge to an sub-optimal policy with unbounded regret in our setting. In the first MDP instance (Fig.H.1), the agent starts from state s 0 . After taking action a 1 , it will transit to s 1 with a deterministic cost c 1 = 0.5. After taking action a 2 , it will transit to s 2 . The cost of taking a 2 is 0 with prob. 0.5, and 1 with prob 0.5. There are no rewards in state s 0 . In state s 1 and s 2 , the agent will not suffer any costs. The deterministic rewards are 0.5 and 0.8 respectively. s 3 and s 4 are termination states. The budget B 0 is 0.5. For this MDP instance, the optimal policy is to take action a 1 in s 0 , since the agent can receive total rewards 0.5 by taking a 1 . If taking action a 2 , the agent will terminate at state s 2 with no rewards with prob. 0.5, which leads to an expected total rewards of 0.4. However, if we run the algorithm inEfroni et al. (2020);Brantley et al. (2020), the algorithm will converge to the policy that always selects action a 2 in s 0 , since the expected cumulative cost of taking a 2 is 0.5 ≤ B 0 .

where Γ is an upper bound of |S j |.

ACKNOWLEDGEMENTS

This work was supported by Key-Area Research and Development Program of Guangdong Province (No. 2019B121204008)], National Key R&D Program of China (2018YFB1402600), BJNSF (L172037) and Beijing Academy of Artificial Intelligence.

Published as a conference paper at ICLR 2021

The first inequality is due to the definition of variance. The last inequality is due to E [i:n] V (s ) + Ê[i:n] V (s ) ≤ 2H.All Equ. 35, 36 and 37 can be bounded with the same manner of Equ. 32. That is, we first bound each term with the L 1 -distance of transition probability multiplying the L ∞ -norm of the value function, then we upper bound the L 1 -distance by Lemma D.1. This leads to the following results:This bound doesn't depend on the given fixed s [1 : i -1]. By taking expectation over s [1 : i -1] ∼ P [1:i-1] (•|s, a), we haveCombining with Equ. 34, we haveLemma F.7. Under event Λ 1 , Λ 2 and Ω, we have, where u k,h,i (s, a) is defined in Section 4.3:Proof. We can decompose the difference in the following way:Now we only need to bound σ2. By Lemma 2 of Azar et al. (2017), we know that for two random variables X ∈ R and Y ∈ R, we haveThat is,Published as a conference paper at ICLR 2021We mainly focus on the bound of Eqn 42., by Lemma F.7, we haveThat is,Combining with Eq. 41, we prove thatThe optimism is proved by induction.Lemma F.10. (Optimism) Suppose that Λ 1 , Λ 2 and Ω k,h+1 happen, then we haveThe first inequality is due to induction condition that Ω k,h+1 happens. The last inequality is due to Lemma F.9.Lemma F.11. (Pessimism) Suppose that Λ 1 , Λ 2 and Ω k,h+1 happen, then we haveThe inequality is due to V k,h+1 (s ) ≤ V * h+1 (s ) since event Ω k,h+1 happens. By lemma F.9, we haveTherefore, we haveLemma F.12. (Optimism and pessimism) Under event Λ 1 and Λ 2 , we have Ω k,h holds for all k and h.Proof. By Lemma F.10 and Lemma F.11, through induction over all possible k, h, we can prove the Lemma.

F.6 PROOF OF THEOREM 2

Proof. We decompose) in the classical way (Azar et al., 2017; Zanette & Brunskill, 2019; Dann et al., 2019) , that is

H.2 ALGORITHM AND REGRET

We denote V π h (s, b) as the value function in state s at horizon h following policy π, and the agent's remaining budget is b. For notation simplicity, we define P S P C V (s, a) = s c0 P(s |s, a)P(C(s, a) = c 0 |s, a)V (s , bc 0 ). We use P C,i (c 0 |s, a) to denote the "transition probability" of budget i, i.e. P(C i (s, a) = c 0 |s, a). The Bellman Equation of our setting is written as:Suppose N k (s, a) denotes the number of times (s, a) has been encountered in the first k episodes.We estimate the mean value of r(s, a), the transition matrix P S and P C in the following way:Following the definition in the factored MDP setting, we define the confidence bonus for rewards and transition respectively (forwhere L = log(2dSAT ) + d log(mB) is the logarithmic factors because of union bounds. The additional d log(mB) is because that we need to take union bounds over all possible budget b. This difference compared with factored MDP is mainly due to the noised offset model.CB R k (s, a) is the confidence bonus for rewards, and σR (s, a) denotes the empirical variance of reward R(s, a), which is defined as:is the confidence bonus for state transition estimation PS , and {CB P k,i (s, a)} i=1,...,d is the confidence bonus for budget transition estimation { PC,i } i=1,...,d . σ2 P,i (V k,h+1 , s, a) is the empirical variance of corresponding transition: (s,a) is added to compensate the error due to the difference between V * h+1 and V k,h+1 , where u k,h,i (s, a) is defined as:We calculate the optimistic value function and find the optimal policy π via the following value iteration in our algorithm: Take action a k,h = arg max a Qk,h (s k,h , a) end for Update history trajectory L = L {s i , a i , r i , s i+1 } i=1,2,...,t k end for Proof. (Theorem 4) The proof follows almost the same proof framework of Thm. 2. The term log(SAT ) + d log(Bm) is due to a union bound over all possible (T, s, a) and budget b. This difference is because of the additional union bounds over all budget b.

H.3 DISCUSSIONS ABOUT ASSUMPTION 1 AND ASSUMPTION 2

Assumptions 1 and 2 limit our Algorithm 4 to problems with discrete costs. One may wonder whether it is possible to construct -net for budget B and possible value of the costs when these assumptions don't hold. In that case, we only need to estimate the discrete cost distributions on the -net, and then apply Algorithm 4 to tackle the problem. Unfortunately, we find that the -net construction doesn't work for continuous cost distributions, and it is unlikely to achieve efficient

