NEAR-OPTIMAL REGRET BOUNDS FOR MODEL-FREE RL IN NON-STATIONARY EPISODIC MDPS

Abstract

We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes (MDPs). Both the reward functions and the state transition distributions are allowed to vary over time, either gradually or abruptly, as long as their cumulative variation magnitude does not exceed certain budgets. We propose an algorithm, named Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), for this setting, which adopts a simple restarting strategy and an extra optimism term. Our algorithm outperforms the state-of-the-art (modelbased) solution in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret of O(S ), where S and A are the numbers of states and actions, respectively, ∆ > 0 is the variation budget, H is the number of steps per episode, and T is the total number of steps. We further show that our algorithm is near-optimal by establishing an information-theoretical lower bound of Ω(S ), which to the best of our knowledge is the first impossibility result in non-stationary RL in general. Algorithm Regret Model-free? Comment Undiscounted

1. INTRODUCTION

Reinforcement learning (RL) studies the class of problems where an agent maximizes its cumulative reward through sequential interaction with an unknown but fixed environment, usually modeled by a Markov Decision Process (MDP). At each time step, the agent takes an action, receives a random reward drawn from a reward function, and then the environment transitions to a new state according to an unknown transition kernel. In classical RL problems, the transition kernel and the reward functions are assumed to be time-invariant. This stationary model, however, cannot capture the phenomenon that in many real-world decision-making problems, the environment, including both the transition dynamics and the reward functions, is inherently evolving over time. Non-stationarity exists in a wide range of applications, including online advertisement auctions (Cai et al., 2017; Lu et al., 2019) , dynamic pricing (Board, 2008; Chawla et al., 2016) , traffic management (Chen et al., 2020) , healthcare operations (Shortreed et al., 2011) , and inventory control (Agrawal & Jia, 2019) . Among the many intriguing applications, we specifically emphasize two research areas that can significantly benefit from progress on non-stationary RL, yet their connections have been largely overlooked in the literature. The first one is sequential transfer in RL (Tirinzoni et al., 2020) or multi-task RL Brunskill & Li (2013) . In this setting, the agent encounters a sequence of tasks over time with different system dynamics and reward functions, and seeks to bootstrap learning by transferring knowledge from previously-solved tasks. The second one is multi-agent reinforcement learning (MARL) (Littman, 1994) , where a set of agents collaborate or compete in a shared environment. In MARL, since the transition and reward functions of the agents are coupled, the environment is non-stationary from each agent's own perspective, especially when the agents learn and update policies simultaneously. A more detailed discussion on how non-stationary RL can benefit sequential transfer, multi-task, and multi-agent RL is given in Appendix A. Learning in a non-stationary MDP is highly non-trivial due to the following challenges. The first one is the exploration vs. exploitation challenge inherited from standard (stationary) RL. An agent needs to explore the uncertain environment efficiently while maximizing its rewards along the way. Classical solutions in stationary RL oftentimes leverage the "optimism in the face of uncertain" principle that adopts an upper confidence bound to guide exploration. These bounds can be either an optimistic estimate of the state transition distributions in model-based solutions (Jaksch et al., 2010) , Table 1 : Dynamic regret comparisons for RL in non-stationary MDPs. S and A are the numbers of states and actions, L is the number of abrupt changes, D is the maximum diameter, H is the number of steps per episode, and T is the total number of steps. Gray cells denote results from this paper. or an optimistic estimate of the Q-values in the model-free ones (Jin et al., 2018; Zhang et al., 2020 ). An additional challenge in non-stationary RL is the trade-off between remembering and forgetting. Since the system dynamics vary from one episode to another, all the information collected from previous interactions are essentially out-of-date and biased. In fact, it has been shown that a standard RL algorithm might incur a linear regret if the non-stationarity is not handled properly (Ortner et al., 2019) . On the other hand, the agent does need to maintain a sufficient amount of information from history for future decision making, and learning what to remember becomes a further challenge. In this paper, we introduce an algorithm, named Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), to address the aforementioned challenges in non-stationary RL. Our algorithm utilizes an extra optimism term for exploration, in addition to the standard Hoeffding/Bernstein-based bonus in the upper confidence bound, to counteract the non-stationarity of the MDP. This additional bonus term guarantees that our optimistic Q-value is still an upper bound of the optimal Q -value even when the environment changes. To address the second challenge, we adopt a simple but effective restarting strategy that resets the memory of the agent according to a calculated schedule. Similar strategies have also been considered in non-stationary bandits (Besbes et al., 2014) and non-stationary RL in the un-discounted setting (Jaksch et al., 2010; Ortner et al., 2019) . The restarting strategy ensures that our algorithm only refers to the most up-to-date experience for decision-making. A further advantage of our algorithm is that RestartQ-UCB is model-free. Compared with model-based solutions, our model-free algorithm is more time-and space-efficient, flexible to use, and more compatible with the design of modern deep RL architectures. Related Work. Dynamic regret of non-stationary RL has been mostly studied using model-based solutions. Jaksch et al. (2010) consider the setting where the MDP is allowed to change abruptly L times, and achieve a regret of O(SA 1 2 L 1 3 DT 2 3 ), where D is the maximum diameter of the MDP. A sliding window approach is proposed in Gajane et al. (2018) under the same setting. Ortner et al. (2019) generalize the previous setting by allowing the MDP to vary either abruptly or gradually at every step, subject to a total variation budget of ∆. Cheung et al. (2020) consider the same setting and develop a sliding window algorithm with confidence widening. The authors also introduce a Bandit-over-RL technique that adaptively tunes the algorithm without knowing the variation budget. In a setting most similar to ours, Domingues et al. (2020) investigate non-stationary RL in the episodic setting. They propose a kernel-based approach when the state-action set forms a metric space, and their results can be reduced to an O(SA 1 2 ∆ 1 3 H 4 3 T 2 3 ) regret in the tabular case. Fei et al. (2020) also consider the episodic setting, but they assume stationary transition kernels and adversarial (subject to some smoothness assumptions) full-information rewards. The authors propose two policy optimization algorithms, which are also the only model-free solutions that we are aware of in non-stationary RL. In contrast, we allow both the transition kernel and the reward function to change over time, and deal with bandit-feedback, which makes the setting in Fei et al. (2020) not directly comparable. Table 1 compares our regret bounds with existing results that tackle the same setting as ours. Interested readers are referred to Padakandla (2020) for a comprehensive survey on RL in non-stationary environments. We would also like to mention another related line of research that studies online/adversarial MDPs (Yu & Mannor, 2009; Neu et al., 2010; Arora et al., 2012; Yadkori et al., 2013; Dick et al., 2014; Wang et al., 2018; Lykouris et al., 2019; Jin et al., 2019) , but they mostly only allow variations in reward functions, and use static regret as performance metric. Finally, RL with low switching cost (Bai et al., 2019) also shares a similar spirit as our restarting strategy since it also periodically forgets previous experiences. However, such algorithms do not address the non-stationarity of the environment explicitly, and it is non-trivial to analyze its dynamic regret in terms of the variation budget. Non-stationarity has also been considered in bandit problems. Under different non-stationary multiarmed bandit (MAB) settings, various methods have been proposed, including decaying memory and sliding windows (Garivier & Moulines, 2011; Keskin & Zeevi, 2017) , as well as restart-based strategies (Auer et al., 2002; Besbes et al., 2014; Allesiardo et al., 2017) . These methods largely inspired later research in non-stationary RL. A more recent line of work developed methods that do not require prior knowledge of the variation budget (Karnin & Anava, 2016; Cheung et al., 2019a) or the number of abrupt changes (Auer et al., 2019) . Other related settings considered in the literature include Markovian bandits (Tekin & Liu, 2010; Ma, 2018) , non-stationary contextual bandits (Luo et al., 2018; Chen et al., 2019 ), linear bandits (Cheung et al., 2019b; Zhao et al., 2020) , continuousarmed bandits (Mao et al., 2020) , and bandits with slowly changing rewards (Besbes et al., 2019) . Contributions. First, we propose RestartQ-UCB, the first model-free RL algorithm in the general setting of non-stationary MDPs, where both the transition kernel and reward functions are allowed to vary over time. Second, we provide dynamic regret analysis for RestartQ-UCB, and show that it outperforms even the model-based state-of-the-art solution. Third, we establish the first lower bounds in non-stationary RL, which suggest that our algorithm is optimal in all parameter dependences except for an H 1 3 factor, where H is the episode length. In the main text of this paper, we will present and analyze a simpler version of RestartQ-UCB with a Hoeffding-style bonus term. Replacing the Hoeffding term with a Freedman-style one will lead to a tighter regret bound, but the analysis is more involved. For clarity of presentation, we defer the exposition and analysis of the Freedman-based algorithm to the appendices. All missing proofs in the paper can also be found in the appendices.

2. PRELIMINARIES

Model: We consider an episodic RL setting where an agent interacts with a non-stationary MDP for M episodes, with each episode containing H steps. We use a pair of integers (m, h) as a time index to denote the h-th step of the m-th episode. The environment can be denoted by a tuple (S, A, H, P, r), where S is the finite set of states with |S| = S, A is the finite set of actions with |A| = A, H is the number of steps in one episode,  P = {P m h } m∈[M ],h∈[H] (• | s m h , a m h ). It is worth emphasizing that the transition kernel and the mean reward function depend both on m and h, and hence the environment is nonstationary over time. The episode ends when s m H+1 is reached. We further denote T = M H as the total number of steps. A deterministic policy π : [M ] × [H] × S → A is a mapping from the time index and state space to the action space, and we let π m h (s) denote the action chosen in state s at time (m, h). Define V m,π h : S → R to be the value function under policy π at time (m, h), i.e., V m,π h (s) def = E H h =h r m h (s h , π m h (s h )) | s h = s , s h +1 ∼ P m h (• | s h , a h ) . Accordingly, the state-action value function Q m,π h : S × A → R is defined as: Q m,π h (s, a) def = r m h (s, a) + E H h =h+1 r m h (s h , π m h (s h )) | s h = s, a h = a . For simplicity of notation, we let a) , and we also have V m,π H+1 (s) = 0, ∀s ∈ S by definition. Since the state space, the action space, and the length of each episode are all finite, there always exists an optimal policy π that gives the optimal value P m h V h+1 (s, a) def = E s ∼P m h (•|s,a) [V h+1 (s )]. Then, the Bellman equation gives V m,π h (s) = Q m,π h (s, π m h (s)) and Q m,π h (s, a) = (r m h + P m h V m,π h+1 )(s, V m, h (s) def = V m,π h (s) = sup π V m,π h (s), ∀s ∈ S, m ∈ [M ], h ∈ [H]. From the Bellman optimality equation, we have V m, h (s) = max a∈A Q m, h (s, a), where Q m, h (s, a) def = (r m h + P m h V m, h+1 )(s, a), and V m, H+1 (s) = 0, ∀s ∈ S. Dynamic Regret: The agent aims to maximize the cumulative expected reward over the entire M episodes, by adopting some policy π. We measure the optimality of the policy π in terms of its dynamic regret (Cheung et al., 2020; Domingues et al., 2020) , which compares the agent's policy with the optimal policy of each individual episode in the hindsight: R(π, M ) def = M m=1 V m, 1 (s m 1 ) -V m,π 1 (s m 1 ) , where the initial state s m 1 of each episode is chosen by an adversary (and more specifically, by an oblivious adversary (Zhang et al., 2020) ). Dynamic regret is a stronger measure than the standard (static) regret, which only considers the single policy that is optimal over all episodes combined. Variation: We measure the non-stationarity of the MDP in terms of its variation in the mean reward function and transition kernels: ∆ r def = M -1 m=1 H h=1 sup s,a |r m h (s, a)-r m+1 h (s, a)|, ∆ p def = M -1 m=1 H h=1 sup s,a P m h (• | s, a) -P m+1 h (• | s, a) 1 , where • 1 is the L 1 -norm. Note that our definition of variation only imposes restrictions on the summation of non-stationarity across two different episodes, and does not put any restriction on the difference between two consecutive steps in the same episode; that is, P m h (• | s, a) and P m h+1 (• | s, a) are allowed to be arbitrarily different. We further let ∆ = ∆ r + ∆ p , and assume ∆ > 0.

3. ALGORITHM: RESTARTQ-UCB

We present our algorithm Restarted Q-Learning with Hoeffding Upper Confidence Bounds (RestartQ-UCB Hoeffding) in Algorithm 1. Replacing the Hoeffding-style upper confidence bound in Algorithm 1 with a Freedman-style one will lead to a tighter regret bound, but for clarity of exposition, the latter version will be deferred to Algorithm 2 in Appendix C. RestartQ-UCB breaks the M episodes into D epochs, with each epoch containing K = M D episodes (except for the last epoch which possibly has less than K episodes). The optimal value of D (and hence K) will be specified later in our analysis. RestartQ-UCB periodically restarts a Q-learning algorithm with UCB exploration at the beginning of each epoch, thereby addressing the non-stationarity of the environment. For each d ∈ [D], define ∆ p,h analogously. Since our algorithm essentially invokes the same procedure for every epoch, in the following, we focus our analysis on what happens inside one epoch only (and without loss of generality, we focus on epoch 1, which contains episodes 1, 2, . . . , K). At the end of our analysis, we will merge the results across all epochs. Algorithm 1: RestartQ-UCB (Hoeffding) for epoch d ← 1 to D do Initialize: V h (s) ← H -h + 1, Q h (s, a) ← H -h + 1, N h (s, a) ← 0, Ňh (s, a) ← 0, řh (s, a) ← 0, vh (s, a) ← 0, for all (s, a, h) ∈ S × A × [H]; for episode k ← (d -1)K + 1 to min{dK, M } do observe s k 1 ; for step h ← 1 to H do Take action a k h ← arg max a Q h (s k h , a), receive R k h (s k h , a k h ), and observe s k h+1 ; řh (s k h , a k h ) ← řh (s k h , a k h ) + R k h (s k h , a k h ), vh (s k h , a k h ) ← vh (s k h , a k h ) + V h+1 (s k h+1 ); N h (s k h , a k h ) ← N h (s k h , a k h ) + 1, Ňh (s k h , a k h ) ← Ňh (s k h , a k h ) + 1; if N h (s k h , a k h ) ∈ L // Reaching the end of the stage then b k h ← H 2 Ňh (s k h ,a k h ) ι + 1 Ňh (s k h ,a k h ) ι, b ∆ ← ∆ (d) r + H∆ (d) p ; Q h (s k h , a k h ) ← min řh (s k h ,a k h ) Ňh (s k h ,a k h ) + vh (s k h ,a k h ) Ňh (s k h ,a k h ) + b k h + 2b ∆ , Q h (s k h , a k h ) ; ( * ) V h (s k h ) ← max a Q h (s k h , a); Ňh (s k h , a k h ) ← 0, řh (s k h , a k h ) ← 0, vh (s k h , a k h ) ← 0; For each triple (s, a, h) ∈ S × A × [H], we divide the visitations (within epoch 1) to the triple into multiple stages, where the length of the stages increases exponentially at a rate of (1 + 1 H ). Specifically, let e 1 = H, and e i+1 = (1 + 1 H )e i , i ≥ 1 denote the lengths of the stages. Further, let the partial sums L def = { j i=1 e i | j = 1, 2, 3, . . . } denote the set of the ending times of the stages. We remark that the stages are defined for each individual triple (s, a, h), and for different triples the starting and ending times of their stages do not necessarily align in time. Recall that the time index (k, h) represents the h-th step of the k-th episode. At each step (k, h), we take the optimal action with respect to the optimistic Q h (s, a) value (Line 6 in Algorithm 1), which is designed as an optimistic estimate of the optimal Q k, h (s, a) value of the corresponding episode. For each triple (s, a, h), we update the optimistic Q h (s, a) value at the end of each stage, using samples only from this latest stage that is about to end (Line 12 in Algorithm 1). The optimism in Q h (s, a) comes from two bonus terms b k h and b ∆ , where b k h is a standard Hoeffding-based optimism that is commonly used in upper confidence bounds (Jin et al., 2018; Zhang et al., 2020) , and b ∆ is the extra optimism (Cheung et al., 2020) that we need to take into account the non-stationarity of the environment. The definition of b ∆ requires knowledge of the local variation budget in each epoch, or at least an upper bound of it. The same assumption has also been made in Ortner et al. (2019) . Fortunately, in our method, we can show (in Theorem 2) that if we simply replace Equation ( * ) in Algorithm 1 with the following update rule: Q h (s k h , a k h ) ← min řh (s k h , a k h ) Ňh (s k h , a k h ) + vh (s k h , a k h ) Ňh (s k h , a k h ) + b k h , Q h (s k h , a k h ) then we can achieve the same regret bound without the assumption on the local variation budget. We set ι def = log 2 δ , where δ is the failure probability.

4. ANALYSIS

In this section, we present our main result-a dynamic regret analysis of the RestartQ-UCB algorithm. Our first result on RestartQ-UCB with Hoeffding-style bonus terms is summarized in the following theorem. The complete proofs of its supporting lemmas are given in Appendix B. Theorem 1. (Hoeffding) For T = Ω(SA∆H 2 ), and for any δ ∈ (0, 1), with probability at least 1 -δ, the dynamic regret of RestartQ-UCB with Hoeffding bonuses (Algorithm 1) is bounded by O(S 1 3 A 1 3 ∆ 1 3 H 5 3 T 2 3 ), where O(•) hides poly-logarithmic factors of T and 1/δ. Recall that we focus our analysis on epoch 1, with episode indices ranging from 1 to K. We start with the following technical lemma, stating that for any triple (s, a, h), the difference of their optimal Q-values at two different episodes 1 ≤ k 1 <k 2 ≤ K is bounded by the variation of this epoch. Lemma 1. For any triple (s, a, h) and any 1 ≤ k 1 < k 2 ≤ K, it holds that |Q k1, h (s, a) - Q k2, h (s, a)| ≤ ∆ (1) r + H∆ p . We now define a few notations to facilitate the analysis. Denote by s k h and a k h respectively the state and action taken at step h of episode k. Let N k h (s, a), Ň k h (s, a), Q k h (s, a ) and V k h (s) denote, respectively, the values of N h (s, a), Ňh (s, a), Q h (s, a) and V h (s) at the beginning of the k-th episode in Algorithm 1. Further, for the triple (s k h , a k h , h), let n k h be the total number of episodes that this triple has been visited prior to the current stage, and let l k h,i denote the index of the episode that this triple was visited the i-th time among the total n k h times. Similarly, let ňk h denote the number of visits to the triple (s k h , a k h , h) in the stage right before the current stage, and let ľk h,i be the i-th episode among the ňk h episodes right before the current stage. For simplicity, we use l i and ľi to denote l k h,i and ľk h,i , and ň to denote ňk h , when h and k are clear from the context. We also use řh (s, a) and vh (s, a) to denote the values of řh ( s k h , a k h ) and vh (s k h , a k h ) when updating the Q h (s k h , a k h ) value in Line 12 of Algorithm 1. The following lemma states that the optimistic Q-value Q k h (s, a) is an upper bound of the optimal Q-value Q k, h (s, a) with high probability. Note that we only need to show that the event holds with probability 1poly(K, H)δ, because we can replace δ with δ/poly(K, H) in the end to get the desired high probability bound without affecting the polynomial part of the regret bound. Lemma 2. (Hoeffding) For δ ∈ (0, 1), with probability at least 1-2KHδ, it holds that Q k, h (s, a) ≤ Q k+1 h (s, a) ≤ Q k h (s, a), ∀(s, a, h, k) ∈ S × A × [H] × [K]. We now proceed to analyze the dynamic regret in one epoch, and at the very end of this section, we will see how to combine the dynamic regret over all the epochs to prove Theorem 1. The following analysis will be conditioned on the successful event of Lemma 2. The dynamic regret of Algorithm 1 in epoch d = 1 can hence be expressed as R (d) (π, K) = K k=1 V k, * 1 s k 1 -V k,π 1 s k 1 ≤ K k=1 V k 1 s k 1 -V k,π 1 s k 1 . From the update rules of the value functions in Algorithm 1, we have V k h (s k h ) ≤1 n k h = 0 H + řh (s k h , a k h ) Ň k h (s k h , a k h ) + vh (s k h , a k h ) Ň k h (s k h , a k h ) + b k h + 2b ∆ =1 n k h = 0 H + řh (s k h , a k h ) Ň k h (s k h , a k h ) + 1 ň ň i=1 V ľi h+1 (s ľi h+1 ) + b k h + 2b ∆ . For ease of exposition, we define the following notations: δ k h def = V k h (s k h ) -V k, h (s k h ), ζ k h def = V k h (s k h ) -V k,π h (s k h ). We further define rk h (s k h , a k h ) def = řh (s k h ,a k h ) Ň k h (s k h ,a k h ) -r k h (s k h , a k h ). Then by the Hoeffding's inequality, it holds with high probability that rk h (s k h , a k h ) ≤ 1 ň ň i=1 r ľi h (s k h , a k h ) + ι ň -r k h (s k h , a k h ) ≤ b k h + b ∆ . (4) By the Bellman equation V k,π h (s k h ) = Q k,π h (s k h , π(s k h )) = r k h (s k h , a k h ) + P k h V k,π h+1 (s k h , a k h ), we have ζ k h ≤1 n k h = 0 H + 1 ň ň i=1 V ľi h+1 (s ľi h+1 ) + b k h + 2b ∆ + rk h (s k h , a k h ) -P k h V k,π h+1 (s k h , a k h ) ≤1 n k h = 0 H + 1 ň ň i=1 P ľi h V ľi h+1 (s k h , a k h ) -P k h V k,π h+1 (s k h , a k h ) + 3b k h + 3b ∆ (5) =1 n k h = 0 H + 1 ň ň i=1 P ľi h -P k h V ľi h+1 (s k h , a k h ) 1 + 1 ň ň i=1 P k h V ľi h+1 -V ľi, h+1 (s k h , a k h ) 2 + 1 ň ň i=1 P k h V ľi, h+1 -V k,π h+1 (s k h , a k h ) 3 +3b k h + 3b ∆ , where ( 5) is by the Azuma-Hoeffding inequality and by (4). In the following, we bound each term in ( 6) separately. First, by Hölder's inequality, we have 1 ≤ 1 ň ň i=1 ∆ (1) p (H -h) ≤ b ∆ . Let e j denote a standard basis vector of proper dimensions that has a 1 at the j-th entry and 0s at the others, in the form of (0, . . . , 0, 1, 0, . . . , 0). Recall the definition of δ k h in (3), and we have 2 = 1 ň ň i=1 δ ľi h+1 + 1 ň ň i=1 P k h -e s ľi h+1 V ľi h+1 -V ľi, h+1 (s k h , a k h ) ξ k h+1 = 1 ň ň i=1 δ ľi h+1 + ξ k h+1 . (8) Finally, recalling the definition of ζ k h in (3), we have that 3 = 1 ň ň i=1 V ľi, h+1 (s k h+1 ) -V k,π h+1 (s k h+1 ) + 1 ň ň i=1 P k h -e s k h+1 V ľi, h+1 -V k,π h+1 (s k h , a k h ) φ k h+1 = 1 ň ň i=1 V ľi, h+1 (s k h+1 ) -V k, h+1 (s k h+1 ) + ζ k h+1 -δ k h+1 + φ k h+1 ≤b ∆ + ζ k h+1 -δ k h+1 + φ k h+1 (9) where inequality ( 9) is by Lemma 1. Combining ( 6), ( 7), (8), and (9) leads to ζ k h ≤ 1 n k h = 0 H + 1 ň ň i=1 δ ľi h+1 + ξ k h+1 + ζ k h+1 -δ k h+1 + φ k h+1 + 3b k h + 5b ∆ . To find an upper bound of K k=1 ζ k h , we proceed to upper bound each term on the RHS of (10) separately. First, notice that K k=1 1 n k h = 0 ≤ SAH, because each fixed triple (s, a, h) contributes at most 1 to K k=1 1 n k h = 0 . The second term in (10) can be upper bounded by the following lemma: Lemma 3. K k=1 1 ňk h ňk h i=1 δ ľk h,i h+1 ≤ (1 + 1 H ) K k=1 δ k h+1 . Combining (10) and Lemma 3, we now have that K k=1 ζ k h ≤SAH 2 + 1 H K k=1 δ k h+1 + K k=1 ξ k h+1 + ζ k h+1 + φ k h+1 + 3b k h + 5b ∆ ≤SAH 2 + (1 + 1 H ) K k=1 ζ k h+1 + K k=1 ξ k h+1 + φ k h+1 + 3b k h + 5b ∆ Λ k h+1 , where in (11) we have used the fact that δ k h+1 ≤ ζ k h+1 , which in turn is due to the optimality that V k, h (s k h ) ≥ V k,π h (s k h ) . Notice that we have ζ k h on the LHS of ( 11) and ζ k h+1 on the RHS. By iterating (11) over h = H, H -1, . . . , 1, we conclude that K k=1 ζ k 1 ≤ O SAH 3 + H h=1 K k=1 (1 + 1 H ) h-1 Λ k h+1 . ( ) We bound  H h=1 K k=1 (1 + 1 H ) h-1 Λ k h+1 in H h=1 K k=1 (1 + 1 H ) h-1 Λ k h+1 ≤ O √ SAKH 5 + KH∆ (1) r + KH 2 ∆ (1) p . Now we are ready to prove Theorem 1. Proof. (of Theorem 1) By ( 2) and ( 12), and by replacing δ with δ KH+2 in Proposition 1, we know that the dynamic regret in epoch d = 1 can be upper bounded with probability at least 1 -δ by:  R (d) (π, K) ≤ O SAH 3 + √ SAKH 5 + KH∆ (1) r + KH 2 ∆ (1) √ SAKH 5 + D d=1 KH∆ (d) r + D d=1 KH 2 ∆ (d) p ). Recall the definition that D d=1 ∆ (d) r ≤ ∆ r , D d=1 ∆ (d) p ≤ ∆ p , ∆ = ∆ r + ∆ p , and that K = Θ( T DH ). By setting D = S -1 3 A -1 3 ∆ 2 3 H -2 3 T 1 3 , the dynamic regret over the entire T steps is bounded by R(π, M ) ≤ O(S 1 3 A 1 3 ∆ 1 3 H 5 3 T 2 3 ), which completes the proof. Algorithm 1 relies on the assumption that the local budgets b ∆ are known a priori, which hardly holds in practice. In the following theorem, we will show that this assumption can be safely removed without affecting the regret bound. The only modification to the algorithm is to replace the Q-value update rule in Equation ( * ) of Algorithm 1 with the new update rule in Equation (1). Theorem 2. (Hoeffding, no local budgets) For T = Ω(SA∆H 2 ), and for any δ ∈ (0, 1), with probability at least 1 -δ, the dynamic regret of RestartQ-UCB with Hoeffding bonuses and no knowledge of local budgets is bounded by O(S 1 3 A 1 3 ∆ 1 3 H 5 3 T 2 3 ), where O(•) hides poly-logarithmic factors of T and 1/δ. To understand why this simple modification works, notice that in ( * ) we are adding exactly the same value 2b ∆ to the upper confidence bounds of all (s, a) pairs in the same epoch. Subtracting the same value from all optimistic Q-values simultaneously should not change the choice of actions in future steps. The only difference is that the new "optimistic" Q k h (s, a) values would no longer be strict upper bounds of the optimal Q k, h (s, a) anymore, but only an "upper bound" subject to some error term of the order b ∆ . This further requires a slightly different analysis on how this error term propagates over time, which is presented as a variant of Lemma 2 as follows. Lemma 4. (Hoeffding, no local budgets) Suppose we have no knowledge of the local variation budgets and replace the update rule ( * ) in Algorithm 1 with Equation (1). For δ ∈ (0, 1), with probability at least 1 -2KHδ, it holds that Q k, h (s, a) -2(H -h + 1)b ∆ ≤ Q k+1 h (s, a) ≤ Q k h (s, a), ∀(s, a, h, k) ∈ S × A × [H] × [K]. Remark 1. The easy removal of the local budget assumption is non-trivial in the design of the algorithm, and does not exist in the non-stationary RL literature with restarts. In fact, it has been shown in a concurrent work (Zhou et al., 2020) that removing this assumption would lead to a much worse regret bound (cf. Corollary 2 and Corollary 3 therein). Replacing the Hoeffding-based upper confidence bound with a Freedman-style one will lead to a tighter regret bound, summarized in Theorem 3 below. The proof of the theorem follows a similar procedure as in the proof of Theorem 1, and is given in Appendix D. It relies on a referenceadvantage decomposition technique for variance reduction as coined in Zhang et al. (2020) . The intuition is to first learn a reference value function V ref that serves as a roughly accurate estimate of the optimal value function V . The goal of learning the optimal value function V = V ref + (V * -V ref ) can hence be decomposed into estimating two terms V ref and V * -V ref , each of which can be accurately estimated due to the reduced variance. For ease of exposition, we proceed again with the assumption that the local variation budgets are known. The reader should bear in mind that this assumption can be easily removed using a similar technique as in Theorem 2. Theorem 3. (Freedman) For T greater than some polynomial of S, A, ∆ and H, and for any δ ∈ (0, 1), with probability at least 1 -δ, the dynamic regret of RestartQ-UCB with Freedman bonuses (Algorithm 2) is bounded by O(S 1 3 A 1 3 ∆ 1 3 HT 2 3 ), where O(•) hides poly-logarithmic factors.

5. LOWER BOUNDS

In this section, we provide information-theoretical lower bounds of the dynamic regret to characterize the best achievable performance of any algorithm for solving non-stationary MDPs. Theorem 4. For any algorithm, there exists an episodic non-stationary MDP such that the dynamic regret of the algorithm is at least Ω(S 1 3 A 1 3 ∆ 1 3 H 2 3 T 2 3 ). Proof sketch. The proof of our lower bound relies on the construction of a "hard instance" of nonstationary MDPs. The instance we construct is essentially a switching-MDP: an MDP with piecewise constant dynamics on each segment of the horizon, and its dynamics experience an abrupt change at the beginning of each new segment. More specifically, we divide the horizon T into L segments 1 , where each segment has T 0 def = T L steps and contains M 0 def = M L episodes, each episode having a length of H. Within each such segment, the system dynamics of the MDP do not vary, and we construct the dynamics for each segment in a way such that the instance is a hard instance of stationary MDPs on its own. The MDP within each segment is essentially similar to the hard instances constructed in stationary RL problems (Osband & Van Roy, 2016; Jin et al., 2018) . Between two consecutive segments, the dynamics of the MDP change abruptly, and we let the dynamics vary in a way such that no information learned from previous interactions with the MDP can be used in the new segment. In this sense, the agent needs to learn a new hard stationary MDP in each segment. Finally, optimizing the value of L and the variation magnitude between consecutive segments (subject to the constraints of the total variation budget) leads to our lower bound. A useful side result of our proof is the following lower bound for non-stationary RL in the undiscounted setting, which is the same setting as studied in Cheung et al. (2020), Gajane et al. (2018) and Ortner et al. (2019) . Proposition 2. Consider a reinforcement learning problem in un-discounted non-stationary MDPs with horizon length T , total variation budget ∆, and maximum MDP diameter D (Cheung et al., 2020) . For any learning algorithm, there exists a non-stationary MDP such that the dynamic regret of the algorithm is at least Ω(S 1 3 A 1 3 ∆ 1 3 D 2 3 T 2 3 ). 1 The definition of segments is irrelevant to, and should not be confused with, the notion of epochs we previously defined. A APPLICATIONS TO SEQUENTIAL TRANSFER, MULTI-TASK, AND MULTI-AGENT RL One area that could benefit from non-stationary RL is sequential transfer in RL (Tirinzoni et al., 2020) or multi-task RL (Brunskill & Li, 2013) , which itself is conceptually related to continual RL (Kaplanis et al., 2018) and life-long RL (Abel et al., 2018) . In the setting of sequential transfer/multi-task RL, the agent encounters a sequence of tasks over time with different system dynamics, and seeks to bootstrap learning by transferring knowledge from previously-solved tasks. Typical solutions in this area (Brunskill & Li, 2013; Tirinzoni et al., 2020; Sun et al., 2020) need to assume that there are finitely many candidate tasks, and every task should be sufficiently different from the othersfoot_0 . Only under this assumption can the agent quickly identify the current task it is operating on, by essentially comparing the system dynamics it observes with the dynamics it has memorized for each candidate task. After identifying the current task with high confidence, the agent then invokes the policy that it learned through previous interactions with this specific task. This transfer learning paradigm in turn causes another problem-it "cold switches" between policies that are most likely very different, which might lead to unstable and inconsistent behaviors of the agent over time. Fortunately, non-stationary RL can help alleviate both the finite-task assumption and the cold-switching problem. First, non-stationary RL algorithms do not need the candidate tasks to be sufficiently different in order to correctly identify each of them, because the algorithm itself can tolerate some variations in the task environment. There will also be no need to assume the finiteness of the candidate task set anymore, and the candidate tasks can be drawn from a continuous space. Second, since we are running the same non-stationary RL algorithm for a series of tasks, it improves its policy gradually over time, instead of cold-switching to a completely independent policy for each task. This could largely help with the unstable behavior issues. Multi-agent reinforcement learning (MARL) (Littman, 1994) studies the problem where a set of agents collaborate or compete in a shared environment. In MARL, since the transition and reward functions of the agents are coupled, the environment is non-stationary from each agent's own perspective, especially when the agents learn and update their policies simultaneously. The nonstationarity in MARL is a setting where non-stationary RL can play a role. As advocated earlier in Bowling & Veloso (2001) ; Busoniu et al. (2008) , a good MARL algorithm should be both rational and convergent, where the former means that the algorithm converges to its opponent's best response if its opponent converges to a stationary policy, and the latter means that if all agents use the same algorithm, the algorithm converges to a stationary policy. As such, a non-stationary RL algorithm can be viewed as a rational MARL algorithm, thanks to its dynamic regret guarantees, although its convergence property in MARL settings is still worth further investigation. In fact, developing algorithms that are both rational and convergent in general MARL settings is still relatively open. In addition, non-stationary RL algorithms also apply to the MARL setting to achieve low regret against slowly-changing opponents (see (Lee et al., 2020, Sec. 5 .2) for the setting) but we consider a more challenging measure of dynamic regret (as opposed to the static regret in Lee et al. (2020) ). Finally, dynamic regret is also pertinent to the notion of exploitability of strategies in two-player zero-sum games (Davis et al., 2014) .

B PROOFS OF THE TECHNICAL LEMMAS

B.1 PROOF OF LEMMA 1 Proof. In fact, in the following, we will prove a stronger statement: Q k1, h (s, a) -Q k2, h (s, a) ≤ H h =h ∆ (1) r,h + H H h =h ∆ p,h , which implies the statement of the lemma because H h =h ∆ (1) r,h ≤ ∆ (1) r and H h =h ∆ (1) p,h ≤ ∆ (1) p by definition. Our proof relies on backward induction on h. First, the statement holds for h = H because for any (s, a), by definition Q k1, H (s, a) -Q k2, H (s, a) = r k1 H (s, a) -r k2 H (s, a) ≤ k2-1 k=k1 r k+1 H (s, a) -r k H (s, a) ≤ K-1 k=1 r k+1 H (s, a) -r k H (s, a) ≤ ∆ (1) r,H , where we have used the triangle inequality. Now suppose the statement holds for h + 1; by the Bellman optimality equation, Q k1, h (s, a) -Q k2, h (s, a) =P k1 h V k1, h+1 (s, a) -P k2 h V k2, h+1 (s, a) + r k1 h (s, a) -r k2 h (s, a) ≤P k1 h V k1, h+1 (s, a) -P k2 h V k2, h+1 (s, a) + ∆ (1) r,h = s ∈S P k1 h (s | s, a)V k1, h+1 (s ) - s ∈S P k2 h (s | s, a)V k2, h+1 (s ) + ∆ (1) r,h = s ∈S P k1 h (s | s, a)Q k1, h+1 (s , π k1, h+1 (s )) -P k2 h (s | s, a)Q k2, h+1 (s , π k2, h+1 (s )) + ∆ (1) r,h , where inequality ( 14) holds due to a similar reasoning as in ( 13), and in (15) π k1, and π k2, denote the optimal policy in episode k 1 and k 2 , respectively. Then by our induction hypothesis on h + 1, for any s ∈ S, Q k1, h+1 (s , π k1, h+1 (s )) ≤Q k2, h+1 (s , π k1, h+1 (s )) + H h =h+1 ∆ (1) r,h + H H h =h+1 ∆ (1) p,h ≤Q k2, h+1 (s , π k2, h+1 (s )) + H h =h+1 ∆ (1) r,h + H H h =h+1 ∆ (1) p,h , where inequality ( 16) is due to the optimality of the policy π k2, in episode k 2 over π k1, . Then, Q k1, h (s, a) -Q k2, h (s, a) ≤ s ∈S (P k1 h (s | s, a) -P k2 h (s | s, a))Q k2, h+1 (s , π k2, h+1 (s )) + H h =h ∆ (1) r,h + H H h =h+1 ∆ (1) p,h ≤ P k1 h (•|s, a) -P k2 h (•|s, a) 1 Q k2, h+1 (•, π k2, h+1 (•)) ∞ + H h =h ∆ (1) r,h + H H h =h+1 ∆ (1) p,h

≤∆

(1) p,h (H -h) + H h =h ∆ (1) r,h + H H h =h+1 ∆ (1) p,h ≤ H h =h ∆ (1) r,h + H H h =h ∆ (1) p,h , where ( 17) is by Hölder's inequality, and ( 18) is by the definition of ∆ (1) p,h and by the definition of optimal Q-values that Q k2, h+1 (s, a) ≤ H -h, ∀(s, a) ∈ S × A. Repeating a similar process gives us Q k2, h (s, a) -Q k1, h (s, a) ≤ H h =h ∆ (1) r,h + H H h =h ∆ (1) p,h . This completes our proof.

B.2 PROOF OF LEMMA 2

Proof. It should be clear from the way we update Q h (s, a) that Q k h (s, a) is monotonically decreasing in k. We now prove Q k, h (s, a) ≤ Q k+1 h (s, a) for all s, a, h, k by induction on k. First, it holds for k = 1 by our initialization of Q h (s, a). For k ≥ 2, now suppose Q j, h (s, a) ≤ Q j+1 h (s, a) ≤ Q j h (s, a) for all s, a, h and 1 ≤ j ≤ k. For a fixed triple (s, a, h), we consider the following two cases. Case 1: Q h (s, a) is updated in episode k. Then with probability at least 1 -2δ Q k+1 h (s, a) = řh (s, a) Ň k h (s, a) + vh (s, a) Ň k h (s, a) + b k h + 2b ∆ ≥ řh (s, a) ň + 1 ň ň i=1 V ľi, h+1 (s ľi h+1 ) + H 2 ň ι + ι ň + 2b ∆ (19) ≥ řh (s, a) ň + 1 ň ň i=1 P ľi h V ľi, h+1 (s, a) + ι ň + 2b ∆ (20) = řh (s, a) ň + 1 ň ň i=1 Q ľi, h (s, a) -r ľi h (s, a) + ι ň + 2b ∆ (21) ≥Q k, h (s, a) + b ∆ . Inequality ( 19) is by the induction hypothesis that Q ľi h+1 (s ľi h+1 , a) ≥ Q ľi, h+1 (s ľi h+1 , a), ∀a ∈ A, and hence V ľi h+1 (s ľi h+1 ) ≥ V ľi, h+1 (s ľi h+1 ). Inequality (20) follows from the Azuma-Hoeffding inequality. ( 21  that Q ľi, h (s, a) + b ∆ ≥ Q k, h (s, a). According to the monotonicity of Q k h (s, a), we know that Q k, h (s, a) ≤ Q k+1 h (s, a) ≤ Q k h (s, a). In fact, we have proved the stronger statement Q k+1 h (s, a) ≥ Q k, h (s, a) + b ∆ that will be useful in Case 2 below. Case 2: Q h (s, a) is not updated in episode k. Then there are two possibilities: 1. If Q h (s, a) has never been updated from episode 1 to episode k: It is easy to see that Q k+1 h (s, a) = Q k h (s, a) = • • • = Q 1 h (s, a) = H -h + 1 ≥ Q k, h (s, a) holds. 2. If Q h (s, a) has been updated at least once from episode 1 to episode k: Let j be the index of the latest episode that Q h (s, a) was updated. Then, from our induction hypothesis and Case 1, we know that Q j+1 h (s, a) ≥ Q j, h (s, a) + b ∆ . Since Q h (s, a) has not been updated from episode j + 1 to episode k, we know that Q k+1 h (s, a) = Q k h (s, a) = • • • = Q j+1 h (s, a) ≥ Q j, h (s, a) + b ∆ ≥ Q k, h (s, a) , where the last inequality holds because of Lemma 1. A union bound over all time steps completes our proof.

B.3 PROOF OF LEMMA 3

Proof. It holds that K k=1 1 ňk h ňk h i=1 δ ľk h,i h+1 = K k=1 K j=1 1 ňk h δ j h+1 ňk h i=1 1 ľk h,i = j = K j=1 δ j h+1 K k=1 1 ňk h ňk h i=1 1 ľk h,i = j . (23) For a fixed episode j, notice that ňk h i=1 1[ ľk h,i = j] ≤ 1, and that ňk h i=1 1[ ľk h,i = j] = 1 happens if and only if (s k h , a k h ) = (s j h , a j h ) and (j, h) lies in the previous stage of (k, h) with respect to the triple (s k h , a k h , h). Let K def = {k ∈ [K] : ňk h i=1 1[ ľk h,i = j] = 1} ; then, we know that every element k ∈ K has the same value of ňk h , i.e., there exists an integer N j > 0, such that ňk h = N j , ∀k ∈ K. Further, by our definition of the stages, we know that |K| ≤ (1 + 1 H )N j , because the current stage is at most (1 + 1 H ) times longer than the previous stage. Therefore, for every j, we know that K k=1 1 ňk h ňk h i=1 1 ľk h,i = j ≤ 1 + 1 H . Combining ( 23) and ( 24) completes the proof of K k=1 1 ňk h ňk h i=1 δ ľk h,i h+1 ≤ (1 + 1 H ) K k=1 δ k h+1 . B.4 PROOF OF PROPOSITION 1 In the following, we will bound each term in Λ k h+1 separately in a series of lemmas. Lemma 5. With probability 1, we have that H h=1 K k=1 (1 + 1 H ) h-1 (3b k h + 5b ∆ ) ≤ O( √ SAKH 5 ι + KH∆ (1) r + KH 2 ∆ (1) p ). Proof. First, by the definition of b ∆ , it is easy to see that H h=1 K k=1 (1 + 1 H ) h-1 5b ∆ ≤ H h=1 K k=1 O(∆ (1) r + H∆ (1) p ) ≤ O(KH∆ (1) r + KH 2 ∆ (1) p ). Recall our definition that e 1 = H and e i+1 = (1 + 1 H )e i , i ≥ 1. For a fixed h ∈ [H], since H 2 ≥ 1, K k=1 (1 + 1 H ) h-1 3b k h ≤ K k=1 (1 + 1 H ) h-1 12 H 2 Ň k h (s k h , a k h ) ι =12H √ ι s,a j≥1 (1 + 1 H ) h-1 1 e j K k=1 1 (s k h , a k h ) = (s, a), Ň k h (s k h , a k h ) = e j =12H √ ι s,a j≥1 (1 + 1 H ) h-1 w(s, a, j) 1 e j , where w(s, a, j) def = K k=1 1 (s k h , a k h ) = (s, a), Ň k h (s k h , a k h ) = e j , and w(s, a) def = j≥1 w(s, a, j). We then know that s,a w(s, a) = K. For a fixed (s, a), let us now find an upper bound of j, denoted as J. Since each stage is (1 + 1 H ) times longer than the previous stage, we know for 1 ≤ j ≤ J, w(s, a, j) = K k=1 1 (s k h , a k h ) = (s, a), Ň k h (s k h , a k h ) = e j = (1 + 1 H )e j . From J j=1 w(s, a, j) = w(s, a), we get e J ≤ (1 + 1 H ) J-1 ≤ 10 1+ 1 H w(s,a) H . Therefore, j≥1 (1 + 1 H ) h-1 w(s, a, j) 1 e j ≤ O   J j=1 √ e j   ≤ O w(s, a)H . Finally, by the Cauchy-Schwartz inequality, we have H h=1 K k=1 (1 + 1 H ) h-1 3b k h = O   H 2 √ ι s,a j≥1 w(s, a, j) 1 e j   ≤ √ SAKH 5 ι. Combining the bounds for b k h and b ∆ completes the proof. Lemma 6. With probability at least 1 -δ, it holds that H h=1 K k=1 (1 + 1 H ) h-1 φ k h+1 ≤ O( √ KH 3 ι + KH∆ (1) r + KH 2 ∆ (1) p ). Proof. We have that H h=1 K k=1 (1 + 1 H ) h-1 φ k h+1 = H h=1 K k=1 (1 + 1 H ) h-1 1 ň ň i=1 P k h -e s k h+1 V ľi, h+1 -V k,π h+1 (s k h , a k h ) = H h=1 K k=1 (1 + 1 H ) h-1 1 ň ň i=1 P k h -e s k h+1 V ľi, h+1 -V k, h+1 + V k, h+1 -V k,π h+1 (s k h , a k h ) ≤ H h=1 K k=1 (1 + 1 H ) h-1 2b ∆ + H h=1 K k=1 (1 + 1 H ) h-1 P k h -e s k h+1 V k, h+1 -V k,π h+1 (s k h , a k h ), where the last inequality follows from Lemma 1 and the definition of b ∆ . From the proof of Lemma 5, we know that the first term can be bounded as H h=1 K k=1 (1 + 1 H ) h-1 2b ∆ ≤ O(KH∆ (1) r + KH 2 ∆ (1) p ). Further, the second term is bounded by the Azuma-Hoeffding inequality as H h=1 K k=1 (1 + 1 H ) h-1 P k h -e s k h+1 V k, h+1 -V k,π h+1 (s k h , a k h ) ≤ O( √ KH 3 ι). Combining the two terms completes the proof. Lemma 7. With probability at least 1 -(KH + 1)δ, it holds that H h=1 K k=1 (1 + 1 H ) h-1 ξ k h+1 ≤ O( √ SAKH 3 ι + KH 2 ∆ (1) p ). Proof. We have that H h=1 K k=1 (1 + 1 H ) h-1 ξ k h+1 = H h=1 K k=1 (1 + 1 H ) h-1 1 ň ň i=1 P k h -e s ľi h+1 V ľi h+1 -V ľi, h+1 (s k h , a k h ) = H h=1 K k=1 (1 + 1 H ) h-1 1 ň ň i=1 P k h -P ľi h + P ľi h -e s ľi h+1 V ľi h+1 -V ľi, h+1 (s k h , a k h ) ≤O(KH 2 ∆ (1) p ) + H h=1 K k=1 (1 + 1 H ) h-1 1 ň ň i=1 P ľi h -e s ľi h+1 V ľi h+1 -V ľi, h+1 (s k h , a k h ), ( ) where the last step is by the fact that V ľi h+1 (s k h , a k h ) ≥ V ľi, h+1 (s k h , a k h ) from Lemma 2, and then by Hölder's inequality and the triangle inequality. The following proof is analogous to the proof of Lemma 15 in Zhang et al. (2020) . For completeness we reproduce it here. We have H h=1 K k=1 (1 + 1 H ) h-1 1 ň ň i=1 P ľi h -e s ľi h+1 V ľi h+1 -V ľi, h+1 (s k h , a k h ) = H h=1 K k=1 K j=1 (1 + 1 H ) h-1 1 ňk h ňk h i=1 1 ľk h,i = j P j h -e s j h+1 V j h+1 -V j, h+1 (s k h , a k h ) = H h=1 K k=1 K j=1 (1 + 1 H ) h-1 1 ňk h ňk h i=1 1 ľk h,i = j P j h -e s j h+1 V j h+1 -V j, h+1 (s j h , a j h ), where ( 26) holds because ľk h,i (s k h , a k h ) = j if and only if j is in the previous stage of k and (s k h , a k h ) = (s j h , a j h ). For simplicity of notations, we define θ k h+1 def = (1 + 1 H ) h-1 K j=1 1 ňj h ňj h i=1 1 ľj h,i = k . Then we further have (note that we have swapped the notation of j and k) (26) = H h=1 K k=1 θ k h+1 P k h -e s k h+1 V k h+1 -V k, h+1 (s k h , a k h ). For (k, h) ∈ [K] × [H], let x k h denote the number of occurrences of the triple (s k h , a k h , h) in the current stage. Define θk h+1 def = (1 + 1 H ) h-1 (1+ 1 H )x k h x k h ≤ 3. Define K def = {(k, h) : θ k h+1 = θk h+1 }, and K def = {(k, h) ∈ [K] × [H] : θ k h+1 = θk h+1 }. Then, we have that (26) = H h=1 K k=1 θk h+1 P k h -e s k h+1 V k h+1 -V k, h+1 (s k h , a k h ) + H h=1 K k=1 (θ k h+1 -θk h+1 ) P k h -e s k h+1 V k h+1 -V k, h+1 (s k h , a k h ). Since θk h+1 is independent of s k h+1 , by the Azuma-Hoeffding inequality, it holds with probability at least 1 -δ that H h=1 K k=1 θk h+1 P k h -e s k h+1 V k h+1 -V k, h+1 (s k h , a k h ) ≤ O( √ KH 3 ι). It is easy to see that if k is in a stage that is before the second last stage of the triple (s k h , a k h , h), then (k, h) ∈ K. For a triple (s, a, h), define K ⊥ h (s, a) def = {k ∈ [K] : k is in the second last stage of the triple (s, a, h), (s k h , a k h ) = (s, a)}. We have that H h=1 K k=1 (θ k h+1 -θk h+1 ) P k h -e s k h+1 V k h+1 -V k, h+1 (s k h , a k h ) = s,a,h k:(k,h)∈ K 1 (s k h , a k h ) = (s, a) (θ k h+1 -θk h+1 ) P k h -e s k h+1 V k h+1 -V k, h+1 (s, a) = s,a,h (θ h+1 (s, a) -θh+1 (s, a)) k∈K ⊥ h (s,a) P k h -e s k h+1 V k h+1 -V k, h+1 (s, a), where for a fixed triple (s, a, h), we have defined θ h+1 (s, a) a) , and θh+1 (s, a) is also well-defined. By the Azuma-Hoeffding inequality and a union bound, it holds with probability at least 1 -KHδ that def = θ k h+1 , for any k ∈ K ⊥ h (s, a). Note that θ h+1 (s, a) is well-defined, because θ k1 h+1 = θ k2 h+1 , ∀k 1 , k 2 ∈ K ⊥ h (s, a). Similarly, let θh+1 (s, a) def = θk h+1 for any k ∈ K ⊥ h (s, (28) ≤ s,a,h O H 2 K ⊥ h (s, a) ι = s,a,h O H 2 Ň K+1 h (s, a)ι ≤O SAH 3 ι s,a,h Ň K+1 h (s, a) (29) ≤O √ SAKH 3 ι (30) where Ň K+1 h (s, a) is defined to be the total number of visitations to the triple (s, a, h) over the entire K episodes. ( 29) is by the Cauchy-Schwartz inequality. (30) holds because by the way stages are defined, for each triple (s, a, h), the length of its last two stages is at most an O(1/H) fraction of the total number of visitations. Combining ( 25), ( 27) and ( 30) completes the proof.

B.5 PROOF OF LEMMA 4

Proof. This proof follows a similar structure as the proof of Lemma 2. It should be clear from the way we update Q h (s, a) that Q k h (s, a) is monotonically decreasing in k. We now prove Q k, h (s, a) - 2(H -h + 1)b ∆ ≤ Q k+1 h (s, a) for all s, a, h, k by induction on k. First, it holds for k = 1 by our initialization of Q h (s, a). For k ≥ 2, now suppose Q j, h (s, a) -2(H -h + 1)b ∆ ≤ Q j+1 h (s, a) ≤ Q j h (s, a) for all s, a, h and 1 ≤ j ≤ k. For a fixed triple (s, a, h), we consider the following two cases. Case 1: Q h (s, a) is updated in episode k. Then with probability at least 1 -2δ Q k+1 h (s, a) = řh (s, a) Ň k h (s, a) + vh (s, a) Ň k h (s, a) + b k h ≥ řh (s, a) ň + 1 ň ň i=1 V ľi, h+1 (s ľi h+1 ) -2(H -h)b ∆ + H 2 ň ι + ι ň (31) ≥ řh (s, a) ň + 1 ň ň i=1 P ľi h V ľi, h+1 (s, a) + ι ň -2(H -h)b ∆ (32) = řh (s, a) ň + 1 ň ň i=1 Q ľi, h (s, a) -r ľi h (s, a) + ι ň -2(H -h)b ∆ (33) ≥Q k, h (s, a) -b ∆ -2(H -h)b ∆ . Inequality ( 31) is by the induction hypothesis that  Q ľi h+1 (s ľi h+1 , a) ≥ Q ľi, h+1 (s ľi h+1 , a) -2(H - h)b ∆ , ∀a ∈ A, and hence V ľi h+1 (s ľi h+1 ) ≥ V ľi, h+1 (s ľi h+1 ) -2(H -h)b ∆ . Inequality (32) 1 that Q ľi, h (s, a) ≥ Q k, h (s, a) -b ∆ . According to the monotonicity of Q k h (s, a), we know that Q k, h (s, a) -2(H -h + 1)b ∆ ≤ Q k+1 h (s, a) ≤ Q k h (s, a). In fact, we have proved the stronger statement Q k+1 h (s, a) ≥ Q k, h (s, a) -b ∆ -2(H -h) b ∆ that will be useful in Case 2 below. Case 2: Q h (s, a) is not updated in episode k. Then there are two possibilities: 1. If Q h (s, a) has never been updated from episode 1 to episode k: It is easy to see that Q k+1 h (s, a) = Q k h (s, a) = • • • = Q 1 h (s, a) = H -h + 1 ≥ Q k, h (s, a) -2(H -h + 1)b ∆ holds. 2. If Q h (s, a) has been updated at least once from episode 1 to episode k: Let j be the index of the latest episode that Q h (s, a) was updated. Then, from our induction hypothesis and Case 1, we know that Q j+1 h (s, a) ≥ Q j, h (s, a) -b ∆ -2(H -h)b ∆ . Since Q h (s, a) has not been updated from episode j + 1 to episode k, we know that Q k+1 h (s, a) = Q k h (s, a) = • • • = Q j+1 h (s, a) ≥ Q j, h (s, a) -b ∆ -2(H -h)b ∆ ≥ Q k, h (s, a) -2(H -h + 1)b ∆ , where the last inequality holds because of Lemma 1. A union bound over all time steps completes our proof.

B.6 PROOF SKETCH OF THEOREM 2

Proof sketch. We only sketch the difference with respect to the proof of Theorem 1 in the main text. The reader should have no difficulty recovering the complete proof by following the same routine as Theorem 1. Specifically, it suffices to investigate the steps that are involved with Lemma 2. The dynamic regret of the new algorithm in epoch d = 1 now can be expressed as R (d) (π, K) = K k=1 V k, * 1 s k 1 -V k,π 1 s k 1 ≤ K k=1 V k 1 s k 1 -V k,π 1 s k 1 + 2KHb ∆ , where we applied the results of Lemma 4 instead of Lemma 2. The reader should bear in mind that from the new update rules of the value functions, we now have V k h (s k h ) ≤ 1 n k h = 0 H + řh (s k h , a k h ) Ň k h (s k h , a k h ) + vh (s k h , a k h ) Ň k h (s k h , a k h ) + b k h , where the RHS no longer has the additional bonus term b ∆ . If we define ζ k h , ξ k h+1 , and φ k h+1 in the same way as before, the author can easily verify that all the derivations until Equation ( 12) still holds, although the value of Λ k h+1 should be re-defined as Λ k h+1 def = ξ k h+1 + φ k h+1 + 3b k h + 3b ∆ due to the new upper bound in (36) that is independent of b ∆ . Proposition 1 also follows analogously though some additional attention should be paid to the proof of Lemma 7 where the results of Lemma 2 have been utilized. Finally, we obtain the dynamic regret upper bound in epoch d = 1 as follows: R (d) (π, K) ≤ O SAH 3 + √ SAKH 5 + KH∆ (1) r + KH 2 ∆ (1) p + 2KHb ∆ , where the additional term 2KHb ∆ comes from (35). From our definition of b ∆ , we can easily see that 2KHb ∆ ≤ O(KH∆ (1) r + KH 2 ∆ (1) p ). Therefore, we can conclude that the dynamic regret upper bound in one epoch remains the same order, which leaves the dynamic regret over the entire horizon also unchanged.

C ALGORITHM: RESTARTQ-UCB (FREEDMAN)

The algorithm Restarted Q-Learning with Freedman Upper Confidence Bounds (RestartQ-UCB Freedman) is presented in Algorithm 2. For ease of exposition, we use ř, μ, v, σ, µ ref , σ ref , ň, and n to denote řh (s, a), μh (s k h , a k h ), vh (s k h , a k h ), σh (s k h , a k h ), µ ref h (s k h , a k h ), σ ref h (s k h , a k h ), Ňh (s k h , a k h ), and N h (s k h , a k h ) respectively, when the values of (s, a, h, k) are clear from the context. Compared with Algorithm 1, there are two major improvements in Algorithm 2. The first one is to replace the Hoeffding-based bonus term b k h with a tighter term b k h . The latter term takes into account the second moment information of the random variables, which allows sharper tail bounds that rely on second moments to come into use (in our case, the Freedman's inequality). The second improvement is a variance reduction technique, or more specifically, the reference-advantage decomposition as coined in Zhang et al. (2020) . The intuition is to first learn a reference value function V ref that serves as a roughly accurate estimate of the optimal value function V . The goal of learning the optimal value function V = V ref + (V * -V ref ) can hence be decomposed into estimating two terms V ref and V * -V ref . The reference value V ref is a fixed term, and can be accurately estimated using a large number of samples (in Algorithm 2, we estimate V ref only when we have cSAH 6 ι samples for a large constant c). The advantage term V * -V ref can also be accurately estimated due to the reduced variance.

D PROOF OF THEOREM 3

Similar to the proof of Theorem 1, we start with the dynamic regret in one epoch, and then extend to all epochs in the end. The proof follows the same routine as in the proof of Theorem 1. Given that a rigorous analysis on the Freedman-based bonus with variance reduction is present in Zhang et al. (2020) , one should not find it difficult to extend our Hoeffding-based algorithm to Algorithm 2. Therefore, rather than providing a complete proof of Theorem 3, in the following, we sketch the differences and highlight the additional analysis needed that is not covered by the proof of Theorem 1 and Zhang et al. (2020) . To facilitate the analysis, first recall a few notations N k h , Ň k h , Q k h (s, a), V k h (s), n k h , l k h,i , ňk h , ľk h,i , l i and ľi that we have defined in Section 4. In addition, when (h, k) is clear from the context, we drop the time indices and simply use μ, σ, µ ref , σ ref to denote their corresponding values in the computation of the Q h (s k h , a k h ) value in Line 15 of Algorithm 2. We start with the following lemma, which is an analogue of Lemma 2 but requires a more careful treatment of variations accumulated in µ ref and μh . It states that the optimistic Q k h (s, a) is an upper bound of the optimal Q k, h (s, a) with high probability. Algorithm 2: RestartQ-UCB (Freedman) for epoch d ← 1 to D do Initialize: V h (s) ← H -h + 1, Q h (s, a) ← H -h + 1, N h (s, a) ← 0, Ňh (s, a) ← 0, řh (s, a) ← 0, μh (s, a) ← 0, vh (s, a) ← 0, σh (s, a) ← 0, µ ref h (s, a) ← 0, σ ref h (s, a) ← 0, V ref h (s) ← H, for all (s, a, h) ∈ S × A × [H]; for episode k ← (d -1)K + 1 to min{dK, M } do observe s k 1 ; for step h ← 1 to H do Take action a k h ← arg max a Q h (s k h , a), receive R k h (s k h , a k h ), and observe s k h+1 ; ř ← ř + R k h (s k h , a k h ), v ← v + V h+1 (s k h+1 ); μ ← μ + V h+1 (s k h+1 ) -V ref h+1 (s k h+1 ), σ ← σ + V h+1 (s k h+1 ) -V ref h+1 (s k h+1 ) 2 ; µ ref ← µ ref + V ref h+1 (s k h+1 ), σ ref ← σ ref + (V ref h+1 (s k h+1 )) 2 ; n ← n + 1, ň ← ň + 1; if n ∈ L // Reaching the end of the stage then b k h ← H 2 ň ι + 1 ň ι, b ∆ ← ∆ (d) r + H∆ (d) p ; b k h ← 2 σ ref /n-(µ ref /n) 2 n ι + 2 σ/ň-(μ/ň) 2 ň ι + 5( Hι n + Hι ň + Hι 3/4 n 3/4 + Hι 3/4 ň3/4 )+ 1 ň ι; Q h (s k h , a k h ) ← min ř ň + v ň +b k h +2b ∆ , ř ň + µ ref n + μ ň +2b k h +4b ∆ , Q h (s k h , a k h ) ; V h (s k h ) ← max a Q h (s k h , a); Ňh (s k h , a k h ), řh (s k h , a k h ), vh (s k h , a k h ), μh (s k h , a k h ), σh (s k h , a k h ) ← 0; if a N h (s k h , a) = Ω(SAH 6 ι)// Learn the reference value then V ref h (s k h ) ← V h (s k h ); Lemma 8. (Freedman) For δ ∈ (0, 1), with probability at least 1-2KHδ, it holds that Q k, h (s, a) ≤ Q k+1 h (s, a) ≤ Q k h (s, a), ∀(s, a, h, k) ∈ S × A × [H] × [K]. Proof. It should be clear from the way we update Q h (s, a) that Q k h (s, a) is monotonically decreasing in k. We now prove Q k, h (s, a) ≤ Q k+1 h (s, a) for all s, a, h, k by induction on k. First, it holds for k = 1 by our initialization of Q h (s, a). For k ≥ 2, now suppose Q j, h (s, a) ≤ Q j h (s, a) for all s, a, h and 1 ≤ j ≤ k. For a fixed triple (s, a, h), we consider the following two cases. Case 1: Q h (s, a) is updated in episode k. Notice that it suffices to analyze the case where Q h (s, a) is updated using b k h , because the other case of b k h would be exactly the same as in Lemma 2. With probability at least 1 -δ, Q k+1 h (s, a) = řh (s, a) Ň k h (s, a) + µ ref (s, a) N k h (s, a) + μh (s, a) Ň k h (s, a) + 2b k h + 4b ∆ = řh (s, a) ň + 1 n n i=1 V ref,li h+1 (s li h+1 ) -P li h V ref,li h+1 (s, a) χ1 + 1 ň ň i=1 V ľi h+1 (s ľi h+1 ) -V ref, ľi h+1 (s ľi h+1 ) -P ľi h V ľi h+1 -P ľi h V ref, ľi h+1 (s, a) χ2 + 1 n n i=1 P li h V ref,li h+1 + 1 ň ň i=1 P ľi h V ľi h+1 -P ľi h V ref, ľi h+1 (s, a) χ3 +2b k h + 4b ∆ In the following, we will bound each term in (37) separately. First, we have that χ 3 + 2b ∆ = 1 n n i=1 P li h V ref,li h+1 -P k h V ref,li h+1 (s, a) + b ∆ (38) - 1 ň ň i=1 P ľi h V ref, ľi h+1 -P k h V ref, ľi h+1 (s, a) + b ∆ (39) + 1 n n i=1 P k h V ref,li h+1 (s, a) - 1 ň ň i=1 P k h V ref, ľi h+1 (s, a) + 1 ň ň i=1 P ľi h V ľi h+1 (s, a) (40) ≥ 1 ň ň i=1 P ľi h V ľi h+1 (s, a), where (38)≥ 0 and (39)≥ 0 by Hölder's inequality and the definition of b ∆ . In (40), we have that  1 n n i=1 P k h V ref,li h+1 (s, a) -1 ň ň i=1 P k h V ref, ľi h+1 (s, a) ≥ 0, because V ref,k h+1 (s) is non-increasing in k. |χ 1 | ≤ 2 ν ref ι n + 5Hι 3 4 n 3 4 + 2 √ ι T n + 2Hι n , ( ) |χ 2 | ≤ 2 νι ň + 5Hι 3 4 ň 3 4 + 2 √ ι T ň + 2Hι ň , where h . Substituting the results on χ 1 , χ 2 and χ 3 back to (37), it holds that with probability at least 1 -δ, ν ref def = σ ref n -µ ref Q k+1 h (s, a) = řh (s, a) ň + χ 1 + χ 2 + χ 3 + 2b k h + 4b ∆ ≥ řh (s, a) ň + 1 ň ň i=1 P ľi h V ľi h+1 (s, a) + b k h + 2b ∆ (44) ≥ řh (s, a) ň + 1 ň ň i=1 P ľi h V ľi, h+1 (s, a) + b k h + 2b ∆ (45) = řh (s, a) ň + 1 ň ň i=1 Q ľi, h (s, a) -r ľi h (s, a) + b k h + 2b ∆ ≥ 1 ň ň i=1 Q ľi, h (s, a) + 2b ∆ ≥ Q k, h (s, a) + b ∆ , where in (44) we used (41), ( 42 According to the monotonicity of Q k h (s, a), we can conclude from (46 ) that Q k, h (s, a) ≤ Q k+1 h (s, a) ≤ Q k h (s, a). In fact, we have proved the stronger statement Q k+1 h (s, a) ≥ Q k, h (s, a) + b ∆ that will be useful in Case 2 below. Case 2: Q h (s, a) is not updated in episode k. Then, there are two possibilities: 1. If Q h (s, a) has never been updated from episode 1 to episode k: It is easy to see that a) has been updated at least once from episode 1 to episode k: Let j be the index of the latest episode that Q h (s, a) was updated. Then, from our induction hypothesis and Case 1, we know that Q k+1 h (s, a) = Q k h (s, a) = • • • = Q 1 h (s, a) = H -h + 1 ≥ Q k, h (s, a) holds. 2. If Q h (s, Q j+1 h (s, a) ≥ Q j, h (s, a) + b ∆ . Since Q h (s, a) has not been updated from episode j + 1 to episode k, we know that Q k+1 h (s, a) = Q k h (s, a) = • • • = Q j+1 h (s, a) ≥ Q j, h (s, a) + b ∆ ≥ Q k, h (s, a) , where the last inequality holds because of Lemma 1. A union bound over all time steps completes our proof. Conditional on the successful event of Lemma 8, the dynamic regret of Algorithm 2 in epoch d = 1 can hence be expressed as R (d) (π, K) = K k=1 V k, * 1 s k 1 -V k,π 1 s k 1 ≤ K k=1 V k 1 s k 1 -V k,π 1 s k 1 . From the update rules of the value functions in Algorithm 2, we have V k h (s k h ) ≤ 1 n k h = 0 H + řh (s k h , a k h ) ň + µ ref,k h n + μk h ň + 2b k h + 4b ∆ =1 n k h = 0 H + řh (s k h , a k h ) ň + 1 n n i=1 V ref,li h+1 (s li h+1 )+ 1 ň ň i=1 (V ľi h+1 (s ľi h+1 )-V ref, ľi h+1 (s ľi h+1 ))+2b k h +4b ∆ .

If we again define ζ

k h def = V k h (s k h ) -V k,π h (s k h ), we can follow a similar routine as in the proof of Theorem 1 (details can be found in Zhang et al. (2020) ) and obtain K k=1 ζ k 1 ≤ O SAH 3 + H h=1 K k=1 (1 + 1 H ) h-1 Λ k h+1 , where Λ k h+1 def = ψ k h+1 + ξ k h+1 + φ k h+1 + 4b k h + 8b ∆ with the following definitions: ψ k h+1 def = 1 n k h n k h i=1 P k h V ref,li h+1 -P k h V ref,K+1 h+1 (s k h , a k h ), ξ k h+1 def = 1 ňk h ňk h i=1 P k h -e s ľi h+1 V ľi h+1 -V ľi, h+1 (s k h , a k h ), φ k h+1 def = P k h -e s k h+1 V ľi, h+1 -V k,π h+1 (s k h , a k h ). An upper bound on the first four terms in Λ k h+1 is derived in the proof of Lemma 7 in Zhang et al. ( 2020) (There is an extra term of 1 ň ι in our defnition of b k h compared to theirs, but it does not affect the leading term in the upper bound). By further recalling the definition of b ∆ , we can obtain the following lemma. Lemma 9. (Lemma 7 in Zhang et al. (2020) ) With probability at least (1 -O(H 2 T 4 δ)), it holds that H h=1 K k=1 (1+ 1 H ) h-1 Λ k h+1 = O √ SAH 2 T ι+H √ T ι log(T )+S 2 A 3 2 H 8 T 1 4 ι+KH∆ (1) r +KH 2 ∆ (1) p . Combined with (47) and the definition of ζ k h , we obtain the dynamic regret bound in a single epoch: R (d) (π, K) = O √ SAH 2 T ι + H √ T ι log(T ) + S 2 A 3 2 H 8 T 1 4 ι + KH∆ (1) r + KH 2 ∆ (1) p , ∀d ∈ [D]. Finally, suppose T is greater than a polynomial of S, A, ∆ and H, √ SAH 2 T ι would be the leading term of the dynamic regret in a single epoch. In this case, summing up the dynamic regret over all the D epochs gives us an upper bound of O D √ SAH 2 T + D d=1 KH∆ (d) r + D d=1 KH 2 ∆ (d) p . Recall that D d=1 ∆ (d) r ≤ ∆ r , D d=1 ∆ (d) p ≤ ∆ p , ∆ = ∆ r + ∆ p , and that K = Θ( T DH ). By setting D = S -1 3 A -1 3 ∆ 2 3 T 1 3 , the dynamic regret over the entire T steps is bounded by R(π, M ) ≤ O S 1 3 A 1 3 ∆ 1 3 HT 2 3 . This completes the proof of Theorem 3.

E PROOF OF THEOREM 4

The proof of our lower bound relies on the construction of a "hard instance" of non-stationary MDPs. The instance we construct is essentially a switching-MDP: an MDP with piecewise constant dynamics on each segment of the horizon, and its dynamics experience an abrupt change at the beginning of each new segment. More specifically, we divide the horizon T into L segmentsfoot_1 , where each segment has T 0 def = T L steps and contains M 0 def = M L episodes, each episode having a length of H. Within each such segment, the system dynamics of the MDP do not vary, and we construct the dynamics for each segment in a way such that the instance is a hard instance of stationary MDPs on its own. The MDP within each segment is essentially similar to the hard instances constructed in stationary RL problems (Osband & Van Roy, 2016; Jin et al., 2018) . Between two consecutive segments, the dynamics of the MDP change abruptly, and we let the dynamics vary in a way such that no information learned from previous interactions with the MDP can be used in the new segment. In this sense, the agent needs to learn a new hard stationary MDP in each segment. Finally, optimizing the value of L and the variation magnitude between consecutive segments (subject to the constraints of the total variation budget) leads to our lower bound. We start with a simplified episodic setting where the transition kernels and reward functions are held constant within each episode, i.e., P m 1 = • • • = P m h = . . . P m H and r m 1 = • • • = r m h = . . . r m H , ∀m ∈ [M ] . This is a popular but less challenging episodic setting, and its stationary counterpart has been studied in Azar et al. (2017) . We further require that when the environment varies due to the nonstationarity, all steps in one episode should vary simultaneously in the same way. This simplified setting is easier to analyze, and its analysis conveniently leads to a lower bound for the un-discounted setting as a side result along the way. Later we will show how the analysis can be naturally extended to the more general setting we introduced in Section 2, using techniques that have also been utilized in Jin et al. (2018) . For simplicity of notations, we temporarily drop the h indices and use P m and r m to denote the transition kernel and reward function whenever there is no ambiguity. Consider a two-state MDP as depicted in Figure 1 . This MDP was initially proposed in Jaksch et al. (2010) as a hard instance of stationary MDPs, and following Jin et al. (2018) we will refer to this construction as the "JAO MDP". This MDP has 2 states S = {s • , s } and SA actions A = {1, 2, . . . , SA}. The reward does not depend on actions: state s always gives reward 1 whatever action is taken, and state s • always gives reward 0. Any action taken at state s takes the agent to state s • with probability δ, and to state s with probability 1 -δ. At state s • , for all but a single "good" action a , the agent is taken to state s with probability δ, and for the good action a , the agent is taken to state s with probability δ + ε for some 0 < ε < δ. The exact values of δ and ε will be chosen later. Note that this is not an MDP with S states and A actions as we desire, but the extension to an MDP with S states and A actions is routine (Jaksch et al., 2010) , and is hence omitted here. At the end of an episode, the state should deterministically transition from any state in the last copy to the s • state in the first copy of the chain, the arrows of which are not shown in the figure. Also, the s state in the first copy is actually never reached and hence is redundant.

𝑠 ∘ 𝑠|

𝛿 𝛿 1 -𝛿 1 -𝛿 1 -𝜖 -𝛿 𝜖 + 𝛿 To apply the JAO MDP to the simplified episodic setting, we "concatenate" H copies of exactly the same JAO MDP into a chain as depicted in Figure 2 , denoting the H steps in an episode. The initial state of this MDP is the s • state in the first copy of the chain, and after each episode the state is "reset" to the initial state. In the following, we first show that the constructed MDP is a hard instance of stationary MDPs, without worrying about the evolution of the system dynamics. The techniques that we will be using are essentially the same as in the proofs of the lower bound in the multi-armed bandit problem (Auer et al., 2002) or the reinforcement learning problem in the un-discounted setting (Jaksch et al., 2010) . The good action a is chosen uniformly at random from the action space A, and we use E [•] to denote the expectation with respect to the random choice of a . We write E a [•] for the expectation conditioned on action a being the good action a . Finally, we use E unif [•] to denote the expectation when there is no good action in the MDP, i.e., every action in A takes the agent from state s • to s with probability δ. Define the probability notations P (•), P a (•), and P unif (•) analogously. Consider running a reinforcement learning algorithm on the constructed MDP for T 0 steps, where T 0 = M 0 H. It has been shown in Auer et al. (2002) and Jaksch et al. (2010) that it is sufficient to consider deterministic policies. Therefore, we assume that the algorithm maps deterministically from a sequence of observations to an action a t at time t. Define the random variables N , N • and N • to be the total number of visits to state s , the total number of visits to s • , and the total number of times that a is taken at state s • , respectively. Let s t denote the state observed at time t, and a t the action taken at time t. When there is no chance of ambiguity, we sometimes also use s m h to denote the state at step h of episode m, which should be interpreted as the state s t observed at time t = (m -1) × H + h. The notation a m h is used analogously. Since s • is assumed to be the initial state, we have that For this proof only, define the random variable W (T 0 ) to be the total reward of the algorithm over the horizon T 0 , and define G(T 0 ) to be the (static) regret with respect to the optimal policy. Since for any algorithm, the probability of staying in state s • under P a (•) is no larger than under P unif (•), it follows that E a [W (T 0 )] ≤ E a [N ] ≤ E a [N • -N • ] + (1 + ε δ )E a [N • ] =E a [N • ] + ε δ E a [N • ] ≤ E unif [N • ] + ε δ E a [N • ] =T 0 -E unif [N ] + ε δ E a [N • ]. Let τ m • denote the first step that the state transits from state s • to s in the m-th episode, then E unif [N ] = M0 m=1 H h=1 P unif (τ m • = h)E unif [N | τ m • = h] = M0 m=1 H h=1 (1 -δ) h-1 δE unif [N | τ m • = h] ≥ M0 m=1 H h=1 (1 -δ) h-1 δ H -h 2 = M0 m=1 H 2 - 1 2δ + (1 -δ) H 2δ ≥ T 0 2 - M 0 2δ . ( ) Since the algorithm is a deterministic mapping from the observation sequence to an action, the random variable N • is also a function of the observations up to time T . In addition, since the immediate reward only depends on the current state, N • can further be considered as a function of just the state sequence up to T . Therefore, the following lemma from Jaksch et al. (2010) , which in turn was adapted from Lemma A.1 in Auer et al. (2002) , also applies in our setting. Lemma 10. (Lemma 13 in Jaksch et al. ( 2010)) For any finite constant B, let f : {s • , s } T0+1 → [0, B] be any function defined on the state sequence s ∈ {s • , s } T0+1 . Then, for any 0 < δ ≤ 1 2 , any 0 < ε ≤ 1 -2δ, and any a ∈ A, it holds that E a [f (s)] ≤ E unif [f (s)] + B 2 • ε √ δ 2E unif [N • ]. Since N • itself is a function from the state sequence to [0, T 0 ], we can apply Lemma 10 and arrive at E a [N • ] ≤ E unif [N • ] + T 0 2 • ε √ δ 2E unif [N • ]. From ( 49), we have that SA a=1 E unif [N • ] = T 0 -E unif [N ] ≤ T0 2 + M0 2δ . By the Cauchy-Schwarz inequality, we further have that SA a=1 2E unif [N • ] ≤ SA(T 0 + M0 δ ) . Therefore, from (50), we obtain SA a=1 E a [N • ] ≤ T 0 2 + M 0 2δ + T 0 2 • ε √ δ SA(T 0 + M 0 δ ). Together with (48) and ( 49), it holds that E [W (T 0 )] ≤ 1 SA SA a=1 E a [W (T 0 )] ≤ T 0 2 + M 0 2δ + ε δ 1 SA T 0 2 + M 0 2δ + T 0 2 • ε √ δ SA(T 0 + M 0 δ ) . E.1 UN-DISCOUNTED SETTING Let us now momentarily deviate from the episodic setting and consider the un-discounted setting (with M 0 = 1). This is the case of the JAO MDP in Figure 1 where there is not reset. We could



Needless to say, this assumption itself also to some extent contradicts the primary motivation of transfer learning. After all, we only want to transfer knowledge among tasks that are essentially similar to each other. The definition of segments is irrelevant to, and should not be confused with, the notion of epochs we previously defined.



to be the variation of the mean reward function within epoch d. By definition, we have D d=1 ∆ (d) r ≤ ∆ r . Further, for each d ∈ [D] and h ∈ [H], define ∆ (d) r,h to be the variation of the mean reward at step h in epoch d, i.e., ∆

for every epoch d ∈ [D]. Suppose T = Ω(SA∆H 2 ); summing up the dynamic regret over all the D epochs gives us an upper bound of O(D

) uses the Bellman optimality equation. Inequality (22) is by the Hoeffding's inequality that 1 ň ň i=1 r ľi h (s, a) -řh (s, a) ≤ ι ň with high probability, and by Lemma 1

follows from the Azuma-Hoeffding inequality. (33) uses the Bellman optimality equation. Inequality (34) is by the Hoeffding's inequality that 1 ň ň i=1 r ľi h (s, a) -řh (s, a) ≤ ι ň with high probability, and by Lemma

Following a similar procedure as in Lemma 10, Lemma 12, and Lemma 13 in Zhang et al. (2020), we can further bound |χ 1 | and |χ 2 | as follows:

are the steps where Freedman's inequality Freedman (1975) come into use, and we omit these steps since they are essentially the same as the derivations inZhang et al. (2020). We can see from (42), (43), and the definition of b k h that |χ 1 | + |χ 2 | ≤ b k

),(43), and the definition of b k h in Algorithm 2. (45) is by the induction hypothesis thatQ ľi h+1 (s ľi h+1 , a) ≥ Q ľi, h+1 (s ľi h+1 , a), ∀a ∈ A, 1 ≤ ľi ≤ k.The second to last inequality holds due to the Hofdding's inequality that1 ň ň i=1 r ľi h (s, a) -řh (s, a) ≤ ι ň ≤ b k hwith high probability. Finally, the last inequality follows from Lemma 1.

Figure 1: The "JAO MDP" constructed in Jaksch et al. (2010). Dashed lines denote transitions related to the good action a .

Figure2: A chain with H copies of JAO MDPs correlated in time. At the end of an episode, the state should deterministically transition from any state in the last copy to the s • state in the first copy of the chain, the arrows of which are not shown in the figure. Also, the s state in the first copy is actually never reached and hence is redundant.

a (s m h-1 = s • ) • P a (s m h = s | s m h-1 = s • ) + P a (s m h-1 = s ) • P a (s m h = s | s m h-1 = s ) = M0 m=1 H h=2 δP a (s m h-1 = s • , a m h = a ) + (δ + ε)P a (s m h-1 = s • , a m h = a ) + (1 -δ)P a (s m h-1 = s ) ≤δE a [N • -N • ] + (δ + ε)E a [N • ] + (1 -δ)E a [N ],and rearranging the last inequality gives us E a[N ] ≤ E a [N • -N • ] + (1 + ε δ )E a [N • ].

is the set of transition kernels, andr = {r m h } m∈[M ],h∈[H]is the set of mean reward functions. Specifically, when the agent takes action a m

the proposition below. Its proof relies on a series of lemmas in Appendix B that upper bound each term in Λ k h+1 separately. Proposition 1. With probability at least 1 -(KH + 2)δ, it holds that

annex

calculate the stationary distribution and find that the optimal average reward for the JAO MDP is δ+ε 2δ+ε . It is also easy to calculate that the diameter of the JAO MDP is D = 1 δ . Therefore, the expected (static) regret with respect to the randomness of a * can be lower bounded byBy assuming T 0 ≥ DSA (which in turn suggests D ≤ T0D SA ) and setting ε = c SA T0D for c = 3 40 , we further have thatIt is easy to verify that our choice of δ and ε satisfies our assumption that 0 < ε < δ. So far, we have recovered the (static) regret lower bound of Ω( √ SAT 0 D) in the un-discounted setting, which was originally proved in Jaksch et al. (2010) .Based on this result, let us now incorporate the non-stationarity of the MDP and derive a lower bound for the dynamic regret R(T ). Recall that we are constructing the non-stationary environment as a switching-MDP. For each segment of length T 0 , the environment is held constant, and the regret lower bound for each segment is Ω( √ SAT 0 D). At the beginning of each new segment, we uniformly sample a new action a * at random from the action space A to be the good action for the new segment. In this case, the learning algorithm cannot use the information it learned during its previous interactions with the environment, even if it knows the switching structure of the environment. Therefore, the algorithm needs to learn a new (static) MDP in each segment, which leads to a dynamic regret lower bound of Ω(L √ SAT 0 D) = Ω( √ SAT LD), where let us recall that L is the number of segments. Every time the good action a * varies, it will cause a variation of magnitude 2ε in the transition kernel. The constraint of the overall variation budget requires that 2εL = 3 20. Finally, by assigning the largest possible value to L subject to the variation budget, we obtain a dynamic regret lower bound of Ω S. This completes the proof of Proposition 2.

E.2 EPISODIC SETTINGS

Now let us go back to our simplified episodic setting, as depicted in Figure 2 . One major difference with the previous un-discounted setting is that we might not have time to mix between s • and s in H steps. (Note that we only need to reach the stationary distribution over the (s • , s ) pair in each step h, rather than the stationary distribution over the entire MDP. In fact, the latter case is never possible because the entire MDP is not aperiodic.) It can be shown that the optimal policy on this MDP has a mixing time of Θ 1 δ (Jin et al., 2018) , and hence we can choose δ to be slightly larger than Θ( 1 H ) to guarantee sufficient time to mix. All the analysis up to inequality (51) carries over to the episodic setting, and essentially we can set δ to be Θ 1 H to get a (static) regret lower bound of Ω( √ SAT 0 H) in each segment. Another difference with the previous setting lies in the usage of the variation budget. Since we require that all the steps in the same episode should vary simultaneously, it now takes a variation budget of 2εH each time we switch to a new action a * at the beginning of a new segment. Therefore, the overall variation budget now puts a constraint of 2εHL ≤ O(∆) on the magnitude of each switch. Again, by choosing ε = Θ SA T0H and optimizing over possible values of L subject to the budget constraint, we obtain a dynamic regret lower bound of Ω Sin the simplified episodic setting.Finally, we consider the standard episodic setting as introduced in Section 2. In this setting, we essentially will be concatenating H distinct JAO MDPs, each with an independent good action a * , into a chain like Figure 2 . The transition kernels in these JAO MDPs are also allowed to vary asynchronously in each step h, although our construction of the lower bound does not make use of this property. As argued similarly in Jin et al. (2018) , the number of observations for each specific JAO MDP is only T 0 /H, instead of T 0 . Therefore, we can assign a slightly larger value to ε and the learning algorithm would still not be able to identify the good action given the fewer observations. Setting δ = Θ 1 H and ε = Θ SA T0leads to a (static) regret lower bound of Ω(H √ SAT 0 ) in the stationary RL problem. Again, the transition kernels in all the H JAO MDPs vary simultaneously at the beginning of each new segment. By optimizing L subject to the overall budget constraint 2εHL ≤ O(∆), we obtain a dynamic regret lower bound of Ω S in the episodic setting. This completes our proof of Theorem 4.

