NEAR-OPTIMAL ADVERSARIAL REINFORCEMENT LEARNING WITH SWITCHING COSTS

Abstract

Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient that is strictly positive and is independent of the time horizon) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than Ω((HSA) 1/3 T 2/3 ), where T , S, A and H are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of Õ(H 1/3 ) when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.

1. INTRODUCTION

Reinforcement learning (RL) recently arises as a compelling paradigm for modeling machine learning applications with sequential decision making. In such a problem, an online learner interacts with the environment sequentially over Markov decision processes (MDPs), and aims to find a desirable policy for achieving an accumulated loss (or reward). Various algorithms have been developed for RL problems and have been shown theoretically to achieve polynomial sample efficiency in Zimin & Neu (2013) ; Azar et al. (2017) ; Jin et al. (2018) ; Agarwal et al. (2019) ; Bai et al. (2019) ; Jin et al. (2020a; b) ; Cai et al. (2020) ; Gao et al. (2021) ; Lykouris et al. (2021) ; Qiao et al. (2022) , etc. In addition to the metric of losses, switching costs, which capture the costs for changing policies during the execution of RL algorithms, are also attracting increasing attention. This is motivated by many practical scenarios where the online learners cannot change their policies for free. For example, in recommendation systems, each change of the recommendation involves the processing of a huge amount of data and additional computational costs (Theocharous et al., 2015) . Similarly, in healthcare, each change of the medical treatment requires substantial human efforts and timeconsuming tests and trials (Yu et al., 2021) . Such switching costs are also required to be considered in many other areas, e.g., robotics applications (Kober et al., 2013) , education software (Bennane, 2013) , computer networking (Xu et al., 2018) , and database optimization (Krishnan et al., 2018) . Switching costs have been studied in various problems (please see Sec. 2 for some examples). Among these studies, a relevant line of research is along bandit learning (Geulen et al., 2010; Dekel et al., 2014; Arora et al., 2019; Shi et al., 2022) . Recently, switching costs have received consider-able attention in more general RL settings (Bai et al., 2019; Gao et al., 2021; Wang et al., 2021; Qiao et al., 2022) . However, these studies have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process. Thus, practical scenarios where the loss distribution could be non-stationary or even adversarial are not characterized or considered. While adversarial RL better models the non-stationary or adversarial changes of the loss distribution, to the best of our knowledge, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? Intuitively, in adversarial RL, since much more often policy switches would be needed to adapt to the time-varying environment, it would be much more difficult to achieve a low regret (including both the standard loss regret and the switching costs, please see ( 6)). Indeed, without a special design to reduce switching, existing algorithms for adversarial RL with T episodes, such as those in Zimin & Neu (2013) ; Jin et al. (2020a) ; Lee et al. (2020) and Lykouris et al. (2021) , could yield poor performance with linear-to-T number of policy switches. Thus, the goal of this paper is to make the first effort along this open direction. Our first aim is to develop provably efficient algorithms that enjoy low regrets in adversarial RL with switching costs. This requires a careful reduction of switching under non-stationary or adversarial loss distributions. It turns out that previous approaches to reduce switching in static RL (e.g., those in Bai et al. (2019) and Qiao et al. (2022) ) are not applicable here. Specifically, the high-level idea in static RL is to switch faster at the beginning, while switch slower and slower for later episodes. Such a method performs well in static RL, mainly because after learning enough information about losses at the beginning (by switching faster), the learner can estimate the assumed fixed loss-distribution accurately enough with high probability in later episodes. Thus, even though the learner switches slower and slower, a low regret is still achievable with high probability. In contrast, when the loss distribution could change arbitrarily, this method does not work. This is mainly because what the learner learned in the past may not be that useful for the future. For example, when the loss distribution is adversarial, a state-action pair with small losses in the past may incur large losses in the future. Thus, new ideas are required for addressing switching costs in adversarial RL. Our second aim is to understand fundamentally whether the new challenge of switching costs in adversarial RL significantly increases the regret. This requires a converse result, i.e., a lower bound on the regret, that holds for any RL algorithm. Further, we aim to understand fundamentally whether the adversarial nature of RL indeed requires much more policy switches to achieve a low loss regret. Our contributions: In this paper, we achieve the aforementioned goals and make the following three main contributions. (We use Ω, Θ and Õ to hide constants and logarithmic terms.) First, we provide a lower bound (in Theorem 1) that shows that, for adversarial RL with switching costs, the regret of any algorithm must be larger than Ω((HSA) 1/3 T 2/3 ), where T , S, A and H are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs is no longer achievable. Further, we characterize precisely the new trade-off (in Theorem 2) between the standard loss regret and the switching costs due to the adversarial nature of RL. Second, we develop the first-known near-optimal algorithms for adversarial RL with switching costs. As we discussed above, the idea for reducing switching in static RL does not work well here. To handle the losses that can change arbitrarily, our design is inspired by the approach in Shi et al. (2022) for bandit learning, but with two novel ideas. (a) We delay each switch by a fixed (but tunable) number of episodes, which ensures that switch occurs only every Õ(T 1/3 ) episodes. (b) The idea in (a) results in consistently long intervals of not switching. Since the bias in estimating losses from such a long interval tends to increase the regret, it is important to construct an unbiased estimate of losses for each interval. To achieve this, the idea in bandit learning is to consider all time-slots in each interval as one time-slot, which necessarily requires a single chosen action in each interval. Such an approach is not applicable to our more general MDP setting, since there is no guarantee to visit a single state-action pair due to state transitions. To resolve this issue, our novel idea is to decompose each interval, and then combine the losses of each state-action pair only from the episodes in which such a state-action pair is visited. Interestingly, although this combination is random and the loss is adversarial, the expectation of the estimated losses is (almost) unbiased. Third, we establish the regret bounds for our new algorithms. For the case with a known transition function, we show that our algorithm achieves an Õ((HSA) 1/3 T 2/3 ) regret, which matches our lower bound. For the case with an unknown transition function, we show that, with probability 1δ, our algorithm achieves an Õ H 2/3 (SA) 1/3 T 2/3 (ln T SA δ ) 1/2 regret, which matches our lower bound on the dependency of T , S and A, except with a small factor of Õ(H 1/3 ). Therefore, the regrets of our new algorithms are near-optimal. Moreover, because of our novel ideas for estimating losses and delaying switching discussed above in a state-transition case, our proofs for the regrets involve several new analytical ideas. For example, in Lemma 1 and Lemma 2, we show that our new way of estimating losses is (almost) unbiased so that its effect on the regret is controllable. Moreover, to capture the effects of the delayed switching, our new analytical idea is to first bound the regret across intervals between adjacent switching events, and then relate the regret inside episodes of each interval to this bound (please see Step-2 of the proofs in Appendix D and Appendix G).

2. RELATED WORK

Switching costs: Switching costs have already received considerable attention in various online problems. For example, online convex optimization with switching costs has been studied in Lin et al. (2012) ; Chen et al. (2016) ; Goel et al. (2019) ; Shi et al. (2021a; b) , etc. Convex body chasing with switching costs has been studied in Friedman & Linial (1993) ; Sellke (2020); Bubeck et al. (2021) , etc. Switching costs have also been studied in metrical task systems (Borodin & El-Yaniv, 2005) , online set covering (Buchbinder et al., 2014) , k-server problem (Lin et al., 2020) , online control (Goel & Wierman, 2019; Li et al., 2020; Lin et al., 2021) , etc. Moreover, switching costs have been studied in adversarial bandit learning, e.g., in Geulen et al. (2010) ; Dekel et al. (2014) ; Arora et al. (2019) ; Shi et al. (2022) . Our work in this paper can be viewed as a non-trivial generalization of these studies on bandit learning to adversarial MDP, where state transitions and multiple layers in each episode require new developments in both the algorithm design and regret analysis. To the best of our knowledge, no study in the literature has addressed the challenge due to switching costs in adversarial RL, which is the focus of this paper.

3. PROBLEM FORMULATION

We consider adversarial reinforcement learning (RL) with switching costs in episodic Markov decision processes (MDPs). Suppose there are T episodes, each of which consists of H layers. We use S h to denote the state space of layer h. For ease of elaboration, as in previous work (e.g., Zimin & Neu (2013) ; Jin et al. (2020a) and Lee et al. (2020) ), we assume that the H layers are non-intersecting, i.e., S h ′ ∩ S h ′′ = ϕ for any h ′ ̸ = h ′′ ; S 0 = {s 0 } is a singleton; and each episode ends at state S H = {s H }. Thus, the entire state space is S = ∪ H h=0 S h with size S = H h=0 S h , where S h denotes the size of S h . Moreover, we use A to denote the action space with size A. Then, the MDP is defined by a tuple S, A, P, {l t } T t=1 , H , where P is the transition function with P h : S h+1 × S h × A → [0, 1] denoting the transition probability measure at layer h, and l t : S × A → [0, 1] represents the loss function for episode t. The online learner interacts with the Markov environment episode-by-episode as follows. At the beginning of each episode t = 1, ..., T , the online learner starts from state s 0 and follows an algorithm that (possibly randomly) chooses a deterministic policy π t : S → A. Next, at each layer h = 0, ..., H -1, after observing the current sate s t,h , the learner chooses an action a t,h = π t (s t,h ). Then, the learner incurs a loss l t (s t,h , a t,h ). Finally, the next state s t,h+1 ∈ S h+1 is drawn according to the transition probability P (•|s t,h , a t,h ). (For simplicity, we drop the index h of P h in this paper when it is clear from the context.) These steps repeat until the learner arrives at the last state s H . At the end of episode t, only the losses of visited state-action pairs in the episode are observed by the learner, whereas the losses of non-visited state-action pairs are unknown. As in (Zimin & Neu, 2013; Jin et al., 2020a; Lee et al., 2020; Cai et al., 2020) , this is called "bandit feedback", which is more practical than full-information feedback (Rosenberg & Mansour, 2019a ) that assumes the losses of all state-action pairs (no matter visited or not) are known for free. Adversarial losses: Different from static RL that assumes the loss distribution is fixed for all episodes, in the adversarial setting we consider here, we do not need any assumption on the underlying loss distribution. That is, the loss function l t could change arbitrarily across episodes. Switching costs: As we mentioned in the introduction, in adversarial RL, addressing switching costs remains an open problem. The switching cost refers to the cost needed for changing the policy π t . It is equal to β • 1 {πt+1̸ =πt} , where β > 0 is the switching-cost coefficient and is independent of T , and 1 E is an indicator function (i.e., 1 E = 1 if the event E occurs, and 1 E = 0 otherwise). Therefore, the total cost of executing an RL algorithm π over T episodes is given by Cost π (1 : T ) ≜ E T t=1 H-1 h=0 l t (s π t,h , a π t,h ) + T -1 t=1 β • 1 {πt+1̸ =πt} π, P , where the expectation is taken with respect to the randomness of the state-action pairs (s π t,h , a π t,h ) visited by π, and the possible randomness of changing the policy π t . Next, we introduce a concept called "occupancy measure" (Zimin & Neu, 2013; Jin et al., 2020a) . Specifically, the occupancy measure q π,P t (s, a) = P r[s π t,h = s, a π t,h = a|π, P ] ≥ 0 is the probability of visiting the state-action pair (s, a) by the algorithm π at layer h of episode t under the transition function P . In addition (with slight abuse of notation), the occupancy measure q π,P t (s ′ , s, a) = P r[s π t,h+1 = s ′ , s π t,h = s, a π t,h = a|π, P ] ≥ 0 is the probability of visiting the state-action triple (s ′ , s, a) by the algorithm π at layers h and h + 1 of episode t under the transition function P . In order to be feasible, the occupancy measures need to satisfy some conditions at layer h of episode t. First, according to probability theory, they need to satisfy the conditions that, q π,P t (s, a) = s ′ ∈S h+1 q π,P t (s ′ , s, a), for all (s, a) ∈ S h × A, and s∈S h a∈A q π,P t (s, a) = 1. (2) Second, since the probability of transferring to a state s from the previous layer h -1 must be equal to the probability of transferring from this state s to the next layer h + 1, we have s ′ ∈S h-1 a∈A q π,P t (s, s ′ , a) = s ′ ∈S h+1 a∈A q π,P t (s ′ , s, a), for all s ∈ S h . (3) Third, the occupancy measure should generate the true transition function P , i.e., q π,P t (s ′ ,s,a) b∈A q π,P t (s ′ ,s,b) = P h (s ′ |s, a), for all (s ′ , s, a) ∈ S h+1 × S h × A. (4) We use C(P ) to denote the set of all occupancy measures that satisfy conditions (2)-(4). Moreover, at the beginning of episode t, the algorithm π associated with the occupancy measure q π,P t chooses a deterministic policy π t by assigning an action a ∈ A to each state s ∈ S according to the probability P r[a|s] = q π,P t (s, a) b∈A q π,P t (s, b) . (5) Then, it is not hard to show that the expected total loss, i.e., the first term in (1), can be expressed as loss π (1 : T ) ≜ E T t=1 ⟨q π,P t , l t ⟩ π, P . Finally, the regret of an RL algorithm π is defined to be the sum of the loss regret R π loss (T ) and the switching costs of as follows: R π (T ) ≜ max q∈C(P ) E T t=1 ⟨q π,P t -q, l t ⟩ π, P loss regret: R π loss (T ) + E T -1 t=1 β • 1 {πt+1̸ =πt} π, P switching costs . Therefore, our goal in this paper is to design RL algorithms that achieve as low regret as possible against any possible sequence of loss functions {l t } T t=1 and state transition function P .

4. A LOWER BOUND

In this section, we will develop a lower bound on the regret for adversarial RL with switching costs. Such a lower bound will quantify how difficult it is to control the regret with switching costs under adversarial RL. In Theorem 1 below, we provide this lower bound, the proof of which is given in Appendix A. (In Sec. 5 and Sec. 6, we will provide two near-optimal RL algorithms to achieve this lower bound.) Theorem 1. For adversarial RL with switching costs and T ≥ max {6H 2 SA, β}, the regret of any RL algorithm π can be lower-bounded as follows, R π (T ) ≥ Ω β 1/3 (HSA) 1/3 T 2/3 . Theorem 1 shows that in adversarial RL with switching costs, the dependency on T of the best achievable regret is at least Ω(T 2/3 ). Thus, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs (in Bai et al. (2019) ; Qiao et al. (2022) , etc) as well as adversarial RL without switching costs (in Zimin & Neu (2013) ; Jin et al. (2018) , etc) is no longer achievable. This demonstrates the fundamental challenge of switching costs in adversarial RL, and it is expected that new challenges will arise when developing provably efficient algorithms. Further, in Theorem 2 below, we characterize precisely the new trade-off between the loss regret and switching costs defined in (6). The proof is provided in Appendix B. Intuitively, by switching more, the online RL algorithm can adapt more flexibly to the new information learned, and thus achieves a lower loss regret. On the other hand, if fewer switches are allowed, the online RL algorithm is less flexible to adapt to the new information learned, which will incur a larger loss regret. Theorem 2. For adversarial RL with switching costs, with the switching costs equal to O β • N swi , the loss regret can be lower-bounded by Ω HSA N swi • T . Alternatively, to achieve a loss regret equal to Õ HSA N swi • T , the switching costs incurred must be larger than Ω β • N swi . Theorem 2 provides an interesting and necessary trade-off between the loss regret and switching costs. We further elaborate this result in three cases. First, in order to achieve a loss regret Õ(H √ SAT ), Theorem 2 shows that the number of switches N swi (and thus the switching costs incurred) must be linear in T , i.e., essentially switching at almost all episodes. This is consistent with the regret achieved in adversarial RL without switching costs, i.e., allowing switching linearto-T number of times for free. But our result further implies that, without linear-to-T switches of the policy, it is impossible to achieve an Õ( √ T ) loss regret. Second, Theorem 2 shows that, if only a constant or O(ln ln T ) number of switches are allowed, the loss regret must be linear in T . In contrast, in static RL, an Õ( √ T ) loss regret is achieved with only O(ln ln T ) switches (Qiao et al., 2022) . This indicates that the adversarial nature of RL necessarily requires significantly more policy switches to achieve a low loss regret. Third, Theorem 2 suggests that the loss regret and switching costs can be balanced at the order of Õ T 2/3 . That is, to achieve the Õ T 2/3 loss regret, the switching costs incurred have to be Ω T 2/3 . This is consistent with Theorem 1, where the regret (including both the loss regret and switching costs) is lowered-bound by Ω T 2/3 . Algorithm 1 Switching rEduced EpisoDic relative entropy policy Search (SEEDS) Parameters: η = Θ β -1/3 H 2/3 (SA) -1/3 T -2/3 and τ = Θ β 2/3 (HSA) -1/3 T 1/3 . Initialization: P r[a|s] = 1 A for all (s, a) ∈ S × A. Choose π SEEDS [1] according to (5). for u = 1 : T τ do for t = (u -1)τ + 1 : min{uτ, T } do Step 1: Execute the updated policy π SEEDS (s, a) for all (s, a) according to (8). Step 3: Update the occupancy measure qSEEDS,P according to (5). end for

5. THE CASE WHEN THE TRANSITION FUNCTION IS KNOWN

In this section, we study the case when the transition function is known, and we will further explore the more challenging case when the transition function is unknown in Sec. 6. We propose a novel algorithm (please see Algorithm 1) with a regret that matches the lower bound in (7). Our algorithm is called Switching rEduced EpisoDic relative entropy policy Search (SEEDS). SEEDS is inspired by the episodic method in bandit learning (Shi et al., 2022) . In bandit learning, the idea is to divide the time horizon into Θ(T 2/3 ) episodes, and pull one single Exp3-arm in an episode. By doing so, the total switching cost is trivially O(T 2/3 ). Meanwhile, the loss regret in an episode is Θ(η•(T 1/3 ) 2 ), which is proportional to the loss variance in an episode. The final O(T 2/3 ) regret is then achieved by taking the sum of all these costs and tuning the parameter η = Θ(T -2/3 ). However, in the adversarial MDP setting that we consider, there is a key difference due to random state-action visitations that cause several new challenges as we discuss in the rest of this section. (Step-1 in Algorithm 1) that was updated at the end of the last super-episode u-1, where qSEEDS,P [u] is the updated occupancy measure (that we will introduce soon) of SEEDS for super-episode u. Thus, SEEDS switches the policy at most once in each super-episode.

Super

A novel idea for estimating the losses: At the end of super-episode u, SEEDS estimates the losses l [u] (s, a) of all state-action pairs in super-episode u. Here, it is instructive to see why the episodic importance-estimating method in adversarial bandit learning (i.e., without state transitions) does not apply to our problem. Note that due to state transitions in our more general MDP setting, we are not guaranteed to visit a single state-action pair for the whole superepisode. A naive but intuitive solution may be pretending that each state-action pair visited in super-episode u was the single one visited. Then, we can let the estimated loss of each state-action pair (s, a) to be l [u] (s, a) = l[u] (s,a) 1-(1-q SEEDS,P [u] (s,a)) τ 1 {(s,a) was visited in super-episode u} , where the numerator l[u] (s, a) = uτ t=(u-1)τ +1 l t (s, a)/τ is the average loss of (s, a). If we assume that the loss l t for all episodes t in super-episode u were the same, according to the analysis in bandit learning and the inequality 1 -(1 -x) τ ≥ x for all 0 ≤ x ≤ 1, this idea would have worked. However, the problem is that, inside super-episode u, the loss function l t for each episode t could change arbitrarily. Thus, the estimated loss l[u] (s, a) above is actually unknown and an ill-defined value. To resolve the aforementioned difficulty due to randomly-visited state-action pairs and arbitrarilychanging loss functions, SEEDS estimates the loss as follows (Step-2 in Algorithm 1), lSEEDS [u] (s, a) = J [u] j=1 l t j (s,a) (s,a) qSEEDS,P [u] (s,a) 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} , (8) where J [u] is the maximum number of episodes that the state-action pair (s, a) was visited in superepisode u. In other words, in super-episode u, this state-action pair (s, a) was not visited in any other episode t, such that t ∈ {(u -1)τ + 1, ..., uτ }/{t 1 (s, a), ..., t J [u] (s, a)}. Thus, SEEDS estimates the losses based on the observable true losses in super-episode u. In this way, SEEDS elegantly resolves the aforementioned difficulty due to the random state transitions and adversarial losses. Our novel idea in (8) may be of independent interest for other problems with state transitions and non-stationary or adversarial losses. Indeed, in Sec. 6, we will apply this idea to the case when the transition function is unknown. In Lemma 1 below, we show that the estimated loss in ( 8) is an unbiased estimation of the true loss in super-episode u. This is an important property that we will exploit in our regret analysis. The proof of Lemma 1 is provided in Appendix C. We use F [u] to denote the σ-algebra generated by the observation of SEEDS before super-episode u. Lemma 1. The conditional expectation of the estimated loss designed in ( 8) is equal to E lSEEDS [u] (s, a) F [u] = l [u] (s, a), for all (s, a), where the expectation is taken with respect to the randomness of the episodes t 1 (s, a), ..., t J [u] (s, a), in which the state-action pair (s, a) was visited, and l [u] (s, a) = min{uτ,T } t=(u-1)τ +1 l t (s, a) is the true loss of (s, a) in super-episode u. Updating the occupancy measure: Finally, according to online mirror descent (Rakhlin et al., 2009; Zimin & Neu, 2013) , SEEDS updates the occupancy measure qSEEDS,P [u+1] (s, a) for all stateaction pairs (s, a) ∈ S × A as follows (Step-3 in Algorithm 1), qSEEDS,P [u+1] = arg min q∈C(P ) η • q, lSEEDS [u] + D KL q qSEEDS,P [u] , where D KL (q∥q ′ ) ≜ s∈S,a∈A q(s, a) ln q(s,a) q ′ (s,a) -s∈S,a∈A [q(s, a) -q ′ (s, a)] is the unnormalized relative entropy between two occupancy measures q and q ′ on the space S × A. Recall that C(P ) is formulated by ( 2)-( 4). Note that the term ⟨q, lSEEDS [u] ⟩ represents the expected loss in super-episode u, with respect to the newly-estimated loss function lSEEDS [u] . Thus, it captures how SEEDS adapts to and explores the newly-estimated loss function. In addition, the term D KL (q∥q SEEDS,P [u] ) serves as a regularizer to ensure that the updated occupancy measure in (10) stays close to qSEEDS,P . Thus, it captures how SEEDS exploits the previously-estimated loss functions before super-episode u. As a result, by tuning the parameter η in (10), the updated occupancy measure strikes a balance between exploration and exploitation. We characterize the regret of SEEDS in Theorem 3 below. Theorem 3. Consider adversarial RL with switching costs introduced in Sec. 3. When the transition function P is known, the regret of SEEDS is upper-bounded as follows, R SEEDS (T ) ≤ Õ β 1/3 (HSA) 1/3 T 2/3 . ( ) Theorem 3 shows that the regret of SEEDS matches the lower bound in (7) in terms of the dependency on all the parameters T , S, A, H and β. Thus, the regret of SEEDS is order-wise optimal. To the best of our knowledge, this is the first regret result for adversarial RL with switching costs. To prove Theorem 3, the main difficulty lies in capturing the effects of the arbitrarily-changing losses and multiple random visitations of each state-action pair in a super-episode. To overcome this difficulty, our new idea is to first upper-bound the loss regret based on the correlated loss feedback in a super-episode, and then relate these upper bounds across all super-episodes to the final regret. The first step relies on the proof of Lemma 1, and the second step relies on another lemma in Appendix D.1 that transfers the original regret formulation to a form based on the losses from the entire super-episode. Please see Appendix D for details and the proof of Theorem 3. Further, in Theorem 4 below, we show that SEEDS attains a trade-off between the loss regret and switching costs that matches the trade-off in Theorem 2. The proof of Theorem 4 follows the lossregret bound of SEEDS proved in Appendix D and the trivial switching-cost bound β • T τ . Please see the end of Appendix D for details. Published as a conference paper at ICLR 2023 Algorithm 2 SEEDS-Unknown Transition (SEEDS-UT) Parameters: η = Θ β -1/3 H 1/3 (SA) -1/3 T -2/3 , τ = Θ β 2/3 H -2/3 (SA) -1/3 T 1/3 , γ = Θ β 1/3 H 2/3 (SA) -2/3 T -1/2 , and 0 < δ < 1. Initialization: qSEEDS-UT,P [1] (s ′ , s, a) = 1 S h+1 S h A and M [1] (s ′ , s, a) = N [1] (s, a) = 0, for all (s ′ , s, a) ∈ S h+1 × S h × A and all h. P [1] contains all possible transition functions. Choose π SEEDS-UT [1] = π qSEEDS-UT,P [1] according to (2) and ( 5). for u = 1 : T τ do for t = (u -1)τ + 1 : min{uτ, T } do Step 1: Execute the updated policy π SEEDS-UT [u] = π qSEEDS-UT,P [u] . end for At the end of super-episode u, Step 2: Estimate the losses lSEEDS-UT [u] (s, a) for all (s, a) according to (12). Step 3: Estimate the transition-function set P [u+1] according to (14). Step 4: Update the occupancy measure qSEEDS-UT,P 

6. THE CASE WHEN THE TRANSITION FUNCTION IS UNKNOWN

In this section, we study a more challenging case when the transition function is unknown. We propose a novel algorithm (please see Algorithm 2) with a regret that matches the lower bound in (7) in terms of the dependency on all parameters, except with a small factor of Õ(H 1/3 ). Specifically, to address the new difficulty due to the unknown transition function P in this case, we advance SEEDS into SEEDS-UT (where UT stands for "unknown transition") with three new components as we explain below. 1. Since the transition function P is unknown, updating the occupancy measure q(s, a) (as in SEEDS) is not good enough. Instead, SEEDS-UT updates the occupancy measure q(s ′ , s, a) to take state transitions into consideration. 2. Since the transition function P is unknown, the updated occupancy measure could be different from the true one. To resolve this issue, we generalize the method in Neu (2015) , with a difference to handle the random sequence of the state-action pairs visited in each super-episode. Specifically, SEEDS-UT estimates the loss for each super-episode u as follows (Step-2 in Algorithm 2), lSEEDS-UT [u] (s, a) = J [u] j=1 l t j (s,a) (s,a) Q γ [u] (s,a) 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} , (12) where Q γ [u] (s, a) ≜ max q∈C(P [u] ) q(s, a) + γ is the sum of the largest probability of visiting (s, a) among all occupancy measures in C(P [u] ) and a tunable parameter γ > 0, and P [u] is a transitionfunction set that we will introduce soon. Note that ( 12) is another application of our idea in (8) for estimating losses in a problem with state transitions and adversarial losses. In Lemma 2 below, we show that the gap between the expectation of the estimated loss and the true loss is controlled by the parameter γ. The proof of Lemma 2 is provided in Appendix F. We use F [u] to denote the σ-algebra generated by the observation of SEEDS-UT before super-episode u. Lemma 2. The conditional expectation of the estimated loss designed in ( 12) is equal to E lSEEDS-UT [u] (s, a) F [u] = q SEEDS-UT,P [u] (s,a) max q∈C(P [u] ) q(s,a)+γ • l [u] (s, a), for all (s, a), ( ) where the expectation is taken with respect to the randomness of the episodes t 1 (s, a), ..., t J [u] (s, a), in which (s, a) was visited, q SEEDS-UT,P [u] (s, a) is the true occupancy measure of SEEDS-UT conditioned on F [u] , and l [u] (s, a) = min{uτ,T } t=(u-1)τ +1 l t (s, a) is the true loss of (s, a) in super-episode u. Lemma 2 shows that, as long as P [u] is sufficiently good for estimating the true transition function P (we will show how to construct such a P [u] below), by carefully tuning γ, the bias caused by max q∈C(P [u] ) q(s, a) + γ (i.e., Q γ [u] (s, a)) should be sufficiently small, so that the estimated loss is still sufficiently accurate. 3. Since the transition function P is unknown, the constraint in ( 10) is no longer known. To resolve this issue, we generalize the method in Jin et al. (2020a) , with a difference to handle the samples from the whole super-episode. Specifically, at the end of each super-episode, SEEDS-UT collects the samples from the whole super-episode to update the empirical transition probability (s ′ , s, a) is updated according to (10), but subject to a different constraint q ∈ C P [u+1] (Step-4 in Algorithm 2). P[u+1] (s ′ |s, a) = M [u+1] (s ′ ,s,a) We characterize the regret of SEEDS-UT in Theorem 5 below. Theorem 5. Consider adversarial RL with switching costs introduced in Sec. 3. When the transition function P is unknown, with probability 1 -δ, the regret of SEEDS-UT is upper-bounded as follows, R SEEDS-UT (T ) ≤ Õ β 1/3 H 2/3 (SA) 1/3 T 2/3 ln T SA δ 1/2 . ( ) Theorem 5 shows that the regret of SEEDS-UT matches the lower bound in (7) in terms of the dependency on T , S, A, and β, except with a small factor of Õ(H 1/3 ). That is, the regret of SEEDS-UT is near-optimal. To the best of our knowledge, this is the first regret result for adversarial RL with switching cost when the transition function is unknown. To prove Theorem 5, the main difficulty is that, due to the delayed switching and unknown transition function, the losses of SEEDS-UT in the episodes of any super-episode are correlated and the true occupancy measure is unknown. As a result, the existing analytical ideas in adversarial RL without switching costs and adversarial bandit learning with switching costs do not work here. To overcome these new difficulties, our analysis involves several new ideas, e.g., we construct a series in (35) to handle multiple random visitations of each state-action pairs, and we establish a super-episodic version of concentration in Step-2-iii of Appendix G by relating the second-order moment of the estimated loss that we design to the true loss and the length τ of a super-episode. Please see Appendix G for the detailed proof of Theorem 5.

7. CONCLUSION AND FUTURE WORK

In this paper, we make the first effort towards addressing the challenge of switching costs in adversarial RL. First, we provide a lower bound that shows that the best achieved regret in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. In addition, we characterize precisely the new trade-off between the loss regret and switching costs, which shows that the adversarial nature of RL necessarily requires more switches to achieve a low loss regret. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of Õ(H 1/3 ) when the transition function is unknown. Several future directions are worth pursuing. First, it is important to study adversarial RL with switching costs in linear and more general MDP settings. Another interesting future work is to extend our study to the dynamic regret, which allows the optimal policy to change over time. A PROOF OF THEOREM 1 Note that the bandit setting is a special case (when S = H = 1) of our MDP setting. Thus, the lower bound for the adversarial bandit setting in Shi et al. (2022) serves as a lower bound in our MDP setting. However, the direct use of such a lower bound from bandits will not be good enough for the MDP case that we study in this paper. To get the lower bound in Theorem 1, the most challenging and interesting part is to design the lower-bound instance. Notice that a lowerbound transition is constructed for stochastic MDP in Qiao et al. (2022) , which shows that the MDP setting is at least as difficult as multi-armed bandits with Ω(HSA) arms, and then a similar lower bound can be obtained based on the lower bound from bandits. In this section, we construct a new lower-bound instance. Specifically, we divide the state space S and construct special state transitions, such that the episodic reinforcement learning is reduced to Θ(S/H) chains of bandit learning. Notice that the lower-bound analysis in Shi et al. (2022) implies that, with the loss function l t upper-bounded by H, A arms and T time-slots, the regret of any bandit-learning algorithm with switching costs is at least Ω β 1/3 A 1/3 (HT ) 2/3 when T ≥ max{6H 2 A, β}. Hence, the total regret from all Θ(S/H) chains of bandit learning is at least Ω β 1/3 A 1/3 (H T S/H ) 2/3 • Θ(S/H) = Ω β 1/3 (HSA) 1/3 T 2/3 . Please see our detailed proof below. Proof. Lower-bound instance: We consider a special instance where S -2 is divisible by H -1. First, we assign the states in the state space S to each layer as follows. The first layer contains a single sate, i.e., S 0 = {s 0 }. All episodes end with state S H = {s H }. Moreover, the rest of the S -2 states are assigned to each layer h ∈ [1, H -1] evenly. That it, each layer h ∈ [1, H -1] contains S-2 H-1 states. Following the sequence of the states at each layer, we call the index i of the i-th state the "order" of it. In addition, the order i of the states at layer h of any episode is the same, e.g., the first state at layer h is always the first state at layer h for all episodes, and the second state at layer h is always the second state at layer h for all episodes. Moreover, all actions are available at each state s ∈ S. Finally, based on this construction of the states and actions, we run independently the lower-bound algorithm for adversarial bandit learning with switching costs in (Shi et al., 2022) as a subroutine through all i-th states, for all i = 1, ..., S-2 H-1 . That is, for each layer h = 1, ..., H -1, P h (s i |s i , a) = 1 for all a, and P h (s j |s i , a) = 0 for all j ̸ = i and all a. Lower-bound analysis: The lower-bound analysis in Shi et al. (2022) implies that, with the loss function l t upper-bounded by H, A arms and T time-slots, the regret (including both the loss regret and switching costs) of any bandit-learning algorithm with switching costs is at least Ω β 1/3 A 1/3 (HT ) 2/3 . Notice that based on our lower-bound instance constructed above, there are S-2 H-1 chains of bandit learning. Hence, the total regret of any RL algorithm π from all these S-2 H-1 chains of bandit learning can be lower-bounded as follows, R π (T ) ≥ Ω   β 1/3 A 1/3 H T S-2 H-1 2/3   • S -2 H -1 = Ω β 1/3 (HSA) 1/3 T 2/3 B PROOF OF THEOREM 2 The proof of Theorem 2 follows the lower bound proved in Appendix A, but by considering the loss regret and switching costs separately. Proof. To prove Theorem 2, we use the lower-bound instance that we constructed above for proving Theorem 1 in Appendix A. First, the lower-bound analysis in Shi et al. (2022) implies that, for adversarial bandit learning with the loss function l t upper-bounded by H, A arms and T time-slots, when the total switching cost is equal to O(β • N swi ), the loss regret can be lower-bounded by Ω A N swi • HT . Notice that there are S-2 H-1 chains of bandit learning in the lower-bound instance that we constructed in Appendix A. Thus, with a total switching cost equal to O(β • N swi ) ≜ O(β • S-2 H-1 i=1 N swi i ) , the loss regret of any RL algorithm π against the lower-bound instance that we constructed above can be lower-bounded as follows, R π loss (T ) ≥ S-2 H-1 i=1 Ω A N swi i • H T S-2 H-1 = Ω HSA N swi • T , where the equality is because S-2 H-1 i=1 1 N swi i ≥ 1 N swi S-2 H-1 3/2 . Finally, the second half part of Theorem 2 is trivially true, since it is the converse-negative proposition of the first half part that we have proved above.

C PROOF OF LEMMA 1

Proof. First, since the expectation is taken with respect to the randomness of the episodes t 1 (s, a), ..., t J [u] (s, a), in which the state-action pair (s, a) was visited, the left-hand-side of ( 9) is equal to E lSEEDS [u] (s, a) F [u] = {t1(s,a),...,t J [u] (s,a)} ⊆[(u-1)τ +1,uτ ] lSEEDS [u] (s, a) • P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] . Next, according to the definition of the estimated loss that we design in (8), we have E lSEEDS [u] (s, a) F [u] = {t1(s,a),...,t J [u] (s,a)} ⊆[(u-1)τ +1,uτ ] J [u] j=1 l tj (s,a) (s, a) qSEEDS,P [u] (s, a) • 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} • P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] . In the following, we prove that {t1(s,a),...,t J [u] (s,a)} ⊆[(u-1)τ +1,uτ ] J [u] j=1 l tj (s,a) (s, a)

qSEEDS,P

[u] (s, a) • 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} • P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] = uτ t=(u-1)τ +1 qSEEDS,P [u] (s, a) • l t (s, a) qSEEDS,P [u] (s, a) . That is, under our new design of the estimated loss in (8), summing over all possible sets of the random episodes where the state-action pair was visited (i.e., the outer sum on the left-hand-side) is equivalent to summing over all deterministic episodes from the beginning to the end of a superepisode (i.e., the sum on the right-hand-side). This is because first, relying on the above indicator function on the left-hand-side, the sum of the total observed loss in a super-episode over all possible sets {t 1 (s, a), ..., t J [u] (s, a)} is equivalent to the sum of the total true loss in each episode of a super-episode based on whether the episode is observed. Therefore, we have E lSEEDS (s, a) • P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] . In addition, since the transition function P is known, conditioned on F [u] , the probability of visiting each state-action pair (s, a) in an episode t of super-episode u is equal to the occupancy measure qSEEDS,P (s, a). Finally, by combining ( 17) and ( 18), we have E lSEEDS [u] (s, a) F [u] = uτ t=(u-1)τ +1 qSEEDS,P [u] (s, a) • l t (s, a) qSEEDS,P [u] (s, a) = uτ t=(u-1)τ +1 l t (s, a) = l [u] (s, a).

D PROOF OF THEOREM 3 AND THEOREM 4

Since the total switching cost of SEEDS is trivially upper-bounded by β • T τ , to prove Theorem 3, we focus on upper-bounding the loss regret of SEEDS, i.e., R SEEDS loss (T ) = max q∈C(P ) E T t=1 q SEEDS,P t -q, l t SEEDS, P ≜ E T t=1 q SEEDS,P t -q π * , l t SEEDS, P . To upper-bound the loss regret, the main difficulty lies in capturing the effects of the arbitrarilychanging losses and multiple random visitations of each state-action pair in a super-episode. To overcome this difficulty, our proof of Theorem 3 first upper-bounds the loss regret based on the correlated loss feedback in a super-episode (which relies on our new design of the estimated loss in (8) and Lemma 1), and then relates these upper bounds across all super-episodes to the final regret (which relies on another lemma, Lemma 3 below, which transfers the original regret formulation to a form based on the losses from the entire super-episode). Specifically, for each super-episode, we first relate the true occupancy measure q SEEDS,P t to the unconstrained solution qSEEDS,P [u+1] to (10). Then, we relate qSEEDS,P [u+1] to the optimal offline occupancy measure q π * . The gaps between them are upper-bound mainly by using Lemma 1. Finally, by combining all the loss gaps (according to Lemma 3 and super-episodic version of online mirror descent) and the switching-cost upper-bound β T τ , and tuning the parameters η and τ as in Algorithm 1, we can get the regret of SEEDS in Theorem 3 and the trade-off in Theorem 4. Please see the detailed proofs of Theorem 3 and Theorem 4 in the next two subsections.

D.1 PROOF OF THEOREM 3

Proof. Step-1 (Bounding the switching costs): Since SEEDS switches at most once in each superepisode, the total switching cost of SEEDS is upper-bounded by β • T τ . In the following, we focus on upper-bounding the loss regret R SEEDS loss (T ). Step-2 (Bounding the loss regret): First, since SEEDS applies the same occupancy measure for all episodes t of the same super-episode u and the transition function P is known, conditioned on the history before super-episode u, the true occupancy measures of these episodes are the same. Then, according to Lemma 3 below, we can transfer the original regret formulation to a form based on the losses from the entire super-episode. Lemma 3. The loss regret R SEEDS loss (T ) of SEEDS is equal to E T t=1 q SEEDS,P t -q π * , l t SEEDS, P = E U u=1 q SEEDS,P [u] -q π * , l [u] SEEDS, P . (19) Note that the occupancy measure and loss on the left-hand-side of ( 19) are for each episode t, while those on the right-hand-side of ( 19) are for each super-episode u. Please see Appendix E for the proof of Lemma 3. Next, we use qSEEDS,P [u+1] to denote the unconstrained solution to (10), i.e.,

qSEEDS,P

[u+1] ≜ arg min q η • q, lSEEDS [u] + D KL q qSEEDS,P [u] . Notice that qSEEDS,P [u+1] is the constrained solution to (10), where the constraint is q ∈ C(P ). It is not hard to get that qSEEDS,P (s,a) . [u+1] (s, a) = qSEEDS,P [u] (s, a) • e -η lSEEDS [u] ( ) To get ( 20), let us consider the function f (q) = η • q, lSEEDS [u] + D KL q qSEEDS,P [u] . According to the definition of D KL (q∥q ′ ) right after (10), the derivative of function f (q) is ∂f (q) ∂q(s, a) = η • lSEEDS [u] (s, a) + ln q(s, a) qSEEDS,P [u] (s, a) . By letting the derivative to be 0 and rearranging the terms, we have (20). Remark 1. Notice that, similar to the above steps that use standard convex optimization method to get (20), we can get the the final solution to (10) as follows, qSEEDS,P [u+1] (s, a) = qSEEDS,P [u] (s, a)e δ(s,a|v [u] , l[u] ) z [u] (v [u] , h(s)) , where s,a|v, l[u] ) , δ(s, a|v, l) = -ηl(s, a) - (v, h) . This is consistent with the expression provided in Proposition 1 in Zimin & Neu (2013) . z [u] (v, h) = s∈S h ,a∈A q [u] (s, a)e δ( s ′ ∈S v(s ′ )P (s ′ |s, a) + v(s), and v[u] = arg min v H h=0 ln z [u] Then, because of Lemma 3 and the fact that the calculated occupancy measure qSEEDS,P [u] is equal to the true occupancy measure q SEEDS,P [u] , we have E T t=1 q SEEDS,P t -q π * , l t SEEDS, P = E U u=1 qSEEDS,P [u] -q π * , l [u] SEEDS, P . According to the linearity of expectation, we can decompose the loss regret into two terms that are easier to be bounded as follows, E T t=1 q SEEDS,P t -q π * , l t SEEDS, P = U u=1 E qSEEDS,P [u] -q π * , l [u] SEEDS, P = U u=1 E F [u] E qSEEDS,P [u] -q π * , l [u] F [u] , P = U u=1 E F [u] E qSEEDS,P [u] -qSEEDS,P [u+1] , l [u] F [u] , P + U u=1 E F [u] E qSEEDS,P [u+1] -q π * , l [u] F [u] , P , where the second equality is because E[X] = E[E[X|Y ]] , the last equality is because of the linearity of the expectation, and we drop the condition on SEEDS since it is clear from the context. Below, we focus on upper-bounding the two terms on the right-hand-side of ( 21) one-by-one. Step-2-i (Bounding the first term): Since e x ≥ 1 + x, from (20) we have qSEEDS,P [u] (s, a) -qSEEDS,P [u+1] (s, a) ≤ η qSEEDS,P [u] (s, a) • lSEEDS [u] (s, a). Thus, the first term on the right-hand-side of ( 21) can be upper-bounded as follows, U u=1 E F [u] E qSEEDS,P [u] -qSEEDS,P [u+1] , l [u] F [u] , P ≤ U u=1 E F [u]   E   s∈S,a∈A η qSEEDS,P [u] (s, a) • lSEEDS [u] (s, a) • l [u] (s, a) F [u] , P     . Then, according to the definition of the estimated loss that we design in (8), we have U u=1 E F [u] E qSEEDS,P [u] -qSEEDS,P [u+1] , l [u] F [u] , P ≤ U u=1 E F [u] E s∈S,a∈A η qSEEDS,P [u] (s, a) J [u] j=1 l tj (s,a) (s, a) qSEEDS,P [u] (s, a) • 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} • l [u] (s, a) F [u] , P ≤ U u=1 E F [u]   E   s∈S,a∈A η l [u] (s, a) 2 F [u] , P     ≤ ηSA T τ τ 2 , ( ) where the second inequality is because J [u] j=1 l tj (s,a) (s, a) ≤ min{uτ,T } t=(u-1)τ +1 l t (s, a) = l [u] (s, a), and the last inequality is because l [u] (s, a) ≤ τ and U = T τ . Step-2-ii (Bounding the second term): According to online mirror descent (Rakhlin et al., 2009; Zimin & Neu, 2013) , we have the following inequality for the unconstrained solution qSEEDS,P [u+1] to (10), q -qSEEDS,P [u+1] , η • lSEEDS [u] + ∂D KL (q∥q SEEDS,P [u] ) ∂q q=q SEEDS,P [u+1] ≥ 0, for all q. Since ∂DKL(q∥q SEEDS,P , by rearranging the terms, we have qSEEDS,P [u+1] -q, η • lSEEDS [u] ≤ q -qSEEDS,P [u+1] , ln qSEEDS,P , for all q. By adding and subtracting terms on the right-hand-side, we have qSEEDS,P [u+1] -q, η • lSEEDS [u] ≤   s∈S,a∈A q(s, a) ln q(s, a) qSEEDS,P [u] (s, a) -s∈S,a∈A q(s, a) -qSEEDS,P (s, a) ln qSEEDS,P (s, a) qSEEDS,P [u] (s, a) -s∈S,a∈A qSEEDS,P (s, a) -qSEEDS,P [u] (s, a)   + s∈S,a∈A q(s, a) -qSEEDS,P [u] (s, a) + s∈S,a∈A q(s, a) ln qSEEDS,P (s, a) q(s, a) -s∈S,a∈A qSEEDS,P (s, a) -qSEEDS,P [u] (s, a) = D KL q qSEEDS,P [u] -D KL qSEEDS,P [u+1] qSEEDS,P [u] -D KL q qSEEDS,P , for all q. Then, together with Lemma 1, we have U u=1 E F [u] E qSEEDS,P [u+1] -q π * , l [u] F [u] , P = U u=1 E F [u] E qSEEDS,P -q π * , lSEEDS [u] F [u] , P ≤ 1 η • U u=1 E F [u] E D KL q qSEEDS,P [u] -D KL qSEEDS,P -D KL q qSEEDS,P F [u] , P . Since the intermediate terms get cancelled and the relative entropy is always non-negative, the second term on the right-hand-side of ( 21) can be upper-bounded as follows, U u=1 E F [u] E qSEEDS,P [u+1] -q π * , l [u] F [u] , P ≤ D KL (q∥q SEEDS,P ) η ≤ H η ln SA H . Step-3 (Final step): Finally, by combining ( 22), ( 23) and the switching-cost upper-bound β • T τ , and tuning the parameters η and τ as in Algorithm 1, we have that the regret of SEEDS is upperbounded by O β 1/3 (HSA) 1/3 T 2/3 .

D.2 PROOF OF THEOREM 4

By considering the loss-regret bound that we prove above and the switching-cost bound separately, we can prove Theorem 4. Proof. According to ( 22) and ( 23) above, with the total switching cost equal to O β • T τ = O(β • N SEEDS ), the loss regret of SEEDS is upper-bounded as follows, R SEEDS loss (T ) ≤ Õ ηSAT τ + H η = Õ √ HSAT τ = Õ HSA N SEEDS • T , where the first equality is by tuning η = H SAT τ , and the last equality is because N SEEDS ≜ T τ .

E PROOF OF LEMMA 3

For the convenience of the reader, we re-state Lemma 3 below. Lemma 3. The loss regret R SEEDS loss (T ) of SEEDS is equal to E T t=1 q SEEDS,P t -q π * , l t SEEDS, P = E U u=1 q SEEDS,P [u] -q π * , l [u] SEEDS, P . Proof. We drop the condition on SEEDS since it is clear from the context. First, according to the linearity of expectation, we have that the left-hand-side of ( 25) is equal to E T t=1 q SEEDS,P t -q π * , l t P = U u=1 E   min{uτ,T } t=(u-1)τ +1 q SEEDS,P t -q π * , l t P   . Then, since conditioned on the history before super-episode u, the true occupancy measures for all episodes t of the same super-episode u are the same, we have E T t=1 q SEEDS,P t -q π * , l t P = U u=1 E F [u]   E   min{uτ,T } t=(u-1)τ +1 q SEEDS,P t -q π * , l t F [u] , P     = U u=1 E F [u]   E   min{uτ,T } t=(u-1)τ +1 q SEEDS,P [u] -q π * , l t F [u] , P     . Finally, since the true loss is l [u] (s, a) = min{uτ,T } t=(u-1)τ +1 l t , we have E T t=1 q SEEDS,P t -q π * , l t P = U u=1 E F [u] E q SEEDS,P [u] -q π * , l [u] F [u] , P = E U u=1 q SEEDS,P [u] -q π * , l [u] P .

F PROOF OF LEMMA 2

The proof is similar to the proof of Lemma 1 in Appendix C. Proof. First, since the expectation is taken with respect to the randomness of the episodes t 1 (s, a), ..., t J [u] (s, a), in which the state-action pair (s, a) was visited, the left-hand-side of ( 13) is equal to E lSEEDS-UT [u] (s, a) F [u] = {t1(s,a),...,t J [u] (s,a)} ⊆[(u-1)τ +1,uτ ] lSEEDS-UT [u] (s, a) • P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] . Next, according to the definition of the estimated loss that we design in (12), we have E lSEEDS-UT [u] (s, a) F [u] = {t1(s,a),...,t J [u] (s,a)} ⊆[(u-1)τ +1,uτ ] J [u] j=1 l tj (s,a) (s, a) Q γ [u] (s, a) • 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} • P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] . Then, relying on the above indicator function, the sum of the total observed loss in a super-episode over all possible sets {t 1 (s, a), ..., t J [u] (s, a)} is equivalent to the sum of the total true loss in each episode of a super-episode based on whether the episode is observed. Therefore, we have E lSEEDS-UT [u] (s, a) F [u] = uτ t=(u-1)τ +1 {t1(s,a),...,t J [u] (s,a)}: t∈{t1(s,a),...,t J [u] (s,a)} l t (s, a) Q γ [u] (s, a) • P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] . Finally, since conditioned on F u , the probability of visiting each state-action pair (s, a) in an episode t of super-episode u is equal to the occupancy measure q SEEDS-UT,P [u] (s, a), we have E lSEEDS-UT [u] (s, a) F [u] = uτ t=(u-1)τ +1 q SEEDS-UT,P [u] (s, a) • l t (s, a) Q γ [u] (s, a) = q SEEDS-UT,P [u] (s, a) Q γ [u] (s, a) uτ t=(u-1)τ +1 l t (s, a) = q SEEDS-UT,P [u] (s, a) Q γ [u] (s, a) l [u] (s, a).

G PROOF OF THEOREM 5

Specifically, since the total switching cost of SEEDS-UT is trivially upper-bounded by β • T τ , to prove Theorem 5, we focus on upper-bounding the loss regret R SEEDS loss (T ) of SEEDS-UT, i.e., R SEEDS-UT loss (T ) = max q∈C(P ) E T t=1 q SEEDS-UT,P t -q, l t SEEDS-UT, P ≜ E T t=1 q SEEDS-UT,P t -q π * , l t SEEDS-UT, P . To upper-bound the loss regret, the main difficulties are that, due to the delayed switching and unknown transition function, the losses of SEEDS-UT in the episodes of any super-episode are correlated and the true occupancy measure is unknown. As a result, the existing analytical ideas in adversarial RL without switching costs (e.g., in Jin et al. (2020a) ) and adversarial bandit learning with switching costs (e.g., in Shi et al. (2022) ) do not apply here. To overcome these new difficulties, our proof of Theorem 5 involves several key new components. For example, since SEEDS-UT collects samples from a whole super-episode to estimate the transition-function set P, each state-action pair could be visited multiple times and such visitations are random. As a result, the proof in Jin et al. (2020a) , which requires each state-action pair to be visited at most once does not apply directly here. To resolve this difficulty, we construct a special series based on the collected samples to achieve an analyzable intermediate step for our proof of the final regret. Moreover, due to our new design of the estimated loss in (12), the concentration lemma for the loss based on the samples from only one episode in Jin et al. (2020a) does not apply. To resolve this difficulty, we establish a super-episodic version of concentration in our proof by bounding the second-order moment of the estimated loss. Specifically, for each super-episode, we decompose the loss regret q SEEDS-UT,P [u] -q π * , l [u] into four parts that are easier to be upper-bounded as follows, q SEEDS-UT,P

[u]

-q π * , l [u] = q SEEDS-UT,P [u] -qSEEDS-UT,P [u] , l [u] + qSEEDS-UT,P [u] , l [u] -lSEEDS-UT [u] + qSEEDS-UT,P [u] -q π * , lSEEDS-UT [u] + q π * , lSEEDS-UT [u] -l [u] . The first term on the right-hand-side is mainly the difference between the true occupancy measure and the updated occupancy measure. Intuitively, according to Bernstein inequality (Maurer & Pontil, 2009) and standard stochastic RL analysis, SEEDS-UT estimates the true transition function P very well by using the transition-function setP in ( 14). Thus, based on the relation between the occupancy measure and the transition function in (4), SEEDS-UT should estimate the true occupancy measure very well. Hence, the first term should be upper-bounded and controllable. The second and fourth terms on the right-hand-side depends on the difference between the estimated loss and the true loss. According to Lemma 2, this gap should be controllable by tuning the parameter γ. The third term is similar to the loss regret in the case when the transition function is known. Thus, it can be upperbounded similarly to our proof of Theorem 3 in Appendix D. Finally, by combining all these gaps and the switching-cost upper-bound β • T τ , and tuning the parameters η, τ and γ as in Algorithm 2, we get the regret of SEEDS-UT in Theorem 5. Please see the detailed proof below.

Proof.

Step-1 (Bounding the switching costs): Since SEEDS-UT switches at most once in each super-episode, the total switching cost of SEEDS-UT is upper-bounded by β • T τ . In the following, we focus on upper-bounding the loss regret R SEEDS-UT loss (T ). Step-2 (Bounding the loss regret): We first show Lemma 4 below. Lemma 4 is critical for Lemma 3 to be true in this case with an unknown transition function. Lemma 4. For any two episodes t 1 and t 2 , if the updated occupancy measures are the same, i.e., qt1 (s ′ , s, a) = qt2 (s ′ , s, a) for any (s ′ , s, a), then the true occupancy measures are the same, i.e., q t1 (s, a) = q t2 (s, a) = q [u] (s, a) for any (s, a), where q [u] (s, a) is the true occupancy measure for the super-episode u. The proof of Lemma 4 follows the conditions in (2)-( 5). Since SEEDS-UT applies the same occupancy measure qSEEDS-UT,P [u] for all episodes t of the same super-episode u, according to Lemma 4, the true occupancy measure q SEEDS-UT,P t of these episodes t are the same. Thus, similar to the case with a known transition function, we can get an unknown-transition version of Lemma 3 here. Thus, E T t=1 q SEEDS-UT,P t -q π * , l t P = E U u=1 q SEEDS-UT,P [u] -q π * , l [u] P . We drop the condition on SEEDS-UT in the expectation here and in the following when it is clear from the context. According to the linearity of expectation, we can decompose the loss regret into four terms that are easier to be bounded, i.e., E T t=1 q SEEDS-UT,P t -q π * , l t P = U u=1 E q SEEDS-UT,P [u] -q π * , l [u] P = U u=1 E F [u] E q SEEDS-UT,P [u] -qSEEDS-UT,P [u] , l [u] + qSEEDS-UT,P [u] , l [u] -lSEEDS-UT [u] + qSEEDS-UT,P [u] -q π * , lSEEDS-UT [u] + q π * , lSEEDS-UT [u] -l [u] F [u] , P . Below, we focus on upper-bounding the four terms on the right-hand-side of (26) one-by-one. Step-2-i (Bounding the first term): Since l t (s, a) ≤ 1 for all state-action pairs (s, a), we have l [u] (s, a) ≤ τ for all (s, a). Thus, we have q SEEDS-UT,P [u] -qSEEDS-UT,P [u] , l [u] ≤ τ • s∈S,a∈A q SEEDS-UT,P [u] (s, a) -qSEEDS-UT,P [u] (s, a) . The difference between the true occupancy measure and the updated occupancy measure on the right-hand-side depends on how good the transition-function set P in ( 14) is, and can be further upper-bounded by using Bernstein inequality (Maurer & Pontil, 2009) . Below, we focus on bounding this difference. We use π(a|s) to denote the probability of choosing action a at state s. Specifically, first, according to the relation between the occupancy measure and the transition function in (4), we have that for any state-action pair (s h , a h ) ∈ S h × A visited at stage h, q π,P (s h , a h ) = π(a h |s h ) (si∈Si,ai∈A) h-1 i=0 h-1 j=0 [π(a j |s j )P (s j+1 |s j , a j )] , where for simplicity, we drop the index t for the states s and actions a. Thus, the difference between the updated occupancy measure and the true occupancy measure can be upper-bounded as follows, qSEEDS-UT,P [u] (s h , a h ) -q SEEDS-UT,P [u] (s h , a h ) = πSEEDS-UT [u] (a h |s h ) • (si∈Si,ai∈A) h-1 i=0 h-1 j=0 πSEEDS-UT [u] (a j |s j )   h-1 j=0 P[u] (s j+1 |s j , a j ) - h-1 j=0 P (s j+1 |s j , a j )   , For the terms in the bracket [•], we have h-1 j=0 P[u] (s j+1 |s j , a j ) - h-1 j=0 P (s j+1 |s j , a j ) = h-1 j=0 P[u] (s j+1 |s j , a j ) - h-1 j=0 P (s j+1 |s j , a j ) ± h-1 k=1 k-1 j=0 P (s j+1 |s j , a j ) h-1 j=k P[u] (s j+1 |s j , a j ) = h-1 k=0 P[u] (s k+1 |s k , a k ) -P (s k+1 |s k , a k ) k-1 j=0 P (s j+1 |s j , a j ) h-1 j=k P[u] (s j+1 |s j , a j ) ≤ h-1 k=0 ε[u] (s k+1 |s k , a k ) k-1 j=0 P (s j+1 |s j , a j ) h-1 j=k P[u] (s j+1 |s j , a j ), where ε[u] (s k+1 |s k , a k ) = O P (s k+1 |s k , a k ) ln T SA δ max N [u] (s k , a k )}, 1 + ln T SA δ max N [u] (s k , a k )}, 1 shows how good SEEDS-UT estimates the true transition function, and the inequality is because of the empirical Bernstein inequality (Maurer & Pontil, 2009) and Lemma 8 in Jin et al. (2020a) . Applying ( 27) and ( 28) to SEEDS-UT, we have qSEEDS-UT,P [u] (s h , a h ) -q SEEDS-UT,P [u] (s h , a h ) ≤ h-1 k=0 (si∈Si,ai∈A) h-1 i=0 ε[u] (s k+1 |s k , a k ) •   πSEEDS-UT [u] (a k |s k ) k-1 j=0 πSEEDS-UT [u] (a j |s j )P (s j+1 |s j , a j )   •   πSEEDS-UT [u] (a h |s h ) h-1 j=k+1 πSEEDS-UT [u] (a j |s j ) P (s j+1 |s j , a j )   = h-1 k=0 s k+1 ∈S k+1 ,s k ∈S k ,a k ∈A ε[u] (s k+1 |s k , a k )q SEEDS-UT,P [u] (s k , a k )q SEEDS-UT,P [u] (s h , a h |s k+1 ). (30) Similarly, we can show that qSEEDS-UT,P [u] (s h , a h |s k+1 ) -q SEEDS-UT,P [u] (s h , a h |s k+1 ) = h-1 j=k+1 sj+1∈Sj+1,sj ∈Sj ,aj ∈A ε[u] (s j+1 |s j , a j )q SEEDS-UT,P [u] (s j , a j |s k+1 )q SEEDS-UT,P [u] (s h , a h |s j+1 ) ≤ πSEEDS-UT [u] (a h |s h ) h-1 j=k+1 sj+1∈Sj+1,sj ∈Sj ,aj ∈A ε[u] (s j+1 |s j , a j )q SEEDS-UT,P [u] (s j , a j |s k+1 ). (31) Combining ( 30) and (31), we have U u=1 H-1 h=0 (s h ,a h )∈S h ×A qSEEDS-UT,P [u] (s h , a h ) -q SEEDS-UT,P [u] (s h , a h ) ≤ U u=1 H-1 h=0 (s h ,a h )∈S h ×A h-1 k=0 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A ε[u] (s k+1 |s k , a k )q SEEDS-UT,P [u] (s k , a k ) • q SEEDS-UT,P [u] (s h , a h |s k+1 ) + U u=1 H-1 h=0 (s h ,a h )∈S h ×A h-1 k=0 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A ε[u] (s k+1 |s k , a k )q SEEDS-UT,P [u] (s k , a k ) •   πSEEDS-UT [u] (a h |s h ) h-1 j=k+1 (sj+1,sj ,aj )∈Sj+1×Sj ×A ε[u] (s j+1 |s j , a j )q SEEDS-UT,P [u] (s j , a j |s k+1 )   . (32) Since H-1 h=0 (s h ,a h )∈S h ×A q SEEDS-UT,P [u] (s h , a h |s k+1 ) = 1 and H-1 h=0 (s h ,a h )∈S h ×A πSEEDS-UT [u] (a h |s h ) ≤ S, from (32), we have U u=1 H-1 h=0 (s h ,a h )∈S h ×A qSEEDS-UT,P [u] (s h , a h ) -q SEEDS-UT,P [u] (s h , a h ) ≤ U u=1 H-1 k=0 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A ε[u] (s k+1 |s k , a k )q SEEDS-UT,P [u] (s k , a k ) + S • U u=1 H-1 k=0 H-1 j=k+1 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A (sj+1,sj ,aj )∈Sj+1×Sj ×A ε[u] (s k+1 |s k , a k )q SEEDS-UT,P [u] (s k , a k ) • ε[u] (s j+1 |s j , a j )q SEEDS-UT,P [u] (s j , a j |s k+1 ). (33) Let us focus on bounding the terms on the right-hand-side of (33) one-by-one. For the first term, we have U u=1 H-1 k=0 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A ε[u] (s k+1 |s k , a k )q SEEDS-UT,P [u] (s k , a k ) = O   U u=1 H-1 k=0 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A q SEEDS-UT,P [u] (s k , a k ) P (s k+1 |s k , a k ) ln T SA δ max N [u] (s k , a k )}, 1 + q SEEDS-UT,P [u] (s k , a k ) ln T SA δ max N [u] (s k , a k )}, 1 ≤ O   U u=1 H-1 k=0 (s k ,a k )∈S k ×A q SEEDS-UT,P [u] (s k , a k ) S k+1 ln T SA δ max N [u] (s k , a k )}, 1 + q SEEDS-UT,P [u] (s k , a k ) ln T SA δ max N [u] (s k , a k )}, 1 where the equality is according to the definition of ε[u] (s k+1 |s k , a k ) in ( 29), and the inequality is according to Cauchy-Schwarz inequality. Note that the difficulty to further bound the above terms is that each state-action pair could be visited multiple times in a super-episode u. To this end, we construct a series to achieve an analyzable intermediate step. Let us first imagine there is a sequence of numbers based on the samples that are collected from each single episode. Then, we use N t (s k , a k ) to denote the number of times visiting the state-action pair (s k , a k ) before episode t. Since N t (s k , a k ) is non-decreasing as t increases, i.e., N (u-1)τ +1 (s k , a k ) ≤ N (u-1)τ +2 (s k , a k ) ≤ ... ≤ N uτ (s k , a k ) = N [u] (s k , a k ), we have q SEEDS-UT,P [u] (s k , a k ) max N [u] (s k , a k )}, 1 = q SEEDS-UT,P [u] (s k , a k ) max {N uτ (s k , a k )}, 1} ≤ ... ≤ q SEEDS-UT,P [u] (s k , a k ) max N (u-1)τ +1 (s k , a k )}, 1 Now, let us compare our regret bound before to a intermediate step that is based on this series, i.e., U u=1 H-1 k=0 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A ε[u] (s k+1 |s k , a k )q SEEDS-UT,P [u] (s k , a k ) ≤ O   1 τ U u=1 H-1 k=0 (s k ,a k )∈S k ×A uτ t=(u-1)τ +1 q SEEDS-UT,P [u] (s k , a k ) S k+1 ln T SA δ max {N t (s k , a k )}, 1} + q SEEDS-UT,P [u] (s k , a k ) ln T SA δ max {N t (s k , a k )}, 1} ≤ O 1 τ H-1 k=0 S k S k+1 AT ln T SA δ ≤ O 1 τ HS AT ln T SA δ , Let us now consider the second term on the right-hand-side of (33), which can be upperbounded similarly to the steps above to bound the first term. First, according to the definition of ε[u] (s k+1 |s k , a k ) in ( 29), we have this second term is upper-bounded by S • O     U u=1 H-1 k=0 H-1 j=k+1 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A (sj+1,sj ,aj )∈Sj+1×Sj ×A P (s k+1 |s k , a k ) ln T SA δ max N [u] (s k , a k )}, 1 q SEEDS-UT,P [u] (s k , a k ) • P (s j+1 |s j , a j ) ln T SA δ max N [u] (s j , a j )}, 1 q SEEDS-UT,P [u] (s j , a j |s k+1 ) + ln T SA δ • U u=1 H-1 k=0 H-1 j=k+1 (s k+1 ,s k ,a k )∈S k+1 ×S k ×A (sj+1,sj ,aj )∈Sj+1×Sj ×A q SEEDS-UT,P [u] (s k , a k ) max N [u] (s k , a k )}, 1 + q SEEDS-UT,P [u] (s j , a j ) max N [u] (s j , a j )}, 1     . Next, according to Cauchy-Schwarz inequality, we have the terms inside the big-O notation can be upper-bounded by ln T SA δ •      H-1 k=0 H-1 j=k+1 U u=1 (s k+1 ,s k ,a k ), ( ,sj ,aj ) q SEEDS-UT,P [u] (s k , a k )P (s k+1 |s k , a k )q SEEDS-UT,P [u] (s j , a j |s k+1 ) max N [u] (s k , a k )}, 1 • U u=1 (s k+1 ,s k ,a k ), (sj+1,sj ,aj ) q SEEDS-UT,P [u] (s k , a k )P (s j+1 |s j , a j )q SEEDS-UT,P [u] (s j , a j |s k+1 ) max N [u] (s j , a j )}, 1 + U u=1 H-1 k=0 H-1 j=k+1 (s k+1 ,s k ,a k ), (sj+1,sj ,aj ) q SEEDS-UT,P [u] (s k , a k ) max N [u] (s k , a k )}, 1 + q SEEDS-UT,P [u] (s j , a j ) max N [u] (s j , a j )}, 1     . Then, according to (34), we have that the terms under the √ • operator can be upper-bounded by 1 τ U u=1 (s k+1 ,s k ,a k ), (sj+1,sj ,aj ) uτ t=(u-1)τ +1 q SEEDS-UT,P [u] (s k , a k )P (s k+1 |s k , a k )q SEEDS-UT,P [u] (s j , a j |s k+1 ) max {N t (s k , a k )}, 1} • 1 τ U u=1 (s k+1 ,s k ,a k ), (sj+1,sj ,aj ) uτ t=(u-1)τ +1 q SEEDS-UT,P (s k , a k )P (s j+1 |s j , a j )q SEEDS-UT,P q SEEDS-UT,P [u] (s k , a k ) max {N t (s k , a k )}, 1} + q SEEDS-UT,P [u] (s j , a j ) max {N t (s j , a j )}, 1} . Combining the above steps and according to Lemma 10 in Jin et al. (2020a) , we have that the second term on the right-hand-side of (33) can be upper-bounded by O 1 τ H 2 S 2 A ln T SA δ . Therefore, with probability 1 -δ, the first term on the right-hand-side of ( 26) can be upper-bounded by O HS AT ln T SA δ + H 2 S 2 ln T SA δ . ( ) Step-2-ii (Bounding the second term): The second term on the right-hand-side of (26) can be further decomposed into two terms as follows, U u=1 E F [u] E qSEEDS-UT,P [u] , l [u] -lSEEDS-UT , E lSEEDS-UT [u] -lSEEDS-UT , l [u] -E lSEEDS-UT [u] F [u] , P ≤ τ U u=1 E F [u]   E   s∈S,a∈A Q γ [u] (s, a) -q SEEDS-UT,P q P [u] (s, a) + γ -q SEEDS-UT,P q P [u] (s, a) -q SEEDS-UT,P [u] (s, a) on the right-hand-side represents how well SEEDS-UT estimates the true occupancy measure using the transition-function set, and the term γ on the right-hand-side verifies that this part of the gap is controlled by the parameter γ. Then, according to the bound for the first term on the right-hand-side of (26), we have U u=1 E F [u] E qSEEDS-UT,P [u] , l [u] -E lSEEDS-UT , E lSEEDS-UT [u] -lSEEDS-UT [u] F [u] , P ≤ O τ H T τ ln 1 δ ≤ O H T τ ln 1 δ . Therefore, with probability 1 -δ, the second term on the right-hand-side of ( 26) can be upperbounded by O HS AT ln T SA δ + γT SA + H T τ ln 1 δ . ( ) Step-2-iii (Bounding the third term): Follow our proof for the case when the transition function is known, it is not hard to show that U u=1 qSEEDS-UT,P [u] -q π * , lSEEDS-UT In the following, we show how to get this. First, since lSEEDS-UT By rearranging the terms, we have Step-2-iv (Bounding the fourth term): First, it is not hard to get that with probability 1 -δ, q P [u] (s, a) -1     ≤ H 2γ ln H δ . ( ) Step-3 (Final step): Finally, by combining (36), ( 39), ( 40), ( 42) and the switching-cost upperbound β • T τ , and tuning the parameters η, τ and γ as in Algorithm 2, we have that the regret of SEEDS-UT is upper-bounded by O β 1/3 H 2/3 (SA) 1/3 T 2/3 ln T SA δ 1/2 with probability 1 -δ.



Static MDP: There have been recent studies on static RL with switching costs. Specifically, for tabular MDP,Bai et al. (2019) andZhang et al. (2020) proposed RL algorithms that attain an Õ √ H α SAT • ln T SA δ regret with probability 1 -δ, by incurring O (H α SA ln T ) switching costs, where α = 3 and 2, respectively. Recently, Qiao et al. (2022) obtained a similar Õ( √ T ) regret with probability 1 -δ, by incurring O (HSA ln ln T ) switching costs. Moreover, for linear MDP (with d-dimensional feature space), Gao et al. (2021) and Wang et al. (2021) obtained an Õ √ d 3 H 3 T • (ln dT δ ) 1/2 regret with probability 1 -δ, by incurring O (dH ln T ) switching costs. Adversarial MDPs: Adversarial RL better models scenarios where the loss distributions and/or the transition functions of MDPs could change over time. Specifically, in tabular MDP with a known transition function, Zimin & Neu (2013) proposed an RL algorithm that attains an Õ( √ HSAT ) regret. In the case with an unknown transition function, Jin et al. (2020a) and Lee et al. (2020) obtained an Õ HS AT ln T SA δ regret with probability 1 -δ. These studies assume that the state spaces of layers in an episode are non-overlapping. Moreover, Rosenberg & Mansour (2019a) studied the case with full-information feedback. Adversarial linear MDP has also been studied recently, e.g., in Cai et al. (2020); Luo et al. (2021). In addition, Yu & Mannor (2009); Cheung et al. (2019) and Lykouris et al. (2021) studied the case when both the loss distribution and transition function change arbitrarily. More studies on various adversarial RL settings have been done by Rosenberg & Mansour (2019b); Lee et al. (2021); Zhao et al. (2021); Jin et al. (2021); He et al. (2022), etc.

At the end of super-episode u,Step 2: Estimate the losses lSEEDS[u]

-episode-based policy search: SEEDS divides the episodes into U = T τ super-episodes, where τ ∈ Z ++ is a tunable parameter and a strictly positive integer. Each super-episode includes τ consecutive episodes. For all episodes in each super-episode u = 1, ..., U, SEEDS uses the same policy π qSEEDS,P [u]

[u+1](s ′ , s, a) according to (10), but subject to a different constraint q ∈ C P [u+1] . Update the deterministic policy π qSEEDS-UT,P [u+1] according to (2) and (5). end for Theorem 4. Let N SEEDS ≜ T τ . Then, with the switching costs equal to O β • N SEEDS , SEEDS can achieve a loss regret upper-bounded by Õ HSA N SEEDS • T .

max{N [u+1] (s,a),1} , where M [u+1] (s ′ , s, a) and N [u+1] (s, a) denote the number of times visiting (s ′ , s, a) and (s, a) before super-episode u + 1, respectively. Then, based on the empirical Bernstein bound (Maurer & Pontil, 2009), SEEDS-UT constructs a transition-function set P as follows (Step-3 in Algorithm 2), P [u+1] = P[u+1] : P[u+1] (s ′ |s, a) -P[u+1] (s ′ |s, a) ≤ ϵ [u+1] (s ′ , s, a), for all (s ′ , s, a) , (14) where ϵ [u+1] (s ′ , s, a) = 2 P[u+1] (s ′ ,s,a) ln T SA δ max{N [u+1] (s,a)-1,1} + 14 ln T SA δ 3 max{N [u+1] (s,a)-1,1} , and δ ∈ (0, 1) is the confidence parameter. Finally, the occupancy measure qSEEDS-UT,P [u+1]

a) F [u] = uτ t=(u-1)τ +1 {t1(s,a),...,t J [u] (s,a)}: t∈{t1(s,a),...,t J [u] (s,a)} l t (s, a)

a), i.e., {t1(s,a),...,t J [u] (s,a)}: t∈{t1(s,a),...,t J [u] (s,a)} P r {t 1 (s, a), ..., t J [u] (s, a)} F [u] = qSEEDS,P [u]

, a j |s k+1 ) max {N t (s j , a j )}, 1} , and the second term in the bracket [•] can be upper-bounded by k+1 ,s k ,a k ), (sj+1,sj ,aj ) uτ t=(u-1)τ +1

F [u] E qSEEDS-UT,P [u] , l [u] -E lSEEDS-UT [u] F [u] , P + U u=1 E F [u] E qSEEDS-UT,P [u]

[u]  , P .(37)   Let us consider the two terms on the right-hand-side. First, according to Lemma 2, we haveU u=1 E F [u] E qSEEDS-UT,P Since l [u] (s, a) ≤ τ and Q γ [u](s, a) ≥ qSEEDS-UT,P

to Azuma's inequality, we have with probability 1 -δ,U u=1 E F [u] E qSEEDS-UT,P [u]

Let us focus on the first term on the right-hand-side. Note that different from that inJin et al. (2020a), the loss lSEEDS-UT[u](s, a) above is calculated based on the samples from a whole super-episode. Thus, each state-action pair could be visited multiple times. To this end, we provide a super-episodic version of loss concentration as follows,

tj (s,a) (s, a) Q γ [u] (s, a) 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} tj (s,a) (s, a) Q γ [u] (s, a) 1 {(s,a) was visited in episodes t1(s,a),...,t J [u] (s,a) of super-episode u} s,a) was visited in episode t of super-episode u} . Let us define lt (s, a) ≜ l t (s, a)1 {(s,a) was visited in episode t of super-episode u} Q γ [u] (s,By combining all episodes in the same super-episode u together, we have

a)l [u] (s, a).

Therefore, with probability 1-δ, the third term on the right-hand-side of (26) can be upper-bounded by

ACKNOWLEDGMENTS

The work of M. Shi has been partly supported by NSF grant NSF AI Institute (AI-EDGE) CNS-2112471. The work of Y. Liang has been partly supported by NSF grants NSF AI-EDGE CNS-2112471 and RINGS-2148253. The work of N. Shroff has been partly supported by NSF grants NSF AI-EDGE CNS-2112471, CNS-2106933, 2007231, CNS-1955535, and CNS-1901057, and in part by Army Research Office under Grant W911NF-21-1-0244.

