SIMULTANEOUSLY LEARNING STOCHASTIC AND AD-VERSARIAL MARKOV DECISION PROCESS WITH LIN-EAR FUNCTION APPROXIMATION

Abstract

Reinforcement learning (RL) has been commonly used in practice. To deal with the numerous states and actions in real applications, the function approximation method has been widely employed to improve the learning efficiency, among which the linear function approximation has attracted great interest both theoretically and empirically. Previous works on the linear Markov Decision Process (MDP) mainly study two settings, the stochastic setting where the reward is generated in a stochastic way and the adversarial setting where the reward can be chosen arbitrarily by an adversary. All these works treat these two environments separately. However, the learning agents often have no idea of how rewards are generated and a wrong reward type can severely disrupt the performance of those specially designed algorithms. So a natural question is whether an algorithm can be derived that can efficiently learn in both environments but without knowing the reward type. In this paper, we first consider such best-of-both-worlds problem for linear MDP with the known transition. We propose an algorithm and prove it can simultaneously achieve O(poly log K) regret in the stochastic setting and Õ( √ K) regret in the adversarial setting where K is the horizon. To the best of our knowledge, it is the first such result for linear MDP.

1. INTRODUCTION

Reinforcement learning (RL) studies the problem where a learning agent interacts with the environment over time and aims to maximize its cumulative rewards in a given horizon. It has a wide range of real applications including robotics (Kober et al., 2013) , games (Mnih et al., 2013; Silver et al., 2016) , etc. The environment dynamics are usually modeled by the Markov Decision Process (MDP) with a fixed transition function. We consider the general episodic MDP setting where the interactions last for several episodes and the length of each episode is fixed (Jin et al., 2018; 2020b; Luo et al., 2021; Yang et al., 2021) . In each episode, the agent first observes its current state and would decide which action to take. After making the decision, it receives an instant reward and the environment will then transfer to the next state. The cumulative reward in an episode is called the value and the objective of the agent is equivalent to minimizing the regret defined as the cumulative difference between the optimal value and its received values over episodes. Many previous works focus on the tabular MDP setting where the state and action space are finite and the values can be represented by a table (Azar et al., 2017; Jin et al., 2018; Chen et al., 2021; Luo et al., 2021) . Most of them study the stochastic setting with the stationary reward in which the reward of a state-action pair is generated from a fixed distribution (Azar et al., 2017; Jin et al., 2018; Simchowitz & Jamieson, 2019; Yang et al., 2021) . Since the reward may change over time in applications, some works consider the adversarial MDP where the reward can be arbitrarily generated among different episodes (Yu et al., 2009; Rosenberg & Mansour, 2019; Jin et al., 2020a; Chen et al., 2021; Luo et al., 2021) . All of these works pay efforts to learn the value function table to find the optimal policy and the computation complexity highly depends on the state and action space size. However, in real applications such as the Go game, there are numerous states and the value function table is huge, which brings a great challenge to the computation complexity for traditional algorithms in the tabular case. To cope with the dimensionality curse, a rich line of works employ the function approximation methods, such as the linear function and deep neural networks, to approximate the value functions or the policies to improve learning efficiency. These methods also achieve great success in practical scenarios such as the Atari and Go games (Mnih et al., 2013; Silver et al., 2016) . Despite their great empirical performances, it also brings a series of challenges in deriving theoretical analysis. To build a better theoretical understanding of these approximation methods, lots of works start from deriving regret guarantees for linear function classes. The linear MDP is a popular model which assumes both the transition and reward at a state-action pair are linear in the corresponding d-dimensional feature (Jin et al., 2020b; He et al., 2021; Hu et al., 2022) . There are also mainly two types of the reward. For the stochastic setting, Jin et al. (2020b) provides the first efficient algorithm named Least-Square Value Iteration UCB (LSVI-UCB) and show that the its regret over K episodes can be upper bounded by O( √ K). To seek for a tighter result with respect to the specific problem structure, He et al. (2021) provide a new analysis for LSVI-UCB and show it achieves an O(poly log K) instance-dependent regret upper bound. The adversarial setting is much harder than the stochastic one since the reward can change arbitrarily but the agent can only observe the rewards on the experienced trajectory. For this more challenging case, a regret upper bound of order O( √ K) is only obtained in the case with known transition by Neu & Olkhovskaya (2021) . All these works separately treat two environment types. However, the learning agent usually has no idea of how the reward is generated. And once the reward type is wrong, the specially designed algorithm for a separate setting may suffer great loss. Thus deriving an algorithm that can adapt to different environment types becomes a natural solution for this problem. This direction has attracted great research interest in simpler bandit (Bubeck & Slivkins, 2012; Zimmert et al., 2019; Lee et al., 2021; Kong et al., 2022) and tabular MDP settings (Jin & Luo, 2020; Jin et al., 2021b) but still remains open in linear MDP. In this paper, we try to answer the question of deriving best-of-both-worlds (BoBW) guarantees for linear MDP. Due to the challenge of learning in the adversarial setting, we also consider the known transition case. We propose an algorithm that continuously detects the real environment type and adjusts its strategy. It has been shown that our algorithm can simultaneously achieve O(poly log K) regret in the stochastic setting and Õ( √ K) regret in the adversarial setting. To the best of our knowledge, these are the first BoBW results for linear MDP. It is also worth noting that our BoBW algorithm relies on an algorithm that can achieve a high-probability guarantee for the adversarial setting, which previous works fail to provide. And we propose the first analysis for a high-probability regret bound in the adversarial linear MDP.

2. RELATED WORK

Linear MDP. Recently, deriving theoretically guaranteed algorithms for RL with linear function approximation has attracted great interests. The linear MDP model is one of the most popular one. Jin et al. (2020b) develop the first efficient algorithm LSVI-UCB both in sample and computation complexity for this setting. They show that the algorithm achieves O( √ d 3 H 3 K) regret where d is the feature dimension and H is the length of each episode. This result is recently improved to the optimal order O(dH √ K) by Hu et al. (2022) with a tighter concentration analysis. Apart from UCB, the TS-type algorithm has also been proposed for this setting (Zanette et al., 2020a) . All these results do not consider the specific problem structure. In the stochastic setting, deriving an instance-dependent regret is more attractive to show the tighter performances of algorithms in a specific problem. This type of regret has been widely studied under the tabular MDP setting (Simchowitz & Jamieson, 2019; Yang et al., 2021) . He et al. (2021) is the first to provide this type of regret in linear MDP. Using a different proof framework, they show that the LSVI-UCB algorithm can achieve O(d 3 H 5 log K/∆) where ∆ is the minimum value gap in the episodic MDP. All these works consider the stochastic setting with stationary rewards. Neu & Olkhovskaya (2021) first attempts to analyze the more challenging adversarial environment. They consider a simplier setting with known transition and provide an O( √ dHK) regret upper bound. For unknown transition case, Luo et al. (2021) provide an O(d 2/3 H 2 K 2/3 ) upper bound with the help of a simulator and O(d 2 H 4 K 14/15 ) guarantee for the general case. Above all, even in the separate adversarial setting, O( √ K) regret is only derived for known transition case. We also study the known transition setting and try to provide Õ( √ K) regret in the adversarial setting while simultaneously achieving O(poly log K) regret if the environment is truly stochastic. Best-of-both-worlds. The question of reaching best-of-both-worlds results is first proposed by Bubeck & Slivkins (2012) for bandit setting, a special case of episodic MDP with H = 1. Their proposed algorithm assumes the setting is stochastic and continuously detects whether the assumption is satisfied. Such a detection-based method is shown to achieve O(poly log K) regret in the stochastic setting and O( √ K) in the adversarial setting, which is later improved by Auer & Chiang (2016) . Similar detection-based techniques have also been adopted in more general linear bandit (Lee et al., 2021) and graph feedback (Kong et al., 2022) settings to achieve BoBW guarantees. Another line of works consider using Follow-the-Regularized-Leader (FTRL) to adapt to different environment types (Zimmert & Seldin, 2019; 2021) . This type of algorithm is shown to be tighter than Bubeck & Slivkins (2012) ; Auer & Chiang (2016) in the bandit setting and also attracts lots of interest in more complex problems such as combinatorial bandits (Zimmert et al., 2019; Chen et al., 2022) . The first BoBW result in the MDP setting is provided by Jin & Luo (2020) in the tabular case. Due to the challenge of the problem, they first study the known transition setting. Their approach to achieving BoBW is the FTRL algorithm with a newly designed regularizer, which result is later improved by Jin et al. (2021b) and also generalized to the unknown transition case. To the best of our knowledge, we are the first to consider the BoBW problem in linear MDP. We also start from the known transition setting and our algorithm is based on detection.

RL with general function approximation

The linear mixture MDP is another popular RL model with linear function approximation. It assumes the transition function can be approximated by a weighted average over several transition kernels. In the stochastic setting, both instance-independent (Ayoub et al., 2020; Zhou et al., 2021) and dependent regret bound (He et al., 2021) have been derived. And in the adversarial setting, only full information case has been studied, where the agent has access to the rewards of all state-action pairs (Cai et al., 2020; He et al., 2022) . Apart from linear function approximation, there is also a rich line of works considering general function classes, such as the setting with low Bellman rank (Jiang et al., 2017; Zanette et al., 2020b) , low Eluder dimension (Wang et al., 2020; Kong et al., 2021) and low Bellman Eluder dimension (Jin et al., 2021a) .

3. SETTING

We consider the episodic MDP setting where the agent interacts with the environment for K episodes with known transition. The episodic MDP can be represented by M(S, A, H, {r k } K k=1 , P ) where S is the state space, A is the action space, H is the length of each episode, r k = r k,h H h=1 is the reward function and P = {P h } H h=1 is the known transition probability function. Specifically, at each episode k and step h ∈ [H], r k,h (s, a) ∈ [0, 1] and P h (• | s, a) ∈ [0, 1] |S| are the reward and transition probability at state s ∈ S by taking action a ∈ A, respectively. We focus on stationary policies. Denote π = {π h } H h=1 as a policy mapping from the state space to an action distribution where π h : S → ∆ A . For each episode k ∈ [K], the agent would start from the first state s k,1 := s 1foot_0 and determine the policy π k . Then at each step h ∈ [H] of episode k, it first observes the current state s k,h and then perform the action a k,h ∼ π k,h (• | s k,h ). The agent would receive a random reward y k,h := r k,h (s k,h , a k,h ) + ϵ k,h , where ϵ k,h is an independent zero-mean noise. The environment then transfers to the next state s k,h+1 based on the transition function P (• | s k,h , a k,h ). The episode ends when the last state s k,H+1 is reached. We focus on linear MDP with known transition where the reward functions are linear in a given feature mapping (Jin et al., 2020b; He et al., 2021) . The formal definition is as follows. Assumption 1. (Linear MDP with known transition) M(S, A, H, {r k } K k=1 , P ) is a linear MDP with a known feature mapping ϕ : S × A → R d such that for each step h ∈ [H], there exists an unknown vector θ h where for each (s, a) ∈ S × A, r k,h (s, a) = ϕ(s, a), θ k,h . In the stochastic setting, the reward function {r k } K k=1 , or namely the reward parameter {θ k } K k=1 , is fixed over different episodes k ∈ [K]. And in the adversarial setting, the reward parameter {θ k } K k=1 can be chosen arbitrarily by an adversary (which may be possibly dependent on previous policies). We evaluate the performance of a policy π by its value functions. Specifically, for any episode k and step h, denote the Q-value function Q π k,h (s, a) as the expected reward that will be obtained by the agent starting from (s k,h , a k,h ) = (s, a) with policy π, which is formally defined as Q π k,h (s, a) = E   H h ′ =h y k,h ′ | π, s k,h = s, a k,h = a   . Similarly, the value function V π k,h (s) of any state s is defined as V π k,h (s) = E   H h ′ =h y k,h ′ | π, s k,h = s   . In the following paper, we abuse a bit notation by using V π k := V π k,1 (s 1 ) to represent the value of policy π at the starting state s 1 and episode k. Define ϕ π,h = E ϕ(s h , a h ) | π as the expected feature vector that the policy π visits at step h. It is worth noting that this recovers the state-action visitation probability in the tabular setting. And according to the definition, the value function of policy π at episode k can be represented as V π k = H h=1 ϕ π,h , θ k,h . In this paper, we consider optimizing both the stochastic and adversarial environments within a finite policy set Π. Given a policy set Π, the learning agent would determine the policy π k ∈ Π in each episode k ∈ [K]. Let π * ∈ arg max π K k=1 V π k be one optimal policy in Π that maximizes the cumulative value functions over K episodes, which is assumed to be unique in the stochastic setting similar to previous works in tabular MDP (Jin & Luo, 2020; Jin et al., 2021b) and bandit setting (Lee et al., 2021; Zimmert & Seldin, 2019; 2021) . Denote the cumulative regret compared with π * ∈ Π over K episodes as Reg(K; Π) = K k=1 V π * k -V π k k . (1)

4. ALGORITHM

In this section, we propose a detection-based algorithm to optimize both stochastic and adversarial environments for linear MDP with a given policy set Π. Our algorithm is mainly inspired by the detection technique of Lee et al. (2021) for BoBW in linear bandits. At a high level, the algorithm first assumes the environment is truly adversarial and continuously detect whether it could be a stochastic one. Its design relies on a new linear MDP algorithm that can return well-concentrated estimators for values of policies and also achieve sub-linear regret in the adversarial setting with high probability. Previous works on adversarial linear MDP fail to provide a high-probability guarantee and thus no existing algorithms satisfy this property. In Appendix D, we propose a variant of Geometric Hedge (Algorithm 4), which is initially designed for the simple bandit case (Bartlett et al., 2008) , and provide a new analysis for it in the linear MDP setting. We show that this algorithm satisfies the following properties and can be used to derive the BoBW results. It is also worth noting that this algorithm is the first to achieve a high-probability regret guarantee for adversarial linear MDP. Theorem 1. Given a policy set Π, the regret of our proposed Algorithm 4 in the adversarial setting can be upper bounded by Reg(K; Π) ≤ O dH 3 K log |Π|/δ (2) with probability at least 1 -δ. Further, at each episode k, Algorithm 4 returns a value estimator V π k for each policy π ∈ Π. Choosing constant L 0 = 4dH log |Π|/δ , C 1 ≥ 2 15 dH 3 log K|Π|/δ and C 2 ≥ 20, it holds that for any k 0 ≥ L 0 and policy π ∈ Π, k0 k=1 V π k -V π k k ≤ C 1 k 0 -C 2 k0 k=1 V π k -V π k with probability larger than 1 -δ. Our main BoBW algorithm is a phased algorithm and is presented in Algorithm 1. It takes Algorithm 4 satisfying Theorem 1 with parameter L 0 and C 1 as input. The first epoch is of length L 0 and the length of the following epochs would grow exponentially as Line 4. During each epoch, it executes Algorithm 2, which we refer to as the BoBW main body (Line 3). Algorithm 1 BoBW for linear MDP 1: Input: Algorithm 4 with parameter C 1 and L 0 ; Set L := L 0 . Maximum duration K. 2: while number of episodes k ≤ K do 3: Execute Algorithm 2 (BoBW main body) with parameter L and receive output k 0 4: Set L = 2k 0 5: end while Algorithm 2 (BoBW main body) takes the Algorithm 4 with parameter C 1 and L as input. Here L can just be regarded as the minimum number of episodes that Algorithm 4 needs to run to collect enough observations. The algorithm first assumes the environment is adversarial and executes Algorithm 4 in at least L episodes (which we refer to as the first phase). As shown in Theorem 1, Algorithm 4 guarantees that when running for more than L episodes, the regret compared with any policy π and the distance between its estimated V value and the real V value would be no larger. Based on these concentration properties, if a policy π shows consistent better performance than all of the other policies (Line 5), we have the reason to believe that the environment is truly stochastic. Being aware of this, as shown in Line 6, Algorithm 2 would transfer to the stochastic phase (Line 9-18, which we refer to as the second phase) with the estimated V values returned by Algorithm 4. Since the estimated values by Algorithm 4 can well approximate the real values of policies, the exploration in the stochastic setting can be conducted by the estimated value gaps to obtain a problemdependent regret bound. The objective is to identify the optimal policy and maximize the collected rewards, which can be implemented by an optimization problem (Algorithm 3). Taking the estimated value gap ∆ as input, Algorithm 3 would return a probability distribution p * over the policy set Π. Intuitively, p * maximizes the expected values of policies while ensuring the uncertainty of all policies to be smaller than its sub-optimality gap. In Appendix A, we show that when ∆ is estimated accurately, selecting policies based on p * can reach a problem-dependent regret upper bound. Back to the main body of Algorithm 2, after computing the distribution p k based on the current estimated ∆ (Line 10), it also mixes this policy distribution with a one-hot vector e π to ensure that the estimated optimal policy π can be observed for enough times and the variance of its following estimators can thus be low (Line 11). The algorithm then samples a policy π k according to this mixed distribution and executes it in this episode. Then based on the received rewards y k,h at each step h and the total reward Y k = H h=1 y k,h , the value estimations of policies can be further updated. Due to the technical reason, here for the estimated optimal policy π and other policies π, we use different estimators. Specifically, we use the importance weighted estimator to approximate the value of π. The reason for using mixed policy distribution p is just to ensure the low variance of this estimator. And for other policies, we use the standard least square estimators (Line 13). Based on these newly estimated values of policies, the algorithm can then update the estimation for their sub-optimality gaps as Line 14. To get a tighter analysis, when computing the value gap of π, we use the traditional estimator of it by Algorithm 4 in the first k 0 episodes and the Catoni estimator for the recent k -k 0 episodes. Formally speaking, the Catoni estimator Catoni  α {X 1 , X 2 , • • • X n } is defined as the unique root of f (z) = n i=1 Φ α (X i -z) , where Φ(y) = log 1 + y + y 2 /2 if y ≥ 0 and Φ(y) = -log 1 -y + y 2 /2 otherwise. f K = log K 3: for each episode k = 1, 2, • • • do 4: Execute and update Algorithm 4, receive value estimators V π k for each π ∈ Π 5: if k ≥ L and there exists a policy π ∈ Π such that k s=1 Y s ≤ k s=1 V π s + 5 f K C 1 k , k s=1 Y s ≥ k s=1 V π s + 25 f K C 1 k , ∀π ̸ = π . ( ) then 6: k 0 = k, ∆π = 1 k0 k s=1 V π s -V π s , ∆ = ∆π π∈Π ; break 7: end if 8: end for 9: for episode k = k 0 + 1, k 0 + 2, • • • do 10: Compute p k = OP(k, ∆) 11: Compute pk (π) = 1 2 e π + 1 2 p k (π) , where e π is a one-hot vector with 1 only at policy π 12: Sample π k ∼ pk and execute π k 13: Receive rewards Y k = H h=1 y k,h and calculate V π k for each π ∈ Π as follows ∀π ̸ = π : V π k = H h=1 ϕ ⊤ π,h Σ -1 k,h ϕ π k ,h y k,h , where Σ k,h = π pk (π)ϕ π,h ϕ ⊤ π,h ; V π k = Y k pk (π) 1{π k = π} . ( ) 14: For each π ̸ = π, compute ∆π as ∆π k = 1 k   k0 s=1 V π s + (k -k 0 )Rob k,π - k s=1 V π s   , with Rob k,π = Catoni α π k { V π s } k s=k0+1 (7) 15: if ∃π ̸ = π , ∆π k / ∈ 0.39 ∆π , 1.81 ∆π or (8) k s=k0+1 V π s -Y s ≥ 20 f K C 1 k 0 (9) then 16: Return k 0 17: end if 18: end for Is is worth noting that an adversarial setting may be disguised as stochastic scenarios and fool the algorithm to enter in the stochastic phase. Thus the agent still needs to be vigilant about the possible change of the environment. The detection conditions (Line 15) are set for this objective. To be specific, if the estimated sub-optimality gap of a policy changes obviously compared with the original estimation by Algorithm 4 in the adversarial phase or the regret compared with π is large, the algorithm can determine that the environment may not be stochastic and would terminiate the current epoch and enter in the next epoch with parameter k 0 (Line 16).  s.t. H h=1 ϕ π,h 2 Σ -1 h (p) ≤ k ∆π β k + 4dH and π∈Π p π = 1 , where Σ h (p) = π p π ϕ π,h ϕ ⊤ π,h .

5. THEORETICAL ANALYSIS

In this section, we provide the theoretical guarantee and the analysis of Algorithm 1 in both stochastic and adversarial settings The first is about the stochastic setting. Since the value function remains the same for different episodes, we simplify the notation and use V π to represent the real value of policy π ∈ Π. Before presenting the main results, we first introduce the sub-optimality gaps that will be used. Definition 1. For each policy π ∈ Π, define ∆ π = V π * -V π as the sub-optimality gap of π compared with the optimal policy π * ∈ arg max π∈Π V π . Further let ∆ min = min π:∆π>0 ∆ π be the minimum non-negative value gap. Theorem 2 provides a regret upper bound for Algorithm 1 in the stochastic setting. Theorem 2. (Regret bound in the stochastic setting) With probability at least 1 -δ, Algorithm 1 guarantees that Reg(K; Π) ≤ O dH 2 log(K) log |Π|K/δ ∆ min . ( ) And if the environment is adversarial, the regret of Algorithm 1 can be upper bounded as Theorem 3. Theorem 3. (Regret bound in the adversarial setting) With probability at least 1 -δ, Algorithm 1 guarantees that Reg(K; Π) ≤ O dH 3 K log(K) log |Π|K/δ . ( ) Due to the space limit, the full proof of these two theorems are deferred to Appendix B and C. We will provide a proof sketch for them in later sections. Technique challenge and novelty. There are mainly two types of algorithms to deal with the BoBW problem: the switch-based method which actively detects the environment type, e.g., Bubeck & Slivkins (2012) , and the FTRL-based method which adapts to different environments, e.g. Zimmert & Seldin (2019) . The approach in Bubeck & Slivkins (2012) first assumes the setting to be stochastic and would detect whether a policy's value has changed. Such an approach in our setting brings an O( |Π|) dependence in the regret for adversarial setting which is not idealistic as the policy set size can be large. And the success of FTRL for BoBW mainly relies on a self-bounding inequality that bounds the regret by the chosen probabilities of policies. But such a technique is challenging with linear structure. As discussed by Lee et al. (2021) , even for the single-state linear bandit setting, connecting FTRL with OP is hard. Our approach relies on a new observation that the value of each policy can be written as the inner product between the expected state-action visitation feature ϕ π and the unknown reward feature θ. From this view, we are able to reduce the problem to linear optimization and existing techniques for linear optimization can be used. To the best of our knowledge, we are the first to introduce this type of linear optimization for the regret minimization problem in MDP and such reduction may be of independent interest. Relationship between our ∆ π and the gap min in He et al. (2021) . To compare our ∆ π with gap min in He at al. ( 2021), we assume the optimal policy π * ∈ Π is just the global optimal policy. Recall that gap min is defined as min h,s,a gap h (s, a) : gap h (s, a) > 0 where gap h (s, a) = V * h (s)- Q * h (s, a). In general, ∆ π can be decomposed as ∆ π = V * 1 (s 1 ) -V π 1 (s 1 ) = V * 1 (s 1 ) -Q π 1 (s 1 , a ′ ) ≥ V * 1 (s 1 ) -Q * 1 (s 1 , a ′ ) = gap 1 (s 1 , a ′ ) where the first equality follows He et al. (2021, Eq. (B. 2)) and p π h (s, a) is the visitation probability of state-action (s, a) at step h by following π. This shows that ∆ π ≥ p π gap min in the worst case where p π is the minimum none-zero visitation probability of policy π at some state-action pair. And there are also cases where our defined ∆ π is larger than that in He et al. (2021) . When both the policy and transition are deterministic (Ortner, 2010; Tranos & Proutiere, 2021; Dann et al., 2021; Tirinzoni et al., 2022) , we have ∆ π = H h=1 gap h (s h , a h ) ≥ gap min . And in the stochastic transition case, if all sub-optimal policies happen to not select the optimal action in arg max a Q * 1 (s 1 , a), ∆ π = V * 1 (s 1 ) -V π 1 (s 1 ) = V * 1 (s 1 ) -Q π 1 (s 1 , a ′ ) ≥ V * 1 (s 1 ) -Q * 1 (s 1 , a ′ ) = gap 1 (s 1 , a ′ ) ≥ gap min , where a ′ is the action selected by π at the first step and the last inequality is due to gap 1 (s 1 , a ′ ) > 0. In the above two cases, our our sub-optimality gap is larger than previously defined gap and our dependence on the gap is better. To the best of our knowledge, Algorithm 1 is the first that can simultaneously achieve O(poly log K) regret in the stochastic setting and Õ( √ K) regret in the adversarial setting for linear MDP problem. It is also worth noting that previous works on the separate adversarial setting only provide the upper bound for the expected regret (Neu & Olkhovskaya, 2021) , and we are the first to provide a high-probability guarantee.

5.1. REGRET ANALYSIS IN THE STOCHASTIC SETTING

In the stochastic setting, we consider the regret in two phases of each epoch separately. We first show that, the first phase (which we call as the adversarial phase, Line 3-8 in Algorithm 2) will terminate after k 0 episodes where k 0 ∈ 64f K C 1 /∆ 2 min , 900f K C 1 /∆ 2 min with high probability, and the optimal policy π * ∈ Π can be identified. Lemma 1 summarizes the formal claims. Lemma 1. In the stochastic setting, in each epoch, the following 4 claims hold.

1.. With probability at least

1 -4δ, k 0 ≤ max 900f K C1 ∆ 2 min , L . 2. With probability at least 1 -δ, π = π * . 3. With probability at least 1 -2δ, k 0 ≥ 64f K C1 ∆ 2 min . 4. With probability at least 1 -3δ, ∆π ∈ [0.7∆ π , 1.3∆ π ] , ∀π ̸ = π * . The detailed proof of Lemma 1 is deferred to Appendix B. We next will give a proof sketch of Theorem 2 based on the results of Lemma 1. According to claim 1 in Lemma 1, we know that k 0 = O f K C 1 /∆ 2 min , so we can bound the regret in the first phase using the guarantees of Algorithm 4 in Theorem 1 by √ C 1 k 0 = O C 1 log(K)/∆ min . And after k 0 episodes, the algorithm would transfer to the second phase, which we call as the stochastic phase (Line 9-18 in Algorithm 2). If the environment is truly stochastic, the values of all policies would remain stationary and the detection condition (Line 15 in Algorithm 2) would never be satisfied. Thus the stochastic phase will not end (proved in Lemma 8 in Appendix B). As for the regret suffered in this phase, we can analyze it using the properties of OP. According to claim 4 in Lemma 1, the estimated sub-optimality gap is close to the real sub-optimality gap as ∆π ∈ ∆ π / √ 3, √ 3∆ π . Thus performing policies based on the solution of OP with ∆ can reach the real instance optimality. Specifically, we can first bridge the gap between the expected regret and the regret that occurred using Freedman inequality (Lemma 12). k s=k0+1 ∆ πs ≤ 2 k s=k0+1 π ps (π)∆ π + 2H log 1 δ . Recall that p is computed based on OP under ∆, which is close to the real sub-optimality gap ∆. The regret occurred in phase 2 in episodes k larger than a problem dependent constant M = O dHβ K /∆ 2 min , which is the dominating part in the expected regret, can be bounded using Lemma 7 with r = 3: k s=M π ps (π)∆ π ≤ k s=M 72dHβ s ∆ min s = O dHβ k log(k) ∆ min . Above all, we can conclude that with probability at least 1 -δ, the regret can be upper bounde as Reg (K; Π) = O dHβ K log(K) ∆ min = O dH 2 log(K) log |Π|K/δ ∆ min .

5.2. REGRET ANALYSIS IN THE ADVERSARIAL SETTING

In the adversarial setting, the regret in the first phase can be guaranteed with the property of Algorithm 4 in Theorem 1. Here we will mainly analyze the second phase. We first show that in the second phase of each epoch, the returned policy π is actually the optimal policy in Π. Lemma 2. For any episode k in the second phase, the policy π has the most accumulated value in Π during episodes 1 to k. That is, π ∈ arg max π∈Π k κ=1 V π κ . Since π is the optimal policy in Π, the regret can be written as the sum of the deviation between the value of the selected policy π s and V π s k s=k0+1 V π s -V πs s = k s=k0+1 V π s -Y s + Y s 1{π s = π} + V π s 1{π s ̸ = π} -V π s + V π s 1{π s = π} -Y s 1{π s = π} + (Y s -V πs s ) . According to the detection condition (equation 9) in Algorithm 2, the first term can be upper bounded by O √ f K C 1 k 0 . As for the second and last term, Freedman inequality (Lemma 12) also provides an upper bound O (C 1 k 0 ) for them. Above all, the regret in a single epoch can be upper bounded by O √ f K C 1 k 0 . According to the choice of the minimal duration L in each epoch, the length k 0 of the first phase in an epoch is at most half of that in the next epoch. Thus the final regret can be bounded as Reg (K; Π) = O C 1 K log(K) = O dH 2 Kβ K log(K) .

6. CONCLUSION

In this paper, we propose the first BoBW algorithm for linear MDP that can simultaneously achieve O(poly log K) regret in the stochastic setting and Õ( √ K) regret in the adversarial setting. Our approach relies on the new observation that the value function of a policy can be written as the sum of the inner products between the expected state-action visitation feature ϕ π,h and the unknown reward parameter θ h at different steps h ∈ [H]. And the problem can thus be regarded as an online linear optimization problem. Apart from these BoBW results, we also propose a new analysis that can reach a high-probability regret guarantee for adversarial linear MDP, which is also the first such result in the literature. An important future direction is to remove the assumption of unique optimal policy in the stochastic setting. This assumption also appears in previous BoBW works for tabular MDP (Jin & Luo, 2020; Jin et al., 2021b) and linear bandits (Lee et al., 2021) which destroys the generality of the results but is challenging to be removed due to the hardness of the BoBW objective and the complex structure of MDP. Extending the current results to the unknown transition setting is also prospective. This is hoped to be solved by some new techniques since the current approach highly depends on the known state-action visitation feature. Deriving an FTRL-type algorithm for this objective is also an interesting future direction. It still remains open even in the simpler linear bandit setting without the transition between different states.

A ANALYSIS OF OP

The analysis of OP mainly follows the op problem for linear bandits (Lee et al., 2021) but with the consideration of the special structure of linear MDP. In this section, we provide some useful lemmas for OP in linear MDP. Lemma 3. First, consider the constrained optimization problem: min p∈∆Π π p π ∆π - 2 ξ H h=1 (-ln(det(Σ h (p)))) , where Σ h (p) = π p π ϕ π,h ϕ ⊤ π,h . The optimal choice of p = p * yields that: π∈Π p * π ∆π ≤ 2dH ξ , H h=1 ∥ϕ π,h ∥ 2 Σ -1 h (p * ) ≤ ξ ∆π 2 + dH , ∀π ∈ Π . Proof. Here we relaxed the constraint that p π be a valid distribution on Π to π p π ≤ 1 since there must be π * where ∆π * = 0, so we can always add up the probability on π * to make it a distribution. Applying the KKT conditions and setting the derivatives with respect to each p π in the Lagrangian to zero, we got: 0 = ∆π - 2 ξ H h=1 ∥ϕ π,h ∥ 2 Σ -1 h (p * ) -λ π + λ , where λ π and λ are the respective Lagrange multipliers for the constraints of p π ≥ 0 and π p π ≤ 1, thus we have λ ≥ 0 and π λ π p π = 0 . Multiplying the above equation with p * π and summing over π ∈ Π, we got: 0 = π∈Π   p * π ∆π - 2 ξ H h=1 p * π ∥ϕ π,h ∥ 2 Σ -1 h (p * ) -λ π p * π + λp * π   = π∈Π p * π ∆π - 2 ξ H h=1 π∈Π p * π ∥ϕ π,h ∥ 2 Σ -1 h (p * ) + λ . Notice that: π∈Π ∥ϕ π,h p * π ∥ 2 Σ -1 h (p * ) = π∈Π p * π ϕ π,h Σ -1 h (p * )ϕ ⊤ π,h = π∈Π Tr p * π ϕ π,h ϕ ⊤ π,h Σ -1 h (p * ) = Tr   π∈Π p * π ϕ π,h ϕ ⊤ π,h Σ -1 h (p * )   = d . Plugging in this result, we got: 0 = π∈Π p * π ∆π - 2 ξ dH + λ ≥ π∈Π p * π ∆π - 2 ξ dH . So that π∈Π p * π ∆π ≤ 2 ξ dH , and λ ≤ 2 ξ dH, since π∈Π p * π ∆π ≥ 0. Above all, H h=1 ∥ϕ π,h ∥ 2 Σ -1 h (p * ) = ξ 2 ( ∆π -λ π + λ) ≤ ξ 2 ( ∆π + λ) ≤ ξ ∆π 2 + dH . Lemma 4. Suppose that for any h ∈ {1, 2, • • • , H}, ϕ π,h |π ∈ Π spans R d . Denote p Π as a uniform distribution on Π and let κ ∈ (0, 1 2 ). For any G ⊂ Π, there exists distribution on q ∈ P G such that H h=1 ∥ϕ π,h ∥ 2 Σ -1 h (q G,κ ) ≤ 2dH, where q G,κ = κp Π + (1 -κ)q and Σ h (q G,κ ) = π∈Π q G,κ π ϕ π,h ϕ ⊤ π,h . Proof. Denote P G,κ = κp Π + (1 -κ)q |q ∈ P G . min q∈P G,κ max π∈G H h=1 ∥ϕ π,h ∥ 2 Σ -1 h (q) = min q∈P G,κ max p∈P G H h=1 Tr   π∈G p π ϕ π,h ϕ ⊤ π,h     π∈Π q π ϕ π,h ϕ ⊤ π,h   -1 = max p∈P G min q∈P G,κ H h=1 Tr   π∈G p π ϕ π,h ϕ ⊤ π,h     π∈Π q π ϕ π,h ϕ ⊤ π,h   -1 (14) ≤ max p∈P G H h=1 Tr   ( π∈G p π ϕ π,h ϕ ⊤ π,h )   π∈Π κ |Π| + (1 -κ)p π ϕ π,h ϕ ⊤ π,h   -1    ≤2 max p∈P G H h=1 Tr      π∈Π κ |Π| + (1 -κ)p π ϕ π,h ϕ ⊤ π,h     π∈Π κ |Π| + (1 -κ)p π ϕ π,h ϕ ⊤ π,h   -1    =2dH . Where the second inequality is due to Sion's minimax theorem as equation 14 is linear in p and convex in q. The last inequality is due to 1 -κ ≥ 1 2 . Lemma 5. Let p π be the solution of OP(k, ∆), then we have : π∈Π p π ∆π ≤ O dHβ k √ k . ( ) Proof. We transform p * in Lemma 3 to a solution satisfying OP. Choosing ξ as √ k β k in Lemma 4, and let G = {π ∈ Π : ∆π ≤ 1 √ k }, construct the distribution q = 1 2 p * + 1 2 q G,κ , where q G,κ is the distribution stated above with κ = 1 √ k in Lemma 4, we have for π ∈ G, H h=1 ∥ϕ π,h ∥ 2 Σ -1 h (q) ≤ 4dH; and for π / ∈ G, H h=1 ∥ϕ π,h ∥ 2 Σ -1 h (q) ≤ 2 ξ ∆π 2 + dH ≤ √ k ∆π β k + 2dH ≤ k ∆2 π β k + 4dH . So distribution q satisfy the constraints of OP. For the optimal solution p of OP, we have: π∈Π p π ∆π ≤q π ∆π = 1 2 p * + 1 2 q G,κ ∆π ≤ dHβ k √ k + H 2 √ k + 1 2 √ k = O dHβ k √ k . Lemma 6. Given{ ∆π } π∈Π , suppose there exists unique π such that ∆π = 0, and ∆min = min ∆π>0 ∆π . Let p π be the solution of OP(k, ∆), when k ≥ 16dHβ k ∆2 min , we have π∈Π p π ∆π ≤ 24dHβ k ∆mink . Proof. let G i = {π ∈ Π : 2 i-1 ∆2 min ≤ ∆2 π ≤ 2 i ∆2 min } and n be the largest index that G i is not empty. Define z i = dHβ k 2 i-2 ∆2 min k and κ = 1 n2 n . Define the distribution p as follows: For π ̸ = π, pπ = i≥1 z i q Gi,κ π ; for π, pπ = 1 - π̸ =π pπ . We show it's a valid distribution over Π: pπ =1 - π̸ =π pπ =1 - π̸ =π i z i q Gi,κ π ≥1 - i≥1 z i - i≥1 π∈Gi j̸ =i z j q Gj ,κ π =1 - i≥1 z i - i≥1 π∈Gi j̸ =i z j n2 n |Π| ≥1 -2 i≥1 z i ≥ 1 2 . Where the last inequality is due to k ≥ 16dHβ k ∆2 min . For π ̸ = π and π ∈ G i , H h=1 ϕ π,h 2 Σ -1 h ( p) ≤ H h=1 ϕ π,h 2 π∈Π ziq G i ,κ π ϕ π,h ϕ ⊤ π,h -1 ≤ 2dH z i ≤ k ∆2 π β k + 4dH . For π, H h=1 ϕ π,h 2 Σ -1 h ( p) ≥ H h=1 Σ -1 h (p)ϕ π,h 2 Σ -1 h ( p) ≥ H h=1 Σ -1 h (p)ϕ π,h 2 1 2 ϕ π,h ϕ ⊤ π,h ≥ 1 2 H h=1 ϕ π,h 4 Σ -1 h ( p) ≥ 1 2H   H h=1 ϕ π,h 2 Σ -1 h ( p)   2 , where the last inequality is due to Cauchy inequality. So that: H h=1 ϕ π,h 2 Σ -1 h ( p) ≤ 2H . Thus, p satisfy the constraints of OP. Now we bound the result of OP. By the optimality of p π : π∈Π p π ∆π ≤ π∈Π pπ ∆π = i≥1 π∈Gi   z i q Gi,κ + j̸ =i z j 1 n2 n |Π|   2 i 2 ∆min ≤ i≥1 π∈Gi j̸ =i dHβ k n2 n+j-i 2 -2 |Π| ∆min k + i≥1 dHβ k 2 i 2 -2 ∆min k ≤2 i≥1 dHβ k 2 i 2 -2 ∆min k ≤ 24dHβ k ∆min k . Lemma 7. Suppose that ∆π ∈ 1 √ r ∆ π , √ r∆ π , then the solution p π of OP(k, ∆) for k ≥ 16rdHβ k ∆ 2 min yields that : π∈Π p π ∆ π ≤ 24rdHβ k ∆ min k . Proof. By the condition on ∆ π , we have t ≥ 16rdHβ k ∆ 2 min ≥ 16dHβ k ∆2 min , and that ∆ π * = ∆π * = 0. Thus, π∈Π p π ∆ π ≤ √ r π∈Π p π ∆π ≤ √ r 24dHβ k ∆min k ≤ 24rdHβ k ∆ min k . Where the inequality is due to Lemma 6.

B ANALYSIS IN THE STOCHASTIC SETTING

Proof of Lemma 1. First, we show the following properties, for any k in phase 1: C 2 k s=1 V π s -V π s ≤ C 1 k + k s=1 V πs s -V π s ≤ C 1 k + k∆ π . Here we denote DEV k,π = k s=1 V π s -V π s . Claim 1's proof: Let k = max{ 900f K C1 ∆ 2 min , L} and assume that phase 1 has not finished at episode k. Set π = π * , and we show that the termination conditions hold with high probability at episode k. According to the Azuma-Hoeffding inequality, since Y s -V πs s is a martingale sequence and that |Y s -V πs s | ≤ H, k s=1 Y s ≤ k s=1 V πs s + C 1 k ≤ K s=1 V π * s + C 1 k ≤ K s=1 V π * s + 2 C 1 k ≤ K s=1 V π * s + 3 f K C 1 k , so equation 3 is satisfied. For all π ̸ = π * , we have: k s=1 V π s -Y s = k s=1 V π s -V π s + V π s -V π * s + V π * s -V πs s + (V πs s -Y s ) ≤DEV k,π -k∆ π + C 1 k -C 2 DEV k,π * + C 1 k ≤2 f K C 1 k + 1 C 2 C 1 k + k∆ π -k∆ π ≤ -0.95k∆ π + 2.1 f K C 1 k . Since k ≥ 900f K C1 ∆ 2 π , thus k∆ π ≥ 30 √ f K C 1 k for all π ̸ = π * . So -0.95k∆ π + 2.1 √ f K C 1 k ≤ -25 √ f K C 1 k. Thus: k s=1 Y s ≥ k s=1 V π s + 25 f K C 1 k . So equation 4 is satisfied. Claim 2's proof: Using equation 3 and equation 4, we got: k0 s=1 V π s -V π s ≥ 20 f K C 1 k 0 , ∀π ̸ = π . ( ) However, with probability at least 1 -δ, for any π ̸ = π * k0 s=1 V π s -V π * s ≤ k0 s=1 V π s -V π s + V π s -V π * s + V π * s -V π * s ≤DEV k0,π + DEV k0,π * -k 0 ∆ π ≤ 1 C 2 C 1 k 0 + k 0 ∆ π + 1 C 2 C 1 k 0 -k 0 ∆ π ≤5 f K C 1 k 0 . So we must have π = π * . Claim 3's proof: Suppose that k 0 ≤ 64f K C1 ∆ 2 min . Let π be the policy with minimal estimated gap, that is: ∆ π = ∆ min . k0 s=1 V π * s -V π s ≤k 0 ∆ min + DEV k0,π + DEV k0,π * ≤k 0 ∆ min + 1 C 2 2 C 1 k 0 + k 0 ∆ min ≤2k 0 ∆ min + 2 f K C 1 k 0 ≤16 f K C 1 k 0 + 2 f K C 1 k 0 =18 f K C 1 k 0 . Which contradicts with equation 16. Claim 4's proof: k 0 ∆π -k 0 ∆ π ≤DEV k0,π + DEV k0,π * ≤ 1 C 2 2 C 1 k 0 + k 0 ∆ π ≤ f K C 1 k 0 + 1 C 2 k 0 ∆ π ≤0.3k 0 ∆ π . The last inequality is due to k 0 ≥ 64f K C1 ∆ 2 min . So we have: ∆π ∈ [0.7∆ π , 1.3∆ π ] , ∀π ̸ = π * . Lemma 8. With probability at least 1 -δ, phase 2 never ends. Proof. For equation 9, we decompose it as k s=k0+1 V π s -Y s = k s=k0+1 V π s -V πs s + V πs s -Y s + Y s -V π s ps (π) 1{π s = π} + V π s ps (π) 1{π s = π} -V π s . First we deal with the second term, which is a martingale difference sequence. It's variance is bounded as: E   V πs s -Y s + Y s -V π s ps (π) 1{π s = π} 2   =p s (π)E (V πs s -Y s ) 2 1 - 1 ps (π) 2 + 1 -ps (π) E (V πs s -Y s ) 2 ≤2H 2 1 -ps (π) . The third term is also a martingale difference sequence, whose variance can be bounded as: E   V π s ps (π) 1{π s = π} -V π s 2   ≤ 2H 2 1 -ps (π) . Thus, using the Freedman inequality for the second and third term, we got: k s=k0+1 V πs s -Y s + Y s -V π s ps (π) 1{π s = π} + V π s ps (π) 1{π s = π} -V π s ≤8H k s=k0+1 1 -ps (π) log K δ + 6H log K δ . For the first term, we bound it's variance against it's expectation π̸ =π ps (π) V π s -V π s as follows: E      V π s -V πs s - π̸ =π ps (π) V π s -V π s   2    =p s (π) E      π̸ =π ps (π) V π s -V π s   2    + 1 -ps (π) E      V π s -V πs̸ =π s - π̸ =π ps (π) V π s -V π s   2    ≤p s (π) H 2 1 -ps (π) 2 + 1 -ps (π) H 2 2 -ps (π) 2 ≤4H 2 1 -ps (π) . Using Freedman inequality for the martingale difference sequence above: k s=k0+1 V π s -V πs s ≤ k s=k0+1 π̸ =π ps (π) V π s -V π s + 4H k s=k0+1 1 -ps (π) log K δ + 2H log K δ ≤ k s=k0+1 π̸ =π ps (π) (∆ π -∆ π ) + 4H k s=k0+1 1 -ps (π) log K δ + 2H log K δ ≤ 1 2 k s=k0+1 π̸ =π p s (π)∆ π + 4H k s=k0+1 1 -ps (π) log K δ + log K δ , where the last inequality is conditioned on that π * = π. Using Lemma 7 and equation 31, we have: 1 2 π̸ =π p s (π)∆ π ≤ 36dHβ s ∆ min s , 1 -ps (π) ≤ 12dHβ s ∆2 min s . Finally, we have: k s=k0+1 V π s -Y s ≤ 36dHβ k log(k) ∆ min + 12H k s=k0+1 12dHβ s ∆2 min s log K δ + 8H log K δ ≤ 52dHβ k log(k) ∆min + 72 dH log(k) ∆min β K √ 2 15 (20) ≤20dHβ K log(k) k 0 f K C 1 (21) ≤20 f K C 1 k 0 , where inequality 21 is due to claim 3 that k 0 ≥ 64f K C1 ∆ 2 min ≥ 37f K C1 ∆2 min , inequality 20 is due to ∆ ∈ [0.7∆, 1.3∆] which is claim 4; and f K C 1 ≥ dH 2 β K log(k). So equation 9 is never satisfied. According to equation 28 and equation 30, we have: k s=1 V π s - k0 s=1 V π s -(k -k 0 )Rob k,π ≤ 1.4k ∆π 10 , k s=1 V π s -V π s ≤ 2 f K C 1 k ≤ 0.1k ∆π . Where the last inequality is due to equation 27. So: k ∆k,π -k∆ π ≤ k s=1 V π s - k0 s=1 V π s -(k -k 0 )Rob k,π + k s=1 V π s -V π s ≤ 0.24k ∆π . Thus: ∆k,π ≤ ∆ π + 0.24 ∆π ≤ 1 0.7 ∆π + 0.24 ∆π ≤ 1.81 ∆π , ∆k,π ≥ ∆ π -0.24 ∆π ≥ 1 1.3 ∆π -0.24 ∆π ≥ 0.39 ∆π . So equation 8 is never satisfied. Proof of Theorem 2 . Now, we bound the deviation between the actual regret and the real regret in phase 2. Using Freedman inequality on the martingale difference sequence ∆ πsπ ps (π)∆ π : k s=k0+1 ∆ πs ≤ k s=k0+1 π ps (π)∆ π + 2 log 1 δ k s=k0+1 E ∆ 2 πs + H log 1 δ ≤ k s=k0+1 π ps (π)∆ π + 2 log 1 δ H k s=k0+1 E [∆ πs ] + H log 1 δ ≤ k s=k0+1 π ps (π)∆ π + 2 log 1 δ H k s=k0+1 π ps (π)∆ π + H log 1 δ ≤2 k s=k0+1 π ps (π)∆ π + 2H log 1 δ . Denote M be the episode that first satisfy M ≥ 48dHβ M ∆ 2 min . For k ≥ M , we have: k s=M π ps (π)∆ π = k s=M π 1 2 p s (π)∆ π ≤ k s=M 36dHβ s ∆ min s ≤O dHβ k log(k) ∆ min , where the inequality is due to Lemma 7 with r = 3. For k < M , according to Lemma 5 we have: M s=k0+1 π ps (π)∆ π = M s=k0+1 π 1 2 p s (π)∆ π ≤ M s=k0+1 O dHβ s √ s ≤O dHβ M √ M . During phase 1, we have: k0 s=1 V π * s -V πs s ≤ C 1 900f K C 1 ∆ 2 min ≤ O C 1 log(K) ∆ min . Since we condition on that phase 2 never ends, we have: K s=1 ∆ πs ≤O C 1 log(K) ∆ min + O dHβ K log(K) ∆ min + O dHβ M √ M ≤O    dH 2 log(K) log |Π|K δ ∆ min    .

C ANALYSIS IN THE ADVERSARIAL SETTING

Proof of Lemma 2. First, we bound the deviation of estimation in Phase 1. For any episode k in phase 1, we have: k s=1 V π s -V πs s ≤ C 1 K -C 2 k s=1 V π s -V π s ≤ C 1 K -(C 2 -1) k s=1 V π s -V π s + k s=1 V π s -V π s . Thus: k s=1 V π s -V π s ≤ 1 C 2 -1   C 1 K + k s=1 V πs s -V π s   . At time k 0 : k0 s=1 V π s -V π s ≤ 1 C 2 -1   C 1 K + k0 s=1 V πs s -V π s   ≤ 1 C 2 -1   2 C 1 K + k0 s=1 Y s -V π s   ≤ 1 C 2 -1   2 C 1 K + 5 f K C 1 k 0 + k0 s=1 V π s -V π s   ≤ 1 C 2 -1 7 f K C 1 k 0 + k 0 ∆π . Next, we bound the deviation of (k -k 0 )Rob k,π for π ̸ = π. The variance of V π κ is bounded as follows: E V π κ 2 =E      H h=1 rπ κ,h   2    ≤H H h=1 E rπ κ,h 2 ≤ H H h=1 ϕ π,h 2 Σ-1 κ,h ≤2H H h=1 ϕ π,h 2 Σ -1 κ,h ≤ 2H κ ∆2 π β κ + 4dH . Using the properties of the Catoni estimator, we have: k κ=k0 V π κ -(k -k 0 )Rob k,π ≤α π k k κ=k0+1   E V π κ -V π κ 2 +   V π κ - 1 k -k 0 k κ=k0 V π κ   2    + 2 log k 2 |Π| δ α π k ≤α π k k κ=k0+1 H 2 κ ∆2 π β κ + 9dH + 4 log k|Π| δ α π k (25) ≤4 H log k|Π| δ k κ=k0+1 2 κ ∆2 π β κ + 9dH . Where equation 26 is by our choice of α π k . Since equation 3 and equation 4 provides that: ∆π = 1 k 0 k0 s=1 V π s -V π s ≥ 20 f K C 1 k 0 = 20 f K β K dH 2 k 0 , we have 9dH ≤ k0 ∆2 π β K ≤ 2κ ∆2 π βκ . Thus we have: k κ=k0 V π k -(k -k 0 )Rob k,π ≤4 H log k|Π| δ k κ=k0+1 4 κ ∆2 π β κ ≤4 H log k|Π| δ 4k β k k κ=k0+1 ∆2 π ≤ 1 16 k ∆π . Combining terms, we have: k s=1 V π s - k0 s=1 V π s -(k -k 0 )Rob k,π ≤ 1.4k ∆π 10 . Finally, we bound the deviation of V π . In the first k 0 episodes, since ∆π = 0, according to equation 24, we have that: k0 s=1 V π s -V π s ≤ f k C 1 k 0 . In phase 2, since E V π k = V π k is an unbiased estimator of the true value function, according to Freedman inequality, and that E V π k 2 = E Y 2 k p2 k (π) 1{π k = π} ≤ H 2 pk (π) ≤ 2H 2 , we have: k s=k0+1 V π s -V π s ≤ 2 2H 2 k log k|Π| δ + 2H log k|Π| δ ≤ C 1 k . Combining the two terms, we have: k s=1 V π s -V π s ≤ 2 f K C 1 k . ( ) In sum, k s=1 V π s -V π s ≥ k-1 s=1 V π s -V π s -2H ≥ k0 s=1 V π s -V π s +   k-1 s=k0+1 V π s -(k -k 0 -1)Rob k-1,π   -2 f K C 1 (k -1) - 1.4(k -1) ∆π 10 -2H ≥(k -1) ∆k-1,π - 3(k -1) ∆π 10 ≥ 0 , where the second to last inequality is due to equation 27, and the last one is due to equation 8. Proof of Theorem 3. Finally, we proof the regret bound in adversarial setting. For the regret in phase 2, we can decompose it as follows. k s=k0+1 V π s -V πs s = k s=k0+1 V π s -Y s + Y s 1{π s = π} + V π s 1{π s ̸ = π} -V π s + V π s 1{π s = π} -Y s 1{π s = π} + (Y s -V πs s ) . The first term is bounded by equation 9: k s=k0+1 V π s -Y s ≤ O f K C 1 k 0 . The second term is a martingale difference sequence since: E Y s 1{π s = π} + V π s 1{π s ̸ = π} -V π s =p s (π) E Y s - Y s ps (π) + 1 -ps (π) V π s =p s (π) 1 - 1 ps (π) E [Y s ] + 1 -ps (π) V π s =0 . It's variance is bounded as: E Y s 1{π s = π} + V π s 1{π s ̸ = π} -V π s 2 =p s (π) E Y s - Y s ps (π) 2 + 1 -ps (π) V π s 2 ≤p s (π) 1 - 1 ps (π) 2 H 2 + 1 -ps (π) H 2 ≤2H 2 1 -ps (π) , where the last inequality is due to ps (π) ≥ 1 2 . The third term is also a martingale difference sequence, whose variance is bounded as: E V π s 1{π s = π} -Y s 1{π s = π} + (Y s -V πs s ) 2 ≤ 1 -ps (π) E (Y s -V πs s ) 2 ≤ 1 -ps (π) 4H 2 . Also, we have: 1 -ps (π) ≤ 1 2 p s (π) = 1 2 π̸ =π p s (π) ≤ 1 2 π̸ =π p s (π) ∆π ∆min ≤ 12dHβ s ∆2 min s . Thus, using Freedman inequality on the last two terms, we got: k s=k0+1 V π s -V πs s ≤O    f K C 1 k 0 + log k δ H 2 k s=1 dHβ s ∆2 min s + H log k δ    ≤O     f K C 1 k 0 + log k δ dH 3 log(k)β k k 0 f K C 1 + H log k δ     ≤O f K C 1 k 0 . Combining with the regret in phase 1, the regret in one epoch is bounded as: k s=1 V π s -V πs s ≤ O f K C 1 k 0 . Since the duration time k 0 of phase 1 is at least twice as the length of phase 1 in the previous epoch, we have that the sum of √ k 0 in all the epochs is bounded by an constant factor of the square root of duration time in phase 1 of the last epoch, which is bounded by √ K. Summation over all the epochs, we have : Reg (K; Π) = O C 1 K log(K) = O   dH 3 K log(K) log |Π|K δ   . Algorithm 4 Geometric Hedge for Linear Adversarial MDP Policies 1: Input: policy set Π, γ = min    1 2 , dH log |Π| δ K    , η = γ 4dH 2 2: Initialize: ∀π ∈ Π, w 1 (π) = 1, W 1 := |Π|. ∀h from 1 to H, compute the G-optimal design g h (π) on the set of feature visitations: {ϕ π,h , π ∈ Π}. Denote g(π) = 1 H H h=1 g h (π) 3: for each episode k = 1 to K do 4: ∀π ∈ Π, p k (π) = (1 -γ) w k (π) W k + γg(π) 5: Select π k ∈ Π according to the probability p k (π) and collect rewards y k,h 6: Calculate reward estimators: θk,h = Σ -1 k,h ϕ π k ,h y k,h , and rπ k,h = ϕ ⊤ π,h θk,h , V π k = H h=1 rπ k,h , where Σ k,h = π p k (π)ϕ π,h ϕ ⊤ π,h 7: Compute the optimistic estimate of the value function: Ṽ π k = H h=1   r π k,h + 2ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK    8: And we transform it into loss functions: l k,h (s h , a h ) = 1 -r k,h (s h , a h ), l π k,h = 1 -r π k,h L π k,h = H h=1 l π k,h = H -V π k,h Lπ k,h = H h=1 lπ k,h = H h=1 1 -rπ k,h = H -V π k,h Lπ k,h = H -Ṽ π k,h 9: update using the loss estimators: V π k -V π k , then we have: ∀π ∈ Π, w k+1 (π) = w k (π) exp -η Lπ k , W k+1 = π∈Π w k+1 DEV K,π = K k=1 V π k -V π k = K k=1 Lπ k -L π k ≤ 1 C 2 K k=1 H h=1 ϕ π,h 2 Σ -1 k,h H log 1 δ dK + C 2 dKH log 1 δ + 2 dH 2 γ + H log 1 δ . Proof. First, we show that V π k is an unbiased estimate of V π k : E V π k = H h=1 E V π k,h = H h=1 ϕ ⊤ π,h Σ -1 k,h E ϕ π k ,h y k,h , using the tower rule of expectation, we have: E ϕ π k ,h y k,h = E ϕ π k ,h r π k k,h = E ϕ π k ,h ϕ ⊤ π k ,h θ k,h = Σ k,h θ k,h , so E V π k = H h=1 ϕ ⊤ π,h θ k,h = V π k . Denote σ 2 = K k=1 Var V π k , then: σ 2 ≤ K k=1 E      H h=1 ϕ ⊤ π,h Σ -1 k,h ϕ π k ,h y k,h   2    ≤ K k=1 H H h=1 E ϕ ⊤ π,h Σ -1 k,h ϕ π k ,h y k,h 2 ≤H K k=1 H h=1 ϕ π,h 2 Σ -1 k,h Also, due to the properties of G-optimal design, we have: ϕ π,h 2 π g h (π)ϕ π,h ϕ ⊤ π,h -1 ≤ d , and Σ k,h ⪰ γ H π g h (π)ϕ π,h ϕ ⊤ π,h . Thus we have ϕ π,h 2 Σ -1 k,h ≤ dH γ , ∀π ∈ Π, so V π ≤ dH 2 γ . Using Freedman inequality, the sum of the martingale difference sequence V π k -V π k is bounded as: K k=1 V π k -V π k ≤ 2 H K k=1 H h=1 ϕ π,h 2 Σ -1 k,h log 1 δ + dH 2 γ + H log 1 δ ≤ 1 C 2 K k=1 H h=1 ϕ π,h 2 Σ -1 k,h H log 1 δ dK + C 2 dKH log 1 δ + 2 dH 2 γ + H log 1 δ , where the last inequality is due to the geometric mean-arithmetic mean inequality. Lemma 10. L π k k - K k=1 π p k (π) Lπ k = K k=1 π p k (π) V π k - K k=1 V π k k ≤H √ d + 1 2K log 1 δ + 4 3 H + dH 2 γ log 1 δ . Proof. K k=1 π p k (π) V π k - K k=1 V π k k = K k=1 H h=1 π p k (π)r π k,h -r π k k,h . Using lemma 6 in Bartlett et al. (2008) , we have: K k=1 r π k k,h - K k=1 π p k (π)r π k,h ≤ √ d + 1 2K log 1 δ + 4 3 dH γ + 1 log 1 δ . Since θ⊤ k,h ϕ π,h ≤ dH γ . Plugging it into equation 33, we finish the proof. Lemma 11. With probability at least 1 -δ, we have : K k=1 π p k (π) Lπ k 2 ≤ 2(d + 1)KH 2 + 2 dH 3 γ 2K log 1 δ + 8dH 3 log 1 δ γ . Proof. Since Lπ k = H -V π k , Lπ k 2 ≤ V π k 2 + H 2 , we have: K k=1 π p k (π) Lπ k 2 ≤ K k=1 π p k (π) V π k 2 + KH 2 . Using Cauchy-Schwarz inequality, we have: V π k 2 ≤ H H h=1 rπ k,h 2 . So, K k=1 π p t (π) rπ k,h 2 = K k=1 π p t (π) θ⊤ k,h ϕ π,h ϕ ⊤ π,h θk,h = K k=1 π θ⊤ k,h Σ k,h θk,h ≤ K k=1 ϕ ⊤ π k ,h Σ -1 k,h ϕ π k ,h . Notice that using the definition of Σ k,h and properties of the G-optimal design, E ϕ ⊤ π k ,h Σ -1 k,h ϕ π k ,h = d , ϕ ⊤ π k ,h Σ -1 k,h ϕ π k ,h ≤ dH γ . Applying the Hoeffding bound, we got: K k=1 π p t (π) rπ k,h 2 ≤ dK + dH γ 2K log 1 δ . Thus: K k=1 π p k (π) V π k 2 ≤H H h=1 K k=1 π p t (π) rπ k,h 2 ≤dKH 2 + dH 3 γ 2K log 1 δ . And: Lπ k 2 =    Lπ k - H h=1 2ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK    2 ≤2 Lπ k 2 + 2    H h=1 2ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK    2 ≤2 Lπ k 2 + 2H H h=1 4 ϕ π,h 2 Σ -1 k,h ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK ≤2 V π k 2 + 2H 2 + 8 dH γ H h=1 ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK . So we conclude that: K k=1 π p k (π) Lπ k 2 ≤2 K k=1 π p k (π) Lπ k 2 + 8H 2 log 1 δ γK K k=1 H h=1 π p k (π)ϕ ⊤ π,h Σ -1 k,h ϕ π,h ≤2(d + 1)KH 2 + 2 dH 3 γ 2K log 1 δ + 8dH 3 log 1 δ γ . Proof of Theorem 1 . Now we analyze the potential function. Using classical techniques, we got that counter part of equation ( 2) in Bartlett et al. (2008) : log W K W 1 = K k=1 log W k W k-1 = K k=1 log π w k (π) W k-1 exp -η Lπ k ≤ K k=1 log π p k (π) -γg π 1 -γ 1 -η Lπ k + η 2 Lπ k 2 (34) ≤ K k=1 π p k (π) -γg π 1 -γ -η Lπ k + η 2 Lπ k 2 ≤ η 1 -γ   K k=1 π -p k (π) Lπ k + γ K k=1 π g(π) Lπ k + η K k=1 π p k (π) Lπ k 2   , ( ) where inequality 34 is due to the constraint that η Lπ k ≤ 1. Using Lemma 10, we have: Using Lemma 9 and choosing C 2 = 1 2 , we have: K k=1 Lπ k ≤ K k=1 L π k + DEV k,π - K k=1 H h=1 2ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK ≤ K k=1 L π k + 1 2 dKH log 1 δ + 2 dH 2 γ + H log 1 δ ≤KH + 1 2 dKH log 1 δ + 2 dH 2 γ + H log 1 δ . Thus equation 35 becomes: (36) log W K W 1 ≤ η 1 -γ   - K k=1 L π k k + H √ d + And we also have for ∀π ∈ Π, log W K W 1 ≥η   K k=1 -Lπ k   -log |Π| ≥η    K k=1 -L π k -DEV K,π + K k=1 H h=1 2ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK    -log |Π| , where the last inequality is due to Lemma 9. Plugging equation 36 and equation 37 together, we have that for ∀π ∈ Π: So the choice of η = γ 4dH 2 ensures η Lπ k ≤ 1. Plugging in our choice of η, equation 38 then becomes: K k=1 L π k k -L π k ≤DEV k,π - K k=1 H h=1 2ϕ ⊤ π,h Σ -1 k,h ϕ π, K k=1 L π k k -L π k ≤ 1 C 2 -2    K k=1 H h=1 2ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK    + O   dH 3 K log 1 δ + dH 2 γ log |Π| δ + γKH   . Choosing C 2 = 1 2 , we have: K k=1 L π k k -L π k ≤ O   dH 3 K log 1 δ + dH 2 γ log |Π| δ + γKH   . ( ) And by our choice of γ = min  V π k -V π k k = K k=1 L π k k -L π k ≤ - K k=1 H h=1 2ϕ ⊤ π,h Σ -1 k,h ϕ π,h H log 1 δ dK + O   dH 3 K log 1 δ + dH 2 γ log |Π| δ + γKH   ≤ -C 2 DEV K,π + O   dH 3 K log 1 δ + dH 2 γ log |Π| δ + γKH   . Applying a union bound for all the possible k 0 ∈ {1, 2, • • • K}, we conclude with at least probability 1 -δ, we have: k0 k=1 V π k -V π k k ≤ C 1 k 0 -C 2 DEV k0,π . With the constant C 1 = O dH 3 log |Π|K δ ≥ dH 2 β K , proving Theorem 1.

E CONCENTRATION INEQUALITIES

Lemma 12. [Freedman inequality(Freedman, 1975) ] Let F 0 ⊂ F 1 ⊂ • • • ⊂ F T be a filtration and let X 1 , X 2 , • • • X T be random variables such that X t is F t measurable, E X t |F t-1 = 0, |X t | ≤ b almost surely, and T t=1 E X 2 t |F t-1 ≤ V for some fixed V > 0 and b > 0. Then, for any δ ∈ (0, 1), we have with probability at least 1 -δ, T t=1 X t ≤ 2 V log 1/δ + b log 1/δ . Lemma 13 (Concentration inequality for Catoni estimators (Wei & Luo, 2018; Lee et al., 2021) ). Let F 0 ⊂ F 1 ⊂ • • • ⊂ F n be a filtration and let X 1 , X 2 , • • • X n be random variables such that X i is F i measurable, E X i |F i-1 = µ i for some fixed µ i , and n i=1 E (X i -µ i ) 2 |F i-1 ≤ V for some fixed V . Denote µ = 1 n n i=1 µ i and let μn,α be the Catoni's robust mean estimator of X 1 , X 2 , • • • X n with a fixed parameter α, that is, μn,α is the unique root of the function: f (z) = n i=1 Φ α (X i -z) , where Φ(y) = log 1 + y + y 2 /2 if y ≥ 0 and Φ(y) = -log 1 -y + y 2 /2 otherwise. Then for any δ ∈ (0, 1), as long as n is large enough that n ≥ α 2 V + 



The deterministic starting state is only for expositional convenience. Our algorithm and analysis can directly handle random starting states with a distribution.



Optimization problem (OP) k, ∆ 1: Define β k = 2 15 H log |Π|k/δ 2: Return the minimizer p * of the following constrained optimization problem min p π∈Π p π ∆π (10)

π) 10: end for D HIGH PROBABILITY GUARANTEE FOR ADVERSARIAL LINEAR MDP First, we bound the deviation between the estimated value and the true value of policy π. Lemma 9. Denote DEV K,π = K k=1

and equation 40, we prove the regret bound. Moreover, when K ≥ L 0 , choosing C 2 ≥ 20 and recalling the definition of DEV k,π in Lemma 9, K k=1

n i=1 (µ i -µ) 2 + 2 log 1/δ , we have with probability at least 1 -2δ, μn,α -µ ≤ α V + n i=1 (µ i -µ)

annex

Choosing α optimally, we have:In particular, if

