LEARNING ADVERSARIAL LINEAR MIXTURE MARKOV DECISION PROCESSES WITH BANDIT FEEDBACK AND UNKNOWN TRANSITION

Abstract

We study reinforcement learning (RL) with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, the unknown transition probability function is a linear mixture model (Ayoub et al., 2020; Zhou et al., 2021; He et al., 2022) with a given feature mapping, and the learner only observes the losses of the experienced state-action pairs instead of the whole loss function. We propose an efficient algorithm LSUOB-REPS which achieves O(dS 2 √ K + √ HSAK) regret guarantee with high probability, where d is the ambient dimension of the feature mapping, S is the size of the state space, A is the size of the action space, H is the episode length and K is the number of episodes. Furthermore, we also prove a lower bound of order Ω(dH √ K + √ HSAK) for this setting. To the best of our knowledge, we make the first step to establish a provably efficient algorithm with a sublinear regret guarantee in this challenging setting and solve the open problem of He et al. (2022) .

1. INTRODUCTION

Reinforcement learning (RL) has achieved significant empirical success in the fields of games, control, robotics and so on. One of the most notable RL models is the Markov decision process (MDP) (Feinberg, 1996) . For tabular MDP with finite state and action spaces, the nearly minimax optimal sample complexity is achieved in discounted MDPs with a generative model (Azar et al., 2013) . Without the access of a generative model, the nearly minimax optimal sample complexity is established in tabular MDPs with finite horizon (Azar et al., 2017) and in tabular MDPs with infinite horizon (He et al., 2021b; Tossou et al., 2019) . However, in real applications of RL, the state and action spaces are possibly very large and even infinite. In this case, the tabular MDPs are known to suffer the curse of dimensionality. To overcome this issue, recent works consider studying MDPs under the assumption of function approximation to reparameterize the values of state-action pairs by embedding the state-action pairs in some low-dimensional space via given feature mapping. In particular, linear function approximation has gained extensive research attention. Amongst these works, linear mixture MDPs (Ayoub et al., 2020) and linear MDPs (Jin et al., 2020b) are two of the most popular MDP models with linear function approximation. Recent works have attained the minimax optimal regret guarantee O(dH √ KH) in both linear mixture MDPs (Zhou et al., 2021 ) and linear MDPs (Hu et al., 2022) with stochastic losses. Though significant advances have emerged in learning tabular MDPs and MDPs with linear function approximation under stochastic loss functions, in real applications of RL, the loss functions may not be fixed or sampled from some certain underlying distribution. To cope with this challenge, Even-Dar et al. (2009) ; Yu et al. (2009) make the first step to study learning adversarial MDPs, where the loss functions are chosen adversarially and may change arbitrarily between each step. Most works in this line of research focus on learning adversarial tabular MDPs (Neu et al., 2010a; b; 2012; Arora Table 1: Comparisons of regret bounds with most related works studying adversarial tabular and linear mixture MDPs with unknown transitions. K is the number of episodes, d is the ambient dimension of the feature mapping, S is the size of the state space, A is the size of the action space, and H is the episode length.

Algorithm

Model Feedback Regret Shifted Bandit UC-O-REPS (Rosenberg & Mansour, 2019a) Tabular MDPs Bandit Feedback O H 3/2 SA 1/4 K 3/4 UOB-REPS (Jin et al., 2020a) Tabular MDPs Bandit Feedback O HS √ AK OPPO (Cai et al., 2020) Linear Mixture MDPs

Fullinformation

O dH 2 √ K POWERS (He et al., 2022) Linear Mixture MDPs Fullinformation et al., 2012; Zimin & Neu, 2013; Dekel & Hazan, 2013; Dick et al., 2014; Rosenberg & Mansour, 2019a; b; Jin & Luo, 2020; Jin et al., 2020a; Shani et al., 2020; Chen et al., 2021; Ghasemi et al., 2021; Rosenberg & Mansour, 2021; Jin et al., 2021b; Dai et al., 2022; Chen et al., 2022a) . In contrast, most recent advances regarding learning adversarial MDPs with linear function approximation require some stringent assumptions and we are still far from understanding it well. Specifically, Cai et al. In this paper, we give an affirmative answer to this question in the setting of linear mixture MDPs and hence solve the open problem of He et al. (2022) . Specifically, we propose an algorithm termed LSUOB-REPS for adversarial linear mixture MDPs with unknown transition and bandit feedback. To remove the need for the full-information feedback of the loss function required by policyoptimization-based methods (Cai et al., 2020; He et al., 2022) , LSUOB-REPS extends the general ideas of occupancy-measure-based methods for adversarial tabular MDPs with unknown transition (Jin et al., 2020a; Rosenberg & Mansour, 2019a; b; Jin et al., 2021b) . Specifically, inspired by the UC-O-REPS algorithm (Rosenberg & Mansour, 2019b; a) , LSUOB-REPS maintains a confidence set of the unknown transition and runs online mirror descent (OMD) over the space of occupancy measures induced by all the statistically plausible transitions within the confidence set to handle the unknown transition. The key difference is that we need to build some sort of least-squares estimate of the transition parameter and its corresponding confidence set to leverage the transition structure of the linear mixture MDPs. Previous works studying linear mixture MDPs (Ayoub et al., 2020; Cai et al., 2020; He et al., 2021a; Zhou et al., 2021; He et al., 2022; Wu et al., 2022; Chen et al., 2022b; Min et al., 2022) use the state values as the regression targets to learn the transition parameter. This method is critical to construct the optimistic estimate of the state-action values and attain the final regret guarantee. In this way, however, it is difficult to control the estimation error between the occupancy measure computed by OMD and the one that the learner really takes. O dH 3/2 √ K LSUOB-REPS (Ours) Linear Mixture MDPs Bandit Feedback O dS 2 √ K + √ HSAK Ω dH √ K + √ HSAK To cope with this issue, we use the transition information of the next-states as the regression targets to learn the transition parameter. In particular, we pick a certain next-state, which we call the imaginary next-state, and use its transition information as the regression target (see Section 4.1 for details). In this manner, we are able to control the occupancy measure error efficiently. Besides, since the true transition is unknown, the true occupancy measure taken by the learner is also unknown and it is infeasible to construct an unbiased loss estimator using the standard importance weighting method. To this end, we use the upper occupancy measure (Jin et al., 2020a) together with a hyperparameter to conduct implicit exploration (Neu, 2015) to construct an optimistically biased loss estimator. Finally, we prove the O(dS 2 √ K + √ HSAK) high probability regret guarantee of LSUOB-REPS, where S is the size of the state space, A is the size of the action space, H is the episode length, d is the dimension of the feature mapping, and K is the number of the episodes. Further, we also prove a lower bound of order Ω(dH √ K + √ HSAK) , which matches the upper bound in d, K and A up to logarithmic factors (please see Table 1 for the comparisons between our results and previous ones). Though the upper bound does not match lower bounds in S, we establish the first provably efficient algorithm with O( √ K) regret guarantee for learning adversarial linear mixture MDPs under unknown transition and bandit feedback.

2. RELATED WORK

RL with Linear Function Approximation To permit efficient learning in RL with large state-action space, recent works have focused on RL algorithms with linear function approximation. In general, these works can be categorized into three lines. The first line uses the low Bellman-rank assumption (Jiang et al., 2017; Dann et al., 2018; Sun et al., 2019; Du et al., 2019; Jin et al., 2021a) , which assumes the Bellman error matrix has a low-rank factorization. Besides, Du et al. (2021) consider a similar but more general assumption called bounded bilinear rank. The second line considers the linear MDP assumption (Yang & Wang, 2019; Jin et al., 2020b; Du et al., 2020; Zanette et al., 2020a; Wang et al., 2020; 2021; He et al., 2021a; Hu et al., 2022) , where both the transition probability function and the loss function can be parameterized as linear functions of given state-action feature mappings. In particular, Jin et al. (2020b) propose the first statistically and computationally efficient algorithm with O(H 2 √ d 3 K) regret guarantee. Hu et al. (2022) further improve this result by using a weighted ridge regression and a Bernstein-type exploration bonus and obtain the minimax optimal regret bound O(dH √ KH). Zanette et al. (2020b) consider a weaker assumption called low inherent Bellman error, where the Bellman backup is linear in the underlying parameter up to some misspecification errors. The last line of works considers the linear mixture MDP assumption (Ayoub et al., 2020; Zhang et al., 2021; Zhou et al., 2021; He et al., 2021a; Zhou & Gu, 2022; Wu et al., 2022; Min et al., 2022) , in which the transition probability function is linear in some underlying parameter and a given feature mapping over state-action-next-state triples. Amongst these works, Zhou et al. (2021) obtain the minimax optimal regret bound O(dH √ KH) in the inhomogeneous episodic linear mixture MDP setting. In this work, we also focus on linear mixture MDPs. RL with Adversarial Losses Learning tabular RL with adversarial losses has been well-studied (Neu et al., 2010a; b; 2012; Arora et al., 2012; Zimin & Neu, 2013; Dekel & Hazan, 2013; Dick et al., 2014; Rosenberg & Mansour, 2019a; b; Jin & Luo, 2020; Jin et al., 2020a; Shani et al., 2020; Chen et al., 2021; Ghasemi et al., 2021; Rosenberg & Mansour, 2021; Jin et al., 2021b; Dai et al., 2022; Chen et al., 2022a) . Generally, these results fall into two categories. The first category studies adversarial RL using occupancy-measure-based methods. In particular, with known transition, Zimin & Neu (2013) 

√

AK) regret bound. Besides, we remark that the existing tightest lower bound is Ω(H √ SAK) for the unknown transition and full-information feedback setting (Jin et al., 2018) . The second category for learning adversarial RL is the policy-optimization-based method (Neu et al., 2010a; Shani et al., 2020; Luo et al., 2021b; Chen et al., 2022a) , which aims to directly optimize the policies. In this line of research, with known transition and bandit feedback, Neu et al. (2010b) propose OMDP-BF algorithm and achieve a regret of order O(K 2/3 ). Recently, Shani et al. (2020) Recent advances have also emerged in learning adversarial RL with linear function approximation (Cai et al., 2020; He et al., 2022; Neu & Olkhovskaya, 2021; Luo et al., 2021a; b) . Most of these works study this problem using policy-optimization-based methods (Cai et al., 2020; Luo et al., 2021a; b; He et al., 2022) 

3. PRELIMINARIES

In this section, we present the preliminaries of episodic linear mixture MDPs under adversarial losses.

Inhomogeneous, episodic adversarial MDPs

An inhomogeneous, episodic adversarial MDP is denoted by a tuple M = (S, A, H, {P h } H-1 h=0 , {ℓ k } K k=1 ) , where S is the finite state space with cardinality |S| = S, A is the finite action space with cardinality |A| = A, H is the length of each episode, P h : S × A × S → [0, 1] is the transition probability function with P h (s ′ |s, a) being the probability of transferring to state s ′ from state s and taking action a at stage h, and ℓ k : S × A → [0, 1] is the loss function for episode k chosen by the adversary. Without loss of generality, we assume that the MDP has a layered structure, satisfying the following conditions: • The state space S is constituted by H + 1 disjoint layers S 0 , . . . , S H satisfying S = H h=0 S h and S i S j = ∅ for i ̸ = j. • S 0 and S H are singletons, i.e., S 0 = {s 0 } and S H = {s H }. • Transitions can only occur between consecutive layers. Formally, let h(s) represent the index of the layer to which state s belongs, then ∀s ′ / ∈ S h(s)+1 and ∀a ∈ A, P h(s) (s ′ |s, a) = 0. These assumptions are standard in previous works (Zimin & Neu, 2013; Rosenberg & Mansour, 2019b; a; Jin et al., 2020a; Jin & Luo, 2020; Jin et al., 2021b; Neu & Olkhovskaya, 2021) . They are not necessary for our analysis but can simplify the notations. However, we remark that our layer structure assumption is slightly more general than it in previous works, which assume homogeneous transition functions (i.e., P 0 = P 1 = . . . = P H-1 ). Hence they require ∀s ′ / ∈ S h(s)+1 and ∀a ∈ A, P h (s ′ |s, a) = 0 for all h = 0, . . . , H -1. Besides, in our formulation, due to the layer structure, P h (•|s, a) will actually never affect the transitions in the MDP if h ̸ = h(s). Hence, with slightly abuse of notation, we define P := {P h } H-1 h=0 and write P (•|s, a) = P h(s) (•|s, a). The interaction protocol between the learner and the environment is given as follows. Ahead of time, the environment decides an MDP, and the learner only knows the state space S, the layer structure, and the action space A. The interaction proceeds in K episodes. At the beginning of episode k, the adversary chooses a loss function ℓ k probably based on the history information before episode k. Meanwhile, the learner chooses a stochastic policy π k : S × A → [0, 1] with π k (a|s) being the probability of taking a at state s. Starting from the initial state s k,0 = s 0 , the learner repeatedly selects action a k,h sampled from π k (•|s k,h ), suffers loss ℓ k (s k,h , a k,h ) and transits to the next state s k,h+1 which is drawn from P (•|s k,h , a k,h ) for h = 0, ..., H -1, until reaching the terminating state s k,H = s H . At the end of episode k, the learner only observes bandit feedback, i.e., the learner only observes the loss for each visited state-action pair: {ℓ k (s k,h , a k,h )} H-1 h=0 . For any (s, a) ∈ S × A, the state-action value Q k,h (s, a) and state value V k,h (s) are defined as follows: Q k,h (s, a) = E H-1 j=h ℓ k (s k,j , a k,j ) π, P, (s k,h , a k,h ) = (s, a) and V k,h (s) = E a∼π(•|s) [Q k,h (s, a)]. We denote the expected loss of an policy π in episode k by ℓ k (π) = E H-1 h=0 ℓ k (s k,h , a k,h ) |P, π , where the trajectory {(s k,h , a k,h )} H-1 h=0 is generated by executing policy π under transition function P . The goal of the learner is to minimize the regret compared with π * , defined as R(K) = K k=1 ℓ k (π k ) - K k=1 ℓ k (π * ) , where π * ∈ argmin π∈Π K k=1 ℓ k (π) is the optimal policy and Π is the set of all stochastic policies. Linear mixture MDPs We consider a special class of MDPs called linear mixture MDPs (Ayoub et al., 2020; Cai et al., 2020; Zhou et al., 2021; He et al., 2022) where the transition probability function is linear in a known feature mapping ϕ : S × A × S → R d . The formal definition of linear mixture MDPs is given as follows. Definition 1. M = (S, A, H, {P h } H-1 h=0 , {ℓ k } K k=1 ) is called an inhomogeneous, episodic B-bounded linear mixture MDP if ∥ϕ(s ′ |s, a)∥ 2 ≤ 1 and there exist vectors θ * h ∈ R d such that P h (s ′ |s, a) = ⟨ϕ (s ′ |s, a) , θ * h ⟩, and ∥θ * h ∥ 2 ≤ B, ∀ (s, a, s ′ ) ∈ S × A × S and h = 0, 1, . . . , H -1. We note that the regularity assumption on the feature mapping ϕ(  G : S → [0, 1], where ϕ G (s, a) = s ′ ϕ(s ′ |s, a)G(s ′ ). One can see that our assumption is slightly more general than theirs. Notation For a vector x and a matrix A, we use x(i) to denote the i-th coordinate of x and use A(i, :) to denote the i-th row of A. Let o i,j = (s i,j , a i,j , ℓ i (s i,j , a i,j )) be the observation of the learner at episode i and stage j. We denote by F k,h the σ-algebra generated by {o 1,0 , . . . , o 1,H-1 , o 2,0 , . . . , o k,0 , . . . , o k,h }. For simplicity, we abbreviate E[•|F k,h ] as E k,h [•]. The notation O(•) in this work hides all the logarithmic factors.

3.1. OCCUPANCY MEASURES

To solve the MDPs with online learning techniques, we consider using the concept of occupancy measures (Altman, 1998) . Specifically, for some policy π and a transition probability function P , the occupancy measure q P,π : S × A → [0, 1] induced by P and π is defined as q P,π (s, a) = Pr [(s h , a h ) = (s, a)|P, π], where h = h(s) is the index of the layer of state s. Hence q P,π (s, a) indicates the probability of visiting state-action pair under policy π and transition P . In what follows, we drop the dependence of an occupancy measure on P and π when it is clear from the context. Due to its definition, a valid occupancy measure q satisfies the following two conditions. First, since one and only one state in each layer will be visited in an episode in a layered MDP, ∀h = 0, . . . , H -1, (s,a)∈S h ×A q (s, a) = 1. Second, ∀h = 1, . . . , H -1, and ∀s ∈ S h , (s ′ ,a ′ )∈S h-1 ×A q (s ′ , a ′ ) P (s|s ′ , a ′ ) = a∈A q(s, a). With slightly abuse of notation, we write q(s) = a∈A q (s, a). For a given occupancy measure q, one can obtain its induced policy by π q (a|s) = q(s, a)/q(s). Fixing a transition function P of interest, we denote by ∆(P ) the set of all the valid occupancy measures induced by P and some policy π. Then the regret can be rewritten as R(K) = K k=1 q P,π k -q * , ℓ k , where q * = q P,π * ∈ ∆(P ) is the optimal occupancy measure induced by π * .

4. ALGORITHM

In this section, we introduce the proposed LSUOB-REPS algorithm, detailed in Algorithm 1. In general, LSUOB-REPS maintains a ellipsoid confidence set of the unknown transition parameter (Section 4.1). Meanwhile, it constructs an optimistically biased loss estimator and runs OMD over the space of the occupancy measures induced by the ellipsoid confidence set to update the occupancy measure (Section 4.2). Algorithm 1 Least Squares Upper Occupancy Bound Relative Entropy Policy Search (LSUOB-REPS) 1: Input: state space S, action space A, episode number K, learning rate η, exploration parameter γ, regression regularization parameter λ, and confidence parameter δ 2: Initialization: Initialize confidence set P 1 as the set of all transition functions. For all h = 0, ..., H -1 and all (s, a) ∈ S h ×A, initialize M 0,h = λI, occupancy measure q 1 (s, a) = 1 S k ×A and policy π 1 = π q1 . 3: for k = 1, 2, . . . , K do 4: for h = 0, 1, . . . , H -1 do 5: Take action a k,h ∼ π k (•|s k,h ). 6: Set the imaginary next state s ′ k,h+1 ∈ argmax s∈S h+1 ∥ϕ(s|s k,h , a k,h )∥ M -1 k-1,h . 7: Observe true next state s k,h+1 ∼ P h (•|s s,h , a k,h ) and loss ℓ k (s k,h , a k,h ). 8: M k,h = M k-1,h + ϕ(s ′ k,h+1 |s k,h , a k,h )ϕ(s ′ k,h+1 |s k,h , a k,h ) ⊤ . 9: b k,h = b k-1,h + ϕ(s ′ k,h+1 |s k,h , a k,h )δ s k,h+1 (s ′ k,h+1 ). 10: θ k,h = M -1 k,h b k,h . 11: end for 12: Compute upper occupancy bound: u k (s k,h , a k,h ) = COMP-UOB (π k , s k,h , a k,h , P k ), ∀h. 13: Construct loss estimators for all (s, a): ℓ k (s, a) = ℓ k (s,a) u k (s,a)+γ I k {s, a}. 14: Update transition confidence set P k+1 based on Eq. ( 3). 15: Compute occupancy measure: q k+1 = argmin q∈∆(P k+1 ) η q, ℓ k + D F (q, q k ).

16:

Update policy π k+1 = π q k+1 . 17: end for 4.1 CONFIDENCE SETS One of the main difficulties in learning MDPs comes from the unknown transition P . To deal with this problem, a natural way is to construct its estimator together with the corresponding confidence set. Let ϕ V (s, a) = s ′ ϕ(s ′ |s, a)V (s ′ ). With the observation that P h (•|s, a) ⊤ V k,h+1 = s ′ ∈S V k,h+1 (s ′ )⟨ϕ(s ′ |s, a), θ * h ⟩ = ⟨ϕ V k,h+1 s, a), θ * h ⟩, existing works studying linear mixture MDPs seek to learn θ * h using ϕ V k,h+1 (s k,h , a k,h ) as feature and V k,h+1 (s k,h+1 ) as the regression target (Ayoub et al., 2020; Cai et al., 2020) . Particularly, they construct the estimator θ k,h of θ * h as Zhou et al. (2021) ; He et al. (2022) also use a similar method but further incorporate the estiamted variance information to gain a sharper confidence set. This method is termed as the value-targeted regression (VTR) (Ayoub et al., 2020; Cai et al., 2020; Zhou et al., 2021; He et al., 2022) , which is critical to construct the optimistic estimator Q k+1,h (•, •) of the optimal action-value function Q * (•, •) and lead to the final regret guarantee. θ k,h = argmin θ∈R d k i=1 ϕ V i,h+1 (s i,h , a i,h ) , θ -V i,h+1 (s i,h+1 ) 2 + λ∥θ∥ 2 2 . However, though VTR is popular in previous works studying linear mixture MDPs (Ayoub et al., 2020; Cai et al., 2020; He et al., 2021a; Zhou et al., 2021; He et al., 2022; Wu et al., 2022; Chen et al., 2022b; Min et al., 2022) , including the information of the state-value function V i,h (•) in the regression makes this method hard to control the estimation error of the occupancy measure coming from the unknown transition P . To overcome this challenge, we seek a different way, in which θ * h is learned by directly using the vanilla transition information. Specifically, let Φ s,a ∈ R d×S with Φ s,a (:, s ′ ) = ϕ(s ′ |s, a) and δ s ∈ {0, 1} S be the Dirac measure at s (i.e., an one-hot vector with the one entry at s). To learn θ * h from the transition information, one may consider using Φ s k,h ,a k,h as feature and δ s k,h+1 as the regression target. Specifically, θ k,h could be taken as the solution of the following regularized linear regression problem: θ k,h = argmin θ∈R d k i=1 ∥Φ ⊤ s i,h ,a i,h θ -δ s i,h+1 ∥ 2 2 + λ∥θ∥ 2 2 . However, one obstacle still remains to be solved. Particularly, let η i,h = P h (•|s i,h , a i,h ) -δ s i,h+1 be the noise at episode i and stage h. Then it is clear that η i,h ∈ [-1, 1] S , E i,h [η i,h ] = 0 and s∈S η i,h (s) = 0. Therefore, conditioning on F i,h , the noise η i,h (s) at each state s is 1-subgaussian but they are not independent. In this way, one is still not able to establish an ellipsoid confidence set for θ k,h using the self-normalized concentration for vector-valued martingales (Abbasi-Yadkori et al., 2011) . To further address this issue, we propose to use the transition information of only one state s ′ i,h+1 in the next layer, which we call the imaginary next state. Note that the imaginary next state s ′ i,h+1 is not necessary to be the true next state s i,h+1 experienced by the learner. More specifically, we construct the estimator θ k,h of θ * h via solving θ k,h = argmin θ∈R d k i=1 ϕ s ′ i,h+1 |s i,h , a i,h , θ -δ s i,h+1 (s ′ i,h+1 ) 2 + λ∥θ∥ 2 2 . The closed-form solution of the above display is θ k,h = M -1 k,h b k,h , where M k,h = k i=1 ϕ(s ′ i,h+1 |s i,h , a i,h )ϕ(s ′ i,h+1 |s i,h , a i,h ) ⊤ + λI is the feature covariance matrix at episode k and stage h and b k,h = k i=1 ϕ(s ′ i,h+1 |s i,h , a i,h )δ s i,h+1 (s ′ i,h+1 ). The choice of s ′ k,h+1 may be determined by the learner based on the information of previous steps up to observing (s k,h , a k,h ). In particular, we choose s ′ k,h+1 as s ′ k,h+1 ∈ argmax s∈S h+1 ∥ϕ(s|s k,h , a k,h )∥ M -1 k-1,h , where the intuition is that the learner chooses to estimate the uncertainties of most uncertain states and hence controls the uncertainties of all the states in next layer. Based on the above construction of θ k,h , we have its ellipsoid confidence set guaranteed by the following lemma. Lemma 1. Let δ ∈ (0, 1). Then for any k ∈ N, and simultaneously for all h = 0, . . . , H -1, with probability at least 1 -δ, it holds that θ * h ∈ C k,h , where C k,h = {θ ∈ R d : ∥θ -θ k-1,h ∥ M k-1,h ≤ β k,h } with β k,h = B √ λ + 2 ln( H δ ) + ln( det(M k-1,h ) λ d ). Note that the above lemma immediately implies that with probability 1 -δ, P ∈ P k , where P k = {P k,h } H-1 h=0 and P k,h = { P h : ∃θ ∈ C k,h s.t. ∀(s, a, s ′ ) ∈ S h × A × S h+1 , P h (s ′ |s, a) = θ ⊤ ϕ(s ′ |s, a)} . (3)

4.2. LOSS ESTIMATORS AND ONLINE MIRROR DESCENT

Loss Estimators When learning the MDPs with known transition P , existing works consider constructing a conditionally unbiased estimator ℓ k (s, a) = ℓ k (s,a) q k (s,a) I k {s, a} of the true loss function ℓ k (Zimin & Neu, 2013; Jin & Luo, 2020) , where I k {s, a} = 1 if (s, a) is visited in episode k and I k {s, a} = 0 otherwise. To further gain a high-probability bound, Ghasemi et al. (2021) extend the idea of implicit exploration in multi-armed bandits (Neu, 2015) and propose an optimistically biased loss estimator ℓ k (s, a) = ℓ k (s,a) q k (s,a)+γ I k {s, a} with γ > 0 as the implicit exploration parameter. When transition P is unknown, the true occupancy measure q k taken by the learner is also unknown, and the above loss estimators are no longer applicable. To tackle this problem, we use a loss estimator defined as ℓ k (s, a) = ℓ k (s,a) u k (s,a)+γ I k {s, a} with u k (s, a) = max P ∈P k q P ,π k (s, a) termed as the upper occupancy bound, which is first proposed by Jin et al. (2020a) . This loss estimator is also optimistically biased since u k (s, a) ≥ q k (s, a) given P ∈ P k with high probability. Note that u k can be efficiently computed using COMP-UOB procedure of Jin et al. (2020a) . Online Mirror Descent To compute the updated occupancy measure in each episode, our algorithm follows the standard OMD framework. Since ∆(P ) is unknown, following previous works (Rosenberg & Mansour, 2019b; a; Jin et al., 2020a) , LSUOB-REPS runs OMD over the space of occupancy measures ∆(P k+1 ) induced by the transition confidence set P k+1 . Specifically, at the end of episode k, LSUOB-REPS updates the occupancy measure by solving q k+1 = argmin q∈∆(P k+1 ) η q, ℓ k + D F (q, q k ) , where ℓ k is the biased loss estimator introduced above, η > 0 is the learning rate to be tuned later, D F (q, q ′ ) = s,a q (s, a) ln q(s,a) q ′ (s,a)s,a (q (s, a) -q ′ (s, a)) is the unnormalized KL-divergence, and the potential function F (q) = s,a q(s, a) ln q(s, a)s,a q(s, a) is the unnormalized negative entropy. Besides, we note that Eq. ( 4) can be efficiently solved following the two-step procedure of OMD (Lattimore & Szepesvári, 2020) . The concrete discussions are postponed to Appendix E. More comparisons between our method and previous methods are detailed in Appendix A.

5. ANALYSIS

In this section, we present the regret upper bound of our algorithm LSUOB-REPS, and a regret lower bound for learning adversarial linear mixture MDPs with unknown transition and bandit feedback.

5.1. REGRET UPPER BOUND

The regret upper bound of our algorithm LSUOB-REPS is guaranteed by the following theorem. Recall d is the dimension of the feature mapping, H is the episode length, K is the number of episodes, S and A are the state and action space sizes, respectively. Theorem 1. For any adversarial linear mixture MDP M = (S, A, H, {P h } H-1 h=0 , {ℓ k } K k=1 ) satisfying Definition 1, by setting learning rate η and implicit exploration parameter γ as η = γ = H ln(HSA/δ) KSA , with probability at least 1 -5δ, the regret of LSUOB-REPS is upper bounded by R(K) = O dS 2 √ K ln 2 (K/δ) + HSAK ln(HSA/δ) + H ln(H/δ) . Proof sketch. Let q k = q P,π k . Following Jin et al. (2020a) , we decompose the regret as R(K) = K k=1 q k -q * , ℓ k REG + K k=1 ⟨q k -q k , ℓ k ⟩ ERROR + K k=1 q k , ℓ k -ℓ k BIAS1 + K k=1 q * , ℓ k -ℓ k BIAS2 . We bound each term in the above display as follows (see Appendix C.2 and Appendix C.3 for details). First, the REG term is the regret of the corresponding online optimization problem, which is directly controlled by the OMD and can be bounded by O HSAK ln(HSA/δ) + H ln(H/δ) . Further, the BIAS 2 term measures the overestimation of the true losses by the constructed loss estimators, which can be bounded by O HSAK ln(SA/δ) via the concentration of the implicit exploration loss estimator (Lemma 1, Neu (2015) ; Lemma 11, Jin et al. (2020a) ). Finally, the ERROR and BIAS 1 terms are closely related to the estimation error of the occupancy measure, which can be bounded by O S 2 d √ K ln 2 (K/δ) and O S 2 d √ K ln 2 (K/δ) + HSAK ln(HSA/δ) respectively. Applying a union bound over the above bounds finishes the proof. 2022) for the full-information feedback, our bound introduces the dependence on S and A and is worse than theirs since S ≥ H by the layered structure of MDPs. However, as we shall see in Section 5.2, incorporating the dependence on S and A into the regret bounds is inevitable at the cost of changing from the full-information feedback to the more challenging bandit feedback. Besides, when dS ≤ H √ A, the regret bound of LSUOB-REPS improves the regret bound O(HS √ AK) of Jin et al. (2020a) .

5.1.1. BOUNDING THE OCCUPANCY MEASURE DIFFERENCE

To bound the ERROR and BIAS 1 terms, it is critical to control (a) the estimation error between q k and q k ; and (b) the estimation error between u k and q k , both of which can be bounded by the following key technical lemma. We defer its proof to Appendix C.1. Lemma 2 (Occupancy measure difference for linear mixture MDPs). For any collection of transition functions {P s k } s∈S such that P s k ∈ P k for all s, let q s k = q P s k ,π k . If λ ≥ δ, with probability at least 1 -2δ, it holds that K k=1 (s,a)∈S×A |q s k (s, a) -q k (s, a)| = O dS 2 √ K ln 2 (K/δ) . Remark 2. Comparing with the occupancy measure difference O HS √ AK for tabular MDPs in Lemma 4 of Jin et al. (2020a) , our bound O dS 2 √ K dose not have the dependence on A, though it is worse by a factor of S. The main hardness of simultaneously eliminating the dependence of the occupancy measure difference on S and A is that though the transition P of a linear mixture MDP admits a linear structure, its occupancy measure still has a complicated recursive form: q k (s, a) = π k (a|s) θ * h(s)-1 , (s ′ ,a ′ )∈S h(s)-1 ×A q k (s ′ , a ′ )ϕ(s|s ′ , a ′ ) . We leave the investigation on whether it is possible to also eliminate the dependence on S as our future work. Besides, we note that our bound for occupancy measure difference is not a straightforward extension of its tabular version of Jin et al. (2020a) . Specifically, let q s k (s, a|s m ) be the probability of visiting (s, a) under the event that s m is visited in layer m. Jin et al. (2020a) decompose q s k (s, a|s m ) as (q s k (s, a|s m ) -q k (s, a|s m )) + q k (s, a|s m ) and (q s k (s, a|s m ) -q k (s, a|s m )) will only contribute an O(H 2 S 2 A ln(KSA/δ)) term. However, in the linear function approximation setting, the above term will become a leading term with an O H 2 dS 2 (d + S)K order. Hence we do not follow the decomposition of Jin et al. (2020a) and directly bound q s k (s, a|s m ) instead.

5.2. REGRET LOWER BOUND

In this subsection, we provide a regret lower bound for learning adversarial linear mixture MDPs with bandit feedback and unknown transition. Proof sketch. At a high level, we construct an MDP instance such that it simultaneously makes the learner suffer regret by the unknown transition and the adversarial losses with bandit feedback. Specifically, we divide an episode into two phases, where the first and the second phase include the first H/2 + 1 layers and the last H/2 + 1 layers (layer H/2 belongs to both the first and the second phase). In the first phase, due to the unknown linear mixture transition functions, we can translate learning in this phase into simultaneously learning H/4 d-dimensional stochastic linear bandit problems with lower bound of order Ω(dH √ K). In the second phase, due to the adversarial losses with bandit feedback, we show that learning in this phase can be regarded as learning a combinatorial multi-armed bandit (CMAB) problem with semi-bandit feedback, the lower bound of which is Ω( √ HSAK). The proof is concluded by combining the bounds of the two phases. Please see Appendix D for the formal proof. Remark 3. The regret upper bound in Theorem 1 matches the lower bound in d, K, and A up to logarithmic factors but looses a factor of S 2 /H. The dependence of regret lower bound on S and A is inevitable since only the bandit feedback information of the adversarial losses is revealed to the learner and the loss function is nonstructural.

6. CONCLUSIONS

In this work, we consider learning adversarial linear mixture MDPs with unknown transition and bandit feedback. We propose the first provably efficient algorithm LSUOB-REPS in this setting and prove that with high probability, its regret is upper bound by O(dS 2 √ K + √ HSAK), which only losses an extra S 2 /H factor compared with our proposed lower bound. To achieve this result, we propose a novel occupancy measure difference lemma for linear mixture MDPs by leveraging the transition information of the imaginary next state as the regression target, which might be of independent interest. One natural open problem is how to close the gap between the existing upper and lower bounds. Besides, our lower bound suggests that it is not possible to eliminate the dependence of the regret bound on S and A without any structural assumptions on the loss function. Generalizing the definition of linear mixture MDPs by further incorporating the structural assumption on the loss function (e.g., the loss function is linear in the other unknown parameter) to eliminate the dependence on S and A also seems like an interesting future direction. We leave these extensions as future works.



(2020); He et al. (2022) study learning episodic adversarial linear mixture MDPs with unknown transition but under full-information feedback and Neu & Olkhovskaya (2021) study learning episodic adversarial linear MDPs under bandit feedback but with known transition. In the more challenging setting with both unknown transition and bandit feedback, Luo et al. (2021b) make the first step to establish a sublinear regret guarantee O(K 6/7 ) in adversarial linear MDPs under the assumption that there exists an exploratory policy and Luo et al. (2021a) (an improved version of Luo et al. (2021b)) obtain a regret guarantee O(K 14/15 ) in the same setting but without access to an exploratory policy. Therefore, a natural question remains open: Does there exist a provably efficient algorithm with O( √ K) regret guarantee for RL with linear function approximation under unknown transition, adversarial losses and bandit feedback?

propose the O-REPS algorithm, which achieves (near) optimal regret O(H √ K) with full-information feedback and O( √ HSAK) with bandit feedback respectively. With unknown transition and full-information feedback, Rosenberg & Mansour (2019b) propose UC-O-REPS algorithm, and achieve O(HS √ AK) regret guarantee. When the transition is unknown, and only the bandit feedback is available, Rosenberg & Mansour (2019a) propose the bounded bandit UC-O-REPS algorithm and achieve O(HS√AK/α) regret bound with the assumption that all states are reachable with probability α. Without this assumption,Rosenberg & Mansour (2019a)  only achieve O(H 3/2 SA 1/4 K 3/4 ) regret bound. Under the same setting but without the strong assumption ofRosenberg & Mansour (2019a),Jin et al. (2020a)  develop the UOB-REPS algorithm, which uses a tight confidence set for transition function and a new biased loss estimator and achieves O(HS

Ignoring logarithmic factors, LSUOB-REPS attains an O(dS 2 √ K + √ HSAK) regret guarantee when K ≥ H. Compared with the regret bound O(dH 3/2 √ K) of He et al. (

Theorem 2. Suppose A(H/2-1) ≥ S -2-3H/4, (S -2-3H/4)A ≥ 2(H/2-1), S ≥ 4+3H/2, 2K ≥ d, B ≥ d/ √ 48K,and H ≥ 8. Further assume H/4 and S-2-3H/4 H/2-1 are integers. Then for any algorithm, there exists an inhomogeneous, episodic B-bounded adversarial linear mixture MDP M = (S, A, H, {P h } H-1 h=0 , {ℓ k } K k=1 ) satisfying Definition 1, such that the expected regret for this MDP is lower bounded by Ω(dH √ K + √ HSAK).

establish the POMD algorithm and attain a O( √ S 2 AH 4 K 2/3 ) regret bound for unknown transition and bandit feedback setting, which is further improved to O H 2 S √ AK + H 4 by Luo et al. (2021b) in the same setting.

. Remarkably,He et al. (2022) achieve the (near) optimal O(dH 3/2 √ K) regret bound in adversarial linear mixture MDPs in unknown transition but full-information feedback setting. With bandit feedback but known transition, Neu & Olkhovskaya (2021) obtain a O( √ dHK) regret guarantee in linear MDPs by using an occupancy-measure-based algorithm called Q-REPS. Luo et al. (2021a) make the first step to establish a sublinear regret guarantee O d 2 H 4 K 14/15 in adversarial linear MDPs with unknown transition and bandit feedback.

ACKNOWLEDGMENTS

The corresponding author Shuai Li is supported by National Natural Science Foundation of China No. 62006151 and Shanghai Sailing Program. Baoxiang Wang is partially supported by National Natural Science Foundation of China (62106213, 72150002) and Shenzhen Science and Technology Program (RCBS20210609104356063, JCYJ20210324120011032).

