ON THE INTERPLAY BETWEEN MISSPECIFICATION AND SUB-OPTIMALITY GAP: FROM LINEAR CONTEX-TUAL BANDITS TO LINEAR MDPS

Abstract

We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class up to a bounded misspecification level ζ > 0. We propose an algorithm based on a novel data selection scheme, which only selects the contextual vectors with large uncertainty for online regression. We show that, when the misspecification level ζ is dominated by O(∆/ √ d) with ∆ being the minimal sub-optimality gap and d being the dimension of the contextual vectors, our algorithm enjoys the same gap-dependent regret bound O(d 2 /∆) as in the well-specified setting up to logarithmic factors. Together with a lower bound adapted from Du et al. (2019); Lattimore et al. (2020), our result suggests an interplay between misspecification level and the sub-optimality gap: (1) the linear contextual bandit model is efficiently learnable when ζ ≤ O(∆/ √ d); and (2) it is not efficiently learnable when ζ ≥ Ω(∆/ √ d). We also extend our algorithm to reinforcement learning with linear Markov decision processes (linear MDPs), and obtain a parallel result of gap-dependent regret. Experiments on both synthetic and real-world datasets corroborate our theoretical results.

1. INTRODUCTION

Linear contextual bandits (Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013) have been extensively studied when the reward function can be represented as a linear function of the contextual vectors. However, such a well-specified linear model assumption sometimes does not hold in practice. This motivates the study of misspecified linear models. In particular, we only assume that the reward function can be approximated by a linear function up to some worstcase error ζ called misspecification level. Existing algorithms for misspecified linear contextual bandits (Lattimore et al., 2020; Foster et al., 2020) can only achieve an O(d √ K + ζK √ d log K) regret bound, where K is the total number of rounds and d is the dimension of the contextual vector. Such a regret, however, suggests that the performance of these algorithms will degenerate to be linear in K when K is sufficiently large. The reason for this performance degeneration is because existing algorithms, such as OFUL (Abbasi-Yadkori et al., 2011) and linear Thompson sampling (Agrawal & Goyal, 2013) , utilize all the collected data without selection. This makes these algorithms vulnerable to "outliers" caused by the misspecified model. Meanwhile, the aforementioned results do not consider the sub-optimality gap in the expected reward between the best arm and the second best arm. Intuitively speaking, if the sub-optimality gap is smaller than the misspecification level, there is no hope to obtain a sublinear regret. Therefore, it is sensible to take into account the sub-optimality gap in the misspecified setting, and pursue a gap-dependent regret bound. The same misspecification issue also appears in reinforcement learning with linear function approximation, when a linear function cannot exactly represent the transition kernel or value function of the underlying MDP. In this case, Du et al. (2019) provided a negative result showing that if the misspecification level is larger than a certain threshold, any RL algorithm will suffer from an exponentially large sample complexity. This result was later revisited in the stochastic linear bandit setting by Lattimore et al. (2020) , which shows that a large misspecification error will make the bandit model not efficiently learnable. However, these results cannot well explain the tremendous success of deep reinforcement learning on various tasks (Mnih et al., 2013; Schulman et al., 2015; 2017) , where the deep neural networks are used as function approximators with misspecification error. In this paper, we aim to understand the role of model misspecification in linear contextual bandits through the lens of sub-optimality gap. By proposing a new algorithm with data selection, we can achieve a constant regret bound for such a problem. We also extend our algorithm to the linear Markov decision processes (Jin et al., 2020) and obtain a regret bound of similar flavor. Our contributions are highlighted as follows: • We propose a new algorithm called DS-OFUL (Data Selection OFUL). DS-OFUL only learns from the data with large uncertainty. We prove an O(d 2 ∆ -foot_0 ) gap-dependent regret 1 bound when the misspecification level is small (i.e., ζ = O(∆/ √ d)) and the minimal sub-optimality gap ∆ is known. Our regret bound improves upon the gap-dependent regret in the well-specified setting (Abbasi-Yadkori et al., 2011) by a logarithmic factor. To the best of our knowledge, this is the first constant gap-dependant regret bound for misspecified linear contextual bandits, even assuming a known minimal sub-optimality gap. • We also prove a gap-dependent lower bound following the lower bound proof technique in Du et al. (2019) ; Lattimore et al. (2020) . This together with the upper bound suggests an interplay between the misspecification level and the sub-optimality gap: the linear contextual bandit is efficiently learnable if ζ ≤ O(∆/ √ d) while it is not efficiently learnable if ζ ≥ Ω(∆/ √ d). • We extend the same idea to the misspecified linear MDP, and propose an algorithm called DS-LSVI (Data-Selection LSVI). DS-LSVI enjoys a gap-dependent regret bound, which suggests a similar interplay between the misspecification level and sub-optimality gap in episodic MDPs to achieve a logarithmic regret bound O(H 5 d 3 ∆ -1 log(K)) • Finally, we conduct experiments on the linear contextual bandit with both synthetic and real datasets, and demonstrate the superior performance of DS-OFUL algorithm. This corroborates our theoretical results. Notation. Scalars and constants are denoted by lower and upper case letters, respectively. Vectors are denoted by lower case bold face letters x, and matrices by upper case bold face letters A. We denote by [k] the set {1, 2, • • • , k} for positive integers k. For two non-negative sequence {a n }, {b n }, a n = O(b n ) means that there exists a positive constant C such that a n ≤ Cb n , and we use O(•) to hide the log factor in O(•) other than number of rounds T or episode K; a n = Ω(b n ) means that there exists a positive constant C such that a n ≥ Cb n , and we use Ω(•) to hide the log factor. For a vector x ∈ R d and a positive semi-definite matrix A ∈ R d×d , we define ∥x∥ 2 A = x ⊤ Ax. For any set C, we use |C| to denote its cardinality.

2. RELATED WORK

In this section, we review the related work for misspecified linear bandits and misspecified reinforcement learning. We defer more related work on the function approximation in bandits and RL to Appendix A. Misspecified Linear Bandits. Ghosh et al. (2017) is probably the first work considering the misspecified linear bandits, which shows that the OFUL (Abbasi-Yadkori et al., 2011) algorithm cannot achieve a sublinear regret in the presence of misspecification. They, therefore, proposed a new algorithm with a hypothesis testing module for linearity to determine whether to use OFUL (Abbasi-Yadkori et al., 2011) or the multi-armed UCB algorithm. Their algorithm enjoys the same performance guarantee as OFUL in the well-specified setting and can avoid the linear regret under certain misspecification setting. Lattimore et al. (2020) (d) arms to find a ∆-optimal arm with the knowledge of ζ. Note that although the exponential sample complexity lower bound for best arm identification can be translated into a regret lower bound in linear contextual bandits, the algorithms for best-arm identification and the corresponding upper bounds cannot be easily extended to linear contextual bandits. Besides these works on misspecification, He et al. (2022) studies the linear contextual bandits with adversarial corruptions, which can be considered as a similar setting of misspecification. They assume the summation of the approximation error over all K rounds is bounded by the corruption level C. They proposed an algorithm achieving O(d √ K + dC) regret bound. However, their result in adversarial corrupted bandits cannot be directly translated to the misspecification setting since letting C = Kζ will lead to a O(d √ K + dKζ) linear regret. Besides these series of work, Camilleri et al. (2021) also studies the robustness of kernel bandit algorithms on the misspecification. Du et al. (2019) showed that having a good representation is insufficient for efficient reinforcement learning unless the approximation error (i.e., misspecification level) by the representation is small enough. In particular, Du et al. (2019) suggested that a Ω( H/d) misspecification will lead to Ω(2 H ) sample complexity for RL to find the optimal policy, even with a generative model. On the other hand, a series of work (Jin et al., 2020; Zanette et al., 2020b; Foster & Rakhlin, 2020) provided O( √ T + ζT )-type regret bound for RL in various settings, ignoring the dependence on the dimension of the feature representation d and the planing horizon H. This suggests that the performance of RL will degenerate as the total number of interactions with the environment T increases. Also, these results do not consider the minimal sub-optimality gap of the action-value function. 

3. PRELIMINARIES OF LINEAR CONTEXTUAL BANDITS

We consider a linear contextual bandit problem. In round k ∈ [K], the agent receives a decision set D k ⊂ R d and selects an arm x k ∈ D k then observes the reward r k = r(x k ) + ε k , where r(•) : R d → [0, 1] is a deterministic expected reward function and ε k is a zero-mean R-sub-Gaussian random noise. i.e., E[e λε k |x 1:k , ε 1:k-1 ] ≤ exp(λ 2 R 2 /2), ∀k ∈ [K], λ ∈ R. In this work, we assume that all contextual vector x ∈ D k satisfies ∥x∥ 2 ≤ L and the reward function r(•) : R d → [0, 1] can be approximated by a linear function r (x) = x ⊤ θ * + η(x), where η(•) : R d → [-ζ, ζ] is the unknown misspecification error function. We further assume ∥θ * ∥ 2 ≤ B and for simplicity, we assume B, L ≥ 1. We denote the optimal reward at round k as r * k = max x∈D k r(x) and the optimal arm x * k = argmax x∈D k r(x). Our goal is to minimize the regret defined by Regret(K) := K k=1 r * k -r(x k ). In this paper, we focus on the minimal sub-optimality gap condition. Definition 3.1 (Minimal sub-optimality gap). For each x ∈ D k , the sub-optimality gap ∆ k (x) is defined by ∆ k (x) := r * k -r(x) and the minimal sub-optimality gap ∆ is defined by ∆ := min k∈[K],x∈D k {∆ k (x) : ∆ k (x) > 0}. Then we further assume this minimal sub-optimality gap is strictly positive, i.e., ∆ > 0.

4. PROPOSED ALGORITHM

In this section, we propose our algorithm, DS-OFUL, in Algorithm 1. The algorithm runs for K rounds. In each round, the algorithm first estimates the underlying parameter θ * by solving the following ridge regression problem in Line 3 θ k = argmin θ i∈C k-1 r i -x ⊤ i θ 2 + λ∥θ∥ 2 2 , where C k-1 is the index set of the selected contextual vectors for regression and is initialized as an empty set at the beginning. After receiving the contextual vectors set D k , the algorithm selects an arm from the optimistic estimation powered by the Upper Confidence Bound (UCB) bonus in Line 4. In line 5, the algorithm adds the index of current round into C k if the UCB bonus of the chosen arm x k , denoted by ∥x k ∥ U -1 k , is greater than the threshold Γ. Intuitively speaking, since the UCB bonus reflects the uncertainty of the model about the given arm x, Line 5 discards the data that brings little uncertainty (∥x∥ U -1 k ) to the model. Finally we denote the total number of selected data in Line 5 by |C K |. We will declare the choices of the parameter Γ, β and λ in the next section. Algorithm 1 Data Selection OFUL (DS-OFUL) Input: Threshold Γ, radius β and regularizor λ 1: Initialize C 0 = ∅, U 0 = λI, θ 0 = 0 2: for k = 1, . . . , K do 3: Set U k = λI + i∈C k-1 x i x ⊤ i , θ k = U -1 k i∈C k-1 r i x i . 4: Receive decision set D k , select x k = argmax x∈D k x ⊤ θ k + β∥x∥ U -1 k , receive reward r k 5: if ∥x k ∥ U -1 k ≥ Γ then C k = C k-1 ∪ {k} else C k = C k-1 6: end for

5. REGRET ANALYSIS

In this section, we provide the regret upper bound of Algorithm 1 and the regret lower bound for learning the misspecified linear contextual bandit. Theorem 5.1 (Upper Bound). For any 0 < δ < 1, let λ = B -foot_1 and Γ = ∆/(2 √ dι 1 ) where ι 1 = (24 + 18R) log((72 + 54R)LB √ d∆ -1 ) + 8R 2 log(1/δ). Set β = 1 + 4 √ dι 2 + R √ 2dι 3 where ι 2 = log(3LBΓ -1 ), ι 3 = log((1 + 16L 2 B 2 Γ -2 ι 2 )/δ). If the misspecification level is bounded by 2 √ dζι 1 ≤ ∆, then with probability at least 1 -δ, the cumulative regret of Algorithm 1 is bounded by Regret(K) ≤ 32β 2d 3 ι 2 log(1 + 16dΓ -2 ι 2 )ι 1 ∆ . Remark 5.2. Since β = O( √ d), Theorem 5.1 suggests an O(d 2 ∆ -1 ) constant regret bound in- dependent of the total number of rounds K when ζ ≤ O(∆/ √ d). This suggests an O(d 2 ∆ -1 ) constant regret bound if the misspecification level is reasonably small, which improves the logarithmic regret O(d 2 ∆ -1 log(K) in Abbasi-Yadkori et al. (2011) to a constant regret 2 . Note that our constant regret bound relies on the knowledge of the minimal sub-optimality gap ∆, while the OFUL algorithm in Abbasi-Yadkori et al. (2011) does not need prior knowledge about the minimal sub-optimality gap ∆. Remark 5.3. Our high probability constant regret bound does not violate the lower bound proved in Hao et al. (2020) , which says that certain diversity condition on the contexts is necessary to achieve an expected constant regret bound (Papini et al., 2021) . In contrast, we only provide a highprobability constant regret bound. When extending this high probability constant regret bound to expected regret bound, we have E[Regret(K)] ≤ O(d 2 ∆ -1 log(1/δ))(1 -δ) + δK, which depends on K. To obtain a sub-linear expected regret, we can set δ = 1/K which yields a logarithmic regret O(d 2 ∆ -1 log(K)) and does not violate the lower bound in Hao et al. (2020) . Furthermore, following the similar idea in Lattimore et al. (2020) , we can prove a gap-dependent lower bound for misspecified stochastic linear bandits. Note that stochastic linear bandit can be seen as a special case of linear contextual bandits with a fixed decision set D k = D across all round k ∈ [K]. Similar result and proof can be found in Du et al. (2019) for episodic reinforcement learning. Theorem 5.4 (Lower Bound). Given the dimension d and the number of arms |D|, for any ∆ ≤ 1 and ζ ≥ 3∆ 8 log(|D|)/(d -1), there exists a set of stochastic linear bandit problems Θ with minimal sub-optimality gap ∆ and misspecification error level ζ, such that for any algorithm that has a sublinear expected regret bound for all θ ∈ Θ, i.e., E[Regret θ (K)] ≤ CK α with C > 0 and 0 ≤ α < 1, we have • When K ≤ O(|D|), the expected regret is lower bounded by E θ∼Unif.(Θ) [Regret θ (K)] ≥ K∆. • When K ≥ Ω(|D|), the expected regret is lower bounded by sup θ∈Θ E[Regret θ (K)] ≥ Ω(|D| log(K)∆ -1 ). Remark 5.5. Theorem 5.4 shows two regimes under the case ζ ≥ Ω(∆/ √ d). In the first regime K ≤ O(|D|) where the decision set is large (e.g., |D| = d 100 ), any algorithm will suffer from a linear regret O(∆K), which suggests that the regime cannot be efficiently learnable. In the second regime K ≥ O(|D|), Theorem 5.4 suggests a Ω(|D|∆ -1 log(K)) regret lower bound, which is matched by the multi-armed bandit algorithm with an upper bound O(|D|∆ -1 log(K)) (Lattimore & Szepesvári, 2020) . Therefore, in this easier regime, linear function approximation cannot provide any performance improvement and one can simply adopt the multi-armed bandit algorithm to learn the bandit model. Remark 5.6. Theorems 5.1 and 5.4 provide a holistic picture about the role of misspecification in linear contextual bandits. Here we focus on the more difficult regime K ≤ |D|. In the regime K ≤ |D|, when ζ ≤ O(∆/ √ d), Theorem 5.1 suggests that the bandit problem is efficiently learnable, and our algorithm DS-OFUL can achieve a constant regret, which improves upon the logarithmic regret bound in the well-specified setting (Abbasi-Yadkori et al., 2011) . On the other hand, when ζ ≥ Ω(∆/ √ d), Theorem 5.4 provides a linear regret lower bound suggesting that the bandit model can not be efficiently learned. 6 PROOF SKETCH OF THEOREM 5.1 In this section, we give an overview of the main technical difficulty and our proof technique to derive Theorem 5.1. The detailed proof is deferred to Appendix C. First, we aim at controlling the number of rounds in the index set C K . Since we only select the data with ∥x k ∥ U -1 k ≥ Γ for ridge regression, we can lower bound the summation of the selected UCB terms as k∈C K ∥x k ∥ U -1 k ≥ |C K |Γ. On the other hand, noticing that U k = i∈C k-1 x k x ⊤ k , we can upper bound the summation of UCB terms by using the elliptical potential lemma from Abbasi-Yadkori et al. (2011)  as k∈C K ∥x k ∥ U -1 k ≤ O( d|C K |). Combining the upper bound and lower bound together we can bound the total number of the selected data |C K | as Γ|C K | ≤ O( d|C K |), which suggests that |C K | ≤ O(dΓ -2 ) which is irreverent with the total number of rounds K.

Second, we control the fluctuations in the regression with misspecification error by |x

⊤ (θ k -θ * )| ≤ O(R √ d+ζ |C K |)∥x∥ U -1 k . Compare this result with the original result |x ⊤ (θ k -θ * )| ≤ O(R √ d+ ζ √ dK)∥x∥ U -1 k in Jin et al. (2020), our confidence radius O(R √ d + ζ |C K |) does not grow with the total number of rounds K. In fact, directly use the result in Jin et al. (2020) and follow the proof outline in Abbasi-Yadkori et al. (2011) will lead to the following regret bound: Regret(K) ≤ O R √ d + ζ √ dK K k=1 ∥x∥ U -1 k ≤ O Rd √ K + ζdK , which suggests a linear regret bound. As a comparison, with the help of data selection rule in our work, the regression set C K is finite and we can avoid the linear regret when using all data into regression. In addition, our result provides a √ d tighter bound compared with Jin et al. ( 2020) by using the result provided in Zanette et al. (2020c) Based on these two key observations, we overcome the linear regret bound by partitioning the total round K into two different sets. The first set contains all non-selected round, i.e. [K] \ C K . In this situation, the uncertainty satisfies ∥x k ∥ U -1 k < Γ and we can prove that when ζ ≤ O(∆/ √ d), the instantaneous regret for the un-selected round is bounded by r * k -r(x k ) ≤ 2ζ + 2 O(R √ d + ζ d|C K |)Γ < ∆, which suggests that the non-selected data is optimal and incur no regret. For the data in the finite set C K , we follows the gap-dependent regret bound in Abbasi-Yadkori et al. ( ) to show that k∈G r * k -r(x k ) ≤ O(d 2 ∆ -1 ) log(|C K |). As a result, by partition the set [K] into two subsets [K] \ C K , C K , we get the claimed cumulative regret bound by Regret(K) = [K]\C K Reg(k) + C K Reg(k) ≤ O d 2 log(|C K |) ∆ + 0, where we denote Reg(k) as the instantaneous regret in round k (i.e. r * k -r(x k )).

7.1. PRELIMINARIES OF LINEAR MDPS

We consider the episodic Markov Decision Process (MDP). Each episodic MDP is defined by a tuple M(S, A, H, {r h } H h=1 , {P h } H h=1 ) where S is the state space, A is the action space, H is the length of each episode and r h : S × A → [0, 1] is the reward function at stage h. P h is the transition kernel where P h (s ′ |s, a) denotes the transition probability from state s to s ′ with action a at stage h. At the beginning of each episode, the agent determines a policy π := {π h } H h=1 where π h : S → A. Then from stage h = 1 to h = H, the agent repeatedly receives state s h , takes the action a h = π h (s h ), receives the reward r h (s h , a h ) and the next state s h+1 . For any policy π, the value function and the Q-function at stage h is defined by V π h (s) = E H h ′ =h r h ′ (s h ′ , π h ′ (s h ′ )) s h = s , Q π h (s, a) = r h (s, a) + E V π h+1 (s h+1 )|s h = s, a h = a . It's obvious that for all policy π, for all h ∈ [H], s ∈ S and a ∈ A, the value function and the Q-function is bounded by 0 ≤ V π h (s) ≤ H, 0 ≤ Q π h (s, a) ≤ H since r h (s, a) ∈ [0, 1]. We further define the optimal value function and the optimal Q-function as V * h (s) = max π V π h (s), Q * h (s, a) = max π Q π h (s, a). For simplicity, we denote [P h V ](s, a) = E s ′ ∼P h (•|s,a) [V (s ′ )] and we have the Bellman equation along with the Bellman optimality equation as Q π h (s, a) = r h (s, a) + [P h V π h+1 ](s, a), Q * h (s, a) = r h (s, a) + [P h V * h+1 ](s, a). (7.1) We consider the ζ-approximate linear MDP setting (Jin et al., 2020)  h = µ (1) h , • • • , µ (d) h over S and an unknown vector θ * h ∈ R d such that for any (s, a) ∈ S × A, ∥P h (•|s, a) -⟨ϕ(s, a), µ h (•)⟩ ∥ TV ≤ ζ, |r h (s, a) -⟨ϕ(s, a), θ * h ⟩ | ≤ ζ, w.l.o.g. we assume ∥ϕ(s, a)∥ 2 ≤ 1 for all (s, a) ∈ S × A and max{∥µ h (S)∥ 2 , ∥θ * h ∥ 2 } ≤ √ d for all h ∈ [H]. Under Definition 7.1, it is easy to show that the Q-function under a certain policy π is close to a linear function of the feature map ϕ. Jin et al. 2020) . For a ζ-approximate linear MDP, for any policy π, there exists a corresponding {w π h } h such that for any (s, a, h) Lemma 7.2 (Lemma C.1, Lemma C.2, ∈ S × A × [H]: |Q π h (s, a) -⟨ϕ(s, a), w π h ⟩| ≤ 2Hζ, ∥w π h ∥ 2 ≤ 2H √ d. We are concerning about minimizing the cumulative regret defined by Regret(K) = K k=1 V * 1 (s k 1 ) -V π k 1 (s k 1 ) , where π k is the policy used in the k-th episode. Similar to linear contextual bandits, we introduce the minimal sub-optimality gap ∆ originally defined in He et al. (2021a) Definition 7.3 (Minimal sub-optimality gap, He et al. 2021a) . For each (s, a, h) ∈ S × A × [H], the sub-optimality gap ∆ h (s, a) is defined as ∆ h (s, a) := V * h (s) -Q * h (s, a) and the minimal suboptimality gap is defined as ∆ = min h,s,a {∆ h (s, a) : ∆ h (s, a) > 0}. We further assume this minimal sub-optimality gap is strictly positive, i.e., ∆ > 0.

7.2. PROPOSED ALGORITHM

We propose our algorithm, DS-LSVI, for misspecified linear MDP in Algorithm 2. It applies the idea of DS-OFUL to the LSVI-UCB (Jin et al., 2020) . For simplicity, we denote ϕ k h = ϕ(s k h , a k h ), r k h = r h (s k h , a k h ) for short when there is no confusion. The algorithm runs for K episodes. In the kth episode, the algorithm estimates the optimal Q-function using a linear function as indicated by Lemma 7.2. In detail, at each stage h, after acquiring the estimated value function V k h+1 (•) at state h + 1, in Line 4, the algorithm solves the following ridge regression problem w k h = argmin w ∥w∥ 2 2 + i∈C k-1 ϕ i h , w -r i h -V k h+1 (s i h+1 ) 2 , where C k-1 contains the indices of episodes selected for regression and r i h + V k h+1 (s i h+1 ) is the estimated Q-function by Bellman optimality equation (7.1). Then the algorithm takes the greedy policy based on the estimated Q-function and receives the full episode. In Line 12, the algorithm adds the episode k into the regression index set C k if the data on k-th episode provides more uncertainty (i.e. ∥ϕ∥ U -1 ≥ Γ) at any stage h ∈ [H]. The intuition behind this selection is the same as Line 5 in Algorithm 1 when dealing with linear bandits: when one episode provide a data sample with large uncertainty at any stage, we add it to the regression. Otherwise, the episode will be ignored if the whole episode provide few uncertainty. Algorithm 2 Data Selection LSVI (DS-LSVI) Input: Threshold Γ, radius β 1: Initialize C 0 = ∅, U 0 h = I, w 0 h = 0 for all h ∈ [H], V k H+1 (s) = Q k H+1 (s, a) = 0 for all (s, a) 2: for episodes k = 1, . . . , K do 3: for stage h = H, . . . , 1 do 4: U k h = I + i∈C k-1 ϕ i h (ϕ i h ) ⊤ w k h = U k h -1 i∈C k-1 ϕ i h r i h + V k h+1 (s i h+1 ) 5: Q k h (•, •) = ϕ(•, •), w k h + β∥ϕ(•, •)∥ (U k h ) -1 6: V k h (•) = min{max a {Q k h (•, a)}, H}, π k h (•) = argmax a {Q k h (•, a)} 7: end for  C k = C k-1 ∪ {k} if ∃h ∈ [H], ∥ϕ k h ∥ (U k h ) -1 ≥ Γ else = C k-1 13: end for 7.3 REGRET ANALYSIS We provide the regret upper bound of Algorithm 2 for the ζ-approximate linear MDP. The proof is deferred to Appendix F. Theorem 7.4 (Upper Bound). Let Γ = Θ(∆d -1 H -2 ), β = O(Hd), with probability at least 1 -δ, if ζ = O(∆d -0.5 H -2.5 ), the cumulative regret of Algorithm 2 is bounded by Regret(K) ≤ O(H 5 d 3 ∆ -1 log(K)). Remark 7.5. Theorem 7.4 suggests that if the misspecification level ζ is upper bounded by O(∆d -0.5 H -2.5 ), we can achieve the same logarithmic regret bound O(H 5 d 3 ∆ -1 ) as the wellspecified setting (He et al., 2021a) . This result indicates that a reasonably small misspecification will not deteriorate the performance. Our result improves the original regret bound O( √ d 3 H 4 K + ζdH 2 K) for the misspecified linear MDP provided in Jin et al. (2020) . d. However, their algorithm relies on a generative model which takes multiple actions a for the same state s at the same time. In contrast, Theorem 7.4 suggests that an O(∆d -0.5 H -2.5 ) misspecification level can lead to a logarithmic regret without accessing the generative model. 2 When Γ = 0, our algorithm degrades to OFUL LSW: Using Eq. ( 6) in Lattimore et al. (2020) RLB: Robust Linear Bandit (Ghosh et al., 2017) To verify the performance improvement by data selection using the UCB bonus in Algorithm 1, we conduct experiments for bandit tasks on both synthetic and real-world datasets, which we will describe in detail below. We also carry out experiments for linear MDPs on a synthetic dataset in Appendix B.4.

8.1. SYNTHETIC DATASET

The synthetic dataset is composed as follows: we set d = 16 and generate parameter θ * ∼ N (0, I d ) and contextual vectors {x i } N i=1 ∼ N (0, I d ) where N = 100. The generated parameter and vectors are later normalized to be ∥θ * ∥ 2 = ∥x i ∥ 2 = 1. The reward function is calculated by r i = ⟨θ * , x i ⟩ + η i where η i ∼ Unif{-ζ, ζ}. The contextual vectors and reward function is fixed after generated. The random noise on the receiving rewards ε t are sampled from the standard normal distribution. We set the misspecification level ζ = 0.02 and verified that the sub-optimality gap over the N contextual vectors ∆ ≈ 0.18. We do a grid search for β = {1, 3, 10}, λ = {1, 3, 10} and report the cumulative regret of Algorithm 1 with different parameter Γ = {0, 0.02, 0.05, 0.08, 0.2} over 32 independent trials with total rounds K = 2000. It's obvious that when Γ = 0, our algorithm degrades to the standard OFUL algorithm (Abbasi-Yadkori et al., 2011) which uses data from all rounds into regression. The result is shown in Figure 1 (b) and the average cumulative regret on the last round is reported in Table 1 with its variance over 32 trials. We can see that by setting Γ ≈ ∆/ √ d≈ 0.18/ √ 16 ≈ 0.05, Algorithm 1 can achieve less cumulative regret compared with OFUL (Γ = 0). The algorithm with a proper choice of Γ also convergences to zero instantaneous regret faster than OFUL. It is also evident that a slightly larger Γ = 0.08 will not affect the performance but a too large Γ = 0.20 ≥ ∆ will cause the algorithm to fail to learn the contextual vectors and induce a linear regret. Also, our algorithm shows that using a larger Γ can significantly boost the speed of the algorithm by reducing the number of regressions needed in the algorithm. We also compare with the algorithm (LSW) in Equation ( 6) of Lattimore et al. (2020) and the RLB in Ghosh et al. (2017) in Figure 1 (b) and Table 1 . For Lattimore et al. (2020) , the estimated reward is updated by r (x) = x ⊤ θ k + β∥x∥ U -1 k + ε k s=1 |x ⊤ U -1 k x -1 s |. However, since the term ε k s=1 |x ⊤ U -1 k x -1 s | is hard to be updated incrementally w.r.t. k, this algorithm is less efficient than OFUL Abbasi-Yadkori et al. (2011) as well as our algorithm. For the RLB algorithm in Ghosh et al. (2017) , we did the hypothesis test for k = 10 rounds and then decided whether to use OFUL or multi-armed UCB. The results show that both LSW and RLB achieve a worse regret than OFUL since in our setting ζ is relatively small. 

8.2. REAL-WORLD DATASET

To demonstrate that the proposed algorithm can be easily applied to modern machine learning tasks, we carried out experiments on the Asirra dataset (Elson et al., 2007) . The task of agent is to distinguish the image of cats from the image of dogs. At each round k, the agent receives the feature vector ϕ 1,k ∈ R 512 of a cat image and another feature vector ϕ 2,k ∈ R 512 of a dog image. Both feature vectors are generated using ResNet-18 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) . We normalize ∥ϕ 1,k ∥ 2 = ∥ϕ 2,k ∥ 2 = 1. The agent is required to select the cat from these two vectors. It receives reward r t = 1 if it selects the correct feature vector, and receives r t = 0 otherwise. It is trivial that the sub-optimality gap of this task is ∆ = 1. To better demonstrate the influence of misspecification on the performance of the algorithm, we only select the data with |ϕ ⊤ i θ * -r i | ≤ ζ with r i = 1 if it is a cat and r i = 0 otherwise. θ * is a pretrained parameter on the whole dataset using linear regression θ * = argmin θ N i=1 (ϕ ⊤ i θ -r i ) 2 , which the agent does not know. For hyper-parameter tuning, we select β = {1, 0.3, 0.1} and λ = {1, 3, 10} by doing a grid search and repeat the experiments for 8 times over 1M rounds for each parameter configuration. As shown in Figure 1 The promising result suggests a few interesting directions for future research. For example, it remains unclear if we can get rid of the prior knowledge of the minimum sub-optimality gap to achieve similar regret guarantees. It would also be interesting to incorporate the Lipschitz continuity or smoothness properties of the reward function to derive fine-grained results. Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In ICML, 2020c. Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning. PMLR, 2021. A ADDITIONAL RELATED WORK Linear Contextual Bandits. There is a large body of literature on linear contextual bandits. For example, Auer ( 2002 RL with Linear Function Approximation. To tackle RL tasks in large state space, a line of work on RL with linear function approximation has emerged in the past years. For example, linear MDPs (Yang & Wang, 2019; Jin et al., 2020) is probably one of the most widely studied models where both the transition kernel and the reward function are linear functions of a known feature mapping. Typical algorithms in this setting include LSVI-UCB (Jin et al., 2020) and randomized LSVI (Zanette et al., 2020a) , both of which can achieve a sublinear regret. Besides, linear mixture/kernel MDPs (Modi et al., 2020; Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2021) has emerged as a popular model for model-based RL with linear function approximation, in which the transition kernel is defined as a mixture of feature mappings defined on the triplet of state, action, and next state. In this setting, nearly minimax optimal regret has been attained for both finite-horizon episodic MDPs and infinite-horizon discounted MDPs (Zhou et al., 2021) . The aforementioned works are focused on the problem-independent regret bound, while He et al. (2021a) provided a gap-dependent regret bound for both linear MDPs (i.e., O(d 3 H 5 ∆ -1 )) and linear mixture MDPs (i.e., O(d 2 H 5 ∆ -1 )).

B.1 EXPERIMENT CONFIGURATION

The experiment on synthetic dataset is conducted on Google Colab with a 2-core Intel ® Xeon ® CPU @ 2.20GHz. The experiment on the real-world Asirra dataset (Elson et al., 2007) To demonstrate how our algorithm can deal with different levels of misspecification, we do data preprocessing before feeding the data into the agent. As described in Section 8.2, the remaining data with expected misspecification level ζ are shown in Table 2 . It can be verified that even with the smallest misspecification level, there are still more than 10% of the data is selected.

B.3 ADDITIONAL RESULT ON THE ASIRRA DATASET

As a sensitivity analysis, we change the misspecification level in the preprocessing part in the Asirra dataset. The result is shown in Figure 2 . This result suggests that when the misspecification is small enough, setting Γ = ∆/ √ d can deliver a reasonable result. It is aligned with the parameter setting in our theorem. Meanwhile, we found that when ζ = 0.5, which means it is strictly larger than the threshold ∆/ √ d, the algorithm cannot achieve a similar performance with of ζ < 0.1, regardless of the setting of parameter Γ. This also verifies the theoretical understanding of how a large misspecification level will harm the performance of the algorithm. According to our theory, we choose β = {1, 3, 10} and report the cumulative regret with different choices of Γ. The results are shown in Table 3 and Figure 3 . Three key observations can be revealed from the experiment result in Table 3 . First, choosing Γ = 0.01 can achieve the best performance (lowest cumulative regret). Second, choosing 0.005 ≤ Γ ≤ 0.01 can also lead to a comparable constant regret, according to Table 3 , but smaller Γ may not lead to zero instantaneous regret within 200K rounds. Third, setting Γ > ∆ will lead to a linear regret. In addition, our algorithm degrades to LSVI-UCB algorithm Jin et al. (2020) by setting Γ = 0. For misspecified linear MDPs, the algorithm studied in Theorem 3.2 of Jin et al. ( 2020) is still LSVI-UCB with a different choices of confidence radius β. In our experiments the β is tuned as a hyper-parameter. Therefore, the experiment results suggest that our algorithm outperforms the LSVI-UCB algorithm in the misspecified setting. C DETAILED PROOF OF THEOREM 5.1 In this section, we provide the detailed proof for Theorem 5.1. First, we present a technical lemma to bound the total number of data used in the online linear regression in Algorithm 1. Lemma C.1. Given 0 < Γ ≤ 1, set λ = B -2 . For any k ∈ [K], |C k | ≤ 16dΓ -2 log(3LBΓ -1 ). Lemma C.1 suggests that up to O(dΓ -2 ) contextual vectors a UCB bonus greater than Γ. A similar result is also provided in He et al. (2021b) , suggesting a O(Γ -2 ) Uniform-PAC sample complexity. Lemma C.1 also suggests that the numbers of data added in regression set C is finite, thus the regression procedure is not affected critically by the noise and the misspecification error. For a linear regression with up to |C k | data, the next lemma is crucial in controlling the fluctuations with misspecification error. Lemma C.2. Let λ = B -2 . For all δ > 0, with probability at least 1 -δ, for all x ∈ R d , k ∈ [K], the estimation error is bounded by: |x ⊤ (θ k -θ * )| ≤ 1 + R √ 2dι + ζ |C k | ∥x∥ U -1 k , where ι = log((d + |C k |L 2 B 2 )/(dδ)) and |C k | is the total number of data used in regression at k-th round. Lemma C.2 provides a similar decomposition as the well-specified linear contextual bandits algorithms like OFUL (Abbasi-Yadkori et al., 2011) . However, comparing the confidence radius here O(R √ d + ζ |C k-1 |) with the conventional radius in OFUL O(R √ d) , one can find that the misspecification term will affect the radius in an |C K | order. If we directly use all data to do the regression, the confidence radius will be in the order of O( √ K) and therefore would lead to a O(K √ log K) regret bound (see Lemma 11 in Abbasi-Yadkori et al. (2011) ). This makes the regret bound trivial since it goes beyond the trivial regret upper bound O(K) when K grows larger. In ours, however, the confidence radius is only |C K | where |C K | is finite given Lemma C.1. As a result, our regret bound will not grow with K as OFUL, and will have a more stable prediction.  C.3. Suppose 2 √ dζι 1 ≤ ∆, let λ = B -2 and 0 < Γ ≤ 1. Let β = 1 + 2∆Γ -1 √ ι 2 /ι 1 + R √ 2dι 3 where ι 2 = log(3LBΓ -1 ), ι 3 = log((1+16L 2 B 2 Γ -2 ι 2 )/δ), then with probability at least 1-δ, for all x ∈ R d , k ∈ [K], the estimation error at round k ∈ [K] is bounded by: |x ⊤ (θ k -θ * )| ≤ β∥x∥ U k -1 . Proof. By Lemma C.1, replacing |C K | with its upper bound yields |x ⊤ (θ k -θ * )| ≤ (1 + 4 √ dζΓ -1 √ ι 2 + R 2dι 3 )∥x∥ U -1 k ≤ β∥x∥ U -1 k , where the second inequality is due to the condition 2 √ dζ ≤ ∆/ι 1 . Next we introduce an auxiliary lemma controlling the instantaneous regret bound using the UCB bonus and the misspecification level. Lemma C.4. Suppose Corollary C.3 holds, for all k ∈ [K], the instantaneous regret at round k is bounded by ∆ k (x k ) = r * k -r(x k ) ≤ 2ζ + 2β∥x k ∥ U -1 k . The next auxiliary lemma from He et al. (2021a) bounds the summation of a subset of the selfnormalized vectors. Lemma C.5 (Lemma 6.6, He et al. 2021a) . For any subset G = {c 1 , • • • , c i } ⊆ C K , we have k∈G ∥x k ∥ 2 U -1 k ≤ 2d log(1 + |G|L 2 /λ). The next two technical algebra lemma is used to control the dominated terms Lemma C.6. Let ι 1 = (24+18R) log((72+54R)LB √ d∆ -1 )+ 8R 2 log(1/δ), Γ = ∆/(2 √ dι 1 ), ι 2 = log(3LBΓ -1 ), ι 3 = log((1 + 16L 2 B 2 Γ -2 ι 2 )/δ), we have ι 1 > + 4 √ ι 2 + R √ 2ι 3 . Equipped with these lemmas, we can start the proof of Theorem 5.1. Proof of Theorem 5.1. First, it worth mentioning that by setting Γ = ∆/(2 √ dι 1 ), the confidence radius β becomes 1 + 4 √ dι 2 + R √ 2dι 3 . Then our proof starts with assuming that Corollary C.3 holds with probability at least 1 -δ. We decompose the index set [K] into two subsets. The first set is [K] \ C K indicating the non-selected data, the second set is the selected set C K . We will bound the cumulative regret within two set separately.

First, for those non

-selected data k / ∈ C k , i.e. ∥x k ∥ U -1 k < Γ, combining Lemma C.4 with Corol- lary C.3 yields r * k -r(x k ) < 2ζ + 2βΓ = 2ζ + ∆ √ dι 1 + √ 2ι 3 R∆ ι 1 + 4∆ √ ι 2 ι 1 , (C.1) where ι 1 , ι 2 , ι 3 are the same as Theorem 5.1, and the second equation is from Γ = ∆/(2 √ dι 1 ). When misspecification condition 2 √ dζ ≤ ∆/ι 3 holds, (C.1) suggests that r * k -r(x k ) < 2∆ √ dι 1 + 4∆ √ ι 2 ι 1 + √ 2ι 3 R∆ ι 1 . (C.2) Lemma C.6 suggests that when ι 1 = (24 + 18R) log((72 + 54R)LB √ d∆ -1 ) + 8R 2 log(1/δ) ι 1 > 2 + 4 √ ι 2 + R √ 2ι 3 , (C. 2) yields that the instantaneous regret r * k -r(x k ) < ∆ at round k. By Definition 3.1, the instantaneous regret is zero for all k / ∈ C k , indicating the non-selected data can incur zero instantaneous regret. Then Lemma C.4 suggests that the instantaneous regret for those k ∈ C K is bounded by k∈C K r * k -r(x k ) ≤ k∈C K 2β∥ϕ k ∥ U -1 k + 2ζ ≤ 2β |C K | k∈C K ∥ϕ k ∥ 2 U -1 k + 2|C K |ζ ≤ 8βΓ -1 dι 2 2d log(1 + 16dΓ -2 ι 2 ) + 32ζdΓ -2 ι 2 ≤ 16β 2d 3 ι 2 log(1 + 16dΓ -2 ι 2 )ι 1 /∆ + 64 √ d 3 ι 1 ι 2 /∆ ≤ 32β 2d 3 ι 2 log(1 + 16dΓ -2 ι 2 )ι 1 /∆, (C.3) where the second inequality follows the C.S. inequality, the third one yields from Lemma C.5 while the fourth utilize the fact that Γ = ∆/(2 √ dι 1 ) and ζ ≤ ∆/(2 √ dι 1 ). The last one is due to the fact that the second term in the forth inequality is dominated by the first one. To warp up, the cumulative regret can be decomposed by Regret(K) = k / ∈C K (r * k -r(x k )) + k∈C K (r * k -r(x k )) ≤ 0 + 32β 2d 3 ι 2 log(1 + 16dΓ -2 ι 2 )ι 1 ∆ , where the first two zeros are given by the fact that for k / ∈ C K , we have r * k -r(x k ) = 0. the regret bound for k ∈ G is given by (C.3).

D PROOF OF TECHNICAL LEMMAS IN APPENDIX C D.1 PROOF OF LEMMA C.1

To prove this lemma, we introduce the well known elliptical potential lemma by Abbasi-Yadkori et al. (2011 ) Lemma D.1 (Lemma 11, Abbasi-Yadkori et al. 2011) . Let {ϕ i } I i=1 be a sequence in R d , define U i = λI + i j=1 ϕ j ϕ ⊤ j , then I i=1 min 1, ∥ϕ i ∥ 2 U -1 i-1 ≤ 2d log λd + IL 2 λd . Then the following auxiliary lemma and its corollary are useful Proof. Let y = 1 + ax, x = (y -1)/a. Then x ≥ 4 log(2a) + a -1 is equivalent with y ≥ 4a log(2a)+2. By Lemma D.2, this implies y ≥ a log(y)+1 which is exactly x ≥ log(1+ax). Equipped with these technical lemmas, we can start our proof. Proof of Lemma C.1. Since the cardinality of set C k is monotonically increasing w.r.t. k, we fix k to be K in the proof and only provide the bound of C K . For all selected data k ∈ C K , we have ∥ϕ k ∥ U -1 k ≥ Γ. Therefore when Γ ≤ 1, the summation of the UCB bonus over data k ∈ C K is lower bounded by k∈C K min 1, ∥ϕ k ∥ 2 U -1 k ≥ |C K | min{1, Γ 2 } = |C K |Γ 2 . (D.1) On the other hand, Lemma D.1 implies k∈C K min 1, ∥ϕ k ∥ 2 U -1 k ≤ 2d log λd + |C K |L 2 λd . (D.2) Combining (D.2) and (D.1), the total number of the selected samples |C K | is bounded by Γ 2 |C K | ≤ 2d log λd + |C K |L 2 λd . This result can be re-organized as Γ 2 |C K | 2d ≤ log 1 + 2L 2 Γ 2 λ Γ 2 |C K | 2d . (D.3) Let λ = B -2 and since 2L 2 B 2 ≥ 2 ≥ Γ 2 , Lemma D.3 suggests that if Γ 2 |C K | 2d > 4 log 4L 2 B 2 Γ 2 + 1 ≥ 4 log 4L 2 B 2 Γ 2 + Γ 2 2L 2 B 2 , then (D.3) will not hold. Thus the necessary condition for (D.3) is Γ 2 |C K | 2d ≤ 4 log 4L 2 B 2 Γ 2 + 1 = 8 log 2LB Γ + log(e) = 8 log 2LBe 1 8 Γ < 8 log 3LB Γ . By basic calculus we get the claimed bound for |C K | and complete the proof of Lemma C.1.

D.2 PROOF OF LEMMA C.2

The proof follows the standard technique for linear bandits, we first introduce the self-normalized bound for vector-valued martingales from Abbasi-Yadkori et al. ( 2011). Lemma D.4 (Theorem 1, Abbasi-Yadkori et al. 2011). Let {F t } ∞ t=0 be a filtration. Let {ε t } ∞ t=1 be a real-valued stochastic process such that ε t is F t -measurable and ε t is conditionally R-sub-Gaussian for some R ≥ 0. Let {ϕ t } ∞ t=1 be an R d -valued stochastic process such that ϕ t is F t-1 measurable and ∥ϕ∥ 2 ≤ L for all t. For any t ≥ 0, define U t = λI + t k=1 ϕ k ϕ k . Then for any δ > 0, with probability at least 1 -δ, for all t ≥ 0 t k=1 ϕ k ε k 2 U -1 t ≤ 2R 2 log det(U t ) det(U 0 )δ . Lemma D.5 (Lemma 8, Zanette et al. (2020c) ). Let {a i } d i=1 be any sequence of vectors in R d and {b i } d i=1 be any sequence of scalars such that |b i | ≤ ϵ. For any λ > 0: n i=1 a i b i 2 [ n i=1 aia ⊤ i +λI] -1 ≤ nϵ 2 . The next lemma is to bound the perturbation of the misspecification Lemma D.6. Let {η k } k be any sequence of scalars such that |η k | ≤ ζ for any k ∈ [K]. For any index subset C ⊆ [K], define U = λI + k∈C x k x ⊤ k , then for any x ∈ R d , we have x ⊤ U -1 k∈C x k η k ≤ ζ |C|∥x∥ U -1 . Proof. By Cauchy-Schwartz inequality we have x ⊤ U -1 k∈C x k η k ≤ ∥x∥ U -1 k∈C x k η k U -1 ≤ ζ |C|∥x∥ U -1 , where the second inequality dues to lemma D.5. The next lemma provides the Determinant-Trace inequality. Lemma D.7. Suppose sequence {x k } K k=1 ⊂ R d and for any k ∈ [K], ∥x k ∥ 2 ≤ L. For any index subset C ⊆ [K], define U = λI + k∈C x k x ⊤ k for some λ > 0, then det(U) ≤ (λ + |C|L 2 /d) d . Proof. The proof of this lemma is almost the same with Lemma 10 in Abbasi-Yadkori et al. ( 2011) by replacing the index set [K] with any subset C. We refer the readers to check Abbasi-Yadkori et al. (2011) for details. Equipped with these lemmas, we can start our proof. Proof of Lemma C.2. For any k ∈ [K], considering the data samples k ′ ∈ C k-1 used for regression at round k. Following the update rule of U k and θ k yields U k (θ k -θ * ) = U k U -1 k k ′ ∈C k-1 x k ′ r k ′ -λI + k ′ ∈C k-1 x k ′ x ⊤ k ′ θ * = k ′ ∈C k-1 x k ′ r k ′ -λθ * - k ′ ∈C k-1 x k ′ x ⊤ k ′ θ * = -λθ * + k ′ ∈C k-1 x k ′ (r k ′ -x ⊤ k ′ θ * ) = -λθ * + k ′ ∈C k-1 x k ′ ε k ′ + k ′ ∈C k-1 x k ′ η k ′ , where the first equation is due to the fact that U k = λI + k ′ ∈C k-1 x k x ⊤ k and θ k = U -1 k k ′ ∈C k-1 x k ′ r k ′ . The last equation follows the fact that r k ′ is generated from r k ′ = r(x k ′ ) + ε k ′ = x ⊤ k ′ θ * + η(x k ′ ) + ε k ′ , where we denote η(x k ′ ) as η k ′ for the model misspecification error and ε k ′ is the random noise. Therefore, consider any contextual vector x ∈ R d , we have x ⊤ (θ k -θ * ) = x ⊤ U -1 k U k (θ k -θ * ) ≤ λ x ⊤ U -1 k θ * q1 + x ⊤ U -1 k k ′ ∈C k-1 ϕ k ′ ε k ′ q2 + x ⊤ U -1 k k ′ ∈C k-1 ϕ k ′ η k ′ q3 , where the inequality is due to the triangles inequality. Lemma D.6 yields q 3 ≤ ζ |C k-1 |∥x∥ U -1 k . From the fact that |x ⊤ Ay| ≤ ∥x∥ A ∥y∥ A , we can bound term q 1 by q 1 ≤ ∥x∥ U -1 k ∥θ * ∥ U -1 k ≤ λ -1/2 B∥x∥ U -1 k . (D.4) where the last inequality is due to the fact that U -1 i ⪯ λ -1 I. Term q 2 is also bounded as q 2 ≤ ∥x∥ U -1 k k ′ ∈C k-1 x k ′ ε k ′ U -1 k = ∥x∥ U -1 k K k ′ =1 1 [k ′ ∈ C k-1 ] x k ′ ε k ′ U -1 k I1 , (D.5) where the second equation uses the indicator to replace the summation over subset C k-1 . Denoting y k ′ = 1 [k ′ ∈ C k-1 ] x k ′ , noticing that ∥y k ∥ 2 ≤ ∥x k ∥ 2 ≤ L and U k = k ′ ∈C k-1 x k ′ x ⊤ k ′ = K k ′ =1 1 [k ′ ∈ C k-1 ] x k ′ x ⊤ k ′ = K k ′ =1 y k ′ y ⊤ k ′ , by Lemma D.4, I 1 can be further bounded by I 1 ≤ 2R 2 log det(U k ) det(U 0 )δ ≤ R 2 log det(U k ) det(U 0 )δ = R 2 log det(U k ) λ d δ , (D.6) where the second inequality follows the fact that det (U k ) ≥ det(U 0 ) = λ d . Notice that U k = λI + k ′ ∈C k-1 x k ′ x ⊤ k ′ , lemma D.7 suggests that det(U k ) ≤ (λ + |C k-1 |L 2 /d) d , plugging this into (D.6) we have I 1 can be finally bounded by I 1 ≤ R 2 log (λ + |C k-1 |L 2 /d) d λ d δ ≤ R 2d log dλ + |C k-1 |L 2 dλδ . Plugging the bound of I 1 into (D.5) and combining with (D.4) and Lemma D.6 together, replacing |C k-1 | with its upper bound |C K | we have with probability at least 1 -δ, for all k ∈ [K], x ∈ R d , |x ⊤ (θ k -θ * )| ≤ R 2d log dλ + |C K |L 2 dλδ + Bλ -1/2 + ζ |C K | ∥ϕ∥ U -1 k . Letting λ = B -2 we get the claimed results.

D.3 PROOF OF LEMMA C.4

Proof. According to the definition of expected reward function r(x), we have for all k ∈ [K], suppose the condition in Lemma C.2 holds, then Proof. First it is clear to see that r * k -r k = η(x * k ) -η(x k ) + (x * k ) ⊤ θ * -x ⊤ k θ * ≤ 2ζ + (x * k ) ⊤ θ * -x ⊤ k θ * = 2ζ + (x * k ) ⊤ θ k + (x * k ) ⊤ (θ * -θ k ) -x ⊤ k θ k + x ⊤ k (θ * -θ k ) ≤ 2ζ + (x * k ) ⊤ θ k + β∥x * k ∥ U -1 k -x ⊤ k θ k + β∥x k ∥ U -1 k ≤ 2ζ + x ⊤ k θ k + β∥x k ∥ U -1 k -x ⊤ k θ k + β∥x k ∥ U -1 k ≤ 2ζ + 2β∥x k ∥ U - √ 2ι 3 = 2 log(1 + 16L 2 B 2 Γ -2 ι 2 ) + 2 log(1/δ). Using the fact that √ a + b ≤ √ a + √ b, it can be further bounded by √ 2ι 3 ≤ 2 log(1 + 16L 2 B 2 Γ -2 ι 2 ) + 2 log(1/δ). Assuming L ≥ 1, B ≥ 1, Γ = ∆/(2 √ dι 1 ) ≤ 1 yields LBΓ -1 ≥ 1, then by basic calculus one can verify that 2 + 4 √ ι 2 ≤ 6 log(3LBΓ -1 ), 2 log(1 + 16L 2 B 2 Γ -2 ι 2 ) ≤ 3 log(3LBΓ -1 ), therefore we have that 2 + 4 √ ι 2 + R √ 2ι 3 ≤ (6 + 3R) log(3LBΓ -1 ) + 2 log(1/δ)R = (6 + 3R) log(6LB √ d∆ -1 ι 1 ) + 2 log(1/δ)R, where the last equality is from the fact that Γ = ∆/(2 √ dι 1 ). Lemma D.2 suggests that the necessary condition for (6LB √ d∆ -1 )ι 1 x ≥ (6LB √ d∆ -1 )(6 + 3R) a log(6LB √ d∆ -1 ι 1 ) + (6LB √ d∆ -1 ) 2 log(1/δ)R b (D.7) is that (6LB √ d∆ -1 )ι 1 ≥ 4(6LB √ d∆ -1 )(6 + 3R) log(2(6LB √ d∆ -1 )(6 + 3R)) + 2(6LB √ d∆ -1 ) 2 log(1/δ)R, which suggests that setting ι 1 = (24 + 18R) log((72 + 54R)LB √ d∆ -1 ) + 8R 2 log(1/δ) implies the fact that ι 1 ≥ 2 + 4 √ ι 2 + R √ 2ι 3 E PROOF OF THEOREM 5.4 To begin with, we introduce the lemma providing a sparse vector set in R d . Next, we present the Bretagnolle-Huber inequality providing the lower bound to distinguish a system. Lemma E.2 (Bretagnolle-Huber inequality). Let P and Q be probability measures on the same measurable space (Ω, F), let A ∈ F be an arbitary event. Then P (A) + Q(A c ) ≥ 1 2 exp(-KL(P, Q)). For stochastic linear bandit problem with finite arm, we can denote T i (k) as the number of rounds the algorithm visit the i-th arm over total k rounds. Then We have the KL-divergence decomposition lemma. Lemma E.3 (Lemma 15.1, Lattimore & Szepesvári (2020) ). Let ν = (P 1 , • • • , P n ) be the reward distributions associated with one n-armed bandit and let ν ′ = (P ′ 1 , • • • , P ′ n ) be another n-armed bandit. Fix some algorithm π and let P ν = P νπ , P ν ′ = P ν ′ ,π be the probability measures on the canonical bandit model induced by the k-round interconnection of π and ν (respectively, π and ν ′ ). Then KL(P ν , P ν ′ ) = n i=1 E ν [T i (n)]KL(P i , P ′ i ) Proof of Theorem 5.4. The proof starts from inheriting the idea from Lattimore et al. (2020) . Given dimension d and the number of arms |D|, setting ε = 8 log(|D|)/(d -1), we can provide the contextual vector set D such that ∥x∥ 2 = 1, ∀x ∈ D, | ⟨x, y⟩ | ≤ 8 log(|D|) d -1 , ∀x, y ∈ D, x ̸ = y, For simplicity, we index the decision set as x 1 , • • • , x |D| . Given the minimal sub-optimality gap ∆, we provide the parameter set Θ as follows: Θ = θ (i,j) = ∆x i + 2∆x j , x i , x j ∈ D, i ̸ = j {θ i = ∆x i , x i ∈ D}. It can be verified that Θ contains two kinds of θ. The first one θ (i,j) is a mixture of two different contexts x i , x j with different strength ∆ and 2∆. The second one is θ i which only contains features from one context x i . We can further verify that the size of |Θ| = |D| 2 and ∥θ∥ 2 ≤ √ 5∆ for θ ∈ Θ. For different parameter θ, the reward function is sampled from a Gaussian distribution N (r θ (x), 1), where the expected reward function is defined as r θ (i,j) (x) =    2∆ if x = x j ∆ if x = x i 0 otherwise , r θi (x) = ∆ if x = x i 0 otherwise . We can verify that the minimal sub-optimality of all these bandit problem is ∆. For different parameter θ and input x, by utilizing the sparsity of the set D (i.e. |x ⊤ y| ≤ ε if x ̸ = y), we can verify the misspecification level as |r θ (i,j) (x) -θ ⊤ (i,j) x| =    |2∆ -2∆x ⊤ j x -∆x ⊤ i x| ≤ ∆ε if x = x j |∆ -2∆x ⊤ j x -∆x ⊤ i x| ≤ 2∆ϵ if x = x i |0 -2∆x ⊤ j x -∆x ⊤ i x| ≤ 3∆ε otherwise |r θi (x) -θ ⊤ i (x)| = |∆ -∆x ⊤ i x| = 0 if x = x i |0 -∆x ⊤ i x| ≤ ∆ε otherwise. Therefore we have verified that the misspecification level is bounded by ζ = 3∆ε. The provided bandit structure is hard for any linear algorithm to learn since any algorithm cannot get any information before it encounters non-zero expected rewards, even regardless of the noise of the rewards. We following the same method in Lattimore & Szepesvári (2020) . If the algorithm choose arm i at the first round, there would be |D| parameters (i.e. θ i , θ (i,•) receiving a non-zero expected reward. On the second round if the algorithm choose a different arm j, there would be |D| parameters (i.e. θ j , θ (j,k:k̸ =i) receiving a non-zero expected reward. Therefore the average time of receiving zero expected reward should be |D| -2 |D| i=1 (i -1)(|D| -i + 1) = |D| -2 |D|-1 i=0 i(|D| -i) = |D| -2   |D| |D|-1 i=0 i - |D|-1 i=0 i 2   = |D| -2 |D| 2 (|D| -1) 2 - |D|(|D| -1)(2|D| -1) 6 = |D| -1 2 1 - 2|D| -1 3|D| ≥ |D| -1 6 , where the third equation is from the fact that n i=1 i = n(n + 1)/2 and n i=1 i 2 = n(n + 1)(2n + 1)/6. The last inequality is from the fact that 2|D| -1)/(3|D|) ≤ 2/3. Therefore, even without of the random noise, any algorithm is expected to receive min{K, (|D| -1)/6} uninformative data with expected reward to be zero. Therefore any algorithm will receive a ∆ min{K, (|D| -1)/6} regret considers the suboptimality as ∆. Next, we consider the effect of random noise. For any algorithm running on this parameter set Θ, we find two parameter θ i and θ i,j where j ̸ = i. Define the event as A = {T j (k) ≥ k/2} and A c = {T j (k) < k/2}. By Lemma E.2 and Lemma E.3, P θi T j (k) ≥ k 2 + P θ (i,j) T j (k) < k 2 ≥ 1 2 exp(-KL(P θi , P θ (i,j) )) ≥ 1 2 exp - n∈D E θi [T n (k)]KL P θ (i,j) ,n , P θj ,n . (E.1) Noticing the minimal sub-optimality gap is ∆. Also the j-th arm is the sub-optimal arm for parameter θ i . Therefore, once T j (k) ≥ k/2, the algorithm will at least suffer from ∆k/2 regret for parameter θ i . Also, since the j-th arm is the optimal arm for bandit θ (i,j) . If T j (k) < k/2, the algorithm will also at least suffer from ∆k/2 regret for θ (i,j) . Denoting R θ (k) as the expected cumulative regret over k rounds, that is to say R θi (k) ≥ ∆k 2 P θi (T j (k) ≥ k/2) R θj (k) ≥ ∆k 2 P θi (T j (k) < k/2). (E.2) On the other hand since the bandit using θ i and θ j only differ in the j-th arm. Since standard Gaussian noise is adapted, KL(P θi,n , P θ (i,j) ,n ) = ∆ 2 1[n = j]/2. Combining this with (E.2), (E.1) suggests that R θi (k) + R θj (k) ≥ ∆k 2 exp - ∆ 2 2 E θi [T j (k)] , which suggests that E θi [T j (k)] ≥ log(∆k) -log 2 -log(R θi (k) + R θj (k)) ∆ 2 /2 , (E.3) For any algorithm seeking to get a sublinear expected regret bound of R θ (k) ≤ Ck α with C > 0, 0 ≤ α < 1 for all θ ∈ Θ, (E.3) becomes E θi [T j (k)] ≥ log(∆k) -log 2 -log(2Ck α ) ∆ 2 /2 = log(∆k) -log(4C) -α log k ∆ 2 /2 . (E.4) Since that the regret on θ i can be decomposed by R θi (k) = ∆ |D| n=1,n̸ =i T n (k), (E.5) combining (E.5) with (E.4) yields R θi (k) ≥ 2(|D| -1) ∆ max {log(∆k) -log(4C) -α log k, 0} , where the max operator is trivially taken for R θ (k) ≥ 0. F PROOF OF THEOREM 7.4 In this section we provide the proof of Theorem 7.4, we start from the technical lemmas for the proof. The first lemma is similar with Lemma C.1 by setting λ = L = 1 as in Definition 7.1 and taking union for H possible cases Lemma F.1. Given 0 < Γ ≤ 1. For any k ∈ [K], |C k | ≤ 8HdΓ -2 log(6HΓ -2 ). Then the next lemma extends Lemma C.2 and Lemma C.4 to H > 1 setting and is similar with Lemma C.5 in Jin et al. (2020) , by replacing the total number K with  and c β is an absolute constant. With probability at least 1 -δ, for any fixed policy π and (s, a, h, k) |C K | used in regression Lemma F.2 (Lemma C.5, Jin et al. 2020). Let β = c β H(d √ ι 2 + ζΓ -1 √ 8Hdι 1 ), where ι 1 = log(6HΓ -2 ), ι 2 = log((16H 2 d 2 Γ -2 ι 1 )/δ) ∈ S × A × [H] × [K], ϕ ⊤ (s, a)w k h -Q π h (s, a) = P h (V k h+1 -V π h+1 )(s, a) + ρ k h (s, a), where |ρ k h (s, a)| ≤ β∥ϕ(s, a)∥ (U k h ) -1 + 4Hζ. Given Lemma F.2 we can provide the rate of confidence radius β by adapting the condition and parameters setting: on the misspecification level and the parameter setting Corollary F.3. Let the misspecification level be bounded by ζ = O(∆d -0.5 H -2.5 ) and choose the threshold parameter to be Γ = Θ(∆d -1 H -2 ), one can verify that the β in Lemma F.2 can be bounded by β ≤ β = O(Hd). Given this corollary, we show that the sub-optimality gap is controlled by three parts: summation of the UCB bonus, misspecification level and the noise induced by the probability transition kernel. Lemma F.4. Suppose Corollary F.3 and Lemma F.2 holds, for any subset K ⊆ [K] and h ∈ [H] we have k∈K ∆ h (s k h , a k h ) ≤ k∈K V * h (s k h ) -V π k h (s k h ) ≤ 2β k∈K H h ′ =h ∥ϕ k h ′ ∥ (U k h ′ ) -1 + 4H 2 |K|ζ + k∈K H h ′ =h+1 ε k h ′ , where ε k h := [P h (V k h+1 -V π k h+1 )](s k h , a k h ) -(V k h+1 (s k h+1 ) -V π k h+1 (s k h+1 ) ) is a zero-mean random variable conditioned on all randomness before episode k. Then the next lemma suggests that the cumulative regret can be bounded by the summation of suboptimality gap at all stage, with high probability Lemma F.5 (Lemma 6.1, He et al. 2021a ). For each MDP M(S, A, H, {r h }, {P h }), with probability at least 1 -δ, the cumulative regret over K episode is bounded by Regret(K) ≤ 2 K k=1 H h=1 ∆ h (s k h , a k h ) + 16H 3 3 log log(HK) + 1 δ + 2. Equipped with these lemmas, we can start our proof. Proof of Theorem 7.4. For simplicity, we denote the non-selected episode set as C K := [K] \ C K . We first assume Lemma F.2 holds. We fix stage h and then define the following sequence to apply the peeling technique which is also used in He et al. (2021a) . For 0 ≤ l ≤ log(H/∆)/ log(2), l ∈ N, let k l 0 = 0 and k l i = min{k : k > k l i-1 , 2 l ∆ ≤ ∆ h (s k h , a k h ) < 2 l+1 ∆, k ∈ [K]}. Since ∆ h (s, a) ≤ H, there exists 1 + log(H/∆)/ log(2) levels. We further denote K l = {k l 1 , • • • , k l |K l | } to be the set of the sequence {k l i } i . We fix one level l in the following proof. On the one hand, due to the fact that ∆ h (s k h , a k h ) is lower bounded, we have k∈K l ∆ h (s k h , a k h ) ≥ 2 l ∆|K l |. (F.1) On the other hand, Lemma F.4 yields k∈K l ∆ h (s k h , a k h ) ≤ 2β k∈K l H h ′ =h ∥ϕ k h ′ ∥ (U k h ′ ) -1 I l 1 +4H 2 |K l |ζ + k∈K l H h ′ =h ε k h ′ I l 2 . (F.2) By Azuma-Hoeffding's inequality, with probability at least 1 -δ/H, I l 2 is bounded by I l 2 ≤ 2|K l |H 3 log HK δ . (F.3) Since [K] = C K ∪ C K , term I l 1 is be decomposed as I l 1 = k∈K l H h ′ =h ∥ϕ k h ′ ∥ (U k h ′ ) -1 = k∈K l ∩ C K H h ′ =h ∥ϕ k h ′ ∥ (U k h ′ ) -1 + k∈K l ∩C K H h ′ =h ∥ϕ k h ′ ∥ (U k h ′ ) -1 ≤ HΓ|K l ∩ C K | + H |K l ∩ C K | k∈K l ∩C K ∥ϕ k h ′ ∥ 2 (U k h ′ ) -1 ≤ HΓ|K l ∩ C K | + H |K l ∩ C K | 2d log(1 + |K l ∩ C K |), ≤ HΓ|K l | + H |K l | 2d log(1 + K), (F.4) where the first term in the first inequality utilizes that ∥ϕ k h ∥ ≤ Γ for all h if k ∈ C K and the second term is from triangle's inequality. The second inequality utilizes Lemma C.5. The third inequality is because both |K l ∩ C K | and |K l ∩ C K | is smaller than |K l | as well as K. Plugging (F.3) and (F.4) into (F.2) then combining (F.2) with (F.1) yields 2 l ∆|K l | ≤ (2βHΓ + 4H 2 ζ) I3 |K l | + βH 8|K l |d log(1 + K) + 2|K l |H 3 log HK δ . (F.5) Recall the parameter setting β = O(Hd), Γ = O(∆d -0.5 H -2 ), as long as ζ = O(∆d -1 H -2.5 ), it can be guaranteed that I 3 ≤ ∆/2 ≤ 2 l ∆/2. The calculation of the logarithmic terms can follow the proof of Theorem 5.1 where we ignore it for simplicity. Plugging this into (F.5) we have that when the condition on ζ is satisfied, This suggests that |K l | ≤ 4 -l ∆ -2 I 4 with probability at least 1 -δ/H. Since for all k ∈ K l , ∆ h (s k h , a k h ) < 2 l+1 ∆, the cumulative sub-optimality gap in set K l is bounded by 2 l ∆|K l | ≤ k∈K l ∆ h (s k h , a k h ) ≤ 2 l+1 ∆ × 4 -l ∆ -2 I 4 ≤ 2 × 2 -l ∆ -1 I 4 . (F.7) Replacing δ with δ/(1 + log(H∆ -1 )/ log(2)), we have that with probability at least 1 -δ/H, (F.7) holds for all 0 ≤ l ≤ log(H∆ -1 )/ log(2), l ∈ N by union bound. Therefore we have that Next lemma provides the recursive error bound for the estimated value function V k h (•) Lemma G.2 (Lemma C.6, Jin et al. 2020) . Suppose Lemma F.2 holds, we have for all (h, k) ∈ [H] × [K], V k h (s k h ) -V π k h (s k h ) ≤ V k h+1 (s k h+1 ) -V π k h+1 (s k h+1 )+ [P V k h+1 -V π k h+1 ](s k h , a k h ) + 2β∥ϕ(s k h , a k h )∥ (U k h ) -1 , where V h+1 (•) = 0. Proof of Lemma F.4. We first denote ε k h as [P (V k h+1 -V π k h+1 )](s k h , a k h ), it's easy to verify that ε k h is a zero-mean random variable conditioned on all randomness before k-th episode. By Lemma G.1, for all (s, a, h, k) ∈ S × A × [H] × [K], we have Q k h (s, a) ≥ Q * h (s, a) -4H 2 ζ. Also, by the definition of minimal sub-optimality gap in Definition 7.3, we have ∆ h (s k h , a k h ) = V * h (s k h ) -Q * h (s k h , a k h ) = Q * h (s k h , π * h (s k h )) -Q * h (s k h , a k h ) ≤ Q k h (s k h , π * h (s k h )) + 4H 2 ζ -Q * h (s k h , a k h ). Since Algorithm 2 is taking the greedy policy a k h = argmax a Q k h (s k h , a), the sub-optimality gap is bounded by ∆ h (s k h , a k h ) ≤ Q k h (s k h , a k h ) -Q * h (s k h , a k h ) + 4H 2 ζ ≤ V k h (s k h ) -V π h (s k h ) + 4H 2 ζ, (G.5) since V π k h (s k h ) ≤ V * h (s k h ) . Following Lemma G.2 by telescoping we have V k h (s k h , a k h ) -V π k h (s k h , a k h ) = H h ′ =h+1 ε k h ′ + 2β H h ′ =h ∥ϕ(s k h ′ , a k h ′ )∥ (U k h ′ ) -1 . (G.6) Plugging (G.6) back into (G.5) we will have the result as ∆ h (s k h , a k h ) = 4H 2 ζ + H h ′ =h+1 ε k h ′ + 2β H h ′ =h ∥ϕ(s k h ′ , a k h ′ )∥ (U k h ′ ) -1 . Since it holds for all k ∈ [K], we can take an additional summation over k ∈ K ⊆ [K] to get the claimed result.



we use notation O(•) to hide the log factor other than number of rounds T When we say constant regret, we ignore the log(1/δ) factor in the regret as we choose δ to be a constant.



Du et al. (2020) considered the agnostic Q-learning with misspecified linear function approximation. They proposed an algorithm with the access to a generative model and showed that if ζ ≤ O(∆/ √ d), one can find the optimal policy using O(d) trajectories. Together with the lower bound provided in Du et al. (2019), it suggests that ζ = O(∆/ √ d) is a sufficient and necessary condition to achieve a polynominal sample complexity given the access to the generative model.

Remark 7.6. Compared with the linear contextual bandit results, our result in the linear MDP setting also suggests the same relationship among ζ, ∆ and dimension d, ignoring the horizon factor H. Du et al. (2019) showed that when ζ ≥ Ω(∆/ √ d), any reinforcement learning algorithm suffers an O(2 H ) sample complexity. In addition, Du et al. (2020) provided algorithm for agnostic Q-learning which takes O(d) trajectories to find the optimal policy when ζ < ∆/ √

Cumulative regret comparison of DS-OFUL (with difference choices of Γ), Lattimore et al. (2020) and RLB over 2000 rounds. Results are averaged over 32 replicates. Cumulative regret of DS-OFUL on the Asirra dataset over 1M rounds with different Γ under misspecification level ζ = 0.01. Results are averaged over 8 runs with standard errors shown as shaded areas.

Figure 1: Experiment results on (a): synthetic dataset, and (b): real-world dataset.

(a), when ζ = 0.01, though the OFUL algorithm (setting Γ = 0) will have a better performance at the very beginning, setting Γ = 0.05 ≈ ∆/ √ d = 1/ √ 512 will eventually improve the performance of the algorithm. As a sensitivity analysis, we also set ζ = {0.5, 0.1, 0.05} to test the impact of misspecification on the performance of algorithm choices of Γ. More experiment configurations and results are deferred to Appendix B.9 CONCLUSION AND FUTURE WORKWe study the misspecified linear contextual bandit from a gap-dependent perceptive. We propose an algorithm and show that if the misspecification level ζ ≤ O(∆/ √ d), the proposed algorithm can achieve the same gap-dependent regret bound as in the well-specified case. Along withLattimore  et al. (2020);Du et al. (2019), we provide a complete picture on the interplay between misspecification and sub-optimality gap, in which ∆/√d plays an important role on the phase transition of ζ to decide if the bandit model can be efficiently learned. The algorithm and analysis have been extended to linear Markov decision processes and verified via experiments as well.

);Chu et al. (2011);Agrawal & Goyal (2013) studied linear contextual bandits when the number of arms is finite.Abbasi-Yadkori et al. (2011) proposed an algorithm called OFUL to deal with the infinite arm set. All these works come with an O( √ K) problem-independent regret bound, and an O(d 2 ∆ -1 log(K)) gap-dependent regret bound is also given byAbbasi-Yadkori et al. (2011).

Figure 2: The performance of DS-OFUL under different misspecification level ζ. Results are averaged over 8 runs with standard errors shown as shaded areas.

cumulative regret of DS-LSVI over 8 replicates with different choices of Γ.

regret of DS-LSVI with different choices of Γ after 200K rounds. Each experiment with the same Γ is repeated for 8 times, and each point represents an individual experiment.

Figure 3: The performance of DS-LSVI with different choices of Γ. (a): averaged cumulative regret w.r.t number of rounds. (b): cumulative regret after 200K rounds with different choices of Γ.

When the misspecification level is well bounded by ζ = O(∆/ √ d), the following corollary is a direct result of Lemmas C.2 by replacing the term |C K | with its upper bound provided in Lemma C.1. Corollary

Lemma D.2 (Lemma A.2, Shalev-Shwartz & Ben-David 2014). Let a ≥ 1 and b > 0. Then x ≥ 4a log(2a) + 2b yields x ≥ a log(x) + b. Lemma D.2 can easily indicate the following lemma. Lemma D.3. Let a ≥ 1. Then x ≥ 4 log(2a) + a -1 yields x ≥ log(1 + ax).

where the second inequality utilize the fact that |η(x)| ≤ ζ for all x ∈ D k . The inequality on the forth line follows Corollary C.3. The inequality on the fifth line is due to the fact that x k = argmax x∈D k x ⊤ θ k + β∥x∥ U -1 k , which is executed in Line 4 in Algorithm 1. D.4 PROOF OF LEMMA C.6

Lemma 3.1, Lattimore et al. 2020). For any ε > 0 and d < [|D|] such that d ≥ ⌈8 log(|D|)ε -2 ⌉, there exists a vector set D ⊂ R d such that ∥x∥ 2 = 1 for all x ∈ D and | ⟨x, y⟩ | ≤ ε for all x, y ∈ D and x ̸ = y.

∆ -1 I 4 = 4∆ -1 I 4 . (F.8)By union bound we have with probability at least 1 -δ, (F.8) holds for all h ∈ [H] thus with probability at least 1 -3δ by union bound over the probability event in Lemma F.2, Lemma F.5 and (F.9). By Corollary F.3, β = O(Hd + √ H 3 dζΓ -1 ) = O(Hd), replace δ with δ/3, we have the regret is bounded byRegret(K) ≤ O(H 5 d 3 ∆ -1 log(K))with probability at least 1 -δ.G PROOF OF LEMMAS IN APPENDIX FG.1 PROOF OF LEMMA F.1The proof technique is similar with Lemma C.1 equipped with Lemma D.1 and the Lemma D.3.Proof of Lemma F.1. Since the selected data samples follows that there exists an h ∈ [H] such that ∥ϕ k h ∥ (U k h )-1 ≥ Γ, therefore the summation of the data is lower bounded by 2dH log(1+ |C k |/d), (G.2)where λ = L = 1. Combining (G.1) with (G.2) yieldsΓ 2 |C k | ≤ 2dH log(1 + |C k |/d), (G.3)then following the same rule as the proof of Lemma C.1 we have the necessary condition for (G.3) is|C k | ≤ 8HdΓ -2 log(6HΓ -2 ). (G.4) G.2 PROOF OF LEMMA F.2Proof of Lemma F.2. The proof of this lemma follows the proof of Lemma C.5 in Jin et al.

q4where P (•|s, a) is the well-specified transition kernel defined by P (•|s, a) = ⟨µ(•), ϕ(s, a)⟩. From the proof inJin et al. (2020) we have that for any (s, a)∈ S × A, | ⟨ϕ(s, a), q 1 ⟩ | ≤ √ λ∥w π h ∥ 2 ∥ϕ(s, a)∥ (U k h ) -1 ≤ √ λB∥ϕ(s, a)∥ (U k h ) -1 | ⟨ϕ(s, a), q 2 ⟩ | ≤ O(dH)∥ϕ(s, a)∥ (U k h ) -1 ⟨ϕ(s, a), q 3 ⟩ = P h (V k h+1 -V π h+1 )(s, a) + p 2 , |p 2 | ≤ 2H √ dλ∥ϕ(s, a)∥ (U k h ) -1 .For the fourth term, by Lemma D.6, which improves the Lemma C.4 in Jin et al. (2020) by a factor of √ d, we have| ⟨ϕ(s, a), q 4 ⟩ | ≤ 2Hζ |C k-1 ∥ϕ(s, a)∥ (U k h ) -1 . Finally, since ϕ(s, a), w k h -w π h = ⟨ϕ(s, a), q 1 + q 2 + q 3 + q 4 ⟩ we have that | ϕ(s, a), w k h -Q π h (s, a) -[P h (V k h+1 -V π h+1 )](s, a)| ≤ O(Hd + Hζ |C k-1 |)∥ϕ(s, a)∥ (U k h ) -1 , plugging the bound of |C k-1 | in LemmaF.1 back we will have the claimed results. G.3 PROOF OF LEMMA F.4 The proof of Lemma F.4 follows the same technique from Jin et al. (2020) and we warp it here for completeness. The first lemma shows that the estimation of Q function is still optimistic regardless the misspecification. Lemma G.1 (Lemma C.5, Jin et al. 2020). Under the parameter setting in Theorem 7.4 and Lemma F.2 holds, we have Q k h (s, a) ≥ Q * h (s, a) -4H(H + 1 -h)ζ for all (s, a, h, k) ∈ S × A × [H] × [K].

Ω(2 d ) sample complexity, where D is the decision set. When the outcome is deterministic and does not contain noise, they provided an algorithm using O(d) sample complexity to identify a ∆-optimal arm when ζ ≤ ∆/ √ d. Lattimore et al. (2020) also mentioned that if ζ √ d ≤ ∆, there exists a best arm identification algorithm can only use O

Averaged cumulative regret and elapsed time (E.T.) of DS-OFUL over 32 runs.

The number of remaining data samples after data processing with expected misspecification level

Averaged cumulative and instantaneous regret of DS-LSVI over 8 replicates for 200K rounds with different choice of Γ (see the footnote for the specific choice of Γ). "Last 1K rounds cumulative regret" means the total regret incurred in the very last 1K rounds, in order to verify the final performance of the algorithm. "Total 200K rounds cumulative regret" means the total regret incurred during the entire 200K rounds.

βH 32|K l |d log(1 + K) + 8|K l |H 3 log ≤ 2a + 2b, this implies 4 l ∆ 2 |K l | ≤ 64β 2 H 2 d log(1 + K) + 16H 3 log

