THE POWER OF FEEL-GOOD THOMPSON SAMPLING: A UNIFIED FRAMEWORK FOR LINEAR BANDITS

Abstract

Linear contextual bandit is one of the most popular models in online decisionmaking with bandit feedback. Prior work has studied different variants of this model, e.g., misspecified, non-stationary, and multi-task/life-long linear contextual bandits. However, there is no single framework that can unify the algorithm design and analysis for these variants. In this paper, we propose a unified framework for linear contextual bandits based on feel-good Thompson sampling (Zhang, 2021) . The algorithm derived from our framework achieves nearly minimax optimal regret in various settings and resolves the respective open problem in each setting. Specifically, let d be the dimension of the context and T be the length of the horizon, our algorithm achieves an O(d √ ST ) regret bound for nonstationary linear bandits with at most S switches, O(d 5 6 T 2 3 P 1 3 ) regret for nonstationary linear bandits with bounded path length P , and O(d √ kT + √ dkM T ) regret for (generalized) lifelong linear bandits over M tasks that share an unknown representation of dimension k. We believe our framework will shed light on the design and analysis of other linear contextual bandit variants.

1. INTRODUCTION

Linear contextual bandit is one of the most popular models in online decision-making with a large, possibly infinite, action space. This bandit model has been widely studied in the past decade. One of the most successful approaches is based on the upper confidence bound (Auer, 2002) . For example, LinUCB (Li et al., 2010) (or OFUL (Abbasi-Yadkori et al., 2011) ) follows the optimism-in-theface-of-uncertainty principle and chooses the best action within an elliptical confidence ball. The algorithm has been proved to be nearly minimax optimal by using the elliptical potential lemma to track the bonus term. With some modifications to the algorithm, one had generalized this algorithm to various settings, e.g., non-stationary linear bandits (Chen et al., 2019) , multi-task linear bandits (Hu et al., 2021) , to mention a few. The analyses for these generalizations require the corresponding modified elliptical potential lemma, which is, however, hard to derive in general. (One may refer to the technical note (Faury et al., 2021) which discusses the faults in the elliptical potential lemma for non-stationary linear bandits). Another common approach for online decision-making is exponentially weighted sampling. By sampling from a distribution over actions based on their historical rewards, it gives rise to nearoptimal policy-based algorithms for various settings such as the hedge algorithm (Littlestone & Warmuth, 1994) for prediction with expert advice. For contextual bandits, EXP4 (Auer et al., 2002) enjoys a regret bound of E[Regret(T )] ≤ O( KT log |H|), where K is the number of actions, T is the length of the horizon, and H is the feasible policy set. Note that contextual policy-based algorithms usually allow the policy to take round index as context. This gives a natural way to deal with non-stationary environment. For instance, one can solve non-stationary expert problems using meta-experts by following different experts in different rounds (Herbster & Warmuth, 2004) . With this idea, one can obtain the regret bound for a variety of bandit models by counting the number of policies |H|, which is easy to do in general. This motivates us to find a policy-based algorithm for linear contextual bandits. We note that EXP4 is not suitable for our purpose since its regret suffers a polynomial dependence on the number of actions, which can be unbounded in linear contextual bandits. This is due to the fact that EXP4 is designed for general reward functions and does not leverage the linear structure of linear bandits. Given this observation, we raise the following question: Can we design an EXP4-type algorithm for linear contextual bandits? In this paper, we answer the above question affirmatively. In detail, we propose Feel-Good Thompson Sampling over Linear Policies (FGTS.LP), which is a policy-based algorithm for linear contextual bandits. Our algorithm can be regarded as a policy-based adaption of feel-good Thompson sampling (Zhang, 2021) to linear bandits, while this adaption is nontrivial. Our algorithm enjoys a regret bound that is logarithmically dependent on the number of policies and polynomially dependent on the dimension of contexts. To be specific, we prove the following regret bound for FGTS.LP: Theorem 1.1 (Regret Bound of FGTS.LP (informal)). Let d be the dimension of the context, T be the length of the horizon, and H be the set of all feasible policy hypotheses. The regret of FGTS.LP is bounded by E[Regret(T )] ≤ O( dT log N (H, ϵ) + T √ dϵ), where ϵ is some hyperparameter and N (H, ϵ) is the covering number of policy set which contains an ϵ-optimal policy. The above theorem provides a general interface to analyze the performance of FGTS.LP in different settings. Following the idea of including round index, FGTS.LP can deal with various linear contextual bandits. The results are highlighted as follows: Theorem 1.2 (Regret Bound over Variants of Linear Bandits (informal)). With specific modification for each setting, the regret of FGTS.LP is bounded as • E[Regret(T )] ≤ O(d √ T + T √ dζ) for ζ-misspecified linear contextual bandits. • E[Regret(T )] ≤ O(d √ ST ) for non-stationary linear contextual bandits with at most S switches. • E[Regret(T )] ≤ O(d 5 6 T 2 3 P 3 ) for non-stationary linear contextual bandits with path length bounded by P .

• E[Regret(T )] ≤ O(d √

kT + √ dkM T ) for (generalized) lifelong linear contextual bandits over M tasks that share an unknown representation of dimension k. We note that the above results are all near-optimal which match or improve the state-of-the-art in the corresponding settings. To sum up, our contributions are: • We propose a unified framework for design and analyze various linear contextual bandit models. Our framework is easy to interpret and enjoys near-optimal regret bound in different settings. • We propose the first nearly minimax algorithm for non-stationary linear contextual bandits with a bounded number of switches. • We propose a new algorithm for non-stationary linear contextual bandits with bounded path length. It is the first algorithm that achieves nearly minimax regret. • We propose the first near-optimal algorithm for (generalized) lifelong linear contextual bandits. Its regret matches the state-of-the-art for multi-task linear contextual bandits (Hu et al., 2021) , which is a special case of our model. Notation. We use lower and upper case bold face letters to denote vectors and matrices respectively. We use [k] to denote the set {1, 2, • • • , k}. We denote the Euclidean norm of vector x ∈ R d by ∥x∥ 2 . For a matrix A = [a 1 , • • • , a k ] ∈ R d×k , we define ∥A∥ 2,∞ = max 1≤i≤k ∥a i ∥ 2 . For two non-negative sequence {a n }, {b n }, we write a n ≤ O(b n ) if there exists an absolute constant C > 0 such that a n ≤ Cb n for all n ≥ 1, and a n ≤ O(b n ) if there exists an absolute constant k such that a n ≤ O(b n log k b n ); we write a n ≥ Ω(b n ) if there exists an absolute constant C > 0 such that a n ≥ Cb n for all n ≥ 1 and a n ≥ Ω(b n ) if there exists absolute constant k such that a n ≥ Ω(b n log -k b n ); we write a n = Θ(b n ) if there exists absolute constants 0 < C 1 ≤ C 2 such that C 1 b n ≤ a n ≤ C 2 b n for all n ≥ 1 . For any set C, we use |C| to denote its cardinality. We use log to denote log e for short.

2. RELATED WORK

Misspecified Linear Bandits. The misspecified linear bandits was first studied by Ghosh et al. (2017) , and they proposed an algorithm that achieves sub-linear regret when the misspecification ζ is small. Lattimore & Szepesvari (2020) proposed an algorithm with an O(d √ T + T √ dζ) regret but requiring the contexts to be stationary. The regret is nearly minimax optimal as they proved a matching lower bound in the paper. Later, Zanette et al. (2020) proposed a LinUCB-like algorithm with the same regret that gets rid of the requirement of stationary context. More recent works (Foster et al., 2020a; Krishnamurthy et al., 2021; Takemura et al., 2021) studied the problem when the misspecification level is unknown. Non-Stationary Linear Bandits with Bounded Switches. Non-stationary online decision-making models with bounded switches have long been studied in the literature. Cesa-Bianchi et al. (1993) studied the problem in the full information experts setting, and proposed an algorithm with an expected regret bound O( √ ST ), where S is the number of switches. Later, a high probability bound was obtained by a a UCB-type algorithm (Auer, 2002) . In the bandit feedback setting, EXP3.S (Auer et al., 2001) achieves a regret of O( √ KST ), where K is the number of actions. Borrowing the idea from EXP4, the algorithm has been extended to the contextual setting with the O( KST log |Π|) regret bound (Luo et al., 2018) , where Π is the finite policy class. More recently, Luo et al. (2022) studied the non-stationary linear bandit setting where the action set is a unit ball and proposed a bandit-over-bandit approach that achieves a regret of O( √ dST ). There is no existing algorithms for non-stationary linear contextual bandits with bounded switches. Non-Stationary Linear Bandits with Bounded Path Length. In recent years, non-Stationary linear bandits with bounded path length have received increasing attention. Various algorithms have been developed (Cheung et al., 2019; Chen et al., 2019; Zhao et al., 2020) in this setting. The key idea behind these algorithms is to progressively forget the past data, and the proof is based on the same kind of elliptical potential lemma. However, as discussed by Faury et al. (2021) , there is a fault in the proof of the original paper (Cheung et al., 2019) , and the corrected analysis can only get a regret bound of O(d 3 4 T 3 4 P 1 4 ). Meanwhile, the lower bound of this problem is Ω(T et al., 2021) . The algorithm derived from our framework closes the gap on T . (Generalized) Lifelong Linear Bandits with Shared Representation. The multi-task linear bandit, in which the agent plays over a collection of tasks simultaneously, was first studied by Yang et al. (2020) . They proposed an explore-then-commit algorithm that leverages the multi-task structure but requires a unit-ball action space. Their followed-up work (Yang et al., 2022)  2 3 P 1 3 ) (Faury

√

kT + √ dkM T ) using LinUCB-type approach. Recent work (Qin et al., 2022) studied the non-stationary lifelong linear bandits with a task diversity assumption. On the lower bound side, Yang et al. (2020) proved that any algorithm for multi-task linear bandits suffers at least an Ω(d √ kT + k √ M T ) regret.

3.1. PROBLEM SETUP

We first introduce a framework which is able to cover variants of linear contextual bandits. Let d be the dimension of the context, and T be the lenght of horizon. Denote by A ⊆ {a ∈ R d : ∥a∥ 2 ≤ 1} the action space and F ⊆ R A the reward function space. The contextual bandit can be described as a repeated game between an agent (the bandit algorithm) and the environment (the adversary). In each round t = 1, • • • , T , the environment first picks a hidden reward function f (t) ∈ F and draws an action set A (t) ⊆ A. After observing the action set A (t) , the agent selects an action a (t) ∈ A (t) and receives a stochastic reward r (t) = f (t) (a (t) ) + ξ (t) , where f (t) (a (t) ) is the expected value of the observed reward r (t) and ξ (t) is a zero mean random noise satisfying E[ξ (t) |Ω (t-1) , f (t) , A (t) , a (t) ] = 0, where Ω (t) = {(f (τ ) , A (t) , a (τ ) , r (τ ) )} t τ =1 is the history in the first t rounds. The learning objective of the agent is to maximize the expected cumulative reward, or equivalently, to minimize the pseudo-regret Regret(T ) := T t=1 f (t) (a (t) * ) -f (t) (a (t) ) , where a (t) * = arg max a∈A (t) f (t) (a) is the optimal action in round t. We assume the reward functions can be approximated by linear functions. Assumption 3.1. There is a mapping θ : F → {x ∈ R d : ∥x∥ 2 ≤ 1} and a universal constant ζ ∈ [0, 1] such that sup (f,a)∈(F ×A) |⟨a, θ(f )⟩ -f (a)| ≤ ζ. Moreover, we assume |f (a)| ≤ 1 for all (f, a) ∈ (F × A). We call ζ the misspecification level of the model. For the special case ζ = 0, the model reduces to standard linear bandits. We further assume the observed reward is universally bounded. Assumption 3.2. The observed reward always satisfies |r (t) | ≤ 1 for all t ∈ [T ]. This assumption ensures the additive noise ξ (t) is bounded. Note that this assumption is not essential to the algorithm, our analysis can be naturally generalized to unbounded sub-Gaussian with constant variance as it showed by Zhang (2021) . We use bounded noise to avoid introducing variance as an extra parameter. The above framework is compatible with variants of linear contextual bandits. For example, one can formulate specific linear contextual bandits by restricting the environment to select reward sequences (f (1) , • • • , f (T ) ) from the corresponding structural set. We will present some concrete instances in the sequel.

3.2.1. STATIONARY MISSPECIFIED LINEAR BANDITS

For stationary linear contextual bandits, the reward function is fixed before agent decides. Thus, the environment is restricted to select reward function sequences that has the same linear approximation across rounds. Assumption 3.3. There exists a vector θ 0 ∈ R d such that θ(f (t) ) = θ 0 for all t ∈ [T ]. The reward function is allowed to be misspecified from linear functions in Assumption 3.1. It is easy to verify that our framework reduces to linear contextual bandits with misspecification (Foster et al., 2020a) under Assumptions 3.1-3.3.

3.2.2. NON-STATIONARY LINEAR BANDITS WITH BOUNDED SWITCHES

For non-stationary linear bandits, we first consider the case where the reward function may change dramatically for a finite number of times (Auer et al., 2001; Luo et al., 2022) . The agent is not told when and how the reward function switches, and the environment can schedule the change in an adversarial way. We describe this setting using the following assumption: Assumption 3.4. There exists a constant S such that T t=2 1[θ(f (t) ) ̸ = θ(f (t-1) )] ≤ S. We call S the number of switches. We call the problem formulated under Assumptions 3.1, 3.2 and 3.4 as the non-stationary linear bandits with bounded switches. We note that the non-stationary linear bandits with bounded switches is a harder problem than that with bounded path length, since there is a black-box reduction from non-stationary linear bandits with bounded path length to nonstationary misspecified linear bandits with bounded switches. Recently, Luo et al. (2022) studied this setting but for non-contextual linear bandits with A (t) = {a ∈ R d : ∥a∥ 2 ≤ 1}. They proposed an algorithm with T 1/2 regret. However, there is no existing algorithm for non-stationary contextual linear bandits with T 1/2 regret.

3.2.3. NON-STATIONARY LINEAR BANDITS WITH BOUNDED PATH LENGTH

We also consider another kind of non-stationary linear bandits where the reward function can drift slowly over time (Cheung et al., 2019) . The agent does not know the evolution dynamic, while the environment can choose the reward function adversarially in each round. We characterize this setting by the following assumption. Assumption 3.5. There exists a constant P such that T t=2 ∥θ(f (t) ) -θ(f (t-1) )∥ 2 ≤ P . We call P the path length. One can verify our framework under Assumptions 3.1, 3.2 and 3.5 reduces to non-stationary linear bandits with bounded path length (Faury et al., 2021) . Existing regret upper bound is of order T 3/4 while the lower bound is of order T 2/3 . So there is still a gap between the regret upper and lower bounds.

3.2.4. (GENERALIZED) LIFELONG LINEAR BANDITS WITH SHARED REPRESENTATION

We consider the setting where the agents have to solve a collection of correlated tasks sequentially. Let M be the number of tasks. At the beginning of the process, the environment first chooses M tasks f 1 , • • • , f M . In each round t, the environment draws a task index m (t) ∈ [M ] and assign current reward function as f (t) = f m (t) . Besides the action set A (t) , the agent is provided with the task index m (t) and selects the action a (t) . It is worth noting that our lifelong learning setting is a generalized version of the lifelong learning setting studied in Yang et al. (2022) . The original setting of lifelong learning, in which the same task only appears in one interval, is equivalent to the case where the same m (t) lies in a single interval. We consider this generalized setting since it also captures multi-task learning (Hu et al., 2021) , where the tasks appear periodically. We make the following assumption. Assumption 3.6. There exists a hidden orthogonal matrix B ∈ R d×k and a set of hidden vectors {w i } M i=1 with w i ∈ R k such that θ(f (t) ) = Bw m (t) holds for all t ∈ [T ] where {m (t) } T t=1 is some sequence with m (t) ∈ [M ]. In Assumption 3.6, B is called the linear feature extractor of the model. The assumption implies [θ(f (1) ), • • • , θ(f (T ) ) ] is a low rank matrix which can be decomposed over the bases B. This is a common assumption for representation learning over linear bandits in the literature (see e.g., (Hu et al., 2021; Yang et al., 2022) ). The genearlized lifelong linear bandits is defined under our framework with Assumptions 3.1, 3.2 and 3.6. Recently, Yang et al. (2022) studied the lifelong learning setting for non-contextual linear bandits where A (t) = {a ∈ R d : ∥a∥ 2 ≤ 1} and proposed an algorithm with T 1/2 regret. However, there is no existing algorithm for lifelong contextual linear bandits with T 1/2 regret.

4. FEEL-GOOD THOMPSON SAMPLING OVER LINEAR POLICIES

In this section, we present our main algorithm FGTS.LP (Feel-Good Thompson Sampling over Linear Policies) for solving linear context bandits under the framework described in Section 3. We first introduce some important concepts in the algorithm design.

4.1. LINEAR POLICY CLASS AND ITS COVERING

It is pretty annoying that the regret defined in (1) is related with dynamic variable A (t) . We introduce an essential concept in our algorithm. Instead of selecting an action a (t) , we let the bandit algorithm to select a sequence of policies π (t) , which decide how the agent take action based on the observed context over time. Formally, a sequence of policies π : [T ] × A → R is a time-variant hypothesis for the expected reward of choosing any actions in any round, i.e., the expected reward of action a ∈ A on round t is π(t, a). Thus, the sequence suggests to play action a (t) = arg max a∈A (t) π(t, a) when the action set A (t) is realized on round t. Since the reward functions in our setting can be approximated by linear functions, we would focus on linear policies that can be approximated by linear functions: Definition 4.1 (Linear Policy). A linear policy π(Θ) is a sequence of models parameterized by the matrix Θ = [θ 1 , • • • , θ T ] ∈ R d×T such that π(Θ)(t, a) = ⟨a, θ t ⟩. Let H be the set of all feasible linear policies. Under certain assumptions within our framework, the linear policy set is restricted to a corresponding domain. For example, under Assumption 3.3, the linear policy set H is parameterized by matrices Θ in which all columns are the same. Note that H is a set over continuous domain and thus of infinite size. To deal with the infinite number of linear policies, we introduce the approach of ϵ-net and covering number. Definition 4.2 (ϵ-Net of Linear Policies). H ϵ is said to be a ϵ-net of H if any linear policy Θ = [θ 1 , • • • , θ T ] ∈ H has an ϵ-approximation Θ = [ θ 1 , • • • , θ T ] ∈ H ϵ with ∥θ t -θ t ∥ 2 ≤ ϵ for all t ∈ [T ] . Moreover, we define the covering number of H (at scale ϵ) be the size of minimum H ϵ , i.e., N (H, ϵ, ∥ • ∥ 2,∞ ) := inf |H ϵ |. One can regard the metric entropy log N (H, ϵ, ∥ • ∥ 2,∞ ) as the effective dimension of the policy set.

4.2. THE ALGORITHM

Now we are ready to present our main algorithm. The algorithm is an adaptation of the FGTS (Zhang, 2021) . We present the pseudo-code in Algorithm 1. Algorithm 1 FGTS.LP input: linear policies set H; inverse temperature β; exploration parameter λ; covering radius ϵ. construct an ϵ-net H ϵ of H such that log |H ϵ | = O(log N (H, ϵ, ∥ • ∥ 2,∞ )) for round t = 1, • • • , T do sample Θ (t) = [θ (t) 1 , • • • , θ (t) T ] ∼ P (t) (Θ) ∝ exp -β t-1 τ =1 L (τ ) (Θ) for (3) over H ϵ take action a (t) ← arg max a∈A (t) ⟨a, θ (t) t ⟩ end for At the beginning, the algorithm constructs an ϵ-net H ϵ over all feasible linear policies H. In each round t, the algorithm first samples a policy Θ (t) over H ϵ according to the distribution P (t) (Θ) := exp -β t-1 τ =1 L (τ ) (Θ) Θ ′ ∈Hϵ exp -β t-1 τ =1 L (τ ) (Θ ′ ) , where β > 0 is the inverse temperature and L (t) : R d×T → R is a loss function defined as follows L (t) (Θ) = (r (t) -⟨a (t) , θ t ⟩) 2 -λ max a∈A (t) ⟨a, θ t ⟩. (3) Here λ > 0 is a tuning parameter which controls the exploration. The algorithm then chooses the optimal action a (t) that maximizes the expected reward according to the chosen policy Θ (t) among the action set A (t) . We note the loss function in (3) has two terms. The first term is the Thompson sampling term. It casts penalty based on the estimation error of the reward. Recall r (τ ) is a random variable with mean f (τ ) (a (τ ) ) ≈ ⟨a (τ ) , θ(f (τ ) )⟩, this term forces the algorithm to select policy close to the true policy θ(f (τ ) ). This encourages the algorithm to do exploitation. The second term is so called feel-good exploration term (Zhang, 2021) . It favors the policy that grants large rewards for historical actions, which can be regarded as an bonus that encourages the algorithm to do exploration. Note that Algorithm 1 needs to know T and H before hand. By restarting the algorithm periodically with the doubling trick, one can run the algorithm without knowing T in hindsight. However, it is unclear if the algorithm can be parameter-free for H since model selection for exponential weighted sampling algorithm remains an open question (Foster et al., 2020b) . Furthermore, the algorithm is sampled over ϵ-net of H, which is not necessarily computationally efficient in practice. So it is better to regard Algorithm 1 as an oracle-efficient algorithm provided the sampling oracle over H ϵ . For the implementation of this algorithm in practice, we present a detailed discussion in Appendix B

4.3. COMPARISON WITH EXISTING ALGORITHMS

Comparison with FGTS. One may see that FGTS.LP has the same structure as the original FGTS (Zhang, 2021) . Both algorithms use exponentially weighted sampling over the same loss function. The main difference between the two algorithms is that the original FGTS samples over models (or equivalently, reward functions) rather than policies. Note that sampling over models is equivalent to choosing the stationary policies, which always select the same action given the same context. Thus, the original FGTS cannot deal with non-stationary environments in which the reward functions change over time. In contrast, sampling over policies enables us to inspect the misspecification explicitly and thus analyze it meticulously. As a result, FGTS.LP can achieve nearly minimax regret bound for misspecified linear bandits and non-stationary linear bandits. Comparison with EXP4. Another contextual bandit algorithm that uses exponentially weighted sampling is EXP4 (Auer et al., 2002) . Using exponentially weighted sampling, one can corral a large set of policies achieving a low regret as long as the reward for each policy is revealed. However, the agent cannot observe the payoff for policies that are not chosen in the bandits setting, and has to estimate it. EXP4 uses an unbiased estimator to estimate the reward. However, the estimator has a large variance which is proportional to the number of actions. This causes a polynomial dependence on the size of the action set in the regret bound. As a comparison, FGTS.LP uses the expected reward reported by the policies but casts penalties on the policies that cannot estimate the payoff of chosen actions. Although this estimator may be biased in estimating the actual reward, the estimator will assigns a large value to the policy with a high return. So the regret analysis for exponentially weighted sampling (Littlestone & Warmuth, 1994) can still be applied to FGTS.LP. Since the variance of this estimator is much smaller and independent of the number of actions, FGTS.LP is able to get a regret bound that is independent of the size of the action set. Comparison with LinUCB. In addition to Thompson sampling and posterior sampling, another common approach for linear contextual bandits is LinUCB (Li et al., 2010) (or OFUL (Abbasi-Yadkori et al., 2011) ). The algorithm implements the optimism-in-face-of-uncertainty principle, which selects the policy with the largest optimistic reward among all possible policies that agree with the observed reward in history. Note that FGTS.LP tends to sample policy with large estimated reward on the historical data and small estimation error. Therefore, FGTS.LP can also be regarded as an implementation of the same principle. The difference is, rather than focusing on one action, FGTS.LP uses exponentially weighted sampling to select a policy. From the game theory perspective, the variants of linear contextual bandits can be regarded as a two-player game. It is known that a mixed strategy profile usually generates better rewards than a pure strategy profile against adversarial opponents (Fudenberg & Tirole, 1991) . Thus, FGTS.LP will have a stronger ability to solve variants of linear contextual bandits.

5. MAIN RESULTS

In this section, we present the theoretical results of our algorithm. The following theorem provides a regret guarantee for Algorithm 1 when applied to general linear bandits. Theorem 5.1. Suppose ϵ ∈ [0, 1], set λ = Θ log N (H, ϵ, ∥ • ∥ 2,∞ )/(dT ) + (ϵ + ζ) 2 /d and β = Θ(1) . Under Assumptions 3.1 and 3.2, for any policy class H, the expected regret of Algorithm 1 is bounded by E[Regret(T )] ≤ O dT log N (H, ϵ, ∥ • ∥ 2,∞ ) + T √ d(ϵ + ζ) , where the expectation is taken over all randomness of the learning algorithm and the data noise. There are two terms on the RHS of ( 4), which we will explain separately. The first term depends on the metric entropy log N (H, ϵ, ∥ • ∥ 2,∞ ), which depicts the complexity of the policy set H. The second term depends on the scale of the ϵ-net H ϵ as well as the misspecification level ζ. There is a trade-off between the first term and the second term driven by the scale of the ϵ-net: for large ϵ, the first term is small and the second term is large; for small ϵ, the first term will be large and the second term will be small. With Theorem 5.1, one only needs to calculate the metric entropy log N (H, ϵ, ∥ • ∥ 2,∞ ) and select proper ϵ to minimize (4), i.e., ensuring log N (H, ϵ, ∥ • ∥ 2,∞ ) ∼ √ T (ϵ + ζ). Since the metric entropy log N (H, ϵ, ∥ • ∥ 2,∞ ) is the effective dimension of the policy class, it is easy to calculate for different policy classes and derive the corresponding regret bound. As a result, Algorithm 1 can be viewed as a unified framework for linear contextual bandits.

5.1. IMPLICATIONS TO LINEAR BANDIT VARIANTS

In this subsection, we show the implications of Theorem 5.1 on the specific examples of contextual linear bandit variants. For the ease of comparison between our results and prior results, we summarize the results in Table 1 .

5.1.1. MISSPECIFIED LINEAR BANDITS

We start with the simple setting where the approximate linear function is stationary. In this setting, the ϵ-net of the linear policy class can be reduced to the ϵ-net of single reward function. Since the embedding is of dimension d, the covering number is exactly O(ϵ -d ): Lemma 5.1 (Covering Number for Misspecified Linear Bandits). Under Assumptions 3.1 and 3.3, the metric entropy of linear policy class satisfies log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(d log ϵ -1 ). By choosing ϵ = T -1 , the metric entropy can be bounded as log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(d log T ). With Theorem 5.1, we directly get the following regret bound: Corollary 5.1 (Upper Bound for Misspecified Linear Bandits). Under Assumptions 3.1, 3.2 and 3.3, the expected regret of FGTS.LP is bounded by E[Regret(T )] ≤ O d T log T + T √ dζ . The following lower bound shows that the regret upper bound in Corollary 5.1 is tight up to logarithmic factors: Hu et al., 2021) (Corollary 5.4) (Yang et al., 2020) Lifelong LCB - Existing Algorithm FGTS.LP (This Paper) Lower Bounds Misspecified LCB O(d √ T + T √ dζ) O(d √ T + T √ dζ) Ω(d √ T + T √ dζ) (Zanette et al., 2020) (Corollary 5.1) (Lattimore & Szepesvari, 2020) Non-stationary LCB - O(d √ ST ) Ω(d √ ST ) (Bounded Switches) (Corollary 5.2) (Lemma F.5) Non-stationary LCB O(d 3 4 T 3 4 P 1 4 ) O(d 5 6 T 2 3 P 1 3 ) Ω(d 2 3 T 2 3 P 1 3 ) (Bounded Length) (Faury et al., 2021) (Corollary 5.3) (Cheung et al., 2019) Multi-task LCB O(d √ kT + √ dkM T ) O(d √ kT + √ dkM T ) Ω(d √ kT + k √ M T ) ( O(d √ kT + √ dkM T ) Ω(d √ kT + k √ M T ) (Corollary 5.4) (Yang et al., 2020) Table 1 : Summary of the results on variants of linear contextual bandits. d is the dimension of the context, T is the length of horizon, and ζ is the misspecification level for misspecified bandits. In addition, S is the number of switches and P is the path length for non-stationary bandits. Furthermore, M denotes the number of tasks and k denotes the representation dimension for multi-task and lifelong bandits. Proposition 5.1 (Lower Bound for Misspecified Linear Bandits). Under Assumptions 3.1, 3.2 and 3.3, for any algorithm, there exists a bandit instance for which E[Regret(T )] ≥ Ω d √ T + T d/ log T ζ .

5.1.2. NON-STATIONARY LINEAR BANDITS WITH BOUNDED SWITCHES

In this setting, we can construct the ϵ-net by enumerating the positions of switches and cover the reward function set after each switch. The set of reward functions after each switch has a covering number of O(ϵ -d ). So the set of reward functions with S switches has a covering number of O(ϵ -dS ). Moreover, the different positions of switches within T rounds can be bounded by O(T S ). With above facts, we can bound the size of ϵ-net: Lemma 5.2 (Covering Number for Non-Stationary Linear Bandits with Bounded Switches). Under Assumptions 3.1 and 3.4, the metric entropy of the linear policy set satisfies log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(dS log ϵ -1 + S log T ). Using Theorem 5.1 with ϵ = T -1 , we immediately obtain the following regret bound: Corollary 5.2 (Upper Bound for Non-Stationary Linear Bandits with Bounded Switches). Under Assumptions 3.1, 3.2 and 3.4, the expected regret of FGTS.LP is bounded by E[Regret(T )] ≤ O d ST log T + T √ dζ . The following lower bound shows FGTS.LP is nearly minimax optimal for non-stationary linear bandits with bounded switches: Proposition 5.2 (Lower Bound for Non-Stationary Linear Bandits with Bounded Switches). Under Assumptions 3.1, 3.2 and 3.4, for any algorithm, there exists a bandit instance for which E[Regret(T )] ≥ Ω d √ ST + T d/ log T ζ .

5.1.3. NON-STATIONARY LINEAR BANDITS WITH BOUNDED PATH LENGTH

Now we move to the more challenging setting where the parameters slowly drift over time. We would reduce the parameter drift to parameter switch. More specifically, a parameter switch is considered to happen only if the path length of parameter drift from last switch exceeds ϵ/2. Since the total path length is bounded by P , there are no more than 2P ϵ -1 switches. As a result, an ϵ/2net of the reward functions for non-stationary linear bandits with no more than 2P ϵ -1 switches is an ϵ-net of the reward functions for non-stationary linear bandits with path length no more than P . This immediately gives the following result on its covering number: Lemma 5.3 (Covering Number for Non-Stationary Linear Bandits with Bounded Path Length). Under Assumptions 3.1 and 3.5, the metric entropy of linear policies satisfies log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(d log ϵ -1 + dP ϵ -1 log ϵ -1 + P ϵ -1 log T ). Using Theorem 5.1 with ϵ = max{T -1 3 d 1 3 P 1 3 , T -1 }, we get the following regret bound: Corollary 5.3 (Upper Bound for Non-Stationary Linear Bandits with Bounded Path Length). Under Assumptions 3.1, 3.2 and 3.5, the expected regret of FGTS.LP is bounded by E[Regret(T )] ≤ O d 5 6 T 2 3 P 1 3 + d √ T + T √ dζ . The following lower bound shows that the regret upper bound in Corollary 5.3 is nearly optimal: Proposition 5.3 (Lower Bound for Non-Stationary Linear Bandits with Bounded Path Length). Under Assumptions 3.1, 3.2 and 3.5, for any algorithm, there exists a bandit instance for which E[Regret(T )] ≥ Ω d 2 3 T 2 3 P 1 3 + d √ T + T d/ log T ζ .

5.1.4. (GENERALIZED) LIFELONG LINEAR BANDITS WITH SHARED REPRESENTATION

In this setting, every feasible linear policy can be described by a low-rank matrix that is the product of two matrices B ∈ R d×k and [w i ] M i=1 ∈ R k×M . We can construct the ϵ-net of linear policies using the product of two ϵ/2-net over B and [w i ] M i=1 . Also, the dimension of linear policies is dk + kM . This implies the following result on its covering number: Lemma 5.4 (Covering Number for Lifelong with Shared Representation). Under Assumptions 3.1 and 3.6, the metric entropy of linear policies satisfies log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O dk log ϵ -1 + kM log ϵ -1 . Using Theorem 5.1 with ϵ = T -1 , we obtain the following regret bound: Corollary 5.4 (Upper Bound for Lifelong Linear Bandits with Shared Representation). Under Assumptions 3.1, 3.2 and 3.6, the expected regret of FGTS.LP is bounded by E[Regret(T )] ≤ O d √ kT + √ dkM T + T √ dζ . If each task appears uniformly for T /M rounds, an algorithm that solves each task independently suffers a regret bound of Θ(d √ M T ) using a nearly minimax algorithm such as LinUCB (Li et al., 2010) . In the case that k ≪ M and k ≪ d, our algorithm saves a factor of M/k or M/d, which shows it utilizes the underlying representation. Moreover, this regret bound matches the near-optimal algorithm for multi-task linear bandits (Hu et al., 2021) , which is a special case of our setting. The following lower bound shows that our algorithm is near-optimal: Proposition 5.4 (Lower Bound for Lifelong Linear Bandits with Shared Representation). Under Assumptions 3.1, 3.2 and 3.6, for any algorithm, there exists a bandit instance for which E[Regret(T )] ≥ Ω d √ kT + k √ M T + T d/ log T ζ .

6. CONCLUSION AND FUTURE WORK

In this paper, we proposed a unified framework for linear contextual bandits based on feel-good Thompson sampling, which can cover different variants of linear contextual bandits. At the core of our algorithm is an adaption of feel-good Thompson sampling from reward function to policy class, which enables the algorithm to deal with time-varying environments. We showed that our algorithm can achieve near-optimal regret bounds for these variants of linear bandits, which resolve several open problems in the respective settings. We notice that there is still a gap between the regret upper and lower bounds in terms of the dependence on the dimension d for non-stationary linear bandits with bounded path length. We leave it as a future work to close this gap.

A ADDITIONAL RELATED WORK

Thompson Sampling. Thompson sampling is a classical algorithm that can be adapted to contextual bandits (Thompson, 1933) . Russo & Roy (2014) 

√

T ) on linear bandits. As shown by Hamidi & Bayati (2020) , inflation is necessary for Thompson sampling to achieve sub-linear regret; thus, one cannot improve the regret bounds for original Thompson sampling. In comparison, feel-good Thompson sampling (Zhang, 2021 ) introduced an optimistic bonus on the loss function, making it possible to achieve a nearly minimax regret on the linear contextual bandits. Exponentially Weighted Sampling. A classical technique for online learning is exponentially weighted sampling. This approach allows an algorithm to corral a collection of policies in small regret. By constructing unbiased estimators for unobserved actions, Auer et al. (2002) adapted the ideas to contextual bandits and proposed EXP4. Its regret suffers a polynomial dependence on the number of actions. For linear bandits with stationary contexts, the dependence can be reduced to logarithmic using an EXP2-type algorithm (Bubeck et al., 2012) . The algorithm can also be regarded as an EXP3 with a modified estimator, which can be implemented efficiently under specific action sets using stochastic mirror descent. UCB-Based Algorithms. The concept of upper confidence bound was first proposed by Lai & Robbins (1985) to solve multi-arm bandits. The algorithm can be adapted to linear contextual bandits with an infinite number of actions by analyzing the dynamics of the confidence set using an elliptical lemma (Dani et al., 2008; Li et al., 2010; Abbasi-Yadkori et al., 2011) . However, it is hard to get the potential lemma for other function spaces, making it difficult to generalize the approach to a broader function class. A recent line of works showed the contextual bandits could be reduced to regression oracles (Foster et al., 2018; Foster & Rakhlin, 2020) . This observation pointed out a way for UCB-based algorithms to work on general function classes. However, the algorithm suffers from a polynomial dependence on the number of actions, which is incapable of infinite action sets.

B IMPLEMENTATION OF THE PROPOSED ALGORITHM

In this section, we discuss the implementations for our main algorithm FGTS.LP in practice. In particular, we present an efficient algorithm to sample from the distribution defined in (2) approximately. Consider distribution P (t) in which the probability density function over Θ is proportional to P (t) (Θ) ∝ exp -β t-1 τ =1 L (τ ) (Θ) • P 0 (Θ), where L (t) is the loss function defined in (3) and P 0 is some prior distribution to be determined. One may see P (t) is identical to P (t) (Θ) if one chooses the uniform mixture of discrete distribution of support H ϵ as prior. For efficient implementations, we would use a continuous distribution to cover H instead. More details can be found in the following sections. Inspired by Xu et al. (2022) , we would use Langevin Monte Carlo (Roberts & Tweedie, 1996; Bakry et al., 2013) to take sample from (5). In specific, on round t, we run the following subroutines for K t steps. For iteration step s = 1, • • • , K t , we set Θ t,s-1 = Θ t,s -η t,s t-1 τ =1 ∇L (τ ) (Θ t,s-1 ) + β -1 η t,s ∇ log P 0 (Θ t,s-1 ) + 2β -1 η t,s ϵ t,s where ϵ t,s is Gaussian random matrices and η t,s is the step size. This updating rule can be regarded as the Euler-Maruyama discretization of the following Langevin dynamics dΘ(s) = - t-1 τ =1 ∇L (τ ) (Θ(s))ds + β -1 ∇ log P 0 (Θ(s))ds + 2β -1 dB(s) where s > 0 is the continuous time index and B(s) ∈ R d is a Brownian motion. It was shown under mild conditions, the above Langevin dynamics will converge to the stationary distribution (5). So it is reasonable to take sample from (5) using LMC. We list the corresponding pseudo-code in Algorithm 2. Algorithm 2 LMC-FGTS.LP, General Framework input: prior distribution P 0 ; inverse temperature β; exploration parameter λ; number of iterations {K t } t≥1 ; step sizes {η t,s } t,s≥1 . sample matrix Θ (0) ∈ R d×T from H according to P 0 for round t = 1, • • • , T do Θ t,0 ← Θ (t-1) for iteration s = 1, • • • , K t do sample matrix ϵ t,s ∼ N (0, I T ×d ) Θ t,s ← Θ t,s-1 -η t,s t-1 τ =1 ∇L (τ ) (Θ t,s-1 ) -β -1 η t,s ∇ log P 0 (Θ t,s-1 ) + 2β -1 η t,s ϵ t,s end for Θ (t) ← Θ t,Kt take action a (t) ← arg max a∈A (t) ⟨a, θ (t) t ⟩ where θ (t) t is the t-th column of Θ (t) end for It is important to mention the above algorithm is a general framework for linear contextual bandits. The algorithm could be further refined into specific settings. For example, since we know linear policy set H only contains matrix Θ in which all columns are same, it is sufficient to record only one column of Θ. In the following sections, we discuss how to implement the algorithm for specific linear contextual bandits, and also present corresponding simulation results.

B.1 MISSPECIFIED LINEAR CONTEXTUAL BANDITS

We first discuss the implementation for misspecified linear contextual bandits, i.e., the variants depicted in Corollary 5.1. In this case, we parameterize any matrix Θ = [θ, • • • , θ] in H using any of its column vector θ. We choose the Gaussian distribution N (0, I d /d) as the prior P 0 . One can check the expected length of random vector sampled from P 0 is 1, which matches the hypothesis set given by Assumption 3.1 and 3.3. In the algorithm, we slightly abuse L (t) and use it to denote loss function R d → R, L (t) (θ) = (r (t) -⟨a (t) , θ⟩) 2 -λ max a∈A (t) ⟨a, θ⟩. which is a counterpart of (3) for single round policies. We list the pseudo-code in Algorithm 3.  (0) ∼ N (0, I d /d) for round t = 1, • • • , T do θ t,0 ← θ (t-1) for iteration s = 1, • • • , K t do sample vector ϵ t,s ∼ N (0, I d ) θ t,s ← θ t,s-1 -η t,s t-1 τ =1 ∇L (τ ) (θ t,s-1 ) -dβ -1 η t,s θ t,s-1 + 2β -1 η t,s ϵ t,s end for θ (t) ← θ t,Kt take action a (t) ← arg max a∈A (t) ⟨a, θ (t) ⟩ end for Simulation Results. We generate the synthetic data in the following way: The horizon length is set to T = 3000 and the feature dimension is set to d = 100. We first generate θ 0 from unit sphere S d-1 = {x ∈ R d : ∥x∥ 2 = 1} randomly. On each round t, A = 20 actions {a (t) i } i=1•••A are independently drawn from the same unit sphere. The reward of action a (-ζ, ζ) where Unif(l, u) is the uniform random distribution in [l, u] for misspecification level ζ ∈ {0, 0.05, 0.1}. The observed reward of it is then given by r (t) = f (t) (a (t) ) + ξ (t) where ξ (t) is randomly sampled from Gaussian noise with standard deviation 0.2. (t) i is set to f (t) (a (t) ) = ⟨a (t) i , θ 0 ⟩ + Unif We run Algorithm 3 with exploration parameter λ = 1/T + ζ 2 /d according to our theoretical result given by Theorem 5.1 and Lemma 5.1. We set the temperature parameter β -1 to 0.01 following Xu et al. (2022) , which is the 1/4 of the variance of reward noise. On each round t, the algorithm do K t = 200 iterations and the step size of iteration s is τ t,s = 1/(ds). To accelerate the running speed of the algorithm, we do not find the precise maximum action in (8) in each iteration. We record the maximum action and recompute it only in first 14 steps or once for every 14 steps afterwards. We compare our algorithm to LinUCB with adaptation to misspecification (Zanette et al., 2020) and Linear Thompson Sampling (Abeille & Lazaric, 2017) . The result is shown in Figure 1 . One can see our algorithm has comparative performance with those existing algorithms. 

B.2 LIFELONG LINEAR CONTEXTUAL BANDITS

This section discuss the practical implementation for lifelong linear bandits, i.e., the variants depicted in Corollary 5.4. In this case, the linear policy set H only contains matrices Θ that can be decomposed using two matrices B ∈ R d×k and W = {w i } M i=1 ∈ R M ×k . Thus, we do not need to maintain the whole sequence of models Θ, it is sufficient to keep those two matrices. Moreover, rather than taking LMC iterations on Θ, we compute the gradient respective to B and W to update them directly. In the algorithm, instead of taking LMC iterations on Θ, we compute the gradient to update B and W directly. The prior distribution is also defined on those two matrices. We take N (0, kI d×k /d) and N (0, I k×M /k) as the prior distribution for B and W, respectively. We use a different prior distribution for initialization which ensures the length of θ is about 1 to improve the overall performance. On every iteration of round t, the algorithm computes linear policy for all historical task θ t,s-1,i and uses them to compute gradient. The loss function is defined in (8). We list the pseudo-code in Algorithm 4. Algorithm 4 LMC-FGTS.LP for (Generalized) Lifelong Linear Contextual Bandits input: task index sequence {m (t) } T t=1 ; inverse temperature β; exploration parameter λ; number of iterations {K t } t≥1 ; step sizes {η t,s } t,s≥1 . sample matrix B (0) from orthogonal random matrices in R d×k sample matrix W (0) ∼ N (0, I k×M /k) for round t = 1, • • • , T do (B t,0 , W t,0 ) ← (B (t-1) , W (t-1) ) for iteration s = 1, • • • , K t do sample matrix ϵ B t,s ∼ N (0, I d×k ) and ϵ W t,s ∼ N (0, I k×M ) compute θ t,s-1,i ← B t,s-1 , w t,s-1,i for i ∈ [M ] where w t,s-1,i is the i-th row of W t,s-1 B t,s ← B t,s-1 -η t,s t-1 τ =1 ∇ B L (τ ) (θ t,s,m (τ ) ) -dk -1 β -1 η t,s B t,s-1 + 2β -1 η t,s ϵ B t,s W t,s ← W t,s-1 -η t,s t-1 τ =1 ∇ W L (τ ) (θ t,s,m (τ ) ) -kβ -1 η t,s W t,s-1 + 2β -1 η t,s ϵ W t,s end for (B (t) , W (t) ) ← (B t,Kt , W t,Kt ) compute θ (t) ← B (t) w (t) m (t) where w (t) m (t) is the m (t) -th row of W (t) take action a (t) ← arg max a∈A (t) ⟨a, θ (t) ⟩ end for Simulation Results. We generate the synthetic data in the following way: The horizon length is set to T = 3000 and the feature dimension is set to d = 40. There are M = 20 tasks and each of them appear in exactly T /M = 100 rounds in sequential, i.e, the task on round t is given by m (t) = ⌈t/100⌉. The dimension of hidden subspace is set in k ∈ {2, 3, 4}. The hidden linear feature extractor B is a random orthogonal matrix in R d×k and each of the task-specific vector {w i } M i=1 are independent random vectors from unit sphere S k-1 = {x ∈ R k : ∥x∥ 2 = 1}. On each round t, A = 20 actions {a (t) i } i=1•••A are randomly drawn from unit sphere S d-1 . We fix the misspecification level to be ζ = 0. The observed reward of action a (t) i is given by r t) where ξ (t) is independent random Gaussian variable of standard deviation 0.2. We run Algorithm 4 with exploration parameter λ = (dk + M k)/(dT ) and temperature parameter β -1 = 0.01. On each round t, the algorithm do K t = 100 iterations and the step size of iteration s is τ t,s = 1/(dks). Similar to misspecified linear contextual bandits, we record the maximum action and recompute it only in first 10 steps or once for every 10 steps afterwards to accelerate its running speed. Since there is no existing algorithm for lifelong linear contextual bandits, we compare the performance with running LinUCB on each task independently. The result is shown in figure 2 . One can see our algorithm leverage the underlying representation with a good constant factor. C PROOF SKETCH OF THEOREM 5.1 (t) = ⟨a (t) i , Bw m (t) ⟩ + ξ ( This section presents a proof sketch of our main Theorem 5.1. We note the proof is similar to the proof of Theorem 2 in Zhang (2021) . The main difference comes from the usage of covering number of the policy set and our decoupling lemma under misspecification. Let Θ * = [θ * 1 , • • • , θ * T ] be the optimal policy in the ϵ-net H ϵ . For any linear policy Θ = [θ 1 , • • • , θ T ], denote ∆ t (Θ, a (t) ) := ⟨a (t) , θ t -θ * t ⟩ as the Bellman error and ∆ * t (Θ) := max a∈A (t) ⟨a, θ t ⟩ -max a∈A (t) ⟨a, θ * t ⟩ as the feel-good error. Since ⟨a (t) , θ t ⟩ = max a∈A (t) ⟨a, θ t ⟩ according to definition of a (t) , we can decompose the expected regret on round t by E a (t) [f (t) (a (t) * ) -f (t) (a (t) )] ≤ E a (t) ,Θ (t) ∆ t (Θ (t) a (t) ) -∆ * t (Θ (t) ) + 2(ϵ + ζ). Besides, we construct a potential function following the classical analysis for exponentially weighted sampling, Φ (t) = 1 βλ log Θ∈Hϵ exp β t τ =1 L (τ ) (Θ * ) -L (τ ) (Θ) . The boundary value of the potential function follows Φ (0) -Φ (T ) ≤ log |H ϵ | βλ . The increment of potential function on round t satisfies E a (t) ,r (t) Φ (t) -Φ (t-1) ≤ 1 βλ E a (t) log E r (t) E Θ exp β L (t) (Θ * ) -L (t) (Θ) . Recall L (t) (Θ) = (r (t) -⟨a (t) , θ t ⟩) 2 -λ max a∈A (t) ⟨a, θ t ⟩, the randomness on r (t) only casts an additive noise -2βξ (t) ∆ t (Θ, a (t) ) to the term β L (t) (Θ * ) -L (t) (Θ) in the exponent. Since ξ (t) is constantly bounded and zero mean, the noise can be controlled by selecting proper β according to Hoeffding's lemma. With classical inequalities over logarithm and exponential, we can show the variation of potential function satisfies E a (t) ,r (t) Φ (t) -Φ (t-1) ≤ - 1 λ E a (t) ,Θ ∆ t (Θ, a (t) ) 2 + C t (a (t) )∆ t (Θ, a (t) ) + E Θ ∆ * t (Θ) + O βλ + β(ϵ + ζ) 2 λ ( ) where C t (a (t) ) = f (t) -⟨Θ (t) , θ * t ⟩ is the misspecification of Θ * with respect to ground truth f (t) . One can see that the first two terms on the RHS of ( 12) are very similar to the first two terms on the RHS of ( 9). But with one critical difference, the chosen policy Θ (t) in ( 9) is replaced by a random policy Θ in (12): Although Θ (t) and Θ have the same margin, Θ (t) is correlated with a (t) but Θ is independent from a (t) . As a result, Θ (t) ∆ * t (Θ (t) ) = E Θ ∆ * t (Θ) but E a (t) ,Θ (t) [∆ t (Θ (t) , a (t) )] is not necessary equal to E a (t) ,Θ [∆ t (Θ, a (t) )] . Therefore, we cannot get a regret bound by combining ( 9), ( 10) and ( 12) directly. This is where the decoupling lemma helps. Following Zhang (2021) , one can get the following lemma for linear models, E a (t) ,Θ (t) ∆ t (Θ (t) , a (t) ) ≤ 1 λ E a (t) ,Θ ∆ t (Θ, a (t) ) 2 + O(λd). For our usage, in order to achieve optimal dependence with the covering radius and misspecification, we derived a decoupling lemma includes the corruptions. In particular, Lemma D.2 implies E a (t) ,Θ (t) ∆ t (Θ (t) , a (t) ) ≤ 1 λ E a (t) ,Θ ∆ t (Θ, a (t) ) 2 + C t (a (t) )∆ t (Θ, a (t) ) + O λd + (ϵ + ζ) + (ϵ + ζ) 2 λ . By combining ( 9), ( 10), ( 12) and ( 14), with β = Θ(1), we conclude that E[Regret (T ) ] = T t=1 E a (t) [f (t) (a (t) * ) -f (t) (a (t) )] ≤ O λdT + (ϵ + ζ)T + log |H ϵ | + (ϵ + ζ) 2 T λ . One can obtain the desired statement using λ = Θ log |H ϵ |/(dT ) + (ϵ + ζ) 2 /d . We note our analysis uses a refined way to deal with the covering radius and misspecification. One can only get a regret bound of O( 13) and follows the proof in Zhang (2021) faithfully, which has a worse dependence on (ϵ + ζ). dT log |H ϵ | + T d(ϵ + ζ)) if uses ( D FULL PROOF OF THEOREM 5.1 In this section, we will provide a complete proof of Theorem 5.1. The following decoupling lemma is a critical structural lemma for our analysis, which is a generalization of Lemma 2 in Zhang (2021). Lemma D.1 (Decoupling Lemma). Let P be a joint distribution over two R d space, i.e., P ∈ ∆(R d × R d ). For any constant λ > 0, we have E (θ,ϕ)∼P ⟨θ, ϕ⟩ ≤ dλ + 0.25 λ E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ϕ ′ ⟩ 2 , where (θ, ϕ) on LHS is a sample from P while (θ, ϕ) and (θ ′ , ϕ ′ ) on RHS are two independent samples from P . Proof. Let Σ = E (θ,ϕ)∼P θθ ⊤ be a matrix in R d×d . Denote ξ 1 , • • • , ξ d as a set of orthogonal eigenvectors of Σ. Let s i = E (θ,ϕ)∼P ⟨θ, ξ i ⟩ 2 . Then, E (θ,ϕ)∼P ⟨θ, ϕ⟩ = E (θ,ϕ)∼P d i=1 ⟨θ, ξ i ⟩⟨ξ i , ϕ⟩ = d i=1 E (θ,ϕ)∼P 2λ s i ⟨θ, ξ i ⟩ • 0.5s i λ ⟨ξ i , ϕ⟩ ≤ d i=1 E (θ,ϕ)∼P λ s i ⟨θ, ξ i ⟩ 2 + 0.25s i λ ⟨ξ i , ϕ⟩ 2 , ( ) where the first equality follows from θ = d i=1 ⟨θ, ξ i ⟩ξ i and the inequality follows from ab ≤ 0.5a 2 + 0.5b 2 . For the first term, we have d i=1 E (θ,ϕ)∼P λ s i ⟨θ, ξ i ⟩ 2 = d i=1 λ = λd. ( ) For the second term, it holds that 0.25 λ d i=1 E (θ,ϕ)∼P s i ⟨ξ i , ϕ⟩ 2 = 0.25 λ d i=1 E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ξ i ⟩ 2 ⟨ξ i , ϕ ′ ⟩ 2 = 0.25 λ d i=1 d j=1 E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ξ i ⟩⟨ξ i , ϕ ′ ⟩⟨θ, ξ j ⟩⟨ξ j , ϕ ′ ⟩ = 0.25 λ E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P d i=1 ⟨θ, ξ i ⟩⟨ξ i , ϕ ′ ⟩ 2 = 0.25 λ E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ϕ ′ ⟩ 2 , ( ) where the first equality follows from the definition of s i , the second equality holds since E (θ,ϕ)∼P ⟨θ, ξ i ⟩⟨θ, ξ j ⟩ = ξ ⊤ i Σξ j = 0 for all i ̸ = j and the final equality follows from the definition of ξ i . By plugging ( 16) and ( 17) into (15), we can conclude that E (θ,ϕ)∼P ⟨θ, ϕ⟩ ≤ λd + 0.25 λ E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ϕ ′ ⟩ 2 . Using this lemma, we can bound the expectation of the inner product of two correlated random variables using the expectation of the inner product of two independent random variables with the corresponding margin. To provide a nearly minimax bound with optimal dependence on misspecification, we further propose a decoupling lemma that takes misspecification into account: For any constant λ > 0, we have E (θ,ϕ)∼P ⟨θ, ϕ⟩ ≤ λ(d + 1) + 4ζ 2 λ + 4ζ + 0.25 λ E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ϕ ′ ⟩ 2 + 8C(θ) • ⟨θ, ϕ ′ ⟩ , where (θ, ϕ) on LHS is a sample from P while (θ, ϕ) and (θ ′ , ϕ ′ ) on RHS are two independent samples from P . Proof. Let P + be an auxiliary distribution over ∆(R d+1 , R d+1 ) in which each element is (θ + , ϕ + ) = ([θ ⊤ , C(θ)] ⊤ , [ϕ ⊤ , 1] ⊤ ) where (θ, ϕ) ∼ P . For two independent samples (θ + , ϕ + ) and (θ ′ + , ϕ ′ + ), we have ⟨θ + , ϕ + ⟩ = ⟨θ, ϕ⟩ + 4C(θ), (18) ⟨θ + , ϕ ′ + ⟩ 2 = ⟨θ, ϕ ′ ⟩ 2 + 8C(θ) • ⟨θ, ϕ ′ ⟩ + 16C(θ) 2 . ( ) Apply Lemma D.1 to distribution P + , we have E (θ+,ϕ+)∼P+ ⟨θ + , ϕ + ⟩ ≤ λ(d + 1) + 0.25 λ E (θ+,ϕ+)∼P+ (θ ′ + ,ϕ ′ + )∼P+ ⟨θ + , ϕ ′ + ⟩ 2 . ( ) Plugging ( 18) and ( 19) into (20), we have E (θ,ϕ)∼P ⟨θ, ϕ⟩ + 4C(θ) ≤ λ(d + 1) + 0.25 λ E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ϕ ′ ⟩ 2 + 8C(θ) • ⟨θ, ϕ ′ ⟩ + 16C(θ) 2 . Since |C(θ)| ≤ ζ always holds, we can further conclude that E (θ,ϕ)∼P ⟨θ, ϕ⟩ ≤ λ(d + 1) + 4ζ 2 λ + 4ζ + 0.25 λ E (θ,ϕ)∼P (θ ′ ,ϕ ′ )∼P ⟨θ, ϕ ′ ⟩ 2 + 8C(θ) • ⟨θ, ϕ ′ ⟩ . Denote Ω (t) -= Ω (t-1) ∪ {A (t) , f (t) } be the history before the agent chooses action on round t. Let Θ * = [θ * 1 , • • • , θ * T ] be the optimal policy within ϵ-net H ϵ . For any linear policy Θ = [θ 1 , • • • , θ T ], denote ∆ t (Θ, a (t) ) := ⟨a (t) , θ t -θ * t ⟩, ∆ * t (Θ) := max a∈A (t) ⟨a, θ t ⟩ -max a∈A (t) ⟨a, θ * t ⟩. Note ∆ t is referred to the Bellman error and ∆ * t is referred to the feel-good error. The next lemma shows that we can decomposed the expected regret using these two notions: Lemma D.3. Under Assumption 3.1, the expected regret on round t satisfies E a (t) |Ω (t) - f (t) (a (t) * ) -f (t) (a (t) ) ≤ E a (t) ,Θ (t) |Ω (t) - ∆ t (Θ (t) , a (t) ) -∆ * t (Θ (t) ) + 2(ϵ + ζ). Proof. For any action a ∈ A, we have ⟨a, θ * t ⟩ ≤ f (t) (a) + |f (t) (a) -⟨a, θ(f (t) )⟩| + |⟨a, θ(f (t) )⟩ -⟨a, θ * t ⟩| ≤ f (t) (a) + ϵ + ζ, where the first inequality follows from triangle inequality and the second inequality holds follows from Assumption 3.1 and the definition of ϵ-net with |⟨a, θ(f (t) )⟩ -⟨a, θ * t ⟩| ≤ ∥a∥ 2 • ∥θ(f (t) ) - θ * t ∥ 2 ≤ ζ. Similarly, we have ⟨a, θ * t ⟩ ≥ f (t) (a) -(ϵ + ζ). Thus, E a (t) |Ω (t) - f (t) (a (t) * ) -f (t) (a (t) ) ≤ E a (t) |Ω (t) - ⟨a (t) * , θ * t ⟩ -⟨a (t) , θ * t ⟩ + 2(ϵ + ζ). Moreover, we have E a (t) |Ω (t) - ⟨a (t) * , θ * t ⟩ -⟨a (t) , θ * t ⟩ ≤ E a (t) |Ω (t) - max a∈A (t) ⟨a, θ * t ⟩ -⟨a (t) , θ * t ⟩ = E a (t) |Ω (t) - max a∈A (t) ⟨a, θ * t ⟩ -max a∈A (t) ⟨a, θ t ⟩ + ⟨a (t) , θ t ⟩ -⟨a (t) , θ * t ⟩ = E a (t) ,Θ (t) |Ω (t) - ∆ t (Θ (t) , a (t) ) -∆ * t (Θ (t) ) . ( ) where the first equality follows from the fact that action a (t) maximizes ⟨a, θ t ⟩ and thus max a∈A (t) ⟨a, θ t ⟩ = ⟨a (t) , θ t ⟩. By combining ( 21) and ( 22), we obtain the desired result. The following lemma shows a connection between ∆ t and a potential function. Lemma D.4. Define potential function Φ (t) := 1 βλ log Θ∈Hϵ exp β t τ =1 L (τ ) (Θ * ) -L (τ ) (Θ) . Let C t (a (t) ) := f (t) (a (t) ) -⟨a (t) , θ * t ⟩ be the misspecification of Θ * on action a (t) with respect to ground truth f (t) . For any ϵ ∈ [0, 1], β ∈ (0, 0.01] and λ ∈ [0, 1], under Assumptions 3.1 and 3.2, the expected increment of potential function on any round t satisfies E a (t) ,r (t) |Ω (t) - Φ (t) -Φ (t-1) ≤ - 0.25 λ E a (t) |Ω (t) - E Θ∼P (t) ∆ t (Θ, a (t) ) 2 + 8C t (a (t) )∆ t (Θ, a (t) ) + E Θ∼P (t) ∆ * t (Θ) + 4βλ + 8β λ (ϵ + ζ) 2 . Proof. the expected increment of the potential function in round t conditional on the reward function and context in round t satisfies: E a (t) ,r (t) |Ω (t) - Φ (t) -Φ (t-1) = 1 βλ E a (t) ,r (t) |Ω (t) - log Θ∈Hϵ exp β t τ =1 L (τ ) (Θ * ) -L (τ ) (Θ) Θ∈Hϵ exp β t-1 τ =1 L (τ ) (Θ * ) -L (τ ) (Θ) = 1 βλ E a (t) ,r (t) |Ω (t) - log Θ∈Hϵ exp -β t-1 τ =1 L (τ ) (Θ) • exp β L (t) (Θ * ) -L (t) (Θ) Θ∈Hϵ exp -β t-1 τ =1 L (τ ) (Θ) = 1 βλ E a (t) ,r (t) |Ω (t) - log E Θ∼P (t) exp β L (t) (Θ * ) -L (t) (Θ) ≤ 1 βλ E a (t) |Ω (t) - log E r (t) |Ω (t) -,a (t) E Θ∼P (t) exp β L (t) (Θ * ) -L (t) (Θ) . ( ) where the last equality follows from the construction of P (t) and the last inequality follows from Jensen's inequality. Moreover, we can decompose the loss function for any policy Θ according to L (t) (Θ) = (r (t) -⟨a (t) , θ t ⟩) 2 -λ max a∈A (t) ⟨a, θ t ⟩ = (f (t) (a (t) ) + ξ (t) -⟨a (t) , θ t ⟩) 2 -λ max a∈A (t) ⟨a, θ t ⟩ = (C t (a (t) ) + ξ (t) -∆ t (Θ, a (t) )) 2 -λ max a∈A (t) ⟨a, θ t ⟩, where the first equality follows from the definition r (t) = f (t) (a (t) ) + ξ (t) and the second equality follows from the definition of C t . For the optimal policy Θ * in ϵ-net, one can see that L (t) (Θ * ) = (C t (a (t) ) + ξ (t) ) 2 -λ max a∈A (t) ⟨a, θ * t ⟩. Combining ( 24) and ( 25), the terms in the exponent can be computed by β(L (t) (Θ * ) -L (t) (Θ)) = -2βC t (a (t) )∆ t (Θ, a (t) ) -2βξ (t) ∆ t (Θ, a (t) ) -β∆ t (Θ, a (t) ) 2 + βλ∆ * t (Θ). Since ξ (t) is zero mean random variables with range 2, according to Hoeffding's lemma, we have E r (t) |Ω (t) -,a (t) exp(-2βξ (t) ∆ t (Θ, a (t) ) ≤ exp 2β 2 ∆ t (Θ, a (t) ) 2 . ( ) In case that β ∈ (0, 0.01], we have 2β 2 -β ≤ -0.5β. Plugging ( 26) and ( 27) into ( 23) gives E a (t) ,r (t) |Ω (t) - Φ (t) -Φ (t-1) ≤ 1 βλ E a (t) |Ω (t) - log E Θ∼P (t) exp -2βC t (a (t) )∆ t (Θ, a (t) ) -0.5β∆ t (Θ, a (t) ) 2 + βλ∆ * t (Θ) . According to Cauchy-Schwarz inequality, we can decompose RHS by E a (t) ,r (t) |Ω (t) - Φ (t) -Φ (t-1) ≤ 0.5 βλ E a (t) |Ω (t) - log E Θ∼P (t) exp -4βC t (a (t) )∆ t (Θ, a (t) ) I1 + 0.25 βλ E a (t) |Ω (t) - log E Θ∼P (t) exp -2β∆ t (Θ, a (t) ) 2 I2 + 0.25 βλ E a (t) |Ω (t) - log E Θ∼P (t) exp 4βλ∆ * t (Θ) I3 . For the first term, I 1 ≤ 0.5 βλ E a (t) |Ω (t) - log E Θ∼P (t) 1 -4βC t (a (t) )∆ t (Θ, a (t) ) + 16β 2 (ϵ + ζ) 2 ≤ - 0.25 λ E a (t) |Ω (t) - E Θ∼P (t) 8C t (a (t) )∆ t (Θ, a (t) ) + 8β λ (ϵ + ζ) 2 where the first inequality follows from e t) )| ≤ 2 and β ∈ (0, 0.01], and the second inequality follows from log(1 + x) ≤ x for x ≥ -1. x ≤ 1 + x + t 2 for |x| ≤ t ≤ 1 with |C t (a (t) )| ≤ ϵ + ζ ≤ 2, |∆ t (Θ, a For the second term, I 2 ≤ 0.25 βλ E a (t) |Ω (t) - log E Θ∼P (t) 1 -β∆ t (Θ, a (t) ) 2 ≤ - 0.25 λ E a (t) |Ω (t) - E Θ∼P (t) ∆ t (Θ, a (t) ) 2 , where the first inequality follows from e -x ≤ 1 -0.5x for x ∈ [0, 1] with |∆ t (Θ, a (t) )| ≤ 2 and β ∈ (0, 0.01], and the second inequality follows from log(1 -x) ≤ -x for all x ≤ 1. For the third term, I 3 ≤ 0.25 βλ E a (t) |Ω (t) - log E Θ∼P (t) 1 + 4βλ∆ * t (Θ) + 16β 2 λ 2 ≤ E Θ∼P (t) ∆ * t (Θ) + 4βλ, where the first inequality follows from e x ≤ 1 + x + t 2 for |x| ≤ t ≤ 1 with |∆ * t (Θ)| ≤ 2, β ∈ (0, 0.01] and λ ∈ [0, 1], and the second inequality follows from log(1 + x) ≤ x for x ≥ -1. Combining (29), ( 30) and ( 31) with (28), we have E a (t) ,r (t) |Ω (t) - Φ (t) -Φ (t-1) ≤ - 0.25 λ E a (t) |Ω (t) - E Θ∼P (t) ∆ t (Θ, a (t) ) 2 + 8C t (a (t) )∆ t (Θ, a (t) ) + E Θ∼P (t) ∆ * t (Θ) + 4βλ + 8β λ (ϵ + ζ) 2 . The following Lemma upper bounds the total change of the potential: Lemma D.5. Under the settings of Lemma D.4, we have Φ (0) -Φ (T ) ≤ 1 βλ log N (H, ϵ, ∥ • ∥ 2,∞ ). Proof. The statement can be proved by combining Φ (0) = 1 βλ log Θ∈Hϵ exp(0) = 1 βλ log |H ϵ |, Φ (T ) ≥ 1 βλ E Ω (T ) log max Θ∈Hϵ exp β T τ =1 (-L (τ ) (Θ) + L (τ ) (Θ * )) ≥ 0, where the second inequality follows from Θ * ∈ H ϵ . Note the chosen policy Θ (t) in Lemma D.3 is correlated with the selected action a (t) while the random policy Θ in Lemma D.4 is not. The following lemma presents a connection between these two notions. Lemma D.6. Let C t : R d → R be a function such that |C t (a)| ≤ ϵ + ζ for any a ∈ A. Then, E Θ (t) |Ω (t) - ∆ * t (Θ (t) ) = E Θ∼P (t) ∆ * t (Θ), and also E a (t) ,Θ (t) |Ω (t) - ∆ t (Θ (t) , a (t) ) ≤ 0.25 λ E a (t) |Ω (t) - E Θ∼P (t) ∆ t (Θ, a (t) ) 2 + 8C t (a (t) )∆ t (Θ, a (t) ) + λ(d + 1) + 4 λ (ϵ + ζ) 2 + 4(ϵ + ζ). Proof. According to the construction of P (t) , the conditional distribution Θ (t) |Ω (t) -is identical to distribution Θ ∼ P (t) . This directly gives E Θ (t) |Ω (t) - ∆ * t (Θ (t) ) = E Θ∼P (t) ∆ * t (Θ). Moreover, apply Lemma D.2 to joint distribution (a (t) , θ t) t -θ * t ) conditional on Ω (t) -, we have E a (t) ,θ (t) t |Ω (t) - ⟨a (t) , θ (t) t -θ * t ⟩ ≤ 0.25 λ E a (t) |Ω (t) - E θt∼P (t) ⟨a (t) , θ t -θ * t ⟩ 2 + 8C t (a (t) )⟨a (t) , θ t -θ * t ⟩ + λ(d + 1) + 4 λ (ϵ + ζ) 2 + 4(ϵ + ζ). The desired statement can be obtained using the definition of ∆ t . Now, we are ready to prove Theorem 5.1: Proof of Theorem 5.1. We choose the parameters according to λ := log |H ϵ | + (ϵ + ζ) 2 T dT = O(1/d), β := 0.01. (32) Denote X t := 0.25 λ E a (t) |Ω (t) - E Θ∼P (t) ∆ t (Θ, a (t) ) 2 + 8C t (a (t) )∆ t (Θ, a (t) ) -E Θ∼P (t) ∆ * t (Θ). According to Lemma D.6, we have E a (t) ,Θ (t) |Ω (t) - ∆ t (Θ (t) , a (t) ) -∆ * t (Θ (t) ) ≤ X t + λ(d + 1) + 4 λ (ϵ + ζ) 2 + 4(ϵ + ζ). According to Lemma D.4, we have E a (t) ,r (t) |Ω (t) - Φ (t) -Φ (t-1) ≤ -X t + 4βλ + 8β λ (ϵ + ζ) 2 . Combining ( 33) and ( 34) gives E a (t) ,Θ (t) |Ω (t) - ∆ t (Θ (t) , a (t) ) -∆ * t (Θ (t) ) ≤ E a (t) ,r (t) |Ω (t) - Φ (t-1) -Φ (t) + λ(d + 1 + 4β) + 4 + 8β λ (ϵ + ζ) 2 + 4(ϵ + ζ). Moreover, we have E PROOF OF COVERING NUMBER FROM SECTION 5.1 E[Regret(T )] = T t=1 E a (t) |Ω (t) - f (t) (a (t) * ) -f (t) (a (t) ) ≤ T t=1 E a (t) ,Θ In this section, we prove covering number bounds for some specific linear bandit settings. We first introduce a classic result on the covering number of the unit ball. With this lemma, we can directly prove the covering number of stationary linear bandits. Proof. We will prove this lemma by constructing an ϵ-net for H. Let One can naturally extend the proof to non-stationary linear bandits with bounded switches: Lemma 5.2 (Covering Number for Non-Stationary Linear Bandits with Bounded Switches). Under Assumptions 3.1 and 3.4, the metric entropy of linear policies satisfies log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(dS log ϵ -1 + S log T ). Proof. We will prove this lemma by constructing an ϵ-net for H. Let B d ϵ be the minimum ϵnet of unit Euclidean ball B d in L 2 distance. We build a set H ϵ by mapping any sequence (t 1 , • • • , t S ) with 1 < t 1 < • • • < t S ≤ T and S + 1 vectors θ 0 , • • • , θ S ∈ B d ϵ to the matrix [ θ 0 , • • • , θ 0 , θ 1 , • • • , θ 1 , • • • , θ S ] where the first θ i occurs at index t i . t i describes the round of i-th switch and θ i characterize the parameter after the switch. The size of H ϵ can be bounded by |H ϵ | ≤ T S • |B d ϵ | S+1 ≤ T S • (1 + 2ϵ -1 ) d(S+1) , where the first term is the number of possible (t i ) and the second term is the number of possible θ i . We then show H ϵ is an ϵ-net of H under under Assumption 3.4: Every linear policy in H can also be written in form [θ 0 , • • • , θ 0 , θ 1 , • • • , θ 1 , • • • , θ S ] which can be covered by some [ θ 0 , • • • , θ 0 , θ 1 , • • • , θ 1 , • • • , θ S ] with the same switch locations and corresponding parameters. According to the definition of B d ϵ , ∥θ i -θ i ∥ ≤ ϵ for every i and thus H ϵ is an ϵ-net of H. As a result, we can bound the metric entropy by log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ log |H ϵ | = O(dS log ϵ -1 + S log T ). For non-stationary linear bandits with bounded path length, we will show the ϵ-net for some bounded switches instances can be transformed into a desired covering: Lemma 5.3 (Covering Number for Non-Stationary Linear Bandits with Bounded Path Length). Under Assumptions 3.1 and 3.5, the metric entropy of linear policies satisfies log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(d log ϵ -1 + dP ϵ -1 log ϵ -1 + P ϵ -1 log T ). Proof. Let H ϵ/2 be the ϵ/2-net covering of non-stationary bandits with no more than 2P ϵ -1 (see Assumption 3.4). We will show H ϵ/2 is an ϵ-net covering of non-stationary bandits with path length no more than P : For some linear policy Θ = [θ 1 , θ 2 , • • • , θ T ] ∈ H such that T t=1 ∥θ t -θ t-1 ∥ ≤ P , consider a sequence 1 = t 0 < t 1 < • • • < t M ≤ T where t i is the minimal index such that ∥θ ti -θ ti-1 ∥ > ϵ/2 for all i ≥ 1. Denote Θ = [θ 1 , • • • , θ 1 , θ t1 , • • • , θ t1 , • • • , θ t M ] as the linear policy, which quantizes the parameter drift in which the first θ ti occurs at index t i . Since the number of switch of Θ is bounded by M ≤ P/(ϵ/2) ≤ 2P ϵ -1 , the ϵ/2-net H ϵ/2 contains an ϵ/2 approximated policy of Θ. Denote the policy as Θ. Since deviation of each single round policy in the same slice is no more than ϵ/2, we have ∥ Θ -Θ∥ 2,∞ ≤ ϵ/2. According to the definition of ϵ/2-net, we also have ∥ Θ -Θ∥ 2,∞ ≤ ϵ/2. According to triangle inequality, we have ∥ Θ -Θ∥ 2,∞ ≤ ϵ. This shows that Θ always has an ϵ-approximation within H ϵ/2 and thus H ϵ/2 is indeed an ϵ-net of H. As a result, we can bound the metric entropy using Lemma 5.2 log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ log | H ϵ/2 | ≤ O(d log ϵ -1 + dP ϵ -1 log ϵ -1 + P ϵ -1 log T ), where the extra d log ϵ -1 term follows from the special case S = 2P ϵ -1 < 1. Now we consider the lifelong linear bandits. Its covering number bound is essentially the covering number of low-rank matrices. Lemma 5.4 (Covering Number for Lifelong Bandits with Shared Representation). Under Assumptions 3.1 and 3.6, the metric entropy of linear policies satisfies log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(dk log ϵ -1 + kM log ϵ -1 ). The lower bounds result in Section 5.1 can be obtained by combining above lemmas: • Combining Lemma F.1 and Lemma F.4 gives Proposition 5.1. • Combining Lemma F.5 and Lemma F.4 gives Proposition 5.2. • Combining Lemma F.2, Lemma F.1, and Lemma F.4 gives Proposition 5.3. • Combining Lemma F.3 and Lemma F.4 gives Proposition 5.4.



improved the regret of the algorithm and generalized it to the lifelong setting. The algorithm has a regret bound of O(d √ kT + k √ M T ). Hu et al. (2021) studied the multi-task setting in linear contextual bandits and proposed a computationally inefficient algorithm with a regret of O(d

proposed the first general theoretical guarantee for Thompson sampling. By utilizing the connection between Thompson sampling and optimistic policies, they showed Thompson sampling achieves a Bayesian regret bound of O(d √ T ) for linear bandits, where the underlying parameters are randomly drawn from some public distribution. For frequentist regret, Agrawal & Goyal (2013) showed a modified Thompson sampling in which the variance of the posterior distribution is inflated by O(d) achieves a regret bound of O(d 3 2

Figure 1: The performance of Algorithm 3 comparing with other linear contextual bandit algorithms. Results are averaged over 5 runs with standard errors shown as shaded areas.

Figure 2: The performance of Algorithm 4 comparing with running LinUCB on each task. Results are averaged over 5 runs with standard errors shown as shaded areas.

Decoupling Lemma with Misspecification). Let P be a joint distribution over two R d space, i.e., P ∈∆(R d × R d ). Let C(•) : R d → R be a function such that |C(θ)| ≤ ζ for all θ ∈ R d .

Lemma E.1 (Lemma 5.2 in Vershynin (2012), Restated). The unit Euclidean ball B d = {x ∈ R d : ∥x∥ 2 ≤ 1} equipped with the Euclidean metric satisfies for every ϵ > 0 thatN (B d , ϵ, ∥ • ∥ 2 ) ≤ (1 + 2ϵ -1 ) d .

Lemma 5.1 (Covering Number for Misspecified Linear Bandits). Under Assumptions 3.1 and 3.3, the metric entropy of linear policies satisfieslog N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ O(d log ϵ -1 ).

B d ϵ be the minimum ϵ-net of unit Euclidean ball B d in L 2 distance. Let H ϵ be the set that contains every matrix [ θ, • • • , θ] for all θ ∈ B d ϵ . According to Lemma E.1, we have |H ϵ | = |B d ϵ | ≤ (1 + 2ϵ -1 ) d . We then show H ϵ is an ϵ-net of H under Assumption 3.3: For every linear policy [θ, • • • , θ], there exists θ ∈ B d ϵ such that ∥θ -θ∥ 2 ≤ ϵ according to definition of B d ϵ . This shows [ θ, • • • , θ] ∈ H ϵ is an ϵ-approximation policy of [θ, • • • , θ] which implies H ϵ is indeed an ϵ-net of H.As a result, we can bound the metric entropy by log N (H, ϵ, ∥ • ∥ 2,∞ ) ≤ log |H ϵ | = O(d log ϵ -1 ).

t) |Ωwhere the first inequality follows from Lemma D.3, the second inequality follows from (35), and the last inequality follows from Lemma D.5. With the parameter defined in (32), we conclude that

annex

Proof. We will prove this lemma by constructing an ϵ-net for H. This is done by constructing an ϵ/2net for all possible feature extractor B and an ϵ/2-net for all possible tasks specific vectors {w i } M i=1 . In specific, we build a set B ϵ/2 such that there exists B ∈ B ϵ/2 such that ∥B -B∥ 2 ≤ ϵ/2 and W ϵ/2 such that there exists { w i } M i=1 ∈ W ϵ/2 such that ∥w i -w i ∥ 2 ≤ ϵ/2 for all i. We generate the ϵ-net by mapping every pairs of B and { T ) ] where θ (t) = B w m (t) . In any round with θ (t) = Bw m (t) , the vector θ (t) = B w m (t) with corresponding B and w m (t) in ϵ/2-net satisfieswhere the second inequality follows from triangle inequality and the third inequality holds since B is orthogonal matrix. This shows any task can be ϵ-approximated and the generated set is indeed an ϵ-net. As a result, with Lemma E.1, we conclude that

F LOWER BOUNDS IN SECTION 5.1

In this section, give regret lower bounds for some specific linear bandit settings. We first present some existing result in the literature. It is easy to check the corresponding hard instance satisfies our formulation. Lemma F.4. (Theorem F.1 in Lattimore & Szepesvari (2020), Restated) Under Assumptions 3.1 and 3.2, for any algorithm, there exists a bandit instance for whichRemark F.1. In Lattimore & Szepesvari (2020), they proved a lower bound in the order of Ω(min(T, K) d/ log(K)ζ), where K is the number of actions. Here we choose K = T for simplicity.We further prove a lower bound for non-stationary linear bandits with bounded switches. Lemma F.5. Under Assumptions 3.1, 3.2 and 3.4, for any algorithm, there exists a bandit instance for whichProof. Without loss of generality, assume T is divided by S. Consider a bandit instance where the switch occurs at round T /S, 2T /S, • • • . Note the reward functions after different switches are independent. Thus, according to Lemma F.1, any algorithm suffers Ω(d T /S) regret in any interval between two switches. Since there are S switches, for any algorithm, there exists a bandit instance for which E[Regret(T )] ≥ Ω(d √ ST ).

