VARIANCE-AWARE SPARSE LINEAR BANDITS

Abstract

It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is Θ dT where d is the ambient dimension and T is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve O(1) regret, which is (nearly) independent of d and T . In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: O d T t=1 σ 2 t + 1 , where σ 2 t is the variance of the noise at the t-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., σ t ≡ Ω(1)) and the benign deterministic regimes (i.e., σ t ≡ 0). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a "black-box" manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.

1. INTRODUCTION

This paper studies the sparse linear stochastic bandit problem, which is a special case of linear stochastic bandits. In linear bandits (Dani et al., 2008) , the agent is facing a sequential decisionmaking problem lasting for T rounds. For the t-th round, the agent chooses an action x t ∈ X ⊆ R d , where X is an action set, and receives a noisy reward r t = ⟨θ * , x t ⟩ + η t where θ * ∈ X is the (hidden) parameter of the game and η t is random zero-mean noise. The goal of the agent is to minimize her regret R T , that is, the difference between her cumulative reward T t=1 ⟨θ * , x t ⟩ and max x∈X T t=1 ⟨θ * , x⟩ (check Eq. ( 1) for a definition). Dani et al. (2008) proved that the minimax optimal regret for linear bandits is Θ(d √ T ) when the noises are independent Gaussian random variables with means 0 and variances 1 and both θ * and the actions x t lie in the unit sphere in R d . 1In real-world applications such as recommendation systems, only a few features may be relevant despite a large candidate feature space. In other words, the high-dimensional linear regime may actually allow a low-dimensional structure. As a result, if we still use the linear bandit model, we will always suffer Ω(d

√

T ) regret no matter how many features are useful. Motivated by this, the sparse linear stochastic bandit problem was introduced (Abbasi- Yadkori et al., 2012; Carpentier & Munos, 2012) . This problem has an additional constraint that the hidden parameter, θ * , is sparse, i.e., ∥θ * ∥ 0 ≤ s for some s ≪ d. However, the agent has no prior knowledge about s and thus the interaction protocol is exactly the same as that of linear bandits. The minimax optimal regret for sparse linear bandits is Θ( √ sdT ) (Abbasi-Yadkori et al., 2012; Antos & Szepesvári, 2009) . 2 This bound bypasses the Ω(d √ T ) lower bound for linear bandits as we always have s = ∥θ * ∥ 0 ≤ d and the agent does not have access to s either (though a few previous works assumed a known s).

However, both the O(d √

T ) and the O( √ sdT ) bounds are the worst-case regret bounds and sometime are too pessimistic especially when d is large. On the other hand, many problems with delicate structures permit a regret bound much smaller than the worst-case bound. The structure this paper focuses on is the magnitude of the noise. Consider the following motivating example. Motivating Example (Deterministic Sparse Linear Bandits). Consider the case where the action set is the unit sphere X = S d-1 , and there is no noise, i.e., the feedback is r t = ⟨θ * , x t ⟩ for each round t ∈ [T ]. In this case, one can identify all non-zero entries of θ * coordinates in O(s log d) steps with high probability via a divide-and-conquer algorithm, and thus yield a dimension-free regret O(s) (see Appendix C for more details about this). 3 However, this divide-and-conquer algorithm is specific for deterministic sparse linear bandit problems and does not work for noisy models. Henceforth, we study the following natural question: Can we design an algorithm whose regret adapts to the noise level such that the regret interpolates the √ dT -type bound in the worst case and the dimension-free bound in the deterministic case? Before introducing our results, we would like to mention that there are recent works that studied the noise-adaptivity in linear bandits (Zhou et al., 2021; Zhang et al., 2021; Kim et al., 2021) . They gave variance-aware regret bounds of the form O poly(d) T t=1 σ 2 t + poly(d) where σ 2 t is the (conditional) variance of the noise η t . This bound reduces to the standard O(poly(d) √ T ) bound in the worst-case when σ t = Ω(1), and to a constant-type regret O(poly(d)) that is independent of T . However, compared with the linear bandits setting, the variance-aware bound for sparse linear bandits is more significant because it reduces to a dimension-free bound in the noiseless setting. Despite this, to our knowledge, no variance-aware regret bounds exist for sparse linear bandits.

1.1. OUR CONTRIBUTIONS

This paper gives the first set of variance-aware regret bounds for sparse linear bandits. We design a general framework, VASLB, to reduce variance-aware sparse linear bandits to variance-aware linear bandits with little overhead in regret. For ease of presentation, we define the following notation to characterize the variance-awareness of a sparse linear bandit algorithm: Definition 1. A variance-aware sparse linear bandit algorithm F is (f (s, d), g(s, d))-varianceaware, if for any given failure probability δ > 0, with probability 1 -δ, F ensures R F T ≤ O   f (s, d) T t=1 σ 2 t polylog 1 δ + g(s, d) polylog 1 δ   , where R F T is the regret of F in T rounds, d is the ambient dimension and s is the maximum number of non-zero coordinates. Specifically, for linear bandits, f, g are functions only of d. Hence, an (f, g)-variance-aware algorithm will achieve O(f (s, d)

√

T polylog 1 δ ) worst-case regret and O(g(s, d) polylog 1 δ ) deterministic-case regret. Ideally, we would like g(s, d) being independent of d, making the bound dimension-free in deterministic cases, as the divide-and-conquer approach. In this paper, we provide a general framework that can convert any linear bandit algorithm F to a corresponding sparse linear bandit algorithm G in a black-box manner. Moreover, it is varianceaware-preserving, in the sense that, if F enjoys the variance-aware property, so does G. Generally speaking, if the plug-in linear bandit algorithm F is (f (d), g(d))-variance-aware, then our framework directly gives an (s(f (s) + √ d), s(g(s) + 1))-variance-aware algorithm G for sparse linear bandits. Besides presenting our framework, we also illustrate its usefulness by plugging in two existing variance-aware linear bandit algorithms, where the first one is variance-aware (i.e., works in unknownvariance cases) but computationally inefficient. In contrast, the second one is efficient but requires the variance σ 2 t to be delivered together with feedback r t . Their regret guarantees are stated as follows. 1. The first variance-aware linear bandit algorithm we plug in is VOFUL, which was proposed by Zhang et al. (2021) and improved by Kim et al. (2021) . This algorithm is computationally inefficient but deals with unknown variances. Using this VOFUL, our framework generates a (s 2.5 + s √ d, s 3 )-variance-aware algorithm for sparse linear bandits. Compared to the Ω( √ sdT ) regret lower-bound for sparse linear bandits (Lattimore & Szepesvári, 2020, §24. 3), our worst-case regret bound is near-optimal up to a factor √ s. Moreover, our bound is independent of d and T in the deterministic case, nearly matching the bound of divide-and-conquer algorithm dedicated to the deterministic setting up to poly(s) factors. 2. The second algorithm we plug in is Weighted OFUL (Zhou et al., 2021) , which requires known variances but is computationally efficient. We obtain an (s 2 + s √ d, s 1.5 √ T )-variance-aware efficient algorithm. In the deterministic case, this algorithm can only achieve a √ T -type regret bound (albeit still independent of d). We note that this is not due to our framework but due to Weighted OFUL which itself cannot gives constant regret bound in the deterministic setting. Moreover, we would like to remark that our deterministic regret can be further improved if a better variance-aware linear bandit algorithm is deployed: The current ones either have O(d 2 ) (Kim et al., 2021) or O( √ dT ) (Zhou et al., 2021) regret in the deterministic case, which are both sub-optimal compared with the Ω(d) lower bound.

1.2. RELATED WORK

Linear Bandits. This problem was first introduced by Dani et al. (2008) , where an algorithm with regret O(d

√

T (log T ) 3 /2 ) and a near-matching regret lower-bound Ω(d √ T ) were given. After that, an improved upper bound O(d -Yadkori et al., 2011) together with an improved lower bound Ω(d √ T log T ) (Li et al., 2019) were derived. An extension of it, namely linear contextual bandits, where the action set allowed for each step can vary with time (Chu et al., 2011; Kannan et al., 2018; Li et al., 2019; 2021) , is receiving more and more attention. The best-arm identification problem where the goal of the agent is to approximate θ * with as few samples as possible (Soare et al., 2014; Degenne et al., 2019; Jedra & Proutiere, 2020; Alieva et al., 2021) is also of great interest. √ T log T ) (Abbasi Sparse Linear Bandits. Abbasi-Yadkori et al. (2011) and Carpentier & Munos (2012) concurrently considered the sparse linear bandit problem, where the former work assumed a noise model of r t = ⟨x t , θ * ⟩ + η t such that η t is R-sub-Gaussian and achieved O(R √ sdT ) regret, while the latter one considered the noise model of r t = ⟨x t + η t , θ * ⟩ such that ∥η t ∥ 2 ≤ σ and ∥θ * ∥ 2 ≤ θ, achieving O((σ + θ) 2 s √ T ) regret. Lattimore et al. (2015) assumed an hypercube (i.e., X = [-1, 1] d ) action set and a ∥θ * ∥ 1 ≤ 1 ground-truth, yielding O(s √ T ) regret. Antos & Szepesvári (2009) proved a Ω( √ dT ) lower-bound when s = 1 with the unit sphere as X . Some recent works considered data-poor regimes where d ≫ T (Hao et al., 2020; 2021a; b; Wang et al., 2020) , which is beyond the scope of this paper. Another work worth mentioning is the recent work by Dong et al. (2021) , which studies bandits or MDPs with deterministic rewards. Their result implies an O(T 15 /16 s 1 /16 ) bound for deterministic sparse linear bandits, which is independent of d. They also provided an ad-hoc divide-and-conquer algorithm, which achieves O(s log d) regret only for deterministic cases. Variance-Aware Online Learning. For tabular MDPs, the variance information is widely used in both discounted settings (Lattimore & Hutter, 2012) and episodic settings (Azar et al., 2017; Jin et al., 2018) , where Zanette & Brunskill (2019) used variance information to derive problemdependent regret bounds for tabular MDPs. For bandits, Audibert et al. (2009) made use of variance information in multi-armed bandits, giving an algorithm outperforming existing ones when the variances for suboptimal arms are relatively small. For bandits with high-dimensional structures, Faury et al. (2020) studied variance adaptation for logistic bandits, Zhou et al. (2021) considered linear bandits and linear mixture MDPs where the variance information is revealed to the agent,  O(d 1.5 √ T ) O(d 2 ) ✗ Unknown VASLB (This work) Sparse LinBandit O(s 2 √ T + s √ dT ) O(s 1.5 √ T ) ✓ Known Sparse LinBandit O(s 2.5 √ T + s √ dT ) O(s 3 ) ✗ Unknown Lower Bound (Antos & Szepesvári, 2009) Sparse LinBandit Ω( √ dT ) e N/A N/A N/A a "Worst-case" means the variances σ 2 t are all 1. Here, d is the ambient dimension, T is the number of rounds, and s is the sparsity parameter (only applicable to sparse linear bandits). b "Deterministic-case" means the variances σ 2 t are all 0. Only applicable to variance-aware algorithms. (2021) . The recent work by Hou et al. (2022) considered variance-constrained best arm identification, where the feedback noise only depends on the action by the agent (whereas ours can depend on time, which is more general than theirs). Another recent work (Zhao et al., 2022) studied variance-aware regret bounds for bandits with general function approximation in the known variance case. Stochastic Contextual Sparse Linear Bandits. In the setting, the action set for each round t is i.i.d. sampled (called the "context"). It is known that O( √ sT ) regret is achievable in this setting (Kim & Paik, 2019; Ren & Zhou, 2020; Oh et al., 2021; Ariu et al., 2022) . However, in our setting where both action set X and ground-truth θ * are fixed, a polynomial dependency on d is in general unavoidable because it is impossible to learn more than one parameter per arm (Bastani & Bayati, 2020) , agreeing with the Ω( √ dT ) lower bound when s = 1 (Antos & Szepesvári, 2009; Abbasi-Yadkori et al., 2012) .

2. PROBLEM SETUP

Notations. We use [N ] to denote the set {1, 2, . . . , N } where N ∈ N. For a vector x ∈ R d , we use ∥x∥ p to its L p -norm, namely ∥x∥ p ≜ ( d i=1 x p i ) 1 /p . We use S d-1 to denote the (d -1)-dimensional unit sphere, i.e., S d-1 ≜ {x ∈ R d | ∥x∥ 2 = 1}. We use O(•) and Θ(•) to hide all logarithmic factors in T, s, d and log 1 δ (see Footnote 1). For a random event E, we denote its indicator by 1[E]. We assume the action space and the ground-truth space are both the (d -1)-dimensional unit sphere, denoted by X ≜ S d-1 . Denote the ground-truth by θ * ∈ X . There will be T ≥ 1 rounds for the agent to make decisions sequentially. At the beginning of round t ∈ [T ], the agent has to choose an action x t ∈ X . At the end of step t, the agent receives a noisy feedback r t = ⟨x t , θ * ⟩ + η t , ∀t ∈ [T ], where η t is an independent zero-mean Gaussian random variable. Denote by σ 2 t = Var(η t ) the variance of η t . For a fair comparison with non-variance-aware algorithms, we assume that σ 2 t ≤ 1. The agent then receives a (deterministic and unrevealed) reward of magnitude ⟨x t , θ * ⟩ for this round. The agent is allowed to make the decision x t based on all historical actions x 1 , . . . , x t-1 , all historical feedback r 1 , . . . , r t-1 , and any amount of private randomness. The agent's goal is to minimize the regret, defined as follows. Definition 2 (Regret). The following random variable is the regret of a linear bandit algorithm: R T = max x∈X T t=1 ⟨x, θ * ⟩ - T t=1 ⟨x t , θ * ⟩ = T t=1 ⟨θ * -x t , θ * ⟩, where the second equality is due to our assumption that X = S d-1 . if S ̸ = ∅ then ▷ Have some coordinates to "commit".

4:

Initialize a new linear bandit instance F on coordinates S. ▷ "Commit" phase.

5:

Execute F for n a ∆ ≥ 1 steps & maintain pessimistic estimation R F n a ∆ , until 1 n a ∆ R F n a ∆ < ∆ 2 . 6: Suppose that F plays x 1 , x 2 , . . . , x n a ∆ . Set θ = 1 n a ∆ n a ∆ i=1 x i as the estimate for {θ * i } i∈S . 7: if i∈S θ 2 i ≤ 1 -∆ 2 then ▷ Still have undiscovered coordinates with θ * i > ∆ 2 8: Let R ← 1 -i∈S θ 2 i , K = d -|S|. ▷ "Explore" phase. 9: Perform n b ∆ ≥ 1 calls to RANDOMPROJECTION(K, R, S, θ) in Algorithm 2, until 2 2 n b ∆ k=1 (r k,i -r i ) 2 ln 4 δ < n b ∆ • ∆ 4 , ∀1 ≤ i ≤ K, where r k is the k-th return vector of RANDOMPROJECTION and r ≜ 1 n b ∆ n b ∆ k=1 r k . 10: for i = 1, 2, . . . , K do 11: if |r i | > ∆ where r = 1 n b ∆ n b ∆ k=1 r k then add the i-th element that is not in S to S. Algorithm 2 The RANDOMPROJECTION Subroutine 1: function RANDOMPROJECTION(K, R, S Generate K i.i.d. samples y 1 , y 2 , . . . , y K , each with equal probability being ± R √ K . 3: Play x ∈ X constructed as x i = θ i , i ∈ S y j , i is the j-th element that is not in S 4: return K R 2 ((r -i∈S θ 2 i ) y) where r = ⟨x, θ * ⟩ + η is the (noisy) feedback. For the sparse linear bandit problem, we have an additional restriction that ∥θ * ∥ 0 ≤ s, i.e., there are at most s coordinates of θ * is non-zero. However, as mentioned in the introduction, the agent does not know anything about s -she only knows that she is facing a (probably sparse) linear environment.

3. FRAMEWORK AND ANALYSIS

Our framework VASLB is presented in Algorithm 1. We explain its design in Section 3.1 and sketch its analysis in Section 3.2. Then we give two applications using VOFUL2 (Kim et al., 2021) and Weighted VOFUL (Zhou et al., 2021) as F, whose analyses are sketched in Sections 4.1 and 4.2.

3.1. MAIN DIFFICULTIES AND TECHNICAL OVERVIEW

At a high level, our framework follows the spirit of the classic "explore-then-commit" approach (which is directly adopted by Carpentier & Munos (2012) ), where the agent first identifies those "huge" entries of θ * and then performs a linear bandit algorithm on them. However, it is hard to incorporate variances into this vanilla idea to make it variance-aware -the desired regret depends on variances and is thus unknown to the agent. Thus it is difficult to determine a "gap threshold" ∆ (that is, the agent stops to "commit" after identifying all θ * i ≥ ∆) within a few rounds. For example, in the deterministic case, the agent must identify all non-zero entries to make the regret independent of T ; on the other hand, in the worst case where σ t ≡ 1, the agent only needs to identify all entries with magnitude at least T -1/4 to yield

√

T -style regret bounds. At the same time, the actual setting might be mixture of them (e.g., σ t ≡ 0 for t ≤ t 0 and σ t ≡ 1 for t > t 0 where t 0 ∈ [T ]). As a result, such an idea cannot always succeed in determining the correct threshold ∆ and getting the desired regret. In our proposed framework, we tackle this issue by "explore-then-commit" multiple times. We reduce the uncertainty gently and alternate between "explore" and "commit" modes. We decrease a "gap threshold" ∆ in a halving manner and, at the same time, maintain a set S of coordinates that we believe to have a magnitude larger than ∆. For each ∆, we "explore" (estimating θ * i and adding those greater than ∆ into S) and "commit" (performing linear bandit algorithms on coordinates in S). However, as we "explore" again after "committing", we face a unique challenge: Suppose that some entry i ∈ [d] is identified to be at least 2∆ by previous "explore" phases. During the next "explore" phase, we cannot directly do pure exploration over the remaining unidentified coordinates -otherwise, coordinate i will incur 4∆ 2 regret for each round. Fortunately, we can get an estimation θ i of θ * i during the previous "commit" phase thanks to the regret-to-sample-complexity conversion (Eq. ( 3)). Guarded with this estimation, we can reserve θ i mass for arm i and subtract θ 2 i from the feedback in subsequent "explore" phases. More preciously, we do the following. 1. In the "commit" phase where we apply the black-box F, we estimate {θ * i } i∈S by the regret-tosample-complexity conversion: Suppose F plays x 1 , x 2 , . . . , x n and achieves regret R F n , then ⟨θ * -θ, θ * ⟩ ≤ R F n n , where θ ≜ 1 n n i=1 x i . Hence, if we take { θ i } i∈S as an estimate of {θ * i } i∈S , the estimation error shrinks as R F n is sublinear and the LHS of Eq. ( 3) is non-negative. Moreover, as we can show that θ is not away from X by a lot (Lemma 18), we can safely use { θ i } i∈S to estimate {θ * i } i∈S in subsequent phases. More importantly, if we are granted access to R F n , we know how close the estimate is; we can proceed to the next stage once it becomes satisfactory. But it is unrevealed. Fortunately, we know the regret guarantee of F, namely R F n , which can serve as a pessimistic estimation of R F n . Hence, terminating when 1 n R F n < ∆ 2 can ensure ⟨θ * -θ, θ * ⟩ < ∆ 2 to hold with high probability. 2. In the "exploration" phase, as mentioned before, we can keep the regret incurred by the coordinates identified in S small by putting mass θ i for each i ∈ S. For the remaining ones, we use random projection, an idea borrowed from compressed sensing literature (Blumensath & Davies, 2009; Carpentier & Munos, 2012) , to find those with large magnitudes to add them to S. One may notice that putting mass θ i for all i ∈ S will induce bias to our estimation as i∈S θ 2 i ̸ = i∈S θ i θ * i . However, as θ i is close to θ * i , this bias will be bounded by O(∆ 2 ) and become dominated by ∆ 4 as ∆ decreases. Hence, if we omit this bias, we can overestimate the estimation error due to standard concentration inequalities like Empirical Bernstein (Maurer & Pontil, 2009; Zhang et al., 2021) . Once it becomes small enough, we alternate to the "commit" phase again. Therefore, with high probability, we can ensure all coordinates not in S have magnitudes no more than O(∆) and all coordinates in S will together contribute regret bounded by O(∆ 2 ). Hence, the regret in each step is (roughly) bounded by O(s∆ 2 ). Upper bounding the number of steps needed for each stage and exploiting the regret guarantees of the chosen F then gives well-bounded regret.

3.2. ANALYSIS OF THE FRAMEWORK

Notations. For each ∆, let T ∆ be the set of rounds associated with ∆. By our algorithm, each T ∆ should be an interval. Moreover, {T ∆ } ∆ forms a partition of [T ]. Define T a ∆ as all the rounds in the "commit" phase when the gap threshold is ∆ (where F is executed), and T b ∆ as the "explore" phase (i.e., those executing RANDOMPROJECTION). Let T a ∆ and T b ∆ be the steps that the agent decided not to proceed in T a ∆ and T b ∆ , respectively, which are formally defined as T i ∆ = {t ∈ T i ∆ | t ̸ = max t ′ ∈T i ∆ t ′ }, i = a, b. Define the final value of ∆ as ∆ f . Denote n a ∆ = |T a ∆ | and n b ∆ = |T b ∆ | (both are stopping times). We have ∆=2 -2 ,...,∆ f (n a ∆ + n b ∆ ) = T . We can then decompose R T into R a T and R b T : R a T = ∆=2 -2 ,...,∆ f t∈T a ∆ ⟨θ * -x t , θ * ⟩, R b T = ∆=2 -2 ,...,∆ f t∈T b ∆ ⟨θ * -x t , θ * ⟩, where R a T may depend on the choice of F and R b T only depends on the framework (Algorithm 1) itself. We now show that, as long as the regret estimation R F n is indeed an overestimation of R F n with high probability, we can get a good upper bound of R b T , which is formally stated as Theorem 3. The full proof of Theorem 3 will be presented in Appendix F and is only sketched here. Theorem 3. Suppose that for any execution of F that last for n steps, R F n ≥ R F n holds with probability 1 -δ, i.e., R F n is pessimistic. Then the total regret incurred by the second phase satisfies R b T = O   s √ d T t=1 σ 2 t log 1 δ + s log 1 δ   with probability 1 -δ. Remark. This theorem indicates that our framework itself will only induce an (s √ d, s)-varianceawareness to the resulting algorithm. As noticed by Abbasi-Yadkori et al. (2011) , when σ t ≡ 1, Ω( √ sdT ) regret is unavoidable, which means that it is only sub-optimal by a factor no more than √ s. Moreover, for deterministic cases, the O(s) regret also matches the aforementioned divide-andconquer algorithm, which is specially designed and can only work for deterministic cases. Proof Sketch of Theorem 3. We define two good events with high probability for a given gap threshold ∆: G ∆ and H ∆ . Informally, G ∆ means i∈S θ * i (θ * i -θ i ) < ∆ 2 (i.e. , θ is close to θ * after "commit") and H ∆ stands for |θ * i | ≥ Ω(∆) if and only if i ∈ S (i.e., we "explore" correctly). Check Eq. ( 10) in the appendix for formal definitions. For G ∆ , from Eq. ( 3), we know that it happens as long as R F n ≥ R F n . It remains to argue that Pr{H ∆ | G ∆ , H 2∆ } ≥ 1 -sδ. By Algorithm 2, the i-th coordinate of each r k (1 ≤ k ≤ n b ∆ ) is an independent sample of K R 2 (y i ) 2 θ * i + j∈S θ j (θ * j -θ j ) + j / ∈S,j̸ =i K R 2 y i y j θ * j + K R 2 y i η n , where √ K R y i is an independent Rademacher random variable. After conditioning on G ∆ and H 2∆ , i∈S θ 2 i and i∈S θ i θ * i will be close. Therefore, the first term is exactly θ * i (the magnitude we want to estimate), the second term is a small bias bounded by O(∆ 2 ) and the last two terms are zero-mean noises, which are bounded by ∆ 4 according to Empirical Bernstein Inequality (Theorem 10) and our choice of n b ∆ (Eq. ( 2)). Hence, Pr{H ∆ | G ∆ , H 2∆ } ≥ 1 -sδ. Let us focus on an arm i * never identified into S in Algorithm 1. By definition of n b ∆ (Eq. ( 2)), (n b ∆ -1) ∆ 4 < 2 2 t∈ T b ∆ (r t,i * -r i * ) 2 ln 4 δ ≤ 2 2 t∈ T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ , where the second inequality is due to properties of sample variances. By G ∆ , those coordinates in S will incur regret of i∈S (θ * i -x t,i )θ * i = i∈S (θ * i -θ i )θ * i < ∆ 2 for all t ∈ T b ∆ . Moreover, by H 2∆ , each arm outside S will roughly incur n b ∆ (θ * i ) 2 = O(n b ∆ ∆ 2 ) regret, as y i 's are independent and zero-mean. As there are at most s non-zero coordinates, the total regret for T b ∆ will be roughly bounded by O(n b ∆ • s∆ 2 ) (there exists another term due to randomized y i 's, which is dominated and omitted here; check Lemma 21 for more details). Hence, the total regret is bounded by R b T ≲ ∆ O(sn b ∆ ∆ 2 ) = s • O    ∆ ∆ t∈T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ    + O(s). To avoid undesired poly(T ) factors, we cannot directly apply Cauchy-Schwartz inequality to the sum of square roots (as there are a lot of ∆'s). Instead, again by definition of n b ∆ (Eq. ( 2)), we observe the following lower bound of n b ∆ , which holds for all ∆'s except for ∆ f : n b ∆ ≥ O 1 ∆ t∈T b ∆ (r t,i * -E[r t,i * ]) 2 ln 1 δ . As ∆ n b ∆ ≤ T , some arithmetic calculation gives (intuitively, by thresholding, we manage to "move" the summation over ∆ into the square root, though suffering an extra logarithmic factor; see Eq. ( 15) in the appendix for more details) ∆̸ =∆ f ∆ t∈T b ∆ (r t,i * -E[r t,i * ]) 2 = O   ∆̸ =∆ f ∆ 2 t∈T b ∆ (r t,i * -E[r t,i * ]) 2   . For a given ∆ and any 19) in the appendix), which is no more than 1 + 4d ∆ 2 σ 2 k . By concentration properties in the sample variances (Theorem 14 in the appendix), the empirical (r 1 ≤ k ≤ n b ∆ , the expectation of (r k,i * -E[r k,i * ]) 2 is bounded by 1 + K R 2 σ 2 k (Eq. ( k,i * -E[r k,i * ]) 2 should also be close to (1 + 4 d 2 σ 2 k ); hence, one can write (omitting all log 1 δ terms) R b T = O ∆ sn b ∆ ∆ 2 = O   s ∆ n b ∆ ∆ 2 + s d T t=1 σ 2 t   . As ∆ n b ∆ ∆ 2 appears on both sides, we can apply the "self-bounding" property (Efroni et al., 2020, Lemma 38)  to conclude R b T = O( ∆ sn b ∆ ∆ 2 ) = O s d T t=1 σ 2 t + s , as claimed.

4. APPLICATIONS OF THE PROPOSED FRAMEWORK

After showing Theorem 16, it only remains to bound R a T , which depends on the choice of the plug-in algorithm F. In this section, we give two specific choices of F, VOFUL2 (Kim et al., 2021) and Weighted OFUL (Zhou et al., 2021) . The former algorithm does not require the information of σ t 's (i.e., it works in unknown-variance cases), albeit computationally inefficient. In contrast, the latter is computationally efficient but requires σ 2 t to be revealed with the feedback r t at round t.

4.1. COMPUTATIONALLY INEFFICIENT ALGORITHM FOR UNKNOWN VARIANCES

We first use the VOFUL2 algorithm from Kim et al. (2021) as the plug-in algorithm F, which has the following regret guarantee. Note that this is slightly stronger than the original bound: We derive a strengthened "self-bounding" version of it (the first inequality), which is critical to our analysis. Proposition 4 (Kim et al. (2021, Variant of Theorem 2)). VOFUL2 executed for n rounds on d dimensions guarantees, w.p. at least 1 -δ, there exists a constant C = O(1) such that R F n ≤ C d 1.5 n k=1 η 2 k ln 1 δ + d 2 ln 4 δ = O d 1.5 n k=1 σ 2 k log 1 δ + d 2 log 4 δ , where n is a stopping time finite a.s. and σ 2 1 , σ 2 2 , . . . , σ 2 n are the variances of the independent Gaussians η 1 , η 2 , . . . , η n . We now construct the regret over-estimation R F n . Due to unknown variances, it is not straightforward. Our rescue is to use ridge linear regression β ≜ argmin β∈R d n k=1 (r k -⟨x k , β⟩) 2 + λ∥β∥ 2 for samples {(x k , r k )} n k=1 , which ensures that the empirical variance estimation n k=1 (r k -⟨x k , β⟩) 2 differs from the true sample variance n k=1 η 2 k = n k=1 (r k -⟨x k , β * ⟩) 2 by no more than O(s log 1 δ ) (check Appendix E for a formal version). Accordingly, from Proposition 4, we can see that R F n ≤ R F n ≜ C   s 1.5 n k=1 (r k -⟨x k , β⟩) 2 ln 1 δ + s 2 2 ln n sδ 2 ln 1 δ + s 1.5 2 ln 1 δ + s 2 ln 1 δ   . (5) Moreover, one can observe that the total sample variance n k=1 η 2 k is bounded by (a constant multiple of) the total variance n k=1 σ 2 k (which is formally stated as Theorem 13 in the appendix). Therefore, with Eq. ( 5) as our pessimistic regret estimation R F n , we have the following regret guarantee. Theorem 5 (Regret of Algorithm 1 with VOFUL2). Algorithm 1 with VOFUL2 as F and R F n defined in Eq. ( 5) ensures that R T = O (s 2.5 + s √ d) T t=1 σ 2 t log 1 δ + s 3 log 1 δ with probability 1 -δ. Due to space limitations, we defer the full proof to Appendix G.1 and only sketch it here. Proof Sketch of Theorem 5. To bound R a T , we consider the regret from the coordinates in and outside S separately. For the former, the total regret in a single phase with gap threshold ∆ is simply controlled by O s 1.5 t∈T a ∆ η 2 t log 1 δ + s 2 log 1 δ (thanks to Proposition 4). For the latter, each non-zero coordinate outside S can at most incur O(∆ 2 ) regret for each t ∈ T a ∆ . By definition of n a ∆ (Line 5), we have n a ∆ = |T a ∆ | = O s 1.5 ∆ 2 t∈ T a ∆ η 2 t ln 1 δ + s 2 ∆ 2 ln 1 δ , just like the proof of Theorem 3. As the regret from the second part is bounded by O(s∆ 2 • n a ∆ ), these two parts together sum to R a T ≤ ∆ O   s 2.5 t∈T a ∆ η 2 t log 1 δ + s 3 log 1 δ + s∆ 2    . As in Theorem 3, we notice that n a ∆ = Ω s 1.5 ∆ 2 t∈ T a ∆ η 2 t ln 1 δ + s 2 ∆ 2 ln 1 δ for all ∆ ̸ = ∆ f again by definition of n a ∆ . This will move the summation over ∆ into the square root. Moreover, by the fact that η 2 t = O(σ 2 t log 1 δ ) (Theorem 14 in the appendix), we have R a T = O s 2.5 T t=1 σ 2 t log 1 δ + s 3 log 1 δ . Combining this with the bound of R b T provided by Theorem 3 concludes the proof.

4.2. COMPUTATIONALLY EFFICIENT ALGORITHM FOR KNOWN VARIANCES

In this section, we consider a computational efficient algorithm Weighted OFUL (Zhou et al., 2021) , which itself requires σ 2 t to be presented at the end of round t. Their algorithm guarantees: Proposition 6 (Zhou et al. (2021, Corollary 4.3)) . With probability at least 1 -δ, Weighted OFUL executed for n steps on d dimensions guarantees R F T ≤ C( dn log 1 δ + d n k=1 σ 2 k log 1 δ ) , where C = O(1), n is a stopping time finite a.s., and σ 2 1 , σ 2 2 , . . . , σ 2 n are the variances of η 1 , η 2 , . . . , η n . Taking F as Weighted OFUL, we will have the following regret guarantee for sparse linear bandits: Theorem 7 (Regret of Algorithm 1 with Weighted OFUL). Algorithm 1 with Weighted OFUL as F and R F n defined as C sn ln 1 δ + s n k=1 σ 2 k ln 1 δ guarantees R T = O (s 2 + s √ d) T t=1 σ 2 t log 1 δ + s 1.5 √ T log 1 δ with probability 1 -δ. The proof is similar to that of Theorem 5, i.e., bounding n a ∆ by Line 5 of Algorithm 1 and then using summation techniques to move the summation over ∆ into the square root. The only difference is that we will need to bound O( ∆ ∆ -2 ), which seems to be as large as T if we follow the analysis of Theorem 5. However, as we included an additive factor sn ln 1 δ in the regret over-estimation R F n , we have n a ∆ ≥ ∆ -2 sn a ∆ ln 1 δ , which means n a ∆ = Ω(s∆ -4 ). From ∆ n a ∆ ≤ T , we can consequently bound ∆ ∆ -2 as O( T s ) . The remaining part is just an analog of Theorem 5. Therefore, the proof is omitted in the main text and postponed to Appendix H.

5. CONCLUSION

We considered the sparse linear bandit problem with heteroscedastic noises and provided a general framework to reduce any variance-aware linear bandit algorithm F to an algorithm G for sparse linear bandits that is also variance-aware. We first applied the computationally inefficient algorithm VOFUL from Zhang et al. (2021) and Kim et al. (2021) . The resulting algorithm works for the unknownvariance case and gets O((s 2.5 + s

√ d)

T t=1 σ 2 t log 1 δ + s 3 log 1 δ ) regret, which, when regarding the sparsity factor s ≪ d as a constant, not only is worst-case optimal but also enjoys constant regret in deterministic cases. We also applied the efficient algorithm Weighted OFUL by Zhou et al. (2021) that requires known variance; we got O((  s 2 + s √ d) T t=1 σ 2 t log 1 δ + ( √ sT + s) log 1 δ ) regret,

A MORE ON RELATED WORKS

In this section, we briefly compare to several related works on sparse linear bandits in terms of regret guarantees, noise assumptions and query models. • The regret of Abbasi-Yadkori et al. (2012) is O(Rs √ dT ) when assuming conditionally R-sub-Gaussian noises (i.e., η t | F t-1 ∼ subG(R 2 ), which will be formally defined in ??). At the same time, they allow an arbitrary varying action set D 1 , D 2 , . . . , D T ⊆ B d (though they in fact allows arbitrary decision sets D 1 , D 2 , . . . , D T ⊆ R d , their regret bound scales with max x∈Dt ∥x∥ 2 , so we assume D t ⊆ B d without loss of generality). This model is less strictive than ours, as we only allow D 1 = D 2 = • • • = D T = B d ( as explained in Footnote 3 in the main text. When the noises are Gaussian with variance 1 and the ground-truth θ * is one-hot (i.e., s = 1), their regret bound reduces to O( √ dT ), which matches the Ω( √ dT ) bound in Antos & Szepesvári (2009) when the actions sets are allowed to be the entire unit ball (which means the agent will be more powerful than that of Abbasi-Yadkori et al. ( 2012)). • The regret of Carpentier & Munos (2012) is O((∥θ∥ 2 + ∥σ∥ 2 )s √ T ), assuming a unit-ball action set, a ∥θ * ∥ 2 ≤ ∥θ∥ 2 ground-truth and ∥η t ∥ 2 ≤ ∥σ∥ 2 noises, where η t will be defined later. This bound seems to bypass the Ω( √ dT ) lower bound when s = 1. However, this is due to a different noise model: They assumed the noise is component-wise, i.e., r t = ⟨θ * + η t , x t ⟩ where η t ∈ R d . In contrast, our model assumed a ⟨θ * , x t ⟩ + η t noise model where η t ∈ R. Therefore, the max t ∥η t ∥ 2 dependency can be of order O( √ d) to ensure a similar noise model as ours. • The regret of Lattimore et al. (2015, Appendix G) is also of order O(s √ T ), assuming a [-1, 1]bounded noises, a hypercube X = [-1, 1] d action set and a ∥θ * ∥ 1 ≤ 1 ground-truth. We will then explain why this does not violate the Ω( √ sdT ) regret lower bound as well. Consider an extreme query (1, 1, . . . , 1) ∈ X , which is valid in their query model (in fact, their algorithm is a random projection procedure with some carefully designed regularity conditions, so this type of queries appears all the time). However, in our query model where the action set is the unit ball B d , we have to scale it by 1 √ d . As the noise will never be scaled, this will amplify the noises by √ d, so we will need poly(d) more times of queries to get the same confidence set, making the regret bound have a polynomial dependency on d. Moreover, their ground-truth θ * needs to satisfy ∥θ * ∥ 1 ≤ 1. However, an s-sparse ground-truth θ * with 2-norm 1 can have 1-norm as much as √ s. Therefore, another √ s should also be multiplied for a fair comparison with our algorithm. In conclusion, the second and the third work assumed different noise or query models to amplify the signal-to-noise ratio and thus avoid a polynomial dependency on d, compared to the regret bounds of Abbasi-Yadkori et al. ( 2012) and ours. However, we have to admit that Abbasi-Yadkori et al. ( 2011) allows a drifting action set, whereas ours only allow a unit-sphere action set, just like Carpentier & Munos (2012) . The reason is discussed in Footnote 3 in the main text.

B FUTURE DIRECTIONS

First of all, there is still a gap in the worst-case regret in terms of s, as the lower bound for sparse linear bandits is Ω( √ sdT ) instead of our O(s √ dT ) when σ t ≡ 1. Closing this gap in s is an interesting future work. Our current algorithm, unfortunately, is incapable of a O( √ sdT )-style worst-case regret guarantee: Suppose that T = ds 2 , θ * i = s -1/2 for i = 1, 2, . . . , s (so ∥θ * ∥ 2 ≤ 1), and σ t ≡ 1. Then we have n b ∆ ≈ ∆ -1 n b ∆ (1 + d∆ -2 ), which gives n b ∆ ≈ d∆ -4 . Hence, the total regret will be s i=1 ∆≥θ * i n b ∆ (θ * i ) 2 ≈ d s i=1 (θ * i ) 2 = ds 2 = O(s √ dT ). Thus, algorithmic improvements must be made to better dependency on s. We leave this for future research. Moreover, the current work relies on the random projection procedure (Carpentier & Munos, 2012) , which only works when the action set is the unit sphere. Such an assumption is unrealistic in practice. We wonder whether there is an alternative that only requires a looser condition. At last, deriving a variance-aware lower bound rather than a minimax one is also important, as it can better illustrate the inherent hardness of the problem with different noise levels. We remark that extending the proof of current minimax lower bounds (see, e.g., (Antos & Szepesvári, 2009) ) to variance-aware ones is not straightforward.

C DIVIDE-AND-CONQUER ALGORITHM FOR DETERMINISTIC SETTINGS

In this section, we discuss how to solve the deterministic sparse linear bandit problem in O(s) steps using a divide-and-conquer algorithm, as we briefly mentioned in the main text. We mainly adopt the idea mentioned by Dong et al. (2021, Footnote 6) . For each divide-andconquer subroutine working on several coordinates i 1 , i 2 , . . . , i k ∈ [d], we query half of them (e.g., i 1 , i 2 , . . . , i k/2 when assuming 2 | k) with 2 /k mass on each coordinate. This will reveal whether there is a non-zero coordinate among them. If the feedback is non-zero, we then conclude that there exists a non-zero coordinate in this half. Hence, we dive into this half and conquer this sub-problem (i.e., divide-and-conquer). Otherwise, we simply discard this half and consider the other half. However, this vanilla algorithm proposed by Dong et al. (2021) fails to consider the possibility that two coordinates cancel each other (e.g., two coordinates with magnitude ± 1 /2 will make the feedback equal to zero). Fortunately, this problem can be resolved via randomly putting magnitude ± 2 /k on each coordinate, which is similar to the idea illustrated in Algorithm 2. As the environment is deterministic, each step will give the correct feedback with probability 1. Therefore, a constant number of trials is enough to tell whether there exists a non-zero coordinate. At last, we analyze the number of interactions needed for this approach. As it is guaranteed that each divide-and-conquer subrountine will be working on a set of coordinates where at least one of them is non-zero, we can bound the number of interactions as O θ * i ̸ =0 #subrountines containing i . As for each time we will divide the coordinates into half, there can be at most log 2 d subrountines containing i for each individual i. Therefore, the number of interactions will be O(s log d). After that, we will be sure to find out all coordinates with non-zero magnitudes. Asking each of them once then reveals their actual magnitude. Therefore, we can recover θ * in O(s log d+s) = O(s log d) rounds and will not suffer any regret after that. So the regret of this algorithm will indeed be O(s), which is (nearly) independent of d and T .

D CONCENTRATION INEQUALITIES D.1 SAMPLE MEAN UPPER BOUND

We shall make use of the following self-normalizing result. Proposition 8 (Fan et al. (2015, Remark 2.9 )). Suppose that {ξ i } n i=1 are independent and symmetric. Then for all x > 0, Pr max 1≤k≤n k i=1 ξ i n i=1 ξ 2 i ≥ x ≤ exp - x 2 2 . Corollary 9. Let X 1 , X 2 , . . . , X n be a sequence of independent and symmetric random variables where n is a stopping time that is finite a.s. Then for any δ > 0, with probability 1 -δ, we have n i=1 (X i -µ i ) ≤ n i=1 (X i -µ i ) 2 ln 2 δ . Proof. This immediately follows by picking x = 2 ln 2 δ and then applying Fatou's lemma. Therefore, we can present our Empirical Bernstein Inequality for conditional symmetric stochastic processes with a common mean, as follows: Theorem 10 (Empirical Bernstein Inequality). For a sequence of independent and symmetric random variables X 1 , X 2 , . . . , X n that shares a common mean (i.e., E[X i ] = µ for some µ for all i), we have the following inequality where n is a stopping time finite a.s. Pr    n i=1 (X i -µ) ≤ 2 n i=1 (X i -X) 2 ln 4 δ    ≥ 1 -δ, ∀δ ∈ (0, 1), where X = 1 n n i=1 X i is the sample mean. Proof. By direct calculation, we have n i=1 (X i -X) 2 = n i=1 X 2 i -2nX 2 + nX 2 = n i=1 X 2 i -nX 2 = n i=1 (X i -µ) 2 -n(X -µ) 2 . Applying Corollary 9 to {X i } n i=1 gives Pr    n i=1 (X i -µ) ≥ n i=1 (X i -µ) 2 ln 4 δ    ≤ δ 2 . Therefore, with probability 1 -δ 2 , we have n i=1 (X i -X) 2 ≤ n i=1 (X i -µ) 2 ≤ n i=1 (X i -X) 2 + 4 n n i=1 (X i -µ) 2 ln 4 δ . Hence, with probability 1 -δ 2 , we have n i=1 (X i -µ) 2 ≤ 2 n i=1 (X i -X) 2 . ( ) By the union bound, with probability 1 -δ, we thus have n i=1 (X i -µ) ≤ 2 n i=1 (X i -X) 2 ln 4 δ , as claimed.

D.2 SAMPLE VARIANCE UPPER BOUND

Recall that a random variable X is sub-Gaussian with variance proxy σ 2 if and only if E[exp(λ(X - E[X]))] ≤ exp( 1 2 λ 2 σ 2 ) for all λ > 0. We shall denote such a random variable by X ∼ subG(σ 2 ). We first state the following generalized Freedman's inequality for sub-Gaussian random variables. Proposition 11 (Fan et al. (2015, Theorem 2.6 )). Suppose that {ξ i } n i=1 is a sequence of zero-mean random variables, i.e., E[ξ i ] = 0. Suppose that E[exp(λξ i )] ≤ exp(f (λ)V i ) for some deterministic function f (λ) and some fixed {V i } n i=1 for all λ ∈ (0, ∞), then, for all x, v > 0 and λ > 0, we have Pr ∃1 ≤ k ≤ n : k i=1 ξ i ≥ x ∧ k i=1 V i ≤ v 2 ≤ exp -λx + f (λ)v 2 . To derive a bound related to (X i -X) 2 and σ 2 i , we will need to characterize the concentration of the square of a sub-Gaussian random variable, which is a "sub-exponential" random variable: Proposition 12 (Honorio & Jaakkola (2014, Appendix B) ). For a sub-Gaussian random variable X with variance proxy σ 2 and mean µ, we have E exp λ(X 2 -E[X 2 ]) ≤ exp 16λ 2 σ 4 , ∀|λ| ≤ 1 4σ 2 . Theorem 13. For a sequence of sub-Gaussian random variables {X i } n i=1 such that E[X i ] = µ i , X i ∼ subG(σ 2 i ) , and n is a stopping time finite a.s., Pr n i=1 (X i -µ i ) 2 -E[(X i -µ i ) 2 ] > 4 √ 2 n i=1 σ 2 i ln 2 δ ≤ δ, ∀δ ∈ (0, 1). Proof. We first consider a non-stopping time n. Apply Proposition 11 to the sequence {( X i -µ i ) 2 - E[(X i -µ i ) 2 ]} with V i = σ 4 i , f (λ) = 16λ 2 for λ < 1 4σ 2 max and f (λ) = ∞ otherwise, where σ max is defined as max{σ 1 , σ 2 , . . . , σ n }. Then for all x, v > 0 and λ ∈ (0, 1 σ 2 max ), we have Pr    n i=1 (X i -µ i ) 2 -E[(X i -µ i ) 2 ] > x ∧ n i=1 σ 4 i ≤ v    ≤ exp -λx + 16λ 2 v 2 . Picking v 2 = n i=1 σ 4 i ln 2 δ and x = 4 √ 2v ln 2 δ gives Pr    n i=1 (X i -µ i ) 2 -E[(X i -µ i ) 2 ] > 4 2 n i=1 σ 4 i ln 2 δ    ≤ exp - x 2 32v 2 = exp - 32v 2 ln 2 2 δ 32v 2 = δ 2 , where λ is set to x 32v 2 = 1 4 √ 2v < 1 σ 2 max (as v 2 > n i=1 σ 4 i ≥ σ 4 max ). A union bound by applying Proposition 11 to the sequence {E[(X i -µ i ) 2 ] -(X i -µ i ) 2 } with the same parameters and noticing the fact that n i=1 σ 4 i ≤ n i=1 σ 2 i then shows that our conclusion hold for any fixed n. By Fatou's lemma, we conclude that it also holds for a stopping time n that is finite a.s. Theorem 14 (Variance Concentration). Let {X i } n i=1 be a sequence of random variables with a common mean µ such that X i ∼ subG(σ 2 i ), (X i -µ) is symmetric, and n is a stopping time finite a.s. Then, ∀δ ∈ (0, 1), with probability 1 -δ, we have the following three inequalities: n i=1 (X i -X) 2 ≤ n i=1 (X i -µ) 2 n i=1 (X i -X) 2 ≥ 1 2 n i=1 (X i -µ) 2 n i=1 (X i -µ) 2 ≤ 8 n i=1 σ 2 i ln 2 δ . where X = 1 n n i=1 X i is the sample mean. Proof. The first two inequalities follow from Eqs. ( 6) and ( 7). The last one follows from Theorem 13 together with the fact that E[(X i -µ i ) 2 ] ≤ σ 2 i by definition of sub-Gaussian random variables.

E RIDGE LINEAR REGRESSION

Lemma 15. Suppose that we are given n samples y i = ⟨x i , β * ⟩ + ϵ i , i = 1, 2, . . . , n, where β * ∈ B d and {x i } n i=1 , {ϵ i } n i=1 are stochastic processes adapted to the filtration {F i } n i=0 such that ϵ i is conditionally σ-Gaussian, i.e., ϵ i | F i-1 ∼ subG(σ 2 ). Define the following quantity as the estimate for β * : β = argmin β n i=1 (y i -x T i β) 2 + λ∥β∥ 2 . Then with probability 1 -δ, the following inequality holds: n i=1 (y i -x T i β * ) 2 - n i=1 (y i -x T i β) 2 ≤ 2dσ 2 ln n λdδ 2 + 1 λ + λ. Proof. Denote y = (y 1 , y 2 , . . . , y n ) T , X = (X T 1 , X T 2 , . . . , X T n ) and ϵ = (ϵ 1 , ϵ 2 , . . . , ϵ n ) T . Denote Var * = n i=1 (y i -x T i β * ) 2 and Var = n i=1 (y i -x T i β) 2 . We have the following representation of β, which is by direct calculation (check, e.g., (Kirschner & Krause, 2018) ) β = (X T X + λI) -1 X T y. Furthermore, by Abbasi-Yadkori et al. (2011, Proof of Theorem 2), we have β -β * = (X T X + λI) -1 X T ϵ -λ(X T X + λI) -1 β * . Therefore, we can write Var * -Var = (y -Xβ * ) T (y -Xβ * ) -(y -X β) T (y -X β) = (β * ) T X T Xβ * -(β * ) T X T y -y T Xβ * -β T X T X β + β T X T y + y T X β = β T (X T y -X T X β) -(β * ) T X T (y -Xβ * ) + y T X( β -β * ). By Eq. ( 8), we have X T y = (X T X + λI) β. So the first term is just λ β T β. As y = Xβ * + ϵ, the second term is just -(β * ) T X T ϵ. By Eq. ( 9), the last term becomes y T X(X T X + λI) -1 X T ϵ -λy T X(X T X + λI) -1 β * . For the sake of simplicity, we define ⟨a, b⟩ M = a T (X T X + λI) -1 b (note that ⟨a, b⟩ M = ⟨b, a⟩ M as X T X + λI is symmetric) and denote the induced norm by ∥•∥ M . Therefore, we have Var * -Var = λ β T β -(β * ) T X T ϵ + ⟨X T y, X T ϵ⟩ M -λ⟨X T y, β * ⟩ M . Again by Eq. ( 8), we have ⟨X T y, X T ϵ⟩ M = β T X T ϵ. Therefore, the second and third term together give -(β * ) T X T ϵ + ⟨X T y, X T ϵ⟩ M = ( β -β * ) T X T ϵ = (X T X + λI) -1 X T ϵ -λ(X T X + λI) -1 β * T X T ϵ = ∥X T ϵ∥ 2 M -λ⟨β * , X T ϵ⟩ = ∥X T ϵ∥ 2 M -λ(β * ) T ( β -β * ) + λ 2 ⟨β * , β * ⟩ M , where the last step is yielded from using Eq. ( 9) reversely. Then note that ⟨X T y, β * ⟩ = β * β * , so taking expectation on both sides gives Var * -Var = λ β T β + ∥X T ϵ∥ 2 M -λ(β * ) T ( β -β * ) + λ 2 ⟨β * , β * ⟩ M -λ β T β * = ∥X T ϵ∥ 2 M + λ∥ β -β * ∥ 2 2 + λ 2 ∥β * ∥ 2 M . By Cauchy-Schwartz inequality, we have ∥ β -β * ∥ 2 2 ≤ ∥X T ϵ∥ 2 M ∥(X T X + λI) -1/2 ∥ 2 + ∥β * ∥ 2 M ∥(X T X + λI) -1/2 ∥ 2 , where the matrix norms can further be bounded by 1/λ min (X T X + λI) ≤ 1 λ . Similarly, we can conclude that ∥β * ∥ 2 M ≤ 1/λ min (X T X + λI)∥β * ∥ 2 2 ≤ 1/λ. Consequently, we have Var * -Var = 2∥X T ϵ∥ 2 M + 1 λ + λ. As proved by Abbasi-Yadkori et al. (2011, Theorem 1), with probability 1 -δ, we have ∥X T ϵ∥ 2 M ≤ 2σ 2 ln 1 δ det(X T X + λI) det(λI) = σ 2 ln 1 δ 2 (λ + n/s) s λ ≤ dσ 2 ln 1 δ 2 1 + n λd , where σ 2 is the (maximum) variance proxy and the second last step is due to the Determinant-Trace Inequality (Abbasi-Yadkori et al., 2011, Lemma 10). Hence, with probability 1 -δ, we have Var * -Var ≤ 2dσ 2 ln n λdδ 2 + 1 λ + λ, as claimed. F OMITTED PROOF IN SECTION 3.2 (ANALYSIS OF FRAMEWORK)

F.1 PROOF OF MAIN THEOREM

In this section, we prove Theorem 3, which is restated as follows for the ease of reading: Theorem 16 (Restatement of Theorem 3). Suppose that for any execution of F that last for n steps, R F n ≥ R F n holds with probability 1 -δ, i.e., R F n is pessimistic. Then the total regret incurred by the second phase satisfies R b T = O   s √ d T t=1 σ 2 t log 1 δ + s log 1 δ   with probability 1 -δ. Proof. As mentioned in the main body, we define the following good events for each ∆ = 2 -2 , 2 -2 , . . . , ∆ f where ∆ f is the final value of ∆: G ∆ : i∈S θ * i (θ * i -θ i ) < ∆ 2 after T a ∆ , H ∆ : (|θ * i | > 3∆ → i ∈ S) ∧ i ∈ S → |θ * i | > 1 2 ∆ after T b ∆ . It is by definition to see that H 1 2 indeed holds, as all |θ * i | < 3 2 and S is initially empty. We can then use induction to prove that all good events hold with high probability. Here, we list several technical lemmas we informally referred to in the main body, whose proofs are left to subsequent sections. The first one is about the regret-to-sample-complexity conversion. Lemma 17 (Regret-to-sample-complexity Conversion). If for any execution of F that lasts for n steps, we have R F n ≥ R F n with probability 1 -δ, then we have Pr{G ∆ } ≥ 1 -δ. The second term bounds the 'bias' term, i.e., the second term of Eq. ( 4) for random projection. Lemma 18 (Bias Term of the Random Projection). Conditioning on G ∆ and H 2∆ , we have | i∈S ( θ 2 i -(θ * i ) 2 )| < 3∆ 2 , which further gives | i∈S θ i (θ * i -θ i )| < 4∆ 2 . Furthermore, we can bound the estimation error of the random projection process. Lemma 19 (Concentration of the Random Projection). For any given i ∈ [K], we will have Pr |r i -θ * i | > 3∆ 2 + ∆ 4 G ∆ , H 2∆ ≤ δ. As long as the estimation errors are small, we can ensure, with high probability that the good event for ∆, namely H ∆ will also hold. Lemma 20 (Identification of Non-zero Coordinates). Pr{H ∆ | G ∆ , H 2∆ } ≥ 1 -dδ. Therefore, by combining all lemmas above, we can ensure that all good events, namely {G ∆ ∧ H ∆ } ∆=2 -2 ,...,∆ f , hold simultaneously with probability 1 -dT δ. At last, we bound the total regret incurred in Phase B for ∆. Lemma 21 (Single-Phase Regret Bound). Conditioning on G ∆ and H 2∆ , we have t∈T b ∆ ⟨θ * -x t , θ * ⟩ ≤ 36sn b ∆ ∆ 2 + 6s∆ R √ K 2n b ∆ ln 1 δ , with probability 1 -sδ. Then we follow the analysis sketched in the main body. We assume, without loss of generality, that s < d. Then, conditioning on all G ∆ and H ∆ , some coordinate i * must never be included into S as H ∆ holds for all ∆. Therefore, by property of the sample variances (Theorem 14), for such i * and for each phase, with probability 1 -δ, we have N n=1 (r n,i * -r i * ) 2 ≤ N n=1 (r n,i * -E[r n,i * ]) 2 . Together with Eq. (2), 2 2 t∈ T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ > (n b ∆ -1) ∆ 4 . ( ) By using Lemma 21, the total regret from Phase B is bounded by R b T ≤ ∆=2 0 ,2 -2 ,...,∆ f 36sn b ∆ ∆ 2 + 6s∆ R √ K 2n b ∆ ln 1 δ By writing n b ∆ as (n b ∆ -1) + 1, for any given ∆, the total regret of Phase b is bounded by R b T ≤ ∆ t∈T b ∆ ⟨θ * -x t , θ * ⟩ ≤ ∆ 36s∆ 2 •    4 ∆ 2 2 t∈ T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ + 1    + ∆ 6s∆ R √ K 2 ln 1 δ 4∆ • 2 2 t∈ T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ + 1 ≤ ∆   288 √ 2s ∆ 2 t∈T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ + 36s∆ 2    Part (a) + (12) ∆   24s ln 1 δ ∆ 4 ∆ 2 t∈T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ + 12s∆ ln 1 δ    Part (b) . ( ) As mentioned in the main text, we make use of the following lower bound of n b ∆ which again follows from Eq. ( 2) and holds for all ∆'s except ∆ f : n b ∆ ≥ 4 ∆ • 2 2 t∈T b ∆ (r t,i * -r i * ) 2 ln 4 δ ≥ 8 ∆ 2 t∈T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ , where the last step is due to Theorem 14. For simplicity, define S ∆ = ∆ 2 t∈T b ∆ (r n,i -E[r n,i ]) 2 and therefore n b ∆ ≥ C ∆ 2 √ S ∆ where C = 8 2 ln 4 δ , a constant if we regard δ as a constant. Therefore, ∆̸ =∆ f C ∆ 2 S ∆ ≤ ∆̸ =∆ f n b ∆ ≤ T (14) and our goal is to upper bound ∆ √ S ∆ . Define a threshold X = T / C ∆̸ =∆ f S ∆ and denote ∆ X = 2 -⌈log 4 X⌉ so that ∆ 2 X ≤ 1 X . We will have ∆̸ =∆ f S ∆ = ∆=2 -2 ,...,∆ X S ∆ + ∆=∆ X /2,...,2∆ f S ∆ (a) ≤ log 4 X ∆=2 -2 ,...,∆ X S ∆ + ∆ X >∆>∆ f ∆ 2 X ∆ 2 X S ∆ (b) ≤ log 4 X 1 4 ≥∆≥∆ X S ∆ + 1 X ∆ X >∆>∆ f √ S ∆ ∆ 2 X (c) ≤ log 4 X 1 4 ≥∆≥∆ X S ∆ + 1 X T C = ( log 4 X + 1) ∆̸ =∆ f S ∆ , where (a) applied Cauchy-Schwartz to the first summation, (b) applied the fact that ∆ 2 X ≤ 1 X and (c) used Eq. ( 14). So Part (a) of the regret, namely Eq. ( 12), can be bounded by ∆   288 √ 2s ∆ 2 t∈T b ∆ (r t,i -E[r t,i ]) 2 ln 4 δ + 36s∆ 2    (a) ≤ 288 √ 2s ∆ S ∆ ln 4 δ + 72s (b) ≤ 288 √ 2s ln 4 δ   log 4 (4T ) ∆̸ =∆ f S ∆ + S ∆ f   + 72s (c) ≤ 576s ∆=2 -2 ,...,∆ f ∆ 2 t∈T b ∆ (r t,i * -E[r t,i * ]) 2 ln 4 δ log 4 (4T ) + 72s (d) ≤ 576s ∆=2 -2 ,...,∆ f 8 t∈T b ∆ (∆ 2 + dσ 2 t ) ln 4 δ log 4 (4T ) + 72s (16) ≤ 1152 √ 2s √ d T t=1 σ 2 t ln 4 δ log 4 (4T ) + 1152 √ 2s ∆ n b ∆ ∆ 2 ln 1 δ log 4 (4T ) + 72s. where (a) used the fact that ∆ 2 ≤ 2 as ∆ = 2 -i , (b) used Eq. ( 15), (c) again used Cauchy-Schwartz inequality and (d) used Theorem 14(3), the variance concentration result, Eq. ( 19), the magnitude of the variance, and the facts that R ≥ ∆, K ≤ d. Therefore, we can conclude that O(s) • ∆ n b ∆ ∆ 2 ≤ O   s √ d T t=1 σ 2 t log 4 δ + s   + O s n b ∆ ∆ 2 ln 1 δ . Notice that the left-handed-side and the right-handed-side has a common term (the RHS one is inside the square-root sign). Hence, by the self-bounding property Lemma 37, we can conclude that (note that we divided s on both sides) ∆ n b ∆ ∆ 2 ≤ O   √ d T t=1 σ 2 t log 4 δ + 1   + O ln 1 δ , ( ) which means that Part (a) (Eq. ( 12)), or equivalently O(s ∆ n b ∆ ∆ 2 ), is bounded by O   s √ d T t=1 σ 2 t log 4 δ + s log 4 δ   . Now consider Part (b) of the regret, namely Eq. ( 13). The second term in each summand will sum up to O(s log 1 δ ). For the first term, using the notation S ∆ , we want to bound 24s ln 1 δ 3 /4 ∆ ∆ 4 S ∆ ≤ 24s ln 1 δ ∆ ∆ 4 ∆ S ∆ ≤ 48s ln 1 δ 4 ∆ S ∆ . Conditioning on the same events as in Eq. ( 16), we will have ∆ S ∆ ≤ O ∆ n b ∆ ∆ 2 + T t=1 dσ 2 t ≤ O   T t=1 dσ 2 t + T t=1 dσ 2 t log 1 δ + log 1 δ   ≤ O T t=1 dσ 2 t log 1 δ , where the second step comes from Eq. ( 17). Therefore, 24s ln 1 δ 3 /4 ∆ ∆ 4 S ∆ ≤ O     s √ d T t=1 σ 2 t log 1 δ log 1 δ 3 /4     = O   s 4 d T t=1 σ 2 t log 1 δ   , which gives (by combining the bound of Eq. ( 12) and Eq. ( 13) together): R b T ≤ O   s √ d T t=1 σ 2 t log 4 δ + s 4 d T t=1 σ 2 t log 1 δ + s log 4 δ   . Further notice that, if d T t=1 σ 2 t ≥ 1, then the square-root is larger than the 4th-root. When d T t=1 σ 2 t ≤ 1, then either root is bounded by s. Hence, in either case, the 4th-root will be hidden by other factors. Henceforth, we indeed have the following conclusion with probability 1 -(s + 3T )δ: R b T ≤ O   s d T t=1 σ 2 t log 1 δ + s log 1 δ   . By setting the actual δ as (s + 3T )δ, we will still have the same regret bound as O(log c δ ) = O(log 1 δ + log c) = O(log 1 δ ) as all logarithmic factors will be hidden by O.

F.2 REGRET-TO-SAMPLE-COMPLEXITY CONVERSION

Proof of Lemma 17. By Fatou's lemma and the fact that n b ∆ is finite a.s. (as it is truncated according to T ), the probability that R F n is a pessimistic estimation for all n = 1, 2, . . . , n b ∆ is bounded by 1 -δ. Conditioning on this, by definition, we will have n b ∆ n=1 ⟨θ * i , θ * i -x n ⟩ ≤ R F n b ∆ ≤ R F n b ∆ , which means our stopping criterion (Line 5) will ensure n b ∆ n=1 ⟨θ * i , θ * i -θ⟩ ≤ R F n b ∆ n b ∆ < ∆ 2 and thus G ∆ is ensured.

F.3 RANDOM PROJECTION

Proof of Lemma 18. By H 2∆ , we have |θ * i | > ∆, ∀i ∈ S, which gives |θ * i -θ i | < ∆ by G ∆ . So we can bound | i∈S ( θ 2 i -(θ * i ) 2 )| by 2| i∈S θ * i (θ * i -θ i )| + | i∈S (θ * i -θ i ) 2 | < 2∆ 2 + ∆ 2 = 3∆ 2 . The second claim is then straightforward as θ i (θ * i -θ i ) = -θ * i (θ * i -θ i ) -( θ 2 i -(θ * i ) 2 ) . Proof of Lemma 19. In the random projection procedure, let r (n) i be the random variable defined as (which is the i-th coordinate of r n in the algorithm) r (n) i = K R 2 y (n) i   ⟨x n , θ * ⟩ + η n - j∈S θ 2 j   = K R 2 y (n) i 2 θ * i + j∈S θ j (θ * j -θ j ) + j / ∈S,j̸ =i K R 2 y (n) i y (n) j θ * j + K R 2 y (n) i η n . (18) Then r i is just the sample mean estimator of {r (n) i } n b ∆ n=1 . Firstly, observe that y 2 i is always R 2 K , so the first term of Eq. ( 18) is just θ * i , which is exactly the magnitude we want to estimate. Moreover, by Lemma 18, the second term is a small (deterministic) bias added to θ * i and is bounded by 3∆ 2 . For the third term, as each y j is independent, K R 2 y i y j is an i.i.d. Rademacher random variable, denoted by z (n) j . Hence, the last two terms sum to j / ∈S,j̸ =i z (n) j θ * j + K R 2 y (n) i η n . By definition of Rademacher random variables, they are all zero-mean and symmetric. Henceforth they can be viewed as noises with variances at most E      j̸ ∈S,j̸ =i z (n) j θ * j + K R 2 y (n) i η n   2    (a) = E   j̸ ∈S z (n) j θ * j 2 + K R 2 y (n) i 2 η 2 n   ≤ 1 + K R 2 σ 2 n , where (a) is due to the mutual independence between z (n) j and y (n) i . Moreover, as each y (n) i is also redrawn for every n and i, we know that all {z (n) j } j̸ ∈S,1≤n≤n b ∆ and {y (n) i } 1≤n≤n b ∆ are all mutually independent. Because the first two terms Eq. ( 18) is not random, all {r (n) i } 1≤n≤n b ∆ are independent, symmetric and sub-Gaussian (as θ * j is bounded and η n is Gaussian) random variables. Therefore, as n b ∆ is indeed a stopping time finite a.s., we shall apply Empirical Bernstein Inequality (Theorem 10) to {r (n) i } n where Var(r (n) i ) is characterized by Eq. (19), giving Pr      n b ∆ n=1   j / ∈S,j̸ =i z (n) j θ * j + K R 2 y (n) i η n   ≥ 2 2 n b ∆ n=1 (r n,i -r i ) 2 ln 4 δ      ≤ δ. In other words, our choice of n b ∆ in Eq. (2) will ensure the average noise is bounded by ∆ 2 . By Lemma 18, we conclude that Pr{|r i -θ * i | > 3∆ 2 + ∆ 4 | G ∆ , H 2∆ } ≤ δ. Proof of Lemma 20. If we skipped due to i∈S θ 2 i > 1 -∆ 2 , which is Line 7 of Algorithm 1, then by Lemma 18, we will have i∈S (θ * i ) 2 > 1 -5∆ 2 conditioning on G ∆ ∧ H 2∆ , and thus all remaining coordinates are smaller than 3∆. Moreover, by H 2∆ , all discovered coordinates are with magnitude at least 2∆ 2 > ∆ 2 . Hence H ∆ automatically holds in this case. Otherwise, suppose that the conclusion of Lemma 19 holds for all i ∈ [K] (which happens with probability 1 -Kδ conditioning on G ∆ and H 2∆ ). As we only pick those coordinates with |r i | > ∆, all coordinates with magnitude at last ∆ + ∆ 4 + 3∆ 2 < 3∆ will be picked as ∆ ≤ 1 4 , so the first condition of H ∆ indeed holds. Moreover, all picked coordinates will have magnitudes at least ∆ -( ∆ 4 + 3∆ 2 ). But when ∆ = 1 4 , there will be no coordinates in S as it is initially empty. And after that, we will have ∆ 2 ≤ ∆ 8 . Hence, all coordinates with magnitude ∆ 2 will surely be identified, which means the second condition of H ∆ also holds. We then have Pr{H ∆ | G ∆ , H 2∆ } ≥ 1 -Kδ by putting these two cases together.

F.4 SINGLE-PHASE REGRET

Proof of Lemma 21. Conditioning on G ∆ , as we are playing x t,i = θ i for all i ∈ S and t ∈ T b ∆ , we will have t∈T b ∆ i∈S (θ * i -x t,i )θ * i = t∈T b ∆ i∈S (θ * i -θ i )θ * i < n b ∆ • ∆ 2 . Now consider a single coordinate not in S. For each t ∈ T b ∆ , we will equiprobably play ± R √ K for this coordinate. Hence, the total regret will become t∈T b ∆ θ * i ± R √ K θ * i = n b ∆ • (θ * i ) 2 + θ * i t∈T b ∆ ± R √ K . By H 2∆ , the first term is bounded by 36n b ∆ ∆ 2 . By Chernoff bound, the absolute value of the summation in the second term will be bounded by R √ K 2n b ∆ ln 1 δ with probability 1 -δ. As there are at most s non-zero coordinates by sparsity, from a union bound, we can conclude that t∈T b ∆ ⟨θ * -x t , θ * ⟩ ≤ 36sn b ∆ ∆ 2 + 6s∆ R √ K 2n b ∆ ln 1 δ with probability 1 -sδ.

G OMITTED PROOF IN SECTION 4.1 (ANALYSIS OF VOFUL2)

G.1 PROOF OF MAIN THEOREM In this section, we prove Theorem 5, which is restated as Theorem 22. We first assume that Proposition 4 is indeed correct, whose discussion is left to Appendix G.2. Theorem 22 (Regret of Algorithm 1 with VOFUL2 in Unkown-Variance Case). Consider Algorithm 1 with F as VOFUL2 (Kim et al., 2021) and R F n as R F n = C   s 1.5 n k=1 (r k -⟨x k , β⟩) 2 ln 1 δ + s 2 2 ln n sδ 2 ln 1 δ + s 1.5 2 ln 1 δ + s 2 ln 1 δ   , where C = O(1) is the constant hidden in the O notation of Proposition 4, x 1 , x 2 , . . . , x n are the actions made by the agent, r 1 , r 2 , . . . , r n are the corresponding (noisy) feedback, and β is defined as β = argmin β∈R s n k=1 (r k -⟨x k , β⟩) 2 + ∥β∥ 2 . The algorithm ensures the following regret bound with probability 1 -δ: R T = O   (s 2.5 + s √ d) T t=1 σ 2 t log 1 δ + s 3 log 1 δ   . Proof. By Proposition 4, the total regret incurred by algorithm F satisfies R F n ≤ C   s 1.5 n k=1 η 2 k ln 1 δ + s 2 ln 1 δ   , where C = O(1) is a constant (with some logarithmic factors) and η k ∼ N (0, σ 2 k ) is the noise for the k-th round executing F. We now consider our regret estimator R F n . We show that, with high probability, it is a pessimistic estimation. Lemma 23. For any given ∆, with probability 1 -δ, R F n ≥ R F n . Therefore, for the "explore" phase, we can make use of Theorem 16 which only requires R F n to be an over-estimate of R F n w.h.p., giving R b T ≤ O   s √ d T t=1 σ 2 t log 1 δ + s log 1 δ   . So we only need to bound the regret for the "commit" phase, namely R a T . As mentioned in the main text, we will consider the regret contributed from inside and outside S seprately. Formally, we will write R a T as R a T = ∆=2 -2 ,...,∆ f t∈T a ∆ i∈S θ * i (θ * i -x t,i ) + i / ∈S (θ * i ) 2 , where the equality is because we will not put any mass on those i / ∈ S during the "commit" phase. We still assume the 'good events' G ∆ , H ∆ hold for all ∆ (defined in Eq. ( 10)). By H 2∆ , for a given ∆, i̸ ∈S (θ * i ) 2 ≤ 36s∆ 2 . Moreover, by Lemma 23, we will have R a T ≤ ∆=2 -2 ,...,∆ f R F n a ∆ + 36s ∆=2 -2 ,...,∆ f n a ∆ ∆ 2 . By the terminating criterion 1 n R F n < ∆ 2 (Line 5 of Algorithm 1), for all ∆, we will have n a ∆ -1 ≤ C ∆ 2 R F n a ∆ , which means n a ∆ ≤ C ∆ 2   s 1.5 t∈ T a ∆ (r t -⟨x t , β⟩) 2 ln 1 δ + s 2 2 ln n a ∆ sδ 2 ln 1 δ + s 1.5 2 ln 1 δ + s 2 ln 1 δ    + 1. Therefore, plugging back into the expression of R a T gives (the term related with R F n a ∆ is dominated by the second term, as intuitively explained in the main text): R a T ≤ ∆=2 -2 ,...,∆ f O   s 2.5 t∈ T a ∆ (r t -⟨x t , β⟩) 2 ln 1 δ + s 3 ln 1 δ + ln 1 δ + s∆ 2    . The last term is simply bounded by O(s) after summing up over ∆. Let us focus on the first term, where we need to upper bound the magnitude of R F n . Applying Lemma 15 shows that the following with probability 1 -δ: n k=1 r k -⟨x k , β⟩ 2 ≤ n k=1 η 2 k + 2s ln n sδ 2 + 2. Therefore, we can write the regret (conditioning on this good event holds for all ∆, which is with probability at least 1 -dT δ according to Theorem 16) as R a T ≤ ∆=2 -2 ,...,∆ f O   s 2.5 t∈T a ∆ η 2 t log 1 δ + s 3 log 1 δ + log 1 δ + s∆ 2    . Again by the technique we used for Appendix F, we will need the following lower bound for all n a ∆ except for the last one, which follows from the fact that R F n ≥ R F n (Lemma 23): n a ∆ ≥ Cs 1.5 ∆ 2 t∈ T a ∆ η 2 t ln 1 δ . By the summation technique that we used in Appendix F, more preciously, by the derivation of Eq. ( 15), we will have the following derivation: ∆̸ =∆ f Cs 1.5 t∈T a ∆ η 2 t ln 1 δ ≤ log 4 X 1 4 ≥∆≥∆ X t∈T a ∆ C 2 s 3 η 2 t ln 1 δ + 1 X ∆ X >∆>∆ f Cs 1.5 ∆ 2 t∈T a ∆ η 2 t ln 1 δ where X is defined as X = T ∆̸ =∆ f t∈T a ∆ C 2 s 3 η 2 t ln 1 δ and ∆ X = 2 -⌈log 4 X⌉ , which means ∆ 2 X ≤ 1 X . Hence, we will have (the second summation will be bounded by T X as ∆ n a ∆ ≤ T ) ∆̸ =∆ f Cs 1.5 t∈T a ∆ σ 2 t ln 1 δ ≤ O    ∆̸ =∆ f t∈T a ∆ C 2 s 3 η 2 t log 1 δ    = O   s 1.5 T t=1 η 2 t log 1 δ   . In other words, the first term of Eq. ( 21) will be bounded by s 2.5 ∆=2 -2 ,...,∆ f t∈T a ∆ η 2 t ln 1 δ ≤ O   s 2.5 T t=1 η 2 t log 1 δ   . So we are done with the first and the last term of Eq. ( 21). Now consider the second term, which is equivalent to bounding ∆ 1. Use the following property guaranteed by (again) Line 5 of Algorithm 1: ∆̸ =∆ f ∆ -2 Cs 2 ln 4 δ ≤ ∆̸ =∆ f n a ∆ ≤ T, which means ∆ 1 ≤ log 4 (T /s 2 ) = O(1). Combining them together gives R a T ≤ O   s 2.5 T t=1 η 2 t log 1 δ + s 3 log 1 δ + log 1 δ   , At last, due to Theorem 13, with probability 1 -δ, n i=1 (r i -⟨x i , β * ⟩) 2 = n k=1 η 2 k (a) ≤ n k=1 E[η 2 k | F k-1 ] + 4 √ 2 n k=1 σ 2 k ln 2 δ ≤ 8 n k=1 σ 2 k ln 2 δ , Algorithm 3 VOFUL2 Algorithm (Kim et al., 2021) 1: for t = 1, 2, . . . , T do 2: Compute the action for the t-th round as x t = argmax x∈X max θ∈ t-1 s=1 Θs ⟨x, θ⟩, where Θ s is defined in Eq. ( 24). 3: Observe reward r t = ⟨x t , θ * ⟩ + η t . where (a) is due to Theorem 13 and (b) is due to E[η 2 k | F k-1 ] ≤ σ 2 k . Hence, R a T ≤ O   s 2.5 T t=1 σ 2 t log 1 δ + s 3 log 1 δ + log 1 δ   . Combining this with the regret for R b T (Theorem 16) gives R T ≤ O   (s 2.5 + s √ d) T t=1 σ t t log 1 δ + s 3 log 1 δ + log 1 δ + s log 1 δ   = O   (s 2.5 + s √ d) T t=1 σ t t log 1 δ + s 3 log 1 δ   , as claimed.

G.2 REGRET OVER-ESTIMATION

In this section, we first discuss why the strengthened Proposition 4 holds. After that, we argue that our pessimistic estimation R F n is indeed an over-estimation of R F n . We state the VOFUL2 algorithm in Algorithm 3 and also restate Proposition 4 as Proposition 24 for the ease of presentation. Proposition 24. VOFUL2 on d dimensions guarantees, with probability at least 1 -δ, R F T = O   d 1.5 T t=1 η 2 t log 1 δ + d 2 log 1 δ   = O   d 1.5 T t=1 σ 2 t log 1 δ + d 2 log 1 δ   , ( ) where T is a stopping time finite a.s. and σ 2 1 , σ 2 2 , . . . , σ 2 T are the variances of η 1 , η 2 , . . . , η T . Proof. We will follow the proof of Kim et al. (2021) and highlight the different steps. We first argue that an analog to their Empirical Bernstein Inequality (Zhang et al., 2021, Theorem 4) still holds. Lemma 25 (Analog of Theorem 4 from Zhang et al. (2021) ). Let {X i } n i=1 be a sequence of zeromean Gaussian random variables such that n is a stopping time finite a.s. Then for all n ≥ 8 and any δ ∈ (0, 1), Pr    n i=1 X i ≤ 8 n i=1 X 2 i ln 4 δ    ≥ 1 -δ. Proof. This is a direct corollary of Theorem 10. Thanks to Theorem 14, the following analog of Lemma 17 from Zhang et al. (2021) also holds: Lemma 26 (Analog of Lemma 17 from Zhang et al. (2021) ). Let {X i } n i=1 be a sequence of zero-mean Gaussian random variables such that n is a stopping time finite a.s. Then for all δ ∈ (0, 1), Pr n i=1 X 2 i ≥ 8 n i=1 σ 2 i ln 2 δ ≤ δ, where σ 2 i is the variance of X i . Then consider their Elliptical Potential Counting Lemma (Kim et al., 2021, Lemma 5) , which holds as long as ∥x t ∥ 2 ≤ 1. In our setting, we indeed have this property as x t ∈ B d (only the noises can be unbounded, instead of actions). Hence, this lemma still holds. Proposition 27 (Kim et al. (2021, Lemma 5) ). Let x 1 , x 2 , . . . , x k ∈ R d be such that ∥x i ∥ 2 ≤ 1 for all s ∈ [k]. Let V k = λI + k i=1 x i x T i . Let J = {i ∈ [k] | ∥x i ∥ 2 V -1 i-1 ≥ q}, then |J| ≤ 2d ln(1 + q) ln 1 + 2/e ln(1 + q) 1 λ . Then consider the confidence set construction: Θ t = L ℓ=1    θ ∈ B d t s=1 (x T s µ) ℓ ϵ s (θ) ≤ t s=1 (x T s µ) 2 ℓ ϵ 2 s (θ)ι, ∀µ ∈ B d    , where ϵ s (θ) = r s -x T s θ, ϵ 2 s (θ) = (ϵ s (θ)) 2 , ι = 128 ln((12K2 L ) d+2 /δ) = O( √ d), L = max{1, ⌊log 2 (1 + T d )⌋} and (x) ℓ = min{|x|, 2 -ℓ } x |x| . By using Lemma 25 together with their original ϵ-net coverage argument, we have θ * ∈ Θ t for all t ∈ [T ] with high probability: Lemma 28 (Analog of Lemma 1 from Kim et al. (2021) ). The good event E 1 = {∀t ∈ [T ], θ * ∈ Θ t } happens with probability 1 -δ. Similar to Kim et al. (2021) , we define θ t be the maximizer of Eq. ( 22) in the t-th round and define µ t = θ t -θ * . We also consider the following good event E 2 : ∀t ∈ [T ], t s=1 ϵ 2 s (θ * ) ≤ 8 t s=1 σ 2 s ln 8T δ , which happens with probability 1 -δ due to Lemma 26. Define W ℓ,t-1 (µ) = 2 -ℓ λI + t-1 s=1 1 ∧ 2 -ℓ |x T s µ| x s x T s . Abbreviate W ℓ,t-1 (µ t ) as W ℓ,t-1 , then we have the following lemma, which is slightly different from the original one, whose proof will be presented later. Lemma 29 (Analog of Lemma 4 from Kim et al. (2021) ). Conditioning on E 1 and setting λ = 1, we have 1. ∥µ t ∥ 2 W ℓ,t-1 ≤ C 1 2 -ℓ ( A t-1 ι + ι) for some absolute constant C 1 , where A t ≜ t s=1 η 2 s . 2. For all s ≤ t, we have ∥µ t ∥ 2 W ℓ,s-1 ≤ C 1 2 -ℓ ( A s-1 ι + ι). 3. There exists absolute constant C 2 such that x t µ t ≤ C 2 ∥x T t ∥ 2 W -1 ℓ,t-1 ( A t-1 ι + ι). Therefore, as we have the same E 1 , Lemma 5 and a similar Lemma 4 (which uses η s instead of σ s ), we can conclude that R F T ≤ C   T -1 s=1 η 2 s ι + ι   d ln 2   1 + C   T -1 s=1 η 2 s ι + ι   1 + T d 2   = O   d 1.5 T t=1 η 2 t log 1 δ + d 2 log 1 δ   , as in Kim et al. (2021, Theorem 2)  (recall that ι = O( √ d)). Conditioning on E 2 , we then further have R F T ≤ O   d 1.5 T t=1 σ 2 t log 1 δ + d 2 log 1 δ   , as claimed. Proof of Lemma 29. By definition and abbreviating (•) ℓ as (•), we have ∥µ t ∥ 2 W ℓ,t-1 -2 -ℓ λI = t-1 s=1 (x T s µ t )(x T s µ t ) = t-1 s=1 (x T s µ s )(x s θ t -r t + r t -x s θ * ) = t-1 s=1 (x T s µ t )(-ϵ s (θ t ) + ϵ s (θ * )) (E1) ≤ t-1 s=1 (x T s µ t ) 2 ϵ 2 s (θ t )ι + t-1 s=1 (x T s µ t ) 2 ϵ 2 s (θ * )ι (a) ≤ t-1 s=1 (x T s µ t ) 2 2(x T s µ t ) 2 ι + 2 t-1 s=1 (x T s µ t ) 2 2ϵ 2 s (θ * )ι (b) ≤ 2 -ℓ t-1 s=1 (x T s µ t ) 2 2(x T s µ t ) 2 ι + 2 -ℓ 16 t-1 s=1 η 2 s ι ≤ 2 2 t-1 s=1 (x T s µ t )(x T s µ t )ι + 2 -ℓ 16 t-1 s=1 η 2 s ι = 2 -ℓ 8∥µ t ∥ 2 Vt-1-λI ι + 2 -ℓ 16 t-1 s=1 η 2 s ι, where (a) used ϵ 2 s (θ t ) = (r s -x s θ t ) 2 = (x T s (θ * -θ t ) + ϵ 2 s (θ * )) ≤ 2(x T s µ t ) 2 + 2ϵ 2 s and (b) used ϵ 2 s (θ * ) = η 2 s . By the self-bounding property Lemma 37, we have ∥µ t ∥ 2 Vt-1-λI ≤ 16 t-1 s=1 σ 2 s + 8ι, which means ∥µ t ∥ 2 Vt-1 ≤ 4λ + 16 t-1 s=1 σ 2 s ι + 8ι. Setting λ = 1 gives the first conclusion. Based on this, the second and third conclusion directly follow according to Kim et al. (2021) .

G.3 EXTENSION TO UNKNOWN VARIANCE CASES

Based on Proposition 4, we then show that, our regret estimation R F n Eq. ( 20) is indeed pessimistic (i.e., Lemma 23). Proof of Lemma 23. From Lemma 15 with λ = 1, with probability 1 -δ, we will have (recall the assumption that σ 2 t ≤ 1 for all t ∈ [T ]) n k=1 (r k -⟨x k , β * ⟩) 2 = n k=1 η 2 k ≤ n k=1 (r k -⟨x k , β⟩) 2 + 2s 2 ln n sδ 2 + 2. Therefore we have R F n ≤ C   s 1.5 n k=1 η 2 k ln 1 δ + s 2 ln 1 δ   ≤ C   s 1.5 n k=1 (r k -⟨x k , β⟩) 2 + 2s ln n sδ 2 + 2 ln 1 δ + s 2 ln 1 δ   ≤ C   s 1.5 n k=1 (r k -⟨x k , β⟩) 2 ln 1 δ + s 2 2 ln n sδ 2 ln 1 δ + s 1.5 2 ln 1 δ + s 2 ln 1 δ   = R F n . In other words, our R F n is an over-estimation of R F n with probability 1 -δ.

H OMITTED PROOF IN SECTION 4.2 (ANALYSIS OF WE I G H T E D OFUL)

H.1 PROOF OF MAIN THEOREM Similar to the VOFUL2 algorithm, we still assume Proposition 6 indeed holds and defer the discussions to the next section. We restate the regret guarantee of Weighted OFUL (Zhou et al., 2021) , namely Theorem 7, for the ease of reading, as follows: Theorem 30 (Regret of Algorithm 1 with Weighted OFUL in Known-Variance Case). Consider Algorithm 1 with F as Weighted OFUL (Zhou et al., 2021) and R F n as R F n ≜ C   sn ln 1 δ + s n k=1 σ 2 k ln 1 δ   . The algorithm ensures the following regret bound with probability 1 -δ: R T = O   (s 2 + s √ d) T t=1 σ 2 t log 1 δ + s 1.5 √ T log 1 δ   . Proof. Firstly, from Proposition 6, the condition of applying Theorem 16 holds. Therefore, we have R b T ≤ O   s √ d T t=1 σ 2 t log 1 δ + s log 1 δ   . For R a T , similar to Appendix H, we decompose it into two parts: those from S and from outside of S. The former case is bounded by Proposition 6, as Regret from S with gap threshold ∆ ≤ C    sn a ∆ ln 1 δ + s t∈T a ∆ σ 2 t ln 1 δ    . For those outside S, we will bound it as O(sn a ∆ ∆ 2 ), where we only need to bound n a ∆ . From Line 5 of Algorithm 1, we have n a ∆ -1 ≤ C ∆ 2    s(n a ∆ -1) ln 1 δ + s t∈ T a ∆ σ 2 t ln 1 δ    . By the "self-bounding" property that x ≤ a + b √ x implies x ≤ O(a + b 2 ) (Lemma 37), we have n a ∆ -1 ≤ 1 ∆ 4 s ln 1 δ + s ∆ 2 t∈ T a ∆ σ 2 t ln 1 δ . Therefore, we can conclude that (the regret from S is dominated) R a T ≤ ∆=2 -2 ,... O    s 2 ∆ 2 log 1 δ + s 2 t∈ T a ∆ σ 2 t log 1 δ + s∆ 2    , where the last term is simply bounded by O(s). Again from Line 5 of Algorithm 1, we will have the following property for all ∆ ̸ = ∆ f : n a ∆ > C ∆ 2    sn a ∆ ln 1 δ + s t∈T a ∆ σ 2 t ln 1 δ    . ( ) We first bound the second term, which basically follow the summation technique (Eq. ( 15)) that we used in Appendices F and G.1: ∆̸ =∆ f Cs t∈T a ∆ σ 2 t ln 1 δ ≤ log 4 X 1 4 ≥∆≥∆ X t∈T a ∆ C 2 s 2 σ 2 t ln 1 δ + 1 X ∆ X >∆>∆ f Cs ∆ 2 t∈T a ∆ σ 2 t ln 1 δ where X is defined as X = T ∆̸ =∆ f t∈T a ∆ C 2 s 2 σ 2 t ln 1 δ and ∆ X = 2 -⌈log 4 X⌉ , which means ∆ 2 X ≤ 1 X . Hence, we will have (the second summation will be bounded by T X as ∆ n a ∆ ≤ T ) ∆̸ =∆ f Cs t∈T a ∆ σ 2 t ln 1 δ ≤ O    ∆̸ =∆ f t∈T a ∆ C 2 s 2 σ 2 t log 1 δ    = O   s T t=1 σ 2 t log 1 δ   . Hence, for the second term, we have O   s 2 T t=1 σ 2 t log 1 δ log T   = O   s 2 T t=1 σ 2 t log 1 δ   . At last, we consider the first term. From the same lower bound of n a ∆ (Eq. ( 25)), we will have n a ∆ > C ∆ 2 sn a ∆ ln 1 δ =⇒ n a ∆ > C 2 ∆ 4 s ln 1 δ . By the fact that ∆̸ =∆ f n a ∆ ≤ T , we will have T ≥ C 2 s ln 1 δ ∆̸ =∆ f ∆ -4 = O(1)C 2 s ln 1 δ (2∆ f ) -4 . Henceforth, O   s 2 log 1 δ • ∆=2 -2 ,...,∆ f ∆ -2   = O s 2 log 1 δ ∆ -2 f ≤ O s 2 log 1 δ T s ln 1 δ = O s 1.5 T log 1 δ . Algorithm 4 Weighted OFUL Algorithm (Zhou et al., 2021) 1 : Intialize A 0 ← λI, c 0 ← 0, θ 0 ← A -1 0 c 0 , β 0 = 0 and Θ 0 ← {θ | ∥θ -θ 0 ∥ A0 ≤ β 0 + √ λB}. 2: for t = 1, 2, . . . , T do 3: Compute the action for the t-th round as x t = argmax x∈X max θ∈Θt-1 ⟨x, θ⟩. Observe reward r t = ⟨x t , θ * ⟩ + η t and variance information σ 2 t , set σ t = max{1/ √ d, σ t }, set confidence radius β t as β t = 8 d ln 1 + t dλσ 2 min,t ln 4t 2 δ , where σ min,t ≜ min t s=1 σ s . 5: Calculate A t ← A t-1 + x t x T t /σ 2 t , c t ← c t-1 + r t x t /σ 2 t , θ t ← A -1 t c t and Θ t ← {θ | ∥θ -θ t ∥ At ≤ β t + √ λB}. Combining all above together gives R T = R a T + R b T ≤ O   s 1.5 T log 1 δ + s 2 T t=1 σ 2 t log 1 δ + s   + O   s √ d T t=1 σ 2 t log 1 δ + s log 1 δ   ≤ O   (s 2 + s √ d) T t=1 σ 2 t log 1 δ + s 1.5 √ T log 1 δ   , as claimed.

H.2 REGRET OVER-ESTIMATION

We again briefly argue that Proposition 6 holds under our noise model. We present their algorithm in Algorithm 4. Proposition 31. With probability at least 1 -δ, Weighted OFUL executed for T steps on d dimensions guarantees R F T ≤ C   dT log 1 δ + d T t=1 σ 2 t log 1 δ   where C = O(1), T is a stopping time finite a.s., and σ 2 1 , σ 2 2 , . . . , σ 2 T are the variances of η 1 , η 2 , . . . , η T . Proof Sketch. We mainly follow the original proof by Zhou et al. (2021) and highlight the differences. We first highlight their Bernstein Inequality for vector-valued martingales also holds under our assumptions, as:    t s=1 x s η s Z -1 t ≤ β t , ∥θ t -θ * ∥ Zt ≤ β t + √ λ∥θ * ∥ 2 , ∀t ∈ [n]    ≥ 1 -δ, where β t = 8σ d ln(1 + tL 2 dλ ) ln 8t 2 δ . The proof, which will be presented later, mainly follows from the idea of their proof of Theorem 4.1, except that we are using Proposition 11. Check the proof below for more details about this. With this theorem, we can consequently conclude that the confidence construction is indeed valid by applying Lemma 32 to the sequence {η t /σ t } t∈[T ] , which gives ) ln 4t 2 δ , as defined in Eq. ( 27). Therefore, we can conclude their (B.19) from exactly the same argument, namely ∥ θ t -θ * ∥ At ≤ β t + √ λ∥θ * ∥ 2 ≤ β t + √ λ, ∀t ∈ [T ], R F T ≤ 2 T t=1 min 1, σ t ( β t-1 + √ λ)∥x t /σ t ∥ A -1 t-1 . Similar to their proof, define I 1 = {t ∈ [T ] | ∥x t /σ t ∥ A -1 t-1 ≥ 1} and I 2 = [T ] \ I 1 , then we have (where σ min is the abbreviation of σ min,T = min T s=1 σ s ) |I 1 | = t∈I1 min{1, ∥x t /σ t ∥ 2 A -1 t-1 } ≤ T t=1 min{1, ∥x t /σ t ∥ 2 A -1 t-1 } ≤ 2d ln 1 + T dλσ 2 min , where the last step uses Proposition 34 and the fact that ∥x t /σ t ∥ 2 ≤ σ -1 min . Therefore, we are having the same Eq. (B.21) as theirs, which gives Proof of Lemma 32. Their original proof mainly use the following two auxiliary results: The first one is the well-known Freedman inequality (Freedman, 1975) , which is originally for bounded martingale difference sequences, while the second one is Lemma 11 from Abbasi-Yadkori et al. (2011) . For the former one, from its variant for sub-Gaussian random variables (Proposition 11), we have: Corollary 33. Suppose that {ξ i } n i=1 is a sequence of zero-mean random variables where ξ i ∼ subG(σ 2 i ) for some sequence {σ i } n i=1 . Let n be a stopping time finite a.s. Then for all x, v > 0 and λ > 0, Pr ∃1 ≤ k ≤ n : k i=1 ξ i ≥ x ∧ k i=1 V i ≤ v 2 ≤ exp - x 2 2v 2 . Moreover, for any δ ∈ (0, 1), with probability 1 -δ, we have n i=1 ξ i ≤ 2 n i=1 σ 2 i ln 2 δ . Proof. The first conclusion is done by applying Proposition 11 optimally with f (λ) = 1 2 λ 2 , V i = σ 2 i and λ = x v 2 . The second conclusion is consequently proved by taking v 2 = n i=1 σ 2 i and x = 2v ln 2 δ . For their second auxiliary lemma (Abbasi-Yadkori et al., 2011, Lemma 11) , one can see that the original lemma indeed holds for sub-Gaussian random variables. Therefore, we still have the following lemma: and let E t be the event that ∥d s ∥ Z -1 s-1 ≤ β s for all s ≤ t. They proved the following lemma, which still applies to our case: Lemma 35 (Analog of Lemma B.3 from Zhou et al. (2021) ). With probability 1 -δ 2 , with the definitions of x t and η t in Lemma 32, the following inequality holds for all t ≥ 1: t s=1 2η s x T s Z -1 s-1 d s-1 1 + w 2 s 1[E s-1 ] ≤ 3 4 β 2 t . Proof. We only need to verify whether we can apply our Freedman's inequality to ℓ s ≜ 2ηsx T s Z -1 s-1 ds-1 1+w 2 s 1[E s-1 ]. It is obvious that E[ℓ s | F s-1 ] = 0. Moreover, from the following inequality (which is their Eq. (B.3)) |ℓ s | ≤ 2∥x s ∥ Z -1 s-1 1 + w 2 s ∥d s-1 ∥ Z -1 s-1 1[E s-1 ] ≤ 2w i 1 + w 2 i β s-1 ≤ min{1, 2w i }β i-1 , and the fact that η s | F s-1 ∼ subG(σ 2 ), we have ℓ s | F s-1 ∼ subG((σβ s-1 min{1, 2w s }) 2 ). Denote the sub-Gaussian parameter as σ s for simplicity. We have Taking a union bound over t and make use of the fact that ∞ t=1 t -2 < 2 completes the proof. We also have the following lemma: Lemma 36 (Analog of Lemma B.4 from Zhou et al. (2021) ). Under the same conditions as the previous lemma, with probability 1 -δ 2 , we will have the following for all ≥ 1 simultaneously: where the second step used the fact that Then, as we did in the proof of Theorem 13, we will apply Proposition 11 to the martingale difference sequence {ℓ s } t s=1 with V s = σ 4 (min{1, w 2 s }) 2 , f (λ) = 16λ 2 for λ < 1 4σ 2 and f (λ) = ∞ otherwise. Then for all x, v > 0 and λ ∈ (0, Taking a union bound over all t and again making use of the fact that ∞ t=1 t -2 < 2 gives our conclusion. Therefore, as long as their Lemmas B.3 and B.4 still hold, we can conclude exactly the same conclusion from their derivation. One may refer to their proof for the details.

I AUXILLIARY LEMMAS

Lemma 37 (Self Bounding Inequality, Efroni et al. (2020, Lemma 38) ). Let 0 ≤ x ≤ a + b √ x where a, b, x ≥ 0, then we have x ≤ 4a + 2b 2 . Proof. As x -b √ x -a ≤ 0, we have √ x ≤ b 2 + 1 4 b 2 + 4a ≤ b 2 + b 2 4 + √ 4a = b + 2 √ a from the fact that √ a + b ≤ √ a + √ b. As √ x ≥ 0, we have x ≤ (b + 2 √ a) 2 ≤ 2b 2 + 4a due to the relation that (a + b) 2 ≤ 2a 2 + 2b 2 .



Throughout the paper, we will use the notations O(•) and Θ(•) to hide log T, log d, log s (where s is the sparsity parameter, which will be introduced later) and log log 1 δ factors (where δ is the failure probability). Carpentier & Munos (2012) andLattimore et al. (2015) obtained an O(s √ T ) regret bound under different models. The former one assumed a component-wise noise model, while the latter one assumed a ∥θ * ∥1 ≤ 1 ground-truth as well as a ∥xt∥∞ ≤ 1 action space. See Appendix A for more discussions on this. We also remark that some assumptions on the action is needed. For example, if every action can only query one coordinate (each action corresponds to one vector of the standard basis) then an Ω (d) regret lower bound is unavoidable. Hence, in this paper, we only consider the benign case that action set is the unit sphere.



still independent of d in deterministic cases. See Appendix B for several future directions. Supplementary Materials A More on Related Works B Future Directions C Divide-and-Conquer Algorithm for Deterministic Settings D Concentration Inequalities D.1 Sample Mean Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Sample Variance Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . E Ridge Linear Regression F Omitted Proof in Section 3.2 (Analysis of Framework) F.1 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2 Regret-to-Sample-Complexity Conversion . . . . . . . . . . . . . . . . . . . . . . F.3 Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.4 Single-Phase Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G Omitted Proof in Section 4.1 (Analysis of VOFUL2) G.1 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2 Regret Over-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.3 Extension to Unknown Variance Cases . . . . . . . . . . . . . . . . . . . . . . . . H Omitted Proof in Section 4.2 (Analysis of Weighted OFUL) H.1 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H.2 Regret Over-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

claimed, while the second step usesσ 2 t = min{ 1 d , σ 2 t } ≤ 1 d + σ 2 t .

w 2 s } as we used in the proof of Lemma 35. Again by Proposition 34, we can conclude that t s=1 σ 4 (min{1, w 2 s }) 2 ≤ 2σ 2 d ln 1 + tL 2 dλ .

An overview of the proposed algorithms/results and comparisons with related works.

With a different feedback model; see Appendix A for more comparison. d With a different action set and an different assumption on θ * ; see Appendix A for more comparison.e This bound holds even if s = 1 and the action set is fixed to be the unit sphere.

Algorithm 1 Variance-Aware Sparse Linear Bandits (VASLB) Framework Input: Number of dimensions d, linear bandit algorithm F and its regret estimator R F

Lemma 32 (Analog of Theorem 4.1 from Zhou et al. (2021)). Let {x i } n i=1 be sequence of ddimensional random vectoes such that ∥x t ∥ 2 ≤ L. Let {η i } n i=1 be a sequence of independent, symmetric and {σ 2 i } n i=1 -sub-Gaussian random variables. Let r t = ⟨θ * , x t ⟩ + η t for all t ∈ [n]. Set Z

with probability 1 -δ, where β t = 8 d ln(1 +

Proposition 34(Abbasi-Yadkori et al. (2011, Lemma 11)). Let {x t } T t=1 be a sequence in R d and define V t = λI + t s=1 x s x T s for some λ > 0. Then, if we have ∥x t ∥ 2 ≤ L for all t ∈ [T ], then

where the first inequality is due to the non-decreasing property of {β s } and the last one is due to Proposition 34. Therefore, from our Freedman's inequality (Corollary 33), we can conclude that with probability 1 -δ/(4t 2 ),

As η s | F s-1 ∼ subG(σ 2 ), η 2 s | F s-1 is a sub-exponential random variable (Proposition 12) such that E exp(λ(η 2 s -E[η 2 s ])) F s-1 ≤ exp(16λ 2 σ 4 ), ∀|λ| ≤

1 σ 2 ), we have

ACKNOWLEDGMENTS

We greatly acknowledge Csaba Szepesvári for sharing the manuscript (Antos & Szepesvári, 2009) with us, which shows the Ω( √ dT ) lower bound for sparse linear bandits when s = 1 and the action sets are unit balls. The authors thank Xiequan Fan from Tianjin University for the constructive discussion about the application of Propositions 8 and 11. At last, we thank the anonymous reviewers for their detailed reviews, from which we benefit greatly. This work was supported in part by NSF CCF 2212261, NSF IIS 2143493, NSF DMS-2134106, NSF CCF 2019844 and NSF IIS 2110170.

