BANDIT LEARNING WITH GENERAL FUNCTION CLASSES: HETEROSCEDASTIC NOISE AND VARIANCE-DEPENDENT REGRET BOUNDS

Abstract

We consider learning a stochastic bandit model, where the reward function belongs to a general class of uniformly bounded functions, and the additive noise can be heteroscedastic. Our model captures contextual linear bandits and generalized linear bandits as special cases. While previous works (Kirschner & Krause, 2018; Zhou et al., 2021) based on weighted ridge regression can deal with linear bandits with heteroscedastic noise, they are not directly applicable to our general model due to the curse of nonlinearity. In order to tackle this problem, we propose a multi-level learning framework for the general bandit model. The core idea of our framework is to partition the observed data into different levels according to the variance of their respective reward and perform online learning at each level collaboratively. Under our framework, we first design an algorithm that constructs the variance-aware confidence set based on empirical risk minimization and prove a variance-dependent regret bound. For generalized linear bandits, we further propose an algorithm based on follow-the-regularized-leader (FTRL) subroutine and online-to-confidence-set conversion, which can achieve a tighter variance-dependent regret under certain conditions.

1. INTRODUCTION

Over the past decade, stochastic bandit algorithms have found a wide variety of applications in online advertising, website optimization, recommendation system and many other tasks (Li et al., 2010; McInerney et al., 2018) . In the model of stochastic bandits, at each round, an agent selects an action and observes a noisy evaluation of the reward function for the chosen action, aiming to maximize the sum of the received rewards. A general reward function governs the reward of each action from the eligible action set. A common assumption used in stochastic bandit problems is that the observation noise is conditionally independent and satisfies a uniform tail bound. In real-world applications, however, the variance of observation noise is likely to be dependent on the evaluation point (chosen action) (Kirschner & Krause, 2018) . Moreover, due to the dynamic environment in reality, the variance of each action may also be different at each round. This motivates the studies of bandit problems with heteroscedastic noise. For example, Kirschner & Krause (2018) introduced the heteroscedastic noise setting where the noise distribution is allowed to depend on the evaluation point. They proposed weighted least squares to estimate the unknown reward function more accurately in the setting where the underlying reward function is linear or lies in a separable Hilbert space (Section 5, Kirschner & Krause 2018) . In this paper, we consider a general setting, where the unknown reward function belongs to a known general function class F with bounded eluder dimension (Russo & Van Roy, 2013) . This captures multi-armed bandits, linear contextual bandits (Abbasi-Yadkori et al., 2011) and generalized linear bandits (Filippi et al., 2010) simultaneously. Since weighted least squares highly depends on the linearity of the function class, we propose a multi-level learning framework for our general setting. The underlying idea of the framework is to partition the observed data into various levels according to the variance of the noise. The agent then estimates the reward function at each level independently and then exploit all the levels when selecting an action at each round. While previous work by Kirschner & Krause (2018) considered sub-Gaussian noise with nonuniform variance proxies, we only assume nonuniform variances of noise (Zhou et al., 2021; Zhang et al., 2021) , which brings a new challenge of exploiting the variance information of the noise to obtain tighter variance-aware confidence sets. Under our multi-level learning framework, we first design an algorithm based on empirical risk minimization and Optimism-in-the-Face-of-Uncertainty (OFU) principle, and prove a variancedependent regret bound. For a special class of bandits namely generalized linear bandits with heteroscedastic noise, we further propose an algorithm using follow-the-regularized-leader (FTRL) as an online regression subroutine and adopting the technique of online-to-confidence-set conversion (Abbasi-Yadkori et al., 2012; Jun et al., 2017) . This algorithm achieves a provaly tighter regret bound when the range of the reward function is relatively wide compared to the magnitude of noise. Our main contributions are summarized as follows: • We develop a new framework called multi-level regression, which can be applied to heteroscedastic bandits, even when the reward function class does not lie in a separable Hilbert space. • Under our framework, we design tighter variance-aware upper confidence bounds for bandits with general reward functions, and propose an bandit learning algorithm based on empirical risk minimzation. We show that our algorithm enjoys variance-dependent regret upper bounds which can be regarded as a strict extension of previous algorithms which obtain variance-dependent regret bounds on simpler bandit models (Zhou et al., 2021; Zhang et al., 2021) . • For generalized linear bandits (Filippi et al., 2010; Jun et al., 2017) , which is a special case of our model class, we further propose an algorithm based on online-to-confidence-set conversion. We first prove a variance-dependent regret bound for follow-the-regularized-leader (FTRL) for the online regression problem derived from generalized linear function class, and then convert the online learning regret bound to the bandit learning confidence set. We show that our algorithm can achieve a tighter regret bound for generalized linear bandits. • As a by-product, our regret bound for FTRL improves the state-of-the-art regret result O(d 2 R 2 ) obtained by stochastic online linear regression (Ouhamma et al., 2021) to O(dσ 2 max ) (omitting the terms without dependence on d), where d is the dimension of contexts, R is the upper bound of the sub-Gaussian norm of the noises at each step, and σ max is the upper bound of the variances of the noises.  O( K κ d √ J + K κ (KAB + R) √ dT ) Computationally efficient Refer to Section 3 for the definitions of dimE, σt, J and R, Section 6 for the definitions of κ, K, A, B. We write general function class with eluder dimension dimE as 'General' and generalized linear function class as 'G-Lin' for short. Oracle efficiency refers to the computational efficiency given a regression oracle (i.e., empirical risk minimization) for the involved function class and an optimization oracle which maximizes the reward function f (x) for a fixed x under some constraint set of f . Notation. We use lower case letters to denote scalars, and use lower and upper case bold face letters to denote vectors and matrices respectively. We denote by [n] the set {1, . . . , n}. For a vector x ∈ R d and matrix Σ ∈ R d×d , a positive semi-definite matrix, we denote by x 2 the vector's Euclidean norm and define x Σ =

√

x Σx. For two positive sequences {a n } and {b n } with n = 1, 2, . . . , we write a n = O(b n ) if there exists an absolute constant C > 0 such that a n ≤ Cb n holds for all n ≥ 1 and write a n = Ω(b n ) if there exists an absolute constant C > 0 such that a n ≥ Cb n holds for all n ≥ 1. Let N (F, α, • ∞ ) denote the α-covering number of F in the sup-norm • ∞ . If there is no ambiguity, we may write N (F, α, • ∞ ) as N α for short. We use O(•) to further hide the polylogarithmic factors other than log-covering numbers.

2. RELATED WORK

Learning with heteroscedastic noise. Heteroscedastic noise has been studied in many different settings such as active learning Antos et al. (2010 ), regression (Aitken, 1936; Goldberg et al., 1997; Chaudhuri et al., 2017; Kersting et al., 2007) , principle component analysis (Hong et al., 2016; 2018) and Bayesian optimization (Assael et al., 2014) . However, only a few works have considered heteroscedastic noise in bandit settings. Cowan et al. (2015) considered a variant of multi-armed bandits where the noise at each round is a Gaussian random variable with unknown variance. Kirschner & Krause (2018) is the first to formally introduce the concept of stochastic bandits with heteroscedastic noise. In their model, the variance of the noise at each round t is a function of the evaluation point x t , ρ t = ρ(x t ), and they further assume that the noise is ρ t -sub-Gaussian. ρ t can either be observed at time t or either be estimated from the obsevations. Zhou et al. ( 2021) considered linear bandits with heteroscedastic noise and generalized the heteroscedastic noise setting in Kirschner & Krause (2018) in the sense that they no longer assume the noise to be ρ t -sub-Gaussian, but only requires the variance of noise to be upper bounded by ρ 2 t and the variances are arbitrarily decided by the environment, which is not necessarily a function of the evaluation point. In the same setting as in Zhou et al. (2021 ), Zhang et al. (2021) further considered a strictly harder setting where the noise has unknown variance. They proposed an algorithm which can deal with unknown variance through a computationally inefficient clip technique. Our work basically considers the noise setting proposed by Zhou et al. (2021) and further generalizes their setting to bandits with general function classes. We will consider to extend it to the harder setting as Zhang et al. ( 2021) as future work. Bandits with known function classes. Moving beyond multi-armed bandits, there have been significant theoretical advances on stochastic bandits with function approximation. Among them, there is a huge body of literature on linear bandit problems where the reward function is assumed to be a linear function of the feature vectors attached to the actions (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Chu et al., 2011; Li et al., 2019; 2021b) . Generalizing the restrictive linear rewards, there has also been a flurry of studies on generalized linear bandit problems (Filippi et al., 2010; Jun et al., 2017; Li et al., 2017; Kveton et al., 2020) . As for stochastic bandits with general function classes, the seminal work by Russo & Van Roy (2013) introduced the notion of eluder dimension to measure the complexity of the function class and provided a general UCB-like algorithm that works for any given class of reward functions with bounded eluder dimension. They further proved a regret upper bound of order O( √ dim E log N • T ) for their proposed algorithm where dim E is the eluder dimension and log N stands for the logcovering number of the funciton class. Linear bandits and generalized linear bandit problems can be seen as special cases as their proposed general model. Online-to-confidence-set conversion. Abbasi-Yadkori et al. (2012) may be the first one to introduce the technique that takes in an online learning subroutine and turns the output of it into a confidence set at each round. While Abbasi-Yadkori et al. (2012) considered applying this technique in linear bandits, Jun et al. (2017) generalized and introduced the previous approach to Generalized Linear Online-to-confidence-set Conversion (GLOC) and applied it to generalized linear bandits. Online regression for linear functions. Online linear regression has long been studied in the setting where the response variables (or labels) are bounded and chosen by an adversary (Bartlett et al., 2015; Cesa-Bianchi et al., 1996; Kivinen & Warmuth, 1997; Littlestone et al., 1991; Malek & Bartlett, 2018) , to mention a few. A recent work (Ouhamma et al., 2021) considers the stochastic setting where the response variables are unbounded and revealed by the environment with additional random noise on the true labels. Ouhamma et al. (2021) discussed the limitations of online learning algorithms in the adversarial setting and further advocates for the need of complementary analyses for existing algorithms under stochastic unbounded setting.

3.1. PRELIMINARIES

General function class Following (Russo & Van Roy, 2013) , we introduce the -dependence and eluder dimension notion, which are used to measure the complexity of a general function class F. Definition 3.1 ( -dependence, Russo & Van Roy 2013). An action a ∈ A is -dependent on actions {a 1 , a 2 , • • • , a n } ∈ A with respect to F if any pair of functions f, f ∈ F satisfying n i=1 (f (a i ) - f (a i )) 2 ≤ 2 also satisfies f (a) -f (a) ≤ . Further, a is -independent of {a 1 , • • • , a n } with respect to F if a is not -dependent on {a 1 , • • • , a n }. Definition 3.2 (eluder dimension, Russo & Van Roy 2013) . The -eluder dimension dim E (F, ) is the length d of the longest sequence of elements in A such that, for some ≥ , every element is -independent of its predecessors. In this work we focus on general function class F with bounded eluder dimension, and we consider generalized linear function class as a special case. We would like to point out that the function class with small eluder dimension is strictly larger than linear and generalized linear bandits (Li et al., 2021a) , while neural networks with ReLU activation do not have a small eluder dimension (their eluder dimension has an exponential dependence on the input dimension) (Dong et al., 2021) . We leave it as future work to consider even more general function classes. Definition 3.3 (width). Let w F (a) = sup f ,f ∈ F f (a) -f (a) . For an action set A ⊆ A, we use w F ( A) to denote sup a∈ A w F (a). To deal with infinite or continuous action sets, we make the following assumption that the reward function class is known in advance by the agent. Notice that for the finite multi-armed bandit case, we can choose F as the set that includes all the eligible functions. Bandit models We consider a heteroscedastic variant of the classic stochastic bandit problem with general function classes. At each round t ∈ [T ] (T ∈ N), the agent observes a decision set D t ⊆ A which is chosen by the environment. The agent then selects an action a t ∈ D t and observes reward r t together with a corresponding variance upper bound σ 2 t . We assume that r t = f * (a t ) + t where f * : A → R is an underlying real-valued reward function which is unknown to the learner and t is a random noise. We make the following assumption on t . Assumption 3.5. For each t, t satisfies that t |a 1:t , 1:t-1 is a R-sub-Gaussian random variable (R > σ t ) and E[ t |a 1:t , 1:t-1 ] = 0, E[ 2 t |a 1:t , 1:t-1 ] ≤ σ 2 t := σ 2 t (a 1:t , r 1:t-1 ). where σ t can be either a constant or a random variable dependent on a 1:t and r 1:t-1 . Remark 3.6. σ t can be seen as either a given information from the environment, or an estimator of the noise variance at t-th round based on all past observations, as discussed in the information directed sampling bandit (Kirschner & Krause, 2018) . For instance, we consider a bandit problem with a two-point action distribution, where the variance of an action can be estimated through the estimation of the mean of the action. Similar estimation procedure has been studied in Lattimore et al. (2015) . The details are in Appendix C. We can further consider the MDP setting. With a confidence set P that includes the true transition dynamic, the conditional variances of the value functions V at state s and action a can be estimated by sup p∈P s ∈S p(s|s, a) V 2 (s )s ∈S p(s |s, a) V (s ) 2 . So the variances of value functions can be efficiently estimated for the MDP setting. For simplicity, let J = T t=1 σ 2 t . This assumption on t is a slightly generalized version of that in Zhou et al. (2021) in the sense that the noise is not necessarily bounded by R. The goal of the agent is to minimize the following cumulative regret: Regret(T ) := T t=1 [f * (a * t ) -f * (a t )], where the optimal action a * t at round t ∈ [T ] is defined as a * t := argmax a∈Dt f * (a). Algorithm 1 ML 2 with OFU principle 1: Input: T, A, F, R, σ > 0. 2: Initialize: Set L ← log 2 R/σ and C 1,l ← F, Ψ 1,l ← ∅ for all l ∈ [L]. 3: for t = 1 • • • T do 4: Observes D t . 5: Choose action a t = argmax a∈Dt min l∈[L] max f ∈C t,l f (a).

6:

Observe stochastic reward r t and σ 2 t . 7: Find l t such that 2 lt+1 σ ≥ max(σ, σ t ) ≥ 2 lt σ.

8:

Update Ψ t+1,lt ← Ψ t,lt ∪ {t} and Ψ t+1,l ← Ψ t,l for all l ∈ [L]\{l t }.

9:

Update C t+1,l according to Ψ t+1,l through a regression subroutine (e.g., Algorithm 2). 10: end for

3.2. MULTI-LEVEL LEARNING FRAMEWORK

Existing approach. To tackle the heteroscedastic bandit problem, for the case where the F is the linear function class (i.e., f (a) = θ * , a for some θ * ∈ R d ), a weighted linear regression framework (Kirschner & Krause, 2018; Zhou et al., 2021) has been proposed. Generally speaking, at each round t ∈ [T ], weighted linear regression constructs a confidence set C t based on the empirical risk minimizarion (ERM) for all previous observed actions a s and rewards r s as follows: θt ← argmin θ∈R d λ θ 2 2 + s∈[t] ws( θ, as -rs) 2 , Ct ← θ ∈ R d t s=1 ws( θ, as -θt, as ) 2 ≤ βt , where w s is the weight, and β t , λ are some parameters to be specified. w s is selected in the order of the inverse of the variance σ 2 s at round s to let the variance of the rescaled reward √ w s r s upper bounded by 1. Therefore, after the weighting step, one can regard the heteroscedastic bandits problem as a homoscedastic bandits problem and apply existing theoretical results to it. To deal with the general function case, a direct attempt is to replace the θ, a appearing in above construction rules with f (a). However, such an approach requires that F is close under the linear mapping, which does not hold for general function class F. Multi-level Learning framework (ML 2 ). To deal with the nonlinearity issue, we propose a novel framework M L 2 in Algorithm 1. At the core of our design is the idea of partitioning the observed data into several levels and 'packing' data with similar variance upper bounds into the same level as shown in line 7-8 of Algorithm 1. Note that we use a small real number σ to ensure that the number of levels is bounded. Specifically, for any two data belong to the same level with variaces larger than σ, their variance will be at most twice larger than the other. Next in line 9, our framework calls a subroutine to estimate f * according to the data points in Ψ t+1,l . Since the variances of the data in the same level are nearly the same, we let algorithms that work for homoscedastic bandit problem run on the data in the same level. Particularlly, we use the Empirical risk minimization (ERM) algorithm as described in Algorithm 2 for Sections 4 and 5. In Section 6, we show the power of using Algorithm 4 as the regression subroutine. Then in line 5, the agent makes use of L confidence sets simultaneously to select an action based on the optimism-in-the-face-of-uncertainty (OFU) principle over all L levels. More specifically, the algorithm chooses the action opmimisticly according to each confidence set but inclines to select the confidence set with the most pessimistic evaluation. In the following sections, we will consider several different settings to show the power of ML 2 .

4. WARMUP: NOISE WITH ADDITIONAL SUB-GAUSSIAN ASSUMPTION

We first consider a simplified variant of our problem, where each noise is also sub-Gaussian. Assumption 4.1 (Sub-Gaussianity of noise). t is conditionally σ t -sub-Gaussian on a 1:t , 1:t-1 . Such a sub-Gaussian assumption on noise has been considered by Kirschner & Krause (2018) . Next we show the regret upper bound for ML 2 with ERM. For simplicity, in the following results, let dim E denote dim E (F, 1/T 2 ). Theorem 4.2 (Gap-independent regret bound for bandits with heteroscedastic sub-Gaussian noise). Suppose Assumption 3.4 and 4.1 hold and |f * (a)| ≤ C for all a ∈ A. For all t ∈ [T ], l ∈ [L] and δ ∈ (0, 1), α > 0, σ > 0, if we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set Algorithm 2 Empirical risk minimization (ERM) for partitioned data 1: Input: Level l, time t and set of data points Ψ t+1,l . 2: Compute f t+1,l ← argmin f ∈F s∈Ψ t+1,l (f (a s ) -r s ) 2 . 3: Return C t+1,l ← f ∈ F s∈Ψ t+1,l f (a s ) -f t+1,l (a s ) 2 ≤ β t+1,l β t,l as the square root of 8(2 l+1 • σ) 2 log(2N α L/δ) + 4tα C + (2 l+1 • σ) 2 log(4t(t + 1)L/δ) , where N α = N (F, α, • ∞ ) and L = log 2 R/σ (recall the definition of L in Algorithm 1) , then with probability at least 1 -δ, the regret for the first T rounds is bounded as follows: Regret(T ) ≤ L + 2C dim E L + 8 2L dim E (J + σ 2 T ) log(2N α L/δ) + 4 L dim E α C + 2R log(4T (T + 1)L/δ)T. Corollary 4.3. Let the same conditions as in Theorem 4.2 hold. Set α = T -2 and σ = dim -1 E (log(2N α L/δ) √ T ) -1 . Then with probability at least 1 -δ, when T is large enough, the regret for the first T rounds is bounded as Regret(T ) = O dim E log(N (F, T -2 , • ∞ ))J . Remark 4.4. Our result is strictly tighter than the O R dim E log(N (F, T -2 , • ∞ ))T regret achieved by Russo & Van Roy (2013) since J = T t=1 σ 2 t ≤ R 2 T . In the worst case, when σ 1 = • • • = σ T = R, our result degrades to their result. Our improvement in regret is due to the utilization of variance information. When the variance information is provided or can be estimated, our algorithm can achieve better regrets for bandits with general function classes studied in this paper, while existing algorithms cannot. Remark 4.5. When restricted to linear contextual bandits with dimension d, since log (Russo & Van Roy, 2013) , our result can be written as O(d √ J), which matches the result of using weighted linear ridge regression for heteroscedastic linear bandit under our assumptions on noise (Kirschner & Krause, 2018; Zhou et al., 2021) . N (F, T -2 , • ∞ ) = O(d), dim E = O(d) We also provide a gap-dependent regret bound for general function class setting in section D, generalizing the previous gap-dependent regret bound in linear bandits (Abbasi-Yadkori et al., 2011) .

5. GENERAL RESULTS FOR BANDITS WITH HETEROSCEDASTIC NOISE

In this section, we consider the original setting introduced in Section 3 without Assumption 4.1. In the following subsections, we will show that Algorithm 2 still works with refined value of β.

5.1. VARIANCE-AWARE CONFIDENCE SET

In this genral setting, directly applying the confidence set used in the previous work (Russo & Van Roy, 2013) gives no improvement since our confidence sets C t,l do not adopt the variance information. We show in the following theorem that our new designed C t,l with a new confidence radius β still ensures that the confidence set is large enough to contain f * with high probability, and exploits the variance information at the same time. Theorem 5.1 (Variance-dependent confidence sets). Suppose that |f * (a)| ≤ C for all a ∈ A. For any α > 0 and δ ∈ (0, 1/2), if we set β t,l as the square root of 12Cαt + 4αRt + 8/3 • CR log(2N α t 2 /δ) + 16 • (2 l+1 σ) 2 log(2N α t 2 /δ), where R = R 2 log(4t 2 /δ), N α = N (F, α, • ∞ ), then f * ∈ C t, l with probability at least 1 -2δ for any fixed t, l. Remark 5.2. With a small α, we have (Russo & Van Roy, 2013; Ayoub et al., 2020) , our confidence set is tighter when C is relatively small compared to R. β t,l = O(2 2l σ 2 log N α + CR log N α ). Compared with the corresponding previous result O(R 2 log N α )

5.2. REGRET UPPER BOUNDS FOR ML 2 WITH ERM

We derive our general results with the variance-aware confidences sets described in the last subsection. In this part, we write dim E (F, T -1 ) as dim E for short. Theorem 5.3 (Gap-independent regret bound for bandits with heteroscedastic noise). Suppose Assumption 3.4 holds and |f * (a)| ≤ 1 for all a ∈ A. For all t ∈ [T ], l ∈ [L] and δ ∈ (0, 1), α > 0, σ > 0, if we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set β t,l as the square root of 12αt + 4αRt + 8/3 • R log(2N α t 2 L/δ) + 16 • (2 l+1 σ) 2 log(2N α t 2 L/δ), where L = log 2 R/σ , N α = N (F, α, • ∞ ) and R = R 2 log(4t 2 L/δ) ( with a slight abuse of notation), then with probability at least 1 -2δ, the regret for the first T rounds is bounded as follows: Regret(T ) ≤ √ L 2 √ dimE T + 1 + 4 L dimE(log T + 1)α 3 + RT + 2 8 3 L dimE(log T + 1)R log(2Nαt 2 L/δ)T + 16 L dimE(log T + 1) log(2NαT 2 L/δ) J + T σ 2 . Corollary 5.4. Assume R = Ω(1). Let the same conditions as in Theorem 5.3 hold. Set α = T -2 , σ = 1. Then with probability at least 1 -δ, when T is large enough, the regret for the first T rounds is bounded as Regret(T ) = O √ dim E log N α J + √ R dim E log N α T . Remark 5.5. Compared with the result shown in Corollary 4.3, the additional term of order O( √ dim E log N α RT ) is due to a larger confidence set by the absence of Assumption 4.1. Remark 5.6. When restricted to heteroscedastic linear contextual bandits of dimension d, our regret bound can be written as O(d √ J + √ Rd √ T ). With a slightly more restricted assumption on noise, Zhou et al. ( 2021) achieved a result of order O(d √ J + R √ dT ). Our result is appealing when the sub-Gaussian parameter of noise R is much larger than 1 (or the range of the reward function, equivalently). When R is small, our result becomes sub-optimal due to the property of our varianceaware confidence set. We also provide a gap-dependent regret bound in Appendix D.

6. TIGHTER BOUNDS FOR GENERALIZED LINEAR BANDITS

Our general result shown in Theorem 5.3 has an additional term of order O( √ dim E log N α RT ) which makes the result sub-optimal when R is close to the range of the reward function. In this section, we consider a special case, generalized linear bandits with heteroscedastic noise. We show how to get rid of the O( √ dim E log N α RT ) term in the upper bound of the regret, and achieve a better result when R is relatively small or close to the bound of the reward function.

6.1. GENERALIZED LINEAR BANDITS

Following Filippi et al. (2010) ; Jun et al. (2017) , we consider the generalized linear function class defined as follows. Assumption 6.1 (Generalized linear function class). Action set A and Θ in Assumption 3.4 are subsets of R d . There exists a known link function h, such that ∀a ∈ A and f θ ∈ F, f θ (a) = h(θ a). Let f * = f θ * . Assume that θ * 2 ≤ B, sup a∈A a 2 ≤ A. To make the problem tractable, we need the following assumption on h. Assumption 6.2 (Assumption 1, Jun et al. 2017) . h is K-Lipschitz on [-A • B, A • B] and con- tinuously differentiable on (-A • B, A • B). Furthermore, inf z∈(-A•B,A•B) h (z) = κ for some κ > 0. Next we propose the follow-the-regularized-leader (FTRL) framework (Shalev-Shwartz & Singer, 2007; Xiao, 2010; Hazan, 2019) in Algorithm 3, which is the key component of our final algorithm. Note that when dealing with bandit setting, we maintain an independent process executing Algorithm 3 for each variance level instead of feeding all the data points into a single FTRL online learner. Here we number the data points with t = 1, 2, • • • for simplicity with a slight abuse of notation. In Algorithm 3 we will use a loss function and a regularized function φ. φ is defined as Algorithm 3 Follow The Regularized Leader (FTRL) 1: Input: F, R. 2: for t ≥ 1 do 3: Output θ t ← argmin θ∈R d φ(θ) + t-1 s=1 (θ a s , r s ).

4:

Observe a t , r t . 5: end for φ(θ) = c • θ 2 2 where c is a constant which will be specified later in Theorem 6.5. To align with our Assumption 6.1, we select the loss function as follows, following Jun et al. ( 2017): Assumption 6.3 (Loss function, Jun et al. 2017 ). The loss function in Algorithm 3 is selected as follows: (z, r) = -rz + m(z), t (θ) = (θ a t , r t ), where m(z) satisfies m (z) = h(z).

6.2. VARIANCE-DEPENDENT REGRET FOR FTRL

Before proposing our final algorithm for generalized linear bandits, we first propose a variancedependent complexity result for FTRL, since it is already nontrivial and reveals some interesting properties about our setting. We define a notion of regret of online regression, named by reg t , as follows. The concept of regret of online regression has been introduced in the previous work (Abbasi-Yadkori et al., 2012; Jun et al., 2017) . In detail, it is used to characterize the complexity for FTRL to learn the generalized linear function. Definition 6.4. Let reg t = t s=1 (a s θ s , r s ) - t s=1 (a s θ * , r s ). Our definition of regret for online regression is slightly different that in prior works (Abbasi-Yadkori et al., 2012; Jun et al., 2017) . Here θ * is choosen to be the true parameter for the bandit model, while θ * is often chosen as arg inf θ∈Θ s (θ) in Abbasi-Yadkori et al. ( 2012); Jun et al. (2017) . From the perspective of online learning, the algorithms and the corresponding analyses are usually introduced for either the realizable setting where there exists an underlying θ * that incurs zero loss, or the adversarial setting where the bounded label r s in each round s can be arbitrarily chosen by the adversary. As a result, the previous approaches by Abbasi-Yadkori et al. ( 2012); Jun et al. (2017) do not exploit the 'stochastic' property of the labels. Provided that the labels are sequentially generated with additional stochastic noise, our definition is more reasonable and natural. A recent work focusing on stochastic online linear regression also discussed the limitation of adversarial setting (Section 2.2, Ouhamma et al. 2021 ). Next we propose a bound for reg t which adopts the variance information. Theorem 6.5 (Regret of FTRL). Set φ(θ) = 2A 2 K 2 θ 2 2 /κ and assume that all the data points fed into the algorithm are of noise variance bounded by σ 2 max , then with probability at least 1 -3δ, ∀t ≥ 1, the regret of Algrithm 3 for the first t rounds is bounded as follows: Jun et al. (2017) analyzed the online learning regret for the same function class and loss function with our setting. Their result yields a reg t in the order of O( K 2 A 2 B 2 +R 2 κ d). Our result improves their result in two aspects. First, R is strictly larger than σ max since a R-sub-Gaussian random variable is definitely of variance lower than R 2 . Second, when we consider cases where the bound of reward functions (i.e., KAB) is extremely large compared to R, their result becomes O(K 2 A 2 B 2 d/κ), which has an additional linear dependence on d. Remark 6.7. Consider a special case where κ = K = 1. Our result degrades to O(A 2 B 2 + R + σ 2 max d). This is essentially a regret upper bound for stochastic online linear regression with square loss. Recently Ouhamma et al. (2021) studied this stochastic setting and managed to get rid of the O(A 2 B 2 d) term in classic result for online linear regression considering adversarial setting. Ouhamma et al. (2021) derived a high probability regret bound of O(R 2 d 2 ) after omitting the o(log(T ) 2 ) terms (Theorem 3.3, Ouhamma et al. 2021) . Unlike their result, our result does not suffer from the quadratic dependence on d, and our result depends on σ 2 max d rather than R 2 d 2 . Therefore, our result is better than that in Ouhamma et al. (2021) when d is large. We also notice that the discussion in Sec. 3.3 in Ouhamma et al. (2021) yields an improved expected regret of order O(R 2 d log 2 T + R 2 d 2 log T log log T ), but the first term dominates the regret in the asymptotic sense, i.e., only when T is very large (T ≥ (log T ) d ). reg t ≤ 8A 2 K 2 B 2 κ + 9 2κ R 2 log 2 (4t 2 /δ) + 3 σ 2 max κ d log 1 + tAκ 2 4dK 2 . Remark 6.6. Algorithm 4 GLOC with multi-level FTRL learners 1: Initialize: V 0,l ← λI for all l ∈ [L]. 2: while input t, l t , a t , r t do 3: Set θ t,lt following Algorithm 3, where θ t,lt ← argmin θ∈R d φ(θ) + s∈Ψ t+1,l t (θ a s , r s ).

4:

Find t = max Ψ t,lt . 5: Update V t,lt ← V t ,lt + a t a t , z t,lt ← a t θ t,lt . 6: Compute θ t,lt ← V -1 t,lt s∈Ψ t+1,l t z s,lt • a s . 7: Define C t,lt ← {θ ∈ R d : θ -θ t,lt 2 V -1 t,l t ≤ β t,l }. 8: Define C t,l ← C t-1,l for all l ∈ [L]\{l t }. 9: Return C t,l ← {f θ ∈ F|θ ∈ C t,l } for all l ∈ [L] . 10: end while

6.3. REGRET BOUND OF ALGORITHM 1 WITH GLOC

With our new technical tool presented in the last subsection, we now show our final algorithm for the generalized linear bandit setting. We propose our algorithm in Algorithm 4. Generally speaking, Algorithm 4 is a multi-level version of the generalized linear online-to-confidence-set conversion (GLOC) algorithm proposed by Jun et al. (2017) , equipped with FTRL. As shown in Algorithm 4, we maintain L FTRL online learners in parallel. Under our framework, a single learner only receives data with similar variances of noise. As a result, we can make use of the variance-dependent result shown in Theorem 6.5 to derive a tighter regret bound for generalized linear bandits with heteroscedastic noises. Theorem 6.8 (Regret bound for generalized linear bandits, informal). Suppose that Assumption 6.1 and 6.2 hold for the known reward function class F. If we apply Algorithm 4 as a subroutine of Algorithm 1 (in line 9) and set β t,l to 1 + 32A 2 K 2 B 2 κ 2 + 26 κ 2 R 2 log 2 (4t 2 L/δ) + 12 2 2(l+1) σ 2 κ 2 d log 1 + tAκ 2 4dK 2 + λB 2 for all t ∈ [T ], l ∈ [L] , where L = log 2 R/σ , σ = R/ √ d, then with probability 1 -4δ, the regret of Algorithm 1 for the first T rounds is bounded as follows: Regret(T ) = O K κ d √ J + K κ (K • AB + R) √ dT . Remark 6.9. In the worst case, i.e.  σ 1 = • • • = σ T = R, our result degraded to O(KRd √ T /κ + K 2 AB √ dT /κ),

7. CONCLUSION AND FUTURE WORK

In this work we study heteroscedastic stochastic bandits problem for a general reward function class. We propose a multi-level regression framework ML 2 to deal with the heteroscedastic noises. Under three different settings with additional assumptions on the noise and the function class, we study the performance of ML 2 and propose corresponding variance-dependent regret bounds, which strictly improves previous algorithms for homoscedastic bandit setting. We leave to study the optimal regret bound of heteroscedastic stochastic bandits for a general reward function class for future work. 

A EXPERIMENTS

In this section, we conduct some experiments of the proposed ML 2 +FTRL algorithm for generalized linear bandits. We compare our results with GLOC proposed by Jun et al. (2017) . For each trial, the dimension of θ * is set to d = 20, and θ * is sampled uniformly from  = t -E[ t ]. With this construction, it is not hard to see that t satisfies Assumption 3.5 with sub-Gaussian parameter R. We plot the cumulative regret of ML 2 and GLOC in Figure 1 . We can see that by adapting the multi-level and variance-aware scheme, our ML 2 algorithm outperforms the previous GLOC algorithm by a large margin.

B PROOF SKECTCH OF THEOREM 5.3

Proof Sketch. We prove the result by showing the validity of the confidence sets and bounding the sum of single-step regret incurred by each variance level. Step 1: Construction of confidence sets We first show that the construction of C t,l is big enough to contain f * with high probability. Recall the definition of C t,l as follows C t,l ← f ∈ F s∈Ψ t,l f (a s ) -f t,l (a s ) 2 ≤ β t,l (B.1) Thus, it only suffices to show that f * satisfies the inequality in (B.1). According to our ERM subroutine, we have s∈Ψ t,l ( f t,l (a s ) -f * (a s )) 2 + 2 s∈Ψ t,l s [f * (a s ) -f t,l (a s )] = s∈Ψ t,l f t,l (a s ) -r s 2 - s∈Ψ t,l (f * (a s ) -r s ) 2 ≤ 0, where the inequality holds since f t,l is the function which minimizes the cumulative squared loss at level l, i.e., s∈Ψ t,l (f (a s ) -r s ) 2 . Therefore, to bound s∈Ψ t+1,l ( f t+1,l (a s ) -f * (a s )) 2 , it suffices to bound the absolute value of s∈Ψ t,l s [f * (a s ) -f t,l (a s )]. Since f suffers from the measurability issue, we use the concentration of self-normalized process with an α-cover discretization argument to bound this term, which can finally show that f * satisfies (B.1). Step 2: Regret decomposition With the confidence sets corresponding to different variance levels constructed in Step 1, we decompose the final regret in the following way: Regret(T ) ≤ l∈[L] t∈Ψ T +1,l (f * (a * t ) -f * (a t )) ≤ l∈[L] t∈Ψ T +1,l max f ∈C t,l f (a t ) -f * (a t ) ≤ l∈[L] t∈Ψ T +1,l w C t,l (a t ) (B.2) where the second inequality holds due to the optimism principle for arm selection, the last one holds due to the definition of width w C t,l (a) := max f ∈C t,l f (a) -min f ∈C t,l f (a) and the fact that  f * ∈ C t,

C EXAMPLE: VARIANCE-DEPENDENT REGRET BOUNDS FOR BERNOULLI BANDITS

In this section, we consider the following specific Bernoulli multi-armed bandits with general function approximation, where the observed reward r t incurred by action a t is subject to the following Bernoulli distribution: r t = f * (a t ) + 1/f * (a t ), with probability f * (a t ) -1/(1 -f * (a t )), otherwise , ξ ≤ f * (a t ) ≤ 1/2. (C.1) It is easy to see that the noise t satisfies R = 1/f * (a t ) ≤ 1/ξ and σ 2 t = 1/f * (a t )+1/(1-f * (a t )). As discussed in Remark 3.6, we can use estimators of variance in Algorithm 1 instead of the true variance at each round. The following corollary is an illustration on how to apply Algorithm 1 when variance information is not accessible but can be estimated through past observation. Corollary C.1. Let the same conditions as in Theorem 5. 3 hold. Set α = T -2 , σ = 1. At each round, set σ 2 t (a t ) to min{1, σ 2 (a t ) + 2/ξ 2 • w C t,l t (a t )} where σ 2 (a t ) := 1/ f t,lt (a t ) -1/(1 - f t,lt (a t )). Then with probability at least 1 -δ, when T is large enough, the regret of Algorithm 1 for the first T rounds is bounded as Regret(T ) = O( dim E log N α T t=1 Var[r t ] + ξ -3 dim E log N α + ξ -1/2 √ dim E log N α T ). Proof. We suppose the event described in Theorem 5.1 holds and the conditions of Theorem 5.3 hold without the given variance at each round. At each round t, we can upper bound the variance of t := r t -f * (a t ) by making use of C t,l and f t,l returned by Algorithm 2. Specifically, we estimate Var[r t ] = σ 2 t = 1/f * (a t ) + 1/(1 -f * (a t )) as 1/ f t,lt (a t ) -1/(1 -f t,lt (a t )) and bound the gap between σ 2 t and Var[r t ]: | σ 2 t (a t ) -Var[r t ]| ≤ |1/ f t,lt (a t ) -1/f * (a t )| + |1/(1 -f t,l (a t )) -1/(1 -f * (a t ))| ≤ 2/ξ 2 • | f t,lt (a t ) -f * (a t )| ≤ 2/ξ 2 • w C t,l t (a t ). (C.2) where the first inequality follows from the distribution of reward, and the last inequality holds according to Theorem 5.1. Therefore, it is valid to replace σ 2 t by min{1, σ 2 t + 2/ξ 2 • w C t,l t (a t )} in Algorithm 1. Similar to the proof of Theorem 5.3, we have J = T t=1 min{1, σ 2 t + 2/ξ 2 • w C t,l t (a t )} ≤ T t=1 min{1, Var[r t ] + 4/ξ 2 • w C t,l t (a t )} ≤ T t=1 Var[r t ] + 4/ξ 2 • l∈[L] s∈Ψ T +1,l w C s,l (a s ) ≤ T t=1 Var[r t ] + 4/ξ 2 • L 1/T + d + 4β T,L √ dT where the first equality follows from the definition of J, the first inequality follows from (C.2), the last inequality holds due to Lemma E.2. Hence J is of order O( T t=1 Var[r t ] + 1/ξ 3 • √ d log N α T ). According to Theorem 5.3, the regret bound of Algorithm 1 for this Bernoulli bandit problem is of order O( dim E log N α T t=1 Var[r t ] + 1/ξ 3/2 • dim 3/4 E (log N α ) 3/4 T 1/4 + ξ -1 dim E log N α T ) ≤ O( dim E log N α T t=1 Var[r t ] + ξ -3 dim E log N α + ξ -1/2 dim E log N α T ). This completes the proof.

D GAP-DEPENDENT REGRET BOUNDS

In this section, we provide gap-dependent regret bounds for Algorithm 1 with various subroutines. Denote ∆ t as the smallest gap between the reward of an optimal action and the reward of a suboptimal action: ∆ t := min a∈Dt,a / ∈D * t [f * (a * t ) -f * (a)], (D.1) where D * t := argmax a∈Dt f * (a). Let ∆ be the smallest gap in all the rounds: ∆ := min t∈[T ] ∆ t . Theorem D.1 (Gap-dependent regret bound for bandits with heteroscedastic sub-Gaussian noise). Suppose Assumption 3.4 and 4.1 hold and |f * (a)| ≤ C for all a ∈ A. Let σ max = max t∈[T ] σ t and suppose σ max > σ. If we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set β t,l as the same value in Theorem 4.2, then with probability at least 1 -δ, the regret of Algorithm 1 for the first T rounds is bounded as follows: Regret(T ) ≤ L ∆ 4 dim E C 2 + 1/T + 16 LT αC ∆ dim E (log T + 1) + 128 L ∆ σ 2 max log(2N α L/δ) dim E (log T + 1) + 32 L ∆ T ασ max log(8T 2 L/δ) dim E (log T + 1). Corollary D.2. Let the same conditions as in Theorem D.1 hold. Set α = T -2 . Then with probability at least 1 -δ, when T is large enough, the regret for the first T rounds is bounded as follows: Regret(T ) = O σ 2 max ∆ dim E log(N (F, T -2 , • ∞ )) . Remark D.3. Corollary D.2 immediately suggests an O(R 2 dim E log(N (F, T -2 , • ∞ ))/∆) gap- dependent regret by the fact σ max = O(R) , which provides a novel instance-dependent bound for the original problem considered by Russo & Van Roy (2013) . To our knowledge, this is the first result of its kind for the general bandit model. Remark D.4. When restricted to linear contextual bandits with dimension d, our result reduces to O(σ 2 max d 2 /∆), which matches the previous result derived in Abbasi-Yadkori et al. (2011) . Theorem D.5 (Gap-dependent regret bound for bandits with heteroscedastic noise, informal). Suppose Assumption 3.4 holds and |f * (a)| ≤ 1 for all a ∈ A. Let σ max = max t∈[T ] σ t (suppose σ max > σ) and d = dim E (F, 1/T ). If we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set β t,l as the same value in Theorem 5.3, then with probability at least 1 -2δ, the regret of Algorithm 1 for the first T rounds is bounded as follows: Regret(T ) = O σ 2 max ∆ d log N α + R ∆ d log N α if T is large enough and we set α = T -2 . Remark D.6. Similar to Remark 5.5, compared with the regret in Corollary D.2, the regret in Theorem D.5 has an additional term that depends on R.

E PROOFS FROM SECTION 4

Lemma E.1 (Proposition 3, Russo & Van Roy 2013) . If (β t ≥ 0|t ∈ N) is a nondecreasing sequence and F t := f ∈ F : t-1 s=1 f t (a s ) -f (a s ) 2 ≤ β 2 t , then T t=1 1(w Ft (a t ) > ) ≤ 4β 2 T 2 + 1 dim E (F, ) for all T ∈ N and > 0. Lemma E.2 (Lemma 2, Russo & Van Roy 2013). If (β t ≥ 0|t ∈ N) is a nondecreasing sequence and F t := f ∈ F : t-1 s=1 f t (a s ) -f (a s ) 2 ≤ β 2 t , then T t=1 w Ft (a t ) ≤ 1 T + w F (A) • dim E (F, T -2 ) + 4β T dim E (F, T -2 )T for all T ∈ N. Instead of using the previous approach that bounds the sum of widths, we take another approach to bound the sum of squared widths, which can further provide a novel gap-dependent result later. Lemma E.3 (Bounding the sum of the square of widths). If (β t ≥ 0|t ∈ N) is a nondecreasing sequence and F t := f ∈ F : t-1 s=1 f t (a s ) -f (a s ) 2 ≤ β 2 t , then T t=1 w 2 Ft (a t ) ≤ dim E (F, 1/ √ T )w 2 F (A) + 1 + 4β 2 T dim E (F, 1/ √ T )(log T + 1) for all T ∈ N. Proof. Following a similar approach as Russo & Van Roy (2013), we reorder the set {w Ft (a t )} t∈[T ] to {w t } t∈[T ] , such that w 1 ≥ w 2 ≥ • • • ≥ w T . Let T = max{t ∈ [T ], w t ≥ 1 T }. By Lemma E.1, t ≤ 4β 2 T (w t -δ) 2 + 1 dim E (F, w t -δ) (E.1) for any δ ∈ (0, ). Taking δ → 0, we have w 2 t ≤ 4β 2 T dim E (F, w t ) t -dim E (F, w t ) . (E.2) Hence, T t=1 w 2 Ft (a t ) = T t=1 w 2 t (E.3) ≤ dim E (F, 1/T )w 2 F (A) + 1/T + T t=dim E (F ,1/T )+1 w 2 t ≤ dim E (F, 1/T )w 2 F (A) + 1/T + T t=dim E (F ,1/T )+1 4β 2 T dim E (F, 1/T ) t -dim E (F, 1/T ) ≤ dim E (F, 1/T )w 2 F (A) + 1/T + 4β 2 T dim E (F, 1/T )(log T + 1), where the first inequality holds due to T t=T +1 w 2 t ≤ 1/T under our definition of T , the second inequality follows from (E.2), the third inequality is derived by taking the integral. With Assumption 4.1, we can directly apply the previous result on confidence set by replacing the sub-Gaussianity η parameter by 2 +1 σ. Previous result by Russo & Van Roy (2013) achieved a confidence set of radius O( η 2 log N α + αt(C + η)). Ayoub et al. (2020) later provides a result of the same order with improvement in terms of smaller constants. Lemma E.4 (Theorem 5, Ayoub et al. 2020) . Suppose that |f * (a)| ≤ C for all a ∈ A. For any α > 0, if we set β t,l = 8(2 l+1 • σ) 2 log(2N α L/δ) + 4tα(C + (2 l+1 • σ) 2 log(4t(t + 1)L/δ)) 1/2 , (E.4) then with probability at least 1 -δ, for all t ≥ 1, l ∈ [L], f * ∈ C t,l . Theorem E.5 (Restatement of Theorem 4.2). Suppose Assumptions 3.4 and 4.1 hold and |f * (a)| ≤ C for all a ∈ A. For all t ∈ [T ], l ∈ [L] and δ ∈ (0, 1), α > 0, σ > 0, if we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set β t,l as the square root of 8(2 l+1 • σ) 2 log(2N α L/δ) + 4tα(C + (2 l+1 • σ) 2 log(4t(t + 1)L/δ)), where N α = N (F, α, • ∞ ) and L = log 2 R/σ (recall the definition of L in Algorithm 1), then with probability at least 1 -δ, the regret for the first T rounds is bounded as follows: Regret(T ) ≤ L + 2C dim E L + 8 2L dim E (J + σ 2 T ) log(2N α L/δ) + 4 L dim E α C + 2R log(4T (T + 1)L/δ)T. Proof. For simplicity, let d = dim E (F, 1/T 2 ). With probability at least 1 -δ, we have Regret(T ) = T t=1 (f * (a * t ) -f * (a t )) (E.5) = l∈[L] t∈Ψ T +1,l (f * (a * t ) -f * (a t )) ≤ l∈[L] t∈Ψ T +1,l max f ∈C t,l f (a t ) -f * (a t ) ≤ l∈[L] 1 + 2C • d + 4β T,l d|Ψ T +1,l | ≤ L + 2CdL + 2 L l∈[L] β 2 T,l d|Ψ T +1,l | I0 , (E.6) where the first equality holds by the definition in (3.1), the second equality holds since Ψ T +1,• forms a partition of [T ], the first inequality holds due to Lemma E.4, the second inequality follows from Lemma E.3, the third inequality is obtained by applying Cauchy-Schwarz inequality. Then we continue to bound I 0 , I 0 = L l∈[L] t∈Ψ T +1,l β 2 T,l d ≤ √ Ld l∈[L] t∈Ψ T +1,l 8(2 l+1 • σ) 2 log(2N α L/δ) + √ Ld l∈[L] t∈Ψ T +1,l 4T α(C + (4R 2 • log(4T (T + 1)L/δ)) ≤ √ Ld   l∈[L] t∈Ψ T +1,l 8(2 l+1 • σ) 2 log(2N α L/δ) + 2 √ αT C + 2R log(4T (T + 1)L/δ)   ≤ √ Ld   T t=1 32(σ 2 t + σ 2 ) log(2N α L/δ) + 2 √ αT C + 2R log(4T (T + 1)L/δ)   ≤ 32Ld(J + σ 2 T ) log(2N α L/δ) + 2 √ Ldα C + 2R log(4T (T + 1)L/δ)T, (E.7) where the first inequality follows from the definition of β T,l , the third inequality holds due to the fact that ∀l ∈ [L], t ∈ Ψ T +1,l , 2 l+1 σ = 2 • 2 l σ ≤ 2 max{σ, σ t } = 4 max{σ 2 , σ 2 t } ≤ 4(σ 2 + σ 2 t ), the fourth inequality follows from the definition of J. Substituting (E.7) into (E.6), we obtain Regret(T ) ≤ L + 2CdL + 8 2Ld(J + σ 2 T ) log(2N α L/δ) + 4 √ Ldα C + 2R log(4T (T + 1)L/δ)T, which completes the proof. Theorem E.6 (Restatement of Theorem D.1). Suppose Assumptions 3.4 and 4.1 hold and |f * (a)| ≤ C for all a ∈ A. Let σ max = max t∈[T ] σ t . If we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set β t,l as the same value in Theorem 4.2, then with probability at least 1 -δ, the regret of Algorithm 1 for the first T rounds is bounded as follows: Regret(T ) ≤ L ∆ 4 dim E C 2 + 1/T + 16 LT αC ∆ dim E (log T + 1) + 128 L ∆ σ 2 max log(2N α L/δ) dim E (log T + 1) + 32 L ∆ T ασ max log(8T 2 L/δ) dim E (log T + 1). Proof. For simplicity, let d = dim E (F, 1/T 2 ). Suppose the event described in Lemma E.4 holds. With probability at least 1 -δ, Regret(T ) = T t=1 (f * (a * t ) -f * (a t )) = l∈[L] t∈Ψ T +1,l (f * (a * t ) -f * (a t )) ≤ l∈[L] t∈Ψ T +1,l (f * (a * t ) -f * (a t )) 2 /∆ ≤ l∈[L] t∈Ψ T +1,l max f ∈C t,l f (a t ) -f * (a t ) 2 /∆ ≤ 1 ∆ l∈[L] t∈Ψ T +1,l w 2 C t,l (D t ) ≤ 1 ∆ l∈[L] 4dC 2 + 1/T + 4β 2 T,l d(log T + 1) ≤ L ∆ • 4dC 2 + 1/T + 4 L ∆ d(log T + 1) 32σ 2 max log(2N α L/δ) + 4T α(C + 2 σ 2 max log(4T (T + 1)L/δ)) where the first equality follows from the definition in (3.1), the second equality holds by the fact that Ψ T +1,l (l ∈ [L]) forms a partition of [T ], the first inequality holds due to the definition of ∆ in Subsection 3.1, the second inequality follows from Lemma E.4, the fourth inequality holds due to Lemma E.3 and the last inequality is derived by directly substituting the value of β T,l . F PROOFS FROM SECTION 5 Lemma F.1 (Freedman 1975). Let M, v > 0 be fixed constants. Let {x i } n i=1 be a stochastic process, {G i } i be a filtration so that for all i ∈ [n], x i is G i -measurable, while most surely E[x i |G i-1 ] = 0, |x i | ≤ M and n i=1 E(x 2 i |G i ) ≤ v. Then, for any δ > 0, with probability 1 -δ, for all t ∈ [n], t i=1 x i ≤ 2v log(2t 2 /δ) + 2/3 • M log(2t 2 /δ). Lemma F.2. Suppose a, b ≥ 0. If x 2 ≤ a + b • x, then x 2 ≤ 2b 2 + 2a. Proof. By solving the root of quadratic polynomial q(x) := x 2 -b•x-a, we obtain max{x 1 , x 2 } = (b + √ b 2 + 4a)/2. Hence, we have x ≤ (b + √ b 2 + 4a )/2 provided that q(x) ≤ 0. Then we further have x 2 ≤ 1 4 b + b 2 + 4a 2 ≤ 1 4 • 2 b 2 + b 2 + 4a ≤ 2b 2 + 2a. (F.1) Theorem F.3 (Restatement of Theorem 5.1). Suppose that |f * (a)| ≤ C for all a ∈ A. For any α > 0 and δ ∈ (0, 1/2), if we set β t,l as the square root of 12Cαt + 4αRt + 8 3 CR log(2N α t 2 /δ) + 16 • (2 l+1 σ) 2 log(2N α t 2 /δ) where R = R 2 log(4t 2 /δ), then f * ∈ C t,l for all t with probability at least 1 -2δ for any fixed l. Proof. By simple calculation, for all f ∈ F we have s∈Ψ t,l (f (a s ) -f * (a s )) 2 + 2 s∈Ψ t,l s [f * (a s ) -f (a s )] I(f ) = s∈Ψ t,l (r s -f (a s )) 2 - s∈Ψ t,l (r s -f * (a s ) 2 . (F.2) By sub-Gaussianity of t we have P ∃t ≥ 1, max 1≤s≤t | s | ≥ R 2 log(4t 2 /δ) ≤ s≥1 P(| s | ≥ R 2 log(4s 2 /δ)) ≤ s≥1 δ/(2s 2 ) ≤ δ. (F.3) For simplicity, let event E subG := ∀t ≥ 1, max 1≤s≤t | s | ≤ R 2 log(4t 2 /δ) . Let G(α) ⊂ F be an α-cover of F in • ∞ . From the definition of f t,l , we have s∈Ψ t,l ( f t,l (a s ) -f * (a s )) 2 + 2 s∈Ψ t,l s [f * (a s ) -f t,l (a s )] = I( f t,l ) ≤ 0. (F.4) Let g = argmin G(α) f t,l -g ∞ . We then bound the gap I(g) -I( f t,l ) under event E subG , I(g) -I( f t,l ) = s∈Ψ t,l (g(a s ) -f * (a s )) 2 -( f t,l (a s ) -f * (a s )) 2 + 2 s∈Ψ t,l s [ f t,l (a s ) -g(a s )] ≤ s∈Ψ t,l (g(a s ) -f t,l (a s ))(g(a s ) + f t,l (a s ) -2f * (a s )) + 2 s∈Ψ t,l αR 2 log(4t 2 /δ) ≤ 4Cαt + 2αR 2 log(4t 2 /δ)t. (F.5) Fix an f ∈ F. Applying Freedman's inequality (Lemma F.1), with probability at least 1 -δ, we have s∈Ψ t,l s • 1(E subG )[f * (a s ) -f (a s )] ≥ -2/3R 2 log(4t 2 /δ)C log(2t 2 /δ) -2 • (2 l+1 σ) 2 s∈Ψ t,l (f (a s ) -f * (a s )) 2 log(2t 2 /δ). (F.6) for all t ≥ 1. Using a union bound on all the f ∈ G(α) and E subG , we further obtain that s∈Ψ t,l t [f * (a s ) -f (a s )] ≥ - 2 3 R 2 log(4t 2 /δ)C log(2N α t 2 /δ) -2(2 l+1 σ) 2 log(2N α t 2 /δ) s∈Ψ t,l (f (a s ) -f * (a s )) 2 (F.7) for all f ∈ G(α) with probability at least 1 -2δ. Substituting (F.7) into the definition of I(f ), we have that for g, it holds for probability at least 1-2δ that 4Cαt + 2αR 2 log(4t 2 /δ)t ≥ I(g) (F.8) ≥ - 4 3 R 2 log(4t 2 /δ)C log(2N α t 2 /δ) (F.9) -8 • (2 l+1 σ) 2 log(2N α t 2 /δ) s∈Ψ t,l (g(a s ) -f * (a s )) 2 (F.10) + s∈Ψ t,l (g(a s ) -f * (a s )) 2 (F.11) where the first inequality is obtained by substituting (F.5) and (F.4) into the inequality below I(g) ≤ I(g) -I( f t,l ) + I( f t,l ), the second inequality follows from the definition of I(f ) and (F.7). Using Lemma F.2, we can deduce that s∈Ψ t,l (g(a s ) -f * (a s )) 2 ≤ 8Cαt + 4αR 2 log(4t 2 /δ)t + 8 3 RC 2 log(4t 2 /δ) log(2N α t 2 /δ) + 16 • (2 l+1 σ) 2 log(2N α t 2 /δ). Then we can complete the proof by bounding the gap between s∈Ψ t,l (g(a s ) -f * (a s )) 2 and s∈Ψ t,l ( f t,l (a s ) -f * (a s )) 2 : s∈Ψ t,l ( f t,l (a s ) -f * (a s )) 2 ≤ s∈Ψ t,l (g(a s ) -f * (a s )) 2 + s∈Ψ t,l (g(a s ) -f * (a s )) 2 - s∈Ψ t,l ( f t,l (a s ) -f * (a s )) 2 ≤ 12Cαt + 4αR 2 log(4t 2 /δ)t + 8 3 RC 2 log(4t 2 /δ) log(2N α t 2 /δ) + 16 • (2 l+1 σ) 2 log(2N α t 2 /δ). Theorem F.4 (Restatement of Theorem 5.3). Suppose Assumption 3.4 holds and |f * (a)| ≤ 1 for all a ∈ A. For all t ∈ [T ], l ∈ [L] and δ ∈ (0, 1), α > 0, σ > 0, if we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set β t,l as the square root of 12αt + 4αRt + 8 3 R log(2N α t 2 L/δ) + 16 • (2 l+1 σ) 2 log(2N α t 2 L/δ) where N α = N (F, α, • ∞ ) and R = R 2 log(4t 2 L/δ) (with a slight abuse of notation), then with probability at least 1 -2δ, the regret for the first T rounds is bounded as follows: Regret(T ) ≤ 4 L dim E (log T + 1)α 3 + RT + 2 8 3 L dim E (log T + 1)R log(2N α t 2 L/δ)T + 16 L dim E (log T + 1) log(2N α T 2 L/δ) J + T σ 2 + √ L 2 dim E T + 1 . Proof. For simplicity, let d = dim E (F, 1/T ). Based on Theorem 5.1, for any fixed l, we have f * ∈ C t,l with probability 1 -2δ/L. Applying a union bound on all l ∈ [L], we have f * ∈ C t,l for all t, l with probability at least 1 -2δ. Then we further obtain, with probability at least 1 -2δ, Regret(T ) = T t=1 (f * (a * t ) -f * (a t )) = l∈[L] t∈Ψ T +1,l (f * (a * t ) -f * (a t )) ≤ l∈[L] t∈Ψ T +1,l max f ∈C t,l f (a t ) -f * (a t ) = l∈[L] t∈Ψ T +1,l max f ∈C t,l f (a t ) -f * (a t ) |Ψ T +1,l | • 1 |Ψ T +1,l | ≤ √ L • l∈[L] |Ψ T +1,l | t∈Ψ T +1,l w 2 C t,l (D t ) ≤ √ L • l∈[L] |Ψ T +1,l | 4d + 1/T + 4β 2 T,l d(log T + 1) where the first equality follows from the definition in (3.1), the second equality holds by the fact that Ψ T +1,l (l ∈ [L]) forms a partition of [T ], the second inequality follows from Cauchy-Schwarz inequality and Definition in (3.3), the third inequality follows from Lemma E.3. Following the definition of β t,l , we further calculate Regret(T ) ≤ √ L 2 √ dT + 1 + √ L • l∈[L] t∈Ψ T +1,l 64(2 l+1 σ) 2 log(2N α T 2 L/δ)d(log T + 1) + 2 Ld(log T + 1)T 12αT + 4αRT + 8 3 R log(2N α T 2 L/δ) ≤ √ L 2 √ dT + 1 + 4 Ld(log T + 1)α 3 + RT + 2 8 3 Ld(log T + 1)R log(2N α t 2 L/δ)T + Ld(log T + 1) log(2N α T 2 L/δ) T t=1 256(σ 2 t + σ 2 ) ≤ √ L 2 √ dT + 1 + 4 Ld(log T + 1)α 3 + RT + 2 8 3 Ld(log T + 1)R log(2N α t 2 L/δ)T + 16 Ld(log T + 1) log(2N α T 2 L/δ) J + T σ 2 , where the first inequality holds by the definition of β t,l and the fact that √ a + b ≤ √ a + √ b for a, b > 0, the second inequality follows from the definition of l t , the third inequality follows from the definition of J. Theorem F.5 (Formal version of Theorem D.5). Suppose Assumption 3.4 holds and |f * (a)| ≤ 1 for all a ∈ A. Let σ max = max t∈[T ] σ t and d = dim E (F, 1/T ). If we apply Algorithm 2 as a subroutine of Algorithm 1 (in line 9) and set β t,l as the same value in Theorem 5.3, then with probability at least 1 -2δ, the regret of Algorithm 1 for the first T rounds is bounded as follows: Regret(T ) ≤ L ∆ (4d + 1/T ) + LαT ∆ d(log T + 1)(12 + 4R) + 32 3 L ∆ Rd(log T + 1) log(2N α T 2 L/δ) + 256 L ∆ σ 2 max d(log T + 1) log(2N α T 2 L/δ). Proof. For simplicity, let d = dim E (F, 1/T ). Basically following the previous approach in Theorem D.1, with probability 1 -2δ, we have Regret(T ) = T t=1 (f * (a * t ) -f * (a t )) = l∈[L] t∈Ψ T +1,l (f * (a * t ) -f * (a t )) ≤ l∈[L] t∈Ψ T +1,l (f * (a * t ) -f * (a t )) 2 /∆ ≤ l∈[L] t∈Ψ T +1,l max f ∈C t,l f (a t ) -f * (a t ) 2 /∆ ≤ 1 ∆ l∈[L] t∈Ψ T +1,l w 2 C t,l (D t ) ≤ 1 ∆ l∈[L] 4 + 1/T + 4β 2 T,l d(log T + 1) ≤ L ∆ • (4d + 1/T ) + 4 L ∆ d(log T + 1) 12αT + 4αRT + 8 3 R log(2N α T 2 L/δ) + 64σ 2 max log(2N α T 2 L/δ) = L ∆ (4d + 1/T ) + LαT ∆ d(log T + 1)(12 + 4R) + 32 3 L ∆ Rd(log T + 1) log(2N α T 2 L/δ) + 256 L ∆ σ 2 max d(log T + 1) log(2N α T 2 L/δ), where the first equality holds by the definition in (3.1), the second equality holds by the fact that Ψ T +1,l (l ∈ [L] ) forms a partition of [T ], the first inequality follows from the definition of ∆ in (D.1), the third inequality follows from Definition (3.3), the fourth inequality holds by Lemma E.3, the fifth inequality follows from the definition of β T,l .

G PROOFS FROM SECTION 6

Lemma G.1. Let reg t := t s=1 (a s θ, r s ) -t s=1 (a s θ * , r s ). Following Algorithm 3, with probability at least 1 -δ, t s=1 (a s (θ s -θ * )) 2 ≤ 4 κ reg t + 8R 2 κ 2 log(4t 2 /δ) (G.1) for all t ≥ 1. We denote the corresponding event by E 1 . Proof. reg t = t s=1 (a s θ, r s ) - t s=1 (a s θ * , r s ) = t s=1 (a s θ * , r s )a s (θ s -θ * ) + (ξ s , r s ) 2 (a s θ s -a s θ * ) 2 ≥ - t s=1 s a s (θ s -θ * ) + κ 2 (a s θ s -a s θ * ) 2 where the first equality follows from Definition 6.4, the second equality holds by Taylor series expansion and ξ s is a point between a s θ and a s θ * , the first inequality follows from Assumption 6.2 and Assumption 6.3. Further we obtain t s=1 (a s (θ s -θ * )) 2 ≤ 2 κ t s=1 s a s (θ s -θ * ) + 2 κ reg t ≤ 2 κ R 2 t s=1 (a s (θ s -θ * )) 2 log(2/δ) + 2 κ reg t with probability 1 -δ, where the second inequality follows from the sub-Gaussianity of s . Applying Lemma F.2, we obtain that with probability at least 1 -δ, t s=1 (a s (θ s -θ * )) 2 ≤ 4 κ reg t + 8 κ 2 R 2 log(2/δ). Applying union bound on all t ≥ 1, we have with probability at least 1 -δ, t s=1 (a s (θ s -θ * )) 2 ≤ 4 κ reg t + 8 κ 2 R 2 log(4t 2 /δ) (G.2) for all t ≥ 1. Lemma G.2 (Lemma 11, Abbasi-Yadkori et al. 2011) . For any λ > 0 and sequence {x t } T t=1 ⊂ R d for t ∈ {0, 1, • • • , T }, define Z t = λI + t i=1 x i x i . Then, provided that x t 2 ≤ M for all t ∈ [T ], we have T t=1 min{1, x t 2 Z -1 t-1 } ≤ 2d log dλ + T M 2 dλ . Lemma G.3. For any λ > 0 and sequence {x t } T t=1 ⊂ R d for t ∈ {0, 1, • • • , T }, define Z t = λI + t i=1 x i x i . Then, provided that x t 2 ≤ M for all t ∈ [T ], we have T t=1 x t 2 Z -1 t ≤ 2d log dλ + T M 2 dλ . Proof. Applying matrix inversion lemma, T t=1 x t 2 Z -1 t = T t=1 x t Z -1 t x t = T t=1 x t Z -1 t-1 - Z -1 t-1 x t x t Z -1 t-1 1 + x t Z -1 t-1 x t x t = T t=1 x t 2 Z -1 t-1 1 + x t 2 Z -1 t-1 ≤ T t=1 min{1, x t 2 Z -1 t-1 } ≤ 2d log dλ + T M 2 dλ , where the second equality follows from matrix inversion lemma, the second inequality holds by Lemma G.2. Lemma G.4. Let Σ t := 4A 2 K 2 κ I + κ t i=1 a i • a i , σ max := max t≥1 σ t . Suppose R t is an upper bound of max 1≤s≤t | s |. Then with probability at least 1 -δ, it holds simultaneously for all t ≥ 1 that t s=1 2 s -E[ 2 s ] a s 2 Σ -1 s ≤ σ max R t K d log 1 + tAκ 2 4dK 2 log(2t 2 /δ) + 2 3κ σ 2 max + R 2 t log(2t 2 /δ). Proof. To bound the sum of variance of each term, we caulculate t s=1 Var 2 s -E[ 2 s ] a s 4 Σ -1 s ≤ t s=1 E[ 4 s ] a s 4 Σ -1 s ≤ t s=1 κ 4K 2 E[ 2 s ]R 2 t a s 2 Σ -1 s ≤ σ 2 max R 2 t 2K 2 d log 1 + tAκ 2 4dK 2 , where the second inequality follows from the definition of Σ t , the third inequality holds by Lemma G.2. Also note that max 1≤s≤t 2 s -E[ 2 s ] a s 2 Σ -1 s ≤ σ 2 max + R 2 t 1 κ since Σ t 4A 2 K 2 κ I + κa s • a s . Then we apply Freedman's inequality, which gives for arbitrary t ≥ 1, t s=1 2 s -E[ 2 s ] a s 2 Σ -1 s ≤ σ max R t K d log 1 + tAκ 2 4dK 2 log(1/δ) + 2/3 • 1 κ σ 2 max + R 2 t log(1/δ) with probability at least 1 -δ. Applying a union bound on all t ≥ 1, we have with probability at least 1 -δ, t s=1 2 s -E[ 2 s ] a s 2 Σ -1 s ≤ σ max R t K d log 1 + tAκ 2 4dK 2 log(2t 2 /δ) + 2 3κ σ 2 max + R 2 t log(2t 2 /δ) (G.3) since t≥1 δ 2t 2 ≤ δ. Theorem G.5 (Restatement of Theorem 6.5). If we set φ(θ) = 2A 2 K 2 κ θ 2 2 and assume that all the data points fed into the algorithm are of noise variance bounded by σ 2 max , then with probability at least 1 -3δ, ∀t ≥ 1, the regret of Algrithm 3 for the first t rounds is bounded as follows: reg t ≤ 8A 2 K 2 B 2 κ + 9 2κ R 2 log 2 (4t 2 /δ) + 3 σ 2 max κ d log 1 + tAκ 2 4dK 2 . Proof. For simplicity, let L t (θ) = t-1 s=1 (θ a s , r s ) + φ(θ) and loss t (θ) = L t (θ) -φ(θ). Suppose event E 1 , event E subG := ∀t ≥ 1, max 1≤s≤t | s | ≤ R 2 log(4t 2 /δ ) and the event described in Lemma G.4 (denoted by E 2 ) simultaneously hold in the following proof. By sub-Gaussianity of t we have P ∃t ≥ 1, max 1≤s≤t | s | ≥ R 2 log(4t 2 /δ) ≤ s≥1 P(| s | ≥ R 2 log(4s 2 /δ)) ≤ s≥1 δ/(2s 2 ) ≤ δ. (G.4) Hence, P(E subG ) ≥ 1 -δ. Applying Lemma G.1 and Lemma G.4, we have that P(E subG ∩ E 1 ∩ E 2 ) ≥ 1 -3δ by union bound. From the update rule of Algorithm 3, we calculate t s=1 (a s θ s , r s ) - t s=1 (a s θ * , r s ) = t s=1 [loss s (θ s ) -loss s+1 (θ s+1 ) + (a s θ s , r s )] + loss t+1 (θ t+1 ) -loss t+1 (θ * ) = t s=1 [L s+1 (θ s ) -L s+1 (θ s+1 )] -φ(θ 1 ) + φ(θ * ) + L t+1 (θ t+1 ) -L t+1 (θ * ) ≤ 2 max θ∈Θ |φ(θ)| + t s=1 [L s+1 (θ s ) -L s+1 (θ s+1 )] I1 . (G.5) where the first equality follows from the definition of loss, the second equality holds by the definition of L, the first inequality holds since L t+1 (θ t+1 ) = min θ∈R d L t+1 (θ). Then we continue to bound I 1 . I 1 = t s=1 [L s+1 (θ s ) -L s+1 (θ s+1 )] = t s=1 - ∂L s+1 ∂θ (θ s ), θ s+1 -θ s -(θ s+1 -θ s ) H s+1 (θ s )(θ s+1 -θ s ) = t s=1 -(h(a s θ s ) -r s ) a s , θ s+1 -θ s -(θ s+1 -θ s ) H s+1 (θ s )(θ s+1 -θ s ) ≤ 1 4 t s=1 (h(a s θ s ) -r s ) 2 a s 2 H -1 s+1 (θ s ) ≤ 1 2 t s=1 (h(a s θ s ) -h(a s θ * )) 2 a s 2 H -1 s+1 (θ s ) + 1 2 t s=1 2 s a s 2 H -1 s+1 (θ s ) ≤ 1 2 K 2 t s=1 (a s θ s -a s θ * ) 2 a s 2 H -1 s+1 (θ s ) I2 + 1 2 t s=1 2 s a s 2 H -1 s+1 (θ s ) I3 (G.6) where the second equality holds due to Taylor Expansion, (Let H be the Hessian matrix and θ s ∈ R d ), the third equality follows from the fact that ∂Ls ∂θ (θ s ) = 0 and ∂Ls+1 ∂θ (θ s ) = ∂Ls ∂θ (θ s ) + ∂ (θ as,rs) ∂θ (θ s )the first inequality is obtained by solving the quadratic function with respect to θ s+1 -θ s , the second inequality follows from (a + b) 2 ≤ 2a 2 + 2b 2 . From the definition of H, we calculate H s+1 (θ) = ∂ ∂θ [∇ θ L s+1 ] (G.7) = ∂ ∂θ s i=1 (h(a i θ) -r i ) • a i + ∇ θ φ (G.8) 4A 2 K 2 κ I + κ s i=1 a i • a i = Σ s . (G.9) From Lemma G.1, I 2 ≤ 4 κ reg t + 8R 2 κ 2 log(4t 2 /δ) κ 4K 2 ≤ 1 K 2 reg t + 2R 2 κK 2 log(4t 2 /δ). (G.10) We bound I 3 by decomposing it into its expected value and a zero-mean term. I 3 ≤ t s=1 σ 2 s a s 2 H -1 s+1 (θ s ) + t s=1 2 s -E[ 2 s ] a s 2 H -1 s+1 (θ s ) ≤ t s=1 σ 2 s a s 2 Σ -1 s + t s=1 2 s -E[ 2 s ] a s 2 Σ -1 s ≤ 2 σ 2 max κ d log 1 + tAκ 2 4dK 2 + t s=1 2 s -E[ 2 s ] a s 2 Σ -1 s I4 , (G.11) where the second inequality follows from (G.9), the third inequality holds by Lemma G.3. Substituting (G.11) and (G.10) into (G.6), we have I 1 ≤ 1 2 reg t + R 2 κ log(4t 2 /δ) + σ 2 max κ d log 1 + tAκ 2 4dK 2 + 1 2 I 4 . (G.12) Applying Lemma G.4 to bound I 4 , we calculate I 4 ≤ σ max R t K d log 1 + tAκ 2 4dK 2 log(2t 2 /δ) + 2 3κ σ 2 max + R 2 t log(2t 2 /δ) ≤ σ max R K 2d log 1 + tAκ 2 4dK 2 log(4t 2 /δ) + 2/3 • σ 2 max + 2R 2 log(4t 2 /δ) 1 κ log(4t 2 /δ) ≤ σ max R K 2d log 1 + tAκ 2 4dK 2 log(4t 2 /δ) + 2R 2 κ log 2 (4t 2 /δ), (G.13) where the second inequality holds due to event E subG , the third inequality follows from σ max ≤ R. Substituting (G.13) into (G.12) , we have from which we can further complete the proof by the arbitrariness of t. I 1 ≤ 1 2 reg t + R 2 κ log(4t 2 /δ) + σ 2 max κ d log 1 + tAκ 2 4dK 2 + σ max R 2K 2d log 1 + tAκ 2 4dK 2 log(4t 2 /δ) + R 2 κ log 2 (4t 2 /δ) ≤ 1 2 reg t + 2R 2 κ log 2 (4t 2 /δ) + σ 2 max κ d log 1 + tAκ 2 4dK 2 + 1 2κ σ 2 max d log 1 + tAκ 2 4dK 2 + 1 4κ R 2 log 2 (4t 2 /δ In the following lemma, we formally introduce the conversion from online learning regret to confidence set in our setting. For simplicity in analysis, we omit the level subscript and suppose all the data is fed into the same level. Lemma G.6. Proof. With Lemma G.1, we can prove this lemma by following nearly the same proof for Theorem 1 in Jun et al. (2017) . (We can set β t in their proof to be 1 + 4 κ reg t + 8R 2 κ 2 log 4t 2 /δ according to Lemma G.1. ) Lemma G.7. For all t, with z t and X t defined as in Lemma G.8, we have z t 2 2 -θ t X t z t ≥ 0. Proof. After ridge regression, θ t = V -1 t X t z t where V t := λI + X t X t . Then we have z t 2 2 -θ t X t z t = z t 2 2 -V -1 t X t z t X t z t = z t 2 2 -z t X t V -1 t X t z t (G.18) We consider λI + X t X t X t X t I [X t I] [X t I] 0. (G.19) From Schur complement theorem, we have I (I + X t X t ) -1 X t = X t V -1 t X t . (G.20) Then we can complete the proof by substituting (G.20) into (G.18). Lemma G.8 (Variance-dependent confidence set for generalized linear bandits). Suppose that Assumption 6.1, 6.2, 6.3 hold. For any δ ∈ (0, 1/4), if we set β t,l := 1 + 32A for all t ∈ [T ], l ∈ [L], σ = R/ √ d, then with probability 1 -4δ, the regret of Algorithm 1 for the first T rounds is bounded as follows: Regret(T ) = O K κ d √ J + K κ (K • AB + R) √ dT . Proof. where the first equality holds by the definition in (3.1), the first inequality follows from Assumption 6.2, the second inequality holds by Lemma G.8.

Regret(T ) =

For an arbitrary l ∈ [L], let I 1 (l) := t ∈ Ψ T +1,l x t V -1 t-1,l ≤ 1 and I 2 (l) := Ψ T +1,l \I 1 (l). where the second inequality holds due to Cauchy-Schwarz inequality, the fourth inequality follows from the definition of β t,l , the sixth inequality holds since l t in Algorithm 1 satisfies 2 lt σ ≤ max{σ t , σ}, the last inequality follows from the definition of J.



MULTI-LEVEL LEARNING FRAMEWORKWe introduce the Multi-level learning framework in this section.



Assumption 3.4 (Known Reward Function Class). The unknown reward function f * belongs to an accessible function class F = {f θ : A → R|θ ∈ Θ}.

which still improves the O K(KAB + R)d √ T /κ result provided by Jun et al. (2017). Remark 6.10. Applying the regret bound in Corollary 5.4 in generalized linear bandits, we obtain a regret bound of O K(d √ J + d √ RT )/κ for the case where K • A • B = 1 and R = Ω(1). Our bound here improved the general result when R = o(d). Remark 6.11. When restricted to the heteroscedastic linear bandits by setting κ = K = 1, our result becomes O (R + A • B) √ dT + d √ J , which is the same as the O R √ dT + d √ J regret in Zhou et al. (2021) when R = Ω(AB).

Figure 1: Cumulative regret comparison between ML 2 and GLOC on a synthetic data.Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S Du. Variance-aware confidence set: Variancedependent bound for linear bandits and horizon-free bound for linear mixture mdp. arXiv preprint arXiv:2101.12745, 2021. Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532-4576. PMLR, 2021.

A summary of our regret results and previous results under different settings.

l from Step 1. Step 3: Bounding the sum of widths From Step 2, the last step to bound the regret is to bound the summation of the width w C t,l (a t ). According to the definition of eluder dimension, we can further bound w C t,l (a

Suppose we feed loss function { s (θ)} t s=1 into a single online learner B. Assume that B has an online learning (OL) regret bound reg t : ∀t ≥ 1, Define X t as the design matrix consisting of a1 , • • • , a t , z t = [z 1 , • • • , z t ]. Then, with probability at least 1 -4δ, ∀t ≥ 1, θ * -θ t log 4t 2 /δ + λB 2 -z t 2 2 -θ t X t z t . (G.17)

log 4t 2 L/δ + λB 2 , making use of Lemma G.8 and Lemma G.7.Theorem G.9. Suppose that Assumptions 6.1 and 6.2 hold for the known reward function class F. If we apply Algorithm 4 as a subroutine of Algorithm 1 (in line 9) and set β t,l to

(AB + β T,l ) 2|Ψ T +1,l |d log dλ + T A 2 dλ

