BANDIT LEARNING WITH GENERAL FUNCTION CLASSES: HETEROSCEDASTIC NOISE AND VARIANCE-DEPENDENT REGRET BOUNDS

Abstract

We consider learning a stochastic bandit model, where the reward function belongs to a general class of uniformly bounded functions, and the additive noise can be heteroscedastic. Our model captures contextual linear bandits and generalized linear bandits as special cases. While previous works (Kirschner & Krause, 2018; Zhou et al., 2021) based on weighted ridge regression can deal with linear bandits with heteroscedastic noise, they are not directly applicable to our general model due to the curse of nonlinearity. In order to tackle this problem, we propose a multi-level learning framework for the general bandit model. The core idea of our framework is to partition the observed data into different levels according to the variance of their respective reward and perform online learning at each level collaboratively. Under our framework, we first design an algorithm that constructs the variance-aware confidence set based on empirical risk minimization and prove a variance-dependent regret bound. For generalized linear bandits, we further propose an algorithm based on follow-the-regularized-leader (FTRL) subroutine and online-to-confidence-set conversion, which can achieve a tighter variance-dependent regret under certain conditions.

1. INTRODUCTION

Over the past decade, stochastic bandit algorithms have found a wide variety of applications in online advertising, website optimization, recommendation system and many other tasks (Li et al., 2010; McInerney et al., 2018) . In the model of stochastic bandits, at each round, an agent selects an action and observes a noisy evaluation of the reward function for the chosen action, aiming to maximize the sum of the received rewards. A general reward function governs the reward of each action from the eligible action set. A common assumption used in stochastic bandit problems is that the observation noise is conditionally independent and satisfies a uniform tail bound. In real-world applications, however, the variance of observation noise is likely to be dependent on the evaluation point (chosen action) (Kirschner & Krause, 2018) . Moreover, due to the dynamic environment in reality, the variance of each action may also be different at each round. This motivates the studies of bandit problems with heteroscedastic noise. For example, Kirschner & Krause (2018) introduced the heteroscedastic noise setting where the noise distribution is allowed to depend on the evaluation point. They proposed weighted least squares to estimate the unknown reward function more accurately in the setting where the underlying reward function is linear or lies in a separable Hilbert space (Section 5, Kirschner & Krause 2018). In this paper, we consider a general setting, where the unknown reward function belongs to a known general function class F with bounded eluder dimension (Russo & Van Roy, 2013) . This captures multi-armed bandits, linear contextual bandits (Abbasi-Yadkori et al., 2011) and generalized linear bandits (Filippi et al., 2010) simultaneously. Since weighted least squares highly depends on the linearity of the function class, we propose a multi-level learning framework for our general setting. The underlying idea of the framework is to partition the observed data into various levels according to the variance of the noise. The agent then estimates the reward function at each level independently and then exploit all the levels when selecting an action at each round. While previous work by Kirschner & Krause (2018) considered sub-Gaussian noise with nonuniform variance proxies, we only assume nonuniform variances of noise (Zhou et al., 2021; Zhang et al., 2021) , which brings a new challenge of exploiting the variance information of the noise to obtain tighter variance-aware confidence sets. Under our multi-level learning framework, we first design an algorithm based on empirical risk minimization and Optimism-in-the-Face-of-Uncertainty (OFU) principle, and prove a variancedependent regret bound. For a special class of bandits namely generalized linear bandits with heteroscedastic noise, we further propose an algorithm using follow-the-regularized-leader (FTRL) as an online regression subroutine and adopting the technique of online-to-confidence-set conversion (Abbasi-Yadkori et al., 2012; Jun et al., 2017) . This algorithm achieves a provaly tighter regret bound when the range of the reward function is relatively wide compared to the magnitude of noise. Our main contributions are summarized as follows: • We develop a new framework called multi-level regression, which can be applied to heteroscedastic bandits, even when the reward function class does not lie in a separable Hilbert space. • Under our framework, we design tighter variance-aware upper confidence bounds for bandits with general reward functions, and propose an bandit learning algorithm based on empirical risk minimzation. We show that our algorithm enjoys variance-dependent regret upper bounds which can be regarded as a strict extension of previous algorithms which obtain variance-dependent regret bounds on simpler bandit models (Zhou et al., 2021; Zhang et al., 2021) . • For generalized linear bandits (Filippi et al., 2010; Jun et al., 2017) , which is a special case of our model class, we further propose an algorithm based on online-to-confidence-set conversion. We first prove a variance-dependent regret bound for follow-the-regularized-leader (FTRL) for the online regression problem derived from generalized linear function class, and then convert the online learning regret bound to the bandit learning confidence set. We show that our algorithm can achieve a tighter regret bound for generalized linear bandits. • As a by-product, our regret bound for FTRL improves the state-of-the-art regret result O(d 2 R 2 ) obtained by stochastic online linear regression (Ouhamma et al., 2021) to O(dσ 2 max ) (omitting the terms without dependence on d), where d is the dimension of contexts, R is the upper bound of the sub-Gaussian norm of the noises at each step, and σ max is the upper bound of the variances of the noises.  O( K κ d √ J + K κ (KAB + R) √ dT ) Computationally efficient Refer to Section 3 for the definitions of dimE, σt, J and R, Section 6 for the definitions of κ, K, A, B. We write general function class with eluder dimension dimE as 'General' and generalized linear function class as 'G-Lin' for short. Oracle efficiency refers to the computational efficiency given a regression oracle (i.e., empirical risk minimization) for the involved function class and an optimization oracle which maximizes the reward function f (x) for a fixed x under some constraint set of f . Notation. We use lower case letters to denote scalars, and use lower and upper case bold face letters to denote vectors and matrices respectively. We denote by [n] the set {1, . . . , n}. For a vector x ∈ R d and matrix Σ ∈ R d×d , a positive semi-definite matrix, we denote by x 2 the vector's Euclidean norm and define x Σ = √ x Σx. For two positive sequences {a n } and {b n } with n = 1, 2, . . . , we write a n = O(b n ) if there exists an absolute constant C > 0 such that a n ≤ Cb n holds for all n ≥ 1 and write a n = Ω(b n ) if there exists an absolute constant C > 0 such that a n ≥ Cb n holds for all n ≥ 1. Let N (F, α, • ∞ ) denote the α-covering number of F in the



A summary of our regret results and previous results under different settings.

