BANDIT LEARNING WITH GENERAL FUNCTION CLASSES: HETEROSCEDASTIC NOISE AND VARIANCE-DEPENDENT REGRET BOUNDS

Abstract

We consider learning a stochastic bandit model, where the reward function belongs to a general class of uniformly bounded functions, and the additive noise can be heteroscedastic. Our model captures contextual linear bandits and generalized linear bandits as special cases. While previous works (Kirschner & Krause, 2018; Zhou et al., 2021) based on weighted ridge regression can deal with linear bandits with heteroscedastic noise, they are not directly applicable to our general model due to the curse of nonlinearity. In order to tackle this problem, we propose a multi-level learning framework for the general bandit model. The core idea of our framework is to partition the observed data into different levels according to the variance of their respective reward and perform online learning at each level collaboratively. Under our framework, we first design an algorithm that constructs the variance-aware confidence set based on empirical risk minimization and prove a variance-dependent regret bound. For generalized linear bandits, we further propose an algorithm based on follow-the-regularized-leader (FTRL) subroutine and online-to-confidence-set conversion, which can achieve a tighter variance-dependent regret under certain conditions.

1. INTRODUCTION

Over the past decade, stochastic bandit algorithms have found a wide variety of applications in online advertising, website optimization, recommendation system and many other tasks (Li et al., 2010; McInerney et al., 2018) . In the model of stochastic bandits, at each round, an agent selects an action and observes a noisy evaluation of the reward function for the chosen action, aiming to maximize the sum of the received rewards. A general reward function governs the reward of each action from the eligible action set. A common assumption used in stochastic bandit problems is that the observation noise is conditionally independent and satisfies a uniform tail bound. In real-world applications, however, the variance of observation noise is likely to be dependent on the evaluation point (chosen action) (Kirschner & Krause, 2018) . Moreover, due to the dynamic environment in reality, the variance of each action may also be different at each round. This motivates the studies of bandit problems with heteroscedastic noise. For example, Kirschner & Krause (2018) introduced the heteroscedastic noise setting where the noise distribution is allowed to depend on the evaluation point. They proposed weighted least squares to estimate the unknown reward function more accurately in the setting where the underlying reward function is linear or lies in a separable Hilbert space (Section 5, Kirschner & Krause 2018). In this paper, we consider a general setting, where the unknown reward function belongs to a known general function class F with bounded eluder dimension (Russo & Van Roy, 2013) . This captures multi-armed bandits, linear contextual bandits (Abbasi-Yadkori et al., 2011) and generalized linear bandits (Filippi et al., 2010) simultaneously. Since weighted least squares highly depends on the linearity of the function class, we propose a multi-level learning framework for our general setting. The underlying idea of the framework is to partition the observed data into various levels according to the variance of the noise. The agent then estimates the reward function at each level independently and then exploit all the levels when selecting an action at each round.

