VARIANCE-AWARE SPARSE LINEAR BANDITS

Abstract

It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is Θ dT where d is the ambient dimension and T is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve O(1) regret, which is (nearly) independent of d and T . In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: O d T t=1 σ 2 t + 1 , where σ 2 t is the variance of the noise at the t-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., σ t ≡ Ω(1)) and the benign deterministic regimes (i.e., σ t ≡ 0). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a "black-box" manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.

1. INTRODUCTION

This paper studies the sparse linear stochastic bandit problem, which is a special case of linear stochastic bandits. In linear bandits (Dani et al., 2008) , the agent is facing a sequential decisionmaking problem lasting for T rounds. For the t-th round, the agent chooses an action x t ∈ X ⊆ R d , where X is an action set, and receives a noisy reward r t = ⟨θ * , x t ⟩ + η t where θ * ∈ X is the (hidden) parameter of the game and η t is random zero-mean noise. The goal of the agent is to minimize her regret R T , that is, the difference between her cumulative reward T t=1 ⟨θ * , x t ⟩ and max x∈X T t=1 ⟨θ * , x⟩ (check Eq. ( 1) for a definition). Dani et al. (2008) proved that the minimax optimal regret for linear bandits is Θ(d

√

T ) when the noises are independent Gaussian random variables with means 0 and variances 1 and both θ * and the actions x t lie in the unit sphere in R d .foot_0  In real-world applications such as recommendation systems, only a few features may be relevant despite a large candidate feature space. In other words, the high-dimensional linear regime may actually allow a low-dimensional structure. As a result, if we still use the linear bandit model, we will always suffer Ω(d T ) regret no matter how many features are useful. Motivated by this, the sparse linear stochastic bandit problem was introduced (Abbasi- Yadkori et al., 2012; Carpentier & Munos, 2012) . This problem has an additional constraint that the hidden parameter, θ * , is sparse, i.e., ∥θ * ∥ 0 ≤ s for some s ≪ d. However, the agent has no prior knowledge about s and thus the interaction protocol is exactly the same as that of linear bandits. The minimax optimal regret for sparse linear bandits is Θ( -Yadkori et al., 2012; Antos & Szepesvári, 2009 ).foot_1 This bound bypasses the Ω(d √ T ) lower bound for linear bandits as we always have s = ∥θ * ∥ 0 ≤ d and the agent does not have access to s either (though a few previous works assumed a known s). √ sdT ) (Abbasi

However, both the O(d √

T ) and the O( √ sdT ) bounds are the worst-case regret bounds and sometime are too pessimistic especially when d is large. On the other hand, many problems with delicate structures permit a regret bound much smaller than the worst-case bound. The structure this paper focuses on is the magnitude of the noise. Consider the following motivating example. Motivating Example (Deterministic Sparse Linear Bandits). Consider the case where the action set is the unit sphere X = S d-1 , and there is no noise, i.e., the feedback is r t = ⟨θ * , x t ⟩ for each round t ∈ [T ]. In this case, one can identify all non-zero entries of θ * coordinates in O(s log d) steps with high probability via a divide-and-conquer algorithm, and thus yield a dimension-free regret O(s) (see Appendix C for more details about this).foot_2 However, this divide-and-conquer algorithm is specific for deterministic sparse linear bandit problems and does not work for noisy models. Henceforth, we study the following natural question: Can we design an algorithm whose regret adapts to the noise level such that the regret interpolates the √ dT -type bound in the worst case and the dimension-free bound in the deterministic case? Before introducing our results, we would like to mention that there are recent works that studied the noise-adaptivity in linear bandits (Zhou et al., 2021; Zhang et al., 2021; Kim et al., 2021) T ) bound in the worst-case when σ t = Ω(1), and to a constant-type regret O(poly(d)) that is independent of T . However, compared with the linear bandits setting, the variance-aware bound for sparse linear bandits is more significant because it reduces to a dimension-free bound in the noiseless setting. Despite this, to our knowledge, no variance-aware regret bounds exist for sparse linear bandits.

1.1. OUR CONTRIBUTIONS

This paper gives the first set of variance-aware regret bounds for sparse linear bandits. We design a general framework, VASLB, to reduce variance-aware sparse linear bandits to variance-aware linear bandits with little overhead in regret. For ease of presentation, we define the following notation to characterize the variance-awareness of a sparse linear bandit algorithm: Definition 1. A variance-aware sparse linear bandit algorithm F is (f (s, d), g(s, d))-varianceaware, if for any given failure probability δ > 0, with probability 1 -δ, F ensures R F T ≤ O   f (s, d) T t=1 σ 2 t polylog 1 δ + g(s, d) polylog 1 δ   , where R F T is the regret of F in T rounds, d is the ambient dimension and s is the maximum number of non-zero coordinates. Specifically, for linear bandits, f, g are functions only of d. Hence, an (f, g)-variance-aware algorithm will achieve O(f (s, d)

√

T polylog 1 δ ) worst-case regret and O(g(s, d) polylog 1 δ ) deterministic-case regret. Ideally, we would like g(s, d) being independent of d, making the bound dimension-free in deterministic cases, as the divide-and-conquer approach. In this paper, we provide a general framework that can convert any linear bandit algorithm F to a corresponding sparse linear bandit algorithm G in a black-box manner. Moreover, it is varianceaware-preserving, in the sense that, if F enjoys the variance-aware property, so does G. Generally



Throughout the paper, we will use the notations O(•) and Θ(•) to hide log T, log d, log s (where s is the sparsity parameter, which will be introduced later) and log log 1 δ factors (where δ is the failure probability). Carpentier & Munos (2012) andLattimore et al. (2015) obtained an O(s √ T ) regret bound under different models. The former one assumed a component-wise noise model, while the latter one assumed a ∥θ * ∥1 ≤ 1 ground-truth as well as a ∥xt∥∞ ≤ 1 action space. See Appendix A for more discussions on this. We also remark that some assumptions on the action is needed. For example, if every action can only query one coordinate (each action corresponds to one vector of the standard basis) then an Ω (d) regret lower bound is unavoidable. Hence, in this paper, we only consider the benign case that action set is the unit sphere.

