THE POWER OF FEEL-GOOD THOMPSON SAMPLING: A UNIFIED FRAMEWORK FOR LINEAR BANDITS

Abstract

Linear contextual bandit is one of the most popular models in online decisionmaking with bandit feedback. Prior work has studied different variants of this model, e.g., misspecified, non-stationary, and multi-task/life-long linear contextual bandits. However, there is no single framework that can unify the algorithm design and analysis for these variants. In this paper, we propose a unified framework for linear contextual bandits based on feel-good Thompson sampling (Zhang, 2021). The algorithm derived from our framework achieves nearly minimax optimal regret in various settings and resolves the respective open problem in each setting. Specifically, let d be the dimension of the context and T be the length of the horizon, our algorithm achieves an O(d √ ST ) regret bound for nonstationary linear bandits with at most S switches, O(d 5 6 T 2 3 P 1 3 ) regret for nonstationary linear bandits with bounded path length P , and O(d √ kT + √ dkM T ) regret for (generalized) lifelong linear bandits over M tasks that share an unknown representation of dimension k. We believe our framework will shed light on the design and analysis of other linear contextual bandit variants.

1. INTRODUCTION

Linear contextual bandit is one of the most popular models in online decision-making with a large, possibly infinite, action space. This bandit model has been widely studied in the past decade. One of the most successful approaches is based on the upper confidence bound (Auer, 2002) . For example, LinUCB (Li et al., 2010) (or OFUL (Abbasi-Yadkori et al., 2011) ) follows the optimism-in-theface-of-uncertainty principle and chooses the best action within an elliptical confidence ball. The algorithm has been proved to be nearly minimax optimal by using the elliptical potential lemma to track the bonus term. With some modifications to the algorithm, one had generalized this algorithm to various settings, e.g., non-stationary linear bandits (Chen et al., 2019) , multi-task linear bandits (Hu et al., 2021) , to mention a few. The analyses for these generalizations require the corresponding modified elliptical potential lemma, which is, however, hard to derive in general. (One may refer to the technical note (Faury et al., 2021) which discusses the faults in the elliptical potential lemma for non-stationary linear bandits). Another common approach for online decision-making is exponentially weighted sampling. By sampling from a distribution over actions based on their historical rewards, it gives rise to nearoptimal policy-based algorithms for various settings such as the hedge algorithm (Littlestone & Warmuth, 1994) for prediction with expert advice. For contextual bandits, EXP4 (Auer et al., 2002) enjoys a regret bound of E[Regret(T )] ≤ O( KT log |H|), where K is the number of actions, T is the length of the horizon, and H is the feasible policy set. Note that contextual policy-based algorithms usually allow the policy to take round index as context. This gives a natural way to deal with non-stationary environment. For instance, one can solve non-stationary expert problems using meta-experts by following different experts in different rounds (Herbster & Warmuth, 2004) . With this idea, one can obtain the regret bound for a variety of bandit models by counting the number of policies |H|, which is easy to do in general. This motivates us to find a policy-based algorithm for linear contextual bandits. We note that EXP4 is not suitable for our purpose since its regret suffers a polynomial dependence on the number of actions, which can be unbounded in linear contextual bandits. This is due to the fact that EXP4 is designed for general reward functions and does not leverage the linear structure of linear bandits. Given this observation, we raise the following question: Can we design an EXP4-type algorithm for linear contextual bandits? In this paper, we answer the above question affirmatively. In detail, we propose Feel-Good Thompson Sampling over Linear Policies (FGTS.LP), which is a policy-based algorithm for linear contextual bandits. Our algorithm can be regarded as a policy-based adaption of feel-good Thompson sampling (Zhang, 2021) to linear bandits, while this adaption is nontrivial. Our algorithm enjoys a regret bound that is logarithmically dependent on the number of policies and polynomially dependent on the dimension of contexts. To be specific, we prove the following regret bound for FGTS.LP: Theorem 1.1 (Regret Bound of FGTS.LP (informal)). Let d be the dimension of the context, T be the length of the horizon, and H be the set of all feasible policy hypotheses. The regret of FGTS.LP is bounded by E[Regret(T )] ≤ O( dT log N (H, ϵ) + T √ dϵ), where ϵ is some hyperparameter and N (H, ϵ) is the covering number of policy set which contains an ϵ-optimal policy. The above theorem provides a general interface to analyze the performance of FGTS.LP in different settings. Following the idea of including round index, FGTS.LP can deal with various linear contextual bandits. The results are highlighted as follows: Theorem 1.2 (Regret Bound over Variants of Linear Bandits (informal)). With specific modification for each setting, the regret of FGTS.LP is bounded as • E[Regret(T )] ≤ O(d √ T + T √ dζ) for ζ-misspecified linear contextual bandits. • E[Regret(T )] ≤ O(d √ ST ) for non-stationary linear contextual bandits with at most S switches. • E[Regret(T )] ≤ O(d 5 6 T 2 3 P 1 3 ) for non-stationary linear contextual bandits with path length bounded by P . • E[Regret(T )] ≤ O(d √ kT + √ dkM T ) for (generalized) lifelong linear contextual bandits over M tasks that share an unknown representation of dimension k. We note that the above results are all near-optimal which match or improve the state-of-the-art in the corresponding settings. To sum up, our contributions are: • We propose a unified framework for design and analyze various linear contextual bandit models. Our framework is easy to interpret and enjoys near-optimal regret bound in different settings. • We propose the first nearly minimax algorithm for non-stationary linear contextual bandits with a bounded number of switches. • We propose a new algorithm for non-stationary linear contextual bandits with bounded path length. It is the first algorithm that achieves nearly minimax regret. • We propose the first near-optimal algorithm for (generalized) lifelong linear contextual bandits. Its regret matches the state-of-the-art for multi-task linear contextual bandits (Hu et al., 2021) , which is a special case of our model. Notation. We use lower and upper case bold face letters to denote vectors and matrices respectively. We use [k] to denote the set {1, 2, • • • , k}. We denote the Euclidean norm of vector  x ∈ R d by ∥x∥ 2 . For a matrix A = [a 1 , • • • , a k ] ∈ R d×k , we define ∥A∥ 2,∞ = max 1≤i≤k ∥a i ∥ 2 .



For two non-negative sequence {a n }, {b n }, we write a n ≤ O(b n ) if there exists an absolute constant C > 0 such that a n ≤ Cb n for all n ≥ 1, anda n ≤ O(b n ) if there exists an absolute constant k such that a n ≤ O(b n log k b n ); we write a n ≥ Ω(b n ) if there exists an absolute constant C > 0 such that a n ≥ Cb n for all n ≥ 1 and a n ≥ Ω(b n ) if there exists absolute constant k such that a n ≥ Ω(b n log -k b n ); we write a n = Θ(b n ) if there exists absolute constants 0 < C 1 ≤ C 2 such that C 1 b n ≤ a n ≤ C 2 b nfor all n ≥ 1 . For any set C, we use |C| to denote its cardinality. We use log to denote log e for short.2 RELATED WORKMisspecified Linear Bandits. The misspecified linear bandits was first studied byGhosh et al.  (2017), and they proposed an algorithm that achieves sub-linear regret when the misspecification ζ is small. Lattimore & Szepesvari (2020) proposed an algorithm with an O(d√ T + T √ dζ) regret

