THE POWER OF FEEL-GOOD THOMPSON SAMPLING: A UNIFIED FRAMEWORK FOR LINEAR BANDITS

Abstract

Linear contextual bandit is one of the most popular models in online decisionmaking with bandit feedback. Prior work has studied different variants of this model, e.g., misspecified, non-stationary, and multi-task/life-long linear contextual bandits. However, there is no single framework that can unify the algorithm design and analysis for these variants. In this paper, we propose a unified framework for linear contextual bandits based on feel-good Thompson sampling (Zhang, 2021). The algorithm derived from our framework achieves nearly minimax optimal regret in various settings and resolves the respective open problem in each setting. Specifically, let d be the dimension of the context and T be the length of the horizon, our algorithm achieves an O(d √ ST ) regret bound for nonstationary linear bandits with at most S switches, O(d 5 6 T 2 3 P 1 3 ) regret for nonstationary linear bandits with bounded path length P , and O(d √ kT + √ dkM T ) regret for (generalized) lifelong linear bandits over M tasks that share an unknown representation of dimension k. We believe our framework will shed light on the design and analysis of other linear contextual bandit variants.

1. INTRODUCTION

Linear contextual bandit is one of the most popular models in online decision-making with a large, possibly infinite, action space. This bandit model has been widely studied in the past decade. One of the most successful approaches is based on the upper confidence bound (Auer, 2002) . For example, LinUCB (Li et al., 2010 ) (or OFUL (Abbasi-Yadkori et al., 2011) ) follows the optimism-in-theface-of-uncertainty principle and chooses the best action within an elliptical confidence ball. The algorithm has been proved to be nearly minimax optimal by using the elliptical potential lemma to track the bonus term. With some modifications to the algorithm, one had generalized this algorithm to various settings, e.g., non-stationary linear bandits (Chen et al., 2019) , multi-task linear bandits (Hu et al., 2021) , to mention a few. The analyses for these generalizations require the corresponding modified elliptical potential lemma, which is, however, hard to derive in general. (One may refer to the technical note (Faury et al., 2021) which discusses the faults in the elliptical potential lemma for non-stationary linear bandits). Another common approach for online decision-making is exponentially weighted sampling. By sampling from a distribution over actions based on their historical rewards, it gives rise to nearoptimal policy-based algorithms for various settings such as the hedge algorithm (Littlestone & Warmuth, 1994) for prediction with expert advice. For contextual bandits, EXP4 (Auer et al., 2002) enjoys a regret bound of E[Regret(T )] ≤ O( KT log |H|), where K is the number of actions, T is the length of the horizon, and H is the feasible policy set. Note that contextual policy-based algorithms usually allow the policy to take round index as context. This gives a natural way to deal with non-stationary environment. For instance, one can solve non-stationary expert problems using meta-experts by following different experts in different rounds (Herbster & Warmuth, 2004) . With this idea, one can obtain the regret bound for a variety of bandit models by counting the number of policies |H|, which is easy to do in general. This motivates us to find a policy-based algorithm for linear contextual bandits. We note that EXP4 is not suitable for our purpose since its regret suffers a polynomial dependence on the number of actions, which can be unbounded in linear contextual bandits. This is due to the fact that EXP4 is designed for general reward functions and does not leverage the linear structure of linear bandits. Given this observation, we raise the following question:

