NEURAL THOMPSON SAMPLING

Abstract

Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of O(T 1/2 ), which matches the regret of other contextual bandit algorithms in terms of total round number T . Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.

1. INTRODUCTION

The stochastic multi-armed bandit (Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesvári, 2020) has been extensively studied, as an important model to optimize the trade-off between exploration and exploitation in sequential decision making. Among its many variants, the contextual bandit is widely used in real-world applications such as recommendation (Li et al., 2010) , advertising (Graepel et al., 2010 ), robotic control (Mahler et al., 2016) , and healthcare (Greenewald et al., 2017) . In each round of a contextual bandit, the agent observes a feature vector (the "context") for each of the K arms, pulls one of them, and in return receives a scalar reward. The goal is to maximize the cumulative reward, or minimize the regret (to be defined later), in a total of T rounds. To do so, the agent must find a trade-off between exploration and exploitation. One of the most effective and widely used techniques is Thompson Sampling, or TS (Thompson, 1933) . The basic idea is to compute the posterior distribution of each arm being optimal for the present context, and sample an arm from this distribution. TS is often easy to implement, and has found great success in practice (Chapelle & Li, 2011; Graepel et al., 2010; Kawale et al., 2015; Russo et al., 2017) . Recently, a series of work has applied TS or its variants to explore in contextual bandits with neural network models (Blundell et al., 2015; Kveton et al., 2020; Lu & Van Roy, 2017; Riquelme et al., 2018) . Riquelme et al. (2018) proposed NeuralLinear, which maintains a neural network and chooses the best arm in each round according to a Bayesian linear regression on top of the last network layer. Kveton et al. (2020) proposed DeepFPL, which trains a neural network based on perturbed training data and chooses the best arm in each round based on the neural network output. Similar approaches have also been used in more general reinforcement learning problem (e.g., Azizzadenesheli et al., 2018; Fortunato et al., 2018; Lipton et al., 2018; Osband et al., 2016a) . Despite the reported empirical success, strong regret guarantees for TS remain limited to relatively simple models, under fairly restrictive assumptions on the reward function. Examples are linear functions (Abeille & Lazaric, 2017; Agrawal & Goyal, 2013; Kocák et al., 2014; Russo & Van Roy, 2014) , generalized linear functions (Kveton et al., 2020; Russo & Van Roy, 2014) , or functions with small RKHS norm induced by a properly selected kernel (Chowdhury & Gopalan, 2017) . In this paper, we provide, to the best of our knowledge, the first near-optimal regret bound for neural network-based Thompson Sampling. Our contributions are threefold. First, we propose a new algorithm, Neural Thompson Sampling (NeuralTS), to incorporate TS exploration with neural networks. It differs from NeuralLinear (Riquelme et al., 2018) by considering weight uncertainty in all layers, and from other neural network-based TS implementations (Blundell et al., 2015; Kveton et al., 2020) by sampling the estimated reward from the posterior (as opposed to sampling parameters). Second, we give a regret analysis for the algorithm, and obtain an O( d √ T ) regret, where d is the effective dimension and T is the number of rounds. This result is comparable to previous bounds when specialized to the simpler, linear setting where the effective dimension coincides with the feature dimension (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017) . Finally, we corroborate the analysis with an empirical evaluation of the algorithm on several benchmarks. Experiments show that NeuralTS yields competitive performance, in comparison with stateof-the-art baselines, thus suggest its practical value in addition to strong theoretical guarantees. Notation: Scalars and constants are denoted by lower and upper case letters, respectively. Vectors are denoted by lower case bold face letters x, and matrices by upper case bold face letters A. We denote by [k] the set {1, 2, • • • , k} for positive integers k. For two non-negative sequence {a n }, {b n }, a n = O(b n ) means that there exists a positive constant C such that a n ≤ Cb n , and we use O(•) to hide the log factor in O(•). We denote by • 2 the Euclidean norm of vectors and the spectral norm of matrices, and by • F the Frobenius norm of a matrix.

2. PROBLEM SETTING AND PROPOSED ALGORITHM

In this work, we consider contextual K-armed bandits, where the total number of rounds T is known. At round t ∈ [T ], the agent observes K contextual vectors {x t,k ∈ R d | k ∈ [K]}. Then the agent selects an arm a t and receives a reward r t,at . Our goal is to minimize the following pseudo regret: R T = E T t=1 (r t,a * t -r t,at ) , (2.1) where a * t is the optimal arm at round t that has the maximum expected reward: a * t = argmax a∈[K] E[r t,a ]. To estimate the unknown reward given a contextual vector x, we use a fully connected neural network f (x; θ) of depth L ≥ 2, defined recursively by f 1 = W 1 x, f l = W l ReLU(f l-1 ), 2 ≤ l ≤ L, f (x; θ) = √ mf L , (2.2) where ReLU(x) := max{x, 0}, m is the width of neural network, W 1 ∈ R m×d , W l ∈ R m×m , 2 ≤ l < L, W L ∈ R 1×m , θ = (vec(W 1 ); • • • ; vec(W L )) ∈ R p is the collection of parameters of the neural network, p = dm + m 2 (L -2) + m, and g(x; θ) = ∇ θ f (x; θ) is the gradient of f (x; θ) w.r.t. θ. Our Neural Thompson Sampling is given in Algorithm 1. It maintains a Gaussian distribution for each arm's reward. When selecting an arm, it samples the reward of each arm from the reward's posterior distribution, and then pulls the greedy arm (lines 4-8). Once the reward is observed, it updates the posterior (lines 9 & 10). The mean of the posterior distribution is set to the output of the neural network, whose parameter is the solution to the following minimization problem: min θ L(θ) = t i=1 [f (x i,ai ; θ) -r i,ai ] 2 /2 + mλ θθ 0 2 2 /2. (2.3) We can see that (2.3) is an 2 -regularized square loss minimization problem, where the regularization term centers at the randomly initialized network parameter θ 0 . We adapt gradient descent to solve (2.3) with step size η and total number of iterations J.

