NEURAL THOMPSON SAMPLING

Abstract

Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of O(T 1/2 ), which matches the regret of other contextual bandit algorithms in terms of total round number T . Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.

1. INTRODUCTION

The stochastic multi-armed bandit (Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesvári, 2020) has been extensively studied, as an important model to optimize the trade-off between exploration and exploitation in sequential decision making. Among its many variants, the contextual bandit is widely used in real-world applications such as recommendation (Li et al., 2010) , advertising (Graepel et al., 2010 ), robotic control (Mahler et al., 2016 ), and healthcare (Greenewald et al., 2017) . In each round of a contextual bandit, the agent observes a feature vector (the "context") for each of the K arms, pulls one of them, and in return receives a scalar reward. The goal is to maximize the cumulative reward, or minimize the regret (to be defined later), in a total of T rounds. To do so, the agent must find a trade-off between exploration and exploitation. One of the most effective and widely used techniques is Thompson Sampling, or TS (Thompson, 1933) . The basic idea is to compute the posterior distribution of each arm being optimal for the present context, and sample an arm from this distribution. TS is often easy to implement, and has found great success in practice (Chapelle & Li, 2011; Graepel et al., 2010; Kawale et al., 2015; Russo et al., 2017) . Recently, a series of work has applied TS or its variants to explore in contextual bandits with neural network models (Blundell et al., 2015; Kveton et al., 2020; Lu & Van Roy, 2017; Riquelme et al., 2018) . Riquelme et al. (2018) proposed NeuralLinear, which maintains a neural network and chooses the best arm in each round according to a Bayesian linear regression on top of the last network layer. Kveton et al. (2020) proposed DeepFPL, which trains a neural network based on perturbed training data and chooses the best arm in each round based on the neural network output. Similar approaches have also been used in more general reinforcement learning problem (e.g., Azizzadenesheli et al., 2018; Fortunato et al., 2018; Lipton et al., 2018; Osband et al., 2016a) . Despite the reported empirical success, strong regret guarantees for TS remain limited to relatively simple models, under fairly restrictive assumptions on the reward function. Examples are linear

