NEURAL THOMPSON SAMPLING

Abstract

Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of O(T 1/2 ), which matches the regret of other contextual bandit algorithms in terms of total round number T . Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.

1. INTRODUCTION

The stochastic multi-armed bandit (Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesvári, 2020) has been extensively studied, as an important model to optimize the trade-off between exploration and exploitation in sequential decision making. Among its many variants, the contextual bandit is widely used in real-world applications such as recommendation (Li et al., 2010) , advertising (Graepel et al., 2010) , robotic control (Mahler et al., 2016) , and healthcare (Greenewald et al., 2017) . In each round of a contextual bandit, the agent observes a feature vector (the "context") for each of the K arms, pulls one of them, and in return receives a scalar reward. The goal is to maximize the cumulative reward, or minimize the regret (to be defined later), in a total of T rounds. To do so, the agent must find a trade-off between exploration and exploitation. One of the most effective and widely used techniques is Thompson Sampling, or TS (Thompson, 1933) . The basic idea is to compute the posterior distribution of each arm being optimal for the present context, and sample an arm from this distribution. TS is often easy to implement, and has found great success in practice (Chapelle & Li, 2011; Graepel et al., 2010; Kawale et al., 2015; Russo et al., 2017) . Recently, a series of work has applied TS or its variants to explore in contextual bandits with neural network models (Blundell et al., 2015; Kveton et al., 2020; Lu & Van Roy, 2017; Riquelme et al., 2018) . Riquelme et al. (2018) proposed NeuralLinear, which maintains a neural network and chooses the best arm in each round according to a Bayesian linear regression on top of the last network layer. Kveton et al. (2020) proposed DeepFPL, which trains a neural network based on perturbed training data and chooses the best arm in each round based on the neural network output. Similar approaches have also been used in more general reinforcement learning problem (e.g., Azizzadenesheli et al., 2018; Fortunato et al., 2018; Lipton et al., 2018; Osband et al., 2016a) . Despite the reported empirical success, strong regret guarantees for TS remain limited to relatively simple models, under fairly restrictive assumptions on the reward function. Examples are linear functions (Abeille & Lazaric, 2017; Agrawal & Goyal, 2013; Kocák et al., 2014; Russo & Van Roy, 2014) , generalized linear functions (Kveton et al., 2020; Russo & Van Roy, 2014) , or functions with small RKHS norm induced by a properly selected kernel (Chowdhury & Gopalan, 2017) . In this paper, we provide, to the best of our knowledge, the first near-optimal regret bound for neural network-based Thompson Sampling. Our contributions are threefold. First, we propose a new algorithm, Neural Thompson Sampling (NeuralTS) , to incorporate TS exploration with neural networks. It differs from NeuralLinear (Riquelme et al., 2018) by considering weight uncertainty in all layers, and from other neural network-based TS implementations (Blundell et al., 2015; Kveton et al., 2020) by sampling the estimated reward from the posterior (as opposed to sampling parameters). Second, we give a regret analysis for the algorithm, and obtain an O( d √ T ) regret, where d is the effective dimension and T is the number of rounds. This result is comparable to previous bounds when specialized to the simpler, linear setting where the effective dimension coincides with the feature dimension (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017) . Finally, we corroborate the analysis with an empirical evaluation of the algorithm on several benchmarks. Experiments show that NeuralTS yields competitive performance, in comparison with stateof-the-art baselines, thus suggest its practical value in addition to strong theoretical guarantees. Notation: Scalars and constants are denoted by lower and upper case letters, respectively. Vectors are denoted by lower case bold face letters x, and matrices by upper case bold face letters A. We denote by [k] the set {1, 2, • • • , k} for positive integers k. For two non-negative sequence {a n }, {b n }, a n = O(b n ) means that there exists a positive constant C such that a n ≤ Cb n , and we use O(•) to hide the log factor in O(•). We denote by • 2 the Euclidean norm of vectors and the spectral norm of matrices, and by • F the Frobenius norm of a matrix.

2. PROBLEM SETTING AND PROPOSED ALGORITHM

In this work, we consider contextual K-armed bandits, where the total number of rounds T is known. At round t ∈ [T ], the agent observes K contextual vectors {x t,k ∈ R d | k ∈ [K]}. Then the agent selects an arm a t and receives a reward r t,at . Our goal is to minimize the following pseudo regret: R T = E T t=1 (r t,a * t -r t,at ) , (2.1) where a * t is the optimal arm at round t that has the maximum expected reward: a * t = argmax a∈[K] E[r t,a ]. To estimate the unknown reward given a contextual vector x, we use a fully connected neural network f (x; θ) of depth L ≥ 2, defined recursively by f 1 = W 1 x, f l = W l ReLU(f l-1 ), 2 ≤ l ≤ L, f (x; θ) = √ mf L , where ReLU(x) := max{x, 0}, m is the width of neural network, W 1 ∈ R m×d , W l ∈ R m×m , 2 ≤ l < L, W L ∈ R 1×m , θ = (vec(W 1 ); • • • ; vec(W L )) ∈ R p is the collection of parameters of the neural network, p = dm + m 2 (L -2) + m, and g(x; θ) = ∇ θ f (x; θ) is the gradient of f (x; θ) w.r.t. θ. Our Neural Thompson Sampling is given in Algorithm 1. It maintains a Gaussian distribution for each arm's reward. When selecting an arm, it samples the reward of each arm from the reward's posterior distribution, and then pulls the greedy arm (lines 4-8). Once the reward is observed, it updates the posterior (lines 9 & 10). The mean of the posterior distribution is set to the output of the neural network, whose parameter is the solution to the following minimization problem: min θ L(θ) = t i=1 [f (x i,ai ; θ) -r i,ai ] 2 /2 + mλ θ -θ 0 2 2 /2. (2.3) We can see that (2.3) is an 2 -regularized square loss minimization problem, where the regularization term centers at the randomly initialized network parameter θ 0 . We adapt gradient descent to solve (2.3) with step size η and total number of iterations J.

Algorithm 1 Neural Thompson Sampling (NeuralTS)

Input: Number of rounds T , exploration variance ν, network width m, regularization parameter λ. 1: Set U 0 = λI 2: Initialize θ 0 = (vec(W 1 ); • • • ; vec(W L )) ∈ R p , where for each 1 ≤ l ≤ L -1, W l = (W, 0; 0, W), each entry of W is generated independently from N (0, 4/m); W L = (w , -w ), each entry of w is generated independently from N (0, 2/m). 3: for t = 1, • • • , T do 4: for k = 1, • • • , K do 5: σ 2 t,k = λ g (x t,k ; θ t-1 ) U -1 t-1 g(x t,k ; θ t-1 )/m 6: Sample estimated reward r t,k ∼ N (f (x t,k ; θ t-1 ), ν 2 σ 2 t,k ) 7: end for 8: Pull arm a t and receive reward r t,at , where a t = argmax a r t,a 9: Set θ t to be the output of gradient descent for solving (2.3) 10: U t = U t-1 + g(x t,at ; θ t )g(x t,at ; θ t ) /m 11: end for A few observations about our algorithm are in place. First, compared to typical ways of implementing Thompson Sampling with neural networks, NeuralTS samples from the posterior distribution of the scalar reward, instead of the network parameters. It is therefore simpler and more efficient, as the number of parameters in practice can be large. Second, the algorithm maintains the posterior distributions related to parameters of all layers of the network, as opposed to the last layer only (Riquelme et al., 2018) . This difference is crucial in our regret analysis. It allows us to build a connection between Algorithm 1 and recent work about deep learning theory (Allen-Zhu et al., 2018; Cao & Gu, 2019) , in order to obtain theoretical guarantees as will be shown in the next section. Third, different from linear or kernelized TS (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017) , whose posterior can be computed in closed forms, NeuralTS solves a non-convex optimization problem (2.3) by gradient descent. This difference requires additional techniques in the regret analysis. Moreover, stochastic gradient descent can be used to solve the optimization problem with a similar theoretical guarantee (Allen-Zhu et al., 2018; Du et al., 2018; Zou et al., 2019) . For simplicity of exposition, we will focus on the exact gradient descent approach.

3. REGRET ANALYSIS

In this section, we provide a regret analysis of NeuralTS. We assume that there exists an unknown reward function h such that for any 1 ≤ t ≤ T and 1 ≤ k ≤ K, r t,k = h(x t,k ) + ξ t,k , with |h(x t,k )| ≤ 1 where {ξ t,k } forms an R-sub-Gaussian martingale difference sequence with constant R > 0, i.e., E[exp(λξ t,k )|ξ 1:t-1,k , x 1:t,k ] ≤ exp(λ 2 R 2 ) for all λ ∈ R. Such an assumption on the noise sequence is widely adapted in contextual bandit literature (Agrawal & Goyal, 2013; Bubeck & Cesa-Bianchi, 2012; Chowdhury & Gopalan, 2017; Chu et al., 2011; Lattimore & Szepesvári, 2020; Valko et al., 2013) . Next, we provide necessary background on the neural tangent kernel (NTK) theory (Jacot et al., 2018) , which plays a crucial role in our analysis. In the analysis, we denote by {x i } T K i=1 the set of observed contexts of all arms and all rounds: {x t,k } 1≤t≤T,1≤k≤K where i = K(t -1) + k. Definition 3.1 (Jacot et al. (2018) ). Define H (1) i,j = Σ (1) i,j = x i , x j , A (l) i,j = Σ (l) i,i Σ (l) i,j Σ (l) i,j Σ (l) j,j , Σ (l+1) i,j = 2E (u,v)∼N (0,A (l) i,j ) max{u, 0} max{v, 0}, H (l+1) i,j = 2 H (l) i,j E (u,v)∼N (0,A (l) i,j ) 1(u ≥ 0) 1(v ≥ 0) + Σ (l+1) i,j . Then, H = ( H (L) + Σ (L) )/2 is called the neural tangent kernel matrix on the context set. The NTK technique builds a connection between deep neural networks and kernel methods. It enables us to adapt some complexity measures for kernel methods to describe the complexity of the neural network, as given by the following definition. Definition 3.2. The effective dimension d of matrix H with regularization parameter λ is defined as d = log det(I + H/λ) log(1 + T K/λ) . Remark 3.3. The effective dimension is a metric to describe the actual underlying dimension in the set of observed contexts, and has been used by Valko et al. (2013) for the analysis of kernel UCB. Our definition here is adapted from Yang & Wang (2019), which also considers UCB-based exploration. Compared with the maximum information gain γ t used in Chowdhury & Gopalan (2017), one can verify that their Lemma 3 shows that γ t ≥ log det(I + H/λ)/2. Therefore, γ t and d are of the same order up to a ratio of 1/(2 log(1 + T K/λ)). Furthermore, d can be upper bounded if all contexts x i are nearly on some low-dimensional subspace of the RKHS space spanned by NTK (Appendix D). We will make a regularity assumption on the contexts and the corresponding NTK matrix H. Assumption 3.4. Let H be defined in Definition 3.1. There exists λ 0 > 0, such that H λ 0 I. In addition, for any t ∈ [T ], k ∈ [K], x t,k 2 = 1 and [x t,k ] j = [x t,k ] j+d/2 . The assumption that the NTK matrix is positive definite has been considered in prior work on NTK (Arora et al., 2019; Du et al., 2018) . The assumption on context x t,a ensures that the initial output of neural network f (x; θ 0 ) is 0 with the random initialization suggested in Algorithm 1. The condition on x is easy to satisfy, since for any context x, one can always construct a new context x as [x/( √ 2 x 2 ), x/( √ 2 x 2 )] . We are now ready to present the main result of the paper: Theorem 3.5. Under Assumption 3.4, set the parameters in Algorithm 1 as λ = 1 + 1/T , ν = B + R d log(1 + T K/λ) + 2 + 2 log(1/δ) where B = max 1/(22e √ π), √ 2h H -1 h with h = (h(x 1 ), . . . , h(x T K )) , and R is the sub-Gaussian parameter. In line 9 of Algorithm 1, set η = C 1 (mλ + mLT ) -1 and J = (1 + LT /λ)(C 2 + log(T 3 Lλ -1 log(1/δ)))/C 1 for some positive constants C 1 , C 2 . If the network width m satisfies: m ≥ poly λ, T, K, L, log(1/δ), λ -1 0 , then, with probability at least 1 -δ, the regret of Algorithm 1 is bounded as R T ≤ C 3 (1 + c T )ν 2λL( d log(1 + T K) + 1)T + (4 + C 4 (1 + c T )νL) 2 log(3/δ)T + 5, where C 3 , C 4 are some positive absolute constants, and c T = √ 4 log T + 2 log K. Remark 3.6. The definition B in Theorem 3.5 is inspired by the RKHS norm of the reward function defined in Chowdhury & Gopalan (2017) . It can be verified that when the reward function h belongs to the function space induced by NTK, i.e., h H < ∞, we have √ h H -1 h ≤ h H according to Zhou et al. (2019), which suggests that B ≤ max{1/(22e √ π), √ 2 h H }. Remark 3.7. Theorem 3.5 implies the regret of NeuralTS is on the order of O( dT 1/2 ). This result matches the state-of-the-art regret bound in Chowdhury & Gopalan (2017); Agrawal & Goyal (2013) ; Zhou et al. (2019); Kveton et al. (2020) . Remark 3.8. In Theorem 3.5, the requirement of m is specified in Condition 4.1 and the proof of Theorem 3.5, which is a high-degree polynomial in the time horizon T , number of layers L and number of actions K. However, in our experiments, we can choose reasonably small m (e.g., m = 100) to obtain good performance of NeuralTS. See Appendix A.1 for more details. This discrepancy between theory and practice is due to the limitation of current NTK theory (Du et al., 2018; Allen-Zhu et al., 2018; Zou et al., 2019) . Closing the gap is a venue for future work. Remark 3.9. Theorem 3.5 suggests that we need to know T before we run the algorithm in order to set m. When T is unknown, we can use the standard doubling trick (See e.g., Cesa-Bianchi & Lugosi (2006) ) to set m adaptively. In detail, we decompose the time interval (0, +∞) as a union of non-overlapping intervals [2 s , 2 s+1 ). When 2 s ≤ t < 2 s+1 , we restart NeuralTS with the input T = 2 s+1 . It can be verified that similar O( d √ T ) regret still holds.

4. PROOF OF THE MAIN THEOREM

This section sketches the proof of Theorem 3.5, with supporting lemmas and technical details provided in Appendix B. While the proof roadmap is similar to previous work on Thompson Sampling (e.g., Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017; Kocák et al., 2014; Kveton et al., 2020) , our proof needs to carefully track the approximation error of neural networks for approximating the reward function. To control the approximation error, the following condition on the neural network width is required in several technical lemmas. Condition 4.1. The network width m satisfies m ≥ C max √ λL -3/2 [log(T KL 2 /δ)] 3/2 , T 6 K 6 L 6 log(T KL/δ) max{λ -4 0 , 1}, m[log m] -3 ≥ CT L 12 λ -1 + CT 7 λ -8 L 18 (λ + LT ) 6 + CL 21 T 7 λ -7 (1 + T /λ) 6 , where C is a positive absolute constant. For any t, we define an event E σ t as follows E σ t = {ω ∈ F t+1 : ∀k ∈ [K], | r t,k -f (x t,k ; θ t-1 )| ≤ c t νσ t,k }, (4.1) where c t = √ 4 log t + 2 log K. Under event E σ t , the difference between the sampled reward r t,k and the estimated mean reward f (x t,k ; θ t-1 ) can be controlled by the reward's posterior variance. We also define an event E µ t as follows E µ t = {ω ∈ F t : ∀k ∈ [K], |f (x t,k ; θ t-1 ) -h(x t,k )| ≤ νσ t,k + (m)}, (4.2) where (m) is defined as (m) = p (m) + C ,1 (1 -ηmλ) J T L/λ p (m) = C ,2 T 2/3 m -1/6 λ -2/3 L 3 log m + C ,3 m -1/6 log mL 4 T 5/3 λ -5/3 (1 + T /λ) + C ,4 B + R log det(I + H/λ) + 2 + 2 log(1/δ) log mT 7/6 m -1/6 λ -2/3 L 9/2 , (4.3) and {C ,i } 4 i=1 are some positive absolute constants. Under event E µ t , the estimated mean reward f (x t,k ; θ t-1 ) based on the neural network is similar to the true expected reward h(x t,k ). Note that the additional term (m) is the approximate error of the neural networks for approximating the true reward function. This is a key difference in our proof from previous regret analysis of Thompson Sampling Agrawal & Goyal (2013); Chowdhury & Gopalan (2017) , where there is no approximation error. The following two lemmas show that both events E σ t and E µ t happen with high probability. Lemma 4.2. For any t ∈ [T ], Pr E σ t F t ≥ 1 -t -2 . Lemma 4.3. Suppose the width of the neural network m satisfies Condition 4.1. Set η = C(mλ + mLT ) -1 , then we have Pr ∀t ∈ [T ], E µ t ≥ 1 -δ, where C is an positive absolute constant. The next lemma gives a lower bound of the probability that the sampled reward r is larger than true reward up to the approximation error (m). Lemma 4.4. For any t ∈ [T ], k ∈ [K], we have Pr r t,k + (m) > h(x t,k ) F t , E µ t ≥ (4e √ π) -1 . Following Agrawal & Goyal (2013) , for any time t, we divide the arms into two groups: saturated and unsaturated arms, based on whether the standard deviation of the estimates for an arm is smaller than the standard deviation for the optimal arm or not. Note that the optimal arm is included in the group of unsaturated arms. More specifically, we define the set of saturated arms S t as follows S t = k k ∈ [K], h(x t,a * t ) -h(x t,k ) ≥ (1 + c t )νσ t,k + 2 (m) . (4.4) Note that we have taken the approximate error (m) into consideration when defining saturated arms, which differs from the Thompson Sampling literature (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017) . It is now easy to show that the immediate regret of playing an unsaturated arm can be bounded by the standard deviation plus the approximation error (m). The following lemma shows that the probability of pulling a saturated arm is small in Algorithm 1. Lemma 4.5. Let a t be the arm pulled at round t ∈ [T ]. Then, Pr a t / ∈ S t |F t , E µ t ≥ 1 4e √ π -1 t 2 . The next lemma bounds the expectation of the regret at each round conditioned on E µ t . Lemma 4.6. Suppose the width of the neural network m satisfies Condition 4.1. Set η = C 1 (mλ + mLT ) -1 , then with probability at least 1 -δ, we have for all t ∈ [T ] that E[h(x t,a * t ) -h(x t,at )|F t , E µ t ] ≤ C 2 (1 + c t )ν √ LE[min{σ t,at , 1}|F t , E µ t ] + 4 (m) + 2t -2 , where C 1 , C 2 are some positive absolute constants. Based on Lemma 4.6, we define ∆t := (h(x t,a * t ) -h(x t,at )) 1(E µ t ), and X t := ∆t -C ∆ (1 + c t )ν √ L min{σ t,at , 1} + 4 (m) + 2t -2 , Y t = t i=1 X i , (4.5) where C ∆ is the same with constant C in Lemma 4.6. By Lemma 4.6, we can verify that with probability at least 1 -δ, {Y t } forms a super martingale sequence since E(Y t -Y t-1 ) = EX t ≤ 0. By Azuma-Hoeffding inequality (Hoeffding, 1963) , we can prove the following lemma. Lemma 4.7. Suppose the width of the neural network m satisfies Condition 4.1. Then set η = C 1 (mλ + mLT ) -1 , we have, with probability at least 1 -δ, that T i=1 ∆i ≤ 4T (m) + π 2 /3 + C 2 (1 + c T )ν √ L T i=1 min{σ t,at , 1} + (4 + C 3 (1 + c T )νL + 4 (m)) 2 log(1/δ)T , where C 1 , C 2 , C 3 are some positive absolute constants. The last lemma is used to control T i=1 min{σ t,at , 1} in Lemma 4.7. Lemma 4.8. Suppose the width of the neural network m satisfies Condition 4.1. Then set η = C 1 (mλ + mLT ) -1 , we have, with probability at least 1 -δ, it holds that T i=1 min{σ t,at , 1} ≤ 2λT ( d log(1 + T K) + 1) + C 2 T 13/6 log mm -1/6 λ -2/3 L 9/2 , where C 1 , C 2 are some positive absolute constants. With all the above lemmas, we are ready to prove Theorem 3.5. Proof of Theorem 3.5. By Lemma 4.3, E µ t holds for all t ∈ [T ] with probability at least 1 -δ. Therefore, with probability at least 1 -δ, we have R T = T i=1 (h(x t,a * t ) -h(x t,at )) 1(E µ t ) ≤ 4T (m) + π 2 3 + C1 (1 + c T )ν √ L T i=1 min{σ t,at , 1} + (4 + C2 (1 + c T )νL + 4 (m)) 2 log(1/δ)T ≤ C1 (1 + c T )ν √ L 2λT ( d log(1 + T K) + 1) + C3 T 13/6 log mm -1/6 λ -2/3 L 9/2 + π 2 3 + 4T (m) + 4 (m) 2 log(1/δ)T + (4 + C2 (1 + c T )νL) 2 log(1/δ)T , = C1 (1 + c T )ν √ L 2λT ( d log(1 + T K) + 1) + C3 T 13/6 log mm -1/6 λ -2/3 L 9/2 + π 2 3 + p (m)(4T + 2 log(1/δ)T ) + (4 + C2 (1 + c T )νL) 2 log(1/δ)T + C ,1 (1 -ηmλ) J T L/λ(4T + 2 log(1/δ)T ), where C1 , C2 , C3 are some positive absolute constants, the first inequality is due to Lemma 4.7, and the second inequality is due to Lemma 4.8. The third equation is from (4.3). By setting η = C4 (mλ + mLT ) -1 and J = (1 + LT /λ)(log(24C ,1 ) + log(T 3 Lλ -1 log(1/δ)))/ C4 , we have C ,1 (1 -ηmλ) J T L/λ(4T + 2 log(1/δ)T ) ≤ 1 3 , Then choosing m such that C1 C3 (1 + c T )νT 13/6 log mm -1/6 λ -2/3 L 5 ≤ 1 3 , p (m)(4T + 2 log(1/δ)T ) ≤ 1 3 . R T can be further bounded by R T ≤ C1 (1 + c T )ν 2λL( d log(1 + T K) + 1)T + (4 + C2 (1 + c T )νL) 2 log(1/δ)T + 5. Taking union bound over Lemmas 4.3, 4.7 and 4.8, the above inequality holds with probability 1 -3δ. By replacing δ with δ/3 and rearranging terms, we complete the proof.

5. EXPERIMENTS

This section gives an empirical evaluation of our algorithm in several public benchmark datasets, including adult, covertype, magic telescope, mushroom and shuttle, all from UCI (Dua & Graff, 2017) , as well as MNIST (LeCun et al., 2010) . The algorithm is compared to several typical baselines: linear and kernelized Thompson Sampling (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017) , linear and kernelized UCB (Chu et al., 2011; Valko et al., 2013) , BootstrapNN (Osband et al., 2016b; Riquelme et al., 2018) , and -greedy for neural networks. Boot-strapNN trains multiple neural networks with subsampled data, and at each step pulls the greedy action based on a randomly selected network. It has been proposed as a way to approximate Thompson Sampling (Osband & Van Roy, 2015; Osband et al., 2016b) .

5.1. EXPERIMENT SETUP

To transform these classification problems into multi-armed bandits, we adapt the disjoint models (Li et al., 2010) to build a context feature vector for each arm: given an input feature x ∈ R d of a k-class classification problem, we build the context feature vector with dimension kd as: x 1 = x; 0; • • • ; 0 , x 2 = 0; x; • • • ; 0 , • • • , x k = 0; 0; • • • ; x . Then, the algorithm generates a set of predicted reward following Algorithm 1 and pulls the greedy arm. For these classification problems, if the algorithm selects a correct class by pulling the corresponding arm, it will receive a reward as 1, otherwise 0. The cumulative regret over time horizon T is measured by the total mistakes made by the algorithm. All experiments are repeated 8 times with reshuffled data. We set the time horizon of our algorithm to 10 000 for all data sets, except for mushroom which contains only 8 124 data. In order to speed up training for the NeuralUCB and Neural Thompson Sampling, we use the inverse of the diagonal elements of U as an approximation of U -1 . Also, since calculating the kernel matrix is expensive, we stop training at t = 1000 and keep evaluating the performance for the rest of the time, similar to previous work (Riquelme et al., 2018; Zhou et al., 2019) . Due to space limit, we defer the results on adult, covertype and magic telescope, as well as further experiment details, to Appendix A. In this section, we only show the results on mushroom, shuttle and MNIST.

5.2. EXPERIMENT I: PERFORMANCE OF NEURAL THOMPSON SAMPLING

The experiment results of Neural Thompson Sampling and other benchmark algorithms are shown in Figure 1 . A few observations are in place. First, Neural Thompson Sampling's performance is among the best in 6 datasets and is significantly better than all other baselines in 2 of them. Second, the function class used by an algorithm is important. Those with linear representations tend to perform worse due to the nonlinearity of rewards in the data. Third, Thompson Sampling is competitive with, and sometimes better than, other exploration strategies with the same function class, in particular when neural networks are used. 

5.3. EXPERIMENT II: ROBUSTNESS TO REWARD DELAY

This experiment is inspired by practical scenarios where reward signals are delayed, due to various constraints, as described by Chapelle & Li (2011) . We study how robust the two most competitive methods from Experiment I, Neural UCB and Neural Thompson Sampling, are when rewards are delayed. More specifically, the reward after taking an action is not revealed immediately, but arrive in batches when the algorithms will update their models. The experiment setup is otherwise identical to Experiment I. Here, we vary the batch size (i.e., the amount of reward delay), and Figure 2 shows the corresponding total regret. Clearly, we recover the result in Experiment I when the delay is 0. Consistent with previous findings (Chapelle & Li, 2011) , Neural TS degrades much more gracefully than Neural UCB when the reward delay increases. The benefit may be explained by the algorithm's randomized exploration nature that encourages exploration between batches. We, therefore, expect wider applicability of Neural TS in practical applications. 

6. RELATED WORK

Thompson Sampling was proposed as an exploration heuristic almost nine decades ago (Thompson, 1933) , and has received significant interest in the last decade. Previous works related to the present paper are discussed in the introduction, and are not repeated here. Upper confidence bound or UCB (Agrawal, 1995; Auer et al., 2002; Lai & Robbins, 1985) is a widely used alternative to Thompson Sampling for exploration. This strategy is shown to achieve near-optimal regrets in a range of settings, such as linear bandits (Abbasi-Yadkori et al., 2011; Auer, 2002; Chu et al., 2011) , generalized linear bandits (Filippi et al., 2010; Jun et al., 2017; Li et al., 2017) , and kernelized contextual bandits (Valko et al., 2013) . Neural networks are increasingly used in contextual bandits. In addition to those mentioned earlier (Blundell et al., 2015; Kveton et al., 2020; Lu & Van Roy, 2017; Riquelme et al., 2018) , Zahavy & Mannor (2019) used a deep neural network to provide a feature mapping and explored only at the last layer. Schwenk & Bengio (2000) proposed an algorithm by boosting the estimation of multiple deep neural networks. While these methods all show promise empirically, no regret guarantees are known. Recently, Foster & Rakhlin (2020) proposed a special regression oracle and randomized exploration for contextual bandits with a general function class (including neural networks) along with theoretical analysis. Zhou et al. (2019) proposed a neural UCB algorithm with near-optimal regret based on UCB exploration, while this paper focuses on Thompson Sampling.

7. CONCLUSIONS

In this paper, we adapt Thompson Sampling to neural networks. Building on recent advances in deep learning theory, we are able to show that the proposed algorithm, NeuralTS, enjoys a O( dT 1/2 ) regret bound. We also show the algorithm works well empirically on benchmark problems, in comparison with multiple strong baselines. The promising results suggest a few interesting directions for future research. First, our analysis needs NeuralTS to perform multiple gradient descent steps to train the neural network in each round. It is interesting to analyze the case where NeuralTS only performs one gradient descent step in each round, and in particular, the trade-off between optimization precision and regret minimization. Second, when the number of arms is finite, O( √ dT ) regret has been established for parametric bandits with linear and generalized linear reward functions. It is an open problem how to adapt NeuralTS to achieve the same rate. Third, Allen-Zhu & Li (2019) suggested that neural networks may behave differently from a neural tangent kernel under some parameter regimes. It is interesting to investigate whether similar results hold for neural contextual bandit algorithms like NeuralTS.

A FURTHER DETAIL OF THE EXPERIMENTS IN SECTION 5

A.1 PARAMETER TUNING In the experiments, we shuffle all datasets randomly, and normalize the features so that their 2norm is unity. One-hidden-layer neural networks with 100 neurons are used. Note that we do not choose m as suggested by theory, and such a disconnection has its root in the current deep learning theory based on neural tangent kernel, which is not specific in this work. During posterior updating, gradient descent is run for 100 iterations with learning rate 0.001. For BootstrapNN, we use 10 identical networks, and to train each network, data point at each round has probability 0.8 to be included for training (p = 10, q = 0.8 in the original paper (Schwenk & Bengio, 2000) ) For -Greedy, we tune with a grid search on {0.01, 0.05, 0.1}. For (λ, ν) used in linear and kernel UCB / Thompson Sampling, we set λ = 1 following previous works (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017) , and do a grid search of ν ∈ {1, 0.1, 0.01} to select the parameter with best performance. For the Neural UCB / Thompson Sampling methods, we use a grid search on λ ∈ {1, 10 -1 , 10 -2 , 10 -3 } and ν ∈ {10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 }. All experiments are repeated 20 times, and the average and standard error are reported.

A.2 DETAILED RESULTS

Table 1 summarizes the total regrets measured at the last round on different data sets, with mean and standard deviation error computed based on 20 independent runs. The Bold Faced data is the top performance over 8 experiments. Table 2 shows the number of times the algorithm in that row significantly outperforms, ties, or significantly underperforms, compared with other algorithm with t-test at 90% significance level. Figure 3 shows the performance of Neural Thompson Sampling compared with other baseline method. Figure 4 shows the comparison between Neural Thompson Sampling and Neural UCB in delay reward settings. 

A.3 RUN TIME ANALYSIS

We compare the run time of the four algorithms based on neural networks: BootstrapNN, -greedy for neural networks, NeuralUCB, and NeuralTS. The comparison is shown in Figure 5 . We can see that NeuralTS and NeuralUCB are about 2 to 3 times slower than -greedy, which is due to the extra calculation of the neural network gradient for each input context. BootstrapNN is often more than 5 times slower than -greedy because it has to train several neural networks at each round.

B PROOF OF LEMMAS IN SECTION 4

Under Condition 4.1, we can show that the following inequalities hold. 2 1/λ ≥ C m,1 m -1 L -3/2 [log(T KL 2 /δ)] 3/2 , 2 T /λ ≤ C m,2 min m 1/2 L -6 [log m] -3/2 , m 7/8 (λη) 2 L -6 T -1 (log m) -1 3/8 , m 1/6 ≥ C m,3 log mL 7/2 T 7/6 λ -7/6 (1 + T /λ) m ≥ C m,4 T 6 K 6 L 6 log(T KL/δ) max{λ -4 0 , 1}, where {C m,1 , C m,2 , . . . , C m,4 } are some positive absolute constants. B.1 PROOF OF LEMMA 4.2 The following concentration bound on Gaussian distributions will be useful in our proof. Lemma B.1 (Hoffman et al. (2013) ). Consider a normally distributed random variable X ∼ N (µ, σ 2 ) and β ≥ 0. The probability that X is within a radius of βσ from its mean can then be written as Pr |X -µ| ≤ βσ ≥ 1 -exp(-β 2 /2). Proof of Lemma 4.2. Since the estimated reward r t,k is sampled from N (f (x t,k ; θ t-1 ), ν 2 σ 2 t,k ) if given filtration F t , Lemma B.1 implies that, conditioned on F t and given t, k, Pr | r t,k -f (x t,k ; θ t-1 )| ≤ c t νσ t,k F t ≥ 1 -exp(-c 2 t /2 ). Taking a union bound over K arms, we have that for any t Pr ∀k, | r t,k -f (x t,k ; θ t-1 )| ≤ c t νσ t,k F t ≥ 1 -K exp(-c 2 t /2). Finally, choose c t = √ 4 log t + 2 log K as defined in (4.1), we get the bound that Pr E σ t F t = Pr ∀k, | r t,k -f (x t,k ; θ t-1 )| ≤ c t νσ t,k F t ≥ 1 - 1 t 2 . B.2 PROOF OF LEMMA 4.3 Before going into the proof, some notation is needed about linear and kernelized models. Definition B.2. Define Ūt = λI + t i=1 g(x i,ai ; θ 0 )g(x t,at ; θ 0 ) /m and based on Ūt , we further define σ2 t,k = λg (x t,k ; θ 0 ) Ū-1 t-1 g(x t,k ; θ 0 )/m. Furthermore, for convenience we define J t = (g(x 1,a1 ; θ t ) • • • g(x t,at ; θ t )) , Jt = (g(x 1,a1 ; θ 0 ) • • • g(x t,at ; θ 0 )) h t = (h(x 1,a1 ) • • • h(x t,at )) , r t = (r 1 • • • r t ) , t = (h(x 1,a1 ) -r 1 • • • h(x t,at ) -r t ) , where t is the reward noise. We can verify that U t = λI + J t J t /m, Ūt = λI + Jt J t /m . We further define K t = J t Jt /m. The first lemma shows that the target function is well-approximated by the linearized neural network if the network width m is large enough. Lemma B.3 (Lemma 5.1, Zhou et al. ( 2019)). There exists some constant C > 0 such that for any δ ∈ (0, 1), if m ≥ CT 4 K 4 L 6 log(T 2 K 2 L/δ)/λ 4 0 , then with probability at least 1 -δ over the random initialization of θ 0 , there exists a θ * ∈ R p such that h(x i ) = g(x i ; θ 0 ), θ * -θ 0 , √ m θ * -θ 0 2 ≤ √ 2h H -1 h ≤ B, (B.1) for all i ∈ [T K], where B is defined in Theorem 3.5. From Lemma B.3, it is easy to show that under this initialization parameter θ 0 , we have that h t = J t (θ * -θ 0 ) The next lemma bounds the difference between the σt,k from the linearized model and the σ t,k actually used in the algorithm. Its proof, together with other technical lemmas', will be given in the next section. Lemma B.4. Suppose the network size m satisfies Condition 4.1. Set η = C 1 (mλ + mLT ) -1 , then with probability at least 1 -δ, |σ t,k -σ t,k | ≤ C 2 log mt 7/6 m -1/6 λ -2/3 L 9/2 , where C 1 , C 2 are two positive constants. We next bound the difference between the outputs of the neural network and the linearized model. Lemma B.5. Suppose the network width m satisfies Condition 4.1. Then, set η = C 1 (mλ + mLT ) -1 , with probability at least 1 -δ over the random initialization of θ 0 , we have f (x t,k ; θ t-1 ) -g(x t,k ; θ 0 ), Ū-1 t-1 Jt-1 r t-1 /m ≤ C 2 t 2/3 m -1/6 λ -2/3 L 3 log m + C 3 (1 -ηmλ) J tL/λ + C 4 m -1/6 log mL 4 t 5/3 λ -5/3 (1 + t/λ), where {C i } 4 i=1 are positive constants. The next lemma, due to Chowdhury & Gopalan (2017), controls the quadratic value generated by an R-sub-Gaussian random vector : Lemma B.6 (Theorem 1, Chowdhury & Gopalan ( 2017)). Let { t } ∞ t=1 be a real-valued stochastic process such that for some R ≥ 0 and for all t ≥ 1, t is F t -measurable and R-sub-Gaussian conditioned on F t , Recall K t defined in Definition B.2. With probability 0 < δ < 1 and for a given η > 0, with probability 1 -δ, the following holds for all t, 1:t ((K t + ηI) -1 + I) -1 1:t ≤ R 2 log det((1 + η)I + K t ) + 2R 2 log(1/δ). Finally, the following lemma shows the linearized kernel and the neural tangent kernel are closed: Lemma B.7. For all t ∈ [T ], there exists a positive constants C such that the following holds: if the network width m satisfies m ≥ CT 6 L 6 K 6 log(T KL/δ), then with probability at least 1 -δ, log det(I + λ -1 K t ) ≤ log det(I + λ -1 H) + 1. We are now ready to prove Lemma 4.3. Proof of Lemma 4.3. First of all, since m satisfies Condition 4.1, then with the choice of η ,the condition required in Lemmas B.3-B.7 are satisfied. Thus, taking a union bound, we have with probability at least 1 -5δ, that the bounds provided by these lemmas hold. Then for any t ∈ [T ], we will first provide the difference between the target function and the linear function g(x t,k ; θ 0 ), Ū-1 t-1 Jt-1 r t-1 /m as: h(x t,k ) -g(x t,k ; θ 0 ), Ū-1 t-1 Jt-1 r t-1 /m ≤ h(x t,k ) -g(x t,k ; θ 0 ), Ū-1 t-1 Jt-1 h t-1 /m + g(x t,k ; θ 0 ), Ū-1 t-1 Jt-1 t-1 /m = g(x t,k ; θ 0 ), θ * -θ 0 -Ū-1 t-1 Jt-1 J t-1 (θ * -θ 0 )/m + g(x t,k ; θ 0 ) Ū-1 t-1 Jt-1 t-1 /m = g(x t,k ; θ 0 ), (I -Ū-1 t-1 ( Ūt-1 -λI))(θ * -θ 0 ) + g(x t,k ; θ 0 ) Ū-1 t-1 Jt-1 t-1 /m = λ g(x t,k ; θ 0 ) Ū-1 t-1 (θ * -θ 0 ) + g(x t,k ; θ 0 ) Ū-1 t-1 Jt-1 t-1 /m ≤ λ g(x t,k ; θ 0 ) Ū-1 t-1 g(x t,k ; θ 0 ) (θ * -θ 0 ) Ū-1 t-1 (θ * -θ 0 ) + g(x t,k ; θ 0 ) Ū-1 t-1 g(x t,k ; θ 0 ) t-1 J t-1 Ū-1 t-1 Jt-1 t-1 /m ≤ √ m θ * -θ 0 2 σt,k + σt,k λ -1/2 t-1 J t-1 Ū-1 t-1 Jt-1 t-1 /m (B.2) where the first inequality uses triangle inequality and the fact that r t-1 = h t-1 + t-1 ; the first equality is from Lemma B.3 and the second equality uses the fact that Jt-1 J t-1 = m( Ūt-1 -λI) which can be verified using Definition B.2; the second inequality is from the fact that |α Aβ| ≤ √ α Aα β Aβ. Since U -1 t-1 1 λ I and σt,k defined in Definition B.2, we obtain the last inequality. Furthermore, by obtaining J t-1 U -1 t-1 Jt-1 /m = J t-1 (λI + Jt-1 J t-1 /m) -1 J t-1 = J t-1 (λ -1 I -λ -2 Jt-1 (I + λ -1 J t-1 Jt-1 /m) -1 J t-1 /m) Jt-1 /m = λ -1 J t-1 Jt-1 /m -λ -1 J t-1 Jt-1 (λI + J t-1 Jt-1 /m) -1 J t-1 Jt-1 /m 2 = λ -1 K t-1 (I -(λI + K t-1 ) -1 K t-1 ) = K t-1 (λI + K t-1 ) -1 , where the first equality is from the Sherman-Morrison formula, and the second equality uses Definition B.2 and the fact that (λI + K t-1 ) -1 K t-1 = I -λ(λI + K t-1 ) -1 which could be verified by multiplying the LHS and RHS together, we have that t-1 J t-1 U -1 t-1 Jt-1 t-1 /m ≤ t-1 K t-1 (λI + K t-1 ) -1 t-1 ≤ t-1 (K t-1 + (λ -1)I)(λI + K t-1 ) -1 t-1 = t-1 (I + (K t-1 + (λ -1)I) -1 ) -1 t-1 (B.3) where the second inequality is because λ = 1 + 1/T ≥ 1 set in Theorem 3.5. Based on (B.2) and (B.3), by utilizing the bound on θ * -θ 2 provided in Lemma B.3, as well as the bound given in Lemma B.6, and λ ≥ 1, we have h(x t,k ) -g(x t,k ; θ 0 ), Ū-1 t-1 Jt-1 r t-1 m ≤ B + R log det(λI + K t-1 ) + 2 log(1/δ) σt,k , since it is obvious that log det(λI + K t-1 ) = log det(I + λ -1 K t-1 ) + (t -1) log λ ≤ log det(I + λ -1 K t-1 ) + t(λ -1) ≤ log det(I + λ -1 H) + 2, where the first equality moves the λ outside the log det, the first inequality is due to log λ ≤ λ -1, and the second inequality is from Lemma B.7 and the fact that λ = 1 + 1/T (as set in Theorem 3.5). Thus, we have h(x t,k ) -g(x t,k ; θ 0 ), Ū-1 t-1 Jt-1 r t-1 m ≤ ν σt,k , where we set ν = B + R log det(I + H/λ) + 2 + 2 log(1/δ). Then, by combining this bound with Lemma B.5, we conclude that there exist positive constants C1 , C2 , C3 so that |f (x t,k ; θ t-1 ) -h(x t,k )| ≤ ν σt,k + C1 t 2/3 m -1/6 λ -2/3 L 3 log m + C2 (1 -ηmλ) J tL/λ + C3 m -1/6 log mL 4 t 5/3 λ -5/3 (1 + t/λ), ≤ νσ t,k + C1 t 2/3 m -1/6 λ -2/3 L 3 log m + C2 (1 -ηmλ) J tL/λ + C3 m -1/6 log mL 4 t 5/3 λ -5/3 (1 + t/λ) + B + R log det(I + H/λ) + 2 + 2 log(1/δ) (σ t,k -σ t,k ). Finally, by utilizing the bound of |σ t,k -σ t,k | provided in Lemma B.4, we conclude that |f (x t,k ; θ t-1 ) -h(x t,k )| ≤ νσ t,k + (m), where (m) is defined by adding all of the additional terms and taking t = T : (m) = C1 T 2/3 m -1/6 λ -2/3 L 3 log m + C2 (1 -ηmλ) J T L/λ+ + C3 m -1/6 log mL 4 T 5/3 λ -5/3 (1 + T /λ) + C4 B + R log det(I + H/λ) + 2 + 2 log(1/δ) log mT 7/6 m -1/6 λ -2/3 L 9/2 , where is exactly the same form defined in (4.3). By setting δ to δ/5 (required by the union bound discussed at the beginning of the proof), we get the result presented in Lemma 4.3. B.3 PROOF OF LEMMA 4.4 Our proof requires an anti-concentration bound for Gaussian distribution, as stated below: Lemma B.8 (Gaussian anti-concentration). For a Gaussian random variable X with mean µ and standard deviation σ, for any β > 0, Pr X -µ σ > β ≥ exp(-β 2 ) 4 √ πβ . Proof of Lemma 4.4. Since r t,k ∼ N (f (x t,k ; θ t-1 ), ν 2 t σ 2 t,k ) conditioned on F t , we have Pr r t,k + (m) > h(x t,k ) F t , E µ t = Pr r t,k -f (x t,k ; θ t-1 ) + (m) νσ t,k > h(x t,k ) -f (x t,k ; θ t-1 ) νσ t,k F t , E µ t ≥ Pr r t,k -f (x t,k ; θ t-1 ) + (m) νσ t,k > |h(x t,k ) -f (x t,k ; θ t-1 )| νσ t,k F t , E µ t = Pr r t,k -f (x t,k ; θ t-1 ) νσ t,k > |h(x t,k ) -f (x t,k ; θ t-1 )| -(m) νσ t,k F t , E µ t ≥ Pr r t,k -f (x t,k ; θ t-1 ) νσ t,k > 1 F t , E µ t ≥ 1 4e √ π , where the first inequality is due to |x| ≥ x, and the second inequality follows from event E µ t , i.e., ∀k ∈ [K], |f (x t,k ; θ t-1 ) -h(x t,k )| ≤ νσ t,k + (m). B.4 PROOF OF LEMMA 4.5 Proof of Lemma 4.5. Consider the following two events at round t: A = {∀k ∈ S t , r t,k < r t,a * t |F t , E µ t }, B = {a t / ∈ S t |F t , E µ t }. Clearly, A implies B, since a t = argmax k r t,k . Therefore, Pr a t / ∈ S t F t , E µ t ≥ Pr ∀k ∈ S t , r t,k < r t,a * t F t , E µ t . Suppose E µ also holds, then it is easy to show that ∀k ∈ [K], |h(x t,k ) -r t,k | ≤ |h(x t,k ) -f (x t,k ; θ t )| + |f (x t,k ; θ t ) -r t,k | ≤ (m) + (1 + c t )ν t σ t,k . (B.4) Hence, for all k ∈ S t , we have that h(x t,a * t ) -r t,k ≥ h(x t,a * t ) -h(x t,k ) -|h(x t,k ) -r t,k | ≥ (m), where we used the definitions of saturated arms in Definition 4.4, and of E µ t and E σ t in (4.1). Consider the following event  C = {h(x t,a * t ) -(m) < r t, ≤ (1 + c t )ν(2σ t, kt + E[σ t,at |F t , E µ t ]) + 4 (m) + 2 t 2 ≤ (1 + c t )ν 2E[σ t,at |F t , E µ t ] 1 4e √ π -1 t 2 + E[σ t,at |F t , E µ t ] + 4 (m) + 2 t 2 ≤ 44e √ π(1 + c t )νE[σ t,at |F t , E µ t ] + 4 (m) + 2t -2 , where the inequality on the second line uses the bound provide in (B.8) and the trivial bound of h(x t,a * t ) -h(x t,at ) for the second term plus Lemma 4.2, the inequality on the third line uses the bound of σ t, kt provide in (B.6), inequality on the forth line is directly calculated by 1 ≤ 4e √ π and 1 1 4e √ π -1 t 2 ≤ 20e √ π, which trivially holds since LHS is negative when t ≤ 4 and when t = 5, the LHS reach its maximum as ≈ 84.  Y t -Y 0 ≤ 2 log(1/δ) t i=1 B 2 i . Proof of Lemma 4.7. From Lemma B.9, we have that there exists a positive constant C 1 such that X t defined in (4.5) is bounded with probability 1 -δ by |X t | ≤ | ∆t | + C 1 (1 + c t )ν √ L min{σ t,at , 1} + 4 (m) + 2t -2 ≤ 2 + 2t -2 + C 1 C 2 (1 + c t )νL + 4 (m) ≤ 4 + C 1 C 2 (1 + c t )νL + 4 (m) where the first inequality uses the fact that |a -b| ≤ |a| + |b|; the second inequality is from Lemma B.9 and the fact that h ≤ 1, where C 2 is a positive constant used in Lemma B.9; the third inequality uses the fact that t -2 ≤ 1. Noticing the fact that c t ≤ c T , and from Lemma 4.6, we know that with probability at least 1 -δ, Y t is a super martingale. From Lemma B.10, we have Y T -Y 0 ≤ (4 + C 1 C 2 (1 + c T )νL + 4 (m)) 2 log(1/δ)T . (B.9) Considering the definition of Y T in (4.5), (B.9) is equivalent to  T i=1 ∆i ≤ 4T (m) + 2 T i=1 t -2 + C 1 (1 + c T )ν √ L T i=1 min{σ t,at , 1} + (4 + C 1 C 2 (1 + c T )νL + 4 (m)) 2 log(1/δ)T , then by utilizing ∞ i=1 t -2 = π 2 /6, V t = λI + t i=1 v i v i . If λ ≥ 1, then T i=1 min{v t V -1 t-1 v t-1 , 1} ≤ 2 log det I + λ -1 t i=1 v i v i . ≤ T T i=1 min{σ 2 t,at , 1} + C 1 T 13/6 log mm -1/6 λ -2/3 L 9/2 , where the first term in the inequality on the second line is from Cauchy-Schwartz inequality, and the second term is from Lemma B.4. From Definition B.2, we have T i=1 min{σ 2 t,at , 1} ≤ λ T i=1 min{g(x t,at , θ 0 ) Ū-1 t-1 g(x t,at , θ 0 )/m, 1} ≤ 2λ log det I + λ -1 T i=1 g(x t,at ; θ 0 )g(x t,at ; θ 0 ) /m = 2λ log det(I + λ -1 JT J T /m) = 2λ log det(I + λ -1 J T J T /m) = 2λ log det(I + λ -1 K T ) where the first inequality moves the positive parameter λ outside the min operator and uses the definition of σt,k in Definition B.2, then the second inequality utilizes Lemma B.11, the first equality use the definition of Jt in Definition B.2, the second equality is from the fact that det(I + AA ) = det(I + A A), and the last equality uses the definition of K t in Definition B.2. From Lemma B.7, we have that log det(I + λ -1 K T ) ≤ log det(I + λ -1 H) + 1 under condition on m and η presented in Theorem 3.5. By taking a union bound we have, with probability 1 -2δ, that T i=1 min{σ t,at , 1} ≤ 2λT ( d log(1 + T K) + 1) + C 1 T 13/6 log mm -1/6 λ -2/3 L 9/2 , where we use the definition of d in Definition 3.2. Replacing δ with δ/2 completes the proof.

C PROOF OF AUXILIARY LEMMAS IN APPENDIX B

In this section, we are about to show the proof of the Lemmas used in Appendix B, we will start with the following NTK Lemmas. Among them, the first is to control the difference between the parameter learned via Gradient Descent and the theoretical optimal solution to linearized network. Lemma C.1 (Lemma B.2, Zhou et al. ( 2019)). There exist constants {C i } 5 i=1 > 0 such that for any δ > 0, if η, m satisfy that for all t ∈ [T ], 2 t/λ ≥ C 1 m -1 L -3/2 [log(T KL 2 /δ)] 3/2 , 2 t/λ ≤ C 2 min m 1/2 L -6 [log m] -3/2 , m 7/8 (λη) 2 L -6 t -1 (log m) -1 3/8 , η ≤ C 3 (mλ + tmL) -1 , m 1/6 ≥ C 4 log mL 7/2 t 7/6 λ -7/6 (1 + t/λ), then with probability at least 1 -δ over the random initialization of θ 0 , for any t ∈ [T ], we have that θ t-1 -θ 0 2 ≤ 2 t/(mλ) and θ t-1 -θ 0 -Ū-1 t-1 Jt-1 r t-1 /m 2 ≤ (1 -ηmλ) J t/(mλ) + C 5 m -2/3 log mL 7/2 t 5/3 λ -5/3 (1 + t/λ). And the next lemma, controls the difference between the function value of neural network and the linearized model: Lemma C.2 (Lemma 4.1, Cao & Gu (2019) ). There exist constants {C i } 3 i=1 > 0 such that for any δ > 0, if τ satisfies that C 1 m -3/2 L -3/2 [log(T KL 2 /δ)] 3/2 ≤ τ ≤ C 2 L -6 [log m] -3/2 , then with probability at least 1 -δ over the random initialization of θ 0 , for all θ, θ satisfying θθ 0 2 ≤ τ, θθ 0 2 ≤ τ and j ∈ [T K] we have f (x j ; θ) -f (x j ; θ) -g(x j ; θ), θ -θ ≤ C 3 τ 4/3 L 3 m log m. Furthermore, to continue with, next lemma is proposed to control the difference between the gradient and the gradient on the initial point. Lemma C.3 (Theorem 5, Allen-Zhu et al. (2018) ). There exist constants {C i } 3 i=1 > 0 such that for any δ ∈ (0, 1), if τ satisfies that C 1 m -3/2 L -3/2 [log(T KL 2 /δ)] 3/2 ≤ τ ≤ C 2 L -6 [log m] -3/2 , then with probability at least 1 -δ over the random initialization of θ 0 , for all θθ 0 2 ≤ τ and j ∈ [T K] we have g(x j ; θ) -g(x j ; θ 0 ) 2 ≤ C 3 log mτ 1/3 L 3 g(x j ; θ 0 ) 2 . Also, we need the next lemma to control the gradient norm of the neural network with the help of NTK. Lemma C.4 (Lemma B.3, Cao & Gu (2019) ). There exist constants {C i } 3 i=1 > 0 such that for any δ > 0, if τ satisfies that C 1 m -3/2 L -3/2 [log(T KL 2 /δ)] 3/2 ≤ τ ≤ C 2 L -6 [log m] -3/2 , then with probability at least 1 -δ over the random initialization of θ 0 , for any θθ 0 2 ≤ τ and j ∈ [T K] we have g(x j ; θ) F ≤ C 3 √ mL. Finally, as literally shows, we can also provide bounds on the kernel provided by the linearized model and the NTK kernel if the network is width enough.  ∇ a ψ = Ca √ a Ca , ∇ a ψ 2 = a C 2 a a Ca = b D 2 b b Db = d i=1 b 2 i 2 i d i=1 b 2 i i ≤ 1/ √ λ where the last inequality is from the fact that C 1/λI, which indicates that all eigen-value i ≤ 1/λ, for the same reason, we have θ) -g(x; θ 0 ) 2 ≤ C2 log mτ 1/3 L 3 g(x; θ 0 ) 2 / √ m ≤ C3 log mt 1/6 m -1/6 λ -1/6 L 7/2 . ∇ C.4 PROOF OF LEMMA B.9 Proof of Lemma B.9. Set τ in Lemma C.4 as 2 t/(mλ). Then the network width m and learning rate η satisfy all of the condition needed by Lemma C.1 to C.5. Hence, there exists C 1 such that g(x; θ) 2 ≤ g(x; θ) F ≤ C 1 √ mL for all x, since it is easy to verify that U -1 t λ -1 I. Thus we have that for all t ∈ [T ], k ∈ [K], σ 2 t,k = λg (x t,k ; θ t-1 )U -1 t-1 g(x t,k ; θ t-1 )/m ≤ g(x t,k ; θ t-1 ) 2 2 /m ≤ C 2 5 L. Therefore, we could get that σ t,k ≤ C 1 √ L, with probability 1 -2δ by taking a union bound (Lemmas C.1 and C.4). Replacing δ with δ/2 completes the proof.

D AN UPPER BOUND OF EFFECTIVE DIMENSION d

We now provide an example, showing when all contexts x i concentrate on a d -dimensional nonlinear subspace of the RKHS space spanned by NTK, the effective dimension d is bounded by d . We consider the case when λ = 1, L = 2. Suppose that there exists a constant d such that for any i > d , 0 < λ i (H) ≤ 1/(T K). Then the effective dimension d can be bounded as λ i (H)

I2

. For I 1 and I 2 we have I 1 ≤ d i=1 H 2 = Θ(d ), I 2 ≤ T K • 1/(T K) = 1, Therefore, the effective dimension satisfies that d ≤ d + 1. To show how to satisfy the requirement, we first give a charcterization of the RKHS space spanned by NTK. By Bietti & Mairal (2019) ; Cao et al. (2019) we know that each entry of H has the following formula: H i,s = ∞ k=0 µ k N (d,k) j=1 Y k,j (x i )Y k,j (x s ), where Y k,j for j = 1, . . . , N (d, k) are linearly independent spherical harmonics of degree k in d variables, d is the input dimension, N (d, k) = (2k + d -2)/k • C d-2 k+d-3 , µ k = Θ(max{k -d , (d -1) -k+1 }). In that case, the feature mapping ( √ µ k Y k,j (x)) k,j maps any context x from R d to a RKHS space R corresponding to H. Let y i ∈ R denote the mapping for x i . Then if there exists a d -dimension subspace R such that for all i, y i -z i 1 where z i is the projection of y i onto R d , the requirement for λ i (H) holds.



Figure 1: Comparison of Neural Thompson Sampling and baselines on UCI datasets and MNIST dataset. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over 8 runs with standard errors shown as shaded areas.

Figure 2: Comparison of Neural Thompson Sampling and Neural UCB on UCI datasets and MNIST dataset under different scale of delay. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over 8 runs with standard errors shown as error bar.

Figure 3: Comparison of Neural Thompson Sampling and baselines on UCI datasets and MNIST dataset. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over multiple runs with standard errors shown as shaded areas.

Figure 4: Comparison of Neural Thompson Sampling and Neural UCB on UCI datasets and MNIST dataset under different scale of delay. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over multiple runs with standard errors shown as error bar.

Figure 5: Comparison of the running time for Neural TS, Neural UCB and -greedy for neural networks on UCI datasets and MNIST dataset.

Lemma B.1, Zhou et al. (2019)). Set K = T t=1 K k=1 g(x t,k ; θ 0 )g(x t,k ; θ 0 )/m, recall the definition of H in Definition 3.1,then there exists a constant C 1 such that m ≥ C 1 L 6 log(T KL/δ) -4 , we could get that K -H F ≤ T K .Equipped with these lemmas, we could continue for our proof.C.1 PROOF OF LEMMA B.4Proof of Lemma B.4. Firstly, set τ = 2 t/(mλ), then we have the condition on the network m and learning rate η satisfy all of the condition need from Lemma C.1 to Lemma C.5.

with both side and utilizing tr(AB) = tr(BA) and tr(α β) = tr(αβ ), we have that 2 tr(ψ∂ψ) = tr 2(∂a) λI + a i a i -1 for simplicity and decompose C = Q DQ, b = Qa where D = diag( 1 , • • • , p ) as the eigen-value of C, we have that

Total regrets get at the last step with standard deviation attached

Performance on total regret comparing with other methods on all datasets. Tuple (w/t/l) indicates the times of the algorithm at that row wins, ties with or loses, compared to all other 7 algorithms with t-test at 90% significant level.

a * t |F t , E µ t }. To prove Lemma 4.6, we will need an upper bound bound on δ t,k . Lemma B.9. For any time t ∈ [T ], k ∈ [K], and δ ∈ (0, 1), if the network width m satisfies Condition 4.1, we have, with probability at least 1 -δ, that Proof of Lemma 4.6. Recall that given F t and E µ t , the only randomness comes from sampling r t,k for k ∈ [K]. Let kt be the unsaturated arm with the smallest σ t,• , i.e. inequality ignores the case when a t ∈ S t , and the second inequality is from Lemma 4.5 and the definition of kt mentioned above. -h(x t,at ) = h(x t,a * t ) -h(x t, kt ) + h(x t, kt ) -h(x t,at ) ≤ (1 + c t )νσ t, kt + 2 (m) + h(x t, kt ) -r t, kt -h(x t,at ) + r t,at + r t, kt -r t,at

11 < 96.36 ≈ RHS. + c t )νE[σ t,at |F t , E µ t ], 2} + 4 (m) + 2t -2 , and since we have 1 + c t ≥ 1 and ν = B + R log det(I + H/λ) + 2 + 2 log(1/δ) ≥ B, recall 22e √ πB ≥ 1, it is easy to verify the following inequality also holds: where we use the fact that there exists a constant C 1 such that σ t,at is bounded by C 1 Lemma B.10 (Azuma-Hoeffding Inequality for Super Martingale). If a super-martingale Y t , corresponding to filtration F t satisfies that |Y t -Y t-1 | ≤ B t , then for any δ ∈ (0, 1), w.p. 1 -δ, we have

and merge the constant C 1 with 44e √ π, taking union bound of the probability bound of Lemma 4.6, B.10, B.9, we have the inequality above hold with probability at least 1 -3δ. Re-scaling δ to δ/3 and merging the product of C 1 C 2 as a new positive constant leads to the desired result.Lemma B.11 (Lemma 11, Abbasi-Yadkori et al. (2011)). Let {v t } ∞ t=1 be a sequence in R d , and define

Proof of Lemma 4.8. First, recall σt,k defined in Definition B.2 and the bound of σt,k -σ t,k provided in Lemma B.4. We have that there exists a positive constants C 1 such that

Thus from Lemma C.1, we have that there exists θ t-1 -θ 0 2 ≤ τ , thus from Lemma C.4, we have that there exists positive constant C1 such that g(x; θ t-1 ) 2 ≤ C1 Published as a conference paper at ICLR 2021 then we obtain that the function ψ is defined under the domain a 2 ≤ C1 √ L, a i 2 ≤ C1 √ L then by taking the derivation w.r.t. ψ 2 , we have that 2ψ∂ψ = (∂a)

ai ψ 2 =

ACKNOWLEDGEMENT

We would like to thank the anonymous reviewers for their helpful comments. WZ, DZ and QG are partially supported by the National Science Foundation CAREER Award 1906169 and IIS-1904183. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

annex

Thus, since we obtain that there exists constant C 5 such that |σ t,k -σt,k | ≤ C 1 log mt 7/6 m -1/6 λ -2/3 L 9/2 , where we use the fact that C 1 = max{ C3 , C3 C2 1 } and L ≥ 1 to merge the first term into the summation. This inequality is based on Lemma C.1, Lemma C.3 and Lemma C.4, thus it holds with probability at least 1 -3δ. Replacing δ with δ/3 completes the proof.

C.2 PROOF OF LEMMA B.5

Proof of Lemma B.5. Setting τ = 2 t/mλ, we have the condition on the network m and learning rate η satisfy all of the condition needed by Lemmas C.1 to C.5. From Lemma C.1 we have θ t-1θ 0 2 ≤ τ . Then, by Lemma C.2, there exists a constant C 1 such that1 and the norm of gradient bound given in Lemma C.4, we have that there exist positive constants C1 , C2 such thatwhere+ C 3 m -1/6 log mL 4 t 5/3 λ -5/3 (1 + t/λ), which holds with probability 1-3δ with a union bound (Lemma C.4, Lemma C.1, and Lemma C.2).Replacing δ with δ/3 completes the proof.

C.3 PROOF OF LEMMA B.7

Proof of Lemma B.7. From the definition of K t , we have thatwhere the the first inequality is because the double summation on the second line contains more elements than the summation on the first line. The second inequality utilizes the definition of K in Lemma C.5 and H in Definition 3.1, the third inequality is from the convexity of log det(•) function, and the forth inequality is from the fact that A, B ≤ A F B F . Then the fifth inequality is from the fact that A F ≤ √ T K A 2 if A ∈ R T K×T K and λ ≥ 0. Finally, the sixth inequality utilizes Lemma C.5 by setting = (T K) -3/2 with m ≥ C 1 L 6 T 6 K 6 log(T KL/δ), where we conclude our proof.

