FEDERATED NEURAL BANDITS

Abstract

Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: UCB a allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while UCB b uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes UCB a initially for accelerated exploration and relies more on UCB b later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.

1. INTRODUCTION

The stochastic multi-armed bandit is a prominent method for sequential decision-making problems due to its principled ability to handle the exploration-exploitation trade-off (Auer, 2002; Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesvári, 2020) . In particular, the stochastic contextual bandit problem has received enormous attention due to its widespread real-world applications such as recommender systems (Li et al., 2010a) , advertising (Li et al., 2010b) , and healthcare (Greenewald et al., 2017) . In each iteration of a stochastic contextual bandit problem, an agent receives a context (i.e., a d-dimensional feature vector) for each of the K arms, selects one of the K contexts/arms, and observes the corresponding reward. The goal of the agent is to sequentially pull the arms in order to maximize the cumulative reward (or equivalently, minimize the cumulative regret) in T iterations. To minimize the cumulative regret, linear contextual bandit algorithms assume that the rewards can be modeled as a linear function of the input contexts (Dani et al., 2008) and select the arms via classic methods such as upper confidence bound (UCB) (Auer, 2002) or Thompson sampling (TS) (Thompson, 1933) , consequently yielding the Linear UCB (Abbasi-Yadkori et al., 2011) and Linear TS (Agrawal & Goyal, 2013) algorithms. The potentially restrictive assumption of a linear model was later relaxed by kernelized contextual bandit algorithms (Chowdhury & Gopalan, 2017; Valko et al., 2013) , which assume that the reward function belongs to a reproducing kernel Hilbert space (RKHS) and hence model the reward function using kernel ridge regression or Gaussian process (GP) regression. However, this assumption may still be restrictive (Zhou et al., 2020 ) and the kernelized model may fall short when the reward function is very complex and difficult to model. To this end, neural networks (NNs), which excel at modeling complex real-world functions, have been adopted to model the reward function in contextual bandits, thereby leading to neural contextual bandit algorithms such as Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021) . Due to their ability to use the highly expressive NNs for better reward prediction (i.e., exploitation), Neural UCB and Neural TS have been shown to outperform both linear and kernelized contextual bandit algorithms in practice. Moreover, the cumulative regrets of Neural UCB and Neural TS have been analyzed by leveraging the theory of the neural tangent kernel (NTK) (Jacot et al., 2018) , hence making these algorithms both provably efficient and practically effective. We give a comprehensive review of the related works on neural bandits in App. A. The contextual bandit algorithms discussed above are only applicable to problems with a single agent. However, many modern applications of contextual bandits involve multiple agents who (a) collaborate with each other for better performances and yet (b) are unwilling to share their raw observations (i.e., the contexts and rewards). For example, companies may collaborate to improve their contextual bandits-based recommendation algorithms without sharing their sensitive user data (Huang et al., 2021b) , while hospitals deploying contextual bandits for personalized treatment may collaborate to improve their treatment strategies without sharing their sensitive patient information (Dai et al., 2020) . These applications naturally fall under the setting of federated learning (FL) (Kairouz et al., 2019; Li et al., 2021) which facilitates collaborative learning of supervised learning models (e.g., NNs) without sharing the raw data. In this regard, a number of federated contextual bandit algorithms have been developed to allow bandit agents to collaborate in the federated setting (Shi & Shen, 2021) . We present a thorough discussion of the related works on federated contextual bandits in App. A. Notably, Wang et al. (2020) have adopted the Linear UCB policy and developed a mechanism to allow every agent to additionally use the observations from the other agents to accelerate exploration, while only requiring the agents to exchange some sufficient statistics instead of their raw observations. However, these previous works have only relied on either linear (Dubey & Pentland, 2020; Huang et al., 2021b) or kernelized (Dai et al., 2020; 2021) methods which, as discussed above, may lack the expressive power to model complex real-world reward functions (Zhou et al., 2020) . Therefore, this naturally brings up the need to use NNs for better exploitation (i.e., reward prediction) in federated contextual bandits, thereby motivating the need for a federated neural contextual bandit algorithm. To develop a federated neural contextual bandit algorithm, an important technical challenge is how to leverage the federated setting to simultaneously (a) accelerate exploration by allowing every agent to additionally use the observations from the other agents without requiring the exchange of raw observations (in a similar way to that of Wang et al. (2020) ), and (b) improve exploitation by further enhancing the quality of the NN for reward prediction through the federated setting (i.e., without requiring centralized training using the observations from all agents). In this work, we provide a theoretically grounded solution to tackle this challenge by deploying a weighted combination of two upper confidence bounds (UCBs). The first UCB, denoted by UCB a , incorporates the neural tangent features (i.e., the random features embedding of NTK) into the Linear UCB-based mechanism adopted by Wang et al. (2020) , which achieves the first goal of accelerating exploration. The second UCB, denoted by UCB b , adopts an aggregated NN whose parameters are the average of the parameters of the NNs trained by all agents using their local observations for better reward prediction (i.e., better exploitation in the second goal). Hence, UCB b improves the quality of the NN for reward prediction in a similar way to the most classic FL method of federated averaging (FedAvg) for supervised learning (McMahan et al., 2017) . Notably, our choice of the weight between the two UCBs, which naturally arises during our theoretical analysis, has an interesting practical interpretation (Sec. 3.3): More weight is given to UCB a in earlier iterations, which allows us to use the observations from the other agents to accelerate the exploration in the early stage; more weight is assigned to UCB b only in later iterations after every agent has collected enough local observations to train its NN for accurate reward prediction (i.e., reliable exploitation). Of note, our novel design of the weight (Sec. 3.3) is crucial for our theoretical analysis and may be of broader interest for future works on related topics. This paper introduces the first federated neural contextual bandit algorithm which we call federated neural-UCB (FN-UCB) (Sec. 3). We derive an upper bound on its total cumulative regret from all N agents: R T = O( d √ T N + d max N √ T ) 1 where d is the effective dimension of the contexts from all N agents and d max represents the maximum among the N individual effective dimensions of the contexts from the N agents (Sec. 2). The communication complexity (i.e., total number of communication rounds in T iterations) of FN-UCB can be upper-bounded by C T = O( d √ N ). Finally, we use both synthetic and real-world contextual bandit experiments to explore the interesting insights about our FN-UCB and demonstrate its effective practical performance (Sec. 5). 2 BACKGROUND AND PROBLEM SETTING Let [k] denote the set {1, 2, . . . , k} for a positive integer k, 0 k represent a k-dimensional vector of 0's, and 0 k×k denote an all-zero matrix with dimension k × k. Our setting involves N agents with the same reward function h defined on a domain X ⊂ R d . We consider centralized and synchronous communication: The communication is coordinated by a central server and every agent exchanges information with the central server during a communication round. In each iteration t ∈ [T ], every agent i ∈ [N ] receives a set X t,i ≜ {x k t,i } k∈ [K] of K context vectors and selects one of them x t,i ∈ X t,i to be queried to observe a noisy output y t,i ≜ h(x t,i ) + ϵ where ϵ is an R-sub-Gaussian noise. We will analyze the total cumulative regret from all N agents in T iterations: R T ≜ N i=1 T t=1 r t,i where r t,i ≜ h(x * t,i ) -h(x t,i ) and x * t,i ≜ arg max x∈Xt,i h(x). Let f (x; θ) denote the output of a fully connected NN for input x with parameters θ (of dimension p 0 ) and g(x; θ) denote the corresponding (column) gradient vector. We focus on NNs with ReLU activations, and use L ≥ 2 and m to denote its depth and width (of every layer), respectively. We follow the initialization technique from Zhang et al. (2021) ; Zhou et al. (2020) to initialize the NN parameters θ 0 ∼ init(•). Of note, as a common ground for collaboration, we let all N agents share the same initial parameters θ 0 when training their NNs and computing their neural tangent features: g(x; θ 0 )/ √ m (i.e., the random features embedding of NTK (Zhang et al., 2021) ). Also, let H denote the (T KN ) × (T KN )-dimensional NTK matrix on the set of all received T KN contexts (Zhang et al., 2021; Zhou et al., 2020) . Similarly, let H i denote the (T K) × (T K)-dimensional NTK matrix on the set of T K contexts received by agent i. We defer the details on the definitions of H and H i 's, the NN f (x; θ), and the initialization scheme θ 0 ∼ init(•) to App. B due to limited space. Next, let h ≜ [h(x k t,i )] t∈[T ],i∈[N ],k∈[K] denote the (T KN )-dimensional column vector of reward function values at all received contexts and B be an absolute constant s.t.

√

2h ⊤ H -1 h ≤ B. This is related to the commonly adopted assumption in kernelized bandits that h lies in the RKHS H induced by the NTK (Chowdhury & Gopalan, 2017; Srinivas et al., 2010) (or, equivalently, that the RKHS norm ∥h∥ H of h is upper-bounded by a constant) because et al., 2020) . Following the works of Zhang et al. (2021) ; Zhou et al. (2020) , we define the effective dimension of H as d ≜ log det(I+H/λ) log(1+T KN/λ) with regularization parameter λ > 0. Similarly, we define the effective dimension for agent i as d i ≜ log det(I+Hi/λ) log(1+T K/λ) and also define d max ≜ max i∈[N ] d i . Note that the effective dimension is related to the maximum information gain γ which is a commonly adopted notion in kernelized bandits (Zhang et al., 2021)  √ h ⊤ H -1 h ≤ ∥h∥ H (Zhou : d ≤ 2γ T KN / log(1 + T KN/λ) and d i ≤ 2γ T K / log(1 + T K/λ), ∀i ∈ [N ] . Consistent with the works on neural contextual bandits (Zhang et al., 2021; Zhou et al., 2020) , our only assumption on the reward function h is its boundedness: |h(x)| ≤ 1, ∀x ∈ X . We also make the following assumptions for our theoretical analysis, all of which are mild and easily achievable, as discussed in (Zhang et al., 2021; Zhou et al., 2020) : Assumption 1. There exists λ 0 > 0 s.t. H ⪰ λ 0 I and H i ⪰ λ 0 I, ∀i ∈ [N ]. Also, all contexts satisfy ∥x∥ 2 = 1 and [x] j = [x] j+d/2 , ∀x ∈ X t,i , ∀t ∈ [T ], i ∈ [N ].

3. FEDERATED NEURAL-UPPER CONFIDENCE BOUND (FN-UCB)

Our FN-UCB algorithm is described in Algo. 1 (agents' part) and Algo. 2 (central server's part).

3.1. OVERVIEW OF FN-UCB ALGORITHM

Before the beginning of the algorithm, we sample the initial parameters θ 0 and share it with all agents (Sec. 2). In each iteration t ∈ [T ], every agent i ∈ [N ] receives a set X t,i = {x k t,i } k∈[K] of K contexts (line 3 of Algo. 1) and then uses a weighted combination of UCB a t,i and UCB b t,i to select a context x t,i ∈ X t,i to be queried (lines 4-7 of Algo. 1). Next, each agent i observes a noisy output y t,i (line 8 of Algo. 1) and then updates its local information (lines 9-10 of Algo. 1). After that, every agent checks if it has collected enough information since the last communication round (i.e., checks Algorithm 1 FN-UCB (Agent i) 1: inputs: λ = 1 + 2/T , θ 0 ∼ init(•), W sync = 0 p0×p0 , W new,i = 0 p0×p0 , B sync = 0, B new,i = 0 p0 , α = 0, V local t,i = λI, V -1 sync,NN = λ -1 I, V last = λI, t last = 0, θ sync,NN = θ 0 . 2: for t = 1, 2, . . . , T do 3: Receive a set X t,i = {x k t,i } k∈[K] of K contexts 4: Compute V t,i = λI + W sync + W new,i , θ t,i = V -1 t,i (B sync + B new,i ) 5: Compute UCB a t,i (x) ≜ ⟨g(x; θ 0 )/ √ m, θ t,i ⟩ + ν T KN √ λ g(x; θ 0 )/ √ m V -1 t,i 6: If α ̸ = 0, compute UCB b t,i (x) ≜ f (x; θ sync,NN ) + ν T K √ λN -1 N j=1 g(x; θ 0 )/ √ m (V local j ) -1 7: Select x t,i ≜ arg max x∈Xt,i (1 -α) UCB a t,i (x) + α UCB b t,i (x) 8: Query x t,i to observe y t,i 9: Update W new,i ← W new,i +g(x t,i ; θ 0 ) g(x t,i ; θ 0 ) ⊤ /m, B new,i ← B new,i +y t,i g(x t,i ; θ 0 )/ √ m 10: Update V local t,i = V local t-1,i + g(x t,i ; θ 0 ) g(x t,i ; θ 0 ) ⊤ /m 11: if (t -t last ) log det(λI + W sync + W new,i )/det(V last ) > D then 12: Send a synchronisation signal to the central server to start a communication round 13: if a communication round is started then 14: Train an NN with gradient descent using all agent i's local observations  D t,i = {(x τ,i , y τ,i )} τ ∈[t] send { W new,i , B new,i , θ i t , α t,i , (V local t,i ) -1 } to the central server 17: receive { W sync , B sync , θ sync,NN , α, {(V local i ) -1 } i∈[N ] } from the central server 18: Set V last = W sync + λI , t last = t , W new,i = 0 p0×p0 , B new,i = 0 Algorithm 2 Central Server 1: if a synchronization signal is received from any agent then 2: Send a signal to all agents to start a communication round 3: receive { W new,i , B new,i , θ i t , α t,i , (V local t,i ) -1 } i∈[N ] 4: Compute θ sync,NN = N -1 N i=1 θ i t , α = min i∈[N ] α t,i ; let (V local i ) -1 = (V local t,i ) -1 , ∀i ∈ [N ] 5: Update W sync ← W sync + N i=1 W new,i , B sync ← B sync + N i=1 B new,i 6: Broadcast { W sync , B sync , θ sync,NN , α, {(V local i ) -1 } i∈[N ] } to all agents the criterion in line 11 of Algo. 1); if so, it sends a synchronization signal to the central server (line 12 of Algo. 1) who then tells all agents to start a communication round (line 2 of Algo. 2). During a communication round, every agent i uses its current history of local observations to train an NN (line 14 of Algo. 1) and sends its updated local information to the central server (line 16 of Algo. 1); the central server then aggregates these information from all agents (lines 4-5 of Algo. 2) and broadcasts the aggregated information back to all agents (line 6 of Algo. 2) to start the next iteration. We refer to those iterations between two communication rounds as an epoch.foot_1 So, our FN-UCB algorithm consists of a number of epochs which are separated by communication rounds. Note that every agent i only needs to train an NN in every communication round, i.e., only after the change in the log determinant of the covariance matrix of any agent exceeds a threshold D (line 11 of Algo. 1). This has the additional benefit of reducing the computational cost due to the training of NNs. Interestingly, this is in a similar spirit as the adaptive batch size scheme in Gu et al. (2021) which only retrains the NN in Neural UCB after the change in the determinant of the covariance matrix exceeds a threshold and is shown to only slightly degrade the performance of Neural UCB.

3.2. THE TWO UPPER CONFIDENCE BOUNDS (UCBS)

Firstly, UCB a t,i can be interpreted as the Linear UCB policy (Abbasi-Yadkori et al., 2011) using the neural tangent features g(x; θ 0 )/ √ m as the input features. In iteration t, let p denote the index of the current epoch. Then, computing UCB a t,i (line 5 of Algo. 1) makes use of two types of information. The first type of information, which uses the observations from all N agents before epoch p, is used for computing UCB a t,i via W sync and B sync (line 4 of Algo. 1). Specifically, as can be seen from line 5 of Algo. 2, W sync and B sync are computed by the central server by summing up the W new,i 's and B new,i 's from all agents (i.e., by aggregating the information from all agents) where W new,i and B new,i are computed using the local observations of agent i (line 9 of Algo. 1). The second type of information used by UCB a t,i (via W new,i and B new,i utilized in line 4 of Algo. 1) exploits the newly collected local observations of agent i in epoch p. As a result, UCB a t,i allows us to utilize the observations from all agents via the federated setting for accelerated exploration without requiring the agents to share their raw observations. UCB a t,i is computed with the defined parameter ν T KN ≜ B + R[2(log(3/δ) + 1) + d log(1 + T KN/λ)] 1/2 where δ ∈ (0, 1). Secondly, UCB b t,i leverages the federated setting to improve the quality of NN for reward prediction (to achieve better exploitation) in a similar way to FedAvg, i.e., by averaging the parameters of the NNs trained by all agents using their local observations (McMahan et al., 2017) . Specifically, when a communication round is started, every agent i ∈ [N ] uses its local observations D t,i ≜ {(x τ,i , y τ,i )} τ ∈[t] to train an NN (line 14 of Algo. 1). It uses initial parameters θ 0 (i.e., shared among all agents (Sec. 2)) and trains the NN using gradient descent with learning rate η for J training iterations (see Theorem 1 for the choices of η and J) to minimize the following loss function: L t,i (θ) ≜ 0.5 t τ =1 (f (x τ,i ; θ) -y τ,i ) 2 + 0.5mλ∥θ -θ 0 ∥ 2 2 . (1) The resulting NN parameters θ i t 's from all N agents are sent to the central server (line 16 of Algo. 1) who averages them (line 4 of Algo. 2) and broadcasts the aggregated θ sync,NN ≜ N -1 N i=1 θ i t back to all agents to be used in the next epoch. In addition, to compute the second term of UCB b t,i , every agent needs to compute the matrix V local t,i using its local inputs (line 10 of Algo. 1) and send its inverse to the central server (line 16 of Algo. 1) during a communication round; after that, the central server broadcasts these matrices {(V local i ) -1 } i∈[N ] received from each agent back to all agents to be used in the second term of UCB b t,i (line 6 of Algo. 1). Refer to Sec. 4.2 for a detailed explanation on the validity of UCB b t,i as a high-probability upper bound on h (up to additive error terms). UCB b t,i is computed with the defined parameter ν T K ≜ B + R[2(log(3N/δ) + 1) + d max log(1 + T K/λ)] 1/2 .

3.3. WEIGHT BETWEEN THE TWO UCBS

Our choice of the weight α between the two UCBs, which naturally arises during our theoretical analysis (Sec. 4), has an interesting interpretation in terms of the relative strengths of the two UCBs and the exploration-exploitation trade-off. Specifically, (line 15 of Algo. 1). In other words, α t,i is the ratio between the smallest and largest uncertainty across the entire domain for agent i, and α is the smallest such ratio α t,i among all agents. Therefore, α is expected to be generally increasing with the number of iterations/epochs: σ local t,i,min is already small after the first few iterations since the uncertainty at the queried contexts is very small; on the other hand, σ local t,i,max is expected to be very large in early iterations and become smaller in later iterations only after a large number of contexts has been queried to sufficiently reduce the overall uncertainty in the entire domain. This implies that we give more weight to UCB a t,i in earlier iterations and assign more weight to UCB b t,i in later iterations. This, interestingly, turns out to have an intriguing practical interpretation: Relying more on UCB a t,i in earlier iterations is reasonable because UCB a t,i is able to utilize the observations from all agents to accelerate exploration in the early stage (Sec. 3.2); it is also sensible to give more emphasis to UCB b t,i only in later iterations because the NN trained by every agent is only able to accurately model the reward function (for reliable exploitation) after it has collected enough observations to train its NN. In our practical implementation (Sec. 5), we will use the analysis here as an inspiration to design an increasing sequence of α. σ local t,i (x) ≜ √ λ g(x; θ 0 )/ √ m (V local t,i ) -

3.4. COMMUNICATION COST

To achieve a better communication efficiency, we propose here a variant of our main FN-UCB algorithm called FN-UCB (Less Comm.) which differs from FN-UCB (Algos. 1 and 2) in two aspects. Firstly, the central server averages the matrices {(V local t,i ) -1 } i∈[N ] received from all agents to produce a single matrix V -1 sync,NN = N -1 N i=1 (V local t,i ) -1 and hence only broadcasts the single matrix V -1 sync,NN instead of all N received matrices {(V local t,i ) -1 } i∈[N ] to all agents (see line 6 of Algo. 2). Secondly, the UCB b t,i of every agent i (line 6 of Algo. 1) is modified to use the matrix V -1 sync,NN : UCB b t,i (x) ≜ f (x; θ sync,NN ) + ν T K √ λ g(x; θ 0 )/ √ m V -1 sync,NN . To further reduce the communication cost of both FN-UCB and FN-UCB (Less Comm.) especially when the NN is large (i.e., its total number p 0 of parameters is large), we can follow the practice of previous works (Zhang et al., 2021; Zhou et al., 2020) to diagonalize the p 0 × p 0 matrices, i.e., by only keeping the diagonal elements of the matrices. Specifically, we can diagonalize W new,i (line 9 of Algo. 1) and V local t,i (line 10 of Algo. 1), and let the central server aggregate only the diagonal elements of the corresponding matrices to obtain W sync and V -1 sync,NN . This reduces both the communication and computational costs. As a result, during a communication round, the parameters that an agent sends to the central server include {W new,i , B new,i , θ i t , α t,i , (V local t,i ) -1 } (line 16 of Algo. 1) which constitute p 0 +p 0 +p 0 +1+p 0 = O(p 0 ) parameters and are the same for FN-UCB and FN-UCB (Less Comm.). The parameters that the central server broadcasts to the agents include {W sync , B sync , θ sync,NN , α, {(V local t,i ) -1 } i∈[N ] } for FN-UCB (line 6 of Algo. 2) which amount to p 0 + p 0 + p 0 + 1 + N p 0 = O(N p 0 ) parameters. Meanwhile, FN-UCB (Less Comm.) only needs to broadcast O(p 0 ) parameters because the N matrices {(V local t,i ) -1 } i∈[N ] are now replaced by a single matrix V -1 sync,NN . Therefore, the total number of exchanged parameters by FN-UCB (Less Comm.) is O(p 0 ) which is comparable to the number of exchanged parameters in standard FL for supervised learning (e.g., FedAvg) where the parameters (or gradients) of the NN are exchanged (McMahan et al., 2017 ). We will also analyze the total number of required communication rounds by FN-UCB, as well as by FN-UCB (Less Comm.), in Sec. 4.1. As we will discuss in Sec. 4.1, the variant FN-UCB (Less Comm.) has a looser regret upper bound than our main FN-UCB algorithm (Algos. 1 and 2). However, in practice, FN-UCB (Less Comm.) is recommended over FN-UCB because it achieves a very similar empirical performance as FN-UCB (which we have verified in Sec. 5.1) and yet incurs less communication cost.

4.1. THEORETICAL RESULTS

Regret Upper Bound. For simplicity, we analyze the regret of a simpler version of our algorithm where we only choose the weight α using the method described in Sec. 3.3 in the first iteration after every communication round (i.e., first iteration of every epoch) and set α = 0 in all other iterations. Note that when communication occurs after each iteration (i.e., when D is sufficiently small), this version coincides with our original FN-UCB described in Algos. 1 and 2 (Sec. 3). The regret upper bound of FN-UCB is given by the following result (proof in Appendix C): Theorem 1. Let δ ∈ (0, 1), λ = 1 + 2/T , and D = O(T /(N d)). Suppose that the NN width m grows polynomially: m ≥ poly(λ, T, K, N, L, log(1/δ), 1/λ 0 ). For the gradient descent training (line 14 of Algo. 1), let η = C 4 (mλ + mT L) -1 for some constant C 4 > 0 and J = O T L/(λC 4 ) . Then, with probability of at least 1 -δ, R T = O d √ T N + d max N √ T . Refer to Appendix C.1 for the detailed conditions on the NN width m as well as the learning rate η and number J of iterations for the gradient descent training (line 14 of Algo. 1). Intuitively, the effective dimension d measures the actual underlying dimension of the set of all T KN contexts for all agents (Zhang et al., 2021) , and d max ≜ max i∈[N ] d i is the maximum among the underlying dimensions of the set of T K contexts for each of the N agents. Zhang et al. (2021) showed that if all contexts lie in a d ′ -dimensional subspace of the RKHS induced by the NTK, then the effective dimension of these contexts can be upper-bounded by the constant d ′ . The first term d √ T N in the regret upper bound (Theorem 1) arises due to UCB a t,i and reflects the benefit of the federated setting. In particular, this term matches the regret upper bound of standard Neural UCB (Zhou et al., 2020) running for T N iterations and so, the average regret d T /N across all agents decreases with a larger number N of agents. The second term d max N

√

T results from UCB b t,i which involves two major components of our algorithm: the use of NNs for reward prediction and the aggregation of the NN parameters. Although not reflecting the benefit of a larger N in the regret bound, both components are important to our algorithm. Firstly, the use of NNs for reward prediction is a crucial component in neural contextual bandits in order to exploit the strong representation power of NNs. This is similar in spirit to the works on neural contextual bandits (Zhang et al., 2021; Zhou et al., 2020) in which the use of NNs for reward prediction does not improve the regret upper bound (compared with using the linear prediction given by the first term of UCB a t,i ) and yet significantly improves the practical performance. Secondly, the aggregation of the NN parameters is also important for the performance of our FN-UCB since it allows us to exploit the federated setting in a similar way to FL for supervised learning which has been repeatedly shown to improve the performance (Kairouz et al., 2019) . We have also empirically verified (Sec. 5.1) that both components (i.e., the use of NNs for reward prediction and the aggregation of NN parameters) are important to the practical performance of our algorithm. The work of Huang et al. (2021a) has leveraged the NTK to analyze the convergence of FedAvg for supervised learning (McMahan et al., 2017) which also averages the NN parameters in a similar way to our algorithm. Note that their convergence results also do not improve with a larger number N of agents but in fact become worse with a larger N . Of note, in the single-agent setting where N = 1, we have that d = d max (Sec. 2). Therefore, our regret upper bound from Theorem 1 reduces to R T = O( d √ T ), which, interestingly, matches the regret upper bounds of standard neural bandit algorithms including Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021) . We also prove (App. C.7) that FN-UCB (Less Comm.), which is a variant of our FN-UCB with a better communication efficiency (Sec. 3.4), enjoys a regret upper bound of R T = O( d √ T N + d max N √ T N ) , whose second term is worse than that of FN-UCB (Theorem 1) by a factor of √ N . In addition, we have also analyzed our general algorithm which does not set α = 0 in any iteration (results and analysis in Appendix F), which requires an additional assumption and only introduces an additional multiplicative constant to the regret bound. Communication Complexity. The following result (proof in App. D) gives a theoretical guarantee on the communication complexity of FN-UCB, including its variant FN-UCB (Less Comm.): Theorem 2. With the same parameters as Theorem 1, if the NN width m satisfies m ≥ poly(T, K, N, L, log(1/δ)), then with probability of at least 1 -δ, the total number of communication rounds for FN-UCB satisfies C T = O( d √ N ). The specific condition on m required by Theorem 2 corresponds to condition 1 listed in App. C.1 (see App. D for details) which is a subset of the conditions required by Theorem 1. Following the same discussion on the effective dimension d presented above, if all contexts lie in a d ′ -dimensional subspace of the RKHS induced by the NTK, then d can be upper-bounded by the constant d ′ , consequently leading to a communication complexity of C T = O( √ N ).

4.2. PROOF SKETCH

We give a brief sketch of our regret analysis for Theorem 1 (detailed proof in Appendix C). To begin with, we need to prove that both UCB a t,i and UCB b t,i are valid high-probability upper bounds on the reward function h (App. C.3) given that the conditions on m, η, and J in App. C.1 are satisfied. Since UCB a t,i can be viewed as Linear UCB using the neural tangent features g(x; θ 0 )/ √ m as the input features (Sec. 3), its validity as a high-probability upper bound on h can be proven following similar steps as that of standard linear and kernelized bandits (Chowdhury & Gopalan, 2017) (see Lemma 3 in App. C.3). Next, to prove that UCB b t,i is also a high-probability upper bound on h (up to additive error terms), let θ local t,i ≜ (V local t,i ) -1 ( t τ =1 y τ,i g(x τ,i ; θ 0 )/ √ m) which is defined in the same way as θ t,i (line 4 of Algo. 1) except that θ local t,i only uses the local observations of agent i. Firstly, we show that f (x; θ sync,NN ) (i.e., the NN prediction using the aggregated parameters) is close to N -1 N i=1 ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ which is the linear prediction using θ local t,i averaged over all agents. This is achieved by showing that the linear approximation of the NN at θ 0 is close to both terms. Secondly, we show that the absolute difference between the linear prediction ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ of agent i and the reward function h(x) can be upper-bounded by ν T K √ λ||g(x; θ 0 )/ √ m|| (V local t,i ) -1 . This can be done following similar steps as the proof for UCB a t,i mentioned above. Thirdly, using the Published as a conference paper at ICLR 2023 averaged linear prediction N -1 N i=1 ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ as an intermediate term, the difference between f (x; θ sync,NN ) and h(x) can be upper-bounded. This implies the validity of UCB b t,i as a high-probability upper bound on h (up to additive error terms which are small given the conditions on m, η, and J presented in App. C.1), as formalized by Lemma 4 in App. C.3. Next, following similar footsteps as the analysis in Wang et al. (2020) , we separate all epochs into "good" epochs (intuitively, those epochs during which the amount of newly collected information from all agents is not too large) and "bad" epochs (details in App. C.2), and then separately upper-bound the regrets incurred in these two types of epochs. For good epochs (App. C.4), we are able to derive a tight upper bound on the regret r t,i = h(x * t,i ) -h(x t,i ) in each iteration t by making use of the fact that the change of information in a good epoch is bounded, and consequently obtain a tight upper bound on the total regrets in all good epochs. For bad epochs (App. C.5), we make use of the result from App. C.2 which guarantees that the total number of bad epochs can be upper-bounded. As a result, with an appropriate choice of D = O(T /(N d)), the growth rate of the total regret incurred in bad epochs is smaller than that in good epochs. Lastly, the final regret upper bound follows from adding up the total regrets from good and bad epochs (App. C.6).

5. EXPERIMENTS

All figures in this section plot the average cumulative regret across all N agents up to an iteration, which allows us to inspect the benefit that the federated setting brings to an agent (on average). In all presented results, unless specified otherwise (by specifying a value of D), a communication round happens after each iteration. All curves stand for the mean and standard error from 3 independent runs. Some experimental details and results are deferred to App. E due to space limitation.

5.1. SYNTHETIC EXPERIMENTS

We firstly use synthetic experiments to illustrate some interesting insights about our algorithm. Similar to that of Zhou et al. (2020) , we adopt the synthetic functions of h(x) = cos(3⟨a, x⟩) and h(x) = 10(⟨a, x⟩) 2 which are referred to as the cosine and square functions, respectively. We add a Gaussian observation noise with a standard deviation of 0.01. The parameter a is a 10dimensional vector randomly sampled from the unit hypersphere. In each iteration, every agent receives K = 4 contexts (arms) which are randomly sampled from the unit hypersphere. For fair comparisons, for all methods (including our FN-UCB, Neural UCB, and Neural TS), we use the same set of parameters of λ = ν T KN = ν T K = 0.1 and use an NN with 1 hidden layer and a width of m = 20. As suggested by our theoretical analysis (Sec. 3.3), we select an increasing sequence of α which is linearly increasing (to 1) in the first 700 iterations, and let α = 1 afterwards. To begin with, we compare our main FN-UCB algorithm and its variant FN-UCB (Less Comm.) (Sec. 3.4). The results (Figs. 3a and 3b in App. E) show that their empirical performances are very similar. So, for practical deployment, we recommend the use of FN-UCB (Less Comm.) as it is more communication-efficient and achieves a similar performance. Accordingly, we will use the variant FN-UCB (Less Comm.) in all our subsequent experiments and refer to it as FN-UCB for simplicity. Fig. 1 presents the results. Figs. 1a and 1b show that our FN-UCB with N = 1 agent performs comparably with Neural UCB and Neural TS, and that the federation of a larger number N of agents consistently improves the performance of our FN-UCB. Note that the federation of N = 2 agents can already provide significant improvements over non-federated algorithms. Fig. 1c gives an illustration of the importance of different components in our FN-UCB. The red curve is obtained by removing UCB b t,i (i.e., letting α = 0) and the green curve corresponds to removing UCB a t,i . The red curve shows that relying solely on UCB a t,i leads to significantly larger regrets in the long run due to its inability to utilize NNs to model the reward functions. On the other hand, the green curve incurs larger regrets than the red curve initially; however, after more observations are collected (i.e., after the NNs are trained with enough data to accurately model the reward function), it quickly learns to achieve much smaller regrets. These results provide empirical justifications for our discussion on the weight between the two UCBs (Sec. 3.3): It is reasonable to use an increasing sequence of α such that more weight is given to UCB a t,i initially and then to UCB b t,i later. The yellow curve is obtained by removing the step of aggregating (i.e., averaging) the NN parameters (in line 4 of Algo. 2), i.e., when calculating UCB b t,i (line 6 of Algo. 1), we use θ i t to replace θ sync,NN . The results show that the aggregation of the NN parameters significantly improves the performance of FN-UCB (i.e., the blue curve has much smaller regrets than the yellow one) and is hence an indispensable part of our FN-UCB. Lastly, Fig. 1d shows that more frequent communications (i.e., smaller values of D which make it easier to initiate a communication round; see line 11 of Algo. 1) lead to smaller regrets.

5.2. REAL-WORLD EXPERIMENTS

We adopt the shuttle and magic telescope datasets from the UCI machine learning repository (Dua & Graff, 2017) and construct the experiments following a widely used protocol in previous works (Li et al., 2010a; Zhang et al., 2021; Zhou et al., 2020) . A K-class classification problem can be converted into a K-armed contextual bandit problem. In each iteration, an input x is randomly drawn from the dataset and is then used to construct K context feature vectors x 1 = [x; 0 d ; . . . ; 0 d ], x 2 = [0 d ; x; . . . ; 0 d ], . . . , x K = [0 d ; . . . ; 0 d ; x] which correspond to the K classes. The reward is 1 if the arm with the correct class is pulled, and 0 otherwise. For fair comparisons, we use the same set of parameters of λ = 10, ν T KN = 0.1, and ν T K = 0.01 for all methods. Figs. 2a and 2b present the results for the two datasets (1 hidden layer, m = 20) and show that our FN-UCB with N = 2 agents consistently outperforms standard Neural UCB and Neural TS, and its performance also improves with the federation of more agents. Fig. 2c shows the results for shuttle when diagonal approximation (Sec. 3.4) is applied to the NNs (1 hidden layer, m = 50); the corresponding results are consistent with those in Fig. 2a . 4 Moreover, the regrets in Fig. 2c are in general smaller than those in Fig. 2a . This may suggest that in practice, a wider NN with diagonal approximation may be preferable to a narrower NN without diagonal approximation since it not only improves the performance but also reduces the computational and communication costs (Sec. 3.4). Fig. 2d plots the regrets of shuttle (with diagonal approximation) for different values of D and shows that more frequent communications lead to better performances and are hence consistent with that in Fig. 1d . For completeness, we also compare their performance with that of linear and kernelized contextual bandit algorithms (for the experiments in both Secs. 5.1 and 5.2), and the results (Fig. 4 , App. E) show that they are outperformed by neural contextual bandit algorithms.

6. CONCLUSION

This paper describes the first federated neural contextual bandit algorithm called FN-UCB. We use a weighted combination of two UCBs and the choice of this weight required by our theoretical analysis has an interesting interpretation emphasizing accelerated exploration initially and accurate prediction of the aggregated NN later. We derive upper bounds on the regret and communication complexity of FN-UCB, and verify its effectiveness using empirical experiments. Our algorithm is not equipped with privacy guarantees, which may be a potential limitation and will be tackled in future work.

REPRODUCIBILITY STATEMENT

We have included the necessary details to ensure the reproducibility of our theoretical and empirical results. For our theoretical results, we have stated all our assumptions in Sec. 2, added a proof sketch in Sec. 4.2, and included the complete proofs in App. C and App. D. Our detailed experimental settings have been described in Sec. 5.1, Sec. 5.2, and App. E. Our code has been submitted as supplementary material. A RELATED WORKS Federated Bandits. Federated learning (FL) has received enormous attention in recent years (Kairouz et al., 2019; Li et al., 2021; 2014; McMahan et al., 2017) . A number of recent works have extended the classic K-armed bandits (i.e., the arms are not associated with feature vectors) to the federated setting. Li & Song (2022) and Li et al. (2020) focused on incorporating privacy guarantees into federated K-armed bandits in both centralized and decentralized settings. Shi & Shen (2021) proposed a setting where the goal is to minimize the regret of a global bandit whose reward of an arm is the average of the rewards of the corresponding arm from all agents, which was later extended by adding personalization such that every agent aims to maximize a weighted combination between the global and local rewards (Shi et al., 2021) . Subsequent works on federated K-armed bandits have focused on other important aspects such as decentralized communication via the gossip algorithm (Zhu et al., 2021b) , the security aspect via cryptographic techniques (Ciucanu et al., 2022) , uncoordinated exploration (Yan et al., 2022) , and robustness against Byzantine attacks (Demirel et al., 2022) . Regarding federated linear contextual bandits, Wang et al. (2020) proposed a distributed linear contextual bandit algorithm which allows every agent to use the observations from the other agents by only exchanging the sufficient statistics to calculate the Linear UCB policy. Subsequently, Dubey & Pentland (2020) extended the method from Wang et al. (2020) to consider differential privacy and decentralized communication, Huang et al. (2021b) considered a setting where every agent is associated with a unique context vector, Li & Wang (2022a) focused on asynchronous communication, and Jadbabaie et al. ( 2022) considered the robustness against Byzantine attacks. Federated kernelized/GP bandits (also named federated Bayesian optimization) have been explored by Dai et al. (2020; 2021) , which focused on the practical problem of hyperparameter tuning in the federated setting. The recent works of Li et al. (2022) ; Li & Wang (2022b) have, respectively, focused on deriving communication-efficient algorithms for federated kernelized and generalized linear bandits. In addition to federated bandits, other similar sequential decision-making problems have also been extended to the federated setting, such as federated reinforcement learning (Fan et al., 2021; Zhuo et al., 2019) and federated hyperparameter tuning (Holly et al., 2021; Khodak et al., 2021; Zhou et al., 2021) . Neural Bandits. Since the pioneering works of Zhou et al. (2020) and Zhang et al. (2021) which, respectively, introduced Neural UCB and Neural TS, a number of recent works have focused on different aspects of neural contextual bandits. Xu et al. (2020) reduced the computational cost of Neural UCB by using an NN as a feature extractor and applying Linear UCB only to the last layer of the learned NN, Kassraie & Krause (2022) analyzed the maximum information gain of the NTK and hence derived no-regret algorithms, Gu et al. (2021) focused on the batch setting in which the policy is only updated at a small number of time steps, Nabati et al. (2021) 

B MORE BACKGROUND

In this section, we give more details on some of the technical background mentioned in Sec. 2. The details in this section all follow the works of Zhang et al. (2021) ; Zhou et al. (2020) , and we present them here for completeness.

Definition of the NN

f (x; θ). Let W 1 ∈ R m×d , W l ∈ R m×m , ∀l = 2, . . . , L -1, and W L ∈ R m×1 , then the NN f (x; θ) is defined as f 1 = W 1 x, f l = W l ReLU(f l-1 ), ∀l = 2 . . . , L, f (x; θ) = √ mf L , in which ReLU(z) = max(z, 0) denotes the rectified linear unit (ReLU) activation function and is applied to each element of f l-1 . With this definition of the NN, θ denotes the collection of all parameters of the NN: θ = (vec(W 1 ), . . . , vec(W L )) ∈ R p0 . Details of the Initialization Scheme θ 0 ∼ init(•). To obtain the initial parameters θ 0 , for each l = 1, . . . , L -1, let W l = W 0 0 W where each entry of W is independently sampled from N (0, 4/m), and let W L = (w ⊤ , -w ⊤ ) where each entry of w is independently sampled from N (0, 2/m). This initialization scheme is the same as that used by the works of Zhang et al. (2021) ; Zhou et al. (2020) . Definitions of the NTK Matrices H and H i 's. To simplify the exposition here, we use {x j } j=1,...,T KN to denote the set of all contexts from all iterations, all arms and all agents: {x k t,i } t∈[T ],k∈[K],i∈[N ] . We can then define H (1) i,j = Σ (1) i,j = ⟨x i , x j ⟩, A (l) i,j = Σ (l) i,i Σ (l) i,j Σ (l) i,j Σ (l) j,j , Σ (l+1) i,j = 2E (u,v)∼N (0,A (l) i,j ) max(u, 0) max(v, 0), H (l+1) i,j = 2 H (l) i,j E (u,v)∼N (0,A (l) i,j ) 1(u > 0)1(v > 0) + Σ (l+1) i,j . With these definitions, the NTK matrix is defined as H = ( H (L) + Σ (L) )/2. Similarly, H i can be obtained in the same way by only using all contexts from agent i in the definitions above, i.e., now we use {x j } j=1,...,T K to denote {x k t,i } t∈[T ],k∈ [K] and plug these T K contexts into the definitions above to obtain H i .

C PROOF OF REGRET UPPER BOUND (THEOREM 1)

We use p to index different epochs and denote by P the total number of epochs. We use t p to denote the first iteration of epoch p, and use E p to represent the length (i.e., number of iterations) of epoch p. Throughout our theoretical analysis, we will denote different error probabilities as δ 1 , . . . , δ 6 , which we will combine via a union bound at the end of the proof to ensure that our final regret upper bound holds with probability of at least 1 -δ.

C.1 CONDITIONS ON THE WIDTH m OF THE NEURAL NETWORKS

We list here the detailed conditions on the width m of the NN that are needed by our theoretical analysis. These include two types of conditions, some of them (conditions 1-4) are required for our regret upper bound to hold (i.e., they are used during the proof to derive the regret upper bound), whereas the others (conditions 5-6) are used after the final regret upper bound is derived to ensure that the final regret upper bound is small (see App. C.6). When presenting our detailed proofs starting from the next subsection, we will refer to each of these conditions whenever they are used by the corresponding lemmas. Different lemmas may use different leading constants in their required condition (i.e., lower bound) on m, but here we use the same constant C > 0 for all lower bounds for simplicity, which can be considered as simply taking the maximum among all these different leading constants for different lemmas. 1. m ≥ CT 6 K 6 N 6 L 6 log(T KN L/δ), 2. m ≥ CT 4 K 4 N 4 L 6 log(T 2 K 2 N 2 L/δ)/λ 4 0 , 3. m ≥ C √ λL -3/2 [log(T KN L 2 /δ)] 3/2 , 4. m(log m) -3 ≥ CT L 12 λ -1 + CT 7 λ -8 L 18 (λ + LT ) 6 + CL 21 T 7 λ -7 (1 + T /λ) 6 . 5. m(log m) -3 ≥ CT 10 N 6 λ -4 L 18 , 6. m(log m) -3 ≥ CT 16 N 6 L 24 λ -10 (1 + T /λ) 6 . Some of these conditions above can be combined, but we leave them as separate conditions to make it easier to refer to the corresponding place in the proof where a particular condition is needed. Furthermore, to achieve a small upper bound on the cumulative regret, we also need to place some conditions on the learning rate η and number of iterations J for the gradient descent training (line 14 of Algo. 1). Specifically, we need to choose the learning rate as η = C 4 (mλ + mT L) -1 , in which C 4 > 0 is an absolute constant such that C 4 ≤ 1 + T L, and choose J = 1 C 4 1 + T L λ log 1 3C 2 N λ T 3 L = O T L/(λC 4 ) . C.2 DEFINITION OF GOOD AND BAD EPOCHS Denote the matrix V last (see line 18 of Algo. 1) after epoch p as V p . As a result, the matrix V P is calculated using all selected inputs from all agents: V P = T t=1 N i=1 g(x t,i ; θ 0 )g(x t,i ; θ 0 ) ⊤ /m + λI. Define V 0 ≜ λI. Imagine that we have a hypothetical agent which chooses all T × N queries {x t,i } t∈[T ],i∈ [N ] sequentially in a round-robin fashion (i.e., the hypothetical agent chooses x 1,1 , x 1,2 , . . . , x 2,1 , x 2,2 , . . . , x T,N ), and denote the corresponding hypothetical covariance matrix as V t,i = t-1 τ =1 N j=1 g(x τ,j ; θ 0 )g(x τ,j ; θ 0 ) ⊤ /m + i j=1 g(x t,j ; θ 0 )g(x t,j ; θ 0 ) ⊤ /m + λI. We represent the indices of this hypothetical agent by t ′ ∈ [T N ] to distinguish it from our original multi-agent setting. Define J T N ≜ [g(x t ′ ; θ 0 )] t ′ ∈[T N ] which is a p 0 × (T N ) matrix, and define K T N ≜ J ⊤ T N J T N /m, which is a (T N ) × (T N ) matrix. According to thes definitions, we have that Lemma 1 (Lemma B.7 of Zhang et al. (2021) ). Let δ 1 ∈ (0, 1). If m ≥ CT 6 N 6 K 6 L 6 log(T N KL/δ 1 ), we have with probability of at least 1 -δ 1 that log det(I + λ -1 K t ′ ) ≤ log det(I + λ -1 H) + 1, ∀t ′ ∈ [T N ]. The condition on m given in Lemma 1 corresponds to condition 1 listed in App. C.1, except that δ 1 is used here instead of δ in App. C.1. Lemma 1 allows us to derive the following equation, which we will use (at the end of this section) to justify that the total number of "bad" epochs is not too large. P -1 p=0 log detV p+1 detV p = log detV P detV 0 (a) = log det J T N J ⊤ T N /m + λI detV 0 = log det λ λ -1 J T N J ⊤ T N /m + I detV 0 (b) = log λ p0 det λ -1 J T N J ⊤ T N /m + I λ p0 = log det λ -1 J T N J ⊤ T N /m + I (c) = log det λ -1 J ⊤ T N J T N /m + I = log det λ -1 K T N + I (d) ≤ log det λ -1 H + I + 1 (e) = d log(1 + T KN/λ) + 1 ≜ R ′ . Step (a) is because V P = J T N J ⊤ T N /m + λI according to our definition of J T N above. Step (b) follows from our definition of V 0 = λI above, as well as some standard properties of matrix determinant. Step (c) follows because: det(AA ⊤ + I) = det(A ⊤ A + I). Step (d) has made use of Lemma 1 above, which suggests that equation 4 holds with probability of at least 1 -δ 1 . Step (e) follows from the definition of d ≜ log det(I+H/λ) log(1+T KN/λ) (Sec. 2). In the last step, we have defined R ′ ≜ d log(1 + T KN/λ) + 1. We further define R ≜ ⌈R ′ ⌉, in which ⌈•⌉ denotes the ceiling operator. Now we define all epochs p's which satisfy the following condition as "good epochs": 1 ≤ detV p detV p-1 ≤ e, and define all other epochs as "bad epochs". The first inequality trivially holds for all epochs according to the way in which the matrices are constructed. It is easy to verify that the second inequality holds for at least R epochs (with probability of at least 1 -δ 1 ). This is because if the second inequality is violated for more than R epochs (i.e., if log detVp detVp-1 > 1 for more than R epochs), then P -1 p=0 log detVp+1 detVp > R, which violates equation 4. This suggests that there are no more than R bad epochs (with probability of at least 1 -δ 1 ). From here onwards, we will denote the set of good epochs by E good and the set of bad epochs by E bad .

C.3 VALIDITY OF THE UPPER CONFIDENCE BOUND

In this section, we prove that the upper confidence bound used in our algorithm, (1 -α t )UCB a t,i (x) + α t UCB b t,i (x) (used in line 7 of Algo. 1), is a valid high-probability upper bound on the reward function h. We will achieve this by separately proving that UCB a t,i and UCB b t,i are valid highprobability upper bounds on h in the next two sections. Note that for both UCBs, unlike Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021) which use θ t (the parameters of trained NNs) to calculate the exploration term (the second terms of UCB a t,i and UCB b t,i ), we instead use θ 0 . This is consistent with Kassraie & Krause (2022) who have shown that the use of θ 0 gives accurate uncertainty estimation.

C.3.1 VALIDITY OF UCB a t,i AS A HIGH-PROBABILITY UPPER BOUND ON h:

To begin with, we will need the following lemma from Zhang et al. (2021) . Lemma 2 (Lemma B.3 of Zhang et al. (2021) ). Let δ 2 ∈ (0, 1). There exists a constant C > 0 such that if m ≥ CT 4 K 4 N 4 L 6 log(T 2 K 2 N 2 L/δ 2 )/λ 4 0 , then with probability of ≥ 1 -δ 2 over random initializations of θ 0 , there exists a θ * ∈ R p0 such that h(x) = ⟨g(x; θ 0 ), θ * -θ 0 ⟩, √ m∥θ * -θ 0 ∥ 2 ≤ √ 2h ⊤ H -1 h ≤ B, ∀x ∈ X t,i , t ∈ [T ], i ∈ [N ]. The condition on m required by Lemma 2 corresponds to condition 2 listed in App. C.1, except that δ 2 is used here instead of δ as in App. C.1. The following lemma formally guarantees the validity of UCB a t,i as a high-probability upper-bound on h. Lemma 3. Let δ 3 ∈ (0, 1) and ν T KN = B + R 2(log(1/δ 3 ) + 1) + d log(1 + T KN/λ). We have with probability of at least 1 -δ 1 -δ 2 -δ 3 for all t ∈ [T ], i ∈ [N ], that |h(x) -⟨g(x; θ 0 )/ √ m, θ t,i ⟩| ≤ ν T KN √ λ g(x; θ 0 )/ √ m V -1 t,i , ∀x ∈ X t,i Proof. Lemma 3 can be proved by following similar steps as the proof of Lemma 4.3 in the work of Zhang et al. (2021) . Specifically, the proof of Lemma 3 requires Lemmas B.  T K = B + R 2(log(N/δ 4 ) + 1) + d max log(1 + T K/λ) Suppose the width m of the NN satisfies m ≥ C √ λL -3/2 [log(T KN L 2 /δ 5 )] 3/2 for some constant C > 0, as well as condition 4 in Appendix C.1. Suppose the learning rate η and number of iterations J of the gradient descent training satisfy the conditions in equation 2 and equation 3 (App. C.1), respectively. We have with probability of at least 1 -δ 4 -δ 5 for all t ∈ T -1 , i ∈ [N ], that |h(x) -f (x; θ sync,NN )| ≤ ν T K √ λ 1 N N i=1 g(x; θ 0 )/ √ m (V local i ) -1 + ε linear (m, T ), ∀x ∈ X t,i . Proof. Note that the condition on m listed in the lemma, m ≥ C √ λL -3/2 [log(T KN L 2 /δ 5 )] 3/2 , corresponds to condition 3 listed in App. C.1 except that δ 5 is used here instead of δ. Therefore, the validity of Lemma 4 requires conditions 3 and 4 on m (App. C.1) to be satisfied. For ease of exposition, we separate our proof into three steps. Step 1: NN Output f (x; θ sync,NN ) Is Close to (Averaged) Linear Prediction Based on Lemma C.2 of Zhang et al. (2021) , if the conditions on m listed in Lemma 4 is satisfied, then for any θ such that θ -θ 0 2 ≤ 2 t/(mλ), there exists a constant C 1 > 0 such that we have with probability of at least 1 -δ 5 over random initializations θ 0 that |f (x; θ) -⟨g(x; θ 0 ), θ -θ 0 ⟩| ≤ C 1 t 2/3 m -1/6 λ -2/3 L 3 log m ≤ C 1 T 2/3 m -1/6 λ -2/3 L 3 log m ≜ ε linear,1 (m, T ), (7) which holds ∀x ∈ X t,i , t ∈ [T ], i ∈ [N ]. Also note that according to Lemma C.1 of Zhang et al. (2021) , if conditions 3 and 4 on m listed in App. C.1, as well as the condition on η equation 2, are satisfied, then we have with probability of at least 1 -δ 5 over random initializations θ 0 that θ i t -θ 0 2 ≤ 2 t/mλ, ∀i ∈ [N ]. An immediate implication is that the aggregated NN parameters θ sync,NN = 1 N N i=1 θ i t also satisfies: θ sync,NN -θ 0 2 = 1 N N i=1 θ i t -θ 0 2 ≤ 1 N N i=1 θ i t -θ 0 2 ≤ 2 t/mλ. This implies that equation 7 holds for θ sync,NN with probability of at least 1 -2δ 5 : |f (x; θ sync,NN ) -⟨g(x; θ 0 ), θ sync,NN -θ 0 ⟩| ≤ ε linear,1 (m, T ). Next, note that the θ i t is obtained by training only using agent i's local observations (line 14 of Algo. 1). Define θ local t,i = (V local t,i ) -1 ( t τ =1 y τ,i g(x τ,i ; θ 0 )/ √ m). Note that θ local t,i is calculated in the same way as θ t,i (line 4 of Algo. 1), except that its calculation only involves agent i's local observations. Next, making use of Lemmas C. 2021)) to show that there exists constants C 2 > 0 and C 3 > 0 such that we have ∀x ∈ X t,i , t ∈ [T ], i ∈ [N ] that |⟨g(x; θ 0 ), θ i t -θ 0 ⟩ -⟨g(x; θ 0 )/ √ m, θ local t,i ⟩| ≤ C 2 (1 -ηmλ) J tL/λ + C 3 m -1/6 log mL 4 t 5/3 λ -5/3 (1 + t/λ) ≤ C 2 (1 -ηmλ) J T L/λ + C 3 m -1/6 log mL 4 T 5/3 λ -5/3 (1 + T /λ) ≜ ε η,J + ε linear,2 (m, T ). (9) We refer to ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ as the linear prediction because it is the prediction of the linear model with the neural tangent features g(x; θ 0 )/ √ m as the input features, conditioned on the local observations of agent i. Note that similar to equation 8 which also relies on Lemma C.1 of Zhang et al. (2021) , equation 9 also requires conditions 3 and 4 on m, as well as the condition on η, in App. C.1 to be satisfied. equation 9 holds with probability of at least 1 -2δ 5 , where the error probabilities come from the use of Lemmas C.1 and C.4 of Zhang et al. (2021) . Next, we can bound the difference between f (x; θ sync,NN ) (i.e., the prediction of the NN with the aggregated parameters) and the averaged linear predictions of all agents calculated using their local observations: |f (x; θ sync,NN ) - 1 N N i=1 ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩| ≤ |f (x; θ sync,NN ) - 1 N N i=1 ⟨g(x; θ 0 ), θ i t -θ 0 ⟩| + | 1 N N i=1 ⟨g(x; θ 0 ), θ i t -θ 0 ⟩ - 1 N N i=1 ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩| ≤ |f (x; θ sync,NN ) -⟨g(x; θ 0 ), θ sync,NN -θ 0 ⟩| + 1 N N i=1 |⟨g(x; θ 0 ), θ i t -θ 0 ⟩ -⟨g(x; θ 0 )/ √ m, θ local t,i ⟩| ≤ ε linear,1 (m, T ) + 1 N N i=1 (ε η,J + ε linear,2 (m, T )) ≤ ε linear,1 (m, T ) + ε η,J + ε linear,2 (m, T ) ≜ ε linear (m, T ). In the second inequality, we plugged in the definition of θ sync,NN = 1 N N i=1 θ i t . In the third inequality, we have made use of equation 8 and equation 9; equation 10 holds with probability of at least 1 -4δ 5 , where the error probabilities come from equation 8 (2δ 5 ) and equation 9 (2δ 5 ), respectively. Now we replace δ 5 by δ 5 /4, which ensures that equation 10 holds with probability of at least 1 -δ 5 . This will only introduce a factor of 4 within the log of condition 3 on m (App. C.1), which is ignored since it can be absorbed by the constant C. Step 2: Linear Prediction Is Close to the Reward Function h(x) In the proof in this section, we will also need a "local" variant of the confidence bound of Lemma 3, i.e., the confidence bound of Lemma 3 calculated only using the local observations of an agent i: Lemma 5 (Zhang et al. (2021) ). We have with probability of at least 1 -δ 4 for all t ∈ T -1 ⊂ [T ], i ∈ [N ], that |h(x) -⟨g(x; θ 0 )/ √ m, θ local t,i ⟩| ≤ ν T K √ λ g(x; θ 0 )/ √ m (V local t,i ) -1 , ∀x ∈ X t,i . Proof. Similar to the proof of Lemma 3 (App. C.3.1), the proof of Lemma 5 also requires of Lemmas B.3, B.6 and B.7 from Zhang et al. (2021) , which, in this case, can be directly applied to our setting (except that we need an additional union bound over all N agents). The implication of the additional union bound on the error probabilities is taken care of by the additional term of N within the log in the expression of ν T K (Lemma 4), and in conditions 1 and 2 on m (see App. C.1, and also Lemmas 2 and 1). 2021)) be δ 4 , we can ensure that Lemma 5 holds with probability of at least 1 -δ 4 . For simplicity, we let the error probability for Lemma B.6 be δ 4 , which leads to the cleaner expression of ν T K in Lemma 4. This means that the error probabilities for Lemmas B.3 and B.7 are very small, which can be accounted for by simply increasing the value of the absolute constant C in conditions 1 and 2 on m (App. C.1) and hence does not affect our main theoretical analysis. Step 3: Combining Results from Step 1 and Step 2 Next, we are ready to prove the validity of UCB b t,i by using the averaged linear prediction 1 N N i=1 ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ as an intermediate term: |f (x;θ sync,NN ) -h(x)| = |f (x; θ sync,NN ) - 1 N N i=1 ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ + 1 N N i=1 ⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ -h(x)| ≤ 1 N N i=1 |⟨g(x; θ 0 )/ √ m, θ local t,i ⟩ -h(x)| + ε linear (m, T ) ≤ 1 N N i=1 ν T K √ λ g(x; θ 0 )/ √ m (V local t,i ) -1 + ε linear (m, T ) = 1 N N i=1 ν T K √ λ g(x; θ 0 )/ √ m (V local i ) -1 + ε linear (m, T ) The second inequality has made use of equation 10, and the third inequality follows from Lemma 5. In the last inequality, we have made the substitution of (V local i ) -1 = (V local t,i ) -1 . This is because in the proof of Lemma 4 here, we only consider the iterations of t ∈ T -1 ≜ {t p -1} p=2,...,P -1 , i.e., the last iteration of every epoch. As a result, this ensures that (V local i ) -1 = (V local t,i ) -1 because every time the central server obtains (V local i ) -1 through (V local i ) -1 = (V local t,i ) -1 , ∀i ∈ [N ] (line 4 of Algo. 2), we have that the current iteration t is the last iteration of the previous epoch. As a results, equation 11 holds with probability of at least 1 -δ 4 -δ 5 , in which the error probabilities come from equation 10 (δ 5 ) and Lemma 5 (δ 4 ). In other words, Lemma 4 (i.e., the validity of UCB b t,i ) holds with probability of at least 1 -δ 4 -δ 5 .

C.4 REGRET UPPER BOUND FOR GOOD EPOCHS

In this section, we derive an upper bound on the total regrets incurred in all good epochs E good (defined in App. C.2).

C.4.1 AUXILIARY INEQUALITY

We firstly derive an auxiliary result which will be used in the proofs later. For agent i and iteration t in a good epoch p ∈ E good , we have that √ λ g(x; θ 0 )/ √ m V -1 t,i = λg(x; θ 0 ) ⊤ V -1 t,i g(x; θ 0 )/m ≤ λg(x; θ 0 ) ⊤ V -1 t,i g(x; θ 0 )/m det V t,i detV t,i ≤ λg(x; θ 0 ) ⊤ V -1 t,i g(x; θ 0 )/m detV p detV p-1 ≤ eλg(x; θ 0 ) ⊤ V -1 t,i g(x; θ 0 )/m = √ eλ g(x; θ 0 )/ √ m V -1 t,i . Recall that V t,i (line 4 of Algo. 1) is used by agent i in iteration t to select x t,i (via UCB a t,i ), and that the matrix V t,i is defined for the hypothetical agent which sequentially chooses all T N queries {x t,i } t∈[T ],i∈ [N ] in a round-robin fashion (App. C.2). The first inequality in equation 12 above follows from Lemma 12 of Abbasi-Yadkori et al. (2011) . The second inequality is because V p contains more information than V t,i (since V p is calculated using all the inputs selected after epoch p), and V p-1 contains less information than V t,i (because compared with V p-1 , V t,i additionally contains the local inputs selected by agent i in the current epoch p). In the last inequality, we have made use of the definition of good epochs, i.e., (detV p )/(detV p-1 ) ≤ e (App. C.2).

C.4.2 UPPER BOUND ON THE INSTANTANEOUS REGRET r t,i

Here we assume that both UCB a t,i and UCB b t,i hold (hence we ignore the error probabilities here), which we have proved in App. C.3. We now derive an upper bound on the instantaneous regret r t,i = h(x * t,i ) -h(x t,i ) for agent i and iteration t in a good epoch p ∈ E good : r t,i = h(x * t,i ) -h(x t,i ) = αh(x * t,i ) + (1 -α)h(x * t,i ) -h(x t,i ) ≤ αUCB b t,i (x * t,i ) + αε linear (m, T ) + (1 -α)UCB a t,i (x * t,i ) -h(x t,i ) ≤ αUCB b t,i (x t,i ) + (1 -α)UCB a t,i (x t,i ) + αε linear (m, T ) -h(x t,i ) = α UCB b t,i (x t,i ) -h(x t,i ) + (1 -α) UCB a t,i (x t,i ) -h(x t,i ) + αε linear (m, T ) ≤ α 2ν T K 1 N N j=1 √ λ g(x t,i ; θ 0 )/ √ m (V local j ) -1 + ε linear (m, T ) + (1 -α) 2ν T KN √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i + αε linear (m, T ) ≤ α 2ν T K 1 N N j=1 √ λ g(x t,i ; θ 0 )/ √ m (V local j ) -1 + ε linear (m, T ) + (1 -α) 2ν T KN √ eλ g(x t,i ; θ 0 )/ √ m V -1 t,i + αε linear (m, T ) = α2ν T K 1 N N j=1 √ λ g(x t,i ; θ 0 )/ √ m (V local j ) -1 + (1 -α)2ν T KN √ eλ g(x t,i ; θ 0 )/ √ m V -1 t,i + 2αε linear (m, T ) ≜ (1 -α)2ν T KN √ e σ t,i (x t,i ) + α2ν T K 1 N N j=1 σ local tp-1,j (x t,i ) + 2αε linear (m, T ). ( ) The first inequality makes use of Lemma 3 (i.e., the validity of UCB a t,i ) and Lemma 4 (i.e., the validity of UCB b t,i ). The second inequality follows from the way in which x t,i is selected (line 7 of Algo. 1): x t,i = arg max x∈Xt,i (1 -α)UCB a t,i (x) + αUCB b t,i (x). The third inequality again makes use of Lemma 3 and Lemma 4, as well as the expressions of UCB a t,i and UCB b t,i . In the fourth inequality, we have made used of the auxiliary inequality of equation 12 we derived in the last section. Recall that we have discussed that (V local i ) -1 = (V local t,i ) -1 for all t = t p -1 at the end of App. C.3.2. Therefore, in the last step, we have defined σ local tp-1,j (x t,i ) ≜ √ λ g(x t,i ; θ 0 )/ √ m (V local tp -1,j ) -1 = √ λ g(x t,i ; θ 0 )/ √ m (V local j ) -1 which represents the GP posterior standard deviation (using the kernel of k(x, x ′ ) = g(x; θ 0 ) ⊤ g(x ′ ; θ 0 )/m) conditioned on all agent j's local observations before iteration t p . Note that σ local tp-1,j (x t,i ) is the same as the one defined in Sec. 3.3 of the main text where we explain the weight between the two UCBs. Similarly, we have also defined σ t,i (x t,i ) ≜ √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i , which represents the GP posterior standard deviation conditioned on the observations of the hypothetical agent before x t,i is selected (App. C.2). Next, we will separately derive upper bounds on the summation (across all good epochs and all agents) of the first and second terms of the upper bound from equation 13.

C.4.3 UPPER BOUND ON THE SUM OF THE FIRST TERM OF EQUATION 13

Here, similar to Kassraie & Krause (2022) , we denote as κ 0 an upper bound on the value of the NTK function for any input: ⟨g(x; θ 0 )/ √ m, g(x; θ 0 )/ √ m⟩ ≤ κ 0 , ∀x ∈ X t,i , t ∈ [T ], i ∈ [N ]. As a result, we can use it to show that both σ t,i (x) and σ local tp-1,j (x) can be upper-bounded: σ t,i (x) ≤ √ κ 0 and σ local tp-1,j (x) ≤ √ κ 0 . To show this, following the notations of Appendix C.2, we denote V t,i = J t,i J ⊤ t,i + λI where J t,i = g(x τ,j ; θ 0 ) τ ∈[t-1],j∈[N ] , g(x t,j ; θ 0 ) j∈[i] which is a p 0 × [(t -1)N + i] matrix. Then we have σ 2 t,i (x) = λ g(x; θ 0 )/ √ m 2 V -1 t,i = λg(x; θ 0 ) ⊤ (J t,i J ⊤ t,i + λI) -1 g(x; θ 0 )/m = λg(x; θ 0 ) ⊤ 1 λ I - 1 λ J t,i I + J ⊤ t,i 1 λ J t,i -1 J ⊤ t,i 1 λ g(x; θ 0 )/m = g(x; θ 0 ) ⊤ g(x; θ 0 )/m -(g(x; θ 0 ) ⊤ / √ m)J t,i λI + J ⊤ t,i J t,i -1 J ⊤ t,i (g(x; θ 0 )/ √ m) = g(x; θ 0 ) ⊤ g(x; θ 0 )/m -(g(x; θ 0 ) ⊤ / √ m)J t,i 2 λI+J ⊤ t,i Jt,i -1 ≤ g(x; θ 0 ) ⊤ g(x; θ 0 )/m ≤ κ 0 where we used the matrix inversion lemma in the third equality. Using similar derivations also allows us to show that ( σ local tp-1,j (x)) 2 ≤ κ 0 . Therefore, we have that σ t,i (x) ≤ √ κ 0 and σ local tp-1,j (x) ≤ √ κ 0 . Denoting the set of iterations from all good epochs as T good , we can derive an upper bound the first term of equation 13, summed across all agents i ∈ [N ] and all iteration in good epochs T good : N i=1 t∈T good (1 -α)2ν T KN √ e σ t,i (x t,i ) (a) ≤ 2 √ eν T KN N i=1 T t=1 σ t,i (x t,i ) (b) = 2 √ eν T KN N i=1 T t=1 min{ σ t,i (x t,i ), √ κ 0 } (c) ≤ 2 √ eν T KN N i=1 T t=1 min{ √ κ 0 σ t,i (x t,i ), √ κ 0 } ≤ 2 √ eν T KN √ κ 0 N i=1 T t=1 min{ σ t,i (x t,i ), 1} ≤ 2 √ eν T KN √ κ 0 T N N i=1 T t=1 min{ σ 2 t,i (x t,i ), 1} ≤ 2 √ eν T KN √ κ 0 T N [2λ log det(λ -1 K T N + I)] (f ) ≤ 2 √ 2eν T KN √ κ 0 T N λ[log det(λ -1 H + I) + 1] = 2 √ eν T KN √ κ 0 T N λ d log(1 + T N K/λ) + 1 Step (a) follows from α ≤ 1, ∀t ≥ 1 and summing across all iterations [T ] instead of only those iterations T good in good epochs. Step (b) follows because σ t,i (x) ≤ √ κ 0 as discussed above. In step (c), we have assumed that κ 0 ≥ 1; however, if κ 0 < 1, the proof still goes through since we can directly upper-bound min{ σ t,i (x t,i ), √ κ 0 } by min{ σ t,i (x t,i ), 1}, after which the only modification we need to make to the equation above is to remove the dependency on multiplicative term of √ κ 0 . Step (d) results from the Cauchy-Schwarz inequality. Step (e) can be derived following the proof of In this subsection, we derive an upper bound on the sum of the second term in equation equation 13 across all good epochs and all agents. For the proof here, we need a "local" version of Lemma 1, i.e., a version of Lemma 1 which only makes use of the contexts of an agent i. Define K t,i as the local counterpart to K t ′ (from Lemma 1), i.e., K t,i is the t × t matrix calculated using only agent j's local contexts up to (and including) iteration t. Specifically, define J t,i ≜ [g(x τ,i ; θ 0 )] τ ∈[t] which is a p 0 × t matrix, then K t,i is defined as K t,i ≜ J ⊤ t,i J t,i /m , which is a t × t matrix. Also recall that in the main text, we have defined H i as the local counterpart of H for agent i (Sec. 2). The next lemma gives our desired local version of Lemma 1. Lemma 6 (Lemma B.7 of Zhang et al. (2021) ). If m ≥ CT 6 K 6 L 6 log(T N KL/δ 6 ), we have with probability of at least 1 -δ 6 that log det(I + λ -1 K t,i ) ≤ log det(I + λ -1 H i ) + 1, for all t ∈ [T ], i ∈ [N ]. We needed to take a union bound over all N agents, which explains the factor of N within the log in the lower bound on m given in Lemma 6. Note that the required lower bound on m by Lemma 6 is smaller than that of Lemma 1 (by a factor of N 6 ), therefore, the condition on m in Lemma 6 is ignored in the conditions listed in App. C.1. Of note, throughout the entire epoch p, σ local tp-1,j (x t,i ) is calculated conditioned on all the local observations of agent j before iteration t p . Denote by T (p) the iteration indices in epoch p: T (p) = {t p , . . . , t p + E p -1}. In the proof in this section, as we have discussed in the first paragraph of Sec. 4.1, we analyze a simpler variant of our algorithm where we only set α > 0 in the first iteration after a communication round, i.e., α > 0, ∀t ∈ {t p } p∈[P ] and α = 0, ∀t ∈ [T ] \ {t p } p∈ [P ] . Now we are ready to derive an upper bound on the second term in equation 13, summed over all agents and all good epochs: N i=1 p∈E good t∈T (p) α2ν T K 1 N N j=1 σ local tp-1,j (x t,i ) (a) ≤ 2ν T K 1 N N i=1 p∈[P ] t∈T (p) α N j=1 σ local tp-1,j (x t,i ) (b) ≤ 2ν T K 1 N N i=1 N j=1 p∈[P ] α σ local tp-1,j (x tp,i ) (c) ≤ 2ν T K 1 N N i=1 N j=1 p∈[P ] σ local tp-1,j (x tp,j ) (d) ≤ 2ν T K 1 N N i=1 N j=1 T t=1 σ local t-1,j (x t,j ) The inequality in step (a) results from summing across all epoch p ∈ [P ] instead of only good epochs p ∈ E good . Step (b) follows since α t = 0, ∀t ∈ [T ] \ {t p } p∈[P ] as we discussed above, therefore, for every epoch p, we only need to keep the first term of t = t p in the summation of t ∈ T (p) . To understand step (c), recall that in the main text (Sec. 3.3), we have defined: σ local t,i,min ≜ min x∈X σ local t,i (x) and σ local t,i,max ≜ max x∈X σ local t,i (x), ∀i ∈ [N ]. Next, note that our algorithm selects α by: α = min i∈[N ] α t,i (line 4 of Algo. 2) where α t,i = σ local t,i,min / σ local t,i,max (line 15 of Algo. 1) and t = t p -1 since α t,i is calculated only in the last iteration of every epoch. As a result, we have that α = min i∈[N ] α tp-1,i = min i∈[N ] σ local tp-1,i,min σ local tp-1,i,max ≤ σ local tp-1,j,min σ local tp-1,j,max ≤ σ local tp-1,j (x tp,j ) σ local tp-1,j (x tp,i ) , which tells us that α σ local tp-1,j (x tp,i ) ≤ σ local tp-1,j (x tp,j ) and hence leads to step (c). Step (d) results from summing across all iterations [T ] instead of only the first iteration of every epoch. Next, we can derive an upper bound on the inner summation over t = 1, . . . , T from equation 16: T t=1 σ local t-1,j (x t,j ) (a) ≤ √ κ 0 T t=1 min{ σ local t-1,j (x t,j ), 1} ≤ √ κ 0 T T t=1 min{ σ local t-1,j (x t,j ) 2 , 1} ≤ √ κ 0 T [2λ log det(λ -1 K T,j + I)] (d) ≤ √ 2 √ κ 0 T λ[log det(λ -1 H j + I) + 1] = √ 2 √ κ 0 T λ d j log(1 + T K/λ)) + 1 . Step (a) is obtained in the same way as steps (b) and (c) in equation 15 (App. C.4.3), i.e., we have made use of σ local tp-1,j (x) ≤ √ κ 0 and assumed that κ 0 ≥ 1. Again note that if κ 0 < 1, then the proof still goes through since σ local t-1,j (x t,j ) ≤ min{ σ local t-1,j (x t,j ), √ κ 0 } ≤ min{ σ local t-1,j (x t,j ), 1}, after which the only modification we need to make to the equation above is to remove the dependency on multiplicative term of Step (d) follows from Lemma 6 and hence holds with probability of at least 1 -δ 6 . In the last equality, we have simply plugged in the definition of d j (Sec. 2). Now we can plug equation 17 into equation 16 to obtain N i=1 p∈E good t∈T (p) α2ν T K 1 N N j=1 σ local tp-1,j (x t,i ) ≤ 2ν T K 1 N N i=1 N j=1 √ 2 √ κ 0 T λ d j log(1 + T K/λ)) + 1 = 2 √ 2ν T K √ κ 0 N j=1 T λ d j log(1 + T K/λ)) + 1 . C.4.5 PUTTING THINGS TOGETHER Finally, recall that our derived upper bound on r t,i in equation 13 contains three terms (the third term is simply an error term), and now we can make use of our derived upper bound on the first term (App. C.4.3) and the second term (App. C.4.4), summed over all agents and all good epochs, to obtain an upper bound on the total regrets incurred in all good epochs: R good T = N i=1 t∈T good r t,i ≤ 2 √ eν T KN √ κ 0 T N λ d log(1 + T N K/λ) + 1 + 2 √ 2ν T K √ κ 0 N j=1 T λ d j log(1 + T K/λ)) + 1 + T N ε linear (m, T ) = O d T N d + d max N T d max + T N ε linear (m, T ) = O d √ T N + d max N √ T + T N ε linear (m, T ) . In the second last equality, we have used ν T KN = O( d) and ν T K = O( d max ).

C.5 REGRET UPPER BOUND FOR BAD EPOCHS

In this section, we derive an upper bound on the total regrets from all bad epochs. To begin with, we firstly derive an upper bound on the total regrets of any bad epoch p denoted as R [p] : R [p] = N i=1 tp+Ep-1 t=tp r t,i ≤ N i=1 2 + 2 + tp+Ep-2 t=tp+1 r t,i ≤ N i=1 4 + tp+Ep-2 t=tp+1 UCB a t,i (x * t,i ) -h(x t,i ) (c) ≤ N i=1 4 + tp+Ep-2 t=tp+1 UCB a t,i (x t,i ) -h(x t,i ) (d) ≤ N i=1 4 + tp+Ep-2 t=tp+1 2ν T KN √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i ≤ N i=1 4 + 2ν T KN √ κ 0 tp+Ep-2 t=tp min{ √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i , 1} ≤ N i=1 4 + 2ν T KN κ 0 λ tp+Ep-2 t=tp min{ g(x t,i ; θ 0 )/ √ m V -1 t,i , 1} Step (a) follows from simply upper-bounding the regrets of the first and last iteration within this epoch by 2. Step (b) makes use of the validity of UCB a t,i (Lemma 3). Step (c) follows because α = 0, ∀t ∈ [T ] \ {t p } p∈[P ] (i.e., we set α = 0 except for the first iteration of all epochs), which implies that after the first iteration of an epoch, x t,i is selected by only maximizing UCB a t,i (line 7 of Algo. 1). Step (d) again uses Lemma 3, as well as the expression of UCB a t,i . Step (e) is obtained in the same way as steps (b) and (c) in equation 15 (App. C.4.3). Specifically, since ⟨g(x; θ 0 ), g(x; θ 0 )⟩ ≤ κ 0 (App. C.4.3), therefore, √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i ≤ √ κ 0 , which can be proved by following the same steps as equation 14. As a result, if we assume that κ ≥ 1, then √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i = min{ √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i , √ κ 0 } ≤ √ κ 0 min{ √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i , 1}; in the other case where κ 0 < 1, then √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i = min{ √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i , √ κ 0 } ≤ min{ √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i , 1}. Here we have assumed κ 0 ≥ 1 for simplicity, since when κ 0 < 1, the equation above still holds except that we can remove the dependency on √ κ 0 . Step (f ) follows because λ = 1 + 2/T > 1. Next, we derive an upper bound on the inner summation in equation 20. tp+Ep-2 t=tp min{ g(x t,i ; θ 0 )/ √ m V -1 t,i , 1} ≤ (E p -1) tp+Ep-2 t=tp min{ g(x t,i ; θ 0 )/ √ m 2 V -1 t,i , 1} ≤ (E p -1)2 log detV tp+Ep-2,i detV tp,i ≤ 2((t p + E p -2) -t last ) log detV tp+Ep-2,i detV last (d) ≤ √ 2D. Step (a) follows from the Cauchy-Schwarz inequality. Step (b) makes use of Lemma 11 of Abbasi-Yadkori et al. (2011) . In step (c), we used the notations of t last = t p -1, V tp,i = V last (this is because in the first iteration t p of an epoch, W new,i = 0 p0×p0 and hence V tp,i = V last = W sync + λI), and V tp+Ep-2,i = V tp+Ep-2,i + g(x t,i ; θ 0 )g(x t,i ; θ 0 ) ⊤ /m, and also used detV tp+Ep-2,i ≤ detV tp+Ep-2,i . To understand step (d), note that the term in step (c): ((t p + E p -2) -t last ) log detV tp +Ep -2,i detVlast is exactly the criterion we use to check whether to start a communication round in iteration t = t p +E p -2 (line 11 of Algo. 1). Since t = t p +E p -2 is not the last iteration in this epoch (i.e., we did not start a communication round after checking this criterion in iteration t = t p + E p -2), therefore, this criterion is not satisfied in iteration t = t p + E p -2, i.e., ((t p + E p -2) -t last ) log detV tp +Ep-2,i detVlast ≤ D, which explains step (d). Next, we can plug equation 21 into equation 20 to obtain: R [p] = N i=1 tp+Ep-1 t=tp r t,i ≤ N i=1 4 + 2ν T KN κ 0 λ √ 2D = 4 + 2ν T KN 2κ 0 λD N, which gives an upper bound on the total regret from any bad epoch. Now recall that as we have discussed in App. C.2, there are no more than R bad epochs (with probability of at least 1 -δ 1 ). Therefore, the total regret of all bad epochs can be upper-bounded by: ) (line 1 of Algo. 1), we can further express the above upper bound on the total regrets from all bad epochs as: R bad T ≤ R 4 + 2ν T KN 2κ 0 λD N ≤ d log(1 + T KN/λ) + 1 4 + 2ν T KN 2κ 0 λD N = O d d √ DN = O ( d) 3/2 √ DN . R bad T = O T N d ( d) 3/2 N = O d √ T N . C.6 FINAL REGRET UPPER BOUND Here we derive an upper bound on the total cumulative regret by adding up the regrets resulting from all good epochs (App. C.4) and all bad epochs (App. C.5): R T = R good T + R bad T = O d √ T N + d max N √ T + T N ε linear (m, T ) + d √ T N = O d √ T N + d max N √ T + T N ε linear (m, T ) . ( ) This regret upper bound holds with probability of at least 1 -δ 1 -δ 2 -δ 3 -δ 4 -δ 5 -δ 6 . We let δ 3 = δ 4 = δ/3, which leads to the expressions of ν T KN and ν T K given in the main paper (Sec. 3). We let δ 1 = δ 2 = δ 5 = δ 6 = δ/12, and this will only introduce an additional factor of log 12 in the first three conditions on m in App. C.1 which can be absorbed by the constant C. Next, the last term from the upper bound in equation 25 can be further written as: T N ε linear (m, T ) = T N ε linear,1 (m, T ) + ε linear,2 (m, T ) + ε η,J = T N C 1 T 2/3 m -1/6 λ -2/3 L 3 log m + T N C 3 m -1/6 log mL 4 T 5/3 λ -5/3 (1 + T /λ) + T N C 2 (1 -ηmλ) J T L/λ. It can be easily verified that as long as m(log m) -3 ≥ 3 6 C 6 1 T 10 N 6 λ -4 L 18 and m(log m) -3 ≥ 3 6 C 6 3 T 16 N 6 L 24 λ -10 (1 + T /λ) 6 (which are ensured by conditions 5 and 6 on m in App. C.1), then the first and second terms in equation 26 can both be upper-bounded by 1/3. Moreover, if the conditions on η and J presented in App. C.1 are satisfied, i.e., if we choose the learning rate as η = C 4 (mλ + mT L) -1 in which C 4 > 0 is an absolute constant such that C 4 ≤ 1 + T L, and choose J = 1 C4 1 + T L λ log 1 3C2N λ T 3 L = O T L/(λC 4 ) , then the third term in equation 26 can also be upper-bounded by 1/3. As a result, the last term from the upper bound in equation 25 can be upper-bounded by 1, and hence the regret upper bound becomes: R T = O d √ T N + d max N √ T . Worst-Case Regret Upper Bound in Terms of the Maximum Information Gain γ. Next, we perform some further analysis of the final regret upper bound derived above, which allows us to inspect the order of growth of our regret upper bound in the worst-case scenario (i.e., without assuming that the effective dimensions are upper-bounded by constants). We have defined in Sec. 2 that d ≤ 2γ T KN / log(1 + T KN/λ), d i ≤ 2γ T K / log(1 + T K/λ), ∀i ∈ [N ] and d max = max i∈[N ] d. As a result, in our derivations in equation 19 and equation 23, we can replace d log(1 + T KN/λ) by 2γ T KN and replace d j log(1 + T K/λ) by 2γ T K , after which the regret upper bound becomes R T = O γ T KN √ T N + γ T K N √ T . The growth rate of the maximum information gain of NTK has been characterized by previous works: & Krause, 2022; Vakili et al., 2021) . This implies that our regret upper bound can be further expressed as γ T = O(T d-1 d ) (Kassraie R T = O K (d-1) d (T N ) 3d-2 2d + K (d-1) d T 3d-2 2d N = O K (d-1) d T 3d-2 2d N . C.7 REGRET UPPER BOUND FOR FN-UCB (LE S S CO M M.) Here we explain how the proof above can be modified to derive a regret upper bound FN-UCB (Less Comm.). To begin with, note that in terms of the regret analysis, the only difference between FN-UCB (Less Comm.) and FN-UCB is that UCB b t,i of every agent i is now modified to be: UCB b t,i (x) = f (x; θ sync,NN ) + ν T K √ λ g(x; θ 0 )/ √ m V -1 sync,NN , in which the matrix V -1 sync,NN is obtained by: V -1 sync,NN = 1 N N i=1 (V local t,i ) -1 . Note that every time the matrix V -1 sync,NN is calculated, we have that t = t p -1. Firstly, we prove that the modified UCB b t,i is also a valid high-probability upper bound on the reward function f . To achieve this, all we need to do is to add a few steps to equation 11 in Step 3 of the proof of. Specifically, we can further analyze equation 11 by: |f (x;θ sync,NN ) -h(x)| ≤ 1 N N i=1 ν T K √ λ g(x; θ 0 )/ √ m (V local i ) -1 + ε linear (m, T ) = ν T K 1 N N i=1 λg(x; θ 0 ) ⊤ (V local i ) -1 g(x; θ 0 )/m + ε linear (m, T ) ≤ ν T K 1 N N i=1 λg(x; θ 0 ) ⊤ (V local i ) -1 g(x; θ 0 )/m + ε linear (m, T ) = ν T K λg(x; θ 0 ) ⊤   1 N N i=1 (V local i ) -1   g(x; θ 0 )/m + ε linear (m, T ) = ν T K λg(x; θ 0 ) ⊤ V -1 sync,NN g(x; θ 0 )/m + ε linear (m, T ) = ν T K √ λ g(x; θ 0 )/ √ m V -1 sync,NN + ε linear (m, T ). The first inequality directly follows from equation 11, and the second inequality results from the concavity of the square root function. In the second last equality, we have plugged in the definition of V -1 sync,NN = 1 N N i=1 (V local t,i ) -1 . As a result, Lemma 4 which guarantees the validity of UCB b t,i can be modified to be: |h(x) -f (x; θ sync,NN )| ≤ ν T K √ λ g(x; θ 0 )/ √ m V -1 sync,NN + ε linear (m, T ), ∀x ∈ X t,i . Secondly, we will need the following auxiliary inequality for agent i and iteration t in a good epoch p ∈ E good : √ λ g(x t,i ; θ 0 )/ √ m V -1 sync,NN = λg(x t,i ; θ 0 ) ⊤ V -1 sync,NN g(x t,i ; θ 0 )/m = λg(x t,i ; θ 0 ) ⊤ 1 N N j=1 (V local j ) -1 g(x t,i ; θ 0 )/m = 1 N N j=1 λg(x t,i ; θ 0 ) ⊤ (V local j ) -1 g(x t,i ; θ 0 )/m ≤ 1 √ N N j=1 λg(x t,i ; θ 0 ) ⊤ (V local j ) -1 g(x t,i ; θ 0 )/m ≤ 1 √ N N j=1 √ λ g(x t,i ; θ 0 )/ √ m (V local j ) -1 . The first inequality is because √ a + b ≤ √ a + √ b. Thirdly, we need to modify the proof of the regret upper bound for good epochs (App. C.4). Specifically, we can derive an upper bound on the instantaneous regret r t,i = h(x * t,i ) -h(x t,i ) for agent i and iteration t in a good epoch p ∈ E good (in a similar way to equation 13): r t,i = h(x * t,i ) -h(x t,i ) ≤ α 2ν T K √ λ g(x t,i ; θ 0 )/ √ m V -1 sync,NN + ε linear (m, T ) + (1 -α) 2ν T KN √ λ g(x t,i ; θ 0 )/ √ m V -1 t,i + αε linear (m, T ) ≤ α 2ν T K 1 √ N N j=1 √ λ g(x t,i ; θ 0 )/ √ m (V local j ) -1 + ε linear (m, T ) + (1 -α) 2ν T KN √ eλ g(x t,i ; θ 0 )/ √ m V -1 t,i + αε linear (m, T ) = α2ν T K 1 √ N N j=1 √ λ g(x t,i ; θ 0 )/ √ m (V local j ) -1 + (1 -α)2ν T KN √ eλ g(x t,i ; θ 0 )/ √ m V -1 t,i + 2αε linear (m, T ) ≜ (1 -α)2ν T KN √ e σ t,i (x t,i ) + α2ν T K 1 √ N N j=1 σ local tp-1,j (x t,i ) + 2αε linear (m, T ). In the first inequality, we have made use of equation 30 which ensures the validity of the modified UCB b t,i as a high probability upper bound on h. The second inequality follows from equation 31. In the last equality, we have defined σ local tp-1,j (x t,i ) in the same way as equation 13. The steps regarding the term involving (1 -α) are the same as those from equation 13. As a result, the only change we have made to instantaneous regret upper bound from equation 13 is that in the second term, we have replaced 1 N by 1 √ N . Further propagating this change through the proof for the regret upper bound for all good epochs (App. C.4.4 and App. C.4.5), we have that: R good T = N i=1 t∈T good r t,i = O d √ T N + d max N 3/2 √ T + T N ε linear (m, T ) . Lastly, also note that the regret upper bound for the bad epochs (i.e., the proof in App. C.5) remains unchanged. Therefore, the final regret upper bound for FN-UCB (Less Comm.) is  R T = R good T + R bad T = O d √ T N + d max N 3/2 √ T + T N ε linear (m, T ) + d √ T N = O d √ T N + d max N 3/2 √ T + T N ε linear (m, T ) = O d √ T N + d max N 3/2 √ T . P -1 p=0 log detV p+1 detV p ≤ R ′ ≤ R, with probability of at least 1 -δ 1 ≥ 1 -δ. Therefore, there can be at most ⌈ R D/ζ ⌉ = ⌈ Rζ D ⌉ such epochs whose length is smaller than ζ. As a result, the total number of epochs can be upper-bounded by:  ⌈T /ζ⌉ + ⌈ Rζ D ⌉ = O T R D . √ N ) = O γ T KN √ N = O (T KN ) d-1 d √ N = O(T d-1 d K d-1 d N 3d-2 ), which is still sub-linear in T even in the worst case. The proof here, and hence Theorem 2, makes use of Lemma 1. Therefore, we only need condition 1 on m listed in App. C.1 to hold, and do not require any condition on η and J.

E MORE EXPERIMENTAL DETAILS

Our code can be found at: https://github.com/daizhongxiang/ Federated-Neural-Bandits. Some of the experimental details (e.g., the number of layers and the width m of the NN used in every experiment) are already described in the main text (Sec. 5). Following the works of Zhang et al. (2021) ; Zhou et al. (2020) , when training the NN (line 14 of Algo. 1) for agent i, we use the NN parameters resulting from the last gradient descent training of agent i (instead of θ 0 ) as the initial parameters, in order to accelerate the training procedure. Every time we train an NN, we use stochastic gradient descent to train the NN for 30 iterations with a learning rate of 0.01. To save computational cost, we stop training the NNs after 2000 iterations, i.e., after 2000 iterations, all NN parameters are no longer updated. Also to reduce the computational cost, when checking the criterion in line 11 of Algo. 1, we diagonalize (i.e., only keep the diagonal elements of) the two matrices for which we need to calculate the determinant. Our experiments are run on a server with 96 CPUs, an NVIDIA A100 GPU with a memory of 40GB, a RAM of 256GB, running the Ubuntu system. The shuttle dataset is publicly available at https://archive.ics.uci.edu/ml/ datasets/Statlog+(Shuttle) and contains no personally identifiable information or offensive content. It includes 58000 instances, has an input dimension of d = 9 and contains K = 7 classes/arms. As a result, according to the way in which the contexts are constructed (Sec. 5.2), every context feature vector has a dimension of 9 × 7 = 63. The magic telescope dataset is publicly available at https://archive.ics.uci.edu/ml/datasets/magic+gamma+ telescope and contains no personally identifiable information or offensive content. The dataset contains 19020 instances, has an input dimension of d = 10 and K = 2 classes/arms. As a result, every context feature vector has a dimension of 10 × 2 = 20. When comparing with Linear-UCB, Linear TS, Kernelized UCB and Kernelized TS, we follow the work of Zhang et al. (2021) to set λ = 1 and perform a grid search within ν ∈ {1, 0.1, 0.01}. The results showing comparisons with these algorithms, for both the synthetic experiments (Sec. 5.1) and real-world experiments (Sec. 5.2), are presented in Fig. 4 . The figures show that both linear and kernelized contextual bandit algorithms are outperformed by neural contextual bandit algorithms, which is consistent with the observations from Zhang et al. (2021) ; Zhou et al. (2020) . We have additionally evaluated the empirical impact of the technique of diagonalization of the matrices (Sec. 3.4), using the shuttle dataset and a fixed width of m = 20 for the NN. The results (Fig. 5 ) show that for the same width of the NN, the technique of diagonalization indeed results in worse performances. However, also note that diagonalization allows us to afford a larger value of m in a computationally feasible way, which can lead to better performances than using a smaller m without diagonalization. This is corroborated by our empirical results in Fig. 2a and 2c , because the regrets in Fig. 2c (m=50 , with diagonalization) are in general smaller than the regrets in Fig. 2a (m=20 , without diagonalization), and the computational cost of Fig. 2c (244.9 seconds) is smaller than that of Fig. 2a (361.8 seconds). Furthermore, using m = 50 without diagonalization would incur a significantly larger computational cost (3134.3 seconds). These results demonstrate the practical usefulness of diagonalization. We have also visualized the empirical scaling of the final average cumulative regret (after 5000 iterations) in terms of the number N of agents, using the cosine and square experiments. The results (Fig. 6 ) demonstrate that the average cumulative regret (averaged across all N agents) is indeed decreasing as the number N of agents increases.

F EXTENDED ANALYSIS FOR THE GENERAL ALGORITHM

Recall that it has been mentioned at the beginning of Sec. 4.1 that our main regret analysis (Theorem 1) has focused on a simpler version of our FN-UCB algorithm, in which we only choose the value of α using the method described in Sec. 3.3 in the first iteration of every epoch and set α = 0 in the other iterations. Here, we show how our regret analysis can be extended to derive a regret upper bound for the general FN-UCB algorithm, in which we choose α using the method described in Sec. 3.3 in every iteration, i.e., we do not set α = 0 in any iteration. To achieve this, we need an additional assumption of an upper bound on the amount of new information collected by every agent i in every epoch p. Specifically, we assume that detV local tp+Ep-2,i detV local tp-1,i ≤ D, ∀i ∈ [N ], p ∈ [P ] for a constant D ≥ 1. This can in fact be viewed as an additional property of the sequence of contexts for each agent. Intuitively, if the contexts for each agent are received in such an order that similar contexts also arrive in similar iterations, then the constant D is likely to be small. This can be seen as a "stationarity" property of the sequence of contexts, which is reasonable in many practical scenarios. For example, in a healthcare application, the patients arriving within the same time period are likely to have similar characteristics due to factors such as the local transmission of a seasonal flu. In addition, another scenario where D is likely to be small is when every agent has some previously observed offline contexts before running our algorithm. If these offline contexts have a good coverage of the space of contexts, then conditioned on these offline contexts, the newly collected information by every agent in every epoch is highly likely to be small. With this additional assumption, the most important step in the proof that we need to modify is the proof in Appendix C.4.4, in which we proved an upper bound on the sum of the second term of equation 13. To begin with, ∀t = t p , . . . , t p + E p -1, we have that σ local tp-1,j (x t,i ) = √ λ g(x t,i ; θ 0 )/ √ m (V local tp -1,j ) -1 = λg(x t,i ; θ 0 ) ⊤ (V local tp-1,j ) -1 g(x t,i ; θ 0 )/m (b) ≤ λg(x t,i ; θ 0 ) ⊤ (V local t-1,j ) -1 g(x t,i ; θ 0 )/m detV local t-1,j detV local tp-1,j ≤ λg(x t,i ; θ 0 ) ⊤ (V local t-1,j ) -1 g(x t,i ; θ 0 )/m detV local tp+Ep-1-1,j detV local tp-1,j ≤ λg(x t,i ; θ 0 ) ⊤ (V local t-1,j ) -1 g(x t,i ; θ 0 )/mD (e) = D σ local t-1,j (x t,i ). (38) Step (39) Step (a) follows from the same reasoning as step (c) of equation 16, step (b) has made use of equation 38, and all other steps follow the same corresponding steps of equation 16. As a result, by comparing the modified equation 39 with the original equation 16, the only modification to the result in equation 16 is the additional multiplicative term of √ D. Therefore, after propagating this modification to all the analysis in Appendix C.4.4, we have that a multiplicative term of √ D will also be introduced into equation 18. Subsequently, the upper bound on the total regrets from all good epochs (i.e., equation 19) will be correspondingly modified to be:  Step (a) follows from Lemma 3 (i.e., the validity of UCB a t,i ) and Lemma 4 (i.e., the validity of UCB b t,i ). Step (b) results from the way in which x t,i is selected (line 7 of Algo. 1): x t,i = arg max x∈Xt,i (1 -α)UCB a t,i (x) + αUCB b t,i (x). For step (c), the term A is obtained by upperbounding the regrets of the first and last iteration within this epoch by 2 and using the fact that α ≤ 1. As a result, we can follow the same steps of analysis in App. C.4.4 (after making the modification using equation 39; note that the analysis in App. C.4.4 is applicable to both good and bad epochs) to show that after summing over all bad epochs, the term B can be upper-bounded by O √ D d max N √ T + T N ε linear (m, T ) . Next, combining the upper bounds on both A and B (after summing across all bad epochs), we have that for the general algorithm, the total regrets from all bad epochs can be upper bounded by R bad T = O d √ T N + D d max N √ T + T N ε linear (m, T ) , which is in fact the same as the upper bound on the total regrets from all good epochs which we have derived in equation 40. Finally, following the same analysis in App. C.6, we can show that the final regret upper bound for the general algorithm, in which we do not set α = 0 in any iteration, is R T = O d √ T N + D d max N √ T . Note that compared to our regret upper bound from Theorem 1, the regret upper bound for the general algorithm (i.e., when we choose the value of α using the method from Sec. 3.3 in every iteration) only includes an additional multiplicative term of √ D in the second term. Of note, when communication indeed occurs after each iteration (i.e., E p = 1 for every epoch p), we have that D = 1 because detV local tp +Ep -2,i detV local tp -1,i = 1 (equation 37). In this case, the version of our algorithm analyzed in Theorem 1 becomes the same as our general algorithm (Sec. 4.1), and interestingly, the regret upper bound of our general algorithm (equation 44) also becomes the same as Theorem 1 because D = 1.



The O ignores all logarithmic factors. The first (last) epoch is between a communication round and the beginning (end) of FN-UCB algorithm. Formally, σ local t,i (x) is the Gaussian process posterior standard deviation at x conditioned on the local observations of agent i till iteration t and computed using the kernel k(x, x ′ ) = g(x; θ0) ⊤ g(x ′ ; θ0)/m. Since diagonalization increases the scale of the first term in UCB a t,i , we use a heuristic to rescale the values of this term for all contexts such that the max and min values (among all contexts) are 0 and 1 after rescaling.



(N = 2) FN-UCB (N = 2, only UCB a t, i) FN-UCB (N = 2, only UCB b t, i) FN-UCB (N = 2, no NN aggregation) (N = 1) FN-UCB (N = 5, D = 5) FN-UCB (N = 5, D = 4) FN-UCB (N = 5, D = 2.5) FN-UCB (N = 5, comm. every round) (a) cosine (b) square (c) cosine (d) cosine Figure 1: Cumulative regret with varying number of agents for the (a) cosine function and (b) square function. (c) Illustration of the importance of different components of our FN-UCB algorithm (cosine function). (d) Performances with different values of D (cosine function). The average number of rounds of communications are 348.0, 380.0, 456.7 for D = 5, 4, 2.5, respectively. (N = 5, D = 0.05) FN-UCB (N = 5, D = 0.03) FN-UCB (N = 5, D = 0.01) FN-UCB (N = 5, comm. every round) (a) shuttle (b) magic telescope (c) shuttle (diag.) (d) shuttle (diag.) Figure 2: Results (m = 20) for (a) shuttle and (b) magic. (c) Results for shuttle with diagonal approximation (m = 50). (d) Results for shuttle with different values of D. The average number of communication rounds are 3850.7, 4442.7, 4906.3 for D = 0.05, 0.03, 0.01, respectively.

aimed to reduce the memory requirement of Neural UCB, Lisicki et al. (2021) performed an empirical investigation of neural bandit algorithms to verify their practical effectiveness, Ban et al. (2022) adopted a separate NN for exploration in neural contextual bandits, Ban & He (2021) applied the convolutional NTK, Jia et al. (2021) used perturbed rewards to train the NN to remove the need for explicit exploration, Nguyen-Tang et al. (2022) incorporated offline policy learning into neural contextual bandits, Zhu et al. (2021a) studied pure exploration in kernel and neural bandits, Kassraie et al. (2022) applied graph NNs in neural bandits to handle graph-structured data, Salgia et al. (2022) extended neural bandits beyond the ReLU activation to consider smoother activation functions, and Dai et al. (2022) introduced a scalable batch Neural TS algorithm through sample-then-optimize optimization.

1 and C.4 of Zhang et al. (2021), we can follow similar steps as equation C.3 of Zhang et al. (2021) (in Appendix C.2 of Zhang et al. (

Lemma 4.8 of Zhang et al. (2021) (in Appendix B.7 of Zhang et al. (2021)). Step (f ) follows from Lemma 1 and hence holds with probability of at least 1 -δ 1 . The last equality simply plugs in the definition of the effective dimension d (Sec. 2). C.4.4 UPPER BOUND ON THE SUM OF THE SECOND TERM OF EQUATION 13

√ κ 0 . Step (b) makes use of the Cauchy-Schwarz inequality. Step (c), similar to step (e) of equation 15, is derived following the proof of Lemma 4.8 of Zhang et al. (2021) (in Appendix B.7 of Zhang et al. (2021)).

In the second last equality, we have used ν T KN = O( d). By choosing D = O( T N d

Recall that R = O( d) (App. C.2). Therefore, with probability of at least 1 -δ 1 ≥ 1 -δ, the total number of epochs can be upper-bounded by O( T d D ).Since we have chosenD = O( T N d) (line 1 of Algo. 1), therefore, the total number of epochs can be upper-bounded by O( Now we can further make use of the relationship between d and γ T KN : d ≤ 2γ T KN / log(1 + T KN/λ), which allows us to show that the worst-case communication complexity is upper-bounded by: O( d

Figure 5: Comparison between the performances without (yellow) and with (red) diagonalization, using m = 20 with the shuttle dataset. The results show that using an NN with the same width m = 20, diagonalization indeed deteriorates the performances.

Figure 6: The scaling of the final average cumulative regret after 5000 iterations (averaged across all N agents) in terms of the number N of agents, using the cosine and square experiments. The results correspond to Fig. 1 a and Fig. 1 b, respectively.

(a) has made use of the definition of σ local tp-1,j (x t,i ) (see the paragraph below equation 13), step (b) results from Lemma 12 of Abbasi-Yadkori et al. (2011), step (c) follows because V local tp+Ep-1-1,jcontains more information than V local t-1,j ∀t = t p , . . . , t p + E p -1, step (d) follows from equation 37, and step (e) has again made use of the definition of σ local tp-1,j (x t,i ). Published as a conference paper at ICLR 2023 Using equation 38, we can modify the proof in equation 16 (Appendix C.4.4): j (x t,j ).

D d max N √ T + T N ε linear (m, T ) .(40)Next, we also need to modify the proof of the upper bound on the total regrets from all bad epochs (Appendix C.5). Following the roadmap of Appendix C.5, we start by upper-bounding the total regrets from a particular bad epoch p:

Next, we can separately analyze the terms A and B in equation 41. Firstly, note that the term A is the same as step (c) of equation 20, therefore, we can follow the same steps of analyses in App. C.5 (i.e., equation 20, equation 21, equation 22, equation 23 and equation 24) to show that after summing the term A across all bad epochs, we get an upper bound of O d √ T N . Secondly, for the term B, we can in fact follow similar steps of analysis in equation 13 to show that every term inside the square bracket of the term B is upper-bounded by the last two terms in equation 13. That is, α(UCB b t,i (x t,i ) -h(x t,i )) + αε linear (m, T ) ≤ α2ν T K j (x t,i ) + 2αε linear (m, T ). (42)

ofChowdhury & Gopalan (2017), can be directly applied in our setting and introduces an error probability of δ 3 (which appears in the expression of ν T KN ). As a result, Lemma 3 holds with probability of at least 1 -δ 1 -δ 2 -δ 3 , in which the error probabilities come from Lemma 1 (δ 1 ), Lemma 2 (δ 2 ) and the application of Lemma B.6 ofZhang et al. (2021) (δ 3 ). is updated (i.e., the last iteration in every epoch) as T -1 ≜ {t p -1} p=2,...,P -1 , which immediately implies that T -1 ⊂ [T ] and hence |T -1 | ≤ T . Lemma 4. Let δ 4 , δ 5 ∈ (0, 1), and ν

The required lower bounds on m by the local variants ofLemmas B.3 and B.7 (required in   the proof here) are smaller than those given in Lemmas 2 and 1 and hence do not need to appear in the conditions in Appendix C.1. By letting the sum of the three error probabilities (resulting from the applications ofLemmas B.3, B.6 and B.7 of Zhang et al. (

In this section, we derive an upper bound on the communication complexity (i.e., the total number of communication rounds) of our FN-UCB algorithm (including its variant FN-UCB (Less Comm.)). Define ζ ≜ DT /R. An immediate implication is that there can be at most ⌈T /ζ⌉ epochs whose length is larger than ζ. Next, we try to derive an upper bound on the number of epochs whose length is smaller than ζ.

(x t,i ) + αε linear (m, T ) + (1 -α)UCB a t,i (x t,i ) -h(x t,i ) (x t,i ) -h(x t,i )) + (1 -α)(UCB a t,i (x t,i ) -h(x t,i ))+ αε linear (m, T )

ACKNOWLEDGMENTS

This research/project is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Industry Alignment Fund -Pre Positioning (IAF-PP) (Award A19E4a0101).

