LEARNING KERNELIZED CONTEXTUAL BANDITS IN A DISTRIBUTED AND ASYNCHRONOUS ENVIRONMENT

Abstract

Despite the recent advances in communication-efficient distributed bandit learning, most existing solutions are restricted to parametric models, e.g., linear bandits and generalized linear bandits (GLB). In comparison, kernel bandits, which search for non-parametric functions in a reproducing kernel Hilbert space (RKHS), offer higher modeling capacity. But the only existing work in distributed kernel bandits adopts a synchronous communication protocol, which greatly limits its practical use (e.g., every synchronization step requires all clients to participate and wait for data exchange). In this paper, in order to improve the robustness against delays and unavailability of clients that are common in practice, we propose the first asynchronous solution based on approximated kernel regression for distributed kernel bandit learning. A set of effective treatments are developed to ensure approximation quality and communication efficiency. Rigorous theoretical analysis about the regret and communication cost is provided; and extensive empirical evaluations demonstrate the effectiveness of our solution.

1. INTRODUCTION

There are many application scenarios where an environment repeatedly provides a learner with a set of candidate actions to choose from, and possibly some side information (aka., context) (Li et al., 2010a; b; Durand et al., 2018) ; and the learner, whose goal is to maximize cumulative reward over time, can only observe the reward corresponding to the chosen action. This is often modeled as a bandit learning problem (Abbasi-Yadkori et al., 2011; Krause & Ong, 2011) , which exemplifies the well-known exploitation-exploration dilemma (Auer, 2002) . Various modeling assumptions have been made about the relation between the context for each action and its expected reward. Compared with parametric bandits, such as linear and generalized linear bandits (Abbasi-Yadkori et al., 2011; Filippi et al., 2010) , kernel/Gaussian process bandits (Valko et al., 2013; Srinivas et al., 2009) offer greater flexibility as they find non-parametric functions lying in a RKHS. And thus they have become a powerful tool for optimizing black box functions based on noisy observations in various applications, such as recommender systems (Vanchinathan et al., 2014) , mobile health (Tewari & Murphy, 2017) , environment monitoring (Srinivas et al., 2009) , automatic machine learning (Li et al., 2017) , cyber-physical systems (Lizotte et al., 2007; Li et al., 2016) , etc. Motivated by the rapid growth in affordability and availability of hardware resources, e.g., computer clusters or IoT devices, there is increasing interest in distributing the learning tasks, which gives rise to the recent research efforts in distributed bandits (Wang et al., 2019; Huang et al., 2021; Li & Wang, 2022a; b; Li et al., 2022; He et al., 2022) , where N clients collaboratively maximize the overall cumulative rewards over time T . As communication bandwidth is the key bottleneck in many distributed applications (Huang et al., 2013) , these studies emphasize communication efficiency, i.e., incur sub-linear communication cost with respect to time T , while attaining near-optimal regret. However, most of these works are restricted to simple parametric models, like linear bandits (Wang et al., 2019; Huang et al., 2021; Li & Wang, 2022a; He et al., 2022) or GLB (Li & Wang, 2022b) . The only exception is Li et al. (2022) , who proposed the first algorithm for distributed kernel bandit that has sub-linear communication cost. They achieved this via a Nyström embedding function (Nyström, 1930) shared among all the clients, such that the clients only need to transfer the embedded statistics for joint kernelized estimation. Nevertheless, in their algorithm, the update of the Nyström embedding function, as well as the communication of the embedded statistics, relies on a synchronization round that requires participation of all the clients. As is widely recognized in distributed optimization (Low et al., 2012; Xie et al., 2019; Lian et al., 2018; Chen et al., 2020; Lim et al., 2020) and distributed bandit learning (Li & Wang, 2022a; He et al., 2022) , this design is vulnerable to stragglers (i.e., slower clients) in the system, i.e., the update procedure of Li et al. (2022) is paused until the slowest client responds. Due to device heterogeneity and network unreliability, this situation unfortunately is common especially at the scale of hundreds of devices/clients. Thus, asynchronous communication is preferred, as the server can readily perform model update when communication from a client is received, which is more robust against stragglers. The main bottleneck in addressing this limitation of Li et al. (2022) lies in computing Nyström approximation under asynchronous communication. Specifically, during synchronization step, their algorithm first samples a small set of representative data points (i.e., the dictionary) from all clients, and then lets each client project their local data to the subspace spanned by this dictionary and share statistics about the projected data with others. However, new challenges arise in both algorithmic design and theoretical analysis when extending their solution to asynchronous communication, since a 'fresh' re-sample from the data of all clients is no longer possible, and each client has a different copy of the dictionary due to the asynchronous communication with the server, such that their local data will be projected to different subspaces, and thus causes difficulty in joint kernel estimation. In this paper, we address these challenges and propose the first asynchronous algorithm for distributed kernelized contextual bandits. Compared with prior works in distributed bandits, our algorithm simultaneously enjoys the modeling capacity of non-parametric models and the improved robustness against delays and unavailability of clients, making it suitable for a wider range of applications. To ensure the approximation quality and compactness of the constructed dictionary in asynchronous communications, we design an incremental update procedure tailored to our problem setting with a variant of Ridge leverage score (RLS) sampling. Compared with the sampling procedure in prior works (Li et al., 2022; Calandriello et al., 2020) , this requires specialized treatments in analysis, since the quality of the current dictionary now relies on all previous asynchronous communications. Moreover, to enable joint kernel estimation, we perform transformations on the server side to convert statistics from different clients to a common subspace, which to the best of our knowledge is also new in bandit literature. We rigorously proved that the proposed algorithm incurs an Õ(N 2 γ 3 T ) communication cost, matching that of Li et al. (2022) , where γ T is the maximum information gain, while still attaining the optimal O( √ T γ T ) regret.

2. RELATED WORKS

There have been increasing research efforts in distributed bandit learning in recent years, i.e., multiple agents collaborate in pure exploration (Hillel et al., 2013; Tao et al., 2019; Du et al., 2021) , or regret minimization (Wang et al., 2019; Li & Wang, 2022a; b) . They mainly differ in the relations of learning problems solved by the agents (i.e., homogeneous vs., heterogeneous) and the type of communication network (i.e., peer-to-peer (P2P) vs., star-shaped). However, most of these works assume linear reward functions, and the clients communicate by transferring the O(d 2 ) sufficient statistics. For example, Korda et al. (2016) et al. (2022) , which propose event-triggered communication protocols to obtain sub-linear communication cost over time for distributed linear bandits with a time-varying arm set. In particular, Li & Wang (2022a) first considered the asynchronous communication setting for distributed bandit learning. Though their proposed algorithm avoids global synchronization (Wang et al., 2019) , it still involves download to inactive clients. He et al. (2022) further improved their algorithm design and analysis, such that only the active client in each round needs to participate in communication. In comparison, distributed kernelized contextual bandits still remain under-explored. Prior work in this direction assumes a local communication setting (Dubey et al., 2020) , where the agent immediately shares the new raw data point to its neighbors after each interaction, and thus the communication cost is still linear over time. A recent work by Li et al. (2022) addresses this issue by letting clients communicate via statistics computed using a shared Nyström embedding function (Calandriello et al., 2019; 2020) . However, though their proposed algorithm attains sub-linear communication cost over time, it relies on a global synchronization operation similar to that of Wang et al. (2019) to update the embedding function and share the embedded statistics. In comparison, our proposed method in this paper effectively addresses this issue using a novel asynchronous update procedure for the embedding function, making asynchronous kernel bandit learning possible.

3.1. PROBLEM FORMULATION

We consider a learning system consisting of (1) N clients that directly interact with the environment by taking actions and receiving the corresponding rewards, and (2) a central server that coordinates the communication among the clients to facilitate their learning.  t = f (x t ) + η t ∈ R. Note that A t is time-varying and assumed to be chosen by an oblivious adversary, f denotes the unknown reward function shared by all clients, and η t denotes the noise. Moreover, under the asynchronous communication scheme considered in this paper, only the active client i t is allowed to communicate with the server, e.g., to send or receive updates, after its interaction at time step t. Kernelized Reward Function Following Valko et al. (2013) , we assume the unknown reward function f lies in the RKHS, denoted as H, such that the reward can be equivalently written as y t = θ ⊤ ⋆ ϕ(x t ) + η t , where θ ⋆ ∈ H is an unknown parameter vector and ϕ : R d → H is a known feature map associated with H. We assume that η t is zero-mean R-sub-Gaussian conditioned on σ (i s , x s , η s ) s∈[t-1] , i t , x t , i.e., the σ-algebra generated by previous clients, their pulled arms, and the corresponding noises. In addition, there exists a positive definite kernel k(•, •) associated with H, and we assume ∀x ∈ A := ∪ t∈[T ] A t that, ∥x∥ k ≤ L and ∥f ∥ k ≤ S for some L, S > 0.

Regret and Communication Cost

The learning system's goal is to minimize the cumulative (pseudo) regret for all N clients, i.e., R T = T t=1 r t , where r t = max x∈At ϕ(x) ⊤ θ ⋆ -ϕ(x t ) ⊤ θ ⋆ . Meanwhile, the system also needs to keep the communication cost C T low, which is measured by the total number of scalars being transferred across the system up to time T .

3.2. KERNEL RIDGE REGRESSION & NYSTR ÖM APPROXIMATION

Throughout the paper, we use D ⊆ [T ] to denote a set of time steps and |D| as its size. The design matrix and reward vector constructed using data collected at these time steps, i.e., {x s , y s } s∈D , are denoted as X D = [x s ] ⊤ s∈D ∈ R |D|×d and y D = [y s ] ⊤ s∈D ∈ R |D| . Applying feature map ϕ(•) to each row of X D , we have Φ D ∈ R |D|×p , where p denotes the dimension of H and is possibly infinite. Kernel Ridge Regression Since the reward function f is linear in H, one can construct the Ridge regression estimator for θ ⋆ as, θ = (Φ ⊤ D Φ D + λI) -1 Φ ⊤ D y D where λ > 0 is the regularization parameter. This gives us the following estimated mean reward and standard deviation in the primal form for any arm x ∈ A: μ(x) = ϕ(x) ⊤ Φ ⊤ D Φ D + λI -1 (Φ ⊤ D y D ) σ(x) = ϕ(x) ⊤ Φ ⊤ D Φ D + λI -1 ϕ(x). Note that directly working with the possibly infinite-dimension θ ∈ R p is impractical. Instead, using the kernel trick (Valko et al., 2013; Li et al., 2022) , we can obtain an equivalent dual form that only involves entries of the kernel matrix: μ(x) = K D (x) ⊤ K D,D + λI -1 y D σ(x) = λ -1/2 k(x, x) -K D (x) ⊤ K D,D + λI -1 K D (x) where K D (x) = Φ D ϕ(x) = [k(x s , x)] ⊤ s∈D ∈ R |D| and K D,D = Φ ⊤ D Φ D = [k(x s , x s ′ )] s,s ′ ∈D ∈ R |D|×|D| . Nyström Approximation Though equation 1 avoids directly working in H, it requires computing the inverse of K D,D , which is expensive in terms of both computation cost (Calandriello et al., 2019) , i.e., O(T 3 ) as |D| = O(T ), and communication cost (Li et al., 2022) , i.e., O(T ) as {(x s , y s )} s∈D needs to be transferred across the clients. Therefore, Nyström method is used to approximate equation 1, so clients can share embedded statistics, which improves communication efficiency. As Calandriello et al. (2020) ; Li et al. (2022) , we project the original dataset Dfoot_0 to the subspace defined by a small representative subset S ⊆ D, i.e., the dictionary, and the orthogonal projection matrix is defined as P S = Φ ⊤ S Φ S Φ ⊤ S -1 Φ S = Φ ⊤ S K -1 S,S Φ S ∈ R p×p . Taking eigen-decomposition of K S,S = UΛU ⊤ ∈ R |S|×|S| , we can rewrite the orthogonal projection as P S = Φ ⊤ S UΛ -1/2 Λ -1/2 U ⊤ Φ S , and define the Nyström embedding function as z(x; S) = P 1/2 S ϕ(x) = Λ -1/2 U ⊤ Φ S ϕ(x) = K -1/2 S,S K S (x), which maps the data point x from R d to R |S| . Therefore, we can approximate the Ridge regression estimator on dataset D as θ = P S Φ ⊤ D Φ D P S + λI -1 P S Φ ⊤ D y D , and equation 1 as μ(x) = z(x; S) ⊤ Z ⊤ D;S Z D;S + λI -1 Z ⊤ D;S y D σ(x) = λ -1/2 k(x, x) -z(x; S) ⊤ Z ⊤ D;S Z D;S [Z ⊤ D;S Z D;S + λI] -1 z(x|S) where Z D;S ∈ R |D|×|S| is obtained by applying z(•; S) to each row of X D , i.e., Z D;S = Φ D P 1/2 S . Note that the computation of μ(x) and σ(x) only requires the embedded statistics, i.e., matrix Z ⊤ D;S Z D;S ∈ R |S|×|S| and vector Z ⊤ D;S y D ∈ R |S| , which makes joint kernelized estimation among N clients much more efficient in communication compared with equation 1.

4. METHODOLOGY

In this section, we propose and analyze the first asynchronous algorithm for distributed kernelized contextual bandit problem that addresses the challenges mentioned in Section 1, and name the resulting algorithm Async-KernelUCB, with its description given in Algorithm 1.

4.1. ALGORITHM

We denote the embedded statistics used in the computation of equation 2 by Ã(D; S) := Z ⊤ D;S Z D;S and b(D; S) := Z ⊤ D;S y D , to explicitly emphasize they are computed by projecting the data points from dataset D to the subspace spanned by dictionary S. We denote the sequence of time steps corresponding to the interactions between client i and the environment up to time t as N t (i) = {1 ≤ s ≤ t : i s = i} for t ∈ [T ]. Throughout the paper, we reserve k as the index for communication, and use t k ∈ [T ] to denote the time step when the k-th communication happens. Moreover, as each client has a different copy of the embedding function and embedded statistics due to asynchronous communication, we use k(i) to denote the index of client i's latest communication with the server, up to the k-th one: if client i triggers the k-th communication, then k(i) = k. Arm Selection At each round t ∈ [T ], client i t ∈ [N ] selects arm x t from the candidate set A t by maximizing the following upper confidence bound (line 5) x t = arg max x∈At μk(it) (x) + ασ k(it) (x) where μk(it) (x) and σk(it) (x) are approximated mean and standard deviation of arm x's reward, computed using statistics Ã(D k(it) , S k(it) ) and b(D k(it) , S k(it) ) that client i t received from the server during the k(i t )-th communication. Proper choice of α is given in Lemma 4.4. Algorithm 1 Asynchronous KernelUCB (Async-KernelUCB) 1: Input: α, q, communication threshold D > 0, regularization parameter λ > 0, δ ∈ (0, 1) and kernel function k(•, •). 2: Initialize approximated mean and variance μ0 (x) = 0, σ0 (x) = λ -1/2 k(x, x), dataset D 0 = ∅, dictionary S 0 = ∅, index of communication k = 0, and N 0 (i) = ∅ for each client i ∈ [N ] 3: for t = 1, 2, ..., T do 4: Client i t ∈ [N ] becomes active, and observes arm set A t 5: [Client i t ] Choose arm x t ∈ A t according to equation 3, and observe reward y t 6: // Set N t (i t ) = N t-1 (i t ) ∪ {t}, and N t (i) = N t-1 (i) for i ̸ = i t 7: if s∈Nt(it)\Nt k(i t ) (it) σ2 k(it) (x s ) > D then // Denote ∆D k = N t (i t ) \ N t k(i t ) (i t ), and set k = k + 1 8: [Server → Client i t ] Send {x s , y s } s∈S k-1 , Ã(D k-1 ; S k-1 ), b(D k-1 ; S k-1 ) to client i t 9: [Client i t ] Select ∆S k ⊆ ∆D k via RLS sampling with probability qσ 2 k-1 (•) // Set S k = S k-1 ∪ ∆S k 10: [Client i t ] Compute Ã(∆D k ; S k ), b(∆D k ; S k ) 11: [Client i t → Server] Send {x s , y s } s∈∆S k , Ã(∆D k ; S k ) and b(∆D k ; S k ) to server // Set D k = D k-1 ∪ ∆D k 12: [Server] Compute Ã(D k ; S k ), b(D k ; S k ) according to equation 5 13: [Server → Client i t ] Send Ã(D k ; S k ), b(D k ; S k ) to client i t 14: [Client i t ] Update μk (•) and σk (•) using Ã(D k ; S k ), b(D k ; S k ) according to equation 2 15: end if 16: end for Event-triggered Asynchronous Communication After the interaction at time step t, μk(it) (•) and σk(it) (•) of active client i t will only be updated if the following event is true (line 7): s∈Nt(it)\Nt k(i t ) (it) σ2 k(it) (x s ) > D, where D > 0 denotes the communication threshold. This measures whether sufficient amount of new information has been collected by client i t since its lastest (the k(i t )-th) communication with the server. If true, communication between client i t and the server is triggered (line 8-14), where the update procedure described in the following paragraphs will be performed. And this procedure is also illustrated in Figure 1 . Dictionary and Embedded Statistics Update During the k-th communication, the server first sends its latest dictionary S k-1 , as well as its latest embedded statistics Ã(D k-1 ; S k-1 ) and b(D k-1 ; S k-1 ), to client i t (line 8), which is illustrated as the blue lines in Figure 1 . Then client i t selects a subset ∆S k from the data it has collected since its lastest communication (line 9), i.e., ∆D k , which will be used to incrementally update dictionary S k-1 . This is done by sampling q k,s ∼ B(p k,s ) for each data point with time index s ∈ ∆D k , where pk,s := qσ 2 k-1 (x s ). This can be considered as a variant of Ridge leverage score (RLS) sampling (Calandriello et al., 2020; Li et al., 2022) . It is worth noting that the only purpose of sending Ã(D k-1 ; S k-1 ) and b(D k-1 ; S k-1 ) is to enable RLS sampling with the latest σ2 k-1 (•). Otherwise, client i t , whose lastest communication with the server can be long time ago, would include unnecessary data points into ∆S k due to its unawareness of server's current status. We will demonstrate in the proof of Lemma 4.3 that our design here is necessary to obtain a compact dictionary under asynchronous communication. With the dictionary updated, client i t computes the embeddings of its new local data, i.e., Ã(∆D k ; S k ) and b(∆D k ; S k ), and sends them, as well as ∆S k , to the server (the yellow lines in Figure 1 ). As shown in Figure 1 , the server stores: 1) the last received embedded statistics from each client i ∈ [N ], i.e., Ã(N t k(i) (i); S k(i) ) ∈ R |S k(i) |×|S k(i) | and b(N t k(i) (i); S k(i) ) ∈ R |S k(i) | ; 2) their corresponding dictionary S k(i) . As mentioned earlier, due to asynchronous communication, the statistics from different clients are based on different dictionaries, which means they have different dimensions and thus cannot be directly aggregated as in Li et al. (2022) . We propose to transform the statistics from each client i ∈ [N ] using the latest dictionary S k . This is based on the fact that Figure 1 : Illustration of asynchronous update of dictionary and embedded statistics Z Nt k(i) (i);S k = Φ Nt k(i) (i) P 1/2 S k = Φ Nt k(i) (i) P 1/2 S k(i) P -1/2 S k(i) P 1/2 S k = Z Nt k(i) (i);S k(i) T k(i),k , where the linear transformation T k(i),k := P -1/2 S k(i) P 1/2 S k = Λ 1/2 S k(i) U ⊤ S k(i) Φ S k(i) Φ ⊤ S k U S k Λ -1/2 S k serves the purpose. Hence, we have Ã(N t k(i) (i); S k ) = T ⊤ k(i),k Ã(N t k(i) (i); S k(i) )T k(i),k , b(N t k(i) (i); S k ) = T ⊤ k(i),k b(N t k(i) (i); S k(i) ), which makes the statistics received from all clients have the same dimension |S k |. Then we compute Ã(D k ; S k ) = N i=1 Ã(N t k(i) (i); S k ) and b(D k ; S k ) = N i=1 b(N t k(i) (i); S k ) (line 12 ), and send them to client i t to update its UCB (line 13-14), which is illustrated as the green line in Figure 1 .

4.2. ANALYSIS OF DICTIONARY ACCURACY AND SIZE

As mentioned earlier, the key to low regret and low communication cost, is to have a dictionary S k that can accurately approximate the dataset D k , while having a compact size |S k |. In this section, we show that this is possible with our update procedure in Section 4.1. First, we need some additional notations. We denote the total number of times up to time T that communication is triggered, i.e., the number of times equation 4 is true, as B, where B ∈ [0, T ]. Following Calandriello et al. (2020) ; Li et al. (2022) , the approximation quality is formally defined using ϵ-accuracy: if the event (1 -ϵ)(Φ ⊤ D k Φ D k + λI) ⪯ Φ ⊤ D k S⊤ k Sk Φ D k + λI ⪯ (1 + ϵ)(Φ ⊤ D k Φ D k + λI) is true, then we say the dictionary S k is ϵ-accurate w.r.t. dataset D k , for some ϵ ∈ (0, 1), where Sk ∈ R |D k |×|D k | denotes a diagonal matrix, with s-th diagonal entry equal to q k,s / pk,s , where q k,s = 1 if s ∈ S k , and q k,s = 0, otherwise. Based on this notion, we prove Lemma 4.1 below. Lemma 4.1 (Dictionary Accuracy and Size). With q = 4 ln(2 √ 2T /δ)β(1 + ϵ/3)/ϵ 2 , where β := (1 + ϵ)/(1 -ϵ), and λ ≤ k(x, x), ∀x ∈ A, we have with probability at least 1 -δ that dictionary S k is ϵ-accurate w.r.t. dataset D k , and its size |S k | ≤ 12β(1 + βD)qγ T , ∀k, where δ ∈ (0, 1). This shows that our incremental update procedure under asynchronous communication still matches the results in prior works that perform synchronous re-sampling over the whole dataset for dictionary update (Li et al., 2022; Calandriello et al., 2020) . We provide a proof sketch for Lemma 4.1 below to highlight our technical novelty and provide the detailed proof in appendix. Proof Sketch of Lemma 4.1. Let's define the unfavorable event H k = A k ∪ E k , where A k is the event that the dictionary S k is not ϵ-accurate w.r.t. D k , and E k is the event that the size of dictionary |S k | > 12β(1 + βD)qγ T . Therefore, the probability of ∪ B k=0 H k can be decomposed as P ∪ B k=0 H k = P ∪ B k=0 A k + P (∪ B k=0 E k ) ∩ (∪ B k=0 A k ) C . Bounding the first term: In Calandriello et al. (2019; 2020) ; Li et al. (2022) , the first term is further decomposed as P ∪ B k=0 A k ≤ B k=1 P(A k ∩ A C k-1 ), because dictionary S k is constructed by a fresh re-sampling over D k using the latest approximated variance σ2 k-1 (•), and thus they only need to guarantee σ2 k-1 (•) is a good approximation to σ 2 k-1 (•). In our case, S k is incrementally updated in each communication, i.e., S k = ∪ k k ′ =1 ∆S k ′ where each ∆S k ′ is sampled using σ2 k ′ -1 (•). The accuracy of S k depends on the accuracy of every S k ′ , i.e., ∩ k-1 k ′ =1 A C k ′ . Therefore, we decompose P ∪ B k=0 A k = 1 -P ∩ B k=0 A C k = 1 - B k=1 [1 -P A k | ∩ k-1 k ′ =0 A C k ′ ] ≤ B k=1 P A k | ∩ k-1 k ′ =0 A C k ′ using Bayes theorem and Weierstrass product inequality, and bound each conditional probability separately, which leads to Lemma 4.2. Lemma 4.2 (Bounding B k=1 P A k | ∩ k-1 k ′ =0 A C k ′ ). By setting q = 4 ln(2 √ 2T /δ)β(1 + ϵ/3)/ϵ 2 , we have B k=0 P A k | ∩ k-1 k ′ A C k ′ ≤ δ/2, for δ ∈ (0, 1). Bounding the second term: The second term can be decomposed as P (∪ B k=0 E k ) ∩ (∪ B k=0 A k ) C ≤ B k=0 P E k ∩ (∩ B k=0 A C k ) . Note that the size of dictionary |S k | = s∈D k q k,s by the definition of q k,s , and its analysis relies on upper bounding s∈D k pk,s (Calandriello et al., 2020) . Again, due to asynchronous communication, for data point s that was added during the k ′ -th communication, i.e., s ∈ ∆D k ′ , we have q k,s = q k ′ ,s , pk,s = pk ′ ,s and thus s∈D k pk,s =  k k ′ =1 s∈∆D k ′ pk ′ , B k=0 P E k ∩ (∩ B k=0 A C k ) ). By setting q = 4 ln(2 √ 2T /δ)β(1 + ϵ/3)/ϵ 2 , and λ ≤ k(x, x), ∀x ∈ A, we have B k=0 P E k ∩ (∩ B k=0 A C k ) ≤ δ/2, for δ ∈ (0, 1). Putting everything together, we have P ∪ B k=0 H k ≤ δ, for δ ∈ (0, 1), which finishes the proof.

4.3. ANALYSIS OF REGRET AND COMMUNICATION COST

Lemma 4.1 guarantees a compact and accurate dictionary for Nyström approximation throughout the learning process. Based on it, we establish the upper bounds for the cumulative regret and communication cost of Async-KernelUCB. First, motivated by the confidence ellipsoid for asynchronous linear bandits (He et al., 2022) , we construct the following confidence ellipsoid for our approximated estimator for kernel bandit defined in Section 3.2 (proof is provided in appendix). Lemma 4.4 (Confidence ellipsoid for approximated estimator). Under the same condition as Lemma 4.1, with probability at least 1 -2δ, for δ ∈ (0, 1), we have ∀k that ∥ θk -θ⋆∥ Ṽk ≤ (1/ √ 1 -ϵ + 1) √ λS + 2R 1 + N Dβ + N 2Dβ ln(1/δ) + γT := α, where Ṽk : = P S k Φ ⊤ D k Φ D k P S k + λI and γ T := max D⊂A:|D|=T 1 2 log det(K D,D /(Dβλ) + I) 2 is the maximum information gain after T interactions (Chowdhury & Gopalan, 2017; Li et al., 2022) . Then based on Lemma 4.4, we establish Theorem 4.5 below (proof is provided in appendix). Theorem 4.5. Under the same condition as Lemma 4.1, we have RT ≤ 4N γT LS + 4 √ 2 (1/ √ 1 -ϵ + 1) √ λS + 2R 1 + N Dβ + N 2Dβ ln(1/δ) + γT • T β[1 + N β(L 2 /λ + D)]γT with probability at least 1 -2δ, and CT ≤ 2γT (N + 4β/D) 3(|SB| 2 + |SB|) + d|SB| . where the dictionary size |S B | ≤ 12β(1 + βD)qγ T due to Lemma 4.1. By setting D = 1/N 2 , we have R T = O N γ T LS + √ T (S √ γ T + γ T ) , and C T = Õ(N 2 γ 3 T ).

5. EXPERIMENTS

To validate Async-KernelUCB's effectiveness in reducing communication cost, we performed extensive empirical evaluations on both synthetic and real-world datasets, and reported the results (over 10 runs) in Figure 2 . The baselines included in our comparisons are: 1) OneKernelUCB (Valko et al., 2013) , it learns a shared kernel bandit model across all clients' aggregated data where data aggregation happens immediately after each new data point becomes available; 2) NKernelUCB, it learns a separate kernel bandit model for each client with no communication; 3) FedGLBUCB (Li & Wang, 2022b) , it is a synchronous distributed GLB algorithm; 4) DisLinUCB (Wang et al., 2019) , it is a synchronous distributed linear bandit algorithm; 5) FedLinUCB (He et al., 2022) , it is an asynchronous distributed linear bandit algorithm; and 6) Approx-DisKernelUCB (Li et al., 2022) , it is a synchronous distributed kernel bandit algorithm. For all the kernel bandit algorithms, we used the Gaussian kernel k(x, y) = exp(-γ∥x -y∥ 2 ), where we did a grid search of γ ∈ {0.1, 1, 4}, and for FedGLBUCB, we used Sigmoid function µ(z) = (1 + exp(-z)) -1 as link function. For all algorithms, instead of using their theoretically derived exploration coefficient α, we followed the convention Li et al. (2010a) ; Zhou et al. (2020) to use grid search for α in {0.1, 1, 4}.

Synthetic dataset

We simulated the distributed bandit setting in Section 3.1, with d = 20, T = 10 4 , N = 10 2 . At each time step t ∈ [T ], client i t ∈ [N ] selects an arm from candidate set A t (with |A t | = 20), which is uniformly sampled from a ℓ 2 unit ball. Then the reward is generated using one of the following reward functions: 1) f 1 (x) = cos(3x ⊤ θ ⋆ ), and 2) f 2 (x) = (x ⊤ θ ⋆ ) 3 -3(x ⊤ θ ⋆ ) 2 - (x ⊤ θ ⋆ ) + 3 , where the parameter θ ⋆ is uniformly sampled from a ℓ 2 unit ball and fixed.

UCI Datasets

We also performed experiments using MagicTelescope and Mushroom from the UCI Machine Learning Repository (Dua & Graff, 2017) , which are converted to bandit problem following Filippi et al. (2010) . Specifically, we partitioned the dataset into 20 clusters using k-means, and used the centroid of each cluster as the context for the arms and used the averaged response as mean reward (the response is binarized by setting one class as 1, and all the others as 0). Then we simulated the distributed bandit setting in Section 3.1 with |A t | = 20, T = 10 4 and N = 10 2 . MovieLens and Yelp dataset Yelp dataset is released by the Yelp dataset challenge, and consists of 4.7 million rating entries for 157 thousand restaurants by 1.18 million users. MovieLens consists of 25 million ratings between 160 thousand users and 60 thousand movies (Harper & Konstan, 2015) . Following the pre-processing steps in Ban et al. (2021) , we built the rating matrix by choosing the top 2,000 users and top 10,000 restaurants/movies and used singular-value decomposition to extract a 10-dimension feature vector for each user and restaurant/movie. We treated ratings greater than 2 as positive, and simulated the distributed bandit setting in Section 3.1 with T = 10 4 and N = 10 2 . The candidate set A t (with |A t | = 20) is constructed by sampling an arm with positive reward and nineteen arms with negative reward from the arm pool, and the concatenation of user and restaurant/movie feature vector is used as the context vector for the arm (thus d = 20).

5.2. EXPERIMENT RESULTS

OneKernelUCB and NKernelUCB correspond to the two extreme cases where the clients either communicate in every time step to learn a shared model, or they learn their own models independently with no communication. As shown in Figure 2 regret than their linear counterparts, while requiring relatively lower communication cost for joint kernel estimation. It is also worth noting that despite having the same Õ(N 2 γ 3 T ) theoretical scaling in communication cost, Async-KernelUCB incurs much smaller communication cost empirically, while having comparable or even better regret than Approx-DisKernelUCB.

CONCLUSIONS

In this paper, we proposed the first asynchronous algorithm for distributed kernel bandit, which relaxes the limitation of prior work that requires impractical global synchronization to update the Nyström embedding function and share the embedded statistics across all clients. To ensure approximation quality and compactness of the constructed dictionary in asynchronous communications, we designed an incremental update procedure tailored to this communication scheme, and a transformation operation on the server side to enable joint kernel estimation using statistics with different embeddings. With the improved robustness against delays and unavailability of clients by having asynchronous communication, we show that to attain near-optimal regret, the proposed algorithm still only incurs an Õ(N 2 γ 3 T ) communication cost, matching that of the prior work. The lower bound analysis for the communication cost of distributed contextual bandits still remains an open problem, and is an important future direction. To the best of our knowledge, the only applicable lower bound states that, in order to have smaller regret than the trivial O( √ N T ) result, i.e., run N instances of optimal bandit algorithm with no communication, Ω(N ) communications is necessary (He et al., 2022) . In comparison, it is more interesting to know what the communication lower bound is in order to attain the optimal O( √ T ) regret. Moreover, motivated by the differential private (DP) version of DisLinUCB by Dubey & Pentland (2020) , i.e., apply randomized mechanisms to the shared sufficient statistics, another interesting direction is a DP version of our Async-KernelUCB, in which case, the main focus is a privacy-preserving construction of the shared embedding function.

A TECHNICAL LEMMAS

Lemma A.1 (Lemma 11 of Abbasi-Yadkori et al. ( 2011)). Let {x t } ∞ t=1 be a sequence in R d , V ∈ R d×d a positive definite matrix, and define V t = V + t s=1 x s x ⊤ s . Then we have that ln det(V n ) det(V ) ≤ n t=1 ∥x t ∥ 2 V -1 t-1 . If ∥x t ∥ 2 ≤ L, ∀t, and λ min (V ) ≥ max(1, L 2 ), then n t=1 ∥x t ∥ 2 V -1 t-1 ≤ 2 ln det(V n ) det(V ) . Lemma A.2 (Lemma 12 of Abbasi-Yadkori et al. ( 2011)). Let A, B and C be positive semi-definite matrices such that A = B + C. Then, we have that: sup x̸ =0 x ⊤ Ax x ⊤ Bx ≤ det(A) det(B) Lemma A.3 (Lemma A.2 of Li et al. (

)). Define positive definite matrices

A = λI + Φ ⊤ 1 Φ 1 + Φ ⊤ 2 Φ 2 and B = λI + Φ ⊤ 1 Φ 1 , where Φ ⊤ 1 Φ 1 , Φ ⊤ 2 Φ 2 ∈ R p×p and p is possibly infinite. Then, we have that: sup ϕ̸ =0 ϕ ⊤ Aϕ ϕ ⊤ Bϕ ≤ det(I + λ -1 K A ) det(I + λ -1 K B ) where K A = Φ 1 Φ 2 Φ ⊤ 1 , Φ ⊤ 2 and K B = Φ 1 Φ ⊤ 1 . Lemma A.4 (Eq (26) and Eq (27) of Zenati et al. (2022) ). Let {ϕ t } ∞ t=1 be a sequence in R p , V ∈ R p×p a positive definite matrix, where p is possibly infinite, and define V t = V + t s=1 ϕ s ϕ ⊤ s . Then we have that n t=1 min ∥ϕ t ∥ 2 V -1 t-1 , 1 ≤ 2 ln det(I + λ -1 K Vt ) , where K Vt is the kernel matrix corresponding to V t as defined in Lemma A.3. Lemma A.5 (Lemma 4 of Calandriello et al. (2020) ). For t > t ′ , we have for any x ∈ R d σ 2 t (x) ≤ σ 2 t ′ (x) ≤ 1 + t s=t ′ +1 σ 2 t ′ (x s ) σ 2 t (x) Lemma A.6 (Lemma 6 of Calandriello et al. (2019)). If S k is ϵ-accurate w.r.t. D k , then 1 -ϵ 1 + ϵ σ 2 (x) ≤ min(σ 2 k (x), 1) ≤ 1 + ϵ 1 -ϵ σ 2 (x) for all x ∈ R d . Lemma A.7 (Proposition 7 of Calandriello et al. (2019) ). Let G 1 , . . . , G n be a sequence of independent self-adjoint random operators such that E[G i ] = 0 and ∥G i ∥ ≤ R. Then for any ϵ ≥ 0, we have P ∥ t i=1 G i ∥ ≥ ϵ ≤ 4t exp - ϵ 2 /2 ∥ t i=1 E[G 2 i ]∥ + Rϵ/3 . Lemma A.8 (Proposition 8 of (Calandriello et al., 2019) ). Let {q s } t s=1 be independent Bernoulli random variables, each with success probability p s . Then we have  P t s=1 q s ≥ 3 t s=1 p s ≤ exp(-2 t s=1 p s ).

B PROOF OF LEMMAS IN SECTION 4.2

Let's define the unfavorable event H k = A k ∪E k , where A k is the event that the dictionary S k is not ϵ-accurate w.r.t. D k , and E k is the event that the size of dictionary |S k | is large, i.e., |S k | > 12β(1 + βD)qγ T . Therefore, we want to bound the probability of ∪ B k=0 H k , which can be decomposed as P ∪ B k=0 H k = P ∪ B k=0 (A k ∪ E k ) = P (∪ B k=0 A k ) ∪ (∪ B k=0 E k ) = P ∪ B k=0 A k + P ∪ B k=0 E k -P (∪ B k=0 A k ) ∩ (∪ B k=0 E k ) = P ∪ B k=0 A k + P (∪ B k=0 E k ) ∩ (∪ B k=0 A k ) C Note that, as in Calandriello et al. (2017) , we bound the second term as P (∪ B k=0 E k ) ∩ (∪ B k=0 A k ) C = P (∪ B k=0 E k ) ∩ (∩ B k=0 A C k ) = P ∪ B k=0 [E k ∩ (∩ B k=0 A C k )] ≤ B k=0 P E k ∩ (∩ B k=0 A C k ) . For the first term P ∪ B k=0 A k , we need a decomposition different from prior works (Calandriello et al., 2017; 2019) , since our dictionary is incrementally updated with a batch of samples at each communication round (line 9 in Algorithm 1). Specifically, when bounding the probability of having an inaccurate dictionary at the k-th communication, i.e., event A k , we need to condition on the event that dictionaries at all previous communications are ϵ-accurate, i.e., event ∩ k-1 k ′ =0 A C k ′ . Hence, we decompose P ∪ B k=0 A k = 1 -P ∩ B k=0 A C k = 1 -P(A C 0 ) B k=1 P A C k | ∩ k-1 k ′ =0 A C k ′ = 1 - B k=1 [1 -P A k | ∩ k-1 k ′ =0 A C k ′ ] ≤ B k=1 P A k | ∩ k-1 k ′ =0 A C k ′ , where the second equality is due to Bayes theorem, the third equality is because D 0 = ∅ is well-approximated by S 0 = ∅, and thus P A C 0 = 1, and the inequality is due to Weierstrass product inequality. Putting everything together, we have P ∪ B k=0 H k ≤ B k=1 P A k | ∩ k-1 k ′ A C k ′ + B k=1 P E k ∩ (∩ B k=0 A C k ) Then we can upper bound these two terms using Lemma 4.2 and Lemma 4.3 given in Section 4.2, which leads to P ∪ B k=0 H k ≤ δ, for δ ∈ (0, 1), and thus finishes the proof of Lemma 4.1. Proof of Lemma 4.2: bounding B k=1 P A k | ∩ k-1 k ′ A C k ′ . As Calandriello et al. (2019) , we can rewrite the event A k , based on the definition of ϵ-accuracy given in equation 6, as A k = ∥ s∈D k G k,s ∥ > ϵ where G k,s = ( q k,s pk,s -1)ψ k,s ψ ⊤ k,s and ψ k,s = (Φ ⊤ D k Φ D k + λI) -1/2 ϕ(x s ). Then let's define F k := {q k,s , η s } s∈D k for k ∈ [B] , which contains all randomness in the construction of S k during the k-th communication. With conditioning, we have P(A k | ∩ k-1 k ′ A C k ′ ) = P ∥ s∈D k G k,s ∥ > ϵ | ∩ k-1 k ′ A C k ′ = E F k 1 ∥ s∈D k G k,s ∥ > ϵ | ∩ k-1 k ′ A C k ′ = E F k-1 E F k \F k-1 1 ∥ s∈D k G k,s ∥ > ϵ |F k-1 | ∩ k-1 k ′ A C k ′ = E F k-1 :∩ k-1 k ′ A C k ′ E F k \F k-1 1 ∥ s∈D k G k,s ∥ > ϵ |F k-1 = E F k-1 :∩ k-1 k ′ A C k ′ P F k \F k-1 ∥ s∈D k ( q k,s pk,s -1)ψ k,s ψ ⊤ k,s ∥ > ϵ | F k-1 . where the third equality holds because when conditioned on the event ∩ k-1 k ′ A C k ′ , the outcomes associated with the complement of this event have zero probability, and thus we can restrict the expectation to the outcomes where the event  ∩ k-1 k ′ A C k ′ holds k := N t k (c k ) \ N t k(c k ) (c k ) = {t k(c k ) < s ≤ t k : i s = c k }. Note that due to our incremental update procedure, for some data point with time index s, that was added into D k during the k ′ -th communication (sent to the server in the form of embedded statistics), i.e., s ∈ ∆D k ′ , for k ′ = 1, . . . , k, we have q k,s = q k ′ ,s and pk,s = pk ′ ,s . When conditioned on F k-1 , q k,s for all s ∈ D k are independent Bernoulli random variable with mean pk,s , because they only correlate via the approximated variance function(s) that were used for arm selection and RLS sampling up to the k-th communication, which are deterministic conditioned on F k-1 , and thus both pk,s and ψ k,s are deterministic as well. Therefore, we can bound P F k \F k-1 ∥ s∈D k ( q k,s pk,s -1)ψ k,s ψ ⊤ k,s ∥ > ϵ|F k-1 using Lemma A.7. First, we need to show that each term in the summation has zero mean and bounded norm, i.e., E F k \F k-1 [G k,s |F k-1 ] = 0 and ∥G k,s ∥ ≤ R for some constant R: E F k \F k-1 ( q k,s pk,s -1)ψ k,s ψ ⊤ k,s |F k-1 = ( E F k \F k-1 q k,s |F k-1 pk,s -1)ψ k,s ψ ⊤ k,s = 0, and ∥G k,s ∥ = ∥( q k,s pk,s -1)ψ k,s ψ ⊤ k,s ∥ ≤ ( q k,s pk,s -1)∥ψ k,s ψ ⊤ k,s ∥ ≤ σ 2 k (x s ) pk,s , where the last inequality is because q k,s ≤ 1 and ∥ψ k,s ψ ⊤ k,s ∥ = ψ ⊤ k,s ψ k,s = σ 2 k (x s ). As mentioned earlier, for s ∈ ∆D k ′ , k ′ = 1, . . . , k, we have pk,s = pk ′ ,s = qσ 2 k ′ -1 (x s ), i.e., during the k ′ - th communication, client c k ′ first receives server's latest statistics to compute σ2 k ′ -1 (•) for RLS sampling. Conditioned on ∩ k k ′ =0 A C k ′ and by Lemma A.6, we have σ2 k ′ -1 (x s ) ≥ σ 2 k ′ -1 (x s )/β, where β := (1 + ϵ)/(1 -ϵ). Hence, ∥G k,s ∥ ≤ σ 2 k (x s ) pk,s = σ 2 k (x s ) qσ 2 k ′ -1 (x s ) ≤ β q σ 2 k (x s ) σ 2 k ′ -1 (x s ) ≤ β q := R. where the last inequality is because the variance is non-increasing over time. Then by Lemma A.7, P F k \F k-1 ∥ s∈D k G k,s ∥ > ϵ|F k-1 ≤ 4|D k | exp(- ϵ 2 /2 ∥ s∈D k E F k \F k-1 [G 2 k,s |F k-1 ]∥ + Rϵ/3 ) Now we need to further upper bound the term ∥ s∈D k E F k \F k-1 [G 2 k,s |F k-1 ]∥. First, note that E F k \F k-1 [G 2 k,s |F k-1 ] = E F k \F k-1 ( q k,s pk,s -1) 2 ψ k,s ψ ⊤ k,s ψ k,s ψ ⊤ k,s |F k-1 = E F k \F k-1 ( q k,s pk,s -1) 2 |F k-1 ψ k,s ψ ⊤ k,s ψ k,s ψ ⊤ k,s , and E F k \F k-1 [( q k,s pk,s -1) 2 |F k-1 ] = E F k \F k-1 [( q k,s pk,s ) 2 |F k-1 ] -2E F k \F k-1 [ q k,s pk,s |F k-1 ] + 1 = E F k \F k-1 [ q k,s p2 k,s |F k-1 ] -1 = 1 pk,s -1 ≤ 1 pk,s . Substituting this to the RHS, we have E F k \F k-1 [G 2 k,s |F k-1 ] ⪯ 1 pk,s ψ k,s ψ ⊤ k,s ψ k,s ψ ⊤ k,s ⪯ 1 pk,s ∥ψ k,s ψ ⊤ k,s ∥ψ k,s ψ ⊤ k,s ⪯ Rψ k,s ψ ⊤ k,s , and thus, ∥ s∈D k E F k \F k-1 [G 2 k,s |F k-1 ]∥ ≤ R∥ s∈D k ψ k,s ψ ⊤ k,s ∥ = R∥ s∈D k (Φ ⊤ D k Φ D k + λI) -1/2 ϕ s ϕ ⊤ s (Φ ⊤ D k Φ D k + λI) -1/2 ∥ = R∥(Φ ⊤ D k Φ D k + λI) -1/2 Φ ⊤ D k Φ D k (Φ ⊤ D k Φ D k + λI) -1/2 ∥ ≤ R, where the first equality is by definition of ψ k,s . Putting everything together, we have P F k \F k-1 ∥ s∈D k G k,s ∥ > ϵ|F k-1 ≤ 4|D k | exp(- ϵ 2 /2 1 + ϵ/3 • q β ), and thus P(A k | ∩ k-1 k ′ =0 A C k ′ ) ≤ 4|D k | exp(-ϵ 2 /2 1+ϵ/3 • q β ). Summing over B terms, we have B k=0 P A k | ∩ k-1 k ′ =0 A C k ′ ≤ 4 exp(- ϵ 2 /2 1 + ϵ/3 • q β ) B k=1 |D k | ≤ 4T 2 exp(- ϵ 2 /2 1 + ϵ/3 • q β ) In order to make sure B k=0 P A k | ∩ k-1 k ′ =0 A C k ′ ≤ δ 2 , we need to set q = 4β 1+ϵ/3 ϵ 2 ln( 2 √ 2T δ ). Proof of Lemma 4.3: bounding B k=0 P E k ∩ (∩ B k=0 A C k ) . First, note that P(E 0 ∩ (∩ B k=0 A C k )) = 0, because S 0 = ∅, and by definition of q k,s for s ∈ D k , the size of dictionary |S k | = s∈D k q k,s . We formally define unfavorable event E k as E k = s∈D k q k,s > 12β(1 + βD)qγ T , where β = (1 + ϵ)/(1 -ϵ). Similar to Calandriello et al. (2017; 2019) , we will use a stochastic dominance argument to upper bound the probability of event E k . First, we use conditioning again to rewrite P E k ∩ (∩ B k=1 A C k ) as P(E k ∩ (∩ B k=1 A C k )) = P(E k | ∩ B k=1 A C k )P(∩ B k=1 A C k ) ≤ P(E k | ∩ B k=1 A C k ) = P s∈D k q k,s ≥ 12β(1 + βD)qγ T | ∩ B k=1 A C k = E F k-1 :∩ B k=1 A C k P F k \F k-1 s∈D k q k,s ≥ 12β(1 + βD)qγ T | F k-1 . As discussed earlier, when conditioned on F k-1 , q k,s for s ∈ D k becomes independent Bernoulli random variable, with mean pk,s . In addition, as a result of our incremental dictionary update (line 9 in Algorithm 1), the partition in D k that were added during the k ′ -th communication for k ′ ∈ 1, . . . , k, which is denoted by ∆D k ′ , is sampled using qσ 2 k ′ -1 (x s ) for s ∈ ∆D k ′ . Hence, E F k \F k-1 s∈D k q k,s |F k-1 = s∈D k pk,s = k k ′ =1 s∈∆D k ′ pk ′ ,s = q k k ′ =1 s∈∆D k ′ σ2 k ′ -1 (x s ) ≤ β q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1 (x s ) = β q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) • σ 2 k ′ -1 (x s ) σ 2 k ′ -1,s-1 (x s ) ≤ β q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) • [1 + s ′ ∈∆D k ′ :s ′ ≤s-1 σ 2 k ′ -1 (x s ′ )] ≤ β q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) • [1 + s ′ ∈∆D k ′ :s ′ ≤s-1 σ 2 k ′ (c k ′ ) (x s ′ )] ≤ β q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) • [1 + β s ′ ∈∆D k ′ :s ′ ≤s-1 σ2 k ′ (c k ′ ) (x s ′ )] ≤ β(1 + βD)q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) where the imaginary variance function σ 2 k ′ -1,s-1 (•) is constructed using dataset ∪ k ′ -1 k=1 ∆D k ∪ {s ′ ∈ ∆D k ′ : s ′ ≤ s -1} (not computed in the actual algorithm); the first and forth inequality is due to Lemma A.6 as we conditioned on ∩ B k=0 A C k ; the second is due to Lemma A.5; the third is because k ′ (c k ′ ) ≤ k ′ -1 and the variance is non-increasing over time; and the fifth is due to our event-trigger design in equation 4, i.e., s∈∆D k ′ :s≤t k ′ -1 σ2 k ′ (c k ′ ) (x s ) < D. Now for each term in the summation on the RHS of the inequality above, we introduce an independent Bernoulli random variable qk,s ∼ B β(1 + βD)qσ 2 k ′ -1,s-1 (x s ) . Since qk,s stochastically dominates q k,s , i.e., E q k,s | F k-1 = pk,s ≤ β(1 + βD)qσ 2 k ′ -1,s-1 (x s ) = E qk,s , we have P s∈D k q k,s > 12β(1 + βD)qγ T | F k-1 ≤ P s∈D k qk,s > 12β(1 + βD)qγ T . Then we can further upper bound the RHS P s∈D k qk,s > 12β(1 + βD)qγ T ≤ P s∈D k qk,s > 3β(1 + βD)q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) ≤ exp -2β(1 + βD)q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) where the first inequality is because k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) ≤ 4γ T , and the second inequality is due to Lemma A.8. By substituting q = 4β 1+ϵ/3 ϵ 2 ln( 2 √ 2T δ ) and under the condition that k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) ≥ 1, we have exp -2β(1 + βD)q k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) ≤ exp -ln(8T 2 /δ) . To ensure k k ′ =1 s∈∆D k ′ σ 2 k ′ -1,s-1 (x s ) ≥ 1, we can set λ ≤ k(x, x), ∀x ∈ A. Finally, by summing over B terms, we have B k=0 P E k ∩ (∩ B k=0 A C k ) ≤ T exp -ln(8T 2 /δ) ≤ T • δ 8T 2 < δ 2 where the last inequality is because T ≥ 1.

C PROOF OF LEMMA 4.4 IN SECTION 4.3

Recall from Section 3.2 that the approximated kernel Ridge regression estimator for θ ⋆ is defined as θk = Ṽ-1 k P S k Φ ⊤ D k y D k where Ṽk := P S k Φ ⊤ D k Φ D k P S k + λI. Then we can decompose ∥ θk -θ ⋆ ∥ 2 Ṽk = ( θk -θ ⋆ ) ⊤ Ṽk ( θk -θ ⋆ ) =( θk -θ ⋆ ) ⊤ Ṽk ( Ṽ-1 k P S k Φ ⊤ D k y D k -θ ⋆ ) =( θk -θ ⋆ ) ⊤ Ṽk [ Ṽ-1 k P S k Φ ⊤ D k (Φ D k θ ⋆ + η D k ) -θ ⋆ ] = ( θk -θ ⋆ ) ⊤ Ṽk ( Ṽ-1 k P S k Φ ⊤ D k Φ D k θ ⋆ -θ ⋆ ) A1 + ( θk -θ ⋆ ) ⊤ P S k Φ ⊤ D k η D k A2 Since Ṽk ( Ṽ-1 k P S k Φ ⊤ D k Φ D k θ ⋆ -θ ⋆ ) = P S k Φ ⊤ D k Φ D k θ ⋆ -P S k Φ ⊤ D k Φ D k P S k θ ⋆ -λθ ⋆ = P S k Φ ⊤ D k Φ D k (I -P S k )θ ⋆ -λθ ⋆ , we have A 1 =( θk -θ ⋆ ) ⊤ P S k Φ ⊤ D k Φ D k (I -P S k )θ ⋆ -λ( θk -θ ⋆ ) ⊤ θ ⋆ =( θk -θ ⋆ ) ⊤ Ṽ1/2 k Ṽ-1/2 k P S k Φ ⊤ D k Φ D k (I -P S k )θ ⋆ -λ( θk -θ ⋆ ) ⊤ Ṽ1/2 k Ṽ-1/2 k θ ⋆ ≤∥ θk -θ ⋆ ∥ Ṽk ∥ Ṽ-1/2 k P S k Φ ⊤ D k Φ D k (I -P S k )θ ⋆ ∥ + λ∥θ ⋆ ∥ Ṽ-1 k ≤∥ θk -θ ⋆ ∥ Ṽk ∥ Ṽ-1/2 k P S k Φ ⊤ D k ∥∥Φ D k (I -P S k )∥∥θ ⋆ ∥ + √ λ∥θ ⋆ ∥ ≤∥ θk -θ ⋆ ∥ Ṽk ∥Φ D k (I -P S k )∥ + √ λ ∥θ ⋆ ∥ where the first inequality is due to Cauchy Schwartz, and the last inequality is because ∥ Ṽ-1/2 k P S k Φ ⊤ D k ∥ = Φ D k P S k (P S k Φ ⊤ D k Φ D k P S k + λI) -1 P S k Φ ⊤ D k ≤ 1. Then by definition of the spectral norm ∥•∥, and the properties of the orthogonal projection matrix P S k , we have ∥Φ D k (I -P S k )∥ = λ max Φ D k (I -P S k ) 2 Φ ⊤ D k = λ max Φ D k (I -P S k )Φ ⊤ D k . Moreover, due to Lemma 4.1, S k is ϵ-accurate w.r.t. D k , for all k, so we have I -P S k ⪯ λ 1-ϵ (Φ ⊤ D k Φ D k + λI) -1 by the property of ϵ-accuracy (Proposition 10 of Calandriello et al. (2019) ). Substituting this to RHS of the equality above, we have ∥Φ D k (I -P S k )∥ ≤ λ 1 -ϵ λ max Φ D k (Φ ⊤ D k Φ D k + λI) -1 Φ ⊤ D k ≤ λ 1 -ϵ . Therefore, A 1 ≤ ∥ θk -θ ⋆ ∥ Ṽk 1 1-ϵ + 1 √ λ∥θ ⋆ ∥. Similarly, by applying Cauchy-Schwartz inequality on term A 2 , we have A 2 =( θk -θ ⋆ ) ⊤ Ṽ1/2 k Ṽ-1/2 k P S k Φ ⊤ D k η D k ≤ ∥ θk -θ ⋆ ∥ Ṽk ∥ Ṽ-1/2 k P S k Φ ⊤ D k η D k ∥ =∥ θk -θ ⋆ ∥ Ṽk ∥ Ṽ-1/2 k P S k V 1/2 k V -1/2 k Φ ⊤ D k η D k ∥ ≤∥ θk -θ ⋆ ∥ Ṽk ∥ Ṽ-1/2 k P S k V 1/2 k ∥∥V -1/2 k Φ ⊤ D k η D k ∥ where V k := Φ ⊤ D k Φ D k +λI. Note that P S k V k k P S k = P S k (Φ ⊤ D k Φ D k +λI)P S k = Ṽk +λ(P S k - I) and P S k ⪯ I, so we have ∥ Ṽ-1/2 k P S k V 1/2 k ∥ = ∥ Ṽ-1/2 k P S k V 1/2 k V 1/2 k P S k Ṽ-1/2 k ∥ ≤ ∥ Ṽ-1/2 k ( Ṽk + λ(P S k -I)) Ṽ-1/2 k ∥ = ∥I + λ Ṽ-1/2 k (P S k -I)) Ṽ-1/2 k ∥ ≤ 1 + λ∥ Ṽ-1 k ∥∥P S k -I∥ ≤ 1 + λ • λ -1 • 1 = √ 2, and thus He et al. (2022) , the standard self-normalized bound for vector-valued martingales cannot be directly applied to bound the term ∥V A 2 ≤ √ 2∥ θk -θ ⋆ ∥ Ṽk ∥V -1/2 k Φ ⊤ D k η D k ∥. As mentioned by -1/2 k Φ ⊤ D k η D k ∥, since D k is constructed by the data that each client has uploaded so far during the event-triggered communications. Therefore, in the following paragraphs, we bound this term by extending their results to the kernel bandit problem considered in our paper. We first need to establish the following lemma. Lemma C.1. Let's denote V k (i) = s∈Nt k(i) (i) ϕ(x s )ϕ(x s ) ⊤ , such that V k = λI + N i=1 V k (i), and then denote the covariance matrix for client i's data that hasn't been uploaded to server by time step t k as ∆V k (i) = s∈Nt k (i)\Nt k(i) (i) ϕ(x s )ϕ(x s ) ⊤ for i ∈ [N ]. Then we have V k ⪰ 1 βD ∆V k (i), and ∀x ∈ R d , ϕ(x) ⊤ V -1 k ϕ(x) ϕ(x) ⊤ (Φ ⊤ [t k ] Φ [t k ] + λI) -1 ϕ(x) ≤ 1 + N βD. Bounding ∥V -1/2 k Φ ⊤ D k η D k ∥: Recall that D k contains data points that N clients have uploaded up to the k-th communication, i.e., D k = ∪ N i=1 N t k(i) (i), where t k(i) denotes the time step of client i's last communication with the server. Therefore, we have the following decomposition V -1/2 k Φ ⊤ D k η D k = N i=1 V -1/2 k Φ ⊤ Nt k(i) (i) η Nt k(i) (i) = N i=1 V -1/2 k Φ ⊤ Nt k(i) (i) η Nt k(i) (i) + Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) - N i=1 V -1/2 k Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) = V -1/2 k Φ ⊤ [t k ] η [t k ] - N i=1 V -1/2 k Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) . Then using triangle inequality, we have ∥V -1/2 k Φ ⊤ D k η D k ∥ ≤ ∥V -1/2 k Φ ⊤ [t k ] η [t k ] ∥ + N i=1 ∥V -1/2 k Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) ∥. We can bound ∥V -1/2 k Φ ⊤ [t k ] η [t k ] ∥ as ∥V -1/2 k Φ ⊤ [t k ] η [t k ] ∥ = ∥Φ ⊤ [t k ] η [t k ] ∥ V -1 k ≤ ∥Φ ⊤ [t k ] η [t k ] ∥ (Φ ⊤ [t k ] Φ [t k ] +λI) -1 1 + N Dβ ≤ 1 + N DβR 2 ln(1/δ) + ln(det(K [T ],[T ] /λ + I)), with probability at least 1 -δ, where the first inequality is due to Lemma C.1, and the second inequality is due to the standard self-normalized bound for kernelized contextual bandit, e.g., Lemma B.3. of Li et al. (2022) . Then we can bound ∥V -1/2 k Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) ∥ as ∥V -1/2 k Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) ∥ ≤ 2Dβ∥ DβλI + Φ ⊤ Nt k (i)\Nt k(i) (i) Φ Nt k (i)\Nt k(i) (i) -1/2 Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) ∥ = 2Dβ∥Φ ⊤ Nt k (i)\Nt k(i) (i) η Nt k (i)\Nt k(i) (i) ∥ DβλI+Φ ⊤ N t k (i)\N t k(i) (i) Φ N t k (i)\N t k(i) (i) -1 ≤ 2DβR 2 ln(1/δ) + ln(det(K [T ],[T ] /(Dβλ) + I)) where the first inequality is because V k = λI + Φ ⊤ D k Φ D k ⪰ 1 Dβ Φ ⊤ Nt k (i)\Nt k(i) (i) Φ Nt k (i)\Nt k(i) (i) due to equation 8 in Lemma C.1, so V k = λI + Φ ⊤ D k Φ D k ⪰ 1 2Dβ (DβλI + Φ ⊤ Nt k (i)\Nt k(i) (i) Φ Nt k (i)\Nt k(i) (i) ) , and the second inequality is again obtained using the standard self-normalized bound. Putting everything together, we have ∥ θk -θ ⋆ ∥ Ṽk ≤ ( 1/(1 -ϵ) + 1) √ λ∥θ ⋆ ∥ + 2 1 + N Dβ + N 2Dβ R ln(1/δ) + γ T , where γ T := max D⊂A:|D|=T  V k ⪰ 1 βD ∆V k (i) for all i ∈ [N ]. For client c k , V k ⪰ 0 = 1 βD ∆V k (c k ). For client i ̸ = c k , we have ϕ(x) ⊤ V -1 k(i) ϕ(x) ϕ(x) ⊤ V k(i) + ∆V k (i) -1 ϕ(x) ≤ 1 + s∈Nt k (i)\Nt k(i) (i) ϕ(x s ) ⊤ V -1 k(i) ϕ(x s ) = 1 + s∈Nt k (i)\Nt k(i) (i) σ 2 k(i) (x s ) ≤ 1 + β s∈Nt k (i)\Nt k(i) (i) σ2 k(i) (x s ) ≤ 1 + βD, where the first inequality is due to Lemma A.5, the second is due to property of ϵ-accuracy in Lemma A.6, and the third is due to our event-trigger in equation 4. This implies V -1 k(i) ⪯ (1 + βD) V k(i) + ∆V k (i) -1 . Then due to Lemma A.9, we have (1 + βD)V k(i) ⪰ V k(i) + ∆V k (i), and thus V k(i) ⪰ 1 βD ∆V k (i). In addition, since k(i) < k, ∀i ̸ = c k , we have V k ⪰ V k(i) ⪰ 1 βD ∆V k (i). By averaging equation 8 over all N clients, we have V k ⪰ 1 N βD N i=1 ∆V k (i), and thus, we have Φ ⊤ [t k ] Φ [t k ] + λI = V k + N i=1 ∆V k (i) ⪯ (1 + N βD)V k . Using Lemma A.9 again finishes the proof. D PROOF OF THEOREM 4.5 IN SECTION 4.3

D.1 COMMUNICATION COST

Recall from Section 4.1 that D k is the set of time indices for the data points that are used to construct the embedded statistics on the server at the k-th communication round, for k = 1, . . . , B. We denote the corresponding (exact) covariance matrix as V k = λI + Φ ⊤ D k Φ D k ∈ R p×p , with V 0 = λI, and kernel matrix as (He et al., 2022) , by defining K D k ,D k = Φ D k Φ ⊤ D k ∈ R |D k |×|D k | . Similar to k p = min{k ∈ [B] | det(I + λ -1 K D k ,D k ) ≥ 2 p )}, we have log det(I + λ -1 K D k p+1 ,D k p+1 )/ det(I + λ -1 K D kp ,D kp ) ≥ 1 for each p ≥ 0. We call the sequence of time steps in-between t kp and t kp+1 an epoch, and denote the total number of epochs as P . Note that since log det(I + λ -1 KD k 1 ,D k 1 ) det(I) + log det(I + λ -1 KD k 2 ,D k 2 ) det(I + λ -1 KD k 1 ,D k 1 ) + • • • + log det(I + λ -1 KD k P ,D k P ) det(I + λ -1 KD k P -1 ,D k P -1 ) ≤ log det(I + λ -1 K [T ],[T ] ) ≤ 2γT , there can be at most 2γ T terms, i.e., P ≤ 2γ T . Now that we have divided the time horizon Since there are at most 2γ T epochs, the total number of communications B ≤ 2γ T (N + 4β D ). Moreover, by Lemma 4.1, we know that during each communication, the size of data being communicated is O log 2 (T )γ 2 T . Hence, with D = 1 N 2 , C T = O(N 2 γ 3 T log 2 (T )). Proof of Lemma D.1. Consider the epoch [t kp , t kp+1 -1] for some p = 0, 1, . . . , P . We denote the total number of communications in this epoch as Q p , and the total number of communications in this epoch that are triggered by client i as Q p,i for i ∈ [N ], i.e., Q p = N i=1 Q p,i . Let's denote the indices associated with the communications triggered by some client i as κ 1 , κ 2 , . . . , κ Qp,i ∈ [k p , k p+1 -1]. Then for each j = 2, 3, . . . , Q p,i , i.e., excluding client i's first communication in this epoch, due to our event-trigger design in equation 4, we have β s∈∆Dκ j σ 2 kp (x s ) ≥ β s∈∆Dκ j σ 2 κj-1 (x s ) ≥ s∈∆Dκ j σ2 κj-1 (x s ) > D, where the first inequality is because by definition of κ j-1 , we have κ j-1 ≥ k p , so σ 2 κj-1 (x) ≤ σ 2 kp (x), ∀x, and the second inequality is due to Lemma A.6. Therefore, we have s∈∆D k j σ 2 kp (x s ) ≥ D/β. Since σ 2 kp (x) = ∥ϕ(x)∥ 2 V -1 kp , we have D/β ≤ s∈∆D k j ∥ϕ(x s )∥ 2 V -1 kp ≤ 4 log det(I + λ -1 K D kp ∪∆Dκ j ,D kp ∪∆Dκ j ) det(I + λ -1 K D kp ,D kp ) ≤ -4 + 4 det(I + λ -1 K D kp ∪∆Dκ j ,D kp ∪∆Dκ j ) det(I + λ -1 K D kp ,D kp ) where the second inequality is by definition of epoch, i.e., det(I + λ -1 K D k p+1 -1 ,D k p+1 -1 )/ det(I + λ -1 K D kp ,D kp ) ≤ 2, combined with Lemma A.4 , and the third is because log(x) ≤ x -1 for x > 0. Hence, we have det(I + λ -1 K D kp ∪∆Dκ j ,D kp ∪∆Dκ j ) det(I + λ -1 K D kp ,D kp ) ≥ 1 + D 4β , and thus, we have det(  I + λ -1 K D kp ∪∆Dκ j ,D kp ∪∆Dκ j ) -det(I + λ -1 K D kp ,D kp ) ≥ D 4β det(I + λ -1 K D kp , (I + λ -1 K D κ ′ j ,D κ ′ j ) -det(I + λ -1 K D κ ′ j-1 ,D κ ′ j-1 ) = det(I + λ -1 K D κ ′ j-1 ∪∆D κ ′ j ,D κ ′ j-1 ∪∆D κ ′ j ) -det(I + λ -1 K D κ ′ j-1 ,D κ ′ j-1 ) ≥ det(I + λ -1 K D kp ∪∆D κ ′ j ,D kp ∪∆D κ ′ j ) -det(I + λ -1 K D kp ,D kp ) ≥ D 4β det(I + λ -1 K D kp ,D kp ) where the first inequality is obtained via matrix determinant lemma and Lemma A.10, and the second is due to the inequality we derived above. Summing over all communications in this epoch, we have det(I + λ -1 K D k p+1 -1 ,D k p+1 -1 ) -det(I + λ -1 K D kp ,D kp ) = Qp j=1 det(I + λ -1 K D κ ′ j ,D κ ′ j ) -det(I + λ -1 K D κ ′ j-1 ,D κ ′ j-1 ) ≥ N i=1 (Q p,i -1) D 4β det(I + λ -1 K D kp ,D kp ), and since det(I + λ -1 K D k p+1 -1 ,D k p+1 -1 )/ det(I + λ -1 K D kp ,D kp ) ≤ 2 by our definition of epoch, we have 1 + D 4β N i=1 (Q p,i -1) ≤ det(I + λ -1 K D k p+1 -1 ,D k p+1 -1 )/ det(I + λ -1 K D kp ,D kp ) ≤ 2, so Q p = N i=1 Q p,i ≤ N + 4β D , which finishes the proof.

D.2 CUMULATIVE REGRET

To facilitate regret analysis of Async-KernelUCB, we need to introduce some additional notations.  (c k ) = N tk (c k ) (c k ) \ N t k (c k ) = P k ∪ {tk (c k ) }. We also define P 0 as the union over the set of time steps before the first communication of each client i ∈ [N ]. Therefore, we have ∪ B k=0 P k ∪ {t k } k∈[B] = [T ]. Since in Algorithm 1, the approximated mean and variance of each client only get updated when it triggers communication, and then remain fixed until after its next communication, we have that all the interactions in P k ∪ {tk (c k ) } are based on the same {μ k (•), σk (•)}, for k = 0, 1, . . . , B. In addition, an important observation is that, based on our event-trigger in equation 4, we have s∈P k σ2 k (x s ) ≤ D, s∈P k σ2 k (x s ) + σ2 k (x tk (c k ) ) > D. Now we are ready to upper bound the cumulative regret. Consider some time step t ∈ P k ∪ {tk (c k ) }. Due to our arm selection rule (line 5 of Algorithm 1), we have x t = arg max x∈At μk (x) + ασ k (x). Combining this with Lemma 4.4, with probability at least 1 -δ, we have f (x ⋆ t ) ≤ μk (x ⋆ t ) + ασ k (x ⋆ t ) ≤ μk (x t ) + ασ k (x t ), f (x t ) ≥ μk (x t ) -ασ k (x t ), where x ⋆ t := arg max x∈At f (x) = arg max x∈At ϕ(x) ⊤ θ ⋆ is the optimal arm at time step t, and thus r t = f (x ⋆ t ) -f (x t ) ≤ 2ασ k (x t ). The cumulative regret R T can be rewritten as R T = B k=0 s∈P k r s + B k=1 r t k ≤ B k=0 s∈P k min(2LS, 2ασ k (x s )) + B k=1 min{2LS, 2ασ k(c k ) (x t k )}. Bounding first term: To bound the first term, we introduce an imaginary variance function σ 2 k,s-1 (•) (not computed in the actual algorithm) for s ∈ P k and k = 0, 1, . . . , B, which is constructed using dataset ∪ k-1 k ′ =0 P k ′ ∪ {s ′ ∈ P k : s ′ ≤ s -1}. In the following paragraph, we will bound the first term by showing that B k=0 s∈P k σ2 k (x s ) is not too much larger than B k=0 s∈P k σ 2 k,s-1 (x s ). This requires us to bound the ratio σ 2 k (xs) σ 2 k,s-1 (xs) for s ∈ P k and k = 0, 1, . . . , B. Recall that σ 2 k (•) is constructed using data points that N clients have uploaded to the server up to the k- th communication, i.e., D k = ∪ N i=1 N t k(i) (t k ), which is a subset of D k ∪ ∪ N i=1 ∆Dk (i) = D k ∪ ∪ N i=1 P k(i) ∪ {tk (i) } . However, as shown in equation 9, the event-trigger cannot be directly used to upper bound the summation of approximated variances in P k(i) ∪ {tk (i) }, but can be used to upper bound that in P k(i) , which is why we construct the imaginary variance function without using data points with time indices {t k } k∈ [B] . Specifically, using the notations we just introduced, we can rewrite the variance as σ 2 k (x) = ϕ(x) ⊤ Φ ⊤ D k ΦD k + λI -1 ϕ(x) σ 2 k,s-1 (x) = ϕ(x) ⊤ Φ ⊤ D k \{t k ′ } k ′ ∈[k] Φ D k \{t k ′ } k ′ ∈[k] + λI + i̸ =c k Φ ⊤ P k(i) ΦP k(i) + Φ ⊤ {s ′ ∈P k :s ′ ≤s-1} Φ {s ′ ∈P k :s ′ ≤s-1} -1 ϕ(x) ≥ ϕ(x) ⊤ Φ ⊤ D k ΦD k + λI + i̸ =c k Φ ⊤ P k(i) ΦP k(i) + Φ ⊤ {s ′ ∈P k :s ′ ≤s-1} Φ {s ′ ∈P k :s ′ ≤s-1} -1 ϕ(x) The following lemma provides a upper bound for this ratio. with probability at least 1 -2δ, where the first inequality is due to Cauchy-Schwarz, and second is due to the property of ϵ-accuracy in Lemma A.6, the third is due to Lemma D.2, the forth is by definition of maximum information gain γ T , and the last is by substituting α defined in Lemma 4.4. Bounding second term: For the second term B k=1 min{2LS, 2ασ k(c k ) (x t k )}, we should note that σk(c k ) (•) is the approximated variance function that client c k received during its last communication with the server, instead of σ k-1 (•) as in our proof of Lemma 4.3 when bounding the size of dictionary. Ideally, we want to relate each σ k(c k ) (•) to σ k (•) and then apply the elliptical potential argument, but as we do not make any assumption on how frequent client arrives, it is possible that for clients who show up infrequently, these two functions are very different. However, by using the epoch argument as in the proof for communication cost, we can show that this undesirable situation only occurs at most 2γ T times. Specifically, recall that V k = λI + Φ ⊤ D k Φ D k , with V 0 = λI, and kernel matrix as K D k ,D k = Φ D k Φ ⊤ D k ∈ R |D k |×|D k | . We define k p = min{k ∈ [B] | det(I + λ -1 K D k ,D k ) ≥ 2 p )}, such that log det(I + λ -1 K D k p+1 ,D k p+1 )/ det(I + λ -1 K D kp ,D kp ) ≥ 1 for each p ≥ 0. We call the sequence of time steps in-between t kp and t kp+1 an epoch, and denote the total number of epochs as P . As shown in the proof for communication cost, we have P ≤ 2γ T . Consider the epoch [t kp , t kp+1 -1] for some p = 0, 1, . . . , P . We denote the total number of communications in this epoch that are triggered by client i as Q p,i for i ∈ [N ], and the indices associated with these communications triggered by client i as κ 1 , κ 2 , . . . , κ Qp,i ∈ [k p , k p+1 -1]. As mentioned above, the approximated variance used during arm selection at t κ1 , i.e, σ 2 κ1(cκ 1 ) (•) could be from a very long time ago. Therefore, we simply bound its regret by 2LS, and in total, there can be at most 2γ T N such terms for all N clients, leading to a upper bound of 4N γ T LS. Now we only need to be concerned about the communications at j = 2, 3, . . . , Q p,i , and show that σ 2 κj (cκ j ) (x) is close to σ 2 κj (x) for all x. Specifically, we have σ 2 κj (cκ j ) (x) = σ 2 κj-1 (x) = σ 2 κj (x) σ 2 κj-1 (x) σ 2 κj (x) ≤ 2σ 2 κj (x), where the first equality is because by definition κ j (c κj ) = κ j-1 , the first inequality is because σ 2 κj-1 (x)/σ 2 κj (x) ≤ det(I + λ -1 K D k p+1 -1 ,D k p+1 -1 )/ det(I + λ -1 K D kp ,D kp ) ≤ 2 due to Lemma A.3, Lemma A.9 and the definition of epoch. Therefore, further applying Cauchy-Schwarz and the ϵ-accuracy property in Lemma A.6, the second term can be bounded by B k=1 min{2LS, 2ασ k(c k ) (x t k )} ≤ 4N γ T LS + 2α 2Bβ B k=1 σ 2 k (x t k ) ≤ 4N γ T LS + 2α 2Bβ B k=1 σ 2 k-1,t k -1 (x t k ) < 4N γ T LS + 2α 2Bβ B k=1 s∈∆D k σ 2 k-1,s-1 (x s ) ≤ 4N γ T LS + 4α 2T βγ T where the imaginary variance function σ 2 k-1,s-1 (•) is constructed using dataset ∪ k-1 k ′ =1 ∆D k ′ ∪ {s ′ ∈ ∆D k : s ′ ≤ s -1}, the second inequality is because variance is non-increasing over time, the third is because variances are positive, and the last is due to definition of maximum information gain γ T and that B ≤ T . Putting upper bounds for the first and second term together, we have R T ≤ 4N γ T LS + 4 √ 2 (1/ √ 1 -ϵ + 1) √ λS + 2R √ 1 + N Dβ + N √ 2Dβ ln(1/δ) + γ T T β(1 + N βD)γ T . Proof of Lemma D.2. We denote V k = λI + Φ ⊤ D k Φ D k , ∆V k,s-1 (i) = Φ ⊤ P k(i) Φ P k(i) for i ̸ = c k and ∆V k,s-1 (c k ) = Φ ⊤ {s ′ ∈P k :s ′ ≤s-1} Φ {s ′ ∈P k :s ′ ≤s-1} . In the following, we first show that V k ⪰ 1 βD ∆V k,s-1 (i) for all i ∈ [N ]. Note that for any client i ̸ = c k , we have x ⊤ V -1 k(i) x x ⊤ V k(i) + ∆V k,s-1 (i) -1 x ≤ 1 + s∈P k(i) x ⊤ s V -1 k(i) x s = 1 + s∈P k(i) σ 2 k(i) (x s ) ≤ 1 + β s∈P k(i) σ2 k(i) (x s ) ≤ 1 + βD, where the first inequality is due to Lemma A.5, the second inequality is due to Lemma 4.1 and Lemma A.6, and the last inequality is due to equation 9. This implies V -1 k(i) ⪯ (1 + βD) V k(i) + ∆V k,s-1 (i) -1 . Then due to Lemma A.9, we have (1 + βD)V k(i) ⪰ V k(i) + ∆V k,s-1 (i), and thus V k(i) ⪰ 1 βD ∆V k,s-1 (i). Moreover, since k(i) < k, ∀i ̸ = c k , we have V k ⪰ V k(i) ⪰ 1 βD ∆V k,s-1 (i). Similarly for client c k , we have x ⊤ V -1 k x x ⊤ V k + ∆V k,s-1 (c k ) -1 x ≤ 1 + s ′ ∈P k :s ′ ≤s-1 σ 2 k (x s ′ ) ≤ 1 + βD. Again, this implies V k ⪰ 1 βD ∆V k,s-1 (c k ), which finishes the proof of equation 10. By averaging equation 10 over all N clients, we have V k ⪰ 1 N βD N i=1 ∆V k,s-1 (i), and thus, we have V k + N i=1 ∆V k,s-1 (i) ⪯ (1 + N βD)V k . Using Lemma A.9 again, we have (1 + N βD)(V k + N i=1 ∆V k,s-1 (i)) -1 ⪰ V -1 k . Therefore, we have σ 2 k (x) σ 2 k,s-1 (x) ≤ ϕ(x) ⊤ V -1 k ϕ(x) ϕ(x) ⊤ V k + N i=1 ∆V k,s-1 (i) -1 ϕ(x) ≤ 1 + N βD



Throughout this paper, we will often use the set of indices D (or S) to refer to the actual dataset {xs, ys}s∈D (or dictionary {xs, ys}s∈S ) for simplicity. As discussed inLi et al. (2022), γT is problem-dependent, showing how fast kernel's eigenvalues decay. For kernels with exponentially decaying eigenvalues, i.e., λm = O(exp(-m βe )), for βe > 0, γT = O(log 1+1/βe (T )), which includes Gaussian kernel that is widely used for GPs and SVMs. For kernels with polynomially decaying eigenvalues, i.e., λm = O(m -βp ), for βp > 1, γT = O(T 1/βp log 1-1/βp (T )).



Figure 2: Experiment results on synthetic and real-world datasets.

Lemma A.9 (Corollary 7.7.4. (a)  ofHorn & Johnson (2012)). Let A, B be positive definite matrices, such that A ⪰ B, then we haveA -1 ⪯ B -1 . Lemma A.10 (Lemma 2.2 ofTie et al. (2011)). For any positive semi-definite matrices A, B and C, it holds that det(A + B + C) + det(A) ≥ det(A + B) + det(A + C).

. Consider the k-th communication for k ∈ [B]. We denote the client who triggers the k-th communication as c k ∈ [N ], and the time step when the k-th communication happens as t k ∈ [T ]. In addition, recall that we denote the sequence of time steps in-between client c k 's last communication (whose index is denoted as k(c k ) ∈ [0, k -1]) and the current (the k-th) communication when client c k 's is active as ∆D

log det(K D,D /(Dβλ) + I). Proof of Lemma C.1. Note that by definition, ∆V k (c k ) = 0, where c k ∈ [N ] is the index of the client who triggers the k-th communication. In the following, we first show that

D kp ) for all j = 2, 3, . . . , Q p,i and all client i ∈ [N ]. Denote the indices associated with the communications of all clients in this epoch as κ ′ 1 , κ ′ 2 , . . . , κ ′ Qp ∈ {k p , k p+1 -1}. Then for each j ∈ [Q p ], if client c κ ′ j has already communicated with the server ealier in this epoch, we have det

First, let's denote the client who triggers the k-th communication as c k ∈ [N ], the index of its next communication as k(c k ), and the time step when the k(c k )-th communication happens is tk (c k ) (tk (c k ) = T if k is client c k 's final communication with the server). Then we denote the set of time steps in-between (but not including) the current (the k-th) communication and client c k 's next communication when client c k is active as P k := {t k < s < tk (c k ) : i s = c k }, and thus by definition ∆Dk

Bounding σ 2 k (x s )/σ 2 k,s-1 (x s )). Under the same condition as Lemma 4.1, with communication threshold D, we have ∀k, s thatσ 2 k (x s )/σ 2 k,s-1 (x s ) ≤ 1 + N βD.With Lemma D.2, we can bound the first term asB k=0 s∈P k min(2LS, 2ασ k (x s )) ≤ 2α T -1 (x s ) ≤ 4α T β(1 + N βD)γ T ≤ 4 (1/ √ 1 -ϵ + 1) √ λS + 2R 1 + N Dβ + N 2Dβ ln(1/δ) + γ T T β(1 + N βD)γ T

The clients cannot directly communicate with each other, but only with the central server, i.e., a star-shaped communication network. At each time step t ∈ [T ], an arbitrary client i t ∈ [N ] becomes active and chooses an arm x t from a candidate set A t ⊆ R d , and then receives the corresponding reward feedback y

s . Compared withLi et al. (2022);Calandriello et al. (2020) that re-sample all s ∈ D k using pk,s = qσ 2 k-1 (x s ), we use a different approximated variance function for each ∆S k ′ . Nevertheless, with our design in Section 4.1, i.e., pk ′ ,s = qσ 2 k ′ -1 (x s ), we show in Lemma 4.3 that we can still ensure |S k | = O(γ T ), as long as a proper threshold D is chosen to avoid any ∆D k ′ being too large.

[T ] into P epochs using {t kp } p∈[P ] , we prove the following lemma that upper bounds the total number of times communication is triggered in each epoch. Lemma D.1 (Number of communications per epoch). For each epoch, i.e., the sequence of time steps in-between t kp and t kp+1 , the number of communications is upper bounded by N + 4β D .

ACKNOWLEDGEMENT

Hongning Wang acknowledges the support by NSF grants IIS-2213700, IIS-2128019 and IIS-1838615. Mengdi Wang acknowledges the support by NSF grants DMS-1953686, IIS-2107304, CMMI1653435, ONR grant 1006977, and http://C3.AI.

