COLLABORATIVE PURE EXPLORATION IN KERNEL BANDIT

Abstract

In this paper, we propose a novel Collaborative Pure Exploration in Kernel Bandit model (CoPE-KB), where multiple agents collaborate to complete different but related tasks with limited communication. Our model generalizes prior CoPE formulation with the single-task and classic MAB setting to allow multiple tasks and general reward structures. We propose a novel communication scheme with an efficient kernelized estimator, and design algorithms CoKernelFC and CoKernelFB for CoPE-KB with fixed-confidence and fixed-budget objectives, respectively. Sample and communication complexities are provided to demonstrate the efficiency of our algorithms. Our theoretical results explicitly quantify how task similarities influence learning speedup, and only depend on the effective dimension of feature space. Our novel techniques, such as an efficient kernelized estimator and decomposition of task similarities and arm features, which overcome the communication difficulty in high-dimensional feature space and reveal the impacts of task similarities on sample complexity, can be of independent interests.

1. INTRODUCTION

Pure exploration (Even-Dar et al., 2006; Kalyanakrishnan et al., 2012; Kaufmann et al., 2016 ) is a fundamental online learning problem in multi-armed bandits (Thompson, 1933; Lai & Robbins, 1985; Auer et al., 2002) , where an agent chooses options (often called arms) and observes random feedback with the objective of identifying the best arm. This formulation has found many important applications, such as web content optimization (Agarwal et al., 2009) and online advertising (Tang et al., 2013) . However, traditional pure exploration (Even-Dar et al., 2006; Kalyanakrishnan et al., 2012; Kaufmann et al., 2016) only considers single-agent decision making, and cannot be applied to prevailing distributed systems in real world, which often face a heavy computation load and require multiple parallel devices to process tasks, e.g., distributed web servers (Zhuo et al., 2003) and data centers (Liu et al., 2011) . To handle such distributed applications, prior works (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) have developed the Collaborative Pure Exploration (CoPE) model, where multiple agents communicate and cooperate to identify the best arm with learning speedup. Yet, existing results focus only on the classic multi-armed bandit (MAB) setting with single task, i.e., all agents solve a common task and the rewards of arms are individual values (rather than generated by a reward function). However, in many distributed applications such as multi-task neural architecture search (Gao et al., 2020) , different devices can face different but related tasks, and there exists similar dependency of rewards on option features among different tasks. Therefore, it is important to develop a more general CoPE model to allow heterogeneous tasks and structured reward dependency, and further theoretically understand how task correlation impacts learning. Motivated by the above fact, we propose a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) model. Specifically, in CoPE-KB, each agent is given a set of arms, and the expected reward of each arm is generated by a task-dependent reward function in a high and possibly infinite dimensional Reproducing Kernel Hilbert Space (RKHS) (Wahba, 1990; Schölkopf et al., 2002) . Each agent sequentially chooses arms to sample and observes random outcomes in order to identify the best arm. Agents can broadcast and receive messages to and from others in communication rounds, so that they can collaborate and exploit the task similarity to expedite learning processes. Our CoPE-KB model is a novel generalization of prior CoPE problem (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) , which not only extends prior models from the single-task setting to multiple tasks, but also goes beyond classic MAB setting and allows general (linear or nonlinear) reward structures. CoPE-KB is most suitable for applications involving multiple tasks and complicated reward structures. For example, in multi-task neural architecture search (Gao et al., 2020) , one wants to search for best architectures for different but related tasks on multiple devices, e.g., the object detection (Ghiasi et al., 2019) and object tracking (Yan et al., 2021) tasks in computer vision, which often use similar neural architectures. Instead of individually evaluating each possible architecture, one prefers to directly learn the relationship (reward function) between the accuracy results achieved and the features of used architectures (e.g., the type of neural networks), and exploit the similarity of reward functions among tasks to accelerate the search. Our CoPE-KB generalization faces a unique challenge on communication. Specifically, in prior CoPE works with classic MAB setting (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) , agents only need to learn scalar rewards, which are easy to transmit. However, under the kernel model, agents need to estimate a high or even infinite dimensional reward parameter, which is inefficient to directly transmit. Also, if one naively adapts existing reward estimators for kernel bandits (Srinivas et al., 2010; Camilleri et al., 2021) to learn this high-dimensional reward parameter, he/she will suffer an expensive communication cost dependent on the number of samples N (r) , since the reward estimators there need all raw sample outcomes to be transmitted. To tackle this challenge, we develop an efficient kernelized estimator, which only needs average outcomes on nV arms and reduces the required transmitted messages from O(N (r) ) to O(nV ). Here V is the number of agents, and n is the number of arms for each agent. The number of samples N (r) depends on the inverse of the minimum reward gap, and is often far larger than the number of arms nV . Under the CoPE-KB model, we study two popular objectives, i.e., Fixed-Confidence (FC) , where we aim to minimize the number of samples used under a given confidence, and Fixed-Budget (FB), where the goal is to minimize the error probability under a given sample budget. We design two algorithms CoKernelFC and CoKernelFB, which adopt an efficient kernelized estimator to simplify the required data transmission and enjoy a O(nV ) communication cost, instead of O(N (r) ) as in adaptions of existing kernel bandit algorithms (Srinivas et al., 2010; Camilleri et al., 2021) . We provide sampling and communication guarantees, and also interpret them by standard kernel measures, e.g., maximum information gain and effective dimension. Our results rigorously quantify the influences of task similarities on learning acceleration, and hold for both finite and infinite dimensional feature space. The contributions of this paper are summarized as follows: • We formulate a collaborative pure exploration in kernel bandit (CoPE-KB) model, which generalizes prior single-task CoPE formulation to allow multiple tasks and general reward structures, and consider two objectives, i.e., fixed-confidence (FC) and fixed-budget (FB). • For the FC objective, we propose algorithm CoKernelFC, which adopts an efficient kernelized estimator to simplify the required data transmission and enjoys only a O(nV ) communication cost. We derive sample complexity Õ( ρ * (ξ) V log δ -1 ) and communication rounds O(log ∆ -1 min ). Here ξ is the regularization parameter, and ρ * (ξ) is the problem hardness (see Section 4.3). • For the FB objective, we design a novel algorithm CoKernelFB with error probability Õ(exp(- T V ρ * (ξ) )n 2 V ) and communication rounds O(log(ω(ξ, X ))). Here T is the sample budget, X is the set of task-arm feature pairs, and ω(ξ, X ) is the principle dimension of data projections in X to RKHS (see Section 5.1). • Our algorithms offer an efficient communication scheme for information exchange in high dimensional feature space. Our results explicitly quantify how task similarities impact learning acceleration, and only depend on the effective dimension of feature space. Due to the space limit, we defer all proofs to the appendix.

2. RELATED WORK

Below we review the most related works, and defer a complete literature review to Appendix B. Collaborative Pure Exploration (CoPE). Hillel et al. (2013) ; Tao et al. (2019) initiate the CoPE literature with the single-task and classic MAB setting, where all agents solve a common classic best arm identification problem (without reward structures). Karpov et al. (2020) further extend the formulation in (Hillel et al., 2013; Tao et al., 2019) to best m arm identification The above kernel bandit works consider either regret minimization or single-agent formulation, which cannot be applied to resolve our challenges on round-speedup analysis and communication.

3. COLLABORATIVE PURE EXPLORATION IN KERNEL BANDIT (COPE-KB)

In this section, we define the Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) problem. Agents and Rewards. There are V agents indexed by [V ] := {1, . . . , V }, who collaborate to solve different but possibly related instances (tasks) of the Pure Exploration in Kernel Bandit (PE-KB) problem. For each agent v ∈ [V ], she is given a set of n arms X v = {x v,1 , . . . , x v,n } ⊆ R d X , where x v,i (i ∈ [n] ) describes the arm feature, and d X is the dimension of arm feature vectors. The expected reward of each arm x ∈ X v is f v (x), where f v : X v → R is an unknown reward function. Let X := ∪ v∈[V ] X v . At each timestep t, each agent v pulls an arm x t v ∈ X v and observes a random reward y t v = f v (x t v ) + η t v . Here η t v is a zero-mean and 1-sub-Gaussian noise, and it is independent among different t and v. We assume that the best arms x v, * := argmax x∈Xv f v (x) are unique for all v ∈ [V ], which is a common assumption in the pure exploration literature (Even-Dar et al., 2006; Audibert et al., 2010; Kaufmann et al., 2016) . Multi-Task Kernel Composition. We assume that the functions f v are parametric functionals of a global function F : Z × X → R, which satisfies that, for each agent v ∈ [V ], there exists a task feature vector z v ∈ Z such that f v (x) = F (z v , x), ∀x ∈ X v . (1) Here Z and X denote the task feature space and arm feature space, respectively. Note that Eq. ( 1) allows tasks to be different for agents (by having different task features), whereas prior CoPE works (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) restrict the tasks (X v and f v ) to be the same for all agents v ∈ [V ]. We denote a task-arm feature pair (i.e., overall input of function F ) by x := (z v , x), and denote the space of task-arm feature pairs (i.e., overall input space of function F ) by X := Z × X . As a standard assumption in kernel bandits (Krause & Ong, 2011; Deshmukh et al., 2017; Dubey et al., 2020) , we assume that F has a bounded norm in a high (possibly infinite) dimensional Reproducing Kernel Hilbert Space (RKHS) H specified by the kernel k : X × X → R, and there exists a feature mapping φ : X → H and an unknown parameter θ * ∈ H such thatfoot_0 Let F (x) = φ(x) θ * , ∀x ∈ X , k(x, x ) = φ(x) φ(x ), ∀x, x ∈ X . Here k(•, •) is a product composite kernel, which satisfies that for any z, z ∈ Z and x, x ∈ X , k((z, x), (z , x )) = k Z (z, z ) • k X (x, x ), where k Z : Z × Z → R K Z := [k Z (z v , z v )] v,v ∈[V ] denote the kernel matrix of task features. rank(K Z ) characterizes how much the tasks among agents are similar. For example, if agents solve a common task, i.e., the arm set X v and reward function f v are the same for all agents, we have that the task feature z v is the same for all v ∈ [V ], and rank(K Z ) = 1; If all tasks are totally different, rank(K Z ) = V . To better illustrate the model, we provide a 2-agent example in Figure 1 . There are Items 1, 2, 3 with expected rewards 3 + √ 2, 4 + 2 √ 2, 5 + 3 √ 2, respectively. Agent 1 is given Items 1, 2, denoted by X 1 = {x 1,1 , x 1,2 }, and Agent 2 is given Items 2, 3, denoted by X 2 = {x 2,1 , x 2,2 }, where both x 1,2 and x 2,1 refer to Item 2. The task feature z 1 = z 2 = 1, which means that the expected rewards of all items are generated by a common function. The reward function F is nonlinear with respect to x, but can be represented in a linear form in a higher-dimensional feature space, i.e., F (x) = F (z, x) = φ(x) θ * with x := (z, x) for any z ∈ Z and x ∈ X . Here φ(x) is the feature embedding, and θ * is the reward parameter. The computation of kernel function k(x, x ) only involves low-dimensional input data x and x , rather than higher-dimensional feature embedding φ(•). F, φ, θ * and k are specified in Figure 1 . The two agents can share the learned information on θ * to accelerate learning. Communication. Following the popular communication protocol in the CoPE literature (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) , we allow these V agents to exchange information via communication rounds, in which each agent can broadcast and receive messages from others. Following existing CoPE works (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) , we restrict the length of each message within O(n) bits for practicability, where n is the number of arms for each agent, and we consider the number of bits for representing a real number as a constant. We want agents to cooperate and complete all tasks using as few communication rounds as possible. Objectives. We consider two objectives of the CoPE-KB problem, one with Fixed-Confidence (FC) and the other with Fixed-Budget (FB). In the FC setting, given a confidence parameter δ ∈ (0, 1), the agents aim to identify x v, * for all v ∈ [V ] with probability at least 1 -δ, and minimize the average number of samples used per agent. In the FB setting, the agents are given an overall T V sample budget (T average samples per agent), and aim to use at most T V samples to identify x v, * for all v ∈ [V ] and minimize the error probability. In both FC and FB settings, the agents are requested to minimize the number of communication rounds and control the length of each message within O(n) bits, as in the CoPE literature (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) . Let λ * r and ρ * r be the optimal solution and optimal value of min Algorithm 1 Collaborative Multi-agent Algorithm CoKernelFC: for Agent v ∈ [V ] 1: Input: δ, X , k(•, •) : X × X → R, λ∈ X max xi,xj ∈B (r) v ,v ∈[V ] φ(x i ) -φ(x j ) 2 (ξI+ x∈ X λxφ(x)φ(x) ) -1 // compute the optimal sample allocation 5: N (r) ← max{ 32(2 r ) 2 (1 + ε) 2 ρ * r log 2n 2 V /δ r , τ (ξ, λ * r , ε)}, where τ (ξ, λ * r , ε ) is the number of samples needed by ROUND 6: (s 1 , . . . , sN (r) ) ← ROUND(ξ, λ * r , N (r) , ε) 7: Extract a sub-sequence s(r) v from (s 1 , . . . , sN (r) ) which only contains the arms in Xv 8: Sample the arms in s(r) v and observe random rewards y (r) v 9: Let N (r) v,i and ȳ(r) v,i be the number of samples and the average sample outcome on arm xv,i 10: Broadcast {(N (r) v,i , ȳ(r) v,i )} i∈[n] , and receive {(N  (r) v ,i , ȳ(r) v ,i )} i∈[n] from all other agents v = v 11: k r (x) ← [ N (r) 1 k(x, x1 ), . . . , N (r) nV k(x, xnV )] for any x ∈ X . K (r) ← [ N (r) i N (r) j k(x i , xj )] i,j∈[nV ] . ȳ(r) ← [ N (r) 1 ȳ(r) 1 , . . . , N for all v ∈ [V ] do 13: ∆r (x, x ) ← (k r (x) -k r (x )) (K (r) + N (r) ξI) -1 ȳ(r) , ∀x, x ∈ B (r) v // use a kernelized estimator (described in Section 4.2) to estimate reward gaps 14:  B (r+1) v ← B (r) v \ {x ∈ B (r) v | ∃x ∈ B (r) v : ∆r (x , x) ≥ 2 -r } // discard if ∀v ∈ [V ], |B (r+1) v | = 1, return B (r+1) 1 , . . . , B V 17: end for Different from prior CoPE-KB works (Tao et al., 2019; Karpov et al., 2020) which consider minimizing the maximum number of samples used by individual agents, we aim to minimize the average (total) number of samples used. Our objective is motivated by the fact that in many applications, obtaining a sample is expansive, e.g., clinical trials (Weninger et al., 2019) , and thus it is important to minimize the average (total) number of samples required. For example, consider that a medical institution wants to conduct multiple clinical trials to identify the best treatments for different age groups of patients (different tasks), and share the obtained data to accelerate the development. Since conducting a trial can consume significant medical resources and funds (e.g., organ transplant surgeries and convalescent plasma treatments for COVID-19), the institution wants to minimize the total number of trials required. Our CoPE-KB model is most suitable for such scenarios. To sum up, in CoPE-KB, we let agents collaborate to simultaneously complete multiple related best arm identification tasks using few communication rounds. In particular, when all agents solve the same task and X is the canonical basis, our CoPE-KB reduces to existing CoPE with classic MAB setting (Hillel et al., 2013; Tao et al., 2019) .

4. FIXED-CONFIDENCE COPE-KB

We start with the fixed-confidence setting. We present algorithm CoKernelFC equipped with an efficient kernelized estimator, and provide theoretical guarantees in sampling and communication.

4.1. ALGORITHM CoKernelFC

CoKernelFC (Algorithm 1) is an elimination-based multi-agent algorithm. The procedure for each agent v is as follows. Agent v maintains candidate arm sets B (r) v for all v ∈ [V ]. In each round r, agent v solves a global min-max optimization (Line 4) to find the optimal sample allocation λ * r ∈ X , which achieves the minimum estimation error. Here X denotes the collection of all dis-tributions on X , and ρ * r is the factor of optimal estimation error. In practice, the high-dimensional feature embedding φ(x) is only implicitly maintained, and this optimization can be efficiently solved by kernelized gradient descent (Camilleri et al., 2021) (see Appendix C.2). After solving this optimization, agent v uses ρ * r to compute the number of samples N (r) to ensure the estimation error of reward gaps to be smaller than 2 -(r+1) (Line 5). Next, we call a rounding procedure ROUND(ξ, λ * r , N (r) , ε) (Allen- Zhu et al., 2021; Camilleri et al., 2021) , which rounds a weighted sample allocation λ * r ∈ X into a discrete sample sequence (s 1 , . . . , sN (r) ) ∈ X N (r) , and ensures the rounding error within ε (Line 6). This rounding procedure requires the number of samples N (r) ≥ τ (ξ, λ * r , ε) = O( d(ξ,λ * r ) ε 2 ) (Line 5). Here τ (ξ, λ * r , ε) is the number of samples needed by ROUND. d(ξ, λ * r ) is the number of the eigenvalues of matrix x∈ X λ * r (x)φ(x)φ(x) which are greater than ξ. It stands for the effective dimension of feature space (see Appendix C.1 for details of this rounding procedure). Obtaining sample sequence (s 1 , . . . , sN (r) ), agent v extracts a sub-sequence s(r) v which only contains the arms in her arm set Xv . Then, she sample the arms in s(r) v and observe sample outcomes y with a single subscript to denote the i-th arm in X , the average sample outcome on this arm and the number of samples allocated to this arm, respectively.

4.2. KERNELIZED ESTIMATOR, COMMUNICATION AND COMPUTATION

Now we introduce the kernelized estimator (Line 13), which boosts communication and computation efficiency of CoKernelFC. First note that, our CoPE-KB generalization faces a unique challenge on communication, i.e., how to let agents efficiently share their learned information on the highdimensional reward parameter θ * . Naively adapting existing federated linear bandit or kernel bandit algorithms (Dubey & Pentland, 2020; Huang et al., 2021; Dubey et al., 2020) , which transmit the whole estimated reward parameter or all raw sample outcomes, will suffer a O(dim(H)) or O(N (r) ) communication cost, respectively. Here, the number of samples N (r) = Õ(d eff /∆ 2 min ), where d eff is the effective dimension of feature space and ∆ min is the minimum reward gap. Thus, N (r) is far larger than the number of arms nV when ∆ min is small. Kernelized Estimator. To handle the communication challenge, we develop a novel kernelized estimator (Eq. ( 3)) to significantly simplify the required transmitted data. Specifically, we make a key observation: since the sample sequence (s 1 , . . . , sN (r) ) are constituted by arms x1 , . . . , xnV , one can merge repetitive computations for same arms. Then, for all i ∈ [nV ], we merge the repetitive feature embeddings φ(x i ) for same arms xi , and condense the N (r) raw sample outcomes to average outcome ȳ(r) i on each arm xi . As a result, we express θr in a simplified kernelized form, and use it to estimate the reward gap ∆r (x, x ) between any two arms x, x as θr := Φ r N (r) ξI + K (r) -1 ȳ(r) , ∆r (x, x ) := (φ(x) -φ(x )) θr = (k r (x) -k r (x )) N (r) ξI + K (r) -1 ȳ(r) . Here Φ r := [ N (r) 1 φ(x 1 ) ; . . . ; N (r) nV φ(x nV ) ], K (r) := [ N (r) i N (r) j k(x i , xj )] i,j∈[nV ] , k r (x) := Φ r φ(x) = [ N (r) 1 k(x, x1 ), . . . , N (r) nV k(x, xnV )] and ȳ(r) := [ N (r) 1 ȳ(r) 1 , . . . , N (r) nV ȳ(r) nV ] stand for the feature embeddings, kernel matrix, correlations and average outcomes of the nV arms, respectively, which merge repetitive information on same arms. We refer interested reader to Appendix C.2, C.3 for a detailed derivation of our kernelized estimator and a comparison with existing estimators (Dubey et al., 2020; Camilleri et al., 2021) . Communication. Thanks to the kernelized estimator, we only need to transmit the nV average outcomes ȳ(r) 1 , . . . , ȳ(r) nV (Line 10), instead of the whole θr or all N (r) raw outcomes as in existing federated linear bandit or kernel bandit algorithms (Dubey & Pentland, 2020; Dubey et al., 2020) . This significantly reduces the number of transmission bits from O(dim(H)) or O(N (r) ) to only O(nV ), and satisfies the O(n)-bit per message requirement. Computation. In CoKernelFC, φ(x) and θr are maintained implicitly, and all steps (e.g., Lines 4, 13) can be implemented efficiently by only querying kernel function k(•, •) (see Appendix C.2 for implementation details). Thus, the computation complexity for reward estimation is only Poly(nV ), instead of Poly(dim(H)) as in prior kernel bandit algorithms (Zhou et al., 2020; Zhu et al., 2021) .

4.3. THEORETICAL PERFORMANCE OF CoKernelFC

To formally state our results, we define the speedup and hardness as in the literature (Hillel et al., 2013; Tao et al., 2019; Fiez et al., 2019) . For a CoPE-KB instance I, let T A M ,I denote the average number of samples used per agent in a multi-agent algorithm A M to identify all best arms. Let T A S ,I denote the average number of samples used per agent, by replicating V copies of a single-agent algorithm A S to complete all tasks (without communication). Then, the speedup of A M on instance I is defined as β A M ,I = inf A S T A S ,I T A M ,I . It holds that 1 ≤ β A M ,I ≤ V . When all tasks are the same, β A M ,I can approach V . When all tasks are totally different, communication brings no benefit and β A M ,I = 1. This speedup can be similarly defined for error probability results (see Section 5.2), by taking T A M ,I and T A S ,I as the smallest numbers of samples needed to meet the confidence constraint. The hardness for CoPE-KB is defined as ρ * (ξ) = min λ∈ X max x∈ Xv\{xv, * },v∈[V ] φ(x v, * ) -φ(x) 2 A(ξ,λ) -1 (F (x v, * ) -F (x)) 2 , where ξ ≥ 0 is a regularization parameter, and A(ξ, λ) := ξI + x∈ X λ xφ(x)φ(x) . This definition of ρ * (ξ) is adapted from prior linear bandit work (Fiez et al., 2019) . Here F (x v, * ) -F (x) is the reward gap between the best arm xv, * and a suboptimal arm x, and φ(x v, * ) -φ(x) 2 A(ξ,λ) -1 is a dimension-related factor of estimation error. Intuitively, ρ * (ξ) indicates how many samples it takes to make the estimation error smaller than the reward gap under regularization parameter ξ. Let ∆ min := min x∈ Xv\{xv, * },v∈[V ] (F (x v, * ) -F (x) ) denote the minimum reward gap between the best arm and suboptimal arms among all tasks. Let S denote the average number of samples used by each agent, i.e., per-agent sample complexity. Below we present the performance of CoKernelFC. Theorem 1 (Fixed-Confidence Upper Bound). Suppose that ξ satisfies √ ξ max xi,xj ∈ Xv,v∈[V ] φ(x i ) -φ(x j ) A(ξ,λ * 1 ) -1 ≤ ∆min 32(1+ε) θ * . With probability at least 1 -δ, CoKernelFC returns the best arms xv, * for all v ∈ [V ], with per-agent sample complexity S = O ρ * (ξ) V • log ∆ -1 min log nV δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min and communication rounds O(log ∆ -1 min ). Remark 1. The condition on regularization parameter ξ implies that CoKernelFC needs a small regularization parameter such that the bias due to regularization is smaller than ∆min 2 . Such conditions are similarly needed in prior kernel bandit work (Camilleri et al., 2021) , and can be dropped in the extended PAC setting (allowing a gap between the identified best arm and true best arm). The d(ξ, λ * 1 ) log(∆ -1 min )/V term is a cost for using rounding procedure ROUND. This is a second order term when the reward gaps ∆ v,i := F (x v, * ) -F (x v,i ) < 1 for all xv,i ∈ Xv \ {x v, * } and v ∈ [V ], which is the common case in pure exploration (Fiez et al., 2019; Zhu et al., 2021) . When all tasks are the same, Theorem 1 achieves a V -speedup, since replicating V copies of singleagent algorithms (Camilleri et al., 2021; Zhu et al., 2021) to complete all tasks without communica-tion will cost Õ(ρ * (ξ)) samples per agent. When tasks are totally different, there is no speedup in Theorem 1, since each copy of single-agent algorithm solves a task in her own sub-dimension and costs Õ( ρ * (ξ) V ) samples (ρ * (ξ) stands for the effective dimension of all tasks). This result matches the restriction of speedup. Interpretation. Now, we interpret Theorem 1 by standard measures in kernel bandits and a decomposition with respect to task similarities and task features. Below we first introduce the definitions of maximum information gain and effective dimension, which are adapted from kernel bandits with regret minimization (Srinivas et al., 2010; Valko et al., 2013) to the pure exploration setting. We define the maximum information gain as Υ := max λ∈ X log det [nV ] denotes the kernel matrix under sample allocation λ. Υ stands for the maximum information gain obtained from the samples generated according to sample allocation λ. I + ξ -1 K λ , where K λ := [ λ i λ j k(x i , xj )] i,j∈ Let λ * := argmax λ∈ X log det I + ξ -1 K λ denote the sample allocation which achieves the maximum information gain, and let α 1 ≥ • • • ≥ α nV denote the eigenvalues of K λ * . Then, we define the effective dimension as d eff := min{j ∈ [nV ] : jξ log(nV ) ≥ nV i=j+1 α i }. d eff stands for the number of principle directions that data projections in RKHS spread.

Recall that K

Z := [k Z (z v , z v )] v,v ∈[V ] denotes the kernel matrix of task similarities. Let K X ,λ * := [ λ * i λ * j k X (x i , x j )] i,j∈[nV ] denote the kernel matrix of arm features under sample allocation λ * . Corollary 1. The per-agent sample complexity S of algorithm CoKernelFC can be bounded by (a) S = Õ Υ ∆ 2 min V , (b) S = Õ d eff ∆ 2 min V , (c) S = Õ rank(K Z ) • rank(K X ,λ * ) ∆ 2 min V , where Õ(•) omits the rounding cost term d(ξ, λ * 1 ) log(∆ -1 min )/V and logarithmic factors. Remark 2. Corollaries 1(a),(b) show that, our sample complexity can be bounded by the maximum information gain, and only depends on the effective dimension of kernel representation. Corollary 1(c) reveals that the more tasks are similar (i.e., the smaller rank(K Z ) is), the fewer samples agents need. For example, when all tasks are the same, i.e., rank(K Z ) = 1, each agent only needs a 1 V fraction of the samples required by single-agent algorithms (Camilleri et al., 2021; Zhu et al., 2021) (which need Õ(rank(K X ,λ * )/∆ 2 min ) samples). Conversely, when all tasks are totally different, i.e., rank(K Z ) = V , no advantage can be obtained from communication, since the information from other agents is useless for solving local tasks. Our experimental results also reflect this relationship between task similarities and speedup, which match our theoretical bounds (see Section 6). We note that these theoretical results hold for both finite and infinite RKHS.

5. FIXED-BUDGET COPE-KB

For the fixed-budget objective, we propose algorithm CoKernelFB and error probability guarantees. Due to space limit, we defer the pseudo-code and detailed description to Appendix C.4.

5.1. ALGORITHM CoKernelFB

CoKernelFB pre-determines the number of rounds and the number of samples according to the principle dimension of arms, and successively cut down candidate arms to a half based on principle dimension. CoKernelFB also adopts the efficient kernelized estimator (in Section 4.2) to estimate the rewards of arms, so that agents only need to transmit average outcomes rather than all raw sample outcomes or the whole estimated reward parameter. Thus, CoKernelFB only requires a O(nV ) communication cost, instead of O(N (r) ) or O(dim(H)) as in adaptions of prior single-agent and fixed-budget algorithm (Katz-Samuels et al., 2020).

5.2. THEORETICAL PERFORMANCE OF CoKernelFB

Define the principle dimension of X as ω(ξ, X ) := min λ∈ X max xi,xj ∈ X φ(x i )-φ(x j ) 2 A(ξ,λ) -1 , where A(ξ, λ) := ξI + x∈ X λ xφ(x)φ(x) . ω(ξ, X ) represents the principle dimension of data projections in X to RKHS. Now we provide the error probability guarantee for CoKernelFB.  O exp - T V ρ * (ξ) • log(ω(ξ, X )) • n 2 V log(ω(ξ, X )) and O(log(ω(ξ, X ))) communication rounds. To guarantee an error probability δ, CoKernelFB only requires Õ( ρ * (ξ) V log δ -1 ) sample budget, which attains full speedup when all tasks are the same. Theorem 2 can be decomposed into components related to task similarities and arm features as in Corollary 1 (see Appendix E.2).

6. EXPERIMENTS

In this section, we provide the experimental results. Here we set V = 5, n = 6, δ = 0.005, H = R d , d ∈ {4, 8, 20}, θ * = [0.1, 0.1 + ∆ min , . . . , 0.1 + (d -1)∆ min ] , ∆ min ∈ [0.1, 0.8] and rank(K Z ) ∈ [1, V ]. We run 50 independent simulations and plot the average sample complexity with 95% confidence intervals (see Appendix A for a complete setup description and more results). We compare our algorithm CoKernelFC with five baselines, i.e., CoKernel-IndAlloc, IndRAGE, IndRAGE/V , IndALBA and IndPolyALBA. CoKernel-IndAlloc is an ablation variant of CoKernelFC, where agents use locally optimal sample allocations. IndRAGE, IndALBA and IndPolyALBA are adaptions of existing single-agent algorithms RAGE (Fiez et al., 2019) , ALBA (Tao et al., 2018) and PolyALBA (Du et al., 2021) , respectively. IndRAGE/V is a V -speedup baseline, which divides the sample complexity of the best single-agent adaption IndRAGE by V . Figures 2(a) , 2(b) show that CoKernelFC achieves the best sample complexity, which demonstrates the effectiveness of our sample allocation and cooperation schemes. Moreover, the empirical results reflect that the more tasks are similar, the higher learning speedup agents attain, which matches our theoretical analysis. Specifically, in the rank(K Z ) = 1 case (Figure 2(a) ), i.e., tasks are the same, CoKernelFC matches the V -speedup baseline IndRAGE-V . In the rank(K Z ) ∈ (1, V ) case (Figure 2(b )), i.e., tasks are similar, the sample complexity of CoKernelFC lies between IndRAGE/V and IndRAGE, which indicates that CoKernelFC achieves a speedup lower than V . In the rank(K Z ) = V case (Figure 2(c )), i.e., tasks are totally different, CoKernelFC performs similar to IndRAGE, since information sharing brings no advantage in this case.

7. CONCLUSION

In this paper, we propose a collaborative pure exploration in kernel bandit (CoPE-KB) model with fixed-confidence and fixed-budget objectives. CoPE-KB generalizes prior CoPE formulation from the single-task and classic MAB setting to allow multiple tasks and general reward structures. We propose novel algorithms with an efficient kernelized estimator and a novel communication scheme. Sample and communication complexities are provided to corroborate the efficiency of our algorithms. Our results explicitly quantify the influences of task similarities on learning speedup, and only depend on the effective dimension of feature space.

APPENDIX A MORE EXPERIMENTS

In this section, we give a complete description of experimental setup, and present the results for the FB setting. Experimental Setup. Our experiments are run on Intel Xeon E5-2660 v3 CPU at 2.60GHz. We set V = 5, n = 6, δ = 0.005 and H = R d , where d is a dimension parameter that will be specified later. We consider three different cases of task similarities, i.e., rank(K Z ) = 1 (tasks are the same), rank(K Z ) ∈ (1, V ) (tasks are similar), and rank(K Z ) = V (tasks are totally different), to show how task similarities impact learning performance in practice. For the rank(K Z ) = 1 case, we set d = 4. For any v ∈ [V ], {φ(x)} x∈ Xv is the set of all 4 2 vectors in R 4 , where each vector has two entries 0 and two entries 1. For the rank(K Z ) ∈ (1, V ) case, we set d = 8. For any v ∈ {1, 2} and v ∈ {3, 4, 5}, {φ(x)} x∈ Xv and {φ(x)} x∈ Xv are the two sets of all 4 2 vectors in the first and second subspaces R 4 of the whole space R 8 , respectively, where each vector has two entries 0 and two entries 1. For the rank(K Z ) = V case, we set d = 20. For any v ∈ [V ], {φ(x)} x∈ Xv is the set of all 4 2 vectors in the v-th subspace R 4 of the whole space R 20 , where each vector has two entries 0 and two entries 1. For all cases, we set θ * = [0.1, 0.1 + ∆ min , . . . , 0.1 + (d -1)∆ min ] ∈ R d , where ∆ min is a reward gap parameter that will be tuned in the experiments. In the FC setting, we change the reward gap ∆ min ∈ [0.1, 0.8] to generate different instances, and run 50 independent simulations to plot the average sample complexity with 95% confidence intervals. In the FB setting, we change the sample budget T ∈ [7000, 300000] to obtain different instances, and perform 100 independent runs to report the average error probability across runs. Results for the Fixed-Budget Setting. In the FB setting (Figure 3 ), we compare our algorithm CoKernelFB with three baselines, i.e., CoKernelFB-IndAlloc, IndPeaceFB (Katz-Samuels et al., 2020) and IndUniformFB. CoKernelFB-IndAlloc is an ablation variant of CoKernelFB, where agents use locally optimal sample allocations, instead of a globally optimal sample allocation. IndPeaceFB and IndUniformFB run V copies of existing single-agent algorithms, PeaceFB (Katz-Samuels et al., 2020) and the uniform sampling strategy, to independently complete the V tasks, respectively. Figures 3(a) , 3(b) show that, our CoKernelFB enjoys a lower error probability than the baselines. Moreover, the experimental results also reflect the influences of task similarities on learning speedup, and match our theoretical bounds. Specifically, from Figures 3(a ) to 3(c), the error probability of CoKernelFB gets closer and closer to that of the single-agent adaption IndPeaceFB, which shows that the learning speedup of CoKernelFB slows down as the task similarity decreases.

B RELATED WORK

In this section, we present a complete review of related works. Collaborative Pure Exploration (CoPE). The collaborative pure exploration literature is initiated by (Hillel et al., 2013) , where all agents solve a common classic best arm identification problem. Hillel et al. (2013) design fixed-confidence algorithms based on majority vote and provides upper bound analysis. Tao et al. (2019) further develop a fixed-budget algorithm and complete the analysis of communication round-speedup lower bounds. Karpov et al. (2020) extend the best arm identification formulation of (Hillel et al., 2013; Tao et al., 2019) to best m arm identification, and show complexity separations between these two formulations. Our CoPE-KB generalizes prior CoPE works (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) from single-task and classic MAB setting to allow multiple tasks and general reward structures, and faces unique challenges on communication and computation due to high (possibly infinite) dimensional feature space. Kernel Bandits. For kernel bandits with the regret minimization objective, Srinivas et al. (2010) and Valko et al. (2013) design algorithms from Bayesian and frequentist perspectives, respectively. Chowdhury & Gopalan (2017) improve the regret bound in (Srinivas et al., 2010) et al., 2020; Li et al., 2022) depend on the total number of timesteps T in the regret minimization game, while our communication costs depend only on the number of arms in the pure exploration setting. For kernel bandits with the pure exploration objective, there are only a few related works (Scarlett et al., 2017; Vakili et al., 2021; Camilleri et al., 2021; Zhu et al., 2021) , and all of them study the single-agent formulation.  ≥ τ (ξ, λ, ε) = O( d(ξ,λ) ε 2 ) samples, ROUND(ξ, λ, N, ε) returns a discrete sample allocation (s 1 , . . . , sN ) ∈ X N such that max xi,xj ∈ Xv,v∈[V ] φ(x i ) -φ(x j ) 2 (N ξI+ N i=1 φ(si)φ(si) ) -1 ≤2(1 + ) max xi,xj ∈ Xv,v∈[V ] φ(x i ) -φ(x j ) 2 (N ξI+ x∈ X N λxφ(x)φ(x) ) -1 . (6) This rounding procedure requires the number of samples N to satisfy that N ≥ τ (ξ, λ, ε) = O( d(ξ,λ) ε 2 ). Here d(ξ, λ) is the number of the eigenvalues of matrix x∈ X λ xφ(x)φ(x) which are greater than ξ. d(ξ, λ) stands for the effective dimension of the feature space spanned by data projections of X under regularization parameter ξ.

C.2 KERNELIZED COMPUTATION IN ALGORITHM CoKernelFC

Kernelized Estimator. Below we present a derivation of our kernelized estimator (Eq. ( 3)), which plays an important role in boosting the communication and computation efficiency. Let θr denote the minimizer of the following regularized least square loss function: L(θ) = N (r) ξ θ 2 + N (r) j=1 (y j -φ(s j ) θ) 2 . Letting the derivative of L(θ) equal to zero, we have N (r) ξ θr + N (r) j=1 φ(x j )φ(x j ) θr = N (r) j=1 φ(x j )y j . Merging repetitive computations for the same arms, we can obtain N (r) ξ θr + nV i=1 N (r) i φ(x i )φ(x i ) θr = nV i=1 N (r) i φ(x i )ȳ (r) i , where N (r) i is the number of samples and ȳ(r) i is the average observation on arm xi for any i ∈ [nV ]. Let Φ r = [ N (r) 1 φ(x 1 ) ; . . . ; N (r) nV φ(x nV ) ], K (r) = Φ r Φ r = [ N (r) i N (r) j k(x i , xj )] i,j∈[nV ] and ȳ(r) = [ N (r) 1 ȳ(r) 1 , . . . , N (r) nV ȳ(r) nV ] . Then, we can write Eq. ( 7) as N (r) ξI + Φ r Φ r θr =Φ r ȳ(r) . Since N (r) ξI + Φ r Φ r 0 and N (r) ξI + Φ r Φ r 0, θr = N (r) ξI + Φ r Φ r -1 Φ r ȳ(r) = Φ r N (r) ξI + Φ r Φ r -1 ȳ(r) = Φ r N (r) ξI + K (r) -1 ȳ(r) . Let k r (x) = Φ r φ(x) = [ N (r) 1 k(x, x1 ), . . . , N nV k(x, xnV )] for any x ∈ X . Then, for any arms xi , xj ∈ X , we obtain the efficient kernelized estimators of the expected reward F (x i ) and expected reward gap F (x i ) -F (x j ) as fr (x i ) = φ(x i ) θr = k r (x i ) N (r) ξI + K (r) -1 ȳ(r) , ∆r (x i , xj ) = (k r (x i ) -k r (x j )) N (r) ξI + K (r) -1 ȳ(r) . Kernelized Optimization Solver. Following (Camilleri et al., 2021) , we use a kernelized optimization solver for the min-max optimization in Line 4 of Algorithm 1. The optimization problem is as follows. min λ∈ X max xi,xj ∈B (r) v ,v∈[V ] φ(x i ) -φ(x j ) 2 A(ξ,λ) -1 , where A(ξ, λ) := ξI + x∈ X λ xφ(x)φ(x) for any ξ ≥ 0 and λ ∈ X .

Define function

h(λ) = max xi,xj ∈B (r) v ,v∈[V ] φ(x i ) -φ(x j ) 2 A(ξ,λ) -1 , and define x * i (λ), x * j (λ) as the optimal solution of h(λ). Then, the gradient of h(λ) with respect to λ is [∇ λ h(λ)] x = -φ(x * i (λ)) -φ(x * j (λ)) A(ξ, λ) -1 φ(x) 2 , ∀x ∈ X . Next, we show how to efficiently compute gradient [∇ λ h(λ)] x with kernel function k(•, •). Since ξI + Φ λ Φ λ φ(x) = ξφ(x) + Φ λ k λ (x) for any x ∈ X , we have φ(x) =ξ ξI + Φ λ Φ λ -1 φ(x) + ξI + Φ λ Φ λ -1 Φ λ k λ (x) =ξ ξI + Φ λ Φ λ -1 φ(x) + Φ λ (ξI + K λ ) -1 k λ (x) Multiplying φ(x * i (λ)) -φ(x * j (λ)) on both sides, we have φ(x * i (λ)) -φ(x * j (λ)) φ(x) =ξ φ(x * i (λ)) -φ(x * j (λ)) ξI + Φ λ Φ λ -1 φ(x) + k λ (x * i (λ)) -k λ (x * j (λ)) (ξI + K λ ) -1 k λ (x) Then, φ(x * i (λ)) -φ(x * j (λ)) ξI + Φ λ Φ λ -1 φ(x) =ξ -1 φ(x * i (λ)) -φ(x * j (λ)) φ(x) -ξ -1 k λ (x * i (λ)) -k λ (x * j (λ)) (ξI + K λ ) -1 k λ (x) =ξ -1 k(x * i (λ), x) -k(x * j (λ), x) -k λ (x * i (λ)) -k λ (x * j (λ)) (ξI + K λ ) -1 k λ (x) Therefore, we can compute gradient ∇ λ h(λ) (Eq. ( 9)) using the equivalent kernelized expression Eq. ( 10), and then the optimization (Eq. ( 8)) can be efficiently solved by projected gradient descent.

C.3 COMPARISON OF ESTIMATORS

In the following, we compare our kernelized estimator (Eq. ( 3)) to those in prior kernel bandit works (Zhou et al., 2020; Zhu et al., 2021; Dubey et al., 2020; Camilleri et al., 2021) and adaptions of federated linear bandits (Dubey & Pentland, 2020; Huang et al., 2021; Li & Wang, 2022) . First, prior kernel bandit works (Zhou et al., 2020; Zhu et al., 2021) or adaptions of federated linear bandits (Dubey & Pentland, 2020; Huang et al., 2021; Li & Wang, 2022) explicitly maintain the regularized least square estimator of reward parameter θ * as θr = N (r) ξI + N (r) j=1 φ(s j )φ(s j ) -1 N (r) j=1 φ(s j )y j . Since both θr and φ(s j ) lie in the high-dimensional feature space H, this estimator will incur O(dim(H)) computation and communication costs. Second, (Dubey et al., 2020) and Algorithm 6 in (Camilleri et al., 2021) use a redundant kernelized form of θr as θr = (Ψ Ψ + N (r) ξI) -1 Ψ Y, where Ψ := [φ(s 1 ) ; . . . ; φ(s N (r) ) ] ∈ R N (r) ×dim(H) , Y := [y 1 , . . . , y N (r) ] ∈ R N (r) , and y 1 , . . . , y N (r) are the observed raw outcomes of sample sequence (s 1 , . . . , sN (r) ). This estimator needs all N (r) raw sample outcomes y 1 , . . . , y N (r) as inputs, rather than only nV average outcomes ȳ(r) 1 , . . . , ȳ(r) nV as our estimator (Eq. ( 2)), which will incur a O(N (r) ) communication cost. Here the number of samples N (r) is often far larger than the number of arms nV . Finally, Algorithm 4 in (Camilleri et al., 2021) adopts a robust inverse propensity score estimator as Let λ * r and ρ * r be the optimal solution and optimal value of min θr := argmin θ max xi,xi ∈B (r) v ,v∈[V ] (φ(x i ) -φ(x i )) θ -W i,i φ(x i ) -φ(x i ) A(ξ,λ * r ) -1 , Algorithm 2 Collaborative Multi-agent Algorithm CoKernelFB: for Agent v ∈ [V ] 1: Input: regularization parameter ξ ≤ 1 16(1+ε) 2 (ρ * (ξ)) 2 θ * 2 , per-agent budget T ≥ max{ρ * (ξ), d(ξ,λ * 1 )} V log(ω(ξ, X )), arm set X which satisfies ω(ξ, {x v, * , x}) ≥ 1 for all x ∈ Xv \ {x v, * } and v ∈ [V ], k(•, •) : X × X → R, rounding procedure ROUND, rounding approxi- mation parameter ε = 1 10 2: Initialization: R ← log 2 (ω(ξ, X )) . N ← T V /R . B (1) v ← Xv for all v ∈ [V ]. λ∈ X max xi,xj ∈B (r) v ,v ∈[V ] φ(x i ) -φ(x j ) 2 (ξI+ x∈ X λxφ(x)φ(x) ) -1 // compute the optimal sample allocation 5: (s 1 , . . . , sN ) ← ROUND(ξ, λ * r , N, ε) 6: Extract a sub-sequence s(r) v from (s 1 , . . . , sN ) which only contains arms in Xv 7: Sample s(r) v and observe random rewards y (r) v 8: Broadcast {(N (r) v,i , ȳ(r) v,i )} i∈[n] , and receive {(N Let i r+1 be the largest index such that ω(ξ, {x (1) , . . . ,  (r) v ,i , ȳ(r) v ,i )} i∈[n] from all other agents v ∈ [V ] \ {v} 9: for all v ∈ [V ] do 10: Fr (x) ← k r (x) (K (r) + N (r) ξI) -1 ȳ(r) for all x ∈ B (r) v // x(ir+1) }) ≤ ω(ξ, B (r) v )/2 13: B (r+1) v ← {x (1) , . . . , x(ir+1) } // cut (r) 1 , . . . , B (r) V W i,i := μ (φ(x i ) -φ(x i )) A(ξ, λ * r ) -1 φ(s j )y j N (r) j=1 , Here A(ξ, λ) := ξI + x∈ X λ xφ(x)φ(x) for any ξ ≥ 0 and λ ∈ X . λ * r is the optimal sample allocation defined in Line 4 of algorithm CoKernelFC (Algorithm 1). μ(•) is the median-of-means estimator or Catoni's estimator (Lugosi & Mendelson, 2019) . This estimator also requires all N (r) raw sample outcomes y 1 , . . . , y N (r) , and will cause a O(N (r) ) communication cost.

C.4 PSEUDO-CODE AND DESCRIPTION OF ALGORITHM CoKernelFB

In this section, we present the algorithm pseudo-code of CoKernelFB (in Algorithm 2), and give a detailed algorithm description. The procedure of CoKernelFB is as follows. During initialization (Line 2), we pre-determine the number of rounds R and the number of samples N for each round according to data dimension ω(ξ, X ), which is formally defined as ω(ξ, S) := min λ∈ X max xi,xj ∈ S φ(x i ) -φ(x j ) 2 A(ξ,λ) -1 , ∀ S ⊆ X , where A(ξ, λ) := ξI + x∈ X λ xφ(x)φ(x) . ω(ξ, S) represents the principle dimension of data projections in S to the RKHS under regularization parameter ξ. Each agent v maintains alive arm sets B v for all agents v ∈ [V ], and calculates a global optimal sample allocation λ * r (Line 4). Then, she generates a sample sequence (s with other agents (Line 8). With the shared information, she estimates rewards of alive arms and only keeps the best half of them in the dimension sense (Lines 10-13). Regarding the conditions on the input parameters (Line 1), the condition on regularization parameter ξ implies that CoKernelFC needs a small regularization parameter such that the bias due to regularization is small. Such conditions are similarly needed in prior kernel bandit work (Camilleri et al., 2021) , and can be dropped in the extended PAC setting (allowing a gap between the identified best arm and true best arm). The condition on T is to ensure that the given sample budget is larger than the number of samples required by rounding procedure ROUND. In addition, the condition on ω(•, •) is to guarantee that the number of rounds is bounded by O(log(ω(ξ, X ))) rather than an uncontrollable variable. Such conditions are also needed by prior fixed-budget pure exploration algorithm (Katz-Samuels et al., 2020) . Communication and Computation. CoKernelFB also adopts the kernelized estimator in Section 4.2 to estimate the rewards of alive arms (Line 10). Specifically, using Eq. ( 2), we can estimate the reward for arm x by Fr (x) = φ(x) θr = k r (x) N (r) ξI + K (r) -1 ȳ(r) , where ȳ(r ) := [ N (r) 1 ȳ(r) 1 , . . . , N (r) nV ȳ(r) nV ] , K (r) := [ N (r) i N (r) j k(x i , xj )] i,j∈[nV ] and k r (x) := [ N (r) 1 k(x, x1 ), . . . , N nV k(x, xnV )] are the average outcomes, kernel matrix and correlations of the nV arms, respectively, which merge repetitive information on same arms. Using the kernelized estimator (Eq. ( 12)), CoKernelFB only requires a O(nV ) communication cost, instead of O(N (r) ) or O(dim(H)) as in adaptions of existing single-agent algorithm (Katz-Samuels et al., 2020).

D PROOFS FOR THE FIXED-CONFIDENCE SETTING D.1 PROOF OF THEOREM 1

Our proof of Theorem 1 adapts the analytical procedure of (Fiez et al., 2019; Katz-Samuels et al., 2020) to the multi-agent setting. Let r * := log 2 2 ∆min + 1. Intuitively, r * is the upper bound of the number of rounds used by algorithm CoKernelFC. For any λ ∈ X , let Φ λ := [ √ λ 1 φ(x 1 ) ; . . . ; √ λ nV φ(x nV ) ], where λ i denotes the weight allocated to arm xi for any i ∈ [nV ]. For any ξ ≥ 0 and λ ∈ X , A(ξ, λ) : = ξI + x∈ X λ xφ(x)φ(x) = ξI + Φ λ Φ λ . The regularization parameter ξ in algorithm CoKernelFC satisfies √ ξ max xi,xj ∈ Xv,v∈[V ] φ(x i ) - φ(x j ) A(ξ,λ * 1 ) -1 ≤ ∆min 32(1+ε) θ * . Since max xi,xj ∈ B(r) v ,v∈[V ] φ(x i ) -φ(x j ) A(ξ,λ * r ) -1 is non- increasing with respect to r (from Line 4 in algorithm CoKernelFC), we have 2(1 + ε) √ ξ max xi,xj ∈ B(r) v ,v∈[V ] φ(x i ) -φ(x j ) A(ξ,λ * r ) -1 θ * ≤ ∆min 16 ≤ 1 2 r+1 for any 1 ≤ r ≤ r * . In order to prove Theorem 1, we first introduce several important lemmas. Lemma 1 (Concentration). Defining event G = Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j )) < 2(1 + ε) • φ(x i ) -φ(x j ) A(ξ,λ * r ) -1 • 2 log (2n 2 V /δ r ) N (r) + ξ θ * 2 ≤ 2 -r , ∀x i , xj ∈ B (r) v , ∀v ∈ [V ], ∀1 ≤ r ≤ r * , we have Pr [G] ≥ 1 -δ. Proof of Lemma 1. Let γ r := N (r) ξ. Recall that θr := γ r I + Φ r Φ r -1 Φ r ȳ(r) , Φ r := [ N (r) 1 φ(x 1 ) ; . . . ; N (r) nV φ(x nV ) ] and ȳ(r) := [ N (r) 1 ȳ(r) 1 , . . . , N (r) nV ȳ(r) nV ] . Let η(r) := [ N (r) 1 η(r) 1 , . . . , N (r) nV η(r) nV ] , where η(r) Then, for any fixed round i := ȳ(r) i -φ(x i ) θ * denotes 1 ≤ r ≤ r * , xi , xj ∈ B(r) v and v ∈ [V ], we have Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j )) = (φ(x i ) -φ(x j )) θr -θ * = (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 Φ r ȳ(r) -θ * = (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 Φ r Φ r θ * + η(r) -θ * = (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 Φ r Φ r θ * + γ r I + Φ r Φ r -1 Φ r η(r) -θ * = (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 Φ r Φ r + γ r I θ * + γ r I + Φ r Φ r -1 Φ r η(r) -θ * -γ r γ r I + Φ r Φ r -1 θ * = (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 Φ r η(r) Term 1 -γ r (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 θ * In Eq. ( 13), the expectation of Term 1 is zero, and the variance of Term 1 is bounded by (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 Φ r Φ r γ r I + Φ r Φ r -1 (φ(x i ) -φ(x j )) ≤ (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 γ r I + Φ r Φ r γ r I + Φ r Φ r -1 (φ(x i ) -φ(x j )) = (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 (φ(x i ) -φ(x j )) = φ(x i ) -φ(x j ) 2 (γrI+Φ r Φr) -1 . Using the Hoeffding inequality, we have that with probability at least 1 -δr n 2 V , (φ(x i ) -φ(x j )) γ r I + Φ r Φ r -1 Φ r η(r) < φ(x i ) -φ(x j ) (γrI+Φ r Φr) -1 2 log 2n 2 V δ r Thus, taking the absolute value on both sides of Eq. ( 13) and using Eq. ( 14) and the Cauchy-Schwarz inequality, we have that for any fixed round 1 ≤ r ≤ r * , xi , xj ∈ B(r) v and v ∈ [V ], with probability at least 1 -δr n 2 V , Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j )) < φ(x i ) -φ(x j ) (γrI+Φ r Φr) -1 2 log 2n 2 V δ r + γ r φ(x i ) -φ(x j ) (γrI+Φ r Φr) -1 θ * (γrI+Φ r Φr) -1 ≤ φ(x i ) -φ(x j ) (γrI+Φ r Φr) -1 2 log 2n 2 V δ r + √ γ r • φ(x i ) -φ(x j ) (γrI+Φ r Φr) -1 θ * 2 (a) ≤ 2(1 + ε) • φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 √ N (r) 2 log 2n 2 V δ r + ξN (r) • 2(1 + ε) • φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 √ N (r) • θ * =2(1 + ε) • φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 2 log (2n 2 V /δ r ) N (r) + 2(1 + ε) ξ φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 θ * 2 ≤2(1 + ε) max xi,xj ∈ B(r) v ,v∈[V ] φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 2 log (2n 2 V /δ r ) N (r) + 2(1 + ε) ξ max xi,xj ∈ B(r) v ,v∈[V ] φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 θ * 2 , where (a) follows from the guarantee of rounding procedure (Eq. ( 6)) and γ r := N (r) ξ. According to the condition of ξ, it holds that 2(1 + ε) ξ max xi,xj ∈ B(r) v ,v∈[V ] φ(x i ) -φ(x j ) ξI+Φ λ * r Φ λ * r -1 θ * 2 ≤ 1 2 r+1 . Thus, with probability at least 1 -δr n 2 V , Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j )) <2(1 + ε) max xi,xj ∈ B(r) v ,v∈[V ] φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 2 log (2n 2 V /δ r ) N (r) + 1 2 r+1 = 8(1 + ε) 2 ρ * r log (2n 2 V /δ r ) N (r) + 1 2 r+1 ≤ 1 2 r+1 + 1 2 r+1 = 1 2 r By a union bound over arms xi , xj , agent v and round r, we have that Pr [G] ≥ 1 -δ. For any 1 < r ≤ r * and v ∈ [V ], let S  v ∈ [V ] such that x * v / ∈ B (r+1) v . According to the elimination rule of algorithm CoKernelFC, we have that these exists some x ∈ B (r) , which completes the proof of the first statement. Now, we prove the second statement also by induction. v such that Fr (x ) -Fr (x * v ) ≥ 2 -r . Using Lemma 1, we have F (x ) -F (x * v ) > Fr (x ) -Fr (x * v ) -2 -r ≥ 0, To begin, we prove that for any v ∈ [V ], B (2) v ⊆ S (2) v . Suppose that there exists some v ∈ [V ] such that B (2) v

S

(2) v . Then, there exists some x ∈ B (2) v such that F (x * v ) -F (x ) > 4 • 2 -2 = 1. Using Lemma 1, we have that at the round r = 1, Fr (x * v ) -Fr (x ) ≥ F (x * v ) -F (x ) -2 -1 > 1 -2 -1 = 2 -1 , which implies that x should have been eliminated in round r = 1 and gives a contradiction. Let ε = 1 10 . Then, we have r * r=1 N (r) ≤ r * r=1 32(2 r ) 2 (1 + ε) 2 ρ * r log 2n 2 V δ r + 1 + τ (ξ, λ * r , ε) = r * t=2 32(2 r ) 2 4 • 2 -r 2 (1 + ε) 2 min λ∈ X max xi,xj ∈B (r) v ,v∈[V ] φ(x i ) -φ(x j ) 2 A(ξ,λ) -1 (4 • 2 -r ) 2 • log 4V n 2 r 2 δ + N 1 + 2τ (ξ, λ * 1 , ε) • r * = r * t=2 512(1 + ε) 2 min λ∈ X max xi,xj ∈B (r) v ,v∈[V ] φ(x i ) -φ(x j ) 2 A(ξ,λ) -1 (4 • 2 -r ) 2 log 4V n 2 (r * ) 2 δ + N 1 + 2τ (ξ, λ * 1 , ε) • r * ≤ r * t=2 2048(1 + ε) 2 min λ∈ X max x∈B (r) v \{xv, * },v∈[V ] φ(x v, * ) -φ(x) 2 A(ξ,λ) -1 (4 • 2 -r ) 2 • log 4V n 2 (r * ) 2 δ + N 1 + 2τ (ξ, λ * 1 , ε) • r * ≤ r * t=2 2048(1 + ε) 2 min λ∈ X max x∈B (r) v \{xv, * },v∈[V ] φ(x v, * ) -φ(x) 2 A(ξ,λ) -1 (F (x v, * ) -F (x)) 2 log 4V n 2 (r * ) 2 δ + N 1 + 2τ (ξ, λ * 1 , ε) • r * ≤ r * t=2 2048(1 + ε) 2 min λ∈ X max x∈ Xv\{xv, * },v∈[V ] φ(x v, * ) -φ(x) 2 A(ξ,λ) -1 (F (x v, * ) -F (x)) 2 log 4V n 2 (r * ) 2 δ + N 1 + 2τ (ξ, λ * 1 , ε) • r * ≤r * • 2048(1 + ε) 2 • ρ * (ξ) • log 4V n 2 (r * ) 2 δ + N 1 + 2τ (ξ, λ * 1 , ε) • r * =O ρ * (ξ) • log ∆ -1 min • log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) • log ∆ -1 min Thus, the per-agent sample complexity is bounded by O ρ * (ξ) V • log ∆ -1 min • log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min . Since algorithm CoKernelFC has at most r * := log 2 ( 2 ∆min ) + 1 rounds, the number of communication rounds is bounded by O(log ∆ -1 min ).

D.2 INTERPRETATION OF THEOREM 1

Proof of Corollary 1. Recall that for any λ ∈ X , Φ λ := [ √ λ 1 φ(x 1 ) ; . . . ; √ λ nV φ(x nV ) ] and K λ := [ λ i λ j k(x i , xj )] i,j∈[nV ] = Φ λ Φ λ . Let λ * := argmax λ∈ X log det I + ξ -1 K λ = argmax λ∈ X log det I + ξ -1 Φ λ Φ λ . For any ξ ≥ 0 and λ ∈ X , A(ξ, λ) := ξI + x∈ X λ xφ(x)φ(x) = ξI + Φ λ Φ λ . Then, we have ρ * (ξ) = min λ∈ X max x∈ Xv\{xv, * },v∈[V ] φ(x v, * ) -φ(x) 2 (ξI+ x ∈ X λ x φ(x )φ(x ) ) -1 (F (x v, * ) -F (x)) 2 ≤ min λ∈ X max x∈ Xv,v∈[V ] φ(x v, * ) -φ(x) 2 (ξI+ x ∈ X λ x φ(x )φ(x ) ) -1 ∆ 2 min = 1 ∆ 2 min • min λ∈ X max x∈ Xv,v∈[V ] φ(x v, * ) -φ(x) 2 (ξI+ x ∈ X λ x φ(x )φ(x ) ) -1 ≤ 1 ∆ 2 min • min λ∈ X 2 max x∈ X φ(x) (ξI+ x ∈ X λ x φ(x )φ(x ) ) -1 2 = 4 ∆ 2 min • min λ∈ X max x∈ X φ(x) 2 (ξI+ x ∈ X λ x φ(x )φ(x ) ) -1 ≤ 4 ∆ 2 min • max x∈ X φ(x) 2 (ξI+ x ∈ X λ * x φ(x )φ(x ) ) -1 (b) = 4 ∆ 2 min • x∈ X λ * x φ(x) 2 (ξI+ x ∈ X λ * x φ(x )φ(x ) ) -1 , where (b) is due to Lemma 3. Since λ * x φ(x) 2 (ξI+ x ∈ X λ * x φ(x )φ(x ) ) -1 ≤ 1 for any x ∈ X , x∈ X λ * x φ(x) 2 (ξI+ x ∈ X λ * x φ(x )φ(x ) ) -1 ≤2 x∈ X log 1 + λ * x φ(x) 2 (ξI+ x ∈ X λ * x φ(x )φ(x ) ) -1 (c) ≤2 log det ξI + x∈ X λ * xφ(x)φ(x) det (ξI) =2 log det   I + ξ -1 x∈ X λ * xφ(x)φ(x)   =2 log det I + ξ -1 Φ λ Φ λ =2 log det I + ξ -1 Φ λ Φ λ =2 log det I + ξ -1 K λ * , where (c) comes from Lemma 4. Thus, we have ρ * (ξ) ≤ 8 ∆ 2 min • log det I + ξ -1 K λ * . ( ) In the following, we interpret the term log det I + ξ -1 K λ * by maximum information gain and effective dimension, and then decompose it into components from task similarities and arm features. Maximum Information Gain. Recall that the maximum information is defined as Υ = max λ∈ X log det I + ξ -1 K λ = log det I + ξ -1 K λ * . Then, using Eq. ( 15), we have ρ * (ξ) ≤ 8 ∆ 2 min • Υ. ( ) Combining the bound in Theorem 1 and Eq. ( 16), the per-agent sample complexity is bounded by S =O ρ * (ξ) V • log ∆ -1 min log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min =O Υ ∆ 2 min V • log ∆ -1 min log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min Effective Dimension. Recall that α 1 ≥ • • • ≥ α nV denote the eigenvalues of K λ * in decreasing order, and the effective dimension is defined as d eff = min j : jξ log(nV ) ≥ nV i=j+1 α i . Then, it holds that d eff ξ log(nV ) ≥ nV i=deff+1 α i . Let ε = d eff ξ log(nV ) -nV i=deff+1 α i , and thus ε ≤ d eff ξ log(nV ). Then, we have deff i=1 α i = Trace(K λ * )- nV i=deff+1 α i = Trace(K λ * )-d eff ξ log(nV )+ε and nV i=deff+1 α i = d eff ξ log(nV )- ε. log det I + ξ -1 K λ * = log Π nV i=1 1 + ξ -1 α i = log Π deff i=1 1 + ξ -1 α i • Π nV i=deff+1 1 + ξ -1 α i ≤ log 1 + ξ -1 • Trace(K λ * ) -d eff ξ log(nV ) + ε d eff deff 1 + ξ -1 • d eff ξ log(nV ) -ε nV -d eff nV -deff ≤d eff log 1 + ξ -1 • Trace(K λ * ) -d eff ξ log(nV ) + ε d eff + log 1 + d eff log(nV ) nV -d eff nV -deff =d eff log 1 + ξ -1 • Trace(K λ * ) -d eff ξ log(nV ) + ε d eff + log 1 + d eff log(nV -d eff + d eff ) nV -d eff nV -deff (d) ≤d eff log 1 + ξ -1 • Trace(K λ * ) -d eff ξ log(nV ) + ε d eff + log 1 + d eff log(nV + d eff ) nV nV =d eff log 1 + ξ -1 • Trace(K λ * ) -d eff ξ log(nV ) + ε d eff + nV log 1 + d eff log(nV + d eff ) nV ≤d eff log 1 + Trace(K λ * ) ξd eff + d eff log(nV + d eff ) ≤d eff log 2nV • 1 + Trace(K λ * ) ξd eff , where inequality (d) is due to that 1 + deff log(x+deff) x x is monotonically increasing with respect to x ≥ 1. Then, using Eq. ( 15), we have ρ * (ξ) ≤ 8 ∆ 2 min • d eff log 2nV • 1 + Trace(K λ * ) ξd eff . ( ) Combining the bound in Theorem 1 and Eq. ( 17), the per-agent sample complexity is bounded by S =O ρ * (ξ) V • log ∆ -1 min log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min =O d eff ∆ 2 min V • log nV • 1 + Trace(K λ * ) ξd eff • log ∆ -1 min log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min Decomposition. Recall that K λ * = [ λ * i λ * j k(x i , xj )] i,j∈[nV ] , K Z = [k Z (z v , z v )] v,v ∈[V ] and K X ,λ * = [ λ * i λ * j k X (x i , xj )] i,j∈[nV ] . Let KZ = [k Z (z vi , z vj )] i,j∈[nV ] , where for any i ∈ [nV ], v i denotes the index of the task for the i-th arm xi in the arm set X and z vi denotes its task feature. It holds that rank( KZ ) = rank(K Z ). Since K λ * is a Hadamard composition of KZ and K X ,λ * , we have that rank (K λ * ) ≤ rank( KZ ) • rank(K X ,λ * ) = rank(K Z ) • rank(K X ,λ * ). log det I + ξ -1 K λ * = log Π nV i=1 1 + ξ -1 α i = log Π rank(K λ * ) i=1 1 + ξ -1 α i ≤ log rank(K λ * ) i=1 1 + ξ -1 α i rank(K λ * ) rank(K λ * ) =rank(K λ * ) log rank(K λ * ) i=1 1 + ξ -1 α i rank(K λ * ) ≤rank(K Z ) • rank(K X ,λ * ) log Trace I + ξ -1 K λ * rank(K λ * ) Then, using Eq. ( 15), we have ρ * (ξ) ≤ 8 ∆ 2 min • rank(K Z ) • rank(K X ,λ * ) log Trace I + ξ -1 K λ * rank(K λ * ) . ( ) Combining the bound in Theorem 1 and Eq. ( 18), the per-agent sample complexity is bounded by O ρ * (ξ) V • log ∆ -1 min log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min =O rank(K Z ) • rank(K X ,λ * ) ∆ 2 min V • log Trace I + ξ -1 K λ * rank(K λ * ) • log ∆ -1 min log V n δ + log log ∆ -1 min + d(ξ, λ * 1 ) V • log ∆ -1 min Hence, we complete the proof of Corollary 1. E PROOFS FOR THE FIXED-BUDGET SETTING E.1 PROOF OF THEOREM 2 Proof of Theorem 2. Our proof of Theorem 2 adapts the error probability analysis in (Katz-Samuels et al., 2020) to the multi-agent setting. Since the number of samples used over all agents in each round is N = T V /R , the total number of samples used by algorithm CoKernelFB is at most T V and the total number of samples used per agent is at most T . Now we prove the error probability upper bound. Recall that for any ξ ≥ 0 and λ ∈ X , A(ξ, λ) := ξI + x∈ X λ xφ(x)φ(x) = ξI + Φ λ Φ λ . For any λ ∈ X , Φ λ = [ √ λ 1 φ(x 1 ) ; . . . ; √ λ nV φ(x nV ) ]. The regularization parameter ξ in algorithm CoKernelFB satisfies ξ ≤ 1 16(1+ε) 2 (ρ * (ξ)) 2 θ * 2 . For any r ∈ [R], xi , xj ∈ B (r) v and v ∈ [V ], define reward gap ∆ r,xi,xj = inf ∆>0    φ(x i ) -φ(x j ) 2 (ξI+Φ λ * r Φ λ * r ) -1 ∆ 2 ≤ 8ρ * (ξ)    , and event J r,xi,xj = Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j )) < ∆ r,xi,xj . In the following, we prove Pr ¬J r,xi,xj ≤ 2 exp -N 32(1+ε)ρ * (ξ) . Similar to the analysis for Eq. ( 13) in the proof of Lemma 1, we have that for any r ∈ [R], xi , xj ∈ B (r) v and v ∈ [V ], Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j )) = (φ(x i ) -φ(x j )) N ξI + Φ r Φ r -1 Φ r η(r) Term 1 -ξN (φ(x i ) -φ(x j )) N ξI + Φ r Φ r -1 θ * Term 2 . (19) Here, the expectation of Term 1 in Eq. ( 19) is zero, and the variance of Term 1 is bounded by φ(x i ) -φ(x j ) 2 (N ξI+Φ r Φr) -1 (a) ≤ 2(1 + ε) • φ(x i ) -φ(x j ) 2 (ξI+Φ λ * r Φ λ * r ) -1 N , where (a) comes from the guarantee of rounding procedure (Eq. ( 6)). Using the Hoeffding inequality, we have Pr (φ(x i ) -φ(x j )) N ξI + Φ r Φ r -1 Φ r η(r) ≥ 1 2 ∆ r,xi,xj ≤2 exp   -2 • 1 4 ∆ 2 r,xi,xj 2(1+ε)• φ(xi)-φ(xj ) 2 (ξI+Φ λ * r Φ λ * r ) -1 N    ≤2 exp      - 1 2 • N 2(1+ε)• φ(xi)-φ(xj ) 2 (ξI+Φ λ * r Φ λ * r ) -1 ∆ 2 r,x i ,x j      ≤2 exp - N 32(1 + ε)ρ * (ξ) Thus, with probability at least 1 -2 exp -N 32(1+ε)ρ * (ξ) , we have (φ(x i ) -φ(x j )) N ξI + Φ r Φ r -1 Φ r η (r) v < 1 2 ∆ r,xi,xj . Since ξ satisfies ξ ≤ 1 16(1+ε) 2 (ρ * (ξ)) 2 θ * 2 , we have 4(1 + ε) √ ξρ * (ξ) θ * ≤ 1. Then, for any r ∈ [R], xi , xj ∈ B (r) v and v ∈ [V ], we have 4(1 + ε) ξ • φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 ∆ r,xi,xj • θ * ≤ 1, and thus, 2(1 + ε) ξ φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 θ * ≤ 1 2 ∆ r,xi,xj . Then, we can bound the bias term (Term 2 in Eq . ( 19)) as ξN (φ(x i ) -φ(x j )) N ξI + Φ r Φ r -1 θ * ≤ξN φ(x i ) -φ(x j ) (N ξI+Φ r Φr) -1 θ * (N ξI+Φ r Φr) -1 ≤ ξN • φ(x i ) -φ(x j ) (N ξI+Φ r Φr) -1 θ * 2 ≤ ξN • 2(1 + ε) • φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 √ N • θ * 2 =2(1 + ε) ξ φ(x i ) -φ(x j ) (ξI+Φ λ * r Φ λ * r ) -1 θ * 2 ≤ 1 2 ∆ r,xi,xj Plugging Eqs. ( 20) and ( 21) into Eq. ( 19), we have that with probability at least 1 -2 exp - Since ω(ξ, {x v, * , x}) ≥ 1 for any x ∈ Xv \ {x v, * }, v ∈ [V ] and R = log 2 (ω(ξ, X )) ≥ log 2 (ω(ξ, Xv )) for any v ∈ [V ], we have that conditioning on J , algorithm CoKernelFB will return the correct answers xv, * for all v ∈ [V ] using at most R rounds. Therefore, we complete the proof of the error probability guarantee. For communication rounds, since algorithm CoKernelFB has at most R := log 2 (ω(ξ, X )) rounds, the number of communication rounds is bounded by O(log(ω(ξ, X ))).

E.2 INTERPRETATION OF THEOREM 2

In the following, we interpret the error probability for algorithm CoKernelFB with maximum information gain and effective dimension, and decompose it into two parts with respect to task similarities and arm features. Recalling Eq. ( 17), we have ρ * (ξ) ≤ 8 ∆ 2 min • d eff log 2nV • 1 + Trace(K λ * ) ξd eff . Combining the bound in Theorem 2 and the above equation, the error probability is bounded by Err =O n 2 V log(ω(ξ, X )) • exp - T V ρ * (ξ) • log(ω(ξ, X )) =O   n 2 V log(ω(ξ, X )) • exp   - T V ∆ 2 min d eff • log nV • 1 + Trace(K λ * ) ξdeff • log(ω(ξ, X ))     Decomposition. Recall that K λ * = [ λ * i λ * j k(x i , xj )] i,j∈[nV ] , K Z = [k Z (z v , z v )] v,v ∈[V ] and K X ,λ * = [ λ * i λ * j k X (x i , xj )] i,j∈[nV ] . Let KZ = [k Z (z vi , z vj )] i,j∈[nV ] , where for any i ∈ [nV ], v i denotes the index of the task for the i-th arm xi in the arm set X and z vi denotes its task feature. It holds that rank( KZ ) = rank(K Z ). Since K λ * is a Hadamard composition of KZ and K X ,λ * , we have that rank(K λ * ) ≤ rank( KZ ) • rank(K X ,λ * ) = rank(K Z ) • rank(K X ,λ * ). Recalling Eq. ( 18), we have ρ * (ξ) ≤ 8 ∆ 2 min • rank(K Z ) • rank(K X ,λ * ) log Trace I + ξ -1 K λ * rank(K λ * ) . Combining the bound in Theorem 2 and the above equation, the error probability is bounded by Err =O n 2 V log(ω(ξ, X )) • exp - T V ρ * (ξ) • log(ω(ξ, X )) =O n 2 V log(ω(ξ, X ))• exp - T V ∆ 2 min rank(K Z ) • rank(K X ,λ * ) • log Trace(I+ξ -1 K λ * ) rank(K λ * ) • log(ω(ξ, X )) Therefore, we complete the proof of Corollary 2.

F TECHNICAL TOOLS

In this section, we provide some useful technical tools. Lemma 3 (Lemma 15 in (Camilleri et al., 2021) ). For λ * = argmax λ∈ X log det I + ξ -1 x ∈ X λ x φ(x )φ(x ) , we have max x∈ X φ(x) 2 (ξI+ x ∈ X λ * x φ(x )φ(x ) ) -1 = x∈ X λ * x φ(x) 2 (ξI+ x ∈ X λ * x φ(x )φ(x ) ) -1 Lemma 4. For λ * = argmax λ∈ X log det I + ξ -1 x ∈ X λ x φ(x )φ(x ) , we have Proof of Lemma 4. This proof is inspired by Lemma 11 in (Abbasi-Yadkori et al., 2011) , and uses a similar analytical procedure as that of Lemma 16 in (Camilleri et al., 2021) . However, different from the analysis of Lemma 16 in (Camilleri et al., 2021) , we do not include the number of samples N (r) in the det(•) operator in this proof. For any j ∈ [nV ], let M j = det ξI + i∈[j] λ * i φ(x i )φ(x i ) . 



For any h, h ∈ H, we denote their inner-product by h h := h, h H and denote the norm of h by h := √ h h. For any h ∈ H and matrix M , we denote h M -1 := √ h M -1 h.



is the task feature kernel which measures the similarity of functions f v , and k X : X × X → R is the arm feature kernel which depicts the feature structure of arms. The computation of kernel function k(•, •) operates only on low-dimensional input data. By expressing all calculations via k(•, •), it enables us to avoid maintaining high-dimensional φ(•) and attain computation efficiency. In addition, the composite structure of k(•, •) allows us to model the similarity among tasks.

Figure 1: Illustrating example for CoPE-KB.

sampling, she only communicates the number of samples N (r) v,i and average sample outcome ȳ(r) v,i on each arm xv,i with other agents (Line 10). Receiving overall sample information, she uses a kernelized estimator (discussed shortly) to estimate the reward gap ∆r (x, x ) between any two arms x, x ∈ B (r) v for all v ∈ [V ], and discards sub-optimal arms (Lines 13,14). In Line 11, for any i ∈ [nV ], we use xi , ȳ(r)

Figure 3: Experimental results for CoPE-KB in the FB setting.

r ← 1. // pre-determine the number of rounds and samples 3: while r ≤ R and ∃v ∈ [V ], |B (r) v | > 1 do 4:

estimate the rewards of alive arms 11: Sort all x ∈ B (r) vby Fr (x) in decreasing order, and denote the sorted sequence by x(1) , . . . , x(|B(r)

. . , s(r) N ) according to λ * r , and selects the sub-sequence s(r) v that only contains her available arms to perform sampling (Lines 5,6). Similar to CoKernelFC, she only communicates the number of samples N (r) v,i and average observed reward ȳ(r) v,i for each arm xv,i

xi for any i ∈ [nV ].

= {x ∈ Xv : F (x v, * ) -F (x) ≤ 4 • 2 -r }.Lemma 2. Assume that event G occurs. Then, for any round 1 < r ≤ r * + 1 and agent v ∈ [V ], we have that xv, * ∈ B 2. We prove the first statement by induction.To begin, for anyv ∈ [V ], xv, * ∈ B (1) v trivially holds. Suppose that xv, * ∈ B (r) v (1 ≤ r ≤ r * ) holds for any v ∈ [V ], and there exists some

which contradicts the definition of x * v . Thus, we have that for anyv ∈ [V ], xv, * ∈ B (r+1) v

32(1+ε)ρ * (ξ) , for any r ∈ [R], xi , xj ∈ B (r) v and v ∈ [V ],Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j ))≤ (φ(x i ) -φ(x j )) N ξI + Φ r Φ r -1 Φ r η (r) v + ξN (φ(x i ) -φ(x j )) N ξI + Φ r Φ r -1 θ * <∆ r,xi,xj , which completes the proof of Pr ¬J r,xi,xj ≤ 2 exp -N 32(1+ε)ρ * (ξ) . Define event J = Fr (x i ) -Fr (x j ) -(F (x i ) -F (x j )) < ∆ r,xi,xj , ∀x i , xj ∈ B (r) v , ∀v ∈ [V ], ∀r ∈ [R] , Let ε = 1 10 . By a union bound over xi , xj ∈ B (r) v , v ∈ [V ] and r ∈ [R], we have Pr [¬J ] ≤2n 2 V log(ω(ξ, X )) • exp -N 32(1 + ε)ρ * (ξ) =O n 2 V log(ω(ξ, X )) • exp -T V ρ * (ξ) • log(ω(ξ, X ))In order to prove Theorem 2, it suffices to prove that conditioning on event J , algorithm CoKernelFB returns the correct answers xv, * for all v ∈ [V ].Suppose that there exist some r ∈ [R] and some v ∈ [V ] such that xv, * is eliminated in round r. DefineB (r) v = {x ∈ B (r) v : Fr (x) > Fr (x v, * )},which denotes the set of arms in B (r) v which are ranked before xv, * according to the estimated rewards. According to the elimination rule (Line 12 in Algorithm 2), we haveωi ) -φ(x j ) 2 (ξI+Φ λ Φ λ) v, * ) -φ(x) 2 (ξI+Φ λ Φ λ ) v, * ) -φ(x) 2(ξI+Φ λ Φ λ ) where (a) follows from Eq. (22), (b) comes from the definition of ω(•, •), and (c) is due to the definition of x0 . According to the definition ∆ r,xv, * ,x0 = inf ∆ r,xv, * ,x0 ≤ ∆ x0 .Conditioning on J , we have Fr (x v, * ) -Fr (x 0 ) -(F (x v, * ) -F (x 0 )) < ∆ r,xv, * ,x0 ≤ ∆ x0 . Then, we have Fr (x v, * ) -Fr (x 0 ) > (F (x v, * ) -F (x 0 )) -∆ x0 = 0,which contradicts the definition of x0 that satisfies Fr (x 0 ) > Fr (x v, * ). Thus, for any round r ∈ [R] and task v ∈ [V ], xv, * will never be eliminated.

x i )φ(x i ) + λ * nV φ(x nV )φ(x nV ) φ(x nV ) = det (M nV -1 ) det I + λ * nV • φ(x nV ) M -1 nV -1 φ(x nV ) = det (M nV -1 ) 1 + λ * nV φ(x nV ) 2

(ξ)) 2 θ * 2 . Algorithm CoKernelFB uses at most T samples per agent, and returns the best arms xv, * for all v ∈ [V ] with error probability

by building a new self-normalized concentration inequality for kernel bandits.Scarlett et al. (2017) develop regret lower bounds for the squared-exponential kernel and Matérn kernel, andLi & Scarlett (2022) establish a near-optimal regret upper bound.Krause & Ong (2011); Deshmukh et al. (2017) study multi-task kernel bandits, which consider a composite kernel constituted by two sub-kernels with respect to tasks and items.Dubey et al. (2020) investigate multi-agent kernel bandits with a local communication protocol, where agents can immediately share the observed data with their neighbors in a network.Li et al. (2022)  study distributed kernel bandits with a client-server communication protocol, where agents only communicate with a central server. The communication costs in(Dubey

down the alive arm set to half dimension

ACKNOWLEDGEMENTS

The work of Yihan Du and Longbo Huang is supported by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403, the Tsinghua University Initiative Scientific Research Program, and Tsinghua Precision Medicine Foundation 10001020109. Yuko Kuroki was supported by Microsoft Research Asia and JST ACT-X JPMJAX200E.

Suppose that B

(1 < r ≤ r * ) holds for any v ∈ [V ], and there exists some v ∈. Then, there exists some x ∈ BUsing Lemma 1, we have that at the round r,, which implies that x should have been eliminated in round r and gives a contradiction. Thus, we complete the proof of Lemma 2. Now we prove Theorem 1.Proof of Theorem 1. Assume that event G occurs.We first prove the correctness.Recall that r * := log 2 ( 2 ∆min ) + 1. According to Lemma 2, for any v ∈ [V ] and x ∈ B In algorithm CoKernelFC, the computation of λ * r , ρ * r and N (r) is the same for all agents, and each agent v just generates partial samples that belong to her arm set Xv from the total N (r) samples. Thus, to bound the overall sample complexity, it suffices to bound r * r=1 N (r) , and then we can obtain the per-agent sample complexity by dividing V .Corollary 2. The error probability of algorithm CoKernelFC, denoted by Err, can also be bounded as follows:where Υ is the maximum information gain.where d eff is the effective dimension.Corollaries 2(a),(b) bound the error probability by maximum information gain and effective dimension, respectively, which capture essential structures of tasks and arm features and only depend on the effective dimension of kernel space.Corollary 2(c) reveals how task similarities impact the error probability performance. Specifically, when tasks are the same, i.e., rank(K Z ) = 1, the error probability enjoys an exponential decay factor of V compared to single-agent algorithm (Katz-Samuels et al., 2020), and achieves the maximum V -speedup. Conversely, when tasks are totally different, i.e., rank(K Z ) = V , the error probability is similar to single-agent results (Katz-Samuels et al., 2020) , since in this case information sharing brings no benefit.Proof of Corollary 2. Recall that λ * := argmax λ∈ X log det I + ξ -1 K λ .Recalling Eq. ( 15), we haveMaximum Information Gain. Recall that the maximum information gain over all sample allocation λ ∈ X is defined asRecall Eq (16), we haveCombining the bound in Theorem 2 and the above equation, the error probability is bounded byEffective Dimension. Recall that α 1 ≥ • • • ≥ α nV denote the eigenvalues of K λ * in decreasing order. The effective dimension of K λ * is defined as

