COLLABORATIVE PURE EXPLORATION IN KERNEL BANDIT

Abstract

In this paper, we propose a novel Collaborative Pure Exploration in Kernel Bandit model (CoPE-KB), where multiple agents collaborate to complete different but related tasks with limited communication. Our model generalizes prior CoPE formulation with the single-task and classic MAB setting to allow multiple tasks and general reward structures. We propose a novel communication scheme with an efficient kernelized estimator, and design algorithms CoKernelFC and CoKernelFB for CoPE-KB with fixed-confidence and fixed-budget objectives, respectively. Sample and communication complexities are provided to demonstrate the efficiency of our algorithms. Our theoretical results explicitly quantify how task similarities influence learning speedup, and only depend on the effective dimension of feature space. Our novel techniques, such as an efficient kernelized estimator and decomposition of task similarities and arm features, which overcome the communication difficulty in high-dimensional feature space and reveal the impacts of task similarities on sample complexity, can be of independent interests.

1. INTRODUCTION

Pure exploration (Even-Dar et al., 2006; Kalyanakrishnan et al., 2012; Kaufmann et al., 2016 ) is a fundamental online learning problem in multi-armed bandits (Thompson, 1933; Lai & Robbins, 1985; Auer et al., 2002) , where an agent chooses options (often called arms) and observes random feedback with the objective of identifying the best arm. This formulation has found many important applications, such as web content optimization (Agarwal et al., 2009) and online advertising (Tang et al., 2013) . However, traditional pure exploration (Even-Dar et al., 2006; Kalyanakrishnan et al., 2012; Kaufmann et al., 2016) only considers single-agent decision making, and cannot be applied to prevailing distributed systems in real world, which often face a heavy computation load and require multiple parallel devices to process tasks, e.g., distributed web servers (Zhuo et al., 2003) and data centers (Liu et al., 2011) . To handle such distributed applications, prior works (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) have developed the Collaborative Pure Exploration (CoPE) model, where multiple agents communicate and cooperate to identify the best arm with learning speedup. Yet, existing results focus only on the classic multi-armed bandit (MAB) setting with single task, i.e., all agents solve a common task and the rewards of arms are individual values (rather than generated by a reward function). However, in many distributed applications such as multi-task neural architecture search (Gao et al., 2020) , different devices can face different but related tasks, and there exists similar dependency of rewards on option features among different tasks. Therefore, it is important to develop a more general CoPE model to allow heterogeneous tasks and structured reward dependency, and further theoretically understand how task correlation impacts learning. Motivated by the above fact, we propose a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) model. Specifically, in CoPE-KB, each agent is given a set of arms, and the expected reward of each arm is generated by a task-dependent reward function in a high and possibly infinite dimensional Reproducing Kernel Hilbert Space (RKHS) (Wahba, 1990; Schölkopf et al., 2002) . Each agent sequentially chooses arms to sample and observes random outcomes in order to identify the best arm. Agents can broadcast and receive messages to and from others in communication rounds, so that they can collaborate and exploit the task similarity to expedite learning processes. Our CoPE-KB model is a novel generalization of prior CoPE problem (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) , which not only extends prior models from the single-task setting to multiple tasks, but also goes beyond classic MAB setting and allows general (linear or nonlinear) reward structures. CoPE-KB is most suitable for applications involving multiple tasks and complicated reward structures. For example, in multi-task neural architecture search (Gao et al., 2020) , one wants to search for best architectures for different but related tasks on multiple devices, e.g., the object detection (Ghiasi et al., 2019) and object tracking (Yan et al., 2021) tasks in computer vision, which often use similar neural architectures. Instead of individually evaluating each possible architecture, one prefers to directly learn the relationship (reward function) between the accuracy results achieved and the features of used architectures (e.g., the type of neural networks), and exploit the similarity of reward functions among tasks to accelerate the search. Our CoPE-KB generalization faces a unique challenge on communication. Specifically, in prior CoPE works with classic MAB setting (Hillel et al., 2013; Tao et al., 2019; Karpov et al., 2020) , agents only need to learn scalar rewards, which are easy to transmit. However, under the kernel model, agents need to estimate a high or even infinite dimensional reward parameter, which is inefficient to directly transmit. Also, if one naively adapts existing reward estimators for kernel bandits (Srinivas et al., 2010; Camilleri et al., 2021) to learn this high-dimensional reward parameter, he/she will suffer an expensive communication cost dependent on the number of samples N (r) , since the reward estimators there need all raw sample outcomes to be transmitted. To tackle this challenge, we develop an efficient kernelized estimator, which only needs average outcomes on nV arms and reduces the required transmitted messages from O(N (r) ) to O(nV ). Here V is the number of agents, and n is the number of arms for each agent. The number of samples N (r) depends on the inverse of the minimum reward gap, and is often far larger than the number of arms nV . Under the CoPE-KB model, we study two popular objectives, i.e., Fixed-Confidence (FC), where we aim to minimize the number of samples used under a given confidence, and Fixed-Budget (FB), where the goal is to minimize the error probability under a given sample budget. We design two algorithms CoKernelFC and CoKernelFB, which adopt an efficient kernelized estimator to simplify the required data transmission and enjoy a O(nV ) communication cost, instead of O(N (r) ) as in adaptions of existing kernel bandit algorithms (Srinivas et al., 2010; Camilleri et al., 2021) . We provide sampling and communication guarantees, and also interpret them by standard kernel measures, e.g., maximum information gain and effective dimension. Our results rigorously quantify the influences of task similarities on learning acceleration, and hold for both finite and infinite dimensional feature space. The contributions of this paper are summarized as follows: • We formulate a collaborative pure exploration in kernel bandit (CoPE-KB) model, which generalizes prior single-task CoPE formulation to allow multiple tasks and general reward structures, and consider two objectives, i.e., fixed-confidence (FC) and fixed-budget (FB). • For the FC objective, we propose algorithm CoKernelFC, which adopts an efficient kernelized estimator to simplify the required data transmission and enjoys only a O(nV ) communication cost. We derive sample complexity Õ( ρ * (ξ) V log δ -1 ) and communication rounds O(log ∆ -1 min ). Here ξ is the regularization parameter, and ρ * (ξ) is the problem hardness (see Section 4.3). • For the FB objective, we design a novel algorithm CoKernelFB with error probability Õ(exp(-T V ρ * (ξ) )n 2 V ) and communication rounds O(log(ω(ξ, X ))). Here T is the sample

