LEARNING KERNELIZED CONTEXTUAL BANDITS IN A DISTRIBUTED AND ASYNCHRONOUS ENVIRONMENT

Abstract

Despite the recent advances in communication-efficient distributed bandit learning, most existing solutions are restricted to parametric models, e.g., linear bandits and generalized linear bandits (GLB). In comparison, kernel bandits, which search for non-parametric functions in a reproducing kernel Hilbert space (RKHS), offer higher modeling capacity. But the only existing work in distributed kernel bandits adopts a synchronous communication protocol, which greatly limits its practical use (e.g., every synchronization step requires all clients to participate and wait for data exchange). In this paper, in order to improve the robustness against delays and unavailability of clients that are common in practice, we propose the first asynchronous solution based on approximated kernel regression for distributed kernel bandit learning. A set of effective treatments are developed to ensure approximation quality and communication efficiency. Rigorous theoretical analysis about the regret and communication cost is provided; and extensive empirical evaluations demonstrate the effectiveness of our solution.

1. INTRODUCTION

There are many application scenarios where an environment repeatedly provides a learner with a set of candidate actions to choose from, and possibly some side information (aka., context) (Li et al., 2010a; b; Durand et al., 2018) ; and the learner, whose goal is to maximize cumulative reward over time, can only observe the reward corresponding to the chosen action. This is often modeled as a bandit learning problem (Abbasi-Yadkori et al., 2011; Krause & Ong, 2011) , which exemplifies the well-known exploitation-exploration dilemma (Auer, 2002) . Various modeling assumptions have been made about the relation between the context for each action and its expected reward. Compared with parametric bandits, such as linear and generalized linear bandits (Abbasi-Yadkori et al., 2011; Filippi et al., 2010) , kernel/Gaussian process bandits (Valko et al., 2013; Srinivas et al., 2009) offer greater flexibility as they find non-parametric functions lying in a RKHS. And thus they have become a powerful tool for optimizing black box functions based on noisy observations in various applications, such as recommender systems (Vanchinathan et al., 2014) , mobile health (Tewari & Murphy, 2017), environment monitoring (Srinivas et al., 2009) , automatic machine learning (Li et al., 2017) , cyber-physical systems (Lizotte et al., 2007; Li et al., 2016) , etc. Motivated by the rapid growth in affordability and availability of hardware resources, e.g., computer clusters or IoT devices, there is increasing interest in distributing the learning tasks, which gives rise to the recent research efforts in distributed bandits (Wang et al., 2019; Huang et al., 2021; Li & Wang, 2022a; b; Li et al., 2022; He et al., 2022) , where N clients collaboratively maximize the overall cumulative rewards over time T . As communication bandwidth is the key bottleneck in many distributed applications (Huang et al., 2013) , these studies emphasize communication efficiency, i.e., incur sub-linear communication cost with respect to time T , while attaining near-optimal regret. However, most of these works are restricted to simple parametric models, like linear bandits (Wang et al., 2019; Huang et al., 2021; Li & Wang, 2022a; He et al., 2022) or GLB (Li & Wang, 2022b) . The only exception is Li et al. (2022) , who proposed the first algorithm for distributed kernel bandit that has sub-linear communication cost. They achieved this via a Nyström embedding function (Nyström, 1930) shared among all the clients, such that the clients only need to transfer the embedded statistics for joint kernelized estimation. Nevertheless, in their algorithm, the update of 1

