LEARNING KERNELIZED CONTEXTUAL BANDITS IN A DISTRIBUTED AND ASYNCHRONOUS ENVIRONMENT

Abstract

Despite the recent advances in communication-efficient distributed bandit learning, most existing solutions are restricted to parametric models, e.g., linear bandits and generalized linear bandits (GLB). In comparison, kernel bandits, which search for non-parametric functions in a reproducing kernel Hilbert space (RKHS), offer higher modeling capacity. But the only existing work in distributed kernel bandits adopts a synchronous communication protocol, which greatly limits its practical use (e.g., every synchronization step requires all clients to participate and wait for data exchange). In this paper, in order to improve the robustness against delays and unavailability of clients that are common in practice, we propose the first asynchronous solution based on approximated kernel regression for distributed kernel bandit learning. A set of effective treatments are developed to ensure approximation quality and communication efficiency. Rigorous theoretical analysis about the regret and communication cost is provided; and extensive empirical evaluations demonstrate the effectiveness of our solution.

1. INTRODUCTION

There are many application scenarios where an environment repeatedly provides a learner with a set of candidate actions to choose from, and possibly some side information (aka., context) (Li et al., 2010a; b; Durand et al., 2018) ; and the learner, whose goal is to maximize cumulative reward over time, can only observe the reward corresponding to the chosen action. This is often modeled as a bandit learning problem (Abbasi-Yadkori et al., 2011; Krause & Ong, 2011) , which exemplifies the well-known exploitation-exploration dilemma (Auer, 2002) . Various modeling assumptions have been made about the relation between the context for each action and its expected reward. Compared with parametric bandits, such as linear and generalized linear bandits (Abbasi-Yadkori et al., 2011; Filippi et al., 2010) , kernel/Gaussian process bandits (Valko et al., 2013; Srinivas et al., 2009) offer greater flexibility as they find non-parametric functions lying in a RKHS. And thus they have become a powerful tool for optimizing black box functions based on noisy observations in various applications, such as recommender systems (Vanchinathan et al., 2014) , mobile health (Tewari & Murphy, 2017 ), environment monitoring (Srinivas et al., 2009) , automatic machine learning (Li et al., 2017) , cyber-physical systems (Lizotte et al., 2007; Li et al., 2016) , etc. Motivated by the rapid growth in affordability and availability of hardware resources, e.g., computer clusters or IoT devices, there is increasing interest in distributing the learning tasks, which gives rise to the recent research efforts in distributed bandits (Wang et al., 2019; Huang et al., 2021; Li & Wang, 2022a; b; Li et al., 2022; He et al., 2022) , where N clients collaboratively maximize the overall cumulative rewards over time T . As communication bandwidth is the key bottleneck in many distributed applications (Huang et al., 2013) , these studies emphasize communication efficiency, i.e., incur sub-linear communication cost with respect to time T , while attaining near-optimal regret. However, most of these works are restricted to simple parametric models, like linear bandits (Wang et al., 2019; Huang et al., 2021; Li & Wang, 2022a; He et al., 2022) or GLB (Li & Wang, 2022b) . The only exception is Li et al. (2022) , who proposed the first algorithm for distributed kernel bandit that has sub-linear communication cost. They achieved this via a Nyström embedding function (Nyström, 1930) shared among all the clients, such that the clients only need to transfer the embedded statistics for joint kernelized estimation. Nevertheless, in their algorithm, the update of the Nyström embedding function, as well as the communication of the embedded statistics, relies on a synchronization round that requires participation of all the clients. As is widely recognized in distributed optimization (Low et al., 2012; Xie et al., 2019; Lian et al., 2018; Chen et al., 2020; Lim et al., 2020) and distributed bandit learning (Li & Wang, 2022a; He et al., 2022) , this design is vulnerable to stragglers (i.e., slower clients) in the system, i.e., the update procedure of Li et al. ( 2022) is paused until the slowest client responds. Due to device heterogeneity and network unreliability, this situation unfortunately is common especially at the scale of hundreds of devices/clients. Thus, asynchronous communication is preferred, as the server can readily perform model update when communication from a client is received, which is more robust against stragglers. The main bottleneck in addressing this limitation of Li et al. (2022) lies in computing Nyström approximation under asynchronous communication. Specifically, during synchronization step, their algorithm first samples a small set of representative data points (i.e., the dictionary) from all clients, and then lets each client project their local data to the subspace spanned by this dictionary and share statistics about the projected data with others. However, new challenges arise in both algorithmic design and theoretical analysis when extending their solution to asynchronous communication, since a 'fresh' re-sample from the data of all clients is no longer possible, and each client has a different copy of the dictionary due to the asynchronous communication with the server, such that their local data will be projected to different subspaces, and thus causes difficulty in joint kernel estimation. In this paper, we address these challenges and propose the first asynchronous algorithm for distributed kernelized contextual bandits. Compared with prior works in distributed bandits, our algorithm simultaneously enjoys the modeling capacity of non-parametric models and the improved robustness against delays and unavailability of clients, making it suitable for a wider range of applications. To ensure the approximation quality and compactness of the constructed dictionary in asynchronous communications, we design an incremental update procedure tailored to our problem setting with a variant of Ridge leverage score (RLS) sampling. Compared with the sampling procedure in prior works (Li et al., 2022; Calandriello et al., 2020) , this requires specialized treatments in analysis, since the quality of the current dictionary now relies on all previous asynchronous communications. Moreover, to enable joint kernel estimation, we perform transformations on the server side to convert statistics from different clients to a common subspace, which to the best of our knowledge is also new in bandit literature. We rigorously proved that the proposed algorithm incurs an Õ(N 2 γ 3 T ) communication cost, matching that of Li et al. (2022) , where γ T is the maximum information gain, while still attaining the optimal O( √ T γ T ) regret.

2. RELATED WORKS

There have been increasing research efforts in distributed bandit learning in recent years, i.e., multiple agents collaborate in pure exploration (Hillel et al., 2013; Tao et al., 2019; Du et al., 2021) , or regret minimization (Wang et al., 2019; Li & Wang, 2022a; b) . They mainly differ in the relations of learning problems solved by the agents (i.e., homogeneous vs., heterogeneous) and the type of communication network (i.e., peer-to-peer (P2P) vs., star-shaped). However, most of these works assume linear reward functions, In comparison, distributed kernelized contextual bandits still remain under-explored. Prior work in this direction assumes a local communication setting (Dubey et al., 2020) , where the agent immediately shares the new raw data point to its neighbors after each interaction, and thus the communication cost is still linear over time. A recent work by Li et al. (2022) addresses this issue by



and the clients communicate by transferring the O(d 2 ) sufficient statistics. For example, Korda et al. (2016) considered a peer-to-peer (P2P) communication network and assumes that the clients form clusters, i.e., each cluster is associated with a unique bandit problem. Huang et al. (2021) considered a star-shaped communication network as in our paper, but their proposed phase-based elimination algorithm only works in the fixed arm set setting. The closest works to ours are Wang et al. (2019); Dubey & Pentland (2020); Li & Wang (2022a); He et al. (2022), which propose event-triggered communication protocols to obtain sub-linear communication cost over time for distributed linear bandits with a time-varying arm set. In particular, Li & Wang (2022a) first considered the asynchronous communication setting for distributed bandit learning. Though their proposed algorithm avoids global synchronization (Wang et al., 2019), it still involves download to inactive clients. He et al. (2022) further improved their algorithm design and analysis, such that only the active client in each round needs to participate in communication.

