ARTIFICIAL REPLAY: A META-ALGORITHM FOR HARNESSING HISTORICAL DATA IN BANDITS

Abstract

While standard bandit algorithms sometimes incur high regret, their performance can be greatly improved by "warm starting" with historical data. Unfortunately, how best to incorporate historical data is unclear: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues-particularly in continuous action spaces. We address these two challenges by proposing ARTIFICIAL REPLAY, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. ARTIFICIAL REPLAY uses only a subset of the historical data as needed to reduce computation and storage. We provide guarantees that our method achieves equal regret as a full warm-start approach while potentially using only a fraction of the historical data for a broad class of base algorithms that satisfy independence of irrelevant data (IIData), a novel property that we introduce. We complement these theoretical results with a case study of K-armed and continuous combinatorial bandit algorithms, including on a green security domain using real poaching data, to show the practical benefits of ARTIFICIAL REPLAY in achieving optimal regret alongside low computational and storage costs. Across these experiments, we show that ARTIFICIAL REPLAY performs well for all settings that we consider, even for base algorithms that do not satisfy IIData.

1. INTRODUCTION

Multi-armed bandits and their variants are robust models for many real-world problems. Resulting algorithms have been applied to wireless networks (Zuo & Joe-Wong, 2021) , COVID testing regulations (Bastani et al., 2021) , and conservation efforts to protect wildlife from poaching (Xu et al., 2021) . Typical bandit algorithms assume no prior knowledge of the expected rewards of each action, simply taking actions online to address the exploration-exploitation trade-off. However, many real-world applications offer access to historical data. For example, in the wildlife conservation setting, we may have access to years of historical patrol records that should be incorporated to learn poaching risk before deploying any bandit algorithm. There is no consensus on how to optimally incorporate this historical data into online learning algorithms. The naive approach uses the full historical dataset to initialize reward estimates (Shivaswamy & Joachims, 2012), possibly incurring unnecessary and onerous computation and storage costs. These costs are particularly salient in continuous action settings with adaptive discretization, where the number of discretized regions is a direct function of the number of historical samples. If excessive data was collected on poor-performing actions, this spurious data with imbalanced data coverage would lead us to unnecessarily process and store an extremely large number of fine discretizations in low-performing areas of the action space, even when a significantly coarser discretization would be sufficient to inform us that region is not worth exploring. These two key challenges highlight that the value of information of the historical dataset may not be a direct function of its size. Real-world decision makers echo this sentiment: Martin et al. ( 2017) note that for conservation decisions, more information does not always translate into better actions; time is the resource which matters most. A natural question one can ask is: Is there an efficient way (in terms of space, computational, and sample complexity) to use historical data to achieve regret-optimal performance? For example, many real-world applications of bandit algorithms, such as online recommender systems, may con-tain historical datasets with millions of data points. Processing these millions of points would require an exceptional amount of upfront computation and storage cost, especially if many of those historical points are no longer relevant; many samples may encode out-of-date data such as old movies or discontinued products. To this end, we propose ARTIFICIAL REPLAY, a meta-algorithm that modifies any base bandit algorithm to harness historical data. ARTIFICIAL REPLAY reduces computation and storage costs by only using historical data on an as needed basis. The key intuition is if we could choose which samples to include in the historical dataset, a natural approach would be to use a regret-optimal bandit algorithm to guide the sampling. ARTIFICIAL REPLAY builds on this intuition by using historical data as a replay buffer to artificially simulate online actions. Every time the base algorithm picks an action, we first check the historical data for any unused samples from the chosen action. If an unused sample exists, update the reward estimates and continue without advancing to the next timestep. Otherwise, sample from the environment, update the estimates using the observion, and continue to the next timestep. While this idea is easiest to understand in the context of the standard K-armed bandit problem, we discuss later how this framework naturally extends to other structure and information models, including continuous action spaces with semi-bandit feedback. Although ARTIFICIAL REPLAY seems to be a natural heuristic to minimize use of historical data, it is not clear how to analyze its regret-specifically how much it loses compared to "full warm-start" (i.e., where the base algorithm is initialized with the full dataset). Surprisingly, however, we prove that under a widely applicable condition, the regret of ARTIFICIAL REPLAY (as a random variable) is distributionally identical to that of a full warm-start approach, while also guaranteeing significantly better time and storage complexity. Specifically, we show a sample-path couplingfoot_0 between our AR-TIFICIAL REPLAY approach and the full warm start approach with the same base algorithm, as long as the base algorithm satisfies a novel independence of irrelevant data (IIData) assumption. While our goal is not to show regret improvements, this result highlights how ARTIFICIAL REPLAY is a simple approach for incorporating historical data with identical regret to full warm start (approach done in practice) with significantly smaller computational overhead. Finally, we show the practical benefits of our method by instantiating ARTIFICIAL REPLAY for several broad classes of bandits and evaluating on real-world data. To highlight the breadth of algorithms that satisfy the IIData property, we provide examples of regret-optimal IIData polices for K-armed and continuous combinatorial bandits. We use these examples to prove that ARTIFI-CIAL REPLAY can lead to arbitrary better storage and computational complexity requirements. We close with a case study of combinatorial bandit algorithms for continuous resource allocation in the context of green security domains, using a novel adaptive discretization technique. Across the experiments, we observe concrete gains in storage and runtime using real-world poaching data from the ARTIFICIAL REPLAY framework over a range of base algorithms, including algorithms that do not satisfy IIData such as Thompson sampling and Information Directed Sampling (IDS). Multi-Armed Bandit Algorithms. The design and analysis of bandit algorithms have been considered under a wide range of models. These algorithms were first studied in the K-armed bandit model in Lai & Robbins (1985) , where the decision maker has access to a finite set of K possible actions at each timestep. However, numerous follow-up works have considered similar approaches when designing algorithms in continuous action spaces (Kleinberg et al., 2019) and with combinatorial constraints (Chen et al., 2013; Xu et al., 2021; Zuo & Joe-Wong, 2021) . Our work provides a framework to modify existing algorithms to harness historical data. Moreover, we also propose a novel algorithm to incorporate adaptive discretization for combinatorial multi-armed bandits for continuous resource allocation, extending the discrete model from Zuo & Joe-Wong (2021).



That is, we construct the regret process for both algorithms simultaneously on a joint probability space such that each individual process has the correct marginal, but both processes are equal over each sample path.



bandit problems have a long history in the online learning literature. We highlight the most closely related works below; for more extensive references please see our detailed discussion in Appendix B and Bubeck et al. (2012); Slivkins (2019); Lattimore & Szepesvári (2020).

