ARTIFICIAL REPLAY: A META-ALGORITHM FOR HARNESSING HISTORICAL DATA IN BANDITS

Abstract

While standard bandit algorithms sometimes incur high regret, their performance can be greatly improved by "warm starting" with historical data. Unfortunately, how best to incorporate historical data is unclear: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues-particularly in continuous action spaces. We address these two challenges by proposing ARTIFICIAL REPLAY, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. ARTIFICIAL REPLAY uses only a subset of the historical data as needed to reduce computation and storage. We provide guarantees that our method achieves equal regret as a full warm-start approach while potentially using only a fraction of the historical data for a broad class of base algorithms that satisfy independence of irrelevant data (IIData), a novel property that we introduce. We complement these theoretical results with a case study of K-armed and continuous combinatorial bandit algorithms, including on a green security domain using real poaching data, to show the practical benefits of ARTIFICIAL REPLAY in achieving optimal regret alongside low computational and storage costs. Across these experiments, we show that ARTIFICIAL REPLAY performs well for all settings that we consider, even for base algorithms that do not satisfy IIData.

1. INTRODUCTION

Multi-armed bandits and their variants are robust models for many real-world problems. Resulting algorithms have been applied to wireless networks (Zuo & Joe-Wong, 2021) , COVID testing regulations (Bastani et al., 2021) , and conservation efforts to protect wildlife from poaching (Xu et al., 2021) . Typical bandit algorithms assume no prior knowledge of the expected rewards of each action, simply taking actions online to address the exploration-exploitation trade-off. However, many real-world applications offer access to historical data. For example, in the wildlife conservation setting, we may have access to years of historical patrol records that should be incorporated to learn poaching risk before deploying any bandit algorithm. There is no consensus on how to optimally incorporate this historical data into online learning algorithms. The naive approach uses the full historical dataset to initialize reward estimates (Shivaswamy & Joachims, 2012), possibly incurring unnecessary and onerous computation and storage costs. These costs are particularly salient in continuous action settings with adaptive discretization, where the number of discretized regions is a direct function of the number of historical samples. If excessive data was collected on poor-performing actions, this spurious data with imbalanced data coverage would lead us to unnecessarily process and store an extremely large number of fine discretizations in low-performing areas of the action space, even when a significantly coarser discretization would be sufficient to inform us that region is not worth exploring. These two key challenges highlight that the value of information of the historical dataset may not be a direct function of its size. Real-world decision makers echo this sentiment: Martin et al. (2017) note that for conservation decisions, more information does not always translate into better actions; time is the resource which matters most. A natural question one can ask is: Is there an efficient way (in terms of space, computational, and sample complexity) to use historical data to achieve regret-optimal performance? For example, many real-world applications of bandit algorithms, such as online recommender systems, may con-

