ARTIFICIAL REPLAY: A META-ALGORITHM FOR HARNESSING HISTORICAL DATA IN BANDITS

Abstract

While standard bandit algorithms sometimes incur high regret, their performance can be greatly improved by "warm starting" with historical data. Unfortunately, how best to incorporate historical data is unclear: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues-particularly in continuous action spaces. We address these two challenges by proposing ARTIFICIAL REPLAY, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. ARTIFICIAL REPLAY uses only a subset of the historical data as needed to reduce computation and storage. We provide guarantees that our method achieves equal regret as a full warm-start approach while potentially using only a fraction of the historical data for a broad class of base algorithms that satisfy independence of irrelevant data (IIData), a novel property that we introduce. We complement these theoretical results with a case study of K-armed and continuous combinatorial bandit algorithms, including on a green security domain using real poaching data, to show the practical benefits of ARTIFICIAL REPLAY in achieving optimal regret alongside low computational and storage costs. Across these experiments, we show that ARTIFICIAL REPLAY performs well for all settings that we consider, even for base algorithms that do not satisfy IIData.

1. INTRODUCTION

Multi-armed bandits and their variants are robust models for many real-world problems. Resulting algorithms have been applied to wireless networks (Zuo & Joe-Wong, 2021) , COVID testing regulations (Bastani et al., 2021) , and conservation efforts to protect wildlife from poaching (Xu et al., 2021) . Typical bandit algorithms assume no prior knowledge of the expected rewards of each action, simply taking actions online to address the exploration-exploitation trade-off. However, many real-world applications offer access to historical data. For example, in the wildlife conservation setting, we may have access to years of historical patrol records that should be incorporated to learn poaching risk before deploying any bandit algorithm. There is no consensus on how to optimally incorporate this historical data into online learning algorithms. The naive approach uses the full historical dataset to initialize reward estimates (Shivaswamy & Joachims, 2012) , possibly incurring unnecessary and onerous computation and storage costs. These costs are particularly salient in continuous action settings with adaptive discretization, where the number of discretized regions is a direct function of the number of historical samples. If excessive data was collected on poor-performing actions, this spurious data with imbalanced data coverage would lead us to unnecessarily process and store an extremely large number of fine discretizations in low-performing areas of the action space, even when a significantly coarser discretization would be sufficient to inform us that region is not worth exploring. These two key challenges highlight that the value of information of the historical dataset may not be a direct function of its size. Real-world decision makers echo this sentiment: Martin et al. (2017) note that for conservation decisions, more information does not always translate into better actions; time is the resource which matters most. A natural question one can ask is: Is there an efficient way (in terms of space, computational, and sample complexity) to use historical data to achieve regret-optimal performance? For example, many real-world applications of bandit algorithms, such as online recommender systems, may con-tain historical datasets with millions of data points. Processing these millions of points would require an exceptional amount of upfront computation and storage cost, especially if many of those historical points are no longer relevant; many samples may encode out-of-date data such as old movies or discontinued products. To this end, we propose ARTIFICIAL REPLAY, a meta-algorithm that modifies any base bandit algorithm to harness historical data. ARTIFICIAL REPLAY reduces computation and storage costs by only using historical data on an as needed basis. The key intuition is if we could choose which samples to include in the historical dataset, a natural approach would be to use a regret-optimal bandit algorithm to guide the sampling. ARTIFICIAL REPLAY builds on this intuition by using historical data as a replay buffer to artificially simulate online actions. Every time the base algorithm picks an action, we first check the historical data for any unused samples from the chosen action. If an unused sample exists, update the reward estimates and continue without advancing to the next timestep. Otherwise, sample from the environment, update the estimates using the observion, and continue to the next timestep. While this idea is easiest to understand in the context of the standard K-armed bandit problem, we discuss later how this framework naturally extends to other structure and information models, including continuous action spaces with semi-bandit feedback. Although ARTIFICIAL REPLAY seems to be a natural heuristic to minimize use of historical data, it is not clear how to analyze its regret-specifically how much it loses compared to "full warm-start" (i.e., where the base algorithm is initialized with the full dataset). Surprisingly, however, we prove that under a widely applicable condition, the regret of ARTIFICIAL REPLAY (as a random variable) is distributionally identical to that of a full warm-start approach, while also guaranteeing significantly better time and storage complexity. Specifically, we show a sample-path couplingfoot_0 between our AR-TIFICIAL REPLAY approach and the full warm start approach with the same base algorithm, as long as the base algorithm satisfies a novel independence of irrelevant data (IIData) assumption. While our goal is not to show regret improvements, this result highlights how ARTIFICIAL REPLAY is a simple approach for incorporating historical data with identical regret to full warm start (approach done in practice) with significantly smaller computational overhead. Finally, we show the practical benefits of our method by instantiating ARTIFICIAL REPLAY for several broad classes of bandits and evaluating on real-world data. To highlight the breadth of algorithms that satisfy the IIData property, we provide examples of regret-optimal IIData polices for K-armed and continuous combinatorial bandits. We use these examples to prove that ARTIFI-CIAL REPLAY can lead to arbitrary better storage and computational complexity requirements. We close with a case study of combinatorial bandit algorithms for continuous resource allocation in the context of green security domains, using a novel adaptive discretization technique. Across the experiments, we observe concrete gains in storage and runtime using real-world poaching data from the ARTIFICIAL REPLAY framework over a range of base algorithms, including algorithms that do not satisfy IIData such as Thompson sampling and Information Directed Sampling (IDS).

1.1. RELATED WORK

Multi-armed bandit problems have a long history in the online learning literature. We highlight the most closely related works below; for more extensive references please see our detailed discussion in Appendix B and Bubeck et al. (2012) ; Slivkins (2019) ; Lattimore & Szepesvári (2020) . Multi-Armed Bandit Algorithms. The design and analysis of bandit algorithms have been considered under a wide range of models. These algorithms were first studied in the K-armed bandit model in Lai & Robbins (1985) , where the decision maker has access to a finite set of K possible actions at each timestep. However, numerous follow-up works have considered similar approaches when designing algorithms in continuous action spaces (Kleinberg et al., 2019) and with combinatorial constraints (Chen et al., 2013; Xu et al., 2021; Zuo & Joe-Wong, 2021) . Our work provides a framework to modify existing algorithms to harness historical data. Moreover, we also propose a novel algorithm to incorporate adaptive discretization for combinatorial multi-armed bandits for continuous resource allocation, extending the discrete model from Zuo & Joe-Wong (2021) . Incorporating Historical Data. Several papers have started to investigate how to incorporate historical data into bandit algorithms, starting with Shivaswamy & Joachims (2012) who consider a K-armed bandit model where each arm has a dataset of historical pulls. The authors develop a "warm start" UCB algorithm to initialize the confidence bound of each arm based on the full historical data-prior to learning. Bouneffouf et al. (2019) extended similar techniques to models with pre-clustered arms. These techniques were later extended to Bayesian and frequentist linear contextual bandits, where the linear feature vector is initialized based on standard regression over the historical data (Oetomo et al., 2021; Wang et al., 2017) . Our work provides a contrasting approach to harnessing historical data in algorithm design: our meta-algorithm can be applied to any standard bandit framework and uses the historical data only as needed, leading to improved computation and storage gains.

2. PRELIMINARIES

We now define the general bandit model and specify the finite-armed and online combinatorial allocation settings that we study in our experiments. See Appendix C for details.

2.1. GENERAL STOCHASTIC BANDIT MODEL

We consider a stochastic bandit problem with a fixed action set A. Let < : A ! ([0, 1]) be a collection of independent and unknown reward distributions over A. Our goal is to pick an action a 2 A to maximize E[<(a)], the expected reward, which we denote µ(a). The optimal reward is: OPT = max a2A µ(a) . For now, we do not impose any additional structure on A, which could potentially be discrete, continuous, or encode combinatorial constraints. Historical Data. We assume that the algorithm designer has access to a historical dataset H hist = {a H j , R H j } j2[H] containing H historical points with actions {a H j } j2[H] and rewards R H j sampled according to <(a H j ). We do not make any assumptions on how the historical actions a H j are chosen and view them as deterministic and fixed upfront. Our goal is to efficiently incorporate this historical data to improve the performance of a bandit algorithm. Online Structure. Since the mean reward function µ(a) is initially unknown, we consider settings where the algorithm interacts with the environment sequentially over T timesteps. At timestep t 2 [T ], the decision maker picks an action A t 2 A according to their policy ⇡. The environment then reveals a reward R t sampled from the distribution <(A t ). The optimal reward OPT would be achieved using a policy with full knowledge of the true distribution. We thus define regret as: REGRET(T, ⇡, H hist ) = T • OPT P T t=1 µ(A t ) . where the dependence on H hist highlights that A t can additionally depend on the historical dataset.

2.2. FINITE, CONTINUOUS, AND COMBINATORIAL ACTION SPACES

Finite-Armed Bandit. The finite-armed bandit model can be viewed in this framework by considering K discrete actions A = [K] = {1, . . . , K}. Combinatorial Multi-Armed Bandit for Continuous Resource Allocation (CMAB-CRA). A central planner has access to a metric space S of resources with metric d S . They are tasked with splitting a total amount of B divisible budget across N different resources within S. An action consists of choosing N resources, i.e., N points in S, and allocating the budget among that chosen subset. The feasible space of allocations is B = [0, 1] and the feasible action space is: A = (p, ~ ) 2 S N ⇥ B N P N i=1 (i)  B, d S (p (i) , p (j) ) ✏ 8i 6 = j . The chosen action must satisfy the budgetary constraint (i.e., P  B), and the resources must be distinct (aka ✏-away from each other according to d S for some ✏ > 0) to ensure the "same" resource is not chosen at multiple allocations. We additionally assume that < decomposes independently over the (resource, allocation) pairs, in that µ(a) = P N i=1 µ(p (i) , (i) ). Lastly, we assume  Algorithm 1 ARTIFICIAL REPLAY Require: Historical dataset H hist = {(a H j , R H j )} j2[H] , base (i) t , (i) t , R (i) t ) i2[N ] for each resource and allocation pair sampled according to <(p 2021) proposed a discrete model of this problem as a generalization of the works in Dagan & Koby (2018) ; Lattimore et al. (2014; 2015) specialize it to consider scheduling a finite set of resources to maximize the expected number of jobs finished. (i) t , (i) t ). Zuo & Joe-Wong ( Extension to Green Security. The CMAB-CRA model can be used to specify green security domains from Xu et al. (2021) by letting the space S represent a protected area and letting B represent the discrete set of patrol resources to allocate, such as number of ranger hours per week, with the total budget B being 40 hours. This formulation generalizes to a more realistic continuous space model of the landscape, instead of the artificial fixed discretization that was considered in prior work consisting of 1 ⇥ 1 sq. km regions of the park. This also highlights the practical necessity that the chosen resources (here, the patrol locations) are ✏-far away to ensure sufficient spread. In Section 5, we show that enabling patrol planning at a continuous level can help park rangers more precisely identify poaching hotspots.

3. ARTIFICIAL REPLAY FOR HARNESSING HISTORICAL DATA

We propose ARTIFICIAL REPLAY, a meta-algorithm that can be integrated with any base algorithm to incorporate historical data. We later prove that for any base algorithm satisfying independence of irrelevant data (IIData), a novel property we introduce, ARTIFICIAL REPLAY has identical regret to an approach which uses the full historical data upfront-showing that our approach reduces computation costs without trading off performance. Additionally, in Appendix E we discuss empirical improvements of ARTIFICIAL REPLAY applied to Thompson Sampling and Information Directed Sampling, two algorithms which do not satisfy IIData. Algorithm Formulation. Any algorithm for online stochastic bandits can be thought of as a function mapping arbitrary ordered histories (i.e., collections of observed (a, R) pairs) to a distribution over actions in A. More specifically, let ⇧ : D ! (A) be an arbitrary base algorithm where D denotes the collection of possible histories (i.e., D = [ i 0 (A ⇥ R + )). The policy obtained by a base algorithm ⇧ without incorporating historical data simply takes the action sampled according to the policy ⇡ IGNORANT(⇧) t = ⇧(H t ) where H t is the data observed by timestep t. In comparison, consider an algorithm ⇡ FULL START(⇧) t which follows the same policy but uses the full historical data upfront, so takes the action sampled according to ⇧(H hist [ H t ).

3.1. ARTIFICIAL REPLAY

The ARTIFICIAL REPLAY meta-algorithm incorporates the historical data H hist into an arbitrary base algorithm ⇧, resulting in a policy we denote by ⇡ ARTIFICIAL REPLAY (⇧) . See Algorithm 1 for the pseudocode. We let H on t be the set of historical datapoints used by the start of time t. Initially, H on 1 = ;. For an arbitrary timestep t, the ARTIFICIAL REPLAY approach works as follows: Let Ãt ⇠ ⇧(H on t [ H t ) be the proposed action at the start of time t. Since we are focused on simulating the algorithm with historical data, we break into cases whether or not the current set of unused historical datapoints (i.e., H hist \ H on t ) contains any additional information about Ãt . • No historical data available: If Ãt is not contained in H hist \ H on t , then the selected action is A t = Ãt , and we advance to timestep t + 1. We additionally set H on t+1 = H on t . • Historical data available: If Ãt is contained in H hist \ H on t , add that data point to H on t and repeat by picking another proposed action. We remain at time t. Strikingly, our framework imposes minimal computational and storage overhead on top of existing algorithms, simply requiring a data structure to verify whether Ã 2 H hist \ H on t , which can be performed with hashing techniques. It is clear that the runtime and storage complexity of ARTIFI-CIAL REPLAY is no worse than FULL START. We also note that most practical bandit applications incorporate historical data obtained from database systems (e.g. content recommendation systems, wildlife poaching model discussed). This historical data will be stored regardless of the algorithm being employed, and so the key consideration is around computational requirements and not storage. Additionally, our approach extends naturally to the following different models: Continuous Spaces. The ARTIFICIAL REPLAY framework can be applied in continuous action spaces with discretization-based algorithms. For example, suppose that ⇧ wants to select an action a 2 A, but the historical data has a sample from a + ✏, a slightly perturbed point. Discretizationbased algorithms avoid precision issues since they map the continuous space to a series of regions which together cover the action set, and run algorithms or subroutines over the discretization. Checking for historical data simply checks for data within the bounds of the chosen discretized action. Semi-Bandit Feedback. ARTIFICIAL REPLAY also naturally extends to combinatorial action sets with semi-bandit feedback where actions are decomposable, that is, they can be written as a = (a 1 , . . . , a N ) with independent rewards. Suppose that ⇧ wants to select an action a = (a 1 , a 2 , . . . , a N ) but the historical data has a sample from (a 0 1 , a 2 , . . . a 0 N ). Even if the combinatorial action a does not appear in its entirety in the historical data, as long as there exists some subcomponent a H i (sometimes referred to as "subarm" in combinatorial bandits) in the historical data (e.g., a 2 ) , we can add that subcomponent a H i to H on t to update the base algorithm.

3.2. INDEPENDENCE OF IRRELEVANT DATA AND REGRET COUPLING

It is not immediately clear how to analyze the regret of ARTIFICIAL REPLAY. To enable regret analysis, we introduce a new property for bandit algorithms, independence of irrelevant data, which essentially requires that when an algorithm is about to take an action, providing additional data about other actions (i.e., those not selected by the algorithm) will not influence the algorithm's decision. Definition 3.1 (Independence of irrelevant data). A deterministic base algorithm ⇧ satisfies the independence of irrelevant data (IIData) property if whenever A = ⇧(H) then ⇧(H) = ⇧(H [ H 0 ) (4) for any H 0 containing data from any actions a 0 other than A (that is, a 0 6 = A). IIData is a natural robustness property for an algorithm to satisfy, highlighting that the algorithm evaluates actions independently when making decisions. IIData is conceptually analogous to the independence of irrelevant alternatives (IIA) axiom in computational social choice as a desiderata used to evaluate voting rules (Arrow, 1951) . In Theorem 3.2 we show that for any base algorithm satisfying IIData, the regret of ⇡ FULL START(⇧) and ⇡ ARTIFICIAL REPLAY(⇧) will be equal. Theorem 3.2. Suppose that algorithm ⇧ satisfies the independence of irrelevant data property. Then for any problem instance, horizon T , and historical dataset H hist we have the following: ⇡ ARTIFICIAL REPLAY(⇧) t d = ⇡ FULL START(⇧) t REGRET(T, ⇡ ARTIFICIAL REPLAY(⇧) , H hist ) d = REGRET(T, ⇡ FULL START(⇧) , H hist ) . Algorithm 2 Monotone UCB (MONUCB) 1: Initialize n 1 (a) = 0, µ 1 (a) = 1, and UCB 1 (a) = 1 for each a 2 [K] 2: for t = {1, 2, . . .} do 3: Let A t = arg max a2[K] UCB t (a) 4: Receive reward R t sampled from <(A t ) 5: Update n t+1 (A t ) = n t (A t ) + 1, n t+1 (a) = n t (a) for a 6 = A t 6: Update µ t+1 (A t ) = (n t (A t )µ t (A t ) + R t )/n t+1 (A t ), µ t+1 (a) = µ t (a) for a 6 = A t 7: Update UCB t+1 (a) = UCB t (a) for a 6 = A t and UCB t+1 (A t ) = min{UCB t (A t ), µ t+1 (A t ) + p 2 log(T )/n t+1 (A t )} This theorem shows that ARTIFICIAL REPLAY allows us to achieve identical regret guarantees as FULL START while simultaneously using data more efficiently. In the subsequent section, we show three example regret-optimal algorithms which satisfy this property, even in the complex CMAB-CRA setting. The algorithms we modify are all UCB-based algorithms. In fact, it is easy to modify most UCB-based algorithms to satisfy IIData by simply imposing monotonicity of the confidence bound estimates for an action rewards. This is easily implementable and preserves all regret guarantees. While for brevity we only discuss IIData algorithms in the K-Armed and CMAB-CRA set-up, it is easy to see how to modify other UCB-based algoriths (e.g. LinUCB for linear bandits (Abbasi-Yadkori et al., 2011) ) to satisfy IIData. In the existing bandit literature, there has been a narrow focus on only finding regret-optimal algorithms. We propose that IIData is another desirable property that implies ease and robustness for optimally and efficiently incorporating historical data.

4. IIDATA ALGORITHMS

In this section, we provide IIData algorithms with optimal regret guarantees for two settings: the K-armed and CMAB-CRA models. We show that IIData is easy to guarantee for UCB algorithms requiring only a minor modification to existing algorithms while not impacting confidence bounds guarantees. We defer algorithm details to Appendix D and proofs to Appendix F. We'll show in Appendix E that in practice, ARTIFICIAL REPLAY still performs nearly optimally even with algorithms that do not satisfy IIData.

4.1. K-ARMED BANDITS

The first algorithm we propose, named Monotone UCB (denoted as MONUCB), is derived from the UCB1 algorithm introduced in Auer et al. (2002) . At every timestep t, the algorithm tracks the following: (i) µ t (a) for the estimated mean reward of action a 2 [K], (ii) n t (a) for the number of times the action a has been selected by the algorithm prior to timestep t, and (iii) UCB t (a) for an upper confidence bound estimate for the reward of action a. At every timestep t, the algorithm picks the action A t which maximizes UCB t (a) (breaking ties deterministically). After observing R t , we increment n t+1 (A t ) = n t (A t ) + 1, update µ t+1 (A t ), and set: UCB t+1 (A t ) = min ⇢ UCB t (A t ), µ t+1 (A t ) + q 2 log(T ) nt+1(At) . (5) The only modification of Monotone UCB from standard UCB is the additional step forcing the UCB estimates to be monotone decreasing over t. It is clear that this modification has no affect on the regret guarantees. Under the "good event" analysis, if UCB t (a) µ(a) with high probability, then the condition still holds at time t + 1, even after observing a new data point. In the following theorem, we show that MONUCB satisfies IIData and is regret-optimal, achieving the same instancedependent regret bound as the standard UCB algorithm. Theorem 4.1. Monotone UCB satisfies the IIData property and has for (a) = max a 0 µ(a 0 ) µ(a): REGRET(T, ⇡ IGNORANT(MONUCB) , H hist ) = O( X a log(T )/ (a)). ( ) This guarantee allows us to use Theorem 3.2 to establish that ARTIFICIAL REPLAY and FULL START have identical regret with MONUCB as a base algorithm. In the next theorem, we show that ARTIFICIAL REPLAY is robust to spurious data, where the historical data has excess samples a H j coming from poor performing actions. Spurious data imposes computational challenges, since the FULL START approach will pre-process the full historical dataset regardless of the observed rewards or the inherent value of the historical data. In contrast, ARTIFICIAL REPLAY will only use the amount of data useful for learning. Theorem 4.2. For every H 2 N there exists a historical dataset H hist with |H hist | = H where the runtime of ⇡ FULL START(MONUCB) = ⌦(H + T ) whereas the runtime of ⇡ ARTIFICIAL REPLAY(MONUCB) = O(T + min{ p T , log(T )/ min a (a) 2 }). This highlights that the computational overhead of ARTIFICIAL REPLAY in comparison to FULL START can be arbitrarily better. For storage requirements, the FULL START algorithm requires O(K) storage for maintaining estimates for each arm. In contrast, a naive implementation of AR-TIFICIAL REPLAY requires O(K + H) storage since the entire historical dataset needs to be stored. However, using hashing techniques can address the extra H factor. In Section 4.2 we will see an example where IIData additionally has strong storage benefits over FULL START. Lastly, to complement the computational improvements of ARTIFICIAL REPLAY applied to MONUCB, we can also show an improvement of regret. This analysis crucially uses the regret coupling, since FULL START(MONUCB) is much easier to reason about than its ARTIFICIAL RE-PLAY counterpart. Theorem 4.3. Let H a be the number of datapoints in H hist for each action a 2 [K]. Then the regret of Monotone UCB with historical dataset H hist is: REGRET(T, ⇡ ARTIFICIAL REPLAY(MONUCB) , H hist )  O ⇣ X a2[K]: a 6 =0 max 0, log H a (a) ⌘ . Theorem 4.2 together with Theorem 4.3 helps highlight the advantage of using ARTIFICIAL REPLAY over FULL START in terms of improving computational complexity while maintaining an equally improved regret guarantee. This reduces to the standard UCB guarantee when H hist = ;. Moreover, it highlights the impact historical data can have on the regret. If |H a | log(T )/ (a) 2 for each a then the regret of the algorithm will be constant not scaling with T . We note that there are no existing regret lower bounds for incorporating historical data in bandit algorithms. Our main goal is not to improve regret guarantees (although Theorem 4.3 highlights the advantage of historical data), but instead highlight a simple, intuitive, and implementable approach through ARTIFICIAL REPLAY which matches the performance of FULL START while simultaneously having smaller compute. We close with an example of a K-armed bandit algorithm which does not satisfy the IIData assumption. Thompson Sampling (Russo et al., 2018) , which samples arms according to the posterior probability that they are optimal, does not satisfy IIData. Data from other actions other than the one chosen will adjust the posterior distribution, and hence will adjust the selection probabilities as well. While we do not obtain a regret coupling, in Fig. 8 (appendix) we show that there are still empirical gains for using ARTIFICIAL REPLAY over FULL START across a variety of base algorithms.

4.2. CMAB-CRA

Incorporating historical data optimally and efficiently is difficult in continuous action settings. Two natural approaches are to (i) discretize the action space A based on the data using nearest neighbor estimates, or (ii) learn a regression of the mean reward using available data. Consider a setting where excessive data is collected from poor-performing actions. Discretization-based algorithms will unnecessarily process and store a large number of discretizations in low-performing regions of the space. Regression-based methods will use compute resources to learn an accurate predictor of the mean reward in irrelevant regions. The key issues are that the computational and storage cost grows with the size of the historical dataset, and the estimation and discretization is done independent of the quality of the reward. To contrast this approach, we present two discretization-based algorithms that satisfy IIData with strong performance guarantees. In particular, we detail fixed and adaptive discretization (ADAMONUCB in Algorithm 3) algorithms that only use the historical dataset to update estimates Domains. We conduct experiments on the two bandit models described in Section 2.2: finite K-armed bandits and CMAB-CRA, using both fixed and adaptive discretization. For the continuous combinatorial setting, we provide two stylized domains: a piecewise-linear and a quadratic reward function. To emphasize the practical benefit of ARTIFICIAL REPLAY, we evaluate on a realworld resource allocation setting for biodiversity conservation. We study real ranger patrol data from Murchison Falls National Park, shared as part of a collaboration with the Uganda Wildlife Authority and the Wildlife Conservation Society. We use historical patrol observations to build the history H hist ; we analyze these historical observations in detail in Appendix E to show that this dataset exhibits both spurious data and imbalanced coverage as discussed in Section 4. Baselines. We compare ARTIFICIAL REPLAY against IGNORANT and FULL START approaches for each setting. In the K-armed model, we use MONUCB as the base algorithm. In CMAB-CRA we use fixed and adaptive discretization as well as REGRESSOR, a neural network learner that is a regression-based approach analogue to FULL START. REGRESSOR is initially trained on the entire historical dataset, then iteratively retrained after 128 new samples are collected. We also compute for each setting the performance of an OPTIMAL action based on the true rewards and a RANDOM baseline that acts randomly while satisfying the budget constraint. Results. The results in Fig. 1 empirically validate our theoretical result from Theorem 3.2: the performance of ARTIFICIAL REPLAY is identical to that of FULL START, and reduces regret considerably compared to the naive IGNORANT approach. We evaluate the regret (compared to OPTIMAL) of each approach across time t 2 [T ]. Concretely, we consider the three domains of piecewise-linear reward, quadratic reward, and green security with continuous space S = [0, 1] 2 , N = 5 possible action components, a budget B = 2, and 3 levels of effort. We include H = 300 historical data points. See Fig. 9 (appendix) for regret and analysis of historical data use on the K-armed bandit. Not only does ARTIFICIAL REPLAY achieve equal performance, but its computational benefits over FULL START are clear even on practical problem sizes. As we increase historical data from H = {10; 100; 1,000; 10,000} in Fig. 2 , the proportion of irrelevant data increases. Our method achieves equal performance, overcoming the previously unresolved challenge of spurious data, while FULL START suffers from arbitrarily worse storage complexity (Theorem 4.2). With 10,000 historical samples and a time horizon of 1,000, we see that 58.2% of historical samples are irrelevant to producing the most effective policy. When faced with imbalanced data coverage, the benefits of ARTIFICIAL REPLAY become clearmost notably in the continuous action setting with adaptive discretization. In Fig. 3 , as we increase the number of historical samples on bad regions (bottom 20th percentile of reward), the additional data require finer discretization, leading to arbitrarily worse storage and computational complexity for FULL START with equal regret. In Fig. 3 (c), we see that with 10% of data on bad arms, AR- TIFICIAL REPLAY requires only 446 regions R compared to 688 used by FULL START; as we get more spurious data and that fraction increases to 90%, then ARTIFICIAL REPLAY requires only 356 regions while FULL START still stores 614 regions.

6. CONCLUSION

We present ARTIFICIAL REPLAY, a meta-algorithm that modifies any base bandit algorithm to efficiently harness historical data. We show that under a widely applicable IIData condition, the regret of ARTIFICIAL REPLAY (as a random variable) is distributionally identical to that of a full warmstart approach, while also guaranteeing significantly better time complexity. We additionally give examples of regret-optimal IIData algorithms in the K-armed and CMAB-CRA settings. Our experimental results highlight the advantage of using ARTIFICIAL REPLAY over FULL START via a variety of base algorithms, applied to K-armed and continuous combinatorial bandit models, including for algorithms such as Thompson sampling and Information Directed Sampling (IDS) that do not exhibit IIData. Directions for future work include (i) find IIData algorithms in other bandit domains such as linear contextual bandits, (ii) incorporate the ARTIFICIAL REPLAY approach into reinforcement learning, and (iii) provide theoretical bounds showing that ARTIFICIAL REPLAY has optimal data usage when incorporating historical data.



That is, we construct the regret process for both algorithms simultaneously on a joint probability space such that each individual process has the correct marginal, but both processes are equal over each sample path.



Figure 1: (CMAB-CRA) Cumulative regret (y-axis; lower is better) across time t 2 [T ]. ARTIFICIAL REPLAY performs equally as well as FULL START across all domain settings, including both fixed discretization (top row) and adaptive discretization (bottom). REGRESSOR performs quite poorly.

Figure2: (K-Armed) Increasing the number of historical samples H leads FULL START to use unnecessary data, particularly as H gets very large. ARTIFICIAL REPLAY achieves equal performance in terms of regret (plot a) while using less than half the historical data (plot b). In plot c we see that with H = 1,000 historical samples, ARTIFICIAL REPLAY uses (on average) 117 historical samples before taking its first online action. The number of historical samples used increases at a decreasing rate, using only 396 of 1,000 total samples by the horizon T . Results are shown on the K-armed bandit setting with K = 10 and horizon T = 1,000.

algorithm ⇧ 1: Initialize set of used historical data points H on 1 = ;, and set of online data H 1 = ; 2: for t = {1, 2, . . .} do Execute action A t and observe reward R t ⇠ <(A t ) .

annex

of the reward. Due to space, we describe the algorithms only at a high level and defer details to Appendix D.Our algorithms are Upper Confidence Bound (UCB) style as the selection rule maximizes Eq. ( 1) over the combinatorial action set (Eq. ( 3)) through a discretization of S. For each allocation 2 B, the algorithm maintains a collection of regions P t of S which covers S. For the fixed discretization variant, P t is fixed at the start of learning, and in the adaptive discretization version it is refined over the course of learning based on observed data. At every timestep t and region R 2 P t , the algorithm tracks the following: (i) µ t (R, ) for the estimated mean reward of region R at allocation , (ii) n t (R, ) for the number of times R has been selected at allocation prior to timestep t, and (iii) UCB t (R, ) for an upper confidence bound estimate. At a high level, our algorithm performs three steps in each iteration t:1. Action selection: Greedily select at most N regions in P t to maximize UCB t (R, ) subject to the budget constraints (see Eq. ( 10) in the appendix). Note that we must additionally ensure that each region is selected at only a single allocation value. 2. Update parameters: For each of the selected regions, increment n t (R, ) by one, update µ t (R, ) based on observed data, and set UCB t+1 (R, ) = min{UCB t (R, ), µ t (R, ) + b(n t (R, ))} for some appropriate bonus term b(•). This enforces monotonicity in the UCB estimates similar to MONUCB and is required for the IIData property. 3. Re-partition: This step differentiates the adaptive discretization algorithm from fixed discretization, which maintains the same partition across all timesteps. We split a region when the confidence in its estimate (i.e., b(n t (R, ))) is smaller than the diameter of the region. This condition may seem independent of the quality of a region, but since it is incorporated into a learning algorithm, the number of samples in a region is correlated with its reward. In Fig. 4 (appendix) we highlight how the adaptive discretization algorithm hones in on regions with large reward without knowing the reward function before learning.These algorithms modify existing approaches applied to CMAB-CRA in the bandit and reinforcement learning literature, which have been shown to be regret-optimal (Xu et al., 2021; Sinclair et al., 2021) . We additionally note that these approaches are IIData.Theorem 4.4. The fixed and adaptive discretization algorithms when using a "greedy" solution to solve Eq. (1) have property IIData.Here we require the algorithm to use the standard "greedy approximation" to Eq. ( 1), which is a knapsack problem in the CMAB-CRA set-up (Williamson & Shmoys, 2011) . This introduces additional approximation ratio limitations in general. However, under additional assumptions on the mean reward function µ(p, ), the greedy solution is provably optimal. For example, optimality of the greedy approximation holds when µ(p, ) is piecewise linear and monotone, or more broadly when µ(a) is submodular. See Appendix D for more discussion.Finally, we comment that the FULL START implementation of these adaptive discretization algorithms will have storage and computational costs proportional to the size of the historical dataset (since the algorithms ensure that the discretization scales with respect to the number of samples).In contrast, ARTIFICIAL REPLAY uses only a fraction of the historical dataset and so again has improved computation and storage complexity. This is validated in the experimental results in Appendix E.

5. EXPERIMENTS

We show the benefits of ARTIFICIAL REPLAY by showing that our meta-algorithm achieves identical performance to FULL START while offering significant practical advantages in reducing runtime and storage. We consider two classes of bandit domains: K-armed and CMAB-CRA. As part of our evaluation on combinatorial bandits, we introduce a new model for green security games with continuous actions by adaptively discretizing the landscape of a large protected area of Murchison Falls National Park in Uganda.All of the code to reproduce the experiments is available at https://github.com/lily-x/ artificial-replay. Results are averaged over 60 iterations with random seeds, with standard error plotted; experiment details and additional results are available in Appendix E.

