REPLICABLE BANDITS

Abstract

In this paper, we introduce the notion of replicable policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called replicable if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do replicable policies exist, but also they achieve almost the same optimal (non-replicable) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the replicability parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop replicable policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the replicability parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.

1. INTRODUCTION

In order for scientific findings to be valid and reliable, the experimental process must be repeatable, and must provide coherent results and conclusions across these repetitions. In fact, lack of reproducibility has been a major issue in many scientific areas; a 2016 survey that appeared in Nature (Baker, 2016a) revealed that more than 70% of researchers failed in their attempt to reproduce another researcher's experiments. What is even more concerning is that over 50% of them failed to reproduce their own findings. Similar concerns have been raised by the machine learning community, e.g., the ICLR 2019 Reproducibility Challenge (Pineau et al., 2019) and NeurIPS 2019 Reproducibility Program (Pineau et al., 2021) , due to the to the exponential increase in the number of publications and the reliability of the findings. The aforementioned empirical evidence has recently led to theoretical studies and rigorous definitions of replicability. In particular, the works of Impagliazzo et al. (2022) and Ahn et al. (2022) considered replicability as an algorithmic property through the lens of (offline) learning and convex optimization, respectively. In a similar vein, in the current work, we introduce the notion of replicability in the context of interactive learning and decision making. In particular, we study replicable policy design for the fundamental setting of stochastic bandits. A multi-armed bandit (MAB) is a one-player game that is played over T rounds where there is a set of different arms/actions A of size |A| = K (in the more general case of linear bandits, we can consider even an infinite number of arms). In each round t = 1, 2, . . . , T , the player pulls an arm a t ∈ A and receives a corresponding reward r t . In the stochastic setting, the rewards of each arm are sampled in each round independently, from some fixed but unknown, distribution supported on [0, 1]. Crucially, each arm has a potentially different reward distribution, but the distribution of each arm is fixed over time. A bandit algorithm A at every round t takes as input the sequence of arm-reward pairs that it has seen so far, i.e., (a 1 , r 1 ), . . . , (a t-1 , r t-1 ), then uses (potentially) some internal randomness ξ to pull an arm a t ∈ A and, finally, observes the associated reward r t ∼ D at . We propose the following natural notion of a replicable bandit algorithm, which is inspired by the definition of Impagliazzo et al. (2022) . Intuitively, a bandit algorithm is replicable if two distinct executions of the algorithm, with internal randomness fixed between both runs, but with independent reward realizations, give the exact same sequence of played arms, with high probability. More formally, we have the following definition. Definition 1 (Replicable Bandit Algorithm). Let ρ ∈ [0, 1]. We call a bandit algorithm A ρreplicable in the stochastic setting if for any distribution D aj over [0, 1] of the rewards of the j-th arm a j ∈ A, and for any two executions of A, where the internal randomness ξ is shared across the executions, it holds that Pr ξ,r (1) ,r (2) a (1) 1 , . . . , a (1) T = a (2) 1 , . . . , a (2) T ≥ 1 -ρ . Here, a (i) t = A(a (i) 1 , r (i) 1 , ..., a (i) t-1 , r t-1 ; ξ) is the t-th action taken by the algorithm A in execution i ∈ {1, 2}. The reason why we allow for some fixed internal randomness is that the algorithm designer has control over it, e.g., they can use the same seed for their (pseudo)random generator between two executions. Clearly, naively designing a replicable bandit algorithm is not quite challenging. For instance, an algorithm that always pulls the same arm or an algorithm that plays the arms in a particular random sequence determined by the shared random seed ξ are both replicable. The caveat is that the performance of these algorithms in terms of expected regret will be quite poor. In this work, we aim to design bandit algorithms which are replicable and enjoy small expected regret. In the stochastic setting, the (expected) regret after T rounds is defined as E[R T ] = T max a∈A µ a -E T t=1 µ at , where µ a = E r∼Da [r] is the mean reward for arm a ∈ A. In a similar manner, we can define the regret in the more general setting of linear bandits (see, Section 5) Hence, the overarching question in this work is the following: Is it possible to design replicable bandit algorithms with small expected regret? At a first glance, one might think that this is not possible, since it looks like replicability contradicts the exploratory behavior that a bandit algorithm should possess. However, our main results answer this question in the affirmative and can be summarized in Table 1 .

Setting

Algorithm Regret Theorem Stochastic MAB Algorithm 1 O K 2 log 3 (T )H∆ ρ 2 Theorem 3 Stochastic MAB Algorithm 2 O K 2 log(T )H∆ ρ 2 Theorem 4 Stochastic Linear Bandits Algorithm 3 O K 2 √ dT ρ 2 Theorem 6 Stochastic Linear Bandits Infinite Action Space Algorithm 4 O poly(d) √ T ρ 2 Theorem 10 Table 1 : Our results for replicable stochastic general multi-armed and linear bandits. In the expected regret column, O(•) subsumes logarithmic factors. H ∆ is equal to j:∆j >0 1/∆ j , ∆ j is the difference between the mean of action j and the optimal action, K is the number of arms, d is the ambient dimension in the linear bandit setting.

1.1. RELATED WORK

Reproducibility/Replicability. In this work, we introduce the notion of replicability in the context of interactive learning and, in particular, in the fundamental setting of stochastic bandits. Close to our work, the notion of a replicable algorithm in the context of learning was proposed by Impagliazzo et al. (2022) , where it is shown how any statistical query algorithm can be made replicable with a moderate increase in its sample complexity. Using this result, they provide replicable algorithms for finding approximate heavy-hitters, medians, and the learning of half-spaces. Reproducibility has been also considered in the context of optimization by Ahn et al. (2022) . We mention that in Ahn et al. (2022) the notion of a replicable algorithm is different from our work and that of Impagliazzo et al. (2022) , in the sense that the outputs of two different executions of the algorithm do not need to be exactly the same. From a more application-oriented perspective, Shamir & Lin ( 2022 

2. STOCHASTIC BANDITS AND REPLICABILITY

In this section, we first highlight the main challenges in order to guarantee replicability and then discuss how the results of Impagliazzo et al. ( 2022) can be applied in our setting.

2.1. WARM-UP I: NAIVE REPLICABILITY AND CHALLENGES

Let us consider the stochastic two-arm setting (K = 2) and a bandit algorithm A with two independent executions, A 1 and A 2 . The algorithm A i plays the sequence 1, 2, 1, 2, . . . until some, potentially random, round T i ∈ N after which one of the two arms is eliminated and, from that point, the algorithm picks the winning arm j i ∈ {1, 2}. The algorithm A is ρ-replicable if and only if T 1 = T 2 and j 1 = j 2 with probability 1 -ρ. Assume that |µ 1 -µ 2 | = ∆ where µ i is the mean of the distribution of the i-th arm. If we assume that ∆ is known, then we can run the algorithm for T 1 = T 2 = C ∆ 2 log(1/ρ) for some universal constant C > 0 and obtain that, with probability 1 -ρ, it will hold that µ (j) 1 ≈ µ 1 and µ (j) 2 ≈ µ 2 for j ∈ {1, 2}, where µ (j) i is the estimation of arm's i mean during execution j. Hence, knowing ∆ implies that the stopping criterion of the algorithm A is deterministic and that, with high probability, the winning arm will be detected at time T 1 = T 2 . This will make the algorithm ρ-replicable. Observe that when K = 2, the only obstacle to replicability is that the algorithm should decide at the same time to select the winning arm and the selection must be the same in the two execution threads. In the presence of multiple arms, there exists the additional constraint that the above conditions must be satisfied during, potentially, multiple arm eliminations. Hence, the two questions arising from the above discussion are (i) how to modify the above approach when ∆ is unknown and (ii) how to deal with K > 2 arms. A potential solution to the second question (on handling K > 2 arms) is the Execute-Then-Commit (ETC) strategy. Consider the stochastic K-arm bandit setting. For any ρ ∈ (0, 1), the ETC algorithm with known ∆ = min i ∆ i and horizon T that uses m = 4 ∆ 2 log(1/ρ) deterministic exploration phases before commitment is ρ-replicable. The intuition is exactly the same as in the K = 2 case. The caveats of this approach are that it assumes that ∆ is known and that the obtained regret is quite unsatisfying. In particular, it achieves regret bounded by m i∈[K] ∆ i + ρ • (T -mK) i∈[k] ∆ i . Next, we discuss how to improve the regret bound without knowing the gaps ∆ i . Before designing new algorithms, we will inspect the guarantees that can be obtained by combining ideas from previous results in the bandits literature and the recent work in replicable learning of Impagliazzo et al. (2022) .

2.2. WARM-UP II: BANDIT ALGORITHMS AND REPLICABLE MEAN ESTIMATION

First, we remark that we work in the stochastic setting and the distributions of the rewards of the two arms are subgaussian. Thus, the problem of estimating their mean is an instance of a statistical query for which we can use the algorithm of Impagliazzo et al. (2022) to get a replicable mean estimator for the distributions of the rewards of the arms. Proposition 2 (Replicable Mean Estimation (Impagliazzo et al., 2022) ). Let τ, δ, ρ ∈ [0, 1]. There exists a ρ-replicable algorithm ReprMeanEstimation that draws Ω log(1/δ) τ 2 (ρ-δ) 2 samples from a distribution with mean µ and computes an estimate µ that satisfies | µ -µ| ≤ τ with probability at least 1 -δ. Notice that we are working in the regime where δ ≪ ρ, so the sample complexity is Ω log(1/δ) τ 2 ρ 2 . The straightforward approach is to try to use an optimal multi-armed algorithm for the stochastic setting, such as UCB or arm-elimination (Even-Dar et al., 2006), combined with the replicable mean estimator. However, it is not hard to see that this approach does not give meaningful results: if we want to achieve replicability ρ we need to call the replicable mean estimator routine with parameter ρ/(KT ), due to the union bound that we need to take. This means that we need to pull every arm at least K 2 T 2 times, so the regret guarantee becomes vacuous. This gives us the first key insight to tackle the problem: we need to reduce the number of calls to the mean estimator. Hence, we will draw inspiration from the line of work in stochastic batched bandits (Gao et al., 2019; Esfandiari et al., 2021) to derive replicable bandit algorithms.

3. REPLICABLE MEAN ESTIMATION FOR BATCHED BANDITS

As a first step, we would like to show how one could combine the existing replicable algorithms of Impagliazzo et al. (2022) with the batched bandits approach of Esfandiari et al. (2021) to get some preliminary non-trivial results. We build an algorithm for the K-arm setting, where the gaps ∆ j are unknown to the learner. Let δ be the confidence parameter of the arm elimination algorithm and ρ be the replicability guarantee we want to achieve. Our approach is the following: let us, deterministically, split the time interval into sub-intervals of increasing length. We treat each subinterval as a batch of samples where we pull each active arm the same number of times and use the replicable mean estimation algorithm to, empirically, compute the true mean. At the end of each batch, we decide to eliminate some arm j using the standard UCB estimate. Crucially, if we condition on the event that all the calls to the replicable mean estimator return the same number, then the algorithm we propose is replicable.  Initialization: B ← log(T ), q ← T 1/B , c 0 ← 0, A ← [K], r ← T , µ a ← 0, ∀a ∈ A 3: for i = 1 to B -1 do 4: if ⌊q i ⌋ • |A| > r then 5: break 6: c i = c i-1 + ⌊q i ⌋ 7: Pull every arm a ∈ A for ⌊q i ⌋ times 8: for a ∈ A do 9: µ a ← ReprMeanEstimation(δ = 1/(2KT B), τ = 1, log(2KT B)/c i , ρ ′ = ρ/(KB)) ▷ Proposition 2 10: r ← r -|A| • ⌊q i ⌋ 11: for a ∈ A do 12: if µ a < max a∈A µ a -2τ then 13: Remove a from A 14: In the last batch play the arm from A with the smallest index Theorem 3. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 1) for the stochastic bandit problem with K arms and gaps (∆ j ) j∈[K] whose expected regret is E[R T ] ≤ C • K 2 log 2 (T ) ρ 2 j:∆j >0 ∆ j + log(KT log(T )) ∆ j , where C > 0 is an absolute numerical constant, and its running time is polynomial in K, T and 1/ρ. ). This is because, due to a union bound over the rounds and the arms, we need to call the replicable mean estimator with parameter ρ/(K log(T )). In the next section, we show how to get rid of the log 2 (T ) by designing a new algorithm.

4. IMPROVED ALGORITHMS FOR REPLICABLE STOCHASTIC BANDITS

While the previous result provides a non-trivial regret bound, it is not optimal with respect to the time horizon T . In this section, we show how to improve it by designing a new algorithm, presented in Algorithm 2, which satisfies the guarantees of Theorem 4 and, essentially, decreases the dependence on the time horizon T from log 3 (T ) to log(T ). Our main result for replicable stochastic multi-armed bandits with K arms follows. Theorem 4. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 2) for the stochastic bandit problem with K arms and gaps (∆ j ) j∈[K] whose expected regret is E[R T ] ≤ C • K 2 ρ 2 j:∆j >0 ∆ j + log(KT log(T )) ∆ j , where C > 0 is an absolute numerical constant, and its running time is polynomial in K, T and 1/ρ. Note that, compared to the non-replicable setting, we incur an extra factor of K 2 /ρ 2 in the regret. The proof can be found in Appendix B. Let us now describe how Algorithm 2 works. We decompose the time horizon into B = log(T ) batches. Without the replicability constraint, one could draw q i samples in batch i from each arm and estimate the mean reward. With the replicability constraint, we have to boost this: in each batch i, we pull each active arm O(βq i ) times, for some q to be determined, where β = O(K 2 /ρ 2 ) is the replicability blow-up. Using these samples, we compute Algorithm 2 Replicable Algorithm for Stochastic Multi-Armed Bandits (Theorem 4) 1: Input: time horizon T , number of arms K, replicability ρ 2: Initialization: B ← log(T ), q ← T 1/B , c 0 ← 0, A 0 ← [K], r ← T , µ a ← 0, ∀a ∈ A 0 3: β ← ⌊max{K 2 /ρ 2 , 2304}⌋ 4: for i = 1 to B -1 do 5: if β⌊q i ⌋ • |A i | > r then 6: break 7: A i ← A i-1 8: for a ∈ A i do 9: Pull arm a for β⌊q i ⌋ times 10: Compute the empirical mean µ (i) α 11: c i ← c i-1 + ⌊q i ⌋ 12: c i ← βc i 13: U i ← 2 ln(2KT B)/ c i 14: U i ← 2 ln(2KT B)/c i 15: U i ← Uni[U i /2, U i ] 16: r ← r -β • |A i | • ⌊q i ⌋ 17: for a ∈ A i do 18: if µ (i) a + U i < max a∈Ai µ (i) a -U i then 19: Remove a from A i 20: In the last batch play the arm from A B-1 with the smallest index the empirical mean µ (i) α for any active arm α. Note that U i in Algorithm 2 corresponds to the size of the actual confidence interval of the estimation and U i corresponds to the confidence interval of an algorithm that does not use the β-blow-up in the number of samples. The novelty of our approach comes from the choice of the interval around the mean of the maximum arm: we pick a threshold uniformly at random from an interval of size U i /2 around the maximum mean. Then, the algorithm checks whether µ (i) a + U i < max µ (i) a ′ -U i , where max runs over the active arms a ′ in batch i, and eliminates arms accordingly. To prove the result we show that there are three regions that some arm j can be in relative to the confidence interval of the best arm in batch i (cf. Appendix B). If it lies in two of these regions, then the decision of whether to keep it or discard it is the same in both executions of the algorithm. However, if it is in the third region, the decision could be different between parallel executions, and since it relies on some external and unknown randomness, it is not clear how to reason about it. To overcome this issue, we use the random threshold to argue about the probability that the decision between two executions differs. The crucial observation that allows us to get rid of the extra log 2 (T ) factor is that there are correlations between consecutive batches: we prove that if some arm j lies in this "bad" region in some batch i, then it will be outside this region after a constant number of batches.

5. REPLICABLE STOCHASTIC LINEAR BANDITS

We now investigate replicability in the more general setting of stochastic linear bandits. In this setting, each arm is a vector a ∈ R d belonging to some action set A ⊆ R d , and there is a parameter θ ⋆ ∈ R d unknown to the player. In round t, the player chooses some action a t ∈ A and receives a reward r t = ⟨θ ⋆ , a t ⟩ + η t , where η t is a zero-mean 1-subgaussian random variable independent of any other source of randomness. This means that E[η t ] = 0 and satisfies E[exp(λη t )] ≤ exp(λ 2 /2) for any λ ∈ R. For normalization purposes, it is standard to assume that ∥θ ⋆ ∥ 2 ≤ 1 and sup a∈A ∥a∥ 2 ≤ 1. In the linear setting, the expected regret after T pulls a 1 , . . . , a T can be written as E[R T ] = T sup a∈A ⟨θ ⋆ , a⟩ -E T t=1 ⟨θ ⋆ , a t ⟩ . In Section 5.1 we provide results for the finite action space case, i.e., when |A| = K. Next, in Section 5.2, we study replicable linear bandit algorithms when dealing with infinite action spaces. In the following, we work in the regime where T ≫ d. We underline that our approach leverages connections of stochastic linear bandits with G-optimal experiment design, core sets constructions, and least-squares estimators. Roughly speaking, the goal of G-optimal design is to find a (small) subset of arms A ′ , which is called the core set, and define a distribution π over them with the following property: for any ε > 0, δ > 0 pulling only these arms for an appropriate number of times and computing the least-squares estimate θ guarantees that sup a∈A ⟨a, θ * -θ⟩ ≤ ε, with probability 1-δ. For an extensive discussion, we refer to Chapters 21 and 22 of Lattimore & Szepesvári (2020).

5.1. FINITE ACTION SET

We first introduce a lemma that allows us to reduce the size of the action set that our algorithm has to search over. Lemma 5 (See Chapters 21 and 22 in Lattimore & Szepesvári (2020)). For any finite action set A that spans R d and any δ, ε > 0, there exists an algorithm that, in time polynomial in d, computes a multi-set of Θ(d log(1/δ)/ε 2 + d log log d) actions (possibly with repetitions) such that (i) they span R d and (ii) if we perform these actions in a batched stochastic d-dimensional linear bandits setting with true parameter θ ⋆ ∈ R d and let θ be the least-squares estimate for θ ⋆ , then, for any a ∈ A, with probability at least 1 -δ, we have a, θ ⋆ -θ ≤ ε. Essentially, the multi-set in Lemma 5 is obtained using an approximate G-optimal design algorithm. Thus, it is crucial to check whether this can be done in a replicable manner. Recall that the above set of distinct actions is called the core set and is the solution of an (approximate) Goptimal design problem. To be more specific, consider a distribution π : A → [0, 1] and define V (π) = a∈A π(a)aa ⊤ ∈ R d×d and g(π) = sup a∈A ∥a∥ 2 V (π) -1 . The distribution π is called a design and the goal of G-optimal design is to find a design that minimizes g. Since the number of actions is finite, this problem reduces to an optimization problem which can be solved efficiently using standard optimization methods (e.g., the Frank-Wolfe method). Since the initialization is the same, the algorithm that finds the optimal (or an approximately optimal) design is replicable under the assumption that the gradients and the projections do not have numerical errors. This perspective is orthogonal to the work of Ahn et al. (2022) , that defines reproducibility from a different viewpoint. Algorithm 3 Replicable Algorithm for Stochastic Linear Bandits (Theorem 6) 1: Input: number of arms K, time horizon T , replicability ρ 2: Initialization: B ← log(T ), q ← (T /c) 1/B , A ← [K], r ← T 3: β ← ⌊max{K 2 /ρ 2 , 2304}⌋ 4: for i = 1 to B -1 do 5: ε i = d log(KT 2 )/(βq i ) 6: ε i = d log(KT 2 )/q i 7: n i = 10d log(KT 2 )/ε 2 i 8: a 1 , . . . , a ni ← multi-set given by Lemma 5 with parameters δ = 1/(KT 2 ) and ε = ε i 9: if n i > r then  ε i ← Uni[ε i /2, ε i ] 14: r ← r -n i 15: for a ∈ A do 16: if ⟨a, θ i ⟩ + ε i < max a∈A ⟨a, θ i ⟩ -ε i then 17: Remove a from A 18: In the last batch play arg max a∈A ⟨a, θ B-1 ⟩ In our batched bandit algorithm (Algorithm 3), the multi-set of arms a 1 , . . . , a ni computed in each batch is obtained via a deterministic algorithm with runtime poly(K, d), where |A| = K. Hence, the multi-set will be the same in two different executions of the algorithm. On the other hand, the LSE will not be since it depends on the stochastic rewards. We apply the techniques that we developed in the replicable stochastic MAB setting in order to design our algorithm. Our main result for replicable d-dimensional stochastic linear bandits with K arms follows. For the proof, we refer to Appendix C. Theorem 6. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm for the stochastic ddimensional linear bandit problem with K arms whose expected regret is E[R T ] ≤ C • K 2 ρ 2 dT log(KT ) , where C > 0 is an absolute numerical constant, and its running time is polynomial in d, K, T and 1/ρ. Note that the best known non-replicable algorithm achieves an upper bound of O( dT log(K)) and, hence, our algorithm incurs a replicability overhead of order K 2 /ρ 2 . The intuition behind the proof is similar to the multi-armed bandit setting in Section 4.

5.2. INFINITE ACTION SET

Let us proceed to the setting where the action set A is unbounded. Unfortunately, even when d = 1, we cannot directly get an algorithm that has satisfactory regret guarantees by discretizing the space and using Algorithm 3. The approach of Esfandiari et al. ( 2021) is to discretize the action space and use an 1/T -net to cover it, i.e. a set A ′ ⊆ A such that for all a ∈ A there exists some a ′ ∈ A ′ with ||a -a ′ || 2 ≤ 1/T . It is known that there exists such a net of size at most (3T ) d (Vershynin, 2018, Corollary 4.2.13). Then, they apply the algorithm for the finite arms setting, increasing their regret guarantee by a factor of √ d. However, our replicable algorithm for this setting contains an additional factor of K 2 in the regret bound. Thus, even when d = 1, our regret guarantee is greater than T, so the bound is vacuous. One way to fix this issue and get a sublinear regret guarantee is to use a smaller net. We use a 1/Tfoot_0/(4d+2) -net that has size at most (3T ) d 4d+2 and this yields an expected regret of order O(T 4d+1/(4d+2) d log(T )/ρ 2 ). For further details, we refer to Appendix D. Even though the regret guarantee we managed to get using the smaller net of Appendix D is sublinear in T , it is not a satisfactory bound. The next step is to provide an algorithm for the infinite action setting using a replicable LSE subroutine combined with the batching approach of Esfandiari et al. (2021). We will make use of the next lemma. Lemma 7 (Section 21.2 Note 3 of Lattimore & Szepesvári (2020)). There exists a deterministic algorithm that, given an action space A ⊆ R d , computes a 2-approximate G-optimal design π with a core set of size O(d log log(d)). We additionally prove the next useful lemma, which, essentially, states that we can assume without loss of generality that every arm in the support of π has mass at least Ω(1/(d log(d))). We refer to Appendix F.1 for the proof. Lemma 8 (Effective Support). Let π be the distribution that corresponds to the 2-approximate optimal G-design of Lemma 7 with input A. Assume that π(a) ≤ c/(d log(d)), where c > 0 is some absolute numerical constant, for some arm a in the core set. Then, we can construct a distribution π such that, for any arm a in the core set, π(a) ≥ C/(d log(d)), where C > 0 is an absolute constant, so that it holds sup a ′ ∈A ∥a ′ ∥ 2 V ( π) -1 ≤ 4d . The upcoming lemma is a replicable algorithm for the least-squares estimator and, essentially, builds upon Lemma 7 and Lemma 8. Its proof can be found at Appendix F.2. Lemma 9 (Replicable LSE). Let ρ, ε ∈ (0, 1] and 0 < δ ≤ min{ρ, 1/d}  A ′ ← 1/T -net of A 3: Initialization: r ← T, B ← log(T ), q ← (T /c) 1/B 4: for i = 1 to B -1 do 5: q i denotes the number of pulls of all arms before the replicability blow-up 6: ε i = c • d log(T )/q i 7: The blow-up is  M i = q i • d 3 log(d) S i = {(a j , r (j) t ) : t ∈ [N i ], j ∈ [|C i |]} 13: θ i ← ReplicableLSE(S i , ρ ′ = ρ/(dB), δ = 1/(2|A ′ |T 2 ), τ = min{ε i , 1}) 14: r ← r -⌈M i ⌉ 15: for a ∈ A ′ do 16: if ⟨a, θ i ⟩ < max a∈A ′ ⟨a, θ i ⟩ -2ε i then 17: Remove a from A ′ 18: In the last batch play arg max a∈A ′ ⟨a, θ B-1 ⟩ 19: 20: ReplicableLSE(S, ρ, δ, τ ) 21: for a ∈ C do 22: v(a) ← ReplicableSQ(ϕ : x ∈ R → x ∈ R, S, ρ, δ, τ ) ▷ Impagliazzo et al. (2022) 23: return ( j∈|S| a j a ⊤ j ) -1 • ( a∈C a n a v(a)) The main result for the infinite actions' case, obtained by Algorithm 4, follows. Its proof can be found at Appendix E. Theorem 10. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm (Algorithm 4) for the stochastic d-dimensional linear bandit problem with infinite action set whose expected regret is E[R T ] ≤ C • d 4 log(d) log 2 log(d) log log log(d) ρ 2 √ T log 3/2 (T ) , where C > 0 is an absolute numerical constant, and its running time is polynomial in T d and 1/ρ. Our algorithm for the infinite arm linear bandit case enjoys an expected regret of order O(poly(d) √ T ). We underline that the dependence of the regret on the time horizon is (almost) optimal, and we incur an extra d 3 factor in the regret guarantee compared to the non-replicable algorithm of Esfandiari et al. (2021) . We now comment on the time complexity of our algorithm. Remark 11. The current implementation of our algorithm requires time exponential in d. However, for a general convex set A, given access to a separation oracle for it and an oracle that computes an (approximate) G-optimal design, we can execute it in polynomial time and with polynomially many calls to the oracle. Notably, when A is a polytope such oracles exist. We underline that computational complexity issues also arise in the traditional setting of linear bandits with an infinite number of arms and the computational overhead that the replicability requirement adds is minimal. For further details, we refer to Appendix G.

6. CONCLUSION AND FUTURE DIRECTIONS

In this paper, we have provided a formal notion of reproducibility/replicability for stochastic bandits and we have developed algorithms for the multi-armed bandit and the linear bandit settings that satisfy this notion and enjoy a small regret decay compared to their non-replicable counterparts. We hope and believe that our paper will inspire future works in replicable algorithms for more complicated interactive learning settings such as reinforcement learning. We also provide experimental evaluation in Appendix H.

A THE PROOF OF THEOREM 3

Theorem. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 1) for the stochastic bandit problem with K arms and gaps (∆ j ) j∈[K] whose expected regret is E[R T ] ≤ C • K 2 log 2 (T ) ρ 2 j:∆j >0 ∆ j + log(2KT log(T )) ∆ j , where C > 0 is an absolute numerical constant, and its running time is polynomial in K, T and 1/ρ. Proof. First, we claim that the algorithm is ρ-replicable: since the elimination decisions are taken in the same iterates and are based solely on the mean estimations, the replicability of the algorithm of Proposition 2 implies the replicability of the whole algorithm. In particular, Pr[(a 1 , ..., a T ) ̸ = (a ′ 1 , ..., a ′ T )] = Pr[∃i ∈ [B], ∃j ∈ [K] : µ (i) j was not replicable] ≤ ρ . During each batch i, we draw for any active arm ⌊q i ⌋ fresh samples for a total of c i samples and use the replicable mean estimation algorithm to estimate its mean. For an active arm, at the end of some batch i ∈ [B], we say that its estimation is "correct" if the estimation of its mean is within log(2KT B)/c i from the true mean. Using Proposition 2, the estimation of any active arm at the end of any batch (except possibly the last batch) is correct with probability at least 1 -1/(2KT B) and so, by the union bound, the probability that the estimation is incorrect for some arm at the end of some batch is bounded by 1/T . We remark that when δ < ρ, the sample complexity of Proposition 2 reduces to O(log(1/δ)/(τ 2 ρ 2 )). Let E denote the event that our estimates are correct. The total expected regret can be bounded as E[R T ] ≤ T • 1/T + E[R T |E] . It suffices to bound the second term of the RHS and hence we can assume that each gap is correctly estimated within an additive factor of log(2KT B)/c i after batch i. First, due to the elimination condition, we get that the best arm is never eliminated. Next, we have that E[R T |E] = j:∆j >0 ∆ j E[T j |E] , where T j is the total number of pulls of arm j. Fix a sub-optimal arm j and assume that i + 1 was the last batch it was active. Since this arm is not eliminated at the end of batch i, and the estimations are correct, we have that ∆ j ≤ log(2KT B)/c i , and so c i ≤ log(2KT B)/∆ 2 j . Hence, the number of pulls to get the desired bound due to Proposition 2 is (since we need to pull an arm c i /ρ 2 1 times in order to get an estimate at distance log(1/δ)/c 2 i with probability 1 -δ in a ρ 1 -replicable manner when δ < ρ 1 ) T j ≤ c i+1 /ρ 2 1 = q/ρ 2 1 (1 + c i ) ≤ q/ρ 2 1 • (1 + log(2KT B)/∆ 2 j ) . This implies that the total regret is bounded by E[R T ] ≤ 1 + q/ρ 2 1 • j:∆j >0 ∆ j + log(2KT B) ∆ j . We finally set q = T 1/B and B = log(T ). Moreover, we have that ρ 1 = ρ/(KB). These yield E[R T ] ≤ K 2 log 2 (T ) ρ 2 j:∆j >0 ∆ j + log(2KT log(T )) ∆ j . This completes the proof. and since δ ≤ ρ it suffices to pick β = 768K 2 B 2 /ρ 2 . We now focus on the regret of our algorithm. Let us set δ = 1/(KT B). Fix a sub-optimal arm j and assume that batch i + 1 was the last batch that is was active. We obtain that the total number of pulls of this arm is T j ≤ c i+1 ≤ βq(1 + c i ) ≤ βq(1 + 8 log(1/δ)/∆ 2 j ) From the replicability analysis, it suffices to take β of order K 2 log 2 (T )/ρ 2 and so E[R T ] ≤ T •1/T +E[R T |E] = 1+ j:∆j >0 ∆ j E[T j |E] ≤ C • K 2 log 2 (T ) ρ 2 j:∆j >0 ∆ j + log(KT log(T )) ∆ j , for some absolute constant C > 0. Notice that the above analysis, which uses a naive union bound, does not yield the desired regret bound. We next provide a more tight analysis of the same algorithm that achieves the regret bound of Theorem 4. Improved Analysis (The Proof of Theorem 4) In the previous analysis, we used a union bound over all arms and all batches in order to control the probability of the bad event. However, we can obtain an improved regret bound as follows. Fix a sub-optimal arm i ∈ [K] and let t be the first round that it appears in the bad event. We claim that after a constant number of rounds, this arm will be eliminated. This will shave the O(log 2 (T )) factor from the regret bound. Essentially, as indicated in the previous proof, the bad event corresponds to the case where the randomness of the cut-off threshold U can influence the decision of whether the algorithm eliminates an arm or not. The intuition is that during the rounds t and t + 1, given that the two intervals intersected at round t, we know that the probability that they intersect again is quite small since the interval of the optimal mean is moving upwards, the interval of the sub-optimal mean is concentrating around the guess and the two estimations have been moved by at most a constant times the interval's length. Since the bad event occurs at round t, we know that µ (t) j ∈ µ (t) t ⋆ -U t -5 U t , µ (t) t ⋆ -U t /2 + 3 U t . In the above µ t t ⋆ is the estimate of the optimal mean at round t whose index is denoted by t ⋆ . Now assume that the bad event for arm j also occurs at round t + k. Then, we have that µ (t+k) j ∈ µ (t+k) (t+k) ⋆ -U t+k -5 U t+k , µ (t+k) (t+k) ⋆ -U t+k /2 + 3 U t+k . First, notice that since the concentration inequality under event E holds for rounds t, t + k we have that µ (t+k) j ≤ µ (t) j + U t + U t+k . Thus, combining it with the above inequalities gives us µ (t+k) (t+k) ⋆ -U t+k -5 U t+k ≤ µ (t+k) j ≤ µ (t) j + U t + U t+k ≤ µ (t) t ⋆ -U t /2 + 4 U t + U t+k . We now compare µ (t) t ⋆ , µ (t+k) ⋆ . Let o denote the optimal arm. We have that µ (t+k) (t+k) ⋆ ≥ µ (t+k) o ≥ µ o -U t+k ≥ µ t ⋆ -U t+k ≥ µ (t) t ⋆ -U t -U t+k . This gives us that µ (t) t ⋆ -U t+k -6 U t+k -U t ≤ µ (t+k) (t+k) ⋆ -U t+k -5 U t+k . Thus, we have established that µ (t) t ⋆ -U t+k -6 U t+k -U t ≤ µ (t) t ⋆ -U t /2 + 4 U t + U t+k =⇒ U t+k ≥ U t /2 -7 U t+k -5 U t ≥ U t /2 -12 U t . Since β ≥ 2304, we get that 12 U t ≤ U t /4. Thus, we get that U t+k ≥ U t /4. Notice that U t+k U t = c t c t+k , thus it immediately follows that c t c t+k ≥ 1 16 =⇒ q t+1 -1 q t+k+1 -1 ≥ 1 16 =⇒ 16 1 - 1 q t+1 ≥ q k - 1 q t+1 =⇒ q k ≤ 16 + 1 q t+1 ≤ 17 =⇒ k log q ≤ log 17 =⇒ k ≤ 5, when we pick B = log(T ) batches. Thus, for every arm the bad event can happen at most 6 times, by taking a union bound over the K arms we see that the probability that our algorithm is not replicable is at most O(K 1/β), so picking β = Θ(K 2 /ρ 2 ) suffices to get the result.

C THE PROOF OF THEOREM 6

Theorem. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 3) for the stochastic d-dimensional linear bandit problem with K arms whose expected regret is E[R T ] ≤ C • K 2 ρ 2 dT log(KT ) , for some absolute numerical constant C > 0, and its running time is polynomial in d, K, T and 1/ρ. Proof. Let c, C be the numerical constants hidden in Lemma 5, i.e., the size of the multi-set is in the interval [cd log(1/δ)/ε 2 , Cd log(1/δ)/ε 2 ]. We know that the size of each batch n i ∈ [cq i , Cq i ] (see Lemma 5), so by the end of the B -1 batch we will have less than n B pulls left. Hence, the number of batches is at most B. We first define the event E that the estimates of all arms after the end of each batch are accurate, i.e., for every active arm a at the beginning of the i-th batch, at the end of the batch we have that a, θ i -θ ⋆ ≤ ε i . Since δ = 1/(KT 2 ) and there are at most T batches and K active arms in each batch, a simple union bound shows that E happens with probability at least 1 -1/T. We condition on the event E throughout the rest of the proof. We now argue about the regret bound of our algorithm. We first show that any optimal arm a * will not get eliminated. Indeed, consider any sub-optimal arm a ∈ [K] and any batch i ∈ [B]. Under the event E we have that ⟨a, θ i ⟩ -⟨a * , θ i ⟩ ≤ (⟨a, θ * ⟩ + ε i ) -(⟨a * , θ * ⟩ -ε i ) < 2 ε i < ε i + ε i . Next, we need to bound the number of times we pull some fixed suboptimal arm a ∈ [K]. We let ∆ = ⟨a * -a, θ * ⟩ denote the gap and we let i be the smallest integer such that ε i < ∆/4. We claim that this arm will get eliminated by the end of batch i. Indeed, ⟨a * , θ i ⟩ -⟨a, θ i ⟩ ≥ (⟨a * , θ i ⟩ -ε i ) -(⟨a, θ i ⟩ + ε i ) = ∆ -2 ε i > 4ε i -2 ε i > ε i + ε i . This shows that during any batch i, all the active arms have gap at most 4ε i-1 . Thus, the regret of the algorithm conditioned on the event E is at most B i=1 4n i ε i-1 ≤ 4βC B i=1 q i d log(KT 2 )/q i-1 ≤ 6βCq d log(KT ) B-1 i=0 q i/2 ≤ O βq B/2+1 d log(KT ) = O K 2 ρ 2 q B/2+1 d log(KT ) = O K 2 ρ 2 q dT log(KT ) . Thus, the overall regret is bounded by δ • T + (1 -δ) • O K 2 ρ 2 q dT log(KT ) = O K 2 ρ 2 q dT log(KT ) . We now argue about the replicability of our algorithm. The analysis follows in a similar fashion as in Theorem 4. Let θ i , θ ′ i be the LSE after the i-th batch, under two different executions of the algorithm and assume that the set of active arms. We condition on the event E ′ for the other execution as well. Assume that the set of active arms is the same under both executions at the beginning of batch i. Notice that since the set that is guaranteed by Lemma 5 is computed by a deterministic algorithm, both executions will pull the same arms in batch i. Consider a suboptimal arm a and let a i * = arg max a∈A ⟨ θ i , a⟩, a ′ i * = arg max a∈A ⟨ θ ′ i , a⟩. Under the event E ∩ E ′ we have that |⟨a, θ i -θ ′ i ⟩| ≤ 2 ε i , |⟨a i * , θ i -θ ′ i ⟩| ≤ 2 ε i , and |⟨a ′ i * , θ ′ i ⟩ -⟨a i * , θ i ⟩| ≤ 2 ε i . Notice that, since the randomness of ε i is shared, if ⟨a, θ i ⟩ + ε i ≥ ⟨a i * , θ i ⟩ -ε i + 4 ε i , then the arm a will not be eliminated after the i-th batch in some other execution of the algorithm as well. Similarly, if ⟨a, θ i ⟩ + ε i < ⟨a i * , θ i ⟩ -ε i -4 ε i the the arm a will get eliminated after the i-th batch in some other execution of the algorithm as well. In particular, this means that if ⟨a, θ i ⟩-2 ε i > ⟨a i * , θ i ⟩+ ε i -ε i /2 then the arm a will not get eliminated in some other execution of the algorithm and if ⟨a, θ i ⟩ + 5 ε i < ⟨a i * , θ i ⟩ -ε i then the arm j will also get eliminated in some other execution of the algorithm with probability 1 under the event E ∩ E ′ . Thus, it suffices to bound the probability that the decision about arm j will be different between the two executions when we are in neither of these cases. Then, the worst case bound due to the mass of the uniform probability measure is 16 d log(1/δ)/ c i d log(1/δ)/c i . This implies that the probability mass of the bad event is at most 16 c i / c i = 16 1/β. A naive union bound would require us to pick β = Θ(K 2 log 2 T /ρ 2 ). We next show to avoid the log 2 T factor. Fix a sub-optimal arm a ∈ [K] and let t be the first round that it appears in the bad event. Since the bad event occurs at round t, we know that ⟨a, θ t ⟩ ∈ ⟨a t * , θ t ⟩ -ε t -5 ε t , ⟨a t * , θ t ⟩ -ε t /2 + 3 ε t . In the above, a t * is the optimal arm at round t w.r.t. the LSE. Now assume that the bad event for arm a also occurs at round t + k. Then, we have that ⟨a, θ t+k ⟩ ∈ ⟨a (t+k) * , θ t+k ⟩ -ε t+k -5 ε t+k , ⟨a (t+k) * , θ t+k ⟩ -ε t /2 + 3 ε t+k . First, notice that since the concentration inequality under event E holds for rounds t, t + k we have that ⟨a, θ t+k ⟩ ≤ ⟨a, θ t ⟩ + ε t + ε t+k . Thus, combining it with the above inequalities gives us ⟨a (t+k) * , θ t+k ⟩ -ε t+k -5 ε t+k ≤ ⟨a, θ t+k ⟩ ≤ ⟨a, θ t ⟩ + ε t + ε t+k ≤ ⟨a t * , θ t ⟩ -ε t /2 + 4 ε t + ε t+k . We now compare ⟨a t * , θ t ⟩, ⟨a (t+k) * , θ t+k ⟩. Let a * denote the optimal arm. We have that ⟨a (t+k) * , θ t+k ⟩ ≥ ⟨a * , θ t+k ⟩ ≥ ⟨a * , θ * ⟩ -ε t+k ≥ ⟨a t * , θ * ⟩ -ε t+k ≥ ⟨a t * , θ t ⟩ -ε t+k -ε t . This gives us that ⟨a t * , θ t ⟩ -ε t+k -6 ε t+k -ε t ≤ ⟨a (t+k) * , θ t+k ⟩ -ε t+k -5 ε t+k . Thus, we have established that ⟨a t * , θ t ⟩ -ε t+k -6 ε t+k -ε t ≤ ⟨a t * , θ t ⟩ -ε t /2 + 4 ε t + ε t+k =⇒ ε t+k ≥ ε t /2 -7 ε t+k -5 ε t ≥ ε t /2 -12 ε t . Since β ≥ 2304, we get that 12 ε t ≤ ε t /4. Thus, we get that ε t+k ≥ ε t /4. Notice that ε t+k ε t = q t q t+k , thus it immediately follows that q t q t+k ≥ 1 16 =⇒ q k ≤ 16 =⇒ k log q ≤ log 16 =⇒ k ≤ 4, when we pick B = log(T ) batches. Thus, for every arm the bad event can happen at most 5 times, by taking a union bound over the K arms we see that the probability that our algorithm is not replicable is at most O(K 1/β), so picking β = Θ(K 2 /ρ 2 ) suffices to get the result.

D NAIVE APPLICATION OF ALGORITHM 3 WITH INFINITE ACTION SPACE

We use a 1/T 1/(4d+2) -net that has size at most (3T ) d 4d+2 . Let A ′ be the new set of arms. We then run Algorithm 3 using A ′ . This gives us the following result, that is proved right after. Corollary 12. Let T ∈ N, ρ ∈ (0, 1]. There is a ρ-replicable algorithm for the stochastic ddimensional linear bandit problem with infinite arms whose expected regret is at most E[R T ] ≤ C • T 4d+1 4d+2 ρ 2 d log(T ) , where C > 0 is an absolute numerical constant. Proof. Since K ≤ (3T ) d 4d+2 , we have that T sup a∈A ′ ⟨a, θ * ⟩ -E T i=1 ⟨a t , θ * ⟩ ≤ O (3T ) 2d 4d+2 ρ 2 dT log T (3T ) d 4d+2 = O T 4d+1 4d+2 ρ 2 d log(T ) Comparing to the best arm in A, we have that: T sup a∈A ⟨a, θ * ⟩ -E T i=1 ⟨a t , θ * ⟩ = T sup a∈A ⟨a, θ * ⟩ -T sup a∈A ′ ⟨a, θ * ⟩ + T sup a∈A ′ ⟨a, θ * ⟩ -E T i=1 ⟨a t , θ * ⟩ Our choice of the 1/T 1/(4d+2) -net implies that for every a ∈ A there exists some 4d+2) . Thus, the total regret is at most a ′ ∈ A ′ such that ||a -a ′ || 2 ≤ 1/T 1/(4d+2) . Thus, sup a∈A ⟨a, θ * ⟩ -sup a ′ ∈A ′ ⟨a ′ , θ * ⟩ ≤ ||a -a ′ || 2 ||θ * || 2 ≤ 1/T 1/( T • 1/T 1/(4d+2) + O T 4d+1 4d+2 ρ 2 d log(T ) = O T 4d+1 4d+2 ρ 2 d log(T ) .

E THE PROOF OF THEOREM 10

Theorem. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 4) for the stochastic d-dimensional linear bandit problem with infinite action set whose expected regret is E[R T ] ≤ C • d 4 log(d) log 2 log(d) log log log(d) ρ 2 √ T log 3/2 (T ) , for some absolute numerical constant C > 0, and its running time is polynomial in T d and 1/ρ.  ε 2 i ρ ′2 times each one of the arms of the i-th core set C i , as indicated by Lemma 9 and computes the LSE θ i in a replicable way using the algorithm of Lemma 9. Let E be the event that over all batches the estimations are correct. We pick δ = 1/(2|A ′ |T 2 ) so that this good event does hold with probability at least 1 -1/T . Our goal is to control the expected regret which can be written as E[R T ] = T sup a∈A ⟨a, θ ⋆ ⟩ -E T t=1 ⟨a t , θ ⋆ ⟩ . We have that T sup a∈A ⟨a, θ ⋆ ⟩ -T sup a ′ ∈A ′ ⟨a ′ , θ ⋆ ⟩ ≤ 1 , Note that we apply this transformation at most O(d log log d) times. Let π be the distribution we end up with. We see that ||a ′ || 2 V ( π) -1 ≤ 1 1 -x cd log log d ||a|| 2 V (π) -1 ≤ 2 1 1 -x cd log log d d. Notice that there is a constant c ′ such that when x = c ′ /d log d we have that Proof. The proof works when we can treat Ω(⌈d log(1/δ)π(a)/εfoot_1 ⌉) as Ω(d log(1/δ)π(a)/ε 2 ), i.e., as long as π(a) = Ω(ε 2 /d log(1/δ)). In the regime we are in, this point is handled thanks to Lemma 8. Combining the following proof with Lemma 8, we can obtain the desired result. We underline that we work in the fixed design setting: the arms a i are deterministically chosen independently of the rewards r i . Assume that the core set of Lemma 7 is the set C. Fix the multi-set S = {(a i , r i ) : i ∈ [M ]}, where each arm a lies in the core set and is pulled n a = Θ(π(a)d log(d) log(|C|/δ)/ε 2 ) times 2 . Hence, we have that M = a∈C n a = Θ d log(d) log(|C|/δ)/ε 2 . Let also V = i∈[M ] a i a ⊤ i . The least-squares estimator can be written as θ (ε) LSE = V -1 i∈[M ] a i r i = V -1 a∈C a i∈[na] r i (a) , where each a lies in the core set (deterministically) and r i (a) is the i-th reward generated independently by the linear regression process ⟨θ ⋆ , a⟩+ξ, where ξ is a fresh zero mean sub-gaussian random variable. Our goal is to reproducibly estimate the value i∈[na] r i (a) for any a. This is sufficient since two independent executions of the algorithm share the set C and n a for any a. Note that the above sum is a random variable. In the following, we condition on the high-probability event that the average reward of the arm a is ε-close to the expected one, i.e., the value ⟨θ ⋆ , a⟩. This happens with probability at least 1 -δ/(2|C|), given Ω(π(a)d log(d) log(|C|/δ)/ε 2 ) samples from arm a ∈ C. In order to guarantee replicability, we will apply a result from Impagliazzo et al. (2022) . Since we will union bound over all arms in the core set and |C| = O(d log log(d)) (via Lemma 7), we will make use of a (ρ/|C|)-replicable algorithm that gives an estimate v(a) ∈ R such that |⟨θ ⋆ , a⟩ -v(a)| ≤ τ , with probability at least 1 -δ/(2|C|). For δ < ρ, the algorithm uses S a = Ω d 2 log(d/δ) log 2 log(d) log log log(d)/(ρ 2 τ 2 ) many samples from the linear regression with fixed arm a ∈ C. Since we have conditioned on the randomness of r i (a) for any i, we get 1 n a i∈[na] r i (a) -v(a) ≤ 1 n a i∈[na] r i (a) -⟨θ * , a⟩ + |⟨θ * , a⟩ -v(a)| ≤ ε + τ , with probability at least 1 -δ/(2|C|). Hence, by repeating this approach for all arms in the core set, we set θ SQ = V -1 a∈C a n a v(a). Let us condition on the randomness of the estimate θ (ε) LSE . We have that sup a ′ ∈A |⟨a ′ , θ SQ -θ ⋆ ⟩| ≤ sup a ′ ∈A |⟨a ′ , θ SQ -θ (ε) LSE ⟩| + sup a ′ ∈A |⟨a ′ , θ (ε) LSE -θ ⋆ ⟩| . Note that the second term is ε with probability at least 1 -δ via Lemma 5. Our next goal is to tune the accuracy τ ∈ (0, 1) so that the first term yields another ε error. For the first term, we have that sup a ′ ∈A |⟨a ′ , θ SQ -θ (ε) LSE ⟩| ≤ sup a ′ ∈A ⟨a ′ , V -1 a∈C a n a (ε + τ )⟩ Note that V = Cd log(d) log(|C|/δ) ε 2 a∈C π(a)aa ⊤ and so V -1 = ε 2 Cd log(d) log(|C|/δ) V (π) -1 , for some absolute constant C > 0. This implies that sup a ′ ∈A |⟨a ′ , θ SQ -θ (ε) LSE ⟩| ≤ (ε+τ ) sup a ′ ∈A a ′ , ε 2 Cd log(d) log(|C|/δ) V (π) -1 a∈C Cd log(d) log(|C|/δ)π(a) ε 2 a . Hence, we get that sup a ′ ∈A |⟨a ′ , θ SQ -θ (ε) LSE ⟩| ≤ (ε + τ ) sup a ′ ∈A a ′ , V (π) -1 a∈C π(a)a . Consider a fixed arm a ′ ∈ A. Then, a ′ , V (π) -1 a∈C π(a)a ≤ a∈C π(a) ⟨a ′ , V (π) -1 a⟩ ≤ a∈C π(a) 1 + ⟨a ′ , V (π) -1 a⟩ 2 = 1 + a∈C π(a) ⟨a ′ , V (π) -1 a⟩ 2 = 1 + ||a ′ || 2 V (π) -1 ≤ 4d + 1 , where the last inequality follows from the fact that π is a 4-approximation of the G-optimal design. Hence, in total, by picking τ = ε, we get that sup a ′ ∈A |⟨a ′ , θ SQ -θ ⋆ ⟩| ≤ 11dε . Thus, for any ε > 0, the total number of pulls of each arm is Ω d 4 log(d/δ) log 2 log(d) log log log(d)/(ρ 2 ε 2 ) , to get sup a ′ ∈A |⟨a ′ , θ SQ -θ ⋆ ⟩| ≤ ε .

G COMPUTATIONAL PERFORMANCE OF ALGORITHM 4

In this appendix, we discuss the barriers towards computational efficiency regarding Algorithm 4. The reasons why Algorithm 4 is computationally inefficient are the following: (a) we have to compute the arm in the set of active arms that has maximum correlation with the estimate θ i , (b) we have to eliminate arms based on this value and (c) we have to run at each batch the Frank-Wolfe algorithm (or some other optimization method needed for Lemma 5) in order to obtain an approximate G-optimal design. As a minimal assumption in what follows, we focus on the case where the action set A is convex and we have access to a separation oracle for it. Note that executing both (a) and (b) naively requires time exponential in d. However, on the one side arm elimination (issue (b)) reduces to finding the intersection of the current active set with a halfspace H whose normal vector is θ i and the threshold is, roughly speaking, the maximum correlation. This maximum correlation can also be computed efficiently. Finding an arm with (almost) maximum correlation relates to the problem of finding a point that maximizes a linear objective under the constraint that the point lies in the intersection of the active arm set with some linear constraints. Thus, we can use the ellipsoid algorithm to implement this step. The above discussion deals with issues (a) and (b) and, essentially, states that even with infinitely many actions, one could implement these steps efficiently. We now focus on issue (c). The Frank-Wolfe method first requires a proper initialization. As mentioned in Lattimore & Szepesvári (2020), if the starting point is chosen to be the uniform distribution over A ′ , then the number of iterations before getting a 2-approximate optimal design is roughly O(d). The issue is that since A ′ is exponential in d, it is not clear how to work with such an initialization efficiently. Notably there is a different initialization Fedorov 2020)). There are two issues: first, one requires an oracle to provide this good initialization. Second, each iteration of the Frank-Wolfe method (with current design guess π) requires computing a point in the current active set with maximum V (π) -1 -norm. As noted in Todd (2016), a good initialization for finding a G-optimal design, i.e., a minimum volume enclosing ellipsoid (MVEE) should be sufficiently sparse (compared to the number of active arms) and assign positive mass to arms that correspond to extreme points, i.e., points that are close to the border of MVEE. The work of Kumar & Yildirim (2005) provides an initial core set that depends only on d but not on the number of points. The algorithm works as follows: it runs for d iterations and, in each round, it adds 2 arms into the core set. Initially, we set the core set C 0 = ∅ and let Ψ = {0}. In each iteration i ∈ [d], the algorithm draws a random direction v i in the orthogonal complement of Ψ (this step is replicable thanks to the shared randomness) and computes the vectors in the active arms' set with the maximum and the minimum correlation with v i , say a + i , a - i . It then extends C 0 ← C 0 ∪ {a + i , a - i } and sets Ψ ← span(Ψ, {a + i -a - i }). Hence, the runtime of this algorithm corresponds to the runtime of the tasks max a∈A ′ ⟨a, v i ⟩ and min a∈A ′ ⟨a, v i ⟩. One can efficiently approximate these values using the ellipsoid algorithm and hence efficiently initialize the Frank-Wolfe algorithm as in Todd (2016) (e.g., set the weights uniformly 1/(2d)). Our second challenge deals with finding a point in the active arm set with maximum V (π) -1norm for some current guess π. Even if the current active set is a polytope, finding an exact norm maximizer is NP-hard Freund & Orlin (1985); Mangasarian & Shiau (1986) 3 . Hence, one should focus on efficient approximation algorithms. We note that even a poly(d)-approximate maximizer is sufficient to get O(poly(d) √ T ) regret. Such an algorithm for polytopes, which gets an 1/d 2approximation, is provided in Ye (1992) ; Vavasis (1993). As a general note, if we assume that we have access to an oracle O that computes a 2-approximate G-optimal design in time T O , then our Algorithm 4 runs in time polynomial in T O .

H EXPERIMENTAL EVALUATION

In this section, we provide some experimental evaluation of our proposed algorithms in the setting of multi-armed stochastic bandits. In particular, we will compare the performance of our algorithm with the standard UCB approach. We first analyze the experimental setup. Experimental Setup. We consider a multi-armed bandit setting with K = 6 arms that have Bernoulli rewards with bias (0.44, 0.47, 0.5, 0.53, 0.56, 0.59), we run the algorithms for T = 45000 iterations and we execute both algorithms 20 different times. For our algorithm, we set its replicability parameter ρ = 0.3. We are interested in comparing the replicability and the regret of the two algorithms. We underline that our theoretical bound in this setting shows that the regret of our algorithm could be, roughly, at most 360 times worse than the regret of UCB. Observations. We first observe that our proposed bandit algorithm is indeed replicable, i.e., it satisfies our proposed reproducibility/replicability definition. In particular, for the 10 pairs of consecutive executions we consider, we observe that our algorithm pulls the exact same sequence of arms in all of them. On the other hand, the standard UCB algorithm is far from being replicable. To be more precise, between consecutive executions we see that the algorithm makes a different choice Figure 1 : Cumulative Regret UCB vs. Replicable Batched Algorithm between 26208 to 28002 rounds out of the 45000 rounds, which is more that half of the rounds! However, the regret of our approach is quite competitive compared to the standard UCB algorithm. In particular, the regret of the standard UCB algorithm has the expected polylog(T ) shape, while the regret of our replicable variant incurs a multiplicative overhead that is, roughly, 2.5 times worse than UCB. This is depicted in Figure 1 , where we plot the average cumulative regret (across the 20 executions) of the two algorithms.



We can handle the case of 0 < δ ≤ d by paying an extra log d factor in the sample complexity. Recall that π(a) ≥ c/(d log(d)), for some constant c > 0, so the previous expression is Ω(log(δ/|C|)/ε 2 ). In fact, even finding a constant factor approximation, for some appropriate constant, is NP-hardBellare & Rogaway (1993).



) study irreproducibility in recommendation systems and propose the use of smooth activations (instead of ReLUs) to improve recommendation reproducibility. In general, the reproducibility crisis is reported in various scientific disciplines Ioannidis (2005); McNutt (2014); Baker (2016b); Goodman et al. (2016); Lucic et al. (2018); Henderson et al. (2018). For more details we refer to the report of the NeurIPS 2019 Reproducibility Program Pineau et al. (2021) and the ICLR 2019 Reproducibility Challenge Pineau et al. (2019). Bandit Algorithms. Stochastic multi-armed bandits for the general setting without structure have been studied extensively Slivkins (2019); Lattimore & Szepesvári (2020); Bubeck et al. (2012b); Auer et al. (2002); Cesa-Bianchi & Fischer (1998); Kaufmann et al. (2012a); Audibert et al. (2010); Agrawal & Goyal (2012); Kaufmann et al. (2012b). In this setting, the optimum regret achievable is O log(T ) i:∆i>0 ∆ -1 ; this is achieved, e.g., by the upper confidence bound (UCB) algorithm of Auer et al. (2002). The setting of d-dimensional linear stochastic bandits is also well-explored Dani et al. (2008); Abbasi-Yadkori et al. (2011) under the well-specified linear reward model, achieving (near) optimal problem-independent regret of O(d T log(T )) Lattimore & Szepesvári (2020). Note that the best-known lower bound is Ω(d √ T ) Dani et al. (2008) and that the number of arms can, in principle, be unbounded. For a finite number of arms K, the best known upper bound is O( dT log(K)) Bubeck et al. (2012a). Our work focuses on the design of replicable bandit algorithms and we hence consider only stochastic environments. In general, there is also extensive work in adversarial bandits and we refer the interested reader to Lattimore & Szepesvári (2020). Batched Bandits. While sequential bandit problems have been studied for almost a century, there is much interest in the batched setting too. In many settings, like medical trials, one has to take a lot of actions in parallel and observe their rewards later. The works of Auer & Ortner (2010) and Cesa-Bianchi et al. (2013) provided sequential bandit algorithms which can easily work in the batched setting. The works of Gao et al. (2019) and Esfandiari et al. (2021) are focusing exclusively on the batched setting. Our work on replicable bandits builds upon some of the techniques from these two lines of work.

Appendix A, states that, by combining the tools from Impagliazzo et al. (2022) and Esfandiari et al. (2021), we can design a replicable bandit algorithm with (instance-dependent) expected regret O(K 2 log 3 (T )/ρ 2 ). Notice that the regret guarantee has an extra K 2 log 2 (T )/ρ 2 factor compared to its non-replicable counterpart in Esfandiari et al. (2021) (Theorem 5.1

every arm a 1 , . . . , a ni and receive rewards r 1 , . . . , r ni 12: Compute the LSE θ i ← ni j=1 a j a T j -1 ni j=1 a j r j 13:

First, the algorithm is ρ-replicable since in each batch we use a replicable LSE sub-routine with parameter ρ ′ = ρ/B. This implies thatPr[(a 1 , ..., a T ) ̸ = (a ′ 1 , ..., a ′ T )] = Pr[∃i ∈ [B] : θ i was not replicable] ≤ ρ . Let us fix a batch iteration i ∈ [B -1]. Set C i bethe core set computed by Lemma 7. The algorithm first pulls n i = Cd 4 log(d/δ) log 2 log(d) log log log(d)

that the mass of every arm is at leastx(1 -x) |C| ≥ x -|C|x 2 = c ′ /(d log(d))c ′′ d log log d/(d 2 log 2 (d)) ≥ c/(d log(d)), for some absolute numerical constant c > 0. This concludes the claim. F.2 THE PROOF OF LEMMA 9

(2013); Lattimore et al. (2020) with support O(d) for which the method runs in O(d log log(d)) rounds (see Note 3 at Section 21.2 of Lattimore & Szepesvári (2020) and Lattimore et al. (

Algorithm 1 Mean-Estimation Based Replicable Algorithm for Stochastic MAB (Theorem 3) 1: Input: time horizon T , number of arms K, replicability ρ 2:

SQ that satisfies sup a∈A |⟨a, θ SQ -θ ⋆ ⟩| ≤ ε , with probability at least 1 -δ.Algorithm 4 Replicable LSE Algorithm for Stochastic Infinite Action Set (Theorem 10)

log 2 log(d) log log log(d) log 2 (T )/ρ 2 8: a 1 , . . . , a |Ci| ← core set C i of the design given by Lemma 7 with parameter A ′ Pull every arm a j for N i = ⌈M i ⌉/|C i | rounds and receive rewards r

7. ACKNOWLEDGEMENTS

Alkis Kalavasis was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the "First Call for H.F.R.I. Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment grant", project BALSAM, HFRIFM17-1424. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N00014-19-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Andreas Krause was supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program grant agreement no. 815943 and the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545. Grigoris Velegkas was supported by NSF (IIS-1845032), an Onassis Foundation PhD Fellowship and a Bodossaki Foundation PhD Fellowship.

B THE PROOF OF THEOREM 4

Theorem. Let T ∈ N, ρ ∈ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 2) for the stochastic bandit problem with K arms and gaps (∆ j ) j∈[K] whose expected regret is(∆ j + log(KT log(T ))/∆ j ) ,for some absolute numerical constant C > 0, and its running time is polynomial in K, T and 1/ρ.To give some intuition, we begin with a non tight analysis which, however, provides the main ideas behind the actual proof.Non Tight Analysis Assume that the environment has K arms with unknown means µ i and let T be the number of rounds. Consider B to the total number of batches and β > 1. We set q = T 1/B . In each batch i ∈ [B], we pull each arm β⌊ q i ⌋ times. Hence, after the i-th batch, we will have drawn c i = 1≤j≤i β⌊q j ⌋ independent and identically distributed samples from each arm. Let us also set c i = 1≤j≤i ⌊q j ⌋.

Let us fix i ∈ [B]

. Using Hoeffding's bound for subgaussian concentration, the length of the confidence bound for arm j ∈ [K] that guarantees 1 -δ probability of success (in the sense that the empirical estimate µ j will be close to the true µ j ) is equal to U i = 2 log(1/δ)/ c i , when the estimator uses c i samples. Also, letAssume that the active arms at the batch iteration i lie in the set A i . Consider the estimates { µj is the empirical mean of arm j using c i samples. We will eliminate an arm j at the end of the batch iteration i ifFor the remaining of the proof, we condition on the event E that for every arm j ∈ [K] and every batch i ∈ [B] the true mean is within U i from the empirical one.We first argue about the replicability of our algorithm. Consider a fixed round i (end of i-th batch) and a fixed arm j. Let i ⋆ be the optimal empirical arm after the i-th batch.Let µi ⋆ the empirical estimates of arms j, i ⋆ after the i-th batch, under some other execution of the algorithm. We condition on the event E ′ for the other execution as well. Notice that | µthen the arm j will not be eliminated after the i-th batch in some other execution of the algorithm as well. Similarly, if µthe the arm j will get eliminated after the i-th batch in some other execution of the algorithm as well. In particular, this means that if µthen the arm j will not get eliminated in some other execution of the algorithm and if µi ⋆ -U i then the arm j will also get eliminated in some other execution of the algorithm with probability 1 under the event E ∩ E ′ . We call the above two cases good since they preserve replicability. Thus, it suffices to bound the probability that the decision about arm j will be different between the two executions when we are in neither of these cases. Then, the worst case bound due to the mass of the uniform probability measure isThis implies that the probability mass of the bad event is at most 16 c i / c i = 16 1/β. A union bound over all arms and batches yields that the probability that two distinct executions differ in at least one pull isPublished as a conference paper at ICLR 2023 since A ′ is a deterministic 1/T -net of A. Also, let us set the expected regret of the bounded action sub-problem asWe can now employ the analysis of the finite arm case. During batch i, any active arm has gap at most 4ε i-1 , so the instantaneous regret in any round is not more than 4ε i-1 . The expected regret conditional on the good event E is upper bounded bywhere M i is the total number of pulls in batch i (using the replicability blow-up) and ε i-1 is the error one would achieve by drawing q i samples (ignoring the blow-up). Then, for some absolute constant C > 0, we have thatwhich yields thatwhere we setWe pick B = log(T ) and get that, if q = T 1/B then S = Θ( √ T ). We remark that this choice of q is valid since B i=1 q i = q B+1 -q q -1 = Θ(q B ) -1 ≥ T ρ 2 d 3 log(d) log 2 log(d) log log log (d) .Hence, we have thatNote that when E does not hold, we can bound the expected regret by 1/T • T = 1. This implies that the overall regretT |E] and so it satisfies the desired bound and the proof is complete.F DEFERRED LEMMATA F.1 THE PROOF OF LEMMA 8 Proof. Consider the distribution π that is a 2-approximation to the optimal G-design and has support |C| = O(d log log d). Let C ′ be the set of arms in the support such that π(a) ≤ c/d log d. We consider π = (1 -x)π + xa, where a ∈ C ′ and x will be specified later. Consider now the matrix V ( π). Using the Sherman-Morrison formula, we have that.Consider any arm a ′ . Then,

