REPLICABLE BANDITS

Abstract

In this paper, we introduce the notion of replicable policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called replicable if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do replicable policies exist, but also they achieve almost the same optimal (non-replicable) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the replicability parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop replicable policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the replicability parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.

1. INTRODUCTION

In order for scientific findings to be valid and reliable, the experimental process must be repeatable, and must provide coherent results and conclusions across these repetitions. In fact, lack of reproducibility has been a major issue in many scientific areas; a 2016 survey that appeared in Nature (Baker, 2016a) revealed that more than 70% of researchers failed in their attempt to reproduce another researcher's experiments. What is even more concerning is that over 50% of them failed to reproduce their own findings. Similar concerns have been raised by the machine learning community, e.g., the ICLR 2019 Reproducibility Challenge (Pineau et al., 2019) and NeurIPS 2019 Reproducibility Program (Pineau et al., 2021) , due to the to the exponential increase in the number of publications and the reliability of the findings. The aforementioned empirical evidence has recently led to theoretical studies and rigorous definitions of replicability. In particular, the works of Impagliazzo et al. (2022) and Ahn et al. (2022) considered replicability as an algorithmic property through the lens of (offline) learning and convex optimization, respectively. In a similar vein, in the current work, we introduce the notion of replicability in the context of interactive learning and decision making. In particular, we study replicable policy design for the fundamental setting of stochastic bandits. A multi-armed bandit (MAB) is a one-player game that is played over T rounds where there is a set of different arms/actions A of size |A| = K (in the more general case of linear bandits, we can consider even an infinite number of arms). In each round t = 1, 2, . . . , T , the player pulls an arm a t ∈ A and receives a corresponding reward r t . In the stochastic setting, the rewards of each arm are sampled in each round independently, from some fixed but unknown, distribution supported on [0, 1]. Crucially, each arm has a potentially different reward distribution, but the distribution of each arm is fixed over time. A bandit algorithm A at every round t takes as input the sequence of arm-reward pairs that it has seen so far, i.e., (a 1 , r 1 ), . . . , (a t-1 , r t-1 ), then uses (potentially) some internal randomness ξ to pull an arm a t ∈ A and, finally, observes the associated reward r t ∼ D at . We propose the following natural notion of a replicable bandit algorithm, which is inspired by the definition of Impagliazzo et al. (2022) . Intuitively, a bandit algorithm is replicable if two distinct executions of the algorithm, with internal randomness fixed between both runs, but with independent reward realizations, give the exact same sequence of played arms, with high probability. More formally, we have the following definition. Definition 1 (Replicable Bandit Algorithm). Let ρ ∈ [0, 1]. We call a bandit algorithm A ρreplicable in the stochastic setting if for any distribution D aj over [0, 1] of the rewards of the j-th arm a j ∈ A, and for any two executions of A, where the internal randomness ξ is shared across the executions, it holds that Pr ξ,r (1) ,r (2) a (1) 1 , . . . , a (1) T = a (2) 1 , . . . , a (2) T ≥ 1 -ρ . Here, a (i) t = A(a (i) 1 , r (i) 1 , ..., a (i) t-1 , r (i) t-1 ; ξ) is the t-th action taken by the algorithm A in execution i ∈ {1, 2}. The reason why we allow for some fixed internal randomness is that the algorithm designer has control over it, e.g., they can use the same seed for their (pseudo)random generator between two executions. Clearly, naively designing a replicable bandit algorithm is not quite challenging. For instance, an algorithm that always pulls the same arm or an algorithm that plays the arms in a particular random sequence determined by the shared random seed ξ are both replicable. The caveat is that the performance of these algorithms in terms of expected regret will be quite poor. In this work, we aim to design bandit algorithms which are replicable and enjoy small expected regret. In the stochastic setting, the (expected) regret after T rounds is defined as E[R T ] = T max a∈A µ a -E T t=1 µ at , where µ a = E r∼Da [r] is the mean reward for arm a ∈ A. In a similar manner, we can define the regret in the more general setting of linear bandits (see, Section 5) Hence, the overarching question in this work is the following: Is it possible to design replicable bandit algorithms with small expected regret? At a first glance, one might think that this is not possible, since it looks like replicability contradicts the exploratory behavior that a bandit algorithm should possess. However, our main results answer this question in the affirmative and can be summarized in Table 1 . Table 1: Our results for replicable stochastic general multi-armed and linear bandits. In the expected regret column, O(•) subsumes logarithmic factors. H ∆ is equal to j:∆j >0 1/∆ j , ∆ j is the difference between the mean of action j and the optimal action, K is the number of arms, d is the ambient dimension in the linear bandit setting.

Setting

Algorithm Regret Theorem Stochastic MAB Algorithm 1 O K 2 log 3 (T )H∆ ρ 2 Theorem 3 Stochastic MAB Algorithm 2 O K 2 log(T )H∆ ρ 2 Theorem 4 Stochastic Linear Bandits Algorithm 3 O K 2 √ dT ρ 2

