REPLICABLE BANDITS

Abstract

In this paper, we introduce the notion of replicable policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called replicable if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do replicable policies exist, but also they achieve almost the same optimal (non-replicable) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the replicability parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop replicable policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the replicability parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.

1. INTRODUCTION

In order for scientific findings to be valid and reliable, the experimental process must be repeatable, and must provide coherent results and conclusions across these repetitions. In fact, lack of reproducibility has been a major issue in many scientific areas; a 2016 survey that appeared in Nature (Baker, 2016a) revealed that more than 70% of researchers failed in their attempt to reproduce another researcher's experiments. What is even more concerning is that over 50% of them failed to reproduce their own findings. Similar concerns have been raised by the machine learning community, e.g., the ICLR 2019 Reproducibility Challenge (Pineau et al., 2019) and NeurIPS 2019 Reproducibility Program (Pineau et al., 2021) , due to the to the exponential increase in the number of publications and the reliability of the findings. The aforementioned empirical evidence has recently led to theoretical studies and rigorous definitions of replicability. In particular, the works of Impagliazzo et al. ( 2022) and Ahn et al. ( 2022) considered replicability as an algorithmic property through the lens of (offline) learning and convex optimization, respectively. In a similar vein, in the current work, we introduce the notion of replicability in the context of interactive learning and decision making. In particular, we study replicable policy design for the fundamental setting of stochastic bandits. A multi-armed bandit (MAB) is a one-player game that is played over T rounds where there is a set of different arms/actions A of size |A| = K (in the more general case of linear bandits, we can consider even an infinite number of arms). In each round t = 1, 2, . . . , T , the player pulls an arm a t ∈ A and receives a corresponding reward r t . In the stochastic setting, the rewards of each 1

