"I PICK YOU CHOOSE": JOINT HUMAN-ALGORITHM DECISION MAKING IN MULTI-ARMED BANDITS

Abstract

Online learning in multi-armed bandits has been a rich area of research for decades, resulting in numerous "no-regret" algorithms that efficiently learn the arm with highest expected reward. However, in many settings the final decision of which arm to pull isn't under the control of the algorithm itself. For example, a driving app typically suggests a subset of routes (arms) to the driver, who ultimately makes the final choice about which to select. Typically, the human also wishes to learn the optimal arm based on historical reward information, but decides which arm to pull based on a potentially different objective function, such as being more or less myopic about exploiting near-term rewards. In this paper, we show when this joint human-algorithm system can achieve good performance. Specifically, we explore multiple possible frameworks for human objectives and give theoretical regret bounds for regret. Finally, we include experimental results exploring how regret varies with the human decision-maker's objective, as well as the number of arms presented.

1. INTRODUCTION Consider the following motivating example:

Alice has recently moved to a new town and does not know the area yet. She uses a navigation app while driving, which narrows down the thousands of potential routes to a few options for her to choose from. She and the app only get to see the actual driving time of the final route she picks. Because of varying traffic and weather delays, the actual driving times of each route is unpredictable. Both the navigation app and Alice wish to minimize her average travel time. However, they might have different short-term objectives. For example, Alice might be myopic and prefer choosing a route that has performed well in the past, rather than exploring a new one. Alternatively, Alice may be adventurous and actively seek out new routes in the hope that there might be quicker than ones she has previously explored. The navigation app uses a generic algorithm that doesn't know Alice's specific objective function. Under what situations can Alice and her navigation app achieve their goal of quickly finding the quickest route? If Alice's navigation app were able to tell Alice exactly which route she must take, then this problem would reduce to that of multi-armed bandits (MAB), a celebrated online-learning paradigm. However, in the driving directions setting, it is unrealistic to assume that the algorithm can force Alice to take a particular route. In human-algorithm collaboration more generally, often the algorithm can provide assistance, but the human makes the final decision. This is the case in other settings as well: a diner trying to find the best restaurant, a doctor trying to find the best treatment, or a teacher trying to find the best pedagogical method. This framework requires a shift in thinking: rather than focus on optimizing the performance of the algorithm alone, the goal is to build an algorithm that maximizes the performance of the human-algorithm system. For multi-armed bandits, the standard objective is to minimize expected regret, the amount of reward that is missed by not selecting the optimal arm. In human-algorithm multi-armed bandits, some aspects (such as the behavior of the human), are entirely out of our control, and so the system cannot be completely optimized. Instead, the goal of this paper is descriptive: to characterize settings where sublinear regret is possible -and settings where linear regret is unavoidable. In Section 2, we discuss how our setting and results relate to previous literature in MAB and in human-algorithm collaboration. In Section 3, we formalize the model that we analyze, including multiple different models of human behavior. Section 4 contains theoretical results, such as bounds on expected regret. Specifically, we show that, so long as the human isn't completely myopic (has some weak preference for exploring arms that haven't been frequently pulled), then sublinear regret is achievable. If the human is myopic, then it is unavoidable that the regret includes a linear dependence on time. Section 5 enriches these theoretical results with experimental simulations. These results show that if the human is more myopic than the algorithm, overall regret decreases the more arms are shown to the human. On the other hand, if the human is less myopic, the opposite is true, and regret increases the more arms are shown to the human. Finally, in Section 6 we briefly discuss implications of our work and potential future directions.

2. RELATED WORK MULTI-ARMED BANDITS

The area of multi-armed bandits is wide enough to admit multiple textbooks Slivkins et al. (2019) ; Lattimore & Szepesvári (2020) . In this section, we will highlight some of the most related papers. Yue et al. (2012) proposed "dueling bandits", where multiple arms are presented simultaneously and the feedback is noisy binary signal as to which has higher reward. Since this, there has been numerous extensions Sui et al. (2018; 2017b) ; Komiyama et al. (2015) , such as those that allow more than 2 arms to be presented Saha & Gopalan (2018) ; Agarwal et al. (2020) ; Sui et al. (2017a) . Shivaswamy & Joachims (2015) studies a related problem where the task of the algorithm is to rank a set of items. The human then improves the ranking according to their true utility function, but with some bounded degree of improvement reflecting limits on human rationality. One major difference between dueling bandits and our framework is that we assume feedback is given by a human who is learning about the rewards of the arms themselves, whereas dueling bandits typically assume that responses between the arms are fixed. Additionally, dueling bandits typically involves boolean feedback, where we allow real-valued access to the rewards. There has also been a series of work looking more specifically at human-algorithm collaboration in bandit settings. Gao et al. (2021) learns from batched historical human data to develop an algorithm that assigns each task at test time to either itself or a human. Chan et al. (2019) studies a setting similar to ours in that the human is simultaneously learning which option is best for them. However, their framework allows the algorithm to overrule the human, which makes sense in many settings, but not all, such as our motivating example of driving directions. Bordt & Von Luxburg (2022) formalizes the problem as a two-player setting where both the human and algorithm take actions that affect the reward both experience. Additionally, some work has used the framework of the human as the final decision-maker and studied how to disclose information so as to incentivize them to take the "right" action. Immorlica et al. (2018) studies how to match the best regret in a setting where myopic humans pull the final arm. Hu et al. (2022) studies a related problem with combinatorial bandits, where the goal is to select a subset of the total arms to pull. Bastani et al. (2022) investigates a more applied setting where each human is a potential customer who will become disengaged and leave if they are suggested products (arms) that are a sufficiently poor fit. Kannan et al. (2017) looks at a similar model of sellers considering sequential clients, specifically investigating questions of fairness. In general, these works differ from ours in that they assume a new human arrives at each time step, and so the algorithm is able to selectively disclose information to them. In our setting, the human may be the same between time steps, and we typically assume that they have access to the same information as the algorithm.

HUMAN-ALGORITHM COLLABORATION

Studying human-algorithm collaboration is a rapidly growing, highly interdisciplinary area of research. In general, most work focuses on offline learning settings, which differs from our MAB analysis. Some veins of research are more ethnographic, studying how people use algorithmic input in their decision-making Lebovitz et al. (2021; 2020) ; Beede et al. (2020) ; Yang et al. (2018); Okolo et al. (2021) . Other avenues work on developing ML tools designed to work with humans, such as in medical settings Raghu et al. (2018) or child welfare phone screenings Chouldechova et al. (2018) . Finally, and most closely related to this paper, some works develop theoretical models to analyze human-algorithm systems, such as Rastogi et al. (2022) ; Cowgill & Stevenson (2020) ; Bansal et al. (2021a) ; Steyvers et al. (2022) ; Madras et al. (2018) . Bansal et al. (2021b) proposes the notion of complementarity, which is achieved when a human-algorithm system together has performance that is strictly better than either the human or the algorithm could achieve along. Straitouri et al. (2022) studies "conformal prediction" where the algorithm narrows down the the list of possible item labels to a subset, from which the human picks. This formulation is structurally similar to ours, but considers the offline labeling task, rather than online MAB.

3.1. MODEL

We assume that there are N arms, each of rewards drawn i.i.d from the distribution X i ∼ D i , for i ∈ [N ] . WLOG, we will order the arms in descending order of expected reward. This means arm 0 has the highest expected reward, and we will set ∆ i to be the difference between the expected reward of arm i and arm 0. There are two actors, the human (H) and the algorithm (A). Each of them has access to the same historical information but uses it in different ways. For each time step t ∈ [T ], the algorithm selects a subset of k ∈ [1, N ] arms to present to the human. Among those k presented, the human selects a single final arm I t to be pulled. Both the human and algorithm observe reward X i,t ∼ D i . Note that for k = 1 this reduces to the algorithm selecting the final arm (because the human can only select from those that are presented), while for k = N this reduces to the human making unconstrained selection. Throughout, our goal will be to minimize expected regret, or the amount of reward the human-algorithm system misses out on by pulling sub-optimal arms: T • µ 0 - T t=0 E[µ It ] = T t=0 E[∆ It ]

3.2. ALGORITHM AND HUMAN BEHAVIOR

Next, we will describe the assumptions behind how the algorithm and human behave. One standard selection approach we will incorporate is the UCB algorithm, which at time t selects whichever arm maximizes Auer et al. (2002) : μi,t + α a • ln(t) n i,t for empirical mean μi,t = 1 ni,t t s=1 1 [I s = i] • X i,t . Because the UCB algorithm is a standard algorithm for multi-armed bandit settings, we will assume the algorithm A uses some variant of it. However, in this paper, we will explore scenarios where the human H uses multiple different selection rules. For example, we say that H is (α h , δ)-myopic if it selects randomly among the k presented arms with probability δ and otherwise selects whichever maximizes the UCB algorithm with coefficient α h : x ∼ Unif[k] r ∼ Unif[0, 1] ≤ δ argmax i μi,t + α h • ln(t) ni,t In addition to these objectives, we will assume that the human and algorithm both prefer to pull each arm at least once before pulling any other arm.

4. THEORETICAL ANALYSIS

In this section, we will provide theoretical regret bounds for our human-algorithm setting. First, we will show in Lemma 1 that any human with δ > 0 must incur at linear regret (proof in the Appendix). Additionally, this result has a lower bound that is increasing in k, the number of arms shown to the human. This suggests that, if the human selects randomly with some nonzero probability, it may be optimal to show as few arms as possible to them. Following this result, we will assume δ = 0 throughout the rest of the paper. Lemma 1. Any human that selects uniformly at randomly with probability δ > 0 among k ≥ 2 arms incurrs regret Ω(T ) that is increasing in k. Next, we will work to bound total regret when δ = 0. We will find it useful to use Lemma 2, which gives a high-probability bound for the relative ordering of the UCB values for two arms. Lemma 2. Consider arm i with n i,t ≥ α 2 • ln(t) First, we consider the case where α h > 0, so the human isn't completely myopic in its choice of arm. Theorem 1 gives a regret bound for this scenario. One key area of focus in human-algorithm collaboration in general is how the performance of the joint system compares to that of the human and algorithm separately Bansal et al. (2021b; a) . Overall performance could be better than either the human or algorithm (complementary performance, as defined inBansal et al. ( 2021b)), or could be worse, or could be somewhere in between. For Theorem 1, the regret bound involves a max(α 2 h , α 2 a ) term. This effectively means that regret for the human-algorithm system is guaranteed to at least as good as the worse of the two components (human or algorithm). ϵ 2 •∆ 2 i , for ϵ ∈ [0, 0.5], α ≥ 1. Theorem 1. Consider the case with α h , α a > 1. Then, the expected regret is bounded by: N • T • ln(T ) • 1 + 4 • max α 2 h , α 2 a + 16 • N Proof. First, we will divide the two arms into groups: 1. Group 1 contains arms with ∆ i < N T • ln(T ) 2. Group 2 contains arms with ∆ i ≥ N T • ln(T ) In order to bound total regret, we can bound regret from group 1 as: i∈G1 n i,T • ∆ i ≤ N T • ln(T ) • i∈G1 n i,T ≤ N T • ln(T ) • T = N • T • ln(T ) The remaining portion of the proof will bound regret from arms of group 2. First, we will work to bound the number of times that arm i, for i ̸ = 0, can be pulled for all time (n i,T ). We will use I t to indicate the arm pulled at time t. Our goal is to bound the expected value of n i,T . We will use n * i,t = 4 • max α 2 h , α 2 a • ln(t) ∆ 2 i to denote a minimum threshold of samples we need from arm i in order to obtain certain high-probability guarantees. E[n i,T ] = E T t=0 1(I t+1 = i) = 1 + E[ T t=N 1(I t+1 = i)] = 1 + E T t=N 1 I t+1 = i|n i,t < n * i,t + E T t=N 1 I t+1 = i|n * i,t ≤ n * i,T + E t t=N 1 I t+1 = i|n i,t ≥ n * i,t In the second line, we used the fact that each arm must be pulled at least once. In the third line, we conditioned on the probability of pulling arm i depending on whether it has already been pulled n * i,t times or not. In the fourth line, we use the fact that arm i can only be pulled n * i,T times until n i,t < n * i,t is no longer satisfied. Next we will bound the second term: the number of times arm i can be selected, given that it has already been pulled at least n * i,t . Here, we will find it useful to rewrite the expectation as the sum of probabilities: E T t=N 1 I t+1 = i|n i,t ≥ n * i,t = T t=N P I t+1 = i|n i,t ≥ n * i,t Recall that the algorithm uses U CB a i,t = μi +α a • ln(t) ni,t , while the human uses U CB h i,t = μi +α h • ln(t) ni,t . We have set n * i,t = 4 • max α 2 h , α 2 a • ln(t) ∆ 2 i so that, according to Lemma 2, we have enough samples that with probability at least 1 -4 t 2 : 1) U CB a i,t < U CB a 0,t and 2) U CB h i,t < U CB h 0,t . Statement 1) means that if arm i is presented to the human, with high probability arm 0 will be as well. Statement 2) means that if both arms are presented to the human, with high probability the human will pick arm 0. If either of these fails, then arm i could get picked. We can use both of these facts to bound regret from this portion based on the probability that either of these conditions fails: ≤ T t=N 2 • 4 t 2 ≤ 8 • 2 We can bound the instance-independent regret from this term (for arms in group 2): i∈G2 n i,t • ∆ i ≤ n * i,T + 16 • ∆ i = 4 • max α 2 h , α 2 a • ln(T ) ∆ 2 i + 16 • ∆ i = i∈G2 4 • max α 2 h , α 2 a • ln(T ) ∆ i + 16 • ∆ i ≤ i∈G2 4 • max α 2 h , α 2 a ln(T ) • √ T N • ln(T ) + 16 N T • ln(T )• ≤ ∆ i ≤ 1 ≤ 4 • max α 2 h , α 2 a • N • T • ln(T ) + 16 • N Overall, if we add in the regret from arms of group 1, we get a regret bound of: N • T • ln(T ) + 4 • max α 2 h , α 2 a • N • T • ln(T ) + 16 • N = N • T • ln(T ) • 1 + 4 • max α 2 h , α 2 a + 16 • N Next, we consider the case where α h = 0, so the human greedily pulls whichever arm has had highest empirical reward (of those presented to it). Theorem 2 gives a regret bound for this scenario. As compared with Theorem 1, note that this bound includes a linear dependence on T that increases with k, the number of arms shown to the human. In Lemma 3, we show that linear regret is unavoidable if the human picks purely myopically. Theorem 2. Consider any human using myopic selection (α h = 0) and α a ≥ 1. . Then, the expected regret is bounded by: N • T • ln(T ) • 1 + α 2 a ϵ 2 + 8 • N + k i=1 p i,ϵ • T for p i,ϵ = P X0∼D0 [X 0 ≤ µ 0 -(1 -ϵ) • ∆ i ]. Proof. The first part of this proof is identical to that of Theorem 1. First, we will divide the two arms into groups (group 1 with low-regret arms, group 2 with high regret arms). For arms from group 1, we've shown that we can bound: i∈G1 n i,T • ∆ i ≤ N • T • ln(T ) For arms from group 2, we work to bound the number of times that arm i, for i ̸ = 0, can be pulled for all time (n i,T ). We will use n * i,t = α 2 a • ln(t) ϵ 2 •∆ 2 i to denote a minimum threshold of samples we need from arm i in order to obtain certain high-probability guarantees. E[n i,T ] = E T t=0 1(I t+1 = i) ≤ n * i,T + E T t=N 1 I t+1 = i|n i,t ≥ n * i,t = n * i,T + T t=N P I t+1 = i|n i,t ≥ n * i,t This proof differs from Theorem 1 because the human selects myopically, greedily optimizing whichever arm has μi maximized. Let X 0 ∼ D 0 denote the first pull from each arm, respectively. We will use p i,ϵ = P X0∼D0 [X 0 ≤ µ 0 -(1 -ϵ) • ∆ i ] We know with high probability that μi,t ≤ µ i + ϵ • ∆ i , but if X 0 ≤ µ 0 -(1 -ϵ) • ∆ i = µ i + ϵ • ∆ i , then it is possible that X 0 ≤ μi,t . If this occurs, arm i will always be preferred over arm 0, so arm 0 would never be pulled again, after the first time.

With probability

1 -p i,ϵ , X 0 > µ 0 -(1 -ϵ) • ∆ i = µ i + ϵ • ∆ i . If this occurs, with probability at least 1 -2 t 2 , the following inequalities hold: μi,t ≤ µ i + ln(t) n i,t Chernoff bound with probability 1 - 2 t 2 ≤ µ i + ϵ • ∆ i |μ i,t -µ i | ≤ ln(t) n i,t ≤ ln(t) • ∆ 2 i • ϵ 2 α 2 a • ln(t) ≤ ϵ • ∆ i = µ 0 -(1 -ϵ) • ∆ i ≤ X 0 by assumption Because n * i,T ≥ α 2 a • ln(t) ∆ 2 i , we know that with high probability the algorithm will rank arm 0 above arm i. This fails to occur with probability no more than 2 t 2 . Overall, the number of times arm i could be pulled is upper bounded by: T i=N p i,ϵ + (1 -p i,ϵ ) • 4 t 2 ≤ p i,ϵ • T + T t=N 4 t 2 ≤ p i,ϵ • T + 8 • (1 -p i,ϵ ) We can bound the instance-independent regret from this term (for arms in group 2): i∈G2 n i,t • ∆ i ≤ ln(T ) ϵ 2 • ∆ 2 i + p i,ϵ • T + 8 • (1 -p i,ϵ ) • ∆ i = i∈G2 ln(T ) ϵ 2 • ∆ i + p i,ϵ • T • ∆ i + 8 • (1 -p i,ϵ ) • ∆ i ≤ i∈G2 α 2 a • ln(T ) • √ T ϵ 2 • N • ln(T ) + p i,ϵ • T + 8 ≤ α 2 a ϵ 2 • ln(T ) • T • N + 8 • N + i∈G2 p i,ϵ • T For the last component, we can actually improve the linear dependence slightly. We know that k arms are shown to the human. Therefore, even if X 0 < µ i + ϵ∆ i for more than k arms, it won't increase regret by more than the total probability for the k largest probabilities. Because p i,ϵ is defined relative to distribution D 0 , this will be the arms i with smallest ∆ i . The upper bound becomes: ≤ α 2 a ϵ 2 • ln(T ) • T • N + 8 • N + k i=1 p i,ϵ • T Combined with the regret from the group 1 arms, this gives us regret: N • T • ln(T ) • 1 + α 2 a ϵ 2 + 8 • N + k i=1 p i,ϵ • T Lemma 3. If p i,ϵ > 0 for any i and k ≥ 2, regret is Ω(T ). Proof. Then, we will show it is possible to construct a case with linear regret. Suppose that D j for j ̸ = 0 : X j ∼ D j = µ j with probability 1. Suppose that k ≥ 2, so the algorithm always must present at least one other arm besides the optimal arm. Consider any arm i ̸ = 0. Then, with probability p i,ϵ , after 1 samples μ0 = X 0 < µ 0 + (1 -ϵ) • ∆ i . If this occurs, then a myopic human will always select arm i rather than arm 0. Because arm i is deterministic, μi will not update, but will remain greater than µ 0 , which means arm 0 will never be pulled again. This leads to regret p i,ϵ • ∆ i • T . Collectively, these results give us an upper bound on expected regret for the human-algorithm MAB setting. Additionally, they demonstrate areas where we should not linear regret is unavoidable.

5. EXPERIMENTAL RESULTS

Next, in this section we further explore this setting through simulations. Specifically, our goal will be to demonstrate how expected regret varies with different features, such as the number of arms that are presented to the human, or the distribution of rewards for each arm. In each simulation, we have N = 5 arms and varying k ∈ [1, N ]. There is a single best arm 0 with highest expected reward drawn N (µ 0 , σ), while all other arms have identical rewards N (µ i , σ) (unless otherwise noted). We use rewards drawn from normal distributions because varying σ will allow us to change p i,ϵ while keeping the relative expected means µ 0 , µ i the same. We fix the algorithm's exploration coefficient α a = 1 throughout, but vary the human's exploration coefficient α h .

5.1. VARYING NUMBER OF ARMS PRESENTED

In Figure 1 , we explore the effect of varying the human's exploration coefficient α h for values in {0, 0.5, 1, 2}. Note that for α h = 0 (Figure 1a ), linear regret dominates and is increasing in the number of arms presented to the human k, as would be expected from Theorem 2. As intuition, consider the fact that for larger k the human has a higher probability of being presented with an arm i for which the first sample X 0 < µ 0 -(1 -ϵ) • ∆ i . Therefore, showing more arms increases the linear component of regret. For α h > 0 regret is sublinear in T . For Figure 1b with α a > α h > 0, regret is decreasing in k: showing more arms to the human decreases regret. Obviously, for Figure 1c with α h = α a regret is constant in k because the human and algorithm are using the same metric to decide which arm to pick. Finally, for Figure 1d with α h > α a , regret is again increasing in k. As intuition, it might be useful to recall that the standard UCB algorithm Auer et al. (2002) has regret that is increasing in α, for α > 0. If α h < α a , then for larger k the human is doing "more" of the selection, because it has more arms to choose from. Because of this, the overall regret will be dominated by the human's, with a coefficient of α h . Conversely, if α h > α a , then for larger k regret will be higher, again because the human is "responsible" for choosing among a larger set of items. These experimental results indicate that it could be possible to find a version of Theorem 1 that also depends on k. However, this attempt might be complicated by the non-linear relationship between regret and k. For Figure 1b , for example, there is a large jump in regret from k = 5 to k = 4, but a much smaller one between k = 4 and k = 3. Interestingly, these experimental results also seem to indicate that complementarity (strict improvements through human-algorithm collaboration) might be impossible. For every simulation shown, either the human alone (k = N ) or the algorithm alone (k = 1) performs optimally.

5.2. VARYING REWARD DISTRIBUTION

Finally, we explored the impact of varying the reward distribution. Figure 2 shows another example with α h = 0, but where regret is sublinear, as compared with Figure 1a , which has α h = 0, but linear regret. In Figure 2 , the gap ∆ i is larger, so p i,ϵ is extremely small. This means that the linear dependence on T is very small, so regret is dominated by the sublinear term. This example shows (empirically) that it is possible to achieve low regret, given certain distributional assumptions on the arms.

6. CONCLUSION AND FUTURE DIRECTIONS

In this paper, we have explored human-algorithm collaboration in a multi-armed bandit scenario. We proved theoretical bounds on regret, as well as demonstrated empirical patterns in regret when varying number of arms are presented. There are multiple possible avenues for future work. For example, in our work we have assumed that the algorithm is using UCB. While this is a standard MAB approach that achieves sublinear regret with a human collaborator in many cases, it may not be optimal. It would be interesting to explore whether another algorithm could achieve superior performance for a range of models of human behavior. In particular, so far results seem to imply that complementarity is impossible with the current setting. It would be useful to know whether this limitation is inherent to the setting or if regret could be improved to achieve complementarity. Relatedly, Section 5 shows intriguing experimental patterns in average regret, k, and α a , α h that could be useful to explore theoretically. Finally, it would be interesting to explore cases where the human and algorithm have access to different historical sets of rewards data (reflecting cases where the human, for example, may have lived in the city for a while, and so has access to prior information about the best routes). 



Figure 1: Plots show regret for (line shows average regret, shaded region is max and min regret, for 100 simulations, each with 10,000 time steps). Each simulation has 5 arms total, where the single best arm has reward drawn from N (µ = 0.5, σ = 0.1) and the other arms have reward N (µ = 0.45, σ = 0.1).

Figure2: Simulation identical to Figure1a1, except that, while the best arm still has reward drawn from N (µ = 0.5, σ = 0.1), the other arms now have reward drawn from N (µ = 0.1, σ = 0.1). The larger gap in rewards means that p i,ϵ is smaller, so the linear dependence on T does not dominate overall regret.

Then, if any UCB algorithm selecting according to U CB i,t = μi,t + α • ln(t) ni,t . Then, UCB i,t < UCB 0,t with probability at least 1 -4 t 2 .

A SUPPLEMENTARY PROOFS

Lemma 1. Any human that selects uniformly at randomly with probability δ > 0 among k ≥ 2 arms incurrs regret Ω(T ) that is increasing in k.Proof. Suppose that k arms are presented and the human selects randomly with probability δ. Each of the presented arms has at least a 1 k • δ probability of being selected. This chance could be higher, because with probability 1 -δ the human selects according to some other strategy, but 1 k • δ is a lower bound. We can similarly lower bound regret by assuming that the algorithm (in each round) is optimal and selects the top k arms to present to the human.Then, expected regret in a given round is lower bounded by by:which gives an overall lower bound on regret by given by:For δ > 0, k ≥ 2, this gives regret that is linear in T .Lemma 2. Consider arm i with n i,t ≥ α 2 • ln(t)Then, if any UCB algorithm selecting according to U CB i,t = μi,t + α • ln(t) ni,t . Then, UCB i,t < UCB 0,t with probability at least 1 -4 t 2 .Proof.Overall, this inequality holds with probability at least 1 -4 t 2

