"I PICK YOU CHOOSE": JOINT HUMAN-ALGORITHM DECISION MAKING IN MULTI-ARMED BANDITS

Abstract

Online learning in multi-armed bandits has been a rich area of research for decades, resulting in numerous "no-regret" algorithms that efficiently learn the arm with highest expected reward. However, in many settings the final decision of which arm to pull isn't under the control of the algorithm itself. For example, a driving app typically suggests a subset of routes (arms) to the driver, who ultimately makes the final choice about which to select. Typically, the human also wishes to learn the optimal arm based on historical reward information, but decides which arm to pull based on a potentially different objective function, such as being more or less myopic about exploiting near-term rewards. In this paper, we show when this joint human-algorithm system can achieve good performance. Specifically, we explore multiple possible frameworks for human objectives and give theoretical regret bounds for regret. Finally, we include experimental results exploring how regret varies with the human decision-maker's objective, as well as the number of arms presented.

1. INTRODUCTION Consider the following motivating example:

Alice has recently moved to a new town and does not know the area yet. She uses a navigation app while driving, which narrows down the thousands of potential routes to a few options for her to choose from. She and the app only get to see the actual driving time of the final route she picks. Because of varying traffic and weather delays, the actual driving times of each route is unpredictable. Both the navigation app and Alice wish to minimize her average travel time. However, they might have different short-term objectives. For example, Alice might be myopic and prefer choosing a route that has performed well in the past, rather than exploring a new one. Alternatively, Alice may be adventurous and actively seek out new routes in the hope that there might be quicker than ones she has previously explored. The navigation app uses a generic algorithm that doesn't know Alice's specific objective function. Under what situations can Alice and her navigation app achieve their goal of quickly finding the quickest route? If Alice's navigation app were able to tell Alice exactly which route she must take, then this problem would reduce to that of multi-armed bandits (MAB), a celebrated online-learning paradigm. However, in the driving directions setting, it is unrealistic to assume that the algorithm can force Alice to take a particular route. In human-algorithm collaboration more generally, often the algorithm can provide assistance, but the human makes the final decision. This is the case in other settings as well: a diner trying to find the best restaurant, a doctor trying to find the best treatment, or a teacher trying to find the best pedagogical method. This framework requires a shift in thinking: rather than focus on optimizing the performance of the algorithm alone, the goal is to build an algorithm that maximizes the performance of the human-algorithm system. For multi-armed bandits, the standard objective is to minimize expected regret, the amount of reward that is missed by not selecting the optimal arm. In human-algorithm multi-armed bandits, some aspects (such as the behavior of the human), are entirely out of our control, and so the system cannot In Section 2, we discuss how our setting and results relate to previous literature in MAB and in human-algorithm collaboration. In Section 3, we formalize the model that we analyze, including multiple different models of human behavior. Section 4 contains theoretical results, such as bounds on expected regret. Specifically, we show that, so long as the human isn't completely myopic (has some weak preference for exploring arms that haven't been frequently pulled), then sublinear regret is achievable. If the human is myopic, then it is unavoidable that the regret includes a linear dependence on time. Section 5 enriches these theoretical results with experimental simulations. These results show that if the human is more myopic than the algorithm, overall regret decreases the more arms are shown to the human. On the other hand, if the human is less myopic, the opposite is true, and regret increases the more arms are shown to the human. Finally, in Section 6 we briefly discuss implications of our work and potential future directions.

2. RELATED WORK MULTI-ARMED BANDITS

The area of multi-armed bandits is wide enough to admit multiple textbooks Slivkins et al. ( 2019); Lattimore & Szepesvári (2020). In this section, we will highlight some of the most related papers. Yue et al. (2012) proposed "dueling bandits", where multiple arms are presented simultaneously and the feedback is noisy binary signal as to which has higher reward. Since this, there has been numerous extensions Sui et al. (2018; 2017b) ; Komiyama et al. (2015) , such as those that allow more than 2 arms to be presented Saha & Gopalan ( 2018 2015) studies a related problem where the task of the algorithm is to rank a set of items. The human then improves the ranking according to their true utility function, but with some bounded degree of improvement reflecting limits on human rationality. One major difference between dueling bandits and our framework is that we assume feedback is given by a human who is learning about the rewards of the arms themselves, whereas dueling bandits typically assume that responses between the arms are fixed. Additionally, dueling bandits typically involves boolean feedback, where we allow real-valued access to the rewards. There has also been a series of work looking more specifically at human-algorithm collaboration in bandit settings. Gao et al. (2021) learns from batched historical human data to develop an algorithm that assigns each task at test time to either itself or a human. Chan et al. ( 2019) studies a setting similar to ours in that the human is simultaneously learning which option is best for them. However, their framework allows the algorithm to overrule the human, which makes sense in many settings, but not all, such as our motivating example of driving directions. Bordt & Von Luxburg (2022) formalizes the problem as a two-player setting where both the human and algorithm take actions that affect the reward both experience. Additionally, some work has used the framework of the human as the final decision-maker and studied how to disclose information so as to incentivize them to take the "right" action. Immorlica et al. (2018) studies how to match the best regret in a setting where myopic humans pull the final arm. Hu et al. ( 2022) studies a related problem with combinatorial bandits, where the goal is to select a subset of the total arms to pull. Bastani et al. ( 2022) investigates a more applied setting where each human is a potential customer who will become disengaged and leave if they are suggested products (arms) that are a sufficiently poor fit. Kannan et al. (2017) looks at a similar model of sellers considering sequential clients, specifically investigating questions of fairness. In general, these works differ from ours in that they assume a new human arrives at each time step, and so the algorithm is able to selectively disclose information to them. In our setting, the human may be the same between time steps, and we typically assume that they have access to the same information as the algorithm.

HUMAN-ALGORITHM COLLABORATION

Studying human-algorithm collaboration is a rapidly growing, highly interdisciplinary area of research. In general, most work focuses on offline learning settings, which differs from our MAB



); Agarwal et al. (2020); Sui et al. (2017a). Shivaswamy & Joachims (

