BAYESIAN OPTIMAL EXPERIMENTAL DESIGN FOR THE SURVEY BANDIT SETTING Anonymous

Abstract

The contextual bandit is a classic problem in sequential decision making under uncertainty that finds broad application to tasks in precision medicine, personalized education, and drug discovery. Here, a decision maker repeatedly receives a context, takes an action, and then observes an associated outcome, with the goal of choosing actions that achieve a minimal regret. However, in many settings, the context is not given, and the decision maker must instead collect some information to infer a context before proceeding. For example, when a doctor does not have prior information about a patient, they might ask a sequence of questions before recommending a medical treatment. In this paper, we aim to develop methods for this setting-which we refer to as the survey bandit-where the decision maker is not given access to the context but can ask a finite sequence of questions to gain information about the context before taking an action and observing an outcome. Using insights from Bayesian optimal experimental design (BOED) and decision-theoretic information theory, we view the interaction with each user as a BOED task, where the goal is to ask a sequence of questions that elicit the most information about the optimal action for this user. Our procedure is agnostic to the choice of probabilistic model, and we demonstrate its usefulness in a few common classes of distributions. Our algorithm achieves significantly better performance on both synthetic and real data relative to existing baseline methods while remaining statistically efficient, interpretable, and computationally friendly.

1. INTRODUCTION

In many sequential decision making applications, a decision maker faces a sequence of users, for which they need to choose an action and then observe an outcome. Each user has a context vector (i.e. a set of features), which, in many cases, is not known a priori to the decision makerfoot_0 . The context is needed to choose an action that yields a good outcome, but acquiring this context can be expensive or time-consuming. We refer to this setting as the survey bandit, which has been previously studied by Krishnamurthy & Athey (2020) . One example of this setting is in personalized medicine: a physician faces a sequence of patients, and to each they can ask a few questions before recommending a final treatment (Yao et al., 2021; Tomkins et al., 2021) . Another example can be found in education: during office hours, a professor faces a sequence of students, and they can ask each a few questions before recommending an exercise or a reading. Last but not least, the survey bandit setting also finds application in drug and material discovery: during virtual screening, a chemist faces a large set of molecular structures, and they can perform a finite set of tests on each candidate (e.g. DFT calculation or molecular docking) before deciding whether or not it should go on to the next phase of the study (Kitchen et al., 2004; Bengio et al., 2021; Svensson et al., 2022) . Users' context features are usually not independent, and thus good decision making can be achieved even when a small part of the context is observed. For example, in a series of questions related to political leaning, if the decision maker observes that a user prefers to watch Fox News, they may not need to ask whether or not they identify as a conservative. Suppose the decision maker can sequentially ask a few questions before recommending a treatment-what questions should they ask? One way to tackle this problem is to view querying



This setting also includes cases where the decision maker has partial information, such as a prior belief about the users' contexts, before asking questions.

