BAYESIAN OPTIMAL EXPERIMENTAL DESIGN FOR THE SURVEY BANDIT SETTING Anonymous

Abstract

The contextual bandit is a classic problem in sequential decision making under uncertainty that finds broad application to tasks in precision medicine, personalized education, and drug discovery. Here, a decision maker repeatedly receives a context, takes an action, and then observes an associated outcome, with the goal of choosing actions that achieve a minimal regret. However, in many settings, the context is not given, and the decision maker must instead collect some information to infer a context before proceeding. For example, when a doctor does not have prior information about a patient, they might ask a sequence of questions before recommending a medical treatment. In this paper, we aim to develop methods for this setting-which we refer to as the survey bandit-where the decision maker is not given access to the context but can ask a finite sequence of questions to gain information about the context before taking an action and observing an outcome. Using insights from Bayesian optimal experimental design (BOED) and decision-theoretic information theory, we view the interaction with each user as a BOED task, where the goal is to ask a sequence of questions that elicit the most information about the optimal action for this user. Our procedure is agnostic to the choice of probabilistic model, and we demonstrate its usefulness in a few common classes of distributions. Our algorithm achieves significantly better performance on both synthetic and real data relative to existing baseline methods while remaining statistically efficient, interpretable, and computationally friendly.

1. INTRODUCTION

In many sequential decision making applications, a decision maker faces a sequence of users, for which they need to choose an action and then observe an outcome. Each user has a context vector (i.e. a set of features), which, in many cases, is not known a priori to the decision makerfoot_0 . The context is needed to choose an action that yields a good outcome, but acquiring this context can be expensive or time-consuming. We refer to this setting as the survey bandit, which has been previously studied by Krishnamurthy & Athey (2020) . One example of this setting is in personalized medicine: a physician faces a sequence of patients, and to each they can ask a few questions before recommending a final treatment (Yao et al., 2021; Tomkins et al., 2021) . Another example can be found in education: during office hours, a professor faces a sequence of students, and they can ask each a few questions before recommending an exercise or a reading. Last but not least, the survey bandit setting also finds application in drug and material discovery: during virtual screening, a chemist faces a large set of molecular structures, and they can perform a finite set of tests on each candidate (e.g. DFT calculation or molecular docking) before deciding whether or not it should go on to the next phase of the study (Kitchen et al., 2004; Bengio et al., 2021; Svensson et al., 2022) . Users' context features are usually not independent, and thus good decision making can be achieved even when a small part of the context is observed. For example, in a series of questions related to political leaning, if the decision maker observes that a user prefers to watch Fox News, they may not need to ask whether or not they identify as a conservative. Suppose the decision maker can sequentially ask a few questions before recommending a treatment-what questions should they ask? One way to tackle this problem is to view querying answers from the user as a feature selection problem, where only a small subset of features is useful to predict the outcome (Bastani & Bayati, 2020) . Taking this view, Krishnamurthy & Athey (2020) treat querying answers as feature selection using ridge regression. Using the linear payoff assumption, RidgeUCB further assumes the knowledge of a threshold β min such that features with a ridge regression coefficient below this threshold have no impact on the outcome, and hence can be ignored. Although this assumption is intuitive, RidgeUCB can be brittle when the assumption is violated. In practice, it is unclear how to set β min without knowing the strength of the relationship between contexts and outcomes. Taking a similar perspective, Bouneffouf et al. ( 2017) views the question phase in the survey bandit setting as a feature selection problem and proposes the Contextual Bandit with Restricted Context algorithm as a solution. The feature selection view of the survey bandit can additionally introduce a challenging combinatorial search problem (i.e. in choosing an optimal subset of the features). This paper takes an alternative point of view. We exploit the ability to sequentially query features and receive a signal from the user in the survey phase to adaptively ask the most informative question, following insights from Bayesian optimal experimental design (BOED) (Chaloner & Verdinelli, 1995; Ryan et al., 2016 ) and decision-theoretic information theory (DeGroot, 1962; Rao, 1984; Neiswanger et al., 2022) . This alternative approach treats the question phase for each user as a BOED problem, where the goal is to ask the most informative question, while the treatment phase can be formulated as a contextual bandit problem. Instead of eliminating the feature that is believed to be unimportant, our approach tries to model the dependencies between features by leveraging probabilistic modeling and approximate inference, in order to carry out a sequence of decision making tasks. To the best of our knowledge, this hybrid approach between BOED and contextual bandits has not yet been explored for the survey bandit setting. In full, our method takes advantage of a recently-developed decision-theoretic BOED approach, which allows us to identify the question that elicits the most information about the best action for a given user in expectation. We conduct experiments on synthetic and real datasets in the survey bandit setting and show strong performance relative to a number of baselines. Our method has intimate connections with a variety of algorithms for decision-making under uncertainty, such as Bayesian optimization, active learning, and contextual bandits. All implementations will be made publicly available.

2. DECISION-THEORETIC ENTROPY SEARCH

FOR SURVEY BANDIT 1 2 3 ... Q 1 2 ... U 1 ... T ya yo q t Figure 1: Each row of the above matrix is a user. The shaded and unshaded cells are observed and unobserved answers/outcomes, respectively. Here, Q allow = 4 is shown. We start by establishing some notation for the survey bandit setting. Facing a sequence of U total users, a decision maker can ask each user a fixed number of questions, observe their answers, recommend a treatment, and observe a corresponding outcome. We assume there are a total of Q questions and T treatments that a decision maker can choose from. For each user, the number of questions that the decision maker can ask, denoted Q allow , is assumed to be given. Associated with each user is a vector y ∼ p(y), y ∈ R Q+T , which can be partitioned into an answer vector y a = [y a,1 , ..., y a,Q ] and an outcome vector y o = [y o,1 , ..., y o,T ]. We can also partition y into an observed vector y obs and an unobserved vector y unobs . The decision maker's goal is to ask, for each user, informative questions that reveal the treatment with the highest outcome, in expectation. A graphical illustration of the problem setting is given in Figure 1 .

2.1. DECISION-THEORETIC ENTROPY SEARCH

In this subsection, we assume that the joint distribution p(y a , y o ) is given, and in the following section we will discuss how to estimate this distribution from observed data. Given this joint density, the decision maker should ask the question that elicits the most information-i.e. the greatest reduction in posterior uncertainty-about the best treatment, on average. One notion of mutual information between an answer and the best outcome can be captured by H r,A -entropy, a decision-theoretic notion of uncertainty (DeGroot, 1962; Rao, 1984; Neiswanger et al., 2022) . Suppose we have a Bayesian model for a parameter ϕ ∈ Φ. Then the H r,A -entropy is parameterized by a prior distribution p(ϕ),



This setting also includes cases where the decision maker has partial information, such as a prior belief about the users' contexts, before asking questions.

