COMBINATORIAL PURE EXPLORATION OF CAUSAL BANDITS

Abstract

The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable Y with probability at least 1 -δ, where δ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.

1. INTRODUCTION

Stochastic multi-armed bandits (MAB) is a classical framework in sequential decision making (Robbins, 1952) . In each round, a learner selects one arm based on the reward feedback from the previous rounds, and receives a random reward of the selected arm sampled from an unknown distribution, with the goal of accumulating as much rewards as possible. Pure exploration is an important variant of the multi-armed bandit problem, where the goal is not to accumulate reward but to identify the best arm through possibly adaptive explorations of arms. Causal bandits, first introduced by Lattimore et al. ( 2016), integrates causal inference (Pearl, 2009) with multi-armed bandits. In causal bandits, we have a causal graph structure G = (X ∪{Y }∪U , E), where X ∪ {Y } are observable causal variables with Y being a special reward variable, U are unobserved hidden variables, and E is the set of causal edges between pairs of variables. For simplicity, we consider binary variables in this paper. The arms are the interventions on variables S ⊆ X together with the choice of null intervention (natural observation), i.e. the action set is A ⊆ {a = do(S = s) | S ⊆ X, s ∈ {0, 1} |S| } with do() ∈ A, where do(S = s) is the standard notation for intervening the causal graph by setting S to s (Pearl, 2009), and do() means null intervention. The reward of an action a is the random outcome of Y , and thus the expected reward is E[Y | a = do(S = s)]. In each round, one action in A is played, and the random outcomes of all variables in X ∪ {Y } are observed. Given the causal graph G but without knowing the distributions among nodes, the task of combinatorial pure exploration (CPE) of causal bandits is to select actions in each round, observe the feedback from all observable random variables, so that in the end the learner can identify the best or nearly best actions. Causal bandits are useful in many real scenarios. In drug testing, the physicians wants to adjust the dosage of some particular drugs to treat the patient. In policy design, the policy-makers select different actions to reduce the spread of disease. Existing studies on CPE of causal bandits either requires the knowledge of P (Pa(Y ) | a) for all action a or only consider causal graphs without hidden variables, and the algorithms proposed are not fully adaptive to reward gaps (Lattimore et al., 2016; Yabe et al., 2018) . In this paper, we study fully adaptive pure exploration algorithms and analyze their gap-dependent sample complexity bounds in the fixed-confidence setting. More specifically, given a confidence bound δ ∈ (0, 1) and an error bound ε, we aim at designing adaptive algorithms that output an action such that with probability at least 1 -δ, the expected reward difference between the output and the optimal action is at most ε. The algorithms should be fully adaptive in the follow two senses. First, it should adapt to the reward gaps between suboptimal and optimal actions similar to existing adaptive pure exploration bandit algorithms, such that actions with larger gaps should be explored less. Second, it should adapt to the observational data from causal bandit feedback, such that actions with enough observations already do not need further interventional rounds for exploration, similar to existing causal bandit algorithms. We are able to integrate both types of adaptivity into one algorithmic framework, and with interaction between the two aspects, we achieve better adaptivity than either of them alone. First we introduce a particular term named gap-dependent observation threshold, which is a nontrivial gap-dependent extension for a similar term in Lattimore et al. ( 2016). Then we provide two algorithms, one for the binary generalized linear model (BGLM) and one for the general model with hidden variables. The sample complexity of both algorithms contains the gap-dependent observation threshold that we introduced, which shows significant improvement comparing to the prior work. In particular, our algorithm for BGLM achieves a sample complexity polynomial to the graph size, while all prior algorithms for general graphs have exponential sample complexity; and our algorithm for general graphs match a lower bound we prove in the paper. To our best knowledge, our paper is the first work considering a CPE algorithm specifically designed for BGLM, and the first work considering CPE on graphs with hidden variables, while all prior studies either assume no hidden variables or assume knowing distribution P (Pa(Y ) | a) for the parents of reward variable Pa(Y ) and all action a, which is not a reasonable assumption. To summarize, our contribution is to propose the first set of CPE algorithms on causal graphs with hidden variables and fully adaptive to both the reward gaps and the observational causal data. The algorithm on BGLM is the first to achieve a gap-dependent sample complexity polynomial to the graph size, while the algorithm for general graphs improves significantly on sample complexity and matches a lower bound. Due to the space constraint, further materials including experimental results, an algorithm for the fixed-budget setting, and all proofs are moved to the appendix. 2018) is the only one considering the general graphs with combinatorial action set, but their algorithm cannot work on causal graphs with hidden variables. All the above pure exploration studies consider simple regret criteria that is not gap-dependent. Cumulative regret is considered in (Lu et al., 2020; Nair et al., 2021; Maiti et al., 2021) . To our best knowledge, Sen et al. ( 2017) is the only one discussing gap-dependent bound for pure exploration of causal bandits for the fixed-budget setting, but it only considers the soft interventions (changing conditional distribution P (X|Pa(X))) on one single node, which is different from causal bandits defined in Lattimore et al. (2016) .

Related

Pure exploration of multi-armed bandit has been extensively studied in the fixed-confidence or fixedbudget setting (Audibert et al., 2010; Kalyanakrishnan et al., 2012; Jamieson et al., 2013; Jamieson & Nowak, 2014) . PAC pure exploration is a generalized setting aiming to find the ε-optimal arm instead of exactly optimal arm (Even-Dar et al., 2002; Mannor & Tsitsiklis, 2004) . In this paper, we utilize the adaptive LUCB algorithm in (Kalyanakrishnan et al., 2012) . CPE has also been studied for multi-armed bandits and linear bandits, etc. (Karnin et al. (2013b); Chen et al. (2014); Du et al. (2021) ), but the feedback model in these studies either have feedback at the base arm level or have full or partial bandit feedback, which are all different from the causal bandit feedback.



Work. Causal bandit is proposed by Lattimore et al. (2016), who discuss the simple regret for parallel graphs and general graphs with known probability distributions P (Pa(Y ) | a). Nair et al. (2021) extend algorithms on parallel graphs to graphs without back-door paths, and Maiti et al. (2021) extend the results to the general graphs. All of them either regard P (Pa(Y ) | a) as prior knowledge, or consider only atomic intervention. The study by Yabe et al. (

