COMBINATORIAL PURE EXPLORATION OF CAUSAL BANDITS

Abstract

The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable Y with probability at least 1 -δ, where δ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.

1. INTRODUCTION

Stochastic multi-armed bandits (MAB) is a classical framework in sequential decision making (Robbins, 1952) . In each round, a learner selects one arm based on the reward feedback from the previous rounds, and receives a random reward of the selected arm sampled from an unknown distribution, with the goal of accumulating as much rewards as possible. Pure exploration is an important variant of the multi-armed bandit problem, where the goal is not to accumulate reward but to identify the best arm through possibly adaptive explorations of arms. Causal bandits, first introduced by Lattimore et al. ( 2016), integrates causal inference (Pearl, 2009) with multi-armed bandits. In causal bandits, we have a causal graph structure G = (X ∪{Y }∪U , E), where X ∪ {Y } are observable causal variables with Y being a special reward variable, U are unobserved hidden variables, and E is the set of causal edges between pairs of variables. For simplicity, we consider binary variables in this paper. The arms are the interventions on variables S ⊆ X together with the choice of null intervention (natural observation), i.e. the action set is In each round, one action in A is played, and the random outcomes of all variables in X ∪ {Y } are observed. Given the causal graph G but without knowing the distributions among nodes, the task of combinatorial pure exploration (CPE) of causal bandits is to select actions A ⊆ {a = do(S = s) | S ⊆ X, s ∈ {0, 1} |S| } with do() ∈ A,



where do(S = s) is the standard notation for intervening the causal graph by setting S to s (Pearl, 2009), and do() means null intervention. The reward of an action a is the random outcome of Y , and thus the expected reward is E[Y | a = do(S = s)].

