COMBINATORIAL PURE EXPLORATION OF CAUSAL BANDITS

Abstract

The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable Y with probability at least 1 -δ, where δ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.

1. INTRODUCTION

Stochastic multi-armed bandits (MAB) is a classical framework in sequential decision making (Robbins, 1952) . In each round, a learner selects one arm based on the reward feedback from the previous rounds, and receives a random reward of the selected arm sampled from an unknown distribution, with the goal of accumulating as much rewards as possible. Pure exploration is an important variant of the multi-armed bandit problem, where the goal is not to accumulate reward but to identify the best arm through possibly adaptive explorations of arms. Causal bandits, first introduced by Lattimore et al. (2016) , integrates causal inference (Pearl, 2009) with multi-armed bandits. In causal bandits, we have a causal graph structure G = (X ∪{Y }∪U , E), where X ∪ {Y } are observable causal variables with Y being a special reward variable, U are unobserved hidden variables, and E is the set of causal edges between pairs of variables. For simplicity, we consider binary variables in this paper. The arms are the interventions on variables S ⊆ X together with the choice of null intervention (natural observation), i.e. the action set is A ⊆ {a = do(S = s) | S ⊆ X, s ∈ {0, 1} |S| } with do() ∈ A, where do(S = s) is the standard notation for intervening the causal graph by setting S to s (Pearl, 2009) , and do() means null intervention. The reward of an action a is the random outcome of Y , and thus the expected reward is E[Y | a = do(S = s)]. In each round, one action in A is played, and the random outcomes of all variables in X ∪ {Y } are observed. Given the causal graph G but without knowing the distributions among nodes, the task of combinatorial pure exploration (CPE) of causal bandits is to select actions in each round, observe the feedback from all observable random variables, so that in the end the learner can identify the best or nearly best actions. Causal bandits are useful in many real scenarios. In drug testing, the physicians wants to adjust the dosage of some particular drugs to treat the patient. In policy design, the policy-makers select different actions to reduce the spread of disease. Existing studies on CPE of causal bandits either requires the knowledge of P (Pa(Y ) | a) for all action a or only consider causal graphs without hidden variables, and the algorithms proposed are not fully adaptive to reward gaps (Lattimore et al., 2016; Yabe et al., 2018) . In this paper, we study fully adaptive pure exploration algorithms and analyze their gap-dependent sample complexity bounds in the fixed-confidence setting. More specifically, given a confidence bound δ ∈ (0, 1) and an error bound ε, we aim at designing adaptive algorithms that output an action such that with probability at least 1 -δ, the expected reward difference between the output and the optimal action is at most ε. The algorithms should be fully adaptive in the follow two senses. First, it should adapt to the reward gaps between suboptimal and optimal actions similar to existing adaptive pure exploration bandit algorithms, such that actions with larger gaps should be explored less. Second, it should adapt to the observational data from causal bandit feedback, such that actions with enough observations already do not need further interventional rounds for exploration, similar to existing causal bandit algorithms. We are able to integrate both types of adaptivity into one algorithmic framework, and with interaction between the two aspects, we achieve better adaptivity than either of them alone. First we introduce a particular term named gap-dependent observation threshold, which is a nontrivial gap-dependent extension for a similar term in Lattimore et al. (2016) . Then we provide two algorithms, one for the binary generalized linear model (BGLM) and one for the general model with hidden variables. The sample complexity of both algorithms contains the gap-dependent observation threshold that we introduced, which shows significant improvement comparing to the prior work. In particular, our algorithm for BGLM achieves a sample complexity polynomial to the graph size, while all prior algorithms for general graphs have exponential sample complexity; and our algorithm for general graphs match a lower bound we prove in the paper. To our best knowledge, our paper is the first work considering a CPE algorithm specifically designed for BGLM, and the first work considering CPE on graphs with hidden variables, while all prior studies either assume no hidden variables or assume knowing distribution P (Pa(Y ) | a) for the parents of reward variable Pa(Y ) and all action a, which is not a reasonable assumption. To summarize, our contribution is to propose the first set of CPE algorithms on causal graphs with hidden variables and fully adaptive to both the reward gaps and the observational causal data. The algorithm on BGLM is the first to achieve a gap-dependent sample complexity polynomial to the graph size, while the algorithm for general graphs improves significantly on sample complexity and matches a lower bound. Due to the space constraint, further materials including experimental results, an algorithm for the fixed-budget setting, and all proofs are moved to the appendix. Related Work. Causal bandit is proposed by Lattimore et al. (2016) , who discuss the simple regret for parallel graphs and general graphs with known probability distributions P (Pa(Y ) | a). Nair et al. (2021) extend algorithms on parallel graphs to graphs without back-door paths, and Maiti et al. (2021) extend the results to the general graphs. All of them either regard P (Pa(Y ) | a) as prior knowledge, or consider only atomic intervention. The study by Yabe et al. (2018) is the only one considering the general graphs with combinatorial action set, but their algorithm cannot work on causal graphs with hidden variables. All the above pure exploration studies consider simple regret criteria that is not gap-dependent. Cumulative regret is considered in (Lu et al., 2020; Nair et al., 2021; Maiti et al., 2021) . To our best knowledge, Sen et al. (2017) is the only one discussing gap-dependent bound for pure exploration of causal bandits for the fixed-budget setting, but it only considers the soft interventions (changing conditional distribution P (X|Pa(X))) on one single node, which is different from causal bandits defined in Lattimore et al. (2016) . Pure exploration of multi-armed bandit has been extensively studied in the fixed-confidence or fixedbudget setting (Audibert et al., 2010; Kalyanakrishnan et al., 2012; Jamieson et al., 2013; Jamieson & Nowak, 2014) . PAC pure exploration is a generalized setting aiming to find the ε-optimal arm instead of exactly optimal arm (Even-Dar et al., 2002; Mannor & Tsitsiklis, 2004) . In this paper, we utilize the adaptive LUCB algorithm in (Kalyanakrishnan et al., 2012) . CPE has also been studied for multi-armed bandits and linear bandits, etc. (Karnin et al. (2013b) ; Chen et al. (2014) ; Du et al. (2021) ), but the feedback model in these studies either have feedback at the base arm level or have full or partial bandit feedback, which are all different from the causal bandit feedback. The binary generalized linear model (BGLM) is studied in (Li et al., 2017; Feng & Chen, 2022) for cumulative regret MAB problems. We borrow the maximum likelihood estimation method and its result in (Li et al., 2017; Feng & Chen, 2022) for our BGLM part, but its integration with our adaptive sampling algorithm for the pure exploration setting is new.

2. MODEL AND PRELIMINARIES

Causal Models. From Pearl (2009) , a causal graph G = (X ∪ {Y } ∪ U , E) is a directed acyclic graph (DAG) with a set of observed random variables X ∪ {Y } and a set of hidden random variables U , where X = {X 1 , • • • , X n }, U = {U 1 , • • • , U k } are two set of variables and Y is the special reward variable without outgoing edges. In this paper, for simplicity, we only consider that X i 's and Y are binary random variables with support {0, 1}. For any node V in G, we denote the set of its parents in G as Pa(V ). The set of values for Pa(X) is denoted by pa(X). The causal influence is represented by P (V | Pa(V )), modeling the fact that the probability distribution of a node V 's value is determined by the value of its parents. Henceforth, when we refer to a causal graph, we mean both its graph structure (X ∪ {Y } ∪ U , E) and its causal inference distributions P (V | Pa(V )) for all V ∈ X ∪ {Y } ∪ U . A parallel graph G = (X ∪ {Y }, E) is a special class of causal graphs with X = {X 1 , • • • , X n }, U = ∅ and E = {X 1 → Y, X 2 → Y, • • • , X n → Y }. An intervention do(S = s) in the causal graph G means that we set the values of a set of nodes S ⊆ X to s, while other nodes still follow the P (V | Pa(V )) distributions. An atomic intervention means that |S| = 1. When S = ∅, do(S = s) is the null intervention denoted as do(), which means we do not set any node to any value and just observe all nodes' values. In this paper, we also study a parameterized model with no hidden variables: the binary generalized linear model (BGLM). Specifically, in BGLM, we have U = ∅, and ) is the unknown parameter vector for X, e X is a zero-mean bounded noise variable that guarantees the resulting probability to be within [0, 1]. To represent the intrinsic randomness of node X not caused by its parents, we denote X 1 = 1 as a global variable, which is a parent of all nodes. P (X = 1 | Pa(X) = pa(X)) = f X (θ X • pa(X)) + e X , where f X is a strictly increasing function, θ X ∈ R Pa(X Combinatorial Pure Exploration of Causal Bandits. Combinatorial pure exploration (CPE) of causal bandits describes the following setting and the online learning task. The causal graph structure is known but the distributions P (V |Pa(V ))'s are unknown. The action (arm) space A is a subset of possible interventions on combinatorial sets of variables, plus the null intervention, that is, A ⊆ {do(S = s) | S ⊆ X, s ∈ {0, 1} |S| } and {do()} ∈ A. For action a = do(S = s), define µ a = E[Y | do(S = s)] to be the expected reward of action do(S = s). Let µ * = max a∈A µ a . In each round t, the learning agent plays one action a ∈ A, observes the sample values X t = (X t,1 , X t,2 • • • , X t,n ) and Y t of all observed variables. The goal of the agent is to interact with the causal model with as small number of rounds as possible to find an action with the maximum expected reward µ * . More precisely, we mainly focus on the following PAC pure exploration with the gap-dependent bound in the fixed-confidence setting. In this setting, we are given a confidence parameter δ ∈ (0, 1) and an error parameter ε ∈ [0, 1), and we want to adaptively play actions over rounds based on past observations, terminate at a certain round and output an action a o to guarantee that µ * -µ a o ≤ ε with probability at least 1 -δ. The metric for this setting is sample complexity, which is the number of rounds needed to output a proper action a o . Note that when ε = 0, the PAC setting is reduced to the classical pure exploration setting. We also consider the fixed budget setting in the appendix, where given an exploration round budget T and an error parameter ε ∈ [0, 1), the agent is trying to adaptively play actions and output an action a o at the end of round T , so that the error probability Pr{µ a o < µ * -ε} is as small as possible. We study the gap-dependent bounds, meaning that the performance measure is related to the reward gap between the optimal and suboptimal actions, as defined below. Let a * be one of the optimal arms. For each arm a, we define the gap of a as ∆ a = µ a * -max a∈A\{a * } {µ a }, a = a * ; µ a * -µ a , a ̸ = a * . We further sort the gaps ∆ a 's for all arms and assume ) , where ∆ (1) is also denoted as ∆ min . ∆ (1) ≤ ∆ (2) • • • ≤ ∆ (|A|

3. GAP-DEPENDENT OBSERVATION THRESHOLD

In this section, we introduce the key concept of gap-dependent observation threshold, which is instrumental to the fix-confidence algorithms in the next two sections. Intuitively, it describes for any action a whether we can derive its reward from pure observations of the causal model. We assume that X i 's are binary random variables. First, we describe terms q a ∈ [0, 1] for each action a ∈ A, which can have different definitions in different settings. Intuitively, q a represents how easily the action a is to be estimated by observation. For example, in Lattimore et al. (2016) , for parallel graph with action set A = {do(X i = x) | 1 ≤ i ≤ n, x ∈ {0, 1}} ∪ {do()} , for action a = do(X i = x), q a = P (X i = x) represents the probability for action do(X i = x) to be observed, since in parallel graph we have P (Y | X i = x) = P (Y | do(X i = x)). Thus, when q a = P (X i = x) is larger, it is easier to estimate P (Y | do(X i = x) ) by observation. We will instantiate q a 's for BGLM and general graphs in later sections. For a = do(), we always set q a = 1. Then, for set q a , a ∈ A we define the observation thershold as follows: Definition 1 (Observation threshold Lattimore et al. ( 2016)). For a given causal graph G and its associated {q a | a ∈ A}, the observation threshold m is defined as: m = min{τ ∈ [|A|] : |{a ∈ A | q a < 1/τ }| ≤ τ }. The observation threshold can be equivalently defined as follows: When we sort {q a | a ∈ A} as q (1) ≤ q (2) ≤ • • • ≤ q |A| , m = min{τ : q (τ +1) ≥ 1 τ }. Note that m ≤ |A| always holds since q do() = 1. In some cases, m ≪ |A|. For example, in parallel graph, when P (X i = 1) = P (X i = 0) = 1 2 for all i ∈ [n], q do(Xi=1) = q do(Xi=0) = 1 2 , q do() = 1.Then m = 2 ≪ 2n + 1 = |A|. Intuitively, when we collect passive observation data without intervention, arms corresponding to q (j) with j ≤ m are under observed while arms corresponding to q (j) with j > m are sufficiently observed and can be estimated accurately. Thus, for convenience we name m as the observation threshold (the term is not given a name in Lattimore et al. ( 2016)). In this paper, we improve the definition of m to make it gap-dependent, which would lead to a better adaptive pure exploration algorithm and sample complexity bound. Before introducing the definition, we first define the term H r . Sort the arm set as q a1 • max{∆ a1 , ε/2} 2 ≤ q a2 • max{∆ a2 , ε/2} 2 ≤ • ≤ q a |A| • max{∆ a |A| , ε/2} 2 , then H r is defined by H r = r i=1 1 max{∆ ai , ε/2} 2 . (3) Definition 2 (Gap-dependent observation threshold). For a given causal graph G and its associated q a 's and ∆ a 's, the gap-dependent observation threshold m ε,∆ is defined as: m ε,∆ = min τ ∈ [|A|] : a ∈ A q a • max {∆ a , ε/2} 2 < 1 H τ ≤ τ . The Gap-dependent observation threshold can be regarded as a generalization of the observation threshold. Intuitively, when considering the gaps, q a • max{∆ a , ε/2} 2 represents how easily the action a would to be distinguished from the optimal arm. To show the relationship between m ε,∆ and m, we provide the following lemma. The proof of Lemma is in the supplementary material. Lemma 1. m ε,∆ ≤ 2m. Lemma 1 shows that m ε,∆ = O(m). In many real scenarios, m ε,∆ might be much smaller than m. Consider some integer n with 4 < n < |A|, ϵ < 1/n, q a = 1 n for a ∈ A \ {do()} and q do() = 1. Then m = n. Now we consider ∆ a1 = ∆ a2 = 1 n , while other arms a have ∆ a = 1 2 . Then H r ≥ n 2 for all r ≥ 1. Then for a ̸ = a 1 , a 2 , we have q a • max{∆ a , ε/2} 2 ≥ 1 4n > 1 Hr , which implies that m ε,∆ = 2. This lemma will be used to show that our result improves previous causal bandit algorithm in Lattimore et al. (2016) .

4. COMBINATORIAL PURE EXPLORATION FOR BGLM

In this section, we discuss the pure exploration for BGLM, a general class of causal graphs with a linear number of parameters, as defined in Section 2. In this section, we assume U = ∅. Let θ * = (θ * X ) X∈X∪{Y } be the vector of all weights. Since X 1 is a global variable, we only need to consider the action set A ⊆ {do(S = s) | S ⊆ X \ {X 1 }, s ∈ {0, 1} |S| }. Following Li et al. (2017) ; Feng & Chen (2022), we have three assumptions: Assumption 1. For any X ∈ X ∪ {Y }, f X is twice differentiable. Its first and second order derivatives can be upper bounded by constant M (1) and M (2) . Assumption 2. κ := inf X∈X∪{Y },v∈[0,1] Pa(X) ,||θ-θ * X ||≤1 ḟX (v • θ) > 0 is a positive constant. Assumption 3. There exists a constant η > 0 such that for any X ∈ X ∪ {Y } and X ′ ∈ Pa(X), for any v ∈ {0, 1} |Pa(X)-2| and x ∈ {0, 1}, we have P r[X ′ = x | Pa(X) \ {X ′ , X 1 } = v] ≥ η. Assumptions 1 and 2 are the classical assumptions in generalized linear model Li et al. (2017) . Assumption 3 makes sure that each parent node of X has some freedom to become 0 and 1 with a non-zero probability, even when the values of all other parents of X are fixed, and it is originally given in Feng & Chen (2022) with additional justifications. Henceforth, we use σ(θ, a) to denote the reward µ a on parameter θ. Our main algorithm, Causal Combinatorial Pure Exploration-BGLM (CCPE-BGLM), is given in Algorithm 1. The algorithm follows the LUCB framework Kalyanakrishnan et al. ( 2012), but has several innovations. In each round t, we play three actions and thus it corresponds to three rounds in the general CPE model. In each round t, we maintain μt O,a and μt I,a as the estimates of µ a from the observational data and the interventional data, respectively. For each estimate, we maintain its confidence interval, [L t O,a , U t O,a ] and [L t I,a , U t I,a ] respectively. At the beginning of round t, similar to LUCB, we find two candidate actions, one with the highest empirical mean so far, a t-1 h ; and one with the highest UCB among the rest, a t-1 l . If the LCB of a t-1 h is higher than the UCB of a t-1 l with an ε error, then the algorithm could stop and return a t-1 h as the best action. However, the second stopping condition in line 5 is new, and it is used to guarantee that the observational estimates μt O,a 's are from enough samples. If the stopping condition is not met, we will do three steps. The first step is the novel observation step comparing to LUCB. In this step, we do the null intervention do(), collect observational data, use maximum-likelihood estimate adapted from Li et al. (2017) ; Feng & Chen (2022) to obtain parameter estimate θt , and then use θt to compute observational estimate μt O,a = σ( θt , a) for all action a, where σ( θt , a) means the reward for action a on parameter θt . This can be efficiently done by following the topological order of nodes in G and parameter θt . From μt O,a , we obtain the confidence interval [L t O,a , U t O,a ] using the bonus term defined later in Eq.( 8). In the second step, we play the two candidate actions a t-1 h and a t-1 l and update their interventional estimates and confidence intervals, as in LUCB. In the third step, we merge the two estimates together and set the final estimate μt a to be the mid point of the intersection of two confidence intervals. While the second step follows the LUCB, the first and the third step are new, and they are crucial for utilizing the observational data to obtain quick estimates for many actions at once. Utilizing observational data has been explored in past causal bandit studies, but they separate the exploration from observations and the interventions into two stages (Lattimore et al., 2016; Nair et al., 2021) , and thus their algorithms are not adaptive and cannot provide gap-dependent sample complexity bounds. Our algorithm innovation is in that we interleave the observation step and the intervention step naturally into the adaptive LUCB framework, so that we can achieve an adaptive balance between observation and intervention, achieving the best of both worlds. To get the confidence bound for BGLM, we use the following lemma from Feng & Chen (2022): Lemma 2. For an action a = do(S = s) and any two weight vectors θ and θ ′ , we have |σ(θ, a) -σ(θ ′ , a)| ≤ E e   X∈N S,Y |V ⊤ X (θ X -θ ′ X )|M (1)   , where N S,Y is the set of all nodes that lie on all possible paths from X 1 to Y excluding S, V X is the value vector of a sample of the parents of X according to parameter θ, M (1) is defined in Assumption 1, and the expectation is taken on the randomness of the noise term e = (e X ) X∈X∪{Y } of causal model under parameter θ. Algorithm 1 CCPE-BGLM(G, A, ε, δ, M (1) , M (2) , κ, η, c) 1: Input:causal general graph G, action set A, parameter ε, δ, M (1) , M (2) , κ, η, c in Assumptions 1,2, 3 and in Lemma 4 in supplementary material. 2: Initialize M 0,X = I for all node X. N a = 0, μ0 a = 0, L 0 a = -∞, U 0 a = ∞ for arms a ∈ A. 3: for t = 1, 2, • • • , do 4: a t-1 h = argmax a∈A μt-1 a , a t-1 l = argmax a∈A\{a t-1 h } U t-1 a . 5: if U t-1 a t-1 l ≤ L t-1 a t-1 h + ε and t ≥ max{ cD η 2 log nt 2 δ , 1024(M (2) ) 2 (4D 2 -3)D κ 4 η (D 2 + log 3nt 2 δ )} then 6: return a t-1 h .

7:

end if Perform action do() and observe X t and Y t . For a = do(), N a = N a + 1. 10: θt = BGLM-estimate((X 1 , Y 1 ), • • • , (X t , Y t )). 11: For a = do(S = s) ∈ A, calculate μO,a = σ( θt , S), and  [L t O,a , U t O,a ] = [μ O,a - β a O (t), μO,a + β a O (t)]. /* β a O (t) N a t-1 l = N a t-1 l + 1, N a t-1 h = N a t-1 h + 1. 15: For a ∈ {a t-1 l , a t-1 h , do()}, update the empirical mean 16:  μI,a = t j=1 1 Na (I{a = a j-1 l }Y (l) j +I{a = a j-1 h }Y (h) j +I{a = do()}Y j ) and [L t I,a , U t I,a ] = [μ I,a -β I (N a ), μI,a + β I (N a )]. /* β I (t) For a ∈ A, calculate [L t a , U t a ] = [L t O,a , U t O,a ] ∩ [L t I,a , U t I,a ] and μt a = L t a +U t a 2 .

19: end for

The key idea in the design and analysis of the algorithm is to divide the actions into two setsthe easy actions and the hard actions. Intuitively, the easy actions are the ones that can be easily estimated by observational data, while the hard actions require direction playing these actions to get accurate estimates. The quantity q a mentioned in Section 3 indicates how easy is action a, and it determines the gap-dependent observational threshold m ε,∆ (Definition 2), which essentially gives the number of hard actions. In fact, the set of actions in Eq.( 4) with τ = m ε,∆ is the set of hard actions and the rest are easy actions. We need to define q a representing the hardness of estimation for each a. Algorithm 2 BGLM-estimate 1: Input: data pairs ((X 1 , Y 1 ), (X 2 , Y 2 ), • • • , (X t , Y t )) 2: Construct (V t,X , X t ) for each X, where V t,X is the value of parent of X at round t, X t is the value of X at round t. 3: for X ∈ X ∪ {Y } do 4: M t,X = M t-1,X + V t,X V ⊤ t,X , cal- culate θt,X by solving t i=1 (X i - f X (V T i,X θt,X ))V i,X = 0. 5: end for 6: return θt . For CCPE-BGLM, we define its q (L) a as follows. Let D = max X∈X∪{Y } |Pa(X)|. For node S ⊆ X, let ℓ S = |N S,Y |. Then for a = do(S = s), we define q (L) a = 1 ℓ 2 S D 3 . (7) Intuitively, based on Lemma 2 and ℓ S = |N S,Y |, a large ℓ S means that the right-hand side of Inequality (6) could be large, and thus it is difficult to estimate µ a accurately. Hence the term q (L) a represents how easy it is to estimate for action a. Note that q (L) a only depends on the graph structure and set S. We can define m (L) and m (L) ε,∆ with respect to q (L) a 's by Definitions 1 and 2. We use two confidence radius terms as follows, one from the estimate of the observational data, and the other from the estimate of the interventional data. β a O (t) = α O M (1) D 1.5 κ √ η 1 q (L) a t log 3nt 2 δ , β I (t) = α I 1 t log |A| log(2t) δ . Parameters α O and α I are exploration parameters for our algorithm. For a theoretical guarantee, we choose α O = 6 √ 2 and α I = 2, but more aggressive α O and α I could be used in experiments. (e.g. Mason et al. (2020) , Kaufmann et al. (2016) , Jamieson et al. ( 2013)) The sample complexity of CCPE-BGLM is summarized in the following theorem. Theorem 1. With probability 1 -δ, our CCPE-BGLM(G, A, ε, δ/2) returns an ε-optimal arm with sample complexity T = O H m (L) ε,∆ log |A|H m (L) ε,∆ δ , where m (L) ε,∆ , H m (L) ε,∆ are defined in Definition 2 and Eq.( 3) in terms of q (L) a 's for a ∈ A \ {do()} defined in Eq.( 7). If we treat the problem as a naive |A|-arms bandit, the sample complexity of LUCB1 is O(H) = O( a∈A 1 max{∆a,ε/2} 2 ), which may contain an exponential number of terms. Now note that q (L) a ≥ 1 n 5 , it is easy to show that m (L) ε,∆ ≤ 2n 5 . Hence H m (L) ε,∆ contains only a polynomial number of terms. Other causal bandit algorithms also suffer an exponential term, unless they rely on a strong and unreasonable assumption as describe in the related work. We achieve an exponential speedup by (a) a specifically designed algorithm for the BGLM model, and (b) interleaving observation and intervention and making the algorithm fully adaptive. The idea of the analysis is as follows. First, for the m ε,∆ hard actions, we rely on the adaptive LUCB to identify the best, and its sample complexity according to LUCB is O(H m (L) ε,∆ log(|A|H m (L) ε,∆ /δ)). Next, for easy actions, we rely on the observational data to provide accurate estimates. According to Eq.( 4), every easy action a has the property that q a • max{∆ a , ε/2} 2 ≥ 1/H mε,∆ . Using this property together with Lemma 2, we would show that the sample complexity for estimating easy action rewards is also  O(H m (L) ε,∆ log(|A|H m (L) ε,∆ /δ)).

5.1. CPE ALGORITHM FOR GENERAL GRAPHS

In this section, we apply a similar idea to the general graph setting, which further allows the existence of hidden variables. The first issue is how to estimate the causal effect (or the do effect) E[Y | do(S = s)] in general causal graphs from the observational data. The general concept of identifiability (Pearl, 2009) is difficult for sample complexity analysis. Here we use the concept of admissible sequence (Pearl, 2009) to achieve this estimation. Definition 3 (Admissible sequence). An admissible sequence for general graph G with respect to Y and S = {X 1 , • • • , X k } ⊆ X is a sequence of sets of variables Z 1 , • • • Z k ⊆ X such that (1) Z i consists of nondescendants of {X i , X i+1 , • • • , X k }, (2) (Y ⊥ ⊥ X i | X 1 , • • • , X i-1 , Z 1 , • • • , Z i ) G X i ,X i+1 ,••• ,X k , where G X means graph G without out-edges of X, and G X means graph G without in-edges of X. Then, for S = {X 1 , • • • , X k }, s = {x 1 , • • • , x k }, we can calculate E[Y | do(S = s)] by E[Y | do(S = s)] = z P (Y = 1 | S = s, Z i = z i , i ≤ k) • P (Z 1 = z 1 ) • • • P (Z k = z k | Z i = z i , X i = x i , i ≤ k -1), where z means the value of ∪ k i=1 Z i , and z i means the projection of z on Z i . For a = do(S = s) with |S| = k, we use {Z a,i } k i=1 to denote the admissible sequence with respect to Y and S , and Z a = ∪ k i=1 Z a,i . Z a = |Z a | and Z = max a Z a . In this paper, we simplify Z a,i to Z i if there is no ambiguity. For any P ⊆ X, denote P t = X t | P as the projection of X t on P . We define Published as a conference paper at ICLR 2023 Algorithm 3 CCPE-General(G, A, ε, δ) 1: Input:causal graph G, action set A, parameter ε, δ, admissible sequence {(Z a ) i } for each action a ∈ A 2: Initialize t = 1, T a = 0, T a,z = 0, N a = 0, μa = 0 for all arms a ∈ A, z ∈ {0, 1} z , z ∈ Perform do() operation and observe X t and Y t . For a = do(), N a = N a + 1. [|X|]. 3: for t = 1, 2, • • • , do 4: a t-1 h = argmax a∈A μt-1 a , a t-1 l = argmax a∈A\a t-1 h (U t-1 a ) 5: if U a t-1 l ≤ L a t- 10: for a = do(S = s) ∈ A \ {do()} with an admissible sequence and S = {X 1 , • • • , X k }, s = {x 1 , • • • , x k } do 11: Estimate μO,a using ( 14) and 16), T a,z is defined in Eq.( 11) and T a = min z T a,z . */  [L t O,a , U t O,a ] = [μ O,a -β a O (T a , t), μO,a + β a O (T a )]. /* β a O (t) is defined in Eq.( N a t-1 l = N a t-1 l + 1, N a t-1 h = N a t-1 h + 1. 16: For a ∈ {a  p a,z,l (t) = 1 n a,z,l (t) t j=1 I{(Z l )j = z l , (Zi)j = zi, (Xi)j = xi, i ≤ l -1} where the r a,z (t) and p a,z,l (t) are the empirical mean of P (Y | S = s, Z a = z) and P (Z l = z l | Z i = z i , X i = x i , i ≤ l -1) . Also, we denote T a = min z T a,z . Using the above Eq.( 10), we estimate each term of the right-hand side for every z ∈ {0, 1} Za to obtain an estimate for E[Y | a] as follows: μO,a = z ra,z(t) k l=1 p a,z,l (t). For general graphs, there is no efficient algorithm to determine the existence of the admissible sequence and extract it when it exists. But we could rely on several methods to find admissible sequences in some special cases. First, we can find the adjustment set, a special case of admissible sequences. For a causal graph G, Z is an adjustment set for variable Y and set S if and only if P (Y = 1 | do(S = s)) = z P (Y = 1 | S = s, Z = z)P (Z = z). There is an efficient algorithm for deciding the existence of a minimal adjustment set with respect to any set S and Y and finding it (van der Zander et al., 2019) . Second, for general graphs without hidden variables, the admissible sequence can be easily found by Z j = Pa(X j ) \ (Z 1 ∪ • • • Z j-1 ∪ X 1 • • • ∪ X j-1 ) (Theorem 4 in the Appendix). Finally, when the causal graph satisfies certain properties, there exist algorithms to decide and construct admissible sequences Dawid & Didelez (2010) . Algorithm 3 provides the pseudocode of our algorithm CCPE-General, which has the same framework as Algorithm 1. The main difference is in the first step of updating observational estimates, in which we rely on the do-calculus formula Eq.( 10). For an action a = do(S = s) without an admissible sequence, define q (G) a = 0, meaning that it is hard to be estimated through observation. Otherwise, define q a as: q (G) a = min z {q a,z }, where q a,z = P (S = s, Z a = z), ∀z ∈ {0, 1} Za . (15) Similar to CCPE-BGLM, for a = do(S = s) with |S| = k, we use observational and interventional confidence radius as: β a O (n, t) = α O 1 n log 20k|A|Z a I a log(2t) δ ; β I (t) = α I 1 n log |A| log(2t) δ , where α O and α I are exploration parameters, and I a = 2 Za . For a theoretical guarantee, we will choose α O = 8 and α I = 2. Our sample complexity result is given below. Theorem 2. With probability 1 -δ, CCPE-General(G, A, ε, δ/5) returns an ε-optimal arm with sample complexity T = O H m (G) ε,∆ log |A|H m (G) ε,∆ δ , ( ) where m (G) ε,∆ , H m (G) ε,∆ are defined in Definitions 2 and 3 in terms of q (G) a 's defined in Eq.( 15). Comparing to LUCB1, since m (G) ε,∆ ≤ |A|, our algorithm is always as good as LUCB1. It is easy to construct cases where our algorithm would perform significantly better than LUCB1. Comparing to other causal bandit algorithms, our algorithm also performs significantly better, especially when m (G) ε,∆ ≪ m (G) or the gap ∆ a is large relative to ε. Some causal graphs with candidate action sets and valid admissible sequence are provided in the Appendix A, and more discussion is in the Appendix.

5.2. LOWER BOUND FOR THE GENERAL GRAPH CASE

To show that our CCPE-General algorithm is nearly minimax optimal, we provide the following lower bound, which is based on parallel graphs. We consider the following class of parallel bandit instance ξ with causal graph G = ({X 1 , • • • , X n , Y }, E): the action set is A = {do(X i = x) | x ∈ {0, 1}, 1 ≤ i ≤ n} ∪ {do()}. The q (G) a in this case is reduced to q (G) do(Xi=x) = P (X i = x) and q do() = 1. Sort the action set as q (G) a1 •max{∆ a1 , ε/2} 2 ≤ q (G) a2 •max{∆ a2 , ε/2} 2 ≤ • • • ≤ q (G) a2n+1 •max{∆ a2n+1 , ε/2} 2 . Let p min = min x∈{0,1} n P (Y = 1 | X = x), p max = max x∈{0,1} n P (Y = 1 | X = x). Let p max + 2∆ 2n+1 + 2ε ≤ 0.9, p min + ∆ min ≥ 0.1. Theorem 3. For the parallel bandit instance class ξ defined above, any (ε, δ)-PAC algorithm has expected sample complexity at least Ω H m (G) ε,∆ -1 - 1 min i<m (G) ε,∆ max{∆ ai , ε/2} 2 - 1 max{∆ do(),ε/2 } 2 log 1 δ . ( ) Theorem 3 is the first gap-dependent lower bound for causal bandits, which needs brand-new construction and technique. Comparing to the upper bound in Theorem 2, the main factor H m (G) ε,∆ is the same, except that the lower bound subtracts several additive terms. The first term H mε,∆-1 is almost equal to H mε,∆ appearing in Eq.( 17), except the it omits the last and the smallest additive term in H mε,∆ . The second term is to eliminate one term with minimal ∆ ai , which is common in multi-armed bandit. (Lattimore (2018) , Karnin et al. (2013a) ) The last term is because do()'s reward must be in-between µ do(Xi=0) and µ do(Xi=1) and thus cannot be the optimal arm.

6. FUTURE WORK

There are many interesting directions worth exploring in the future. First, how to improve the computational complexity for CPE of causal bandits is an important direction. Second, one can consider developing efficient pure exploration algorithms for causal graphs with partially unknown graph structures. Lastly, identifying the best intervention may be connected with the markov decision process and studying their interactions is also an interesting direction.



Step 1. Conduct a passive observation and estimate from the observational data */ 9:

Finally, the interleaving of observations and interventions keep the samply complexity in the same order.

)j = zi, (Xi)j = xi, i ≤ l -1}

is defined in Eq.(8) */ Step 3. Merge the observational estimate and the interventional estimate */

