MODEL-BASED CAUSAL BAYESIAN OPTIMIZATION

Abstract

How should we intervene on an unknown structural causal model to maximize a downstream variable of interest? This optimization of the output of a system of interconnected variables, also known as causal Bayesian optimization (CBO), has important applications in medicine, ecology, and manufacturing. Standard Bayesian optimization algorithms fail to effectively leverage the underlying causal structure. Existing CBO approaches assume noiseless measurements and do not come with guarantees. We propose model-based causal Bayesian optimization (MCBO), an algorithm that learns a full system model instead of only modeling intervention-reward pairs. MCBO propagates epistemic uncertainty about the causal mechanisms through the graph and trades off exploration and exploitation via the optimism principle. We bound its cumulative regret, and obtain the first non-asymptotic bounds for CBO. Unlike in standard Bayesian optimization, our acquisition function cannot be evaluated in closed form, so we show how the reparameterization trick can be used to apply gradient-based optimizers. Empirically we find that MCBO compares favorably with existing state-of-the-art approaches.

1. INTRODUCTION

Many applications, such as drug and material discovery, robotics, agriculture, and automated machine learning, require optimizing an unknown function that is expensive to evaluate. Bayesian optimization (BO) is an efficient framework for sequential optimization of such objectives (Močkus, 1975) . The key idea in BO is to quantify uncertainty in the unknown function via a probabilistic model, and then use this to navigate a trade-off between selecting inputs where the function output is favourable (exploitation) and selecting inputs to learn more about the function in areas of uncertainty (exploration). While most standard BO methods focus on a black-box setup (Figure 1 a), in practice, we often have more structure on the unknown function that can be used to improve sample efficiency. In this paper, we exploit structural knowledge in the form of a causal graph specified by a directed acyclic graph (DAG). In particular, we assume that actions can be modeled as interventions on a structural causal model (SCM) (Pearl, 2009) that contains the reward (function output) as a variable (Figure 1 b ). While we assume the graph structure to be known, we consider the functional relations in the SCM as unknown. All variables in the SCM are observed along with the reward after each action. This Causal BO setting has important potential applications, such as optimizing medical and ecological interventions (Aglietti et al., 2020b) . For illustrative purposes, consider the example of an agronomist trying to find the optimal Nitrogen fertilizer schedule for maximizing crop yield, described in Figure 1 . There, the concentration of Nitrogen in the soil causes its concentration in the soil at the later timesteps. To exploit the causal graph structure for optimization, we propose model-based causal Bayesian optimization (MCBO). MCBO explicitly models the full SCM and the accompanying uncertainty of all SCM components. This allows our algorithm to select interventions based on an optimistic strategy similar to that used by the upper confidence bound algorithm (Srinivas et al., 2010) . We show that this strategy leads to the first CBO algorithm with a cumulative regret guarantee. For a practical algorithm, maximizing the upper confidence bound in our setting is computationally more difficult, because uncertainty in all system components must be propagated through the entire estimated SCM to the reward variable. We show that an application of the reparameterization trick allows MCBO to be practically implemented with common gradient-based optimizers. Empirically, Bayesian Optimization The DAG corresponding to our stylised agronomy example, where we aim to maximize crop yield Y . CBO takes this DAG as input for designing actions. X 0 is an unmodifiable starting property of the soil, and X 1 . . . X 3 are the measured amounts of Nitrogen in the soil at different timesteps. Each observation is modelled with its own Gaussian process. a 1 . . . a 3 are possible interventions involving adding Nitrogen fertilizer to the soil. a 2 a 1 a 0 Y (a) Causal Bayesian Optimization a 1 a 2 a 3 X 0 X 1 X 2 X 3 Y (b) MCBO achieves competitive performance on existing CBO benchmarks and a related setting called function network BO (Astudillo & Frazier, 2021b) .

Contributions

• We introduce MCBO, a model-based algorithm for causal Bayesian optimization than can be applied with very generic classes of interventions. • Using MCBO we prove the first sublinear cumulative regret bound for CBO. We show how the bound scales depending on the graph structure. We demonstrate that CBO can lead to a potentially exponential improvement in cumulative regret, with respect to the number of actions, compared to standard BO. • By an application of the reparameterization technique, we show how our algorithm can be efficiently implemented with popular gradient-based optimizers. • We evaluate MCBO on existing CBO benchmarks and the related setting of function network BO. Our results show that MCBO performs favorably compared to methods designed specifically for these tasks.

2. BACKGROUND AND PROBLEM STATEMENT

We consider the problem of an agent interacting with an SCM for T rounds in order to maximize the value of a particular target variable. We start with introducing SCMs and the kinds of interventions an agent can perform on an SCM. In the following, we denote with [m] the set of integers {0, . . . , m}. Structural Causal Models An SCM is described by a tuple ⟨G, Y, X, F , Ω⟩ of the following elements: G is a known DAG; Y is the reward variable of interest; X = {X i } m-1 i=0 is a set of observed random variables; the set F = {f i } m i=0 defines the functional relations between these variables; and Ω = {Ω i } m i=0 is a set of independent noise variables with zero-mean and known distribution. We use the notation Y and X m interchangeably and assume the elements of X are topologically ordered, i.e., X 0 is a root and X m is a leaf. We use the notation pa i ⊂ {0, . . . , m} for the indices of the parents of the ith node, and Z i = {X j } j∈pai for the parents of the ith node. We sometimes use X i to refer to both the ith node and the ith random variable. Each X i is generated according to the function f i : Z i → X i , taking the parent nodes Z i of X i as input: x i = f i (z i ) + ω i , where lowercase denotes a realization of the corresponding random variable. The reward is a scalar x m ∈ R. An observation X i is defined over a compact set x i ∈ X i ⊂ R d , and its parents are defined over Z i = j∈pai X j for i ∈ [m -1]. Interventions At each interaction round, the agent performs an intervention on the SCM. In this work, we consider two types of intervention models. We consider a soft intervention model (Eberhardt & Scheines, 2007) where interventions are parameterized by controllable action variables. Let A i ⊂ R q denote the compact space of an action a i and A be the space of all actions a = {a i } m i=0 . We represent the actions as additional nodes in G (see Fig. 1 ): a i is a parent of only X i , and hence an additional input to f i . Since f i is unknown, in our soft intervention model, the agent does not know apriori the functional effect of a i on X i . A simple example of a soft intervention is a shift intervention x i = f i (z i , a i ) + ω i = g i (z i ) + a i + ω i for some function g i . A shift intervention might occur in our example of adding Nitrogen fertilizer to soil and then measuring the total soil Nitrogen concentration. While our theoretical results will focus on data obtained via soft interventions, our experiments also consider two other data sources. First, we consider a hard intervention model: hard interventions (often referred to as do-interventions) modify the targeted variable to a specific distribution independently of the variable's parents. For example, a doctor sets the dosage of a patient's medication, which fixes the dosage to a specific value (Aglietti et al., 2020b) . Second, a special case under both intervention models is the collection of observational data, which is when no intervention is performed on the system. In the soft intervention model, not intervening on node i is equivalent to setting a i = 0. An example would be not applying any Nitrogen fertilizer to the soil. In practice, the agent may have access to some previous observational data before its first interaction with the system. In the following, we introduce the problem setup under the soft intervention model and then adapt it to the hard intervention model.

Constraints on interventions

In many applications, we may not be able to intervene on all nodes simultaneously. For example, a farmer may only have the capacity to apply fertilizer at two out of three possible time windows in Fig. 1 . This results in an action space with cardinality constraints, written as A = a = {a i } m i=0 : m i=0 1 [ai̸ =0] ≤ c, c ≥ 1 . ( ) Problem statement We consider the problem of an agent sequentially interacting with an SCM, with known DAG G and a fixed but unknown set of functions F = {f i } m i=1 with f i : Z i × A i → X i . At round t we select actions a :,t = {a i,t } m i=0 and obtain observations x :,t = {x i,t } m i=0 , where we add an additional subscript to denote the round of interaction. The action a i,t and the observation x i,t are related by x i,t = f i (z i,t , a i,t ) + ω i,t , ∀i ∈ [m]. (2) If i corresponds to a root node, the parent vector z i,t denotes an empty vector, and the output of f i only depends on the action a i,t . Since we cannot intervene on the target variable X m , we fix a m = 0. The reward is given by y t = f m (z m,t , a m,t ) + ω m,t , ) which implicitly depends on the whole intervention vector a :,t . We define the action that maximizes the expected reward by a * = arg max a∈A E[y|a], where, unless otherwise stated, expectations are taken over noise ω. Performance metric Our agent's goal is to design a sequence of interventions {a :,t } T t=0 that achieves a high average expected reward. We hence study the cumulative regret (Lattimore & Szepesvári, 2020 ) over a time horizon T : R T = T t=1 E[y|a * ] -E[y|a :,t ] . A sublinear growth rate of R T with T implies vanishing average regret: R T /T → 0 as T → ∞. As an alternative to cumulative regret, one can also study the simple regret E[y|a * ] -E[y|a T ]. The most appropriate metric depends on the application. In the Nitrogen fertilizer example, cumulative regret might be preferable because we care about obtaining high crop yields across all years, not just in one final year. Index notation Let x i,t = [x i,t,1 , . . . , x i,t,d ] T denote a vector where x i,t,l indicates indexing the component l ∈ [d] of the tth timepoint of the observations at the node i ∈ [m]. For functions with vector output, e.g., f i : Z i → X i , we sometimes consider notation with additional input to f i that indicates the output dimension: f i (z, a) = [f i (z, a, 1), . . . , f i (z, a, d)] T .

Regularity assumptions

We consider standard smoothness assumptions for the unknown functions f i : S → X i defined over a compact domain S (Srinivas et al., 2010) . In particular, for each node i ∈ [m], we assume that f i (•) belongs to a reproducing kernel Hilbert space (RKHS) H ki , a space of smooth functions defined on S = Z i × A i . This means that f i ∈ H ki is induced by a kernel function k i : S × S → R where S = S × [d]foot_0 . We also assume that k i (s, s ′ ) ≤ 1 for every s, s ′ ∈ S2 . Moreover, the RKHS norm of f i (•) is assumed to be bounded ∥f i ∥ ki ≤ B i for some fixed constant B i > 0. Finally, to ensure the compactness of the domains Z i , we assume that the noise ω is bounded, i.e., ω i ∈ [-1, 1] d . 3 Problem statement under hard interventions Under a hard intervention model, instead of selecting an action a in Eq. ( 4), the agent must select both a set of intervention targets I ∈ I ⊂ P([m-1]) and their values a I ∈ A I ∈ A. For hard intervention we can rewrite Eq. ( 2) as x i = a i if i ∈ I f i (z i ) + ω i otherwise, ∀ i ∈ [m], where f i is unknown and employs the same regularity assumptions. Further constraints similar to Eq. ( 1) can be placed on either the intervention nodes I or action values a I . Finally, observational data corresponds to the empty intervention set I = ∅.

2.1. RELATED WORK

Optimal decision-making in SCMs has been the subject of several recent works, for example, in the bandit setting (Lattimore et al., 2016; Bareinboim et al., 2015) . Aglietti et al. (2020b) The function class studied in this paper is similar to that of deep Gaussian processes (GPs) (Damianou & Lawrence, 2013) in that MCBO models Y as a composition of GPs given a. Deep GPs, however, do not make use of intermediate system variables and do not compose GPs according to a causal graph structure. Our use of the reparameterization trick to practically implement an upper confidence bound acquisition function (Srinivas et al., 2010) in MCBO is inspired by Curi et al. (2020) , who apply ideas from BO to design an algorithm for sample efficient reinforcement learning.

3. ALGORITHM

In this section, we propose the MCBO algorithm, describing the probabilistic model and acquisition function used. We first introduce MCBO under the soft intervention setup and then describe how to adapt it to hard interventions.

Statistical model

We use Gaussian processes (GPs) for learning the RKHS functions f 0 , . . . , f m from observations. Our regularity assumptions permit the construction of confidence bounds using these GP models with priors associated with the RKHS kernels. We refer to Rasmussen (2003) for more background on the relation between GPs and RKHS functions. For all i ∈ [m], let µ i,0 and σ 2 i,0 denote the prior mean and variance functions for each f i , respectively. Since ω i is bounded, it is also subgaussian and we denote the variance by ρ 2 i . The corresponding posterior GP mean and variance, denoted by µ i,t and σ 2 i,t respectively, are computed based on the previous evaluations D t = {z :,1:t , a :,1:t , x :,1:t }. In particular, for each function f i (•, •, l) defined by the given kernel k i and output component l: µ i,t (z i , a i , l) = k i,t (z i , a i , l) ⊤ (K t + ρ 2 i I) -1 vec(x i,1:t ) , σ 2 i,t (z i , a i , l) = k i ((z i , a i , l); (z i , a i , l)) -k i,t (z i , a i , l) ⊤ (K t + ρ 2 i I) -1 k i,t (z i , a i , l) , where I denotes the identity matrix, vec(x i,1:t ) = [x i,1,1 , x i,1,2 , . . . , x i,t,d ] ⊤ and for (t 1 , l), (t 2 , l ′ ) ∈ [(1, 1), (1, 2), . . . , (t, d)]: [K t ] (t1,l),(t2,l ′ ) = k i ((z i,t1,l , a i,t1,l , l); (z i,t2,l ′ , a i,t2,l ′ , l ′ )), k i,t (z i , a i , l) ⊤ = [k i ((z i,1,1 , a i,1,1 , 1); (z i , a i , l)), . . . , k i ((z i,t,d , a i,t,d , d); (z i , a i , l))] ⊤ . We write µ i,t = [µ i,t (•, 1), . . . , µ i,t (•, d)] T and similarly for σ i,t . We give more background on the posterior updates of vector-valued GPs in Appendix A.1. At time t, the known set M t of statistically plausible functions F = { fi } m i=0 (functions that lie inside the confidence interval given by the posterior of each GP) is defined as: M t = F = { fi } m i=0 , s.t. ∀i : fi ∈ H ki , ∥ fi ∥ ki ≤ B i , and fi (z i , a i ) -µ i,t-1 (z i , a i ) ≤ β i,t σ i,t-1 (z i , a i ), ∀z i ∈ Z i , a i ∈ A i . Here, β i,t is a parameter that ensures the validity of the confidence bounds. Some examples of concentration inequalities under similar regularity assumptions as well as explicit forms for β i,t can be found in Chowdhury & Gopalan (2019) and Srinivas et al. (2010) . In the following, we set β i,t = β t for all i such that the confidence bounds in Eq. ( 9) are still valid.

Algorithm 1 Model-based Causal BO (MCBO)

Require: Parameters {β t } t≥1 , G, Ω, kernel functions k i and prior means µ i,0 = 0 ∀i ∈ [m] 1: for t = 1, 2, . . . do 2: Construct confidence bounds as in Eq. ( 9) 3: Select a t ∈ arg max a∈A max η(•) E[y | { f }, a] as in Eq. ( 12) 4: Observe samples {z i,t , x i,t } m i=0 5: Use D t to update posterior {µ i,t (•), σ 2 i,t (•)} m i=0 as in Eqs. ( 7) and (8) 6: end for Algorithm 2 Model-based Causal BO with Hard Interventions (MCBO) Require: Parameters {β t } t≥1 , G, Ω, kernel functions k i and prior means µ i,0 = 0 ∀i ∈ [m] 1: for t = 1, 2, . . . do 2: Construct confidence bounds as in Eq. ( 9) 3: Select I, a I ∈ arg max I,a I max η E[y | { f }, do(X I = a I )] 4: Observe samples {z i,t , x i,t } m i=0 5: Use D t to update posterior {µ i,t (•), σ 2 i,t (•)} m i=0 as in Eqs. ( 7) and ( 8) 6: end for Acquisition function At each round t, interventions are selected based on maximizing an acquisition function. Our acquisition function is based on the upper confidence bound acquisition function (Srinivas et al., 2010) . That is, we optimistically pick interventions that yield the highest expected return among all system models that are still plausible given past observations: a :,t = arg max a∈A max F ∈Mt E ω y | F , a . Note that Eq. ( 10) is not amenable to commonly used optimization techniques, due to the maximization over a set of functions with bounded RKHS norm. Therefore, following Curi et al. (2020) , we use the reparameterization trick to write any fi ∈ F ∈ M t using a function η i : Z i × A i → [-1, 1] di as: fi,t ( zi , ãi ) = µ i,t-1 ( zi , ãi ) + β t σ i,t-1 ( zi , ãi ) • η i ( zi , ãi ), where xi = fi ( zi , ãi ) + ωi denotes observations from simulating actions in one of the plausible models, and not necessarily the true model. • denotes the elementwise multiplication of vectors. This reparametrization allows for rewriting our acquisition function in terms of η : Z × A → [-1, 1] |X | : a :,t = arg max a∈A max η(•) E ω y | F , a , s.t. F = { fi,t } in Eq. (11). Intuitively, the variables η allow for choosing optimistic but plausible models given the confidence bounds. In practice, the function η can be parameterized by, for example, a neural network, and then standard optimization techniques are applied. For the theory, we assume access to an oracle providing the global optimum for Eq. ( 12). In practice, such an oracle may be computationally infeasible due to the non-convexity of Eq. ( 12). We discuss heuristics that we use for approximating this oracle in Appendix A.3. Algorithm 1 summarizes our Model-based Causal BO approach. We note that for the special case of the SCM following the DAG of Fig. 1 (a), our algorithm and the associated guarantees reduce to standard BO (Srinivas et al., 2010) .

Hard interventions MCBO also naturally generalizes to hard interventions (Algorithm 2).

In our experiments, we perform the combinatorial optimization over the set of nodes I by enumeration because |I| is not large for the instances we consider. We apply the notion of a minimal intervention set from Lee & Bareinboim (2019) to prune sets of intervention targets that contain redundant interventions, resulting in a smaller set to optimize over.

4. THEORETICAL ANALYSIS

This section describes the convergence guarantees for MCBO under a soft intervention model, showing the first sublinear cumulative regret bounds for causal BO. We start by introducing additional technical assumptions required for the analysis. Assumption 1 (Functional relations). All f i ∈ F are L f -Lipschitz continuous. Assumption 2 (Continuity). ∀i, t, the functions µ i,t , σ i,t are L µ , L σ Lipschitz continuous. Assumption 3 (Calibrated model). All statistical models are calibrated w.r.t. F , so that ∀i, t there exists a sequence β t ∈ R >0 such that, with probability at least (1 -δ), for all z i , a i ∈ Z i × A i we have |f i (z i , a i ) -µ i,t-1 (z i , a i )| ≤ β t σ i,t-1 (z i , a i ), element-wise. Assumption 1 follows directly from the regularity assumptions of Section 2. Assumption 2 holds if the RKHS of each f i has a Lipschitz continuous kernel (see Curi et al. (2020) , Appendix G). Assumption 2 restricts the convergence guarantees to soft interventions that affect their target variable in a smooth way, meaning that our analysis does not directly apply to the hard intervention model. We nevertheless experimentally demonstrate the effectiveness of MCBO in non-smooth settings, such as CBO with hard interventions. Assumption 3 holds when we assume that the ith GP prior uses the same kernel as the RKHS of f i and that β t is sufficiently large to ensure the confidence bounds in Eq. ( 9) hold. In the DAG G over nodes {X i } m i=0 , let N denote the maximum distance from a root node to X m : N = max i dist(X i , X m ) where dist(•, •) is measured as the number of edges in the longest path from a node X i to the reward node X m . Let K denote the maximum number of parents of any variable in G: K = max i |pa(i)|. The following theorem bounds the performance of MCBO in terms of cumulative regret. Theorem 1. Consider the optimization problem in Eq. (4) with SCM satisfying Assumptions 1-3 where G is known but F is unknown. Then for all T ≥ 1, with probability at least 1δ, the cumulative regret of Algorithm 1 is bounded by R T ≤ O L N f L N σ β N T K N m T γ T . Here, γ T = max i γ i,T where the node-specific γ i,T denotes the maximum information gain at a time T commonly used in the standard regret guarantees for Bayesian optimization (Srinivas et al., 2010) . This maximum information gain is known to be sublinear in T for many common kernels, such as linear and squared exponential kernels, resulting in an overall sublinear regret for MCBO. We refer to Appendix A.2.3 for the proof. Theorem 1 demonstrates that the use of the graph structure in MCBO results in a potentially exponential improvement in how the cumulative regret scales with the number of actions m. Standard Bayesian optimization as in Fig. 1 (a), that makes no use of the graph structure, results in cumulative regret exponential in m (Srinivas et al., 2010) , when using a squared exponential kernel. When all X i in MCBO are modeled with squared exponential kernels that model output components independently, we have γ T = O(d(Kd + q)(log T ) Kd+q+1 ), resulting in a cumulative regret that is exponential in K and exponential in N . However, note that m ≥ K + N . For several common graphs, the exponential scaling in N and K could be significantly more favorable than the exponential scaling in m. Consider the case of G having the binary-tree-like structure in appendix Fig. 3 (binary tree), where N = log(m) and K = 2. In such settings, the cumulative regret of MCBO will have only polynomial dependence on m. We further discuss the bound in Theorem 1 for specific kernels in Appendix A.2.3 and discuss the dependence of β T on T in Appendix A.2.4.

5. EXPERIMENTS

In this section, we empirically evaluate MCBO on six problems taken from previous CBO or function network papers (Aglietti et al., 2020b; Astudillo & Frazier, 2021b) . The DAGs corresponding to each task are given in Fig. 3 Baselines We compare our approach with three baselines (i) Expected Improvement for Function Networks (EIFN) (Astudillo & Frazier, 2021b) ; (ii) Causal Bayesian Optimization (EICBO) (Aglietti et al., 2020b) ; and (iii) standard upper confidence bound (GP-UCB) (Srinivas et al., 2010) which models the objective given interventions with a single GP (see Fig. 1 a ). To enable a fair comparison, we only apply EICBO to the hard intervention setting, since it was designed specifically for a hard intervention model.

Experimental setup

We report the average reward as a function of the number of system interventions performed. The average reward at time T is defined by T t=0 E ω [Y | a t ] and is directly inversely related to cumulative regret in that high average expected reward is equivalent to low cumulative regret. This matches the performance metric studied in our analysis. Average expected reward and cumulative regret are natural metrics for many real applications, like crop management, in which we want consistently high yield, and the healthcare-inspired setting we study in these experiments, where we seek good treatment outcomes for more than a single patient. In the appendix, we show experiments measuring the best expected reward of any action previously chosen, which is more similar to an inverse of the simple regret. We report mean performance over 20 random seeds, with error bars showing ± σ/ √ 20 where σ is the standard deviation across the repeats. All figures that are referenced but not in the main paper can be found in the appendix. For the guarantees in Theorem 1 to hold, {β t } T t=0 must be chosen so that the model is calibrated at all time-steps as in Eq. ( 9). In practice, we select a single β such that β t = β, ∀t. Choosing β too pessimistically will result in high regret, as demonstrated by the dependence of the guarantee on β N . For GP-UCB and MCBO, β is chosen by cross-validation across tasks, as described in the appendix. Toy Experiment First, we evaluate on the synthetic ToyGraph setting from Aglietti et al. (2020b) . ToyGraph is a hard intervention CBO task where I = {∅, {0}, {1}}. All methods start with 10 observational samples and samples from 2 random interventions on each I ∈ I. When EICBO obtains interventional data, it obtains noiseless samples because it is not designed for the noisy setting. By noiseless samples, we mean that for action a EICBO observes E ω [Y | a]. Other methods obtain single samples from the distribution Y, X | a. In ToyGraph, I = {1} is the optimal target, but to efficiently learn the optimal a {1} , the agent must generalize from both observational data and interventional data on other targets I = {0}. EICBO models Y given a I with a separate GP for every I and is consequently only able to make use of observational data and interventional data with the same target. Since MCBO learns a full system model, it incorporates all observations into the learned model, even when interventions do not match. Figure 2 (a) shows that the average reward of MCBO (Algorithm 2) is favorable compared to the baselines. Both baselines are built on expected improvement, which will continue to explore even after high-reward solutions are found. This explains the non-monotonic average reward of EICBO. Healthcare Experiment PSAGraph is inspired by the DAG from a real healthcare setting (Ferro et al., 2015) and also benchmarked by Aglietti et al. (2020b) . The agent intervenes by prescribing statins and/or aspirin while specifying the dosage to control prostate-specific antigen (PSA) levels. Here I = {∅, {2}, {3}, {2, 3}} and all interventions are hard interventions. Initial sample sizes are the same as for ToyGraph. Figure 2 (b) again shows EICBO having a nonmonotonic average reward and strong comparable performance of MCBO. Noiseless Function Networks In addition to the hard intervention setting, we evaluate MCBO and the baselines on four tasks from Astudillo & Frazier (2021b) . All systems have up to six nodes and varying graph structures. In function networks, actions can affect multiple system variables, and system variables can be children of multiple actions. Function networks are deterministic, so ω = 0. MCBO (Algorithm 1) can be applied directly in this setting, and the guarantees are also easily transferable. Like in Astudillo & Frazier (2021b) , there are no constraints (besides bounded domain) on actions, and the agent is initialized with 2A + 1 samples from random actions, where A is the number of action nodes. EIFN is better on Dropwave. Meanwhile, MCBO is substantially better than EIFN on the Ackley and Rosenbrock tasks. Overall, there are not sufficiently many tasks established in the literature to conclusively say which properties might make a task favor EIFN over MCBO. However, this would be interesting to understand in future work and likely relates to the wider conversation in BO comparing expected improvement and UCB algorithms (Merrill et al., 2021) . We find that the naive GP-UCB approach, which does not use the graph structure, generally performs poorly, especially on problems with larger graphs like Alpine2. On Ackley, EIFN does not achieve monotonically improving average reward, which is not unexpected given that it is based upon expected improvement. Noisy Function Networks We modify three of the function networks settings to include an additive zero-mean Gaussian noise at every system variable, making ω non-zero. EIFN is designed for deterministic function networks and has no convergence guarantees in this setting. Results in terms of average (Figure 2 d, e, f ) and best reward (Figure 6 g, h, i) are comparable to the noiseless case, with MCBO and EIFN both performing well compared to GP-UCB.

6. CONCLUSION

This paper introduces MCBO, a principled model-based approach to solving Bayesian optimization problems over structural causal models. Our approach explicitly models all variables in the system and propagates epistemic uncertainty through the model to select interventions based on the optimism principle. This allows MCBO to solve global optimization tasks in systems that have known causal structure with improved sample efficiency compared to prior works. We prove the first non-asymptotic convergence guarantees for an algorithm solving the causal Bayesian optimization problem and demonstrate that its theoretical advantages are reflected in strong empirical performance. Future work might consider how to apply the method to large graphs, where the sets of all possible discrete intervention targets cannot be efficiently enumerated.

A.1 BACKGROUND ON GPS

Here we give more background on the vector-valued GP models we use. We will drop the i-node index and consider the modelling of a single function with scalar output f : A × Z → R in RKHS H k with kernel k : S × S → R where S = (Z × A). We assume measurement noise with variance ρ 2 . Assuming prior mean and variance µ 0 , σ 0 and dataset D t = {z 1:t , a 1:t , x 1:t } where x t ∈ R, we get a posterior GP mean and variance given by updates µ t (z, a) = k t (z, a) ⊤ K t + ρ 2 I -1 x 1:t , σ 2 t (z, a) = k((z, a); (z, a)) -k t (z, a) ⊤ K t + ρ 2 I -1 k t (z, a) , (K t ) t1,t2 = k((z t1 , a t1 ); (z t2 , a t2 )), k t (z, a) = [k((z 1 , a 1 ); (z, a)) . . . , k((z t , a t ); (z, a))] , and I is the identity matrix. This follows the standard GP update given in Rasmussen (2003) . In this work, we model functions with vector-valued outputs f : A×Z → X ∈ R d (we will continue to drop the i index). For this we follow Chowdhury & Gopalan (2019) and use a scalar-output GP that takes the component of the output vector under consideration as an input. That is, we model f (•, l) (the lth output component of the function under study) where f = [f (•, 1), . . . , f (•, d)]. We again assume that f is from RKHS H k but with kernel k : S × S → R where S = (Z × A × [d]). Assuming prior mean and variance µ 0 , σ 0 and dataset D t = {z 1:t , a 1:t , x 1:t } where x t ∈ R d , we get a posterior GP mean and variance given by updates µ t (z, a, l) = k t (z, a, l) ⊤ (K t + ρ 2 I) -1 vec(x 1:t ) , σ 2 t (z, a, l) = k((z, a, l); (z, a, l)) -k t (z, a, l) ⊤ (K t + ρ 2 I) -1 k t (z, a, l) , where vec(x 1:t ) = [x 1,1 , x 1,2 , . . . , x t,d ] ⊤ and for (t 1 , l), (t 2 , l ′ ) ∈ [(1, 1), (1, 2), . . . , (t, d)]: [K t ] (t1,l),(t2,l ′ ) = k((z t1,l , a t1,l , l); (z t2,l ′ , a t2,l ′ , l ′ )), k t (z, a, l) ⊤ = [k((z 1,1 , a 1,1 , 1); (z, a, l)), . . . , k((z t,d , a t,d , d); (z, a, l))] ⊤ . This follows the posterior update of Chowdhury & Gopalan (2019). The key idea is to use a single scalar-output GP with kernel k for modeling all output components, but introduce the component index as part of the input space. This requires introducing the notation vec(x 1:t ) and ordering all observations, of all time points and all components, in a single index, since the GP update for component l considers observations of all other components. Under this GP model the components of f need not be independent if kernel k is designed to model dependency between the components. In our work, we use one of these vector-valued GP models for each individual random variable in our causal model, leading to the reintroduction of the additional i index in Eqs. ( 7) and (8).

A.2 PROOFS FOR THE THEORETICAL ANALYSIS

Our analysis closely follows Curi et al. (2020) , particularly the proofs in their Appendix D, where they prove similar guarantees for a model-based reinforcement learning problem. In contrast to Curi et al. (2020) , which models RL transition dynamics with a single GP for all timesteps, MCBO uses independent GPs for modeling the functional relation {f 1 , . . . , f m } and uses the causal graph G to determine the input and output variables of these functions. This section is organized as follows. In Appendix A.2.1, we discuss a notion of model complexity Γ T similar to the one introduced in the RL setting by Curi et al. (2020) . We then bound the cumulative regret in terms of Γ T in Appendix A.2.2. Finally, in Appendix A.2.3, we prove Theorem 1 by connecting our notion of model complexity with the maximum information gain of a GP model. The norm notation ∥•∥ refers to ℓ 2 -norm if no additional notation is given. We let {a i,t ∈ A i , z i,t ∈ Z i } i,t>0 denote the set of actions chosen by MCBO and the realizations of the parents of node i, respectively.

A.2.1 MODEL COMPLEXITY

The number of samples needed to learn a low-regret action is related to the number of samples needed to learn all GP models in our SCM. This is analogous to the classic BO setting (Srinivas et al., 2010) . We quantify the model complexity of our entire model class M T as Γ T = max (z,a)∈A⊂{Z×A} T T t=1 m i=0 σ i,t-1 (z i,t , a i,t ) 2 . ( ) This measure of model complexity closely relates to the maximum information gain γ T used for proving regret guarantees in BO (Srinivas et al., 2010) . For a single GP model, γ T is the maximum information gain about the unknown f that can be obtained from noisy evaluations of the f at fixed inputs (see Eq. ( 39)). Later we show that Γ T can be bounded by a sum of the information gains for all m GPs. It is worth noting that Equation 17 may be a loose notion of model complexity because it assumes we can independently choose every z i , but for many graphs, there could be overlap in the z i , z j for i ̸ = j (two nodes could have a shared parent).

A.2.2 ANALYSIS IN TERMS OF GENERAL MODEL COMPLEXITY Γ T

In this section, we will prove a theorem similar to Theorem 1 but in terms of the model complexity we define in Eq. ( 17). Note that this version of the theorem does not require that µ and σ come from a GP model with independent outputs, but any model such that Assumptions 1-3 are satisfied. In later sections, when using a GP model we bound Γ T in terms of the maximum information gain of the m GPs used to get Theorem 1. We will use the function Σ i,t (•) to represent a matrix of all zeros except the values of the diagonal, given by σ i,t (•). A a result, σ i,t (•) = diag (Σ i,t (•)) Theorem 2. Consider the optimization problem in Eq. ( 4) with SCM satisfying Assumptions 1-3 where G is known but F is unknown. Then, for all T ≥ 1, with probability at least 1δ, the regret of Algorithm 1 is bounded by R T ≤ O L N f L N σ β N T K N T m Γ T . We first sketch the proof steps. In Lemma 1 we show that with, high probability, there exists some set of functions η that allows the reparameterized plausible SCM model in Eq. ( 11) to match the true SCM. Recall the mechanism of the ground-truth SCM in Eq. ( 2) x i,t = f i (z i,t , a i,t ) + ω i,t , ∀i ∈ {0, . . . , m}, and the mechanism of the optimistic SCM model using the reparameterization of Eq. ( 11) xi,t = f i ( zi,t , ãi,t ) + ωi,t (18) = µ i,t-1 ( zi,t , ãi,t ) + β t Σ i,t-1 ( zi,t , ãi,t )η i ( zi,t , ãi,t ) + ωi,t , ∀i ∈ {0, . . . , m}. ( ) In Lemma 2, Lemma 3 and Corollary 1 we bound the instantaneous regret (regret at some specific timepoint t) by bounding the difference in SCM output, for the same action input, assuming the true SCM vs the optimistic reparameterized SCM. Then in Lemma 5 we use the intermediate result of Lemma 4 to show that our bound on instantaneous regret implies a bound on cumulative regret. In Lemma 1, for convenience, we will drop the explicit dependence of all quantities on t. Lemma 1. Assume some fixed set of actions a is chosen at any timepoint t. Under Assumption 3, for any x generated by the true SCM Eq. (2), with probability at least 1δ there exists a set of functions η = {η i } m i=0 , where η i : Z i × A i → [-1, 1] d , such that x = x if ∀i ω i = ωi . Proof. Since ω = ω we only need to prove that there exists some η such that ∀i f i (z i , a i ) = µ i ( zi , ãi ) + βΣ i ( zi , ãi )η i . By Assumption 3, with probability 1δ, for all i = 0, . . . , m, we have an elementwise bound |f i (z i , a i ) -µ i (z i , a i )| ≤ β i σ i (z i,t , a i,t ). Thus for each z i,t , a i,t there exists a vector η i with values in [-1, 1] d such that f i (z i , a i ) = µ i (z i , a i ) + βΣ i (z i , a i )η i . Note that this is not quite what we need because the RHS contains z i and not zi . We will now use an inductive argument on i that constructs each η i sequentially from i = 0 to m. Base case: we must prove that for i = 0 we have f 0 (z 0 , a 0 ) = µ 0 ( z0 , ã0 ) + β t Σ 0 ( z0 , ã0 )η 0 . We know z 0 = z0 (both are the empty vector) and a = ã by the assumption of a fixed action. Then there exists some vector η 0 such that f 0 (z 0 , a 0 ) = µ 0 (z 0 , a 0 ) + βΣ 0 (z 0 , a 0 )η 0 = µ 0 ( z0 , ã0 ) + βΣ 0 ( z0 , ã0 )η 0 . Let η 0 (•) be the function that outputs the vector η 0 given input z 0 , a 0 and the base case is proven. Now assume the inductive hypothesis: ∀j < i we have f j (z j , a j ) = µ j ( zj , ãj ) + βΣ j ( zj , ãj )η j . We want to show that this implies f i (z i , a i ) = µ i ( zi , ãi ) + βΣ i ( zi , ãi )η i . We know a i = ãi by the assumption of a fixed action. z i = zi because zi = [ xpai[1] , . . . , xpai[|pai|] ] T and we selected each η j such that xj = x j . Then there exists some vector η i such that f i (z i , a i ) = µ i (z i , a i ) + βΣ i (z i , a i ))η i = µ i ( zi , ãi ) + βΣ i ( zi , ãi ))η i . Let η i (•) output the vector η i given input z i , a i and the inductive step is proven. Lemma 2. Under Assumption 3, with probability at least (1δ) ∀t ≥ 0 the instantaneous regret r t is bounded by r t = E[y|F , a * ] -E[y|F , a :,t ] ≤ E y| Ft , a :,t -E[y|F , a :,t ]. ( ) Proof. The result follows directly from, E[y|F , a * ] ≤ E y| Ft , a :,t . This is true by the definition of Ft , a :,t as the argmax of Eq. ( 12) and that with probability at least (1δ) we have F ∈ M T . We now show how the observations under the true and optimistic dynamics differ for a fixed noise sequence ω = ω and the fixed action a i,t at any time t. Lemma 3. Under Assumptions 1-3, let Lf,t = 1 + L f + 2β t L σ . Then, for all iterations t > 0, any functions η i : R pi × R qi → [-1, 1] di and any sequence of ω i with ωi = ω i (for all i), we have ∥x m,t -xm,t ∥ ≤ 2β t K N LN f,t m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ (22) Proof. We prove by induction on i. Base case. Consider the base case i = 0. Because the nodes are topologically ordered we will have pa 0 = ∅. Its realization, therefore, depends only on the chosen action. Formally, we assume z 0 = ∅, x 0 = f0 (z 0 , a 0 ) + ω 0 and x0 = f0 ( z0 , ã0 ) + ω0 . Since ω 0 = ω0 , ∥x 0,t -x0,t ∥ = ∥f 0 (z 0,t , a 0,t ) + ω 0,t -µ 0,t-1 (z 0,t , a 0,t ) -β t Σ 0,t-1 (z 0,t , a 0,t )η 0 (z 0,t , a 0,t ) -ω0 ∥ ≤ ∥f 0 (z 0,t , a 0,t ) -µ 0,t-1 (z 0,t , a 0,t )∥ + ∥β t Σ 0,t-1 (z 0,t , a 0 )η 0 (z 0,t , a 0,t )∥ ≤ 2β t ∥Σ 0,t-1 (z 0,t , a 0,t )∥ In the following, we omit the dependence on the action a, e.g., using f i (z i,t ) instead of f i (z i,t , a i,t ) since we assume the actions to be the same for the process generating x i,t and xi,t .

Induction step. Now assuming that ∥x

i-1,t -xi-1,t ∥ ≤ 2β t K Ni-1 LNi-1 f,t i-1 j=0 ∥σ j,t-1 (z j,t )∥ we prove a similar result for the ith node. ∥x i,t -xi,t ∥ 1 = ∥f i (z i,t ) + ω i,t -µ i,t-1 ( zi,t ) -β t Σ i,t-1 ( zi,t )η i ( zi,t ) -ωi,t ∥ = ∥f i (z i,t ) -µ i,t-1 ( zi,t ) -β t Σ i,t-1 ( zi,t )η i ( zi,t ) + f i ( zi,t ) -f i ( zi,t )∥ 3 = ∥f i ( zi,t ) -µ i,t-1 ( zi,t ) -β t Σ i,t-1 ( zi,t )η i ( zi,t ) + f i (z i,t ) -f i ( zi,t )∥ 4 ≤ ∥f i ( zi,t ) -µ i,t-1 ( zi,t )∥ + ∥β t Σ i,t-1 ( zi,t )η i ( zi,t )∥ + ∥f i (z i,t ) -f i ( zi,t )∥ 5 ≤ β t ∥σ i,t-1 ( zi,t )∥ + β t ∥σ i,t-1 ( zi,t )∥ + L f ∥z i,t -zi,t ∥ 6 = 2β t ∥σ i,t-1 ( zi,t )∥ + L f ∥z i,t -zi,t ∥ 7 = 2β t ∥σ i,t-1 ( zi,t ) + σ i,t-1 (z i,t ) -σ i,t-1 (z i,t )∥ + L f ∥z i,t -zi,t ∥ 8 ≤ 2β t (∥σ i,t-1 (z i,t )∥ + L σ ∥z i,t -zi,t ∥) + L f ∥z i,t -zi,t ∥ 9 ≤ 2β t ∥σ i,t-1 (z i,t )∥ + (1 + L f + 2β t L σ ) ∥z i,t -zi,t ∥ 10 ≤ 2β t ∥σ i,t-1 (z i,t )∥ + (1 + L f + 2β t L σ ) j∈pai ∥x i,t -xi,t ∥ 11 ≤ 2β t ∥σ i,t-1 (z i,t )∥ + (1 + L f + 2β t L σ ) j∈pai 2β t K Nj (1 + L f + 2β t L σ =: Lf ) Nj j h=0 ∥σ h,t-1 (z h,t )∥ 12 ≤ 2β t K Ni LNi f,t i j=0 ∥σ j,t-1 (z j,t )∥ where 1 follows the dynamics Eqs. ( 2) and (11). In 2 , we assume the noise to be equal and add and subtract the same term. In 3 and 4 , we reorder terms and apply the triangle inequality. In 5 and 6 , we rely on the calibrated uncertainty and Lipschitz dynamics, then collect terms and use diagonality of the matrix Σ i,t-1 (•). In 7 and 8 , we add and subtract the same term and use the Lipschitz continuity of σ i,t-1 . Finally, in 9 , we add 1 to ensure that we can later upper bound this term by taking the exponential of it. 10 applies the triangle inequality. 11 follows the inductive hypothesis, and 12 is due to the depth of at least one parent j being N j = N i -1. Now we will relate this bound on the observations to a bound on y t when selecting actions according to MCBO in both the optimistic and true dynamics. Corollary 1. Under the assumptions of Lemma 3, for any sequence of η i ∈ [-1, 1] di , θ ∈ D, and t ≥ 1 we have that E y t | Ft , a :,t -E[y t |F , a :,t ] ≤ 2β t K N LN f,t E ω=ω m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ Proof. This follows from Lemma 3. Y is just the final observation so E y t | Ft , a :,t -E[y t |F , a :,t ] = E[∥x m,t -xm,t ∥ | a :,t ] ≤ 2β t K N LN f,t E m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ Lemma 4. Under Assumption 3, let L Y,t = 2β t LN f,t . Then, with probability at least (1δ) it holds for all t ≥ 0 that r 2 t ≤ L 2 Y,t K 2N mE m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ 2 2 (25) Proof. r t ≤ E y t | Ft , a t -E[y t |F , a t ] ≤ 2β t K N LN f,t E m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ (27) ≤ L Y,t K N E m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ (28) r 2 t ≤ L 2 Y,t K 2N E m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ 2 (29) ≤ L 2 Y,t K 2N E   m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ 2   (30) ≤ L 2 Y,t K 2N mE m i=0 ∥σ i,t-1 (z i,t , a i,t )∥ 2 2 The last two lines are Jensen's inequality. Now we bound cumulative regret R T . Lemma 5. Under the assumption of Assumptions 1-3, with probability at least (1δ) it holds for all t ≥ 0 that R 2 T ≤ T L 2 Y,T m T t=1 E m i=0 ∥σ i,t (z i,t , a i,t ) 2 ∥ 2 2 (32) Proof. R 2 T = T t=1 r t 2 1 ≤ T T t=1 r 2 t 2 ≤ T L 2 Y,T m T t=1 E m i=0 ∥σ i,t (z i,t , a i,t ) 2 ∥ 2 2 , ( ) where 1 is due to Jensen's inequality and 2 follows Lemma 4. Similar to the equivalent lemma in Curi et al. (2020) , this bound is dependent on the data observed by the iteration t, making it hard to interpret in a general case. To this end, we further provide the worst-case bound dependent on the model complexity Γ T . Lemma 6. Under Assumptions 1-3, with probability at least (1δ) it holds for all t ≥ 0 that R 2 T ≤ T L 2 Y,T K 2N mΓ T Proof. Substituting in Eq. ( 17) we have T t=1 E m i=0 σ i,t (z i,t , a i,t ) 2 2 ≤ Γ T and the result follows. Taking square roots and substituting in for L Y,T in terms of β T , L f and L σ in Eq. ( 34) concludes the proof for Theorem 2.

A.2.3 PROOF OF THEOREM 1

We can bound Γ T in Theorem 2 to get a bound that depends on the specific GP model used for each f i . We can show that for many commonly used kernels MCBO achieves sublinear (in T ) regret. Theorem 1. Consider the optimization problem in Eq. (4) with SCM satisfying Assumptions 1-3 where G is known but F is unknown. Then for all T ≥ 1, with probability at least 1δ, the cumulative regret of Algorithm 1 is bounded by R T ≤ O L N f L N σ β N T K N m T γ T . Proof. Step 1. Some preliminaries relating mutual information I T (x i,1:T , f i,1:T ) and maximum information gain γ T In the following, we consider the information gain for the node i, i.e., for x i,1:T ∈ R d×T and f i,1:T = [f i (z i,1 , a i,1 ), . . . , f i (z i,T , a i,T )] ⊤ evaluated at points A i = {z i,1:T , a i,1:T }, A i ∈ Z i × A i . In this step, for simplicity, when clear from context we will omit the i-notation in the derivation for the information gain. Mutual information I T (x 1:T , f 1:T ) is then defined as: I(x 1:T , f 1:T ) = H(x 1:T ) -H(x 1:T |f 1:T ), where H(•) denotes entropy. For fitting the GPs, the following models are used: x t |f t ∼ N (f t (z t , a t ), ρ 2 I) and x t |x 1:t-1 , z 1:t , a 1:t ∼ N (µ t-1 (z t , a t ), ρI + Σ t-1 (z t , a t )), where I ∈ R d×d , and µ t-1 (z t , a t ) = [µ t-1 (z t , a t , 1), . . . , µ t-1 (z t , a t , d)] ⊤ , σ t-1 (z t , a t ) = diag(Σ t-1 (z t , a t )) = [σ t-1 (z t , a t , 1), . . . , σ t-1 (z t , a t , d)] ⊤ . Our setup assumes that the components {x t,l } d l=0 are independent of each other given z t , a t . Therefore from Srinivas et al. (2010) we know that the mutual information for the ith GP model is: I(x 1:T , f 1:T ) = H(x 1:T ) -H(x 1:T |f 1:T ) = 1 2 T t=1 d l=1 ln 1 + σ 2 t-1 (z t , a t , l) ρ 2 because per component: I(x i,1:T,l , f i,1:T,l ) = 1 2 T t=1 ln 1 + σ 2 t-1 (z t , a t , l) ρ 2 . ( ) Then, by the definition of maximum information gain γ i,T,l per node i in the graph: γ i,T,l := max Ai⊂ {Zi×Ai} T I(x i,1:T,l , f i,1:T,l ) = max Ai⊂ {Zi×Ai} T 1 2 T t=1 ln 1 + σ 2 i,t-1 (z t , a t , l) ρ 2 Accordingly, we can write the maximum information gain between x i,1:T and f i,1:T as follows: γ i,T := max Ai⊂ {Zi×Ai} T I(x i,1:T , f i,1:T ) = max Ai⊂ {Zi×Ai} T 1 2 T t=1 d l=1 ln 1 + σ 2 i,t-1 (z t , a t , l) ρ 2 (39) ≤ d l=1 max Ai⊂ {Zi×Ai} T 1 2 T t=1 ln 1 + σ 2 i,t-1 (z t , a t , l) ρ 2 = d l=1 γ i,T,l Note: Bounds for γ i,T,l and γ i,T . Upper bounds on γ i,T,l are provided in Srinivas et al. (2010) for widely used kernels and scale sublinearly in T . We use p i = d |pa(i)| to represent the size of the z-input to a GP. Recall that q is the length of each action vector i.e. A i ⊂ R q . For the linear kernel γ i,T,l = O((p i + q) log T ), and for the squared exponential kernel γ i,T,l = O((p i + q)(log T ) pi+q+1 ). We can use Eq. ( 39) to give bounds on γ i,T too e.g. if all GPs use independent linear kernels for each output component γ i,T = O(d i (p i +q) log T ), and if all GPs use independent squared exponential kernels for each output component γ i,T = O(d(p i + q)(log T ) pi+q+1 ). Step 2. A bound for model complexity Γ T . Here we bound the model complexity Eq. ( 17) Γ T ≤ m i=0 1 ln(1 + ρ -2 i ) γ i,T . For readability, in the following, we denote max A (•) := max A : A=∪iAi: ∀i : Ai⊂{Zi×Ai} T (•). Γ T = max {Z×A} T T t=1 m i=0 ∥σ i,t-1 (z i , a i )∥ 2 2 1 ≤ max A T t=1 m i=0 ∥σ i,t-1 (z i , a i )∥ 2 2 2 ≤ m i=0 max A T t=1 ∥σ i,t-1 (z i , a i )∥ 2 2 3 ≤ m i=0 max Ai T t=1 ∥σ i,t-1 (z i , a i )∥ 2 2 4 ≤ m i=0 max Ai T t=1 di l=1 ∥σ i,(t-1) (z i , a i , l)∥ 2 2 5 ≤ m i=0 2 ln(1 + ρ -2 i ) max Ai 1 2 T t=1 di l=1 ln 1 + σ i,(t-1) (z i , a i , l) ρ 2 i maximum information gain Eq. (39) 6 = m i=0 2 ln(1 + ρ -2 i ) γ i,T 7 = O(mγ T ), where 1 bounds A and Z with a box. 2 is from the max over a sum. 3 is due to the assumption of x i,t being independent of A j , j ̸ = i, conditioned on A i . 4 is due to Jensen's inequality. 5 is due to the fact that for any s 2 ∈ [0, ρ -2 ] we can bound s 2 ≤ ρ -2 ln(1+ρ -2 ) ln(1+s 2 ) Srinivas et al. (2010) . This also holds for function s 2 (•) := ρ -2 i σ 2 i,(t-1),l (•) since ρ -2 i σ 2 i,(t-1),d (•) ≤ ρ -2 i k(•, •) ≤ ρ -2 i because k i (•, •) < 1 ∀i (bounded kernel assumption). 6 is due to Eqs. ( 36) and (39). Finally, in 7 we define γ T := max i γ i,T being the maximum value of the maximum information gains over graph nodes. Plugging this upper bound on Γ T into Theorem 2 completes the proof. Note: Sublinearity w.r.t. T of maximum information gain γ T . Upper bounds on γ T will often scale sublinearly in T . This follows from γ i,T scaling sublinearly in T for many popularly used kernels (see previous note). In particular, a linear kernel leads to γ T = O (d(Kd + q) log T ) and a squared exponential kernel leads to γ T = O d(Kd + q)(log T ) Kd+q+1 (assuming output components are independent) since max i p i = Kd. A.2.4 DEPENDENCE OF β T ON T FOR PARTICULAR KERNELS Note that for Assumption 3 to hold under our RKHS assumptions, β T might depend on T . For a single observed variable corresponding to node i at time t, for Assumption 3 to hold we must have (Chowdhury & Gopalan (2019) equation 9) . For Assumption 3 to hold at time t for all i we can apply a union bound and see that it is sufficient for β T = Õ B i + ρi d √ γ i,t β T = Õ B + ρ d √ γ t where B = max i B i and ρ = max i ρ i . In the regret bound of Theorem 1, β T appears raised to the power N . For γ T corresponding to the linear kernel and squared exponential kernel, this will still lead to sublinear regret regardless of N because γ T will be only logarithmic in T . However, for a Matérn kernel, where the best known bound γ T = O (p(pT ) c log(pT )) with 0 < c < 1, the cumulative regret bound will not be sublinear if N and c are sufficiently large. A similar phenomena with the Matérn kernel appears in the guarantees of Curi et al. (2020) which use GP models in model-based reinforcement learning.

A.3 MAXIMIZING THE ACQUISITION FUNCTION

Our theoretical results assume access to an oracle that can maximize Eq. ( 10). Here we discuss how we approximate this oracle in practice. In noiseless settings, instead of parameterizing each η i as a neural network, we can parameterize it as a constant. This is because with no noise, the inputs to η i (z i and a i ) given a are fixed. This keeps the space of parameters to optimize over small, meaning we can use an identical optimization procedure to that used in EIFN by Astudillo & Frazier (2021b) which is also an out-of-the-box optimizer in the BoTorch package Balandat et al. (2020) . For noisy settings where each η i : A i × Z i → R is a neural network, we use our own optimizer. For each initialization of η parameters, we perform stochastic gradient descent to optimize both the η i parameters and a. We can do this because Eq. ( 12) is differentiable with respect to both a and the parameters of each η i . After running stochastic gradient descent on many random initializations we will have many solution candidates. We select the candidate with the highest acquisition function value. We use a large number of different random initializations because the acquisition function may be very non-convex. Other approaches, such as those considered in Curi et al. (2020) for model-based reinforcement learning, could also be adapted to optimize our acquisition function. When parameterizing each η i with a neural network, we always use a two layer feed-forward network with a ReLu non-linearity, To map the output into [-1, 1] we put the output of the network through an element-wise Sigmoid. In all noisy environments we estimate the expectation in the acquisition function (Eq. ( 12)) using a Monte Carlo estimate with 32 repeats for each gradient step. For noisy Dropwave we use 128 repeats because the environment is particularly noisy compared to other noisy environments.

A.4 EXPERIMENTAL SETUP

Here we give additional notes to explain the details of our experimental setup. The cross-validation for selecting β is performed across β = {0.05, 0.5, 5}. For a given performance metric and task, we'll use the example of average reward on Dropwave, we select the β that on average performs best across all other tasks for the same performance metric, called β * . We then report the results in terms of average reward for running the algorithm (GP-UCB or MCBO) on Dropwave with β = β * For MCBO and EIFN, we use identical values for all shared hyperparameters (e.g. GP kernels) and use the original hyperparameters from Astudillo & Frazier (2021b) . CBO methods do not have hyperparameters comparable with these methods because CBO methods do not model individual system observations (see Section 2.1). We run the original EICBO source code effectively unmodified Aglietti et al. (2020b) . For CBO tasks, if the agent selects no intervention (observational data) at a given timestep, we give the agent, for any algorithm, a single observational sample. This is different to in Aglietti et al. (2020b) where 20 observational samples are given. There is no noisy version of Ackley because adding noise can make the environment unstable by producing very large or very small rewards. For all environments and all X i , the noisy environment adds unit-Gaussian noise, except Dropwave where we scale this noise by 0.1 to make the environment more stable. A.5 FURTHER EXPERIMENTAL ANALYSIS

ToyGraph

X 0 X 1 Y PSAGraph X 0 X 1 X 2 X 3 X 4 Y Dropwave a 0 a 1 X 0 Y Alpine2 a 0 a 1 a 2 a 3 a 4 a 5 X 0 X 1 X 2 X 3 X 4 Y Rosenbrock a 0 a 1 a 2 a 3 a 4 X 0 X 1 X 2 Y Ackley a 0 a 1 a 2 a 3 a 4 a 5 X 0 X 1 Y binary tree a 2 a 3 a 1 a 0 a 4 a 5 X 2 X 3 X 1 X 0 X 4 X 5 Y

Average expected reward

In Figure 4 we show the deterministic function networks plots not included in the main paper. In Figure 5 we show the average reward for all β considered in the cross-validation. Best expected reward Here we also report the best expected reward as a function of the number of rounds T . The best expected reward at time T is defined by max a∈{at} T t=0 E ω [Y | a]. This is similar to but not directly inversely related to simple regret. Simple regret algorithms require an inference procedure to select a final action after T rounds of exploration. This procedure could be computationally expensive for the CBO setting. For example, a reasonable choice would be to report a final action based upon maximizing a lower confidence bound of the objective in Eq. ( 4), however a closed form expression for this does not exist for CBO. Best expected reward instead assumes access to an oracle that computes the highest expected reward of any action the algorithm has previously chosen. Our plotting of the best expected reward is consistent with previous work on CBO (Aglietti et al., 2020b) . When performing the same cross-validation procedure as for the average expected reward case, we find that the performance of MCBO varies drastically across tasks. This is shown in Figure 6 . Selecting β is known to be a difficulty with UCB-based method (Merrill et al., 2021) . Figure 7 shows the performance in terms of best reward for all 3 possibilities for β. When selecting just ing in terms of average reward instead of best reward, compared to Aglietti et al. (2020b) we also used smaller initial sample sizes (random samples the agent obtains before t = 0) of observational data. This was to make the setting more challenging, since we found that with extra observational data ToyGraph and PSAGraph could be solved with very few samples. When reproducing the experiments of Fig. 2 (a, b) with additional starting observational data, we found the same qualitative results (non-montonicity) for the average reward case. Note that this result is not inconsistent at all with the experimental results of Aglietti et al. (2020b) , since they only evaluate in terms of best reward.



For vector-valued functions coming from an RKHS we consider a scalar-valued function where the output index is part of the function input, as described inCuri et al. (2020) Appendix F. This is known as the bounded variance property, and it holds for many common kernels. This assumption can be relaxed to ωi being sub-Gaussian using similar techniques toCuri et al. (2020) Appendix I. Though sub-Gaussian noise includes distributions with unbounded support,Curi et al. (2020) provide high probability bounds on the domain of Zi. https://github.com/ssethz/mcbo



Figure 1: A visual comparison between the modelling assumptions of BO vs CBO. Circular nodes represent observed variables, squares represent action inputs and Y is the reward. Algorithms select a before observing X and Y . (a) In standard BO, the DAG has the structure shown regardless of the problem structure. (b)The DAG corresponding to our stylised agronomy example, where we aim to maximize crop yield Y . CBO takes this DAG as input for designing actions. X 0 is an unmodifiable starting property of the soil, and X 1 . . . X 3 are the measured amounts of Nitrogen in the soil at different timesteps. Each observation is modelled with its own Gaussian process. a 1 . . . a 3 are possible interventions involving adding Nitrogen fertilizer to the soil.

Figure 2 (c) and Figure 4 (a,b,c) show that MCBO achieves competitive average reward on all tasks.EIFN is better on Dropwave. Meanwhile, MCBO is substantially better than EIFN on the Ackley and Rosenbrock tasks. Overall, there are not sufficiently many tasks established in the literature to conclusively say which properties might make a task favor EIFN over MCBO. However, this would be interesting to understand in future work and likely relates to the wider conversation in BO comparing expected improvement and UCB algorithms(Merrill et al., 2021). We find that the naive GP-UCB approach, which does not use the graph structure, generally performs poorly, especially on problems with larger graphs like Alpine2. On Ackley, EIFN does not achieve monotonically improving average reward, which is not unexpected given that it is based upon expected improvement.

Figure 3: The DAGs corresponding to each task in the experiments, except binary tree which is used for illustrative purposes. Circles are observation or reward variables. Squares are actions in the function network setting. In hard intervention CBO, nodes with a dashed border can be intervened upon. All nodes represent a scalar random variable. a

Figure 7: An ablation across β where we plot best reward for all tasks. a

Figure 8: Runtimes of experiments for a a) CBO setting, b) noiseless function network setting, and c) noisy function network setting.

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

We thank Lars Lorch, Parnian Kassraie and Ilnura Usmanova for their feedback and anonymous reviewers for their helpful comments. This research was supported by the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, and by the European Research Council (ERC) under the European Union's Horizon grant 815943.

annex

the best β among the 3 tried, we find that MCBO is very strong on all tasks (with the exception of best reward on ToyGraph, where it underperforms the baselines). The optimal β can vary by an order of magnitude or more across tasks in our case. This is likely because all of our settings have very different functional relationships and graph structures. As discussed in Section 4, this can be understood from our cumulative regret guarantee in Theorem 1.The plots of average and best scores for other values of β (Figs. 5 and 7 ) suggest that β can be made larger for MCBO to trade-off exploration vs. exploitation depending on whether simple or cumulative regret is more of a priority. Meanwhile, expected improvement methods are focused almost almost entirely on exploration.Runtimes In Figure 8 we show the total runtime of each method for all 100 rounds. EIFN and MCBO run on equivalent hardware, which has 4 times more ram than the hardware used for GP-UCB and EICBO. MCBO generally has a much longer runtime than EIFN in noisy settings where the η i are parameterized by neural networks. Generally in BO, because the unknown function is often assumed expensive to evaluate, we are less concerned about the time required to optimize the acquisition function and more concerned with notions of statistical efficiency such as regret.Our implementation for noisy settings could likely be sped-up with more parallelism to make it more comparable to EIFN runtimes. In noiseless settings, where the optimization methods used are roughly equivalent between EIFN and MCBO, the two methods have comparable runtimes.Non-monotonic regret of EICBO In Fig. 2 (a,b) we found that EICBO had a non-monotonic average reward, which could likely be explained by the use of an expected improvement acquisition function. We decided to clarify that the non-monotonic average reward was because of the algorithm and not because of differences in our setup compared to Aglietti et al. (2020b) . Besides us evaluat-

