BO-MUSE: A HUMAN EXPERT AND AI TEAMING FRAMEWORK FOR ACCELERATED EXPERIMENTAL DE-SIGN

Abstract

In this paper we introduce BO-Muse, a new approach to human-AI teaming for the optimisation of expensive blackbox functions. Inspired by the intrinsic difficulty of extracting expert knowledge and distilling it back into AI models and by observations of human behaviour in real-world experimental design, our algorithm lets the human expert take the lead in the experimental process. The human expert can use their domain expertise to its full potential, while the AI plays the role of a muse, injecting novelty and searching for areas of weakness to break the human out of over-exploitation induced by cognitive entrenchment. With mild assumptions, we show that our algorithm converges sub-linearly, at a rate faster than the AI or human alone. We validate our algorithm using synthetic data and with human experts performing real-world experiments.

1. INTRODUCTION

Bayesian Optimisation (BO) (Shahriari et al., 2015) is a popular sample-efficient optimisation technique to solve problems where the objective is expensive. It has been applied successfully in diverse areas (Greenhill et al., 2020) including material discovery (Li et al., 2017) , alloy design (Barnett et al., 2020) and molecular design (Gómez-Bombarelli et al., 2018) . However, standard BO typically operates tabula rasa, building its model of the objective from minimal priors that do not include domain-specific detail. While there has been some progress made incorporating domain-specific knowledge to accelerate BO (Li et al., 2018; Hvarfner et al., 2022) or transfer learnings from previous experiments (Shilton et al., 2017) , it remains the case that there is a significant corpus of knowledge and expertise that could potentially accelerate BO even further but which remain largely untapped due to the inherent complexities involved in knowledge extraction and exploitation. In particular, this often arises from the fact that experts tend to organise their knowledge in complex schema containing concepts, attributes and relationships (Rousseau, 2001) , making the elicitation of relevant expert knowledge, both quantitative and qualitative, a difficult task. Experimental design underpins the discovery of new materials, processes and products. However, experiments are costly, the target function is unknown and the search space unclear. To be sample-efficient, the least number of experiments must be performed. Traditionally experimental design is guided by (human) experts who uses their domain expertise and intuition to formulate an experimental design, test it, and iterate based on observations. Living beings from fungi (Watkinson et al., 2005) to ants (Pratt & Sumpter, 2006) and humans (Daw et al., 2006; Cohen et al., 2007) face a dilemma when they make these decisions: exploit the information they have, or explore to gather new information. How humans balance this dilemma has been studied in Daw et al. (2006) -examining human choices in a n-arm bandit problem, they showed that humans were highly skewed towards exploitation. Moreover, when the task requires specialised experts, cognitive entrenchment is heightened and the balance between expertise and flexibility swings further towards remaining in known paradigms.

BO-Muse Suggestion

Measure Result To break out of this, dynamic environments of engagement are needed to force experts to incorporate new points of view (Dane, 2010) . For such lateral thinking to catalyse creativity, Beaney (2005) has further confirmed that random stimuli are crucial. For example, using random stimuli to boost creativity has been attempted in the context of games (Yannakakis et al., 2014) . In Sentient Sketchbook, a machine creates sketches that the human can refine, and sketches are readily created through machine learning models trained on ample data. Other approaches use machine representations to learn models of human knowledge, narrowing down options for the human to consider. Recently, Vasylenko et al. (2021) constructs a variational auto-encoder from underlying patterns of chemistry based on structure/composition and then a human generated hypothesis guides possible solutions. An entirely different approach refines a target function by allowing machine learning to discover relations between mathematical objects, and guides humans to make new conjectures (Davies et al., 2021) . Note, however, that there is still a requirement for large datasets to formulate representations of mathematical objects, which is antithetical to sample-efficiency as typically in experimental design we have a budget on the number of experiments, data from past designs is lean, and formulation of hypothesis is difficult in this lean data space. The use of BO for experimental design overcomes the problems of over-exploitation and cognitive entrenchment and provides mathematically rigorous guarantees of convergence to the optimal design. However, as noted previously, this often means that domain-specific knowledge and expertise is lost. In this paper, motivated by our observations, rather than attempting to enrich AI models using expert knowledge to accelerate BO, we propose the BO-Muse algorithm that lets the human expert take the lead in experimental design with the aid of an AI "muse" whose job is to augment the expert's intuition through AI suggestions. Thus the AI's role is to provide dynamism to break an expert's cognitive entrenchment and go beyond the state-of-the-art in new problems, while the expert's role is to harness their vast knowledge and extensive experience to produce state-of-the-art designs. Combining these roles in a formal framework is the main contribution of this paper. BO-Muse is a formal framework that inserts BO into the expert's workflow (see Figure 1 ), allowing adjustment of the AI exploit/explore strategy in response to the human expert suggestions. This process results in a batch of suggestions from the human expert and the AI at each iteration. This batch of designs is experimentally evaluated and shared with the human and the AI. The AI model is updated and the process iterated until the target is reached. We analyse the sample-efficiency of BO-Muse and provide a sub-linear regret bound. We validate BO-Muse using optimisation benchmarks and teaming with experts to perform complex real-world tasks. Our contributions are: • Design of a framework (BO-Muse) for a human expert and an AI to work in concert to accelerate experimental design, taking advantage of the human's deeper insight into the problem and the AI's advantage in using rigorous models to complement the expert to achieve sample-efficiency; • Design of an algorithm that compensates for the human tendency to be overly exploitative by appropriately boosting the AI exploration; • Provide a sub-linear regret bound for BO-Muse to demonstrate the accelerated convergence due to the human-AI teaming in the optimisation process; and • Provide experimental validation both using optimisation benchmark functions and with human experts to perform complex real-world tasks.

2. BACKGROUND 2.1 HUMAN MACHINE PARTNERSHIPS

Mixed initiative creative interfaces propose a tight coupling of human and machine to foster creativity. Thus far, however, research has been largely restricted to game design (Deterding et al., 2017) , where the authors identified open challenges including "what kinds of human-AI co-creativity can we envision across and beyond creative practice?". Our work is the first example of the use such a paradigm to accelerate experimental design. Also of importance, though beyond the scope of this study, is the design of interfaces for such systems (Rezwana & Maher, 2022) and how the differing ways human and machine express confidence affects performance (Steyvers et al., 2022) .

2.2. BAYESIAN OPTIMISATION

Bayesian Optimisation (BO, (Brochu et al., 2010) ) is an optimisation method for solving the problems of the form: x ⋆ = argmax x∈X f ⋆ (x) in the least possible number of iterations when f ⋆ is an expensive blackbox function. Bayesian optimisation models f ⋆ as a draw from a Gaussian Process GP(0, K) with prior covariance (kernel) K (Rasmussen, 2006) and, at each iteration t, recommends the next function evaluation point x t by optimising a (cheap) acquisition function a t : X → R based on the posterior mean and variance given a dataset of observations D t-1 = {(x i , y i ) : i ∈ N t-1 }: µ t-1 (x) = y T t-1 K t-1 + σ 2 I -1 k t-1 (x) σ 2 t-1 (x) = K t-1 (x, x) -k T t-1 (x) K t-1 + σ 2 I -1 k t-1 (x) where y t-1 = [y i ] i∈Nt-1 is the set of observed outputs, K t-1 = [K(x i , x j )] i,j∈Nt-1 and k t-1 (x) = [K(x i , x)] i∈Nt-1 . Experiments evaluate y t = f ⋆ (x t ) + ν t , where ν t is noise, the GP model is updated to include the new observation, and the process repeats either until a convergence criteria is met or a fixed budget of evaluations is exhausted. Common acquisition functions include Expected Improvement (EI, (Jones et al., 1998) ) and GP-UCB (Srinivas et al., 2012) . GP-UCB uses: a t (x) = µ t-1 (x) + β 1/2 t σ t-1 (x) Here β t is a variable controlling the trade-off between exploitation of known minima (if β t is small) and exploration (if β t is large). The optimisation performance depends on β t . To achieve sub-linear convergence Chowdhury & Gopalan (2017) recommend β t = χ t , where: χ t = Σ √ σ 2 ln 1 δ + 1 + γ t + ∥f ⋆ ∥ H K 2 (2) and γ t is the maximum information gain (maximum mutual information between f ⋆ and any t observations (Srinivas et al., 2012) ). Typically it is assumed that f ⋆ ∼ GP(0, K) or f ⋆ ∈ H K , where H K is the reproducing kernel Hilbert space of the kernel K (Srinivas et al., 2012) . Otherwise the problem is mis-specified. One approach to mis-specified BO is enlarged confidence GP-UCB (EC-GP-UCB, (Bogunovic & Krause, 2021) ). In EC-GP-UCB, the function closest to the objective f ⋆ is denoted f # = argmin f ∈H K ∥f ⋆ -f ∥ ∞ , and the acquisition function takes the form: a t (x) = µ t-1 (x) + β 1/2 t + ϵ √ t σ σ t-1 (x) where ϵ = ∥f # -f ⋆ ∥ ∞ is the mis-specification gap. The additional term in the acquisition function is required to ensure sub-linear convergence in this case.

3. FRAMEWORK

We consider the following optimisation problem x ⋆ = argmax x∈X f ⋆ (x), where f ⋆ : X → R is expensive and evaluation is noisy, with results y = f ⋆ (x) + ν, where ν is Σ-sub-Gaussian noise. We optimise f ⋆ as a series of experimental batches, indexed by s = 1, 2, . . .. For each batch a human expert suggests xs for evaluation and an AI suggests xs (hatted variables relate to the human expert, breved variables to the AI). Both experiments are carried out, the human expert and the AI update their models based on the results, and the process is repeated for a total of S batches. We do not know precisely how the human expert will model f ⋆ , but the AI maintains a GP model with kernel (prior covariance) Ks and we assume the human expert does the same with some unknown kernel Ks (we discuss this in detail in section 3.1), where Ks and Ks may be updated after each batch. The posterior means and variances given dataset D s = {(x i , ŷi ), (x i , yi ) : i ≤ s} are: μs (x) = y T s ( Ks + σ 2 I) -1 ks (x) , σ2 s (x) = Ks (x, x) -kT s (x) ( Ks + σ 2 I) -1 ks (x) μs (x) = y T s ( Ks + σ 2 I) -1 ks (x) , σ2 s (x) = Ks (x, x) -kT s (x) ( Ks + σ 2 I) -1 ks (x) for the human and AI, respectively. We assume f ⋆ ∈ Hs = H Ks lies in the RKHS of the AI's kernel Ks . We do not make this assumption for the human expert, making the problem mis-specified from their perspective. Hence borrowing from (Bogunovic & Krause, 2021) , we assume that for batch s the human by default attempts to maximise the closest function to f ⋆ in Ĥs : f # s = argmin f ∈ Ĥs:∥f ∥ Ĥs ≤ B ∥f -f ⋆ ∥ ∞ (here Ĥs = H Ks ) We further assume that the gap between f ⋆ and f # s is bounded as ∥ f # i,s -f ⋆ ∥ ∞ ≤ εs , and that the human is able to learn (in effect, update their kernel) to "close the gap", so lim s→∞ εs → 0. The AI generates recommendations using GP-UCB. Motivated by (Borji & Itti, 2013) , we assume the human in effect generates recommendations using EC-GP-UCB (Bogunovic & Krause, 2021) : xs = argmin x∈X âs (x) = μs-1 (x) + ( β1/2 s + εs σ √ 2s)σ s-1 (x) xs = argmin x∈X ȃs (x) = μs-1 (x) + β1/2 s σs-1 (x) The form of âs captures the broad behaviour of human recommendation selection, balancing exploitation of known good regions, which corresponds to βs → 0 (e.g. "based on my model, this experiment should yield good results"), and exploration of areas of uncertainty, which corresponds to βs → ∞ (e.g. "explore here, I'm curious"). We do not presume to know the precise trade-off βs used by the human but, based on (Borji & Itti, 2013; Dane, 2010) , it appears probable that the human will pursue a conservative policy, so βs may be small. We therefore use the AI trade-off βs , which we control, to compensate for the potential conservative tendencies of the expert. It is convenient to specify the exploration/exploitation trade-off parameters βs and βs relative to (2) (Chowdhury & Gopalan, 2017; Bogunovic & Krause, 2021) . Without loss of generality we require that βs ∈ ( ζs↓ χs , ζs↑ χs ) and βs = ζs χs , where 0 ≤ ζs↓ ≤ 1 ≤ ζs↑ ≤ ∞, ζs ≥ 1 and (2): χs = Σ √ σ 2 ln (1/δ) + 1 + γs + ∥ f # s ∥ Ĥs 2 , χs = Σ √ σ 2 ln (1/δ) + 1 + γs + ∥f ⋆ ∥ Hs 2 (5) where γs and γs are, respectively, the max information-gain for human expert and the AI. We show in section 3.3 that the this suffices to ensure sublinear convergence for any human expert selections, and moreover that the convergence rate can be improved beyond what can be achieved by standard GP-UCB if the human operates as described here.

3.1. THE HUMAN MODEL

We posit that the human expert will maintain, explicitly or implicitly, an evolving weight-space model (we do not presume to know the details of this, only that it exists, explicitly or otherwise): fs (x) = ĝs (p s (x)) of f ⋆ , where ĝs is in some sense "simple" and ps : R n → R ms represents the human's understanding of important features. This fits into the above scheme if we let the form of ĝs dictate the kernel Ks . For example if we know that ĝs is linear (so fs (x) = ŵT s ps (x) in weight-space) then we can say that the human is, in effect, using a GP model with a linear-derived model: Ks (x, x ′ ) = pT s (x) ps (x ′ ) Similarly if the human is using a model ĝi,s captured by a d th -order approximation model fs (x) = ŵT s [p ⊗q s (x)] q=0,1,...,d then we can assume a GP with a polynomial-derived kernel: Ks (x, x ′ ) = 1 + pT s (x) ps (x ′ ) d As discussed previously we do not assume that the human expert's evolving model suffices to completely replicate f ⋆ , particularly in the earlier stages of the algorithm, but we do assume that the human expert is capable of learning from observations of f ⋆ to refine their model, closing the gap ε2s between their model and f ⋆ so that lim s→∞ ε2s = 0. With regard to max information gain, we can reasonably assume that the human expert (the expertise is important) starts with a better understanding of f ⋆ than the AI -the underlying physics of the system, the behaviour one might expect in similar experiments, etc. So the human expert may begin with an incomplete but informative set of features that relate to their knowledge of the system or similar systems similar, or an understanding of the covariance structure of design space, so: • The prior variance of the human expert's GP will vary between a zero-knowledge base level in regions that are a mystery to the expert, and much lower in regions where the expert has a good understanding from past experience, understanding of underlying physics etc. • The prior covariance of the expert's GP (the expert's kernel) will have a structure informed by the expert's understanding and knowledge, for example, of the "region A will behave like region B because they have feature/attribute C in common" type, so an experiment in region A will reduce the human expert's posterior variance both in regions. By comparison, the AI will typically start with a generic kernel prior like an SE kernel, Matern kernel or similar. Such priors are "flat" over the design space, with no areas of lower prior variance and no "region A will behave like B" behaviour, so an experiment will only reduce the AI's local variance.foot_0 Thus as the algorithm progresses the human expert's variance will both start from a lower prior and decrease more quickly than the AI's. We know that max information gain is bounded as the sum of the logs of the pre-experiment posterior variance (Srinivas et al., 2012, Lemma 5. 3), so we may reasonably assume that the expert's max information gain will be lower than the AI's. Finally, we argue that, as the human expert's kernel is built on relatively few, highly informative features, the expert's RKHS norm ∥f # s ∥ Ks = ∥ ŵ∥ 2 will be less than the AI's norm ∥f ⋆ ∥ Ks = ∥ w∥ 2 .

3.2. THE BO-MUSE ALGORITHM

The BO-Muse algorithm is shown in algorithm 1, where f ⋆ is optimised using a sequence of batches s = 1, 2, . . . , S, each containing one human and one AI recommendation, respectively xs and xs . The AI recommendation minimises the GP-UCB acquisition function on the AI's GP posterior, with the exploration/exploitation trade-off βs given (see section 3.4). The human recommendation is assumed to be implicitly selected to minimise the EC-GP-UCB acquisition function (3), where the exploitation/exploration trade-off βs is unknown but assumed to lie in the range βs ∈ ( ζs↓ χs , ζs↑ χs ), and the gap εs is unknown but assume to converge to 0 at the rate to be specified in theorem 1). We use two approximations in our definition of βs . Following the standard practice, we approximate the max information gain γs as (x,...)∈Ds-1 ln(1 + σ -2 σ2 i (x)); and we approximate the RKHS norm bound B = ∥f ⋆ ∥ Hs as ∥μ s ∥ Hs , which is updates using max to ensure it is non-decreasing with s (this will tend to over-estimate the norm, but this should not affect the algorithm's rate of convergence). For simplicity we assume that σ = Σ (the model parameter matches the noise).

Algorithm 1 BO-Muse

Input: Initial observations D 0 = {(x 1 , y 1 ) , (x 2 , y 2 ) , . . .}, AI prior GP(0, K). Let B = 1. for s ∈ 1, 2, . . . , S do Set βs = 7 √ σ 2 ln(1/δ) + 1 + (x,...)∈Ds-1 ln (1 + σ -2 σ2 i (x)) + B 2 as per (5), (8). AI recommends xs as xs = argmax x∈X ȃs (x) = μs-1 (x) + β1/2 s σs-1 (x). Human recommends xs . Run experiments to obtain ŷs = f ⋆ (x s ) + νs and ys = f ⋆ (x s ) + νs . Set D s = D s-1 ∪ {(x s , ŷs ) , (x s , ys )} and update the AI GP posterior. Set B = max{ B, y T s K s + σ 2 I -1 y s }. end for

3.3. CONVERGENCE AND REGRET BOUNDS

We now discuss the convergence properties of BO-Muse. Our goal here is twofold: first we show that BO-Muse converges even in the worst-case where the human operates arbitrarily; and second, assuming the human behaves according to our assumptions, we analyse how convergence is accelerated. Our approach is based on regret analysis (Srinivas et al., 2012; Chowdhury & Gopalan, 2017; Bogunovic & Krause, 2021) . As experiments arebatched we are concerned with instantaneous regret per batch, not per experiment. The instantaneous regret for batch s is: r s = min {f ⋆ (x ⋆ ) -f ⋆ (x s ) , f ⋆ (x ⋆ ) -f ⋆ (x s )} and the cumulative regret up to and including batch S is R S = S s=1 r s . If R S grows sub-linearly then the minimum instantaneous regret will converge to 0 as S → ∞. Consider first the worst-case scenario, i.e. an arbitrary, non-expert human. In this case the BO-Muse algorithm 1 effectively involves the AI using a GP-UCB acquisition function (the constant scaling factor on βs does not change the convergence properties of GP-UCB) to design a sequence of experiments xs , where each experiment costs twice as much as usual to evaluate and yields two observations, (x s , ys = f ⋆ (x s ) + ηs ) and (x s , ŷs = f ⋆ (x s ) + ηs ), where xs is arbitrary. Additional observations can only improve the accuracy of the posterior, so using standard methods (e.g. Chowdhury & Gopalan (2017); Bogunovic & Krause (2021) ; Srinivas et al. (2012) ), we see that R S = O(B √ γ2S 2S + (ln(1/δ) + γ2S )γ 2S 2S), which is sub-linear for well-behaved kernels. Thus we see that the worst-case convergence of BO-Muse is the same, in the big-O sense, as that of GP-UCB. However this is pessimistic: in reality we assume a human is generating experiments xs using an EC-GP-UCB style trade-off between exploitation and exploration, and moreover that the human is an expert with an implicit or explicit evolving model of the system that is superior to the AI's generic prior. For this case we have the following result: Theorem 1. Fix δ > 0, θ ↓ > 0, θ s ∈ [θ ↓ , ∞], S ∈ N + . Assume ζs↓ ∈ (0, 1], ζs = ζ ≥ 1 so: M θs exp( χ1/2 s (1 - ζ1/2 s↓ ) + σ s-1↑ ), exp( χ1/2 s (1 -ζ1/2 ) -σ s-1↓ ) ≤ 1 (6) for all s ≤ S, where σ s-1↑ = max(σ s-1 (x ⋆ ), σs-1 (x ⋆ )), σ s-1↓ = min(σ s-1 (x ⋆ ), σs-1 (x ⋆ )), M θ (a, b) = ( 1 2 (a θ + b θ )) 1/θ is the generalised mean, and χs , χs as per (5). If ε2s = O(s -1foot_1 ) then: 1 S R S ≤ ln M -θ ↓ exp Ĉ χS γS S +O 1 S , exp C 1 2 (1 + ζ1/2 ) 2 χS γS S (7) with probability ≥ 1 -δ, where Ĉ, C > 0 are constants. We present the proof of this theorem in the appendix. As discussed previously γS and γS are the max information gains for the human and AI, respectively; ζs↓ ∈ (0, 1] defines the degree to which the human is allowed to over-exploit; and ζ defines the degree to which the AI must over-explore to compensate. The parameters θ s constrain the range of ζ for which the theorem is applicable through the condition (6), and the "mix" between human and AI in the regret bound ( 7). Assuming the max information gain for the human model has better convergence properties than the AI model, which we consider reasonable based on previous discussion, and noting that the power mean M θ (a, b) will be dominated by max(a, b) as θ → ∞ and by min(a, b) as θ → -∞, we see that: • When θ s is large the AI trade-off parameter ζ must be large to satisfy (6), corresponding to a very explorative AI, and the regret bound (7) will be asymptotically superior, being dominated by the human's max information gain γS for large S which, as discussed in section 3.1, may be reasonably expected to be smaller than the AI's max information gain due to the human expert's superior understanding of f ⋆ . • When θ s is small the AI is less constrained, so the theorem is applicable for more expoitative AIs, but the regret bound (7) will be more strongly influenced by the AI max information gain and hence asymptotically inferior. Note that, while larger θ s will lead to tighter regret bounds in the asymptotic regime, this bound will only apply for very explorative AIs, and moreover the factor ( 1 2 (1 + ζ1/2 )) 2 will also be large, so the regret bound may only be superior for very large S. The "ideal" trade-off is unclear, but the important observation from this theorem is that the human's expertise, and subsequent superior maximum information gain, will improve the regret bound and thus convergence.

3.4. TUNING PARAMETER SELECTION

We now consider the selection of the tuning parameter βs to ensure faster convergence than standard BO alone. The conditions that βs must meet are specified by (6). Equivalently, taking the maximally pessimistic view of human under-exploration (i.e. ζs↓ = 0), we require that exp(θ s √ χs (1 - ζ1/2 s )σ s-1↓ ) ≤ 2 -exp(θ s √ χs σ s-1↑ ) or, equivalently, ζs ≥ 1 + ( χs χs ) 1/2 σ s-1↑ σ s-1↓ 1 ϕs ln 1 2-e ϕs 2 , where 2 ϕ s = θ s √ χs σ s-1↑ ∈ (0, ln 2), and, noting that the right of the inequality is strictly increasing, ϕ s = 0 corresponds to ζs = 1 + ( χs χs ) 1/2 σ s-1↑ σ s-1↓ and ϕ s = ln 2 to ζs = ∞. Table 1 : Synthetic benchmark functions. Analytical forms are provided in the second column and the last column depicts the high level features used by a simulated human expert. Functions f ⋆ (x) High Level Features Matyas-2D 0.26 * (x 2 1 + x 2 2 ) -0.48 * (x1 * x2) x ′ 1 = x 2 1 , x ′ 2 = x 2 2 x ′ 3 = x1 * x2 Ackley-4D -ae -b √ 1 d ||x|| 2 2 -e √ 1 d i cos(cx i ) + a + e 1 x ′ 1 = cos(x1), x ′ 2 = cos(x2) x ′ 3 = cos(x3),x ′ 4 = cos(x4),x ′ 5 = ||x||2 Levy-6D sin 2 (πw1) + s + (w d -1) 2 [1 + sin 2 (2πw d )] x ′ 1 = (sin x1) 2 , where s = d-1 i (w1 -1) 2 [1 + 10 sin 2 (πwi + 1)] x ′ j+1 = x 2 j (sin xj) 2 and wi = 1 + x i -1 4 ∀i ∈ N6 ∀j ∈ N6 As discussed in section 3.1, it is reasonable to assume that the max information gains satisfy γs ≥ γs , and that ∥f # s ∥ Ks ≤ ∥f ⋆ ∥ Ks . With these assumptions √ χs ≤ √ χs . Furthermore we prove in the Appendix that lim s→∞ σ s-1↑ σ s-1↓ = 1, so we approximate the bound on ζs as ζs ≥ (1+ 1 ϕs ln( 1 2-e ϕs )) 2 . Recalling ϕ s ∈ (ln 1, ln 2) we finally, somewhat arbitrarily select ϕ s = ln 3/2 (the middle of the range in the log domain), which leads to the following heuristic used in the BO-Muse algorithm: ζs ≥ 1 + ln 2 ln 3/2 2 ≈ 7 (8)

4. EXPERIMENTS

We validate the performance of our proposed BO-Muse algorithm in the optimisation of synthetic benchmark functions, and the real-world tasks involving human experts. In all our experiments, we have used Squared Exponential (SE) kernel at all levels with associated hyper-parameters tuned using maximum-likelihood estimation. We measure the sample-efficiency of BO-Muse framework and other standard baselines in terms of the simple regret (r t ): r t = f ⋆ (x ⋆ ) -max xt∈Dt f ⋆ (x t ), where f ⋆ (x ⋆ ) is the true global optima and f ⋆ (x t ) is the best solution observed in t iterations. Experiments were run on an Intel Xeon CPU@ 3.60GHz workstation with 16 GB RAM capacity.

4.1. EXPERIMENTS WITH OPTIMISATION BENCHMARK FUNCTIONS

We have evaluated BO-Muse on synthetic test functions covering a range of dimensions, as detailed in Table 1 . We compare the sample-efficiency of BO-Muse with (i) Generic BO: A standard GP-UCB based BO algorithm with the exploration-exploitation trade-off factor (β) set as per Srinivas et al. ( 2012); (ii) Simulated Human: A simulated human with access to higher level properties (refer to Section 3.1 and high level features in Table 1 ) that may help to model the optimisation function more accurately; and (iii) Simulated Human + PE: A simulated human teamed with an AI agent using a pure exploration strategy (that is, an AI policy with ζs = ∞). To simulate a human expert (with high exploitation), we use a standard BO algorithm with small exploration factor maximising âs (x) = μs (x) + 0.001σ s (x). Furthermore, we have ensured to allocate the same function evaluation budget for all the competing methods. Figure 2a and 2b shows the simple regret computed for the Ackley-4D and Levy-6D functions averaged over 10 randomly initialised runs -BO-Muse consistently outperforms all the baselines for both functions. We believe that the poor performance of Simulated Human + PE is due to the pure exploration strategy used. Additionally, we have conducted an ablation study by varying the degrees of exploitation-exploration strategy ( βs ) to simulate covering over-exploitative to overexplorative experts. Figure 2c shows the result of the ablation study for Matyas-2D function and we refer to the appendix for the additional results. To further understand the behaviour of BO-Muse framework with other human acquisition functions, we have also considered an additional function optimisation experiment to include Expected Improvement acquisition function for human experts. The experimental details and results of the ablation study are provided in the Appendix. 

4.2. REAL-WORLD EXPERIMENTS

We now evaluate the performance of our proposed framework in complex real-world tasks.

4.2.1. CLASSIFICATION TASKS -SUPPORT VECTOR MACHINES AND RANDOM FORESTS

Experimental Set-up. The expert task is to choose hyper-parameters for Support Vector Machine (SVM) and Random Forest (RF) classifiers operating on real-world Biodeg dataset from UCI repository (Dua & Graff, 2017) . We divide the dataset into random 80/20 train/test splits. We set up two human expert teams. Each member of Team 1 works in partnership with BO-Muse, whilst members of Team 2 work individually without BO-Muse. We recruited 8 participantsfoot_2 consisting of 4 postdocs and 4 postgraduate students. 2 postdocs and 2 students are allocated to each team randomly so that each team has 4 participants, with roughly similar expertise. Each participant is given the same budget, 3 random initial designs + 30 further iterations. At the end of each iteration, the test classification error for the suggested hyper-parameter set is computed. We measure the overall performance using simple regret, which will be the minimum test classification error observed so far. The individual results from each team are averaged to compare (Team 1) BO-Muse + Human vs (Team 2) Human alone (baseline). Additionally, we also report the performance of Generic BO (AI alone) method to demonstrate the efficacy of our approach. Further, we have set the same seed initialisation and allocated the same evaluation budget for all the algorithms. As we provide each participant the same set of random initial observations, we do not compute the descriptive statistics for the Generic BO baseline. We note that BO-Muse is designed to work with experts at various levels as we assume an imperfect expert model via a mis-specified GP. Interfacing with Experts. The human experts perform two hyper-parameter tuning experiments to minimise the test classification error of SVM and RF. The expert is provided with a simple graphical interface (Figure 3a ) which shows accumulated observations of classifier performance as a function of hyper-parameters, with the best result so far shown in dark blue. Interfaces for both teams are similar excepting that teams working with BO-Muse also see the previous AI-generated observations indicated by ▼. Experts suggest the next hyper-parameter set by clicking at a point of their choice inside the plot. We refer to Appendix A.9 for more details. Experiment 1 -SVM Classification. In this experiment we have considered C-SVM classifiers with Radial Basis Function (RBF) kernel. We have used LibSVM (Chang & Lin, 2011) implementation of SVM with hyper-parameters kernel scale γ and the cost parameter C. The SVM hyper-parameters i.e., γ and C are tuned in the exponent space of [-3, 3] . Experiment 2 -Random Forest Classification. In the classification tasks with random forests we let the experts and BO-Muse tune the hyper-parameters maximum depth of the decision tree and the number of samples per split in the range (0, 100], and (1, 50], respectively. The results of our classification experiments are depicted in Figure 3b and Figure 3c . In both experiments, the BO-Muse + Human team outperforms the experts working on their own.

4.2.2. SUPPLEMENTARY SPACE SHIELD DESIGN EXPERIMENT

Our third experiment evaluates BO-Muse in a real applied engineering experiment. We consider the design of a shield for protecting spacecraft against the impact of a space debris particle by partnering with a world leading impact expert. For this problem, i.e., impact of a cubic steel debris particle, there exists no state-of-the-art solution. We include this experiment in Appendix A.10 because of the lack of familiarity of CS with this problem and the detail required to give sufficient background. In the experiment (see Table 6 in Appendix A.10) we observe the human expert initially exploring solutions based on the state-of-the-art for more typical debris impact problems (which are normally simplified to spherical aluminium projectiles). The expert is observed to rapidly exploit their initial 3 designs to identify two feasible shielding solutions within the first four batch iterations (ID 4-11, marked in blue). The expert performs further exploitation of these successful designs (Result=0) over the next four batch iterations (ID 12-19, marked in brown) in an attempt to reduce the weight, but is unsuccessful. To this point the expert does not appear to have been influenced at all by the BO suggestions. However, in the next four batch iterations (ID 20-27, marked in green) we can observe the expert taking inspiration from the previous BO-Muse suggestions. One such exploitation results in a successful solution (ID 27), further exploitation of which provides the best solution identified by the experiment ID 29. This solution is highly irregular for spacecraft debris shields, utilising a polymer outer layer in contact with a metallic backing to disrupt the debris particle. Such a design is not reminiscent of any established flight hardware, see e.g., Christiansen et al. (2009) . Thus, the BO-Muse is demonstrated to have performed its role as hypothesised, inspiring the human expert with novel designs that are subsequently subject to exploitation by the human expert. The experimental set-up and results are discussed in detail in Appendix A.10.

5. CONCLUSION

We have presented a new approach to human expert/AI teaming for experimental optimisation. Our algorithm lets the human expert take the lead in the experimental process thus allowing them to fully use their domain expertise, while the AI plays the role of a muse, injecting novelty and searching for regions the human may have overlooked to break the human out of over-exploitation induced by cognitive entrenchment. We show that our algorithm converges sub-linearly and faster than either the AI or human expert alone.

A APPENDIX A.1 PROOFS OF REGRET BOUNDS

In this appendix we consider a generalised version of the framework. For clarity (we favour brevity in the paper body, but clarity is essential here) we also require some additional notations. As in the main paper, our goal is to solve: x ⋆ = argmax x∈X f ⋆ (x) where f ⋆ is only measurable via an expensive and noisy process: y = f ⋆ (x) + ν where ν is Σ-sub-Gaussian noise. Let's assume that we have a series of experimental batches 1, 2, . . ., where batch s contains p human generated experiments and p AI generated experiments, giving a total of p = p + p experiments. We use a hat with an index i ∈ M to indicate a property relating to human i, and a breve with an index j ∈ M to indicate a property relating to AI j, where: M = N p (set of humans) M = N p (set of AIs) We assume the batches are run sequentially, and wlog that the experiments within each batch are nominally ordered, so the set of all experiments may be indexed with t. We use a bar to differentiate between a property indexed by experiment number t, which is unbarred (e.g. c t ), and a property indexed by batch number s, which is barred (e.g. ds ). For simplicity we define: s t = t p (batch s in which experiment t occurs) t s = p (s -1) + 1 (first experiment in batch s) t = p (s t -1) + 1 (first experiment in batch s t ) ts = p (s -1) + 1 (first human experiment in the batch s) t = p (s t -1) + 1 (first human experiment in the batch s t ) ts = p (s -1) + 1 + p (first AI experiment in the batch s) t = p (s t -1) + 1 + p (first AI experiment in the batch s t ) T s = N p + p (s -1) + 1 (set of experiments in batch s) T ≤s = N sp + 1 (set of experiments up to and including batch s) We assume that humans and AIs maintain a GP model that is updated after each batch. So, after batch s, the posterior means and variances are, respectively: μi,s (x) = αT i,s ki,s (x) , σ2 i,s (x) = Ki,s (x, x) - kT i,s (x) Ki,s + σ2 I -1 ki,s (x) μj,s (x) = αT j,s kj,s (x) , σ2 j,s (x) = Kj,s (x, x) - kT j,s (x) Kj,s + σ2 I -1 kj,s (x) where: αi,s = Ki,s + σ2 I -1 ȳs , αj,s = Kj,s + σ2 I -1 ȳs Ki,s = Ki,s (x t , x t ′ ) t,t ′ ∈T ≤s , Kj,s = Kj,s (x t , x t ′ ) t,t ′ ∈T ≤s ki,s (x) = Ki,s (x t , x) t∈T ≤s , kj,s (x) = Kj,s (x t , x) t∈T ≤s ȳs = [ f ⋆ (x t ) + ν t ] t∈T ≤s , fs = [ f ⋆ (x t ) ] t∈T ≤s We also occasionally use: y t = [ f ⋆ (x t ′ ) + ν t ′ ] t ′ ≤t , f t = [ f ⋆ (x t ′ ) ] t ′ ≤t As this is a batch algorithm we are concerned with the instantaneous regret for the batches, not the individual experiments therein. The instantaneous regret for batch s is: rs = min t∈Ts r t , r t = f ⋆ (x ⋆ ) -f ⋆ (x t ) where r t is the instantaneous regret for experiment t. The cumulative regret up to and including batch S is: RS = s∈N S +1 rs We do not assume f ⋆ is drawn from any of the GP models for humans or AIs, so the problem misspecified. So, borrowing from Bogunovic & Krause (2021), we assume humans and AIs attempt to maximise the "closest" (best-in-class) function to f ⋆ in the respective hypothesis spaces, using the shorthand Ĥi,s = H Ki,s , Hj,s = H Kj,s : Ĥi,s = f ∈ Ĥi,s ∥f ∥ Ĥi,s ≤ Bi Hj,s = f ∈ Hj,s ∥f ∥ Hj,s ≤ Bj where the closest to optimal approximations of f ⋆ in the hypothesis spaces are: f # i,s = argmin f ∈ Ĥi,s ∥f -f ⋆ ∥ ∞ , f # j,s = argmin f ∈ Hj,s ∥f -f ⋆ ∥ ∞ As is usual in practice, kernels may be updated when the GP models are updated, typically using max-log-likelihood for AIs or something more radical for the humans, which modifies the corresponding RKHSs. The difference between the best-in-class approximations and f ⋆ for batch s are assumed bounded as: f # i,s -f ⋆ ∞ ≤ εi,s , f # j,s -f ⋆ ∞ ≤ εj,s After each batch s, we have the (nominal) GP models built on (nominal) observations of f # i,s and f # j,s , which have the same variance as the (real) models (the posterior variance is independent of y) but different posterior means: μ# i,s (x) = α#T i,s ki,s (x) , μ# j,s (x) = α#T j,s kj,s (x) We assume that test points are generated, either nominally (for humans) or directly (for AIs), EC-GP-UCB style (Bogunovic & Krause, 2021) , from a sequence of interleaved β t sequences by the relevant human i = t -t if i ∈ M or AI j = t -t if j ∈ M in batch s t , so: β t = βi,t if i = t -t ∈ M βj,t if j = t -t ∈ M ϵ t = εi,t if i = t -t ∈ M εj,t if j = t -t ∈ M µ t-1 (x) = μi,st-1 (x) if i = t -t ∈ M μj,st-1 (x) if j = t -t ∈ M σ t-1 (x) = σi,st-1 (x) if i = t -t ∈ M σj,st-1 (x) if j = t -t ∈ M using the acquisition function: α t (x) = µ t-1 (x) + √ β t + ϵt σ √ t σ t-1 (x) Generally we cannot control the human's exploitation/exploration trade-off sequence βi,t , but we assume humans are conservative, so the sequence may be assumed small. We use the AI trade-off sequence βj,t , which we do control, to compensate for the conservative tendencies of the humans involved. We do however assume that the humans include at least some exploration in their decisions on the understanding that their knowledge is not, in fact, perfect, so βi,t ≥ 0 (note that a human who over-estimates their abilities may fail to meet this requirement, so care is required to avoid this). For reasons which will become apparent, we assume that: βi,t s ∈ ζi,s↓ χi,s , ζi,s↑ χi,s βj,t s ∈ ζj,s↓ χj,s , ζj,s↑ χj,s where ζi,s↓ ≤ 1, which we will see makes the humans over-exploitative, and ζj,s↓ ≥ 1, which we will see makes the AIs over-explorative, and: χi,s = Σ √ σ 2 ln 1 δ + 1 + γi,s + f # i,s Ĥi,s 2 χj,s = Σ √ σ 2 ln 1 δ + 1 + γj,s + f # j,s Hj,s 2 where γi,s and γj,s are, respectively, the max-information-gain terms for humans i ∈ M and AIs j ∈ M, as will be described shortly.

A.2 NOTES ON MAXIMUM INFORMATION GAIN

We are assuming p GPs, each of which will have a different information gain that is a function of its kernel. All have the same dataset and get updated at the batch boundary. Thus, after S batches, if we consider human i ∈ M: Îi (y Sp : f Sp ) = H (y Sp ) -1 2 ln 2πe σ2 I Sp = H (y Sp-1 ) + H y Sp | y (S-1)p -1 2 ln 2πe σ2 I Sp = H (y Sp-2 ) + H y Sp | y (S-1)p + H y Sp-1 | y (S-1)p -1 2 ln 2πe σ2 I Sp = . . . = H y (S-1)p + t∈T S H y t | y t-1 -1 2 ln 2πe σ2 I Sp = H y (S-2)p + t∈T S H y t | y t-1 + t∈T S-1 H y t | y t-1 -1 2 ln 2πe σ2 I Sp = . . . = t∈T ≤S H y t | y t-1 -1 2 ln 2πe σ2 I Sp = 1 2 t∈T ≤S ln 2πe σ2 + σ-2 σ2 i,t-1 (x t ) -1 2 ln 2πe σ2 I T = 1 2 t∈T ≤S ln 1 + σ-2 σ2 i,t-1 (x t ) where we have used that x 1 , . . . , x sp are deterministic conditioned on y (s-1)p , and that the variances do not depend on y. The derivation for AI j is essentially identical. In summary, therefore, the information gain is: Îi (y Sp : f Sp ) = 1 2 t∈T ≤S ln 1 + σ-2 σ2 i,t-1 (x t ) ≤ γi,S Ȋj (y Sp : f Sp ) = 1 2 t∈T ≤S ln 1 + σ-2 σ2 j,t-1 (x t ) ≤ γj,S where γi,S and γj,S are, respectively, the maximum information gains for human i and AI j over S batches.

A.3 NOTES ON THE HUMAN MODEL

We posit that every human i ∈ M has an evolving model of the system: f i,s (x) = ĝi,s pi,s (x) where ĝi,s is in some sense "simple" and pi,s : R n → R mi,s . This fits into the above scheme if we let the form of ĝi,s dictate the kernel Ki,s . For example if we know that ĝi,s is linear -i.e., the human is known to be using some heuristic model of the form: f i,s (x) = ŵT i,s pi,s (x) then we can use a GP with a linear-derived kernel: Ki,s (x, x ′ ) = pT i,s (x) pi,s (x ′ ) Similarly if the human is using a model ĝi,s that can be captured by a d th -order polynomial model: f i,s (x) = ŵT i,s d q p⊗q i,s (x) q∈N d+1 then we can use a GP with a polynomial-derived kernel: Ki,s (x, x ′ ) = 1 + pT i,s (x) pi,s (x ′ ) d Alternatively, if ĝi,s is more vague (i.e., the researcher knows that the factors pi,s are important but not the exact form of the relationship) then we might use a GP assuming a distance-based model: Ki,s (x, x ′ ) = exp -1 2l pi,s (x) -pi,s (x ′ ) 2 2 or some similarly generic model that captures the worst-case behaviour of the human, along, hopefully, with some insight into the thought processes used by them. With regard to maximum information gain, because the human models are evolving, mi,s will change with s, so it is convenient to define define mi,s↑ = max t∈Ts mi,s to capture the worstcase feature-space dimensionality over S batches. We can then bound the maximum information gain for the human as the worst-case of these models over all S batches. So for example, depending on the specifics of ĝi,s , we have (Srinivas et al., 2012; Scetbon & Harchaoui, 2021) : Linear: γi,s = O (ln (Sp)) Polynomial: γi,s = O (ln (Sp)) Squared-Exponential: γi,s = O ln mi,s↑ +1 (Sp) In general we assume that the asymptotic behaviour of the human maximum information gain converges more quickly than that of the machine models. This makes intuitive sense of the human is applying a linear or polynomial heuristic model ĝi,s , which is captured by a linear or polynomial kernel, while the machines use more general GP models with SE or Matern type kernels (as is common practice). Thus in this case the human has a better behaved maximum information gain at the cost of a potentially non-zero gap εi,s (presumably trending to 0 as the human gains improved insight into the problem and evolves their model to better match the problem), while the machine has a worse behaved maximum information gain but zero gap (assuming a universal kernel like an SE kernel).

A.4 MATHEMATICAL PRELIMINARIES

We use p-norms extensively, where: ∥x∥ p = ( i |x i | p ) 1 p where x ∈ R n ∥f ∥ p = x |f (x)| p dx 1 p where f : R n → R with the extensions: ∥x∥ ∞ = max i |x i | ∥x∥ -∞ = min i |x i | ∥f ∥ ∞ = inf {C ≥ 0 : |f (x)| ≤ C for almost all x ∈ R n } ∥f ∥ -∞ = sup {C ≥ 0 : |f (x)| ≥ C for almost all x ∈ R n } This is only a norm for p ∈ [1, ∞] (though we may occasionally refer to it as such in a loose sense), but is well defined for p ∈ [-∞, 0) ∪ (0, ∞]. It is not difficult to see that: ∥x∥ p ≥ ∥x∥ p ′ , ∀p, p ′ ∈ (0, ∞], p ≤ p ′ ∥x∥ p ≥ ∥x∥ p ′ , ∀p, p ′ ∈ [-∞, 0), p ≤ p ′ ∥x∥ p ≥ ∥x∥ p ′ , ∀p ∈ (0, ∞], p ′ ∈ [-∞, 0) (the final inequality follows from the first two, noting that min i |x i | < max i |x i |, and similarly for functions); and furthermore: ∥x∥ p ≤ n 1 p -1 p ′ ∥x∥ p ′ ∀1 ≤ p ≤ p ′ ≤ ∞ when x ∈ R n . We also use the following inequalities (⊙ is the Hadamaard product): ∥x ⊙ x ′ ⊙ . . . ⊙ x ′′′′ ∥ r ≤ ∥x∥ p ∥x ′ ∥ p ′ . . . ∥x ′′′′ ∥ p ′′′′ ∥f ⊙ f ′ ⊙ . . . ⊙ f ′′′′ ∥ r ≤ ∥f ∥ p ∥f ′ ∥ p ′ . . . ∥f ′′′′ ∥ p ′′′′ General. Hölder inequality ∥x ⊙ x ′ ∥ 1 ≥ ∥x∥ 1 q ∥x ′ ∥ -1 q-1 ∥f ⊙ f ′ ∥ 1 ≥ ∥f ∥ 1 q ∥f ′ ∥ -1 q-1 Reverse Hölder inequality where 1 p + 1 p ′ + . . . + 1 p ′′′′ = 1 r and r, p, p ′ , . . . , p ′′′′ ∈ (0, ∞], with the convention 1 ∞ = 0; and q ∈ (0, ∞). We also use generalised (power) mean (Pečarić, 1991; Pyun, 1974) , defined as: M θ ({a 1 , a 2 , . . . a n }) =              min i a i if θ = -∞ 1 n if θ = 0 The following result is central to our proof: 4 Lemma 2. Let a, b ∈ R n + , z ∈ [-∞, ∞], q, q ′ ∈ [1, ∞], 1 q + 1 q ′ = 1, 1 ∞ = 0. Then we have the following bounds on M -∞ (a ⊙ b): M -∞ (a ⊙ b) ≤ M -z (a) M z (b) M -∞ (a ⊙ b) ≤ M |zq| (a) M |zq ′ | (b) M -∞ (a ⊙ b) ≤ M -|z| (a) M q q-1 |z| (b) Proof. Let us suppose that, unlike in the theorem, z ∈ (-∞, 0)∪(0, ∞). Then, using the generalised mean inequality: M -∞ (a ⊙ b) ≤ M z (a ⊙ b) = n -1 z ∥a ⊙ b∥ z = n -1 z ∥a ⊙z ⊙ b ⊙z ∥ 1 z 1 If z > 0 then, using Hölder's inequality: M -∞ (a ⊙ b) = n -1 |z| a ⊙|z| ⊙ b ⊙|z| 1 |z| 1 ≤ n -1 |z| a ⊙|z| 1 |z| q b ⊙|z| 1 |z| q ′ = n -1 |z| a ⊙|zq| 1 |zq| 1 b ⊙|zq ′ | 1 |zq ′ | 1 = n -1 |z| n 1 |zq| n 1 |zq ′ | 1 n a ⊙|zq| 1 |zq| 1 1 n b ⊙|zq ′ | 1 |zq ′ | 1 = n -1 |zq ′ | ∥a∥ |zq| n -1 |zq ′ | ∥b∥ |zq ′ | = M |zq| (a) M |zq ′ | (b) If z < 0 then, using the reverse Hölder inequality (noting the negative exponent): M -∞ (a ⊙ b) = n 1 |z| a ⊙-|z| ⊙ b ⊙-|z| -1 |z| 1 ≤ n 1 |z| a ⊙-|z| -1 |z| 1 q b ⊙-|z| -1 |z| -1 q-1 = n 1 |z| a ⊙-|z| q -q |z| 1 b ⊙ |z| q-1 q-1 |z| 1 = n 1 |z| n -q |z| n q-1 |z| 1 n a ⊙-|z| q -q |z| 1 1 n b ⊙ |z| q-1 q-1 |z| 1 = n q |z| ∥a∥ -|z| q n -q-1 |z| ∥b∥ |z| q-1 = M -|z| q (a) M |z| q-1 (b) It is instructive to let r = |z|/q. Then the most recent bound becomes: M -∞ (a ⊙ b) ≤ M -r (a) M q q-1 r (b) which in the limit q → ∞ simplifies to: M -∞ (a ⊙ b) ≤ M -r (a) M r (b) Finally, using the definitions: M -∞ (a ⊙ b) ≤ M 0 (a ⊙ b) = M 0 (a) M 0 (b) completing the proof.

A.5 BOGUNOVIC'S LEMMA

We have the following from Bogunovic & Krause (2021): µ # t-1 (x) -µ t-1 (x) ≤ ϵt σ √ tσ t-1 (x) This follows from (Bogunovic & 2021 , Lemma 2) using that all models satisfy this result, and that the model is posterior on the observations up to and including the previous batch. It follows from this and the definitions that, for all x ∈ X: Lemma 3. Let δ ∈ (0, 1). Assume noise variables are Σ-sub-Gaussian. Let: μi,s-1 (x) -f ⋆ (x) ≤ μ# i,s-1 (x) - f # i,s-1 (x) + εi,t s σ √ t s σi,s-1 (x) + εi,t s μj,s-1 (x) -f ⋆ (x) ≤ μ# j,s-1 (x) - f # j,s-1 (x) + εj,t s σ √ t s σj,s-1 (x) + εj,t s χi,s = Σ √ σ 2 ln 1 δ + 1 + γi,s + f # i,s Ĥi,s 2 χj,s = Σ √ σ 2 ln 1 δ + 1 + γj,s + f # j,s Hj,s 2 Then, for all i ∈ M and j ∈ M: Pr ∀s, ∀x ∈ X, μi,s-1 (x) -f ⋆ (x) ≤ . . . χi,s + εi,t s σ √ t s σi,s-1 (x) + εi,t s ≥ 1 -δ Pr ∀s, ∀x ∈ X, μj,s-1 (x) -f ⋆ (x) ≤ . . . χj,s + εj,t s σ √ t s σj,s-1 (x) + εj,t s ≥ 1 -δ Proof. We start with (Chowdhury & Gopalan, 2017, Theorem 2). This states that, in our setting, with probability ≥ 1 -δ, simultaneously for all s ≥ 1 and x ∈ D: μ# i,s-1 (x) - f # i,s-1 (x) ≤ σ-1 i,s-1 (x) χi,s μ# j,s-1 (x) - f # j,s-1 (x) ≤ σ-1 j,s-1 (x) χj,s for χi,s , χj,s as specified by ( 11). Next, recall (10): μi,s-1 (x) -f ⋆ (x) ≤ μ# i,s-1 (x) - f # i,s-1 (x) + εi,s σ √ t s σi,s-1 (x) + εi,s μj,s-1 (x) -f ⋆ (x) ≤ μ# j,s-1 (x) - f # j,s-1 (x) + εj,s σ √ t s σj,s-1 (x) + εj,s and the result follows. Remark: Alternatively, following (Fiedler et al., 2021 , Theorem 1) we can use: χi,s ≥ Σ √ σ 2 ln 1 δ + ln det Ki,s-1 + σ2 I + f # i,s Ĥi,s 2 χj,s ≥ Σ √ σ 2 ln 1 δ + ln det Kj,s-1 + σ2 I + f # j,s Hj,s Using the uncertainty bound above, we obtain our first bound on instantaneous experiment-wise regret based on (Srinivas et al., 2012 where: χi,s = Σ √ σ 2 ln 1 δ + 1 + γi,s + f # i,s Ĥi,s 2 χj,s = Σ √ σ 2 ln 1 δ + 1 + γj,s + f # j,s Hj,s Then, simultaneously for all s ≥ 1, the instantaneous regret is bounded as: Proof. By the pretext and Lemma 3: r ts +i ≤ ri,s↑ + 1 -ζi, f ⋆ (x ⋆ ) ≤ μi,s-1 (x ⋆ ) + χi,s + εi,t s σ √ t s σi,s-1 (x ⋆ ) + εi,t s with probability ≥ 1 -δ. By definition x t maximises the acquisition function, so: μi,s-1 (x ⋆ ) ≤ μi,s-1 x ts +i + βi,t s + εi,t s σ √ t s σi,s-1 x ts +i - βi,t s + εi,t s σ √ t s σi,s-1 (x ⋆ ) and hence: f ⋆ (x ⋆ ) -f ⋆ x ts +i ≤ μi,}s-1 x ts +i + βi,t s + εi,t s σ √ t s σi,s-1 x ts +i + εi,t s . . . + χi,s -βi,t s σi,s-1 (x ⋆ ) -f ⋆ x ts +i It follows that the instantaneous regret is bounded by: r ts +i ≤ μi,s-1 x ts +i + βi,t s + εi,t s σ √ t s σi,s-1 x ts +i + εi,t s . . . + χi,s -βi,t s σi,s-1 (x ⋆ ) -f ⋆ x ts +i So, using our assumptions on β we find that: r ts +i ≤ μi,s-1 x ts +i + ζi,s↑ χi,s + εi,t s σ √ t s σi,s-1 x ts +i + εi,t s . . . + 1 -ζi,s↓ χi,s σi,s-1 (x ⋆ ) -f ⋆ x ts +i Once again using our pretext and Lemma 3 we see that: f ⋆ x ts +i ≥ μi,s-1 x ts +i - χi,s + εi,t s σ √ t s σi,s-1 x ts +i -εi,t s Hence: r ts +i ≤ 2 1+ ζi,s↑ 2 χi,s + εi,t s σ √ t s σi,s-1 x ts +i . . . + 1 -ζi,s↓ χi,s σi,s-1 (x ⋆ ) + 2ε i,t s and the desired result follows from the definitions. The proof for AI regret follows by an analogous argument. To extend this to usable batchwise instantaneous regret bound we need to deal with the variance terms σi,s-1 (x ⋆ ), σi,s-1 (x ⋆ ) in the above theorem. To do this, in the following theorem we use the generalised (power) mean, and in particular lemma 2, to split the batchwise risk bound into a constraint term containing the free variances and a risk bound that depends only on the various parameters of the problem: Lemma 5. Fix δ > 0 and θs ∈ [0, ∞].foot_4 Assume noise variables are Σ-sub-Gaussian, and that: βi,t s ∈ ζi,s↓ χi,s , ζi,s↑ χi,s βj,t s = ζj,s↓ χj,s , ζj,s↑ χj,s where ζi,s↓ ≤ 1, ζj,s↓ ≥ 1, and: χi,s = Σ √ σ 2 ln 1 δ + 1 + γi,s + f # i,s Ĥi,s 2 χj,s = Σ √ σ 2 ln 1 δ + 1 + γj,s + f # j,s Hj,s 2 If: Mθ s       exp χi,s 1 -ζi,s↓ + σs-1↑ : i ∈ M . . . exp χj,s 1 -ζj,s↓ - σs-1↓ : j ∈ M       ≤ 1 Then, simultaneously for all s ≥ 1, the batchwise instantaneous regret is bounded as: rs ≤ ln M -θs (exp (r s↑ )) with probability ≥ 1 -δ, where: rs↑ =       2 1+ ζi,s↑ 2 χi,s + εi,t s σ √ t s σi,s-1 x ts +i + 2ε i,t s i∈ M 2 1+ ζj,s↑ 2 χj,s + εj,t s σ √ t s σj,s-1 x ts +j + 2ε j,t s i∈ M       and: σs-1↑ = max max i∈ M σi,s-1 (x ⋆ ) , max j∈ M σj,s-1 (x ⋆ ) σs-1↓ = min min i∈ M σi,s-1 (x ⋆ ) , min j∈ M σj,s-1 (x ⋆ ) and exp is applied elementwise. Proof. It is convenient to re-frame the batch-wise instantaneous regret in log-space, and extract the max variance upper bound. Recalling that ζi,s↓ ≤ 1, ζj,s↓ ≥ 1: rs ≤ min min i∈ M          2 1+ ζj,s↑ 2 χi,s + εi,t s σ √ t s σi,s-1 x ts +i + . . . 1 -ζi,s↓ χi,s σi,s-1 (x ⋆ ) + 2ε i,t s          , min j∈ M          2 1+ ζj,s↑ 2 χj,s + εj,t s σ √ t s σj,s-1 x ts +j + . . . 1 -ζj,s↓ χj,s σj,s-1 (x ⋆ ) + 2ε j,t s          ≤ ln min min i∈ M            exp 2 1+ ζi,s↑ 2 χi,s + εi,t s σ √ t s σi,s-1 x ts +i + 2ε i,t s . . . exp χi,s 1 -ζi,s↓ + σs-1↑            , min j∈ M            exp 2 1+ ζj,s↑ 2 χj,s + εj,t s σ √ t s σj,s-1 x ts +j + 2ε j,t s . . . exp χj,s 1 -ζj,s↓ - σs-1↓            which may be re-written: rs ≤ ln (M -∞ (a ⊙ b)) where: a = â ȃ ∈ R p + , b = b b ∈ R p + â = exp 2 1+ ζi,s↑ 2 χi,s + εi,t s σ √ t s σi,s-1 x ts +i + 2ε i,t s i∈ M b = exp χi,s 1 -ζi,s↓ + σs-1↑ i∈ M ȃ = exp 2 1+ ζj,s↑ 2 χj,s + εj,t s σ √ t s σj,s-1 x ts +j + 2ε j,t s j∈ M b = exp χj,s 1 -ζj,s↓ - σs-1↓ j∈ M So, by lemma 2, we have that: rs ≤ ln M -θs (a) Mθ s (b) = ln M -θs (a) + ln Mθ s (b) and by assumption Mθ s (b) ≤ 1, so: rs ≤ ln M -θs (a) and the result follows by the definition. The next step is to convert this bound into a bound on the total regret. To obtain such a bound we need some additional assumptions regarding "gap" parameters ϵ and the explorative/exploitative nature of the humans and AIs. We do this with the following theorem: Theorem 6. Fix δ > 0 and θs ∈ [0, ∞].foot_5 Assume noise variables are Σ-sub-Gaussian, and that: βi,t s ∈ ζi,s↓ χi,s , ζi,s↑ χi,s βj,t s = ζj,s↓ χj,s , ζj,s↑ χj,s where ζi,s↓ ≤ 1, ζj,s↓ ≥ 1, and: χi,s = Σ √ σ 2 ln 1 δ + 1 + γi,s + f # i,s Ĥi,s 2 χj,s = Σ √ σ 2 ln 1 δ + 1 + γj,s + f # j,s Hj,s and: Mθ s       exp χi,s 1 -ζi,s↓ + σs-1↑ : i ∈ M . . . exp χj,s 1 -ζj,s↓ - σs-1↓ : j ∈ M       ≤ 1 for all s, where: σs-1↑ = max max i∈ M σi,s-1 (x ⋆ ) , max j∈ M σj,s-1 (x ⋆ ) σs-1↓ = min min i∈ M σi,s-1 (x ⋆ ) , min j∈ M σj,s-1 (x ⋆ ) If εi,t s , εj,t s = O(s -q ) for some q ∈ ( 1 2 , ∞) then, with probability ≥ 1 -δ: Proof. Using the assumptions and Lemma 5, we have that, with probability ≥ 1 -δ, simultaneously for all s ≥ 1: rs ≤ ln (M -θs (exp (r s↑ ))) ≤ ln M -θs (exp (r s↑ )) where rs↑ ≥ 0. Using the definition of M -θs : 1 S RS ≤ ln M -θ↓ exp 8 σ2 χi,S + O (1) 2 Ĉi,2 1 S γi,S + O (1) i ∈ M . . .    8 σ2 s∈N S +1 rs ≤ s∈N S +1 ln M -θ↓ (exp (r s↑ )) = ln s∈N S +1 M -θ↓ (exp (r s↑ )) = ln s∈N S +1 1 p ∥exp (r s↑ )∥ -θ↓ -1 θ↓ = ln s∈N S +1 p S θ↓ (exp (r s↑ )) ⊙-θ↓ -1 θ↓ 1 = ln       s∈N S +1 p S θ↓   (exp (r s↑ )) ⊙-θ↓ S ⊙S 1 S 1   -S θ↓       = ln s∈N S +1 p S θ↓ (exp (r s↑ )) ⊙-θ↓ S -S θ↓ S Using the generalised Hölder inequality and the definition of M -θ↓ : s∈N S +1 rs ≤ ln s∈N S +1 p S θ↓ (exp (r s↑ )) ⊙-θ↓ S -S θ↓ S ≤ ln     s∈N S +1 p S θ↓ (exp (r s↑ )) ⊙-θ↓ S -S θ↓ 1     = ln            1 p s∈N S +1 exp (r s↑ ) ⊙-θ↓ S 1    -S θ↓         = ln M -θ↓ S s∈N S +1 exp (r s↑ ) and so, again recalling that rs↑ ≥ 0: s∈N S +1 rs ≤ ln M -θ↓ S s∈N S +1 exp (r s↑ ) = ln M -θ↓ S exp s∈N S +1 rs↑ Next, using that χ is increasing with s, and noting our restricted range in ζ, note that: r2 i,s↑ ≤ 4 χi,S + εi,t s σ √ t s σi,s-1 x ts +i + εi,t s 2 ≤ 8 max χi,S + εi,t s σ √ t s 2 σ2 i,s-1 x ts +i , ε2 i,t s ȓ2 j,s↑ ≤ 4 1+ ζj∼ 2 χj,S + εj,t s σ √ t s σj,s-1 x ts +j + εj,t s 2 ≤ 8 max    1+ ζj∼ 2 χj,S + εj,t s σ √ t s 2 σ2 j,s-1 x ts +j , ε2 j,t s    and so: s∈N S +1 r2 i,s↑ ≤ 8 s∈N S +1 max χi,S + εi,t s σ √ t s 2 σ2 i,s-1 x ts +i , ε2 i,t s ≤ 8 s∈N S +1 χi,S + εi,t s σ √ t s 2 σ2 i,s-1 x ts +i + 8 s∈N S +1 ε2 i,t s s∈N S +1 ȓ2 j,s↑ ≤ 8 s∈N S +1 max    1+ ζj∼ 2 χj,S + εj,t s σ √ t s 2 σ2 j,s-1 x ts +j , ε2 j,t s    ≤ 8 s∈N S +1 1+ ζj∼ 2 χj,S + εj,t s σ √ t s 2 σ2 j,s-1 x ts +j + 8 s∈N S +1 ε2 j,t s Recalling our assumption εi,t s , εj,t s = O(s -q ) for some q > 1 2 we find that, as s∈N 1 s 2 qi = ζ(2q i ) < ∞ (ζ here is the Reimann zeta function): s∈N S +1 r2 i,s↑ ≤ 8 s∈N S +1 χi,S + O (1) 2 σ2 i,s-1 x ts +i + O (1) s∈N S +1 ȓ2 j,s↑ ≤ 8 s∈N S +1 1+ ζj∼ 2 χj,S + O (1) 2 σ2 j,s-1 x ts +j + O (1) Now, by the standard procedure: s∈N S +1 σ-2 σ2 i,s-1 x t s +i ≤ Ĉi,2 s∈N S +1 ln 1 + σ-2 σ2 i,s-1 x t s +i ≤ Ĉi,2 t∈T ≤S ln 1 + σ-2 σ2 i,st-1 (x t ) ≤ Ĉi,2 γi,S s∈N S +1 σ-2 σ2 j,s-1 x t s +j ≤ Cj,2 s∈N S +1 ln 1 + σ-2 σ2 j,s-1 x t s +j ≤ Cj,2 t∈T ≤S ln 1 + σ-2 σ2 j,st-1 (x t ) ≤ Cj,2 γj,S where: Ĉi,2 = max s∈N S +1 σ-2 Ki,s↑ ln 1+ σ-2 Ki,s↑ Cj,2 = max s∈N S +1 σ-2 Kj,s↑ ln 1+ σ-2 Kj,s↑ and so: s∈N S +1 r2 i,s↑ ≤ 8 σ2 χi,S + O (1) 2 Ĉi,2 γi,S + O (1) s∈N S +1 ȓ2 j,s↑ ≤ 8 σ2 1+ ζj∼ 2 χj,S + O (1) 2 Cj,2 γj,S + O (1) Recalling that ∥ • ∥ 1 ≤ √ S∥ • ∥ 2 (in S dimensions): s∈N S +1 ri,s↑ ≤ √ S 8 σ2 χi,S + O (1) 2 Ĉi,2 γi,S + O (1) s∈N S +1 ȓj,s↑ ≤ √ S 18 σ2 1+ ζj∼ 2 χj,S + O (1) 2 Cj,2 γj,S + O (1) and so: s∈N S +1 ri,s↑ + s∈N S +1 ȓj,s↑ . . . ≤ ln M -θ↓ S exp √ S 8 σ2 O (1) χi,S 2 Ĉi,2 γi,S + O (1) i ∈ M . . .    √ S 8 σ2 O (1) + 1+ ζj∼ 2 χj,S 2 Cj,2 γj,S + O (1) j ∈ M   Finally, noting that, for a > 0: 1 S ln M -θ↓ S (exp (a)) = ln M 1 S -θ↓ S (exp (a)) = ln 1 p i (e ai ) -θ↓ S -1 θ↓ = ln 1 p i e 1 S ai -θ↓ -1 θ↓ = ln M -θ↓ exp 1 S a we obtain: 1 S s∈N S +1 ri,s↑ + s∈N S +1 ȓj,s↑ . . . ≤ ln M -θ↓ exp 1 √ S 8 σ2 O (1) + χi,S 2 Ĉi,2 γi,S + O (1) i ∈ M . . .    1 √ S 8 σ2 O (1) + 1+ ζj∼ 2 χj,S 2 Cj,2 γj,S + O (1) j ∈ M   as required. Finally, we consider the bounds on the posterior variance at x ⋆ . We have the following result: Lemma 7. Fix j ′ ∈ M. We have the bounds for all i ∈ M, j ∈ M: σ⋆ i,s↓ ≤ σi,s (x ⋆ ) ≤ σ⋆ i,s↑ σ⋆ j,s↓ ≤ σj,s (x ⋆ ) ≤ σ⋆ j,s↑ where: σ⋆2 i,s↓ = σ2 +2ps Di,s,↓ -ps 1 K⋆⋆ i,s D2 i,s,↓ σ2 +ps K⋆⋆ i,s K⋆⋆ i,s σ⋆2 j,s↓ = σ2 +2ps Dj,s,↓ -ps 1 K⋆⋆ j,s σ2 +2 t∈T ≤s Di,s,t- and:  1 K⋆⋆ i,s t∈T ≤s D2 i,s,t K⋆⋆ i,s = Ki,s (x ⋆ , x ⋆ ) , K⋆⋆ j,s = Kj,s (x ⋆ , x ⋆ ) Ki,s,t,⋆ = Ki,s (x ⋆ , x t ) , Kj,s,t,⋆ = Kj,s (x ⋆ , x t ) Ki,s,t,t = Ki,s (x t , x t ) , Kj,

Di,s,t

Proof. Using the definition of GP posterior variance and K bounds we can minimise the posterior variance within the constraints given as: σ2 i,s (x ⋆ ) ≥ K⋆⋆ i,s - K2 i,s,↑,⋆ 1 T ps Ki,s↑ 1 ps 1 T ps + σ2 I ps -1 1 ps = K⋆⋆ i,s 1 - K2 i,s,↑,⋆ K⋆⋆ i,s ps σ2 +ps Ki,s↑ = K⋆⋆ i,s 1 - ps K2 i,s,↑,⋆ σ2 K⋆⋆ i,s +ps Ki,s↑ K⋆⋆ i,s = K⋆⋆ i,s σ2 K⋆⋆ i,s +ps Ki,s↑ K⋆⋆ i,s -ps K2 i,s,↑,⋆ σ2 K⋆⋆ i,s +ps Ki,s↑ K⋆⋆ i,s = K⋆⋆ i,s σ2 K⋆⋆ i,s +ps Ki,s↑ K⋆⋆ i,s -ps K⋆⋆2 i,s +2ps K⋆⋆ i,s D i,s,↓ -psD 2 i,s,↓ σ2 K⋆⋆ i,s +ps Ki,s↑ K⋆⋆ i,s ≥ K⋆⋆ i,s σ2 +2psD i,s,↓ -ps 1 K⋆⋆ i,s D 2 i,s,↓ σ2 +ps Ki,s↑ and likewise for the AI posterior variances. For the upper bound, we may pessimise the bound by first using the maximum eigenvalue: σ2 i,s (x ⋆ ) ≤ K⋆⋆ i,s - t∈T ≤s K2 i,s,t,⋆ λmax Ki,s + σ2 and by Gershgorin's circle theorem λ max ( Ki,s ) ≤ t∈T ≤s Ki,s,ts,ts , so:  σ2 i,s (x ⋆ ) ≤ K⋆⋆ i,s - t∈T ≤s K2 i, K⋆⋆2 i,s K⋆⋆ i,s ≤ σ2 K⋆⋆ i,s +2 K⋆⋆ i,s t∈T ≤s Di,s,t-t∈T ≤s D2 i,s,t σ2 K⋆⋆ i,s +ps K⋆⋆2 i,s K⋆⋆ i,s = σ2 +2 t∈T ≤s Di,s,t- 1 K⋆⋆ i,s t∈T ≤s D2 i,s,t and likewise for AI variances. In this theorem the sequence Di,s,t = Ki,s (x ⋆ , x ⋆ )-| Ki,s (x ⋆ , x t )| ≥ 0 is a proxy for convergence, being minimised for x t = x ⋆ and increasing as x t becomes (a-posterior) less correlated with x ⋆ . If we consider the case considered in the paper -namely 1 human and 1 AI with zero gap and a trade-off sequence meeting the conditions of GP-UCB -then we know that the AI, operating alone, suffices to ensure convergence; and moreover adding additional observations (the human recommendations) will not prevent this. From here, it is not difficult to see that the upper and lower bounds in the above theorem converge not just to 0 but to one another, and if we further assume that K⋆⋆ i,s = K⋆⋆ j,s then the ratio of any upper bound on the posterior variance to the lower bound on any posterior variance will convergen to 1.

A.7 ADDITIONAL SYNTHETIC EXPERIMENTS

In addition to the experimental results mentioned in the main paper, we provide the optimisation performance of our proposed BO-Muse framework on the following synthetic benchmark functions: (a) Matyas-2D, and (b) Rastrigin-5D. The details of the additional synthetic optimisation benchmark functions are mentioned in Table 2 . We maintain the same experimental settings mentioned in the main paper for the additional experiments conducted. For a given d dimensional problem, we use d + 1 initial observations and optimise for 10 × d iterations i.e., the budget allocated for our synthetic experiments is set to 10 × d function evaluations. The simple regret plots obtained for Matyas-2D function and Rastrigin-5D function after 10 runs with random initialisations are shown in Figure 4a and Figure 4b , respectively.

A.8.1 VARYING DEGREES OF EXPLOITATION-EXPLORATION

We study the sensitivity of the exploitation-exploration parameter ( βs ) in our proposed BO-Muse framework and compare the optimisation performance. We vary the exploitation-exploration param- Functions f ⋆ (x) High Level Features Matyas-2D 0.26(x 2 1 + x 2 2 ) -0.48(x 1 x 2 ) x ′ 1 = x 2 1 , x ′ 2 = x 2 2 x ′ 3 = x 1 x 2 Rastrigin-5D 10d + d i [x 2 i -10 cos(2πx i )] ∀i ∈ N 5 x ′ i = x 2 i ∀i ∈ N 5 x ′ j+5 = cos x j ∀j ∈ N 5 Table 2 : Synthetic optimisation benchmark functions. Analytical forms are provided in the second column and the last column depicts the features used by a simulated human expert. ,3] to cover the whole spectrum from overexploitative experts ( βs = 1) to over-explorative experts ( βs = 1000). We have tuned the Squared Exponential (SE) kernel hyper-parameters of the inherent GP surrogate models using maximumlikelihood estimation. The empirical results obtained for various synthetic functions are depicted in Figure 5 . It is evident from the empirical results that BO-Muse teamed up with an expert following more of an exploitation strategy ( βs = 1) has better convergence when compared to its counterpart teamed with pure explorative expert ( βs = 1000).

A.8.2 BO-MUSE WITH EXPECTED IMPROVEMENT ACQUISITION FUNCTION

We have conducted an additional experiment to study the behaviour of our BO-Muse framework with different acquisition function strategies for the human experts. Expected Improvement (EI) acquisition function (Wilson et al., 2018) guides the search for optima by taking into account the expected improvement over the current best solution. If f ⋆ (x + ) is the best value observed, then the next best query point is obtained by maximising the EI acquisition function âEI s (x), given by: âEI s (x) = (μ s (x) -f ⋆ (x + )) Φ(Z) + σs (x) ϕ(Z) if σs (x) > 0 0 if σs (x) = 0 Z = μs (x) -f ⋆ (x + ) σs (x) where Φ(Z) and ϕ(Z) represents the Cumulative Distribution Function (CDF) and the Probability Density Function (PDF) of the standard normal distribution, respectively. In this experiment, the GP-UCB acquisition function used by the human expert in BO-Muse framework is now replaced with the Expected Improvement acquisition function. We compare this new baseline (BO-Muse + Human (EI)) with BO-Muse + Human (GP-UCB) and all the other competing baselines. The empirical results obtained for the experiment with EI acquisition function is depicted in Figure 6 . As expected BO-Muse with the EI acquisition function still outperforms the standard baselines considered. However, BO-Muse with GP-UCB acquisition function has superior performance when compared to its counterpart with the EI acquisition function. A.9 ADDITIONAL DETAILS OF CLASSIFICATION EXPERIMENTS We have considered two real-world classification tasks using Support Vector Machines (SVMs) and Random Forests (RFs). This experiment involves hyperparameter tuning of SVMs and RFs operating on the Biodeg dataset to classify biodegradable and non-biodegradable materials. We used publicly available Biodeg dataset from the UCI data repository (Dua & Graff, 2017) . Biodeg dataset consists of 1056 instances with 41 features. We randomly split the dataset into 80/20 train/test splits. Each time a hyperparameter set (design) is chosen, the model needs to retrained and evaluated on a held out set. The goal is to reach to the hyperparameter set that leads to a classification model with the minimum test classification error. We have created two groups (arms) with 4 members (2 students and 2 postdoctoral researchers) randomly allocated in each group. Each expert in the first group teams up with AI as per BO-Muse, while each expert in the second group (baseline) tunes the classifier completely on their own. For each of the classification tasks, the two groups use the same tuning budget (3 random initial designs + 30 further iterations. The aforementioned real-world task is suitable for our case: (1) It is easier to find multiple human experts for this task as AI graduate students and post-doctoral researchers have a good understanding of classification (SVM and RF) models and understand how its hyperparameters generally influence the model fitting, (2) This task is familiar to the ICLR and machine learning community. The simple graphical interface used by each participant is shown in Figure 7a A.10 SPACECRAFT SHIELDING DESIGN EXPERIMENT Our third experiment is to team with an expert to design spacecraft shields to protect from impact by orbital debris particles.

A.10.1 EXPERIMENTAL PROBLEM

Here we consider the design of a two-or three-wall shield for protection against a cubic steel projectile impacting face on, normal to the surface of the target plates, at an impact velocity of 7.0 km/s. There exists no state-of-the-art solution for such an impact threat, however for protecting against a spherical aluminium projectiles in this velocity domain the state-of-the-art solution would be a "stuffed Whipple shield" after Christiansen et al. (1995) , consisting of an outer aluminium plate, inner layers of aramid and ceramic fabrics, and a rear wall (pressure hull) of aluminium. US, Japanese, and European modules on the ISS all utilise stuffed Whipple shield designs (Christiansen et al., 2009) . The design space is schematically shown in Figure 8 . Design variables include: (1) plate material -AA6061-T651 ("AL"), 4340 steel ("ST"), Kevlar/epoxy ("KE"), and ultra-high molecular weight polyethylene ("PE"); (2) plate thickness -0.1 cm to 1.0 cm in 0.1 cm increments; (3) plate spacing, S -0.0 cm to 10 cm in 1.0 cm increments, an; (4) number of plates, 2 or 3 (i.e., the 'outer bumper' plate may or may not be used). Only metal plates (i.e., "AL" or "ST") may be used for the 3rd plate. The full factorial design space includes 577,365 options. A.10.2 BACKGROUND Spacecraft are subject to impact by natural micrometeoroid and man-made orbital debris particles, collectively referred to as space debris, during their orbital lifetime. The impact of such particles (typically at velocities above 10 km/s) is a significant risk to the safe operation of spacecraft and the fulfillment of mission objectives. Indeed, for manned spacecraft such as the International Space Spacecraft debris shields are typically designed using a combination of semi-analytical equations, numerical simulations, and experimental testing. Simulations are performed in either explicit finite element solvers, e.g., ANSYS LS-DYNA, or shock physics solvers, e.g., CTH from Sandia National Laboratory. Modelling hypervelocity impact in those simulation codes requires substantial expertise to accurately projectile and target kinetmatics together with material response. Furthermore, such simulations can be computationally expensive, requiring hundreds of CPU hours depending on the geometric discretisation of the model. Experimentation is typically performed on laboratory accelerators known as two-stage light gas guns. The number of such facilities that can perform experiments with millimetre and centimetre sized proejctiles up to impact velocities of 7+ km/s is very limited (estimated to be < 20 globally). Such experiments are also expensive, costing on the order of thousands of dollars, with a low through-put of approximately 1 experiment per day. In the design of space debris shielding, in order to minimise the number of experiments and simulations required, space debris is typically simplified to spherical aluminium particles. In reality, of course, the debris environment consists of a range of materials, both metallic and non-metallic, the properties of which influence their impact lethality. Similarly, for robotic and manned spacecraft the majority of impact risk is represented by millimeter-sized objects, the majority of which are fragmentation debris that have been generated by catestrophic breakup of a satellite or rocket body and are thus highly irregular in shape, see Rivero et al. (2016) . Until recently the engineering environment models used to predict mission risk to space debris impact have also simplified the debris population as spherical aluminium objects, thus there was little incentive to introduce the added complexity of projectile shape and material effects in shielding design or characterisation studies. However, recent improvements in orbital debris environment engineering models, e.g., ORDEM 3.0 (Krisko, 2014) , and planned improvements to debris population source models, e.g., via DebriSat (Rivero et al., 2016) , aim to address some of these deficiencies. Shield design and characterisation, therefore, must also begin to account for projectile shape and material effects.

A.10.3 DETAILS OF EXPERIMENT

The spacecraft shielding design experiment utilises synthetic data generated via numerical simulation. This section provides additional information on the simulation setup and evaluation. Simula- Three initial designs are evaluated by the human expert, the details of which are provided in Table 6 together with the simulation results. Based on these results we assume that target designs with areal weights significantly less than 5 kg/m 2 are likely infeasible, while designs significantly heavier (> 15kg/m 2 ) are not of interest. Within this weight range the design space includes 577,365 potential options. We perform the optimisation in iterative batches of size 2, with one suggestion from the BO and one from the human expert.

A.10.4 DIFFERENCE TO A CLASSICAL CS SETTING

We work with one expert and this differs from usual CS settings where multiple experts perform the same task because: 1. Level of expertise required is high, and access is difficult: The design of shields for protecting against space debris impact at hypervelocity is a highly specialised discipline typically limited to national space agencies or their primary contractors. For instance, shielding onboard the ISS was predominantly developed by NASA and Boeing for US modules, ROSCOSMOS and RKK Energia for Russian modules, and JAXA for the Japanese module. Shielding on the European Coloumbus module was designed primarily by Alenia Aerospazio under contract to the European Space Agency, but borrowed heavily from the NASA designs (see e.g., Destefanis et al. (1999) ). Therefore, recruiting multiple experts is a formidable task. 2. The cost of experiments is high: Due to the cost and access limitations on experimental facilities, we utilise numerical simulations for this design study. Such simulations are difficult to design and validate, our expert (with 20 years experience on such codes) required about 120 hours to build the simulation models. In addition, the simulations can be com-putationally expensive, requiring on the order of 100-200 CPU hours per simulation for a moderate CPU. 3. The design problem is hard: There exists no state-of-the-art solution for the defined problem. The nearest analogue is a shield designed for a spherical aluminium projectile, for which the state-of-the-art is a stuffed Whipple shield. This shielding configuration has been used to define our optimisation variables. Existing semi-analytical penetration laws, such as those in Christiansen et al. (1995) , are not valid for application with non-spherical or non-aluminium projectiles. 4. The expert cannot repeat the same design task as they learn during the first experiment. All these factors mean that for a real experiment we can only show how BO-Muse helps a single expert for a new problem for which there is no state-of-art solution. A.10.5 RESULTS Results of the experiment are provided in Table 6 . We can observe that the human expert initially explored designs similar to the stuffed Whipple concept (i.e., metallic outer bumper, KE or PE inner bumper, and metallic rear wall, with the inner bumper being roughly located at the mid-point between the two metallic plates). By design ID 9 the expert had identified a stuffed Whipple design that was successful at defeating the projectile, with an areal weight of 14.3 kg/m 2 . Between design IDs 9 and 19 we can observe the expert exploiting this successful design to identify a lower weight solution, without success. Up to this point the expert does not seem to have been influenced by the BO suggestions. By design ID 21, however, we can observe the expert beginning to exploit BO suggestions, with design ID 21 a modification of ID 5 and 14, design ID 23 a modification of 18, and so on. Design ID 27, an expert suggestion that is an exploitation of the BO suggested ID 22, was successful at defeating the projectile, albeit at a higher weight than the previously identified solutions at ID 9 and 11 (16.9 kg/m 2 ). The final expert design, ID 29, is a further exploitations of ID 27 intended to reduce weight and is found to be the best solution identified during the experiment, with an areal weight of 13.8 kg/m 2 . The BO-Muse design ID 22 and subsequent human exploitations (design IDs 27 and 29) are, according to our human expert, highly unusual configurations for spacecraft debris shielding that they would not have otherwise considered if not for the BO suggestion. The conventional design methodology based on typical shields used in flight harware suggest that the outer bumper have a density and a shock impedance comparable to that of the projectile and be sized such that the shock rarefaction and tensile release wave superimpose towards the back of the projectile, maximising fragmentation and radial dispersion. An internal fabric layer, such as that used in stuffed Whipple shields (see e.g., Christiansen et al. (1995) ) is then intended to catch and decelerate projectile fragments prior to impact upon the shield rear wall. This general design principle has been established through ongoing investigation since the Apollo program and matured for the International Space Station with significant, proven success. Design ID 22 and the subsequent designs (IDs 27 and 29) are a substantial deviation from these established principles and at current it is unclear why they have been successful -further investigation is needed. Evaluating this experiment -the role of BO-Muse was to inject novelty in the design process, from which the human expert could take inspiration and perform exploitation. We consider this to have been successfully demonstrated in a real applied engineering design experiment A.11 DISCUSSION OF LIMITATIONS In this section we present a brief discussion of the limitations of our work. With regard to the human expert, we have assumed that "Cognitive entrenchment" behaviour occurs, which is backed by recent studies (Dane, 2010; Daw et al., 2006) . This may not hold strictly in all cases, which may cause the algorithm's sample efficiency to be lower than expected as BO may over-compensate for expected expert over-exploitation that does not eventuate. Similarly, we assume that the expert is able to improve their model as the number of observations available to them increases. However a less skilled expert may fail to do this, and subsequently the algorithm's sample efficiency may suffer as, after a point, expert suggestions may cease to be useful. Finally, the human expert may



We do not consider the possibility of transfer learning of this structure, as this requires the expert to distill their knowledge in an amenable form which, as noted previously, is highly non-trivial. We need ϕs > 0 to ensure θs > 0, and note that h(ϕ) = 1 ϕ ln(12-e ϕ ) is increasing and h(ln 2) = ∞. Necessary ethics approval obtained. This result may be well known, but we have been unable to find it in the literature. The proof is true for θs ∈ [-∞, ∞], but the negative θs case is not of interest here. The proof is true for θs ∈ [-∞, ∞], but the negative θs case is not of interest here.



Figure 1: Bo-Muse Workflow.

Figure 2: Simple regret versus iterations for (a) Ackley-4D (b) Levy-6D functions. Ablation study with varying degrees of exploitation-exploration parameter ( βs ) is shown in (c).

Figure3: (a) Interface used by participants showing performance vs hyper-parameters: best performances indicated in blue (darker is higher). If working with BO-Muse, machine point from the previous iteration (#) are shown with ▼; Simple regret vs iterations for the hyper-parameter tuning of classifiers comparing Team 1 (BO-Muse + Human) (red) vs Team 2 (Human only) (black) vs AI alone (blue). We report the simple regret mean (along with its standard deviation) for (b) Support Vector Machines (SVM) and (c) Random Forests (RF) classifiers.

THE REGRET BOUNDWe begin with the following uncertainty bound, which is largely based on the bound (Chowdhury & Gopalan, 2017, Theorem 2) and analogous to(Srinivas et al., 2012, Lemma 5.1)  and(Bogunovic & Krause, 2021, Lemma 1):

s∈N S +1 θs ζj∼ = max s∈N S +1 ζj,s

Figure 4: Simple regret vs iterations for synthetic Functions:(a) Matyas-2D (b) Rastrigin-5D

Figure 5: Ablation study with varying degrees of exploration-exploitation parameter ( βs ) obtained for (a) Ackley-4D (b) Levy-6D functions (c) Rastrigin-5D functions.

Figure 7: graphical interface used by (a) Team 1 and (b) Team 2 for both the classification experiments using SVM and RF classifiers.

Figure 8: Schematic of the spec debris shield design problem. Our objective is a design solution that will prevent perforation of the spacecraft hull (Plate 3, rear wall) by modifying the plate materials, plate thicknesses, and plate spacing (S).

Figure9: Series of frames from an LS-DYNA simulation showing the impact of a steel cubic particle (yellow) at 7 km/s against a shield consisting of three 1.0 cm thick plates of AA6061-T651 (redouter bumper, blue -inner bumper, green -rear wall) with a 3.0 mm thick AA6061-T651 witness plate (brown) positioned 10 cm behind the rearmost surface of the shield. The projectile is shown to fragment into a dispersed cloud of projectile and shield fragments which radially disperse before impacting upon the rear wall. The rear wall is perforated and fragments are observed to crater the witness plate, providing a non-zero depth of penetration measurement.

annex

tions are performed in the explicit structural mechanics solver LS-DYNA from ANSYS (Hallquist, 2006) . Simulations are performed in 3D using a smooth particle hydrodynamics (SPH) discretisation scheme, which enables projectile fragmentation to be modelled without arbitrary numerical erosion that would otherwise be required for a mesh-based Lagrangian scheme. SPH elements of 0.05 mm diameter are used to disretise all simulated parts. The metallic materials, AA6061-T651 ("AL") and 4340 steel ("ST"), utilise a Gruneisen equation of state (EoS) (Gruneisen, 1959) , a Johnson-Cook viscoplasticity model (Johnson & Cook, 1983 ) and a Johnson-Cook fracture model (Johnson & Cook, 1985) , the constants for which are given in Tables 3 and 4 . The aramid composite ("KE") is modelled as a continuum using the elastic-plastic orthotropic strength with failure model from LS-DYNA (MAT 059) and a linear EoS, the constants for which are given in Table 5 . The ultra-high molecular weight polyethylene ("PE"), specifically Dyneema HB26, is modelled using the orthotropic non-linear model and material constants from (Nguyen et al., 2016) .In Figure 9 a series of frames from a representative LS-DYNA simulation are provided, depicting the impact of the cubic steel projectile against a three-wall shield design.Simulations are performed on AMD EPYC servers with 64 CPU cores (2.25 GHz) and 1 TB of RAM. All simulations are performed in parallel on 4 CPU cores and require 20-40 CPU hours to run, depending on the complexity (in this case thickness) of the target plates. The simulation models included a 3.0 cm thick aluminium alloy witness plate located 10.0 cm from the rear surface of the target rear wall (Plate 3 in Figure 8 ) to measure residual penetration in the event that the shield was perforated. Results were recorded as a binary pass/fail related to non-perforation or perforation of the target rear wall, respectively, together with a continuous Depth of Penetration (DoP) measurement into the witness plate. Our objective is to design a protective shield that can defeat the projectile threat for minimal weight. not behave precisely like a GP-UCB model would suggest, again resulting in lower efficiency. We note, however, that even when the human is unable to perform as expected, BO-Muse will still have sublinear convergence. In such a worst-case scenario every second experiment will, in effect, be wasted. However the data from these "wasted" experiments will still provide additional observations of f ⋆ for the machine's GP model, which can only improve the model's accuracy. The behaviour of the algorithm in this case can therefore be analysed in the "machine running GP-UCB plus improved prior due to additional data" regime -that is, a GP-UCB algorithm with AI-generated suggestions, an exploration parameter β that is increased by a constant multiplicative factor, and a stream of additional (harmless or even potentially informative) human-guided experimental observationswhich suffices to ensure sublinear convergence (Srinivas et al., 2012) . We also note that the aforesaid worst-case scenario is highly unlikely on the assumption that the human expert is knowledgeable in the relevant field.

