WHEN TO MAKE AND BREAK COMMITMENTS?

Abstract

In many scenarios, decision-makers must commit to long-term actions until their resolution before receiving the payoff of said actions, and usually, staying committed to such actions incurs continual costs. For instance, in healthcare, a newlydiscovered treatment cannot be marketed to patients until a clinical trial is conducted, which both requires time and is also costly. Of course in such scenarios, not all commitments eventually pay off. For instance, a clinical trial might end up failing to show efficacy. Given the time pressure created by the continual cost of keeping a commitment, we aim to answer: When should a decision-maker break a commitment that is likely to fail-either to make an alternative commitment or to make no further commitments at all? First, we formulate this question as a new type of optimal stopping/switching problem called the optimal commitment problem (OCP). Then, we theoretically analyze OCP, and based on the insight we gain, propose a practical algorithm for solving it. Finally, we empirically evaluate the performance of our algorithm in running clinical trials with subpopulation selection.

1. INTRODUCTION

In many real-world settings, decision-makers must commit to long-term actions and wait until their resolution before receiving the payoff of said actions. Meanwhile, staying committed to such actions incurs continual costs. For instance, in portfolio management, it might take time for an asset to develop additional value after an initial investment, and keeping capital tied up in an asset comes with an opportunity cost for the investor (Markowitz, 1959; Merton, 1969; Karatzas and Wang, 2020) . In an energy network, turning power stations on and off is not an immediate action, hence a sudden increase in energy demand can only be met with a delay after putting more stations into operation, and keeping stations operational obviously consumes resources (Rafique and Jianhua, 2018; Olofsson et al., 2022) . In healthcare, a newly-discovered treatment can only be marketed to patients once a successful clinical trial that targets the said treatment is conducted, which both requires time and is also costly (Kaitin, 2010; Umscheid et al., 2011) . Of course, not all commitments eventually pay off: An asset might end up losing value despite investments, energy demands might shift faster than a network can react to, and a clinical trial might fail to show efficacy for the targeted treatment. Given the time pressure created by the continual cost of keeping a commitment, our goal in this paper is to answer the question: When should a decision-maker break a commitment-thereby avoiding future costs but also forfeiting any potential returns-either to make an alternative commitment instead or to make no further commitments at all? Solving this problem optimally requires a careful balance between exploration and exploitation: The earlier a commitment that is bound to fail is broken, the more resources would be saved (cf. exploitation); but the longer one is kept, the more information is revealed regarding whether the commitment is actually failing or might still succeed (cf. exploration)-and in certain cases, also regarding the prospects of similar commitments one could make instead. Related problems are mostly studied within the context of adaptive experimentation and sequential hypothesis testing (see Section 5). As such, we focus on adaptive experimentation as our main application as well. More specifically, we consider the problem of selecting the target population of an adaptive experiment. Suppose an experimenter, who is interested in proving the efficacy of a new treatment, starts running an initial experiment that targets a certain population of patients. Incidentally, the treatment being tested is effective only for a relatively narrow subpopulation of patients but not for the wider population as a whole. Hence, an experiment targeting the overall population, but not the subpopulation specifically, will most probably fail to prove efficacy and prevent the deployment of the treatment for the patients who would have actually benefited from it, not to mention waste time and resources (Moineddin et al., 2008; Lipkovich et al., 2017; Chiu et al., 2018) . Of course, the experimenter has no knowledge of this in advance but the initial experiment they have set up would slowly reveal more information regarding the effects of the treatment and the fact that the ongoing experiment is bound to fail. In that case, we want to be able to determine at what point the experimenter has enough information to justify breaking their commitment to the initial experiment that targets too wide of a population to be successful, in favor of making a new commitment to a follow-up experiment that focuses on a narrower subpopulation instead? Contributions Our contributions are threefold: First, we formulate the problem of making and breaking commitments in a timely manner as a new type of optimal stopping/switching problem called the optimal commitment problem (OCP) (Section 2). The defining feature of OCP is that rewards are received only when a known time point is reached but costs are incurred continually, requiring commitment to actions but with incentive to abandon those commitments. As we will show later, OCP cannot be easily solved via conventional reinforcement learning techniques due to its non-convex nature. Second, we theoretically analyze a simplified case of OCP to identify the characteristics of the optimal solution (Section 3), and based on the insights we gain, propose a practical algorithm for the more general case (Section 4). Third, we empirically evaluate the performance of our algorithm in running experiments with subpopulation selection (Section 6). Before we move on, it should be emphasized that, although we predominantly consider adaptive experimentation as our main application, our contributions remain generally applicable to portfolio management, energy systems, and any other decision-making scenarios that require commitments to long-term actions.

2. OPTIMAL COMMITMENT PROBLEM

We first introduce the problem of optimal commitment from the perspective of running experiments. As far as our formulation is concerned, experiments are conducted to confirm the efficacy of an intervention by observing the outcome of the said intervention for subjects belonging to a particular population. However, this experiment-focused perspective does not limit the applicability of OCP; we stress its generality later at the end of the section. We provide a glossary of terms and notation in Appendix K. Populations Let X be a discrete set of atomic-populations such that every subject is only the member of exactly one atomic-population x ∈ X . Denote with η x ∈ [0, 1] the probability of a subject being from atomic-population x (such that x∈X η x = 1), and with Ω x the distribution of outcomes for atomic-population x such that the mean outcome θ x = E y∼Ωx [y] is the effect of some intervention for atomic-population x. Now, wider populations can be constructed by combining various atomic-populations. Let any X ⊆ X represent the population of subjects who belong to either one of the atomic-populations {x ∈ X}. Then, the probability of a subject being from population X can be written as η X = x∈X η x , the probability of a subject being from atomic-population x conditioned on the fact that they are from population X can be written as η x|X = η x /η X , and the average effect for population X can be written as θX = x∈X η x|X θ x . Experiments An experiment is largely characterized by the population it targets, its sample horizon, and its success criterion. During an experiment that targets population X, at each time step t ∈ {1, 2, . . .} that the experiment continues, first a subject from some atomic-population x t within the targeted population X arrives with probability η xt|X , and then the outcome y t ∼ Ω xt for that subject is observed. This process generates an online dataset D t = {x t , y t } t t =1 . The experiment terminates when a pre-specified sample/time horizon τ is reached. Once terminated, the experiment is declared a success if ρ(D τ ) = 1, where ρ : (X × R) τ → {0, 1} is the success criterion, and declared a failure otherwise. Formally, the tuple ψ = (X, τ, ρ) constitutes an experiment design. Meta-experimenter Suppose a meta-experimenter is given a set of viable experiment designs Ψ and is tasked with running at least one successful experiment. Each experiment ψ ∈ Ψ has an associated cost C ψ ∈ R + , which the experiment incurs per time step that it continues, and an associated reward R ψ ∈ R + , which the experiment provides only if it eventually succeeds. The metaexperimenter aims to maximize utility-that is the difference between any eventual reward received and the total costs incurred by running experiments. They first pick an initial experiment ψ 1 ∈ Ψ and start conducting it, which generates an online dataset D 1 t as described earlier. Now at each time step t, they need to decide whether they should stay committed to their initial decision and wait until ψ 1 terminates, or stop ψ 1 early in favor of starting a new experiment ψ 2 . They might decide on the latter to avoid unnecessary costs if D 1 t already indicates ψ 1 is unlikely to succeed. If at some point a secondary experiment ψ 2 is started, now the meta-experiment has a similar decision to make regarding whether to stop ψ 2 early in favor of starting a new experiment ψ 3 ∈ Ψ. This process continues until either an experiment finally succeeds or the meta-experimenter decides not to conduct any further experiments; let the random variable n ∈ {1, 2, . . .} be such that ψ n is the last experiment. We denote with ψ i = (X i , τ i , ρ i ) the i-th experiment conducted by the meta-experimenter, and with T i the number of time steps for which the i-th experiment is conducted either until it was stopped by the meta-experimenter or the time horizon τ i was reached. Denote with π(t, ψ i , Di t ) the decision-making policy of the meta-experimenter, where t is the current time step of the latest experiment ψ i and Di t = (∪ i-1 j=1 D j T j ) ∪ D i t is an aggregate dataset. We write (i) π(t, ψ i , Di t ) = ψ i if the meta-experiment decides to keep conducting the current experiment ψ i , (ii) π(t, ψ i , Di t ) = ψ = ψ i if the meta-experimenter decides to stop experiment ψ i and start experiment ψ instead, and (iii) π(t, ψ i , Di t ) = ∅ if the meta-experimenter decides not to conduct any further experiments. Objective Once all experimentation is concluded, the meta-experimenter achieves the total utility G = R ψ n • 1{T n = τ n } • ρ n (D n τ n ) - n i=1 C ψ i • T i . (1) Then, the optimal commitment problem is to find the optimal policy π * = argmax π E π [G] that maximizes the expected utility given Ψ, {η x }. {R ψ , C ψ } without knowing mean outcomes {θ x } or outcome distributions {Ω x }. It is called the optimal commitment problem because each experiment ψ = (X, τ, ρ) only provides a reward if the meta-experimenter commits to incurring its costs for at least τ time steps, and the meta-experimenter needs to decide which experiment in Ψ is the better commitment-or if there is any experiment worth committing to at all-adaptively. General applicability of OCP Although we have described OCP from the perspective of (meta-)experiment design, it can potentially be useful in modeling many other problems as we have stressed during the introduction (see Table 1 ). For instance, in portfolio management, atomicpopulations can be regarded as various assets one can invest in, then a population would correspond to a portfolio of assets. Similar to experiments, when these portfolios require a time commitment (cf. τ ) before they provide their payoff (cf. R ψ ) and incur an opportunity cost (cf. C ψ ) in the mean time, the decisionmaking problem of managing when and which portfolio to invest in constitutes an instance of the optimal commitment problem. Another good examples is energy management, where power stations and the networks they form are akin to atomic-populations and populations. Since power stations cannot be turned on and off immediately, putting one in operation requires a certain amount of commitment.

3. WARM-UP: WHEN TO BREAK A SINGLE COMMITMENT?

In this section, to gather insights, we commence by analyzing a simplified instance of OCP. Later, in Section 4, using these insights, we construct a practical algorithm for solving a more general case of OCP. As the simplified instance, we only consider one atomic-population such that X = {X 0 } and one experiment design that targets this atomic-population such that Ψ = {Ψ 0 = (X 0 , τ, ρ)}. Moreover, we assume that the outcomes are distributed normally with unit variance such that Ω . = Ω X0 . = N (θ . = θ X0 , 1) and the success criterion is a simple Z-test to see whether θ > 0 such that ρ(D τ ) . = ρ(µ τ ) . = 1{µ τ > α/ √ τ } , where µ t = (x t ,y t )∈Dt y t /|D t | is the empirical mean outcome given dataset D t , and α determines the significance threshold for the test. Since there is just one viable experiment in this setting, the only decision that needs to be made at each time step is whether to keep conducting experiment ψ 1 = Ψ 0 or to stop all experimentation. For this decision to be interesting, we will also assume that C . = C Ψ0 > 0-so that never stopping is not necessarily optimal-and R . = R Ψ0 > τ C-so that always stopping is not necessarily optimal either. Value and Q-functions Since t and µ t are sufficient statistics to estimate the success probability of the experiment, it is also sufficient to only consider policies of the form π(t, µ). For a given policy π, V π (t, µ) = E[R • 1{T π t > τ } • ρ(µ τ ) -C • (min{T π t , τ } -t) | µ t = µ] (2) Q π (t, µ) = E[R • 1{T π t+1 > τ } • ρ(µ τ ) -C • (min{T π t+1 , τ } -t) | µ t = µ] are the value function, and the Q-function of conducting the experiment for at least one more time step respectively, where T π t = min{t ≥ t : π(t , µ t ) = ∅} is the first time step at or after time t that policy π decides to stop; let V * = V π * and Q * = Q π * be the optimal value and Q-functions. Note that the Q-factor of stopping all experimentation is always equal to zero for all policies. Hence, the optimal policy must be such that π Once we identify the value and Q-functions, a naive attempt at finding the optimal policy would be to compute V * and Q * via dynamic programming as they would satisfy the following Bellman optimality conditions: * (t, µ) = Ψ 0 if Q * (t, µ) > 0 and π * (t, µ) = ∅ otherwise. -1.5 -1 -0.5 0 0.5 1 1.5 2 0 3 6 9 Test Statistic (µ) Optimal Value (V * ) V * (t = 1, µ) V * (t = 2, µ) V * (t = 3, µ) Q * (t, µ) = -C + E[V * (t + 1, µ t+1 )|µ t = µ] (4) V * (t, µ) = max{0, Q * (t, µ)} (5) and V * (τ, µ) = R • ρ(µ). However, a major complication in applying dynamic programming methods to compute V * and Q * is that they are continuous functions in µ. In the literature of partially-observable Markov decision processes (POMDPs), which OCP happens to be an instance of (see Appendix A), the standard approach of addressing this complication would be to leverage the convexity of V * and Q * , and approximate them with functions of the form f (µ) = max i a i µ + b i (Spaan, 2012) . However, this standard approach is not applicable in OCP because, in general, neither V * (t, µ) nor -V * (t, µ) is a convex function with respect to µ (see Figure 1 ): Proposition 1 (Non-convexity). There exist a problem instance (C, R, τ, α) and t ∈ {1, . . . , τ -1} such that ∃µ, µ ∈ R, p ∈ [0, 1] : V * (t, pµ + (1 -p)µ ) < pV * (t, µ) + (1 -p)V * (t, µ ) and ∃µ, µ ∈ R, p ∈ [0, 1] : -V * (t, pµ + (1 -p)µ ) < -pV * (t, µ) -(1 -p)V * (t, µ ). 1 Properties of the optimal policy Although identifying π * exactly by computing V * and Q * is challenging, we can still identify some properties that π * should have, which can then help us design a heuristic policy that we expect to perform well, albeit not optimally. First of all, the optimal policy π * should be a "thresholding-type" policy-that is the meta-experimenter should keep conducting the experiment as long as µ t stays above a time-dependent threshold µ * t and should stop all experimentation the moment µ t drops below that threshold (see the top panel of Figure 2 Intuitively, a higher test statistic µ t means that the experiment is only more likely to succeed, hence if it is optimal to continue conducting the experiment when µ t = µ, then it should also be optimal to continue when µ t = µ > µ (likewise, lower µ t means success is even less likely hence π * (t, µ) = ∅ implies π * (t, µ ) = ∅ for µ < µ). Moreover, the optimal policy π * must be "optimistic" that the experiment will succeed when making decisions. Consider a greedy policy π greedy that continues as long as the expected utility of committing fully to conducting the experiment until it terminates at t = τ is positive-that is π greedy = Ψ 0 if and only if V π (0) (t, µ) > 0 where π (0) is the policy that always waits until the experiment terminates such that π (0) (t, µ) = Ψ 0 for all t, µ; π greedy is said to be greedy because the decision to continue is made assuming a full commitment to the experiment without considering the possibility to stop at a future time step. Then, whenever such greedy reasoning suggests continuing, the meta-experimenter should indeed continue. However, whenever the same reasoning suggests stopping, the metaexperimenter should be optimistic that the experiment will succeed and occasionally make the decision to continue instead-that is π * should be biased towards continuing (see the threshold gap in Figure 2 ): Proposition 3 (Optimism). First, π greedy is also of thresholding type and there exists {µ greedy t ∈ R} τ -foot_0 t=1 such that π greedy (t, µ) = Ψ 0 if and only if µ > µ greedy t . Moreover, for all t ∈ {1, . . . , τ -1}, µ * t ≤ µ greedy t ⇐⇒ {µ : π * (t, µ) = Ψ 0 } ⊇ {µ : π greedy (t, µ) = Ψ 0 } (7) Intuitively, the optimism of π * accounts for the information gained from observing more samples when the experiment is continued. Remember that π greedy estimates the reward to be received if the experiment is conducted until termination, and it stops whenever its estimate is negative. But, the estimate of π greedy has some uncertainty associated with it. Whenever it is uncertain enough that the reward to be received is actually negative; incurring the cost of continuing for one more time step, gaining new information, and forming a more certain estimate can lead to a more accurate decision and a higher overall utility. Finally, the optimism of π * has a strictly decreasing upper bound; denoting with F (x) = (1/ √ 2π) x -∞ e -(1/2)x 2 the c.d.f. of the standard normal distribution: Proposition 4 (Decreasing optimism). For all t ∈ {1, . . . , τ -1}, |µ * t -µ greedy t | ≤ 1 /t -1 /τ × F -1 ((τ -t) C /R) -F -1 ( C /R) Intuitively, as the experiment continues, the information gained from one individual sample decreases relative to the total information accumulated, hence the optimism of π * that accounts for the that information gain also decreases (see the bottom panel of Figure 2 ). Consider one extreme: When t = τ -1, there is no more information to be gained before the experiment terminates at t = τ , hence π * should make the same decisions as π greedy . Indeed, Proposition 4 implies that µ * τ -1 = µ greedy τ -1 . 4 A PRACTICAL ALGORITHM: BAYES-OCP Summarizing our discussion in the previous section, we suspect the optimal policy to be (i) of thresholding type (cf. Proposition 2), (ii) optimistic (cf. Proposition 3), and (iii) increasingly more greedy (cf. Proposition 4). These findings are not a complete surprise as optimism-in-the-face-of-uncertainty is a well-known principle in solving online decision-making problems (Auer et al., 2002; Bubeck et al., 2012) . Our earlier analysis shows rigorously that this principle holds for at least a special case of OCP and strengths our intuition that it should be applicable for more general cases of OCP as well. Keeping properties (i-iii) in mind, we now propose a practical algorithm for solving OCP in a more general setting than the one we analyzed earlier. Let |X | ≥ 1 and Ψ = {(X, τ, ρ) : X ∈ 2 X \ ∅} include all experiment designs that target a unique subpopulation within X for a given time horizon τ and success criterion ρ; let C X . = C (X,τ,ρ) and R X . = R (X,τ,ρ) . We assume that the conditional power of performing a hypothesis test at time τ according to ρ-that is the probability of the test being successful conditioned on mean outcomes {θ x }-can be computed for interim datasets-that is P(X, D t ; {θ x }) = E x t ∼{η x|X } x∈X ,y t ∼N (θx t ,1) [ρ(D t ∪ (∪ τ t =t+1 {x t , y t }))] can be evaluated efficiently. Then, based on this conditional power function, we define Algorithm 1 Bayes-OCP 1: Initialize µx and σ 2 x for all x ∈ X 2: X ← X , t ← 0, D0 ← ∅ 3: Start experiment ψ = (X , τ, ρ) 4: loop: 5: t ← t + 1; Dt ← Dt-1 ∪ {xt, yt} 6: 1/σ 2 x t ← 1/σ 2 x t + 1 7: µx t ← µx t + (yt -µx t )σ 2 x t (i) Identify a candidate subpopulation X to replace X: 8: X ← ∅ 9: while X \ X ⊃ ∅: 10: x * ← argmax x∈X\X E θx∼N (µx,σ 2 x ) [G (0) (X ∪ {x}; {θx})] 11: if E θx∼N (µx,σ 2 x ) [G (0) (X ∪ {x * }; {θx})] > E θx∼N (µx,σ 2 x ) [G (0) (X ; {θx})]: 12: X ← X ∪ {x * } 13: else: break (ii) Decide whether to actually replace X with X : 14: if P θx∼N (µx,σ 2 x ) {G (0) (X ; {θx}) > G(X, Dt; {θx})} > β: 15: X ← X , t ← 0, D0 ← ∅ 16: Start a new experiment ψ = (X, τ, ρ) G(X, D t ; {θ x }) = R X • P(X, D t ) -C X • (τ -t) as the expected utility of fully committing to an experiment and waiting until it terminates when the experiment targets population X, is currently at time step t, and has collected dataset D t so far. Denote with G (0) (X; {θ x }) = G(X, ∅; {θ x }) the same expected utility but for an experiment that is yet to start, and with G (0) (∅; {θ x }) = 0 the utility of stopping all experimentation. Our algorithm is called Bayes-OCP and is given in Algorithm 1. It maintains a posterior distribution N (µ x , σ 2 x ) for each mean outcome θ x assuming that, given mean θ x , outcomes are distributed normally with unit variance-that is Ω x = N (θ x , 1). These posteriors are only used in deciding which experiment to run next and not in determining whether the experiment was a success or not. Hence, even when the assumption of outcomes being normally distributed is violated, the integrity of the experiments would not be effected; only the performance of Bayes-OCP in managing various experiments would degrade (see Appendix C for related experiments). Making use of the posteriors it maintains, Bayes-OCP performs two steps at each iteration: (i) First, a subpopulation X ⊂ X within the currently targeted population X is identified as a potential candidate to target next; due to the combinatorial size of Ψ, it would not be practical to consider every subpopulation individually as a candidate for large |X |. The ideal candidate would be the subpopulation with the largest expected utility: X = argmax X ⊂X E θx∼N (µx,σ 2 x ) [G (0) (X ; {θ x })]. But again due to the combinatorial size of the search space, Bayes-OCP employs a greedy algorithm instead and forms candidate subpopulations by combining, one by one, the atomic-subpopulations that increase the expected utility the most, until the expected utility no longer improves. Note that it is common to use greedy algorithms to solve combinatorial optimization problems (Lawler, 1976; Papadimitriou and Steiglitz, 1982) . (ii) Then, it is decided whether the current experiment targeting population X should be stopped in favor of targeting candidate X identified earlier instead. A greedy strategy would have done so whenever E θx∼N (µx,σ 2 x ) [G (0) (X ; {θ x })] > E θx∼N (µx,σ 2 x ) [G(X, D t ; {θ x })]. But from our earlier analysis, we have learned that the optimal strategy is optimistic (cf. Proposition 3). As such, Bayes-OCP checks whether it is overwhelmingly likely that the alternative experiment has higher expected utility-that is whether P θx∼N (µx,σ 2 x ) {G (0) (X ; {θ x }) > G(X, D t ; {θ x })} > β, where β ∈ (1/2, 1) controls the decision-making threshold. When β is large, we are more optimistic that the current experiment will succeed and require stronger evidence that the alternative experiment has higher expected utility. Note that, as the posteriors N (µ x , σ 2 x ) get narrower, the optimism of this rule naturally decreases, which should be the case for the optimal strategy (cf. Proposition 4). As one extreme, the two switching rules become equivalent when {σ 2 x → 0}.

5. RELATED WORK

Optimal stopping Optimal commitment is essentially a new type of optimal stopping/switching problem. In typical optimal stopping problems (OSPs), the reward an agent can receive evolves based on a stochastic process and the goal of the agent is to determine the optimal time step to stop when the reward to be received is in some sense maximized (Shiryaev, 2007) . Optimal commitment is unique in that a positive reward can only be received by not stopping until a pre-specified time horizon τ . In optimal commitment, there is still a stochastic process (namely, samples y t ) that gradually reveals more information regarding what that positive reward will be at the end, however, the reward-or rather the cost-of stopping earlier is independent of this stochastic process (and is equal to -tC). Sequential hypothesis testing Among other OSPs, optimal commitment is most closely related to sequential hypothesis testing (SHT), where an agent makes sequential observations regarding a given hypothesis and eventually needs to decide whether to reject the said (alternate) hypothesis or reject some null hypothesis (Wald and Wolfowitz, 1948; Yu et al., 2009; Drugowitsch et al., 2012; Shenoy and Angela, 2012; Zhang and Angela, 2013; Drugowitsch et al., 2014; Khalvati and Rao, 2015; Schönbrodt et al., 2017; Fauß et al., 2020) . Rejecting the correct hypothesis provides a positive reward whereas waiting for more observations, while informative, is also costly as in OCP. It is well known that the optimal policy in the classic setting of SHT is a thresholding-type policy with fixed thresholds that do not vary over time: The null hypothesis is rejected if some test statistic gets above a threshold (and the alternate hypothesis is rejected if the same statistic gets below a different threshold). Optimal commitment can be thought of as a SHT problem with the crucial difference that the metaexperimenter has only the option of discarding the alternate hypothesis (i.e. breaking a commitment), and once some time horizon is reached (i.e. when a commitment is kept), either the null hypothesis or the alternate hypothesis is automatically rejected according to some external success criterion ρ, regardless of what the meta-experimenters's decision might have been otherwise. As we have shown in Proposition 2, the optimal policy still remains a thresholding-type policy, but since there is now a deadline to discard the alternate hypothesis early, the thresholds become time-varying; in particular, they become less and less optimistic as the said deadline approaches (cf. Proposition 4). Frazier and Angela (2007); Dayanik and Angela (2013) ; Alaa and van der Schaar (2016) consider SHT under stochastic deadlines, but different from optimal commitment, they still allow agents to reject both hypotheses at any time. In these works, the agent must make the rejection decision before the deadline is reached to be able to receive a positive reward, whereas in our case, the agent must wait until the deadline to see whether the null hypothesis will be rejected or not. Naghshvar and Adaptive experimentation We introduced optimal commitment predominantly as a tool for population selection during an experiment. In clinical trials, dominant approach to population selection is adaptive enrichment (Mehta et al., 2009; Magnusson and Turnbull, 2013; Simon and Simon, 2013; Wang and Hung, 2013; Simon and Simon, 2018; Ondra et al., 2019; Thall, 2021) and adaptive signature designs (Freidlin and Simon, 2005; Freidlin et al., 2010; Mi, 2017; Zhang et al., 2017; Bhattacharyya and Rai, 2019) . These designs are capable of adapting the target population of a trial as the trial continues, but unlike optimal commitment, they can only do so at fixed analysis points and not just at any time step. While adaptive signature designs can select arbitrary populations, adaptive enrichment designs are also limited by the number of pre-specified populations they can select between, which is typically only two: the overall population and an alternative subpopulation. Optimal commitment is also related to clinical trial designs with futility stopping, where an experimenter might terminate a trial early once it becomes apparent that the said trial is highly unlikely to succeed (van der Tweel and van Noord, 2003; Lachin, 2005; He et al., 2012; Jitlal et al., 2012; Kimani et al., 2013; Chang et al., 2020) . However, this does not consider the possibility of switching to a new trial that targets a different population. As we will see during our experiments, switching to an alternative experiment might prove preferable even before an ongoing experiment can be deemed futile. In such cases, optimal commitment can make more timely decisions. Table 2 summarizes the experiment designs related to optimal commitment. Finally, it is worth mentioning that there are several methods for managing clinical trials at a portfolio level-that is determining which clinical trial is to be conducted next (Rogers et al., 2002; Colvin and Maravelias, 2008; Graham et al., 2020) . Trial management in this vain is orthogonal to optimal commitment: They are concerned with the success of multiple new treatments and make decisions on a trial-by-trial basis whereas we only ever consider a single intervention and make decisions regarding the target population on a sample-by-sample basis while experiments still continue. See Appendix H for extended related work.

6. EXPERIMENTS

We want to investigate how Bayes-OCP behaves in environments that differ in terms of ground-truth outcomes, for instance, what happens in environments where the original experiment is quite likely to succeed versus what happens in ones where switching to an alternative experiment is needed. To this end, we simulate experiments where mean outcomes are varied but other aspects of an experiment are fixed: In our environments, there are two atomic-populations, X = {X A , X B }. Both atomicpopulations have equal propensities η X A = η X B = 1/2 and the meta-experimenter has the same positively-biased prior for the mean outcome associated with each atomic-population: θ X A , θ X B ∼ N (0.1, 0.1). Experiment designs targeting one or both of these atomic-populations all have the same time horizon τ = 600 and success criterion ρ (D τ ) = 1{ (xt,yt)∈Dτ y t /|D τ | > α/ √ τ } , where α = F -1 (95%). So, experiments are powered to detect a positive mean outcome of 0.1 with probability ∼ 80. Rewards are given by R X = 1000η 0.1 X -the wider the target population is, the more people a successful intervention can be marketed to-and costs are given by C X = 1/η 0.1 X -the narrower the target population is, the harder it becomes to find subjects eligible to participate. Benchmarks We consider the metaexperiment designs summarized in Table 2 as benchmarks (see Appendix A.1 for an RL-based benchmark). Conventional RCT always targets the overall population and never stops early-that is it always conducts the experiment ψ = ({X A , X B }, τ, ρ) until its completion. Adaptive Enrichment performs an intermediary analysis at t = τ /2 = 300 and greedily selects the experiment with the highest expected utility from Ψ = {(X, τ, ρ)} X⊆{X A ,X B } . Futility Stopping is implemented via Bayes-OCP by initializing the set of all experiments as a singleton Ψ = {Ψ 0 = ({X A , X B }, τ, ρ)}. Intuitively, futility stopping only decides whether or not to stop the initial experiment that targets the overall population early. Bayes-OCP is initialized with β = 0.80 (see Appendix E for a sensitivity analysis). We also consider an ablation of Bayes-OCP where decisions are made greedily instead of optimistically (Greedy Bayes-OCP). As a baseline of maximum achievable performance, we consider an oracle (Oracle RCT) that always runs the RCT with the optimum target (or does not run any RCT at all if that happens to be optimal). Environments A meta-experimenter's performance is specific to the environment instance. In particular, it depends on the ground-truth outcome distributions {Ω x } for different populations. For example, an algorithm that always immediately stops the experiment would perform best when the mean outcome is negative. Hence, to faithfully evaluate the benchmarks, we need to focus on the average performance across different environments. To this end, we randomly generated 1000 environments (repeated five times to obtain error bars) with true mean outcomes θ X A , θ X B sampled independently from N (0.1, 0.1). Given these means, outcome distributions are set to be Gaussian with unit variance such that Ω x = N (θ x , 1). Depending on the true mean outcome, these environments can be categorized into three groups: (i) green instances where the initial experiment targeting the overall population has the highest utility, (ii) amber instances where an alternative experiment that targets a subpopulation has the highest utility, and (iii) red instances where no experiment has positive utility hence running no experiments is the optimal decision. Different benchmarks favor different instances (see the top row of Table 3 ): Conventional RCTs do not allow for any adaptation hence they favor green instances where the target population of the initial experiment does not need to be adapted. Adaptive Enrichment allows for adaptation but only at a certain time point, which is often too late to stop unsuccessful experiments (as in red instances). However, an adaptive enrichment design at least makes it possible to eventually target a subpopulation, even though it might be too late to do so at the pre-specified decision point, hence it partially accommodates amber instances. Futility Stopping decides between either continuing with the initial experiment or stopping all experimentation completely (targeting a subpopulation is not an option) hence it favors either green or red instances (but not amber instances). Greedy Bayes-OCP is pessimistic (or rather not optimistic enough) towards any ongoing experiment succeeding, hence it favors red instances where no experiment is likely to succeed. Similar to adaptive enrichment, Greedy Bayes-OCP at least allows subpopulations to be targeted hence it too partially accommodates amber instances.

Main results

Performance of a meta-experimenter is primarily measured by Bayesian utility which is the expected utility averaged over randomly sampled environment instances (Utility). Remember that maximizing utility was our main objective, and as such, Bayes-OCP has the highest expected utility when averaged over all environment instances, see Bayes-OCP (incorrectly) stops due to initial noise in a green instance while Bayes-OCP does not stop since it is more optimistic (as the optimal policy should, cf. Proposition 3). OCP (i) can make timely decisions-unlike Adaptive Enrichment-and (ii) is optimistic hence it does not stop likely-to-succeed experiments prematurely-unlike Greedy Bayes-OCP. More specifically, (i) Timeliness of Bayes-OCP: Bayes-OCP has an advantage in amber and red instances over adaptive enrichment and futility stopping. Consider the example in Figure 3 : While Bayes-OCP stops in a timely manner, adaptive enrichment can only stop at a fixed decision point and experiments with futility stopping only stop when the ongoing experiment is failing not as soon as a better alternative emerges. This underlines the exploitative aspect of Bayes-OCP-making and breaking commitments to maximize utility. (ii) Optimism of Bayes-OCP: While a design that favors early stopping is obviously desirable in amber and red environments, how much it is favored should be moderated to also succeed in green environments. Consider the example in Figure 4 : Greedy Bayes-OCP prematurely stops the initial experiment in a green environment while Bayes-OCP does not. Theoretically, we know that the optimal policy should be optimistic towards the ongoing experiment succeeding and be hesitant to stop to a certain extend. This underlines the exploratory aspect of Bayes-OCP-keeping a seemingly failing commitment still has value as it reveals more information regarding whether the commitment is actually failing. In Table 3 , in addition to Utility, we also report the familywise error rate (FWER)-that is the frequency of runs where at least one experiment (denote it with ψ i ) is declared successful (i.e. ρ i (D i τ ) = 1) despite the mean outcome being negative for the targeted population (i.e. θX i < 0)-the average number of times the target population has been switched (Switches), the probability of success which is defined as achieving positive utility (Success), the average time until a successful outcome (Timeto-Success, T-to-S), and the average time until until an unsuccessful outcome where all experimentation is stopped with negative utility (Time-to-Failure, T-to-F), see Appendix G for details. Importantly, Bayes-OCP does not compromise the error control of experiments, on the contrary, it even achieves a smaller FWER than conventional RCTs. This is because aggregate data is only ever used to select experiments, otherwise no two experiments consult each other's data when evaluating a success criterion so that the potential confoundedness that could have been caused by the adaptiveness of Bayes-OCP is avoided when declaring an experiment as successful (see Appendix B for a discussion on error control).

Supplementary results

We also provide supplementary results: Appendix A.1 evaluates RL-based benchmarks, Appendix B.1 investigates error control, Appendix C considers environments with non-Gaussian outcomes, Appendix D considers environments with more than two atomic-populations, and Appendix E analyzes the sensitivity of Bayes-OCP's performance to its hyper-parameter β.

7. CONCLUSION

Two aspects of OCP require further discussion: (i) How can it be approached from the perspective of reinforcement learning? While OCP technically describes a special class of POMDPs, we have not found this to be constructive in finding a solution (see Appendix A). (ii) What are the implications of using Bayes-OCP in terms of error control? It has no impact on individual error rates and can be adapted to control FWER (see Appendix B). See Appendix F for a discussion on future work.

ETHICS STATEMENT

As the main application of optimal commitment, we have focused on adaptive experimentation, particularly experiments that are run as part of clinical development. Clinical trials have a huge impact on the wellbeing of patients and this high-stakes nature of clinical trials naturally raises some ethical concerns; we discuss two major ones in this section. However before we start our discussion, it should be emphasized that clinical trials is not the only application domain of optimal commitment. As we have highlighted at the end of Section 2, our contributions are generally applicable to decisionmaking problems such as portfolio and energy management. Moreover, not all adaptive experiments are clinical and have the same high stakes as a clinical trial. For instance, A/B testing is common in online advertisement to determine what recommendation policies lead to more user engagement (Gui et al., 2015; Xu et al., 2015; Kohavi and Longbotham, 2017) . Therefore, the ethical concerns we discuss here does not universally concern all possible applications of optimal commitment. The first concern is how the designed error rate of an individual experiment is affected when multiple such experiments are managed together using Bayes-OCP in an adaptive manner, in particular, whether any error rate is inflated by the use of Bayes-OCP or not. We discuss error control in Appendix B with supplementary experiments. But briefly, Bayes-OCP has essentially no impact on the error rate of experiments on an individual level, and when controlling their family-wise error rate is also a concern, it can easily be adapted to accommodate this additional constraint as well. The second concern is that an adaptive approach to population selection might lead to overly conservative experiments that unnecessarily limit the use of an effective treatment. As we have mentioned in the introduction to motivate the need for optimal commitment, when the treatment is effective only for a subpopulation (cf. amber instances in our experiments), population selection is absolutely necessary, otherwise the treatment is most likely to be found ineffective and discarded after an experiment that targets the overall patient population as a whole, which would deny the treatment for the subpopulation that would have benefited from it. On the flip side of this, when the treatment happens to be effective for everyone (cf. green instances in our experiment), population selection might lead to conducting a restrictive experiment that only targets a small subpopulation, which this time, would deny the treatment for the rest of the patient population. This is essentially the reason behind the performance drop between Bayes-OCP and conventional RCTs in green instances (see Table 3 ). There is a trade-off between the performance in amber instances and green instances; and Bayes-OCP achieves a better balance between the two compared with a conventional RCT as evidenced by its superior performance when averaged over all environment instances (again see Table 3 ); although it causes a drop in performance for green instances, it more than makes up for that drop in amber instances. This balance is partly controlled by how optimistic Bayes-OCP is, which is in turn dictated by its hyper-parameter β-larger β leads to more optimistic decisions towards ongoing experiments, which favors green instances more then amber instances. We analyze the sensitivity of Bayes-OCP's performance to β in Appendix E; and for all configurations that we have evaluated, Bayes-OCP always performs significantly better than a conventional RCT.

REPRODUCIBILITY STATEMENT

All our experiments are based on synthetic simulations, hence our results can easily be reproduced by following the specifications in Section 6 without needing access to any private dataset. In order to aid reproducibility, we have rigorously described all our benchmarks in algorithmic form, similar to Algorithm 1, in Appendix J. Moreover, the source code necessary to reproduce our main results in Table 3 is made publicly available at https://github.com/alihanhyk/optcommit and https://github.com/vanderschaarlab/optcommit.

A A REINFORCEMENT LEARNING PERSPECTIVE

Optimal commitment can be viewed as a partially-observed reinforcement learning problem. Let the tuple (S, A, Z, T , O, R) denote a partially-observable Markov decision process (POMDP), where S is the (unobserved) state space, A is the action space, Z is the observation space, T ∈ ∆(S) S×S describes the transition dynamics, O ∈ Z S describes the observation dynamics, and R ∈ R S describes the reward dynamics. Then, OCPs as defined in Section 2 can also be expressed as a special class of POMDPs: Letting Y = R denote the outcome space for clarity, D = ∪ t=0 (X × Y) t be the space of all possible datasets D t , and O be the space of all possible outcome distributions Ω, • S . = {∅} ∪ (Ψ × D × O X ), where states s = (ψ, D t , {Ω x } x∈X ) consist of the ongoing experiment ψ ∈ Ψ, the dataset D t ∈ D collected by the ongoing experiment so far, and the true outcome distributions {Ω x ∈ O}, • A . = {∅} ∪ Ψ, • Z . = {∅} ∪ (X × Y), • T (s = ∅, a) . = ∅ and T (s = (ψ = (X, τ, ρ), D t , {Ω x }), a) . =              ∅ if a = ∅ s = (ψ, D t+1 = D t ∪ {x t+1 , y t+1 }, {Ω x }) s.t. x t+1 ∼ {η x|X }, y t+1 ∼ Ω xt+1 if a = ψ s = (ψ , D 1 = {x 1 , y 1 }, {Ω x }) s.t. x 1 ∼ {η x|X }, y 1 ∼ Ω x1 if a = ψ = (X , τ , ρ ) = ψ , • O(s = ∅) . = ∅ and O(s = (ψ, D t+1 = D t ∪ {x t+1 , y t+1 }, {Ω x })) . = (x t+1 , y t+1 ) , • R(s = ∅) . = 0 and R(s = (ψ = (X, τ, ρ), D t+1 , {Ω x })) . = -C ψ + R ψ • 1{t + 1 = τ } • ρ(D t+1 ) . Since ongoing experiments ψ are completely dictated by actions, and datasets D t collected by the ongoing experiments consist solely of observations (x t , y t ), the only unobserved component of the states in this POMDP is the true outcome distributions {Ω x } x∈X . Hence, the optimal policy should have the form π(ψ, D t , b) where b ∈ ∆(O X ) denotes beliefs over {Ω x }-that is posterior distributions over the true outcome distributions. For instance, when Ω x = N (θ x , 1) as we have been assuming in Sections 3 and 4, posteriors over mean outcomes {θ x } x∈X , which are given by parameters {µ x , σ 2 x } such that θ x | Di t ∼ N (µ x , σ x ), constitute as beliefs. Now although an OCP can be expressed as a POMDP, doing so is not particularly helpful in finding a solution. As we have already discussed in Section 3, the standard approach to solving a POMDP would be to use dynamic programming and compute the optimal value function V * and the optimal Q-function Q * iteratively according to Bellman optimality conditions Q * (b, a) = E s∼b,s ∼T (s,a),z =O(s ),b |{b,z } [R(s ) + V * (b )] V * (b) = max a∈A Q * (b, a) , where b |{b, z } denotes the updated belief b after having belief b and making a new observation z . When the state space S is discrete-or equivalently in our case, when the space of outcome distributions Ω ∈ O is discrete-V * and Q * happen to be convex functions, which makes it possible to perform these iterations efficiently by approximating V * and Q * using functions of the form f (b) = max{a i b + a i } (Spaan, 2012) . However, even in the simplest of cases where S is continuous-or equivalently, the space of outcome distributions Ω ∈ O is continuous, for instance when Ω x = N (θ x , 1)-the convexity of V * and Q * no longer generally holds. In fact, we show in Proposition 1 that neither V * nor -V * is convex with respect to beliefs b ≡ {t, µ} for at least one instance of the simplified OCP that we have analyzed in Section 3. Having said all that, one naive way to still compute V * and Q * iteratively according to Bellman optimality conditions is to discretize the belief space. We call this benchmark Discretized RL and we use it to perform futility stopping-that is when |Ψ| = 1, deciding whether to stop the only viable experiment design early or not. Otherwise, the dimensionality of the belief state explodes combinatorially with respect to |Ψ|. We consider the same setting that we have considered during our experiments in Section 6 and compare the performance of Futility Stopping with Discretized RL with that of Futility Stopping with Bayes-OCP. When implementing discretized RL, instead of keeping track of the entire dataset D t , we only keep track of the sufficient statistic µ t = (x t ,y t ) y t /|D t |, restrict the domain of µ t to interval [-0.3, 0.3], and discretize this interval into 100 equally-spaced bins. In addition to discretized RL, we also consider the approach proposed by Ni et al. (2022) for solving complex classes of POMDPs, which the optimal commitment problem is one of. Briefly, we employ deep Q-learning (as such, we call this benchmark Deep Q-learning) to train a neural network as an approximation of the Q-function Q * (b, a) using the POMDP we formalized earlier as a simulator. As the network architecture, we consider a multi-layer perceptron with two hidden layers of size 100 and with tanh activations. Results are given in Table 4 ; futility stopping with Bayes-OCP performs better than futility stopping with discretized RL as well as futility stopping with deep Q-learning. In addition to the bad performance of discretized RL, it is also not feasible to scale it to use cases beyond futility stopping. When |Ψ| > 1, we would need to keep separate track of each µ x . Moreover, we would also need to start keeping track of the scale parameters {σ x } since it would now be possible to distribute samples among multiple atomic-populations in multiple ways by targeting different populations with different experiments (we no longer would be able to treat the target population of the only viable experiment design as the only atomic-population there is). Noting that σ x 's already take discrete values with at least τ -many possible values, merely increasing the number of viable experiments |Ψ| from one to two causes the dimensionality of the belief space to jump from 100 to ∼ (100 × 600) 2 = 36 × 10 8 . Deep Q-learning performs even worse as it ignores all structure present in the optimal commitment problem, and instead, views the POMDP that describes it as a black-box simulator. 

B DISCUSSION ON ERROR CONTROL

Bayes-OCP is a method for managing experiments-that is deciding what experiment to conduct and when-as opposed to a hypothesis testing strategy in and of itself. Implication of this in terms of error control is that the type 1 error of any individual experiment run by Bayes-OCP can always be controlled by choosing an appropriate experimental design, in particular, by specifying an appropriate success criterion ρ. This individual-level error control built into the design of each experiment is not compromised by Bayes-OCP; no aggregate data from multiple experiments is ever fed into the success criterion of one alone (see Section 2, experiment ψ i is successful if ρ i (D i t ) = 1 not if ρ i ( Di t ) = 1); and any assumptions made by Bayes-OCP regarding outcomes in Section 4, whether accurate or inaccurate, have no effect on the results produced by an external success criterion. While Bayes-OCP does not compromise the individual error control of experiments, neither does it control their collective family-wise error rate (FWER)-that is the probability of at least one experiment among all that are conducted making a false discovery. Bayes-OCP views the problem of managing experiments purely as a utility maximization problem with no additional constraints. Within the scope of our discussion, the purpose of measuring FWER as a metric is to check empirically whether the individual error rates are inflated or not (note that FWER is a stricter notion of error than individual error rate). In practice, depending on how closely related the managed experiments are, controlling FWER might not necessarily be a concern. Let us highlight this: Any algorithm that manages experiments for long enough is bound to make at least one false discovery. Each year more than a thousand clinical trials are launched (that eventually post results) and more than half of these trials succeed (Takebe et al., 2018; Cli) . If the type 1 error rate of all these trials were %5, we would expect at least 25 false discoveries in a year, which is more than one hence it would have put FWER of all real-world trials at almost 100% when measured in a year-by-year basis. Of course, this is not problematic since not all clinical trials are related to each other closely enough to be considered a family.

B.1 EXPERIMENTS WITH FAMILY-WISE ERROR CONTROL

When controlling FWER is of concern, Bayes-OCP can easily be adapted to satisfy this additional constraint by first limiting the number of total experiments that can be conducted-that is putting an upper bound on n-and then using well-established methods for family-wise error control such as Bonferroni correction or alpha spending functions (Demets and Lan, 1994) to adjust the success criteria of the viable experiments in Ψ. We run additional experiments to evaluate the performance of Bayes-OCP with Bonferroni correction. We consider the same setting that we have considered during our experiments in Section 6 except for one difference: We limit the number of experiments that can be conducted by each algorithm as at most two, and we specify α = F -1 (0.975) for algorithms that can potentially run more than one experiment-namely, adaptive enrichment and (Greedy) Bayes-OCP-while we still specify α = F -1 (0.95) for algorithms that always run exactly one experiment-namely, RCT and futility stopping. These specifications ensure that FWER of all algorithms are bounded by %5. Results are given in Table 5 ; Bayes-OCP still performs the best when explicit control of FWER is required.

C EXPERIMENTS WITH MISSPECIFIED OUTCOME DISTRIBUTIONS

We consider the same setting that we have considered during our experiments in Section 6. Except now, the ground-truth outcome distributions are such that, when y ∼ Ω x , y = 1 with probability (θ x +1)/2 and y = -1 otherwise. In order to ensure that θ x ∈ [-1, 1], we also sample ground-truth mean outcomes so that θ x = 2p -1 where p is distributed according to Beta distribution with α = 979/200 and β = 801/200 (note that the mean and variance of θ x remains the same as in our original experiments). Despite the fact that outcomes are now distributed in a non-Gaussian way, we leave the implementation of Bayes-OCP unchanged, which still assumes that outcomes distributions are Gaussian. So, there is now a mismatch between the structure of outcome distributions specified as part of Bayes-OCP and the ground-truth outcome distributions. Results are given in Table 6 ; Bayes-OCP still does not inflate FWER despite the misspecified outcome distributions.

D EXPERIMENTS WITH MORE ATOMIC-POPULATIONS

We repeat our main experiments with more than two atomic-populations, specifically we set |X | = 10. As before, all atomic-populations have equal propensities such that η x = 1/10, ∀x ∈ X , and the meta-experimenter has the same positively-biased prior for the mean outcome associated with each atomic population: θ x ∼ N (0.1, 0.1), ∀x ∈ X . We randomly generated 100 environment (repeated five times to obtain error bars), and the results are given in Table 7 . We observe that Bayes-OCP still performs the best. These results confirm that a greedy approximation is suitable in identifying candidate experiments when the number of atomic-populations is large.

E SENSITIVITY ANALYSIS

Bayes-OCP has one hyper-parameter: β, which controls how optimistic the switching rule given in line 14 of Algorithm 1 is, from β = 1/2 meaning decisions are made greedily to β = 1 meaning decisions are so extremely optimistic that the original experiment will never be abandoned (as there will always be a chance that it succeeds). As with all online algorithms, tuning β is challenging since no a priori data would be available to perform cross validation. However, a nice feature of Bayes-OCP is that β is rather interpretable, it is the evidence required against the ongoing experiment: An alternative experiment is preferred over the ongoing experiment only if it is believed to be the better experiment with at least β-confidence. We evaluate the sensitivity of Bayes-OCP's performance to hyper-parameter β in Figure 5 ; Bayes-OCP performs better than an RCT for all configurations and better than adaptive enrichment for most configurations.

F FUTURE WORK

Extending the scope of Bayes-OCP One limitation of Bayes-OCP is that it only adapts the target population X ⊆ X of experiments but not the sample horizon τ or the success criterion ρ. We have chosen to focus on the selection of a target population since we believe the target population of an experiment to be the most critical design dimension to adjust adaptively. As we have already highlighted in our introduction, experiments with inflexible target populations can be problematic when responses to the treatment of interest are highly heterogeneous. That being said, the high-level strategy of our proposed algorithm should still be applicable to adapting design dimensions other than the target population, namely τ and ρ. At a high level, Bayes-OCP first identifies a candidate experiment and then compares the identified experiment to the ongoing experiment in a n optimistic manner. Regardless of the given set of viable experiment design Ψ, one could still follow the same strategy; the only complication would be to adapt how candidate experiments are identified depending on what design dimension varies across experiment designs in Ψ. For instance, when experiment designs varied in terms of X, a combinatorial search was required to identify good candidate experiments, for which we proposed a greedy strategy. When experiment designs vary in terms of ρ, a simple search over all possible ρ would suffice for identifying candidate experiment. The case where experiment designs vary in terms of τ is more complex; optimal τ for an experiment would be dependent on unknown effects θ x ; selecting a good candidate experiment would involve estimating the optimal τ given posteriors over θ x . This would be an interesting problem to explore as a future research direction. Performance guarantees While our theoretical results motivate the general use of an optimistic decision rule, they do not provide any guarantees about the performance of the specific rule we propose as part of Bayes-OCP. Another future research direction would be to prove an upper bound on the sub-optimality gap of Bayes-OCP.

G FURTHER DISCUSSION ON MAIN RESULTS

Table 3 report six metrics: Utility, FWER, Switches, Success, T-to-S, and T-to-F. We have already discussed the implications of Utility and FWER in Section 6. Here, we highlight other interesting phenomena regarding the remaining metrics. First, we see that Greedy Bayes-OCP switches experiments much more frequently compared with Bayes-OCP. This is because Greedy Bayes-OCP requires less evidence against the ongoing experiment when comparing it against an alternative experiment, whereas, Bayes-OCP favors the ongoing experiments more. Second, we see that a higher success probability does not necessarily also imply a higher utility. For instance, compare RCT with futility stopping, futility stopping is able to achieve higher utility than RCT by terminating risky experiments early and saving costs. However, this of course also means that futility stopping sees fewer experiments to completion hence leads to a lower success probability. Finally, we see that succeeding or failing early does not necessarily imply a higher utility either. Our best algorithm Bayes-OCP succeeds the latest on average as well as fails the latest compared with other benchmarks favoring red instances. This highlights the importance being conservative when making decisions, being optimistic, and favoring the status quo more than a potential adaptation.

H FURTHER DISCUSSION ON RELATED WORK

Multi-armed bandits The optimal commitment problem is similar to a multi-armed bandit (MAB) problem (Auer et al., 2002; Bubeck et al., 2012) in some aspects: Like arms in a MAB problem, each experiment design ψ has a random utility given by R ψ • ρ(D τ ) -τ C ψ , where D τ is the source of randomness, and the distribution of this utility is unknown. Also similar to a MAB problem, the overall goal is to sequentially select experiment designs (cf. arms) that yield the maximum cumulative utility. The main difference between the two problems is that, in a MAB problem, selecting an arm immediately reveals a sample from its random utility, while in optimal commitment, running an experiment ψ just for one time step only incurs a cost of C ψ ; observing a full sample of its random utility requires the experiment to be run until its completion for τ consecutive time steps, without selecting any other experiment design in the meantime. One can naively apply a MAB algorithm by viewing each viable experiment design as a unique arm, and by running experiments/arms selected by the algorithm until their completion to observe full samples from their unknown utility distributions. However, this obviously side steps the main question we want to answer in optimal commitment: When can we abandon a commitment-in this case, the decision to run an experiment/arm selection until its completion-before fully observing its outcome? Looking at optimal commitment from a MAB perspective reveals that there are two explore-exploit dilemmas present in optimal commitment: One is with respect to which experiment to select next, and the other is with respect to when to preemptively stop the current experiment (i.e. breaking a commitment). MAB algorithms address the former dilemma but not the latter.

Task replication in parallel computing

There is work (Ghare and Leutenegger, 2005; Wand et al., 2014; Wang et al., 2019) that focuses on the problem of when to kill existing tasks and relaunch them in parallel computing, which is related to optimal stopping/switching. However there, the focus is on reasoning about when a stochastic event (i.e. successful completion of a computational task) will occur without any extra information other than the fact that the event of interest has not occurred yet. In contrast, in our setting, the decision-maker needs to process a streaming set of samples to reason about the random outcome of an event that is scheduled to happen at a deterministic time point (here, the event is an experiment reaching its conclusion). This means that our problem has a completely different information structure when compared with the problem of task replication. More formally, we observe samples y t that are informative of whether ρ(D τ ) = 1 when τ is a fixed variable. In contrast, the problem of task replication would correspond to the setting where τ is a random variable with a known distribution and ρ = 1 always holds (hence no need to observe any samples y t ). Among optimal stopping/switching problems, the structure of our problem is more closely related to sequential hypothesis testing, which we have already covered in Section 5.

I PROOFS OF PROPOSITIONS I.1 PROOF OF PROPOSITION 1

We start by relating the optimal value function V * to the optimal Q-function Q * . Letting T * t = T π * t , V * (t, µ) = E[R • 1{T * t > τ } • ρ(µ τ ) -C • (min{T * t , τ } -t)|µ t = µ] = E[1{π * (t, µ t ) = ∅}(R • 1{T * t > τ } • ρ(µ τ ) -C • (min{T * t , τ } -t)) + 1{π * (t, µ t ) = Ψ 0 }(R • 1{T * t > τ } • ρ(µ τ ) -C • (min{T * t , τ } -t))|µ t = µ] = E[1{π * (t, µ t ) = ∅} • 0 + 1{π * (t, µ t ) = Ψ 0 }(R • 1{T * t+1 > τ } • ρ(µ τ ) -C • (min{T * t+1 , τ } -t))|µ t = µ] (11) = 1{π * (t, µ) = Ψ 0 } • E[R • 1{T * t+1 > τ } • ρ(µ τ ) -C • (min{T * t+1 , τ } -t)|µ t = µ] (12) = 1{Q * (t, µ) > 0} • Q * (t, µ) (13) = max{0, Q * (t, µ)} , where ( 11) holds since π * (t, µ t ) = ∅ =⇒ T * t = t and π * (t, µ t ) = Ψ 0 =⇒ T * t ≥ t + 1 =⇒ T * t = min{t ≥ t : π * (t , µ t ) = ∅} = min{t ≥ t + 1 : π * (t , µ t ) = ∅} = T * t+1 , (12) holds since µ τ ⊥ ⊥ 1{π * (t, µ t ) = Ψ 0 } and T * t+1 ⊥ ⊥ 1{π * (t, µ t ) = Ψ 0 } when conditioned on µ t = µ, and holds since π * (t, µ) = Ψ 0 ⇐⇒ Q * (t, µ) > 0. Intuitively, the maximum possible value at a given time is achieved either by stopping immediately or by conducting the experiment for at least one more time step and then following the optimal policy thereafter. Next, we observe that P{µ t+1 ≤ µ |µ t = µ} = P{µ t+1 ≤ µ |θ, µ t = µ}dP{θ|µ t = µ} = F µ - θ + tµ t + 1 ; 1 (t + 1) 2 f (θ -µ; 1 /t)dθ = 1{µ t+1 ≤ µ }f µ t+1 - θ + tµ t + 1 ; 1 (t + 1) 2 f (θ -µ; 1 /t)dµ t+1 dθ = 1{µ t+1 ≤ µ }f µ t+1 - y + (t + 1)µ t + 1 ; 1 (t + 1) 2 f (y; 1 /t)dµ t+1 dy = 1 x + y + (t + 1)µ t + 1 ≤ µ f (x; 1 /(t+1) 2 )f (y; 1 /t)dxdy = P X∼N (0, 1 /(t+1) 2 ) Y ∼N (0, 1 /t) X + Y t + 1 ≤ µ -µ = P X+Y /(t+1)∼N (0, 1 /t-1 /t+1) X + Y t + 1 ≤ µ -µ = F (µ -µ; 1 /t -1 /t+1) , where f (x; σ 2 ) = (1/ √ 2πσ 2 )e -(1/2)x 2 /σ 2 and F (x; σ 2 ) = (1/ √ 2πσ 2 ) x -∞ e -(1/2)x 2 /σ 2 dx are the p.d.f. and the c.d.f. of the Gaussian distribution with mean zero and variance σ 2 respectively. Hence dP{µ t+1 = µ |µ t = µ} = f (µ -µ; 1 /t -1 /t+1)dµ . Then, using the relationship between V * and Q * and the observation regarding P{µ t+1 ≤ µ |µ t = µ}, we drive the following Bellman optimality condition: Q * (t, µ) = E[R • 1{T * t+1 > τ } • ρ(µ τ ) -C • (min{T * t+1 , τ } -t)|µ t = µ] = -C + E[R • 1{T * t+1 > τ } • ρ(µ τ ) -C • (min{T * t+1 , τ } -t -1)|µ t = µ] = -C + E[R • 1{T * t+1 > τ } • ρ(µ τ ) -C • (min{T * t+1 , τ } -t -1)|µ t+1 = µ ] × dP(µ t+1 = µ |µ t = µ) = -C + V * (t + 1, µ )dP(µ t+1 = µ |µ t = µ) = -C + V * (t + 1, µ )f (µ -µ; 1 /t -1 /t+1)dµ = -C + V * (t + 1, µ + z)f (z; 1 /t -1 /t+1)dz (16) = -C + max{0, Q * (t + 1, µ + z)}f (z; 1 /t -1 /t+1)dz . For the problem setting where C = 1, R = 2, α = 0, and τ = 2, we have V * (1, µ) = max{0, -1 + V * (2, µ + z)f (z; 1 /2)dz} = max{0, -1 + 2 1{µ + z > 0}f (z; 1 /2)dz} = max 0, -1 + 2 ∞ -µ f (z; 1 /2)dz = max{0, -1 + 2F (µ; 1 /2)} = 0 if µ < 0 -1 + 2F (µ; 1 /2) if µ ≥ 0 . Notice that, for µ > 0, d 2 dµ 2 V * (1, µ) = d 2 dµ 2 -1 + 2F (µ; 1 /2) = d dµ 2f (µ; 1 /2) = -(4/π)µe -µ 2 < 0 hence V * (1, µ ) is concave at least on interval µ ∈ (0, ∞) and is not a convex function. Moreover, -V * (1, µ) cannot be a convex function-or equivalently V * (1, µ) cannot be a purely concave function-either: For an arbitrary µ ∈ (0, ∞), V * (1, µ) > 0 and V * (1, -µ) = 0 hence (1/2)V * (1, µ) + (1/2)V * (1, -µ) > 0 but V * (1, (1/2)µ + (1/2)(-µ)) = V * (1, 0) = 0.

I.2 PROOF OF PROPOSITION 2

We will prove the proposition by showing that (i) Q * (t, µ) is non-decreasing in µ-that is µ < µ =⇒ Q * (t, µ) ≤ Q * (t, µ ), (ii) lim t→∞ Q * (t, µ) = -(τ -t)C + R > 0, and (iii) lim t→-∞ Q * (t, µ) = -C < 0 for all t ∈ {1, . . . , τ -1} via mathematical induction. Notice that these three facts-together with the fact that Q * (t, µ) is a continuous function in µ for t ∈ {1, . . . , τ -1}-would imply the existence of a unique µ * t such that Q * (t, µ * t ) = 0, Q * (t, µ) > 0 ⇐⇒ µ > µ * t , and Q * (t, µ) ≤ 0 ⇐⇒ µ ≤ µ * t , which in turn would imply that π * (t, µ) = Ψ 0 if µ > µ * t ⇐⇒ Q * (t, µ) > 0 ∅ if µ ≤ µ * t ⇐⇒ Q * (t µ) ≤ 0 , meaning the optimal policy π * is indeed of "thresholding -type" as the proposition states. First, we observe the following base cases for t = τ -1: (i) Q * (τ -1, µ) is non-decreasing in µ. When µ < µ , Q * (τ -1, µ) = -C + V * (τ, µ + z)f (z; 1 /(τ-1) -1 /τ)dz (18) = -C + R 1{µ + z > α/ √ τ }f (z; 1 /(τ-1) -1 /τ)dz ≤ -C + R 1{µ + z > α/ √ τ }f (z; 1 /(τ-1) -1 /τ)dz (19) = Q * (τ -1, µ ) , where ( 18) is due to (16), and (19)  holds since µ+z > α/ √ τ =⇒ µ +z > µ+z > α/ √ τ . (ii) lim µ→∞ Q * (τ -1, µ) = -C + R > 0 since lim µ→∞ Q * (τ -1, µ) = lim µ→∞ -C + R 1{µ + z > α/ √ τ }f (z; 1 /(τ-1) -1 /τ)dz = lim µ→∞ -C + R ∞ α/ √ τ -µ f (z; 1 /(τ-1) -1 /τ)dz = -C + R f (z; 1 /(τ-1) -1 /τ)dz = -C + R . (iii) lim µ→-∞ Q * (τ -1, µ) = -C < 0 since lim µ→-∞ Q * (τ -1, µ) = lim µ→-∞ -C + R 1{µ + z > α/ √ τ }f (z; 1 /(τ-1) -1 /τ)dz = lim µ→-∞ -C + R ∞ α/ √ τ -µ f (z; 1 /(τ-1) -1 /τ)dz = lim µ→-∞ -C + R 1 - α/ √ τ -µ -∞ f (z; 1 /(τ-1) -1 /τ)dz = -C + R 1 -f (z; 1 /(τ-1) -1 /τ)dz = -C . Then, we show that the following inductive cases hold for t ∈ {τ -1, . . . , 2}: (i) Given that Q * (t, µ) is non-decreasing in µ, Q * (t -1, µ) is also non-decreasing in µ. Similar to the base case, when µ < µ , Q * (t -1, µ) = -C + max{0, Q * (t, µ + z)}f (z, 1 /(t-1) -1 /t)dz ≤ -C + max{0, Q * (t, µ + z)}f (z, 1 /(t-1) -1 /t)dz = Q * (t -1, µ ) , where ( 20) is due to (17). (ii) Given lim µ→∞ Q * (t, µ) = -(τ -t)C + R-and also given that Q * (t, µ) is non-decreasing in µ-we have lim µ→∞ Q * (t -1, µ) = -(τ -t + 1)C + R > 0, which can be shown using the sandwich theorem: Q * (t -1, µ) = -C + max{0, Q * (t, µ + z)}f (z, 1 /(t-1) -1 /t)dz ≤ -C + max{0, lim µ →∞ Q * (t, µ )}f (z, 1 /(t-1) -1 /t)dz ≤ -C + (-(τ -t)C + R) f (z, 1 /(t-1) -1 /t)dz = -(τ -t -1)C + R . Q * (t -1, µ) = -C + max{0, Q * (t, µ + z)}f (z, 1 /(t-1) -1 /t)dz ≥ -C + ∞ -|µ| 1/2 max{0, Q * (t, µ + z)}f (z, 1 /(t-1) -1 /t)dz ≥ -C + ∞ -|µ| 1/2 max{0, Q * (t, µ -|µ| 1/2 }f (z, 1 /(t-1) -1 /t)dz ≥ -C + Q * (t, µ -|µ| 1/2 ) ∞ -|µ| 1/2 f (z, 1 /(t-1) -1 /t)dz , Finally, observing that (iii) Given lim µ→-∞ Q * (t, µ) = -C < 0-and also given that Q * (t, µ) is non-decreasing in µ and lim µ→∞ Q * (t, µ) > 0 so that µ * t exists-we have lim µ→-∞ Q * (t -1, µ) = -C < 0, which again can be shown using the sandwich theorem: Q * (t -1, µ) = -C + max{0, Q * (t, µ + z)}f (z, 1 /(t-1) -1 /t)dz = -C + R • F µ 1 /(τ-1) -1 /τ , where F (x) . = F (x; 1) is the c.d.f. of the standard Gaussian distribution. Moreover, assuming (30) is true for t, it is also true for t -1: V (0) (t -1, µ) = -C + V (0) (t, µ + z)f (z; 1 /t-1 -1 /t)dz = -(τ -t + 1)C + R F ((µ + z)/ 1 /t -1 /τ; 1)f (z; 1 /(t-1) -1 /t)dz = -(τ -t + 1)C + R (µ+z)/ √ 1 /t-1 /τ -∞ f (z ; 1)f (z; 1 /(t-1) -1 /t)dz dz = -(τ -t + 1)C + R µ+z -∞ f (z ; 1 /t -1 /τ)f (z; 1 /(t-1) -1 /t)dz dz = -(τ -t + 1)C + R 1{z ≤ µ + z}f (z ; 1 /t -1 /τ)f (z; 1 /(t-1) -1 /t)dz dz = -(τ -t + 1)C + R • P Z∼N (0, 1 /(t-1)-1 /t) Z ∼N (0, 1 /t-1 /τ) {Z ≤ µ + Z} = -(τ -t + 1)C + R • P Z -Z √ 1 /(t-1)-1 /τ ∼N (0,1) Z -Z 1 /(t-1) -1 /τ ≤ µ 1 /(t-1) -1 /τ = -(τ -t + 1)C + R • F µ 1 /(t-1) -1 /τ . Therefore, (30) indeed holds for all t ∈ {1, . . . , τ -1}. Next, we observe that V (0) (t, µ) has a root at µ = F -1 ((τ -t)C/R) 1 /t -1 /τ provided that (τ -t)C/R ∈ (0, 1), which is the case for all t ∈ {1, . . . , τ -1} since τ C < R. Moreover, V (0) (t, µ) is a strictly increasing function in µ. Hence, there exists a unique µ greedy t for all t ∈ {1, . . . , τ -1} such that V (0) (t, µ greedy t ) > 0 and V (0) (t, µ) > 0 ⇐⇒ µ > µ greedy t . In other words, π greedy is also a thresholding-type policy as the proposition states. Finally, we have V (0) (t, µ * t ) = Q (0) (t, µ * t ) ≤ Q * (t, µ * t ) = 0 hence µ * t ≤ µ greedy t . This is because, by definition, Q * (t, µ) ≥ Q π (t, µ) for all t, µ for any given policy π, including π (0) .

I.4 PROOF OF PROPOSITION 4

As in the proof of Proposition 3, we take α = 0 for notational brevity. Once again, this is without any loss of generality as, by simply shifting each value function and Q-function by α/ √ τ with respect to µ, all of the following arguments would still hold. Remember that the formula we derived for V (0) (t, µ) in (30) holds when α = 0. We start by deriving two bounds on the optimal Q-function Q * (t, µ): (i) a lower bound and (ii) an upper bound. For the lower bound, it is sufficient to observe that V (0) (t, µ) = Q (0) (t, µ) ≤ Q * (t, µ) , which holds since, by definition, Q * (t, µ) ≥ Q π (t, µ) for all t, µ for any given policy π. For the upper bound, we use mathematical induction to show that Q * (t, µ) ≤ (τ -t-1)C +V (0) (t, µ). First, for the base case of τ -1, Q * (τ -1, µ) = -C + V * (τ, µ + z)f (z; 1 /t -1 /t+1)dz = -C + 1{µ + z > α}f (z; 1 /t -1 /t+1)dz = -C + V (0) (τ, µ + z)f (z; 1 /t -1 /t+1)dz = V (0) (τ -1, µ) , where ( 31) is due to (16), and ( 32) is due to (29). Then, for the inductive case, assuming Q * (t, µ) ≤ (τ -t -1)C + V (0) (t, µ), Q * (t -1, µ) = -C + max{0, Q * (t, µ + z)}f (z, 1 /(t-1) -1 /t)dz (33) ≤ Q * (t, µ + z)f (z, 1 /(t-1) -1 /t)dz (34) ≤ (τ -t -1)C + V (0) (t, µ + z)f (z, 1 /(t-1) -1 /t)dz = (τ -t)C + V (0) (t -1, µ + z) , where ( 33) is due to (17), and (34) holds since -C ≤ Q * (t, µ) implies that max{0, Q * (t, µ)} ≤ max{C + Q * (t, µ), Q * (t, µ)} ≤ C + Q * (t, µ). Define µ + t and µ - t such that V (0) (t, µ + t ) = 0 ⇐⇒ µ + t = F -1 (τ -t) C R 1 t - 1 τ (τ -t -1)C + V (0) (t, µ - t ) = 0 ⇐⇒ µ - t = F -1 C R 1 t - 1 τ , which we are able to write in closed form using the formula we derived for V (0) (t, µ) in ( 30) during the proof of Proposition 3. By definition, µ greedy t = µ + t . Moreover, (i) V (0) (t, µ * t ) ≤ Q * (t, µ * t ) = 0 = V (0) (t, µ + t ) due to our lower bound, hence µ * t ≤ µ + t (remember that V (0) (t, µ) was a strictly increasing function in µ), and (ii) (τ -t -1)C + V (0) (t, µ - t ) = 0 = Q * (t, µ * t ) ≤ (τ -t -1)C + V (0) (t, µ * t ) due to our upper bound, hence V (0) (t, µ - t ) ≤ V (0) (t, µ * t ) meaning µ - t ≤ µ * t . Putting together these facts, and also the fact that µ * t ≤ µ greedy t , we obtain |µ * t -µ greedy t | ≤ µ + t -µ - t as the proposition states.

J BENCHMARKING ALGORITHMS

Algorithm 2 Adaptive Enrichment, Futility Stopping with Bayes-OCP, Greedy Bayes-OCP 1: Initialize µx and σ 2 x for all x ∈ X 2: X ← X , t ← 0, D0 ← ∅ 3: Start experiment ψ = (X , τ, ρ) 4: loop: 5: t ← t + 1 6: Observe xt, yt 7: Dt ← Dt-1 ∪ {xt, yt} 8: 1/σ 



Proofs of all propositions are given in Appendix I.



Figure 1: Optimal value function V * (t, µ) for C = 1, R = 10, τ = 4, and α = 0. It can clearly be seen that neither V * nor -V * is convex in µ (cf. Proposition 1).

Figure4: Optimism of Bayes-OCP. Greedy Bayes-OCP (incorrectly) stops due to initial noise in a green instance while Bayes-OCP does not stop since it is more optimistic (as the optimal policy should, cf. Proposition 3).

Figure 5: Utility achieved by Bayes-OCP for various values of hyper-parameter β.

(z, 1 /(t-1) -1 /t)dz = -C + lim µ →∞ Q * (t, µ ) f (z, 1 /(t-1) -1 /t)dz = -(τ -t -1)C + R ,together with bounds (21) and (22), we obtain lim µ→∞ Q * (t -1, µ) = -(τ -t + 1)C + R.

Equivalent concepts across different domains. OCP can model scenarios other than adaptive experimentation.

):

Comparison of related experiment designs. Optimal commitment is the only design that aims to decide both when an alternative population should be targeted-as opposed to switching the target population only at a fixed decision point-as well as which population to target among many potential candidates-as opposed to a simple binary decision of "overall population vs. sub-population" or "go vs. no-go". Jarrett and van der Schaar (2020) consider active versions of SHT where the agent is able to choose what type of observations to make. Our case is "passive" in the sense that the meta-experimenter cannot influence what kind of samples they are going to receive from the currently running experiment. Finally, optimal commitment, and SHT in general, can be thought of as more structured instances of partially-observed reinforcement learning (RL). As we have discussed earlier, the standard technique here relies on convex reward structures whereas the optimal value function in our case is not convex in general (cf. Proposition 1, see Appendix A for a detailed discussion).

Performance comparison in various environment instances. Bayes-OCP has the highest expected utility-and a smaller FWER then conventional RCTs-when averaged over all environment instances. This is because Bayes-OCP is a balanced design whose structure does not favor certain environment instances over others. As an example, compare it with conventional RCTs: RCTs do not have an adaptive structure hence they favor green environments where it is not necessary to adapt the target population of the initial experiment. * Instances favored/addressed partially

Unlike other benchmarks, Bayes-OCP strikes a good balance in prioritizing all environment instances at the same time. This is because Bayes-

Performance comparison between Futility Stopping with RL-based algorithms and with Bayes-OCP.

Performance comparison of algorithms with family-wise error control.

Performance comparison when the ground-truth outcome distributions are not Gaussian.

Performance comparison when the number of atomic-populations is 10.

2 x t ← 1/σ 2 x t + 1 9: µx t ← µx t + (yt -µx t )σ 2 ← argmax x∈X\X E θx∼N (µx,σ 2 x ) [G (0) (X ∪ {x}; {θx})] 13: if E θx∼N (µx,σ 2 x ) [G (0) (X ∪ {x * }; {θx})] > E θx∼N (µx,σ 2 x ) [G (0) (X ; {θx})]:Conditional power function P(X, Dt; {θx}) The probability of a hypothesis test being successful conditioned on mean outcomes {θx} Expected utility function G(X, Dt; {θx}) The expected utility of fully committing to an experiment and waiting until it terminates when the experiment targets population X, is currently at time step t, and collected dataset DtPosteriors N (µx, σ 2 x )Posterior distributions over mean outcomes {θx} maintained by Bayes-OCP such that θx| D ∼ N (µx, σ 2 x )

ACKNOWLEDGMENTS

We would like to thank the reviewers and the members of the van der Schaar lab, for their valuable input, comments, and suggestions. This work was supported by the US Office of Naval Research (ONR) and the National Science Foundation (NSF, grant number 1722516).

≥ -C .

(23)where ( 24) holds since Q * (t, µ + z) > 0 if and only if z > µ * t -µ and max{0, Q * (t, µ + z)} = 0 otherwise, and (25) holds sincetogether with bounds ( 23) and ( 26), we obtainWhen put together, the base cases and the inductive cases above imply that conditions (i-iii) hold for all t ∈ {1, . . . , τ -1}-hence µ * t exists for all t ∈ {1, . . . , τ -1}-which concludes our proof.

I.3 PROOF OF PROPOSITION 3

First, we prove the existence of µ greedy t for all t ∈ {0, . . . , τ -1} by driving an analytical formula forwhere ( 27) holds since π (0) (t, µ) = Ψ 0 for all t and µ hence it is always the case that T (0) t = ∞, and ( 28) is due to (15).In the remainder of our proofs, we take α = 0 for notational brevity. This is without any loss of generality as, by simply shifting each value function and Q-function by α/ √ τ with respect to µ, all of the following arguments would still hold. For α = 0, we show thatfor all t ∈ {1, . . . , τ -1} via mathematical induction. Note that (30) is true for t = τ -1: 

