A REINFORCEMENT LEARNING FRAMEWORK FOR TIME DEPENDENT CAUSAL EFFECTS EVALUATION IN A/B TESTING Anonymous authors Paper under double-blind review

Abstract

A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. The aim of this paper is to introduce a reinforcement learning framework for carrying A/B testing in two-sided marketplace platforms, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both synthetic data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice.

1. INTRODUCTION

A/B testing, or online experiment is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries (e.g., Google, Amazon, or Facebook). Most works in the literature focus on the setting, in which observations are independent across time (see e.g. Johari et al., 2015; 2017, and the references therein) . The treatment at a given time can impact future outcomes. For instance, in a ride-sharing company (e.g., Uber), an order dispatching strategy not only affects its immediate income, but also impacts the spatial distribution of drivers in the future, thus affecting its future income. In medicine, it usually takes time for drugs to distribute to the site of action. The independence assumption is thus violated. The focus of this paper is to test the difference in long-term treatment effects between two products in online experiments. There are three major challenges as follows. (i) The first one lies in modelling the temporal dependence between treatments and outcomes. (ii) Running each experiment takes a considerable time. The company wishes to terminate the experiment as early as possible in order to save both time and budget. (iii) Treatments are desired to be allocated in a manner to maximize the cumulative outcomes and to detect the alternative more efficiently. The testing procedure shall allow the treatment to be adaptively assigned. We summarize our contributions as follows. First, we introduce a reinforcement learning (RL, see e.g., Sutton & Barto, 2018 , for an overview) framework for A/B testing. In addition to the treatmentoutcome pairs, it is assumed that there is a set of time-varying state confounding variables. We model the state-treatment-outcome triplet by using the Markov decision process (MDP, see e.g. Puterman, 1994) to characterize the association between treatments and outcomes across time. Specifically, at each time point, the decision maker selects a treatment based on the observed state. The system responds by giving the decision maker a corresponding outcome and moving into a new state in the next time step. In this way, past treatments will have an indirect influence on future rewards through its effect on future state variables. In addition, the long-term treatment effects can be characterized by the value functions (see Section 3.1 for details) that measure the discounted cumulative gain from a given initial state. Under this framework, it suffices to evaluate the difference between two value functions to compare different treatments. This addresses the challenge mentioned in (i). Second, we propose a novel sequential testing procedure for detecting the difference between two value functions. To the best of our knowledge, this is the first work on developing valid sequential tests in the RL framework. Our proposed test integrates temporal difference learning (see e.g., Precup et al., 2001; Sutton et al., 2008) , the α-spending approach (Lan & DeMets, 1983 ) and bootstrap (Efron & Tibshirani, 1994) to allow for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs, including the Markov design, the alternating-time-interval design and the adaptive design (see Section 4.4). This addresses the challenges in (ii) and (iii). Third, we systematically investigate the asymptotic properties of our testing procedure. We show that our test not only maintains the nominal type I error rate, but also has non-negligible powers against local alternatives. To our knowledge, these results have not been established in RL. Finally, we introduce a potential outcome framework for MDP. We state all necessary conditions that guarantee that the value functions are estimable from the observed data.

2. RELATED WORK

There is a huge literature on RL such that various algorithms are proposed for an agent to learn an optimal policy and interact with an environment. Our work is closely related to the literature on offpolicy evaluation, whose objective is to estimate the value of a new policy based on data collected by a different policy. Popular methods include Thomas et al. (2015) ; Jiang & Li (2016) ; Thomas & Brunskill (2016) ; Liu et al. (2018) ; Farajtabar et al. (2018) ; Kallus & Uehara (2019) . Those methods required the treatment assignment probability (propensity score) to be bounded away from 0 and 1. As such, they are inapplicable to the alternating-time-interval design, which is the treatment allocation strategy in our real data application. Our work is related to the temporal-difference learning method based on function approximation. Convergence guarantees of the value function estimators have been derived by Sutton et al. (2008) under the setting of independent noise and by Bhandari et al. (2018) for Markovian noise. However, uncertainty quantification of the resulting value function estimators have been less studied. Such results are critical for carrying out A/B testing. Luckett et al. (2019) outlined a procedure for estimating the value under a given policy. Shi et al. (2020b) developed a confidence interval for the value. However, these methods do not allow for sequential monitoring or online updating. In addition to the literature on RL, our work is also related to a line of research on evaluating timevarying causal effects (see e.g. Robins, 1986; Boruvka et al., 2018; Ning et al., 2019; Rambachan & Shephard, 2019; Viviano & Bradic, 2019; Bojinov & Shephard, 2020) . However, none of the above cited works used an RL framework to characterize treatment effects. In particular, Bojinov & Shephard (2020) proposed to use importance sampling (IS) based methods to test the null hypothesis of no (average) temporal causal effects in time series experiments. Their causal estimand is different from ours since they focused on p lag treatment effects, whereas we consider the long-term effects characterized by the value function. Moreover, their method requires the propensity score to be bounded away from 0 and 1, and thus it is not valid for our applications. Furthermore, our work is also related to the literature on sequential analysis (see e.g. Jennison & Turnbull, 1999 , and the references therein), in particular, the α-spending function approach that allocates the total allowable type I error rate at each interim stage according to an error-spending function. Most test statistics in classical sequential analysis have the canonical joint distribution (see Equation (3.1), Jennison & Turnbull, 1999) and their associated stopping boundary can be recursively updated via numerical integration. However, in our setup, test statistics no longer have the canonical joint distribution when adaptive design is used. This is due to the existence of the carryover effects in time. We discuss this in detail in Appendix C. To resolve this issue, we propose a scalable bootstrap-assisted procedure to determine the stopping boundary (see Section 4.3). Recently, there is a growing literature on bringing classical sequential analysis to A/B testing. In particular, Johari et al. (2015) proposed an always valid test based on the classical mixture sequential probability ratio tests (mSPRT). Kharitonov et al. (2015) propose modified versions of the O'Brien & Fleming and MaxSPRT sequential tests. Deng et al. (2016) studied A/B testing under Bayesian framework. Abhishek & Mannor (2017) developed a bootstrap mSPRT. These tests cannot detect the carryover effects in time, leading to low statistical power in our setup. See the toy examples in Section 4.1 for detailed illustration. In addition, we note that there is a line of research on bandit/RL with causal graphs (see e.g., Lee & Bareinboim, 2018; 2019) . We remark that the problems considered and the solutions developed in this article are different from these works. Specifically, these works considered applying causal inference methods to deal with unmeasured confounders in bandit/RL settings whereas we apply the RL framework to evaluate time-dependant causal effects. Finally, we relax several key conditions used in Ertefaie (2014) and Luckett et al. (2019) that presented a potential outcome framework for MDP (see Section 3.1 for details). Specifically, Ertefaie (2014) and Luckett et al. (2019) imposed the Markov conditions on the observed data rather than the potential outcomes, while assuming that the outcome at time t is a deterministic function of the state variables at time t, t + 1 and the treatment at time t.

3. PROBLEM FORMULATION

3.1 A POTENTIAL OUTCOME FRAMEWORK FOR MDP For simplicity, we assume that there are only two treatments (actions, products), coded as 0 and 1, respectively. For any t ≥ 0, let āt = (a 0 , a 1 , • • • , a t ) ∈ {0, 1} t+1 denote a treatment history vector up to time t. Let S denote the support of state variables and S 0 denote the initial state variable. We assume S is a compact subset of R d . For any (ā t-1 , āt ), let S * t (ā t-1 ) and Y * t (ā t ) be the counterfactual state and counterfactual outcome, respectively, that would occur at time t had the agent followed the treatment history āt . The set of potential outcomes up to time t is given by W * t (ā t ) = {S 0 , Y * 0 (a 0 ), S * 1 (a 0 ), • • • , S * t (ā t-1 ), Y * t (ā t )}. Let W * = ∪ t≥0,āt∈{0,1} t+1 W * t (ā t ) be the set of all potential outcomes. A deterministic policy π is a function that maps the space of state variables to the set of available actions. For any such π, let πt denote the treatment history up to time t, assigned according to π. We use S * t (π t-1 ) and Y * t (π t ) to denote the associated potential state and outcome that would occur at time t had the agent followed π. The goodness of a policy π is measured by its value function, V (π; s) = t≥0 γ t E{Y * t (π t )|S 0 = s}, where 0 < γ < 1 is a discounted factor that reflects the trade-off between immediate and future outcomes. Note that our definition of the value function is slightly different from those in the existing literature (see Sutton & Barto, 2018, for example) . Specifically, V (π; s) is defined through potential outcomes rather than the observed data. Similarly, we define the Q function by Q(π; a, s) = t≥0 γ t E{Y * t (π t (a))|S 0 = s}, where {π t (a)} t≥0 denotes the treatment history where the initial action equals to a and all other actions are assigned according to π. In our setup, we focus on two nondynamic policies that assign the same treatment at each time point. We use their value functions (denote by V (1; •) and V (0; •)) to measure their long-term treatment effects. Meanwhile, our proposed method is equally applicable to the dynamic policy scenario as well. See Section B.1 for details. To quantitatively compare the two policies, we introduce the Average Treatment Effect (ATE) based on their value functions which relates RL to causal inference. Definition. For a given reference distribution function G, ATE is defined by the integrated difference between two value function, i.e., ATE = s {V (1; s) -V (0; s)}G(ds). The focus of this paper is to test the following hypotheses: H 0 : τ 0 = ATE ≤ 0 v.s H 1 : τ 0 = ATE > 0. When H 0 holds, the new product is no better than the old one.

3.2. IDENTIFIABILITY OF ATE

One of the most important question in causal inference is the identifiability of causal effects. In this section, we present sufficient conditions that guarantee the identifiability of the value function. In practice, with the exception of S 0 , the set W * cannot be observed, whereas at time t, we observe the state-action-outcome triplet (S t , A t , Y t ). For any t ≥ 0, let Āt = (A 0 , A 1 , • • • , A t ) denote the observed treatment history. We first introduce two conditions that are commonly assumed in multi-stage decision making problems (see e.g. Murphy, 2003; Zhang et al., 2013; Kennedy, 2019) . (CA) Consistency assumption: S t+1 = S * t+1 ( Āt ) and Y t = Y * t ( Āt ) for all t ≥ 0, almost surely. (SRA) Sequential randomization: A t is independent of W * given S t and {S j , A j , Y j } 0≤j<t . The SRA implies that there are no unmeasured confounders and it automatically holds in online experiments, in which the treatment assignment mechanism is pre-specified. In SRA, we allow A t to depend on the observed data history S t , {S j , A j , Y j } 0≤j<t and thus, the treatments can be adaptively chosen. We next introduce two conditions that are unique to the RL setting. (MA) Markov assumption: there exists a Markov transition kernel P such that for any t ≥ 0, āt ∈ {0, 1} t+1 and S ⊆ R d , we have Pr{S * t+1 (ā t ) ∈ S|W * t (ā t )} = P(S; a t , S * t (ā t-1 )). (CMIA) Conditional mean independence assumption: there exists a function r such that for any t ≥ 0, āt ∈ {0, 1} t+1 , we have E{Y * t (ā t )|S * t (ā t-1 ), W * t-1 (ā t-1 )} = r(a t , S * t (ā t-1 )). These two conditions are central to the empirical validity of reinforcement learning (RL). Specifically, under these two conditions, there exists an optimal policy π * such that V (π * ; s) ≥ V (π; s) for any π and s. We observe that Ertefaie (2014) and Luckett et al. (2019) imposed the Markov conditions on the observed data rather than the potential outcomes. When CA and SRA hold, these assumptions are equivalent. When SRA is violated, their Markov assumptions could be violated as the treatment depends on unobserved confounders and the observed data process is no longer Markovian. CMIA requires past treatments to affect Y * t (ā t ) only through its impact on S * t (ā t-1 ). In other words, the state variables shall be chosen to include those that serve as important mediators between past treatments and current outcomes. Under MA, CMIA is automatically satisfied when Y * t (ā t ) is a deterministic function of (S * t+1 (ā t ), a t , S * t (ā t-1 )) that measures the system's status at time t + 1. The latter condition is commonly imposed in the reinforcement learning literature. To conclude this section, we derive a version of Bellman equation for the Q function under the potential outcome framework. Specifically, for a ∈ {0, 1}, let Q(a ; •, •) denote the Q function where treatment a is repeatedly assigned after the initial decision. Lemma 1 Under MA, CMIA, CA and SRA, for any t ≥ 0, a ∈ {0, 1} and any function ϕ : S × {0, 1} → R, we have E[{Q(a ; A t , S t ) -Y t -γQ(a ; a , S t+1 )}ϕ(S t , A t )] = 0. Sketch of Proof: Under MA, CMIA, CA, SRA, the defined Q-function under the potential outcome framework is the same as that defined on the observed data. Lemma 1 thus follows from the classical Bellman equation (see Equation (4.6) in Sutton & Barto, 2018) . Lemma 1 implies that the Q-function is estimable from the observed data. Specifically, an estimating equation can be constructed based on Lemma 1 and the Q-function can be learned by solving this estimating equation. Note that V (a, s) = Q(a; a, s) and τ 0 is completely determined by the value function V . As a result, τ 0 is estimable from the observed data as well. We remark that the positivity assumption is not needed in Lemma 1. Our procedure can thus handle the case where treatments are deterministically assigned, i.e., the behavior policy b is deterministic. This is due to MA and CMIA that assume the system dynamics are invariant across time. To elaborate this, note that the discounted value function is completely determined by the transition kernel P and the reward function r. We remark that these quantities can be consistently estimated under certain conditions (see C1-C3 in Appendix E), regardless of whether b is deterministic or not.

4. TESTING PROCEDURE

We first introduce a toy example to illustrate the limitations of existing A/B testing methods. We next present our method and prove its consistency under a variety of different treatment designs.

4.1. TOY EXAMPLES

Existing A/B testing methods can only detect short-term treatment effects, but fail to identify any long-term effects. To elaborate this, we introduce two examples below. Example 1. S t = 0.5ε t , Y t = S t + δA t for any t ≥ 1 and S 0 = 0.5ε 0 . Example 2. S t = 0.5S t-1 + δA t + 0.5ε t , Y t = S t for any t ≥ 1 and S 0 = 0.5ε 0 . In both examples, the random errors {ε t } t≥0 follow independent standard normal distributions and the parameter δ describes the degree of treatment effects. Suppose δ > 0. Then H 1 holds. In Example 1, the observations are independent and there are no carryover effects at all. In this case, both the existing A/B tests and the proposed test are able to discriminate H 1 from H 0 . In Example 2, however, treatments have delayed effects on the outcomes. Specifically, Y t does not depend on A t , but is affected by A t-1 through S t . Existing tests will fail to detect H 1 as the short-term conditional : β a ,a ∈ R q } be a large linear approximation space for Q(a ; a, s), where Ψ(•) is a vector containing q basis functions on S. The dimension q is allowed to depend on the number of samples T to alleviate the effects of model misspecification. Let us suppose Q ∈ Q for a moment. By Lemma 1, there exists some β * = (β * 0,0 , β * 0,1 , β * 1,0 , β * 1,1 ) such that E[{Ψ (S t )β * a ,a -Y t -γΨ (S t+1 )β * a ,a }Ψ(S t )I(A t = a)] = 0, ∀a, a ∈ {0, 1} , where I(•) denotes the indicator function. Let ξ(s, a) = {Ψ (s)I(a = 0), Ψ (s)I(a = 1)} . The above equations can be rewritten as E(Σ t β * ) = Eη t , where Σ t is a block diagonal matrix given by Σ t = ξ(S t , A t ){ξ(S t , A t ) -γξ(S t+1 , 0)} ξ(S t , A t ){ξ(S t , A t ) -γξ(S t+1 , 1)} and η t = {ξ(S t , A t ) Y t , ξ(S t , A t ) Y t } . Let Σ(t) = t -1 j<t Σ j and η(t) = t -1 j<t η j . It follows that E{ Σ(t)β * } = E{ η(t)}. This motivates us to estimate β * by β(t) = { β 0,0 (t), β 0,1 (t), β 1,0 (t), β 1,1 (t)} = Σ -1 (t) η(t). ATE can thus be estimated by the plug-in estimator τ (t) = s Ψ (s){ β 1,1 (t) -β 0,0 (t)}G(ds). Second, we use τ (t) to construct our test statistic at time t. We will show √ t{ τ (t)-τ 0 } is asymptotically normal. Its variance can be consistently estimated by σ 2 (t) = U Σ -1 (t) Ω(t){ Σ -1 (t)} U , as t grows to infinity, where U = {-s∈S Ψ(s) G(ds), 0 q , 0 q , s∈S Ψ(s) G(ds)} , 0 q denotes a zero vector of length q, and Ω(t) corresponds to some consistent covariance estimator of η t based on the data observed at time t (see equation 3 for the explicit form). This yields our test statistic √ t τ (t)/ σ(t), at time t. Third, we integrate the α-spending approach with bootstrap to sequentially implement our test (see Section 4.3). The idea is to generate bootstrap samples that mimic the distribution of our test statistics, to specify the stopping boundary at each interim stage. Suppose that the interim analyses are conducted at time points T 1 < • • • < T K = T . For each 1 ≤ k < K, we assume T k /T → c k for some constants 0 < c 1 < • • • < c K-1 < 1.

4.3. SEQUENTIAL MONITORING AND ONLINE UPDATING

Let {Z 1 , • • • , Z K } denote the sequence of our test statistics, where Z k = √ T k τ (T k )/ σ(T k ). To sequentially monitor our test, we need to specify the stopping boundary {b k } 1≤k≤K such that the experiment is terminated and H 0 is rejected when Z k > b k for some k. First, we use the α spending function approach to guarantee the validity of our test. It requires to specify a monotonically increasing function α(•) that satisfies α(0) = 0 and α(T ) = α. Some popular choices of the α spending function include α 1 (t) = 2 -2Φ{Φ -1 (1 -α/2) T /t} and α 2 (t) = α(t/T ) θ for θ > 0, where Φ(•) denotes the normal cumulative distribution function. Adopting the α spending approach, we require b k 's to satisfy Pr(∪ k j=1 {Z j > b j }) = α(T k ) + o(1), ∀1 ≤ k ≤ K. ( ) As commented in the introduction, the numerical integration method is not applicable to determine the stopping boundary. Our method is built upon the wild bootstrap (Wu et al., 1986) . The idea is to generate bootstrap samples that have asymptotically the same joint distribution as the test statistics. However, we note that directly applying the wild bootstrap algorithms is time consuming. See Section C for details. To facilitate the computation, we present a scalable bootstrap algorithm to determine {b k } k . Let {e k } k be a sequence of i.i.d N (0, I 4q ) random vectors, where I J stands for a J × J identity matrix for any J. Let Ω(T 0 ) be an 4q × 4q zero matrix. At the k-th stage, we compute Z * k = U Σ -1 (T k ) √ T k σ(T k ) k j=1 {T j Ω(T j ) -T j-1 Ω(T j-1 )} 1/2 e j . A key observation is that, conditional on the observed dataset, the covariance of Z * k1 and Z * k2 equals that of Z k1 and Z k2 . See Theorem 3 for details. In addition, the limiting distributions of {Z k } k and {Z * k } k are multivariate normal. As such, the joint distribution of {Z k } k can be well approximated by that of {Z * k } k conditional on the data. This forms the basis of our bootstrap algorithm. By the requirement on 1). To implement our test, we recursively calculate the threshold b k as follows, {b k } k in 2, we obtain Pr(max 1≤j<k (Z j -b j ) ≤ 0, Z k > b k ) = α(T k ) -α(T k-1 ) + o( Pr * max 1≤j<k (Z * j -b j ) ≤ 0, Z * k > b k = α(T k ) -α(T k-1 ), where Pr * denotes the conditional probability given on the data, and reject H 0 when Z * k > b k for some k. In practice, the left-hand-side can be approximated via Monte carlo simulations.

4.4. CONSISTENCY UNDER DIFFERENT TREATMENT DESIGNS

We consider three treatment allocation designs that can be handled by our procedure as follows: D1. Markov design: Pr(A t = 1|S t , {S j , A j , Y j } 0≤j<t ) = b (0) (S t ) for some function b (0) (•) uniformly bounded away from 0 and 1.

D2. Alternating-time-interval design:

A 2j = 0, A 2j+1 = 1 for all j ≥ 0. D3. Adaptive design (e.g., -greedy): For T k ≤ t < T k+1 for some k ≥ 0, Pr(A t = 1|S t , {S j , A j , Y j } 0≤j<t ) = b (k) (S t ) for some b (k) (•) that depends on {S j , A j , Y j } 0≤j<T k . Here, D2 is a deterministic design and is widely used in industry (see our real data example). D1 and D3 are random designs. D1 is commonly assumed in the literature on off-policy evaluation (see e.g., Jiang & Li, 2016) . D3 is widely employed in the contextual bandit setting to balance the trade-off between exploration and exploitation. These three settings cover a variety of scenarios in practice. Theorem 1 (Type-I error) Suppose α(•) is continuous, C1-C3 (see Appendix E) hold and q = o( √ T / log T ). Then Pr( k j=1 {Z j > b j }) ≤ α(T k ) + o(1), for all 1 ≤ k ≤ K under H 0 . Sketch of Proof: We consider the case where τ 0 = 0 only. The general case is proven in Section F.3. As discussed in Section 4.3, the conditional distribution of {Z * k } k given the data is equivalent as the distribution of {Z k } k . Since { b k } k is a continuous function of {Z * k } k , it follows from the continuous mapping theorem that { b k } k are consistent. The proof is hence completed. Theorem 1 implies that the type-I error rate of the proposed test is well controlled. When ATE= 0, the equality in Theorem 1 holds. The rejection probability achieves the nominal level under H 0 . Theorem 2 (Power) Under the conditions of Theorem 1, assume τ 0 T -1/2 , then Pr(Z 1 > b 1 ) → 1. Assume τ 0 = T -1/2 h for some h > 0. Then lim T →∞ [Pr(∪ k j=1 {Z j > b j }) -α(T k )] > 0. Sketch of Proof: Under H 1 , similar to Theorem 1, we have Pr(∪ k j=1 {Z j -T j τ 0 / σ(T j ) > b j } = α(T k ) + o(1) . The assertion follows by that Z j is stochastically larger than Z j -T j τ 0 for all j. The second assertion in Theorem 2 implies that our test has non-negligible powers against local alternatives converging to H 0 at the T -1/2 rate. When the signal decays to zero faster than this rate, our test is not able to detect H 1 . When the signal decays at a slower rate, the power of our test approaches 1. Combining Theorems 1 and 2 yields the consistency of our test. Finally, it is worth mentioning that our test can be online updated as batches of observations arrive at the end of each interim stage. We summarize our procedure in Algorithm 1 (see Appendix A). Its time complexity is dominated by O(Bq 3 + T q 2 ).  S 1,t = (2A t-1 -1)S 1,(t-1) /2 + S 2,(t-1) /4 + δA t-1 + ε 1,t , S 2,t = (2A t-1 -1)S 2,(t-1) /2 + S 1,(t-1) /4 + δA t-1 + ε 2,t , Y t = 1 + (S 1,t + S 2,t )/2 + ε 3,t , where the random errors {ε j,t } j=1,2,0≤t≤T are i.i.d N (0, 0.5 2 ) and {ε 3,t } 0≤t≤T are i.i.d N (0, 0.3 2 ). The initial states S 1,0 and S 2,0 are independent N (0, 0.5 2 ) as well. Let S t = (S 1,t , S 2,t ) denote the state at time t. Under this model, treatments have delayed effects on the outcomes, as in Example 2. The parameter δ characterizes the degree of such carryover effects. When δ = 0, τ 0 = 0 and H 0 holds. When δ > 0, H 1 holds. Moreover, τ 0 increases as δ increases. We set K = 5 and (T 1 , T 2 , T 3 , T 4 , T 5 ) = (300, 375, 450, 525, 600) . The discounted factor γ is set to 0.6 and G is chosen as the initial state distribution. We consider three behavior policies, according to the designs D1-D3, respectively. For the behavior policy in D1, we set b (0) (s) = 0.5 for any s ∈ S. For the behavior policy in D3, we use an -greedy policy and set b (k) (s) = /2 + (1 - )I(Ψ(s) ( β 1,1 (T k ) -β 0,0 (T k )) > 0), with = 0.1, for any k ≥ 1 and s ∈ S. For each design, we further consider five choices of δ, corresponding to 0, 0.05, 0.1, 0.15 and 0.2. The significance level α is set to 0.05 in all cases. To implement our test, we choose two α-spending functions, corresponding to α 1 (•) and α 2 (•) given in equation 1. The hyperparameter θ in α 2 (•) is set to 3. The number of bootstrap sample is set to 1000. In addition, we consider the following polynomial basis function, Ψ(s) = Ψ(s 1 , s 2 ) = (1, s 1 , s 2 1 , • • • , s J 1 , s 2 , s 2 2 , • • • , s J 2 ) , with J = 4. We also tried some other values of J by setting J to 3 and 5. Results are reported in Figure 6 (see Appendix G). It can be seen that the resulting tests is not sensitive to the choice of J. All experiments run on a macbook pro with a dual-core 2.7 GHz processor. Implementing a single test takes one second. Figures 1 (a ) and 5 (a) (see Appendix G) depict the empirical rejection probabilities of our test statistics at different interim stages under H 0 and H 1 with different combinations of δ, α(•) and the designs. These rejection probabilities are aggregated over 500 simulations. We also plot α 1 (•) and α 2 (•) under H 0 . Based on the results, it can be seen that under H 0 , the type-I error rate of our test is well-controlled and close to the nominal level at each interim stage. Under H 1 , the power of our test increases as δ increases, showing the consistency of our test procedure. To further evaluate our method, we compare it with the classical two-sample t-test and the sequential test developed by Kharitonov et al. (2015) . To apply the t-test, for each T k , we apply the t-test to the We next explain why several other methods mentioned in the introduction cannot be used for comparison. First, a lot of causal effects evaluation methods did not consider early termination. Consequently, they are unsuitable to apply in our numerical studies. Second, standard temporal difference learning method did not study the asymptotic distribution of the resulting value estimators. These results are critical for carrying out A/B testing. Finally, many methods proposed to use inverse propensity-score weighting. These methods are not valid for the alternating-time-interval design.

5.2. REAL DATA APPLICATION

We apply the proposed test to a real dataset from a ride-sharing platform. Order dispatching is one of the most critical problems in online ride-hailing platforms to adapt the operation and management strategy to the dynamics in demand and supply. The purpose of this study is to compare the performance of a newly developed strategy with a standard control strategy used in the platform. The new strategy is expected to reduce the answer time of passengers and increase drivers income. For a given order, the new strategy will dispatch it to a nearby driver that has not yet finished their previous ride request, but almost. In comparison, the standard control assigns orders to drivers that have completed their ride requests. The experiment is conducted at a given city from December 3rd to December 16th. Dispatch strategies are executed based on alternating half-hourly time intervals. We also apply our test to a data from an A/A experiment (which compares the baseline strategy against itself), conducted from November 12th to November 25th. We expect that our test will not reject H 0 when applied to the data from the A/A experiment, since the two strategies used are essentially the same. Both experiments last for two weeks. Thirty-minutes is defined as one time unit. We set T k = 48(k + 6) for k = 1, . . . , 8. That is, the first interim analysis is performed at the end of the first week, followed by seven more at the end of each day during the second week. We choose the overall drivers' income in each time unit as the response. The new strategy is expected to reduce the answer time of passengers and increase drivers' income. Three time-varying variables are used to construct the state. The first two correspond to the number of requests (demand) and drivers' online time (supply) during each 30-minutes time interval. These factors are known to have large impact on drivers' income. The last one is the supply and demand equilibrium metric. This variable characterizes the degree that supply meets the demand and serves as an important mediator between past treatments and future outcomes. To implement our test, we set γ = 0.6, B = 1000 and use a fourth-degree polynomial basis for Ψ(•), as in simulations. We use α 1 (•) as the spending function for interim analysis and set α = 0.05. The test statistic and its corresponding rejection boundary at each interim stage are plotted in Figure 2 . It can be seen that our test is able to conclude, at the end of the 12th day, that the new order dispatch strategy can significantly increase drivers' income, and meet more order requests. In addition, based on the dataset from the A/B experiment, we found that the new strategy reduces the answer time of orders by 2%, leading to almost 2% increment of drivers income. When applied to the data from the A/A experiment, we fail to reject H 0 , as expected. For comparison, we also apply the two-sample ttest to the data collected from the A/B experiment. The corresponding p-value is 0.18. This result is consistent with our findings. Specifically, the treatment effect at a given time affects the distribution of drivers in the future, inducing interference in time. As shown in the toy example (see Section 4.1), the t-test cannot detect such carryover effects, leading to a low power. Our procedure, according to Theorem 2, has enough powers to discriminate H 1 from H 0 . Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence offpolicy evaluation. For k = 1 to K: Step 1. Online update of ATE. For t = T k-1 to T k -1: Σ a = (1 -t -1 ) Σ a + t -1 ξ(S t , A t ){ξ(S t , A t ) -γξ(S t+1 , a)} , a = 0, 1; η = (1 -t -1 ) η + t -1 ξ(S t , A t )Y t . Set ( β a,0 , β a,1 ) = Σ -1 a η for a ∈ {0, 1} and τ = U β. Step 2. Online update of the variance estimator. Initialize Ω * to a zero matrix. For t = T k-1 to T k -1: ε t,a = Y t + γΨ (S t+1 ) β a,a -Ψ (S t ) β a,At for a = 0, 1; Ω * = Ω * + (ξ(S t , A t ) ε t,0 , ξ(S t , A t ) ε t,1 ) (ξ(S t , A t ) ε t,0 , ξ(S t , A t ) ε t,1 ). Set Σ to a block diagonal matrix by aligning Σ 0 and Σ 1 along the diagonal of Σ; Set Ω = T -1 k (T k-1 Ω + Ω * ) and the variance estimator σ 2 = U Σ -1 Ω{ Σ -1 } U . Step 3. Bootstrap test statistic. For b = 1 to B: Generate e (b) k ∼ N (0, I 4q ); S b = S b + Ω * 1/2 e (b) k ; Z * b = T -1/2 k σ -1 U Σ -1 S b ; Set z to be the upper {α(t) -|I c |/B}/(1 -|I c |/B)-th percentile of { Z * b } b∈I . Update I and Ω as I ← {b ∈ I : Z * b ≤ z}; Step 4. Reject or not? Reject the null if √ T k σ -1 τ > z. Algorithm 1: The testing procedure

A MORE ON THE ALGORITHM

A pseudo algorithm summarizing our procedure is given in Algorithm 1. We next introduce some notations. The matrix Ω(t) is defined by Ω(t) = 1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1 ξ j ε j,0 ξ j ε j,1 , where ε j,0 and ε j,1 are the temporal difference errors defined in Algorithm 1.

B EXTENSIONS B.1 EXTENSIONS TO DYNAMIC POLICIES

In this paper, we focus on comparing the long-term treatment effects between two nondynamic policies. The proposed method can be easily extended to handle dynamic policies as well. Specifically, consider two time-homogeneous policies π 1 and π 2 where each π j (s) measures the treatment assignment probability Pr(A t = 1|S t = s). Note that the integrated value difference function τ 0 can be represented by s {V (π 1 ; s) -V (π 2 ; s)}G(ds) = s [{Q(π 1 ; 1, s) -Q(π 1 ; 0, s)}π 1 (s) -{Q(π 2 ; 1, s) -Q(π 2 ; 0, s)}π 2 (s) + Q(π 1 ; 0, s) -Q(π 2 ; 0, s)]G(ds). The Q-estimators can be similarly computed via temporal difference learning. More specifically, for a given policy π, let be the Q-estimator given the data {(S j , A j , Y j )} j<t where Σ π (t) = t -1 j<t Σ j where Σ j is defined by Q t (π; a, s) = Ψ (s) Σ -1 π (t)    1 t j<t ξ(S j , A j )Y j A j 1 -A j    , Ψ(S j )(1 -A j ){Ψ(S j ) -γΨ(S j+1 )(1 -π(S j+1 ))} -γΨ(S j )(1 -A j )Ψ (S j+1 )π(S j+1 ) -γΨ(S j )A j Ψ (S j+1 )π(S j+1 ) Ψ(S j )A j {Ψ(S j ) -γΨ(S j+1 )(1 -π(S j+1 ))} . We can plug-in the Q-estimator in equation 4 to estimate τ 0 . The corresponding variance estimator and the resulting test statistic can be similarly derived. A bootstrap procedure can be similarly developed as in Section 4.3 for sequential testing. We omit the details for brevity.

B.2 EXTENSIONS TO OTHER NONPARAMETRIC ESTIMATORS

In addition to temporal difference learning, other existing OPE methods could be potentially coupled with the proposed bootstrap procedure for online sequential testing. We use the double reinforcement learning method (DRL, Kallus & Uehara, 2019) as an example. First, we remark that DRL requires the system to be ergodic and use an inverse propensity-score weighted method to construct the value estimator. As such, it might not be applicable to the alternating-time-interval design and the adaptive design. Second, in the Markov design, it could be coupled with our bootstrap procedure for sequential testing. We compare such a procedure with our proposed method using the simulation setting in Section 5.1, and report the rejection probabilities in Figure 3 . It can be seen that DRL-based test has some inflated type-I errors under H 0 and is less powerful than our procedure under H 1 . We next outline the procedure. Specifically, at the kth interim stage, we compute the test statistic Z k = 1 √ T k σ k    √ T k-1 σ k-1 Z k-1 + T k -1 t=T k-1 ψ t    , where ψ t = s { Q k (1; 1, s) -Q k (0; 0, s)}Gds + γ A t b(S t ) ω k (1; S t ){R t + Q k (1; 1, S t+1 ) -Q k (1; A t , S t )} -γ 1 -A t 1 -b(S t ) ω k (0; S t ){R t + Q k (0; 0, S t+1 ) -Q k (0; A t , S t )}, Q k and ω k denote the estimated Q-and marginal density ratio functions based on the data collected at the kth stage, and σ 2 k = T k-1 σ 2 k-1 + T k -1 t=T k-1 ψ 2 t T k . The bootstrapped sample can be constructed as Z * k = T k-1 σ -1 k σ k-1 Z * k-1 + T k -T k-1 N (0, 1 ). Then similar to Algorithm 1, we can decide whether to reject H 0 or not based on the test statistic Z k and the bootstrap samples. This algorithm can be implemented online provided that Q and ω can be computed online.

C MORE ON THE WILD BOOTSTRAP ALGORITHM

We first provide an example to show that our test statistics do not have the canonical joint distribution. This motivates us to propose a wild bootstrap algorithm. We next present some details on the bootstrap algorithm. Let {Z k } k be the sequence of test statistics conducted at each interim stage. These test statistics are said to have canonical joint distribution with information levels {I k } k for the parameter θ if: (i) (Z 1 , Z 2 , • • • , Z K ) is asymptotically normal, (ii) EZ k = θ √ I k + o(1), (iii) Cov(Z k1 , Z k2 ) = I k1 /I k2 + o(1). See also Equation (3.1) in Jennison & Turnbull (1999) . Unlike the settings where observations are independent across time, (iii) is likely to be violated in our setup when adaptive design is used. This is due to the existence of carryover effects in time. Specifically, when treatment effects are adaptively generated, the behavior policy at difference stages are likely to vary. Due to the carryover effects in time, the state vectors at difference stages have different distribution functions. According to Part 3 of the proof of Theorem 3, we have for any k 1 ≤ k 2 that Cov(Z k1 , Z k2 ) = T k1 T k2 × U Σ -1 (T k1 )Ω(T k1 ){Σ -1 (T k2 )} U η k 1 ,k 2 +o(1). The matrices Σ(k) and Ω(k) depend the distributions of state vectors and are likely to differ for different k. Consequently, the second term η k1,k2 on the right-hand-side depends on both k 1 and k 2 . As such, (iii) is violated. The idea of our bootstrap algorithm is to generate bootstrap samples { Z MB (t)} t that have asymptotically the same joint distribution as { √ t σ -1 (t)( τ (t) -τ 0 )} t . Specifically, let {ζ t } t≥0 be a sequence of i.i.d. random variables independent of the observed data. Define β MB (t) = Σ -1 (t)    1 t j<t ξ(S j , A j )ζ j ε j,0 ε j,1    , where ε t,a is the temporal difference error defined in Algorithm 1. Based on β MB (t), one can define the bootstrap sample Z MB (t) = √ t σ -1 (t)U β MB (t). We remark that although the wild bootstrap method is developed under the i.i.d. settings, it is valid under our setup as well. This is due to that under CMIA, β(t) -β * forms a martingale sequence with respect to the filtration {(S j , A j , Y j ) : j < t}. This guarantees that the covariance matrices of β MB (t) and β(t) are asymptotically equivalent. As such, the bootstrap approximation is valid. However, calculating β MB (T k ) requires O(T k ) operations. The time complexity of the resulting bootstrap algorithm is O(BT k ) up to the k-th interim stage, where B is the total number of bootstrap samples. This can be time consuming when {T k -T k-1 } K k=1 are large.

D MORE ON THE DESIGNS

In D3, we require b (k) to be strictly bounded between 0 and 1. Suppose an -greedy policy is used, i.e. b (k) (s) = /2 + (1 -) π (k) (s), where π (k) denotes some estimated optimal policy. It follows that /2 ≤ b (k) (s) ≤ 1 -/2 for any s. Such a requirement is automatically satisfied. For any behaviour policy b in D1-D3, define S * t ( bt-1 ) and Y * t ( bt ) as the potential outcomes at time t, where bt denotes the action history assigned according to b. When b is a random policy as in D1 or D3, definitions of these potential outcomes are more complicated than those under a deterministic policy (see Luckett et al., 2019) . When b is a stationary policy, it follows from MA that {S * t+1 ( bt )} t≥-1 forms a time-homogeneous Markov chain. When b follows the alternatingtime-interval design, both {S * 2t ( b2t-1 )} t≥0 and {S * 2t+1 ( b2t )} t≥0 form time-homogeneous Markov chains. We next show that (Z 1 , • • • , Z K ) is asymptotically multivariate normal and provide a consistent covariance estimator. Theorem 3 (Limiting distributions) Assume C1-C3 hold. Assume all immediate rewards are uniformly bounded variables, the density function of S 0 is uniformly bounded on S and q satisfies q = o( √ T / log T ). Then under either D1, D2 or D3, we have • {Z k } 1≤k≤K are jointly asymptotically normal; • their asymptotic means are non-positive under H 0 ; • their covariance matrix can be consistently estimated by some Ξ, whose (k 1 , k 2 )-th element Ξ k1,k2 equals T k1 /T k2 U Σ -1 (T k1 ) Ω(T k1 ){ Σ -1 (T k2 )} U /{ σ(T k1 ) σ(T k2 )}.

E TECHNICAL CONDITIONS

To simplify the presentation, we assume all state variables are continuous. The immediate reward and the density function of S 0 are bounded.

E.1 CONDITION C1

C1 Suppose (i) holds. Assume (ii) holds under D1, (iii) holds under D2 and (ii), (iv) hold under D3. (i) The transition kernel P is absolutely continuous and satisfies P(ds; a, s ) = p(s; a, s )ds for some transition density function p. In addition, assume p is uniformly bounded away from 0 and ∞. (ii) The Markov chain {S * t ( b(0) t-1 )} t≥0 formed under the behaviour policy b (0) is geometrically ergodic, i.e. there exists some function M on S, some constant 0 ≤ ρ < 1 and some probability density function Π such that s∈S M (s)Π(ds) < +∞ and Pr(S * t ( b(0) t-1 ) ∈ S|S 0 = s) -Π(S) T V ≤ M (s)ρ t , ∀t ≥ 0, s ∈ S, S ⊆ S, where • T V denotes the total variation norm. (iii) The Markov chains {S * 2t ( b2t )} t≥0 and {S * 2t+1 ( b2t+1 )} t≥0 are geometrically ergodic. (iv) For any k = 1, • • • , K -1, the following events occur with probability tending to 1: the Markov chain {S * t ( b(k) t-1 )} t≥0 is geometrically ergodic; sup s∈S |b (k) (s) -b * (s)| P → 0 for some b * (•); the stationary distribution of {S * t ( b(k) t-1 )} t≥0 will converge to some Π * in total variation. Remark: By C1(ii), Π is the stationary distribution of {S * t ( b(0) t-1 )} t≥0 . It follows that Π(S) = a∈{0,1} s∈S P(S; a, s){ab (1) (s) + (1 -a)b (0) (s)}Π(ds), for any S ⊆ S. By C1(i), we obtain Π(S) = a∈{0,1} s∈S s ∈S [a{1 -b (0) (s)} + (1 -a)b (0) (s)]p(s ; a, s)ds Π(ds) = s ∈S a∈{0,1} s∈S [a{1 -b (0) (s)} + (1 -a)b (0) (s)]p(s ; a, s)Π(ds) µ(s ) ds . This implies that µ(•) is the density function of Π. Since p is uniformly bounded away from 0 and ∞, so is µ. Under C1(iv), for any k ∈ {1, • • • , K -1}, there exist some M (k) (•), Π (k) (•) and ρ (k) that satisfy s∈S M (k) (s)Π (k) (ds) < +∞ and Pr(S * t ( b(k) t-1 ) ∈ S|S 0 = s) -Π (k) (S) T V ≤ M (k) (s){ρ (k) } t , ∀t ≥ 0, s ∈ S, S ⊆ S, with probability tending to 1. Since b (k) is a function of the observe data history, so are M (k) (•), Π (k) (•) and ρ (k) . Suppose an -greedy policy is used, i.e. b (k) (s) = /2 + (1 -) π (k) (s) where π (k) denotes some estimated optimal policy. Then the condition sup s∈S |b (k) (s)-b * (s)| P → 0 requires π (k) to converge. The total variation distance between the one-step transition kernel under b(k) and that under b * can be bounded by  sup s |Pr(S * 1 (b (k) ) ∈ S|S 0 = s) -Pr(S * 1 (b * ) ∈ S|S 0 = s)| ≤ sup s |b (k) (s) -b * (s)| sup s, |Q(a ; a, s) -Ψ (s)β * a ,a | = o(T -1/2 ). (ii) Assume there exists some constant c * ≥ 1 such that (c * ) -1 ≤ λ min s∈S Ψ(s)Ψ (s)ds ≤ λ max s∈S Ψ(s)Ψ (s)ds ≤ c * , and sup s Ψ(s) 2 = O( √ q). (iii) Assume lim inf q s∈S Ψ(s)G(ds) 2 > 0. Remark: For any a, a ∈ {0, 1}, suppose Q(a ; a, s) is p-smooth as a function of s (see e.g. Stone, 1982 , for the definition of p-smoothness). When tensor product B-splines or wavelet basis functions (see Section 6 of Chen & Christensen, 2015, for an overview of these bases) are used for Ψ(•), we have sup a ,a∈{0,1},s∈S |Q(a ; a, s) -Ψ (s)β * a ,a | = o(q -p/d ), under certain mild conditions. See Section 2.2 of Huang (1998) for details. It follows that Condition C2(i) automatically holds when the number of basis functions q satisfies q T d/(2p) . Condition C2(ii) is satisfied when tensor product B-splines or wavelet basis is used. For B-spline basis, the assertion in equation 8 follows from the arguments used in the proof of Theorem 3.3, Burman & Chen (1989) . For wavelet basis, the assertion in equation 8 follows from the arguments used in the proof of Theorem 5.1, Chen & Christensen (2015) . For both bases, the number of nonzero elements in Ψ(•) is bounded by some constant. Moreover, each basis function is uniformly bounded by O( √ q). The condition sup s Ψ(s) 2 = O( √ q) thus holds. Condition C2(iii) automatically holds for tensor product B-splines basis. Notice that 1 Ψ(s) = q 1/2 for any s ∈ S where 1 denotes a vector of ones. It follows from Cauchy-Schwarz inequality that √ q s∈S Ψ(s)G(ds) 2 ≥ s∈S 1 Ψ(s)G(ds) 2 = √ q. C2(iii) is thus satisfied.

E.3 CONDITION C3

Let ε * (a , a) = Y * 0 (a) + γQ(a ; a , S * 1 (a)) -Q(a ; a, S 0 ). C3 Assume inf q inf a ,a∈{0,1},s∈S Var{ε * (a , a)|S 0 = s} > 0 and sup q sup a∈{0,1},s∈S ρ ε (a, s) < 1 where ρ ε (a, s) = E{ε * (0, a)ε * (1, a)|S 0 = s} Var(ε * (0, a)|S 0 = s)Var(ε * (1, a)|S 0 = s) . Here, ρ ε corresponds to the partial correlation of ε * (0, a) and ε * (1, a) given S 0 .

F TECHNICAL PROOFS F.1 PROOF OF LEMMA 1

To prove Lemma 1, we state the following lemma. Lemma 2 Under MA and CMIA, Q(a ; a, s) = r(a, s) + γ s Q(a ; a , s )P(ds ; a, s) for any (s, a). Proof of Lemma 2: For any a, a ∈ {0, 1}, define the potential outcome Y * t (a , a) and S * t (a , a) as the reward and state variables that would occur at time t had the agent assigned Treatment a at the initial time point and Treatment a afterwards. Let P t a (S, a, s) = Pr{S * t (a , a) ∈ S|S 0 = s} for any S ⊆ S, a, a ∈ {0, 1}, s ∈ S and t ≥ 0. We break the proof into two parts. In Part 1, we show Lemma 2 holds when the following is satisfied: Pr{S * t+1 (a , a) ∈ S|S * 1 (a) = s, S 0 } = P t a (S, a , s), In Part2, we show equation 9 holds. This together with the definition of Q function and CMIA yields Q(a ; a, s) = r(a, s) + γ   t≥0 γ t E{Y * t+1 (a , a)|S 0 = s}   = r(a, s) + γE{Q(a ; a, S * 1 (a))|S 0 = s}. ( ) Under MA, we have E{Q(a ; a, S * 1 (a))|S 0 = s} = s ∈S Q(a ; a, s )P(ds ; a, s). Combining this together with equation 12 yields the desired result. Part 2: We use induction to prove equation 9. When t = 0, it trivially holds. Suppose equation 9 holds for t = k. In the following, we show equation 9 holds for t = k + 1. Under MA, we have The proof is hence completed. Pr{S * k+2 (a , a) ∈ S|S * 1 (a) = s, S 0 } = E[Pr{S * k+2 (a , a) ∈ S|S * k+1 (a , Proof of Lemma 1: By CA, it is equivalent to show E{Q(a ; A t , S * t ( Āt-1 )) -Y * t ( Āt ) -γQ(a ; a , S * t+1 ( Āt ))}ϕ(A t , S * t ( Āt-1 )) = 0. Let S 0 denote the support of S 0 . For any s 0 ∈ S 0 , it suffices to show E{Q(a ; A t , S * t ( Āt-1 )) -Y * t ( Āt ) -γQ(a ; a , S * t+1 ( Āt ))ϕ(A t , S * t ( Āt-1 ))|S 0 = s 0 } = 0. This is equivalent to show E{Q(a ; A t , S * t ( Āt-1 )) -Y * t ( Āt ) -γQ(a ; a , S * t+1 ( Āt ))ϕ(A t , S * t ( Āt-1 ))I(A 0 = a 0 )}|S 0 = s 0 ] = 0, for any s 0 ∈ S 0 , a 0 ∈ {0, 1}. Let A 0 (s 0 ) = {a ∈ {0, 1} : Pr(A 0 = a|S 0 = s) > 0}. It suffices to show for any s 0 ∈ S 0 , a 0 ∈ A 0 (s 0 ), E{Q(a ; A t , S * t ( Āt-1 )) -Y * t ( Āt ) -γQ(a ; a , S * t+1 ( Āt ))ϕ(A t , S * t ( Āt-1 ))I(A 0 = a 0 )|S 0 = s 0 } = 0, or equivalently, E{Q(a ; A t , S * t ( Āt-1 )) -Y * t ( Āt ) -γQ(a ; a , S * t+1 ( Āt ))ϕ(A t , S * t ( Āt-1 ))|S 0 = s 0 , A 0 = a 0 } = 0. (13) Let sj = (s 0 , s 1 , • • • , s j ) , ȳj = (y 0 , y 1 , • • • , y j ) , Sj = (S 0 , S 1 , • • • , S j ) and Ȳj = (Y 0 , Y 1 , • • • , Y j ) . We can recursively define the sets Y j (s j , āj , ȳj-1 ), S j+1 (s j , āj , ȳj ), A j+1 (s j+1 , āj , ȳj ) to be the supports of Y j , S j+1 , A j+1 conditional on ( Sj = sj , Āj = āj , Ȳj-1 = ȳj-1 ), ( Sj = sj , Āj = āj , Ȳj = ȳj ), ( Sj+1 = sj+1 , Āj = āj , Ȳj = ȳj ) respectively, for j ≥ 0. Similar to equation 13, it suffices to show E{Q(a ; A t , S * t ( Āt-1 )) -Y * t ( Āt ) -γQ(a ; a , S * t+1 ( Āt ))ϕ(A t , S * t ( Āt-1 ))| St = st , Āt = āt , Ȳt-1 = ȳt-1 } = 0, for any s 0 ∈ S 0 , a 0 ∈ A 0 (s 0 ), y 0 ∈ Y 0 (s 0 , a 0 ), • • • , s t ∈ S t (s t-1 , āt-1 , ȳt-1 ), a t ∈ A t (s t , āt-1 , ȳt-1 ). This is equivalent to show E{Q(a ; a t , S * t (ā t-1 )) -Y * t (ā t ) -γQ(a ; a , S * t+1 (ā t ))| St = st , Āt = āt , Ȳt-1 = ȳt-1 } = 0. ( ) By construction, we have Pr (A t = a t | St = st , Ȳt-1 = ȳt-1 , Āt-1 = āt-1 ) > 0. Under SRA, the left-hand-side (LHS) of equation 14 equals E{Q(a ; a t , S * t (ā t-1 )) -Y * t (ā t ) -γQ(a ; a , S * t+1 (ā t ))| St = st , Āt-1 = āt-1 , Ȳt-1 = ȳt-1 }. ( ) Notice that the conditioning event is the same as {S * t (ā t-1 ) = s t , Y * t-1 (ā t-1 ) = y t-1 , St-1 = st-1 , Āt-1 = āt-1 , Ȳt-2 = ȳt-2 }. Under SRA, equation 15 equals E{Q(a ; a t , S * t (ā t-1 )) -Y * t (ā t ) -γQ(a ; a , S * t+1 (ā t ))|S * t (ā t-1 ) = s t , Y * t-1 (ā t-1 ) = y t-1 , St-1 = st-1 , Āt-2 = āt-2 , Ȳt-2 = ȳt-2 }. By recuisvely applying SRA, we can show the left-hand-side of equation 14 equals E[Q(a ; a t , S * t (ā t-1 )) -Y * t (ā t ) -γQ(a ; a , S * t+1 (ā t ))|{S * j (ā j-1 ) = s j } 1≤j≤t , {Y * j (ā j ) = y j } 1≤j≤t-1 ]. This is equal to zero by MA, CMIA and Lemma 2. The proof is hence completed. F.2 PROOF OF THEOREM 3

F.2.1 PROOF UNDER D1

We begin by providing an outline of the proof. The proof is divided into three steps. In the first step, we show for any T 1 ≤ t ≤ T k , the estimator β(t) satisfies the following linear representation, β(t) -β * = Σ -1 (t)    1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1    ζ1(t) +o p (t -1/2 ), where Σ(t) = E Σ(t) and ε j,a = Y j + γQ(a; a, S j+1 ) -Q(a; A j , S j ) for a = 0, 1. Based on this representation, in the second step, we show the asymptotic normality of τ (t). Specifically, we show √ t{ τ (t) -τ 0 } σ(t) d → N (0, 1). In the last step, we prove Theorem 3. Part 1: By definition, we have β(t) -β * = Σ -1 (t)    1 t t-1 j=0 ξ j Y j ξ j Y j -Σ(t)β *    = Σ -1 (t)   1 t t-1 j=0 ξ j Y j ξ j Y j -Σ j β *   = Σ -1 (t)   1 t t-1 j=0 ξ j {Y j -Ψ (S j )β * 0,Aj + γΨ (S j+1 )β * 0,0 } ξ j {Y j -Ψ (S j )β * 1,Aj + γΨ (S j+1 )β * 1,1 }   , where the last equality is due to the definition of Σ j . Let r a,j = Ψ (S j )β * a,At -γΨ (S j+1 )β * a,0 -Q(a; A j , S j ) + γQ(a; a, S j+1 ). It follows that β(t) -β * = Σ -1 (t)    1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1    -Σ -1 (t)    1 t t-1 j=0 ξ j r j,0 ξ j r j,1    , and hence β(t) -β * = Σ -1 (t)    1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1    + { Σ -1 (t) -Σ -1 (t)}    1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1    ζ2(t) -Σ -1 (t)    1 t t-1 j=0 ξ j r j,0 ξ j r j,1    ζ3(t) . We first consider ζ 3 (t). It can be upper bounded by Σ -1 (t) 2 1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1 2 = Σ -1 (t) 2 max a∈{0,1} sup a 2 =1 a   1 t t-1 j=0 ξ j r a,j   ≤ Σ -1 (t) 2 max a∈{0,1} sup a 2 =1 1 t t-1 j=0 (a ξ j ) 2 r 2 a,j ≤ max a,j |r a,j | Σ -1 (t) 2 λ max   1 t t-1 j=0 ξ j ξ j   , where the second follows from Cauchy-Schwarz inequality. Under Condition C2(i), we have for any j ≤ t ≤ T k , max j |r a,j | = o(t -1/2 ). Suppose for now, we have shown Σ -1 (t) 2 = O p (1) and λ max   1 t t-1 j=0 ξ j ξ j   = O p (1). ( ) It follows that ζ 3 (t) = o p (t -1/2 ). ( ) To bound ζ 2 (t), notice that for any a ∈ {0, 1}, E 1 t t-1 j=0 ξ j ε j,a 2 2 = 1 t 2 t-1 j=0 Eξ j ξ j ε 2 j,a + 1 t 2 j1 =j2 Eξ j1 ξ j2 ε j1,a ε j2,a . Similar to Lemma 1, we can show for any ϕ( •) that is a function of Āt , St , Ȳt-1 that E{Q(a ; A t , S t ) -Y t -γQ(a ; a , S t+1 )}ϕ( St , Āt , Ȳt-1 ) = 0. ( ) This implies that Eξ j1 ξ j2 ε j1,a ε j2,a = 0 for any j 1 = j 2 . It follows that E 1 t t-1 j=0 ξ j ε j,a 2 2 = 1 t 2 t-1 j=0 Eξ j ξ j ε 2 j,a ≤ qλ max   1 t 2 t-1 j=0 Eξ j ξ j ε 2 j,a   . Since all immediate rewards are uniformly bounded, so is the Q function. As a result, |ε j,a |'s are uniformly bounded. Suppose for now, we have shown λ max   1 t t-1 j=0 Eξ j ξ j   = O(1). It follows that E t -1 t-1 j=0 ξ j ε j,a 2 2 = O(q) and hence 1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1 = O p (t -1/2 √ q). Suppose Σ -1 (t) -Σ -1 (t) 2 = o p (q -1/2 ). It follows that ζ 2 (t) 2 ≤ Σ -1 (t) -Σ -1 (t) 2 1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1 2 = o p (t -1/2 ). This together with equation 18 yields equation 16. It remains to show equation 17, equation 20 and equation 21 hold. We summarize these results in Lemma 3. Lemma 3 Under the given conditions, we have equation 17, equation 20 and equation 21 hold. Part 2: By definition, we have τ (t) - τ 0 = U { β(t) -β * } + U β * -τ 0 . Define Ω(t) = E    1 t t-1 j=0 ξ j ε j,0 ξ j ε j,1 ξ j ε j,0 ξ j ε j,1    The asymptotic variance of √ t{ τ (t) -τ 0 } is given by σ 2 (t) = U Σ -1 (t)Ω(t){Σ -1 (t)} U . We begin by providing a lower bound for σ 2 (t). Notice that σ 2 (t) ≥ λ min {Ω(t)} U Σ -1 (t) 2 2 ≥ λ min {Ω(t)}λ min [Σ -1 (t){Σ -1 (t)} -1 ] U 2 2 . Under C1(iii), we have lim inf q U 2 2 > 0. In addition, notice that Σ -1 (t){Σ -1 (t)} -1 is positive semi-definite. It follows that λ min [Σ -1 (t){Σ -1 (t)} -1 ] = 1/λ max [Σ(t){Σ(t)}]. Using similar arguments in showing Σ (0) * 2,2 (0) 2 = O(1) in the proof of Lemma 3, we can show sup t≥1 Σ(t) 2 = O(1) and hence sup t≥1 λ max [Σ(t){Σ(t)}] = O(1). This further yields inf t≥1 λ min [Σ -1 (t){Σ -1 (t)} -1 ] > 0. Suppose Ω(t) satisfies lim inf t λ min {Ω(t)} > 0. It follows that σ 2 (t) is bounded away from zero, for sufficiently large t. Under Condition C2(i), we have U β * -τ 0 = o(T -1/2 k ) = o(t -1/2 ). It follows that √ t{ τ (t) -τ 0 } σ(t) = √ tU { β(t) -β * } σ(t) + √ t(U β * -τ 0 ) σ(t) = √ tU { β(t) -β * } σ(t) + o(1). Moreover, it follows from equation 22, equation 23 and equation 24 that σ(t)/ U 2 is uniformly bounded away from zero, for sufficiently large t. Combining this together with equation 16 yields √ t{ τ (t) -τ 0 } σ(t) = √ tU ζ 1 (t) σ(t) + √ tU R t σ(t) , where the remainder term satisfies R t 2 = o p (t -1/2 ). It follows that the second term on the righthand-side (RHS) of the above expression is bounded from above by √ t R t 2 U 2 /σ(t) = o p (1) and hence √ t{ τ (t) -τ 0 } σ(t) = √ tU ζ 1 (t) σ(t) + o p (1). Similar to the proof of Lemma 1, we can show for any j ≥ 0, a ∈ {0, 1}, E(ξ j ε j,a |{S i , A i , Y i } i<j ) = 0. By the definition of ζ 1 (t), √ tU ζ 1 (t)/σ(t) forms a martingle with respect to the filtration σ({S j , A j , Y j } j<t ), i.e. the σ-algebra generated by {S j , A j , Y j } j<t . By the martingale central limit theorem, we can show √ tU ζ 1 (t)/σ(t) d → N (0, 1) (see Lemma 4 for details). To complete the proof of Part 2, we need to show equation 24 holds and that σ(t)/σ(t) P → 1. The assertion σ(t)/σ(t) P → 1 can be similarly proven using arguments from Step 3 of the proof of Theorem 1, Shi et al. (2020a) . We show the asymptotic normality of √ tU ζ 1 (t)/σ(t) and that equation 24 holds in the following lemma. Lemma 4 Under the given conditions, we have equation 24 holds and that √ tU ζ 1 (t)/σ(t) d → N (0, 1). Part 3: Results in Part 2 yield that √ T k { τ (T k ) -τ 0 }/σ(T k ) d → N (0, 1) for each 1 ≤ k ≤ K. In addition, for any K-dimensional vector a = (a 1 , • • • , a K ) , it follows from equation 25 that K k=1 a k √ T k { τ (T k ) -τ 0 } σ(T k ) = K k=1 a k √ T k U ζ 1 (T k ) σ(T k ) + o p (1). The leading term on the RHS can be rewritten as a weighted sum of {ξ j ε j,0 , ξ j ε j,1 } 0≤j<t . Similar to the proof in Part 2, we can show it forms a martingale with respect to the filtration σ({S j , A j , Y j } j<t ). We now derive its asymptotic normality for any a, using the martingale central limit theorem for triangular arrays. By Corollary 2 of McLeish (1974) , we need to verify the following two conditions: (a) max 0≤j<t | K k=1 a k T -1/2 k U Σ -1 (T k )(ξ j ε j,0 , ξ j ε j,1 ) {σ(T k )} -1 I(j < T k )| P → 0; (b) T -1 j=0 | K k=1 a k T -1/2 k U Σ -1 (T k )(ξ j ε j,0 , ξ j ε j,1 ) {σ(T k )} -1 I(j < T k )| 2 converges to some constant in probability. Since K is fixed, to verify (a), it suffices to show max 1≤j<t,1≤k≤K T -1/2 k |U Σ -1 (T k )(ξ j ε j,0 , ξ j ε j,1 ) {σ(T k )} -1 | P → 0. In Lemma 3, we have shown Σ -1 (t) = O(1). In Part 1 and Part 2 of the proof, we have shown |ε j,a |'s are uniformly bounded and that σ(t)/ U 2 is bounded away from zero. Therefore, it suffices to show T -1/2 1 max 0≤j<t ξ j 2 P → 0. Under Condition C2(ii), we have sup s Ψ(s) 2 = O(q 1/2 ) and hence max 0≤j<t ξ j 2 = O(q 1/2 ). The assertion thus follows by noting that T 1 = c 1 T and q = o(T ).

Using similar arguments in

Step 3 of the proof of Shi et al. (2020a) , we can show 1 t t-1 j=0 (ξ j ε j,0 , ξ j ε j,1 ) (ξ j ε j,0 , ξ j ε j,1 ) -Ω(t) 2 P → 0, as t → ∞. This together with the facts Σ -1 (t) = O(1) and σ(t)/ U 2 is bounded away from zero implies that a k1 a k2 T k1 T k2 σ 2 (T k1 ∧ T k2 ) T k 1 ∧T k 2 j=0 U Σ -1 (T k1 )(ξ j ε j,0 , ξ j ε j,1 ) (ξ j ε j,0 , ξ j ε j,1 ){Σ -1 (T k2 )} U - a k1 a k2 (T k1 ∧ T k2 ) T k1 T k2 σ 2 (T k1 ∧ T k2 ) U Σ -1 (T k1 )Ω(T k1 ∧ T k2 ){Σ -1 (T k2 )} U 2 ≤ a k1 a k2 σ 2 (T k1 ∧ T k2 ) U 2 2 max k Σ -1 (T k ) 2 2 1 T k1 ∧ T k2 T k 1 ∧T k 2 -1 j=0 (ξ j ε j,0 , ξ j ε j,1 ) (ξ j ε j,0 , ξ j ε j,1 ) -Ω(t) P → 0, where a ∧ b = min(a, b). It follows that T -1 j=0 K k=1 a k T -1/2 k U Σ -1 (T k )(ξ j ε j,0 , ξ j ε j,1 ) {σ(T k )} -1 I(j < T k ) 2 (27) - k1 =k2 a k1 a k2 (T k1 ∧ T k2 ) T k1 T k2 σ 2 (T k1 ∧ T k2 ) U Σ -1 (T k1 )Ω(T k1 ∧ T k2 ){Σ -1 (T k2 )} U = o p (1). In the proofs of Lemmas 3 and 4, we show that Σ -1 (t) -(Σ (0) * ) -1 2 = O(t -1/2 ) and Ω(t) - Ω (0) * 2 = O(t -1/2 ) for some matrices Σ (0) * and Ω (0) * that are invariant to t. Definitions of these two matrices can be found in Sections F.5 and F.6. Under C2(ii) and the condition that q = o( √ T / log T ), we can show U 2 = O(q 1/2 ) and hence σ 2 (t) P → (σ (0) * ) 2 where (σ (0) * ) 2 = U (Σ (0) * ) -1 Ω (0) * {(Σ (0) * ) -1 } U . Similar to equation 27, we have k1 =k2 a k1 a k2 (T k1 ∧ T k2 ) T k1 T k2 σ 2 (T k1 ∧ T k2 ) U Σ -1 (T k1 )Ω(T k1 ∧ T k2 ){Σ -1 (T k2 )} U (28) P → k1 =k2 a k1 a k2 (T k1 ∧ T k2 ) T k1 T k2 (σ (0) * ) 2 U {Σ (0) * } -1 Ω (0) * {(Σ (0) * ) -1 } U → k1 =k2 a k1 a k2 (c k1 ∧ c k2 ) √ c k1 c k2 , where c k 's are defined in Section 4.4. This together with equation 27 yields that T -1 j=0 K k=1 a k T -1/2 k U Σ -1 (T k )(ξ j ε j,0 , ξ j ε j,1 ) {σ(T k )} -1 I(j < T k ) 2 P → a k1 a k2 (c k1 ∧ c k2 ) √ c k1 c k2 . Conditions (a) and (b) are thus verified. By Lemma 4, we can show K k=1 a k √ T k { τ (T k ) -τ 0 } σ(T k ) = K k=1 a k √ T k { τ (T k ) -τ 0 } σ(T k ) + o p (1), for any (a 1 , • • • , a K ). This yields the joint asymptotic normality of our test statistics. By equation 28, its covariance matrix is given by Ξ 0 whose (k 1 , k 2 )-th entry is equal to (c k1 c k2 ) -1/2 c k1 ∧ c k2 . Using similar arguments in proving equation 27, equation 28 and Step 3 of the proof in Theorem 1, Shi et al. (2020a) , we can show Ξ is a consistent estimator for Ξ 0 . This completes the proof of Theorem 3 under D1.

F.2.2 PROOF UNDER D2

The proof is very similar to that under D1. Suppose we can show equation 17, equation 20, equation 21 and equation 24 hold. Then similar to the proof under D1, we have √ t{ τ (t) -τ 0 } σ(t) = √ tU ζ 1 (t) σ(t) + o p (1). The following lemma shows these assertions hold under D2 as well. Lemma 5 Under the given conditions, we have equation 17, equation 20, equation 21 and equation 24 hold. It follows that for any K-dimensional vector a = (a 1 , • • • , a K ) , K k=1 a k √ T k { τ (T k ) -τ 0 } σ(T k ) = K k=1 a k √ T k U ζ 1 (T k ) σ(T k ) + o p (1). In the proof of Lemma 5, we show {Σ(t)} -1 2 = O(1), Σ(t) -Σ * 2 = O(t -1/2 ) for some time-invariant matrix Σ * that satisfies (Σ * ) -1 2 = O(1). It follows that {Σ(t)} -1 -(Σ * ) -1 2 = {Σ(t)} -1 (Σ(t) -Σ * )(Σ * ) -1 2 ≤ {Σ(t)} -1 2 Σ(t) -Σ * 2 (Σ * ) -1 2 = O(t -1/2 ). Similarly, we can show Ω(t) -Ω * 2 = O(t -1/2 ) for some matrix Ω * . In addition, using similar arguments in the proof of Lemma 5, we can show equation 26 holds under D2 as well. Now, the joint asymptotic normality of our test statistics follow using arguments from Part 3 of the proof under D1. Similarly, we can show Ξ is consistent. This completes the proof under D2.

F.2.3 PROOF UNDER D3

The proof under D1 indicates that equation 17, equation 20, equation 21 and equation 24 hold with t = T 1 . It follows that √ T 1 { τ (T 1 ) -τ 0 } σ(T 1 ) = √ T 1 U ζ 1 (T 1 ) σ(T 1 ) + o p (1). ( ) The rest of the proof is divided into two parts. In the first part, we show for k = 2, • • • , K, √ T k { τ (T k ) -τ 0 } σ * (T k ) = √ T k U ζ * 1 (T k ) σ * (T k ) + o p (1), for some ζ * 1 (T k ) and σ * (T k ) defined below. In the second part, we show the assertion in Theorem 3 holds under D3. Part 1: For any 1 ≤ k ≤ K, consider the matrices Σ (k) = 1 T k -T k-1 T k -1 j=T k-1 E[Σ j |{(S t , A t , Y t )} 0≤t<T k-1 ] and Σ (k) = 1 T k -T k-1 T k -1 j=T k-1 Σ j . We show in Lemma 6 below that for k = 2, • • • , K, Σ (k) -Σ (k) 2 = o p (q -1/2 ), and {Σ (k) } -1 2 = O p (1). ( ) where Σ (k) = T -1 k k i=1 (T i -T i-1 )Σ (i) . Lemma 6 Under the given conditions, we have equation 31 and equation 32 hold. Notice that (T i -T i-1 )/T k → (c i -c i-1 )/c k and {c -1 k k i=1 (c i -c i-1 )Σ (i) } -1 2 = O p (1) where c 0 = 0. It follows from equation 31 that c -1 k k i=1 (c i -c i-1 )( Σ (i) -Σ (i) ) 2 = o p (q -1/2 ) and hence 1 T k k i=1 (T i -T i-1 )( Σ (i) -Σ (i) ) 2 = o p (q -1/2 ), ∀k = 2, • • • , K. Similar to the proof under D1, we can show 1 T k k i=1 (T i -T i-1 ) Σ (i) -1 2 = O p (1) and 1 T k k i=1 (T i -T i-1 ) Σ (i) -1 -{Σ (k) } -1 2 = o p (q -1/2 ), for k = 2, • • • , K. Thus, equation 21 and the first assertion in equation 17 hold with t = T 2 , T 3 , • • • , T K under D3. In addition, similar to Lemma 6, we can show λ max   1 T k -T k-1 T k -1 j=T k-1 E[ξ j ξ j |{(S t , A t , Y t )} 0≤t<T k-1 ]   = O p (1), and 1 T k -T k-1 T k -1 j=T k-1 ξ j ξ j -E[ξ j ξ j |{(S t , A t , Y t )} 0≤t<T k-1 ] 2 = o p (q -1/2 ), for k = 2, • • • , K. This yields (T k -T k-1 ) -1 T k -1 j=T k-1 ξ j ξ j 2 = O p (1) . As a result, the second assertion in equation 17 holds with t = T 2 , T 3 , • • • , T K . Moreover, using similar arguments in showing t -1 t-1 j=0 (ξ j ε j,0 , ξ j ε j,1 ) = O p (t -1/2 √ q) under D1, we have by equation 33 that (T k -T k-1 ) -1 T k -1 j=T k-1 (ξ j ε j,0 , ξ j ε j,1 ) = O p {(T k - T k-1 ) -1/2 √ q}, for k = 1, • • • , K. Under the given conditions on {T k } k , we obtain t -1 t-1 j=0 (ξ j ε j,0 , ξ j ε j,1 ) = O p (t -1/2 √ q) for t = T 2 , T 3 , • • • , T K . Based on these results, using similar arguments in Part 1 of the proof under D1, we can show T k { β(T k ) -β * } = T k ζ * 1 (T k ) + o p (1), ∀k ∈ {2, • • • , K}, where ζ * 1 (T k ) = 1 T k T k j=1 (Σ (k) ) -1 ξ j ε j,0 ξ j ε j,1 . For 1 ≤ k ≤ K, define Ω (k) = 1 T k -T k-1 T k -1 j=T k-1 E[(ξ j ε j,0 , ξ j ε j,1 ) (ξ j ε j,0 , ξ j ε j,1 )|{(S t , A t , Y t )} 0≤t<T k-1 ], and Ω (k) = T -1 k k i=1 (T i -T i-1 )Σ (i) . For any 2 ≤ k ≤ K, we have λ min (Ω (k) ) ≥ λ min (T -1 k T 1 Ω (1) ). Since T -1 k T 1 → c -1 k c 1 > 0 and λ min (Ω (1) ) = λ min (Ω(T 1 )) is bounded away from zero, λ min (Ω (k) ) is bounded away from zero for k = 2, • • • , K as well. Define {σ * (T k )} 2 = U (Σ (k) ) -1 Ω (k) {(Σ (k) ) -1 } U . It can be shown that σ * (T k )/ U 2 is bounded away from zero, for k = 2, • • • , K. Using similar arguments in Part 2 of the proof under D1, we can show equation 30 holds. This completes the proof for Part 1. Part 2: Let σ * (T 1 ) = σ(T 1 ). By equation 29 and equation 30, we have for any K-dimensional vector a = (a 1 , • • • , a K ) that K k=1 a k √ T k { τ (T k ) -τ 0 } σ * (T k ) = K k=1 a k √ T k U ζ 1 (T k ) σ * (T k ) + o p (1). In the following, we show the leading term on the RHS of equation 35 is asymptotically normal. Similar to the proof under D1, it suffices to verify the following conditions: (a) max 0≤j<T | K k=1 a k T -1/2 k U (Σ (k) ) -1 (ξ j ε j,0 , ξ j ε j,1 ) I(j < T k )| P → 0; (b) T -1 j=0 | K k=1 a k T -1/2 k U (Σ (k) ) -1 (ξ j ε j,0 , ξ j ε j,1 ) {σ * (T k )} -1 I(j < T k )| 2 converges to some constant in probability. Condition (a) can be proven in a similar manner as in Part 3 of the proof under D1. Notice that k) , Ω (k) and σ * (T k ) are random variables and depend on the observed data history. In the proof of Lemma 6, we show (Σ for k = 2, • • • , K, Σ (k) ) -1 -(Σ * * ) -1 2 = O p (T -1/2 ) for some deterministic matrix Σ * and all k ∈ {2, • • • , K}. Similarly, we can show Ω (k) -Ω * * 2 = O p (T -1/2 ) and {σ * (T k )} 2 -(σ * * ) 2 2 = O p (T -1/2 ) for some Σ * , σ * * and all k ∈ {2, • • • , K}. Moreover, using similar arguments in the proof of Lemma 6, we can show 1 T k -T k-1 T k -1 j=T k-1 ξ j ε j,0 ξ j ε j,1 ξ j ε j,0 ξ j ε j,1 -Ω (k) 2 = o p (q -1/2 ), ∀k = 2, • • • , K. This further implies that 1 T k T k -1 j=0 ξ j ε j,0 ξ j ε j,1 ξ j ε j,0 ξ j ε j,1 -Ω * * 2 = o p (q -1/2 ), ∀k = 2, • • • , K. Based on these results, using similar arguments in Part 3 of the proof of Lemma 3, we obtain (b). The joint asymptotic normality of √ T 1 { τ (T 1 ) -τ 0 }/σ * (T 1 ), • • • , √ T 1 { τ (T 1 ) -τ 0 }/σ * (T K ) thus follows. Consistency of Ξ can be similarly proven. We omit the details for brevity.

F.3 PROOF OF THEOREM 1

As discussed in Section 4.3, (Z * 1 , Z * 2 , • • • , Z * K ) is jointly normal with mean zero and covariance matrix Ξ, conditional on the observed data. By Theorem 1, we have Ξ P → Ξ 0 where Ξ 0 is the asymptotic covariance matrix of (Z 1 , Z 2 , • • • , Z K ) . Let α * (t) = α(tT ) for any 0 ≤ t ≤ 1, we have α(T k ) → α * (c k ) for any 1 ≤ k ≤ K. Notice that { b k } 1≤k≤K is a continuous function of Ξ and {α(T k )} 1≤k≤K , it follows that b k P → b k,0 for 1 ≤ k ≤ K, where {b k,0 } 1≤k≤K are recursively defined as follows: Pr max 1≤j<k (Z j,0 -b j,0 ) ≤ 0, Z k,0 > b k,0 = α * (c k ) -α * (c k-1 ), where (Z 1,0 , Z 2,0 , • • • , Z K,0 ) is asymptotically normal with mean zero and covariance matrix Ξ 0 . Theorem 3 implies that (Z 1 - √ T 1 τ 0 / σ(T 1 ), Z 2 - √ T 2 τ 0 / σ(T 2 ), • • • , Z K - √ T K τ 0 / σ(T K )) d → (Z 1,0 , Z 2,0 , • • • , Z K,0 ) . It follows that Pr   k j=1 {Z j > b j }   ≤ Pr   k j=1 {Z j -T j τ 0 / σ(T j ) > b j }   → Pr   k j=1 {Z j,0 > b j,0 }   = α * (c k ). (36) The proof is hence completed by noting that α(T k ) → α * (c k ). When τ 0 = 0, we have EZ k = o(1). The rejection probability thus converges to the nominal level.

F.4 PROOF OF THEOREM 2

Suppose τ 0 = T -1/2 h for some h > 0. Based on the proof of Theorem 3, we can show σ(T k ) P → σ * k for some σ * k > 0. It follows from equation 36 that Pr   k j=1 {Z j > b j }   = Pr   k j=1 {Z j -T j τ 0 / σ(T j ) > b j -h/ σ(T j )}   → Pr   k j=1 {Z j,0 > b j,0 -h/σ * j }   > α * (c k ). The second assertion in Theorem 2 thus holds by noting that α(T k ) → α * (c k ). Let h → ∞, we obtain Pr   k j=1 {Z j > b j }   = Pr   k j=1 {Z j,0 > b j,0 -h/σ * j }   + o(1) → 1. The proof is hence completed.

F.5 PROOF OF LEMMA 3

Under the given conditions in C1(i), C1(ii) and C2(ii), equation 20 and the second assertion in equation 17 can be proven using similar arguments in the proof of Lemma E.2 and E.3 of Shi et al. (2020a) . We omit the proof for brevity. It remains to show equation 21 and the first assertion in equation 17 hold. Recall that µ is the density function of the stationary distribution Π (see the remark below Condition C1). In addition, µ is uniformly bounded away from 0 and ∞ under C1(i). For a ∈ {0, 1}, define the matrix Σ (0) * (a ) = s,s ∈S a∈{0,1} ξ(s, a){ξ(s, a) -γξ(s , a )}µ(s){(1 -a)b (0) (s) + a(1 -b (0) (s))}p(s ; a, s)dsds . Define Σ (0) * = Σ (0) * (0) Σ (0) * (1) The matrix Σ (0) * is the population limit of Σ(t) under D1. To prove the first assertion in equation 17, we first show (Σ (0) * ) -1 2 = O(1). By definition, this is equivalent to show {Σ (0) * (a)} -1 2 = O(1), for a ∈ {0, 1}. The matrix Σ (0) * (0) can be written as Σ (0) * (0) = Σ (0) * 1,1 (0) Σ (0) * 2,1 (0) Σ (0) * 2,2 , where Σ (0) * 1,1 (0) = s,s ∈S Ψ(s){Ψ(s) -γΨ(s )} b (0) (s)µ(s)p(s ; 0, s)dsds , Σ (0) * 2,1 (0) = -γ s,s ∈S Ψ(s)Ψ (s )(1 -b (0) (s))µ(s)p(s ; 0, s)dsds , Σ (0) * 2,2 (0) = s∈S Ψ(s)Ψ (s)µ(s)(1 -b (0) (s))p(s ; 1, s)ds. It follows that {Σ (0) * (0)} -1 = {Σ (0) * 1,1 (0)} -1 -{Σ (0) * 2,2 (0)} -1 Σ (0) * 2,1 (0){Σ (0) * 1,1 (0)} -1 {Σ (0) * 2,2 (0)} -1 , and hence {Σ (0) * (0)} -1 2 = sup a1 2=1, a2 2=1 a 1 {Σ (0) * 1,1 (0)} -1 -{Σ (0) * 2,2 (0)} -1 Σ (0) * 2,1 (0){Σ (0) * 1,1 (0)} -1 {Σ (0) * 2,2 (0)} -1 a 2 ≤ sup a3 2 =1, a4 2=1 |a 3 {Σ (0) * 1,1 (0)} -1 a 4 | + sup a3 2=1, a4 2=1 |a 3 {Σ (0) * 2,2 (0)} -1 a 4 | + sup a3 2 =1, a4 2=1 |a 3 {Σ (0) * 2,2 (0)} -1 Σ (0) * 2,1 (0){Σ (0) * 1,1 (0)} -1 a 4 | ≤ {Σ (0) * 1,1 (0)} -1 2 + {Σ (0) * 2,2 (0)} -1 2 + {Σ (0) * 2,2 (0)} -1 2 Σ (0) * 2,1 (0) 2 {Σ (0) * 1,1 (0)} -1 2 . Thus, to prove {Σ (0) * (0)} -1 2 = O(1), it suffices to show  {Σ (0) * 1,1 (0)} -1 2 = O(1), {Σ (0) * 2,2 (0)} -1 2 = O(1), Σ (0) * 2,1 (0) 2 = O(1). 0) * 1,1 (0)a ≥ c1 a 2 2 , ∀a, for some c1 > 0. Under D1, b (0) is strictly positive. Since µ is strictly positive, it suffices to show a s,s ∈S Ψ(s){Ψ(s) -γΨ(s )} dsds a ≥ c2 a 2 2 , ∀a, for some c2 > 0. Since the density function µ is uniformly bounded, we have Σ (0) * 2,1 (0) 2 ≤ O(1) sup a1 2=1, a2 2=1 s,s ∈S |a 1 Ψ(s)||a 2 Ψ(s )|dsds , where O(1) denotes the universal constant. By Cauchy-Schwarz inequality, we obtain Σ (0) * 2,1 (0) 2 ≤ O(1)λ(S) sup a 2 =1 s∈S |a Ψ(s)| 2 ds ≤ O(1)λ(S)λ max s∈S Ψ(s)Ψ (s)ds . In view of C2(ii), we obtain equation 40. To summarize, we have shown {Σ (0) * (0)} -1 2 = O(1). Similarly, we can prove {Σ (0) * (1)} -1 2 = O(1). Assertion equation 37 thus holds. Similar to Lemma E.5 of Shi et al. (2020a) , we can show Σ(t) -Σ (0) * 2 = O(t -1/2 ). Using similar arguments in Part 1 of the proof of Lemma E.2, Shi et al. (2020a) , this yields Σ -1 (t) -(Σ (0) * ) -1 2 = o(t -1/2 ) and Σ -1 (t) 2 = O(1). Under the given conditions, equation 21 and the first assertion in equation 17 now follow from the arguments used in Part 2 and 3 of the proof of Lemma E.2, Shi et al. (2020a) .

F.6 PROOF OF LEMMA 4

The asymptotic normality of √ t{ τ (t) -τ 0 }/σ(t) can be proven using similar arguments in Part 3 of the proof of Theorem 3. In the following, we focus on equation 24. Define the matrix Ω (0) * = s∈S a∈{0,1} E ξ 0 ε 0,0 ξ 0 ε 0,1 ξ 0 ε 0,0 ξ 0 ε 0,1 S 0 = s, A 0 = a µ(s){ab (0) (s) + (1 -a)(1 -b (0) (s))}ds. Similar to Lemma E.5 of Shi et al. (2020a) , we can show Ω (0) * -Ω(t) 2 = O(t -1/2 ). Thus, it suffices to show inf q λ min (Ω (0) * ) > 0. Under CA and SRA, we have where ρ ε = sup q sup a∈{0,1},s∈S ρ ε (a, s). E ξ 0 ε 0,0 ξ 0 ε 0,1 ξ 0 ε 0,0 ξ 0 ε 0,1 S 0 = Under C3, we have ρ ε < 1 and inf q inf a ,a,s E[{ε * (a , a)} 2 |S 0 = s] > 0. It follows that (a 1 , a 2 )E ξ 0 ε 0,0 ξ 0 ε 0,1 ξ 0 ε 0,0 ξ 0 ε 0,1 S 0 = s, A 0 = a (a 1 , a 2 ) ≥ c[{a 1 ξ(s, a)} 2 + {a 2 ξ(s, a)} 2 ], for some constant c3 > 0. Therefore, λ min (Ω (0) * ) = inf a1 2 2 + a2 2 2 =1 (a 1 , a 2 )Ω (0) * (a 1 , a 2 ) ≥ c3 inf a1 2 2 + a2 2 2 =1 s∈S a∈{0,1} [{a 1 ξ(s, a)} 2 + {a 2 ξ(s, a)} 2 ]µ(s){ab (0) (s) + (1 -a)(1 -b (0) (s))}ds. The strict positivity of µ(•) and the condition that b (0) (•) is uniformly bounded away from 0 and 1 yields λ min (Ω (0) * ) ≥ c4 inf a1 2 2 + a2 2 2 =1 s∈S a∈{0,1} [{a 1 ξ(s, a)} 2 + {a 2 ξ(s, a)} 2 ]ds, for some constant c4 > 0. With some calculation, we can show the RHS of equation 42 is equal to c4 λ min s∈S Ψ(s)Ψ (s) . By Condition C2(ii), it is strictly positive. This yields inf q λ min (Ω (0) * ) > 0. Thus, we obtain equation 24. F.7 PROOF OF LEMMA 5 We begin by proving Σ -1 (t) 2 = O(1), under D2. For any matrices M 1 and M 2 , denote by diag[M 1 , M 2 ] the block diagonal matrix M 1 M 2 . By MA and Condition C1(ii), the two Markov chains {S 2t-1 } t≥1 , {S 2t } t≥0 are geometrically ergodic. Let µ 1 and µ 2 denote the density function of their stationary distributions, respectively. Under C1(i), we can similarly show that they are uniformly bounded away from 0 and ∞. Define Σ * 1 = s,s ∈S diag[ξ(s, 1){ξ(s, 1) -γξ(s , 0)} , ξ(s, 1){ξ(s, 1) -γξ(s , 1)} ]µ 1 (s)p(s ; 1, s)dsds , Σ * 2 = s,s ∈S diag[ξ(s, 0){ξ(s, 0) -γξ(s , 0)} , ξ(s, 0){ξ(s, 0) -γξ(s , 1)} ]µ 2 (s)p(s ; 0, s)dsds . The matrix (Σ * 1 + Σ * 2 )/2 corresponds to the population limit of Σ(t). Similar to Lemma E.5 of Shi et al. (2020a) , we can show Σ * 1 -(2t) -1 t j=0 EΣ 2j+1 2 = o(t -1/2 ) and Σ * 2 -(2t) -1 t j=0 EΣ 2j 2 = o(t -1/2 ). This further yields Σ * 1 + Σ * t -1 t-1 j=0 Σ 2j -Σ * 2 2 = O p (t -1/2 log t) and t -1 t-1 j=0 Σ 2j+1 -Σ * 1 2 = O p (t -1/2 log t). This further implies Σ(t) -(Σ * 1 + Σ * 2 )/2 = O p (t -1/2 log t) and hence Σ(t) -Σ(t) 2 = O p (t -1/2 log t). Combining these results together with equation 43, we can show equation 21 and the first assertion in equation 17 hold. equation 20 and the second assertion in equation 17 hold can be proven in a similar manner. Finally, using similar arguments in the proof of Lemma 4, we can show equation 24 holds. We omit the details to save space. Condition on {(S j , A j , Y j )} 1≤j<T k-1 , the matrix Σ (k) * (a) is deterministic. Let Σ (k) = diag[Σ (k) * (0), Σ (k) * (1)]. Similar to the proof of Lemma 3, we can show Σ (k) * -Σ (k) 2 = o(1), conditional on {(S j , A j , Y j )} 1≤j<T k-1 , with probability tending to 1. This implies for any sufficiently small > 0, Pr( Σ (k) * -Σ (k) 2 > |{(S j , A j , Y j )} 1≤j<T k-1 ) P → 0. The above conditional probability is bounded between 0 and 1. Using bounded convergence theorem, we have  Pr( Σ (k) * -Σ (k) 2 > ) = o(1),



Figure 1: Empirical rejection probabilities of our test and the two-sample t-test with α(•) = α1(•). Settings correspond to the alternating-time-interval, adaptive and Markov design, from top plots to bottom plots. 5 NUMERICAL STUDIES 5.1 SYNTHETIC DATA Simulated data of states and rewards was generated as follows,

Figure 2: Our test statistic (the orange line) and the rejection boundary (the black line) in the A/A (left plot) and A/B (right plot) experiments. data {A t , Y t } 0≤t≤T k and plot the corresponding empirical rejection probabilities in Figures 1(b) and 5(b) (Appendix G). Results for Kharitonov et al. (2015)'s test are reported in Figure 4 (Appendix G). Both competing methods fail to detect any carryover effects and do not have any power.

Figure 3: Empirical rejection probabilities of our test and the DRL-based test.

Under CMIA, we have E{Y * t (a , a)|S 0 = s} = E[E{Y * t (a , a)|S * t (a , a), S 0 = s}|S 0 = s] = E{r(π(S * t (a , a)), S * t (a , a))|S 0 = s}. E{r(π(S * t (a , a)), S * t (a , a))|S 0 = s}. (11) Similar to equation 10, we can show E{Y * t+1 (a , a)|S 0 = s} = E{r(π(S * t+1 (a , a)), S * t+1 (a , a))|S 0 = s} = E[E{r(π(S * t+1 (a , a)), S * t+1 (a , a))|S * 1 (a), S 0 = s}|S 0 = s], and hence t≥0 γ t E{Y * t+1 (a , a)|S 0 = s} = E   t≥0 γ t E{r(π(S * t+1 (a , a)), S * t+1 (a , a)) S * 1 (a), S 0 = s}|S 0 = s   . By equation 9, the conditional distribution of S * t+1 (a , a) given S * 1 (a) = s and S 0 are the same as the conditional distribution of S * t (a , a) given S 0 = s. It follows that from equation 11 that t≥0 γ t E{Y * t+1 (a , a)|S 0 = s} = E{Q(a ; a, S * 1 (a))|S 0 = s}.

a), S * 1 (a) = s, S 0 }|S * 1 (a) = s, S 0 ] = E[P(S; a , S * k+1 (a , a))|S * 1 (a) = s, S 0 ]. Since we have shown equation 9 holds for t = k, it follows that Pr{S * k+2 (a , a) ∈ S|S * 1 (a) = s, S 0 } = s ∈S P(S; a , s )P k a (ds , a , s). a , s) = Pr{S * k+1 (a , a ) ∈ S|S 0 = s} = s ∈S P(S; a , s )P k a (ds , a , s).

We first consider equation 38. Using similar arguments in Part 1 of the proof of Lemma E.2,Shi  et al. (2020a), it suffices to show a Σ

s, A 0 = a = E ξ 0 (a)ε * (0, a) ξ 0 (a)ε * (1, a) ξ 0 (a)ε * (0, a) ξ 0 (a)ε * (1, a) S 0 = s, A 0 = a = E ξ 0 (a)ε * (0, a) ξ 0 (a)ε * (1, a) ξ 0 (a)ε * (0, a) ξ 0 (a)ε * (1, a) S 0 = s For any 2q-dimensional vectors a 1 , a 2 that satisfy a s, A 0 = a (a 1 , a 2 ) = {a 1 ξ(s, a)} 2 E[{ε * (0, a)} 2 |S 0 = s] + {a 2 ξ(s, a)} 2 E[{ε * (1, a)} 2 |S 0 = s] + 2{a 1 ξ(s, a)}{a 2 ξ(s, a)}E[{ε * (0, a)ε * (1, a)}|S 0 = s] ≥ {a 1 ξ(s, a)} 2 E[{ε * (0, a)} 2 |S 0 = s] + {a 2 ξ(s, a)} 2 E[{ε * (1, a)} 2 |S 0 = s] -2ρ ε |a 1 ξ(s, a)||a 2 ξ(s, a)| E[{ε * (0, a)} 2 |S 0 = s]E[{ε * (1, a)} 2 |S 0 = s] = (1 -ρ ε ){a 1 ξ(s, a)} 2 E[{ε * (0, a)} 2 |S 0 = s] + (1 -ρ ε ){a 2 ξ(s, a)} 2 E[{ε * (1, a)} 2 |S 0 = s] + ρ ε |a 1 ξ(s, a)| E|{ε * (0, a)} 2 |S 0 = s] + |a 2 ξ(s, a)| E[{ε * (1, a)} 2 |S 0 = s] 2 ≥ (1 -ρ ε ){a 1 ξ(s, a)} 2 E[{ε * (0, a)} 2 |S 0 = s] + (1 -ρ ε ){a 2 ξ(s, a)} 2 E[{ε * (1, a)} 2 |S 0 = s],

Similar to the proof of Lemma 3, in order to show equation 43, it suffices to show(Σ * 1 + Σ * 2 ) -1 2 = O(1). (44) Notice that Σ * 1 + Σ * 2 = diag[Σ * (0), Σ * (1)] where Σ * (a) = s,s ∈S[ξ(s, 0){ξ(s, 0) -γξ(s , a)}µ 2 (s)p(s ; 0, s) + ξ(s, 1){ξ(s, 1) -γξ(s , a)}µ 1 (s)p(s ; 1, s)] dsds .The matrix Σ * (0) can be further decomposed into {Ψ(s) -γΨ(s )} µ 2 (s)p(s ; 0, s)dsds , andΣ * 2,1 (0) 2 = O(1). It follows that {Σ * (0)} -1 2 = O(1). Similarly, we can show {Σ * (1)} -1 2 = O(1). This proves equation 44. Thus, we obtain equation 43. Using similar arguments in Part 2 of the proof of Lemma E.2,Shi et al. (2020a), we can show

PROOF OF LEMMA 6Under C1(iv), we have equation 6 holds. Similar to equation 7, we can show Π (k) has a probability density function µ(k) given byµ (k) (s ) = a∈{0,1} s∈S [a{1 -b (k) (s)} + (1 -a)b (k) (s)]p(s ; a, s)Π (k) (ds). (45)For a ∈ {0, 1}, defineΣ (k) * (a) = s,s ∈S a∈{0,1} ξ(s, a){ξ(s, a) -γξ(s , a )} µ (k) (s){a{1 -b (k) (s)} + (1 -a)b (k) (s)}p(s ; a, s)dsds .

and henceΣ (k) * -Σ (k) 2 = o p (1). Notice that sup s |b (k) (s) -b * (s)| P → 0 and Π (k) -Π * T V P → 0. Define µ * (s) = a∈{0,1} s∈S[a{1 -b * (s)} + (1 -a)b * (s)]p(s ; a, s)Π * (ds).

Figure 4: Empirical rejection probabilities of the modified version of the O'Brien & Fleming sequential test developed by Kharitonov et al. (2015). The left panels depicts the empirical type-I error and the right panels depicts the empirical power. Settings correspond to the alternating-time-interval, adaptive and Markov design, from top plots to bottom plots.

Figure 5: Empirical rejection probabilities of our test and the two-sample t-test with α(•) = α 2 (•). Settings correspond to the alternating-time-interval, adaptive and Markov design, from top plots to bottom plots.

(a) The proposed test under H1 and H0 (from (b) The proposed test under H1 and H0 (from left plots to right plots). J = 3, α(•) = α1(•). left plots to right plots). J = 3, α(•) = α2(•).

(c) The proposed test under H1 and H0 (from (d) The proposed test under H1 and H0 (from left plots to right plots). J = 5, α(•) = α1(•). left plots to right plots). J = 5, α(•) = α2(•).

Figure 6: Empirical rejection probabilities of our test. Settings correspond to the alternating-timeinterval, adaptive and Markov design, from top plots to bottom plots.

Powers of t-test, DML-based test and the proposed test under Examples 1 and 2, with T = 500,

In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. Davide Viviano and Jelena Bradic. Synthetic learner: model-free inference on treatments over time. arXiv preprint arXiv:1904.01490, 2019. Input: no. of basis functions q, no. of bootstrap samples B, an α spending function α(•). Initialize: I = {1, • • • , B}. Set Ω, Σ 0 , Σ 1 to zero matrcies, and η, S 1 , • • • , S B to zero vectors. Compute U (see Section 4.2) using either Monte Carlo methods or numerical integration.

This is directly implied by Condition C2(ii). The proof of equation 38 is hence completed. Similarly, we can prove equation 39. In addition, notice that

annex

Combining this together with equation 46, we obtainSimilar to the proof of Lemma 3, we can show µ (k) 's are uniformly bounded away from 0 and ∞. It follows that µ * is uniformly bounded away from 0 and ∞. Using similar arguments in Lemma 3, we can show 

G ADDITIONAL FIGURES

We present some additional figures to report the simulation results in this section. Figure 4 depicts the empirical rejection probabilities of the modified version of the O'Brien & Fleming sequential test developed by Kharitonov et al. (2015) . It can been seen that such a test has no power at all. In addition, we remark that Kharitonov et al. (2015) 's test requires equal sample size T 1 = T k -T k-1 for k = 2, • • • , K and is not directly applicable to our setting with unequal sample size. To apply such a test, we modify the decision time and set (T 1 , T 2 , T 3 , T 4 , T 5 ) = (120, 240, 360, 480, 600) .Figure 5 depicts the empirical rejection probabilities of our test and two-sample t-test with the error spending function given by α 2 . Figure 6 reports the empirical rejection probabilities of our test with different combinations of the number of basis and the error spending function.

