AN ONLINE SEQUENTIAL TEST FOR QUALITATIVE TREATMENT EFFECTS Anonymous

Abstract

Tech companies (e.g., Google or Facebook) often use randomized online experiments and/or A/B testing primarily based on the average treatment effects to compare their new product with an old one. However, it is also critically important to detect qualitative treatment effects such that the new one may significantly outperform the existing one only under some specific circumstances. The aim of this paper is to develop a powerful testing procedure to efficiently detect such qualitative treatment effects. We propose a scalable online updating algorithm to implement our test procedure. It has three novelties including adaptive randomization, sequential monitoring, and online updating with guaranteed type-I error control. We also thoroughly examine the theoretical properties of our testing procedure including the limiting distribution of test statistics and the justification of an efficient bootstrap method. Extensive empirical studies are conducted to examine the finite sample performance of our test procedure.

1. INTRODUCTION

Tech companies use randomized online experiments, or A/B testing to compare their new product with a well-established one. Most works in the literature focus on the average treatment effects (ATE) between the new and existing products (see Kharitonov et al., 2015; Johari et al., 2015; 2017; Yang et al., 2017; Ju et al., 2019 , and the references therein). In addition to ATE, sometimes we are interested in locating the subgroup (if exists) that the new product performs significantly better than the existing one, as early as possible. Consider a ride-hailing company (e.g., Uber). Suppose some passengers are in the recession state (at a high risk of stopping using the companys app) and the company comes up with certain strategy to intervene the recession process. We would like to if there are some subgroups that are sensitive to the strategy and pin-point these subgroups if exists. It motivates us to consider the null hypothesis that the treatment effect is nonpositive for all passenger. Such a null hypothesis is closely related to the notion of qualitative treatment effects in medical studies (QTE, Gail & Simon, 1985; Roth & Simon, 2018; Shi et al., 2020a) , and conditional moment inequalities in economics (see for example, Andrews & Shi, 2013; 2014; Chernozhukov et al., 2013; Armstrong & Chan, 2016; Chang et al., 2015; Hsu, 2017) . However, these tests are computed offline and might not be suitable to implement in online settings. Moreover, it is assumed in those papers that observations are independent. In online experiment, one may wish to adaptively allocate the treatment based on the observed data stream in order to maximize the cumulative reward or to detect the alternative more efficiently. The independence assumption is thus violated. In addition, an online experiment is desired to be terminated as early as possible in order to save time and budget. Sequential testing for qualitative treatment effects has been less explored. In the literature, there is a line of research on estimation and inference of the heterogeneous treatment effects (HTE) (Athey & Imbens, 2016; Taddy et al., 2016; Wager & Athey, 2018; Yu et al., 2020) . In particular, Yu et al. (2020) proposed an online test for HTE. We remark that HTE and QTE are related yet fundamentally different hypotheses. There are cases where HTE exists whereas QTE does not. See Figure 1 for an illustration. Consequently, applying their test will fail in our setting. The contributions of this paper are summarized as follows. First, we propose a new testing procedure for treatment comparison based on the notion of QTE. When the null hypothesis is not rejected, the new product is no better than the control for any realization of covariates, and thus it is not useful at all. Otherwise, the company could implement different products according to the auxiliary Y denotes the associated reward. In the ride-hailing example, X is a feature vector describing the characteristics of a passenger, A is a binary strategy indicator and Y is the passenger's number of rides in the following two weeks. In the left panel, the treatment effect does not depend on X. Neither HTE nor QTE exists in this case. In the middle panel, HTE exists. However, the treatment effect is always negative. As such, QTE does not exist. In the right penal, both QTE and HTE exist. covariates observed, to maximize the average reward obtained. We remark that there are plenty cases where the treatment effects are always nonpositive (see Section 5 of Chang et al., 2015; Shi et al., 2020a) . A by-product of our test is that it yields a decision rule to implement personalization when the null is rejected (see Section 3.1 for details). Although we primarily focus on QTE in this paper, our procedure can be easily extended to testing ATE as well (see Appendix D for details). Second, we propose a scalable online updating algorithm to implement our test. To allow for sequential monitoring, our procedure leverages idea from the α spending function approach (Lan & DeMets, 1983) originally designed for sequential analysis in a clinical trial (see Jennison & Turnbull, 1999 , for an overview). Classical sequential tests focus on ATE. The test statistic at each interim stage is asymptotically normal and the stopping boundary can be recursively updated via numerical integration. However, the limiting distribution of the proposed test statistic does not have a tractable analytical form, making the numerical integration method difficult to apply. To resolve this issue, we propose a scalable bootstrap-assisted procedure to determine the stopping boundary. Third, we adopt a theoretical framework that allows the maximum number of interim analyses K to diverge as the number of observations increases, since tech companies might analyze the results every few minutes (or hours) to determine whether to stop the experiment or continue collecting more data. It is ultimately different from classical sequential analysis where K is fixed. Moreover, the derivation of the asymptotic property of the proposed test is further complicated due to the adaptive randomization procedure, which makes observations dependent of each other. Despite these technical challenges, we establish a nonasymptotic upper bound on the type-I error rate by explicitly characterizing the conditions needed on randomization procedure, K and the number of samples observed at the initial decision point to ensure the validity of our test.

2. BACKGROUND AND PROBLEM FORMULATION

We propose a potential outcome framework (Rubin, 2005) to formulate our problem. Suppose that we have two products including the control and the treatment. The observed data at time point t consists of a sequence of triples {(X i , A i , Y i )} N (t) i=1 , where N (•) is a counting process that is independent of the data stream {(X i , A i , Y i )} +∞ i=1 , A i is a binary random variable indicating the product executed for the i-th experiment, X i ∈ R p denotes the associated covariates, and Y i stands for the associated reward (the larger the better by convention). We allow A i to depend on X i and past observations {(X j , A j , Y j )} j<i so that the randomization procedure can be adaptively changed. In addition, define Y * i (0) and Y * i (1) to be the potential outcome that would have been observed if the corresponding product is executed for the i-th experiment. Suppose that {(X i , Y * i (0), Y * i (1))} +∞ i=1 are independently and identically distributed copies of (X, Y * (0), Y * (1)). Let X be the support of X and Q 0 (x, a) = E{Y * (a)|X = x} for a = 0, 1, we focus on testing the following hypotheses: H 0 : Q 0 (x, 1) ≤ Q 0 (x, 0), ∀x ∈ X versus H 1 : Q 0 (x, 1) > Q 0 (x, 0), ∃x ∈ X. Notice that when there are no covariates, i.e., X = ∅, the hypotheses are reduced to H 0 : τ 0 ≤ 0 versus H 1 : τ 0 > 0, where τ 0 corresponds to ATE, i.e, τ 0 = E{Y * (1) -Y * (0)}. In general, we require X to be a compact set. We consider a large linear approximation space Q for the conditional mean function Q 0 . Specifically, let Q = {Q(x, a; β 0 , β 1 ) = ϕ (x)β a : β 0 , β 1 ∈ R q } be the approximation space, where ϕ(x) is a q-dimensional vector composed of basis functions on X. The dimension q is allowed to diverge with the number of observations in order to alleviate the effects of model misspecification. The use of linear approximation space simplifies the computation of our testing procedure. When Q 0 is well approximated, it suffices to test H 0 : ϕ (x)(β * 1 -β * 0 ) ≤ 0, ∀x ∈ X versus H 1 : ϕ (x)(β * 1 -β * 0 ) > 0, ∃x ∈ X. For clarity, here we assume Q 0 (x, a) = Q(x, a; β * 0 , β * 1 ) for some β * 0 and β * 1 . In Appendix B, we allow the approximation error inf β0,β1∈R p sup x∈X,a∈{0,1} |Q 0 (x, a) -Q(x, a; β 0 , β 1 )| to be nonzero. Let F j denote the sub-dataset {(X i , A i , Y i )} 1≤i≤j for j ≥ 1 and F 0 = ∅. Throughout this paper, we assume that the following two assumptions hold. (A1) Y i = A i Y * i (1) + (1 -A i )Y * i (0) for ∀i ≥ 1. (A2) A i is independent of Y * i (0), Y * i (1), {(X k , Y * k (0), Y * k (1))} k>i given X i and F i-1 , for any i. Assumption (A1) is referred to be the stable unit treatment value assumption (Rubin, 1974) and Assumption (A2) is the sequential randomization assumption (Zhang et al., 2013) and is automatically satisfied in a randomized study where the treatments are independently generated of the observed data. (A2) essentially assumes there is no unmeasured confounders. These assumptions guarantee that both regression coefficients (defined through potential outcomes) are estimable from the observed dataset as shown in the following lemma. Lemma 1 Let I(•) denotes the indicator function. Under (A1)-(A2), we have E[I(A i = a){Y i -ϕ (X i )β * a }] = 0, ∀a ∈ {0, 1}, i ≥ 1. 3 ONLINE SEQUENTIAL TESTING FOR QTE

3.1. TEST STATISTICS AND THEIR LIMITING DISTRIBUTION

We first present our test statistic for testing H 0 . In view of Lemma 1, we estimate β a by using the ordinary least squares estimator β a (t) = Σ -1 a (t)    1 N (t) N (t) i=1 I(A i = a)ϕ(X i )Y i    at each time point t for a ∈ {0, 1}, where Σ a (t) = N -1 (t) N (t) i=1 I(A i = a)ϕ(X i )ϕ (X i ). A generalized inverse might be used even if Σ a (t) is not invertible. Consider the following test statistic S(t) = sup x∈X ϕ (x){ β 1 (t) -β 0 (t)}. Under H 0 , we expect S(t) to be small. A large S(t) can be interpreted as the evidence against H 0 . As such, we reject H 0 for large S(t). We remark that when H 0 is rejected, we can apply the decision rule d(x) = arg max a∈{0,1} ϕ (x) β a (t) for personalized recommendation. To determine the rejection region, we next discuss the limiting distribution of S(t). Under H 0 , S(t) ≤ sup x∈X ϕ (x){ β 1 (t) -β * 1 -β 0 (t) + β * 0 } + sup x∈X ϕ (x)(β * 1 -β * 0 ) ≤ sup x∈X ϕ (x){ β 1 (t) -β * 1 -β 0 (t) + β * 0 }. Both equalities hold when β * 0 = β * 1 . Suppose there exists some function π * (•, •) defined on {0, 1} × X that satisfies E X | n i=1 n -1 π i-1 (a, X) -π * (a, X)| P → 0, ∀a ∈ {0, 1} as n → ∞, where π n (•, •) = Pr(A n = a|X n = x, F n-1 ), and the expectation E X is taken with respect to X. This condition implies that the treatment assignment mechanism cannot be arbitrary (see the discussion below Theorem 1 for details). Then we will show B(t) ≡ N (t){ β 1 (t) -β * 1 -β 0 (t) + β * 0 } d → N (0, a∈{0,1} Σ -1 a Φ a Σ -1 a ), as N (t) → ∞, where Σ a = Eπ * (a, X)ϕ(X)ϕ (X), Φ a = Eπ * (a, X)σ 2 (a, X)ϕ(X)ϕ (X), and σ 2 (a, x) = E[{Y * (a) -ϕ (X)β a } 2 |X = x], for any x ∈ X. According to equation 3, the right-hand-side (RHS) of equation 2 is to converge in distribution to the maximum of some Gaussian random variables. This observation forms the basis of our test. We next discuss the sequential implementation of our test. Assume that the interim analyses are conducted at time points t 1 , t 2 , . . . , t K ∈ [0, . . . , T ] such that 0 < t 1 < t 2 < • • • < t K = T . We allow K to grow with the number of observations. In the most extreme case, one may set t k = inf t {N (t) ≥ N (t k-1 )+1}, ∀k ≥ 2. That is, we make a decision regarding the null hypothesis upon the arrival of each observation. In addition, we assume that t 1 is large so that there are enough number of samples N (t 1 ) to guarantee the validity of the normal approximation for B(t 1 ). We remark that in typical tech companies such as Amazon, Facebook, etc., massive data are collected even within a short time interval. Large sample approximation is validated in these applications. To guarantee our test controls the type-I error, we reject H 0 and terminate the experiment at t k if N (t k )S(t k ) ≥ z k for some k = 1, . . . , K with some suitably chosen z 1 , . . . , z K > 0 that satisfy Pr max k∈{1,...,K} { N (t k )S(t k ) -z k } > 0 ≤ α + o(1) for a given significance level α > 0 under H 0 . In view of equation 2, it suffices to find {z k } k that satisfy Pr max k∈{1,...,K} sup x∈X ϕ (x)B(t k ) -z k > 0 ≤ α + o(1), where the stochastic process B(•) is defined in equation 3. To determine {z k } k , we need to derive the asymptotic distribution of the left-hand-side (LHS) of equation 4. To this end, define a mean-zero Gaussian process G(t) with covariance function Cov(G(t), G(t )) = N 1/2 (t)N -1/2 (t ) a∈{0,1} Σ -1 a Φ a Σ -1 a , ∀0 < t ≤ t . In the following, we show that the LHS of equation 4 can be uniformly approximated by G(•), for any {z k } k=1,...,K . To establish our theoretical results, we need some regularity conditions on ϕ(•). To save space, we summarize these assumptions in (A3) and put them in Appendix B. Theorem 1 Assume (A1)-(A3) hold. For a = 0, 1, assume inf x∈X π * (a, x) > 0 and |Y * (a)| is bounded almost surely. Assume there exists some 0 < α 0 ≤ 1 such that for any sequence {j n } n that satisfies j α0 n / log α0 j n q 2 , the following event occurs with probability at least 1 -O(j -α0 n ), sup a∈{0,1} E k i=1 {π i-1 (a, x) -π * (a, x)} ≤ O(1)qk 1-α0 log α0 k, ∀k ≥ j n , where O(1) denotes some positive constant. Assume N α0 (t 1 )/ log α0 N (t 1 ) q 2 and N (t 1 ) log N (T ) almost surely. Then conditional on the counting process N (•), there exists some constant c > 0 such that sup z1,...,z K Pr max k∈{1,...,K} sup x∈X ϕ (x)B(t K ) -z k > 0 -Pr max k∈{1,...,K} sup x∈X ϕ (x)G(t K ) -z k > 0 ≤ c q 3/4 N -1/8 (t 1 ) log 15/8 {KN (t 1 )} + qN -α0/3 (t 1 ) log (5+α0)/3 {KN (t 1 )} . Theorem 1 implies that the approximation error depends on the number of observations obtained up to the first decision point N (t 1 ), the maximum number of interim analyses K, the total number of basis functions q, and α 0 , which characterizes the convergence rate of the treatment assignment mechanism n i=1 n -1 π i-1 . Clearly, the error will decay to zero when the followings hold with probability tending to 1, q = O(N α * (t 1 )), for some 0 ≤ α * < min(1/6, α 0 /3), (6) log(K) min{N 1/15-2α * /5 (t 1 ), N (α0-3α * )/(5+α0) (t 1 )}. In Appendix C, we show that α 0 = 1/2, when an -greedy strategy is used for randomization to balance the trade-off between exploration and exploitation. In this case, 6 requires q to grow at a slower rate than N 1/6 (t 1 ). This condition is automatically satisfied when q is bounded. Condition 7 is satisfied when K grows polynomially fast with respect to N (t 1 ). In addition to -greedy, other adaptive allocation procedures (e.g., upper confidence bound or Thompson sampling) could be applied as well. As discussed in the introduction, the derivation of Theorem 1 is nontrivial. One way to obtain the magnitude of the approximation error is to apply the strong approximation theorem for multidimensional martingales (see Morrow & Philipp, 1982; Zhang, 2004) . However, the rate of approximation typically depends on the dimension and decays fast as the dimension increases. To derive Theorem 1, we view {ϕ (x)B(t K )} x∈X,k∈{1,...,κ} as a high-dimensional martingale and adopt the Gaussian approximation techniques that have been recently developed by Belloni & Oliveira (2018) . In view of equation 2, an application of Theorem 1 yields the following result. Theorem 2 Assume that the conditions of Theorem 1 hold, equation 6 and equation 7 hold with probability tending to 1. Then for any z 1 , . . . , z k that satisfy Pr max k∈{1,...,K} sup x∈X ϕ (x)G(t k ) -z k > 0 = α + o(1), as N (t 1 ) diverges to infinity, we have under H 0 , Pr max k∈{1,...,K} { N (t k )S(t k ) -z k } > 0 ≤ α + o(1). The above equality holds when β * 0 = β * 1 . Theorem 2 suggests that the type-I error rate of the proposed test can be well controlled. It remains to find critical values {z k } 1≤k≤K that satisfy equation 8. In the next section, we propose a bootstrapassisted procedure to determine these critical values.

3.2. BOOTSTRAP STOPPING BOUNDARY

We first outline a method based on the wild bootstrap (Wu, 1986) to approximate the limiting distribution of {S(t k )} k . Then we discuss its limitation and present our proposal, a scalable bootstrap algorithm to determine the stopping boundary. The idea is to generate bootstrap samples { β MB a (t k )} a,k that have asymptotically the same joint distribution as { β a (t k ) -β * a } a,k . Then the joint distribution of {S(t k )} k can be well-approximated by the conditional distribution of { S MB (t k )} k given the data, where S MB (t) = sup x∈X ϕ (x){ β MB 1 (t)β MB 0 (t)} for any t. Specifically, let {ξ i } +∞ i=1 be a sequence of i.i.d. standard normal random variables independent of {(X i , A i , Y i )} +∞ i=1 . For a ∈ {0, 1}, define β MB a (t) = Σ -1 a (t)   1 N (t) N (t) i=1 I(A i = a)ϕ i (X){Y i -ϕ (X i ) β(t)}ξ i   , ∀a ∈ {0, 1}. Both (t k )) up to the k-th interim stage, where B is the total number of bootstrap samples. This can be time consuming when {N (t k )-N (t k-1 )} K k=1 are large. To facilitate the computation, we observe that in the calculation of β MB a , the random noise is generated upon the arrival of each observation. This is unnecessary as we aim to approximate the distribution of β a (•) only at finitely many time points. We next present our proposal. Let {e i,a } i=1,...,K,a=0,1 be a sequence of i.i.d N (0, I q ) random vectors independent of the observed data, where I q denotes the q × q identity matrix. At the kth interim stage, we compute S MB * (t k ) = sup x∈X ϕ (x){ β MB * 1 (t k ) -β MB * 0 (t k )}, where β MB * a (t k ) equals 1 N (t k ) k j=1   N (tj ) i=N (tj-1)+1 Σ -1 a (t j )I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ(X i ) β a (t j )} 2 Σ -1 a (t j )   1/2 e j,a . For any k 1 and k 2 , the conditional covariance of N (t k1 ){ β MB * 1 (t k1 ) -β MB * 0 (t k1 )} and N (t k2 ){ β MB * 1 (t k2 ) -β MB * 0 (t k2 )} equals 1 N (t k1 )N (t k2 ) 1 a=0 k1 j=1 N (tj ) i=N (tj-1)+1 Σ -1 a (t j )I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a (t j )} 2 Σ -1 a (t j ). Under the given conditions in Theorem 1, it is to converge to N (t k1 ) N (t k2 ) 1 a=0 Σ -1 a Φ(a)Σ -1 a = Cov(G(t k1 ), G(t k2 )). This means { N (t k )( β MB* 1 (t k ) -β MB* 0 (t k ))} k and {G(t k )} k have the same asymptotic distribu- tion. Consequently, { N (t k ) S MB * (t k )} K k=1 can be used to approximate the joint distribution of {sup x∈X ϕ (x)G(t k )} K k=1 . To choose {z k } k that satisfies equation 8, we adopt the α-spending approach that allocates the total allowable type I error at each interim stage according to an error-spending function. This guarantees our test controls the type-I error. We begin by specifying an α spending function α(t) that is nonincreasing and satisfies α(0 ) = 0, α(T ) = α. Popular choices of α(•) include α 1 (t) = α log 1 + (e -1) t T , α 2 (t) = 2 -2Φ Φ -1 (1 -α/2) √ T √ t , α 3 (t) = α t T θ , for θ > 0, α 4 (t) = α 1 -exp(-γt/T ) 1 -exp(-γ) , for γ = 0, where Φ(•) denotes the cumulative distribution function of a standard normal variable and Φ -1 (•) is its quantile function. Based on α(•), we iteratively calculate z k , k = 1, . . . , K as the solution of Pr * max j∈{1,...,k-1} N (t j ) S MB * (t j ) -z j ≤ 0, N (t k ) S MB * (t k ) > z k = α(t k ) -α(t k-1 ), and reject H 0 when N (t k )S(t k ) > z k holds for some k. The validity of the bootstrap test is summarized in Theorems 3 and 4 below. Theorem 3 Assume the conditions in Theorem 1 hold. Assume q = O(N α * (t 1 )) for some 0 < α * < 1/3, almost surely. Then conditional on the counting process N (•), we have sup z1,...,z K Pr * max k∈{1,...,K} N (t k ) S MB * (t k ) -z k > 0 -Pr max k∈{1,...,K} sup x∈X ϕ (x)G(t k ) -z k > 0 ≤ c q 1/2 N -1/6 (t 1 ) log 11/6 {KN (t 1 )} + qN -α0/3 (t 1 ) log (5+α0)/3 {KN (t 1 )} for some constant c > 0 with probability at least 1 -O(N -α0 (t 1 )), where Pr * (•) denotes the probability measure conditional on the data stream {X i , A i , Y i } +∞ i=1 . Theorem 4 Assume the conditions in Theorem 3 hold. Then conditional on N (•), the critical values { z k } k satisfy Pr max k∈{1,...,K} sup x∈X ϕ (x)G(t k ) -z k > 0 -α ≤ c q 1/2 N -1/6 (t 1 ) log 11/6 {KN (t 1 )} + qN -α0/3 (t 1 ) log (5+α0)/3 {KN (t 1 )} , ( ) for some constant c > 0. When the RHS of equation 11 is o p (1), it follows from Theorems 2 and 4 that our test is valid. The conditional distribution in equation 10 can be approximated by the empirical distribution of Bootstrap samples. Finally, we remark that our test can be online updated as batches of observations arrive at the end of each interim stage. A pseudocode summarizing our procedure is given in Algorithm 1 in the appendix. The spatial complexity of the proposed algorithm is O(B), where B is the number of bootstrap samples. The time complexity is O(Bk + N (t k )) up to the k-th interim stage. Suppose N (t j ) -N (t j-1 ) = n for any 1 ≤ j ≤ K, we have Bk + N (t k ) = (B + n)k Bnk = BN (t k ) for large n and B. Hence, our procedure is much faster compared to the standard wild bootstrap.

4.1. SIMULATION STUDIES

In this section, we conduct Monte Carlo simulations to examine the finite sample properties of the proposed test. We generated the potential outcomes as Y * i (a) = 1 + (X i1 -X i2 )/2 + aτ (X i ) + ε i , where ε i 's are i.i.d N (0, 0.5 2 ). The covariates X i = (X i1 , X i2 , X i3 ) were generated as fol- lows. We first generated X * i = (X * i1 , X * i2 , X * i3 ) from a multivariate normal distribution with zero mean and covariance matrix equal to {0.5 |i-j| } i,j . Then we set X ij = X * ij I(X * ij | ≤ 2) + 2sgn(X * ij )I(X * ij | > 2). We consider two randomization designs. In the first design, the treatment assignment is nondynamic and completely random. Specifically we set π i (a, x) = 0.5, for any a, x and i. In the second design, we use an -greedy strategy to generate the treatment with ε = 0.3. In addition, we set N (T 1 ) = 2000 and N (T j ) -N (T j-1 ) = 2n for 2 ≤ j ≤ K and some n > 0. We consider two combinations of (n, K), corresponding to (n, K) = (200, 5) and (20, 50). We set the significance level α = 0.05 and choose B = 10000. We set τ (X i ) = φ δ {(X i1 + X i2 )/ √ 2}X 2 i3 for some function φ δ parameterized by some δ ≥ 0. We consider two scenarios for φ δ . Specifically, we set φ δ (x) = δx 2 /3 in Scenario 1 and φ δ = δ cos(πx) in Scenario 2. For each setting, we further consider four cases by setting δ = 0, 0.1, 0.15, 0.2, 0.25 and 0.3. When δ = 0, H 0 holds. Otherwise, H 1 holds. For all settings, we construct the basis function ϕ(•) using additive cubic splines. For each univariate spline, we set the number of internal knots to be 4. These knots are equally spaced between [-2, 2] . We denote our test by BAT, short for bootstrap-assisted test. We run our experiments on a single computer instance with 40 Intel(R) Xeon(R) 2.20GHz CPUs. It takes 1-2 seconds on average to compute each test. In Table 1 (see Appendix G), we report the rejection probabilities and average stopping times (defined as the average number of samples consumed when the experiment is terminated) of the proposed test aggregated over 400 simulations when α 1 (•) is chosen as the spending function. In Figure 2 , we plot the rejection probabilities of our tests and the average stopping times of the experiments. It can be seen that the type-I error rates are close to the nominal level in all cases. The power of our test increases as δ increases, demonstrating its consistency. In addition, when δ > 0, our experiments are stopped early in all cases. To further evaluate our method, we compare it with a test based on the law of iterated logarithm (denoted by LIL). LIL determines the decision boundary based on an always valid finite error bound (see Appendix F for details about the competing method). It can be seen from Figure 2 that our method has much larger power than the law of iterated logarithm approach.

4.2. REAL DATA ANALYSIS

In this section, we apply the proposed method to a Yahoo! Today Module user click log datasetfoot_0 , which contains 45,811,883 user visits to the Today Module, during the first ten days in May 2009. For the ith visit, the dataset contains an ID of the new article recommended to the user, a binary response variable Y i indicating whether the user clicked the article or not, and a five dimensional feature vector summarizing information of the user. Due to privacy concerns, feature definitions and article names were not included in the data. Each feature vector sums up to 1. Therefore, we took the first three and the fifth elements to form the covariates X i . For illustration, we only consider a subset of data that contains visits on May 1st where the recommended article ID is either 109510 or 109520. These two articles were being recommended most on that day. This gives us a total of 405888 visits. On the reduced dataset, define A i = 1 if the recommended article is 109510 and A i = 0 otherwise. We first conduct A/A experiments (which compare these two articles against themselves) to examine the validity of our test. The A/A experiments are done when every 2000 more users are available, we randomly assign 1000 users to arm A, and the other 1000 users in arm B. We expect our test will not reject H 0 in A/A experiments, since the articles being recommended are the same. Then, we conduct A/B experiment to test the QTE of these two articles. The test statistics and their corresponding critical values are plotted in Figure 3 . On average it takes several seconds to implement our test. It can be seen that our test is able to be reject H 0 after obtaining the first one-third of the observations, in the A/B experiment. In the A/A experiments, we fail to reject H 0 , as expected.

5. DISCUSSION

In this paper, we propose a new testing procedure for evaluating the performance of technology products in tech companies based on the notion of qualitative treatment effects. Currently, we only focus on comparing two products. It would be practically useful to develop a multiple testing procedure under settings with multiple treatment options. These topics warrant further investigations. Input: Number of bootstrap samples B, an α spending function α(•). Initialize: n = 0, Σ 0 = Σ 1 = O p+1 , γ 0 = γ 1 = 0 p+1 , β 0,b = β 1,b = 0 p+1 , and a set I = {1, . . . , B}. For k = 1 to K do Initialize: m = 0 and Φ 0 = Φ 1 = O p+1 . Step 1: Online update of  β a For i = N (t k-1 ) + 1 to N (t k ) do n = n + 1 and m = m + 1; Σ a = (1 -n -1 ) Σ a + n -1 ϕ(X i )ϕ (X i )I(A i = a), a = 0, 1; γ a = (1 -n -1 ) γ a + n -1 ϕ(X i )Y i I(A i = a), a = 0, 1; Compute β a = Σ -1 a γ a for a ∈ {0, 1} and S = sup x∈X ϕ (x)( β 1 -β 0 ); Step 2: Bootstrap For i = N (t k-1 ) + 1 to N (t k ) do Φ a = Φ a + Σ -1 a ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a } 2 Σ -1 a I(A i = a), a = 0, 1; For b = 1, . . . , B do Generate two independent N (0, I p+1 ) Gaussian vectors e 0 , e 1 ; β a,b = (1 -mn -1 ) β a,b + n -1 Φ 1/2 a e a , a = 0, 1; Compute S b = sup x∈X ϕ (x)( β 1,b -β 0,b ); Step 3: Reject or not Set z to be the upper {α(t) -|I| c /B|}/(1 -|I c |/B)-th percentile of { S b } b∈I ; Update I as I ← {b ∈ I : S b ≤ z}. If S > z: Reject H 0 and

A NOTATIONS

We introduce some general notations used in the appendix. For any matrix Mat, we use Mat p to denote the matrix norm induced by the corresponding p norm of vectors, for 1 ≤ p ≤ +∞. For two nonnegative sequences {s 1,n } n and {s 2,n } n , we will use the notation s 1,n s 2,n to represent that s 1,n ≤ cs 2,n for some universal constant c > 0 whose value is allowed to change from place to place. When a matrix Mat is degenerate, Mat -1 denotes the Moore-Penrose inverse of Mat. For any vector ψ, we use ψ (i) to denote its i-th element. In Algorithm 1, we use O p+1 to denote a (p + 1) × (p + 1) zero matrix and 0 p+1 to denote a (p + 1)-dimensional zero vector.

B MORE ON THE BASIS FUNCTION

B.1 CONDITION (A3) (A3). Assume λ min [Eϕ(X)ϕ (X)] 1, λ max [Eϕ(X)ϕ (X)] 1, sup x ϕ(x) 1 = O(q 1/2 ), lim inf q inf x∈X ϕ(x) 2 > 0. In addition, assume sup x,y∈X x =y ϕ(x) -ϕ(y) 2 x -y 2 q 1/2 . ( ) When a tensor-product B-spline is used (see Section 6 of Chen & Christensen, 2015, for a brief overview of tenor-product B-splines), (A3) is automatically satisfied. Specifically, λ min [Eϕ(X)ϕ (X)] 1, λ max [Eϕ(X)ϕ (X)] 1 follow from Theorem 3.3 of (Burman & Chen, 1989) . sup x ϕ(x) 1 = O(q 1/2 ) follows by noting that the absolute value of each element in ϕ(x) is bounded by some universal constant. lim inf q inf x∈X ϕ(x) 2 > 0 follows from the arguments used in the proof of Lemma E.4 of Shi et al. (2020b) . The last condition in equation 12 holds by noting that each function in the vector ϕ(•) is Lipschitz continuous when a tensor-product B-spline is used.

B.2 ON THE APPROXIMATION ERROR

The proposed test remains valid as long as the approximation error satisfies inf β0,β1∈R p sup x∈X,a∈{0,1} |Q 0 (x, a) -Q(x, a; β 0 , β 1 )| = o({N (T )} -1/2 ), with probability tending to 1. In the following, we introduce some sufficient conditions for equation 13.  Suppose the Q-function Q 0 (•, a) is p-smooth ( |Q 0 (x, a) -Q(x, a; β 0 , β 1 )| = O(q -p/d ). See Section 2.2 of Huang (1998) for detailed discussions on the approximation power of these basis functions. Condition equation 13 is thus automatically satisfied when 2p) , with probability tending to 1. q {N (T )} d/(

C ADAPTIVE RANDOMIZATION

In practice, the company might want to allocate more traffic to a better treatment based on the observed data stream. The -greedy strategy is commonly used to balance the trade-off between exploration and exploitation. For a given 0 < ε 0 < 1, consider the following randomization procedure: for some integer N 0 > 0 and any j ≥ N 0 , a ∈ {0, 1}, x ∈ X, we set π j-1 (a, x) = (1 -ε 0 )aI{ϕ (x)( β 1,j-1 -β 0,j-1 ) > 0} + ε 0 (1 -a)I{ϕ (x)( β 1,j-1 -β 0,j-1 ) ≤ 0}, where β a,j = Σ -1 a,j 1 j j i=1 {I(A i = a)ϕ(X i )Y i } and Σ a,j = 1 j j i=1 I(A i = a)ϕ(X i )ϕ (X i ). It is immediate to see that Σ a (t) = Σ a,n(t) and β a (t) = β a,n(t) . Define π * (a, x) = (1 -ε 0 )aI{ϕ (x)(β 1 -β 0 ) > 0} + ε 0 (1 -a)I{ϕ (x)(β 1 -β 0 ) ≤ 0} for any a ∈ {0, 1} and x ∈ X. Lemma 2 Assume (A1)-(A3) hold. Assume inf x∈X π * (a, x) > 0 and |Y * (a)| is bounded almost surely, for a ∈ {0, 1}. Assume Pr(|ϕ (X)(β 1 -β 0 )| ≤ ) ≤ L 0 , for some constant L 0 > 0 and any > 0. Then for any {j n } n that satisfies √ j n / √ log j n q 2 , the following event occurs with probability at least 1 -O(j -1 n ), a∈{0,1} E Fi-1 k i=1 {π i-1 (a, X) -π * (a, X)} q k log k, ∀k ≥ j n . Lemma 2 implies that Condition equation 5 in Theorem 1 automatically holds with α 0 = 1/2, when the epsilon-greedy strategy is used. When ϕ (X)(β 1 -β 0 ) is a continuous random variable, the assumption Pr(|ϕ (X)(β 1 -β 0 )| ≤ ) ≤ L 0 for any > 0 in Lemma 2 is satisfied if ϕ (X)(β 1 - β 0 ) has a bounded probability density function. When ϕ (X)(β 1 -β 0 ) is discrete, this assumption is satisfied if inf x∈X |ϕ (x)(β 1 -β 0 )| > 0.

D TESTING THE AVERAGE TREATMENT EFFECTS D.1 THE ALGORITHM

We focus on testing the following hypothesis, H 0 : EY * i (1) ≤ EY * i (0) versus H 1 : EY * i (1) > EY * i (0) . Under (A1) and (A2), it suffices to test H 0 : EQ(X i , 1) ≤ EQ(X i , 0) versus H 1 : EQ(X i , 1) > EQ(X i , 0). We similarly use basis approximations to model the Q-function. Our proposal is summarized in the following algorithm. We next conduct simulation studies to evaluate this algorithm. Input: Number of bootstrap samples B, an α spending function α(•). Initialize: n = 0, Σ 0 = Σ 1 = O p+1 , γ 0 = γ 1 = 0 p+1 , β 0,b = β 1,b = 0 p+1 , φ = 0 and a set I = {1, . . . , B}. For k = 1 to K do Initialize: m = 0, φ = 0 and Φ 0 = Φ 1 = O p+1 . For i = N (t k-1 ) + 1 to N (t k ) do n = n + 1, m = m + 1 and φ = n -1 (n -1) φ + n -1 ϕ(X i ); Σ a = (1 -n -1 ) Σ a + n -1 ϕ(X i )ϕ (X i )I(A i = a), a = 0, 1; γ a = (1 -n -1 ) γ a + n -1 ϕ(X i )Y i I(A i = a), a = 0, 1; Compute β a = Σ -1 a γ a for a ∈ {0, 1} and S = φ ( β 1 -β 0 ); For i = N (t k-1 ) + 1 to N (t k ) do φ = φ + [{ϕ(X i ) -φ} ( β 1 -β 0 )] 2 . Φ a = Φ a + Σ -1 a ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a } 2 Σ -1 a I(A i = a), a = 0, 1; For b = 1, . .

. , B do

Generate two independent N (0, I p+1 ) Gaussian vectors e 0 , e 1 , N (0, 1) random variable e 2 ; β a,b = (1 -mn -1 ) β a,b + n -1 Φ 1/2 a e a + n -1 φ 1/2 e 2 , a = 0, 1; Compute S b = φ ( β 1,b -β 0,b ); Set z to be the upper {α(t) -|I| c /B|}/(1 -|I c |/B)-th percentile of { S b } b∈I ; Update I as I ← {b ∈ I : S b ≤ z}. If S > z: Reject H 0 and terminate the experiment;

D.2 NUMERICAL STUDIES

In this section, we compare our procedure with the always valid test for testing ATE (Johari et al., 2015) . We generate the potential outcomes with the same model, except that ε i 's are i.i.d N (0, 1). However, we set N (T 1 ) = 1000 and N (T j ) -N (T j-1 ) = 2n for 2 ≤ j ≤ K and some n > 0. We consider two combinations of (n, K), corresponding to (n, K) = (100, 5) and (10, 50). For all settings, we use a linear function to approximate Q. In Table 2 (see Appendix G) and Figure 4 , we show the rejection probabilities and average stopping times of the proposed test aggregated over 400 simulations, when α 1 (•) is chosen as the spending function. It can be seen that our method behaves better than the always valid test when the effect size is small, and comparable when the effect size is large. The always valid test fails in the adaptive randomization settings, as the type-I error rates are around 50% under the null hypothesis.

E PROOFS

Note that we require X to be a compact set. To simplify the proof, we assume X = [0, 1] d . Set F 0 = ∅. We state the following lemma before proving Lemma 1. Lemma 3 For any j ≥ 1, (X j , Y * j (0), Y * j (1)) ⊥ ⊥ F j-1 . For any a ∈ {0, 1}, i ≥ 1, notice that EI(A i = a){Y i -ϕ (X i )β a } = EI(A i = a){Y * i (a) -ϕ (X i )β a } = EE Xi,Fi-1 [I(A i = a){Y * i (a) -ϕ (X i )β a }], where the first equation is due to Assumption (A1) and E Xi,Fi-1 denotes the conditional expectation given F i-1 and X i . By Assumption (A2), we have E Xi,Fi-1 [I(A i = a){Y * i (a) -ϕ (X i )β a }] = {E Xi,Fi-1 I(A i = a)}[E Xi,Fi-1 {Y * i (a) -ϕ (X i )β a }]. The second term on the RHS equals zero due to Lemma 3 and our model assumption E{Y * i (a)|X i } = ϕ (X i )β a . The proof is hence completed.

E.2 PROOF OF THEOREM 1

Let n(•) be the realization of the counting process N (•). We will show the assertion in Theorem 1 holds for any such realizations that satisfy n(t 1 ) < n(t 2 ) < • • • < n(t K ). The case where some of the n(t k )'s are the same can be similarly discussed. For any j ≥ 1, define σ(F j ) to be the σ-algebra generated by F j . For a ∈ {0, 1}, define Σ a,j = 1 j j i=1 I(A i = a)ϕ(X i )ϕ (X i ) and β a,j = Σ -1 a,j 1 j j i=1 I(A i = a)ϕ(X i )Y i . It is immediate to see that Σ a (t) = Σ a,n(t) and β a (t) = β a,n(t) . Define δ n = qn -α0 log α0 n. We state the following lemmas before proving Theorem 1. Lemma 4 There exists some constant 0 < 0 < 1 such that λ min [Eϕ(X)ϕ (X)] ≥ 0 , λ max [Eϕ(X)ϕ (X)] ≤ -1 0 , sup x ϕ(x) 2 ≤ sup x ϕ(x) 1 ≤ -1 0 √ q, min a∈{0,1} λ min [Σ a ] ≥ 0 , max a∈{0,1} β a 2 ≤ -1 0 , max a∈{0,1} |Y * (a)| ≤ -1 0 and sup x max a∈{0,1} |ϕ (x)β a | ≤ -1 0 . Lemma 5 Assume the conditions in Theorem 1 hold. Then for any sequence {j n } n that satisfies j α0 n / log α0 (j n ) q 2 , we have with probability at least 1 -O(j -α0 n ) that for any a ∈ {0, 1} and any k ≥ j n , ( Σ a,k -Σ a ) 2 qδ k + qk -1 log k, ( ) ( Σ -1 a,k -Σ -1 a ) 2 qδ k + qk -1 log k. ( ) Lemma 6 Assume the conditions in Theorem 1 hold. The for any sequence {j n } n that satisfies j n / log(j n ) q, we have with probability at least 1 -O(j -1 n ) that for any a ∈ {0, 1} and any k ≥ j n , k i=1 ϕ(X i )I(A i = a){Y i -ϕ (X i )β a } 2 qk log k. For a ∈ {0, 1}, β a,k -β a = Σ -1 a 1 k k i=1 I(A i = a)ϕ(X i ){Y i -ϕ (X i )β a } , and hence β a,k -β a -Σ -1 a 1 k k i=1 I(A i = a)ϕ(X i ){Y i -ϕ (X i )β a } 2 (16) ≤ Σ -1 a,k -Σ -1 a 2 1 k k i=1 I(A i = a)ϕ(X i ){Y i -ϕ (X i )β a } 2 (qδ k + qk -1 log k)q 1/2 k -1/2 log 1/2 k, ∀k ≥ j n , with probability at least 1 -O(j -α0 n ), by Lemma 5 and Lemma 6. Define B * (t) = 1 n(t) n(t) i=1 [Σ -1 1 ϕ(X i )A i {Y i -ϕ (X i )β 1 } -Σ -1 0 ϕ(X i )(1 -A i ){Y i -ϕ (X i )β 0 }]. It follows that B * (t k ) -B(t k ) 2 {q 3/2 δ n(t k ) + q n -1 (t k ) log n(t k )}n -1/2 (t k ) log 1/2 n(t k ), ∀k ≥ 1, with probability at least 1 -O(n -α0 (t 1 )), and hence sup x∈X ϕ (x)B * (t k ) -sup x∈X ϕ (x)B(t k ) 2 ≤ c{q 2 δ n(t k ) + q 3/2 n -1 (t k ) log n(t k )} n -1 (t k ) log n(t k ), ∀k ≥ 1, with probability at least 1 -O(n -α0 (t 1 )), for some constant c > 0, by equation 41. Under the given conditions on q and n(t 1 ), we have q n -1 (t k ) log n(t k ) = o(1), ∀k ≥ 1, and hence sup x∈X ϕ (x)B * (t k ) -sup x∈X ϕ (x)B(t k ) 2 ≤ c{qδ n(t k ) + qn -1 (t k ) log n(t k )}, ∀k ≥ 1, Thus, for any given z 1 , z 2 , . . . , z K , we obtain Pr max k∈{1,...,K} sup x∈X ϕ (x)B * (t k ) -z k,-≤ 0 -O(n -α0 (t 1 )) ≤ Pr max k∈{1,...,K} sup x∈X ϕ (x)B(t k ) -z k ≤ 0 (18) ≤ Pr max k∈{1,...,K} sup x∈X ϕ (x)B * (t k ) -z k,+ ≤ 0 + O(n -α0 (t 1 )), where z k,-= z k -c{qδ n(t k ) + qn -1 (t k ) log n(t k )}, z k,+ = z k + c{qδ n(t k ) + qn -1 (t k ) log n(t k )}. For any i ≥ 1, 1 ≤ k ≤ K, define a q-dimensional vector ξ i,k = 1 n(t k ) [Σ -1 1 ϕ(X i )A i {Y i -ϕ (X i )β 1 } -Σ -1 0 ϕ(X i )(1 -A i ){Y i -ϕ (X i )β 0 }]I(i ≤ n(t k )), or equivalently, ξ i,k = 1 n(t k ) [Σ -1 1 ϕ(X i )A i {Y * i (1) -ϕ (X i )β 1 } -Σ -1 0 ϕ(X i )(1 -A i ){Y * i (0) -ϕ (X i )β 0 }]I(i ≤ n(t k )), by Condition (A1). Let ξ i = (ξ i,1 , ξ i,2 , • • • , ξ i,K ) and M j = j i=1 ξ i . The sequence {M i } i≥1 forms a multivariate martingale with respect to the filtration {σ(F i ) : i ≥ 1}, since E(ξ i,k |F i ) = [{E(ξ i,k |A i , X i , F i )}|F i ] = 0, by (A2). Let n(t 0 ) = 0. For any i such that n(t k-1 ) < i ≤ n(t k ) for some 1 ≤ k ≤ K, we have ξ i ∞ ≤ 1 n(t k ) { Σ -1 1 ϕ(X i ){Y * i (1) -ϕ (X i )β 1 } 2 + Σ -1 0 ϕ(X i ){Y * i (0) -ϕ (X i )β 0 } 2 } ≤ 4 √ qn -1/2 (t k ) -3 0 , where the second inequality is due to Lemma 4. Therefore, E ξ i 3 ∞ q 3/2 n 3/2 (t k ) . It follows that n(t K ) i=1 E ξ i 3 ∞ = K k=1 n(t k ) i=n(t k-1 )+1 E ξ i 3 ∞ q 3/2 K k=1 n(t k ) -n(t k-1 ) n 3/2 (t k ) (19) ≤ q 3/2 n(t 1 ) + q 3/2 K k=2 n(t k ) -n(t k-1 ) n 3/2 (t k ) ≤ q 3/2 n -1/2 (t 1 ) + q 3/2 +∞ n(t1) x -3/2 dx = 3q 3/2 n -1/2 (t 1 ). Define a sequence of independent Gaussian vectors {η i } i≥1 that satisfy η i ∼ N (0, E(ξ i ξ i |F i-1 )) for any i ≥ 1. Then the distribution of η i is the same as I(i ≤ n(t 1 )) n(t 1 ) Z , I(i ≤ n(t 2 )) n(t 2 ) Z , • • • , I(i ≤ n(t K )) n(t K ) Z , where Z is a p-dimensional mean-zero Gaussian vector with covariance matrix Cov[ a∈{0,1} Σ -1 a ϕ(X i )I(A i = a){Y * i (a) -ϕ (X i )β a }|F i-1 ] (20) = a∈{0,1} Σ -1 a E[ϕ(X i )ϕ (X i )I(A i = a){Y * i (a) -ϕ (X i )β a } 2 |F i-1 ]Σ -1 a = a∈{0,1} Σ -1 a E{ϕ(X i )ϕ (X i )I(A i = a)σ 2 (a, X i )|F i-1 }Σ -1 a = a∈{0,1} Σ -1 a E{ϕ(X i )ϕ (X i )π i-1 (a, X i )σ 2 (a, X i )|F i-1 }Σ -1 a ≡ a∈{0,1} Σ -1 a E Fi-1 π i-1 (a, X)σ 2 (a, X)ϕ(X)ϕ (X)Σ -1 a , where the second equality follows from (A2) and Lemma 3, the third equality is due to the definition of π i-1 and the last equality follows from Lemma 3. Similar to equation 19, we can show that n(t K ) i=1 E η i 3 ∞ q 3/2 n -1/2 (t 1 ). Using similar arguments in equation 20, we can show that for any 1 ≤ k 1 ≤ k 2 ≤ K, n(t K ) i=1 E{ξ i,k1 ξ i,k2 |F i-1 } = 1 n(t k1 )n(t k2 ) n(t k 1 ) i=1 a∈{0,1} Σ -1 a E Fi-1 π i-1 (a, X)σ 2 (a, X)ϕ(X)ϕ (X)Σ -1 a . Let V (k 1 , k 2 ) = 1 n(t k1 )n(t k2 ) n(t k 1 ) i=1 a∈{0,1} Σ -1 a E Fi-1 π * (a, X)σ 2 (a, X)ϕ(X)ϕ (X)Σ -1 a = 1 n(t k1 )n(t k2 ) n(t k 1 ) i=1 a∈{0,1} Σ -1 a Φ a Σ -1 a = n(t k1 ) n(t k2 ) a∈{0,1} Σ -1 a Φ a Σ -1 a . Consider an arbitrary sequence of R p+1 vectors {b k } 1≤k≤K . Under the given conditions, we have b k1   n(t K ) i=1 E(ξ i,k1 ξ 1,k2 |F i-1 ) -V (k 1 , k 2 )   b k2 1 n(t k1 ) a∈{0,1} n(t k 1 ) i=1 E Fi-1 {π i-1 (a, X) -π * (a, X)}σ 2 (a, X)ϕ(X)ϕ (X) 2 b k1 2 b k2 2 . Define a matrix V as V =     V (1, 1) V (1, 2) . . . V (1, K) V (2, 1) V (2, 2) . . . V (2, K) . . . . . . . . . V (K, 1) V (K, 2) . . . V (K, K)     . It follows that n(t K ) i=1 E(ξ i ξ i |F i-1 ) -V 2 sup a∈{0,1} j≥n(t1) 1 j j i=1 E Fi-1 {π i-1 (a, X) -π * (a, X)}σ 2 (a, X)ϕ(X)ϕ (X) 2 . Using similar arguments in proving equation 14, we can show the RHS of the above equation is upper bounded by -2 0 q sup a∈{0,1} x∈X,j≥n(t1) 1 j j i=1 {π i-1 (a, x) -π * (a, x)} , and hence by -2 0 qδ n(t1) , with probability at least 1 -O(n -α0 (t 1 )). Therefore, we have λ min   V + δ n(t1) I Kp×Kp - n(t K ) i=1 E(ξ i ξ i |F i-1 )   ≥ 0, with probability at least 1 -O(n -α0 (t 1 )), where I Kp×Kp denotes a Kp × Kp identity matrix. Moreover, notice that sup a∈{0,1} x∈X,j≥n(t1) 1 j j i=1 {π i-1 (a, x) -π * (a, x)} is bounded between 0 and 1. For any a ∈ {0, 1} and any z > 0, we have E sup a∈{0,1} x∈X,j≥n(t1) 1 j j i=1 {π i-1 (a, x) -π * (a, x)} ≤ E sup a∈{0,1} x∈X,j≥n(t1) 1 j j i=1 {π i-1 (a, x) -π * (a, x)} I    sup a∈{0,1} x∈X,j≥n(t1) 1 j j i=1 {π i-1 (a, x) -π * (a, x)} ≤ z    + Pr    sup a∈{0,1} x∈X,j≥n(t1) 1 j j i=1 {π i-1 (a, x) -π * (a, x)} > z    . Under the given conditions, we have E sup a∈{0,1} x∈X,j≥n(t1) 1 j j i=1 {π i-1 (a, x) -π * (a, x)} δ n(t1) + O(n -α0 (t 1 )). Therefore, we obtain E n(t K ) i=1 E(ξ i ξ i |F i-1 ) -V 2 qn -α0 (t 1 ) + qδ n(t1) , or E n(t K ) i=1 E(ξ i ξ i |F i-1 ) -V 2 qδ n(t1) , since n -α0 (t 1 ) δ n(t1) . Combining equation 19 with equation 21, equation 23 and equation 24, an application of Theorem 2.1 in Belloni & Oliveira (2018)  yields that |Eψ(M n(t K ) ) -Eψ(N (0, V ))| (25) c 0 (ψ)n -α0 (t 1 ) + c 2 (ψ)qδ n(t1) + c 3 (ψ)q 3/2 n -1/2 (t 1 ), for any thrice differential function ψ(•), and c 0 (ψ) = sup z,z ∈R pK |ψ(z) -ψ(z )| and c i = sup z∈R pK j1,••• ,ji |∂ j1 ∂ j2 • • • ∂ ji ψ(z)|, i = 2, 3, where ∂ j g(z) denotes the partial derivative ∂g(z)/∂z (j) for any function g(•) and z (j) stands for the j-th element of z. Let X k,0 be an ε-net of X that satisfies the following: for any x ∈ X, there exists some x 0 ∈ X 0 such that x -x 0 2 ≤ ε. Set ε = √ d/n 4 (t 1 ). Since X = [0, 1] d , there exists some X 0 with |X 0 | ≤ n 4d (t 1 ), where |X 0 | denotes the number of elements in X 0 . Under Condition (A3), we have sup x∈X inf x0∈X0 ϕ(x) -ϕ(x 0 ) 2 √ q n 4 (t 1 ) . It follows that sup ν 2=1 | sup x∈X ϕ (x)ν -sup x∈X0 ϕ (x)ν| √ q n 4 (t 1 ) . ( ) Using similar arguments in showing equation 17, we can show the following event occurs with probability at least 1 -O(n -1 (t 1 )), B * (t k ) 2 q 1/2 log 1/2 n(t k ), ∀k ≥ 1. This together with equation 27 yields max k∈{1,...,K} sup x∈X ϕ (x)B * (t k ) -max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) ≤ max k∈{1,...,K} sup x∈X ϕ (x)B * (t k ) -sup x∈X0 ϕ (x)B * (t k ) q log n 1/2 (t K ) n 4 (t 1 ) , with probability at least 1 -O(n -1 (t 1 )). Under the given conditions, we have n(t 1 ) max(q, log n(t K )). It follows that there exists some constant c * > 0 such that max k∈{1,...,K} sup x∈X ϕ (x)B * (t k ) -max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) ≤ c * n -2 (t 1 ), with probability at least 1 -O(n -1 (t 1 )). Define  z * k,-= z k -c{qδ n(t k ) + qn -1 (t k ) log n(t k )} -c * n -2 (t 1 ), z * k,+ = z k + c{qδ n(t k ) + qn -1 (t k ) log n(t k )} + c * n -2 ( ∈ R qK with L ≤ n 4d (t 1 )K, max j d j 1 ≤ -1 0 q 1/2 and a function k(•) that maps {1, . . . , L} into {1, . . . , K} such that max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -ν k = max 1≤j≤L {d j M n(t K ) -ν k(j) }, for any {ν k } K k=1 . For any η > 0, m ∈ R qK , consider the function φ η,{ν k } k : R qK → R, defined as φ η,{ν k } k (m) = 1 η log    L j=1 exp[η{d j m -ην k(j) }]    . It has the following property: max 1≤j≤L {d j m -ν k(j) } ≤ φ η,{ν k } k (m) ≤ max 1≤j≤L {d j m -ν k(j) } + η -1 log L ≤ max 1≤j≤L {d j m -ν k(j) } + η -1 {log K + 4d log n(t 1 )} = max 1≤j≤L [d j m -{ν k(j) -η -1 log K -η -1 4d log n(t 1 )}].

It follows that

Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -z * k,+ ≤ 0 ≤ Pr φ η,{z * * k,+ } k (M n(t K ) ) ≤ 0 , Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -z * k,- ≤ 0 (32) = Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -(z * k,--3δ) ≤ 3δ ≥ Pr φ η,{z * * k,-} k (M n(t K ) ) ≤ 3δ , where z * * k,+ = z * k,+ + η -1 {log K + 4d log n(t 1 )} and z * * k,-= z * k,--3δ. The value of δ will be specified later. In addition, with some calculations, we have ∂ j φ η,{ν k } k (m) = L i=1 d (j) i exp η[d i m -ν k(i) ] L i=1 exp η[d i m -ν k(i) ] , ∂ j1 ∂ j2 φ η,{ν k } k (m) = η L i=1 d (j1) i d (j2) i exp η[d i m -ν k(i) ] L i=1 exp η[d i m -ν k(i) ] -η l=1,2 L i=1 d (j l ) i exp η[d i m -ν k(i) ] L i=1 exp η[d i m -ν k(i) ] 2 , ∂ j1 ∂ j2 ∂ j3 φ η,{ν k } k (m) = η 2 L i=1 d (j1) i d (j2) i d (j3) i exp η[d i m -ν k(i) ] L i=1 exp η[d i m -ν k(i) ] -3η 2 L i=1 d (j1) i d (j2) i exp η[d i m -ν k(i) ] L i=1 exp η[d i m -ν k(i) ] × L i=1 d (j3) i exp η[d i m -ν k(i) ] L i=1 exp η[d i m -ν k(i) ] + 2η 2 l=1,2,3 L i=1 d (j l ) i exp η[d i m -ν k(i) ] L i=1 exp η[d i m -ν k(i) ] 3 . Since max i d i 1 ≤ -1 0 q 1/2 , we obtain that j |∂ j φ η,{ν k } k (m)| ≤ -1 0 q 1/2 , j1,j2 |∂ j1 ∂ j2 φ η,{ν k } k (m)| ≤ 2η -2 0 q, ( ) j1,j2,j3 |∂ j1 ∂ j2 ∂ j3 φ η,{ν k } k (m)| ≤ 6η 2 -3 0 q 3/2 . By Lemma 5.1 of Chernozhukov et al. (2016) , for any δ > 0, there exists some function g  δ (•) : R → R with g δ ∞ ≤ δ -1 , g δ ∞ ≤ K 0 δ -2 , g δ ∞ ≤ K 0 δ -3 for some constant K 0 > 0 such that I(z 0 ≤ 0) ≤ g δ (z 0 ) ≤ I(z 0 ≤ 3δ), ∀δ ∈ R. It follows that I(φ η,{ν k } k (m) ≤ 0) ≤ g • φ η,{ν k } k (m) ≤ I(φ η,{ν k } k (m) ≤ 3δ), for any m ∈ R qK . sup x∈X0 ϕ (x)B * (t k ) -z * k,+ ≤ 0 ≤ Eg δ • φ η,{z * * k,+ } k (M n(t K ) ), (34) Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -z * k,- ≤ 0 ≥ Eg δ • φ η,{z * * k,-} k (M n(t K ) ). ( ) Consider the function g δ • φ η,{ν k } k . Apparently, we have sup δ,η,{ν k } k c 0 (g δ • φ η,{ν k } k ) ≤ 1. ( ) By equation 33, we can show that sup δ,η,{ν k } k c 2 (g δ • φ η,{ν k } k ) δ -2 q + δ -1 ηq, sup δ,η,{ν k } k c 3 (g δ • φ η,{ν k } k ) δ -3 q 3/2 + δ -2 ηq 3/2 + δ -1 η 2 q 3/2 . ( ) Set δ = η -1 {log K + 4d log n(t 1 )}, we obtain sup η,{ν k } k c i (g δ • φ η,{ν k } k ) q i/2 η i {log i K + log i n(t 1 )}, i = 2, 3. Combining equation 37 together with equation 25 and equation 36 yields sup δ,η,{ν k } k |Eg δ • φ η,{ν k } k (M n(t K ) ) -Eg δ • φ η,{ν k } k (N (0, V ))| n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ).

This together with equation 34 and equation 35 yields

Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -z * k,+ ≤ 0 -Eg δ • φ η,{z * * k,+ } k (N (0, V )) (38) n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ), Eg δ • φ η,{z * * k,-} k (N (0, V )) -Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -z * k,- ≤ 0 (39) n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ).

Similar to equation 31-equation 35, we can show

Eg δ • φ η,{z * * k,+ } k (N (0, V )) ≤ Pr φ η,{z * * k,+ } k (N (0, V )) ≤ 3δ ≤ Pr max 1≤j≤L {d j N (0, V ) -z * * k(j),+ } ≤ 3δ = Pr max 1≤j≤L {d j N (0, V ) -z * * * k(j),+ } ≤ 0 , Eg δ • φ η,{z * * k,-} k (N (0, V )) ≥ Pr φ η,{z * * k,-} k (N (0, V )) ≤ 0 ≥ Pr max 1≤j≤L {d j N (0, V ) -z * * * k(j),-} ≤ 0 , where z * * * k,+ = z * k,+ + η -1 {log K + 4d log n(t 1 )} + 3δ and z * * * k,-= z * k,--η -1 {log K + 4d log n(t 1 )} -3δ, for each k. Notice that for any {ν k } k , we have Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -ν k ≤ 0 = Pr max 1≤j≤L {d j N (0, V ) -ν k(j) } ≤ 0 .

This together with equation 38 and equation 39 yields

Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -z * k,+ ≤ 0 -Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -z * * * k,+ ≤ 0 n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ), Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -z * * * k,- ≤ 0 -Pr max k∈{1,...,K} sup x∈X0 ϕ (x)B * (t k ) -z * k,- ≤ 0 n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ). In view of equation 29, we have shown that Pr max k∈{1,...,K} sup x∈X ϕ (x)B(t k ) -z k ≤ 0 -Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -z * * * k,+ ≤ 0 n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ), Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -z * * * k,- ≤ 0 -Pr max k∈{1,...,K} sup x∈X ϕ (x)B(t k ) -z k ≤ 0 n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ), The covariance matrix Cov(G(t k )) is given by a∈{0,1} Σ -1 a Φ a Σ -1 a and is nonsingular by Lemma 4. In addition, we have ϕ(x) 2 ≥ c, ∀x ∈ X 0 , by Condition A3. Thus, there exists some constant c * > 0 such that c * ≤ ϕ (x)   a∈{0,1} Σ -1 a Φ a Σ -1 a   1/2 ϕ(x), ∀x ∈ X 0 . By Theorem 1 of Chernozhukov et al. (2017) , we obtain that Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -z * * * k,+ ≤ 0 -Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -z * * * k,- ≤ 0 η -1 {log K + log n(t 1 )} 3/2 + qδ n(t1) {log K + log n(t 1 )} 1/2 + qn -1 (t 1 ) log n(t 1 ){log K + log n(t 1 )} 1/2 .

Thus, we obtain

Pr max k∈{1,...,K} sup x∈X ϕ (x)B(t k ) -z k ≤ 0 -Pr max k∈{1,...,K} sup x∈X ϕ (x)G(t k ) -z k ≤ 0 n -1/2 (t 1 )q 3 η 3 {log 3 K + log 3 n(t 1 )} + q 2 η 2 {log 2 K + log 2 n(t 1 )}δ n(t1) + n -α0 (t 1 ) + η -1 {log K + log n(t 1 )} 3/2 + qδ n(t1) {log K + log n(t 1 )} 1/2 + qn -1 (t 1 ) log n(t 1 ){log K + log n(t 1 )} 1/2 . Setting η = min(q -3/4 n 1/8 (t 1 ) log -3/8 {Kn(t 1 )}, q -1 n -α0/3 (t 1 ) log -α0/3-1/6 {Kn(t 1 )}) yields the desired results. The proof is hence completed.

E.3 PROOF OF LEMMA 3

The assertion trivially holds for j = 1. We prove it holds for any j ≥ 2, by induction. By (A2), we have (X j , Y * j (0), Y * j (1)) ⊥ ⊥ A 1 |X 1 . Since (X j , Y * j (0), Y * j (1)) ⊥ ⊥ (X 1 , Y * 1 (0), Y * 1 (1)), this fur- ther implies (X j , Y * j (0), Y * j (1)) ⊥ ⊥ A 1 and hence (X j , Y * j (0), Y * j (1)) ⊥ ⊥ (X 1 , A 1 , Y * 1 (0), Y * 1 (1)). By (A1), Y 1 is completely determined by A 1 , Y * 1 (0) and Y * 1 (1). Therefore, we obtain (X j , Y * j (0), Y * j (1)) ⊥ ⊥ F 1 . Suppose we have shown that (X j , Y * j (0), Y * j (1)) ⊥ ⊥ F k for some k < j -1. To prove (X j , Y * j (0), Y * j (1)) ⊥ ⊥ F k+1 , it suffices to show (X j , Y * j (0), Y * j (1)) ⊥ ⊥ (X k+1 , A k+1 , Y k+1 ). By (A1), Y k+1 is determined by A k+1 , Y * k+1 (0) and Y * k+1 (1). Since (X j , Y * j (0), Y * j (1)) ⊥ ⊥ (X k+1 , Y * k+1 (0), Y * k+1 (1)), it suffices to show (X j , Y * j (0), Y * j (1)) ⊥ ⊥ A k+1 . This is implied by (X j , Y * j (0), Y * j (1)) ⊥ ⊥ A k+1 |X k+1 , F k and that (X j , Y * j (0), Y * j (1)) ⊥ ⊥ X k+1 , F k . The proof is hence completed.

E.4 PROOF OF LEMMA 4

The assertions 0 ≤ λ min [Eϕ(X)ϕ (X)] ≤ λ max [Eϕ(X)ϕ (X)] ≤ -1 0 , and sup x ϕ(x) 1 ≤ -1 0 √ q, ( ) for some 0 < 0 < 1 are directly implied by the conditions that λ min [Eϕ(X)ϕ (X)] 1, λ max [Eϕ(X)ϕ (X)] 1, sup x ϕ(x) 1 ≤ -1 0 √ q. Since ϕ(x) 2 ≤ ϕ(x) 1 , we obtain sup x ϕ(x) 2 ≤ sup x ϕ(x) 1 ≤ -1 0 √ q. Under the condition inf a,x π * (a, x) > 0, we can similarly show that λ min [Σ a ] ≥ 0 for some 0 > 0. Since |Y * (0)| and |Y * (1)| are bounded, there exists some constant 0 < 0 < 1 that satisfies max a∈{0,1} |Y * (a)| ≤ -1 0 . Notice that ϕ (x)β a = E{Y * (a)|X = x}. Boundedness of |Y * (a)| implies that the conditional mean E{Y * (a)|X} is a bounded random variable as well. As a result, we obtain sup x∈X max a∈{0,1} |ϕ (x)β a | ≤ -1 0 . Notice that β a = Σ -1 a Eϕ (X)Y * (a). Since λ min [Σ a ] is bounded away from 0, it suffices to show Eϕ (X)Y * (a) 2 = O(1), or equivalently, sup ν∈R p , ν 2=1 |Eν ϕ(X)Y * (a)| = O(1). By Cauchy-Schwarz inequality, it suffices to show sup ν∈R p , ν 2=1 E|Y * (a)| 2 E|ν ϕ(X)| 2 = O(1). Since |Y * (a)| = O(1) almost surely, we have by the condition λ max [Eϕ(X)ϕ (X)] = O(1) that sup ν∈R p , ν 2=1 E|ν ϕ(X)| 2 = sup ν∈R p , ν 2=1 ν Eϕ(X)ϕ (X)ν ≤ λ max [Eϕ(X)ϕ (X)] = O(1). The proof is hence completed. E.5 PROOF OF LEMMA 5 E.5.1 PROOF OF EQUATION 14 Notice that j( Σ 1,j -Σ 1 ) 2 = j i=1 {A i ϕ(X i )ϕ (X i ) -E Fi-1 π i-1 (1, X)ϕ(X)ϕ (X)} 2 (42) +j E Fi-1 ϕ(X)ϕ (X) 1 j j i=1 π i-1 (1, X) -π * (1, X) 2 . By Lemma 4, we have E Fi-1 ϕ(X)ϕ (X) 1 j j i=1 π i-1 (1, X) -π * (1, X) 2 ≤ ε -2 0 qE Fi-1 1 j j i=1 π i-1 (1, X) -π * (1, X) ≤ ε -2 0 q 2 j -α0 log α0 j, ∀j ≥ j n , with probability at least 1 -O(j -α0 n ). Consider the first term on the RHS of equation 42. For any i ≥ 1, define M i = ϕ(X i )ϕ (X i ){A i - π i-1 (1, X i )}. Notice that {M i } i≥1 forms a martingale difference sequence with respect to the filtration {σ(F i-1 ) : i ≥ 2}, since E[ϕ(X i )ϕ (X i ){A i -π i-1 (X i )}|F i-1 ] (43) = E Fi-1 [E(ϕ(X i )ϕ(X i ) {A i -π i-1 (X i )}|F i-1 , X i )] = 0, where E Fi,Xi denotes the conditional expectation given X i and F i . Here, the first equality is due to that X i ⊥ ⊥ F i-1 , implied by Lemma 3. Under the given conditions on the basis function ϕ(•), using similar arguments in proving Equation (C.15) of Shi et al. (2020b) , we can show that the following event occurs with probability at least 1 -O(j -2 ), j i=1 M i 2 qj log(j). Notice that k≥j k -2 ≤ j -2 + k>j {k(k -1)} -1 = j -2 + j -1 . Thus, the following occurs with probability at least 1 -O(j -1 n ), j i=1 {A i ϕ(X i )ϕ (X i ) -E Fi-1 π i-1 (1, X)ϕ(X)ϕ (X)} 2 qj log j, ∀j ≥ j n . It follows that ( Σ 1,k -Σ 1 ) 2 qδ k + qk -1 log k, ∀k ≥ j n , with probability at least 1 -O(j -α0 ). Similarly, we can show ( Σ 0,k -Σ 0 ) 2 qδ k + qk -1 log k, ∀k ≥ j n , with probability at least 1 -O(j -α0 n ). The proof is hence completed.

E.5.2 PROOF OF EQUATION 15

When j n satisfies j α0 n / log α0 (j n ) q 2 , it follows from equation 14 and equation 40 that λ min [ Σ a,k ] ≥ λ min [Σ a ] -Σ a,k -Σ a 2 ≥ 2 -1 ε 0 , ∀k ≥ j n , with probability at least 1 -O(j -α0 n ). Combining equation 40 with equation 45 and equation 14, we obtain Σ -1 a,k -Σ -1 a 2 = Σ -1 a,k ( Σ a,k -Σ a )Σ -1 a 2 ≤ λ min [Σ a ]λ min [ Σ a,k ) Σ a,k -Σ a 2 qδ k + qk -1 log k, ∀k ≥ j n , with probability at least 1 -O(j -α0 n ). The proof is hence completed.

E.6 PROOF OF LEMMA 6

For any l ∈ {1, . . . , q} and i ≥ 1, define M i (l) = ϕ (l) (X i )A i {Y i -ϕ (X i )β 1 }. Here, ϕ (l) (X i ) corresponds to the l-th element of ϕ(X i ). Similar to equation 43, we can show {M i (l)} i≥1 forms a martingale difference sequence with respect to the filtration {σ(F i-1 ) : i ≥ 1}. By equation 43, we have for any l, E{ϕ (l) (X i )} 2 ≤ λ max [ϕ(X i )ϕ (X i )] ≤ -1 0 . Notice that E{M 2 i (l)|F i-1 } = E[{ϕ (l) (X i )} 2 A i {Y * i (1) -ϕ (X i )β 1 } 2 |F i-1 ] ≤ E[{ϕ (l) (X i )} 2 {Y * i (1) -ϕ (X i )β 1 } 2 |F i-1 ] = Eσ 2 (1, X i ){ϕ (l) (X i )} 2 ≤ 4 -2 0 E{ϕ (l) (X i )} 2 ≤ 4 -3 0 , where the first equality is due to (A1), the first inequality is due to that A is bounded between 0 and 1, the second equality follows from Lemma 3, the second inequality follows from Lemma 4, and the last inequality is due to equation 46. It follows that k i=1 E{M 2 i (l)|F i-1 } ≤ 4k -3 0 . Similarly, by (A1) and Lemma 4, we have k i=1 M 2 i (l) ≤ 4 -2 0 k i=1 {ψ (l) (X i )} 2 . ( ) Similar to equation 44, we can show with probability at least 1 -O(j -1 ) that k i=1 [M 2 i (l) -E{M 2 i (l)|F i-1 }] qk log k, ∀k ≥ j. Thus, for any sequence j n that satisfies j n / log(j n ) q, we have by equation 47 that k i=1 M 2 i (l) + k i=1 E{M 2 i (l)|F i-1 } ≤ ck, ∀k ≥ j n , for some constant c > 0, with probability at least 1 -O(j -1 n ). It follows that Pr   k≥jn {| k i=1 M i (l)| ≤ 2 ck log k}   ≥ Pr      k≥jn {| k i=1 M i (l)| ≤ 2 ck log k}       k≥jn { k i=1 [M 2 i (l) + {M 2 i (l)|F i-1 }] ≤ ck}      -O(j -1 n ) ≥ Pr      k≥jn { k i=1 [M 2 i (l) + {M 2 i (l)|F i-1 }] ≤ ck}      -O(j -1 n ) -Pr      k≥jn {| k i=1 M i (l)| > 2 ck log k}       k≥jn { k i=1 [M 2 i (l) + {M 2 i (l)|F i-1 }] ≤ ck}      ≥ 1 -Pr      k≥jn {| k i=1 M i (l)| > 2 ck log k}       k≥jn { k i=1 [M 2 i (l) + {M 2 i (l)|F i-1 }] ≤ ck}      -O(j -1 n ) . By Bonferroni's inequality and Theorem 2.1 of Bercu & Touati (2008) , we have Pr   k≥jn {| k i=1 M i (l)| ≤ 2 ck log k}   ≥ 1 -O(j -1 n ) - k≥jn Pr   {| k i=1 M i (l)| > 2 ck log k}    k ≥jn { k i=1 [M 2 i (l) + {M 2 i (l)|F i-1 }] ≤ ck }      ≥ 1 -O(j -1 n ) - k≥jn Pr {| k i=1 M i (l)| > 2 ck log k} k i=1 [M 2 i (l) + {M 2 i (l)|F i-1 }] ≤ ck ≥ 1 -O(j -1 n ) -2 k≥jn exp - 4ck log k 2ck = 1 -O(j -1 n ) - k≥jn 2k -2 . ( ) The last term on the RHS of equation 49 is 1 -O(j -1 n ). To summarize, we have shown that the following event occurs with probability at least 1 -O(j -1 n ), k≥jn | k i=1 M i (l)| ≤ 2 ck log k . By Bonferroni's inequality, we have k≥jn k i=1 ϕ(X i )A i {Y i -ϕ (X i )β 1 } 2 ≤ 2 cqk log k , with probability at least 1 -O(j -1/2 n ). Similarly, we can show k≥jn k i=1 ϕ(X i )(1 -A i ){Y i -ϕ (X i )β 0 } 2 ≤ c qk log k , for some constant c > 0, with probability at least 1 -O(j -1 n ). The proof is hence completed.

E.7 PROOF OF THEOREM 3

We state the following lemmas before presenting the proof. Lemma 7 Assume the conditions in Theorem 3 hold. Then for any sequence {j n } n that satisfies j α0 n / log α0 j n q 2 , we have with probability at least 1 -O(j -α0 n ) that β a,k -β a 2 q 1/2 k -1/2 log k, ∀a ∈ {0, 1}, ∀k ≥ j n . Lemma 8 Assume the conditions in Theorem 3 hold. Then for any sequence {j n } n that satisfies j α0 n / log α0 j n q 2 , we have with probability at least 1 -O(j -α0 n ) that 1 k k i=1 I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a } 2 -Φ a 2 qδ k + q 1/2 k -1/2 log k, ∀a ∈ {0, 1}, k ≥ j n . Similar to the proof of Theorem 1, we will show the assertion in Theorem 3 holds for any n(•) that correspond to the realizations of N (•) that satisfy n(t 1 ) < n(t 2 ) < • • • < n(t K ). For any 1 ≤ k 1 ≤ k 2 ≤ K, define V (k 1 , k 2 ) = n(t k1 )n(t k2 )Cov β MB * 1 (t k1 ) -β MB * 0 (t k1 ), β MB * 1 (t k2 ) -β MB * 0 (t k2 )|{(X i , A i , Y i )} +∞ i=1 = 1 n(t k1 )n(t k2 ) 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a (t j )I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a (t j )} 2 Σ -1 a (t j ), V =      V (1, 1) V (1, 2) . . . V (1, K) V (2, 1) V (2, 2) . . . V (2, K) . . . . . . . . . V (K, 1) V (K, 2) . . . V (K, K)      . We aim to bound the entrywise ∞ norm of V -V where V is defined in equation 22. It suffices to bound max 1≤k1≤k2≤K sup b1,b2∈R p+1 , b1 2= b2 2=1 |b T 1 { V (k 1 , k 2 ) -V (k 1 , k 2 )}b 2 | = max 1≤k1≤k2≤K V (k 1 , k 2 ) -V (k 1 , k 2 ) 2 . For any k 1 , k 2 , we decompose V (k 1 , k 2 ) -V (k 1 , k 2 ) as V (k 1 , k 2 ) -V (k 1 , k 2 ) = V (k 1 , k 2 ) -V * (k 1 , k 2 ) + V * (k 1 , k 2 ) -V * * (k 1 , k 2 ) + V * * (k 1 , k 2 ) -V (k 1 , k 2 ), where V * (k 1 , k 2 ) = 1 n(t k1 )n(t k2 ) 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a (t j )} 2 Σ -1 a , V * * (k 1 , k 2 ) = 1 n(t k1 )n(t k2 ) 1 a=0 n(t k 1 ) j=1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a } 2 Σ -1 a . By Lemma 4 and Lemma 8, we obtain that max 1≤k1≤k2≤K V * * (k 1 , k 2 ) -V (k 1 , k 2 ) 2 ≤ max 1≤k1≤K 1 a=0 1 n(t k1 ) n(t k 1 ) j=1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a } 2 Σ -1 a -Σ -1 a Φ a Σ -1 a 2 ≤ max 1≤k1≤K 1 2 0 1 a=0 1 n(t k1 ) n(t k 1 ) j=1 I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a } 2 -Φ a 2 qδ n(t1) + q 1/2 n -1/2 (t 1 ) log n(t 1 ), with probability at least 1 -O(n -α0 (t 1 )). Notice that n(t k1 )n(t k2 ) V * (k 1 , k 2 ) = 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a (t j )} 2 Σ -1 a = 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a + ϕ (X i )β a -ϕ (X i ) β a (t j )} 2 Σ -1 a = 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){ϕ (X i )β a -ϕ (X i ) β a (t j )} 2 Σ -1 a +2 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a }ϕ (X i ){β a -β a (t j )}Σ -1 a + n(t k1 )n(t k2 ) V * * (k 1 , k 2 ). It follows that max 1≤k1≤k2≤K V * (k 1 , k 2 ) -V * * (k 1 , k 2 ) 2 ≤ max 1≤k1≤K 1 n(t k1 ) 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){ϕ (X i )β a -ϕ (X i ) β a (t j )} 2 Σ -1 a 2 + max 1≤k1≤K 2 n(t k1 ) 1 a=0 k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a }ϕ (X i )(β a -β a (t j ))Σ -1 a 2 . By Lemma 4, we obtain that max 1≤k1≤k2≤K V * (k 1 , k 2 ) -V * * (k 1 , k 2 ) 2 (51) max 1≤k1≤K a∈{0,1} 1 n(t k1 ) k1 j=1 n(tj ) i=n(tj-1)+1 I(A i = a)ϕ(X i )ϕ (X i ){ϕ (X i )β a -ϕ (X i ) β a (t j )} 2 Ψ 1,a,k 1 2 + max 1≤k1≤K a∈{0,1} 2 n(t k1 ) k1 j=1 n(tj ) i=n(tj-1)+1 I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a }ϕ (X i ){β a -β a (t j )} Ψ 2,a,k 1 2 . By Lemmas 4 and 7, we have with probability at least 1 -O(n -1 (t 1 )) that 1 n(t k1 ) Ψ 1,a,k1 2 q 2 n -1 (t 1 ) log{n(t 1 )} 1 n(t k1 ) n(t k 1 ) i=1 I(A i = a)ϕ(X i )ϕ (X i ) 2 , ( ) ∀1 ≤ k 1 ≤ K, a ∈ {0, 1}. Similar to Lemma 5, we can show there exists some constant c * > 0 that 1 n(t k1 ) n(t k 1 ) i=1 [I(A i = a)ϕ(X i )ϕ (X i ) -E Fi-1 {I(A i = a)ϕ(X i )ϕ (X i )}] 2 (53) ≤ c * {qδ n(t k 1 ) + q 1/2 n -1/2 (t k1 ) log n(t k1 )}, ∀1 ≤ k 1 ≤ K, a ∈ {0, 1}, with probability at least 1 -O(n -1 (t 1 )). By Lemma 4, we can show with probability at least 1 -O(n -1 (t 1 )) that max 1≤k1≤K 1 n(t k1 ) n(t k 1 ) i=1 E Fi-1 {I(A i = a)ϕ(X i )ϕ (X i )} 2 = O(1). This together with equation 52 and equation 53 yields n -1 (t k1 ) Ψ 1,a,k1 2 q 2 n -1 (t 1 ) log{n(t 1 )}, ∀1 ≤ k 1 ≤ K, a ∈ {0, 1}, with probability at least 1 -O(n -1 (t 1 )). Moreover, using similar arguments in proving Equation (C.15) of Shi et al. (2020b) , we can show that for any 1 ≤ k 1 ≤ K, the following event occurs with probability at least 1 -O(n -2 (t k1 )), 1 n(t k1 ) n(t k 1 ) i=1 I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a }ϕ (l) (X i ) 2 q 1/2 n -1/2 (t k1 ) log n(t k1 ), ∀1 ≤ l ≤ q. Since K k1=1 n -2 (t k1 ) ≤ n -1 (t 1 ), we obtain with probability at least 1 -O(n -1 (t 1 )) that 1 n(t k1 ) n(t k 1 ) i=1 I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i )β a }ϕ (l) (X i ) 2 q 1/2 n -1/2 (t k1 ) log n(t k1 ), ∀1 ≤ l ≤ q, 1 ≤ k 1 ≤ K. In addition, it follows from Lemma 7 that n -1 (t k1 ) Ψ 2,a,k1 2 q 3/2 n -1 (t 1 ) log{n(t 1 )}, ∀1 ≤ k 1 ≤ K, a ∈ {0, 1}. This together with equation 54 yields that max 1≤k1≤k2≤K V * (k 1 , k 2 ) -V * * (k 1 , k 2 ) 2 q 2 n -1 (t 1 ) log n(t 1 ), with probability at least 1 -O(n -1 (t 1 )). Under the given conditions, we have max 1≤k1≤k2≤K V * (k 1 , k 2 ) -V * * (k 1 , k 2 ) 2 q 1/2 n -1/2 (t 1 ) log 1/2 n(t 1 ), with probability at least 1 -O(n -1 (t 1 )). Moreover, with some calculations, we can show that max 1≤k1≤k2≤K V (k 1 , k 2 ) -V * (k 1 , k 2 ) 2 ≤ 1 a=0 max j≥1 Σ -1 a -Σ -1 a (t j ) 2 × max 1≤k1≤K 2 n(t k1 ) k1 j=1 n(tj ) i=n(tj-1)+1 I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a (t j )} 2 Σ -1 a 2 + 1 a=0 max 1≤k1≤K 1 n(t k1 ) k1 j=1 n(tj ) i=n(tj-1)+1 Σ -1 a I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a (t j )} 2 Σ -1 a 2 × max j≥1 Σ -1 a -Σ -1 a (t j ) 2 2 . In view of Lemma 4 and Lemma 5, we have with probability at least 1 -O(n -α0 (t 1 )) that  max 1≤k1≤k2≤K V (k 1 , k 2 ) -V * (k 1 , k 2 ) 2 ≤ O(1)(qδ n(t1) + qn -1 (t 1 ) log n(t 1 )) × max 1≤k1≤K 1 n(t k1 ) k1 j=1 n(tj ) i=n(tj-1)+1 I(A i = a)ϕ(X i )ϕ (X i ){Y i -ϕ (X i ) β a (t j )} 2 Ψ 3,a, V (k 1 , k 2 ) -V * (k 1 , k 2 ) 2 qδ n(t1) + qn -1 (t 1 ) log n(t 1 ), with probability at least 1-O(n -α0 (t 1 )). Combining this together with equation 50 and equation 55, we obtain with probability at least 1 -O(n -α0 (t 1 )) that max 1≤k1≤k2≤K V (k 1 , k 2 ) -V (k 1 , k 2 ) 2 qδ n(t1) + qn -1 (t 1 ) log n(t 1 ). Consider the function g δ • φ η,{ν k } k defined in the proof of Theorem 1. We fix δ = η -1 {log K + 4d log n(t 1 )}. Based on Lemma A2 in Belloni & Oliveira (2018) , we have with probability at least  1 -O(n -α0 (t 1 )) that sup {ν k } k E * g δ • φ η,{ν k } k (N (0, V )) -Eg δ • φ η,{ν k } k (N (0, V )) qη 2 {log 2 K + log 2 n(t 1 )} qδ n(t1) + qn -1 ( n(t k ) S MB * -ν k ≤ 0 ≤ E * g δ • φ η,{ν k,+ } k (N (0, V )) ≤ Eg δ • φ η,{ν k,+ } k (N (0, V )) + O(1)qη 2 {log 2 K + log 2 n(t 1 )} qδ n(t1) + qn -1 (t 1 ) log n(t 1 ) ≤ Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -ν * k,+ ≤ 0 + O(1)qη 2 {log 2 K + log 2 n(t 1 )} qδ n(t1) + qn -1 (t 1 ) log n(t 1 ) , Pr * max k∈{1,...,K} n(t k ) S MB * -ν k ≤ 0 ≥ E * g δ • φ η,{ν k,-} k (N (0, V )) ≥ Eg δ • φ η,{ν k,-} k (N (0, V )) -O(1)qη 2 {log 2 K + log 2 n(t 1 )} qδ n(t1) + qn -1 (t 1 ) log n(t 1 ) ≥ Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -ν * k,- ≤ 0 -O(1)qη 2 {log 2 K + log 2 n(t 1 )} qδ n(t1) + qn -1 (t 1 ) log n(t 1 ) , where O(1) denotes some positive constant, and  ν k,+ = ν k + η -1 {4d log n(t 1 ) + log K} + c * n -2 (t 1 ), ν * k,+ = ν k,+ + 3η -1 {4d log n(t 1 ) + log K}, ν k,-= ν k -3η -1 {4d log n(t 1 ) + log K} -c * n -2 (t 1 ), ν * k,-= ν k,--η -1 {4d ϕ (x)G(t k ) -ν * k,+ ≤ 0 -Pr max k∈{1,...,K} sup x∈X0 ϕ (x)G(t k ) -ν * k,- ≤ 0 η -1 {log 3/2 n(t 1 ) + log 3/2 K} + c * n -2 (t 1 ){log 1/2 n(t 1 ) + log 1/2 K}. It follows that sup {ν k } k Pr * max k∈{1,...,K} n(t k ) S MB * -ν k ≤ 0 -Pr max k∈{1,...,K} sup x∈X ϕ (x)G(t k ) -ν k ≤ 0 qη 2 {log 2 K + log 2 n(t 1 )} qδ n(t1) + qn -1 (t 1 ) log n(t 1 ) +η -1 {log 3/2 n(t 1 ) + log 3/2 K} + c * n -2 (t 1 ){log 1/2 n(t 1 ) + log 1/2 K}, with probability at least 1 -O(n -α0 (t 1 )). Set η = min[q -1 n α0/3 (t 1 ) log -(1+2α0)/6 {Kn(t 1 )}, q -1/2 n 1/6 (t 1 ) log -1/3 {Kn(t 1 )}], we obtain the desired result. E.8 PROOF OF LEMMA 7 Combining Lemma 6 with Lemma 4 yields that Σ -1 a 1 k k i=1 I(A i = a)ϕ(X i ){Y i -ϕ (X i )β a } 2 q 1/2 k -1/2 log k, ∀k ≥ j n , a ∈ {0, 1}, with probability at least 1 -O(j -1 n ). Combining this together with equation 16 yields that β a,k -β a 2 q 1/2 k -1/2 log k, ∀k ≥ j n , a ∈ {0, 1}, with probability at least 1 -O(j -1 n ). The proof is hence completed. E.9 PROOF OF LEMMA 8 Notice that 1 k k i=1 I(A i = a)ϕ(X i )ϕ (X i )(Y i -ϕ (X i )β a ) 2 -Φ a 2 ≤ 1 k k i=1 I(A i = a)ϕ(X i )ϕ (X i )[{Y i -ϕ (X i )β a } 2 -σ 2 (a, X i )] 2 + 1 k k i=1 I(A i = a)ϕ(X i )ϕ (X i )σ 2 (a, X i ) -Φ a 2 . Similar to the proof of Lemma 5, we can show that the second term on the RHS of equation 56 is of the order O(qδ k + qk -1 log k), for any a ∈ {0, 1} and any k ≥ j n , with probability at least 1 -O(j -α0 n ). As for the first term, notice that each element in the matrix 1 k k i=1 I(A i = a)ϕ(X i )ϕ (X i ){(Y i -ϕ (X i )β a ) 2 -σ 2 (a, X i )} corresponds to a martingale with respect to the filtration {σ(F i-1 ) : i ≥ 1}, under (A1) and (A2). Using similar arguments in proving Equation (C.15) of Shi et al. (2020b) , we can show that 1 k k i=1 I(A i = a)ϕ(X i )ϕ (X i )[{Y i -ϕ (X i )β a } 2 -σ 2 (a, X i )] 2 q 1/2 k -1/2 log k, ∀a ∈ {0, 1}, k ≥ j n , with probability at least 1 -O(j -1 n ). The proof is hence completed.

E.10 PROOF OF LEMMA 2

We begin by providing an upper bound for max a∈{0,1} β a,k -β a 2 . With some calculations, we have max a∈{0,1} β a,k -β a 2 = max a∈{0,1} 1 k Σ -1 a,k k i=1 ϕ(X i )I(A i = a){Y i -ϕ (X i )β a } 2 ≤ max a∈{0,1} Σ -1 a,k 2 max a∈{0,1} 1 k k i=1 ϕ(X i )I(A i = a){Y i -ϕ (X i )β a } 2 . By Lemma 6, we obtain with probability at least 1 -O(j -1 n ) that max a∈{0,1} 1 k k i=1 ϕ(X i )I(A i = a){Y i -ϕ (X i )β a } 2 q 1/2 k -1/2 log k, ∀k ≥ j n . Similarly, we can show with probability at least 1 -O(j -1 n ) that max a∈{0,1} 1 k k i=1 ϕ(X i )I(A i = a){Y i -ϕ (X i )β a } 2 q 1/2 k -1/2 log j n , ∀1 ≤ k < j n . Similar to equation 42, we have max a∈{0,1} λ min [ Σ a,k ] ≥ min a∈{0,1} λ min E Fi-1 ϕ(X)ϕ (X) 1 k k i=1 π i-1 (a, X) -max a∈{0,1} 1 k k i=1 {I(A i = a)ϕ(X i )ϕ (X i ) -E Fi-1 π i-1 (a, X)ϕ(X)ϕ (X)} 2 . Using similar arguments in proving equation 44, we can show that max a∈{0,1} k i=1 {I(A i = a)ϕ(X i )ϕ (X i ) -E Fi-1 π i-1 (a, X)ϕ(X)ϕ (X)} 2 qk log k, (60) ∀k ≥ j n , with probability at least 1 -O(j -1 n ). Similarly, we can show max a∈{0,1} k i=1 {I(A i = a)ϕ(X i )ϕ (X i ) -E Fi-1 π i-1 (a, X)ϕ(X)ϕ (X)} 2 qk log j n , (61) ∀1 ≤ k < j n , with probability at least 1 -O(j -1 n ). Without loss of generality, assume ε 0 ≤ 1/2. Notice that we have π i-1 (a, x) ≥ ε 0 , for any a ∈ {0, 1}, x ∈ X and i ≥ N 0 . This together with Lemma equation 4 implies that inf a∈{0,1},n≥jn λ min E Fi-1 ϕ(X)ϕ (X) 1 n n i=1 π i-1 (a, X) ≥ n -N 0 n ε 0 ≥ j -N 0 j ε 0 . Combining this together with equation 60 and equation 61 yields max a∈{0,1} λ min [ Σ a,k ] ≥ ε 0 2 , ∀k ≥ L * q log j n , for some constant L * ≥ 1, with probability at least 1 -O(j -1 n ). This together with equation 58 and equation 59 yields that max a∈{0,1} β a,k -β a 2 q 1/2 k -1/2 log max(k, j n ), ∀k ≥ L * q log j n , with probability at least 1 -O(j -1 n ). By Condition (A3), we have |ϕ (X)( β 1,k -β 0,k -β 1 + β 0 )| ≤ Lqk -1/2 log 1/2 max(k, j n ), ∀k ≥ L * q log j n , for some constant L > 0, with probability at least 1 -O(j -1 n ). For any z 1 , z 2 ∈ R, we have I(z 1 > 0) = I(z 2 > 0) only when |z 1 -z 2 | ≥ |z 2 |. Hence, under the event defined in equation 62, the event I{ϕ (X)( β 1,k -β 0,k ) > 0} = I{ϕ (X)(β 1 -β 0 ) > 0} occurs only when |ϕ (X)(β 1 -β 0 )| ≤ |ϕ (X)( β 1,k -β 0,k -β 1 + β 0 )| ≤ Lqk -1/2 log max(k, j n ), for any k ≥ j n . Under the given conditions, we have Pr |ϕ (X)(β 1 -β 0 )| ≤ Lqk -1/2 log 1/2 max(k, j n ) ≤ LL 0 qk -1/2 log 1/2 max(k, j n ). (63) Notice that when I{ϕ (X)( β 1,k -β 0,k ) > 0} = I{ϕ (X)(β 1 -β 0 ) > 0}, we have π k (a, X) = π * (a, X). Thus, we obtain π k (a, X) = π * (a, X) if |ϕ (X)(β 1 -β 0 )| > Lqk -1/2 log max(k, j n ), for any k ≥ L * √ q log j n . Set k 0 = L * √ q log j n . By equation 62 and equation 63, we have with probability at least 1 -O(j -1 n ) that Pr |ϕ (X)(β 1 -β 0 )| ≤ Lqi -1/2 log i qk 1/2 log 1/2 k, ∀k ≥ j n . The proof is hence completed.  N (t) i=1 [I(A i = 1) Σ -1 1 (t)ϕ(X i ){Y i -ϕ (X i )β * 1 } -I(A i = 0) Σ -1 0 (t)ϕ(X i ){Y i -ϕ (X i )β * 0 }]. The above expression is asymptotically equivalent to 1 N (t) N (t) i=1 [I(A i = 1)Σ -1 1 ϕ(X i ){Y i -ϕ (X i )β * 1 } -I(A i = 0)Σ -1 0 ϕ(X i ){Y i -ϕ (X i )β * 0 }]. By the law of iterated logarithm, the -th dimension of the above expression can be upper bounded by N -1/2 (t) 2σ 2 log log{N (t)}, where σ 2 can be consistently estimated by 1 N (t) N (t) i=1 I(A i = 1) Σ -1 1 (t)ϕ(X i ){Y i -ϕ (X i ) β 1 (t)} -I(A i = 0) Σ -1 0 (t)ϕ(X i ){Y i -ϕ (X i ) β 0 (t)} 2 2 . As such, the finite error bound is given by sup x∈X ϕ(x) 2 2 log log{N (t)} N (t) × 1 N (t) N (t) i=1 I(A i = 1) Σ -1 1 (t)ϕ(X i ){Y i -ϕ (X i ) β 1 (t)} -I(A i = 0) Σ -1 0 (t)ϕ(X i ){Y i -ϕ (X i ) β 0 (t)} 2 2 . G ADDITIONAL 



https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=49



Figure 1: Plots demonstrating QTE. X denotes the observed covariates, A denotes the received treatment and

Figure 2: Rejection probabilities and average stopping times of the proposed test when α1(•) is chosen as the spending function. From left to right: Scenario 1 with random design, Scenario 1 with -greedy design, Scenario 2 with random design and Scenario 2 with -greedy design.

Figure 3: Critical values and test statistics.

Figure 4: Rejection probabilities and average stopping times of the proposed test when α 1 (•) is chosen as the spending function.

1 (a, X) -π * (a, X)} ≤ Fi-1 |π i-1 (a, X) -π * (a, Fi-1 |π i-1 (a, X) -π * (a, X)| ≤ 2L * q log j n + Fi-1 |π i-1 (a, X) -π * (a, X)|I{|ϕ (X)(β 1 -β 0 )| > Lqi -1/2 log 1/2 i} + Fi-1 |π i-1 (a, X) -π * (a, X)|I{|ϕ (X)(β 1 -β 0 )| ≤ Lqi -1/2 log 1/2 i} ≤ 2L * q log j n +

Figure 5: Alpha spending functions when θ = 0.5, γ = 1.0.

terminate the experiment. Algorithm 1: the Pseudocode that summarizing the online bootstrap testing procedure.

t 1 ).

k 1 2, where O(1) denotes some positive constant. Similar to equation 50 and equation 55, we can show with probability at least 1 -O(n -α0 (t 1 )) that

log n(t 1 ) + log K}. By Theorem 2 ofChernozhukov et al. (2017), we obtain that

F COMPARISON OF THE BASELINEConsider our test statistic S(t). Under H 0 , it can be bounded from above by

TABLES AND FIGURES QTE: rejection probabilities (multiplied by 100) and average stopping times under Scenarios 1 and 2 when α 1 (•) is chosen as the spending function. Standard errors are reported in the parentheses.

ATE: rejection probabilities (multiplied by 100) and average stopping times under Scenarios 1 and 2 when α 1 (•) is chosen as the spending function. Standard errors are reported in the parentheses.

