ONLINE TESTING OF SUBGROUP TREATMENT EF-FECTS BASED ON VALUE DIFFERENCE Anonymous authors Paper under double-blind review

Abstract

Online A/B testing plays a critical role in high-tech industry to guide product development and accelerate innovation. It performs a null hypothesis statistical test to determine which variant is better. However, a typical A/B test presents two problems: (i) a fixed-horizon framework inflates the false positive errors under continuous monitoring; (ii) the homogeneous effects assumption fails to identify a subgroup with a beneficial treatment effect. In this paper, we propose a sequential test for subgroup treatment effects based on value difference, named SUBTLE, to address these two problems simultaneously. The SUBTLE allows the experimenters to "peek" the results during the experiment without harming the statistical guarantees. It assumes heterogeneous treatment effects and aims to test if some subgroup of the population will benefit from the investigative treatment. If the testing result indicates the existence of such subgroup, a subgroup will be identified using a readily available estimated optimal treatment rule. We examine the empirical performance of our proposed test on both simulations and a real data set. The results show that the SUBTLE has high detection power with controlled type I error at any time, is more robust to noise covariates, and can achieve early stopping compared with the corresponding fixed-horizon test.

1. INTRODUCTION

Online A/B testing, as a kind of randomized control experiments, is widely used in high-tech industry to assess the value of ideas in a scientific manner (Kohavi et al., 2009) . It randomly exposes users to one of the two variants: control (A), the currently-used version, or treatment (B), a new version being evaluated, and collects the metric of interest, such as conversion rate, revenue, etc. Then, a null hypothesis statistical test is performed to evaluate whether there is a statistically significant difference between the two variants on the metric of interest. This scientific design helps to control for the external variations and thus establish the causality between the variants and the outcome. However, the current A/B testing has its limitations in terms of framework and model assumptions. First of all, most A/B tests employ a fixed-horizon framework, whose validity requires that the sample size should be fixed and determined before the experiment starts. However, experimenters, driven by a fast-paced product evolution in practice, often "peek" the experiment and hope to find the significance as quickly as possible to avoid large (i) time cost: an A/B test may take prohibitively long time to collect the determined size of samples; and (ii) opportunity cost: the users who have been assigned to a suboptimal variant will be stuck in a bad experience for a long time (Ju et al., 2019) . The behaviors of continuously monitoring and concluding the experiment prematurely will be favorably biased towards getting significant results and lead to very high false positive probabilities, well in excess of the nominal significance level α (Goodson, 2014; Simmons et al., 2011) . Another limitation of A/B tests is that they assume homogeneous treatment effects among the population and mainly focus on testing the average treatment effect. However, it is common that treatment effects vary across sub-populations. Testing the subgroup treatment effects will help decision makers distinguish the sub-population that may benefit from a particular treatment from those who may not, and thereby guide companies' marketing strategies in promoting new products. The first problem can be addressed by applying the sequential testing framework. Sequential testing, contrast to the classic fixed-horizon test, is a statistical testing procedure that continuously checks for significance at every new sample and stops the test as soon as a significant result is detected, while controlling the type I error at any time. It generally gives a significant decrease in the required sample size compared to the fixed-horizon test with the same type I error and type II error control, and thus is able to end an experiment much earlier. This field was first introduced by Wald (1945) , who proposed sequential probability ratio test (SPRT) for simple hypotheses using likelihood ratio as the test statistics, and then was extended to composite hypotheses by many following literature (Schwarz, 1962; Armitage et al., 1969; Cox, 1963; Robbins, 1970; Lai, 1988) . A thorough review is given in Lai (2001) . However, the advantage of sequential testing in online A/B testing has not been recognized until recently Johari et al. (2015) brought the mSPRT, a variant of SPRT to A/B tests. The second problem shows a demand for a test on subgroup treatment effects. Although sequential testing is rapidly developing in online A/B test, few work focuses on subgroup treatment effect testing. Yu et al. (2020) proposed a sequential score test (SST) based on score statistics under a generalized linear model, which aims to test if there is difference between treatment and control groups among any subjects. However, this test is based on a restrictive parametric assumption on treatment-covariates interaction and can't be used to test the subgroup treatment effects. In this paper, we consider a flexible model, and propose a sequential test for SUBgroup Treatment effects based on vaLuE difference (SUBTLE), which aims to test if some group of the population would benefit from the investigative treatment. Our method does not require to specify any parametric form of covariate-specific treatment effects. If the null hypothesis is rejected, a beneficial subgroup can be easily obtained based on the estimated optimal treatment rule. The remainder of this paper is structured as follows. In Section 2, we review the idea of the mSPRT and SST, and discuss how they are related to our test. Then in Section 3, we introduce our proposed method SUBTLE and provide the theoretical guarantee for its validity. We conduct simulations in Section 4 and real data experiments in Section 5 to demonstrate the validity, detection power, robustness and efficiency of our proposed test. Finally, in Section 6, we conclude the paper and present future directions.

2.1. MIXTURE SEQUENTIAL PROBABILITY RATIO TEST

The mixture sequential probability ratio test (mSPRT) (Robbins, 1970) supposes that the independent and identically distributed (i.i.d.) random variables Y 1 , Y 2 , • • • have a probability density function f θ (x) induced by parameter θ, and aims to test H 0 : θ = θ 0 v.s. H 1 : θ = θ 0 . (1) Its test statistics Λ π n at sample size n is a mixture of likelihood ratios as below: Λ π n = Θ n i=1 f θ (Y i ) f θ0 (Y i ) π(θ)dθ, with a mixture density π(•) over the parameter space Θ. The mSPRT stops the sampling at the stage N = inf{n ≥ 1 : Λ π n ≥ 1/α} (3) and rejects the null hypothesis H 0 in favor of H 1 . If no such time exists, it continues the sampling indefinitely and accept the H 0 . Since the likelihood ratio under H 0 is a nonnegative martingale with initial value equal to 1, and so is the mixture of such likelihood ratios Λ π n , the type I error of mSPRT can be proved to be always controlled at α by an application of Markov's inequality and optional stopping theorem: P H0 (Λ π n ≥ α -1 ) ≤ E H 0 [Λ π n ] α -1 = E H 0 [Λ π 0 ] α -1 = α. Besides, mSPRT is a test of power one (Robbins & Siegmund, 1974) , which means that any small deviation from θ 0 can be detected as long as waiting long enough. It is also shown that mSPRT is almost optimal for data from an exponential family of distributions, with respect to the expected stopping time (Pollak, 1978) . The mSPRT was brought to A/B test by Johari et al. (2015; 2017) , who assume that the observations in control (A = 0) and treatment (A = 1) groups arrive in pairs (Y (0) i , Y (1) i ), i = 1, 2, • • • . They restricted their data model to the two most common cases in practice: normal distribution and Bernoulli distribution, with µ A and µ B denoting the mean for control and treatment group, respectively. They test the hypothesis as below H 0 : θ := µ B -µ A = 0 v.s. H 1 : θ = 0, by directly applying mSPRT to the distribution of the differences Z i = Y (1) i -Y (0) i (normal), or the joint distribution of data pairs (Y (0) i , Y (1) i ) (Bernoulli), i = 1, 2, • • • . After making some approximations to the likelihood ratio and choosing a normal mixture density π(θ) ∼ N (0, τ 2 ), the test statistic Λ π n is able to have a closed form for both normal and Bernoulli observations. However, the mSPRT does not work well on testing heterogeneous treatment effects due to the complexity of likelihood induced by individual covariates. Specifically, a conjugate prior π(•) for the likelihood ratio may not exist anymore so that the computation for the test statistic is challenging. The unknown baseline covariates effect also increases the difficulty in constructing and approximating the likelihood ratios (Yu et al., 2020) .

2.2. SEQUENTIAL SCORE TEST

The sequential score test (SST) (Yu et al., 2020) assumes a generalized linear model with a link function g(•) for the outcome Y g(E[Y |A, X]) = µ T X + (θ T X)A, where A and X denote the binary treatment indicator and user covariates vector, and tests the multidimensional treatment-covariates interaction effect: H 0 : θ = 0 vs. H 1 : θ = 0, while accounting for the linear baseline covariates effect µ T X. For the test statistics Λ π n , instead of using a mixture likelihood ratios as mSPRT, SST employed a mixture asymptotic probability ratios of a score statistics. Since the probability ratio has the same martingale structure as the likelihood ratio, the type I error can still be controlled with the same decision rule as mSPRT (3). The asymptotic normality of the score statistics also guarantees a closed form of Λ π n with a multivariate normal mixture density π(•). However, the considered parametric model (5) can only be used to test if there are linear covariate-treatment interaction effects, and may fail to detect the existence of a subgroup with enhanced treatment effects. In addition, the subgroup estimated based on the index θ T X i may be biased if the assumed linear model ( 5) is misspecified. Therefore in this paper, we propose a subgroup treatment effect test, which is able to test the existence of a beneficial subgroup and does not require to specify the form of treatment effects.

3. SUBGROUP TREATMENT EFFECTS TEST BASED ON

VALUE DIFFERENCE 3.1 PROBLEM SETUP Suppose we have i.i.d. data O i = {Y i , A i , X i } , i = 1, 2, • • • , where Y i , A i , X i respectively denote the observed outcome, binary treatment indicator, and p-dimensional user covariates vector. Here, we consider a flexible generalized linear model: g(E[Y i |A i , X i ]) = µ(X i ) + θ(X i )A i , where baseline covariates effect µ(•) and treatment-covariates interaction effect θ(•) are completely unspecified functions, and g(•) is a prespecified link function. For example, we use the identity link g(µ) = µ for normal response and the logit link g(µ) = log µ 1-µ for binary data. Assuming Y is coded such that larger values indicate a better outcome, we consider the following test of subgroup treatment effects: H 0 : ∀x ∈ X , θ(x) ≤ 0 vs. H 1 : ∃X 0 ⊂ X such that θ(x) > 0 for all x ∈ X 0 , where X 0 is the beneficial subgroup with P(X ∈ X 0 ) > 0. Note that the above subgroup test is very different from the covariate-treatment interaction test considered in ( 6) and is much more challenging due to several aspects. First, both µ(•) and θ(•) are nonparametric and need to be estimated. Second, the considered hypotheses are moment inequalities which are nonstandard. Third, it allows the nonregular setting, i.e. P{θ(X) = 0} > 0, which makes associated inference difficult. Here, we propose a test based on value difference between the optimal treatment rule and a fixed treatment rule. Let V (d) = E (Y * (a),X) {Y * (d(X) )} denote a value function for a treatment decision rule, where Y * (d(X)) is the potential outcome if treatment were allocated according to the fixed treatment decision rule d(X), which maps the information in X to treatment {0, 1}. Consider the value difference ∆ = V (d opt )-V (0) between the optimal treatment rule d opt = 1 {θ(X) > 0} and the treatment rule that assigns control to everyone d = 0, where 1{•} is an indicator function. If the null hypothesis is true, no one would benefit from the treatment and the optimal treatment rule assigns everyone to control, and therefore the value difference is zero. However, if the alternative hypothesis is true, some people would have higher outcomes being assigned to treatment and thus the value difference is positive. In this way the testing hypotheses ( 8) can be equivalently transformed into the following pair: H 0 : ∆ = 0 vs. H 1 : ∆ > 0. We make the following standard causal inference assumptions: (i) consistency, which states that the observed outcome is equal to the potential outcome under the actual treatment received, i.e. Y = Y * (1)I(A = 1) + Y * (0)I(A = 0); (ii) no unmeasured confounders, i.e. Y * (a) ⊥ ⊥ A|X, which means the potential outcome is independent of treatment given covariates; (iii) positivity, i.e. P(A = a|X = x) > 0 for a = 0, 1 and all x ∈ X such that P(X = x) > 0. Under these assumptions, it can be shown that V (d) = E X {E [Y |A = d(X), X]} .

3.2. ALGORITHM AND IMPLEMENTATION

We take the augmented inverse probability weighted (AIPW) estimator (Robins et al., 1994; Zhang et al., 2012) for the value function of a given treatment rule d: VAIPW (d) = 1 n n i=1 Y i • 1{A i = d} p Ai (X i ) - 1{A i = d} p Ai (X i ) -1 • E[Y i |A i = d, X i ] where p A (X) = A * p(X) + (1 -A) * (1 -p(X)) and p(X) = P(A = 1|X) is the propensity score. This estimator is unbiased, i.e., E (Y,A,X) [ VAIPW (d)] = V (d). Moreover, the most important property of AIPW estimator is the double robustness, that is, the estimator remains consistent if either the estimator of E[Y |A = d, X] or the estimator of the propensity score p(X) is consistent, which gives much flexibility. Then the value difference ∆ is unbiased estimated by D(O i ; µ, θ, p) := 1 {A i = 1(θ(X i ) > 0)} p Ai (X i ) * Y i - 1 {A i = 1(θ(X i ) > 0)} p Ai (X i ) -1 * g -1 (µ(X i ) + θ(X i )1(θ(X i ) > 0)) - 1 (A i = 0) 1 -p(X i ) * Y i - 1(A i = 0) 1 -p(X i ) -1 * g -1 (µ(X i )) where g -1 (•) is the inverse of the link function. That is, E (Y,A,X) [D(O i ; µ, θ, p)] = ∆. Since µ(•), θ(•) and p(•) are usually unknown, we let data come in batches and estimate them based on previous batches of data. Algorithm 1 shows our complete testing procedures. In step (ii) of Algorithm 1, we estimate µ(•) and θ(•) by respectively building a random forest on control observations and on treatment observations in previous batches. The propensity score p(•) is estimated by computing the proportion of treatment observations (A = 1) in previous batches. In step (iv) we estimate σ k with σk = 14) is a multiplier of an asymptotic unbiased estimator for ∆, which is defined as below: s 2 k m , where s 2 k is the sample variance of D(O i ; μk-1 , θk-1 , pk-1 ), ∀O i ∈ C k-1 . Note that R k ( ∆k := k j=1 σ-1 j Dj (C j ; C j-1 ) k j=1 σ-1 j . ( ) Algorithm 1: Subgroup treatment effects sequential test based on value difference 1. Initialize k = 0, Λ π k = 0. Choose a significance level 0 < α < 1, a batch size m, an initial batch size l, and a failure time M . 2. Sample l observations to formulate initial batch C 0 . while True do (i) k=k+1; (ii) Let C k-1 = ∪ k-1 j=0 C j . Estimate µ(•), θ(•) and p(•) based on data in C k-1 to get μk-1 , θk-1 and pk-1 ; (iii) Sample another m observations to formulate batch C k . For each O i ∈ C k , calculate D(O i ; μk-1 , θk-1 , pk-1 ). Let Dk (C k ; C k-1 ) = 1 m Oi∈C k D(O i ; μk-1 , θk-1 , pk-1 ) (iv) Estimate the conditional standard deviation σ k = sd Dk (C k ; C k-1 )|C k-1 based on data in C k-1 and denote it as σk ; (v) Calculate R k = 1 √ k k j=1 σ-1 j Dj (C j ; C j-1 ) and Λ π k = ψ 1 √ k ( k j=1 σ-1 j )∆, 1 (R k ) ψ (0, 1) (R k ) π(∆)d∆, where ψ (µ,σ 2 ) (•) denotes the probability density function of a normal distribution with mean µ and variance σ 2 ; if Λ π k > 1/α or k × m + l > M then break; end end if Λ π k > 1/α then Reject H 0 . Estimate θ(•) using all the data up to now and identify a subgroup 1{ θ(X) > 0}; else Accept H 0 ; end In section 3.3 we will show that R k has an asymptotic normal distribution with same variance but different means under null and local alternatives, so that our test statistics Λ π k (15) is a mixture asymptotic probability ratios of R k . Since the value difference is always non-negative, we choose a truncated normal π(∆) = 2 √ 2πτ 2 • exp -∆ 2 2τ 2 • 1(∆ > 0) as the mixture density, where τ 2 is estimated based on historical data. The simulation result in Appendix A.2.1 shows considerable robustness in choosing τ 2 . Our test statistic now has a closed form: Λ π k = 2 k k + (τ • k j=1 σ-1 j ) 2 1/2 × exp    (τ • k j=1 σ-1 j • R k ) 2 2 (τ • k j=1 σ-1 j ) 2 + k    × [1 -F (0)] , where F (•) is the cumulative distribution function of a normal distribution with mean √ k• k j=1 σ-1 j •R k ( k j=1 σ-1 j ) 2 +k and variance kτ 2 τ 2 ( k j=1 σ-1 j ) 2 +k . If the null hypothesis is rejected, we can employ random forests to estimate θ(•) based on all the data up to the time that the experiment ends. Then the optimal treatment rule θ(x) naturally gives the beneficial subgroup X 0 = {x : θ(x) > 0}.

3.3. VALIDITY

In this section, we will show that our proposed test SUBTLE is able to control type I error at any time, that is, P H0 (Λ π k > 1/α) < α for any k ∈ N. As we discussed in Section 2.1, if we can show that the ratio term in Λ π k (15) has a martingale structure under H 0 , it follows easily that the type I error is always controlled at α. Theorem 3.1 gives the respective asymptotic distributions of R k under null and local alternative, which demonstrates that the test statistics Λ π k is a mixture asymptotic probability ratios weighted by π(•). Proposition 1 shows that this asymptotic probability ratio is a martingale when the sample size is large enough. Combining these two results with the demonstration in Section 2.1, we can conclude that the type I error of SUBTLE is always controlled at α. We assume the following conditions hold: • (C1) k diverges to infinity as sample size n diverges to infinity. • (C2) Lindeburg-like condition: 1 k k j=1 E Dj (C j ; C j-1 ) σj 2 • 1 | Dj (C j ; C j-1 )| σj > √ k C j-1 = o p (1) for all > 0. • (C3) 1 k k i=1 σ 2 j σ2 j p -→ 1. • (C4) 1 k k j=1 σ-1 j E[ Dj (C j ; C j-1 )|C j-1 ] -E[ Dj (C j ; dopt j-1 , µ, θ, p)|C j-1 ] = o p (k -1/2 ). • (C5) 1 k k j=1 σ-1 j E[ Dj (C j ; dopt j-1 , µ, θ, p)|C j-1 ] -∆ = o p (k -1/2 ). Theorem 3.1 For ∆k defined in (11), under conditions (C1)-(C5), 1 √ k   k j=1 σ-1 j   ∆k -∆ d → N (0, 1) as k → ∞, ( ) where d → represents convergence in distribution. In particular, as k → ∞, R k d --→ H0 N (0, 1) under null hypothesis ∆ = 0, while R k -1 √ k k j=1 σ-1 j ∆ d --→ H1 N (0, 1) under local alternative ∆ = δ √ k , where δ > 0 is fixed. Proposition 1 Let λ k = ψ 1 √ k ( k j=1 σ-1 j ) ∆, 1 (R k ) ψ (0, 1) (R k ) , and F k denote a filtration that contains all the historical information in the first (k + 1) batches C k . Then under null hypothesis H 0 : ∆ = 0, E[λ k+1 |F k ] is approximately equal to λ k • exp{o p (1)}. The proofs of above results are given in the Appendix A.1.

4. SIMULATED EXPERIMENTS

In this section, we evaluate the test SUBTLE on three metrics: type I error, power and sample size. We first compare SUBTLE with SST in terms of type I error and power under five models in Section 4.1. Then in Section 4.2, we present the impact of noise covariates on their powers. Finally in Section 4.3, we compare the stopping time of SUBTLE to the required sample size of a fixedhorizon value difference test. The significance level α = 0.05, initial batch size l = 300, failure time M = 2300 and variance of mixture distribution τ 2 = 1 are fixed for all simulation settings.

4.1. TYPE I ERROR & POWER

We consider five data generation models in the form of ( 7) with logistic link g(•). Data are generated in batches with batch size m = 20 and are randomly assigned to two groups with fixed propensity score p(X) = 0.5. Each experiment is repeated 1000 times to estimate the type I error and power. For the first four models, we choose • Five covariates: X 1 iid ∼ Ber(0.5), X 2 iid ∼ U nif [-1, 1], X 3 , X 4 , X 5 iid ∼ N (0, 1) • Two baseline effect: µ 1 (X) = -2 -X 1 + X 2 3 , µ 2 (X) = -1.3 + X 1 + 0.5X 2 -X 2 3 • Two treatment-covariates interaction effect: θ 1 (X) = c • 1{X 1 + 2X 3 > 0}, θ 2 (X) = c • 1{X 2 > 0 or X 5 < -0.5}. Table 1 : The first four models Model Input covariates µ(X) θ(X) I X 1 , X 3 µ 1 (X) θ 1 (X) II X 1 , X 2 , X 3 , X 4 , X 5 µ 2 (X) θ 2 (X) III X 1 , X , X 3 , X 4 , X 5 µ 1 (X) θ 2 (X) IV X 1 , X 2 , X 3 , X 4 , X 5 µ 2 (X) θ 1 (X) Table 1 displays which covariates, µ(X) and θ(X) are employed in each model. For model V, we consider the following high-dimensional setting: X r iid ∼ N (0.2r -0.6, 1), r = 1, 2, 3, 4, 5 X 14 iid ∼ U nif [-0.5, 1.5] X r iid ∼ N (0.2r -1.6, 2), r = 6, 7, 8, 9, 10 X 15 iid ∼ U nif [-1.5, 0.5] X r iid ∼ U nif [-0.5r + 5, 0.5r -5], r = 11, 12, 13 X r iid ∼ Ber(0.2r -3.1), r = 16, 17, 18, 19, 20 µ(X) = -0.8 + X 18 + 0.5X 12 -X 2 3 θ(X) = c • 1{(X 14 > -0.1) & (X 20 = 1)}, where c varies among {-1, 0, 0.6, 0.8, 1} indicating the intensity of the value difference. When c = -1 and 0, the null hypothesis is true and the type I error is estimated, while when c = 0.6, 0.8, 1, the alternative is true and the power is estimated. Table 2 shows that the SUBTLE is able to control type I error and achieve competing detection power, especially under high-dimensional setting (Model V); however, SST couldn't control type I error especially when c = -1. This can be explained by two things: (i) the linearity of model ( 5) is violated; (ii) SST is testing if there is difference between treatment and control groups among any subjects, instead of the existence of a beneficial subgroup. Specifically, SST is testing if the least false parameter θ * , to which the MLE of θ under model misspecification converges, is zero or not. We also perform experiments with batch size m = 40, and the results (shown in Appendix A.2.1) do not have much difference. 

4.2. NOISE COVARIATES

It is common in practice that a large number of covariates are incorporated in the experiment whereas the actual outcome only depends on a few of them. Some covariates do not have any effect on the response, like X 4 in Model II, III, IV, and we call them noise covariates. In the following simulation, we explore the impact of noise covariates to the detection power. We choose Model I with c = 0.8 as the base model, and at each time add three noise covariates which are respectively from normal N (0, 1), uniform U nif [-1, 1], and Bernoulli Ber(0.5) distributions. The batch size is set to m = 40 for computation efficiency. Figure 1 shows that SST has continuously decreasing powers as the number of noise covariates increases, while the power of SUBTLE is more robust to the noise covariates A key feature for sequential test is that it has an expected smaller sample size than fixed-horizon test. For comparison, we consider a fixed-horizon version of SUBTLE, which leverages the Theorem 3.1 and rejects the null hypothesis H 0 : ∆ = 0 when R k > Z α for some predetermined k, where Z α denotes the (1-α) quantile of standard normal distribution. We assume σ -1 = lim k→∞ 1 k k j=1 σ-1 j , then the required number of batches k can be calculated as k = σ 2 (Zα+Z1-power) 2 ∆ 2 , and thus the required sample size is n = k * m + l. The true value difference ∆ can be directly estimated from data generated under true model and two treatment rules, while σ 2 is estimated by the sample variance of ∆k (11) times k for some fixed large k . Here, we choose Model V with c = 1 and batch size m = 20. The stopping sample size of our sequential SUBTLE over 1000 replicates are shown in Figure 2 , and the dashed vertical line indicates the required sample size for the fixed-horizon SUBTLE with the same power 0.997 (seen from Table 2 ) under the same setting. We can find that most of the time our sequential SUBTLE arrives the decision early than the fixed-horizon version, but occasionally it can take longer. The distribution of the stopping time for sequential SUBTLE is right-skewed, which is line with the findings in Johari et al. (2015) and Ju et al. (2019) .

5. REAL DATA EXPERIMENTS

We use Yahoo real data to examine the performance of our SUBTLE, which contains user click events on articles over 10 days. Each event has a timestamps, a unique article id (variants), a binary click indicator (response), and four independent user features (covariates). We choose two articles (id=109520 and 109510) with the highest click through rates as control and treatment, respectively. We set the significance level α = 0.05, initial batch size and batch size l = m = 200, and the failure time M = 50000. To demonstrate the false positive control of our method, we conduct A/A test and permutation test. For A/A test, we only use data on article 109510 and randomly generate fake treatment indicator. Our method accepts the null hypothesis. For permutation test, we use combined data from article 109510 and 109520, and permute their response 1000 times while leaving treatment indicator and covariates unchanged. The estimated false positive rate is below the significance level. Then we test if there is any subgroup of users who would have higher click rate on article 109510. In this experiment, SUBTLE rejects the null hypothesis with sample size n = 12400. We identify the beneficial subgroup 1{θ(X) > 0} by estimating θ(X) with random forest on the first 12400 observations. To get a structured optimal treatment rule, we then build a classification tree on the same 12400 samples with random forest estimator 1{ θ(X) > 0} as true labels. The resulting decision tree (shown in Appendix A.2.2) suggests that the users in the subgroup defined by {X 3 < 0.7094 or (X 3 ≥ 0.7094, X 1 ≥ 0.0318 and X 4 < 0.0003)} benefit from treatment. We then use the 50000 samples after the first 12400 samples as test data set, and then compute the difference of click through rates between article 109510 and 109520 on the test data (overall treatment effect), and the same difference in the subgroup of the test data (subgroup treatment effect). We found that the subgroup treatment effect 0.009 is larger than the overall treatment effect 0.006, which shows that the identified subgroup has enhanced treatment effects than the overall population. We further compute the inverse probability weighted (IPW) estimator 1 n n i=1 1(Ai=d(Xi)) * Yi p A i (Xi) using the test data for the values of two treatment rules: d 1 (X) = 0 that assigns everyone to control and the optimal treatment rule d 2 (X) = 1{ θ(X) > 0} estimated by random forest. Their IPW estimates are respectively 0.043 and 0.049, which suggests that the estimated optimal treatment rule is better than the fixed rule that assigns all users to the control group. This implies there exists a subgroup of the population that does benefit from the article 109510.

6. CONCLUSION

In this paper, we propose SUBTLE, which is able to sequentially test if some subgroup of the population will benefit from the investigative treatment. If the null hypothesis is rejected, a beneficial subgroup can be easily identified based on the estimated optimal treatment rule. The validity of the test has been proved by both theoretical and simulation results. The experiments also show that SUBTLE has high detection power especially under high-dimensional setting, is robust to noise covariates, and allows quick inference most of time compared with fixed-horizon test. Same as mSPRT and SST, the rejection condition of SUBTLE may never be reached under some cases, especially when the true effect size is negligible. Thus, a failure time is needed to terminate the test externally and accept the null hypothesis if we ever reach it. How to choose a failure time to trade off between waiting time and power need to be studied in the future. Another future direction is the application of our test under adaptive allocation, where users will have higher probabilities of being assigned to a beneficial variant based on previous observations. However, the validity may not be guaranteed anymore under adaptive allocation and more theoretical investigations are needed.

A APPENDIX

A.1 PROOFS A.1.1 PROOF OF THEOREM 3.1 Among the conditions for Theorem 3.1, (C1) holds by nature. We suppose that (C2) and (C3) hold. (C4) and (C5) depends on the convergence rate of estimators of µ, θ, p. Wager & Athey (2018) showed that under certain constraints on the subsampling rate, random forest predictions converge at the rate n s-1/2 , where s is chosen to satisfy some conditions. We assume that under this rate, (C4) and (C5) also hold. Let F j , 0 ≤ j ≤ k, denote a filtration generated by observations in first (j+1) batches C j = ∪ j r=0 C r , and Dj (C j ; dopt j-1 , µ, θ, p) denote an AIPW estimator for ∆ with only optimal decision rule estimated by previous batches: Dj (C j ; dopt j-1 , µ, θ, p) := 1 m Oi∈Cj 1 A i = 1( θj-1 (X i ) > 0) p Ai (X i ) * Y i -   1 A i = 1( θj-1 (X i ) > 0) p Ai (X i ) -1   * g -1 µ(X i ) + θ(X i )1( θj-1 (X i ) > 0) - 1 (A i = 0) 1 -p(X i ) * Y i - 1(A i = 0) 1 -p(X i ) -1 * g -1 (µ(X i )) . Then 1 k   k j=1 σ-1 j   ∆k -∆ (18) = 1 k k j=1 σ-1 j Dj (C j ; C j-1 ) -∆ (19) = 1 k k j=1 σ-1 j ( Dj (C j ; C j-1 ) -E[ Dj (C j ; dopt j-1 , µ, θ, p)|C j-1 ]) + (E[ Dj (C j ; dopt j-1 , µ, θ, p)|C j-1 ] -∆) (20) = 1 k k j=1 σ-1 j Dj (C j ; C j-1 ) -E[ Dj (C j ; dopt j-1 , µ, θ, p)|C j-1 ] + o p (k -1/2 ) (21) = 1 k k j=1 σ-1 j Dj (C j ; C j-1 ) -E[ Dj (C j ; C j-1 )|F j-1 ] + o p (k -1/2 ). Above ( 21) follows by condition (C5) and ( 22) follows by condition (C4). For j = 1, 2, • • • , k, let M k,j = 1 √ k • Dj (C j ; C j-1 ) -E[ Dj (C j ; C j-1 )|F j-1 ] σj . It is obvious that for each k, M k,j , 1 ≤ j ≤ k, is a martingale with respect to the filtration F j . In particular, for all j ≥ 1, E[M k,j |F j-1 ] = 0 and k j=1 E[M 2 k,j |F j-1 ] = 1 k k i=1 σ 2 j σ2 j p -→ 1 as k - → ∞ by (C3). The conditional Lindeberg condition holds in (C2), so the martingale central limit theory for triangular arrays gives ( ) Plugging it back into (22), we can get 1 √ k   k j=1 σ-1 j   ∆k -∆ d -→ N (0, 1). A.1.2 PROOF OF PROPOSITION 1 We first simplify the formula of λ k to: λ k = ψ 1 √ k ( k j=1 σ-1 j )∆, 1 (R k ) ψ (0, 1) (R k ) (26) = exp    1 √ k k j=1 σ-1 j • ∆ • R k - 1 2k ( k j=1 σ-1 j ) 2 • ∆ 2    (27) = exp    1 k k j=1 σ-1 j • ∆ • k j=1 (σ -1 j Dj ) - 1 2k ( k j=1 σ-1 j ) 2 • ∆ 2    , where we denote Dj (C j ; C j-1 ) with Dj for simplicity. Let ∆k := k j=1 σ-1 j Dj k j=1 σ-1 j . ( ) and remember that Theorem 3.1 gives 1 √ k   k j=1 σ-1 j   ∆k -∆ d → N (0, 1), where σj is estimated from the first j batches C j-1 , j = 1, 2, • • • , k. Since the true value difference ∆ is not very large in practice, we assume local alternative ∆ = O p (k -1/2 ) here as in Theorem 3.1. Then, E H0 [λ k+1 |F k ] (31) = E H0    exp   1 k + 1 k+1 j=1 σ-1 j • ∆ • k+1 j=1 (σ -1 j Dj ) - 1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 • ∆ 2   F k    (32) Delta Method ≈ exp    E H0   1 k + 1 k+1 j=1 σ-1 j • ∆ • k+1 j=1 (σ -1 j Dj ) - 1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 • ∆ 2 F k      (33) = exp    1 k + 1 k+1 j=1 σ-1 j • ∆ •   k j=1 (σ -1 j Dj ) + σ-1 k+1 • E H0 [ Dk+1 |F k ] 0   - 1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 • ∆ 2    (34) = exp    1 k + 1 k+1 j=1 σ-1 j • ∆ • k j=1 (σ -1 j Dj ) - 1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 • ∆ 2    (35) = exp    1 k k j=1 σ-1 j • ∆ • k j=1 (σ -1 j Dj ) +   1 k + 1 k+1 j=1 σ-1 j - 1 k k j=1 σ-1 j   • ∆ • k j=1 (σ -1 j Dj ) - 1 2k ( k j=1 σ-1 j ) 2 • ∆ 2 -   1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 - 1 2k ( k j=1 σ-1 j ) 2   ∆ 2    (36) = λ k • exp      1 k + 1 k+1 j=1 σ-1 j - 1 k k j=1 σ-1 j   • ∆ • k j=1 (σ -1 j Dj ) -   1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 - 1 2k ( k j=1 σ-1 j ) 2   ∆ 2    (37) (29) = λ k • exp      1 k + 1 k+1 j=1 σ-1 j - 1 k k j=1 σ-1 j   • ∆ • k j=1 σ-1 j • ∆k -   1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 - 1 2k ( k j=1 σ-1 j ) 2   ∆ 2    (38) = λ k • exp                1 k + 1 k+1 j=1 σ-1 j - 1 k k j=1 σ-1 j   Op(k -1 ) × k j=1 σ-1 j ∆k Op(k 1/2 ) by (30) ×∆ -   1 2(k + 1) ( k+1 j=1 σ-1 j ) 2 - 1 2k ( k j=1 σ-1 j ) 2   Op(1) ×∆ 2              = λ k • exp{o p (1)} (40) 

A.2.1 HYPERPARAMETERS

There are three hyperparameters in our algorithm: batch size m, variance of mixture density τ 2 , and failure time M . We did not tune these hyperparameters in our experiments, but used the same value for SST and SUBTLE. In the following, we will expound the effects of these hyperparameters on the performance of our tests and provide additional simulation results. Apart from batch size 20 in Section 4.1, we also conduct experiments with batch size 40 under the same setting. The results are shown in Table 3 . It seems that there is considerable robustness in choosing batch size. In theory the choice of mixture density variance τ 2 will not have any effect on the type I error control. Johari et al. (2015) proved that an optimal τ 2 in terms of stopping time is the prior variance times a correction for truncating. It is the reason that we suggest using historical data to estimate the variance of value difference ∆. Besides, we conduct simulations with varying τ 2 . The data is generated from Model I in Table 1 with c = 0 or c = 1. When c = 0 we estimate the type I error, while when c = 1 we estimate the power. The results in Table 4 show that the type I error is always controlled below significance level 0.05 and the power has considerable robustness to the choice of τ 2 . As we mentioned in future work, how to choose the optimal failure time M is still a problem. The larger the failure time, the higher power we have to detect the difference since we collect more samples. However, large failure time also means long waiting time and high opportunity cost. Thus, there is a trade off between waiting time and power. A.2.2 OPTIMAL TREATMENT RULE FOR YAHOO DATA Figure 3 gives the decision tree of the estimated optimal treatment rule. Each left branch contains the subpopulation whose covariates satisfy the conditions on its parent node. The classification 0/1 on each leaf node indicates the optimal treatment rule for corresponding subpopulation, and the two values separated by slash gives the number of users who "truly" (estimated by random forest) benefit from control and treatment.



Figure 1: Estimated power v.s. the number of noise covariates

Estimated type I error or power for SUBTLE and SST with batch size 20

Estimated type I error or power for SUBTLE and SST with batch size 40

Estimated type I error and power for SUBTLE and SST with varying mixture density variance

