ONLINE TESTING OF SUBGROUP TREATMENT EF-FECTS BASED ON VALUE DIFFERENCE Anonymous authors Paper under double-blind review

Abstract

Online A/B testing plays a critical role in high-tech industry to guide product development and accelerate innovation. It performs a null hypothesis statistical test to determine which variant is better. However, a typical A/B test presents two problems: (i) a fixed-horizon framework inflates the false positive errors under continuous monitoring; (ii) the homogeneous effects assumption fails to identify a subgroup with a beneficial treatment effect. In this paper, we propose a sequential test for subgroup treatment effects based on value difference, named SUBTLE, to address these two problems simultaneously. The SUBTLE allows the experimenters to "peek" the results during the experiment without harming the statistical guarantees. It assumes heterogeneous treatment effects and aims to test if some subgroup of the population will benefit from the investigative treatment. If the testing result indicates the existence of such subgroup, a subgroup will be identified using a readily available estimated optimal treatment rule. We examine the empirical performance of our proposed test on both simulations and a real data set. The results show that the SUBTLE has high detection power with controlled type I error at any time, is more robust to noise covariates, and can achieve early stopping compared with the corresponding fixed-horizon test.

1. INTRODUCTION

Online A/B testing, as a kind of randomized control experiments, is widely used in high-tech industry to assess the value of ideas in a scientific manner (Kohavi et al., 2009) . It randomly exposes users to one of the two variants: control (A), the currently-used version, or treatment (B), a new version being evaluated, and collects the metric of interest, such as conversion rate, revenue, etc. Then, a null hypothesis statistical test is performed to evaluate whether there is a statistically significant difference between the two variants on the metric of interest. This scientific design helps to control for the external variations and thus establish the causality between the variants and the outcome. However, the current A/B testing has its limitations in terms of framework and model assumptions. First of all, most A/B tests employ a fixed-horizon framework, whose validity requires that the sample size should be fixed and determined before the experiment starts. However, experimenters, driven by a fast-paced product evolution in practice, often "peek" the experiment and hope to find the significance as quickly as possible to avoid large (i) time cost: an A/B test may take prohibitively long time to collect the determined size of samples; and (ii) opportunity cost: the users who have been assigned to a suboptimal variant will be stuck in a bad experience for a long time (Ju et al., 2019) . The behaviors of continuously monitoring and concluding the experiment prematurely will be favorably biased towards getting significant results and lead to very high false positive probabilities, well in excess of the nominal significance level α (Goodson, 2014; Simmons et al., 2011) . Another limitation of A/B tests is that they assume homogeneous treatment effects among the population and mainly focus on testing the average treatment effect. However, it is common that treatment effects vary across sub-populations. Testing the subgroup treatment effects will help decision makers distinguish the sub-population that may benefit from a particular treatment from those who may not, and thereby guide companies' marketing strategies in promoting new products. The first problem can be addressed by applying the sequential testing framework. Sequential testing, contrast to the classic fixed-horizon test, is a statistical testing procedure that continuously checks for significance at every new sample and stops the test as soon as a significant result is detected, while controlling the type I error at any time. It generally gives a significant decrease in the required sample size compared to the fixed-horizon test with the same type I error and type II error control, and thus is able to end an experiment much earlier. This field was first introduced by Wald (1945), who proposed sequential probability ratio test (SPRT) for simple hypotheses using likelihood ratio as the test statistics, and then was extended to composite hypotheses by many following literature (Schwarz, 1962; Armitage et al., 1969; Cox, 1963; Robbins, 1970; Lai, 1988) . A thorough review is given in Lai (2001). However, the advantage of sequential testing in online A/B testing has not been recognized until recently Johari et al. (2015) brought the mSPRT, a variant of SPRT to A/B tests. The second problem shows a demand for a test on subgroup treatment effects. Although sequential testing is rapidly developing in online A/B test, few work focuses on subgroup treatment effect testing. Yu et al. ( 2020) proposed a sequential score test (SST) based on score statistics under a generalized linear model, which aims to test if there is difference between treatment and control groups among any subjects. However, this test is based on a restrictive parametric assumption on treatment-covariates interaction and can't be used to test the subgroup treatment effects. In this paper, we consider a flexible model, and propose a sequential test for SUBgroup Treatment effects based on vaLuE difference (SUBTLE), which aims to test if some group of the population would benefit from the investigative treatment. Our method does not require to specify any parametric form of covariate-specific treatment effects. If the null hypothesis is rejected, a beneficial subgroup can be easily obtained based on the estimated optimal treatment rule. The remainder of this paper is structured as follows. In Section 2, we review the idea of the mSPRT and SST, and discuss how they are related to our test. Then in Section 3, we introduce our proposed method SUBTLE and provide the theoretical guarantee for its validity. We conduct simulations in Section 4 and real data experiments in Section 5 to demonstrate the validity, detection power, robustness and efficiency of our proposed test. Finally, in Section 6, we conclude the paper and present future directions.

2.1. MIXTURE SEQUENTIAL PROBABILITY RATIO TEST

The mixture sequential probability ratio test (mSPRT) (Robbins, 1970) supposes that the independent and identically distributed (i.i.d.) random variables Y 1 , Y 2 , • • • have a probability density function f θ (x) induced by parameter θ, and aims to test H 0 : θ = θ 0 v.s. H 1 : θ = θ 0 . (1) Its test statistics Λ π n at sample size n is a mixture of likelihood ratios as below: Λ π n = Θ n i=1 f θ (Y i ) f θ0 (Y i ) π(θ)dθ, with a mixture density π(•) over the parameter space Θ. The mSPRT stops the sampling at the stage N = inf{n ≥ 1 : Λ π n ≥ 1/α} (3) and rejects the null hypothesis H 0 in favor of H 1 . If no such time exists, it continues the sampling indefinitely and accept the H 0 . Since the likelihood ratio under H 0 is a nonnegative martingale with initial value equal to 1, and so is the mixture of such likelihood ratios Λ π n , the type I error of mSPRT can be proved to be always controlled at α by an application of Markov's inequality and optional stopping theorem: P H0 (Λ π n ≥ α -1 ) ≤ E H 0 [Λ π n ] α -1 = E H 0 [Λ π 0 ] α -1 = α. Besides, mSPRT is a test of power one (Robbins & Siegmund, 1974) , which means that any small deviation from θ 0 can be detected as long as waiting long enough. It is also shown that mSPRT is almost optimal for data from an exponential family of distributions, with respect to the expected stopping time (Pollak, 1978) . The mSPRT was brought to A/B test by Johari et al. (2015; 2017) , who assume that the observations in control (A = 0) and treatment (A = 1) groups arrive in pairs (Y 



i ), i = 1, 2, • • • .They restricted their data model to the two most common cases in practice: normal distribution

