ONLINE TESTING OF SUBGROUP TREATMENT EF-FECTS BASED ON VALUE DIFFERENCE Anonymous authors Paper under double-blind review

Abstract

Online A/B testing plays a critical role in high-tech industry to guide product development and accelerate innovation. It performs a null hypothesis statistical test to determine which variant is better. However, a typical A/B test presents two problems: (i) a fixed-horizon framework inflates the false positive errors under continuous monitoring; (ii) the homogeneous effects assumption fails to identify a subgroup with a beneficial treatment effect. In this paper, we propose a sequential test for subgroup treatment effects based on value difference, named SUBTLE, to address these two problems simultaneously. The SUBTLE allows the experimenters to "peek" the results during the experiment without harming the statistical guarantees. It assumes heterogeneous treatment effects and aims to test if some subgroup of the population will benefit from the investigative treatment. If the testing result indicates the existence of such subgroup, a subgroup will be identified using a readily available estimated optimal treatment rule. We examine the empirical performance of our proposed test on both simulations and a real data set. The results show that the SUBTLE has high detection power with controlled type I error at any time, is more robust to noise covariates, and can achieve early stopping compared with the corresponding fixed-horizon test.

1. INTRODUCTION

Online A/B testing, as a kind of randomized control experiments, is widely used in high-tech industry to assess the value of ideas in a scientific manner (Kohavi et al., 2009) . It randomly exposes users to one of the two variants: control (A), the currently-used version, or treatment (B), a new version being evaluated, and collects the metric of interest, such as conversion rate, revenue, etc. Then, a null hypothesis statistical test is performed to evaluate whether there is a statistically significant difference between the two variants on the metric of interest. This scientific design helps to control for the external variations and thus establish the causality between the variants and the outcome. However, the current A/B testing has its limitations in terms of framework and model assumptions. First of all, most A/B tests employ a fixed-horizon framework, whose validity requires that the sample size should be fixed and determined before the experiment starts. However, experimenters, driven by a fast-paced product evolution in practice, often "peek" the experiment and hope to find the significance as quickly as possible to avoid large (i) time cost: an A/B test may take prohibitively long time to collect the determined size of samples; and (ii) opportunity cost: the users who have been assigned to a suboptimal variant will be stuck in a bad experience for a long time (Ju et al., 2019) . The behaviors of continuously monitoring and concluding the experiment prematurely will be favorably biased towards getting significant results and lead to very high false positive probabilities, well in excess of the nominal significance level α (Goodson, 2014; Simmons et al., 2011) . Another limitation of A/B tests is that they assume homogeneous treatment effects among the population and mainly focus on testing the average treatment effect. However, it is common that treatment effects vary across sub-populations. Testing the subgroup treatment effects will help decision makers distinguish the sub-population that may benefit from a particular treatment from those who may not, and thereby guide companies' marketing strategies in promoting new products. The first problem can be addressed by applying the sequential testing framework. Sequential testing, contrast to the classic fixed-horizon test, is a statistical testing procedure that continuously checks for significance at every new sample and stops the test as soon as a significant result is detected,

