AN ONLINE SEQUENTIAL TEST FOR QUALITATIVE TREATMENT EFFECTS Anonymous

Abstract

Tech companies (e.g., Google or Facebook) often use randomized online experiments and/or A/B testing primarily based on the average treatment effects to compare their new product with an old one. However, it is also critically important to detect qualitative treatment effects such that the new one may significantly outperform the existing one only under some specific circumstances. The aim of this paper is to develop a powerful testing procedure to efficiently detect such qualitative treatment effects. We propose a scalable online updating algorithm to implement our test procedure. It has three novelties including adaptive randomization, sequential monitoring, and online updating with guaranteed type-I error control. We also thoroughly examine the theoretical properties of our testing procedure including the limiting distribution of test statistics and the justification of an efficient bootstrap method. Extensive empirical studies are conducted to examine the finite sample performance of our test procedure.

1. INTRODUCTION

Tech companies use randomized online experiments, or A/B testing to compare their new product with a well-established one. Most works in the literature focus on the average treatment effects (ATE) between the new and existing products (see Kharitonov et al., 2015; Johari et al., 2015; 2017; Yang et al., 2017; Ju et al., 2019 , and the references therein). In addition to ATE, sometimes we are interested in locating the subgroup (if exists) that the new product performs significantly better than the existing one, as early as possible. Consider a ride-hailing company (e.g., Uber). Suppose some passengers are in the recession state (at a high risk of stopping using the companys app) and the company comes up with certain strategy to intervene the recession process. We would like to if there are some subgroups that are sensitive to the strategy and pin-point these subgroups if exists. It motivates us to consider the null hypothesis that the treatment effect is nonpositive for all passenger. Such a null hypothesis is closely related to the notion of qualitative treatment effects in medical studies (QTE, Gail & Simon, 1985; Roth & Simon, 2018; Shi et al., 2020a) , and conditional moment inequalities in economics (see for example, Andrews & Shi, 2013; 2014; Chernozhukov et al., 2013; Armstrong & Chan, 2016; Chang et al., 2015; Hsu, 2017) . However, these tests are computed offline and might not be suitable to implement in online settings. Moreover, it is assumed in those papers that observations are independent. In online experiment, one may wish to adaptively allocate the treatment based on the observed data stream in order to maximize the cumulative reward or to detect the alternative more efficiently. The independence assumption is thus violated. In addition, an online experiment is desired to be terminated as early as possible in order to save time and budget. Sequential testing for qualitative treatment effects has been less explored. In the literature, there is a line of research on estimation and inference of the heterogeneous treatment effects (HTE) (Athey & Imbens, 2016; Taddy et al., 2016; Wager & Athey, 2018; Yu et al., 2020) . In particular, Yu et al. (2020) proposed an online test for HTE. We remark that HTE and QTE are related yet fundamentally different hypotheses. There are cases where HTE exists whereas QTE does not. See Figure 1 for an illustration. Consequently, applying their test will fail in our setting. The contributions of this paper are summarized as follows. First, we propose a new testing procedure for treatment comparison based on the notion of QTE. When the null hypothesis is not rejected, the new product is no better than the control for any realization of covariates, and thus it is not useful at all. Otherwise, the company could implement different products according to the auxiliary

