AN ONLINE SEQUENTIAL TEST FOR QUALITATIVE TREATMENT EFFECTS Anonymous

Abstract

Tech companies (e.g., Google or Facebook) often use randomized online experiments and/or A/B testing primarily based on the average treatment effects to compare their new product with an old one. However, it is also critically important to detect qualitative treatment effects such that the new one may significantly outperform the existing one only under some specific circumstances. The aim of this paper is to develop a powerful testing procedure to efficiently detect such qualitative treatment effects. We propose a scalable online updating algorithm to implement our test procedure. It has three novelties including adaptive randomization, sequential monitoring, and online updating with guaranteed type-I error control. We also thoroughly examine the theoretical properties of our testing procedure including the limiting distribution of test statistics and the justification of an efficient bootstrap method. Extensive empirical studies are conducted to examine the finite sample performance of our test procedure.

1. INTRODUCTION

Tech companies use randomized online experiments, or A/B testing to compare their new product with a well-established one. Most works in the literature focus on the average treatment effects (ATE) between the new and existing products (see Kharitonov et al., 2015; Johari et al., 2015; 2017; Yang et al., 2017; Ju et al., 2019 , and the references therein). In addition to ATE, sometimes we are interested in locating the subgroup (if exists) that the new product performs significantly better than the existing one, as early as possible. Consider a ride-hailing company (e.g., Uber). Suppose some passengers are in the recession state (at a high risk of stopping using the companys app) and the company comes up with certain strategy to intervene the recession process. We would like to if there are some subgroups that are sensitive to the strategy and pin-point these subgroups if exists. It motivates us to consider the null hypothesis that the treatment effect is nonpositive for all passenger. Such a null hypothesis is closely related to the notion of qualitative treatment effects in medical studies (QTE, Gail & Simon, 1985; Roth & Simon, 2018; Shi et al., 2020a) , and conditional moment inequalities in economics (see for example, Andrews & Shi, 2013; 2014; Chernozhukov et al., 2013; Armstrong & Chan, 2016; Chang et al., 2015; Hsu, 2017) . However, these tests are computed offline and might not be suitable to implement in online settings. Moreover, it is assumed in those papers that observations are independent. In online experiment, one may wish to adaptively allocate the treatment based on the observed data stream in order to maximize the cumulative reward or to detect the alternative more efficiently. The independence assumption is thus violated. In addition, an online experiment is desired to be terminated as early as possible in order to save time and budget. Sequential testing for qualitative treatment effects has been less explored. In the literature, there is a line of research on estimation and inference of the heterogeneous treatment effects (HTE) (Athey & Imbens, 2016; Taddy et al., 2016; Wager & Athey, 2018; Yu et al., 2020) . In particular, Yu et al. (2020) proposed an online test for HTE. We remark that HTE and QTE are related yet fundamentally different hypotheses. There are cases where HTE exists whereas QTE does not. See Figure 1 for an illustration. Consequently, applying their test will fail in our setting. The contributions of this paper are summarized as follows. First, we propose a new testing procedure for treatment comparison based on the notion of QTE. When the null hypothesis is not rejected, the new product is no better than the control for any realization of covariates, and thus it is not useful at all. Otherwise, the company could implement different products according to the auxiliary Y denotes the associated reward. In the ride-hailing example, X is a feature vector describing the characteristics of a passenger, A is a binary strategy indicator and Y is the passenger's number of rides in the following two weeks. In the left panel, the treatment effect does not depend on X. Neither HTE nor QTE exists in this case. In the middle panel, HTE exists. However, the treatment effect is always negative. As such, QTE does not exist. In the right penal, both QTE and HTE exist. covariates observed, to maximize the average reward obtained. We remark that there are plenty cases where the treatment effects are always nonpositive (see Section 5 of Chang et al., 2015; Shi et al., 2020a) . A by-product of our test is that it yields a decision rule to implement personalization when the null is rejected (see Section 3.1 for details). Although we primarily focus on QTE in this paper, our procedure can be easily extended to testing ATE as well (see Appendix D for details). Second, we propose a scalable online updating algorithm to implement our test. To allow for sequential monitoring, our procedure leverages idea from the α spending function approach (Lan & DeMets, 1983) originally designed for sequential analysis in a clinical trial (see Jennison & Turnbull, 1999 , for an overview). Classical sequential tests focus on ATE. The test statistic at each interim stage is asymptotically normal and the stopping boundary can be recursively updated via numerical integration. However, the limiting distribution of the proposed test statistic does not have a tractable analytical form, making the numerical integration method difficult to apply. To resolve this issue, we propose a scalable bootstrap-assisted procedure to determine the stopping boundary. Third, we adopt a theoretical framework that allows the maximum number of interim analyses K to diverge as the number of observations increases, since tech companies might analyze the results every few minutes (or hours) to determine whether to stop the experiment or continue collecting more data. It is ultimately different from classical sequential analysis where K is fixed. Moreover, the derivation of the asymptotic property of the proposed test is further complicated due to the adaptive randomization procedure, which makes observations dependent of each other. Despite these technical challenges, we establish a nonasymptotic upper bound on the type-I error rate by explicitly characterizing the conditions needed on randomization procedure, K and the number of samples observed at the initial decision point to ensure the validity of our test.

2. BACKGROUND AND PROBLEM FORMULATION

We propose a potential outcome framework (Rubin, 2005) to formulate our problem. Suppose that we have two products including the control and the treatment. The observed data at time point t consists of a sequence of triples {(X i , A i , Y i )} N (t) i=1 , where N (•) is a counting process that is independent of the data stream {(X i , A i , Y i )} +∞ i=1 , A i is a binary random variable indicating the product executed for the i-th experiment, X i ∈ R p denotes the associated covariates, and Y i stands for the associated reward (the larger the better by convention). We allow A i to depend on X i and past observations {(X j , A j , Y j )} j<i so that the randomization procedure can be adaptively changed. In addition, define Y * i (0) and Y * i (1) to be the potential outcome that would have been observed if the corresponding product is executed for the i-th experiment. Suppose that { (X i , Y * i (0), Y * i (1))} +∞ i=1 are independently and identically distributed copies of (X, Y * (0), Y * (1)). Let X be the support of X and Q 0 (x, a) = E{Y * (a)|X = x} for a = 0, 1, we focus on testing the following hypotheses: H 0 : Q 0 (x, 1) ≤ Q 0 (x, 0), ∀x ∈ X versus H 1 : Q 0 (x, 1) > Q 0 (x, 0), ∃x ∈ X. Notice that when there are no covariates, i.e., X = ∅, the hypotheses are reduced to H 0 : τ 0 ≤ 0 versus H 1 : τ 0 > 0, where τ 0 corresponds to ATE, i.e, τ 0 = E{Y * (1) -Y * (0)}. In general, we require X to be a compact set. We consider a large linear approximation space Q for the conditional mean function Q 0 . Specifically, let Q = {Q(x, a; β 0 , β 1 ) = ϕ (x)β a : β 0 , β 1 ∈ R q } be the approximation space, where ϕ(x) is a q-dimensional vector composed of basis functions on X. The



Figure 1: Plots demonstrating QTE. X denotes the observed covariates, A denotes the received treatment and

