A REINFORCEMENT LEARNING APPROACH TO ESTI-MATING LONG-TERM EFFECTS IN NONSTATIONARY ENVIRONMENTS

Abstract

Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consuming and expensive. In this paper, we take a reinforcement learning (RL) approach that estimates the average reward in a Markov process. Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems, and demonstrate promising results in two synthetic datasets and one online store dataset.

1. INTRODUCTION

Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In an experiment, units like customers or patients are randomly split into a treatment bucket and a control bucket. For example, in a rideshare app, drivers in the control and treatment buckets are matched to customers in different ways (e.g., with different spatial ranges or different ranking functions). After we expose customers to one of these options for a period of time, usually a few days or weeks, we can record the corresponding customer engagements, and run a statistical hypothesis test on the engagement data to detect if there is a statistically significant difference in customer preference of treatment over control. The result will inform whether the app should launch the treatment or control. While this method has been widely successful (e.g., in online applications (Kohavi et al., 2020) ), it typically measures treatment effect during the short experiment window. However, in many problems, a treatment has a lasting effect that evolves over time. For example, a treatment that increases installation of a mobile app may result in a drop of short-term profit due to promotional benefits like discounts. But the installation allows the customer to benefit from the app, which will increase future engagements and profit in the long term. A limitation with standard randomized experiments is that they do not easily extend to measure long-term effects. We can run a long experiment for months or years to measure the long-term impacts, which however is time-consuming and expensive. We can also design proxy signals that are believed to correlate with long-term engagements (Kohavi et al., 2009) , but finding a reliable proxy is challenging in practice. Another solution is the surrogacy method that estimates delayed treatment impacts from surrogate changes during the experiment (Athey et al., 2019) . However, it does not estimate long-term impacts resulting from long-term treatment exposure, but rather from short-term exposure during the experiment. Shi et al. (2022b) mitigates the limitation of standard randomized experiment by framing the longterm effect as a reinforcement learning (RL) problem. Their method is closely related to recent advances in infinite-horizon off-policy evaluation (OPE) (Liu et al., 2018; Nachum et al., 2019a; Xie et al., 2019; Kallus & Uehara, 2020; Uehara et al., 2020; Chandak et al., 2021) . However, their solution relies on stationary Markov assumption, which fails to capture the real-world nonstationary dynamics. Motivated by real-world scenarios where the observed state transitions are nonstationary, we consider a class of nonstationary problems, where the observation consists of two additive terms: an endogenous term that follows a stationary Markov process, and an exogenous term that is time-varying but independent of the policy. Based on this assumption, we develop a new algorithm to jointly estimate long-term reward and the exogenous variables. Our contributions are threefold. First, it is a novel application of RL to estimate long-term treatment effects, which is challenging for standard randomized experiments. Second, we develop an estimator for a class of nonstationary problems that are motivated by real-world scenarios, and give a preliminary theoretical analysis. Third, we demonstrate promising results in two synthetic datasets and one online store dataset.

2.1. LONG-TERM TREATMENT EFFECTS

Let π 0 and π 1 be the control and treatment policies, used to serve individual in respective buckets. In the rideshare example, a policy may decide how to match a driver to a nearby request. During the experiment, each individual (the driver) is randomly assigned to one of the policy groups, and we observe a sequence of behavior features of that individual under the influence of the assigned policy. We use variable D ∈ {0, 1} to denote the random assignment of an individual to one of the policies. The observed features are denoted as a sequence of random variable in R d O 0 , O 1 , . . . , O t , . . . , where the subscript t indicates time step in the sequence. A time step may be one day or one week, depending on the application. Feature O t consists of information like number of pickup orders. We are interested in estimating the difference in average long-term reward between treatment and control policies: ∆ = E[ ∞ t=0 γ t R t |D = 1] -E[ ∞ t=0 γ t R t |D = 0], where E averages over individuals and their stochastic sequence of engagements, R t = r(O t ) is the reward signal (e.g., customer rating) at time step t, following a pre-defined reward function r : R d → R, and γ ∈ (0, 1) is the discounted factor. The discounted factor γ is a hyper-parameter specified by the decision maker to indicate how much they value future reward over the present. The closer γ is to 1, the greater weight future rewards carry in the discounted sum. Suppose we have run a randomized experiment with the two policies for a short period of T steps. In the experiment, a set of n individuals are randomly split and exposed to one of the two policies π 0 and π 1 . We denote by d j ∈ {0, 1} the policy assignment of individual j, and I i the index set of individuals assigned to π i , i.e., j ∈ I i iff d j = i. The in-experiment trajectory of individual j is: τ j = {o j,0 , o j,1 , . . . , o j,T }. The in-experiment dataset is the collection of all individual data as D n = {(τ j , d j )} n j=1 . Our goal is to find an estimator ∆(D n ) ≈ ∆.

2.2. ESTIMATION UNDER STATIONARY MARKOVIAN DYNAMICS

Inspired by recent advances in off-policy evaluation (OPE) (e.g. Liu et al., 2018; Nachum et al., 2019b) , the simplest assumption is a fully observed Markov Process that the observation in each time step can fully predict the future distribution under a stationary dynamic kernel. In this paper, we assume the dynamics kernel and reward function are both linear, following the setting in Parr et al. (2008) . Linear representations are popular in the RL literature (e.g., Shi et al., 2022b) , and often preferable in industrial applications due to simplicity and greater model interpretability. Assumption 2.1. (Linear Dynamics) there is a matrix M i such that E[O t+1 |O t = o, D = i] = M i o, ∀t ∈ N, i ∈ {0, 1}. (2) Remark 2.2. Unlike standard RL, we don't have an explicit action for a policy. The difference between the control and treatment policy is revealed by different transition matrix M . Assumption 2.3. (Linear Reward) There is a coefficient vector θ r ∈ R d such that r(O t ) = θ ⊤ r O t , ∀t ∈ N. (3)

