A REINFORCEMENT LEARNING APPROACH TO ESTI-MATING LONG-TERM EFFECTS IN NONSTATIONARY ENVIRONMENTS

Abstract

Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consuming and expensive. In this paper, we take a reinforcement learning (RL) approach that estimates the average reward in a Markov process. Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems, and demonstrate promising results in two synthetic datasets and one online store dataset.

1. INTRODUCTION

Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In an experiment, units like customers or patients are randomly split into a treatment bucket and a control bucket. For example, in a rideshare app, drivers in the control and treatment buckets are matched to customers in different ways (e.g., with different spatial ranges or different ranking functions). After we expose customers to one of these options for a period of time, usually a few days or weeks, we can record the corresponding customer engagements, and run a statistical hypothesis test on the engagement data to detect if there is a statistically significant difference in customer preference of treatment over control. The result will inform whether the app should launch the treatment or control. While this method has been widely successful (e.g., in online applications (Kohavi et al., 2020) ), it typically measures treatment effect during the short experiment window. However, in many problems, a treatment has a lasting effect that evolves over time. For example, a treatment that increases installation of a mobile app may result in a drop of short-term profit due to promotional benefits like discounts. But the installation allows the customer to benefit from the app, which will increase future engagements and profit in the long term. A limitation with standard randomized experiments is that they do not easily extend to measure long-term effects. We can run a long experiment for months or years to measure the long-term impacts, which however is time-consuming and expensive. We can also design proxy signals that are believed to correlate with long-term engagements (Kohavi et al., 2009) , but finding a reliable proxy is challenging in practice. Another solution is the surrogacy method that estimates delayed treatment impacts from surrogate changes during the experiment (Athey et al., 2019) . However, it does not estimate long-term impacts resulting from long-term treatment exposure, but rather from short-term exposure during the experiment. Shi et al. (2022b) mitigates the limitation of standard randomized experiment by framing the longterm effect as a reinforcement learning (RL) problem. Their method is closely related to recent advances in infinite-horizon off-policy evaluation (OPE) (Liu et al., 2018; Nachum et al., 2019a; Xie et al., 2019; Kallus & Uehara, 2020; Uehara et al., 2020; Chandak et al., 2021) . However, their solution relies on stationary Markov assumption, which fails to capture the real-world nonstationary dynamics. Motivated by real-world scenarios where the observed state transitions are nonstationary, we consider a class of nonstationary problems, where the observation consists of two additive terms: an endogenous term that follows a stationary Markov process, and an exogenous

