A REINFORCEMENT LEARNING APPROACH TO ESTI-MATING LONG-TERM EFFECTS IN NONSTATIONARY ENVIRONMENTS

Abstract

Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consuming and expensive. In this paper, we take a reinforcement learning (RL) approach that estimates the average reward in a Markov process. Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems, and demonstrate promising results in two synthetic datasets and one online store dataset.

1. INTRODUCTION

Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In an experiment, units like customers or patients are randomly split into a treatment bucket and a control bucket. For example, in a rideshare app, drivers in the control and treatment buckets are matched to customers in different ways (e.g., with different spatial ranges or different ranking functions). After we expose customers to one of these options for a period of time, usually a few days or weeks, we can record the corresponding customer engagements, and run a statistical hypothesis test on the engagement data to detect if there is a statistically significant difference in customer preference of treatment over control. The result will inform whether the app should launch the treatment or control. While this method has been widely successful (e.g., in online applications (Kohavi et al., 2020) ), it typically measures treatment effect during the short experiment window. However, in many problems, a treatment has a lasting effect that evolves over time. For example, a treatment that increases installation of a mobile app may result in a drop of short-term profit due to promotional benefits like discounts. But the installation allows the customer to benefit from the app, which will increase future engagements and profit in the long term. A limitation with standard randomized experiments is that they do not easily extend to measure long-term effects. We can run a long experiment for months or years to measure the long-term impacts, which however is time-consuming and expensive. We can also design proxy signals that are believed to correlate with long-term engagements (Kohavi et al., 2009) , but finding a reliable proxy is challenging in practice. Another solution is the surrogacy method that estimates delayed treatment impacts from surrogate changes during the experiment (Athey et al., 2019) . However, it does not estimate long-term impacts resulting from long-term treatment exposure, but rather from short-term exposure during the experiment. Shi et al. (2022b) mitigates the limitation of standard randomized experiment by framing the longterm effect as a reinforcement learning (RL) problem. Their method is closely related to recent advances in infinite-horizon off-policy evaluation (OPE) (Liu et al., 2018; Nachum et al., 2019a; Xie et al., 2019; Kallus & Uehara, 2020; Uehara et al., 2020; Chandak et al., 2021) . However, their solution relies on stationary Markov assumption, which fails to capture the real-world nonstationary dynamics. Motivated by real-world scenarios where the observed state transitions are nonstationary, we consider a class of nonstationary problems, where the observation consists of two additive terms: an endogenous term that follows a stationary Markov process, and an exogenous term that is time-varying but independent of the policy. Based on this assumption, we develop a new algorithm to jointly estimate long-term reward and the exogenous variables. Our contributions are threefold. First, it is a novel application of RL to estimate long-term treatment effects, which is challenging for standard randomized experiments. Second, we develop an estimator for a class of nonstationary problems that are motivated by real-world scenarios, and give a preliminary theoretical analysis. Third, we demonstrate promising results in two synthetic datasets and one online store dataset.

2. BACKGROUND

2.1 LONG-TERM TREATMENT EFFECTS Let π 0 and π 1 be the control and treatment policies, used to serve individual in respective buckets. In the rideshare example, a policy may decide how to match a driver to a nearby request. During the experiment, each individual (the driver) is randomly assigned to one of the policy groups, and we observe a sequence of behavior features of that individual under the influence of the assigned policy. We use variable D ∈ {0, 1} to denote the random assignment of an individual to one of the policies. The observed features are denoted as a sequence of random variable in R d O 0 , O 1 , . . . , O t , . . . , where the subscript t indicates time step in the sequence. A time step may be one day or one week, depending on the application. Feature O t consists of information like number of pickup orders. We are interested in estimating the difference in average long-term reward between treatment and control policies: ∆ = E[ ∞ t=0 γ t R t |D = 1] -E[ ∞ t=0 γ t R t |D = 0], where E averages over individuals and their stochastic sequence of engagements, R t = r(O t ) is the reward signal (e.g., customer rating) at time step t, following a pre-defined reward function r : R d → R, and γ ∈ (0, 1) is the discounted factor. The discounted factor γ is a hyper-parameter specified by the decision maker to indicate how much they value future reward over the present. The closer γ is to 1, the greater weight future rewards carry in the discounted sum. Suppose we have run a randomized experiment with the two policies for a short period of T steps. In the experiment, a set of n individuals are randomly split and exposed to one of the two policies π 0 and π 1 . We denote by d j ∈ {0, 1} the policy assignment of individual j, and I i the index set of individuals assigned to π i , i.e., j ∈ I i iff d j = i. The in-experiment trajectory of individual j is: τ j = {o j,0 , o j,1 , . . . , o j,T }. The in-experiment dataset is the collection of all individual data as D n = {(τ j , d j )} n j=1 . Our goal is to find an estimator ∆(D n ) ≈ ∆.

2.2. ESTIMATION UNDER STATIONARY MARKOVIAN DYNAMICS

Inspired by recent advances in off-policy evaluation (OPE) (e.g. Liu et al., 2018; Nachum et al., 2019b) , the simplest assumption is a fully observed Markov Process that the observation in each time step can fully predict the future distribution under a stationary dynamic kernel. In this paper, we assume the dynamics kernel and reward function are both linear, following the setting in Parr et al. (2008) . Linear representations are popular in the RL literature (e.g., Shi et al., 2022b) , and often preferable in industrial applications due to simplicity and greater model interpretability. Assumption 2.1. (Linear Dynamics) there is a matrix M i such that E[O t+1 |O t = o, D = i] = M i o, ∀t ∈ N, i ∈ {0, 1}. (2) Remark 2.2. Unlike standard RL, we don't have an explicit action for a policy. The difference between the control and treatment policy is revealed by different transition matrix M . Assumption 2.3. (Linear Reward) There is a coefficient vector θ r ∈ R d such that r(O t ) = θ ⊤ r O t , ∀t ∈ N. Remark 2.4. The reward signal may be one of the observed features. For example, if we are interested in customer rating, and rating is one of the observe features, then θ r is just a one-hot vector with 1 in the corresponding coordinate. When the reward is complex with unknown coefficient, we can use ordinary least-squares to estimate the coefficient θ r . Proposition 2.5. Under Assumption 2.1 and 2.3, if the spectral norm of M i is smaller than 1 γ , then the expected long-term reward of policy π i , v(π i ) := E[ ∞ t=0 γ t R t |D = i], can be obtained by: v(π i ) = θ ⊤ r (I -γM i ) -1 Ō(i) 0 , where Ō(i) 0 := E[O 0 |D = i]. The only remaining step is to estimate Ō(i) 0 and M i . The former can be directly estimated from the Monte Carlo average of the experimental data: Ô(i) 0 = 1 ni j∈Ii o 0,j , where n i = |I i | is the number of individuals assigned to policy π i . To estimate the latter, we may use ordinary least-squares on observed transitions: Mi =   j∈Ii T -1 t=0 o j,t+1 o ⊤ j,t     j∈Ii T -1 t=0 o j,t o ⊤ j,t   -1 . (5) The detailed derivation can be found in (Parr et al., 2008) . Once we get the estimated value of vi ≈ v(π i ), the long term impact in Eq. ( 1) can be estimated as: ∆ = v1 -v0 . Remark 2.6. Although this a model-based estimator, it is equivalent to other OPE estimator in general under linear Markovian assumption (e.g., Nachum et al., 2019b; Duan et al., 2020; Miyaguchi, 2021) and it enjoys similar statistical guarantees as other OPE estimators.

3. OUR METHOD

In Section 2.2, we assumed the observation O t follows a stationary Markov process, and derived a model-based closed-form solution based on linear reward Assumption 2.3. In reality, this model assumption has two major limitations. First, real-world environments are nonstationary. For example, in a hotel reservation system, seasonality heavily influences the prediction of the future booking count. Our stationary assumption does not capture those seasonal changes, resulting in poorly learned models and inaccurate predictions of long-term treatment effects. Second, in practice, we are unable to ensure that observed features fully capture the dynamics. OPE methods based on stationary and full observability assumptions are unlikely to work robustly in complex, real-life scenarios. Figure 1 illustrates nonstationarity in data from an online store (see Section 5 for more details). The figure shows how the weekly average of a business metric changes in a span of 5 months, for two policies (C for control, and T4 for treatment). Such highly non-statioanary data, especially during special seasons towards the right end of the plot, are common. However, the difference of the two policy groups remains much more stable. This is expected as both policies are affected by the same exogenous affects (seasonal variations in this example). Figure 1 motivates a relaxed model assumption (Section 3.1), by introducing a non-stationary exogenous component on top of a stationary hidden state S t . Our new assumption is that the observation O t can be decomposed additively into two parts: an endogenous part still follows a stationary Markovian dynamic for each policy group (treatment or control); and an exogenous part which is time-varying and shared across all groups. Based on the new assumption we propose an alternating minimization algorithm that jointly estimates both transition dynamics and exogenous variables.

3.1. NONSTATIONARY MODEL RELAXATION

We assume there is an exogenous noise vector z t for each time step t, to represent the linear additive exogenous noise in the uncontrollable outside world such as seasonal effect, which applies uniformly to every individual under each treatment bucket. We relax Assumption 2.1 as the following: Assumption 3.1. (Linear Additive Exogenous Noise) the observational feature O t is the sum of the endogenous hidden features and the time-varying exogenous noise z t . O t = S t + z t , ∀t ∈ N. where z t does not depend on policy or any individual in the experiments and S t follows the linear Markovian kernel with transition matrix M i : E[S t+1 |S t = s, D = i] = M i s, ∀t ∈ N, i ∈ {0, 1}. (6) Remark 3.2 (Explanation of the Linear Additive Model). Our linear additive model is inspired by the parallel trend assumption in the Difference-in-Difference (DID) estimator (Lechner et al., 2011) . In real-world environments, it is impossible to capture all the covariates that may effect the dynamics. The linear additive exogenous noise z t can be seen as the drive from the outside that is both unobserved and uncontrol. For example, in an intelligent agriculture system, the highly nonstationary weather condition can be seen as exogenous which we cannot control, but the amount of water and fertilizer that affect the growth of the plant can be seen as the hidden state that is controlled by a pre-defined stationary policy. And we add up those two factors as the features (e.g., the condition of the crop) we observed in the real world. From Assumption 3.1 and linear reward function assumption in 2.3, the closed form of v(π i ) can be rewritten as: Proposition 3.3. Under Assumption 3.1 and 2.3, and suppose v(z ∞ ) := ∞ t=0 γ t z t < ∞. Suppose the spectral norm of M i is smaller than 1 γ , the expected long-term reward can be obtained by: v(π i ) = θ ⊤ r (I -γM i ) -1 S(i) 0 + v(z ∞ ), where S(i) 0 = E[S 0 |D = i]. The long-term reward in Eq. ( 7) contains v(z ∞ ), which depends on the unknown exogenous noise sequence outside of the experimental window and thus is unpredictable. However, the long term treatment effect, ∆(π 1 , π 0 ) = v(π 1 )v(π 0 ), cancels out the dependency on that exogenous term v(z ∞ ). For simplicity, we redefine v(π i ) = θ ⊤ r (I -γM i ) -1 S(i) 0 without the term of v(z ∞ ). Therefore, the only thing we need to estimate is S(i) 0 and M i . Once we have the access of z 0 , we can estimate S(i) 0 similarly as Monte Carlo sample: Ŝ0 = 1 ni j∈Ii o 0,j -ẑ0 . The next question is how to estimate in-experiment exogenous variable z t and the underlying transition kernels.

3.2. OPTIMIZATION FRAMEWORK

We propose to optimize {z t } 1≤t≤T and {M 0 , M 1 } jointly under a single loss function, with the same spirit of reducing the reconstruction loss of each transition pair similar to the model-based approach. For each individual j in treatment group i, Assumption 3.1 implies that at time step t + 1, the observation o j,t+1 can be written as: o j,t+1 -z t+1 = M i (o j,t -z t ) + ε j,t , ∀j ∈ I i , 1 ≤ t ≤ T -1, where ε j,t is a noise term with zero mean, so that M i (o j,t -z t ) = E[S t+1 |S t = o j,t -z t , D = i]. Inspired by Eq. ( 8), given observation history D n , in order to minimize the empirical reconstruct risk by each transition pair (o j,t , o j,t+1 ), we construct the following loss function L(M 0 , M 1 , {z t } 1≤t≤T ; D n ) = 1 i=0 j∈Ii T -1 t=0 ∥o j,t+1 -z t+1 -M i (o j,t -z t )∥ 2 2 . ( ) To simplify the notation, Eq. ( 9) can be rewritten as a vectorized form L(M 0 , M 1 , z; D n ) = 1 i=0 j∈Ii ∥A i (o j -z)∥ 2 2 , Algorithm 1 Estimating Long-Term Effect Under Non-stationary Dynamics Input: In-experiment training Data D n = {(τ j , d j )} n j=1 , where τ i = (o j,0 , o j,1 , . . . , o j,T ) is the in-experiment observation features for individual j, d j ∈ {0, 1} is the indicator of which policy group individual j is assigned to. Initialize the estimation of exogenous noise ẑ = 0. Optimization: while not convergent do Update M i as the ordinary least square solution given the current ẑ: Mi =   j∈Ii T -1 t=1 (o j,t+1 -ẑt+1 )(o j,t -ẑt ) ⊤     j∈Ii T -1 t=1 (o j,t -ẑt )(o j,t -ẑt ) ⊤   -1 . Update ẑ according to Eq. ( 12): ẑ = (n 0 G 0 + n 1 G 1 ) -1 ( 1 i=0 j∈Ii G i o j ). end while Evaluation: Compute vi = θ ⊤ r (I -γ Mi ) -1 Ô(i) 0 -ẑ0 , where Ô(i) 0 = 1 ni j∈Ii o 0,j . Output the long-term impact estimation as ∆ = v1 -v0 . where o j =    o j,0 o j,1 . . . o j,T    , and z =    z 0 z 1 . . . z T    are column vector aggregate over the experiment time horizon, and A i is a dT × d(T + 1) matrix constructing by a block matrix M i : A i =      -M i I ... 0 -M i ... ... ... I 0 ... -M i I      dT ×d(T +1) . (11)

3.3. ALTERNATING MINIMIZATION

To reconstruct M i and z, we apply alternating minimization on the loss function L(M 0 , M 1 , z; D n ) in Eq. (10) . By looking at the zero-gradient point of the loss function, under proper non-degenerate assumption (see Appendix for details), we have: Proposition 3.4. Suppose (n 0 G 0 + n 1 G 1 ) is nonsingular, the minimizer of z given M i is a closed- form solution in the followings: arg min z L(M 0 , M 1 , z; D n ) = (n 0 G 0 + n 1 G 1 ) -1 ( 1 i=0 j∈Ii G i o j ), where G i = A ⊤ i A i . ( ) The minimizer of M i given z is similar to Eq. ( 5), except that we subtract the exogenous part z t from the observation: arg min M0,M1 L(M 0 , M 1 , z; D n ) :=   j∈Ii T -1 t=1 (o j,t+1 -ẑt+1 )(o j,t -ẑt ) ⊤     j∈Ii T -1 t=1 (o j,t -ẑt )(o j,t -ẑt ) ⊤   -1 . ( ) The final optimization process is summarized in Algorithm 1.

3.4. THEORETICAL ANALYSIS

We give a preliminary theoretical analysis in this section to give readers some insights on how good our estimator is once a partial oracle information is given. We will extend our analysis to quantify the error of the estimator at the convergence state of alternating minimization in future work. To simplify our analysis, we first assume we get access to the true transition matrix M i , and our goal is to quantify the error between v(π i ) and the true policy value v(π i ) for each policy π i . Proposition 3.5. Suppose we have bounded noise and matrices under Assumption A.1 and Assumption A.2, and suppose n 0 = n 1 = n 2 is equally divided. When we get access of the oracle transition matrix M i = M * i , i ∈ {0, 1}, let ẑ = arg min z L(M * 0 , M * 1 , z; D n ). If we plugin ẑ in the estimation of v(π i ), we will have |v(π i ) -v(π i )| = O( 1 √ n ), with probability at least 1δ. In the second analysis we assume that we get an accurate z. In this case, the estimation of M reduces to the stationary assumption case in Assumption 2.1 where the hidden state variable s t = o tz t is fully recovered. We follow the analysis (e.g., Duan et al., 2020; Miyaguchi, 2021) of linear MDP to characterize the error. Proposition 3.6 (Proposition 11 in Miyaguchi ( 2021)). Suppose we get access to the oracle exogenous noise z during the experimental period, let Mi = arg min Mi L({M i }, z * ; D n ) in Eq. ( 13). Under the assumption in Proposition 11 in Miyaguchi (2021) , with the plugin estimator v with Mi , we have: |v(π i ) -v(π i )| = O(n -1 2d+2 ), with probability at least 1δ.

3.5. PRACTICAL CONSIDERATIONS

Regularize the Transition Dynamic Matrices. Degenerated case may happen during the alternating minimization when either 1) the spectral norm is too large, i.e. ∥M i ∥ 2 ≥ 1 γ , leading the long-term operator (I -γM i ) -1 = ∞ t=0 γ t M t i diverges in Eq. ( 7), or 2) the matrix inversion calculation of M i in Eq. ( 13) is not well-defined. To avoid those scenarios and stabilize the computation procedure, we add a regularization term of M i as λ i ∥M i -I d ∥ 2 2 in our experiment. The intuition is that the transition matrix should be close to identity matrix as in practice the treatment policy typically deviates from the control policy in an incremental manner. After adding the regularization, the closed-form minimizer of M i of the regularized loss function becomes: M i =   λ i I d + j∈Ii T -1 t=1 (o j,t+1 -z t+1 )(o j,t -z t ) ⊤     λ i I d + j∈Ii T -1 t=1 (o j,t -z t )(o j,t -z t ) ⊤   -1 . Regularize the Exogenous Variable. There is a challenge in deriving the closed-form z in Eq. ( 12) where n 0 G 0 + n 1 G 1 can be degenerated or nearly degenerated. By definition, G i is always singular. Moreover, if there is no control of the minimal eigenvalue of (n 0 G 0 + n 1 G 1 ), e.g. close to zero, the update step on z is uncontrolled and the variance of noise can be magnified in the direction of the minimal eigenvector. Therefore it is crucial to regularize z. To tackle the possible degenerated circumstances, one natural idea is to include regularization of the ℓ 2 norm of z, where the regularized loss function can be written as: L λ (z, M 0 , M 1 ; D) = L(z, M 0 , M 1 ; D) + λ z ∥z∥ 2 2 . ( ) Its corresponding minimizer of ẑ can be written as: ẑ = (λI + n 0 G 0 + n 1 G 1 ) -1 ( 1 i=0 j∈Ii G i o j ), where I is the identity matrix of dimension d × (T + 1). It is worth mentioning that when the regularization parameter λ increases to infinity, z will go to 0, and the solution reduces to the stationary case in Assumption 2.1.

Extend to Multiple Treatment Policies

The optimization framework can be easily extend to multiple treatment policies case. Suppose we have k different treatment policies π 1 , π 2 , • • • , π k and let π 0 be the control policy, the closed form solution for ẑ under multiple dataset of different treatment groups can be derived as ẑλ = (λI + k i=0 n i G i ) -1 ( k i=0 j∈Ii G i o j ). And the closed-form update for M i stays the same. The final estimation of the treatment effect for policy π i is ∆ = vi -v0 .

4. RELATED WORK

Estimating long-term treatment effects Our work is related to causal inference with temporal data. The surrogate index method (Athey et al., 2019; 2020) makes a different assumption that the long-term effect is independent of the treatment conditioned on the surrogate index measured during the experiment. It then estimates long-term impacts resulting from short-term exposure during the experiment. In contrast, our work aims to estimate long-term impacts resulting from long-term exposure. Time series methods (e.g. Bojinov & Shephard, 2019) require probabilistic treatments, which allow an individual to be exposed to different treatments at different time periods during an experiment. They then estimate the temporal treatment effect, which is averaged over all the temporal steps, differs from traditional treatment effect which is averaged over randomized individuals. Our method draws inspirations from off-policy evaluation(OPE) and related areas, whose goal is to estimate the long-term policy value, usually from a offline dataset collected under different policies. Most early work focuses on the family of inverse propensity score estimators that are prone to high variance in long-horizon problems (e.g., Precup et al., 2000; Murphy et al., 2001; Jiang & Li, 2016) . Recently, there are growing interests in long-and even infinite-horizon settings (Liu et al., 2018; Nachum et al., 2019a; Xie et al., 2019; Tang et al., 2020; Uehara et al., 2020; Dai et al., 2020; Chandak et al., 2021) . In particular, Shi et al. (2022b) considers a similar problem of estimating long-term impacts, which is comparable to our stationary baseline. However, these methods either rely on the stationarity assumption that is violated in many applications, or consider the general nonstationary Markov decision process (Kallus & Uehara, 2020) that does not leverage domainspecific assumptions. RL in nonstationary or confounded environments Our model is a special case of Partially Observable Markov Decision Process (POMDP) ( Åström, 1965; Kaelbling et al., 1998) . OPE in general POMDPs remains challenging, unless various assumptions are made (e.g., Tennenholtz et al., 2020; Bennett et al., 2021; Shi et al., 2022a) . Most assumptions are on the causal relation of the logged data, such as relation between state, action and confounded variable. In contrast, we make an assumption motivated by real-world data, which allows our estimator to cancel out exogenous variables from observations. Our assumption is also related to MDP with Exogenous Variables (e.g., Dietterich et al., 2018; Chitnis & Lozano-Pérez, 2020) , and Dynamics Parameter MDP (DPMDP) or Hidden Paramter MDP (HiP-MDP) (Al-Shedivat et al., 2017; Xie et al., 2020) . For exogenous variable, they assume observation features can be partitioned into two groups, where the exogenous group is not affected by the action and the endogenous group evolve as in a typical MDP. The major challenge is infer the right partition. Several recent works (e.g Misra et al., 2020; Du et al., 2019; Efroni et al., 2021) combine exogenous variable with rich observation in RL. This is different from our assumption where we assume the observation is a sum of both parts, which is a more natural assumption in applications like e-commerce. For DPMDP and Hip-MDP, they assume a meta task variable which is non-stationary and changed across time but the task variable dynamic can be captured by a sequential model. Our assumption can be viewed as a linear special case but our focus is not to better characterize the system but is to remove the exogenous part for better predictions.

5. EXPERIMENTS

We evaluate our methods in three problems: a synthetic dataset, a dataset from the Type-1 Diabete RL simulator (Xie, 2019) , and a real-world dataset from an online store. The ground truth ∆ is computed either from a true simulator or using the average of the real experimental data under a long time period. We compare our methods based on plug-in estimator of the stationary solution in Eq. ( 4), its non-stationary variant in Algorithm 1, and an Naive Average baseline. The baseline directly uses the short-term reward average as the estimate of the long-term effect.

5.1. SYNTHETIC SIMULATION

The synthetic environment generates 4 randomized matrix M i for policies {π i } 3 i=0 and a trajectory of randomized exogenous noise {z t } T t=0 . See details of the synthetic dynamic in Appendix C. The randomized sequence follows the non-stationary dynamics with a parameter α controlling the scale of the exogenous noise: o j,t = s j,t + αz t , ∀j, t. We collect n trajectories for each policy until t = T (w/ varying T ). We vary the parameters of the generating sequences: number n of trajectories, horizon T , data dimension d, and scale α of the exogenous noise. We plot the logarithmic Mean Square Error (MSE) for each method in Figure 2 . The result shows that our estimator method (the green line) clearly outperforms all other baselines. Moreover, Figure 2(d) shows the increase of the scale of the exogenous noise does not affect estimation accuracy of our method.

5.2. TYPE-1 DIABETE SIMULATOR

This environment is modified based on an open-source implementationfoot_0 of the FDA approved Type-1 Diabetes simulator (T1DMS) (Man et al., 2014) . The environment simulates two-day behavior in an in-silico patient's life. Consumption of a meal increases the blood-glucose level in the body. If the level is too high, the patient suffers from hyperglycemia. If the level is too low, the patient suffers from hypoglycemia. The goal is to control the blood glucose level by regulating the insulin dosage to minimize the risk associated with both hyperglycemia and hypoglycemia. We modify the Bagal and Bolus (BB) policy (Bastani, 2014) (control policy) in the codebase and set two glucose target levels and different noise levels as our two treatment policies. We collect information in the first 12-hour of all the three policies with 5000 randomized patients in each policy group and use those information to predict the long-term effect. The observation feature is 2-dimensional: glucose level (CGM) and the amount of insulin injection. The non-stationarity comes from the time and amount of the consumption of a meal, which is time varying, but otherwise shared by all patients. We average a 2-day simulation window over random 250, 000 patients as ground truth treatment effect between policy groups.  Figure 3 : Results of Type-1 Diabete Environment. We vary two parameters in the simulation, the number of patients and the in-experiment horizon, to compare the performance for different methods under two different evaluation metrics. Similar to the synthetic simulator, we vary the number of patients and the experimental period. Figure 3 shows that the non-stationary method performs better in the prediction accuracy compared to stationary method in both predictions of CGM and the amount of insulin injection. Even though the simulator is non-linear, our simple linear additive exogenous noise assumption still captures the small local changes well, which is approximately linear. We test our methods under 4 long-running experiments in an online store with a total of 7 different treatment policies (some experiments have more than 1 treatment). Each experiment has 1 control policy. We evaluate 4 business metrics related to customer purchases in the store (Metrics 1-4), and use d = 17 features. All the experiments lasted for 12 weeks. We treat the first 5 weeks as the experiment window, and use data in those weeks to estimate long-term impacts of the 4 metrics. The trailing 7-week average of the metrics are used as ground true to evaluate accuracy of various estimators. Table 1 reports the median of the Mean Absolute Percentage Error (MAPE) of the estimators; See full results in Appendix C.

5.3. DATA

Given the high cost in such long-running experiments, we cannot collect more data points for comparison, and for computing statistical significance. That said, there is good evidence from the reported number that our method produces better predictions of long-term treatment effects than Naive Average. Furthermore, our method improves on the stationary baseline, suggesting the practical relevance of our nonstationary assumtion, and effectiveness of the proposed estimator.

6. CONCLUSIONS

In this paper we study how to estimate the long-term treatment effect by using only the inexperimental data in the non-stationary environment. We propose a novel non-stationary RL model and an algorithm to make prediction. A major limitation is the linear assumption in both the dynamics model and the additive exogenous part. Once the real world model includes a highly non-linear part, the prediction value can be biased. Future direction includes further relax our model to nonlinear case to better capture the real world environment.

Appendix

A PROOF In this section, we provide detailed proof for the theorem in the main text, as a self-contained section, we briefly introduce the notation as below, and adopt the regularized, multiple policy groups settings in the appendix: • n: number of total individuals. • I i : the index set for policy π i ; n i = |I i | as the number of individual in under policy π i . • k total number of different policy group. • D n : dataset for n individuals in the experimental period. In the appendix, we denote the ground truth dynamic M * i and the ground truth exogenous noise z * with a star * to distinguish the variables M i and z during optmization process.

A.1 ASSUMPTIONS

The dynamic assumption of our linear additive exogenous noise assumption in Assumption 3.1 can be rewritten as the following equation: M * i (o j,t -z * t ) = (o j,t+1 -z * t+1 ) + ε j,t , ∀j ∈ I i , 0 ≤ t ≤ T -1. ( ) where ε j,t is a zero-mean noise. Let ε j =    ε j,0 ε j,1 . . . ε j,T -1    ∈ R d×T , {ε j } 1≤j≤n forms a martingale: E[ε j |F j-1 ] = 0, where the filtration F j = {o 1 , ..., o j-1 } is the information up to the first j -1 individuals. We make addition bounded assumption on the zero-mean noise term for the proof: Assumption A.1 (Bounded Noise assumption). Let ε j = M * i (o j,tz * t ) -(o j,t+1z * t+1 ), j ∈ I i be the residual of the transition under the true transition matrix M * i , we have ∥ε j ∥ 2 ≤ C ε , ∀j, where C ε is a uniform constant independent of policy assignment. For the empirical covariance matrix in the middle step of the calculation, we assume they are all bounded. Assumption A.2 (Bounded Norm for Matrices). We make the following assumptions on matrices 1. ∥M * i ∥ ≤ C Mi < 1 γ , ∀i. 2. ∥(Λ * n /n) -1 ∥ ≤ C Λ .

A.2 LOSS FUNCTION AND ALTERNATING MINIMIZATION

Our loss function can be written as: L({M i } 1≤i≤k , z; D n ) = k i=1 j∈Ii ∥A i (z -o j )∥ 2 2 + λ z ∥z∥ 2 2 + k i=1 λ i ∥M i -I d ∥ 2 F . ( ) Lemma A.3. Fix {M i }, denote G i = A ⊤ i A i where A i is defined in Eq. (11), the minimization of z = arg min z L({M i } 1≤i≤k , z; D n ) is z({M i }) = λ z I d×(T +1) + k i=1 n i G i -1   k i=1 j∈Ii G i o j   . ( ) Proof. By taking the gradient of the loss function, we will have: 0 =∇ z L({M i } 1≤i≤k , z; D n ) =2 k i=1 j∈Ii G i (z -o j ) + 2λ z z which implies z({M i }) = λ z I d×(T +1) + k i=1 n i G i -1   k i=1 j∈Ii G i o j   . Here, G i = A ⊤ i A i is semi-definite, so the inversion of the large matrix in the right side of the expression always exists. Similarly we can get the minimizer of M i fixing z. Lemma A.4. By fixing z, the minimizer of M i can be written as M i (z) =   λ i I d + j∈Ii T -1 t=1 (o j,t+1 -z t+1 )(o j,t -z t ) ⊤     λ i I d + j∈Ii T -1 t=1 (o j,t -z t )(o j,t -z t ) ⊤   -1 . (20) Proof. The proof is similarly applied by looking at the zero gradient of M i . If we set λ i = 0 and z = 0, the minimization reduces back to estimation of M in Eq. (5).

A.3 ERROR ANALYSIS

Lemma A.5. Let M * i be the true dynamic of the underlying state, we have: z({M * i }) -z * = -λ z (Λ * n ) -1 z * + (Λ * n ) -1   k i=1 j∈Ii A ⊤ i ε j   , where Λ * n = λ z I d×T + k i=1 n i G * i . Proof. By expand the definition of z({M * i }, we have: z({M * i }) = (Λ * n ) -1   k i=1 j∈Ii G i o j   = (Λ * n ) -1   k i=1 j∈Ii A ⊤ i (A i o j )   = (Λ * n ) -1   k i=1 j∈Ii A ⊤ i (A i z * + ε j )   = (Λ * n ) -1   Λ * n z * -λ z z * + k i=1 j∈Ii A ⊤ i ε j )   =z * -λ z (Λ * n ) -1 z * + (Λ * n ) -1   k i=1 j∈Ii A ⊤ i ε j )   Lemma A.6. Let z * be the true exogenous noise, we have: M i (z * ) -M * i =   λ i (I d -M * i ) + j∈Ii T -1 t=1 ε j,t (o j,t -z * t ) ⊤   (λ i I d + Σ * n ) -1 , where Σ * n = j∈Ii T -1 t=1 (o j,t -z * t )(o j,t -z * t ) ⊤ is the empirical covariace matrix. Proof. By expand the definition of M i (z * ), we have: M i (z * ) =   λ i I d + j∈Ii T -1 t=1 (o j,t+1 -z * t+1 )(o j,t -z * t ) ⊤     λ i I d + j∈Ii T -1 t=1 (o j,t -z * t )(o j,t -z * t ) ⊤   -1 (23) =   λ i I d + j∈Ii T -1 t=1 (ε j,t + M * i (o j,t -z * t )) (o j,t -z * t ) ⊤   (λ i I d + Σ * n ) -1 =M * i +   λ i (I d -M * i ) + j∈Ii T -1 t=1 ε j,t (o j,t -z * t ) ⊤   (λ i I d + Σ * n ) -1 A.4 PROOF OF PROPOSITION 2.5 Proof. By induction, it is not hard to prove that E[O t |O 0 = o, D = i] = M t i o. Sum up all condition on O 0 , we have: E[O t |D = i] = M t i E[O 0 ] . By the definition of long-term discounted reward G, we have: v(π i ) =E[ ∞ t=0 γ t R t |D = i] = ∞ t=0 γ t E[θ ⊤ r O t |D = i] =θ ⊤ r ∞ t=0 γ t M t i E[O 0 ] =θ ⊤ r (I -γM i ) -1 E[O 0 ] , where the last equation holds when ∥M i ∥ < 1 γ . A.5 PROOF OF PROPOSITION 3.5 Proof. From Lemma A.5, suppose λ z = 0 and (Λ * n ) -1 exists, we have: z -z * = (Λ * n ) -1   1 i=0 j∈Ii A ⊤ i ε j   . Consider v(π 0 ) if we plugin ẑ and the true dynamic M * 0 , the error between v and v is v(π 0 ) -v(π 0 ) =θ ⊤ r (I -γM * 0 ) -1 (z 0 -z * 0 ) :=β ⊤ r (z 0 -z * 0 ) =(β ⊤ r , 0, . . . , 0)(z 0 -z * 0 ) = β⊤ r (z 0 -z * 0 ), where β r = (I -γM * 0 ) -T θ r , and βr is the extended vector of β r if we fill the other vector value at other time step as 0. Expand the difference (z 0z * 0 ) we have: v(π 0 ) -v(π 0 ) = β⊤ r (z 0 -z * 0 ) = 1 i=0 β⊤ r (Λ * n ) -1 A ⊤ i ( j∈Ii ε j ) = 1 i=0 β⊤ r ( Λ * n n ) -1 A ⊤ i ( j∈Ii ε j n ) ≤∥ βr ∥ 1 i=0 ∥( Λ * n n ) -1 A i ∥∥ j∈Ii ε j n ∥. By Assumption A.1 and Assumption A.2, the norm of βr is the same as β r , which is bounded by ∥β r ∥ ≤ 1 1-γC M i ∥θ r ∥. The matrix norm in the middle factor is bounded because of Assumption A.2. Finally, by vector concentration inequality, since ε j is norm-subGaussian (Jin et al., 2019) , there exist a constant c that with probability at least 1δ: ∥ j∈Ii ε j n ∥ ≤ c log(2dT /δ) n . In sum, the error is bounded by O( 1 √ n ) with probability at least 1δ, and the constant depends on C ε , C Mi , C Λ and the norm of ∥θ r ∥. A.6 PROOF OF PROPOSITION 3.6 Proof. Since we get access to the ground true z * , the remaining problem is by changing the state as s j,t = o j,tz * t and reduce the problem back to standard MDP. The detailed proof can refer to Proposition 11 in Miyaguchi (2021) .

B REDUCE THE COMPUTATION COMPLEXITY WITH PRE-COMPUTATION

In this section, we explain how to reduce the computation complexity with pre-computation. The pre-computation requires computation complexity of O(nT d 2 + d 3 ), where d 2 is the computation complexity of the outer product, d 3 is the computation complexity of the matrix inversion after summing up the matrix.

Pre-computation. Compute

In Each Iteration. The computation of M can be rewritten as M i (z) =   j∈Ii T -1 t=1 (o j,t+1 -z t+1 )(o j,t -z t ) ⊤     j∈Ii T -1 t=1 (o j,t -z t )(o j,t -z t ) ⊤   -1 =M i (0) - T -1 t=1 z t+1 ō⊤ t - T -1 t=1 ōt+1 z ⊤ t + T -1 t=1 z t+1 z ⊤ t , which requires computation complexity of O(T d 2 ). Similarly, the computation of z(G) = ( k i=0 n i G i ) -1 ( k i=0 G i ōj ) requires computation complexity of O(T 2 d 2 ). Both steps are computationally scalable, since it does not rely on number of individuals n (which is often much larger than T and d). Overall Computation Complexity. Suppose we execute the iterations for k times, then the total computation complexity for the alternating minimization is O(nT d 2 + d 3 + kT 2 d 2 ). In practice, the number of different individual n is far larger than the experiment horizon T and the feature dimension d, therefore the computation complexity essentially scales linearly with n. 



https://github.com/jxx123/simglucose



Figure 1: An example of non-stationarity. The weekly average metric value is highly non-stationary during holiday season.

Figure 2: Results of Synthetic Environment. We vary parameters in the simulation to compare the logarithmic MSE of various estimators: (a) number of trajectories; (b) horizon; (c) observation feature dimension; (d) scale of the exogeneous noise.

Results in the online store dataset. The reported numbers are the median of MAPE over 7 different treatment policies. Columns correspond to business metrics of interest.

Experiment # 4

C EXPERIMENTS DETAILS C.1 SYNTHETIC SIMULATION

The synthetic environment generates 4 randomized matrix M i for policies {π i } 3 i=0 , where each entry of M i is a positive number randomly sample from a uniform distribution between (0, 1). We normalize each row so that it sums up to 1, and we set Mi = 0.5I + 0.5M i as our final transition matrix. The 0.5I part ensures each matrix is not too far away from each other. We generate a set of i.i.d. random vector η t ∼ N (0, 1.5I) and set z t+1 = z t + η t recursively. And we let zt = α t * z t as the final exogenous noise, where α t = e βt and β t ∼ N (0, 0.5I), i.i.d..All the parameters (z t and M i ) of the dynamic are fixed once generated, and we use the dynamic to generate our observation for each individual, followingwhere ε t is independently drawn from a standard normal distribution, and α control the level of exogenous noise.

C.2 POLICY CONSTRUCTION IN TYPE-1 DIABETE SIMULATOR

The Bagal and Bolus policy is a parametrized policy based on the amount of insulin that a person with diabetes is instructed to inject prior to eating a meal (Bastani, 2014) injection = current blood glucosetarget blood glucose CF + meal size CR ,where CF and CR are parameter based on patients information such as body weights, which is already specified in the simulator.We set our two treatment policies with target blood glucose level at 145 and 130 (compared to control: 140). And we increase the noise in the insulin pump simulator in both the treatment policies.

C.3 RANDOM PATIENTS GENERATION IN TYPE-1 DIABETE SIMULATOR

Type-1 Diabete simulator pre-stores 30 patients parameter. 

