ADVERSARIAL COUNTERFACTUAL ENVIRONMENT MODEL LEARNING

Abstract

A good model for action-effect prediction, i.e., the environment model, is essential for sample-efficient policy learning, in which the agent can take numerous free trials to find good policies. Currently, the model is commonly learned by fitting historical transition data through empirical risk minimization (ERM). However, we discover that simple data fitting can lead to a model that will be totally wrong in guiding policy learning due to the selection bias in offline dataset collection. In this work, we introduce weighted empirical risk minimization (WERM) to handle this problem in model learning. A typical WERM method utilizes inverse propensity scores to re-weight the training data to approximate the target distribution. However, during the policy training, the data distributions of the candidate policies can be various and unknown. Thus, we propose an adversarial weighted empirical risk minimization (AWRM) objective that learns the model with respect to the worst case of the target distributions. We implement AWRM in a sequential decision structure, resulting in the GALILEO model learning algorithm. We also discover that GALILEO is closely related to adversarial model learning, explaining the empirical effectiveness of the latter. We apply GALILEO in synthetic tasks and verify that GALILEO makes accurate predictions on counterfactual data. We finally applied GALILEO in real-world offline policy learning tasks and found that GALILEO significantly improves policy performance in real-world testing.

1. INTRODUCTION

A good environment model is important for sample-efficient decision-making policy learning techniques like reinforcement learning (RL) (James & Johns, 2016) . The agent can take trials with this model to find better policies, then the costly real-world trial-and-errors can be saved (James & Johns, 2016; Yu et al., 2020) or completely waived (Shi et al., 2019) . In this process, the core of the models is to answer queries on counterfactual data unbiasedly, that is, given states, correctly answer what might happen if we were to carry out actions unseen in the training data (Levine et al., 2020) . Requiring counterfactual queries makes the environment model learning essentially different from standard supervised learning (SL) which directly fits the offline dataset. In real-world applications, the offline data is often collected with selection bias, that is, for each state, each action might be chosen unfairly. Seeing the example in Fig. 1(a) , to keep the ball following a target line, a behavior policy will use a smaller force when the ball's location is closer to the target line. When a dataset is collected with selection bias, the association between the (location) states and (force) actions will make SL hard to identify the correct causal relationship of the states and actions to the next states respectively. Then when we query the model with counterfactual data, the predictions might be catastrophic failures. In Fig. 1(c ), it mistakes that smaller forces will increase the ball's next location. Generally speaking, the problem corresponds to a challenge of training the model in one dataset but testing in another dataset with a shifted distribution (i.e., the dataset generated by counterfactual queries), which is beyond the SL's capability as it violates the independent and identically distributed (i.i.d.) assumption. The problem is widely discussed in causal inference for individual treatment effects (ITEs) estimation in many scenarios like patients' treatment selection (Imbens, 1999; Alaa & van der Schaar, 2018) . ITEs are the effects of treatments on individuals, which are measured by treating each individual under a uniform policy and evaluate the effect differences. Practical solutions use weighted empirical risk minimization (WERM) to handle this problem (Jung et al., 2020; Shimodaira, 2000; Hassanpour & Greiner, 2019) . In particular, they estimate an inverse propensity score (IPS) to re-weight the training data to approximate the data distribution under a uniform policy. Then a model is trained under the reweighted data distribution. The distribution-shift problem is solved as ITEs estimation and model training are under the same distribution. (2, 45) a =1.17 shows how the data is collected: a ball locates in a 2D plane whose position is (x t , y t ) at time t. The ball will move to (x t+1 , y t+1 ) according to x t+1 = x t + 1 and y t+1 ∼ N (y t + a t , 2). Here, a t is chosen by a control policy a t ∼ N ((ϕ -y t )/15, 0.05) parameterized by ϕ, which tries to keep the ball near the line y = ϕ. In Subfigure (a), ϕ is set to 62.5. Subfigure (b) shows the collected training data (grey dashed line) and the two learned models' prediction of the next position of y. All the models discovered the relation that the corresponding next y will be smaller with a larger action. However, the truth is not because the larger a t causes a smaller y t+1 , but the policy selects a small a t when y t is close to the target line. When we estimate the response curves by fixing y t and reassigning action a t with other actions a t + ∆a, where ∆a ∈ [-1, 1] is a variation of action value, the model of SL will exploit the association and give opposite responses, while in AWRM and its practical implementation GALILEO, the predictions are closer to the ground truths. The result is in Subfigure (c), where the darker a region is, the more samples are fallen in. The selection bias can be regarded as an instance of the problem called "distributional shift" in offline model-based RL, which has also received great attention (Levine et al., 2020; Yu et al., 2020; Kidambi et al., 2020; Chen et al., 2021) . However, previous methods, where naive supervised learning is used for environment model learning, ignore the problem in environment model learning to Reviewer rQ79 The prediction risks is measured with mean square error (MSE). The error of SL is small only in training data (ϕ = 62.5) but becomes much larger in the dataset "far away from" the training data. AWRM-oracle selects the oracle worst counterfactual dataset for training for each iteration (pseudocode is in Alg. 1) which reaches small MSE in all datasets and gives correct response curves (Fig. 1(c )). GALILEO approximates the optimal adversarial counterfactual data distribution based on the training data and model. Although the MSE of GALILEO is a bit larger than SL in the training data, in the counterfactual datasets, the MSE is on the same scale as AWRM-oracle. but handling the problem by suppressing the policy exploration and learning in risky regions. Although these methods have made great progress in many tasks, so far, how to learn a better environment model that can alleviate the problem for faithful offline policy optimization has rarely been discussed. In this work, for faithful offline policy optimization, we introduce WERM to environment model learning. The extra challenge of model learning for policy optimization is that we have to query numerous different policies' feedback besides the uniform policy for finding a good policy. Thus the target data distribution to reweight can be various and unknown. To solve the problem, we propose an objective called adversarial weighted empirical risk minimization (AWRM). AWRM introduces adversarial policies, of which the corresponding counterfactual dataset has the maximal prediction error of the model. For each iteration, the model is learned to be as small prediction risks as possible under the adversarial counterfactual dataset. However, the adversarial counterfactual dataset cannot be obtained in the offline setting, thus we derive an approximation of the counterfactual data distribution queried by the optimal adversarial policy and use a variational representation to give a tractable solution to learn a model from the approximated data distribution. As a result, we derive a practical approach named Generative Adversarial offLIne counterfactuaL Environment mOdel learning (GALILEO) for AWRM. Fig. 2 shows the difference in the prediction errors learned by these algorithms. We also discover that GALILEO is closely related to existing generative-adversarial model learning techniques, explaining the effectiveness of the latter. Experiments are conducted in two synthetic and two realistic environments. The results in the synthetic environments show that GALILEO can reconstruct correct responses for counterfactual queries. The evaluation results in two realistic environments also demonstrate that GALILEO has better ability in counterfactual query compared with baselines. We finally search for a policy based on the learned model in a real-world online platform. The policy significantly improves performance in concerned business indicators.

2. RELATED WORK

We give related adversarial algorithms for model learning in the following and leave other related work in Appx. F. GANTIE (Yoon et al., 2018) uses a generator to fill counterfactual outcomes for each data pair and a discriminator to judge the source (treatment group or control group) of the filled data pair. The generator is trained to minimize the output of the discriminator. GANITE is trained until the discriminator cannot determine which of the components is the factual outcome. Bica et al. (2020) propose SCIGAN to extend GANITE to continuous treatment effect estimation (a.k.a., dosage-response estimation) via a hierarchical discriminator architecture. In real-world applications, environment model learning based on Generative Adversarial Imitation Learning (GAIL) has also been adopted for sequential decision-making problems (Ho & Ermon, 2016) . GAIL is first proposed for policy imitation (Ho & Ermon, 2016) , which uses the imitated policy to generate trajectories by interacting with the environment. The policy is learned with the trajectories through RL which maximizes the cumulative rewards given by the discriminator. Shi et al. (2019) ; Chen et al. (2019) ; Shang et al. (2019) use GAIL for environment model learning by regarding the environment model as the generator and the behavior policy as the "environment" in standard GAIL. These studies empirically demonstrate that adversarial model learning algorithms have better generalization ability for counterfactual queries, while our study reveals the connection between adversarial model learning and the WERM through IPS. Our derived practical algorithm GALILEO is closely related to the existing adversarial model learning algorithms, explaining the effectiveness of the latter.

3.1. SINGLE-STEP INDIVIDUALIZED TREATMENT EFFECTS ESTIMATION AND WEIGHTED EMPIRICAL RISKS MINIMIZATION

We first introduce individualized treatment effects (ITEs) estimation (Rosenbaum & Rubin, 1983) , which can be regarded as the scenario in which the environment model has only a single step. ITEs are typically defined as ITE(x) := E[M * (y|x, 1)|A = 1, X = x] -E[M * (y|x, 0)|A = 0, X = x], where y is the feedback of the environment M * (y|x, a), X denotes the state vector containing pre-treatment covariates (such as age and weight), A denotes the treatment variable which is the action intervening to the state X, and A should be sampled from a uniform policy. In the twotreatment scenario, A is in {0, 1} where 1 is the action to intervene and 0 is the action to do nothing. A correct ITEs estimation should be done in Randomized Controlled Trials (RCT) in which we have the same probability of samples of A = 1 and A = 0 for each X. Here we use lowercase x, a and y to denote samples of random variables X, A and Y , and use X , A and Y to denote space of the samples. In practice, we prefer to estimate ITEs under observational studies. In observational studies, datasets are pre-collected from the real world by a behavior policy such as a human-expert policy. In this case, a common approach for estimating the ITEs can be ÎTE( et al., 2017 ) in deterministic prediction, where x i and y F i denote the covariate and factual feedback of the i-th sample, and M ∈ M denotes an approximated feedback model. M is the space of the model. In this formulation, the training set is an empirical factual data distribution P F = {(x i , a i )} n i and the testing set is an empirical counterfactual data distribution P CF = {(x i , 1 -a i )} n i . If a does not sample from a discrete uniform policy, i.e., the policy has selection bias, P F and P CF will be two different distributions, which violate the i.i.d. assumption of standard supervised learning. In stochastic prediction, ÎTE(x) = E[M (y|x, 1)] -E[M (y|x, 0)] and the counterfactual distribution for testing is the dataset with action sampling from a uniform policy. x i ) = a i (y F i -M (x i , 1 -a i )) + (1 -a i )(M (x i , 1 -a i ) -y F i ) (Shalit Generally speaking, in ITEs estimation, the risks of queries under counterfactual data are caused by the gap between the policy in training and testing data distributions. Without further processing, minimizing the empirical risks cannot guarantee the counterfactual-query risks being minimized. Assuming that the policy in training data µ satisfies µ(a|x) > 0, ∀a ∈ A, ∀x ∈ X (often named overlap assumption), a classical solution to handle the above problem is weighted empirical risk minimization (WERM) through an inverse propensity scoring (IPS) term ω (Shimodaira, 2000; Assaad et al., 2021; Hassanpour & Greiner, 2019; Jung et al., 2020) : Definition 3.1. The learning objective of WERM through IPS is formulated as min M ∈M L(M ) = min M ∈M E x,a,y∼p µ M * [ω(x, a)ℓ(M (y|x, a), y)], s.t. ω(x, a) = β(a|x) µ(a|x) , where β and µ denote the policies in testing and training domains, and p µ M * is the joint probability p µ M * (x, a, y) := ρ 0 (x)µ(a|x)M * (y|x, a) in which ρ 0 (x) is the distribution of state. M is the model space. ℓ is a loss function for model learning. The ω is also known as importance sampling (IS) weight, which corrects the sampling bias. In this objective, ω is to align the training data distribution to the testing data. By selecting different ω to approximate ω to learn the model M , current environment model learning algorithms for ITEs estimation are fallen into the framework. In standard supervised learning and some works for ITEs estimation (Wager & Athey, 2018; Weiss et al., 2015) , ω(x, a) = 1 as the distribution-shift problem is ignored. In Shimodaira (2000) ; Assaad et al. (2021) ; Hassanpour & Greiner (2019) , ω = 1 μ (i.e., β a uniform policy) for balancing treatment and control group, where μ is an approximation of behavior policy µ. Note that it is a reasonable weight in ITEs estimation: ITEs are defined to evaluate the effect of each state between treatment and control behavior under a uniform policy.

3.2. SEQUENTIAL DECISION-MAKING SETTING

Decision-making processes in a sequential environment are often formulated into Markov Decision Process (MDP) (Sutton & Barto, 1998) . MDP depicts an agent interacting with the environment through actions. In the first step, states are sampled from an initial state distribution x 0 ∼ ρ 0 (x). Then at each time-step t ∈ {0, 1, 2, ...}, the agent takes an action a t ∈ A through a policy π(a t |x t ) ∈ Π based on the state x t ∈ X , then the agent receives a reward r t from a reward function r(x t , a t ) ∈ R and transits to the next state x t+1 given by a transition function M * (x t+1 |x t , a t ) built in the environment. Π, X , and A denote the policy, state, and action spaces.

4. METHOD

In this section, we first propose a new offline model-learning objective based on Def. 3.1 for policy optimization tasks in Sec. 4.1; In Sec. 4.2, we derive a tractable solution to the proposed objective; Finally, we give a practical implementation in Sec. 4.3.

4.1. PROBLEM FORMULATION

For offline policy optimization, we require the environment model to have generalization ability in counterfactual queries since we need to query numerous different policies' correct feedback from M . Referring to the formulation of WERM through IPS in Def. 3.1, policy optimization requires M to minimize counterfactual-query risks under numerous unknown different policies rather than a specific target policy β. More specifically, the question is: If β is unknown and can be varied, how should we reduce the risks in counterfactual queries? In this article, we call the model learning problem in this setting "counterfactual environment model learning" and propose a new objective to handle the problem. To be compatible with multi-step environment model learning, we first define a generalized WERM through IPS based on Def. 3.1. Definition 4.1. Given the MDP transition function M * that satisfies M * (x ′ |x, a) > 0, ∀x ∈ X , ∀a ∈ A, ∀x ′ ∈ X and µ satisfies µ(a|x) > 0, ∀a ∈ A, ∀x ∈ X , the learning objective of generalized WERM through IPS is formulated as min M ∈M L(M ) = min M ∈M E x,a,x ′ ∼ρ µ M * [ω(x, a, x ′ )ℓ M (x, a, x ′ )], s.t. ω(x, a, x ′ ) = ρ β M * (x, a, x ′ ) ρ µ M * (x, a, x ′ ) , where ρ µ M * is the training data distribution (collected by policy µ), ρ β M * is the testing data distribution (collected by policy β). We define ℓ M (x, a, x ′ ) := ℓ(M (x ′ |x, a), x ′ ) for brevity. In an MDP, given any policy π, ρ π M * (x, a, x ′ ) = ρ π M * (x)π(a|x)M * (x ′ |x, a) where ρ π M * (x) denotes the occupancy measure of x for policy π, which can be defined as (Sutton & Barto, 1998; Ho & Ermon, 2016) where ρ π M * (x) := (1 -γ)E x0∼ρ0 [ ∞ t=0 γ t Pr(x t = x|x 0 , M * )] ω(x, a|ρ β M * )E M * (-log M (x ′ |x, a)) , where E M * [•] denotes E x ′ ∼M * (x ′ |x,a) [•]. The core problem is how to construct the data distribution ρ β * M * of the best-response policy β * in M * as it is costly to get extra data from M * in real-world applications. Instead of deriving the optimal β * , our solution is to offline estimate the optimal adversarial distribution ρ β * M * with respect to M , then we can construct a surrogate objective to optimize M without directly querying the real environment M * . 4.2.1 OPTIMAL ADVERSARIAL DATA DISTRIBUTION APPROXIMATION Ideally, given any M , it is obvious that the optimal β is the one that makes ρ β M * (x, a) assign all densities to the point that has the largest negative log-likelihood. However, searching for the maximum is impractical, especially in continuous space. To give a relaxed but tractable solution, we add an L 2 regularizer to the original objective Eq. 3: min M ∈M max β∈Π L(ρ β M * , M ) = min M ∈M max β∈Π E x,a∼ρ µ M * ω(x, a)E M * [-log M (x ′ |x, a)] - α 2 ∥ρ β M * (•, •)∥ 2 2 , where α denotes the regularization coefficient of and ρ κ M * . By adjusting the weights, the learning process will exploit subtle errors in any data point, whatever how many proportions it contributes, to correct potential generalization errors on counterfactual data. THM. 4.4 In Thm. 4.4, the term f ( a) ) is still intractable. To solve the problem, first, we resort to the first-order approximation of f . Given some u ∈ (1 -ξ, 1 + ξ), ξ > 0, we have ρ β M * and ∥ρ β M * (•, •)∥ 2 2 = X ,A (ρ β M * (x,

4.2.2. TRACTABLE SOLUTION TO

ρ κ M θ t (x,a,x ′ ) ρ κ M * (x,a,x ′ ) ) -f ( ρ κ M θ t (x,a) ρ κ M * (x, f (u) ≈ f (1) + f ′ (u)(u -1), where f ′ is the first-order derivative of f . By Taylor's formula and the fact that f ′ (u) of the generator function f is bounded in (1 -ξ, 1 + ξ), the approximation error is no more than O(ξ 2 ). Let u = p(x) q(x) in Eq. 7, the pattern f ( p(x) q(x) ) in Thm. 4.4 can be converted to f ′ ( p(x) q(x) )( p(x) q(x) -1) + f (1). p(x) q(x) can be approximated by sampling from datasets and f ′ ( p(x) q(x) ) can be approximated by the corresponding variational representation T φ * according to Lemma A.9 (Nowozin et al., 2016) (see value ϵ µ > 0 to keep the assumption holding. Besides, we add small Gaussian noises N (0, ϵ D ) to the inputs of D φ to handle the mismatch between ρ µ M * and ρ μ M * due to ϵ µ . In Eq. 8, H M * is unknown in advance. In practice, we use H M θ to estimate it. More specifically, the neural network of M θ can be modeled with a Gaussian distribution. The variance of the Gaussian distribution is modeled with global variables Σ for each dimension of output. We estimate H M * with the closed-form solution of Gaussian entropy through Σ. to Reviewer bcJS Based on the above techniques, we propose Generative Adversarial offLIne counterfactuaL Environment mOdel learning (GALILEO) for environment model learning. GALILEO can be adopted in both single-step and sequential environment model learning. The detailed implementation and comparison to previous adversarial methods are in Appx. E.

5. EXPERIMENTS

In this section, we first conduct experiments in synthetic environments (Bica et al., 2020) to verify GALILEO on counterfactual queriesfoot_0 and the compatibility of GALILEO in sequential and single-step environments. We select Mean Integrated Square Error MISE = E A (M * (y|x, a) -M (y|x, a)) 2 da as the metric, which is a commonly used metric to measure the accuracy in counterfactual queries by considering the prediction errors in the whole action space. Then we analyze the benefits of the implementation techniques described in Sec. 4.3 and the problems without them. Finally, we deploy GALILEO in two complex environments: MuJoCo in Gym (Todorov et al., 2012) 1 (a) is also an example of GNFC. We give detailed motivation, the effect of selection bias, and other details in Appx. G.1.1. We construct tasks on GNFC by adding behavior policies µ with different scales of uniform noise U (-e, e) with different probabilities p. In particular, with e ∈ {1.0, 0.2, 0.05} and p ∈ {1.0, 0.2, 0.05}, we construct 9 tasks and name them with the format of "e*_p*". For example, e1_p0.2 is the task with behavior policy injecting with U (-1, 1) with 0.2 probability. The results of GNFC tasks are summarized in Fig. 3 (a) and the detailed results can be found in Tab. 8. The results show that the property of the behavior policy (i.e., e and p) dominates the generalization ability of the baseline algorithms. When e = 0.05, almost all of the baselines fail and give a completely opposite response curve (see Fig. 4 (a) and Appx. H.2). IPW still perform well when 0.2 ≤ e ≤ 1.0 but fails when e = 0.05, p <= 0.2. We also found that SCIGAN can reach a better performance than other baselines when e = 0.05, p <= 0.2, but the results in other tasks are unstable. GALILEO is the only algorithm that is robust to the selection bias and outputs correct response curves in all of the tasks. Based on the experiment, we also indicate that the commonly used overlap assumption is unreasonable to a certain extent especially in real-world applications since it is impractical to inject noises into the whole action space. The problem of overlap assumption being violated should be taken into consideration otherwise the algorithm will be hard to use in practice if it is sensitive to the noise range. Test in single-step environments Previous experiments on counterfactual environment model learning are based on single-step semi-synthetic data simulation (Bica et al., 2020) . Since GALILEO is compatible with single-step environment model learning, we select the same task named TCGA in Bica et al. (2020) to test GALILEO. Based on three synthetic response functions in TCGA, we construct 9 tasks by choosing different parameters of selection bias on µ which is constructed with beta distribution, and design a coefficient c to control the selection bias of the beta distribution. We name the tasks with the format of "t?_bias?". For example, t1_bias2 is the task with the first response functions and c = 2. The detail of TCGA is in Appx. G.1.2. The results of TCGA tasks are summarized in Fig. 3 (b) and the detailed results can be found in Tab. 9 in Appendix. We found the phenomenon in this experiment is similar to the one in GNFC, which demonstrates the compatibility of GALILEO to single-step environments. We also found that the results of IPW are unstable in this experiment. It might be because the behavior policy is modeled with beta distribution while the propensity score μ is modeled with Gaussian distribution. Since IPW directly reweight loss function with 1 μ , the results are sensitive Table 1 : Results of policy performance directly optimized through standard SAC (Haarnoja et al., 2018) using the learned dynamics models and deployed in MuJoCo environments. MAX-RETURN is the policy performance of SAC in the MuJoCo environments, and "avg. norm." is the averaged normalized return of the policies in the 9 tasks, where the returns are normalized to lie between 0 and 100, where a score of 0 corresponds to the worst policy, and 100 corresponds to MAX-RETURN. Task Hopper Walker2d HalfCheetah avg. norm. Horizon H=10 H=20 H=40 H=10 H=20 H=40 H=10 H=20 H=40 / GALILEO 13.0 ± 0.1 33.2 ± 0.1 53.5 ± 1.2 11.7 ± 0.2 29.9 ± 0.3 61.2 ± 3.4 0.7 ± 0.2 -1.1 ± 0.2 -14.2 ± 1.4 51.1 SL 4.8 ± 0.5 3.0 ± 0.2 4.6 ± 0.2 10.7 ± 0.2 20.1 ± 0.8 37.5 ± 6.7 0.4 ± 0.5 -1.1 ± 0.6 -13.2 ± 0.3 21.1 IPW 5.9 ± 0.7 4.1 ± 0.6 5.9 ± 0.2 4.7 ± 1.1 2.8 ± 3.9 14.5 ± 1.4 1.6 ± 0.2 0.5 ± 0.8 -11.3 ± 0.9 19.7 SCIGAN 12.7 ± 0.1 29.2 ± 0.6 46.2 ± 5.2 8.4 ± 0.5 9.1 ± 1.7 1.0 ± 5.8 1.2 ± 0.3 -0.3 ± 1.0 -11.4 ± 0.3 41.8 MAX-RETURN 13.2 ± 0.0 33.3 ± 0.2 71.0 ± 0.5 14.9 ± 1.3 60.7 ± 11.1 221.1 ± 8.9 2.6 ± 0.1 13.3 ± 1.1 49.1 ± 2.3 100.0 to the error on μ. GALILEO also models μ with Gaussian distribution but the results are more stable since GALILEO does not re-weight through μ explicitly. Response curve visualization We plot the averaged response curves which are constructed by equidistantly sampling action from the action space and averaging the feedback of the states in the dataset as the averaged response. Parts of the results in Fig. 4 (all curves can be seen in Appx. H.2). For those tasks where baselines fail in reconstructing response curves, GALILEO not only reaches a better MISE score but reconstructs almost exact responses. Ablation studies In Sec. 4.3, we introduce several techniques to develop a practical GALILEO algorithm. Based on task e0.2_p0.05 of GNFC, we give the ablation studies to investigate the effects of these techniques. As the main-body space is limited, we leave the results in Appx. H.3.

5.2. TEST IN COMPLEX ENVIRONMENTS

In MuJoCo tasks MuJoCo is a benchmark task in Gym (Todorov et al., 2012; Brockman et al., 2016) where we need to control a robot with specific dynamics to complete some tasks (e.g., standing or running). We select 3 environment from D4RL (Fu et al., 2020) to construct our model learning tasks. We compare it with a standard transition model learning algorithm used in the previous offline model-based RL algorithms (Yu et al., 2020; Kidambi et al., 2020) , which is a variant of supervised learning. We name the method OFF-SL. Besides, we also implement IPW and SCIGAN as the baselines. In D4RL benchmark, only the "medium" tasks is collected with a fixed policy, i.e., the behavior policy is with 1/3 performance to the expert policy), which is most matching to our proposed problem. So we train models in datasets HalfCheetah-medium, Walker2d-medium, and Hopper-medium. We trained the models with the same gradient steps and saved the models. to Reviewer bcJS, j5p5, and rQ79 We first verify the generalization ability of the models by adopting them into offline model-based RL. Instead of designing sophisticated tricks to suppress policy exploration and learning in risky regions as current offline model-based RL algorithms (Yu et al., 2020; Kidambi et al., 2020 ) do, we just use the standard SAC algorithm Haarnoja et al. (2018) to exploit the models for policy learning to strictly verify the ability of the models. Unfortunately, we found that the compounding error will still be inevitably large in the 1,000-step rollout, which is the standard horizon in MuJoCo tasks, leading all models to fail to derive a reasonable policy. To better verify the effects of models on policy optimization, we learn and evaluate the policies with three smaller horizons: H ∈ {10, 20, 40}. The results are listed in Tab. 1. We first averaged the normalized return (refer to "avg. norm.") under each task, and we can see that the policy obtained by GALILEO is significantly higher than other models (the improvements are 24% to 161%). At the same time, we found that SCIGAN performed better in policy learning, while IPW performed similarly to SL. This is in line with our expectations, since IPW only considers the uniform policy as the target policy for debiasing, while policy optimization requires querying a wide variety of policies. Minimizing the prediction risks only under a uniform policy cannot yield a good environment model for policy optimization. Besides, in IPW, the cumulative effects of policy on the state distribution are ignored. On the other hand, SCIGAN, as a partial implementation of GALILEO (refer to Appx. E.2), also roughly achieves AWRM and considers the cumulative effects of policy on the state distribution, so its overall performance is better; In addition, we find that GALILEO achieves significant improvement in 6 of the 9 tasks. But in HalfCheetah, IPW works slightly better. However, compared with MAX-RETURN, it can be found that all methods fail to derive reasonable policies because their policies' performances are far away from the optimal policy. By further visualizing the trajectories, we found that all the learned policies just keep the cheetah standing in the same place or even going backward. This phenomenon is also similar to the results in MOPO (Yu et al., 2020) . In MOPO's experiment in the medium datasets, the truncated-rollout horizon used in Walker and Hopper for policy training is set to 5, while HalfCheetah has to be set to the minimal value: 1. These phenomena indicate that HalfCheetah may still have unknown problems, resulting in the generalization bottleneck of the models. Besides, we also test the prediction error of the learned model in corresponding unseen "expert" and "medium-replay" datasets. The detailed results are in Appx. H.5. In a real-world platform We finally deploy GALILEO in a real-world large-scale food-delivery platform. The goal of the platform is to balance the demand from take-out food orders and the supply of delivery clerks, i.e., helping delivery clerks fulfill more orders by giving reasonable strategies. We focus on a Budget Allocation task to the Time period (BAT) in the platform (see Appx. G.1.3 for details). The goal of the BAT task is to handle the imbalance problem between the demanded orders from customers and the supply of delivery clerks in different time periods by allocating reasonable allowances to those time periods. The core challenge of the environment model learning in BAT tasks is similar to the challenge in Fig. 1 . Specifically, the behavior policy in BAT tasks is a human-expert policy, which tends to increase the budget of allowance in the time periods with a lower supply of delivery clerks, otherwise tends to decrease the budget (Fig. 12 gives a real-data instance of this phenomenon). We first learn a model to predict the supply of delivery clerks (measured by fulfilled order amount) on given allowances. Although the SL model can efficiently fit the offline data, the tendency of the response curve is easily to be incorrect. As can be seen in Fig. 5 (a), with a larger budget of allowance, the prediction of the supply is decreased in SL, which obviously goes against our prior knowledge. This is because, in the offline dataset, the corresponding supply will be smaller when the allowance is larger. It is conceivable that if we learn a policy through the model of SL, the optimal solution is canceling all of the allowances, which is obviously incorrect in practice. On the other hand, the tendency of GALILEO's response is correct. Fig. 13 plots all the results in 6 cities. to Reviewer rQ79 and j5p5 Second, we conduct randomized controlled trials (RCT) in one of the testing cities. Using the RCT samples, we can evaluate the generalization ability of the model predictions via Area Under the Uplift Curve (AUUC) (Betlei et al., 2020) , which measure the correctness of the sort order of the model prediction in RCT samples. The AUUC further show that GALILEO gives a reasonable sort order on the supply prediction (see Fig. 5(b )) while the standard SL technique fails to complete this task. Finally, we search for the optimal policy via the cross-entropy method planner (Hafner et al., 2019) based on the learned model and deploy the policy in a real-world platform. The results of A/B test in City A is shown in Fig. 5(c ). It can be seen that after the day of the A/B test, the treatment group (deploying our policy) significant improve the five-minute order-taken rate than the baseline policy (the same as the behavior policy). In summary, the policy improves the supply from 0.14 to 1.63 percentage points to the behavior policies in the 6 cities. The details of these results are in Appx. H.6.

6. DISCUSSION AND FUTURE WORK

In this work, we propose AWRM which handles the generalization challenges of the counterfactual environment model learning. By theoretical modeling, we give a tractable solution to handle AWRM and propose GALILEO. GALILEO is verified in synthetic environments, complex robot control tasks, and a real-world platform, and shows great generalization ability on counterfactual queries. Giving correct answers to counterfactual queries is important for policy learning. We hope the work can inspire researchers to develop more powerful tools for counterfactual environment model learning. The current limitation lies in: There are several simplifications in the theoretical modeling process (further discussion is in Appx. B), which can be modeled more elaborately . Besides, experiments on MuJoCo indicate that these tasks are still challenging to give correct predictions on counterfactual data. These should also be further investigated in future work. 

A PROOF OF THEORETICAL RESULTS

The overall pipeline to model the tractable solution to AWRM is given in Fig. 6 . In the proof section, we replace the notation of E with an integral for brevity. Now we rewrite the original objective L(ρ β M * , M ) as: min M ∈M max β∈Π X ,A ρ µ M * (x, a)ω(x, a) X M * (x ′ |x, a) (-log M (x ′ |x, a)) dx ′ dadx - α 2 ∥ρ β M * (•, •)∥ 2 2 , where ω(x, a) = a) and ρ β M * (x,a) ρ µ M * (x, ∥ρ β M * (•, •)∥ 2 2 = X ,A ρ β M * (x, a) 2 dadx, which is the squared l 2 -norm. In an MDP, given any policy π, ρ π M * (x, a, x ′ ) = ρ π M * (x)π(a|x)M * (x ′ |x, a) where ρ π M * (x) denotes the occupancy measure of x for policy π, which can be defined (Sutton & Barto, 1998; Ho & Ermon, 2016) as ρ π M * (x) := (1 -γ)E x0∼ρ0 [ ∞ t=0 γ t Pr(x t = x|x 0 , M * )] where Pr π [x t = x|x , M * ] is the state visitation probability that π starts at state x 0 in model M * and receive x at timestep t and γ ∈ [0, 1] is the discount factor. Figure 6 : The overall pipeline to model the tractable solution to AWRM. f is a generator function defined by f -divergence (Nowozin et al., 2016) . κ is an intermediary policy introduced in the estimation. A.1 PROOF OF LEMMA 4.3 For better readability, we first rewrite Lemma 4.3 as follows: Lemma A.1. Given any M in L(ρ β M * , M ), the distribution of the ideal best-response policy β * satisfies: , and α M is the regularization coefficient α in Eq. 9 and also as a normalizer. ρ β * M * (x, a) = 1 α M (D KL (M * (•|x, a), M (•|x, a)) + H M * (x, a)), Proof. Given a transition function M of an MDP, the distribution of the best-response policy β * satisfies: ρ β * M * = arg max ρ β M * X ,A ρ µ M * (x, a)ω(x, a) X M * (x ′ |x, a) (-log M (x ′ |x, a)) dx ′ dadx - α 2 ∥ρ β M * (•, •)∥ 2 2 = arg max ρ β M * X ,A ρ β M * (x, a) X M * (x ′ |x, a) (-log M (x ′ |x, a)) dx ′ g(x,a) dadx - α 2 ∥ρ β M * (•, •)∥ 2 2 = arg max ρ β M * 2 α X ,A ρ β M * (x, a)g(x, a)dadx -∥ρ β M * (•, •)∥ 2 2 = arg max ρ β M * 2 α X ,A ρ β M * (x, a)g(x, a)dadx -∥ρ β M * (•, •)∥ 2 2 - ∥g(•, •)∥ 2 2 α 2 = arg max ρ β M * --2 X ,A ρ β M * (x, a) g(x, a) α dadx + ∥ρ β M * (•, •)∥ 2 2 + ∥g(•, •)∥ 2 2 α 2 = arg max ρ β M * -∥ρ β M * (•, •) - g(•, •) α ∥ 2 2 . We know that the occupancy measure ρ β M * is a density function with a constraint X A ρ β M * (x, a)dadx = 1. Assuming the occupancy measure ρ β M * has an upper bound c, that is 0 ≤ ρ β M * (x, a) ≤ c, ∀a ∈ A, ∀x ∈ X , constructing a regularization coefficient α M = X A (D KL (M * (•|x, a), M (•|x, a)) + H M * (x, a) )dxda as a constant value given any M , then we have ρ β * M * (x, a) = g(x, a) α M = X M * (x ′ |x, a) log M * (x ′ |x,a) M (x ′ |x,a) dx -X M * (x ′ |x, a) log M * (x ′ |x, a)dx α M = D KL (M * (•|x, a), M (•|x, a)) + H M * (x, a) α M ∝ D KL (M * (•|x, a), M (•|x, a)) + H M * (x, a) , which is the optimal density function of Eq. 9 with α = α M . Note that in some particular M * , we still cannot construct a β that can generate an occupancy specified by g(x, a)/α M for any M . We can only claim the distribution of the ideal best-response policy β * satisfies: ρ β * M * (x, a) = 1 α M (D KL (M * (•|x, a), M (•|x, a)) + H M * (x, a)), where α M is a normalizer that α M = X A (D KL (M * (•|x, a), M (•|x, a)) + H M * (x, a))dxda. We give a discussion of the rationality of the ideal best-response policy β * as a replacement of the real best-response policy β * in Remark A.2. Remark A.2. The optimal solution Eq. 11 relies on g(x, a). In some particular M * , it is intractable to derive a β that can generate an occupancy specified by g(x, a)/α M . Consider the following case: a state x 1 in M * might be harder to reach than another state x 2 , e.g., M * (x 1 |x, a) < M * (x 2 |x, a), ∀x ∈ X , ∀a ∈ A, then it is impossible to find a β that the occupancy satisfies a) . In this case, Eq. 11 can be a sub-optimal solution. Since this work focuses on task-agnostic solution derivation while the solution to the above problem should rely on the specific description of M * , we leave it as future work. However, we point out that Eq. 11 is a reasonable re-weighting term even as a sub-optimum: ρ β * M * gives larger densities on the data where the distribution distance between the approximation model and the real model (i.e., D KL (M * , M )) is larger or the stochasticity of the real model (i.e., H M * ) is larger. ρ β M * (x 1 , a) > ρ β M * (x 2 , A.2 PROOF OF EQ. 6 The integral process of D KL in Eq. 5 is intractable in the offline setting as it explicitly requires the conditional probability function of M * . Our motivation for the tractable solution is utilizing the offline dataset D real as the empirical joint distribution ρ µ M * (x, a, x ′ ) and adopting practical techniques for distance estimation on two joint distributions, like GAN (Goodfellow et al., 2014; Nowozin et al., 2016) , to approximate Eq. 5. To adopt that solution, we should first transform Eq. 5 into a form under joint distributions. Without loss of generality, we introduce an intermediary policy κ, of which µ can be regarded as a specific instance. Then we have M (x ′ |x, a) = ρ κ M (x, a, x ′ )/ρ κ M (x, a) for any M if ρ κ M (x, a) > 0. Assuming ∀x ∈ X , ∀a ∈ A, ρ κ M * (x, a) > 0 if ρ β * M * (x, a) > 0, which will hold when κ overlaps with µ, then Eq. 5 can transform to: ρ β * M * (x, a) = D KL (M * (•|x, a), M (•|x, a)) + H M * (x, a) α M = 1 α M X M * (x ′ |x, a) log M * (x ′ |x, a) M (x ′ |x, a) -log M * (x ′ |x, a) dx ′ = 1 α M ρ κ M * (x, a) X ρ κ M * (x, a)M * (x ′ |x, a) log M * (x ′ |x, a) M (x ′ |x, a) -log M * (x ′ |x, a) dx ′ (12) = 1 α M ρ κ M * (x, a) X ρ κ M * (x, a, x ′ ) log ρ κ M * (x, a, x ′ ) ρ κ M (x, a, x ′ ) + log ρ κ M (x, a) ρ κ M * (x, a) -log M * (x ′ |x, a) dx ′ = 1 α M ρ κ M * (x, a) X ρ κ M * (x, a, x ′ ) log ρ κ M * (x, a, x ′ ) ρ κ M (x, a, x ′ ) dx ′ - ρ κ M * (x, a) log ρ κ M * (x, a) ρ κ M (x, a) X M * (x ′ |x, a)dx ′ =1 -ρ κ M * (x, a) X M * (x ′ |x, a) log M * (x ′ |x, a)dx ′ = 1 α 0 (x, a) X ρ κ M * (x, a, x ′ ) log ρ κ M * (x, a, x ′ ) ρ κ M (x, a, x ′ ) dx ′ -ρ κ M * (x, a) log ρ κ M * (x, a) ρ κ M (x, a) + ρ κ M * (x, a)H M * (x, a) where α 0 (x, a) = α M ρ κ M * (x, a). Definition A.3 (f -divergence). Given two distributions P and Q, two absolutely continuous density functions p and q with respect to a base measure dx defined on the domain X , we define the f -divergence (Nowozin et al., 2016) , D f (P ∥Q) = X q(x)f p(x) q(x) dx, where the generator function f : R + → R is a convex, lower-semicontinuous function. We notice that the terms ρ κ M * (x, a, x ′ ) log ρ κ M * (x,a,x ′ ) ρ κ M (x,a,x ′ ) and ρ κ M * (x, a) log ρ κ M * (x,a) ρ κ M (x,a ) are the integrated functions in reverse KL divergence, which is an instance of f function in f -divergence (See Reverse-KL divergence of Tab.1 in (Nowozin et al., 2016) for more details). Replacing that form q log q p with qf ( p q ), we obtain a generalized representation of ρ β * M * : ρ β * M * := 1 α0(x, a) X ρ κ M * (x, a, x ′ )f ρ κ M (x, a, x ′ ) ρ κ M * (x, a, x ′ ) dx ′ -ρ κ M * (x, a) f ρ κ M (x, a) ρ κ M * (x, a) -HM * (x, a) , A.3 PROOF OF THM. 4.4 We first introduce several useful lemmas for the proof. Lemma A.4. Rearrangement inequality The rearrangement inequality states that, for two sequences a 1 ≥ a 2 ≥ . . . ≥ a n and b 1 ≥ b 2 ≥ . . . ≥ b n , the inequalities a 1 b 1 + a 2 b 2 + • • • + a n b n ≥ a 1 b π(1) + a 2 b π(2) + • • • + a n b π(n) ≥ a 1 b n + a 2 b n-1 + • • • + a n b 1 hold, where π(1), π(2), . . . , π(n) is any permutation of 1, 2, . . . , n. Lemma A.5. For two sequences a 1 ≥ a 2 ≥ . . . ≥ a n and b 1 ≥ b 2 ≥ . . . ≥ b n , the inequalities n i=1 1 n a i b i ≥ n i=1 1 n a i 1 n b i hold. Proof. By rearrangement inequality, we have n i=1 a i b i ≥ a 1 b 1 + a 2 b 2 + • • • + a n b n n i=1 a i b i ≥ a 1 b 2 + a 2 b 3 + • • • + a n b 1 n i=1 a i b i ≥ a 1 b 3 + a 2 b 4 + • • • + a n b 2 . . . n i=1 a i b i ≥ a 1 b n + a 2 b 1 + • • • + a n b n-1 Then we have n n i=1 a i b i ≥ n i=1 a i n i=1 b i n i=1 1 n a i b i ≥ n i=1 1 n a i 1 n b i Now we extend Lemma A.5 into the continuous integral scenario: Lemma A.6. Given X ⊂ R, for two functions f : X → R and g : X → R that f (x) ≥ f (y) if and only if g(x) ≥ g(y), ∀x, y ∈ X , the inequality X p(x)f (x)g(x)dx ≥ X p(x)f (x)dx X p(x)g(x)dx holds, where p : X → R and p(x) > 0, ∀x ∈ X and X p(x)dx = 1. Proof. Since (f (x) -f (y))(g(x) -g(y)) ≥ 0, ∀x, y ∈ X , we have Corollary A.7. Let g( p(x) q(x) ) = -log p(x) q(x) where p(x) > 0, ∀x ∈ X and q(x) > 0, ∀x ∈ X , for υ > 0, the inequality x∈X y∈X p(x)p(y)(f (x) -f (y))(g(x) -g(y))dydx ≥ 0 x∈X y∈X p(x)p(y)f (x)g(x) + p(x)p(y)f (y)g(y) -p(x)p(y)f (x)g(y) -p(x)p(y)f (y)g(x)dydx ≥ 0 x∈X y∈X p(x)p(y)f (x)g(x) + p(x)p(y)f (y)g(y)dydx ≥ x∈X y∈X p(x)p(y)f (x)g(y) + p(x)p(y)f (y)g(x)dydx X q(x)f (υ p(x) q(x) )g( p(x) q(x) )dx ≥ X q(x)f (υ p(x) q(x) )dx X q(x)g( p(x) q(x) )dx, holds if f ′ (x) ≤ 0, ∀x ∈ X . It is not always satisfied for f functions of f -divergence. We list a comparison of f on that condition in Tab. 2. Proof. g ′ (x) = -log x = -1 x < 0, ∀x ∈ X . Suppose f ′ (x) ≤ 0, ∀x ∈ X , we have f (x) ≥ f (y) if and only if g(x) ≥ g(y), ∀x, y ∈ X holds. Thus f (υ p(x) q(x) ) ≥ f (υ p(y) q(y) ) if and only if g( p(x) q(x) ) ≥ g( p(y) q(y) ), ∀x, y ∈ X holds for all υ > 0. By defining F (x) = f (υ p(x) q(x) )) and G(x) = g( p(x) q(x) ) and using Lemma A.6, we have: X q(x)F (x)G(x)dx ≥ X q(x)F (x)dx X q(x)G(x)dx.

Then we know

X q(x)f (υ p(x) q(x) )g( p(x) q(x) )dx ≥ X q(x)f (υ p(x) q(x) )dx X q(x)g( p(x) q(x) )dx holds. Table 2: Properties of f ′ (x) ≤ 0, ∀x ∈ X for f -divergences. Name Generator function f (x) If f ′ (x) ≤ 0, ∀x ∈ X Kullback-Leibler x log x False Reverse KL -log x True Pearson χ 2 (x -1) 2 False Squared Hellinger ( √ x -1) 2 False Jensen-Shannon -(x + 1) log 1+x 2 + x log x False GAN x log x -(x + 1) log(x + 1) True Now, we prove Thm. 4.4. For better readability, we first rewrite Thm. 4.4 as follows: Theorem A.8. Let ρ β * M * as the data distribution of the best-response policy β * in Eq. 4 under model M θ parameterized by θ, then we can find the optimal θ * of min θ max β∈Π L(ρ β M * , M θ ) (Eq. 4) via iteratively optimizing the objective θ t+1 = min θ L(ρ  θ t+1 = arg max θ E ρ κ M * 1 α 0 (x, a) log M θ (x ′ |x, a) f ρ κ M θ t (x, a, x ′ ) ρ κ M * (x, a, x ′ ) -f ρ κ M θ t (x, a) ρ κ M * (x, a) + H M * (x, a) W (x,a,x ′ ) , where α 0 (x, a) = α M θ t ρ κ M * (x, a), E ρ κ M * [•] denotes E x,a,x ′ ∼ρ κ M * [•], f is the generator function in f -divergence which satisfies f ′ (x) ≤ 0, ∀x ∈ X , and θ is the parameters of M . M θt denotes a probability function with the same parameters as the learned model (i.e., θ = θ) but the parameter is fixed and only used for sampling.

Proof. Let

ρ β * M * as the data distribution of the best-response policy β * in Eq. 4 under model M θ parameterized by θ, then we can find the optimal θ t+1 of min θ max β∈Π L(ρ β M * , M θ ) (Eq. 4) via 1 α0(x, a) ρ κ M * (x, a, x ′ ) log M θ (x ′ |x, a) f ρ κ M θ t (x, a, x ′ ) ρ κ M * (x, a, x ′ ) -f ρ κ M θ t (x, a) ρ κ M * (x, a) + HM * (x, a) dx ′ dadx, where M θt is introduced to approximate the term ρ β * M * and fixed when optimizing θ. In Eq. 16, ∥ρ β M * (•, •)∥ 2 2 for Eq. 9 is eliminated as it does not contribute to the gradient of θ. Assume f ′ (x) ≤ 0, ∀x ∈ X , let υ(x, a) := ρ κ M θ t (x,a) ρ κ M * (x,a) > 0, p(x ′ |x, a) = M θ (x ′ |x , a), and q(x ′ |x, a) = M * (x ′ |x, a), the first inequality can be derived by adopting Corollary A.7 and eliminating the first H M * since it does not contribute to the gradient of θ.

A.4 PROOF OF THE TRACTABLE SOLUTION

Now we are ready to prove the tractable solution: Proof. The core challenge is that the term f ( a) ) is still intractable. In the following, we give a tractable solution to Thm. 4.4. First, we resort to the first-order approximation. Given some u ∈ (1 -ξ, 1 + ξ), ξ > 0, we have ρ κ M θ t (x,a,x ′ ) ρ κ M * (x,a,x ′ ) ) -f ( ρ κ M θ t (x,a) ρ κ M * (x, f (u) ≈ f (1) + f ′ (u)(u -1), where f ′ is the first-order derivative of f . By Taylor's formula and the fact that f ′ (u) of the generator function f is bounded in (1 -ξ, 1 + ξ), the approximation error is no more than O(ξ 2 ). Substituting u with p(x) q(x) in Eq. 18, the pattern f ( p(x) q(x) ) in Eq. 17 can be converted to p (x) q(x) f ′ ( p(x) q(x) ) -f ′ ( p(x) q(x) ) + f (1), We can estimate f ′ ρ κ M θ t (x,a) ρ κ M * (x,a) and f ′ ρ κ M θ t (x,a,x ′ ) ρ κ M * (x,a,x ′ ) through Lemma A.9. Lemma A.9 (f ′ ( p q ) estimation (Nguyen et al., 2010) ). Given a function T φ : X → R parameterized by φ ∈ Φ, if f is convex and lower semi-continuous, by finding the maximum point of φ in the following objective: φ * = arg max φ E x∼p(x) [T φ (x)] -E x∼q(x) [f * (T φ (x))] , we have f ′ ( p(x) q(x) ) = T φ * (x). f * is Fenchel conjugate of f (Hiriart-Urruty & Lemaréchal, 2001). In particular, φ * 0 = arg max φ0 E x,a,x ′ ∼ρ κ M * [T φ0 (x, a, x ′ )] -E x,a,x ′ ∼ρ κ M θ t [f * (T φ0 (x, a, x ′ ))] φ * 1 = arg max φ1 E x,a∼ρ κ M * [T φ1 (x, a)] -E x,a∼ρ κ M θ t [f * (T φ1 (x, a))] , then we have f ′ ρ κ M θ t (x,a,x ′ ) ρ κ M * (x,a,x ′ ) ≈ T φ * 0 (x, a, x ′ ) and f ′ ρ κ M θ t (x,a) ρ κ M * (x,a) ≈ T φ * 1 (x, a). Given φ * 0 and φ * 1 , let A φ * 0 ,φ * 1 (x, a, x ′ ) = T φ * 0 (x, a, x ′ ) -T φ * 1 (x, a ), then we can optimize θ via: θ t+1 = arg max θ X ,A,X ρ κ M θ t (x, a, x) T φ * 0 (x, a, x ′ ) -T φ * 1 (x, a) log M θ (x ′ |x, a)dx ′ dadx+ X ,A,X ρ κ M * (x, a, x ′ ) T φ * 1 (x, a) -T φ * 0 (x, a, x ′ ) + H M * (x, a) log M θ (x ′ |x, a)dx ′ dadx = arg max θ X ,A,X ρ κ M θ t (x, a, x)A φ * 0 ,φ * 1 (x, a, x ′ ) log M θ (x ′ |x, a)dx ′ dadx+ X ,A,X ρ κ M * (x, a, x ′ )(-A φ * 0 ,φ * 1 (x, a, x ′ ) + H M * (x, a)) log M θ (x ′ |x, a)dx ′ dadx. Based on the specific f -divergence, we can represent T and f * (T ) with a discriminator D φ . It can be verified that Nowozin et al. (2016) satisfies the condition f ′ (x) ≤ 0, ∀x ∈ X (see Tab. 2). We select the former in the implementation and convert the tractable solution to: f (u) = u log u -(u + 1) log(u + 1), T φ (u) = log D φ (u), and f * (T φ (u)) = -log(1 -D φ (u)) proposed in θt+1 = arg max θ E ρ κ M θ t A φ * 0 ,φ * 1 (x, a, x ′ ) log M θ (x ′ |x, a) + E ρ κ M * (HM * (x, a) -A φ * 0 ,φ * 1 (x, a, x ′ )) log M θ (x ′ |x, a) s.t. φ * 0 = arg max φ 0 E ρ κ M * log Dφ 0 (x, a, x ′ ) + E ρ κ M θ t log(1 -Dφ 0 (x, a, x ′ )) φ * 1 = arg max φ 1 E ρ κ M * [log Dφ 1 (x, a)] + E ρ κ M θ t [log(1 -Dφ 1 (x, a))] , where A φ * 0 ,φ * 1 (x, a, x ′ ) = log D φ * 0 (x, a, x ′ ) -log D φ * 1 (x, a), E ρ κ M θ t [•] is a simplification of E x,a,x ′ ∼ρ κ M θ t [•].

B DISCUSSION OF THE THEORETICAL RESULTS

We summarize the limitations of current theoretical results and future work as follows: 1. As discussed in Remark A.2, the solution Eq. 11 relies on ρ β M * (x, a) ∈ [0, c], ∀a ∈ A, ∀x ∈ X . In some particular M * , it is intractable to derive a β that can generate an occupancy specified by g(x, a)/α M . If more knowledge of M * or β * is provided or some mild assumptions can be made on the properties of M * or β * , we may model ρ in a more sophisticated approach to alleviating the above problem. 2. In the tractable solution derivation, we ignore the term α 0 (x, a) = α M θ t ρ κ M * (x, a) (See Eq. 20). The benefit is that ρ κ M * (x, a, x ′ ) in the tractable solution can be estimated through offline datasets directly. Although the results in our experiments show that it does not produce significant negative effects in these tasks, ignoring ρ κ M * (x, a) indeed incurs extra bias in theory. In future work, techniques for estimating ρ κ M * (x, a) (Liu et al., 2020 ) can be incorporated to correct the bias. On the other hand, α M θ t is also ignored in the process. α M θ t can be regarded as a global rescaling term of the final objective Eq. 20. Intuitively, it constructs an adaptive learning rate for Eq. 20, which increases the step size when the model is better fitted and decreases the step size otherwise. It can be considered to further improve the learning process in future work, e.g., cooperating with empirical risk minimization by balancing the weights of the two objectives through α M θ t .

C SOCIETAL IMPACT

This work studies a method toward counterfactual environment model learning. Reconstructing an accurate environment of the real world will promote the wide adoption of decision-making policy optimization methods in real life, enhancing our daily experience. We are aware that decision-making policy in some domains like recommendation systems that interact with customers may have risks of causing price discrimination and misleading customers if inappropriately used. A promising way to reduce the risk is to introduce fairness into policy optimization and rules to constrain the actions (Also see our policy design in Sec. G.1.3). We are involved in and advocating research in such directions. We believe that business organizations would like to embrace fair systems that can ultimately bring long-term financial benefits by providing a better user experience.

D AWRM-ORACLE PSEUDOCODE

We list the pseudocode of AWRM-oracle in Alg. 1.

Algorithm 1 AWRM with Oracle Counterfactual Datasets

Input: Φ: policy space; N : total iterations Process: 1: Generate counterfactual datasets {D π ϕ } for all adversarial policies π ϕ , ϕ ∈ Φ 2: Initialize an environment model M θ 3: for i = 1:N do 4: Select D π ϕ with worst prediction errors through M θ from {D π ϕ } 5: Optimize M θ with standard supervised learning based on D π ϕ 6: end for

E IMPLEMENTATION E.1 DETAILS OF THE GALILEO IMPLEMENTATION

The approximation of Eq. 18 holds only when p(x) q(x) is close to 1, which might not be satisfied. To handle the problem, we inject a standard supervised learning loss arg max θ E ρ κ M * [log M θ (x ′ |x, a)] to replace the second term of the above objective when the output probability of D is far away from 0.5 (f ′ (1) = log 0.5). In the offline model-learning setting, we only have a real-world dataset D collected by the behavior policy µ. We learn a policy μ ≈ µ via behavior cloning with D (Pomerleau, 1991; Ho & Ermon, 2016) and let μ be the policy κ. We regard D as the empirical data distribution of ρ κ M * and the trajectories collected by μ in the model M θt as the empirical data distribution of ρ κ M θ t . But the assumption ∀x ∈ X , ∀a ∈ A, µ(a|x) > 0 might not be satisfied. In behavior cloning, we model μ with a Gaussian distribution and constrain the lower bound of the variance with a small value ϵ µ > 0 to keep the assumption holding. Besides, we add small Gaussian noises u ∼ N (0, ϵ D ) to the inputs of D φ to handle the mismatch between ρ µ M * and ρ μ M * due to ϵ µ . In particular, for φ 0 and φ 1 learning, we have: φ * 0 = arg max φ 0 E ρ κ M * ,u log Dφ 0 (x + ux, a + ua, x ′ + u x ′ ) + E ρ κ M θ t ,u log(1 -Dφ 0 (x + ux, a + ua, x ′ + u x ′ )) φ * 1 = arg max φ 1 E ρ κ M * ,u [log Dφ 1 (x + ux, a + ua)] + E ρ κ M θ t ,u [log(1 -Dφ 1 (x + ux, a + ua))] , where E ρ κ M θ t ,u [•] is a simplification of E x,a,x ′ ∼ρ κ M θ t ,u∼N (0,ϵ D ) [•] and u = [u x , u a , u x ′ ]. On the other hand, we notice that the first term in Eq. 21 is similar to the objective of GAIL (Ho & Ermon, 2016 ) by regarding M θ as the policy to learn and κ as the environment to generate data. For better capability in sequential environment model learning, here we introduce some practical tricks inspired by GAIL for model learning (Shi et al., 2019; Shang et al., 2019) : we introduce an MDP for κ and M θ , where the reward is defined by the discriminator D, i.e., r(x, a, x ′ ) = log D(x, a, x ′ ). M θ is learned to maximize the cumulative rewards. With advanced policy gradient methods (Schulman et al., 2015; 2017) , the objective is converted to max θ A φ * 0 ,φ * 1 (x, a, x ′ ) log M θ (x, a, x ′ ) , where A = Q κ M θ t -V κ M θ t , Algorithm 2 GALILEO pseudocode Input: D real : offline dataset sampled from ρ µ M * where µ is the behavior policy; N : total iterations; Process: 1: Approximate a behavior policy μ via behavior cloning 2: Initialize an environment model M θ1 3: for t = 1 : N do Update Q and V via Eq. 24 and Eq. 25 through D gen , D φ0 , and D φ1 7: Update the model M θt via the first term of Eq. 23, which is implemented with a standard policy gradient method like TRPO (Schulman et al., 2015) or PPO (Schulman et al., 2017) . Record the policy gradient g pg 8: if p 0 < E Dgen [D φ0 (x t , a t , x t+1 )] < p 1 then 9: Compute the gradient of M θt via the second term of Eq. 23 and record it as g sl 10: else 11: Compute the gradient of M θt via Eq. 22 and record it as g sl 12: end if 13: Rescale g sl via Eq. 28 14: Update the model M θt via the gradient g sl and obtain M θt+1 15: end for Q κ Mθ (x, a, x ′ ) = E [ ∞ t=0 γ t r(x t , a t , x t+1 ) | (x t , a t , x t+1 ) = (x, a, x ′ ), κ, M θt ], and V κ Mθ (x, a) = E Mθ Q κ Mθ (x, a, x ′ ) . A in Eq. 21 can also be constructed similarly. Although it looks unnecessary in theory since the one-step optimal model M θ is the global optimal model in this setting, the technique is helpful in practice as it makes A more sensitive to the compounding effect of one-step prediction errors: we would consider the cumulative effects of prediction errors induced by multi-step transitions in environments. In particular, to consider the cumulative effects of prediction errors induced by multi-step of transitions in environments, we overwrite function A φ * 0 ,φ * 1 as A φ * 0 ,φ * 1 = Q κ M θ t -V κ M θ t , where Q κ M θ t (x, a, x ′ ) = E ∞ t γ t log D φ * 0 (x t , a t , x t+1 )|(x t , a t , x t+1 ) = (x, a, x ′ ), κ, M θt and V κ M θ t (x, a) = E ∞ t γ t log D φ * 1 (x t , a t )|(x t , a t ) = (x, a), κ, M θt . To give an algorithm for singlestep environment model learning, we can just set γ in Q and V to 0. 1 + H M * . Since H M * is unknown, we use H M θ to estimate it. When the mean output probability of a batch of data is larger than 0.6 or small than 0.4, we replace the second term of Eq. 23 with a standard supervised learning in Eq. 22. Besides, unreliable gradients also exist in the process of optimizing the second term of Eq. 23. In our implementation, we use the scale of policy gradients to constrain the gradients of the second term of Eq. 23. In particular, we first compute the l 2 -norm of the gradient of the first term of Eq. 23 via conservative policy gradient algorithms, named ||g pg || 2 . Then we compute the l 2 -norm of the gradient of the second term of Eq. 23, name ||g sl || 2 . Finally, we rescale the gradients of the second term g sl by g sl ← g sl ||g pg || 2 max{||g pg || 2 , ||g sl || 2 } . ( ) For each iteration, Eq. 23, Eq. 26, and Eq. 27 are trained with certain steps (See Tab. 5) following the same framework as GAIL. Based on the above techniques, we summarize the pseudocode of GALILEO in Alg. 2, where p 0 and p 1 are set to 0.4 and 0.6 in all of our experiments. The overall architecture is shown in Fig. 7 . to Reviewer bcJS

E.2 CONNECTION WITH PREVIOUS ADVERSARIAL ALGORITHMS

Standard GAN (Goodfellow et al., 2014) can be regarded as a partial implementation including the first term of Eq. 23 and Eq. 26 by degrading them into the single-step scenario. In the context of GALILEO, the objective of GAN is θ t+1 = arg max θ E ρ κ M θ t [A φ * (x, a, x ′ ) log M θ (x ′ |x, a)] s.t. φ * = arg max φ E ρ κ M * [log D φ (x, a, x ′ )] + E ρ κ M θ t [log(1 -D φ (x, a, x ′ ))] , where A φ * (x, a, x ′ ) = log D φ * (x, a, x ′ ). In the single-step scenario, ρ κ M θ t (x, a, x ′ ) = ρ 0 (x)κ(a|x)M θt (x ′ |a, x). The term (Sutton & Barto, 1998) . Previous algorithms like GANITE (Yoon et al., 2018) and SCIGAN (Bica et al., 2020) can be regarded as variants of the above training framework. E ρ κ M θ t [A φ * (x, a, x ′ ) log M θ (x ′ |x, a)] can convert to E ρ κ M θ [log D φ * (x, a, x ′ )] by replacing the gradient of M θt (x ′ |x, a)∇ θ log M θ (x ′ |x, a) with ∇ θ M θ (x ′ |x, a) The first term of Eq. 23 and Eq. 26 are similar to the objective of GAIL by regarding M θ as the "policy" to imitate and μ as the "environment" to collect data. In the context of GALILEO, the In offline model-based RL, the problem is called distribution shift (Yu et al., 2020; Levine et al., 2020; Chen et al., 2021) which has received great attentions. However, previous algorithms do not handle the model learning challenge directly but propose techniques to suppress policy sampling and learning in risky regions (Yu et al., 2020; Kidambi et al., 2020) . Although these algorithms have made great progress in offline policy optimization in many tasks, so far, how to learn a better environment model in this scenario has rarely been discussed.

G EXPERIMENT DETAILS

G.1 SETTINGS G.1.1 GENERAL NEGATIVE FEEDBACK CONTROL (GNFC) The design of GNFC is inspired by a classic type of scenario that behavior policies µ have selection bias and easily lead to counterfactual risks: For some internet platforms, we would like to allocate budgets to a set of targets (e.g., customers or cities) to increase the engagement of the targets in the platforms. Our task is to train a model to predict targets' feedback on engagement given targets' features and allocated budgets. In these tasks, for better benefits, the online working policy (i.e., the behavior policy) will tend to cut down the budgets if targets have better engagement, otherwise, the budgets might be increased. The risk of counterfactual environment model learning in the task is that: the object with better historical engagement will be sent to smaller budgets because of the selection bias of the behavior policies, then the model might exploit this correlation for learning and get a conclusion that: increasing budgets will reduce the targets' engagement, which violates the real causality. We construct an environment and a behavior policy to mimic the above process. In particular, the behavior policy µ GN F C is µ GN F C (x) = (62.5 -mean(x)) 15 + ϵ, where ϵ is a sample noise, which will be discussed later. The environment includes two parts: (1) response function M 1 (y|x, a): M 1 (y|x, a) = N (mean(x) + a, 2) (2) mapping function M 2 (x ′ |x, y): M 2 (x ′ |x, a, y) = y -mean(x) + x The transition function M * is a composite of M * (x ′ |x, a) = M 2 (x ′ |x, a, M 1 (y|x, a) ). The behavior policies have selection bias: the actions taken are negatively correlated with the states, as illustrated in Fig. 8 (a) and Fig. 8(b ). We control the difficulty of distinguishing the correct causality of x, a, and y by designing different strategies of noise sampling on ϵ. In principle, with a larger number or more pronounced disturbances, there are more samples violating the correlation between x and a, then more samples can be used to find the correct causality. Therefore, we can control the difficulty of counterfactual environment model learning by controlling the strength of disturbance. In particular, we sample ϵ from a uniform distribution U (-e, e) with probability p. That is, ϵ = 0 with probability 1 -p and ϵ ∼ U (-e, e) with probability p. Then with larger p, there are more samples in the dataset violating the negative correlation (i.e., µ GN F C ), and with larger e, the difference of the feedback will be more obvious. By selecting different e and p, we can construct different tasks to verify the effectiveness and ability of the counterfactual environment model learning algorithm.

G.1.2 THE CANCER GENOMIC ATLAS (TCGA)

The Cancer Genomic Atlas (TCGA) is a project that has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein, and epigenetic levels. The resulting rich data provide a significant opportunity to accelerate our understanding of the molecular basis of cancer. We obtain features, x, from the TCGA dataset and consider three continuous treatments as done in SCIGAN (Bica et al., 2020) . Each treatment, a, is associated with a set of parameters, v 1 , v 2 , v 3 , that are sampled randomly by sampling a vector from a standard normal distribution and scaling it with its norm. We assign interventions by sampling a treatment, a, from a beta distribution, a | x ∼ Beta (α, β). α ≥ 1 controls the sampling bias and β = α-1 a * + 2 -α, where a * is the optimal treatment. This setting of β ensures that the mode of Beta (α, β) is a * . The calculation of treatment response and optimal treatment are shown in Table 3 . Table 3 : Treatment response used to generate semi-synthetic outcomes for patient features x. In the experiments, we set C = 10.

Treatment

Treatment Response Optimal treatment 1 f 1 (x, a 1 ) = C v 1 1 T x + 12 v 1 2 T xa 1 -12 v 1 3 T xa 2 1 a * 1 = (v 1 2 ) T x 2(v 1 3 ) T x 2 f 2 (x, a 2 ) = C v 2 1 T x + sin π v 2T 2 x v 2T 3 x a 2 a * 2 = (v 2 3 ) T x 2(v 2 2 ) T x 3 f 3 (x, a 3 ) = C v 3 1 T x + 12a 3 (a 3 -b) 2 , where b = 0.75 (v 3 2 ) T x (v 3 3 ) T x 3 b if b ≥ 0.75, 1 if b < 0.75 We conduct experiments on three different treatments separately and change the value of bias α to assess the robustness of different methods to treatment bias. When the bias of treatment is large, which means α is large, the training set contains data with a strong bias on treatment so it would be difficult for models to appropriately predict the treatment responses out of the distribution of training data.

G.1.3 BUDGET ALLOCATION TASK TO THE TIME PERIOD (BAT)

We deploy GALILEO in a real-world large-scale food-delivery platform. The platform contains various food stores, and food delivery clerks. The overall workflow is as follows: the platform presents the nearby food stores to the customers and the customers make orders, i.e., purchase take-out foods from some stores on the platform. The food delivery clerks can select orders from the platform to fulfill. After an order is selected to fulfill, the delivery clerks will take the ordered take-out foods from the stores and then send the food to the customers. The platform will pay the delivery clerks (mainly in proportion to the distance between the store and the customers' location) once the orders are fulfilled. An illustration of the workflow can be found in Fig. 9 . However, there is an imbalance problem between the demanded orders from customers and the supply of delivery clerks to fulfill these orders. For example, at peak times like lunchtime, there will be many more demanded orders than at other times, and the existed delivery clerks might not be able to fulfill all of these orders timely. The goal of the Budget Allocation task to the Time period (BAT) is to handle the imbalance problem in time periods by sending reasonable allowances to different time periods. More precisely, the goal of BAT is to make all orders (i.e., the demand) sent in different time periods can be fulfilled (i.e., the supply) timely. To handle the imbalance problem in different time periods, in the platform, the orders in different time periods t ∈ [0, 1, 2..., 23] will be allocated with different allowances c ∈ N + . For example, at 10 A.M. (i.e., t = 10), we add 0.5$ (i.e., c = 0.5) allowances to all of the demanded orders. From 10 A.M. to 11 A.M., the delivery clerks who take orders and send food to customers will receive extra allowances. Specifically, if the platform pays the delivery clerks 2$ for fulfilling the order, now he/she will receive 2.5$. For each day, the budget of allowance C is fixed. We should find the best budget allocation policy π * (c|t) of the limited budget C to make as many orders as possible can be taken timely. To find the policy, we first learn a model to reconstruct the response of allowance for each delivery clerk M (y t+1 |s t , p t , c t ), where y t+1 is the taken orders of the delivery clerks in state s t , c t is the allowances, p t denotes static features of the time period t. In particular, the state s t includes historical order-taken information of the delivery clerks, current orders information, the feature of weather, city information, and so on. Then we use a rule-based mapping function f to fill the complete next time-period states, i.e., s t+1 = f (s t , p t , c t , y t+1 ). Here we define the composition of the above functions M and f as Mf . Finally, we learn a budget allocation policy based on the learned model. For each day, the policy we would like to find is: max π E s0∼S 23 t=0 y t | Mf , π , s.t., t,s∈S c t y t ≤ C In our experiment, we evaluate the degree of balancing between demand and supply by computing the averaged five-minute order-taken rate, that is the percentage of orders picked up within five minutes. Note that the behavior policy is fixed for the long term in this application. So we directly use the data replay with a small scale of noise (See Tab. 5) to reconstruct the behavior policy for model learning in GALILEO. Also note that although we model the response for each delivery clerk, for fairness, the budget allocation policy is just determining the allowance of each time period t and keeps the allowance to each delivery clerk s the same.

G.2 BASELINE ALGORITHMS

The algorithm we compared are: (1) Supervised Learning (SL): training a environment model to minimize the expectation of prediction error, without considering the counterfactual risks; (2) inverse propensity weighting (IPW) (Spirtes, 2010) : a practical way to balance the selection bias by re-weighting. It can be regarded as ω = 1 μ , where μ is another model learned to approximate the behavior policy; (3) SCIGAN: a recent proposed adversarial algorithm for model learning for continuous-valued interventions (Bica et al., 2020) . All of the baselines algorithms are implemented with the same capacity of neural networks (See Tab. 5).

G.2.1 SUPERVISED LEARNING (SL)

As a baseline, we train a multilayer perceptron model to directly predict the response of different treatments, without considering the counterfactual risks. We use mean square error to estimate the performance of our model so that the loss function can be expressed as M SE = 1 n n i=1 (y i -ŷi ) 2 , where n is the number of samples, y is the true value of response and ŷ is the predicted response. In practice, we train our SL models using Adam optimizer and the initial learning rate 3e -4 on both datasets TCGA and GNFC. The architecture of the neural networks is listed in Tab. 5.

G.2.2 INVERSE PROPENSITY WEIGHTING (IPW)

Inverse propensity weighting (Spirtes, 2010) is an approach where the treatment outcome model uses sample weights to balance the selection bias by re-weighting. The weights are defined as the inverse propensity of actually getting the treatment, which can be expressed as 1 μ(a|x) , where x stands for the feature vectors in a dataset, a is the corresponding action and μ(a|x) indicates the action taken probability of a given the features x within the dataset. μ is learned with standard supervised learning. Standard IPW leads to large weights for the points with small sampling probabilities and finally makes the learning process unstable. We solve the problem by clipping the propensity score: μ ← min(μ, 0.05), which is common used in existing studies (Ionides, 2008) . The loss function can thus be expressed as 1 n n i=1 1 μ(ai|xi) (y i -ŷi ) 2 . The architecture of the neural networks is listed in Tab. 5.

G.2.3 SCIGAN

SCIGAN (Bica et al., 2020) is a model that uses generative adversarial networks to learn the data distribution of the counterfactual outcomes and thus generate individualized response curves. SCIGAN does not place any restrictions on the form of the treatment-does response functions and is capable of estimating patient outcomes for multiple treatments, each with an associated parameter. SCIGAN first trains a generator to generate response curves for each sample within the training dataset. The learned generator can then be used to train an inference network using standard supervised methods. For fair comparison, we increase the number of parameters for the the open-source version of SCIGAN so that the SCIGAN model can have same order of magnitude of network parameters as GALILEO. In addition, we also finetune the hyperparameters (Tab. 4) of the enlarged SCIGAN to realize its full strength. We set num_dosage_samples 9 and λ = 10. 

G.3 HYPER-PARAMETERS

We list the hyper-parameter of GALILEO in Tab. 5.

G.4 COMPUTATION RESOURCES

We use one Tesla V100 PCIe 32GB GPU and a 32-core Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz to train all of our model. We give the result of CNFC in Tab. 8, TCGA in Tab. 9, BAT in Tab. 7, and MuJoCo in Tab. 6.

H.2 AVERAGED RESPONSE CURVES

We give the averaged responses for all of the tasks and the algorithms in Fig. 16 to Fig. 23 . We randomly select 20% of the states in the dataset and equidistantly sample actions from the action space for each sampled state, and plot the averaged predicted feedback of each action. The real response is slightly different among different figure as the randomly-selected states for testing is different. We sample 9 points in GNFC tasks and 33 points in TAGC tasks for plotting. In Sec. 4.3 and Appx. E.1, we introduce several techniques to develop a practical GALILEO algorithm. Based on task e0.2_p0.05 in GNFC, we give the ablation studies to investigate the effects of these techniques. We first compare two variants that do not handle the assumptions violation problems: (1) NO_INJECT_NOISE: set ϵ µ and ϵ D to zero, which makes the overlap assumption not satisfied;;

H.3 ABLATION STUDIES

(2) SINGLE_SL: without replacing the second term in Eq. 8 with standard supervised learning even when the output probability of D is far away from 0.5. Besides, we introduced several tricks inspired by GAIL and give a comparison of these tricks and GAIL: (3) ONE_STEP: use one-step reward instead of cumulative rewards (i.e., Q and V; see Eq. 24 and Eq. 25) for re-weighting, which is implemented by set γ to 0; (4) SINGE_DIS: remove T φ * 1 (x, a) and replace it with E M θ T φ * 0 (x, a, x ′ ) , which is inspired by GAIL that uses a value function as a baseline instead of using another discriminator; (5) PURE_GAIL: remove the second term in Eq. 8. It can be regarded as a naive adoption of GAIL and a partial implementation of GALILEO. We summarize the results in Fig. 10 . Based on the results of NO_INJECT_NOISE and SINGLE_SL, we can see that handling the assumption violation problems is important and will increase the ability The core challenge of the environment model learning in BAT tasks is similar to the challenge in Fig. 1 . Specifically, the behavior policy in BAT tasks is a human-expert policy, which will tend to increase the budget of allowance in the time periods with a lower supply of delivery clerks, otherwise will decrease the budget (Fig. 12 gives an instance of this phenomenon in the real data). Since there is no oracle environment model for querying, we have to describe the results with other metrics. (f) City-F Figure 13 : Illustration of the response curves in the 6 cities. Although the ground-truth curves are unknown, through human expert knowledge, we know that it is expected to be monotonically increasing . First, we review whether the tendency of the response curve is consistent. In this application, with a larger budget of allowance, the supply will not be decreased. As can be seen in Fig. 13 , the tendency of GALILEO's response is valid in 6 cities but almost all of the models of SL give opposite directions to the response. If we learn a policy through the model of SL, the optimal solution is canceling all of the allowances, which is obviously incorrect in practice. Second, we conduct randomized controlled trials (RCT) in one of the testing cities. Using the RCT samples, we can evaluate the correctness of the sort order of the model predictions via Area Under the Uplift Curve (AUUC) (Betlei et al., 2020) . To plot AUUC, we first sort the RCT samples based on the predicted treatment effects. Then the cumulative treatment effects are computed by scanning the sorted sample list. If the sort order of the model predictions is better, the sample with larger treatment effects will be computed early. Then the area of AUUC will be larger than the one via a random sorting strategy. The result of AUUC show GALILEO gives a reasonable sorting to the RCT samples (see Fig. 14 ). Finally, we search for the optimal policy via the cross-entropy method planner (Hafner et al., 2019) based on the learned model. We test the online supply improvement in 6 cities. The algorithm compared is a human-expert policy, which is also the behavior policy of the offline datasets. We conduct online A/B tests for each of the cities. For each test, we randomly split a city into two partitions, one is for deploying the optimal policy learned from the GALILEO model, and the other is as a control group, which keeps the human-expert policy as before. Before the intervention, we collect 10 days' observation data and compute the averaged five-minute order-taken rates as the baselines of the treatment and control group, named b t and b c respectively. Then we start intervention and observe the five-minute order-taken rate in the following 14 days for the two groups. The results of the treatment and control groups are y t i and y c i respectively, where i denotes the i-th day of the deployment. The percentage points of the supply improvement are computed via difference-in-difference (DID): T i (y t i -b t ) -(y c i -b c ) T × 100, where T is the total days of the intervention and T = 14 in our experiments. The results are summarized in Tab. 7. The online experiment is conducted in 14 days and the results show that the policy learned with GALILEO can make better (the supply improvements are from 0.14 to 1.63 percentage points) budget allocation than the behavior policies in all the testing cities. We give detailed results which record the supply difference between the treatment group and the control group in Fig. 15 . 



the code will be released after the paper is published.



Figure 1: An example of selection bias and predictions under counterfactual queries. Subfigure (a)shows how the data is collected: a ball locates in a 2D plane whose position is (x t , y t ) at time t. The ball will move to (x t+1 , y t+1 ) according to x t+1 = x t + 1 and y t+1 ∼ N (y t + a t , 2). Here, a t is chosen by a control policy a t ∼ N ((ϕ -y t )/15, 0.05) parameterized by ϕ, which tries to keep the ball near the line y = ϕ. In Subfigure (a), ϕ is set to 62.5. Subfigure (b) shows the collected training data (grey dashed line) and the two learned models' prediction of the next position of y. All the models discovered the relation that the corresponding next y will be smaller with a larger action. However, the truth is not because the larger a t causes a smaller y t+1 , but the policy selects a small a t when y t is close to the target line. When we estimate the response curves by fixing y t and reassigning action a t with other actions a t + ∆a, where ∆a ∈ [-1, 1] is a variation of action value, the model of SL will exploit the association and give opposite responses, while in AWRM and its practical implementation GALILEO, the predictions are closer to the ground truths. The result is in Subfigure (c), where the darker a region is, the more samples are fallen in.

Figure 2: An illustration of the prediction error in counterfactual datasets.The prediction risks is measured with mean square error (MSE). The error of SL is small only in training data (ϕ = 62.5) but becomes much larger in the dataset "far away from" the training data. AWRM-oracle selects the oracle worst counterfactual dataset for training for each iteration (pseudocode is in Alg. 1) which reaches small MSE in all datasets and gives correct response curves (Fig.1(c)). GALILEO approximates the optimal adversarial counterfactual data distribution based on the training data and model. Although the MSE of GALILEO is a bit larger than SL in the training data, in the counterfactual datasets, the MSE is on the same scale as AWRM-oracle.

)) 2 dadx. Then we can approximate the optimal distribution ρ β * M * via Lemma. 4.3. Lemma 4.3. Given any M in L(ρ β M * , M ), the distribution of the ideal best-response policy β * satisfies: ρ β * M * (x, a) = 1 αM (DKL(M * (•|x, a), M (•|x, a)) + HM * (x, a)), (5) where D KL (M * (•|x, a), M (•|x, a)) is the Kullback-Leibler (KL) divergence between M * (•|x, a) and M (•|x, a), H M * (x, a) denotes the entropy of M * (•|x, a), and α M is the regularization coefficient α in Eq. 4 and also as a normalizer of Eq. 5.

Figure 3: Illustration of the performance in GNFC and TCGA. The grey bar denotes the standard error (×0.3 for brevity) of 3 random seeds.

Figure 4: Illustration of the averaged response curves.

Figure5: An illustration of the performance in BAT tasks. In Fig.5(a) demonstrate the averaged response curves of the SL and GALILEO model in City A. It is expected to be monotonically increasing through our prior knowledge. In Fig.5(b), the model with larger areas above the "random" line makes better predictions in randomized-controlled-trials data(Betlei et al., 2020). Fig.5(c)shows the daily responses in the A/B test in City A. The complete results are in Appx. H.6.

min M ∈M max β∈Π L(ρ β M * , M ) estimate the optimal adversarial distribution ρ β * M * given M (Sec. A.1) derive ρ β * M * as a generalized representation of ρ β * M * (Sec. A.2) Let ρ β * M * as the distribution of the best-response policy arg max β∈Π L(ρ β M * , M ), we have the surrogate objective min M ∈M L(ρ β * M * , M ) (Sec. A.3) tractable solution. (Sec. A.4) intermediary policy κ generator function f model an easy-toestimate distribution of the best-response policy β * approximate ρ β * M * with variational representation

) where D KL (M * (•|x, a), M (•|x, a)) is the Kullback-Leibler (KL) divergence between M * (•|x, a) and M (•|x, a), H M * (x, a) denotes the entropy of M * (•|x, a), where D KL (M * (•|x, a), M (•|x, a)) is the Kullback-Leibler (KL) divergence between M * (•|x, a) and M (•|x, a), H M * (x, a) denotes the entropy of M * (•|x, a)

p(y)f (x)g(y) + p(x)p(y)f (y)g(x)dydx2 x∈X p(x)f (x)g(x)dx ≥ 2 y∈X x∈X p(x)p(y)f (x)g(y)dydx 2 x∈X p(x)f (x)g(x)dx ≥ 2 x∈X p(x)f (x)dx x∈X p(x)g(x)dx x∈X p(x)f (x)g(x)dx ≥ x∈X p(x)f (x)dx x∈X p(x)g(x)dx

* , M θ ), where ρ β * M * is approximated via the last-iteration model M θt . Based on Corollary A.7, we have an upper bound objective for min θ L(ρ β * M * , M θ ) and derive the following objective

generate a dataset D gen with the model M θt 5: Update the discriminators D φ0 and D φ1 via Eq. 26 and Eq. 27 respectively, where ρ μ M θ t is estimated by D gen and ρ µ M * is estimated by D real 6:

Figure 7: Illustration of the workflow of the GALILEO algorithm.

a in collected trajectories.

Figure 8: Illustration of information about the collected dataset in GNFC. Each color of the line denotes one of the collected trajectories. The X-axis denotes the timestep of a trajectory.

Figure 9: Illustration of the workflow of the food-delivery platform.

Figure 10: Illustration of the ablation studies. The error bars are the standard error.

Figure11: Illustration of learning curves of the MuJoCo Tasks. The X-axis record the steps of the environment model update, and the Y-axis is the corresponding prediction error. The figures with titles ending in "(train)" means the dataset is used for training while the titles ending in "(test)" means the dataset is just used for testing. The solid curves are the mean reward and the shadow is the standard error of three seeds.

Figure 14: Illustration of the AUUC result for BAT.

Figure 16: Illustration of the averaged response curves of Supervised Learning (SL) in TCGA.

Figure 18: Illustration of the averaged response curves of Inverse Propensity Weighting (IPW) in TCGA.

Figure 19: Illustration of the averaged response curves of Inverse Propensity Weighting (IPW) in GNFC.

Figure 22: Illustration of the averaged response curves of GALILEO in TCGA.

A Proof of Theoretical ResultsA.1 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Proof of Eq. 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Proof of Thm. 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Proof of the Tractable Solution . . . . . . . . . . . . . . . . . . . . . . . . . .

Table of hyper-parameters for SCIGAN.

Table of hyper-parameters for all of the tasks.



Results on BAT. We use City-X to denote the experiments on different cities. "pp" is an abbreviation of percentage points on the supply improvement.

Illustration of the daily responses in the A/B test in the 6 cities. √ M ISE results on GNFC. We bold the lowest error for each task. ± is the standard deviation of three random seeds.

√M ISE results on TCGA. We bold the lowest error for each task. ± is the standard deviation of three random seeds.

√ M M SE results on GNFC. We bold the lowest error for each task. ± is the standard deviation of three random seeds.

√ M M SE results on TCGA. We bold the lowest error for each task. ± is the standard deviation of three random seeds.

annex

Our primitive objective is inspired by weighted empirical risk minimization (WERM) based on inverse propensity score (IPS). WERM is originally proposed to solve the generalization problem of domain adaptation in machine learning literature. For instance, we would like to train a predictor M (y|x) in a domain with distribution P train (x) to minimize the prediction risks in the domain with distribution P test (x), where P test ̸ = P test . To solve the problem, we can train a weighted objective with max M E x∼Ptrain [ Ptest(x) Ptrain(x) log M (y|x)], which is called weighted empirical risk minimization methods (Ben-David et al., 2006; 2010; Cortes et al., 2010; Byrd & Lipton, 2019; Quinonero-Candela et al., 2008) . These results have been extended and applied to causal inference, where the predictor is required to be generalized from the data distribution in observational studies (source domain) to the data distribution in randomized controlled trials (target domain) (Shimodaira, 2000; Assaad et al., 2021; Hassanpour & Greiner, 2019; Jung et al., 2020; Johansson et al., 2018) . In this case, the input features include a state x (a.k.a. covariates) and an action a (a.k.a. treatment variable) which is sampled from a policy. We often assume the distribution of x, P (x) is consistent between the source domain and the test domain, then we have Ptest(x) Ptrain(x) = P (x)β(a|x) P (x)µ(a|x) = β(a|x) µ(a|x) , where µ and β are the policies in source and target domains respectively. In Shimodaira (2000) ; Assaad et al. (2021) ; Hassanpour & Greiner (2019) , the policy in randomized controlled trials is modeled as a uniform policy, then Ptest(x) Ptrain(x) = P (x)β(a|x) P (x)µ(a|x) = β(a|x) µ(a|x) ∝ 1 µ(a|x) .1 µ(a|x) is also known as inverse propensity score (IPS). In Johansson et al. (2018) , it assumes that the policy in the target domain is predefined as β(a|x) before environment model learning, then it uses β µ as the IPS. The differences between AWRM and previous works are fallen in two aspects: (1) We consider the distribution-shift problem in the sequential decision-making scenario. In this scenario, we not only consider the action distribution mismatching between the behavior policy µ and the policy to evaluation β, but also the follow-up effects of policies to the state distribution; (2) For faithful offline policy optimization, we require the environment model to have generalization ability in numerous different policies. The objective of AWRM is proposed to guarantee the generalization ability of M in numerous different policies instead of a specific policy.On a different thread, there are also studies that bring counterfactual inference techniques of causal inference into model-based RL (Buesing et al., 2019; Pitis et al., 2020; Sontakke et al., 2021) . These works consider that the transition function is relevant to some hidden noise variables and use Pearl-style structural causal models (SCMs), which is a directed acyclic graphs to define the causality of nodes in an environment, to handle the problem. SCMs can help RL in different ways: Buesing et al. (2019) approximate the posterior of the noise variables based on the observation of data, and environment models are learned based on the inferred noises. The generalization ability is improved if we can infer the correct value of the noise variables. Pitis et al. (2020) discover several local causal structural models of a global environment model, then data augmentation strategies by leveraging these local structures to generate counterfactual experiences. Sontakke et al. (2021) proposes a representation learning technique for causal factors, which is an instance of the hidden noise variables, in partially observable Markov decision processes (POMDPs). With the learned representation of causal factors, the performance of policy learning and transfer in downstream tasks will be improved. Instead of considering the hidden noise variables in the environments, our to Reviewer j5p5 study considers the environment model learning problem in the fully observed setting and focuses on unbiased causal effect estimation in the offline dataset under behavior policies collected with selection bias. 

H.5 DETAILED RESULTS IN THE MUJOCO TASKS

We select 3 environments from D4RL (Fu et al., 2020) to construct our model learning tasks. We compare it with a typical transition model learning algorithm used in the previous offline model-based RL algorithms (Yu et al., 2020; Kidambi et al., 2020) , which is a variant of standard supervised learning. We name the method OFF-SL. We train models in datasets HalfCheetah-medium, Walker2dmedium, and Hopper-medium, which are collected by a behavior policy with 1/3 performance to the expert policy, then we test them in the corresponding expert dataset. We plot the converged results and learning curves of GALILEO and OFF-SL in three MuJoCo tasks in Tab. 6 and Fig. 11 respectively.In Fig. 11 , we can see that both OFF-SL and GALILEO perform well in the training datasets. OFF-SL can even reach a bit lower error in halfcheetah and walker2d. However, when we verify the models through "expert" and "medium-replay" datasets, which are collected by other policies, the performance of GALILEO is significantly more stable and better than OFF-SL. As the training continues, OFF-SL even gets worse and worse. In summary, GALILEO reaches significantly better performances in the expert dataset: the averaged declines of root MSE in three environments are 56.5%, 49.2%, and 34.8%. However, whether in GALILEO or OFF-SL, the performance for testing is at least 2x worse than in the training dataset. The phenomenon indicates that although GALILEO can make better performances for counterfactual queries, the risks of using the models are still large and still challenging to be further solved. 

