REPRESENTATION BALANCING OFFLINE MODEL-BASED REINFORCEMENT LEARNING

Abstract

One of the main challenges in offline and off-policy reinforcement learning is to cope with the distribution shift that arises from the mismatch between the target policy and the data collection policy. In this paper, we focus on a model-based approach, particularly on learning the representation for a robust model of the environment under the distribution shift, which has been first studied by Representation Balancing MDP (RepBM). Although this prior work has shown promising results, there are a number of shortcomings that still hinder its applicability to practical tasks. In particular, we address the curse of horizon exhibited by RepBM, rejecting most of the pre-collected data in long-term tasks. We present a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections. This effectively overcomes the aforementioned limitation of RepBM, as well as naturally extending to continuous action spaces and stochastic policies. We also present an offline model-based policy optimization using this new objective, yielding the state-of-the-art performance in a representative set of benchmark offline RL tasks.

1. INTRODUCTION

Reinforcement learning (RL) has accomplished remarkable results in a wide range of domains, but its successes were mostly based on a large number of online interactions with the environment. However, in many real-world tasks, exploratory online interactions are either very expensive or dangerous (e.g. robotics, autonomous driving, and healthcare), and applying a standard online RL would be impractical. Consequently, the ability to optimize RL agents reliably without online interactions has been considered as a key to practical deployment, which is the main goal of batch RL, also known as offline RL (Fujimoto et al., 2019; Levine et al., 2020) . In an offline RL algorithm, accurate policy evaluation and reliable policy improvement are both crucial for the successful training of the agent. Evaluating policies in offline RL is essentially an off-policy evaluation (OPE) task, which aims to evaluate the target policy given the dataset collected from the behavior policy. The difference between the target and the behavior policies causes a distribution shift in the estimation, which needs to be adequately addressed for accurate policy evaluation. OPE itself is one of the long-standing hard problems in RL (Sutton et al., 1998; 2009; Thomas & Brunskill, 2016; Hallak & Mannor, 2017) . However, recent offline RL studies mainly focus on how to improve the policy conservatively while using a common policy evaluation technique without much considerations for the distribution shift, e.g. mean squared temporal difference error minimization or maximum-likelihood training of environment model (Fujimoto et al., 2019; Kumar et al., 2019; Yu et al., 2020) . While conservative policy improvement helps the policy evaluation by reducing the off-policyness, we hypothesize that addressing the distribution shift explicitly during the policy evaluation can further improve the overall performance, since it can provide a better foundation for policy improvement. To this end, we aim to explicitly address the distribution shift of the OPE estimator used in the offline RL algorithm. In particular, we focus on the model-based approach, where we train an environment model robust to the distribution shift. One of the notable prior works is Representation Balancing MDP (RepBM) (Liu et al., 2018b) , which regularizes the representation learning of the model to be invariant between the distributions. However, despite the promising results by RepBM, its step-wise estimation of the distance between the distributions has a few drawbacks that limit the algorithm from being practical: not only it assumes a discrete-action task where the target policy is deterministic, but it also performs poorly in long-term tasks due to the curse of horizon of step-wise importance sampling (IS) estimators (Liu et al., 2018a) . To address these limitations, we present the Representation Balancing with Stationary Distribution Estimation (RepB-SDE) framework, where we aim to learn a balanced representation by regularizing, in the representation space, the distance between the data distribution and the discounted stationary distribution induced by the target policy. Motivated by the recent advances in estimating stationary distribution corrections, we present a new representation balancing objective to train a model of the environment that no longer suffers from the curse of horizon. We empirically show that the model trained by the RepB-SDE objective is robust to the distribution shift for the OPE task, particularly when the difference between the target and the behavior is large. We also introduce a model-based offline RL algorithm based on the RepB-SDE framework and report its performance on the D4RL benchmark (Fu et al., 2020) , showing the state-of-the-art performance in a representative set of tasks.

2. RELATED WORK

Learning balanced representation Learning a representation invariant to specific aspects of data is an established method for overcoming distribution shift that arises in unsupervised domain adaptation (Ben-David et al., 2007; Zemel et al., 2013) and in causal inference from observational data (Shalit et al., 2017; Johansson et al., 2018) . They have shown that imposing a bound on the generalization error under the distribution shift leads to the objective that learns a balanced representation such that the training and the test distributions look similar. RepBM (Liu et al., 2018b) can be seen as a direct extension to the sequential case, which encourages the representation to be invariant under the target and behavior policies in each timestep.

Stationary distribution correction estimation (DICE)

Step-wise importance sampling (IS) estimators (Precup, 2000) compute importance weights by taking the product of per-step distribution ratios. Consequently, these methods suffer from exponentially high variance in the lengths of trajectories, which is a phenomenon called the curse of horizon (Liu et al., 2018a) . Recently, techniques of computing a stationary DIstribution Correction Estimation (DICE) have made remarkable progress that effectively addresses the curse of horizon (Liu et al., 2018a; Nachum et al., 2019a; Tang et al., 2020; Zhang et al., 2020; Mousavi et al., 2020) . DICE has been also used to explicitly address the distribution shift in online model-free RL, by directly applying IS on the policy and action-value objectives (Liu et al., 2019; Gelada & Bellemare, 2019) . We adopt one of the estimation techniques, DualDICE (Nachum et al., 2019a) , to measure the distance between the stationary distribution and the data distribution in the representation space. Offline reinforcement learning There are extensive studies on improving standard online modelfree RL algorithms (Mnih et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) for stable learning in the offline setting. The main idea behind them is to conservatively improve policy by (1) quantifying the uncertainty of value function estimate, e.g. using bootstrapped ensembles (Kumar et al., 2019; Agarwal et al., 2020) , or/and (2) constraining the optimized target policy to be close to the behavior policy (i.e. behavior regularization approaches) (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Lee et al., 2020) . A notable exception is AlgaeDICE (Nachum et al., 2019b) , which implicitly uses DICE to regularize the discounted stationary distribution induced by the target policy to be kept inside of the data support, similar to this work. On the other hand, Yu et al. (2020) argued that the model-based approach can be advantageous due to its ability to generalize predictions on the states outside of the data support. They introduce MOPO (Yu et al., 2020) , which uses truncated rollouts and penalized rewards for conservative policy improvement. MOReL (Kidambi et al., 2020) trains a state-action novelty detector and use it to penalize rewards in the data-sparse region. Matsushima et al. (2020) , MOOSE (Swazinna et al., 2020) and MBOP (Argenson & Dulac-Arnold, 2020) guide their policy optimization using the behavior policy, similar to the behavior regularization approaches. Note that these aforementioned offline RL methods build on the standard approximate dynamic programming algorithm for action-value estimation (model-free) or on a maximum-likelihood environment model (model-based), without explicitly addressing the distribution shift in the estimator. In contrast, we augment the objective for model learning to obtain a robust model under the distribution shift, which is the first attempt for offline RL to the best of our knowledge.

3. PRELIMINARIES

A Markov Decision Process (MDP) is specified by a tuple M = S, A, T, R, d 0 , γ , consisting of state space S, action space A, transition function T : S × A → ∆(S), reward function R : S × A → ∆([0, r max ]), initial state distribution d 0 , and discount rate γ. In this paper, we mainly focus on continuous state space S ⊆ R ds and conduct experiments on both discrete action spaces A = {a 0 , ...a na } and continuous action spaces A ⊆ R da . Given MDP M and policy π, which is a (stochastic) mapping from state to action, the trajectory can be generated in the form of s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , ..., where s 0 ∼ d 0 and for each timestep t ≥ 0, a t ∼ π(s t ), r t ∼ R(s t , a t ), and s t+1 ∼ T (s t , a t ). The goal of RL is to optimize or evaluate a policy, based on the normalized expected discounted return: R π (1 -γ)E M,π [ ∞ t=0 γ t r t ]. A useful and important concept throughout the paper is the discounted stationary distribution, which represents the long-term occupancy of states: d π (s, a) (1 -γ) ∞ t=0 γ t Pr(s t = s, a t = a|M, π). From the definition, it can be observed that R π can be obtained by R π = E (s,a)∼d π [r(s, a)]. Offline RL and off-policy evaluation In this paper, we focus on the offline RL problem where the agent can only access a static dataset D = {(s i , a i , r i , s i )} N i=1 for the maximization of R π . We consider a behavior-agnostic setting where we do not have any knowledge of the data collection process. We denote the empirical distribution of the dataset by d D . Before improving policy, we first aim to better evaluate R π given a target policy π and a static dataset D, which corresponds to an off-policy evaluation (OPE) problem. We mainly focus on a modelbased approach where the algorithm first estimates the unknown dynamics T , R using the dataset D. This defines an approximate MDP M = S, A, T , R, d 0 , γ , with the approximate expected discounted return R π (1 -γ)E M ,π [ ∞ t=0 γ t r t ] obtained from M . In this paper, we are interested in the MDP estimate M that can effectively reduce the error in the evaluation of policy π, |R π -R π |. In order to do so, we need to learn a good representation of a model that results in a small OPE error. We assume a bijective representation function φ : S × A → Z where Z ⊆ R dz is the representation space. We define the transition and the reward models in terms of the representation function φ, i.e. T = T z • φ and R = R z • φ. In practice, where we use a neural network for T and R, z can be chosen to be the output of an intermediate hidden layer, making φ represented by lower layers and T z , R z by the remaining upper layers. We define d π φ (z) the discounted stationary distribution on Z induced by d π (s, a) under the representation function z = φ(s, a), and similarly for d D φ (z).

4.1. GENERALIZATION ERROR BOUND FOR MODEL-BASED OFF-POLICY EVALUATION

We aim to construct a model M from the dataset D, which can accurately evaluate the policy π, by minimizing a good upper bound of policy evaluation error |R π -R π |. We define the following point-wise model loss for notational convenience: E φ, Rz, Tz (s, a) = c R D T V R(r|s, a) R z r|φ(s, a) + c T D T V T (s |s, a) T z s |φ(s, a) , where c R = 2(1 -γ) and c T = 2γr max . Then, we start by restating the simulation lemma (Kearns & Singh, 2002) to bound the policy evaluation error in terms of the point-wise model loss. The proof is available in Appendix A. Lemma 4.1. Given an MDP M and its estimate M with a bijective representation function φ, i.e. ( T , R) = T z • φ, R z • φ , the policy evaluation error of a policy π can be bounded by: R π -R π ≤ E (s,a)∼d π E φ, Rz, Tz (s, a) The Lemma 4.1 has a natural interpretation: if the model error is small in the states frequently visited by following the policy π, the resultant policy evaluation error will also be small. However, minimizing the RHS of Eq. ( 1) in the off-policy evaluation (OPE) task is generally intractable since the distribution d π is not directly accessible. Therefore, the common practice has been to construct a maximum-likelihood MDP using D while ignoring the distribution shift, but its OPE performance is not guaranteed. Instead, we will derive a tractable upper bound on the policy evaluation error by eliminating the direct dependence on d π in Eq. (1). To this end, we adopt the distance metric between two distributions over representations d π φ and d D φ that can bound their difference in expectations, which is the Integral Probability Metric (IPM) (Müller, 1997) : IPM G (p, q) = sup g∈G E z∼p [g(z)] -E z∼q [g(z)] . ( ) where particular choices of G make the IPM equivalent to different well-known distances of distributions, e.g. total variation distance or Wasserstein distance (Sriperumbudur et al., 2009) . Theorem 4.2. Given an MDP M and its estimate M with a bijective representation function φ, i.e. ( T , R) = T z • φ, R z • φ , assume that there exists a constant B φ > 0 and a function class G {g : Z → R} such that 1 B φ E φ, Rz, Tz φ -1 (•) ∈ G. Then, for any policy π, R π -R π ≤ E (s,a)∼d D E φ, Rz, Tz (s, a) + B φ IPM G (d π φ , d D φ ) This theorem is an adaptation of Lemma 1 of Shalit et al. (2017) to an infinite horizon model-based policy evaluation and can be derived by the definition of IPM G (d π φ , d D φ ) since it serves as an upper bound of the difference in the expectations of any function in G. The first term in Eq. ( 3) corresponds to the fitness to the data following d D , while the second term serves as a regularizer. To see this, minimizing the second term would yield a near-constant representation function, which would be bad for the first term since it cannot distinguish states and actions well enough. It shows a natural trade-off between optimizing the model that fits data better and learning the representation that is invariant with respect to d π φ and d D φ . Nevertheless, RHS of Eq. (3) still cannot be evaluated naively due to its dependence on d π in estimating the IPM. We address this challenge via a change of variable, which is known as a DualDICE trick (Nachum et al., 2019a) . Define ν : Z → R as an arbitrary function of state-action pairs that satisfies: ν φ(s, a) g φ(s, a) + γE s ∼T (s,a) a ∼π(s ) ν φ(s , a ) , ∀(s, a) ∈ S × A. Then we can rewrite the IPM as: IPM G (d π φ , d D φ ) = sup g∈G E (s,a)∼d π g φ(s, a) -E (s,a)∼d D g φ(s, a) = sup ν∈F (1 -γ)E s∼d0 a∼π(s) ν φ(s, a) -E (s,a,s )∼d D a ∼π(s ) ν φ(s, a) -γν φ(s , a ) , where F = ν : ν(z) = E T,π ∞ t=0 γ t g(φ(s t , a t )) (s 0 , a 0 ) = φ -1 (z) , g ∈ G . In other words, we are now taking a supremum over the new function class F, which captures a function that returns the expected discounted sum of g(φ(s t , a t )) following the policy π in an MDP M given an initial representation z. While it is now difficult to choose F from G, Eq. ( 3) still can be kept valid by using a sufficiently rich function class for F. In this work, we choose F to be the family of functions in the unit ball in a reproducing kernel Hilbert space (RKHS) H k with the kernel k, which allows the following closed-form formula (see Lemma A.3 in Appendix for details): IPM G (d π φ , d D φ ) 2 = E s0∼d0,a0∼π(s0),(s,a,s )∼d D ,a ∼π(s ) s0∼d0,ā0∼π(s0),(s,ā,s )∼d D ,ā ∼π(s ) (5) k φ(s, a), φ(s, ā) + (1 -γ) 2 k φ(s 0 , a 0 ), φ(s 0 , ā0 ) + γ 2 k φ(s , a ), φ(s , ā ) -2(1 -γ)k φ(s 0 , a 0 ), φ(s, ā) -2γk φ(s , a ), φ(s, ā) + 2γ(1 -γ)k φ(s , a ), φ(s 0 , ā0 ) . This completes our derivation for the tractable upper bound of policy evaluation error (Eq. ( 3)), whose direct dependence on d π is eliminated by Eq. ( 5). Finally, we can train a model by minimizing the upper bound that encourages us to learn balanced representation while improving data fitness, where the trained model can readily provide a model-based OPE. The IPM G (d π φ , d D φ ) 2 in Eq. ( 5) can be estimated via finite random samples, and we denote its sampled-based estimator as IPM(d π φ , d D φ ) 2 . We show in the following that, a valid upper bound can be established based on the sample-based estimators instead of the exact terms in the RHS of Eq. ( 3) under certain conditions. Theorem 4.3. Given an MDP M , its estimate M with a bijective representation function φ, i.e. ( T , R) = T z • φ, R z • φ , and an RKHS H k ⊂ (Z → R) in- duced by a universal kernel k such that sup z∈Z k(z, z) = k, assume that f φ, Rz, Tz (z) = E T,π ∞ t=0 γ t E φ, Rz, Tz (s t , a t ) (s 0 , a 0 ) = φ -1 (z) ∈ H k with B φ = f φ, Rz, Tz H k and the loss is bounded by Ē = sup s∈S,a∈A E φ, Rz, Tz (s, a). Let n be the number of data in D. With probability 1 -2δ, R π -R π ≤ 1 n (s,a)∈D E φ, Rz, Tz (s, a)+B φ IPM(d π φ , d D φ )+ Ē2 2n log 1 δ +B φ k n 4 + 8 log 3 δ . This result can be proved by adapting the convergence results of the empirical estimate of the MMD (Gretton et al., 2012 ) and Hoeffding's inequality (Hoeffding, 1963) . With the choice of an RKHS H k , we can now interpret B φ as the RKHS norm f φ, Rz, Tz H k , which captures the magnitude and the smoothness of the expected cumulative model loss f φ, Rz, Tz . In general, assuming smooth underlying dynamics, we can expect B φ to be small when the model error is small. Furthermore, although k depends on the kernel function we use, we can always let k = 1 and subsume it into B φ as long as it is bounded, i.e. using Bφ B φ √ k. In the next section, we develop algorithms based on practical approximations of Eq. ( 3).

Detailed comparison to RepBM

As previously stated, RepBM (Liu et al., 2018b ) is a modelbased finite-horizon OPE algorithm that trains the model to have balanced representation φ, which is encouraged to be invariant under the target and behavior policies. Specifically, given the deterministic target policy π and the behavior policy µ, at each timestep t, it defines the factual distribution on Z given that the actions until timestep t have been executed according to the policy π: p F φ,t (z) = Pr(z t |M, µ, a 0:t = π(s 0:t )) and the counterfactual distribution on Z given the same condition except the action at timestep t: p CF φ,t (z) = Pr(z t |M, µ, a 0:t-1 = π(s 0:t-1 ), a t = π(s t )). Then, RepBM bounds the OPE error as,foot_0  R π -R π ≤ (1-γ) ∞ t=0 γ t E p(st|M,µ,a0:t=π(s0:t)) [E φ, Rz, Tz (s t , π(s t ))] + B φ,t IPM Gt (p F φ,t , p CF φ,t ) . Although RepBM achieves performance improvement over other OPE algorithms, we found a number of practical challenges: from the definition of the IPM Gt (p F φ,t , p CF φ,t ), it requires a discrete-action environment and a deterministic policy π, which cannot be met by many practical RL settings. In addition, since the sample-based estimation of IPM Gt (p F φ,t , p CF φ,t ) requires samples consistent with the policy π, i.e. a 0:t-1 = π(s 0:t-1 ), the algorithm would reject exponentially many samples with respect to t, which is the curse of horizon (Liu et al., 2018a) . When there is a large difference between the behavior and the target policies in long-term tasks, their implementation becomes close to using the maximum likelihood objective, which can also be observed empirically in our experiments. In contrast, our work is free from the abovementioned practical limitations by performing balancing between the discounted stationary distribution d π and the data distribution d D , leveraging the recent advances in stationary distribution correction estimation (i.e. the DualDICE trick) to overcome the difficulties pertinent to the expectation concerning d π required to evaluate the IPM in the objective.

4.2. REPRESENTATION BALANCING WITH STATIONARY DISTRIBUTION ESTIMATION

In the following, we describe algorithms for OPE and offline RL based on the practical approximations to Eq. ( 3), which we call the RepB-SDE framework. Objective for off-policy evaluation As we mentioned earlier, we aim to minimize the upper bound of OPE error |R π -R π | specified in Theorem 4.2. To make the RHS of Eq. ( 3) tractable for optimization, we replace the intractable total variation distance with KL-divergence, which can be easily minimized by maximizing the data log-likelihood. We also replace the IPM with its samplebased estimator to obtain the learning objective: min φ, Rz, Tz 1 n (s,a,s ,r)∈D -log R z r|φ(s, a) DKL(R(r|s,a)|| Rz(r|φ(s,a))) -log T z s |φ(s, a) DKL(T (s |s,a)|| Tz(s |φ(s,a))) + α M IPM(d π φ , d D φ ) (6) The constant B φ in Theorem 4.2 depends on the function classes and cannot be estimated, and thus, we replace it with a tunable hyperparameter α M that balances between data fitness and representation invariance. Remark that α M = 0 recovers the simple maximum-likelihood objective. By simulating the target policy π under the environment model ( T , R) obtained by minimizing Eq. ( 6), it is possible to perform a model-based OPE that approximately minimizes the upper bound of OPE error. Objectives for offline model-based RL By rearranging Eq. (3), we have, R π ≥ R π -E (s,a)∼D E φ, Rz, Tz (s, a) -B φ IPM G (d π φ , d D φ ). Then, we can maximize the RHS of Eq. ( 7) to get the model and the policy that maximizes the lower bound of true return R π . Similar to the derivation of Eq. ( 6), we replace the total variation distance by KL-divergence to obtain the following learning objectives: L M ( M , π, α M ) = E d D -log R z r|φ(s, a) -log T z s |φ(s, a) + α M IPM G (d π φ , d D φ ), J π (π, M , α π ) = E M ,π ∞ t=0 γ t r t -α π IPM G (d π φ , d D φ ). ( ) where the expectation in Eq. ( 9) can be optimized using various model-based RL algorithms, e.g. with planning (Chua et al., 2018) or using a model-free learner (Janner et al., 2019) . By iterating between the minimization of L M with respect to M and the maximization of J π with respect to π by stochastic gradient method, it is possible to perform offline model-based RL that approximately maximizes the lower bound of the true return R π . Implementation details Following the recent practice (Chua et al., 2018; Janner et al., 2019; Yu et al., 2020) , we model the dynamics ( T , R) using a bootstrap ensemble of neural networks. To optimize a policy based on the objective, we perform full rollouts (until reaching the terminal states or maximum timesteps) using learned dynamics ( T , R). During obtaining the rollouts, we pessimistically augment the estimated reward function using the penalty proportional to the bootstrapped uncertainty, which helped the algorithm to perform robustly. We suspect the difficulty in calculating accurate IPM estimation is what makes the additional pessimism beneficial. We store the generated experiences to a separate dataset D and update the policy π with IPM-regularized soft actor-critic (SAC) (Haarnoja et al., 2018) using samples from both datasets D ∪ D similar to MBPO (Janner et al., 2019) . Since the presented model objective requires a policy π to perform a balancing, we initially trained the model and the policy using α M = α π = 0: by M 0 = arg min M L M ( M , •, 0) and π 0 = arg min π L π (π, M 0 , 0). Then, we retrained the model and the policy using π 0 : M 1 = arg min M L M ( M , π 0 , α M ) and π 1 = arg min π L π (π, M 1 , α π ) for some non-negative α M and α π . While it is desirable to repeat the optimization of the model and the policy until convergence, we did not observe significant improvement after the first iteration and reported the performance of the policy after the first iteration, π 1 .

5. EXPERIMENTS

We demonstrate the effectiveness of the RepB-SDE framework, by comparing the OPE performance of Eq. ( 6) to that of RepBM and evaluating the presented model-based offline RL algorithm on the benchmarks. The code used to produce the results is available online.foot_1 A detailed description of the experiments can be found in Appendix B.

5.1. MODEL-BASED OFF-POLICY EVALUATION

For the sake of comparison with RepBM, we test our OPE algorithm on three continuous-state discrete-action tasks where the goal is to evaluate a deterministic target policy. We trained a suboptimal deterministic target policy π and used an -greedy policy with various values of as the data collection policy. In each experiment, we trained environment models for a fixed number of epochs concerning three different objectives: simple maximum-likelihood baseline, step-wise representation balancing objective used in RepBM (Liu et al., 2018b) , and the presented OPE objective of RepB-SDE (Eq. ( 6)). We measured the individual mean squared error (Liu et al., 2018b) . The normalized error of each algorithm, relative to the error of the baseline valued at 1, is presented in Figure 1 . We can observe that the presented objective of RepB-SDE can reduce the OPE error from the baseline significantly, outperforming RepBM in most of the cases. As the off-policyness between the policies ( ) increases, representation balancing algorithms should more benefit compared to the maximum-likelihood baseline in principle. However, the result shows that the performance of RepBM merely increases due to the increased sample rejection rate under large . We evaluate the offline model-based RL algorithm presented in Section 4.2 on a subset of datasets in the D4RL benchmark (Fu et al., 2020) : using four types of datasets (Random, Medium, Medium-Replay, and Medium-Expert) from three different MuJoCo environments (HalfCheetah-v2, Hopper-v2, and Walker2d-v2) (Todorov et al., 2012) . Random dataset contains 10 6 experience tuples from a random policy. Medium dataset contains 10 6 experience tuples from a policy trained to approximately 1/3 the performance of the expert, which is an agent trained to completion with SAC. Med-Replay dataset contains 10 5 (2 × 10 5 for Walker2d-v2) experience tuples, which are from the replay buffer of a policy trained up to the performance of the medium agent. Med-Expert dataset is a Medium dataset combined with 10 6 samples from the expert. This experimental setting exactly follows that of (Yu et al., 2020; Argenson & Dulac-Arnold, 2020) . The normalized score of each algorithm is presented in Table 1 . MF denotes the best score from offline model-free algorithms (taken from Fu et al. (2020) and Kumar et al. (2020) ), including SAC (Haarnoja et al., 2018) , BCQ (Fujimoto et al., 2019) , BEAR (Kumar et al., 2019) , BRAC (Wu et al., 2019) , AWR (Peng et al., 2019) , cREM (Agarwal et al., 2020) , AlgaeDICE (Nachum et al., 2019b) , and CQL (Kumar et al., 2020) . The actual algorithm that achieves the reported score is presented next to the numbers. Base shows the performance of the most naive baseline, which attempts to maximize the estimated policy return under the maximum-likelihood model. RP denotes the performance of Base equipped with the appropriate reward penalty using the bootstrapped uncertainty of the model, which is equivalent to π 0 described in Section 4.2. RepB-SDE denotes the performance after a single iteration of our algorithm, corresponding to π 1 . We also provide BC, the performance of direct behavior cloning from the data, and MOPO (Yu et al., 2020) , an offline model-based RL algorithm that optimizes policy based on truncated rollouts with the heuristic reward penalty. The significant gap between RepB-SDE and RP in the results shows the advantage brought by our framework that encourages balanced representation. While our approach was less successful on some of the datasets (mostly on the Hopper-v2 environment), we hypothesize that the conservative training techniques: the behavior regularization approaches exploited in the model-free algorithms, the rollout truncation technique in MOPO, and the pessimistic training based on the bootstrapped uncertainty estimates adopted in our algorithm exhibit their strengths in different datasets. For example, it may be the case that the ensemble models are overconfident especially in Hopper-v2, and should be regularized with more explicit methods. Nevertheless, we emphasize that the presented framework can be used jointly with any other conservative training technique to improve their performance.

6. CONCLUSION AND FUTURE WORK

In this paper, we presented RepB-SDE, a framework for balancing the model representation with stationary distribution estimation, aiming at obtaining a model robust to the distribution shift that arises in off-policy and offline RL. We started from the theoretical observation that the model-based policy evaluation error can be upper-bounded by the data fitness and the distance between two distributions in the representation space. Motivated by the bound, we presented a model learning objective for off-policy evaluation and model-based offline policy optimization. RepB-SDE can be seen as an extension of RepBM, which addresses the curse of horizon by leveraging the recent advances in stationary distribution correction estimation (i.e. the DualDICE trick). Using stationary distribution also frees us from other limitations of RepBM to be applied to more practical settings. To the best of our knowledge, it is the first attempt to introduce an augmented objective for the learning of model robust to a specific distribution shift in offline RL. In the experiments, we empirically demonstrated that we can significantly reduce the OPE error from the baseline, outperforming RepBM in most cases. We also showed that the robust model also helps in the offline model-based policy optimization, yielding the state-of-the-art performance in a representative set of D4RL benchmarks. We emphasize that our approach can be directly adopted in many other model-based offline RL algorithms. There are a number of promising directions for future work. Most importantly, we have not leveraged the learned representation in the policy when optimizing the policy, yet it is very natural to do so. We can easily incorporate the representation into the policy by assuming energy-based models, but this would make the computation of entropy intractable in entropy-regularized policy optimization algorithms. It would be also interesting to see if the proposed framework for learning balanced representation can benefit off-policy (and offline) model-free methods. A PROOFS Lemma 4.1. Given an MDP M and its estimate M with a bijective representation function φ, i.e. ( T , R) = T z • φ, R z • φ , the policy evaluation error of a policy π can be bounded by: R π -R π ≤ E (s,a)∼d π E φ, Rz, Tz (s, a) Proof. We define a value function V π (s) = E M,π [ ∞ t=0 γ t r t |s 0 = s], the expected discounted return starting from the state s, for the following proof. Due to its definition, we can write R π = (1 -γ)E s∼d0 [V π (s)]. The following recursive equation also holds: V π (s) = E a∼π(s) r(s, a) + γE s ∼T (s,a) [V (s )] , and similarly with an approximate value function V π (s) = E M ,π [ ∞ t=0 γ t r t |s 0 = s] given an MDP estimate M . Then, 1 1 -γ R π -R π = E d0 V π (s 0 ) -V π (s 0 ) = E d0,π [r 0 -r 0 ] + γE d0 V π (s 1 )T (s 1 |s 0 , a 0 ) -V π (s 1 ) T (s 1 |s 0 , a 0 ) π(a 0 |s 0 )ds 1 da 0 = E d0,π [r 0 -r 0 ] + γE d0 V π (s 1 )T (s 1 |s 0 , a 0 ) -V π (s 1 )T (s 1 |s 0 , a 0 ) + V π (s 1 )T (s 1 |s 0 , a 0 ) -V π (s 1 ) T (s 1 |s 0 , a 0 ) π(a 0 |s 0 )ds 1 da 0 = E d0,π [r 0 -r 0 ] + γE d0 V π (s 1 ) -V π (s 1 ) This forms a recursive equation  T (s 1 |s 0 , a 0 )π(a 0 |s 0 )ds 1 da 0 + γE d0 V π (s 1 ) T (s 1 |s 0 , a 0 ) -T (s 1 |s 0 , a 0 ) π(a 0 |s 0 )ds 1 da 0 = E (s, + 2γr max 1 -γ E (s,a)∼d π D T V (T (s |s, a)|| T z (s |φ(s, a))) Theorem 4.2. Given an MDP M and its estimate M with a bijective representation function φ, i.e. ( T , R) = T z • φ, R z • φ , assume that there exists a constant B φ > 0 and a function class G {g : Z → R} such that 1 B φ E φ, Rz, Tz φ -1 (•) ∈ G. Then, for any policy π, R π -R π ≤ E (s,a)∼d D E φ, Rz, Tz (s, a) + B φ IPM G (d π φ , d D φ ) Proof. From Lemma 4.1, we directly have: R π -R π ≤ E (s,a)∼d D E φ, Rz, Tz (s, a) + E φ, Rz, Tz (s, a) d π (s, a) -d D (s, a) dsda. Then, E φ, Rz, Tz (s, a) d π (s, a) -d D (s, a) dsda = E φ, Rz, Tz (φ -1 (z)) d π (z) -d D φ -1 (φ -1 (z)) d(s, a) dz dz = E φ, Rz, Tz (φ -1 (z)) d π φ (z) -d D φ (z) dz ≤ B φ 1 B φ E φ, Rz, Tz (φ -1 (z)) d π φ (z) -d D φ (z) dz (B φ > 0) ≤ B φ sup g∈G g(z) d π φ (z) -d D φ (z) dz 1 B φ E φ, Rz, Tz (φ -1 (z)) ∈ G = B φ IPM G d π φ , d D φ We state some previous results, which are required for further proof. Theorem A.1 (McDiarmid's inequality (McDiarmid, 1989) ). Let {X i } n i=1 be independent random variables taking values in set X , and assume that f : X n → R satisfies sup {xi} n i=1 ∈X n ,x∈X |f ({x i } n i=1 ) -f (x 1 , ..., x i-1 , x, x i+1 , ..., x n )| ≤ c i . Then for every > 0, Pr{f ({X i } n i=1 ) -E {Xi} n i=1 [f ({X i } n i=1 )] ≥ } ≤ exp - 2 2 n i=1 c 2 i . Lemma A.2 (Rademacher complexity of RKHS (Bartlett & Mendelson, 2002) ). Let F be a unit ball in a universal RKHS on the compact domain X , with kernel bounded according to 0 ≤ k(x, x ) ≤ k. Let {x i } n i=1 be an i.i.d. sample of size n drawn according to a probability measure p on X , and let σ i be i.i.d. and take values in {-1, 1} with equal probability. The Rademacher complexity, which is defined as below, is upper bounded as: R n (F) E {xi} n i=1 ,σ sup f ∈F 1 n n i=1 σ i f (x i ) ≤ k n . ( ) The upper bound is followed by Lemma 22 of (Bartlett & Mendelson, 2002) . Now we prove the following using the results above. Lemma A.3. Let H k be a RKHS associated with universal kernel k(•, •). Let •, • H k be the inner product of H k , which satisfies the reproducing property ν(z) = ν, k(•, z) H k . When G is chosen such that G = g ∈ (Z → R) : g(z) = ν(z) -γE s ∼T (φ -1 (z)) a ∼π(s ) [ν(φ(s , a ))], ν ∈ (Z → R), ν, ν H k ≤ 1 , the IPM G (d π φ , d D φ ) has the following closed form definition: IPM G (d π φ , d D φ ) 2 = E s0∼d0,a0∼π(s0),(s,a,s )∼d D ,a ∼π(s ) s0∼d0,ā0∼π(s0),(s,ā,s )∼d D ,ā ∼π(s ) k(φ(s, a), φ(s, ā)) + (1 -γ) 2 k(φ(s 0 , a 0 ), φ(s 0 , ā0 )) + γ 2 k(φ(s , a ), φ(s , ā )) -2(1 -γ)k(φ(s 0 , a 0 ), φ(s, ā)) -2γk(φ(s , a ), φ(s, ā)) + 2γ(1 -γ)k(φ(s , a ), φ(s 0 , ā0 )) . Furthermore, suppose that k sup z∈Z k(z, z). The estimator IPM(d π φ , d D φ ) 2 , which is the sample- based estimation of IPM G (d π φ , d D φ ) 2 from n samples, satisfies with probability at least 1 -δ, IPM G (d π φ , d D φ ) -IPM(d π φ , d D φ ) ≤ k n 4 + 8 log 3 δ . Proof. In the below we write in shorthand, IPM G to denote IPM G (d π φ , d D φ ). As in Eq. ( 4), we can rewrite the IPM term as: IPM G = sup ν∈F (1 -γ)E s∼d0 a∼π(s) [ν(φ(s, a))] -E (s,a,s )∼d D a ∼π(s ) [ν(φ(s, a)) -γν(φ(s , a ))] , and F here becomes F = {ν ∈ (Z → R) : ν, ν H k ≤ 1}, a unit ball in RKHS H k . Using the reproducing property of H k : IPM 2 G = sup ν∈F (1 -γ)E s∼d0 a∼π(s) [ν(φ(s, a))] -E (s,a,s )∼d D a ∼π(s ) [ν(φ(s, a)) -γν(φ(s , a ))] 2 = sup ν∈F (1 -γ)E s∼d0 a∼π(s) [ν(φ(s, a))] -E (s,a,s )∼d D a ∼π(s ) [ν(φ(s, a)) -γν(φ(s , a ))] 2 = sup ν∈F (1 -γ)E s∼d0 a∼π(s) [ ν, k(•, φ(s, a)) H k ] -E (s,a,s )∼d D a ∼π(s ) [ ν, k(•, φ(s, a)) H k -γ ν, k(•, φ(s , a )) H k ] 2 = sup ν∈F ν, ν * 2 H k , where ν * (•) = (1 -γ)E s∼d0 a∼π(s) [k(•, φ(s, a))] -E (s,a,s )∼d D a ∼π(s ) [k(•, φ(s, a)) -γk(•, φ(s , a ))] . Due to the Cauchy-Schwarz inequality and ν, ν H k ≤ 1, ∀ν ∈ F, ν, ν * 2 H k ≤ ν, ν H k ν * , ν * H k ≤ ν * , ν * H k holds and sup ν∈F ν, ν * 2 H k = ν * , ν * H k . Using the property that k(•, z), k(•, z) H k = k(z, z) , we can derive the closed form expression in the lemma from ν * , ν * H k . Now we prove the error bound of the estimator. First, we divide IPM G (d π φ , d D φ ) into three parts: IPM G (d π φ , d D φ ) = sup ν∈F |f 1 (ν) + f 2 (ν) + f 3 (ν)| where f 1 (ν) = (1 -γ)E s∼d0,a∼π(s) [ν(φ(s, a))], f 2 (ν) = -E (s,a)∼d D [ν(φ(s, a))] , f 3 (ν) = E (s,a,s )∼d D ,a ∼π(s ) [γν(φ(s , a ))] . Given n samples {s (i) 0 , a 0 , s (i) , a (i) , s (i) , a (i) } n i=1 from the generative process s j) , a (j) )) (i) 0 ∼ d 0 , a (i) 0 ∼ π(s (i) 0 ), (s (i) , a (i) , s (i) ) ∼ d D , a (i) ∼ π(s (i) ) ∀i, we define the sample-based estimator IPM(d π φ , d D φ ): IPM(d π φ , d D φ ) 2 = 1 n 2 i,j k(φ(s (i) , a (i) ), φ(s (j) , a (j) )) + (1 -γ) 2 k(φ(s (i) 0 , a (i) 0 ), φ(s (j) 0 , a (j) 0 )) + γ 2 k(φ(s (i) , a (i) ), φ(s (j) , a (j) )) -2(1 -γ)k(φ(s (i) 0 , a (i) 0 ), φ(s ( -2γk(φ(s (i) , a (i) ), φ(s (j) , a (j) )) + 2γ(1 -γ)k(φ(s (i) , a (i) ), φ(s (j) 0 , a (j) 0 )) . By deriving in reverse order, we can recover its another definition in supremum, which can be divided into three parts: IPM(d π φ , d D φ ) = sup ν∈F f 1 (ν) + f 2 (ν) + f 3 (ν) where f 1 (ν) = 1 -γ n n i=1 ν φ s (i) 0 , a (i) 0 , f 2 (ν) = - 1 n n i=1 ν φ s (i) , a (i) , f 3 (ν) = γ n n i=1 ν φ s (i) , a (i) . We can bound the error of sample-based estimator with individual errors as: IPM G (d π φ , d D φ ) -IPM(d π φ , d D φ ) = sup ν∈F |f 1 (ν) + f 2 (ν) + f 3 (ν)| -sup ν∈F f 1 (ν) + f 2 (ν) + f 3 (ν) ≤ sup ν∈F |f 1 (ν) + f 2 (ν) + f 3 (ν)| -f 1 (ν) + f 2 (ν) + f 3 (ν) ≤ sup ν∈F f 1 (ν) + f 2 (ν) + f 3 (ν) -f 1 (ν) -f 2 (ν) -f 3 (ν) ≤ sup ν∈F f 1 (ν) -f 1 (ν) + f 2 (ν) -f 2 (ν) + f 3 (ν) -f 3 (ν) ≤ sup ν∈F f 1 (ν) -f 1 (ν) + sup ν∈F f 2 (ν) -f 2 (ν) + sup ν∈F f 3 (ν) -f 3 (ν) . We then observe that sup ν∈F f 1 (ν) -f 1 (ν) = (1 -γ) E s∼d0,a∼π(s) s∼d0,ā∼π(s) [k(φ(s, a), φ(s, ā))] - 2 n n i=1 E s∼d0,ā∼π(s) k φ s (i) , a (i) , φ(s, ā) + 1 n 2 n i=1 n j=1 k φ s (i) , a (i) , φ s (j) , a (j) 1/2 , which shows that changing s (i) , a (i) results in changes of sup ν∈F f 1 (ν) -f 1 (ν) in magnitude of at most 2(1 -γ) k1/2 /n where k = sup z∈Z k(z, z). Therefore, by McDiarmid's inequality (Theorem A.1), Pr sup ν∈F f 1 (ν) -f 1 (ν) -E s (i) ,a (i) sup ν∈F f 1 (ν) -f 1 (ν) ≥ ≤ exp - n 2 2(1 -γ) 2 k . Also, E s (i) ,a (i) sup ν∈F f 1 (ν) -f 1 (ν) = 1 -γ n E s (i) ,a (i) sup ν∈F E s(i) ,ā (i) n i=1 ν φ s(i) 0 , ā(i) 0 - n i=1 ν φ s (i) 0 , a (i) 0 ≤ 1 -γ n E s (i) ,a (i) ,s (i) ,ā (i) sup ν∈F n i=1 ν φ s(i) 0 , ā(i) 0 - n i=1 ν φ s (i) 0 , a (i) 0 = 1 -γ n E s (i) ,a (i) ,s (i) ,ā (i) ,σ (i) sup ν∈F n i=1 σ (i) ν φ s(i) 0 , ā(i) 0 -ν φ s (i) 0 , a (i) 0 ≤ 2(1 -γ) k n , where the last inequality is from Lemma A.2. Combining the results, we get Pr sup ν∈F f 1 (ν) -f 1 (ν) -2(1 -γ) k n ≥ ≤ exp - n 2 2(1 -γ) 2 k . Similarly, we derive bounds for f 2 and f 3 respectively: Pr sup ν∈F f 2 (ν) -f 2 (ν) -2 k n ≥ ≤ exp - n 2 2 k , Pr sup ν∈F f 3 (ν) -f 3 (ν) -2γ k n ≥ ≤ exp - n 2 2γ 2 k . By letting RHS of above bounds to be δ/3 and using union bound, we get, with probability 1-δ, we get IPM G (d π φ , d D φ ) -IPM(d π φ , d D φ ) ≤ k n 4 + 8 log 3 δ . ( ) The relationship between F and G G = g ∈ (Z → R) : g(z) = ν(z) -γE s ∼T (φ -1 (z)) a ∼π(s ) [ν(φ(s , a ))], ν ∈ (Z → R), ν, ν H k ≤ 1 show that when the conditional expectation E s ∼T (φ -1 (•)),a ∼π(s ) [ν(φ(s , a ))] : Z → R is a func- tion in RKHS H k , G also becomes a subset of H k . Then we can prove the following Theorem. Theorem 4.3. Given an MDP M , its estimate M with a bijective representation function φ, i.e.  ( T , R) = T z • φ, R z • φ , and an RKHS H k ⊂ (Z → R) in- duced by a universal kernel k such that sup z∈Z k(z, z) = k, assume that f φ, Rz, Tz (z) = E T,π ∞ t=0 γ t E φ, Rz, Tz (s t , a t ) (s 0 , a 0 ) = φ -1 (z) ∈ H k with B φ = f φ, Rz, -2δ, R π -R π ≤ 1 n (s,a)∈D E φ, Rz, Tz (s, a)+B φ IPM(d π φ , d D φ )+ Ē2 2n log 1 δ +B φ k n 4 + 8 log 3 δ . Proof. Applying the Hoeffding inequality (Hoeffding, 1963) , with probability 1 -δ, we get: By using an union bound with Eq. ( 13) and substituting terms in Eq. (3), we recover the result. E (s, of Liu et al. (2018b) and use dot product kernel k(φ(s), φ(s)) = φ(s) φ(s) for the OPE experiment, which is not universal but allows us to avoid search of kernel hyperparameters, such as length-scales. After the training, we generate another 200 trajectories (50 in case of HIV simulator), and rollout in both true and simulated (based on learned model) environments to evaluate models. We measure the individual MSE, which is Comparison to other baselines In Figure 2 , the OPE results with other model-free baselines are presented. FQE: Fitted Q-evaluation, IS: step-wise importance sampling, DR: doubly robust estimator based on step-wise importance sampling using the value function learned with fitted Qevaluation (Jiang & Li, 2016) , DualDICE: stationary distribution correction algorithm (Nachum et al., 2019a) . We used the implementation provided by the authors in case of DualDICE.  Individual MSE = E s0∼d0 E M ,π ∞ t=0 γ t r t |s 0 -E M,π ∞ t=0 γ t r

B.3 DETAILS OF THE OFFLINE RL EXPERIMENTS

Task details We used 12 datasets in D4RL (Fu et al., 2020) over four dataset types and three environments as specified in the main text. In Table 1 , normalized scores suggested by (Fu et al., 2020) is used to report the result, where score of 0 corresponds to a random policy and 100 corresponds to a converged SAC policy. In HalfCheetah-v2, score of 0 means the undiscounted return of -280, and score of 1 means the undiscounted return of 12135. In Hopper-v2, score of 0 means the undiscounted return of -20, and score of 1 means the undiscounted return of 3234. In Walker2d-v2 score of 0 means the undiscounted return of 2, and score of 1 means the undiscounted return of 4592. We assume that the termination conditions of tasks are known in prior. Representation balancing maximizing undiscounted return As we report the undiscounted sum of rewards in the experiments, the maximization of lower bound of R π may result in an underutilization of experiences of later timesteps. One way to mitigate the mismatch is to optimize the policy by maximizing the returns starting from the states in the dataset. It corresponds to maximizing Rπ = (1 -γ)E s0∼d D ,M,π [ ∞ t=0 γ t r t ] instead of R π , where the expectation respect to the initial state distribution d 0 is altered with the data distribution d D . Consequently, to bound the error of Rπ , the representation should be balanced with another discounted stationary distribution dπ (s, a) (1 -γ) ∞ t=0 γ t Pr(s t = s, a t = a|s 0 ∼ d D , T, π), the distribution induced by the policy π where the initial state is sampled from the data distribution. The derivations can be easily adapted by noting that: Model and algorithm details Similar to the model used in the OPE experiment, the model we learn is composed of a representation module and a dynamics module. A representation module is a feed-forward network with two hidden layers that takes the state-action pair as input and outputs representation through the tanh activation function. The dynamics module is a single hidden layer network that takes representation as input and outputs parameters of diagonal Gaussian distribution predicting state difference and reward. We use 200 hidden units for all intermediate layers including the representation layer. Across all domains, we train an ensemble of 7 models and pick the best 5 models on their validation error on hold-out set of 1000 transitions in the dataset. The inputs and outputs of the neural network is normalized. We present the pseudo-code of the presented Representation Balancing Offline Model-based RL algorithm below. IPM G ( dπ φ , d D φ ) = sup Algorithm 1 Representation Balancing Offline Model-based RL Input: Offline dataset D, previous policy π Output: Optimized policy π 1: Sample K independent datasets with replacement from D. 2: Train bootstrapped ensemble of K models { T i , R i } K i=0 minimizing Eq. ( 8) (adapted with dπ φ ). 3: for repeat = 0, 1, . . . do

4:

for rollout = 0, 1, . . . , B do 5: Sample initial rollout state s 0 from D.

6:

for t = 0, 1, . . . do 7: Sample an action a t ∼ π(s t ).

8:

Randomly pick ( T i , R i ) and sample (s t+1 , r t ) ∼ ( T i , R i )(s t , a t ).

9:

Compute rt = r t -γλ V K [µ(s t , a t )] 2 and store (s t , a t , rt , s t+1 ) in D. For the result of MOPO (Yu et al., 2020) , we ran the code kindly provided by the authorsfoot_3 without any modification on the algorithm or the hyperparameters. All algorithms we experimented (Base, RP, RepB-SDE) share all the hyperparameters in common except the ones associated with changing objectives. We run SAC on the full rollouts from the trained ensemble models as shown in Algorithm 1. The common hyperparameters shared among algorithms are shown in Table 3 . We simply tried the listed hyperparameters and not tuned them further. For RP and RepB-SDE, we penalized the reward from simulated environments with the standard deviation of prediction means of neural network ensembles. We used standardized output of all 7 neural networks to compute the reward 



We adapted their formulation to the infinite horizon discounted MDP setting. https://github.com/dlqudwns/repb-sde https://github.com/rlpy/rlpy https://github.com/tianheyu927/mopo



Figure 1: The OPE results of different model learning algorithms with varying off-policyness on the x-axis. The y-axis plots the normalized individual MSE on test trajectories where the performance of the baseline model is set to 1. The tasks used are infinite-horizon discounted environments (γ = 0.98 (HIV), γ = 0.99 (others)), where we truncated at t = 1000. The experiments are repeated 200 times, and the error bars indicate 95% confidence intervals.

a)∼d π R(r|s, a) -R(r|s, a) dr + γ V π (s ) T (s |s, a) -T (s |s, a) ds ≤ E (s,a)∼d π R(r|s, a) -R(r|s, a) dr + γ V π (s ) T (s |s, a) -T (s |s, a) ds ≤ E (s,a)∼d π R(r|s, a) -R(r|s, a) dr + γ r max 1 -γ T (s |s, a) -T (s |s, a) ds = 2E (s,a)∼d π D T V (R(r|s, a)|| R z (r|φ(s, a)))

Tz H k and the loss is bounded by Ē = sup s∈S,a∈A E φ, Rz, Tz (s, a). Let n be the number of data in D. With probability 1

a)∼d D E φ, Rz, Tz (s, a)

Figure 2: The OPE results compared to other model-free baselines. All experiments are repeated 200 times and the error bars denote 95% confidence interval.

-γ)E s∼d D a∼π(s) ν φ(s, a) -E (s,a,s )∼d D a ∼π(s ) ν φ(s, a) -γν φ(s , a ) ,and changing the initial state sampling distributions to d D during the estimation of IPM.

D to compute IPM( dπ φ , d D φ ).

D and D to update critic Q.14: Maximize E s∼D∪ D,a∼π(s) [Q(s, a) -τ log π(a|s)] -α π IPM( dπ φ , d D φ ) to update π. 15: end for

Normalized scores on D4RL MuJoCo benchmark datasets(Fu et al., 2020) where the score of 0 corresponds to a random policy and 100 corresponds to a converged SAC policy. All results (except MF, which is taken fromFu et al. (2020) andKumar et al. (2020)) are averaged over 5 runs, where ± denotes the standard error. The highest scores are highlighted with boldface.

t |s 0 2 for measuring the performance of each model. Whole experiment, from sampling data to learning and evaluating the model, is repeated 200 times with different random seeds.Choice and effect of hyperparameter α For choosing hyperparameter α for each algorithm, we searched over α ∈ {0.001, 0.01, 0.1, 1, 10} for each off-policyness and for each environment. Chosen αs were mainly α ∈ {0.001, 0.01} for CartPole-v0, α ∈ {1, 10} for Acrobot-v1, and α ∈ {0.01, 0.1} for HIV simulator for both algorithms. In general, large α was beneficial when high off-policyness ( ) is present and/or the task is hard to generalize. On the right we show the example of effect of varying α in CartPole-v0.

Common hyperparameters used in offline RL experiments.

Normalized scores on D4RL MuJoCo benchmark datasets(Fu et al., 2020) with standard errors fully specified.

ACKNOWLEDGMENTS

This work was supported by the National Research Foundation (NRF) of Korea (NRF-2019M3F2A1072238 and NRF-2019R1A2C1087634), and the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No. 2019-0-00075, IITP No. 2020-0-00940 and IITP No. 2017-0-01779 XAI).

B EXPERIMENT DETAILS B.1 COMPUTING INFRASTRUCTURE

All experiments were conducted on the Google Cloud Platform. Specifically, we used computeoptimized machines (c2-standard-4) that provide 4 vCPUs and 16 GB memory for the evaluation experiment of Section 5.1, and we used high-memory machines (n1-highmem-4), which provide 4 vCPUs and 26GB memory, equipped with an Nvidia Tesla K80 GPU for the RL experiment of Section 5.2.

B.2 DETAILS OF THE OPE EXPERIMENT

Task details We did not modify CartPole-v0 environment and Acrobot-v1 environment from the original implementation of OpenAI Gym (Brockman et al., 2016) except for the maximum trajectory length. We ran PPO (Schulman et al., 2017) to optimize policies to reach a certain performance level and set them as the target policies for CartPole-v0 and Acrobot-v1. For the HIV simulator, we used the code adapted by Liu et al. (2018b) , which is originally from the implementation of RLPy 3 . We modified the environment to have more randomness in the initial state (up to 10% perturbation from the baseline initial state) and to use the reward function that gives the logarithm of original reward values, as the original reward function scales up to 10 10 . We used a tree-based fitted q-iteration algorithm implemented by Liu et al. (2018b) to optimize the target policy for the HIV simulator. All the other details are shown in Table 2 . We assume that the termination conditions of tasks are known in prior. Model and algorithm details The model we learn is composed of a representation module and a dynamics module. To be consistent with the experiment settings in Liu et al. (2018b) , we use a representation module of a single hidden layer feed-forward network that takes the state as input and outputs representation. We squashed the representation between (-1, 1) using the tanh activation function. The dynamics module is also a single hidden layer feed-forward network that takes representation as input and outputs state difference and reward prediction for each action. We use the swish activation function (Ramachandran et al., 2017) for the hidden layers of two modules. As a whole, the model can also be seen as a feed-forward network with three hidden layers of varying activation functions, where the output of the second hidden layer is the representation we regularize.For the purpose of comparison, we minimize the L2 distance between the model prediction and the desired outcome from data, which corresponds to using a model of Gaussian predictive distribution with fixed variance. We standardized the inputs and outputs of the neural network and used Adam (Kingma & Ba, 2014) with a learning rate of 3 × 10 -4 for the optimization. When compared to the similar experiments conducted in Liu et al. (2018b) , we used a larger and more expressive model with more optimization steps with a smaller learning rate for a more accurate comparison. While the derivation of the RepB-SDE objective was based on the state-action representation function, we use state representation in this experiment for direct comparison with RepBM, which uses state representation (it can be also understood as using action invariant kernel). We follow the choice 

