REPRESENTATION BALANCING OFFLINE MODEL-BASED REINFORCEMENT LEARNING

Abstract

One of the main challenges in offline and off-policy reinforcement learning is to cope with the distribution shift that arises from the mismatch between the target policy and the data collection policy. In this paper, we focus on a model-based approach, particularly on learning the representation for a robust model of the environment under the distribution shift, which has been first studied by Representation Balancing MDP (RepBM). Although this prior work has shown promising results, there are a number of shortcomings that still hinder its applicability to practical tasks. In particular, we address the curse of horizon exhibited by RepBM, rejecting most of the pre-collected data in long-term tasks. We present a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections. This effectively overcomes the aforementioned limitation of RepBM, as well as naturally extending to continuous action spaces and stochastic policies. We also present an offline model-based policy optimization using this new objective, yielding the state-of-the-art performance in a representative set of benchmark offline RL tasks.

1. INTRODUCTION

Reinforcement learning (RL) has accomplished remarkable results in a wide range of domains, but its successes were mostly based on a large number of online interactions with the environment. However, in many real-world tasks, exploratory online interactions are either very expensive or dangerous (e.g. robotics, autonomous driving, and healthcare), and applying a standard online RL would be impractical. Consequently, the ability to optimize RL agents reliably without online interactions has been considered as a key to practical deployment, which is the main goal of batch RL, also known as offline RL (Fujimoto et al., 2019; Levine et al., 2020) . In an offline RL algorithm, accurate policy evaluation and reliable policy improvement are both crucial for the successful training of the agent. Evaluating policies in offline RL is essentially an off-policy evaluation (OPE) task, which aims to evaluate the target policy given the dataset collected from the behavior policy. The difference between the target and the behavior policies causes a distribution shift in the estimation, which needs to be adequately addressed for accurate policy evaluation. OPE itself is one of the long-standing hard problems in RL (Sutton et al., 1998; 2009; Thomas & Brunskill, 2016; Hallak & Mannor, 2017) . However, recent offline RL studies mainly focus on how to improve the policy conservatively while using a common policy evaluation technique without much considerations for the distribution shift, e.g. mean squared temporal difference error minimization or maximum-likelihood training of environment model (Fujimoto et al., 2019; Kumar et al., 2019; Yu et al., 2020) . While conservative policy improvement helps the policy evaluation by reducing the off-policyness, we hypothesize that addressing the distribution shift explicitly during the policy evaluation can further improve the overall performance, since it can provide a better foundation for policy improvement. To this end, we aim to explicitly address the distribution shift of the OPE estimator used in the offline RL algorithm. In particular, we focus on the model-based approach, where we train an environment model robust to the distribution shift. One of the notable prior works is Representation Balancing MDP (RepBM) (Liu et al., 2018b) , which regularizes the representation learning of the model to be invariant between the distributions. However, despite the promising results by RepBM, its step-wise estimation of the distance between the distributions has a few drawbacks that limit the algorithm from being practical: not only it assumes a discrete-action task where the target policy is deterministic, but it also performs poorly in long-term tasks due to the curse of horizon of step-wise importance sampling (IS) estimators (Liu et al., 2018a) . To address these limitations, we present the Representation Balancing with Stationary Distribution Estimation (RepB-SDE) framework, where we aim to learn a balanced representation by regularizing, in the representation space, the distance between the data distribution and the discounted stationary distribution induced by the target policy. Motivated by the recent advances in estimating stationary distribution corrections, we present a new representation balancing objective to train a model of the environment that no longer suffers from the curse of horizon. We empirically show that the model trained by the RepB-SDE objective is robust to the distribution shift for the OPE task, particularly when the difference between the target and the behavior is large. We also introduce a model-based offline RL algorithm based on the RepB-SDE framework and report its performance on the D4RL benchmark (Fu et al., 2020) , showing the state-of-the-art performance in a representative set of tasks.

2. RELATED WORK

Learning balanced representation Learning a representation invariant to specific aspects of data is an established method for overcoming distribution shift that arises in unsupervised domain adaptation (Ben-David et al., 2007; Zemel et al., 2013) and in causal inference from observational data (Shalit et al., 2017; Johansson et al., 2018) . They have shown that imposing a bound on the generalization error under the distribution shift leads to the objective that learns a balanced representation such that the training and the test distributions look similar. RepBM (Liu et al., 2018b) can be seen as a direct extension to the sequential case, which encourages the representation to be invariant under the target and behavior policies in each timestep.

Stationary distribution correction estimation (DICE)

Step-wise importance sampling (IS) estimators (Precup, 2000) compute importance weights by taking the product of per-step distribution ratios. Consequently, these methods suffer from exponentially high variance in the lengths of trajectories, which is a phenomenon called the curse of horizon (Liu et al., 2018a) . Recently, techniques of computing a stationary DIstribution Correction Estimation (DICE) have made remarkable progress that effectively addresses the curse of horizon (Liu et al., 2018a; Nachum et al., 2019a; Tang et al., 2020; Zhang et al., 2020; Mousavi et al., 2020) . DICE has been also used to explicitly address the distribution shift in online model-free RL, by directly applying IS on the policy and action-value objectives (Liu et al., 2019; Gelada & Bellemare, 2019) . We adopt one of the estimation techniques, DualDICE (Nachum et al., 2019a) , to measure the distance between the stationary distribution and the data distribution in the representation space. Offline reinforcement learning There are extensive studies on improving standard online modelfree RL algorithms (Mnih et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) for stable learning in the offline setting. The main idea behind them is to conservatively improve policy by (1) quantifying the uncertainty of value function estimate, e.g. using bootstrapped ensembles (Kumar et al., 2019; Agarwal et al., 2020) , or/and (2) constraining the optimized target policy to be close to the behavior policy (i.e. behavior regularization approaches) (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Lee et al., 2020) . A notable exception is AlgaeDICE (Nachum et al., 2019b) , which implicitly uses DICE to regularize the discounted stationary distribution induced by the target policy to be kept inside of the data support, similar to this work. On the other hand, Yu et al. ( 2020) argued that the model-based approach can be advantageous due to its ability to generalize predictions on the states outside of the data support. They introduce MOPO (Yu et al., 2020) , which uses truncated rollouts and penalized rewards for conservative policy improvement. MOReL (Kidambi et al., 2020) trains a state-action novelty detector and use it to penalize rewards in the data-sparse region. Matsushima et al. (2020) , MOOSE (Swazinna et al., 2020) 

