BEHAVIOR PRIOR REPRESENTATION LEARNING FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-theshelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.

1. INTRODUCTION

Offline Reinforcement Learning (Offline RL) is one of the most promising data-driven ways of optimizing sequential decision-making. Offline RL differs from the typical settings of Deep Reinforcement Learning (DRL) in that the agent is trained on a fixed dataset that was previously collected by some arbitrary process, and does not interact with the environment during learning (Lange et al., 2012; Levine et al., 2020) . Consequently, it benefits the scenarios where online exploration is challenging and/or unsafe, especially for application domains such as healthcare (Wang et al., 2018; Gottesman et al., 2019; Satija et al., 2021) and autonomous driving (Bojarski et al., 2016; Yurtsever et al., 2020) . A common baseline of Offline RL is Behavior Cloning (BC) (Pomerleau, 1991) . BC performs maximum-likelihood training on a collected set of demonstrations, essentially mimicking the behavior policy to produce predictions (actions) conditioned on observations. While BC can only achieve proficient policies when dealing with expert demonstrations, Offline RL goes beyond the goal of simply imitating and aims to train a policy that improves over the behavior one. Despite promising results, Offline RL algorithms still suffer from two main issues: i) difficulty dealing with limited high-dimensional data, especially visual observations with continuous action space (Lu et al., 2022) ; ii) implicit under-parameterization of value networks exacerbated by highly re-used data, that is, an expressive value network implicitly behaves as an under-parameterized one when trained using bootstrapping (Kumar et al., 2021a; b) . In this paper, we focus on state representation learning for Offline RL to mitigate the above issues: projecting the high-dimensional observations to a low-dimensional space can lead to a better performance given limited data in the Offline RL scenario. Moreover, disentangling representation learning from policy training (or value function learning), referred to as pre-training the state representations, can potentially mitigate the "implicit under-parameterization" phenomenon associated with the emergence of low-rank features in the value network (Wang et al., 2022) . In contrast to

funding

* Correspondence to Xin Li. This work was partially supported by NSFC under Grant 62276024 and 92270125.† Work done while at Microsoft Research Montreal.

availability

The code is available at https://github.com/bit1029public/offline_bpr.

