BEHAVIOR PRIOR REPRESENTATION LEARNING FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-theshelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.

1. INTRODUCTION

Offline Reinforcement Learning (Offline RL) is one of the most promising data-driven ways of optimizing sequential decision-making. Offline RL differs from the typical settings of Deep Reinforcement Learning (DRL) in that the agent is trained on a fixed dataset that was previously collected by some arbitrary process, and does not interact with the environment during learning (Lange et al., 2012; Levine et al., 2020) . Consequently, it benefits the scenarios where online exploration is challenging and/or unsafe, especially for application domains such as healthcare (Wang et al., 2018; Gottesman et al., 2019; Satija et al., 2021) and autonomous driving (Bojarski et al., 2016; Yurtsever et al., 2020) . A common baseline of Offline RL is Behavior Cloning (BC) (Pomerleau, 1991) . BC performs maximum-likelihood training on a collected set of demonstrations, essentially mimicking the behavior policy to produce predictions (actions) conditioned on observations. While BC can only achieve proficient policies when dealing with expert demonstrations, Offline RL goes beyond the goal of simply imitating and aims to train a policy that improves over the behavior one. Despite promising results, Offline RL algorithms still suffer from two main issues: i) difficulty dealing with limited high-dimensional data, especially visual observations with continuous action space (Lu et al., 2022) ; ii) implicit under-parameterization of value networks exacerbated by highly re-used data, that is, an expressive value network implicitly behaves as an under-parameterized one when trained using bootstrapping (Kumar et al., 2021a; b) . In this paper, we focus on state representation learning for Offline RL to mitigate the above issues: projecting the high-dimensional observations to a low-dimensional space can lead to a better performance given limited data in the Offline RL scenario. Moreover, disentangling representation learning from policy training (or value function learning), referred to as pre-training the state representations, can potentially mitigate the "implicit under-parameterization" phenomenon associated with the emergence of low-rank features in the value network (Wang et al., 2022) . In contrast to previous work that pre-train state representations by specifying the required properties, e.g., maximizing the diversity of states encountered by the agent (Liu & Abbeel, 2021; Eysenbach et al., 2019) , exploring the attentive knowledge on sub-trajectory (Yang & Nachum, 2021), or capturing temporal information about the environment (Schwarzer et al., 2021a), we consider using the behavior policy to learn generic state representations instead of specifying specific properties. Many existing Offline RL methods regularize the policy to be close to the behavior policy (Fujimoto et al., 2019; Laroche et al., 2019b; Kumar et al., 2019) or constrain the learned value function of OOD actions not to be overestimated (Kumar et al., 2020; Kostrikov et al., 2021) . Beyond these use, the behavior policy is often ignored, as it does not directly provide information on the environment. However, the choice of behavior has a huge impact on the Offline RL task. As shown by recent theoretical work (Xiao et al., 2022; Foster et al., 2022) , under an agnostic baseline, the Offline RL task is intractable (near optimality is exponential in the state space size), but it becomes tractable with a well-designed behavior (e.g. the optimal policy or a policy trained online). This impact indicates that the information collected from the behavior policy might deserve more attention. To this end, we propose Behavior Prior Representation (BPR), a state representation learning method tailored to Offline RL settings (Figure 1 ). BPR learns state representations implicitly by enforcing them to be predictive of the action performed by the behavior policy, normalized to be on the unit sphere. Then, the learned encoder is frozen and utilized to train a downstream policy with any Offline RL algorithms. Intuitively, to be predictive of the normalized actions, BPR encourages the encoder to ignore the task-irrelevant information while maintaining the task-specific knowledge relative to the behavior, which we posit is efficient for learning a state representation. Theoretically, we prove that BPR carries out performance guarantees when combined with conservative or pessimistic Offline RL algorithms. While an uninformative behavioral policy may lead to bad representations and therefore degraded performance, we note that such a scenario may be predicted from the empirical returns of the dataset. Furthermore, since the learning procedure of BPR does not involve value functions or bootstrapping methods like Temporal-Difference, it can naturally mitigate the "implicit under-parameterization" phenomenon. We prove this empirically by utilizing effective dimensions measurement to evaluate feature compactness in the value network's penultimate layer. The key contributions of our work are summarized below: • We propose a simple, yet effective method for state representation learning in Offline RL, relying on the behavior cloning of actions; and find that this approach is effective across several offline benchmarks, including raw state and pixel-based ones. Our approach can be combined to any existing Offline RL pipeline with minimal changes. • Behavior prior representation (BPR) is theoretically grounded: we show, under usual assumptions, that policy improvement guarantees from offline RL algorithms are retained through the BPR, at the only expense of an additive behavior cloning error term. • We provide extensive empirical studies, comparing BPR to several state representation objectives for Offline RL, and show that it outperforms the baselines across a wide range of tasks.

2. RELATED WORK

Offline RL with behavior regularization. Although to the best of our knowledge, we are the first to leverage behavior cloning (BC) to learn a state representation in Offline RL, we remark that combining Offline RL with behavior regularization has been considered previously by many works. A common way of combining BC with RL is to utilize it as a reference for policy optimization with baseline methods, such as natural policy gradient (Rajeswaran et al., 2018 ), DDPG (Nair et al., 2018; Goecks et al., 2020) , BCQ (Fujimoto et al., 2019) , SPIBB (Laroche et al., 2019a; Nadjahi et al., 



Figure 1: Illustration of Behavior Prior Representations and comparison with Behavior Cloning.

funding

* Correspondence to Xin Li. This work was partially supported by NSFC under Grant 62276024 and 92270125.† Work done while at Microsoft Research Montreal.

availability

The code is available at https://github.com/bit1029public/offline_bpr.

