BEHAVIOR PROXIMAL POLICY OPTIMIZATION

Abstract

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to overestimating of out-ofdistribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we reach a surprising conclusion that online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms.

1. INTRODUCTION

Typically, reinforcement learning (RL) is thought of as a paradigm for online learning, where the agent interacts with the environment to collect experiences and then uses them to improve itself (Sutton et al., 1998) . This online process poses the biggest obstacles to real-world RL applications because of expensive or even risky data collection in some fields (such as navigation (Mirowski et al., 2018) and healthcare (Yu et al., 2021a) ). As an alternative, offline RL eliminates the online interaction and learns from a fixed dataset collected by some arbitrary and possibly unknown process (Lange et al., 2012; Fu et al., 2020) . The prospect of this data-driven mode (Levine et al., 2020) is pretty encouraging and has been placed with great expectations for solving real-world RL applications. Unfortunately, the major superiority of offline RL, the lack of online interaction, also raises another challenge. The classical off-policy iterative algorithms tend to underperform due to overestimating out-of-distribution (shorted as OOD) state-action pairs, even though offline RL can be viewed as an extreme off-policy case. More specifically, when Q-function poorly estimates the value of OOD state-action pairs during policy evaluation, the agent tends to take OOD actions with erroneously estimated high values, resulting in low-performance after policy improvement (Fujimoto et al., 2019) . Thus, to overcome the overestimation issue, some solutions keep the learned policy close to the behavior policy (or the offline dataset) (Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021) . Most offline RL algorithms adopt online interactions to select hyperparameters. This is because offline hyperparameter selection, which selects hyperparameters without online interactions, is always an open problem lacking satisfactory solutions (Paine et al., 2020; Zhang & Jiang, 2021) . Deploying the policy learned by offline RL is potentially risky in certain areas (Mirowski et al., 2018; Yu et al., 2021a) since the performance is unknown. However, the risk during online interactions will be greatly reduced if the deployed policy can guarantee better performance than the behavior policy. This inspires us to consider how to use offline dataset to improve behavior policy with a monotonic performance guarantee. We formulate this problem as offline monotonic policy improvement.

availability

//github.com/

