OFFLINE REINFORCEMENT LEARNING VIA HIGH-FIDELITY GENERATIVE BEHAVIOR MODELING

Abstract

In offline reinforcement learning, weighted regression is a common method to ensure the learned policy stays close to the behavior policy and to prevent selecting out-of-sample actions. In this work, we show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training, which deviates from their initial motivation. To address this problem, we adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model. The key insight is that such decoupling avoids learning an explicitly parameterized policy model with a closed-form expression. Directly learning the behavior policy allows us to leverage existing advances in generative modeling, such as diffusionbased methods, to model diverse behaviors. As for action evaluation, we combine our method with an in-sample planning technique to further avoid selecting outof-sample actions and increase computational efficiency. Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as AntMaze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail. The source code is provided at https://github.com/ChenDRAG/SfBC.

1. INTRODUCTION

Offline reinforcement learning seeks to solve decision-making problems without interacting with the environment. This is compelling because online data collection can be dangerous or expensive in many realistic tasks. However, relying entirely on a static dataset imposes new challenges. One is that policy evaluation is hard because the mismatch between the behavior and the learned policy usually introduces extrapolation error (Fujimoto et al., 2019) . In most offline tasks, it is difficult or even impossible for the collected transitions to cover the whole state-action space. When evaluating the current policy via dynamic programming, leveraging actions that are not presented in the dataset (out-of-sample) may lead to highly unreliable results, and thus performance degrade. Consequently, in offline RL it is critical to stay close to the behavior policy during training. Recent advances in model-free offline methods mainly include two lines of work. The first is the adaptation of existing off-policy algorithms. These methods usually include value pessimism about unseen actions or regulations of feasible action space (Fujimoto et al., 2019; Kumar et al., 2019; 2020) . The other line of work (Peng et al., 2019; Wang et al., 2020; Nair et al., 2020) is derived from constrained policy search and mainly trains a parameterized policy via weighted regression. Evaluations of every state-action pair in the dataset are used as regression weights. The main motivation behind weighted policy regression is that it helps prevent querying out-of-sample actions (Nair et al., 2020; Kostrikov et al., 2022) . However, we find that this argument is untenable in

