OFFLINE REINFORCEMENT LEARNING VIA HIGH-FIDELITY GENERATIVE BEHAVIOR MODELING

Abstract

In offline reinforcement learning, weighted regression is a common method to ensure the learned policy stays close to the behavior policy and to prevent selecting out-of-sample actions. In this work, we show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training, which deviates from their initial motivation. To address this problem, we adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model. The key insight is that such decoupling avoids learning an explicitly parameterized policy model with a closed-form expression. Directly learning the behavior policy allows us to leverage existing advances in generative modeling, such as diffusionbased methods, to model diverse behaviors. As for action evaluation, we combine our method with an in-sample planning technique to further avoid selecting outof-sample actions and increase computational efficiency. Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as AntMaze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail. The source code is provided at https://github.com/ChenDRAG/SfBC.

1. INTRODUCTION

Offline reinforcement learning seeks to solve decision-making problems without interacting with the environment. This is compelling because online data collection can be dangerous or expensive in many realistic tasks. However, relying entirely on a static dataset imposes new challenges. One is that policy evaluation is hard because the mismatch between the behavior and the learned policy usually introduces extrapolation error (Fujimoto et al., 2019) . In most offline tasks, it is difficult or even impossible for the collected transitions to cover the whole state-action space. When evaluating the current policy via dynamic programming, leveraging actions that are not presented in the dataset (out-of-sample) may lead to highly unreliable results, and thus performance degrade. Consequently, in offline RL it is critical to stay close to the behavior policy during training. Recent advances in model-free offline methods mainly include two lines of work. The first is the adaptation of existing off-policy algorithms. These methods usually include value pessimism about unseen actions or regulations of feasible action space (Fujimoto et al., 2019; Kumar et al., 2019; 2020) . The other line of work (Peng et al., 2019; Wang et al., 2020; Nair et al., 2020) is derived from constrained policy search and mainly trains a parameterized policy via weighted regression. Evaluations of every state-action pair in the dataset are used as regression weights. The main motivation behind weighted policy regression is that it helps prevent querying out-of-sample actions (Nair et al., 2020; Kostrikov et al., 2022) . However, we find that this argument is untenable in certain settings. Our key observation is that policy models in existing weighted policy regression methods are usually unimodal Gaussian models and thus lack distributional expressivity, while in the real world collected behaviors can be highly diverse. This distributional discrepancy might eventually lead to selecting unseen actions. For instance, given a bimodal target distribution, fitting it with a unimodal distribution unavoidably results in covering the low-density area between two peaks. In Section 3.1, we empirically show that lack of policy expressivity may lead to performance degrade. Ideally, this problem could be solved by switching to a more expressive distribution class. However, it is nontrivial in practice since weighted regression requires exact and derivable density calculation, which places restrictions on distribution classes that we can choose from. Especially, we may not know what the behavior or optimal policy looks like in advance. To overcome the limited expressivity problem, we propose to decouple the learned policy into two parts: an expressive generative behavior model and an action evaluation model. Such decoupling avoids explicitly learning a policy model whose target distribution is difficult to sample from, whereas learning a behavior model is much easier because sampling from the behavior policy is straightforward given the offline dataset collected by itself. Access to data samples from the target distribution is critical because it allows us to leverage existing advances in generative methods to model diverse behaviors. To sample from the learned policy, we use importance sampling to select actions from candidates proposed by the behavior model with the importance weights computed by the action evaluation model, which we refer to as Selecting from Behavior Candidates (SfBC). However, the selecting-from-behavior-candidates approach introduces new challenges because it requires modeling behaviors with high fidelity, which directly determines the feasible action space. A prior work (Ghasemipour et al., 2021) finds that typically-used VAEs do not align well with the behavior dataset, and that introducing building-in good inductive biases in the behavior model improves the algorithm performance. Instead, we propose to learn from diverse behaviors using a much more expressive generative modeling method, namely diffusion probabilistic models (Ho et al., 2020) , which have recently achieved great success in modeling diverse image distributions, outperforming other existing generative models (Dhariwal & Nichol, 2021) . We also propose a planning-based operator for Q-learning, which performs implicit planning strictly within dataset trajectories based on the current policy, and is provably convergent. The planning scheme greatly reduces bootstrapping steps required for dynamic programming and thus can help to further reduce extrapolation error and increase computational efficiency. The main contributions of this paper are threefold: 1. We address the problem of limited policy expressivity in conventional methods by decoupling policy learning into behavior learning and action evaluation, which allows the policy to inherit distributional expressivity from a diffusion-based behavior model. 2. The learned policy is further combined with an implicit in-sample planning technique to suppress extrapolation error and assist dynamic programming over long horizons. 3. Extensive experiments demonstrate that our method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in sparse-reward tasks such as AntMaze.

2. BACKGROUND 2.1 CONSTRAINED POLICY SEARCH IN OFFLINE RL

Consider a Markov Decision Process (MDP), described by a tuple ⟨S, A, P, r, γ⟩. S denotes the state space and A is the action space. P (s ′ |s, a) and r(s, a) respectively represent the transition and reward functions, and γ ∈ (0, 1] is the discount factor. Our goal is to maximize the expected discounted return J(π) = E s∼ρπ(s) E a∼π(•|s) [r(s, a)] of policy π, where ρ π (s) = ∞ n=0 γ n p π (s n = s) is the discounted state visitation frequencies induced by the policy π (Sutton & Barto, 1998) . According to the policy gradient theorem (Sutton et al., 1999) , given a parameterized policy π θ , and the policy's state-action function Q π , the gradient of J(π θ ) can be derived as: ∇ θ J(π θ ) = S ρ π (s) A ∇ θ π θ (a|s)Q π (s, a)da ds. (1) When online data collection from policy π is not possible, it is difficult to estimate ρ π (s) in Equation 1, and thus the expected value of the Q-function η(π θ ) := S ρ π (s) A π θ (a|s)Q π (s, a). Given a static

