LET OFFLINE RL FLOW: TRAINING CONSERVATIVE AGENTS IN THE LATENT SPACE OF NORMALIZING FLOWS Anonymous

Abstract

Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism -i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model -controller in the latent space -is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.

1. INTRODUCTION

Offline Reinforcement Learning (ORL) addresses the problem of training new decision-making policy from a static and pre-recorded dataset collected by some other policies without any additional data collection (Lange et al., 2012; Levine et al., 2020) . One of the main challenges in this setting is the extrapolation error (Fujimoto et al., 2019) -i.e. inability to properly estimate values of state-action pairs not well-supported by the training data, which in turn leads to overestimation bias. This problem is typically resolved with various forms of conservatism, for example, Implicit Q-Learning (Kostrikov et al., 2021) completely avoids estimates of out-of-sample actions, Conservative Q-Learning (Kumar et al., 2020) penalizes q-values for out-of-distribution actions and others (Fujimoto & Gu, 2021; Kumar et al., 2019) put explicit constraints to stay closer to the behavioral policies. An alternative approach to constraint-trained policies was introduced in PLAS (Zhou et al., 2020) , where authors proposed to construct a latent space that maps to the actions well-supported by the training data. To achieve this, Zhou et al. (2020) use Variational Autoencoder (VAE) (Kingma et al., 2019) to learn a latent action space and then train a controller within it. However, as was demonstrated in Chen et al. ( 2022), their specific use of VAE leads to a necessity for clipping the latent space. Otherwise, the training process becomes unstable, and the optimized controller can exploit the newly constructed action space, arriving at the regions resulting in out-of-distribution actions in the original space. While the described clipping procedure was found to be effective, this solution is rather ad-hoc and discards some of the in-dataset actions which could potentially limit the performance of the trained policies. In this work, inspired by the recent success of Normalizing Flows (NFs) (Singh et al., 2020) in the online reinforcement learning setup, we propose a new method called Conservative Normalizing Flows (CNF) for constructing a latent action space useful for offline RL. First, we describe why a naive approach for constructing latent action spaces with NFs is also prone to extrapolation error, and 1

