LET OFFLINE RL FLOW: TRAINING CONSERVATIVE AGENTS IN THE LATENT SPACE OF NORMALIZING FLOWS Anonymous

Abstract

Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism -i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model -controller in the latent space -is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.

1. INTRODUCTION

Offline Reinforcement Learning (ORL) addresses the problem of training new decision-making policy from a static and pre-recorded dataset collected by some other policies without any additional data collection (Lange et al., 2012; Levine et al., 2020) . One of the main challenges in this setting is the extrapolation error (Fujimoto et al., 2019 ) -i.e. inability to properly estimate values of state-action pairs not well-supported by the training data, which in turn leads to overestimation bias. This problem is typically resolved with various forms of conservatism, for example, Implicit Q-Learning (Kostrikov et al., 2021) completely avoids estimates of out-of-sample actions, Conservative Q-Learning (Kumar et al., 2020) penalizes q-values for out-of-distribution actions and others (Fujimoto & Gu, 2021; Kumar et al., 2019) put explicit constraints to stay closer to the behavioral policies. An alternative approach to constraint-trained policies was introduced in PLAS (Zhou et al., 2020) , where authors proposed to construct a latent space that maps to the actions well-supported by the training data. To achieve this, Zhou et al. (2020) use Variational Autoencoder (VAE) (Kingma et al., 2019) to learn a latent action space and then train a controller within it. However, as was demonstrated in Chen et al. (2022) , their specific use of VAE leads to a necessity for clipping the latent space. Otherwise, the training process becomes unstable, and the optimized controller can exploit the newly constructed action space, arriving at the regions resulting in out-of-distribution actions in the original space. While the described clipping procedure was found to be effective, this solution is rather ad-hoc and discards some of the in-dataset actions which could potentially limit the performance of the trained policies. In this work, inspired by the recent success of Normalizing Flows (NFs) (Singh et al., 2020) in the online reinforcement learning setup, we propose a new method called Conservative Normalizing Flows (CNF) for constructing a latent action space useful for offline RL. First, we describe why a naive approach for constructing latent action spaces with NFs is also prone to extrapolation error, and then outline a straightforward architectural modification that allows avoiding this without a need for manual post-hoc clipping. Our method is schematically presented in Figure 1 , where we highlight key differences between our method and the previous approach. We benchmark our method against other competitors based on generative models and show that it performs favorably on a large portion of the D4RL (Fu et al., 2020) locomotion and maze2d datasets.

2. PRELIMINARIES

Offline RL The goal of offline RL is to find a policy that maximizes the expected discounted return given a static and pre-recorded dataset D consisting of state-action-reward tuples. Normally, the underlying decision-making problem is formulated via Markov Decision Process (MDP) that is defined as a 4-elements tuple, consisting of state space S, action space A, state transition probability p : S × S × A → [0, ∞], which represents probability density of the next state s ′ ∈ S given the current state s ∈ S and action a ∈ A; bounded reward function r : S × A × S → [r min , r max ] and a scalar discount factor γ ∈ (0, 1). We denote the reward r(s t , a t , s t+1 ) as r t . The discounted return is defined as R t = ∞ k=0 γ k r t+k . Also, the notion of the advantage function A(s, a) is introduced -a difference between state-action value Q(s, a) and state value V (s) functions: Q π (s t , a t ) = r t + E π [ ∞ k=0 γ k r t+k ] V π (s) = E a∼π [Q π (s, a)] A π (s, a) = Q π (s, a) -V π (s) (1) Advantage Weighted Actor Critic One way to learn a policy in an offline RL setting is by following the gradient of the expected discounted return estimated via importance sampling (Levine et al., 2020) , however, methods employing estimation of the Q-function were found to be more empirically successful (Kumar et al., 2020; Nair et al., 2020; Wang et al., 2020) . Here, we describe Advantage Weighted Actor Critic (Nair et al., 2020) -where the policy is trained by optimization of logprobabilities of the actions from the data buffer re-weighted by the exponentiated advantage. In practice, there are two trained models: policy π θ with parameters θ and critic Q ψ with parameters ψ. The training process consists of two alternating phases: policy evaluation and policy improvement. During the policy evaluation phase, the critic Q π (s, a) estimates the action-value function for the current policy, and during the policy improvement phase, the actor π is updated based on the current estimation of advantage. Combining all together, two following losses are minimized using the gradient descent: L π (θ) = E (s,a)∼D [-log π θ (a|s) • exp(A ψ (s, a)/λ)] L T D (ψ) = E (s,a,r,s ′ )∼D [(r + γQ ψ (s ′ , a ′ ∼ π θ (•|s ′ )) -Q ψ (s, a)) 2 ] Where A ϕ (s, a) is computed according to Equation 1 using critic Q ϕ and λ is a temperature hyperparameter. Normalizing Flows Given a dataset D = {x (i) } N i=1 , with points x (i) from unknown distribution with density p X the goal of a Normalizing Flow model (Dinh et al., 2016; Kingma & Dhariwal, 2018) is to train an invertible mapping z = f ϕ (x) with parameters ϕ to a simpler base distribution with density p Z , typically spherical Gaussian: z ∼ N (0, I). This mapping is required to be invertible by design to sample new points from data distribution by applying the inverse mapping to samples from the base distribution: x = f -1 ϕ (z). A full flow model is a composition of K invertible functions f i and the relationship between z and x can be written as: x f1 ← → h 1 f2 ← → h 2 • • • f K ←→ z (3) Log-likelihood of a data point x is obtained by using the change of variable formula and can be written as: 



log p ϕ (x) = log p Z (z) + log | det(dz/dx)| = log p Z (z) + K i=1 log | det(dh i /dh i-1 )| (4)

