IMPLICIT OFFLINE REINFORCEMENT LEARNING VIA SUPERVISED LEARNING

Abstract

Offline Reinforcement Learning (RL) via Supervised Learning is a simple and effective way to learn robotic skills from a dataset collected by policies of different expertise levels. It is as simple as supervised learning and Behavior Cloning (BC), but takes advantage of return information. On datasets collected by policies of similar expertise, implicit BC has been shown to match or outperform explicit BC. Despite the benefits of using implicit models to learn robotic skills via BC, offline RL via Supervised Learning algorithms have been limited to explicit models. We show how implicit models can leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets. Furthermore, we show the close relationship between our implicit methods and other popular RL via Supervised Learning algorithms to provide a unified framework. Finally, we demonstrate the effectiveness of our method on high-dimension manipulation and locomotion tasks.

1. INTRODUCTION

Large datasets of varied interactions combined with Offline RL algorithms are a promising direction to learning robotic skills safely (Levine et al., 2020) . A practical and straightforward approach to leveraging large and varied robotics datasets is to convert the RL problem into a supervised learning problem (Emmons et al., 2021; Chen et al., 2021) . RL via Supervised Learning (RvS) algorithms can be seen as return-conditional, -filtered, or -weighted BC. Here, we use RvS to refer to policies where the action distribution is conditioned on a return defined by the user or generated by the agent. Reinforcement Learning via Supervised Learning algorithms are as simple as BC ones, but since they take advantage of the return information, they can leverage sub-optimal interactions. In order to be effective for datasets collected by policies of different expertise levels, these algorithms must model multi-modal joint distributions, e.g., over the actions and the return. Previous RvS methods either discretized the variables and could use multinomial distributions (Chen et al., 2021) or ignored the multi-modality of the variables and used simple distributions such as a Gaussian distribution (Kumar et al., 2019) . This paper shows that implicit models can be excellent for modeling multi-modal distributions without the need to discretize the return and action variables. In addition, implicit models have been shown to better extrapolate and model discontinuities than explicit models allowing them to capture complex robotic behaviors (Florence et al., 2021) . Despite the advantages of using implicit models, RvS algorithms have been limited to explicit models. In this work, we bridge the gap between implicit models and RvS, and propose the first implicit RvS (IRvS) algorithm. We show that our IRvS algorithm is closely related to other RvS algorithms but has the advantages of 1) not requiring the user or a second neural network to specify a target return, and 2) not requiring continuous outputs to be discretized to capture multi-modal distributions. Furthermore, we demonstrate the superiority of IRvS over RvS to interpolate, model discontinuity, and handle higher-dimension action spaces in an environment with linear dynamics. Furthermore, using the same objective as IRvS, we derive a novel RvS algorithm formulation where the target return is obtained by maximizing the exponential tilt of the return. Finally, we show that our method achieves state-of-the-art on the difficult ADROIT and FrankaKitchen manipulation tasks and is competitive with Offline RL and RvS algorithms on a suite of robotic locomotion environments. Figure 1 : Comparison of implicit and explicit Reinforcement Learning via Supervised Learning (RvS). Implicit models can be more straightforward and use a single neural network. In contrast, explicit models require either a second neural networks (not depicted here) or the user to specify a task-specific target return Ĝ.

2. PRELIMINARIES

Offline Reinforcement Learning. We consider a Markov Decision Process (MDP) (S, A, P, R) with states s ∈ S, actions a ∈ A, transition function P(s ′ |s, a), and reward function r = R(s, a). We aim to learn a policy π : S → A maximizing the expected return E τ ∼π G(τ ) , where G(τ ) = H t=0 r t . Contrary to "online" RL agents that can interact with the environment, we focus on agents that must learn a policy exclusively from a dataset D of trajectories τ = (s 0 , a 0 , r 0 , . . . , s H , a H , r H ) collected under a behavior policy µ. For the rest of the paper, we work with the tuple (s i , a i , G i ), where G i = H t=i r i is the cumulative return starting from state-action (s i , a i ). Implicit Models and Energy-Based Models. We define an implicit model as a general function approximator E : R D → R where inference can be performed via optimization of ŷ = arg min y E θ (x, y). In this work, we focus on Energy Based Models (LeCun et al., 2006) (EBMs) . EBMs associate scalar energy to a configuration of variables (x, y). The model is trained to predict lower energy on the observed data than on the unobserved one. We use an EBMs to model the conditional density p θ (y|x) = exp(-E θ (x, y)) Z θ (1) where E θ is the energy and the normalizing constant Z θ = y exp(-E θ (x, y))dy. Training the density model can be performed using the InfoNCE loss function (Oord et al., 2018) defined as L(θ) = N i=1 -log(p θ (y i | x, {ỹ j i } Nneg. j=1 )) (2) pθ (y i | x, {ỹ j i } Nneg. j=1 ) = exp(-E θ (x i , y i )) exp(-E θ (x i , y i )) + Nneg. j=1 exp(-E θ (x i , ỹj i )) . where {ỹ j i } Nneg. j=1 is a set of negative counter-examples. Negative sampling can be performed by minimizing ỹj i = arg min y E θ (x i , y) using stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011), see Algorithm 1 for details. Finally, prediction can be done by minimizing the energy ŷ = arg min y∈Y E θ (x, y) (3) which can also be done via SGLD. Implicit Behavior Cloning. Behavior Cloning (Pomerleau, 1988 ) is a simple and popular way to acquire robotic skills from a dataset of expert demonstrations. The policy is usually modeled via

