CONTRASTIVE VALUE LEARNING: IMPLICIT MODELS FOR SIMPLE OFFLINE RL

Abstract

Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.

1. INTRODUCTION

While the offline RL setting is relevant to many real-world applications where the ability for online data collection is limited, it often requires RL algorithms to find policies that are not well-supported by the training data. Instead of learning via trial-and-error, offline RL algorithms must leverage logged historical data to learn about the outcome of different actions, potentially by capturing environment dynamics as a proxy signal. Many prior approaches for this offline RL setting have been proposed, whether in modelfree (Wu et al., 2019; Fujimoto et al., 2019; Kumar et al., 2020 ) or model-based (Kidambi et al., 2020; Yu et al., 2021) settings. Our focus will be on those that address this prediction problem head-on: by learning a predictive model of the environment which can be used in conjunction with most model-free algorithms. Prior model-based methods (Yu et al., 2020b; Argenson and Dulac-Arnold, 2020; Kidambi et al., 2020; Yu et al., 2021) learn a model that predicts the observation at the next time step. This model is then used to generate synthetic data that can be passed to an off-the-shelf RL algorithm. While these approaches can work well on some benchmarks, they can be complex and expensive: the model must predict high-dimensional observations, and determining the value of an action may require unrolling the model for many steps. Learning a model of the environment has not made the RL problem any simpler. Moreover, as we will show later in the paper, the environment dynamics are intertwined with the policy inside the value function; model-based methods aim to decouple these quantities by separately estimating them. On the other hand, we show that one can directly learn a long-horizon transition model for a given policy, which is then used to estimate the value function. A natural use case for learning this long-horizon transition model (specifically, a state occupancy measure) from unlabelled data is multi-task pretraining, where the implicit dynamics model is trained on trajectory data across a collection of tasks, often exhibiting positive transfer properties. As we demonstrate in our experiments, this multi-task occupancy measure can then be finetuned using reward-labelled states on the task of interest, greatly improving performance upon existing pretraining methods as well as tabula rasa approaches. In this paper, we propose to learn a different type of model for offline RL, a model which (1) will not require predicting high-dimensional observations and (2) can be directly used to estimate Q-values without requiring either model-based rollouts or model-free temporal difference learning. Precisely, we will learn an implicit model of the discounted state occupancy measure, i.e. a function which takes in a state, action and future state and outputs a scalar proportional to the likelihood of visiting the future state under some fixed policy.

