IS CONDITIONAL GENERATIVE MODELING ALL YOU NEED FOR DECISION-MAKING?

Abstract

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decisionmaking. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a returnconditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.

1. INTRODUCTION

Over the last few years, conditional generative modeling has yielded impressive results in a range of domains, including high-resolution image generation from text descriptions (DALL-E, ImageGen) (Ramesh et al., 2022; Saharia et al., 2022) , language generation (GPT) (Brown et al., 2020) , and step-by-step solutions to math problems (Minerva) (Lewkowycz et al., 2022) . The success of generative models in countless domains motivates us to apply them to decision-making. Conveniently, there exists a wide body of research on recovering high-performing policies from data logged by already operational systems (Kostrikov et al., 2022; Kumar et al., 2020; Walke et al., 2022) . This is particularly useful in real-world settings where interacting with the environment is not always possible, and exploratory decisions can have fatal consequences (Dulac-Arnold et al., 2021) . With access to such offline datasets, the problem of decision-making reduces to learning a probabilistic model of trajectories, a setting where generative models have already found success. In offline decision-making, we aim to recover optimal reward-maximizing trajectories by stitching together sub-optimal reward-labeled trajectories in the training dataset. Prior works (Kumar et al., 2020; Kostrikov et al., 2022; Wu et al., 2019; Kostrikov et al., 2021; Dadashi et al., 2021; Ajay et al., 2020; Ghosh et al., 2022) have tackled this problem with reinforcement learning (RL) that uses dynamic programming for trajectory stitching. To enable dynamic programming, these works learn a value function that estimates the discounted sum of rewards from a given state. However, value function estimation is prone to instabilities due to function approximation, off-policy learning, and bootstrapping together, together known as the deadly triad (Sutton & Barto, 2018) . Furthermore, to stabilize value estimation in offline regime, these works rely on heuristics to keep the policy within the dataset distribution. These challenges make it difficult to scale existing offline RL algorithms.

