IS CONDITIONAL GENERATIVE MODELING ALL YOU NEED FOR DECISION-MAKING?

Abstract

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decisionmaking. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a returnconditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.

1. INTRODUCTION

Over the last few years, conditional generative modeling has yielded impressive results in a range of domains, including high-resolution image generation from text descriptions (DALL-E, ImageGen) (Ramesh et al., 2022; Saharia et al., 2022) , language generation (GPT) (Brown et al., 2020) , and step-by-step solutions to math problems (Minerva) (Lewkowycz et al., 2022) . The success of generative models in countless domains motivates us to apply them to decision-making. Conveniently, there exists a wide body of research on recovering high-performing policies from data logged by already operational systems (Kostrikov et al., 2022; Kumar et al., 2020; Walke et al., 2022) . This is particularly useful in real-world settings where interacting with the environment is not always possible, and exploratory decisions can have fatal consequences (Dulac-Arnold et al., 2021) . With access to such offline datasets, the problem of decision-making reduces to learning a probabilistic model of trajectories, a setting where generative models have already found success. In offline decision-making, we aim to recover optimal reward-maximizing trajectories by stitching together sub-optimal reward-labeled trajectories in the training dataset. Prior works (Kumar et al., 2020; Kostrikov et al., 2022; Wu et al., 2019; Kostrikov et al., 2021; Dadashi et al., 2021; Ajay et al., 2020; Ghosh et al., 2022) have tackled this problem with reinforcement learning (RL) that uses dynamic programming for trajectory stitching. To enable dynamic programming, these works learn a value function that estimates the discounted sum of rewards from a given state. However, value function estimation is prone to instabilities due to function approximation, off-policy learning, and bootstrapping together, together known as the deadly triad (Sutton & Barto, 2018) . Furthermore, to stabilize value estimation in offline regime, these works rely on heuristics to keep the policy within the dataset distribution. These challenges make it difficult to scale existing offline RL algorithms. In this paper, we ask if we can perform dynamic programming to stitch together sub-optimal trajectories to obtain an optimal trajectory without relying on value estimation. Since conditional diffusion generative models can generate novel data points by composing training data (Saharia et al., 2022) , we leverage it for trajectory stitching in offline decision-making. Given a dataset of reward-labeled trajectories, we adapt diffusion models (Sohl-Dickstein et al., 2015) to learn a return-conditional trajectory model. During inference, we use classifier-free guidance with lowtemperature sampling, which we hypothesize to implicitly perform dynamics programming, to capture the best behaviors in the dataset and glean return maximizing trajectories (see Appendix A). Our straightforward conditional generative modeling formulation outperforms existing approaches on standard D4RL tasks (Fu et al., 2020) . Viewing offline decision-making through the lens of conditional generative modeling allows going beyond conditioning on returns (Figure 1 ). Consider an example (detailed in Appendix A) where a robot with linear dynamics navigates an environment containing two concentric circles (Figure 2 ). We are given a dataset of state-action trajectories of the robot, each satisfying one of two constraints: (i) the final position of the robot is within the larger circle, and (ii) the final position of the robot is outside the smaller circle. With conditional diffusion modeling, we can use the datasets to learn a constraintconditioned model that can generate trajectories satisfying any set of constraints. During inference, the learned trajectory model can merge constraints from the dataset and generate trajectories that satisfy the combined constraint. Figure 2 shows that the constraint-conditioned model can generate trajectories such that the final position of the robot lies between the concentric circles.

Environment

Training Dataset Generation (x,y) Here, we demonstrate the benefits of modeling policies as conditional generative models. First, conditioning on constraints allows policies to not only generate behaviors satisfying individual constraints but also generate novel behaviors by flexibly combining constraints at test time. Further, conditioning on skills allows policies to not only imitate individual skills but also generate novel behaviors by composing those skills. We instantiate this idea with a state-sequence based diffusion probabilistic model (Ho et al., 2020) called Decision Diffuser, visualized in Figure 1 . In summary, our contributions include (i) illustrating conditional generative modeling as an effective tool in offline decision making, (ii) using classifier-free guidance with low-temperature sampling, instead of dynamic programming, to get return-maximizing trajectories and, (iii) leveraging the framework of conditional generative modeling to combine constraints and compose skills during inference flexibly. x 2 + y 2 ≤ R 2 r 2 ≤ x 2 + y 2 ≤ R 2 x 2 + y 2 ≥ r 2

2.1. REINFORCEMENT LEARNING

We formulate the sequential decision-making problem as a discounted Markov Decision Process (MDP) defined by the tuple ⟨ρ 0 , S, A, T , R, γ⟩, where ρ 0 is the initial state distribution, S and A are state and action spaces, T : S × A → S is the transition function, R : S × A × S → R gives the reward at any transition and γ ∈ [0, 1) is a discount factor. The agent acts with a stochastic policy π : S → ∆ A , generating a sequence of state-action-reward transitions or trajectory τ := (s k , a k , r k ) k≥0 with probability p π (τ ) and return R(τ ) := k≥0 γ k r k . The standard objective in RL is to find a return-maximizing policy π * = arg max π E τ ∼pπ [R(τ )].



Figure 1: Decision Making using Conditional Generative Modeling. Framing decision making as a conditional generative modeling problem allows us to maximize rewards, satisfy constraints and compose skills.

Figure 2: Illustrative example. We visualize the 2d robot navigation environment and the constraints satisfied by the trajectories in the dataset derived from the environment. We show the ability of the conditional diffusion model to generate trajectories that satisfy the combined constraints.

