REDUCE, REUSE, RECYCLE: COMPOSITIONAL GENER-ATION WITH ENERGY-BASED DIFFUSION MODELS AND MCMC

Abstract

Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide variety of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.

1. INTRODUCTION

In recent years, tremendous progress has been made in generative modeling across a variety of domains (Brown et al., 2020; Brock et al., 2018; Ho et al., 2020) . These models now serve as powerful priors for downstream applications such as code generation (Li et al., 2022) , text-to-image generation (Saharia et al., 2022 ), question-answering (Brown et al., 2020) and many more. However, to fit this complex data, generative models have grown inexorably larger (requiring 10's or even 100's of billions of parameters) (Kaplan et al., 2020) and require datasets containing non-negligible fractions of the entire internet, making it costly and difficult to train and or finetune such models. Despite this, some of the most compelling applications of large generative models do not rely on finetuning. For example, prompting (Brown et al., 2020) has been a successful strategy to selectively extract insights from large models. In this paper, we explore an alternative to finetuning and prompting, through which we may repurpose the underlying prior learned by generative models for downstream tasks. Diffusion Models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are a recently popular approach to generative modeling which have demonstrated a favorable combination of scalability, sample quality, and log-likelihood. A key feature of diffusion models is the ability for their sampling to be "guided" after training. This involves combining the pre-trained Diffusion Model p θ (x) with a predictive model p θ (y|x) to generate samples from p θ (x|y). This predictive model can be either explicitly defined (such as a pre-trained classifier) (Sohl-Dickstein et al., 2015; Dhariwal & Nichol, 2021) or an implicit predictive model defined through the combination of a conditional and unconditional generative model (Ho & Salimans, 2022) . These forms of conditioning are particularly appealing (especially the former) as they allow us to reuse pre-trained generative models for many downstream applications, beyond those considered at training time. These conditioning methods are a form of model composition, i.e. combining probabilistic models together to create new models. Compositional models have a long history back to early work on Mixtures-Of-Experts (Jacobs et al., 1991) and Product-Of-Experts models (Hinton, 2002; Mayraz & 1 

