REDUCE, REUSE, RECYCLE: COMPOSITIONAL GENER-ATION WITH ENERGY-BASED DIFFUSION MODELS AND MCMC

Abstract

Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide variety of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.

1. INTRODUCTION

In recent years, tremendous progress has been made in generative modeling across a variety of domains (Brown et al., 2020; Brock et al., 2018; Ho et al., 2020) . These models now serve as powerful priors for downstream applications such as code generation (Li et al., 2022) , text-to-image generation (Saharia et al., 2022 ), question-answering (Brown et al., 2020) and many more. However, to fit this complex data, generative models have grown inexorably larger (requiring 10's or even 100's of billions of parameters) (Kaplan et al., 2020) and require datasets containing non-negligible fractions of the entire internet, making it costly and difficult to train and or finetune such models. Despite this, some of the most compelling applications of large generative models do not rely on finetuning. For example, prompting (Brown et al., 2020) has been a successful strategy to selectively extract insights from large models. In this paper, we explore an alternative to finetuning and prompting, through which we may repurpose the underlying prior learned by generative models for downstream tasks. Diffusion Models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are a recently popular approach to generative modeling which have demonstrated a favorable combination of scalability, sample quality, and log-likelihood. A key feature of diffusion models is the ability for their sampling to be "guided" after training. This involves combining the pre-trained Diffusion Model p θ (x) with a predictive model p θ (y|x) to generate samples from p θ (x|y). This predictive model can be either explicitly defined (such as a pre-trained classifier) (Sohl-Dickstein et al., 2015; Dhariwal & Nichol, 2021) or an implicit predictive model defined through the combination of a conditional and unconditional generative model (Ho & Salimans, 2022) . These forms of conditioning are particularly appealing (especially the former) as they allow us to reuse pre-trained generative models for many downstream applications, beyond those considered at training time. These conditioning methods are a form of model composition, i.e. combining probabilistic models together to create new models. Compositional models have a long history back to early work on Mixtures-Of-Experts (Jacobs et al., 1991) and Product-Of-Experts models (Hinton, 2002 In this work, we explore the ways that diffusion models can be reused and composed with oneanother. First, we introduce a set of methods which allow pre-trained diffusion models to be composed, with one-another and with other models, to create new models without retraining. Second, we illustrate how existing methods for composing diffusion models are not fully correct, and propose a remedy to these issues with MCMC-derived sampling. Next, we propose the use of an energy-based parameterization for diffusion models, where the unnormalized density of each reverse diffusion distribution is explicitly modeled. We illustrate how this parameterization enables both additional ways to compose diffusion models, as well as the use of more powerful Metropolis-adjusted MCMC samplers. Finally, we demonstrate the effectiveness of our approach in settings from 2D data to high-resolution text-to-image generation. An illustration of our domains can be found in Figure 1 . ; Mayraz & X 𝑝 ! (𝑥)𝑝 " (𝑥) 𝑍 𝑝 ! (𝑥) 𝑝 " (𝑥) = X 𝑝 # (𝑥) 𝑝 ! (pizza|𝑥) "Not Pizza" = X 𝑝 ! (𝑥|pizza)

2.1. DIFFUSION MODELS

Diffusion models seek to model a data distribution q(x 0 ). We augment this distribution with auxiliary variables {x t } T t=1 defining a Gaussian diffusion q(x 0 , . . . , x T ) = q(x 0 )q(x 1 |x 0 ) . . . q(x T |x T -1 ) where each transition is defined q(x t |x t-1 ) = N (x t ; √ 1 -β t x t-1 , β t I) for some 0 < β t ≤ 1. This transition first scales down x t-1 by √ 1 -β t and then adds Gaussian noise of variance β t . For large enough T , we will have q(x T ) ≈ N (0, I). Our model takes the form p θ (x t-1 |x t ) and seeks to learn the reverse distribution of q(x t |x t-1 ) which seeks to denoise x t to x t-1 . In the limit of small β t this reversal becomes Gaussian (Sohl-Dickstein et al., 2015) so we parameterize our model p θ (x t-1 |x t ) = N (x t-1 ; µ θ (x t , t), βt I) with: µ θ (x t , t) = 1 √ αt x t -βt √ 1-ᾱt ϵ θ (x t , t) . where ϵ θ (x t , t) is a neural network, and α t , ᾱt , βt are functions of {β t } T t=1 . A useful feature of the diffusion process q is that we can analytically derive any time marginal q(x t |x 0 ) = N (x t ; 1 -σ 2 t x 0 , σ 2 t I) where again σ t is a function of {β t } T t=1 . We can sample x t from this distribution using reparameterization, i.e x t (x 0 , ϵ) = 1 -σ 2 t x 0 + σ t ϵ where ϵ ∼ N (0, I). Exploiting this, diffusion models are typically trained with the loss L(θ) = T t=1 L t (θ), L t (θ) = E q(x0)N (ϵ;0,I) ||ϵ -ϵ θ (x t (x 0 , ϵ), t)|| 2 . (2) Once ϵ θ (x, t) is trained, we recover µ θ (x, t) with Equation 1 to parameterize p θ (x t-1 |x t ) and perform ancestral sampling (also known as the reverse process) to reverse the diffusion, i.e sample x T ∼ N (0, I), then for t = T -1 → 1, sample x t-1 ∼ p θ (x t-1 |x t ). A more detailed description can be found in Appendix B.



Figure 1: Creating new models through composition. Simple operators enable diffusion models to be composed without retraining in settings such (a) products, (b) classifier conditioning, (c) compositional text-toimage generation with a product (left) and a mixture (right). All samples generated by trained models.Hinton, 2000). Here, many simple or predictors were combined to increase their capacity. Much of this early work on model composition was done in the context ofEnergy-Based Models (Hinton, 2002), an alternative class of generative model which bears many similarities to diffusion models.

