WHERE TO DIFFUSE, HOW TO DIFFUSE, AND HOW TO GET BACK: AUTOMATED LEARNING FOR MULTIVARI-ATE DIFFUSIONS

Abstract

Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this process to generate samples. The choice of noising process, or inference diffusion process, affects both likelihoods and sample quality. For example, extending the inference process with auxiliary variables leads to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study Multivariate Diffusion Models (MDMs). For any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDMs likelihood without requiring any model-specific analysis. We then demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over all linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST, CIFAR10, and IMAGENET32 datasets. We show learned MDMs match or surpass bits-per-dims (BPDs) relative to fixed choices of diffusions for a given dataset and model architecture.

1. INTRODUCTION

Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this process to generate samples. They have achieved impressive performance in image generation, editing, translation (Dhariwal & Nichol, 2021; Nichol & Dhariwal, 2021; Sasaki et al., 2021; Ho et al., 2022) , conditional text-to-image tasks (Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022) and music and audio generation (Chen et al., 2020; Kong et al., 2020; Mittal et al., 2021) . They are often trained by maximizing a lower bound on the log likelihood, featuring an inference process interpreted as gradually "noising" the data (Sohl-Dickstein et al., 2015; Ho et al., 2020) . The choice of this inference process affects both likelihoods and sample quality. On different datasets and models, different inference processes work better; there is no universal best choice of inference, and the choice matters (Song et al., 2020b) . While some work has improved performance by designing score model architectures (Ho et al., 2020; Kingma et al., 2021; Dhariwal & Nichol, 2021 ), Dockhorn et al. (2021) instead introduce the critically-damped langevin diffusion (CLD), showing that significant improvements in sample generation can be gained by carefully designing new processes. CLD pairs each data dimension with an auxiliary "velocity" variable and diffuses them jointly using second-order Langevin dynamics. A natural question: if introducing new diffusions results in dramatic performance gains, why are there only a handful of diffusions (variance-preserving stochastic differential equation (VPSDE), variance exploding (VE), CLD, sub-VPSDE) used in DBGMs? For instance, are there other auxiliary variable diffusions that would lead to improvements like CLD? This avenue seems promising as auxiliary variables have improved other generative models and inferences, such as normalizing flows (Huang et al., 2020) , neural ordinary differential equations (ODEs) (Dupont et al., 2019) , hierarchical variational models (Ranganath et al., 2016) , ladder variational autoencoder (Sønderby et al., 2016) , among others. Despite its success, CLD also provides evidence that each new process requires significant modelspecific analysis. Deriving the evidence lower bound (ELBO) and training algorithm for diffusions is challenging (Huang et al., 2021; Kingma et al., 2021; Song et al., 2021) and is carried out in a case-by-case manner for new diffusions (Campbell et al., 2022) . Auxiliary variables seemingly complicate this process further; computing conditionals of the inference process necessitates solving matrix Lyupanov equations (section 3.3). Deriving the inference stationary distribution-which helps the model and inference match-can be intractable. These challenges limit rapid prototyping and evaluation of new inference processes. Concretely, training a diffusion model requires: (R1): Selecting an inference and model process pair such that the inference process converges to the model prior (R2): Deriving the ELBO for this pair (R3): Estimating the ELBO and its gradients by deriving and computing the inference process' transition kernel In this work, we introduce Multivariate Diffusion Models (MDMs) and a method for training and evaluating them. MDMs are diffusion-based generative models trained with auxiliary variables. We provide a recipe for training MDMs beyond specific instantiations-like VPSDE and CLD-to all linear inference processes that have a stationary distribution, with any number of auxiliary variables. First, we bring results from gradient-based MCMC (Ma et al., 2015) to diffusion modeling to construct MDMs that converge to a chosen model prior (R1); this tightens the ELBO. Secondly, for any number of auxiliary variables, we derive the MDM ELBO (R2). Finally, we show that the transition kernel of linear MDMs, necessary for the ELBO, can be computed automatically and generically, for higher-dimensional auxiliary systems (R3). With these tools, we explore a variety of new inference processes for diffusion-based generative models. We then note that the automatic transitions and fixed stationary distributions facilitate directly learning the inference to maximize the MDM ELBO. Learning turns diffusion model training into a search not only over score models but also inference processes, at no extra derivational cost. Methodological Contributions. In summary, our methodological contributions are: 1. Deriving ELBOs for training and evaluating multivariate diffusion models (MDMs) with auxiliary variables. 2. Showing that the diffusion transition covariance does not need to be manually derived for each new diffusion. We instead demonstrate that a matrix factorization technique, previously unused in diffusion models, can automatically compute the covariance analytically for any linear MDM. 3. Using results from gradient-based Markov chain Monte Carlo (MCMC) to construct MDMs with a complete parameterization of inference processes whose stationary distribution matches the model prior. To demonstrate these ideas, we develop MDMs with two specific diffusions as well as learned multivariate diffusions. The specific diffusions are accelerated Langevin diffusion (ALDA) (introduced in Mou et al. ( 2019) as a higher-order scheme for gradient-based MCMC) and an alteration, modified accelerated Langevin diffusion (MALDA). Previously, using these diffusions for generative modeling would require significant model-specific analysis. Instead, AMDT for these diffusions is derivation-free.



Combining the above into an algorithm called Automatic Multivariate Diffusion Training (AMDT) that enables training without diffusion-specific derivations. AMDT enables training score models for any linear diffusion, including optimizing the diffusion and score jointly.

