WHERE TO DIFFUSE, HOW TO DIFFUSE, AND HOW TO GET BACK: AUTOMATED LEARNING FOR MULTIVARI-ATE DIFFUSIONS

Abstract

Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this process to generate samples. The choice of noising process, or inference diffusion process, affects both likelihoods and sample quality. For example, extending the inference process with auxiliary variables leads to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study Multivariate Diffusion Models (MDMs). For any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDMs likelihood without requiring any model-specific analysis. We then demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over all linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST, CIFAR10, and IMAGENET32 datasets. We show learned MDMs match or surpass bits-per-dims (BPDs) relative to fixed choices of diffusions for a given dataset and model architecture.

1. INTRODUCTION

Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this process to generate samples. They have achieved impressive performance in image generation, editing, translation (Dhariwal & Nichol, 2021; Nichol & Dhariwal, 2021; Sasaki et al., 2021; Ho et al., 2022) , conditional text-to-image tasks (Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022) and music and audio generation (Chen et al., 2020; Kong et al., 2020; Mittal et al., 2021) . They are often trained by maximizing a lower bound on the log likelihood, featuring an inference process interpreted as gradually "noising" the data (Sohl-Dickstein et al., 2015; Ho et al., 2020) . The choice of this inference process affects both likelihoods and sample quality. On different datasets and models, different inference processes work better; there is no universal best choice of inference, and the choice matters (Song et al., 2020b) . While some work has improved performance by designing score model architectures (Ho et al., 2020; Kingma et al., 2021; Dhariwal & Nichol, 2021) , Dockhorn et al. (2021) instead introduce the critically-damped langevin diffusion (CLD), showing that significant improvements in sample generation can be gained by carefully designing new processes. CLD pairs each data dimension with an auxiliary "velocity" variable and diffuses them jointly using second-order Langevin dynamics. A natural question: if introducing new diffusions results in dramatic performance gains, why are there only a handful of diffusions (variance-preserving stochastic differential equation (VPSDE), variance exploding (VE), CLD, sub-VPSDE) used in DBGMs? For instance, are there other auxiliary variable diffusions that would lead to improvements like CLD? This avenue seems promising as auxiliary variables have improved other generative models and inferences, such as normalizing flows (Huang et al., 2020) , neural ordinary differential equations (ODEs) (Dupont et al., 2019) , hierarchical variational models (Ranganath et al., 2016) , ladder variational autoencoder (Sønderby et al., 2016) , among others. Despite its success, CLD also provides evidence that each new process requires significant modelspecific analysis. Deriving the evidence lower bound (ELBO) and training algorithm for diffusions is challenging (Huang et al., 2021; Kingma et al., 2021; Song et al., 2021) and is carried out in a case-by-case manner for new diffusions (Campbell et al., 2022) . Auxiliary variables seemingly complicate this process further; computing conditionals of the inference process necessitates solving matrix Lyupanov equations (section 3.3). Deriving the inference stationary distribution-which helps the model and inference match-can be intractable. These challenges limit rapid prototyping and evaluation of new inference processes. Concretely, training a diffusion model requires: (R1): Selecting an inference and model process pair such that the inference process converges to the model prior (R2): Deriving the ELBO for this pair (R3): Estimating the ELBO and its gradients by deriving and computing the inference process' transition kernel In this work, we introduce Multivariate Diffusion Models (MDMs) and a method for training and evaluating them. MDMs are diffusion-based generative models trained with auxiliary variables. We provide a recipe for training MDMs beyond specific instantiations-like VPSDE and CLD-to all linear inference processes that have a stationary distribution, with any number of auxiliary variables. First, we bring results from gradient-based MCMC (Ma et al., 2015) to diffusion modeling to construct MDMs that converge to a chosen model prior (R1); this tightens the ELBO. Secondly, for any number of auxiliary variables, we derive the MDM ELBO (R2). Finally, we show that the transition kernel of linear MDMs, necessary for the ELBO, can be computed automatically and generically, for higher-dimensional auxiliary systems (R3). With these tools, we explore a variety of new inference processes for diffusion-based generative models. We then note that the automatic transitions and fixed stationary distributions facilitate directly learning the inference to maximize the MDM ELBO. Learning turns diffusion model training into a search not only over score models but also inference processes, at no extra derivational cost. Methodological Contributions. In summary, our methodological contributions are: 1. Deriving ELBOs for training and evaluating multivariate diffusion models (MDMs) with auxiliary variables. 2. Showing that the diffusion transition covariance does not need to be manually derived for each new diffusion. We instead demonstrate that a matrix factorization technique, previously unused in diffusion models, can automatically compute the covariance analytically for any linear MDM. 3. Using results from gradient-based Markov chain Monte Carlo (MCMC) to construct MDMs with a complete parameterization of inference processes whose stationary distribution matches the model prior. To demonstrate these ideas, we develop MDMs with two specific diffusions as well as learned multivariate diffusions. The specific diffusions are accelerated Langevin diffusion (ALDA) (introduced in Mou et al. ( 2019) as a higher-order scheme for gradient-based MCMC) and an alteration, modified accelerated Langevin diffusion (MALDA). Previously, using these diffusions for generative modeling would require significant model-specific analysis. Instead, AMDT for these diffusions is derivation-free. Empirical contributions. We train MDMs on the MNIST, IMAGENET32 and CIFAR-10 datasets. In the experiments, we show that: 1. Training new and existing fixed diffusions, such as ALDA and MALDA, is easy with the proposed algorithm AMDT. 2. Using AMDT to learn the choice of diffusion for the MDM matches or surpasses the performance of fixed choices of diffusion process; sometimes the learned diffusion and VPSDE do best; other times the learned diffusion and CLD do best. 3. There are new and existing MDMs, trained and evaluated with the MDM ELBO, that account for as much performance improvement over VPSDE as a three-fold increase in score model size for a fixed univariate diffusion. These findings affirm that the choice of diffusion affects the optimization problem, and that learning the choice bypasses the process of choosing diffusions for each new dataset and score architecture. We additionally show the utility of the MDM ELBO by showing on a dataset that CLD achieves better bits-per-dims (BPDs) than previously reported with the probability flow ODE (Dockhorn et al., 2021) .

2. SETUP

We present diffusions by starting with the generative model and then describing its likelihood lower bound (Sohl-Dickstein et al., 2015; Huang et al., 2021; Kingma et al., 2021) . Diffusions sample from a model prior z 0 ∼ π θ and then evolve a continuous-time stochastic process z t ∈ R d : dz = h θ (z, t)dt + β θ (t)dB t , t ∈ [0, T ] where B t is a d-dimensionsal Brownian motion. The model is trained so that z T approximates the data x ∼ q data .foot_0 Maximum likelihood training of diffusion models is intractable (Huang et al., 2021; Song et al., 2021; Kingma et al., 2021) . Instead, they are trained using a variational lower bound on log p θ (z T = x). The bound requires an inference process q ϕ (y s |x = x): 2 dy = f ϕ (y, s)ds + g ϕ (s)d B s , s ∈ [0, T ] where B s is another Brownian motion independent of B t . The inference process is usually taken to be specified rather than learned, and chosen to be i.i.d. for each y tj conditional on each x j . This leads to the interpretation of the y tj as noisy versions of features x j (Ho et al., 2020) . While the diffusion ELBO is challenging to derive in general, Huang et al. (2021) ; Song et al. (2021) show that when the model process takes the form: dz = g 2 ϕ (T -t)s θ (z, T -t) -f ϕ (z, T -t) dt + g ϕ (T -t)dB t , the ELBO is: log p θ (x) ≥ L ism (x) = E q ϕ (y|x) log π θ (y T ) + T 0 - 1 2 ∥s θ ∥ 2 g 2 ϕ -∇ • (g 2 ϕ s θ -f ϕ )ds , where f ϕ , g ϕ , s θ are evaluated at (y s , s), ∥x∥foot_1 A = x ⊤ Ax and g 2 = gg ⊤ . Equation (4) features the Implicit Score Matching (ISM) loss (Song et al., 2020a) , and can be re-written as an ELBO L dsm featuring Denoising Score Matching (DSM) (Vincent, 2011; Song et al., 2020b) , see appendix F.1.

3. A RECIPE FOR MULTIVARIATE DIFFUSION MODELS

As has been shown in prior work (Song et al., 2021; Dockhorn et al., 2021) , the choice of diffusion matters. Drawing on principles from previous generative models (section 6), we can consider a wide class of diffusion inference processes by constructing them using auxiliary variables. At first glance, training such diffusions can seem challenging. First, one needs an ELBO that includes auxiliary variables. This ELBO will require sampling from the transition kernel, and setting the model prior to the specified inference stationary distribution. But doing such diffusion-specific analysis manually is challenging and hinders rapid prototyping. In this section we show how to address these challenges and introduce an algorithm, AMDT, to simplify and automate modeling with MDMs. AMDT can be used to train new and existing diffusions, including those with auxiliary variables, and including those that learn the inference process. In appendix A we discuss how the presented methods can also be used to automate and improve simplified score matching and noise prediction objectives used to train diffusion models.

3.1. MULTIVARIATE MODEL AND INFERENCE

For the j th data coordinate at each time t, MDMs pair z tj ∈ R with a vector of auxiliary variables v tj ∈ R K-1 into a joint vector u t and diffuse in the extended space: u 0 ∼ π θ , du = h θ (u t = z t v t , t)dt + β θ (t)dB t . MDMs model the data x with z T , a coordinate in u T ∼ p θ . For the j th feature x j , each u tj ∈ R K consists of a "data" dimension u z tj and auxiliary variable u v tj . Therefore u ∈ R dK . We extend the drift coefficient h θ from a function in R d × R + → R d to the extended space R dK × R + → R dK . We likewise extend the diffusion coefficient to a matrix β θ acting on Brownian motion B t ∈ R dK . Because the MDM model is over the extended space, the inference distribution y must be too. We then set q(y v 0 |y z 0 = x) to any chosen initial distribution, e.g. N (0, I) and discuss this choice in section 4. Then y s evolves according to the auxiliary variable inference process: dy = f ϕ (y, s)ds + g ϕ (s)d B s , where the inference drift and diffusion coefficients f ϕ , g ϕ are now over the extended space y = [y z , y v ]. The function f ϕ lets the z and v coordinates of y tj interact in the inference process.

ASSUMPTIONS

This work demonstrates how to parameterize time-varying Itô processes, used for diffusion modeling, to have a stationary distribution that matches the given model prior. To take advantage of the automatic transition kernels also presented, the inferences considered for modeling are linear time-varying processes and take the form: dy = A ϕ (s)yds + g ϕ (s)dB s where A ϕ (s) : R + → dK × dK and g ϕ (s) : R + → dK × dK are matrix-valued functions.

3.2. ELBO FOR MDMs

We now show how to train MDMs to optimize a lower bound on the log likelihood of the data. Like in the univariate case, we use the parameterization in eq. ( 3) to obtain a tractable ELBO. Theorem 1. The MDM log marginal likelihood of the data is lower-bounded by: logp θ (x) ≥ E q ϕ (y|x) log π θ (y T ) ℓ T - T 0 1 2 ∥s θ ∥ 2 g 2 ϕ + ∇ • (g 2 ϕ s θ -f ϕ )ds -log q ϕ (y v 0 |x) ℓq (L mism ) = E q ϕ (y|x) ℓ T + T 0 1 2 ∥s ϕ ∥ 2 g 2 ϕ - 1 2 ∥s θ -s ϕ ∥ 2 g 2 ϕ + (∇ • f ϕ )ds -ℓ q (L mdsm ). where divergences and gradients are taken with respect to y s and s ϕ = ∇ y s log q ϕ (y s |x). Proof. The proof for the MDM ISM ELBO L mism is in appendix F. In short, we introduce auxiliary variables, apply Theorem 1 of Huang et al. (2021) (equivalently, Theorem 3 of Song et al. (2021) or appendix E of Kingma et al. ( 2021)) to the joint space, and then apply an additional variational bound to v 0 . The MDM DSM ELBO L mdsm is likewise derived in appendix F, similarly to Huang et al. (2021) ; Song et al. (2021) , but extended to multivariate diffusions. We train MDM's by estimating the gradients of L mdsm , as estimates of L mism can be computationally prohibitive. For numerical stability, the integral in eq. ( 7) is computed on [ϵ, T ] rather than [0, T ]. One can regard this as a bound for a variable u ϵ . To maintain a proper likelihood bound for the data, one can choose a likelihood u 0 |u ϵ and compose bounds as we demonstrate in appendix I. We report the ELBO with this likelihood term, which plays the same role as the discretized Gaussian in Nichol & Dhariwal (2021) and Tweedie's formula in Song et al. (2021) . 3.3 INGREDIENT 1: COMPUTING THE TRANSITION q ϕ (y s |x) To estimate eq. ( 7) and its gradients, we need samples from q(y s |x) and to compute ∇ log q(y s |x). While an intractable problem for MDMs in general, we provide two ingredients for tightening and optimizing these bounds in a generic fashion for linear inference MDMs. We first show how to automate computation of q(y s |y 0 ) and then q(y s |x). For linear MDMs of the form: dy = A(s)yds + g(s)dB s , the transition kernel q(y s |y 0 ) is Gaussian (Särkkä & Solin, 2019) . Let f (y, s) = A(s)y. Then, the mean and covariance are solutions to the following ODEs: dm s|0 /ds = A(s)m s|0 dΣ s|0 /ds = A(s)Σ s|0 + Σ s|0 A ⊤ (s) + g 2 (s). The mean can be solved analytically: m s|0 = exp s 0 A(ν)dν y 0 = exp(sA)y 0 no integration if A(ν) = A . ( ) The covariance equation does not have as simple a solution because eq. ( 9) as the unknown matrix Σ s|0 is being multiplied both from the left and the right. Instead of solving eq. ( 8) for a specific diffusion manually, as done in previous work (e.g. pages 50-54 of Dockhorn et al. (2021) ), we show that a matrix factorization technique (Särkkä & Solin (2019) , sec. 6.3) previously unused in diffusion-based generative models can automatically compute Σ s|0 generically for any linear MDM. Define C s , H s that evolve according to: dC s /ds dH s /ds = A(s) g 2 (s) 0 -A ⊤ (s) C s H s , then Σ s|0 = C s H s -1 for C 0 = Σ 0 and H 0 = I (Appendix D). These equations can be solved in closed-form, C s H s = exp [A] s [g 2 ] s 0 -[A ⊤ ] s Σ 0 I = exp s A g 2 0 -A ⊤ no integration if A(ν) = A, g(ν) = g Σ 0 I , where [A] s = s 0 A(ν)dν. To condition on y 0 = (x, v), we set Σ 0 = 0. Computing q ϕ (y s |x). For the covariance Σ s|0 , to condition on x instead of y 0 , we set Σ 0 to Σ 0 = 0 0 0 Σ v0 , To compute the mean, it is the same expression as for q(y s |y 0 ), but with a different initial condition: m s|0 = exp s 0 A(ν)dν x E q [y v 0 |x] See appendix D for more details.

Algorithm 1 Automatic Multivariate Diffusion Training

Input: Data {x i }, inference process matrices Q ϕ , D ϕ , model prior π θ , initial distribution q ϕ (y v 0 | x), and score model architecture s θ Returns: Trained score model s θ while s θ not converged do Sample x ∼ N i=1 1 N δ xi , v 0 ∼ q ϕ (y v 0 | x) Sample s ∼ U[0, T ] and y s , y T ∼ q ϕ (y s | x) using algorithm 2 Estimate the stochastic gradient of the MDM ELBO, ∇ θ L(θ, ϕ), using eq. ( 7) The table shows the extra computational cost of the automated algorithm is negligible. This automation likewise applies to simplified score matching and noise prediction objectives, since all rely on q ϕ (y s |x) (appendix A). θ ← θ + α∇ θ L(θ, ϕ). if learning inference then ϕ ← ϕ + α∇ ϕ L(θ, ϕ) end if end while Output s θ

3.4. INGREDIENT 2: MDM PARAMETERIZATION

The MDM ELBO (eq. ( 7)) is tighter when the inference y T tends toward the model's prior π θ . Here we construct inference processes with the model prior π θ as a specified stationary distribution q ∞ . Ma et al. (2015) provide a complete recipe for constructing gradient-based MCMC samplers; the recipe constructs non-linear time-homogeneous Itô processes with a given stationary distribution, and show that the parameterization spans all such Itô processes with that stationary distribution. Diffusion models usually have time-varying drift and diffusion coefficients (e.g. use of the β(t) function). To build diffusion models that match the model prior, we first extend Theorem 1 from Ma et al. (2015) to construct non-linear Itô processes with time-varying drift and diffusion coefficients with a given stationary distribution (Appendix C). Then, to keep transitions tractable (per Section 3.3), we specialize this result to linear Itô diffusions. We directly state the result for linear time-varying diffusions with stationary distributions. The parameterization requires a skew-symmetric matrix -Q(s) = Q(s) ⊤ , a positive semi-definite matrix D(s), and a function ∇H(y) such that the desired stationary distribution q ∞ is proportional to exp[-H(y)]. Linear Itô diffusions have Gaussian stationary distributions (Särkkä & Solin, 2019) meaning that ∇H is linear and can be expressed as Sy for some matrix S. For a matrix A, let √ A refer to the matrix square root defined by a = √ A ⇐⇒ A = aa ⊤ . Then, the Itô diffusion: dy = -Q(s) + D(s) Sy f (y,s) ds + 2D(s) g(s) d B s , has Gaussian stationary distribution N (0, S -1 ) where Q(s), D(s) and S are parameters. For a discussion of convergence to the stationary distribution, as well as skew-symmetric and positive semi-definite parameterizations, see appendix C, where we also show that existing diffusion processes such as VPSDE and CLD are included in Q/D parameterization. We display the ELBO in terms of Q/D in appendix G and an algorithm in appendix H. For score matching and noise prediction losses and a given q ϕ , achieving a minimizing value with respect to s θ does not imply that the generative model score will match the inference score. Modeling the data also requires the marginal distribution of q ϕ,T to approximate π. When q ϕ is constant, it is important to confirm the stationary distribution is appropriately set, and the tools used here for the ELBO can be used to satisfy this requirement for score matching and noise prediction (appendix A).

3.5. LEARNING THE INFERENCE PROCESS

The choice of diffusion matters, and the ELBOs in eq. ( 7) have no requirement for fixed q ϕ . We therefore learn the inference process jointly with s θ . Under linear transitions (ingredient 1), no algorithmic details change as the diffusion changes during training. Under stationary parameterization (ingredient 2), we can learn without the stationary distribution going awry. In the experiments, learning matches or surpasses BPDs of fixed diffusions for a given dataset and score architecture. In L mdsm or L mism , q ϕ,∞ may be set to equal π θ , but it is y T ∼ q ϕ,T for the chosen T that is featured in the ELBO. Learning q ϕ can choose y T to reduce the cross-entropy: -E q ϕ (y T |x) [log π θ (y T )]. Minimizing eq. ( 14) will tighten the ELBO for any s θ . Next, q ϕ is featured in the remaining terms that feature s θ ; optimizing for q ϕ will tighten and improve the ELBO alongside s θ . Finally, q ϕ is featured in the expectations and the -log q ϕ term: log p θ (u z T = x) ≥= E q ϕ (y v 0 =v|x) (L dsm or L ism ) -log q ϕ (y v 0 = v|x) The q ϕ (y v 0 |x) terms impose an optimality condition that p θ (u v T |u z T ) = q ϕ (y v 0 |y z 0 ) (appendix E), When it is satisfied, no looseness in the ELBO is due to the initial time zero auxiliary variables. To learn, Q, D need to be specified with parameters ϕ that enable gradients. We keep S fixed at inverse covariance of π θ . The transition kernel q ϕ (y s |x) depends on Q, D through its mean and covariance. Gaussian distributions permit gradient estimation with reparameterization or scorefunction gradients (Kingma & Welling, 2013; Ranganath et al., 2014; Rezende & Mohamed, 2015; Titsias & Lázaro-Gredilla, 2014) . Reparameterization is accomplished via: y s = m s|0 + L s|0 ϵ (16) where ϵ ∼ N (0, I dK ) and L s|0 satisfies L s|0 L ⊤ s|0 = Σ s|0 , derived using coordinate-wise Cholesky decomposition. Gradients flow through eq. ( 16) from y s to m s|0 and Σ s|0 to Q, D to parameters ϕ.

Algorithm 1 displays Automatic Multivariate Diffusion Training (AMDT)

. AMDT provides a training method for diffusion-based generative models for either fixed Q, D matrices or for learning the Q ϕ , D ϕ matrices, without requiring any diffusion-specific analysis. Learning in other diffusion objectives. Like in the ELBO, learning in score matching or noise prediction objectives can improve the match between the inference process and implied generative model (appendix A).

4. INSIGHTS INTO MULTIVARIATE DIFFUSIONS

Scalar versus Multivariate Processes. Equation ( 13) clarifies what can change while preserving q ∞ . Recall that Q and D are K × K for K -1 auxiliary variables. Because 0 is the only 1 × 1 skew-symmetric matrix, scalar processes must set Q = 0. With q ϕ,∞ = N (0, I), the process is: dy = -D(s)yds + 2D(s)d B s . ( ) What is left is the VPSDE process used widely in diffusion models where D(s) = 1 2 β(s) is 1 × 1 (Song et al., 2020b) . This reveals that the VPSDE process is the only scalar diffusion with a stationary distribution. 3 This also clarifies the role of Q: it accounts for mixing between dimensions in multivariate processes, as do non-diagonal entries in D for K > 1. CLD optimizes a log-likelihood lower bound. Differentiating L mdsm (eq. ( 7)) with respect to the score model parameters, we show that the objective for CLD (Dockhorn et al., 2021 ) maximizes a lower bound on log p θ (x), not just log p θ (u 0 ), without appealing to the probability flow ODE.

Does my model use auxiliary variables?

An example initial distribution is q(y v 0 |x) = N (0, I). It is also common to set π θ = N (0, I). Because the optimum for diffusions is p θ = q, the optimal model has main and auxiliary dimensions independent at endpoints 0 and T . Does this mean that the model does not use auxiliary variables? In appendix B, we show that in this case the model can still use auxiliary variables at intermediate times. A sufficient condition is non-diagonal Q + D.

5. EXPERIMENTS

We test the MDM framework with handcrafted and learned diffusions. The handcrafted diffusions are (a) ALDA, used in (Mou et al., 2019) for accelerated gradient-based MCMC sampling (eq. ( 32)) and (b) MALDA: a modified version of ALDA (eq. ( 33)). Both have two auxiliary variables. We also learn diffusions with 1 and 2 auxiliary variables. We compare with VPSDE and ELBO-trained CLD. (2020) . We input the auxiliary variables as extra channels, which only increases the score model parameters in the input and output convolutions (CLD and Learned 2 have 7, 000 more parameters than VPSDE on CIFAR-10 and IMAGENET32 and only 865 more for MNIST). We use simple uniform dequantization. We report estimates of L mdsm (which reduces to the standard L dsm for K = 1). We sample times using the importance sampling distribution from Song et al. (2021) with truncation set to ϵ = 10 -3 . To ensure the truncated bound is proper, we use a likelihood described in appendix I. Results. Table 2 shows that the inference process matters and displays. It displays DBGMs that we train and evaluate on CIFAR-10, IMAGENET32 and MNIST. This includes the existing VPSDE and CLD, the new MALDA and ALDA, and the new learned inference processes. All are trained with the 35.7M parameter architecture. For CIFAR-10, learning outperforms CLD, and both outperform the standard choice of VPSDE. For MNIST, learned diffusions match VPSDE while the three fixed auxiliary diffusions are worse. On IMAGENET32, all perform similarly. The take-away is that learning matches or surpasses the best fixed diffusion performance and bypasses the choice of diffusion for each new dataset or score architecture. In Figure 1 we plot the generated samples from CIFAR10. (Song et al., 2021) and CLD (Dockhorn et al., 2021) Auxiliary variables. Dupont et al. (2019) shows that augmented neural ODEs model a richer set of functions and Huang et al. (2020) uses this principle for normalizing flows. Hierarchical variational models and auto-encoders marginalize auxiliary variables to build expressive distributions (Ranganath et al., 2016; Sønderby et al., 2016; Maaløe et al., 2019; Vahdat & Kautz, 2020; Child, 2020) . We apply this principle to DBGMs, including and extending CLD (Dockhorn et al., 2021) . Learning inference. Learning q ϕ with p θ is motivated in previous work 

7. DISCUSSION

We present an algorithm for training multivariate diffusions with linear time-varying inference processes with a specified stationary distribution and any number of auxiliary variables. This includes automating transition kernel computation and providing a parameterization of diffusions that have a specified stationary distribution, which facilitate working with new diffusion processes, including learning the diffusion. The experiments show that learning matches or surpasses the best fixed diffusion performance, bypassing the need to choose a diffusion. MDMs achieve BPDs similar to univariate diffusions, with as many as three times more score parameters. 

A AUTOMATED SCORE MATCHING WITH LEARNED INFERENCE

Like for the MDM ELBO, the methods in this work apply to training with the score matching loss: L SM (x, θ, ϕ) = T E t∼U [0,T ] E q ϕ (y|x) λ(t) ∥s θ (y t , t) -∇ yt log q ϕ (y t | x)∥ 2 2 , where λ : [0, T ] → R + is a weighing function. The score-matching loss is often optimized in its simplified noise prediction form: L NP (x, θ, ϕ) = T E t∼U [0,T ] E q ϕ (y|x) ∥ϵ θ (y t , t) -ϵ∥ 2 2 where s θ = -L -⊤ t ϵ θ and y t = µ t + L t ϵ and ϵ is the noise used in sampling y t . We describe here how the improvements to the ELBO studied in this work carry over to L SM and L NP . In the following let q 0 be the data distribution, let p (θ,ϕ),0 be the model's distribution of the data, and recall that the model is defined by (s θ , f ϕ , g ϕ ) and prior π via a continuous-time stochastic process with drift coefficient g 2 ϕ s θ -f ϕ and and diffusion coefficient g ϕ . First, minimizing L SM or L NP so that ∇ y t log q ϕ (y t ) = s θ (y t , t) does not alone imply that p (θ,ϕ),0 will equal q 0 ; it must also be that q ϕ,T ≈ π. Foregoing this requirement means π will produce samples that the generative model may not be able to push onto the path the model was trained on (formally, the score of the generative model would not equal the time-reversal of the forward score even if s θ equals the forward score). This condition can be satisfied if q ϕ can be chosen with stationary distribution π. Section 3.4 describes how to accomplish this. Next, for any fixed q ϕ , automatic transitions from section 3.3 streamline the computation of the score matching loss, allowing for simple score computation for a wide class of diffusions beyond VP. Finally, for a fixed q ϕ with q ϕ,T ≈ π and a score architecture s θ , minimizing L SM or L NP w.r.t θ may be suboptimal. Optimization, like for the elbo, carries over to score matching and can close this gap; learning w.r.t. both θ, ϕ increases the ability to successfully minimize the loss at each t (section 3.5). In other words, since the generative model is defined by (s θ , f ϕ , g ϕ ), learning q ϕ means the loss trains all three components of the generative model rather than just one. In summary, score matching is automatic and can learn over the space of linear diffusions that tend to the model prior.

B DOES MY MODEL USE AUXILIARY VARIABLES?

In section 3 we gave the example choice of q(y v 0 |x) = N (0, I) coordinate-wise. It is also a common choice to set π θ = N (0, I). Because the optimum in diffusion models is p θ = q for all t, we see a peculiar phenomenon under this choice: the model has main and auxiliary dimensions independent at both endpoints 0 and T . Does this mean that the model does not use auxiliary variables? We show that even when q ϕ (y 0 ) and π θ have main and auxiliary variables independent, the model can use the auxiliary variables. A sufficient condition is Q + D is non-diagonal. To make this precise, we recall that we model with p θ (u z T = x). To show the model is using auxiliary variables, we just need to show that u z T (main coordinate at T ) depends on u v t (aux. coordinate at t) for T > t. At optimum, p θ (u z T , u v t ) = q ϕ (y z 0 , y v T -t ). Therefore it is sufficient to show that for some time s, q ϕ (y v s |y z 0 ) ̸ = q ϕ (y v s ). Because y z 0 , is determined by x we need to show that q ϕ (y v s |x) ̸ = q ϕ (y v s ). To do that, we first derive q(y s |x) and then marginalize to get q(y v s |x) from it. Since the former is 2D Gaussian, the latter is available in terms of the former's mean and covariance. Suppose E[y v 0 ] = 0, Q = [[0, -1], [1, 0]] and D = [[1, 0], [0, 1] ] and we have s = .1 We have: E[y s |x] = exp -s(Q + D) x 0 = exp -.1 .1 -.1 -.1 x 0 = 0.9003x -0.090x Regardless of the covariance any 1D of this 2D gaussian will have mean that is a function of x, meaning that q(y v s |x) does not equal q(y v s ) (which is also a Gaussian but with mean depending on x ′ s mean rather than x itself. Therefore, even under the setup with independent endpoints, the optimal model makes use of the intermediate auxiliary variables in its final modeling distribution p θ (u z T = x). Are there choices of Q and D that lead to learning models that don't make use of the extra dimensions? As mentioned, in the inference process, Q is responsible for mixing information among the coordinates, and is the only source of this when D is diagonal. Then, if Q = 0 and D is diagonal, none of the coordinates for a given feature x j (including u z tj , u v1 tj , . . . , u v K-1 tj ) interact for any t. Then, since p θ = q at optimum, independence of the coodinates at all t in q imply the same in p θ and the model will not make use of any auxiliary variables when modeling the marginal log p θ (u z T = x).

C STATIONARY PARAMETERIZATION

The non-linear time-homogeneous Itô process family is: dy = f (y)dt + g(y)B t . ( ) This family can be restricted to those with stationary distributions. Ma et al. (2015) show a complete recipe to span the subset of this family with a desired stationary distribution. Let Q be skewsymmetric (-Q = Q ⊤ ) and D is positive semi-definite. Suppose the desired stationary distribution is q ∞ (y). For a matrix A, let √ A refer to the matrix square root defined by a = √ A ⇐⇒ A = aa ⊤ . Then, Ma et al. (2015) show that, setting H(y) = -log q ∞ (y), g(y) = 2D(y), and f (y) = -[D(y) + Q(y)]∇H(y) + Γ(y), Γ i (y) = d j=1 ∂ ∂z j (D ij (y) + Q ij (y)), yields a process y t with stationary distribution q ∞ . We extend it to time-varying (time inhomogeneous) processes. Theorem 2. q ∞ (y) ∝ exp[-H(y)] is a stationary distribution of dy = -[D(y, t) + Q(y, t)]∇H(y) + Γ(y, t) dt + 2D(y, t)B t , for Γ i (y, t) = d j=1 ∂ ∂y j (D ij (y, t) + Q ij (y, t)). ( ) Proof. The Fokker Planck equation is: ∂ t q(y, t) = - i ∂ ∂y i f i (y, t)q(y, t) + i,j ∂ 2 ∂y i ∂y j D ij (y, t)q(y, t) A stationary distribution is one where the Fokker-Planck right hand side is equal to 0. To show that the stationary characterization also holds of time-inhomogenous processes with D(y, t) and Q(y, t), we take two steps, closely following Yin & Ao (2006); Shi et al. (2012) ; Ma et al. (2015) , but noting that there is no requirement for Q, D to be free of t. First, we show that the Fokker-Plack equation can be re-written as: ∂ t q(y, t) = ∇ • D(y, t) + Q(y, t) q(y, t)∇H(y) + ∇q(y, t) Second, because the whole expression is set to 0 when the inside expression equals 0 q(y, t)∇H(y) + ∇q(y, t) = 0, we just need to show that this holds when q(y, t) = exp[-H(y)]/Z. The second step is concluded because q(y, t)∇H(y) + ∇q(y, t) = 1 Z exp[-H(y)]∇H(y) + ∇ exp[-H(y)] = 0, where Z is the normalization constant of exp(-H(y)). It only remains to show that can be re-written in divergence form with time-dependent Q, D. In the following let for scalar functions. We will use Q ijt denote Q ij (y, [Ax] i = j A ij x j . ∂ t q t = ∇ • [D(y, t) + Q(y, t)][q∇H + ∇q] = i ∂ i [D(y, t) + Q(y, t)][q∇H + ∇q] i = i ∂ i j [D ijt + Q ijt ][q∇H + ∇q] j = i ∂ i j [D ijt + Q ijt ][q∂ j H + ∂ j q] = i ∂ i j [D ijt + Q ijt ][q∂ j H] + i ∂ i j [D ijt + Q ijt ][∂ j q] = i ∂ i j [D ijt + Q ijt ][q∂ j H] + i ∂ i j D ijt [∂ j q] + i ∂ i j Q ijt [∂ j q] We re-write the 2nd and 3rd term. Holding i fixed and noting q is scalar, we get the product rule j D ijt (∂ j q) = j ∂ j [D ijt q] -q j ∂ j D ijt for each i, and likewise for q: i ∂ i j [D ijt + Q ijt ][q∂ j H] + i ∂ i j D ijt [∂ j q] + i ∂ i j Q ijt [∂ j q] = i ∂ i j [D ijt + Q ijt ][q∂ j H] + i ∂ i j ∂ j [D ijt q] -q j ∂ j D ijt + i ∂ i j ∂ j [Q ijt q] -q j ∂ j Q ijt Because Q(y, t) is skew-symmetric, we have that i ∂ i j ∂ j [Q ijt q] = 0, leaving ∂ t q t = i ∂ i j [D ijt + Q ijt ][q∂ j H] + i ∂ i j ∂ j [D ijt q] -q j ∂ j D ijt -q j ∂ j Q ijt = i ∂ i j [D ijt + Q ijt ][∂ j H]q + i ∂ i j ∂ j [D ijt q] -q j ∂ j (D ijt + Q ijt ) = i ∂ i j [D ijt + Q ijt ][∂ j H] - j ∂ j (D ijt + Q ijt ) q + i j ∂ 2 y i y j (D ijt q) Recalling that f i (y, t) = -[D +Q]∇H +Γ i and again that [Ax] i = j A ij x j , we have equality with the original Fokker-Planck = i ∂ i j [D ijt + Q ijt ][∂ j H] - j ∂ j (D ijt + Q ijt ) q + ij ∂ 2 y i y j (D ijt q) = - i ∂ ∂y i f i (y, t)q(y, t) + ij ∂ 2 y i y j D ij (y, t)q(y, t) = ∂ t q(y, t) We have shown exp[-H(y)]/Z is a stationary distribution of the time-varying non-linear Itô process: dy = -[D(y, t) + Q(y, t)]∇H(y) + Γ(y, t) dt + 2D(y, t)B t . However, some choices of Q, D, exp[-H(y)]/Z is not necessarily the unique stationary distribution. One problematic case can occur as follows. Suppose that row i of (Q + D) is all-zero; in this case, dy i = 0 which implies that (y i ) t = (y i ) 0 for all t > 0. Then, the initial distribution is also a stationary distribution. To rule out such pathological diffusions, we make the assumption that Q + D is full rank. Then, for uniqueness, recall that stationary distributions are the zeros of ∂ t q(y, t) = ∇ • D(y, t) + Q(y, t) q(y, t)∇H(y) + ∇q(y, t) where the expression is of the form Av for A = D(y, t) + Q(y, t) and v = q(y, t)∇H(y) + ∇q(y, t) . Under the assumption that Q + D is full rank, the expression can only be zero when v is zero. To show uniqueness under the full rank assumption, one must then show that ∇q(y, t) = -q(y, t)∇H(y). holds only if q(y, t) = exp[-H(y)]/Z. Even if exp[-H(y)]/Z is the unique stationary distribution, convergence to that distribution is a question. See Zhang & Chen (2013) for more details. Learning Q ϕ , D ϕ in the MDM ELBO helps push y T to the model prior π θ and avoid issues like those discussed.

C.1 LINEAR PROCESSES

Next, we specialize this general family to linear Itô processes to maintain tractable transition distributions. A linear process is one where the drift f (y, t) and diffusion g(y, t) are linear functions of y. We express the drift function of a non-linear time-varying Itô process with stationary distribution proportional to exp[-H(y)] as -(Q(y, t) + D(y, t))∇H(y) + Γ(y, t). Next, linear Itô processes have Gaussian stationary distributions (Särkkä & Solin, 2019 ) so H(y) must be quadratic and ∇H(y) is linear, and neither are constant in y. Because ∇H(y) is linear, it can be expressed as Sy for some matrix S where S is the inverse of the covariance matrix. Because ∇H is multiplied by Q, D, this means that Q, D must be free of y. Recalling that Γ is expressed as a sum of derivatives w.r.t y of Q + D, this means that Γ must satisfy Γ = 0. Next, because of the stationary requirement that g(t) = 2D(y, t), we can also conclude by the restriction on D that the diffusion coefficient function must be independent of the state y. Our final form for linear time-varying processes with stationary distributions N (0, S -1 ) is: dy = -Q(t) + D(t) Sy f (y,t) dt + 2D(t) g(t) dB t C.2 PARAMETERIZING Q ϕ Suppose b q (s) is a positive scalar function defined on the time domain with known integral. Suppose Qϕ is any matrix. Then Qϕ -Q⊤ ϕ is skew-symmetric with Qϕ,ij = -Qϕ,ji . We can set Q ϕ to Q ϕ (s) = b q (s) • Qϕ - Q⊤ ϕ This is a general parameterization of time-independent skew-symmetric matrices, which have number of degrees of freedom equal to the number of entries in one of the triangles of the matrix, excluding the diagonal.

C.3 PARAMETERIZING D ϕ

Suppose b d (s) is a positive scalar function defined on the time domain with known integral. Suppose Dϕ is any matrix. Then Dϕ D⊤ ϕ is positive semi-definite and spans all time-independent positive semi-definite matrices. We can set ϕ to D ϕ (s) = b d (s) • Dϕ D⊤ ϕ To show D D⊤ spans all positive semi-definite matrices: suppose M is positive semi-definite. Then it is square. Then it can be eigen-decomposed into M = VΣV ⊤ The degrees of freedom in VΣV ⊤ are just R = V √ Σ since VΣV ⊤ = RR ⊤ and the square root is taken element-wise because Σ is diagonal and is real because each Σ ij ≥ 0, which is true because M is positive semi-definite. Take D = R. In our experiments we parameterize D as a diagonal-only matrix.

C.4 INTEGRALS

The known integral requirement comes from the integrals required in the transition kernel, and can be relaxed two possible ways: • numerical integration of function with unknown integral. This is expected to have low error given that the function is scalar-in scalar-out. • Directly parameterize the integral and use auto-grad when needing the functions notintegrated. We stick with the known integrals. In conclusion, the underlying parameters are positive scalar functions b q (s), b d (s) defined on the time domain and with known integral, and general matrices Qϕ , Dϕ . C.5 INSTANCES VPSDE. VPSDE has K = 1. Consequently, Q, D are K × K. The only 1 × 1 skew-symmetric matrix is 0, so Q = 0. Setting D(t) = 1 2 β(t) recovers VPSDE: dy = - β(t) 2 ydt + β(t)dB t ∇H(y) = y so H(y) = 1 2 ∥y∥ 2 2 . The stationary distribution is N (0, I). CLD. The CLD process (eq 5 in Dockhorn et al. (2021) ) is defined as dz t dv r = dy t = 0 β M -β -Γβ M y t + 0 0 0 √ 2Γβ dB t . In Q/D parameterization, we have H(y) = 1 2 ∥z∥ 2 2 + 1 2M ∥v∥ 2 2 , ∇ u H(y) = z 1 M v Q = 0 -β β 0 , D = 0 0 0 Γβ The stationary distribution of this process is: q ϕ,∞ ∝ exp(-H(y)) = N (z; 0, I d )N (v; 0, M I d ) ALDA. Mou et al. (2019) define a third-order diffusion process for the purpose of gradient-based MCMC sampling. The ALDA diffusion process can be specified as Q =   0 -1 L I 0 1 L I 0 -γI 0 γI 0   , D =   0 0 0 0 0 0 0 0 ξ L I   . Note that Q is skew-symmetric and D is positive semi-definite, therefore we have that q t (u) → q ϕ,∞ . In this case, q ϕ,∞ = N (z; 0, I d )N (v 1 ; 0, 1 L I d )N (v 2 ; 0, 1 L I d ) MALDA. Similar to ALDA, we specify a diffusion process we term MALDA which we specify as Q =   0 -1 L I -1 L 1 L I 0 -γI 1 L γI 0   , D =   0 0 0 0 1 L I 0 0 0 1 L I   . ( ) Note that Q is skew-symmetric and D is positive semi-definite. In this case this is q ϕ,∞ = N (z; 0, I d )N (v 1 ; 0, 1 L I d )N (v 2 ; 0, 1 L I d )

D TRANSITIONS FOR LINEAR PROCESSES

For time variable s and Brownian motion B s driving diffusions of the form dy = f (y, s)ds + g(s)d B s , when f ϕ (y s , s), g ϕ (s) are linear, the transition kernel q ϕ (y s |y 0 ) is always normal (Särkkä & Solin, 2019) . Therefore, we just find the mean m s|0 and Σ s|0 of q(y s |y 0 ). Let f (y, s) = A(s)y. The un-conditional time s mean and covariance are solutions to dm s /ds = A(s)m s dΣ s /ds = A(s)Σ s + Σ s A ⊤ (s) + g 2 (s) By (6.6) in Särkkä & Solin (2019) , for computing conditionals q(y s |y 0 ), we can take the marginal distribution ODEs and compute conditionals by simply setting the time 0 mean and covariance initial conditions to the conditioning value and to 0 respectively. We take (6.36-6.39) and set m 0 = u 0 and Σ 0 = 0 to condition. Let  where exp denotes matrix exponential. (6.36-6.39) state the covariance q(y s |y 0 ) as a matrix factorization, for which a derivation is provided below Σ s = C s (H s ) -1 for C s , H s being the solutions of: d ds C s d ds H s = A(s) g 2 (s) 0 -A ⊤ (s) C s H s To condition and get Σ s|0 from Σ s , we set Σ 0 = 0, and initialize C s , H s by C 0 = 0 and H 0 = I. C s H s = exp [A] s [g 2 ] s 0 -[A ⊤ ] s 0 I = exp s A g 2 0 -A ⊤ 0 I no integration if A(ν) = A, g(ν) = g . ( ) Finally, Σ s|0 = C s (H s ) -1 .

D.1 DERIVATION OF THE COVARIANCE MATRIX SOLUTION

Equation ( 35) gives an expression for dΣ s /ds. To derive the matrix factorization technique used in eq. ( 37), we use eq. ( 35) and the desired condition Σ s = C s H -1 s to derive expressions for dC s /ds and dH s /ds and suitable intial conditions so that the factorization also starts at the desired Σ 0 . Let Σ s = C s H -1 s , then note that C s , H s satisfies d ds Σ s = d ds C s H -1 s = C s d ds H -1 s + d ds C s H -1 s And using the fact that d ds H s H -1 s = 0 H s d ds H -1 s + d ds H s H -1 s = 0 d ds H -1 s = -H -1 s d ds H s H -1 s we get that C s d ds H -1 s + d ds C s H -1 s = -C s H -1 s d ds H s H -1 s + d ds C s H -1 s -C s H -1 s d ds H s H -1 s + d ds C s H -1 s = A(s)C s H -1 s + C s H -1 s A ⊤ (s) + g 2 (s) = A(s)C s H -1 s + C s H -1 s A ⊤ (s)H s H -1 s + g 2 (s)H s H -1 s -C s H -1 s d ds H s + d ds C s H -1 s = A(s)C s + C s H -1 s A ⊤ (s)H s + g 2 (s)H s H -1 s -C s H -1 s d ds H s + d ds C s = A(s)C s + C s H -1 s A ⊤ (s)H s + g 2 (s)H s C s H -1 s I d ⊤ d ds H s C s = C s H -1 s I d ⊤ -A ⊤ (s)H s A(s)C s + g 2 (s)H s Now, we note C s , H s satisfy the following d ds H s = -A ⊤ (s)H s d ds C s = A(s)C s + g 2 (s)H s which implies that d ds C s H s = A(s) g 2 (s) 0 -A ⊤ (s) C s H s with C 0 = Σ 0 and H 0 = I d , as C 0 H -1 0 = Σ 0 .

D.2 HYBRID SCORE MATCHING

Instead of computing q(y s |y 0 ), we can apply the hybrid score matching principle (Dockhorn et al., 2021) to reduce variance by compute objectives using q(y s |x) instead of q(y s |y 0 ), which amounts to integrating out v 0 . To accomplish this, following Särkkä & Solin (2019) , we simply replace y 0 with [x, E[v 0 ]] in the expression for m s|0 , i.e. replace the conditioning value of v 0 with the mean of its chosen initial distribution: E[y s |x] = exp s 0 A(ν)dν x E[v 0 ] For the convariance, instead of using C 0 = Σ 0 = 0, we use a block matrix to condition on x but not v 0 . We decompose Σ 0 into its blocks Σ 0,xx , Σ 0,vv ,Σ 0,xv . As before, to condition on x we set Σ 0,xx = 0. Because q(v 0 ) is set to be independent of x, Σ 0,xv is also set to 0. Finally, instead of 0, to marginalize out v 0 , Σ 0,vv is set to the covariance of the chosen initial time zero distribution for v 0 . E.g. if v 0,j ∼ N (0, γ) for each dimension, then Σ 0,vv = N (0, γI). We operationalize this in a simple piece of code, which makes the ELBO tractable and easy, i.e. skips both analytic derivations and numerical forward integration during training.

D.3 TRANSITIONS IN STATIONARY PARAMETERIZATION

In terms of Q, D, the transitions q(y s |y 0 ) for time s are normal with mean m s|0 and Σ s|0 equal to: m s|0 = exp -Q + D s y 0 , C s H s = exp -[Q + D] s [2D] s 0 [(Q + D) ⊤ ] s 0 I where Σ s|0 = C s (H s ) -1 . For the time invariant case, this simplifies to  m s|0 = exp[-s(Q + D)]y 0 , C s H s = exp s -(Q + D) 2D 0 (Q + D) ⊤ 0 I (u = [z, v]) = p(z = z, v = v) . By marginalization, we can get p(z = z), and we can introduce another distribution q to pick a sampling distribution of our choice: p(z = z) = v p(z = z, v = v)dv = v p(z = z|v = v)p(v = v)dv = v q(v = v|z = z) q(v = v|z = z) p(z = z|v = v)p(v = v)dv = E q(v=v|z=z) p(z = z, v = v) q(v = v|z = z) We often work with these expressions in log space, and need to pull the expectation outside to use Monte Carlo. Jensen's bound allows this: log p(z = z) = log E q(v=v|z=z) p(z = z, v = v) q(v = v|z = z) ≥ E q(v=v|z=z) log p(z = z, v = v) q(v = v|z = z) The following shows that the bound is tight when q (v = v|z = z) = p(v = v|z = z): E q(v=v|z=z) log p(z = z, v = v) q(v = v|z = z) = assume E p(v=v|z=z) log p(z = z, v = v) p(v = v|z = z) = E p(v=v|z=z) log p(z = z, v = v) p(v = v, z = z) • p(z = z) = E p(v=v|z=z) log p(z = z) = log p(z = z) F ELBO FOR MDMs log p θ (x) = log v0 p θ (x 0 , v 0 )dv 0 (45) = log v0 p θ (u 0 = [x, v 0 ]) (46) = log v0 q(v 0 |x) q(v 0 |x) p θ (u 0 = [x, v 0 ]) (47) = log E q(v0|x) p θ (u 0 = [x, v 0 ]) q(v 0 |x) (48) ≥ E q(v0|x) log p θ (u 0 = [x, v 0 ]) -log q(v 0 |x) (49) ≥ E q(y|x) log π θ (y T ) + T 0 -∥s θ ∥ 2 g 2 -∇ • (g 2 s θ -f )ds -log q(y v 0 |x) The first inequality holds due to Jensen's inequality and the second due to an application of Theorem 1 from Huang et al. (2021) We will need a form of multivariate integration by parts which gives us for some f and some q(x), E q(x) [∇ x • f (x)] = -E q(x) [f (x) ⊤ ∇ x log q(x)] E q(x) [∇ x • f i (x)] = q(x) d i=1 [∇ xi f i (x)]dx = d i=1 q(x)∇ xi f i (x)dx = d i=1 x-i xi q(x)∇ xi f i (x)dx i dx -i = d i=1 q(x) ∇ xi f i (x)dx i ∞ -∞ -∇ xi q(x) ∇ xi f i (x)dx i ] dx -i = d i=1 -∇ xi q(x)f i (x)dx i dx -i = d i=1 -q(x)∇ xi log q(x)f i (x)dx i dx -i = d i=1 - q(x)∇ xi log q(x)f i (x)dx i dx -i = d i=1 -E q(x) ∇ xi log q(x)f i (x) = -E q(x) [f (x) ⊤ ∇ x log q(x)] This equality also follows directly from the Stein operator using the generator method to the Langevin diffusion (Barbour, 1988) .

F.1.2 DSM ELBO

Using the "expectation by parts", we have: E q(ut|x) [∇ ut • g 2 (t)s θ (u t , t)] = -E q(ut|x) [(g 2 (t)s θ (u t , t)) ⊤ ∇ ut log q(u t |x)] Also we have, for s θ evaluated at (u t , t), by completing the square, - 1 2 ||s θ || g 2 (t) + s ⊤ θ g 2 (t)∇ log q(u t |x) = - 1 2 ||s θ -∇ log q(u t |x)|| 2 g 2 (t) + .5||∇ log q(u t |x)|| 2 g 2 (t) The two together give us: log p(x) ≥ E q(u T |x) log π + T 0 E q(ut|x) -∇ • g 2 s θ -.5||s θ || 2 g 2 (t) + ∇ • f dt = E q(u T |x) log π + T 0 E q(ut|x) (g 2 s θ ) ⊤ ∇ ut log q(u t |x) -.5||s θ || 2 g 2 (t) + ∇ • f dt = E q(u T |x) log π + T 0 E q(ut|x) - 1 2 ||s θ -∇ log q(u t |x)|| 2 g 2 (t) + .5||∇ log q(u t |x)|| 2 g 2 (t) + ∇ ut • f dt

F.2 NOISE PREDICTION

We have that for normal N (y s ; m s|0 , Σ s|0 ), we can sample y s with normal noise ϵ ∼ N (0, I) and y s = m s|0 + Lϵ where L is the cholesky decomposition of Σ s|0 Then, the score is ∇ y s log q(y s |y 0 ) y s =m s|0 +Lϵ = -Σ -1 s|0 y s -m s|0 = -Σ -1 s|0 m s|0 + Lϵ -m s|0 = -Σ -1 s|0 Lϵ = -LL ⊤ -1 Lϵ = -L ⊤ -1 L -1 Lϵ = -L ⊤ -1 ϵ = -L -1 ⊤ ϵ = -L ⊤,-1 ϵ Parameterize s θ (y s , s) as s θ (y s , s) = -L ⊤,-1 ϵ θ (y, s). This gives 1 2 ∥ -L ⊤,-1 ϵ θ (y, s) --L ⊤,-1 ϵ∥ 2 g 2 ϕ (s) = 1 2 ∥L ⊤,-1 ϵ -L ⊤,-1 ϵ θ (y, s)∥ 2 g 2 ϕ (s) = 1 2 L ⊤,-1 ϵ -L ⊤,-1 ϵ θ (y, s) ⊤ g 2 ϕ (s) L ⊤,-1 ϵ -L ⊤,-1 ϵ θ (y, s) = 1 2 L ⊤,-1 ϵ -ϵ θ (y, s) ⊤ g 2 ϕ (s) L ⊤,-1 ϵ -ϵ θ (y, s) We can also use this insight to analytically compute the quadratic score term (following is computed per data-dimension, so must be multiplied by D when computing the ELBO): E y 0 E y s |y 0 1 2 ∥∇ y s log q ϕ (y s |y 0 )∥ 2 g 2 ϕ (s) = E y 0 E y s |y 0 ∇ y s log q ϕ (y s |y 0 ) ⊤ g 2 ϕ (s) ∇ y s log q ϕ (y s |y 0 ) = E y 0 E y s |y 0 -L ⊤,-1 ϵ ⊤ g 2 ϕ (s) -L ⊤,-1 ϵ = E y 0 E y s |y 0 ϵ ⊤ (-L -1 )g 2 ϕ (s)(-L ⊤,-1 )ϵ = E y 0 E y s |y 0 ϵ ⊤ L -1 g 2 ϕ (s)L ⊤,-1 ϵ = E y 0 E ϵ ϵ ⊤ L -1 g 2 ϕ (s)L ⊤,-1 ϵ = E ϵ ϵ ⊤ L -1 g 2 ϕ (s)L ⊤,-1 ϵ = Trace L -1 g 2 ϕ (s)L ⊤,-1 G ELBOS IN STATIONARY PARAMETERIZATION We use the stationary parmeterization described in appendix C. We now specialize the ELBO to the linear stationary parameterization. Recall f ϕ (y, s) = -[Q ϕ (s) + D ϕ (s)]y. Recall g ϕ (s) = 2D ϕ (s) We have g 2 ϕ (s) = 2D ϕ (s). We can write the MDM ISM ELBO as L mism = E v∼qγ E s∼Unif(0,T ) ℓ (ism) s + ℓ T + ℓ q (52) where set: y 0 = [x, 0 1 , . . . , 0 K-1 ] , Σ 0,zz = 0, and Σ 0,zv , Σ 0,vv to chosen initial distribution compute: m s|0 = γ s|0 y 0 (mean) compute:  ℓ s θ = - 1 2 ∥s θ (y s , s)∥ 2 2D ϕ (s) g 2 ϕ ℓ div-fgs = ∇ y s • -[Q ϕ (s) + D ϕ (s)]y s f ϕ -2D ϕ (s) C s H s = exp M s N s 0 -M ⊤ C s H s = exp M s N s 0 -M ⊤ s 0 I compute: Σ s|0 = C s (H s ) -1 (transition cov). instantiate: q ϕ,s,(x,v) = q ϕ (y s |y 0 ) = N (m s|0 , Σ s|0 ). Output: q ϕ,s,(x,v) , A s , g 2 s Algorithm 5 Compute ELBO with ism or dsm input: Data point x and current params θ, ϕ, γ draw: an aux. sample v ∼ q γ (v|x) draw: a sample s ∼ Unif(0, T ) set: y 0 = (x, v) set: q ϕ,s,y 0 , A s , g 2 s ← algorithm 4 called on y 0 , s, ϕ draw: y s ∼ q ϕ,s,y 0 compute: ℓ s with dsm(s) (algorithm 6) or ism(s) (algorithm 7) on y s , θ, A s , g 2 s , q ϕ,s,y 0 set: q ϕ,T,y 0 , , ← algorithm 4 called on y 0 , T, ϕ draw: y T ∼ q ϕ,T,y 0 output: ℓ s + log π θ (y T ) -log q γ (v) Algorithm 6 Compute dsm(s) input: y s , θ, A s , g 2 s , q ϕ,s,y 0 . compute: fwd-score = ∇ y s log q ϕ (y s |y 0 ) compute: model-score = s θ (y s , s) compute: fwd-score-term = 1 2 (fwd-score) ⊤ g 2 s (fwd-score) compute: score-diff = model-scorefwd-score compute: diff-term = -1 2 score-diff ⊤ g 2 s score-diff compute: div-f = ∇ y s • A s y s output: dsm(s) = fwd-score-term + diff-term + div-f Algorithm 7 Compute ism(s) input: y s , θ, A s , g 2 s , q ϕ,s,y 0 . compute: model-score = s θ (y s , s) compute: score-term = -1 2 model-score ⊤ g 2 s model-score compute: div-gs = ∇ y s • g 2 s s θ (y s , s) compute: div-f = ∇ y s • A s y s compute: div-term = -div-gs + div-f output: ism(s) = score-term + div-term

I VALID ELBO WITH TRUNCATION

The integrand in the ELBO and its gradients is not bounded at time 0. Therefore, following Sohl-Dickstein et al. (2015) and Song et al. (2021) the integrand in eq. ( 7) is integrated from [ϵ, T ], rather than [0, T ]. However, that integral is not a valid lower bound on log p θ (x). Instead, it can be viewed as a proper lower bound on the prior for a latent variable y ϵ . Therefore, to provide a bound for the data, one can introduce a likelihood and substitute the prior lower bound into a standard variational bound that integrates out the latent.



Following Huang et al. (2021);Dockhorn et al. (2021) we integrate all processes in forward time 0 to T . It may be helpful to think of an additional variable xt ≜ zT -t so that x0 approximates x ∼ qdata. We use y as the inference variable over the same space as the model's z. There are processes such as sub-VPSDE(Song et al., 2020b) which are covered in the sense that they tend to members of this parameterization as T grows: sub-VP converges to VPSDE.



Combining the above into an algorithm called Automatic Multivariate Diffusion Training (AMDT) that enables training without diffusion-specific derivations. AMDT enables training score models for any linear diffusion, including optimizing the diffusion and score jointly.

both with the 108 million score model fromSong et al. (2021) (labeled "large"). The rest are DBGMs that we train using the U-Net with 35.7 million parameters for CIFAR-10 and IMAGENET32 and 1.1 million for MNIST. Despite using significantly fewer parameters, the learned diffusion achieves similar BPD compared to the larger models, showing that changes in inference can account for as much improvement as a three-fold increase in parameters. While the larger architecture requires two GPUs for batch size 128 on CIFAR-10 on A100s, the smaller one only requires one; exploring inference processes can make diffusions more computationally accessible. Table3also demonstrates a tighter bound for CLD trained and evaluated with the MDM ELBO (≤ 3.11) relative to existing probability flow-based evaluations (3.31).

Figure 1: CIFAR10 samples generated from the "learned 2" and MALDA generative models.

42) E GENERIC CHANGE OF MEASURE AND JENSEN'S FOR APPROXIMATE MARGINALIZATION Suppose u = [z, v] and we have an expression for p

q ϕ,s,(x,v)depends on Q,D ℓ s θ + ℓ div-fgs ℓ T = E q ϕ,T , (x, v) depends on Q,D log π θ (y T ) ℓ q = -log q γ (v|x) (53)For the DSM form,L mdsm = E v∼qγ E s∼Unif(0,T ) ℓ (dsm) s + ℓ T + ℓ q (54)whereℓ div-f = ∇ y s • -[Q ϕ (s) + D ϕ (s)]y s f ϕ ℓ fwd-score = 1 2∇ y s log q ϕ (y s |y 0 ) y s , s) -∇ y s log q ϕ (y s |y 0 ) E q ϕ,s,(x,v)depends on Q,D ℓ neg-scorediff + ℓ fwd-score + ℓ div-Gettransition distribution y s |x Input: data x. time s. A, g. compute: A(s) and g(s) compute: M s = s 0 A(t)dt (integrated drift) compute: N s = s 0 g 2 (t)dt (integrated diffusions squared) compute: γ s|0 = exp M s (mean coefficient)

compute: Σ s|0 = C s (H s ) -1 (cov.) Output: N (m s|0 , Σ s|0 ) H.2 TRANSITIONS WITH Q, DCurrent param matrices Qϕ , Dϕ and along with fixed time-in scalar-out functions b q (s), b d (s) and their known integrals B q (s), B d (s). q γ (v 0 |z 0 = x) taken to be parameterless so that v 0 ∼ N (0, I). Model params are s θ fixed π θ .Algorithm 3 Get Q, D and their integrated terms M, N Input: time s and current params ϕ compute:[b q ] s = s 0 b q (ν)dν using known integral B q (s) -B q (0) compute: [b d ] s = s 0 b d (ν)dν using known integral B d (s) -B d (0). compute: [Q ϕ ] s = [b q ] s • Qϕ -Q⊤ ϕ for current params Qϕ . compute: [D ϕ ] s = [b d ] s • Dϕ D⊤ ϕ for current params Dϕ . compute: M s = -([Q ϕ ] s + [D ϕ ] s ) (M just a variable name) compute: N s = [2D ϕ ] s = 2 • [D ϕ ] s (N just a variable name) compute: Q s = b q (s) • Qϕ -Q⊤ ϕ (not integrated) compute: D s = b d (s) • Dϕ D⊤ ϕ (not integrated) compute: A s = -[Q s + D s ] (drift coef.) compute: g 2 s = 2D s (diffusion coef. squared) Output: A s , g 2 s , M s , N s H.3 ELBO ALGORITHMSAlgorithm 4 Get transition distributions Input: Sample y 0 = (x, v) and time s. Current params ϕ set: A s , g 2 s , M s , N s ← algorithm 3 compute: m s|0 = exp M s y 0 (transition mean) compute: ingredients for transition cov. matrix:

Runtime Comparison: we compare the run time of sampling from the CLD diffusion analytically versus using the automated algorithm.

BPD upper-bounds on image generation for a fixed architecture. CIFAR-10: learning outperforms CLD, and both outperform the standard choice of VPSDE. MNIST: learning matches VPSDE while the fixed auxiliary diffusions are worse. IMAGENET32: all perform similarly. Learning matches or surpasses the best fixed diffusion, while bypassing the need to choose a diffusion.

Parameter Efficiency. The first two rows display diffusions from previous work: VPSDE and CLD, both using score models with 108 million parameters on CIFAR-10. We train the rest using a score model with 35.7 million parameters. The learned diffusion matches the performance of VPSDE-large; changes in the inference can account for as much improvement as a 3x increase in score parameters. BPDs are upper-bounds.Following prior work, we train DBGMs for image generation. We use the U-Net from Ho et al.

s first two rows display diffusion models from previous work: VPSDE

(Kingma & Welling, 2013;Sohl-Dickstein et al., 2015;Kingma et al., 2021). Kingma et al. (2021) learn the noise schedule for VPSDE. For MDMs, there are parameters to learn beyond the noise schedule; Q can be non-zero, D can diagonal or full, give Q and D different time-varying functions, and learn ∇H.

Yin and P Ao. Existence and construction of dynamical potential in nonequilibrium processes without detailed balance. Journal of Physics A:Mathematical and General, 39(27):8593, 2006.

t) and likewise for D ijt . Let ∂ i denote ∂

or Theorem 3 fromSong et al. (2021) applied to the joint variable u 0 .

8. ACKNOWLEDGEMENTS

This work was generously funded by NIH/NHLBI Award R01HL148248, NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science, and NSF CA-REER Award 2145542. The authors would additionally like to thank Chin-Wei Huang for helpful discussing regarding Huang et al. (2021) .

Published as a conference paper at ICLR 2023

To provide a valid lower bound for multivariate diffusions, we extend theorem 6 in Song et al. (2021) from univariate to multivariate diffusions. Theorem 3. For transition kernel q ϕ (y s | y 0 ), we can compute upper bound the model likelihood at time 0 as follows, for any ϵ > 0where L mdm (y ϵ , ϵ) is defined asProof. For transition kernel q ϕ (y s | y 0 ), we can compute upper bound the model likelihood at time 0 following an application of the variational boundA lower bound for log p θ (y ϵ ) can be derived in a similar manner to eq. ( 7), such thatThe choice of p θ (y 0 | y ϵ ) is arbitrary, however following Sohl-Dickstein et al. ( 2015); Song et al.(2021) we let p θ (y 0 | y ϵ ) be Gaussian with mean µ p θ ,ϵ and covariance Σ p θ ,ϵ . Suppose q ϕ (y ϵ | y 0 ) = N (y ϵ | Ay 0 , Σ), then we select the following mean µ p θ ,ϵ and covariance Σ p θ ,ϵ for p θ (y 0 | y ϵ )where µ p θ ,ϵ , Σ p θ ,ϵ are derived using Tweedie's formula (Efron, 2011) by settingWe next derive this choice as an approximation of the optimal Gaussian likelihood.I.1 LIKELIHOOD DERIVATION Suppose y 0 ∼ q 0 (y 0 ) and y ϵ ∼ N (y ϵ | Ay 0 , Σ). Here, A, Σ are the mean coefficient and covariance derived from the transition kernel at time ϵ. We use Tweedie's formula to get the mean and covariance of y 0 given y ϵ under q. This mean and covariance feature the true score ∇ y ϵ log q(y ϵ ).We replace the score with the score model s θ and then set p θ (y 0 |y ϵ ) to have the resulting approximate mean and covariance. We make this choice because the optimal p θ (y 0 |y ϵ ) equals the true q(y 0 |y ϵ ) as discussed throughout the work.HereLet η be the natural parameter for the multivariate Gaussian likelihood N (y ϵ | Ay 0 , Σ). Then, Tweedie's formula (Efron, 2011) states that:• l(y ϵ ) = log q(y ϵ )• s θ (y ϵ , ϵ) is taken to be the true score ∇ y ϵ log q(y ϵ ) so that ∇ yϵ l(y ϵ ) = s θ (y ϵ , ϵ)• l 0 is the log of the base distribution defined in the exponential family parameterization.The base distribution is a multivariate Gaussian with mean 0 and covariance Σ, thereforeHowever, Tweedie's formula is not directly applicable since our y ϵ is not directly normal with mean y 0 . Instead, to derive the conditional mean of y 0 given y ϵ , we use the relation η = Σ -1 Ay 0 and the linearity of conditional expectation to getFor the variance, we use the following relation y ϵ = Ay 0 + √ Σϵ, which implies thatTherefore, for the model posterior distribution p θ (y 0 | y ϵ ) we choose a Normal with mean and covariance µ p θ ,ϵ = A -1 Σs θ (y ϵ , ϵ) + A -1 y ϵ Σ p θ ,ϵ = A -1 ΣA -T

