UNDERSTANDING DDPM LATENT CODES THROUGH OPTIMAL TRANSPORT

Abstract

Diffusion models have recently outperformed alternative approaches to model the distribution of natural images. Such diffusion models allow for deterministic sampling via the probability flow ODE, giving rise to a latent space and an encoder map. While having important practical applications, such as the estimation of the likelihood, the theoretical properties of this map are not yet fully understood. In the present work, we partially address this question for the popular case of the VP-SDE (DDPM) approach. We show that, perhaps surprisingly, the DDPM encoder map coincides with the optimal transport map for common distributions; we support this hypothesis by extensive numerical experiments using advanced tensor train solver for multidimensional Fokker-Planck equation. We provide additional theoretical evidence for the case of multivariate normal distributions.

1. INTRODUCTION

Denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) have recently outperformed alternative approaches to model the distribution of natural images both in the realism of individual samples and their diversity (Dhariwal & Nichol, 2021) . These advantages of diffusion models are successfully exploited in applications, such as colorization (Song et al., 2021b) , inpainting (Song et al., 2021b ), super-resolution (Saharia et al., 2021; Li et al., 2022) , and semantic editing (Meng et al., 2021) , where DDPM often achieve more impressive results compared to GANs. One crucial feature of diffusion models is the existence of a deterministic invertible mapping from the data distribution to the limiting distribution of the diffusion process, commonly being a standard normal distribution. This approach termed denoising diffusion implicit model (DDIM) (Song et al., 2021a) , or the probability flow ODE in the continuous model (Song et al., 2021b) , allows to invert real images easily, perform data manipulations, obtain a uniquely identifiable encoding, as well as to compute exact likelihoods. Despite these appealing features, not much is known about the actual mathematical properties of the encoder map and corresponding latent codes, which is the question we address in this work. Concretely, in this paper, we show that for the DDPM diffusion process, based on the Variance Preserving (VP) SDE (Song et al., 2021b) , this encoder map with a large numerical accuracy coincides with the Monge optimal transport map between the data distribution and the standard normal distribution. We provide extensive empirical evidence on controlled synthetic examples and real datasets and give a proof of equality for the case of multivariate normal distributions. Our findings suggest a complete description of the encoder map and an intuitive approach to understanding the 'structure' of a latent code for DDPMs trained on visual data. In this case, the pixel-based Euclidean distance corresponds to high-level texture and color-level similarity, directly observed on real DDPMs. To summarize, the contributions of our paper are: 1. We theoretically verify that for the case of multivariate normal distributions the Monge optimal transport map coincides with the DDPM encoder map. 2. We study the DDPM encoder map by numerically solving the Fokker-Planck equation on a large class of synthetic distributions and show that the equality holds up to negligible errors. 3. We provide additional qualitative empirical evidence supporting our hypothesis on real image datasets.

2. REMINDER ON DIFFUSION MODELS

We start by recalling various concepts from the theory of diffusion models and stochastic differential equations (SDEs).

2.1. DENOISING DIFFUSION PROBABILISTIC MODELS

Denoising diffusion probabilistic models (DDPMs) is a class of generative models recently shown to obtain excellent performance on the task of image synthesis (Dhariwal & Nichol, 2021; Ho et al., 2020; Song et al., 2021b) . We start with a forward (non-parametric) diffusion which gradually adds noise to data, transforming it into a Gaussian distribution. Formally, we specify the transitions probabilities as q(x t |x t-1 ) := N (x t ; 1 -β t x t-1 , β t I), for some fixed variance schedule β 1 , . . . , β t . Importantly, a noisy sample x t can be obtained directly from the clean sample x 0 as x t = √ ᾱt x 0 + √ 1ᾱt z, with z ∼ N (0, I) and α t := 1β t , ᾱt := t s=1 α s . The generative model then learns to reverse this process and thus gradually produce realistic samples from noise. Specifically, DDPM learns parameterized Gaussian transitions: p θ (x t-1 |x t ) := N (x t ; a θ (x t , t), σ t ). (2) In practice, rather than predicting the mean of the distribution in Equation ( 2), the noise predictor network ϵ θ (x t , t) predicts the noise component from the sample x t and the step t; the mean is then a linear combination of this noise component and x t . The covariances σ t can be either fixed or learned as well; the latter was shown to improve the quality of models (Nichol & Dhariwal, 2021) . Interestingly, the noise predictor ϵ θ (x t , t) is tightly related to the score function (Stein, 1972; Liu et al., 2016; Gorham, 2017) of the intermediate distributions; specifically by defining s θ (x t , t) as s θ (x t , t) = ϵ θ (x t , t) √ 1 -ᾱt we obtain that s θ * (x t , t) ≈ ∇ x log p t (x t ), with p t (x t ) being the density of the target distribution after t steps of the diffusion process and θ * are the parameters at convergence. Stochastic and deterministic sampling. Given the DDPM model, the generative process is expressed as x t-1 = 1 √ α t x t - 1 -α t √ 1 -ᾱt ϵ θ (x t , t) + σ t z, with the initial sample x T ∼ N (0, I) and z ∼ N (0, I). The sampling procedure is stochastic, and no single 'latent space' exists. The authors of Song et al. (2021a) proposed a deterministic approach to produce samples from the target distribution, termed DDIM (denoising diffusion implicit model). Importantly, this approach does not require retraining DDPM and only changes the sampling algorithm; the obtained marginal probability distributions p t are equal to those produced by the stochastic sampling. It takes the following form: x t-1 = √ ᾱt-1 x t - √ 1 -ᾱt ϵ θ (x t , t) √ ᾱt + 1 -ᾱt-1 • ϵ θ (x t , t). By utilizing DDIM, we obtain a concept of a latent space and encoder for diffusion models, since the only input to the generative model now is x T ∼ N (0, I) for sufficiently large T . In the next section, we will see how it is defined in the continuous setup.

