UNDERSTANDING DDPM LATENT CODES THROUGH OPTIMAL TRANSPORT

Abstract

Diffusion models have recently outperformed alternative approaches to model the distribution of natural images. Such diffusion models allow for deterministic sampling via the probability flow ODE, giving rise to a latent space and an encoder map. While having important practical applications, such as the estimation of the likelihood, the theoretical properties of this map are not yet fully understood. In the present work, we partially address this question for the popular case of the VP-SDE (DDPM) approach. We show that, perhaps surprisingly, the DDPM encoder map coincides with the optimal transport map for common distributions; we support this hypothesis by extensive numerical experiments using advanced tensor train solver for multidimensional Fokker-Planck equation. We provide additional theoretical evidence for the case of multivariate normal distributions.

1. INTRODUCTION

Denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) have recently outperformed alternative approaches to model the distribution of natural images both in the realism of individual samples and their diversity (Dhariwal & Nichol, 2021) . These advantages of diffusion models are successfully exploited in applications, such as colorization (Song et al., 2021b ), inpainting (Song et al., 2021b ), super-resolution (Saharia et al., 2021; Li et al., 2022) , and semantic editing (Meng et al., 2021) , where DDPM often achieve more impressive results compared to GANs. One crucial feature of diffusion models is the existence of a deterministic invertible mapping from the data distribution to the limiting distribution of the diffusion process, commonly being a standard normal distribution. This approach termed denoising diffusion implicit model (DDIM) (Song et al., 2021a) , or the probability flow ODE in the continuous model (Song et al., 2021b) , allows to invert real images easily, perform data manipulations, obtain a uniquely identifiable encoding, as well as to compute exact likelihoods. Despite these appealing features, not much is known about the actual mathematical properties of the encoder map and corresponding latent codes, which is the question we address in this work. Concretely, in this paper, we show that for the DDPM diffusion process, based on the Variance Preserving (VP) SDE (Song et al., 2021b) , this encoder map with a large numerical accuracy coincides with the Monge optimal transport map between the data distribution and the standard normal distribution. We provide extensive empirical evidence on controlled synthetic examples and real datasets and give a proof of equality for the case of multivariate normal distributions. Our findings suggest a complete description of the encoder map and an intuitive approach to understanding the 'structure' of a latent code for DDPMs trained on visual data. In this case, the pixel-based Euclidean distance corresponds to high-level texture and color-level similarity, directly observed on real DDPMs. To summarize, the contributions of our paper are:

