DIFFUSION PROBABILISTIC FIELDS

Abstract

Diffusion probabilistic models have quickly become a major approach for generative modeling of images, 3D geometry, video and other domains. However, to adapt diffusion generative modeling to these domains the denoising network needs to be carefully designed for each domain independently, oftentimes under the assumption that data lives in a Euclidean grid. In this paper we introduce Diffusion Probabilistic Fields (DPF), a diffusion model that can learn distributions over continuous functions defined over metric spaces, commonly known as fields. We extend the formulation of diffusion probabilistic models to deal with this field parametrization in an explicit way, enabling us to define an end-to-end learning algorithm that side-steps the requirement of representing fields with latent vectors as in previous approaches (Dupont et al., 2022a; Du et al., 2021). We empirically show that, while using the same denoising network, DPF effectively deals with different modalities like 2D images and 3D geometry, in addition to modeling distributions over fields defined on non-Euclidean metric spaces.

1. INTRODUCTION

Diffusion probabilistic modeling has quickly become a central approach for learning data distributions, obtaining impressive empirical results across multiple domains like images (Nichol & Dhariwal, 2021) , videos (Ho et al., 2022) or even 3D geometry (Luo & Hu, 2021) . In particular, Denoising Diffusion Probabilistic Models (often referred to as DDPMs or diffusion generative models) (Ho et al., 2020; Nichol & Dhariwal, 2021) and their continuous-time extension (Song et al., 2021b) both present a training objective that is more stable than precursors like generative adversarial nets (GANs) (Goodfellow et al., 2014) or energy-based models (EBMs) (Du et al., 2020) . In addition, diffusion generative models have shown to empirically outperform GANs in the image domain (Dhariwal & Nichol, 2021) and to suffer less from mode-seeking pathologies during training (Kodali et al., 2017) . A diffusion generative model consists of three main components: the forward (or diffusion) process, the backward (or inference) process, and the denoising network (also referred to as the score network 1 due to its equivalence with denoising score-matching approaches Dickstein et al. (2015) ). A substantial body of work has addressed different definitions of the forward and backward processes (Rissanen et al., 2022; Bansal et al., 2022; Song et al., 2021a) , focusing on the image domain. However, there are two caveats with current diffusion models that remain open. The first one is that data is typically assumed to live on a discrete Euclidean grid (exceptions include work on molecules (Hoogeboom et al., 2022) and point clouds (Luo & Hu, 2021) ). The second one is that the denoising network is heavily tuned for each specific data domain, with different network architectures used for images (Nichol & Dhariwal, 2021 ), video (Ho et al., 2022) , or geometry (Luo & Hu, 2021) . In order to extend the success of diffusion generative models to the large number of diverse areas in science and engineering, a unification of the score formulation is required. Importantly, a unification enables use of the same score network across different data domains exhibiting different geometric structure without requiring data to live in or to be projected into a discrete Euclidean grid. To achieve this, in this paper, we introduce the Diffusion Probabilistic Field (DPF). DPFs make progress towards the ambitious goal of unifying diffusion generative modeling across domains by learning distributions over continuous functions. For this, we take a functional view of the data, interpreting a data point x ∈ R d as a function f : M → Y (Dupont et al., 2022a; Du et al., 2021) . The function f maps elements from a metric space M to a signal space Y. This functional view of the data is commonly referred to as a field representation (Xie et al., 2022) , which we use to refer to functions of this type. An illustration of this field interpretation is provided in Fig. 1 . Using the image domain as an illustrative example we can see that one can either interpret images as multidimensional array x i ∈ R h×w × R 3 or as field f : R 2 → R 3 that maps 2D pixel coordinates to RGB values. This field view enables a unification of seemingly different data domains under the same parametrization. For instance, 3D geometry data is represented via f : R 3 → R, and spherical signals become fields f : S 2 → R d . In an effort to unify generative modeling across different data domains, field data representations have shown promise in three recent approaches: From data to functa (Functa) (Dupont et al., 2022a) , GEnerative Manifold learning (GEM) (Du et al., 2021) and Generative Adversarial Stochastic Process (GASP) (Dupont et al., 2022b) . The first two approaches adopt a latent field parametrization (Park et al., 2019) , where a field network is parametrized via a Hypernetwork (Ha et al., 2017) that takes as input a trainable latent vector. During training, a latent vector for each field is optimized in an initial reconstruction stage (Park et al., 2019) . In Functa (Dupont et al., 2022a) the authors then propose to learn the distribution of optimized latents in an independent second training stage, similar to the approach by Rombach et al. (2022); Vahdat et al. (2021 ). Du et al. (2021) define additional latent neighborhood regularizers during the reconstruction stage. Sampling is then performed in a non-parametric way: one chooses a random latent vector from the set of optimized latents and projects it into its local neighborhood before adding Gaussian noise. See Fig. 2 for a visual summary of the differences between Functa (Dupont et al., 2022a), GEM (Du et al., 2021) and our DPF. GASP (Dupont et al., 2022b ) employs a GAN paradigm: the generator produces a field while the discriminator operates on discrete points from the field, and distinguishes input source, i.e., either real or generated. In contrast to prior work (Dupont et al., 2022a; Du et al., 2021; Dupont et al., 2022b) , we formulate a diffusion generative model over functions in a single-stage approach. This permits efficient end-to-end training without relying on an initial reconstruction stage or without tweaking the adversarial game, which we empirically find to lead to compelling results. Our contributions are summarized as follows: • We introduce the Diffusion Probabilistic Field (DPF) which extends the formulation of diffusion generative models to field representations. • We formulate a probabilistic generative model over fields in a single-stage model using an explicit field parametrization, which differs from recent work (Dupont et al., 2022a; Du et al., 2021) and simplifies the training process by enabling end-to-end learning. • We empirically demonstrate that DPF can successfully capture distributions over functions across different domains like images, 3D geometry and spherical data, outperforming recent work (Dupont et al., 2022a; Du et al., 2021; Dupont et al., 2022b) .

2. BACKGROUND: DENOISING DIFFUSION PROBABILISTIC MODELS

Denoising Diffusion Probabilistic Models (DDPMs) belong to the broad family of latent variable models. We refer the reader to Everett (2013) for an in depth review. In short, to learn a parametric data distribution p θ (x 0 ) from an empirical distribution of finite samples q(x 0 ), DDPMs reverse a diffusion Markov Chain (i.e., the forward diffusion process) that generates latents x 1:T by gradually adding Gaussian noise to the data x 0 ∼ q(x 0 ) for T time-steps as follows: q(x t |x t-1 ) := N x t-1 ; √ ᾱt x 0 , (1ᾱt )I . (1) Here, ᾱt is the cumulative product of fixed variances with a handcrafted scheduling up to time-step t. Ho et al. ( 2020) highlight two important observations that make training of DDPMs efficient: i) Eq. ( 1) adopts sampling in closed form for the forward diffusion process. ii) reversing the diffusion



† Work was completed while P.Z. was an intern with Apple.1 We use the terms score network/function and denoising network/function exchangeably in the paper.

