f -DM: A MULTI-STAGE DIFFUSION MODEL VIA PRO-GRESSIVE SIGNAL TRANSFORMATION

Abstract

Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. Standard DMs can be viewed as an instantiation of hierarchical variational autoencoders (VAEs) where the latent variables are inferred from input-centered Gaussian distributions with fixed scales and variances. Unlike VAEs, this formulation constrains DMs from changing the latent spaces and learning abstract representations. In this work, we propose f -DM, a generalized family of DMs, which allows progressive signal transformation. More precisely, we extend DMs to incorporate a set of (hand-designed or learned) transformations, where the transformed input is the mean of each diffusion step. We propose a generalized formulation of DMs and derive the corresponding de-noising objective together with a modified sampling algorithm. As a demonstration, we apply f -DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations based on the encoder of pretrained VAEs. In addition, we identify the importance of adjusting the noise levels whenever the signal is sub-sampled and propose a simple rescaling recipe. f -DM can produce high-quality samples on standard image generation benchmarks like FFHQ, AFHQ, LSUN and ImageNet with better efficiency and semantic interpretation. Please check our videos at http://jiataogu.me/fdm/. Figure 1 : Visualization of reverse diffusion from f -DMs with various signal transformations. x t is the denoised output, and z s is the input to the next diffusion step. We plot the first three channels of VQVAE latent variables. Low-resolution images are resized to 256 2 for ease of visualization.

1. INTRODUCTION

Diffusion probabilistic models (DMs, Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021) and score-based (Song et al., 2021b) generative models have become increasingly popular as the tools for high-quality image (Dhariwal & Nichol, 2021) , video (Ho et al., 2022b ), text-tospeech (Popov et al., 2021 ) and text-to-image (Rombach et al., 2021; Ramesh et al., 2022; Saharia et al., 2022a) synthesis. Despite the empirical success, conventional DMs are restricted to operate in the ambient space throughout the Gaussian noising process. On the other hand, common generative models like VAEs (Kingma & Welling, 2013) and GANs (Goodfellow et al., 2014; Karras et al., 2021) employ a coarse-to-fine process that hierarchically generates high-resolution outputs. We are interested in combining the best of the two worlds: the expressivity of DMs and the benefit of hierarchical features. To this end, we propose f -DM, a generalized multi-stage framework of DMs to incorporate progressive transformations to the inputs. As an important property of our formulation, f -DM does not make any assumptions about the type of transformations. This makes it compatible with many possible designs, ranging from domain-specific ones to generic neural networks. In this work, we consider representative types of transformations, including down-sampling, blurring, and neural-based transformations. What these functions share in common is that they allow one to derive increasingly more global, coarse, and/or compact representations, which we believe can lead to better sampling quality as well as reduced computation. Incorporating arbitrary transformations into DMs also brings immediate modeling challenges. For instance, certain transformations destroy the information drastically, and some might also change the dimensionality. For the former, we derive an interpolation-based formulation to smoothly bridge consecutive transformations. For the latter, we verify the importance of rescaling the noise level, and propose a resolution-agnostic signal-to-noise ratio (SNR) as a practical guideline for noise rescaling. Extensive experiments are performed on image generation benchmarks, including FFHQ, AFHQ, LSUN Bed/Church and ImageNet. f -DMs consistently match or outperform the baseline performance, while requiring relatively less computing thanks to the progressive transformations. Furthermore, given a pre-trained f -DM, we can readily manipulate the learned latent space, and perform conditional generation tasks (e.g., super-resolution) without additional training. Diffusion Models (DMs, Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are deep generative models which can be viewed as a special case of hierarchical VAEs (Kingma et al., 2021) . In this paper, we consider diffusion in continuous time similar to Song et al. (2021b) ; Kingma et al. (2021) .

2. BACKGROUND

Given a datapoint x ∈ R N , a DM models time-dependent latent variables z = {z t |t ∈ [0, 1], z 0 = x} based on a fixed signal-noise schedule {α t , σ t }: q(z t |z s ) = N (z t ; α t|s z s , σ 2 t|s I), where α t|s = α t /α s , σ 2 t|s = σ 2 t -α 2 t|s σ 2 s , s < t. It also defines the marginal distribution q(z t |x) as: q(z t |x) = N (z t ; α t x, σ 2 t I), By default, we assume the variance preserving form (Ho et al., 2020) . That is, α 2 t + σ 2 t = 1, α 0 = σ 1 = 1, and the signal-to-noise-ratio (SNR, α 2 t /σ 2 t ) decreases monotonically with t. For generation, a parametric function θ is optimized to reverse the diffusion process by denoising z t = α t x + σ t ϵ to the clean input x, with a weighted reconstruction loss L θ . For example, the "simple loss" proposed in Ho et al. (2020) is equivalent to weighting residuals by ω t = α 2 t /σ 2 t : L θ = E zt∼q(zt|x),t∼[0,1] ω t • ∥x θ (z t , t) -x∥ 2 2 . In practice, θ is parameterized as a U-Net (Ronneberger et al., 2015) . As suggested in Ho et al. (2020) , predicting the noise ϵ θ empirically achieves better performance than predicting x θ , where x θ (z t , t) = (z t -σ t ϵ θ (z t , t))/α t . Sampling from such a learned model can be performed from ancestral sampling (DDPM, Ho et al., 2020) , or a deterministic DDIM sampler (Song et al., 2021a) . Starting from z 1 ∼ N (0, I), a sequence of timesteps 1 = t 0 > . . . > t N = 0 are sampled for iterative generation, and we can readily summarize both methods for each step as follows: z s = α s • x θ (z t ) + σ 2 s -η 2 σ2 • ϵ θ (z t ) + ησ • ϵ, ϵ ∼ N (0, I), s < t, where σ = σ s σ t|s /σ t , and η controls the proportion of additional noise. (i.e., DDIM η = 0). As the score function ϵ θ is defined in the ambient space, it is clear that all the latent variables z are forced to be the same shape as the input data x (R N ). This not only leads to inefficient training, especially for steps with high noise level (Jing et al., 2022) , but also makes DMs hard to learn abstract and semantically meaningful latent space as pointed out by Preechakul et al. (2022) .

3. METHOD

In this section, we introduce f -DM, an extended family of DMs to enable diffusion on transformed signals, in a way similar to a standard hierarchical VAE. We start by introducing the definition of the proposed multi-stage formulation with general signal transformations, followed by modified training and generation algorithms (Section 3.1). Then, we specifically apply f -DM with three categories of transformations (Section 3.2).

3.1. MULTI-STAGE DIFFUSION

Signal Transformations We consider a sequence of deterministic functions f = {f 0 , . . . , f K }, where f 0 . . . f k progressively transforms the input signal x ∈ R N into x k = f 0:k (x) ∈ R M k . We assume x 0 = f 0 (x) = x. In principle, f can be any function. In this work, we focus on transformations that gradually destroy the information contained in x (e.g., down-sampling), leading towards more compact representations. Without loss of generality, we assume M 0 ≥ M 1 ≥ . . . ≥ M K . A sequence of inverse mappings g = {g 0 , . . . , g K-1 } is used to connect a corresponding sequence of pairs of consecutive spaces. Specifically, we define xk as: xk := g k f k+1 (x k ) ≈ x k , if k < K, x k , if k = K. The approximation of Equation 3 (k < K) is not necessarily (and sometimes impossibly) accurate. For instance, f k downsamples an input image x from 128 2 into 64 2 with average pooling, and g k can be a bilinear interpolation that upsamples back to 128 2 , which is a lossy reconstruction. The definition of f and g can be seen as a direct analogy of the encoder (ϕ) and decoder (θ) in hierarchical VAEs (see Figure 2 (b)). However, there are still major differences: (1) the VAE encoder/decoder is stochastic, and the encoder's outputs are regularized by the prior. In contrast, f and g are deterministic, and the encoder output x K does not necessarily follow a simple prior; (2) VAEs directly use the decoder for generation, while f , g are fused in the diffusion steps of f -DM.

Forward Diffusion

We extend the continuous-time DMs for signal transformations. We split the diffusion time 0 → 1 into K+1 stages, where for each stage, a partial diffusion process is performed. More specifically, we define a set of time boundaries 0 = τ 0 < τ 1 < . . . < τ K < τ K+1 = 1, and for t ∈ [0, 1], the latent z t has the following marginal probability: q(z t |x) = N (z t ; α t x t , σ 2 t I), where x t = (t -τ k ) xk + (τ k+1 -t)x k τ k+1 -τ k , τ k ≤ t < τ k+1 . (4) As listed above, x t is the interpolation of x k and its approximation xk when t falls in stage k. A simple illustration for the relationship of x t , xk , x k and z t is shown in Figure 10 . We argue that interpolation is crucial as it creates a continuous transformation that slowly corrupts information inside each stage. In this way, such change can be easily reversed by our model. Also, it is nontrivial to find the optimal stage schedule τ k for each model as it highly depends on how much the information is destroyed in each stage f k . In this work, we tested two heuristics: (1) linear schedule τ k = k/(K + 1); (2) cosine schedule τ k = cos(1 -k/(K + 1)). Note that the standard DMs can be seen as a special case of our f -DM when there is only one stage (K = 0). Equation 4 does not guarantee a Markovian transition. Nevertheless, our formulation only need q(z t |z s , x), which has the following simple form focusing on diffusion steps within a stage: q(z t |z s , x) = N (z t ; α t|s z s + α t • (x t -x s ) , σ 2 t|s I), τ k ≤ s < t < τ k+1 . ( ) From Equation 5, we further re-write x t -x s = -δ t • (t -s)/(t -τ k ), where δ t = x kx t is the signal degradation. Equation 5 also indicates that the reverse diffusion distribution q(z s |z t , x) ∝ q(z t |z s , x)q(z s |x) can be written as the function of x t and δ t which will be our learning objectives. Boundary Condition To enable diffusion across stages, we need the transition at stage boundaries τ k . More specifically, when the step approaches the boundary τ -(the left limit of τ ), the transition q(z τ |z τ -, x) should be as deterministic (ideally invertible) & smooth as possible to minimize information loss.foot_0 . First, we can easily expand z τ and z τ -as the signal and noise combination: Before: z τ -= α τ -• x τ -+ σ τ -• ϵ, p(ϵ) = N (0, I), After: z τ = α τ • x τ + σ τ • ζ(ϵ), p(ζ(ϵ)) = N (0, I). Based on definition, x τ -= xk-1 = g(x k ) = g(x τ ), which means the signal part is invertible. Therefore we only need to find ζ. Under the initial assumption of M k ≤ M k-1 , this can be achieved easily by dropping elements from ϵ. Take down-sampling (M k-1 = 4M k ) as an example. We can directly drop 3 out of every 2 × 2 values from ϵ. More details are included in Appendix A.4. The second requirement of a smooth transition is not as straightforward as it looks, which asks the "noisiness" of latents z to remain unchanged across the boundary. We argue that the conventional measure -the signal-to-noise-ratio (SNR) -in DM literature is not compatible with resolution change as it averages the signal/noise power element-wise. In this work, we propose a generalized resolution-agnostic SNR by viewing data as points sampled from a continuous field: SNR(z) = E Ω∼I ∥E i∼Ω SIGNAL(z i )∥ 2 E Ω∼I ∥E i∼Ω NOISE(z i )∥ 2 , ( ) where I is the data range, SIGNAL represents the real data value (such as image pixels), and NOISE is the unstructed Gaussian noise added to the data. Ω is a patch relative to I, which can be any size as long as it is invariant to different sampling rates (resolutions). As shown in Figure 3 (left), we can obtain a reliable measure of noisiness by averaging the signal/noise inside patches. We derive α τ , σ τ from α τ -, σ τ -for any transformations by forcing SNR(z τ ) = SNR(z τ -) under this new definition. Specifically, if dimensionality change is solely caused by the change of sampling rate (e.g., down-sampling, average RGB channels, deconvolution), we can get the following relation: α 2 τ σ 2 τ = d k • γ k • α 2 τ -σ 2 τ -, where d k = M k-1 /M k is the total dimension change, and γ k = E|| xk-1 || 2 /E||x k || 2 is the change of signal power. For example, we have d k = 4, γ k ≈ 1 for down-sampling. Following Equation 8, the straightforward rule is to rescale the magnitude of the noise, and keep the signal part unchanged: Algorithm 1: Reverse diffusion for image generation using f -DM Input: model θ, f , g, stage schedule {τ0, . . . , τK }, rescaled noise schedule functions α(.), σ(.), step-size ∆t, ϵ full ∼ N (0, I), DDPM ratio η Initialize z from ϵ full for (k = K; k ≥ 0; k = k -1) do for (t = τ k+1 ; t > τ k ; t = t -∆t, s = t -∆t) do ϵ θ , δ θ = θ(z, t); x θ = (z -σ(t) • ϵ θ )/α(t); if s > τ k then z = α(s) • (x θ + δ θ • (t -s)/(t -τ k )) + σ 2 (s) -η 2 σ2 • ϵ θ + ησ • ϵ, ϵ ∼ N (0, I) if k > 0 then Re-sample noise ϵrs from ϵ θ and ϵ full ; z = α(τ k ) • g k (x θ ) + σ(τ k ) • ϵrs return x θ α ← α, σ ← σ/ √ d k , which we refer as signal preserved (SP) rescaling. Note that, to ensure the noise schedule is continuous over time and close to the original schedule, such rescaling is applied to the noises of the entire stage, and will be accumulated when multiple transformations are used. As the comparison shown in Figure 3 , the resulting images are visually closer to the standard DM. However, the variance of z t becomes very small, especially when t → 1, which might be hard for the neural networks to distinguish. Therefore, we propose the variance preserved (VP) alternative to further normalize the rescaled α, σ so that α 2 + σ 2 = 1. We show the visualization in Figure 3 (b) . Training We train a neural network θ to denoise. We also show the training pipeline in Figure 10 . In f -DM, noise is caused by two factors: (1) the perturbation ϵ from noise injection; (2) the degradation δ due to signal transformation. Thus, we propose to predict x θ and δ θ jointly, which simultaneously remove both noises from z t with a "double reconstruction" loss: L θ = E zt∼q(zt|x),t∼[0,1] ω t • ∥x θ (z t , t) -x t ∥ 2 2 + ∥δ θ (z t , t) -δ t ∥ 2 2 , where the denoised output is x θ (z t , t) + δ θ (z t , t). Unlike standard DMs, the denoising goals are the transformed signals of each stage rather than the final real images, which are generally simpler targets to recover. The same as standard DMs, we also choose to predict ϵ θ , and compute x θ = (z t -σ t ϵ θ )/α t . We adopt the same U-Net architecture for all stages, where input z t will be directed to the corresponding inner layer based on spatial resolutions (see Appendix Figure 11 for details). Unconditional Generation We present the generation steps in Algorithm 1, where x t and δ t are replaced by model's predictions x θ , δ θ . Thanks to the interpolation formulation (Equation 4), generation is independent of the transformations f . Only the inverse mappings g -which might be simple and easy to compute -is needed to map the signals at boundaries. This brings flexibility and efficiency to learning complex or even test-time inaccessible transformations. In addition, Algorithm 1 includes a "noise-resampling step" for each stage boundary, which is the reverse process for ζ(ϵ) in Equation 6. While ζ is deterministic, the reverse process needs additional randomness. For instance, if ζ drops elements in the forward process, then the reverse step should inject standard Gaussian noise back to the dropped locations. Because we assume M 0 ≥ . . . ≥ M K , we propose to sample a full-size noise ϵ full before generation, and gradually adding subsets of ϵ full to each stage. Thus, ϵ full encodes multi-scale information similar to RealNVP (Dinh et al., 2016) . Conditional Generation Given an unconditional f -DM, we can do conditional generation by replacing the denoised output x θ with any condition x c at a suitable time (T ), and starting diffusion from T . For example, suppose f is downsample, and x c is a low-resolution image, f -DM enables super-resolution (SR) without additional training. To achieve that, it is critical to initialize z T , which implicitly asks z T ≈ α T x c + σ T ϵ θ (z T ). In practice, we choose T to be the corresponding stage boundary, and initialize z by adding random noise σ T ϵ to α T x c . A gradient-based method is used to iteratively update z T ← z T -λ∇ z T ∥x θ (z T ) -x c ∥ 2 2 for a few steps before the diffusion starts.

3.2. APPLICATIONS ON VARIOUS TRANSFORMATIONS

With the definition in Section 3.1, next we show f -DM applied with different transformations. In this paper, we consider the following three categories of transformations. Downsampling. As the motivating example in Section 3.1, we let f a sequence of downsampe operations that transforms a given image (e.g., 256 2 ) progressively down to 16 2 , where each f k (.) reduces the length by 2, and correspondingly g k (.) upsamples by 2. Thus, the generation starts from a low-resolution noise and progressively performs super-resolution. We denote the model as f -DM-DS, where (Deng et al., 2009) , and is frozen for the rest of the experiments. For f -DM-VQGAN, we directly take the checkpoint provided by Rombach et al. (2021) . Besides, we need to tune γ k separately for each encoder due to the change in signal magnitude. d k = 4, γ k = 1 in

4.1. EXPERIMENTAL SETTINGS

Datasets. We evaluate f -DMs on five commonly used benchmarks testing generation on a range of domains: FFHQ (Karras et al., 2019) , AFHQ (Choi et al., 2020) , LSUN Church & Bed (Yu et al., 2015) , and ImageNet (Deng et al., 2009) . All images are center-cropped and resized to 256 × 256. Training Details. We implement the three types of transformations with the same architecture and hyper-parameters except for the stage-specific adapters. We adopt a lighter version of ADM (Dhariwal & Nichol, 2021) as the main U-Net architecture. For all experiments, we adopt the same training scheme using AdamW (Kingma & Ba, 2014) optimizer with a learning rate of 2e-5 and an EMA decay factor of 0.9999. We set the weight ω t = sigmoid(-log(α 2 t /σ 2 t )) following P2-weighting (Choi et al., 2022) . The cosine noise schedule α t = cos(0.5πt) is adopted for diffusion working in the 256 2 × 3 image space. As proposed in Equation 8, noise rescaling (VP by default) is applied for f -DMs when the resolutions change. All our models are trained with batch-size 32 images for 500K (FFHQ, AFHQ, LSUN Church), 1.2M (LSUN Bed) and 2.5M (ImageNet) iterations, respectively. Baselines & Evaluation. We compare f -DMs against a standard DM (DDPM, Ho et al., 2020) on all five datasets. To ensure a fair comparison, we train DDPM following the same settings and continuous-time formulation as our approaches. We also include transformation-specific baselines: (1) we re-implement the cascaded DM (Cascaded, Ho et al., 2022a) to adapt f -DM-DS setup from 16 2 progressively to 256 2 , where for each stage a separate DM is trained conditioned on the consecutive downsampled image; (2) we re-train a latent-diffusion model (LDM, Rombach et al., 2021) on the extracted latents from our pretrained VQVAE; (3) to compare with f -DM-Blur-G, we include the scores and synthesised examples of IHDM (Rissanen et al., 2022) . We set 250 timesteps (∆t = 0.004) for f -DMs and the baselines with η = 1 (Algorithm 1). We use Frechet Inception Distance (FID, Heusel et al., 2017) and Precision/Recall (PR, Kynkäänniemi et al., 2019) as the measures of visual quality, based on 50K samples and the entire training set.

Qualitative Comparison

To demonstrate the capability of handling various complex datasets, Figure 4 (↑) presents an uncurated set of images generated by f -DM-DS. We show more samples from all types of f -DMs in the Appendix E.4. We also show a comparison between f -DMs and the Dhariwal, 2021; Ho et al., 2022a) , it is underexplored to apply cascades in a sequence of consecu- tive resolutions (16 → 32 → 64 → . . .) like ours. In such cases, the prediction errors get easily accumulated during the generation, yielding serious artifacts in the final resolution. To ease this, Cascaded DM (Ho et al., 2022a) proposed to apply "noise conditioning augmentation" which reduced the domain gap between stages by adding random noise to the input condition. However, it is not straightforward to tune the noise level for both training and inference time. By contrast, f -DM is by-design non-cascaded, and there are no domain gaps between stages. That is, we can train our model end-to-end without worrying the additional tuning parameters and achieve stable results. v.s. LDMs. We show comparisons with LDMs (Rombach et al., 2021) in Table 1 . LDMs generate more efficiently as the diffusion only happens in the latent space. However, the generation is heavily biased by the behavior of the fixed decoder. For instance, it is challenging for VQVAE decoders to synthesize sharp images, which causes low scores in Table 1 . However, LDM with VQGAN decoders is able to generate sharp details, which are typically favored by InceptionV3 (Szegedy et al., 2016) used in FID and PR. Therefore, despite having artifacts (see Figure 4 , below, rightmost) in the output, LDMs (GAN) still obtain good scores. In contrast, f -DM, as a pure DM, naturally bridges the latent and image spaces, where the generation is not restricted by the decoder. v.s. Blurring DMs. Table 1 compares with a recently proposed blurring-based method (IHDM, Rissanen et al., 2022) . Different from our approach, IHDM formulates a fully deterministic forward process. We conjecture the lack of randomness is the cause of their poor generation quality. Instead, f -DM proposes a natural way of incorporating blurring with stochastic noise, yielding better quantitative and qualitative results. Conditional Generation. In Figure 5 (a), we demonstrate the example of using pre-trained f -DMs to perform conditional generation based on learned transformations. We downsample and blur the sampled real images, and start the reverse diffusion following Section 3.1 with f -DM-DS and -Blur-U, respectively. Despite the difference in fine details, both our models faithfully generate high-fidelity outputs close to the real images. The same algorithm is applied to the extracted latent representations. Compared with the original VQVAE output, f -DM-VQVAE is able to obtain better reconstruction. We provide additional conditional generation samples with the ablation of the "gradient-based" initialization method in Appendix E.3.

Latent Space Manipulation

To demonstrate f -DMs have learned certain abstract representations by modeling with signal transformation, we show results of latent manipulation in Figure 5 . Here we assume DDIM sampling (η = 0), and the only stochasticity comes from the initially sampled noise ϵ full . In (b), we obtain a semantically smooth transition between two cat faces when linearly interpolating the low-resolution noises; on the other hand, we show samples of the same identity with different fine details (e.g., expression, poses) in (c), which is achieved easily by sampling f -DM-DS with the low-resolution ( 162 ) noise fixed. This implies that f -DM is able to allocate high-level and fine-grained information in different stages via learning with downsampling. Table 2 presents the ablation of the key design choices. As expected, the interpolation formulation (Equation 4) effectively bridges the information gap between stages, without which the prediction errors get accumulated, resulting in blurry outputs and bad scores. Table 2 also demonstrates the importance of applying correct scaling. For both models, rescaling improves the FID and recall by large margins, where SP works slightly worse than VP. In addition, we also empirically explore the difference of stage schedules. Compared to VAE-based models, we usually have more stages in DS/Blur-based models to generate high-resolution images. The cosine schedule helps diffusion move faster in regions with low information density (e.g., low-resolution, heavily blurred).

5. RELATED WORK

Progressive Generation with DMs. Conventional DMs generate images in the same resolutions. Therefore, existing work generally adopt cascaded approaches (Nichol & Dhariwal, 2021; Ho et al., 2022a; Saharia et al., 2022a ) that chains a series of conditional DMs to generate coarse-to-fine, and have been used in super-resolution (SR3, Saharia et al., 2022b) . However, cascaded models tend to suffer error propagation problems. More recently, Ryu & Ye (2022) dropped the need of conditioning, and proposed to generate images in a pyramidal fashion with additional reconstruction guidance; Jing et al. (2022) explored learning subspace DMs and connecting the full space with Langevin dynamics. By contrast, the proposed f -DM is distinct from all the above types, which only requires one diffusion process, and the images get naturally up-sampled through reverse diffusion. Blurring DMs. Several concurrent research (Rissanen et al., 2022; Daras et al., 2022; Lee et al., 2022) have recently looked into DM alternatives to combine blurring into diffusion process, some of which also showed the possibility of deterministic generation (Bansal et al., 2022) . Although sharing similarities, our work starts from a different view based on signal transformation. Furthermore, our empirical results also show that stochasticity plays a critical role in high-quality generation. Latent Space DMs. Existing work also investigated combining DMs with standard latent variable models. To our best knowledge, most of these works adopt DMs for learning the prior of latent space, where sampling is followed by a pre-trained (Rombach et al., 2021) or jointly optimized (Vahdat et al., 2021) decoder. Conversely, f -DM does not rely on the quality decoder.

ETHICS STATEMENT

Our work focuses on technical development, i,e., synthesizing high-quality images with a range of signal transformations (e.g., downsampling, blurring). Our approach has various applications, such as movie post-production, gaming, helping artists reduce workload, and generating synthetic data as training data for other computer vision tasks. Our approach can be used to synthesize human-related images (e.g., faces), and it is not biased towards any specific gender, race, region, or social class. However, the ability of generative models, including our approach, to generate high-quality images that are indistinguishable from real images, raises concerns about the misuse of these methods, e.g., generating fake images. To resolve these concerns, we need to mark all the generated results as "synthetic". In addition, we believe it is crucial to have authenticity assessment, such as fake image detection and identity verification, which will alleviate the potential for misuse. We hope our approach can be used to foster the development of technologies for authenticity assessment. Finally, we believe that creating a set of appropriate regulations and laws would significantly reduce the risks of misuse while bolstering positive effects on technology development.

REPRODUCIBILITY STATEMENT

We assure that all the results shown in the paper and supplemental materials can be reproduced. We believe we have provided enough implementation details in the paper and supplemental materials for the readers to reproduce the results. Figure 7 : Illustration of noise schedule (α t and σ t ) for f -DM-DS models with 5 stages (16 2 → 256 2 ). We use the standard cosine noise schedule α t = cos(0.5πt). We also show the difference between the linear/cosine stage schedule, as well as the proposed SP/VP re-scaling methods.

A.4 NOISE AT BOUNDARIES

In this paper, the overall principle is to handle the transition across stage boundary is to ensure the forward diffusion to be deterministic and smooth, therefore almost no information is lost during the stage change. Such requirement is important as it directly correlated to the denoising performance. Failing to recover the lost information will directly affect the diversity of the model generates. Figure 6 : Two naïve ways for down-sampling. Forward diffusion As described in Section 3.1, since we have the control of the signal and the noise separately, we can directly apply the deterministic transformation on the signal, and dropping noise elements. Alternatively, we also implemented a different ζ(ϵ) based on averaging. As shown in Figure 6 , if the transformation is down-sampling, we can use the fact that the mean of Gaussian noises is still Gaussian with lower variance: (ϵ 0 + ϵ 1 + ϵ 2 + ϵ 3 )/4 ∼ N (0, 1 4 I). Therefore, ×2 rescaling is needed on the resulted noise. Reverse diffusion Similarly, we can also define the reverse process if ζ is chosen to be averaging. Different from "dropping" where the reverse process is simply adding independent Gaussian noises, the reverse of "averaging" requests to sample 3 i=0 ϵ i = 2ϵ given the input noise ϵ, while having p(ϵ i ) = N (0, I), i = 0, 1, 2, 3. Such problem has a closed solution and can be implemented in an autoregressive fashion: a = 2ϵ; ϵ 0 = a/4 + 3/4 • ε1 , a = a -ϵ 0 , ε1 ∼ N (0, I); ϵ 1 = a/3 + 2/3 • ε2 , a = a -ϵ 1 , ε2 ∼ N (0, I); ϵ 2 = a/2 + 1/2 • ε3 , a = a -ϵ 2 , ε3 ∼ N (0, I); ϵ 3 = a Similar to the case of "dropping", we also need 3 additional samples ε1:3 to contribute to four noises, therefore it can be implemented in the same way as described in Section 3.1. Empirically, reversing the "averaging" steps tends to produce samples with better FID scores. However, since it introduces correlations into the added noise, which may cause undesired biases especially in DDIM sampling. Intuition behind Re-scaling Here we present a simple justification of applying noise rescaling. Suppose the signal dimensionality changes from M k-1 to M k when crossing the stage boundary, and such change is caused by different sampling rates. Based the proposed resolution-agnostic SNR (Equation 7), the number of sampled points inside Ω is proportional to its dimensionality. Generally, it is safe to assume signals are mostly low-frequency. Therefore, averaging signals will not change its variance. By contrast, as shown above, averaging Gaussian noises results in lower variance, where in our case, the variance is proportional to M -1 . Therefore, suppose the signal magnitude does not change, we can get the re-scaling low by forcing SNR(z τ ) = SNR(z τ -) at the stage boundary: σ 2 τ -• M -1 k-1 = σ 2 τ • M -1 k , which derives the signal preserving (SP) rescaling in Equation 8. In Figure 7 , we show an example of the change of α and σ with and without applying the re-scaling techqnique for f -DM-DS models.

A.5 DDIM SAMPLING

The above derivations only describe the standard ancestral sampling (η = 1) where q(z s |z t , x) is determined by Bayes' Theorem. Optionally, one can arbitrarily define any proper reverse diffusion distribution as long as the marginal distributions match the definition. For example, f -DM can also perform deterministic DDIM (Song et al., 2021a) by setting η = 0 in Algorithm 1. Similar to Song et al. (2021a) , we can also obtain the proof based on the induction argument. Figure 8 shows the comparison of DDIM sampling between the standard DMs and the proposed f -DM. In DDIM sampling (η = 0), the only randomness comes from the initial noise at t = 1. Due to the proposed noise resampling technique, f -DM enables a multi-scale noising process where the sampled noises are splitted and sent to different steps of the diffusion process. In this case, compared to standard DMs, we gain the ability of controlling image generation at different levels, resulting in smooth semantic interpretation.

B DETAILED INFORMATION OF TRANSFORMATIONS

We show the difference of all the transformations used in this paper in Figure 9 .

B.1 DOWNSAMPLING

In early development of this work, we explored various combinations of performing down-sampling: f = {bilinear, nearest, Gaussian blur + subsample}, g = {bilinear, bicubic, nearest, neural-based}. While all these combinations produced similar results, we empirically on FFHQ found that both choosing bilinear interpolation for both f , g achieves most stable results. Therefore, all the main experiments of f -DM-DS are conducted on bilinear interpolation. As discussed in Section 3.2, we choose K = 4, which progressively downsample a 256 2 into 16 2 .

B.2 BLURRING

We experimented two types of blurring functions. For upsampling-based blurring, we use the same number of stages as the downsampling case; for Gaussian-based blurring, we adopt K = 7 with corresponding kernel sizes σ B = 15 sin 2 ( π 2 τ k ), where τ k follows the cosine stage schedule. In practice, we implement blurring function in frequency domain following Rissanen et al. (2022) based on discrete cosine transform (DCT). 

B.3 VAES

In this paper, we only consider vector quantized (VQ) models with single layer latent space, while our methods can be readily applied to hierarchical (Razavi et al., 2019) and KL-regularized VAE models (Vahdat & Kautz, 2020) . Following Rombach et al. (2021) , we take the feature vectors before the quantization layers as the latent space, and keep the quantization step in the decoder (g) when training diffusion models. We follow an open-sourced implementationfoot_2 to train our VQVAE model on ImageNet. The model consists of two strided convolution blocks which by default downsamples the input image by a factor of 8. We use the default hyper-parameters and train the model for 50 epochs with a batch-size of 128. For a fair comparison to match the latent size of VQVAE, we use the pre-trained autoencoding model (Rombach et al., 2021) with the setting of {f=8, VQ (Z=256, d=4)}. We directly use the checkpointfoot_3 provided by the authors. Note that the above setting is not the best performing model (LDM-4) in the original paper. Therefore, it generates more artifacts when reconstructing images from the latents. Before training, we compute the signal magnitude ratio γ k (Equation 8) over the entire training set of FFHQ, where we empirically set γ k = 2.77 for VQ-GAN and γ k = 2.0 for VQ-VAE, respectively.

C DATASET DETAILS

FFHQ (https://github.com/NVlabs/ffhq-dataset) contains 70k images of real human faces in resolution of 1024 2 . For most of our experiments, we resize the images to 256 2 . AFHQ (https://github.com/clovaai/stargan-v2# animal-faces-hq-dataset-afhq) contains 15k images of animal faces including cat, dog and wild three categories in resolution of 512 2 . We train conditional diffusion models by merging all training images with the label information. All images are resized to 256 2 . LSUN (https://www.yf.io/p/lsun) is a collection of large-scale image dataset containing 10 scenes and 20 object categories. Following previous works Rombach et al. (2021) , we choose the two categories -Church (126k images) and Bed (3M images), and train separate unconditional models on them. As LSUN-Bed is relatively larger, we set the iterations longer than other datasets. All images are resized to 256 2 with center-crop. As shown in Figure 11 , input z t will be directed to the corresponding inner layer based on spatial resolutions, and a stage-specific adapter is adopted to transform the channel dimension. Such architecture also allows memory-efficient batching across stages where we can create a batch with various resolutions, and split the computation based on the resolutions.

D.2 HYPER-PARAMETERS

In our experiments, we adopt the following two sets of parameters based on the complexity of the dataset: base (FFHQ, AFHQ, LSUN-Church/Bed) and big (ImageNet). For base, we use 1 residual block per resolution, with the basic dimension 128. For big, we use 2 residual blocks with the basic dimension 192. Given one dataset, all the models with various transformations including the baseline DMs share the same hyper-parameters except for the adapters. We list the hyperparameter details in Table 3 E ADDITIONAL RESULTS

E.1 QUANTITATIVE COMPARISON WITH DDIM

We also include comparison of f -DMwith the standard DM using DDIM sampling (η = 0) in Table 4 . Similar to the conclusion drawn from Table 1 , the proposed f -DM can achieve comparable or even better performance than baseline DM even with η = 0 (generation only controlled by the initial noise, see Figure 8 ), while having better scores for DDIM with half generation steps. We include more comparisons in Figure 12 and 13. From Figure 12 , we compare the generation process of f -DM and the cascaded DM. It is clear that f -DM conducts coarse-to-fine generation in a more natural way, and the results will not suffer from error propagation. As shown in Figure 13 , LDM outputs are easily affected by the chosen decoder. VQVAE decoder tends output blurry images; the output from VQGAN decoder has much finer details while remaining noticable artifacts (e.g., eyes, furs). By contrast, f -DM perform stably for both latent spaces.

E.3 CONDITIONAL GENERATION

We include additional results of conditional generation, i.e., super-resolution (Figure 14 ) and deblurring (Figure 15 ). We also show the comparison with or without the proposed gradient-based initialization, which greatly improves the faithfulness of conditional generation when the input noise is high (e.g., 16 × 16 input).

E.4 ADDITIONAL QUALITATIVE RESULTS

Finally, we provide additional qualitative results for our unconditional models for FFHQ (Figure 16 ), AFHQ (Figure 17 ), LSUN (Figure 18 ) and our class-conditional ImageNet model (Figure 19, 20) .

F LIMITATIONS AND FUTURE WORK

Although f -DM enables diffusion with signal transformations, which greatly extends the scope of DMs to work in transformed space, there still exist limitations and opportunities for future work. First, it is an empirical question to find the optimal stage schedule for all transformations. Our ablation studies also show that different heuristics have differences for DS-based and VAE-based models. A metric that can automatically determine the best stage schedule based on the property of each transformation is needed and will be explored in the future. In addition, although the current method achieves faster inference when generating with transformations like down-sampling, the speed-up is not very significant as we still take the standard DDPM steps. How to further accelerate the inference process of DMs is a challenging and orthogonal direction. For example, it has great potential to combine f -DM with speed-up techniques such as knowledge distillation (Salimans & Ho, 2022) . Moreover, no matter hand-designed or learned, all the transformations used in f -DM are still fixed when training DM. It is, however, different from typical VAEs, where both the encoder and decoder are jointly optimized during training. Therefore, starting from a random/imperfect transformation and training f -DM jointly with the transformations towards certain target objectives will be studied as future work. 



For simplicity, we omit the subscript k for τ k in the following paragraphs. CONCLUSIONWe proposed f -DM, a generalized family of diffusion models that enables generation with signal transformations. As a demonstration, we apply f -DM to image generation tasks with a range of transformations, including downsampling, blurring and VAEs, where f -DMs outperform the baselines in terms of synthesis quality and semantic interpretation. https://github.com/rosinality/vq-vae-2-pytorch https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip



Figure 2: (a) the standard DMs; (b) a bottom-up hierarchical VAEs; (c) our proposed f -DM.

Figure 3: Left: an illustration of the proposed SNR computation for different sampling rates; Right: the comparison of rescaling the noise level for progressive down-sampling. Without noise rescaling, the diffused images in low-resolution quickly become too noisy to distinguish the underline signal.

Figure 4: ↑ Random samples from f -DM-DS trained on various datasets; ↓ Comparison of f -DMs and the corresponding baselines under various transformations. Best viewed when zoomed in. All faces presented are synthesized by the models, and are not real identities.

Figure 5: Random DDIM samples (η = 0) from (a) f -DMs on AFHQ and LSUN-Church by given {downsampled, blurred, latent} images as conditions; (b)f -DM-VQVAE by interpolating the initial noise of the latent stage; (c)f -DM-DS starting from the same initial noise of the 16 × 16 stage. For (c), we also show the "mean image" of 300 random samples using the same initial noise.

Figure 8: We show the comparison of the DDIM sampling.

Figure 9: We show examples of the five transformations (downsample, blur, VAEs) used in this paper. For downsampling, we resize the image with nearest upsampler; for VQ-VAE/VQ-GAN, we visualize the first 3 channels of the latent feature maps.

Figure 10: An illustration of the training pipeline. ImageNet (https://image-net.org/download.php) we use the standard ImageNet-1K dataset which contains 1.28M images across 1000 classes. We directly merge all the training images with class-labels. All images are resized to 256 2 with center-crop. For both f -DM and the baseline models, we adopt the classifier-free guidance (Ho & Salimans, 2022) with the unconditional probability 0.2. In the inference time, we use the guidance scale (s = 2) for computating FIDs, and s = 3 to synthesize examples for comparison.

Figure 11: An illustration of the modified U-Net architecture. Time conditioning is omitted. The parameters are partially shared across stages based on the resolutions. Stage-specific adapters are adopted to transform the input dimensions.

Figure 12: Additional comparisons with Cascaded DM on AFHQ. ↑ Comparison of the reverse diffusion process from 16 2 to 256 2 . We visualize the denoised outputs (x t ) and the corresponding next noised input (z s ) near the start & end of each resolution diffusion. ↓ Comparison of random samples generated by Cascaded DM and f -DM-DS.

Figure 13: Additional comparisons with LDMs on AFHQ.

Figure 14: Additional examples of super-resolution (SR) with the unconditional f -DM-DS trained on AFHQ. ↑ The same input image with various resolution 16 2 , 32 2 , 64 2 , 128 2 . We sample 3 random seeds for each resolution input. We also show the difference with and without applying gradientbased initialization (Grad-Init) on z. ↓ SR results of various 16 2 inputs.

Figure 15: Additional examples of de-blurring with the unconditional f -DM-Blur-G trained on AFHQ. ↑ The same input image with various Gaussian kernel sizes σ = 15, 9, 4, 1.4. We sample 3 random seeds for each resolution input. We also show the difference with and without applying gradient-based initialization (Grad-Init) on z. ↓ Deblurred results of various blur images.

Figure 16: Random samples generated by five f -DMs trained on FFHQ 256 × 256. All faces presented are synthesized by the models, and are not real identities.

Figure 17: Random samples generated by five f -DMs trained on AFHQ 256 × 256.

Figure 18: Random samples generated by f -DMs trained on LSUN-Church & -Bed 256 × 256.

Figure 19: Random samples generated by f -DM-DS/VQVAE trained on ImageNet 256 × 256 with classifier-free guidance (s = 3). Classes from top to bottom: red panda, robin, daisy, valley, trifle, comic book.

Figure 20: Random samples generated by f -DM-DS/VQVAE trained on ImageNet 256 × 256 with classifier-free guidance (s = 3). Classes from top to bottom: school bus, pizza, seashore, photocopier, golden retriever, axolotl.

Equation 8 and K = 4 for 256 2 images. Blurring. f -DM also supports general blur transformations. Unlike recent works(Rissanen et al., 2022;Hoogeboom & Salimans, 2022) that focuses on continuous-time blur (heat dissipation), Equation 4 can be seen as an instantiation of progressive blur function if we treat xk as a blurred version of x k . This design brings more flexibility in choosing any kind of blurring functions, and using the blurred versions as stages. In this paper, we experiment with two types of blurring functions. (1) f -DM-Blur-U: utilizing the same downsample operators as f -DM-DS, while always up-sampling the images back to the original sizes; (2) f -DM-Blur-G: applying standard Gaussian blurring kernels followingRissanen et al. (2022). In both cases, we use g k (x) = x. As the dimension is not changed, no rescaling and noise resampling is required. Image → Latent Trans. We further consider diffusion with learned non-linear transformations such as VAEs (see Figure2(b), f : VAE encoder, g: VAE decoder). By inverting such an encoding process, we are able to generate data from low-dimensional latent space similar toRombach et al.  (LDM, 2021). As a major difference, LDM operates only on the latent variables, while f -DM learns diffusion in the latent and image spaces jointly. Because of this, our performance will not be bounded by the quality of the VAE decoder. In this paper, we consider VQVAE(Van Den Oord et al., 2017) together with its GAN variant (VQGAN,Esser et al., 2021). For both cases, we transform 256 2 × 3 images into 32 2 × 4 (i.e., d k = 48) latent space. The VQVAE encoder/decoder is trained on ImageNet

Quantitative comparisons on various datasets. The speed compared to DDPM is calculated with bsz= 1 on CPU. Best performing DMs are shown in bold. We measure the generation quality (FID and precision/recall) and relative inference speed of f -DMs and the baselines in Table1. Across all five datasets, f -DMs con-

Ablation of design choices for f -DMs trained on FFHQ. All faces are not real identities.

.

Hyperparameters and settings for f -DM on different datasets.

APPENDIX A DETAILED DERIVATION OF f -DMS

A.1 q(z t |z s , x)We derive the definition in Equation 5with the change-of-variable trick given the fact that x t , x s and x k are all deterministic functions of x.More precisely, suppose z t ∼ N (α t x t , σ 2 t I), z s ∼ N (α s x s , σ 2 s I), where τ k ≤ s < t < τ k+1 . Thus, it is equivalent to haveFrom the above definition, it is reasonable to assume u t , u s follow the standard DM transitionm which means that:As typically x t ̸ = x s and both x t , x s are the functions of x k . Then z t is dependent on both z s and, resulting in a non-Markovian transition:Note that, this equation stands only when x t , x s and x k in the same space, and we did not make specific assumptions to the form of x t .A.2 q(z s |z t , x)The reverse diffusion distribution follows the Bayes' Theorem: q(z s |z t , x) ∝ q(z s |x)q(z t |z s , x), where both q(z s |x) and q(z t |z s , x) are Gaussian distributions with general forms of N (z s |µ, σ 2 I) and N (z t |Az s +b, σ ′2 I), respectively. Based on Bishop & Nasrabadi (2006) (2.116), we can derive:where σ2 = (σ -2 + σ ′-2 ∥A∥ 2 ) -1 . Therefore, we can get the exact form by plugging our variables, σ ′ = σ t|s into above equation, we get:where ϵ t = (z t -α t x t )/σ t and σ = σ s σ t|s /σ t .Alternatively, if we assume x t take the interpolation formulation in Equation 4, we can also re-write x s with x t + t-s t-τ k δ t , where we define a new variable δ t = x kx t . As stated in the main context (Section 3.1), such change makes q(z t |z s , x) avoid computing x s which may be potentially costly. In this way, we re-write the above equation as follows:

A.3 DIFFUSION INSIDE STAGES

In the inference time, we generate data by iteratively sampling from the conditional distribution p(z s |z t ) = E x [q(z s |z t , x)] based on Equation 10. In practice, the expectation over x is approximated by our model's prediction. As shown in Equation 9, in this work, we propose a "doubleprediction" network θ that reads z t , and simultaneously predicts x t and δ t with x θ and δ θ , respectively. The predicted Gaussian noise is denoted as ϵ θ = (z t -α t x θ )/σ t . Note that the prediction x θ and ϵ θ are interchangable, which means that we can readily derive one from the other's prediction. Therefore, by replacing x t , δ t , ϵ t , with x θ , δ θ , ϵ θ in Equation 10, we obtain the sampling algorithm shown in Algorithm 1: Line 6.

