ITERATIVE α-(DE)BLENDING: LEARNING A DETER-MINISTIC MAPPING BETWEEN ARBITRARY DENSITIES

Abstract

We present a learning method that produces a mapping between arbitrary densities, such that random samples of a density can be mapped to random samples of another. In practice, our method is similar to deterministic diffusion processes where samples of the target density are blended with Gaussian noise. The originality of our approach is that, in contrast to several recent works, we do not rely on Langevin dynamics or score-matching concepts. We propose a simpler take on the topic, which is based solely on basic sampling concepts. By studying blended samples and their posteriors, we show that iteratively blending and deblending samples produces random paths between arbitrary densities. We prove that, for finite-variance densities, these paths converge towards a deterministic mapping that can be learnt with a neural network trained to deblend samples. Our method can thus be seen as a generalization of deterministic denoising diffusion where, instead of learning to denoise Gaussian noise, we learn to deblend arbitrary data. We provide a short video overview of the paper in our supplementary material. increasing α

1. INTRODUCTION

Diffusion models have recently become one of the most popular generative modeling tools (Ramesh et al., 2022) . They have outperformed state-of-the-art GANs (Karras et al., 2020; 2021) and been applied to many applications such as image generation (Rombach et al., 2021; Dhariwal & Nichol, 2021) , image processing (Saharia et al., 2021; Kawar et al., 2022; Whang et al., 2022) , text-toimage (Saharia et al., 2022b) , video (Ho et al., 2022) or audio (Kong et al., 2020) . First, there were stochastic diffusion models... These diffusion models have in common that they can be formulated as a Stochastic Differential Equations (SDEs) (Song et al., 2021b) such as Langevin dynamics. Langevin's equation models a random walk that obeys a balance between two operations related to Gaussian noise: increasing noise by adding more noise and decreasing noise by climbing the gradient of the log density. Increasing noise performs large steps but puts the samples away from the true density. Decreasing noise projects the samples back on the true density. Carefully tracking and controlling this balance allows to perform efficient random walk and provides a sampling procedure for the true density. This is the core of denoising diffusion approaches. Noise Conditional Score Networks (NCSNs) (Song & Ermon, 2019; 2020 ) use Langevin's equation directly by leveraging the fact that the score (the gradient of the log density in Langevin's equation) can be learnt via a denoiser when the samples are corrupted with Gaussian noise (Vincent, 2011) . Denoising Diffusion Probabilistic Models (DDPMs) (Ho et al., 2020; Nichol & Dhariwal, 2021 ) use a Markov chain formalism with a Gaussian prior that provides an SDE similar to Langevin dynamics where the score is also implicitly learnt with a denoiser. ...then came deterministic diffusion models. Langevin's SDEs variants describe an equilibrium between noise injection and noise removal. Nullifying the noise injection in these SDEs yields Ordinary Differential Equations (ODEs), also called Probability Flow ODEs Song et al. (2021b) , that simply describe the deterministic trajectory of a noisy sample projected back in the true density. For instance, Denoising Diffusion Implicit Models (DDIMs) (Song et al., 2021a) are the ODE variants of DDPMs. These ODEs provide a smooth deterministic mapping between the Gaussian noise density and the true density. Deterministic diffusion models have been motivated recently because an ODE requires much fewer solver iterations than its SDE counterpart. Furthermore, a deterministic mapping presents multiple practical advantages because samples are uniquely determined by their prior Gaussian noise, can be interpolated via the Gaussian noise, etc. Is there a simpler approach to deterministic diffusion? The point of the above story is that, in the recent line of work on diffusion models, stochastic diffusion models came first and deterministic diffusion models came after, framed as special cases of the stochastic ones. They hence inherited the underlying mindset and mathematical framework. As a result, advanced concepts such as Langevin dynamics, score matching, how they relate to Gaussian noise, etc. appear to be necessary background to grasp recent deterministic diffusion models. We argue that this is a significant detour to something that can be framed in a much simpler and more general way. We propose a fresh take on deterministic diffusion with another mindset, using only basic sampling concepts. • We derive a deterministic diffusion-like model based on the sampling interpretation of blending and deblending. We call it Iterative α-(de)Blending (IADB) in reference to the Computer Graphics α-blending technique that composes images with a transparency parameter Porter & Duff (1984) . Our model defines a mapping between arbitrary densities (of finite-variance). • We show that when the initial density is Gaussian, the mappings defined by IADB are exactly the same as the ones defined by DDIM (Song et al., 2021a) . On the theoretical side, our model can thus be seen as a generalization of DDIM to arbitrary sampling densities rather than just Gaussian. Furthermore, our alternative derivation leads to a more numerically stable sampling formulation. Our experiments show that IADB consistently outperforms DDIM in terms of final FID on several datasets and is more stable with small number of steps in the sampling stage. • We explore the generalization to arbitrary non-Gaussian densities provided by our model. We report that, although this generalization seems promising on the theoretical side, the application possibilities were disappointing in our experiments in image generation. We found that sampling with non-Gaussian densities can significantly lower the quality of the generated samples and that the mappings are not always interesting for image processing applications.

2. A DETERMINISTIC MAPPING BETWEEN ARBITRARY DENSITIES

We consider two densities p 0 , p 1 : R d → R + represented respectively by the red triangle and the green square in Figure 2 . Our objective is to define a deterministic mapping such that i.i.d. samples x 0 ∼ p 0 passed through the mapping produce i.i.d. samples x 1 ∼ p 1 . 2.1 BLENDING AND DEBLENDING AS SAMPLING OPERATIONS (a) α-blending. We call p α the density of the blended samples x α = (1 -α) x 0 + α x 1 obtained by blending random samples (x 0 , x 1 ) ∼ p 0 × p 1 with a blending parameter α ∈ [0, 1]. This is illustrated in Figure 2-(a) . (b) α-deblending. We call α-deblending the inverse sampling operation, i.e. generating random x 0 and x 1 from the initial densities that could have been α-blended into a point x, as shown in Figure 2-(b) . More formally, it means generating random posterior samples (x 0 , x 1 ) |(x,α) ∼ (p 0 × p 1 ) |(x,α) . Note that we never use these posteriors samples in practice, we use them only for the derivation of our method. Proposition 1. If x ∈ R d is a fixed point, the posteriors samples (x 0 , x 1 ) |(x,α) ∼ (p 0 × p 1 ) |(x,α) are distributed in the posterior densities. However, if x α ∼ p α is a random sample, the posteriors samples are distributed in the initial densities: (x 0 , x 1 ) |(xα∼pα,α) ∼ (p 0 × p 1 ) . Proof. It follows directly from the law of total probability. We provide more details in Appendix A. (c) α-(de)blending. Let's consider two blending parameters α 1 , α 2 ∈ [0, 1]. Using the previous proposition, we can chain α 1 -deblending and α 2 -blending to map a random sample x α1 ∼ p α1 to a random sample x α2 ∼ p α2 . Indeed, by sampling posteriors for a random sample x α1 ∼ p α1 , we obtain random samples (x 0 , x 1 ) ∼ (p 0 × p 1 ) from the initial densities, and blending them with parameter α 2 provides a random sample x α2 ∼ p α2 . This is illustrated in Figure 2-(c ). (a) α-blending (b) α-deblending  (x 0 , x 1 ) ∼p0×p1 → x α ∼pα x α ∼pα → (x 0 , x 1 ) |(xα∼pα,α) ∼p0×p1 x α x 1 x 0 (c) α-(de)blending (d) iterative α-(de)blending x α1 ∼pα 1 → (x 0 , x 1 ) |(xα 1 ∼pα 1 ,α1) ∼p0×p1 → x α2 ∼pα 2 x 0 ∼p0 → .. → x α ∼pα → .. → x 1 ∼p1 x α1 x α2

2.2. ITERATIVE α-(DE)BLENDING (IADB)

We introduce Iterative α-(de)Blending (IADB), an iterative algorithm that can be implemented stochastically or deterministically. Our main result is that both variants converge towards the same limit, which yields a deterministic mapping between the densities p 0 and p 1 shown in Figure 2-(d) . Algorithm 1: iterative α-(de)blending (stochastic). Let's consider a number of iterations T and evenly distributed blending parameters α t = t/T, t = {0, .., T }). This algorithm creates a sequence (x αt ∼ p αt , t = {0, .., T }) that starts with a random sample x 0 ∼ p 0 and ends with a random sample x α T = x 1 ∼ p 1 by applying α-(de)blending iteratively. In each iteration, x αt ∼ p αt is α t -deblended by sampling random posteriors, which are sampled and α t+1 -blended again to obtain a new sample x αt+1 ∼ p αt+1 . End-to-end, this algorithm provides a stochastic mapping between samples x 0 ∼ p 0 and samples x 1 ∼ p 1 . Algorithm 2: iterative α-(de)blending (deterministic). This algorithm is the same as Algorithm 1 except that, in each iteration, the random posteriors samples are replaced by their expectations. The algorithm is thus not stochastic but deterministic. Theorem 1: convergence of iterative α-(de)blending. If p 0 and p 1 are Riemann-integrable densities of finite variance, the sequences computed by Algorithm 1 and Algorithm 2 converge towards the same limit as the number of steps T increases, i.e. for any α ∈ [0, 1] we have lim T →∞ x α computed by Algorithm 1(x 0 , T ) = lim T →∞ x α computed by Algorithm 2(x 0 , T ). (1) Proof. We detail this proof in Appendix B. Intuitively, in each iteration, Algorithm 1 makes a small step ∆x α = (x 1 -x 0 ) ∆α along the segment given by random posterior samples. As the number of iterations increases, many small random steps average out, and the infinitesimal steps are described by an ODE that involves the expected posteriors like in Algorithm 2: dx α = x1|(xα,α) -x0|(xα,α) dα. (2) Algorithm 1 Iterative α-(de)blending (stoch.) Require: x 0 , T for t = 0, .., T -1 do sample (x 0 , x 1 ) ∼ (p 0 , p 1 ) |(xα t ,αt) x αt+1 = (1 -α t+1 ) x 0 + α t+1 x 1 end for Algorithm 2 Iterative α-(de)blending (deter.) Require: x 0 , T for t = 0, .., T -1 do (x 0 , x1 ) = E(p 0,p1) |(xα t ,α t ) [(x 0 , x 1 )] x αt+1 = (1 -α t+1 ) x0 + α t+1 x1 end for T = 10 steps T = 1000 steps Figure 3 : Both algorithms step iteratively by moving the samples along segments defined by their posterior densities. The difference is that Algorithm 1 uses segments between random posterior samples, which creates stochastic paths, while Algorithm 2 uses the segment between the average of the posterior samples, which creates deterministic paths. As the number of steps T increases, the randomness of the stochastic paths averages out and they converge towards the deterministic paths.

3. LEARNING ITERATIVE α-(DE)BLENDING

In this section, we explain how to use iterative α-(de)blending in a machine learning context, where we train a neural network D θ to predict the average posterior samples used in Algorithm 2.

3.1. VARIANT FORMULATIONS OF ITERATIVE α-(DE)BLENDING

A direct transposition of Algorithm 2 means learning the averages of both posterior samples x0 and x1 . However, one is implicitly given by the other one such that it is not necessary to learn both and variants of Alg. 2 are possible. The fact that multiple, theoretically equivalent, variants are possible is pointed out by Salimans & Ho (2022) . However, they are not equivalent in practice. In Table 1 , we summarize four variants derived in Appendix C and compare their practical properties. Variant (a) is the vanilla transposition of Algorithm 2. It is highly unstable because instead of moving the current sample x αt , the new sample x αt+1 is plainly recomputed from the outputs of the neural network, such that its residual learning errors accumulate in each step. The larger the number of steps T , the more this variant diverges. Variants (b) and (c) consist of learning either only x0 or x1 . The sampling suffers from numerical stability near respectively α t = 0 and α t = 1 because of the respective divisions by α t and 1 -α t . We recommend using variant (d) that consists of learning the average difference vector x1 -x0 . It is a direct transposition of the ODE defined in Equation 2. This variant updates the current samples in each iteration without any division. We found it to be the most stable variant for both training and sampling.  (d) learn x1 -x0 (x 0 , x1 ) = D θ (x αt , α t ) x0 = D θ (x αt , α t ) x1 = D θ (x αt , α t ) x1 -x0 = D θ (x αt , α t ) x αt+1 = x αt+1 = x0 + x αt+1 = x1 + x αt+1 = x αt + (1 -α t+1 ) x0 + α t+1 x1 αt+1 αt (x αt -x0 ) (1-αt+1) (1-αt) (x αt -x1 ) (α t+1 -α t ) (x 1 -x0 ) unstable unstable when α t → 0 unstable when α t → 1 stable Table 1 : Variant formulations of iterative α-(de)blending (equivalent in theory, not in practice).

3.2. TRAINING AND SAMPLING

Following variant (d) of Table 1 , we train the neural network D θ to predict the average difference vector between the posterior samples. Our learning objective is defined by min θ E α,xα D θ (x α , α) - E x 0|(xα ,α) ,x 1|(xα ,α) x 1|(xα,α) -x 0|(xα,α) 2 . (3) Note that minimizing the l 2 to the average of a distribution is equivalent to minimizing the l 2 to all the samples of the distribution. We obtain the equivalent objective min θ E α,xα,x 0|(xα ,α) ,x 1|(xα,α) D θ (x α , α) -x 1|(xα,α) -x 0|(xα,α) 2 . ( ) Finally, as explained in Section 2.1, sampling x α ∼ p α and (x 0|(xα,α) , x 1|(xα,α) ) in this order is equivalent to sampling x 0 ∼ p 0 and x 1 ∼ p 1 and blending them to obtain x α ∼ p α . We obtain our final learning objective min θ E α,x0,x1 ∥D θ ((1 -α) x 0 + α x 1 , α) -(x 1 -x 0 )∥ 2 , which we use to optimize θ in Algorithm 3. Finally, in Algorithm 4, we iteratively map samples x 0 ∼ p 0 to samples x 1 ∼ p 1 in the same way as in Algorithm 2 where we use the neural network D θ to obtain the average posterior difference.

Algorithm 3 Training

Require: x 0 ∼ p 0 , x 1 ∼ p 1 , α ∈ [0, 1] x α = (1 -α) x 0 + α x 1 l = ∥D θ (x α , α) -(x 1 -x 0 )∥ 2 backprop from l and update θ Algorithm 4 Sampling Require: x 0 , T for t = 0, .., T -1 do x αt+1 = x αt + (α t+1 -α t ) D θ (x αt , α t ) end for

4. EXPERIMENTS WITH ANALYTIC DENSITIES

Mapping 1D densities. In Figure 4 , we experiment with analytic 1D densities where the expectation x1 -x0 can be computed analytically rather than being learnt by a neural network D θ . The experiment confirms that the analytic version matches the reference and that the neural network trained with the l 2 approximates the same mapping. We also tested training the neural network with the l 1 , which makes the neural network approximate the median of x 1 -x 0 rather than its average. The resulting mapping does not match the reference. This confirms that learning the average via l 2 training is a key component of our model, as explained in Section 3.2. Mapping 2D densities. Figure 5 shows that the intermediate blended densities p α computed by our mapping match the reference blended densities. Figure 6 shows how our algorithm maps the samples of p 0 to samples of p 1 . reference  α = 0 α = 1 3 α = 2 3 α = 1 IADB α = 0 α = 1 3 α = 2 3 α = 1

5. RELATION TO DDIM

In this section, we explain that IADB can be framed as a more stable generalization of DDIM (Song et al., 2021a ) that enjoys the possibility of using non-Gaussian densities for p 0 . Proposition 2. If p 0 is a Gaussian density, IADB and DDIM define the same deterministic mapping. Proof. In Appendix D we show that with a simple change of parameterization, the update rule of DDIM is exactly the variant (b) in our Table 1 . Experimental comparison of IADB and DDIM. In Figure 7 , we experiment under the same conditions (architecture, training time, 1st-order solver, uniform schedule, Gaussian p 0 ) and measure the FID score (Heusel et al., 2017) for varying number of sampling steps on 3 image datasets: LSUN Bedrooms (64x64), CelebA (64 x 64) and AFHQ Cats(128x128) for 120 hours of training. We use a U-Net architecture(from Diffusers libraryfoot_0 ). We observe a consistently better performance of IADB compared to DDIM. This might be due to our numerically stable formulation explained in Section 3.1. The formulation generally used in DDIM corresponds to the variant (b) presented in Table 1 : they train a denoiser to predict the Gaussian noise present in the noisy image samples, i.e. their model learns to predict x0 . However, we explain that this variant makes the sampling less stable because of the division near 0. As a matter of fact, in their implementation, the sampler starts at some ϵ > 0 precisely to avoid dividing by 0. Our variant (d) does not suffer from this problem. Another possibility is that the learning objective defined by variant (d) provides a better optimization landscape than variant (b). For instance, the effort for learning x0 is likely imbalanced in α while the effort for learning x1 -x0 is likely more balanced over α. Discussion. The Langevin/score-matching approach puts the emphasis on the fact that the gradient of the log density in Langevin's equation can be learnt via x0 when p 0 is Gaussian Vincent (2011) . This mindset naturally leads to variant (b) in Table 1 . In contrast, the derivation of IADB emphasizes that x α , x0 and x1 are aligned and thus that all the four variants of 

6. (DISAPPOINTING) EXPERIMENTS WITH ARBITRARY IMAGE DENSITIES

(Disappointing) sampling quality. In the experiment of Figure 8 , we use IADB to compute mappings between different image datasets, i.e. in contrast to the experiment of Figure 7 , we use real images rather than Gaussian noise to sample other images. Note that, in contrast to the Gaussian density, the implicit density represented by an image dataset might not be Riemann integrable, for instance if the images are distributed on a lower-dimensional manifold. To make sure that our theorem applies, we regularize p 0 by applying a little amount of noise to the images. In theory, with this regularization, IADB is proven to produce a correct sampling of p 1 regardless of p 0 . However, in practice, we observe that the qualitative performance of the mapping is significantly lower than with Gaussian noise. This might be because we use an architecture that has been designed specifically for denoising Gaussian noise or because denoising noise might be fundamentally simpler than deblending arbitrary images. In any case, our takeaway is that an experimental set up that works well with Gaussian noise does not necessarily transpose successfully to other densities. (Disappointing) mappings. For some applications, we wish that IADB would learn a "meaningful" mapping between p 0 and p 1 . Unfortunately, this is not always the case. Indeed, the theory predicts that the mapping will produce a valid sampling of p 1 using p 0 but not that the mapping will be what a human user expects. For instance, Figure 9 shows a result where IADB was trained to learn a mapping between corrupted images and clean images and we wish that the mapping would behave like an image restoration process. Unfortunately, although IADB effectively learns to sample clean images using corrupted ones, the mapping does not produce a valid restoration because the clean images do not always resemble the corrupted ones. Stochastic Differential Equations (SDEs). The random sequence computed by the stochastic version of IADB presented in Algorithm 1 is a Markov chain. This algorithm might thus be reminiscent of stochastic diffusion models Song et al. (2021b) based on SDEs. However, it is not related to an SDE. Indeed, SDEs model stochastic behaviors at the infinitesimal scale while our mapping is stochastic for discrete steps and becomes a deterministic ODE in the infinitesimal limit. Non-Gaussian denoising diffusion. Some previous works are dedicated to replacing Gaussian noise by other noise distributions such as the generalised normal (exponential power) distribution (Deasy et al., 2021) or the Gamma distribution (Nachmani et al., 2021) . Our more general derivation works with any finite-variance density rather than specific noise alternatives. Peluchetti (Peluchetti, 2022) proposes a more general SDE framework. Our ODE can be derived from his SDE by nullifying the stochastic component and following the aggregation method. Alternative deterministic diffusion. Cold diffusion (Bansal et al., 2022) shows that diffusionlike methods can reverse a variety of empirical image degradation processes. Our method is similar to their animorphosis example where human faces are progressively blended and deblended with animal faces. Our model provides a proven way to sample from the right density for this application. Image-to-image translation. In Section 6, we have seen that IADB's mapping does not provide a faithful image-to-image translation. Previous works show that conditioning the diffusion process seem to be necessary to get faithful translations. For instance, adding an energy guide during the ODE integration (Zhao et al., 2022) , by progressively injecting features (Meng et al., 2021) , or by sampling conditional densities Saharia et al. (2021; 2022a) . In Figure 10 , we experiment with the latter with Gaussian noise for x 0 , clean images for x 1 , and a corrupted version of x 1 for the condition c passed as an additional argument to the neural network:  D θ ((1 -α) x 0 + α x 1 , c, α) = x1 -x0 .

8. CONCLUSION

The objective of this work was to find a simple and intuitive way to approach deterministic diffusion. We derived Iterative α-(de)Blending (IADB), a deterministic diffusion model based on a sampling interpretation of blending and deblending. Our model is similar to DDIM Song et al. (2021a) but its derivation is significantly simpler and reveals that the model is valid for arbitrary (finite-variance) densities rather than only Gaussian densities. Furthermore, our derivation leads to a variant learning formulation that happens to be more numerically stable than the one of DDIM. An important takeaway of our experiments is that our results in image generation are significantly worse when using non-Gaussian densities. The theory allows to use non-Gaussian densities for sampling and our experiments on 2D non-Gaussian densities were successful in this regard. However, this did not transpose well to image generation in practice. This might be because we used neural networks that have been designed specifically for the denoising task or because denoising noise might be fundamentally simpler than deblending arbitrary images. Finally, note that we experimented with our model in its vanilla setting with a uniform blending schedule and a first-order ODE solver. It might benefit from the improvements brought to denoising diffusion such as better blending schedules and higher-order ODE solvers (Karras et al., 2022) . A LAW OF TOTAL PROBABILITY The law of total probability states that the prior density p 0 × p 1 can be written as a weighted linear combination of its posterior densities: p 0 (x 0 ) p 1 (x 1 ) = R d (p 0 (x 0 ) p 1 (x 1 )) |(xα,α) p α (x α ) dx α , where p α (x α ) is the weight of each posterior density (p 0 (x 0 ) p 1 (x 1 )) |(xα,α) . As a result, sampling the prior density p 0 × p 1 can be achieved by choosing a random posterior density proportionally to its weight, i.e. sampling x α ∼ p α , and choosing a random sample in the posterior density conditioned by parameter x α , i.e. sampling (x 0 , x 1 ) |(xα∼pα,α) . This is illustrated in Figure 11 . We first recall some properties that are required in the derivation of the limit of Algorithm 1. The posterior distributions have finite variance. The theorem requires that p 0 and p 1 are of finite variance, such that their respective posterior densities p 0|(x,α) and p 1|(x,α) are also of finite variance. This is because the idea of the proof is that averaging many small random steps (provided by the posterior distributions) converges towards their expectations and it is true only if their variance is finite. We use this between Equation ( 23) and Equation ( 24). The expectations of the posterior distributions are continuous. If p 0 and p 1 are classic, Riemann-integrable, densities, then they are continuous almost everywhere. Since the blended distributions are essentially convolutions of p 0 and p 1 , it follows that the posterior densities p 0|(x,α) and p 1|(x,α) are also continuous almost everywhere and the expectation of their samples x 0|(x,α) ∼ p 0|(x,α) and x 1|(x,α) ∼ p 1|(x,α) are continuous everywhere (the expectation cancels out the null set where they are not continuous). In summary, for any x ∈ R d and α ∈ [0, 1] we have: lim x ′ →x E x 0|(x ′ ,α) = E x 0|(x,α) , lim x ′ →x E x 1|(x ′ ,α) = E x 1|(x,α) , lim α ′ →α E x 0|(x,α ′ ) = E x 0|(x,α) , lim α ′ →α E x 1|(x,α) = E x 1|(x,α) . We use this between Equation ( 24) and Equation (25).

B.2 OBJECTIVE OF THE PROOF

To prove that Algorithm 1 and Algorithm 2 converge towards the same limit as the number of steps T increases, we need to show that the trajectories of the samples are the same. This is the case if, in the limit, the derivatives dxα dα are the same with both algorithms. The discrete update at step t is: ∆α t = α t+1 -α t = 1 T , ∆x αt = x αt+1 -x αt , and we want to prove that for any α ∈ [0, 1] and at point x α ∈ R d the continuous limit exists and is the same with both algorithms: dx α dα = lim ∆α→0 ∆x α ∆α . B.3 LIMIT OF ALGORITHM 2. In step t of Algorithm 2 we use the average of the posterior samples that are such that: x αt = (1 -α t ) x0|(xα t ,αt) + α t x1|(xα t ,αt) , x αt+1 = (1 -α t+1 ) x0|(xα t ,αt) + α t+1 x1|(xα t ,αt) , where Equation ( 12) is a property of the average posteriors of x αt and Equation ( 13) is true by definition in Algorithm 2. We thus have the discrete difference: ∆x αt = x αt+1 -x αt = ∆α t x1|(xα t ,αt) -x0|(xα t ,αt) . We obtain the discrete ratio ∆x α ∆α = x1|(xα,α) -x0|(xα,α) , which is independent of ∆α. The limit hence exists and is defined by dx α dα = lim ∆α→0 ∆x α ∆α = ∆x α ∆α = x1|(xα,α) -x0|(xα,α) . B.4 LIMIT OF ALGORITHM 1. In step t of Algorithm 1 we sample random posterior samples x 0|(xα t ,αt) and x 1|(xα t ,αt) that are such that: x αt = (1 -α t ) x 0|(xα t ,αt) + α t x 1|(xα t ,αt) , x αt+1 = (1 -α t+1 ) x 0|(xα t ,αt) + α t+1 x 1|(xα t ,αt) , where Equation ( 17) is a property of the posteriors of x αt and Equation ( 18) is true by definition in Algorithm 1. We thus have the discrete difference: ∆x αt = x αt+1 -x αt = ∆α t x 1|(xα t ,αt) -x 0|(xα t ,αt) . We obtain the discrete difference for any parameter α ∈ [0, 1] and any location x α ∈ R d ∆x α = ∆α x 1|(xα,α) -x 0|(xα,α) . Furthermore, increasing the number of steps is equivalent to decomposing each step ∆α into N smaller steps ∆α/N . We rewrite the discrete difference as ∆x α = ∆α N N -1 n=0 x 1|(x α+n∆α/N ,α+n∆α/N ) -x 0|(x α+n∆α/N ,α+n∆α/N ) . With this modification, if the derivative exists, it is defined by the limit: dx α dα = lim ∆α→0 lim N →∞ ∆x α ∆α (22) = lim ∆α→0 lim N →∞ 1 N N -1 n=0 x 1|(x α+n∆α/N ,α+n∆α/N ) -x 0|(x α+n∆α/N ,α+n∆α/N ) . Thanks to the finite-variance condition of p 0 and p 1 , the normalized average sum converges towards the average of the posterior samples over α ′ ∈ [α, α + ∆α] as N increases. dx α dα = lim ∆α→0 E α ′ ∈[α,α+∆α] x 1|(x α ′ ,α ′ ) -E α ′ ∈[α,α+∆α] x 0|(x α ′ ,α ′ ) . (24) Finally, because the expectations of the posterior densities are continuous, we obtain that the expectations over [α, α + ∆α] converge towards the expectation in α, such that dx α dα = E x 1|(xα,α) -E x 0|(xα,α) = x1|(xα,α) -x0|(xα,α) . This is the same result as in Equation ( 16) with Algorithm 2.

C VARIANT FORMULATIONS

We derive the variant formulations introduced in Section 3.1. Blended samples. A blended sample is by definition the blending of its posterior samples x αt = (1 -α t ) x 0 + α t x 1 . ( ) Since blending is linear, a blended sample is also the blending of the average of its posterior samples: x αt = (1 -α t ) x0 + α t x1 . ( ) We can thus rewrite its average posteriors samples x0 and x1 in the following way: x0 = x αt 1 -α t - α t x1 1 -α t , ( ) x1 = x αt α t - (1 -α t ) x0 α t . ( ) Variant (a): In the vanilla version of the algorithm, a blended sample of parameter α t+1 is obtained by blending x0 and x1 : x αt+1 = (1 -α t+1 ) x0 + α t+1 x1 . ( ) Variant (b): By expanding x0 from Equation (30) using Equation ( 28), we obtain: Variant (c): By expanding x1 from Equation (30) using Equation (29), we obtain: x αt+1 = (1 -α t+1 ) x0 + α t+1 x1 , = (1 -α t+1 ) x0 + α t+1 x αt α t - (1 -α t ) x0 α t , x αt+1 = (1 -α t+1 ) x0 + α t+1 x1 , = (1 -α t+1 ) x αt 1 -α t - α t x1 1 -α t + α t+1 x1 , = α t+1 - (1 -α t+1 ) α t 1 -α t x1 + 1 -α t+1 1 -α t x αt , ( ) = 1 - 1 -α t+1 1 -α t x1 + 1 -α t+1 1 -α t x αt , = x1 + 1 -α t+1 1 -α t (x αt -x1 ) . ( ) Variant (d): By rewriting α t+1 = α t+1 + α t -α t in the definition of x αt+1 , we obtain:  x αt+1 = (1 -α t+1 ) x0 + α t+1 x1 ,



https://github.com/huggingface/diffusers



Figure1: Iterative α-blending and deblending. We train a neural network to deblend blended inputs. By deblending and reblending iteratively we obtain a mapping between arbitrary densities.

Figure 2: Blending and deblending as sampling operations.

(a) learn x0 and x1 (b) learn only x0 (c) learn only x1

Figure 4: We map a bi-Normal distribution with modes µ 1 = -0.5 and µ 2 = 0.5 with σ 1|2 = 0.1 (in red) to a Normal distribution of unit variance (in blue). The reference shows the intermediate blended densities p α obtained by analytically convolving both densities. The other densities are the histograms of the samples x α computed in Algorithm 4 using either analytic expressions or neural networks (nn) for D θ . The neural network is a MLP with 5 hidden layers of 64 filters.

Figure5: We display blended samples x α ∼ p α (red) and samples x α computed by Algorithm 4 using a MLP with 5 hidden layers of 64 filters for D θ (green).

Figure6: We show samples of the reference densities p 0 and p 1 and the mapping computed by Algorithm 4 using a MLP with 5 hidden layers of 64 filters for D θ . The final samples computed by the algorithm (green) match the reference samples x 1 ∼ p 1 (red).

Figure8: Sampling with non-Gaussian densities. We use the same experimental set up as in Figure7except that we replace the Gaussian noise by an image database.

Figure 9: Image restoration with IADB. In this experiment, we use IADB to map corrupted images to clean images. (left) The corruption is a downscaling+noise. (right) The corruption is a decoloriza-tion+noise. Ideally, the mapping learnt by IADB would restore the corrupted image. Unfortunately, the mapping creates a clean image but it does not match the corrupted one anymore.

Figure 10: Conditional image restoration with IADB. From a corrupt image, either downscaling (left) or decolorization (right), we create various restorations using Gaussian noises x 0 .

Figure11: Each posterior density represents only a subset of the prior density. However, thanks to the law of total probability, we know that the average of random posterior densities is exactly the prior density. As a result, sampling random blended samples x α ∼ p α and random posterior samples (x 0 , x 1 ) |(xα,α) ∼ (p 0 × p 1 ) |(xα,α) is equivalent to sampling random (x 0 , x 1 ) ∼ p 0 × p 1 directly in the prior density.

Figure 12: We trained a MLP with 5 hidden layers of 64 filters to learn D θ for IADB (a) and the same architecture to learn ϵ θ for DDIM (b) and (c). For (c), we convert points generated by DDIM using the scaling equation. The trajectories of the samples for IADB (a) and DDIM rescaled (c) match.

D RELATION TO DDIM

In this section, we follow the notation of Song et al. (2021a) : x 0 is a sample of a target density and ϵ is a random Gaussian sample. The denoiser of DDIM is defined such that, for an inputWe definewith. y t is an alpha-blended sample such as the one we defined in Section 2.It follows that we have:We now turn to Equation ( 13) of Song et al. (2021a) :By injecting into this expression of the scaled coordinate at line 48, we obtain:andsincewe can simply line 52 to:This last form is exactly variant-(b) of IADB (see Table 1 ). To validate this claim, we trained a DDIM denoiser and applied the rescaling formula to the output samples. We show in Figure 12 that when we rescale the output of DDIM, the generated trajectory maps with IADB.

