BOOMERANG: LOCAL SAMPLING ON IMAGE MANIFOLDS USING DIFFUSION MODELS

Abstract

Diffusion models can be viewed as mapping points in a high-dimensional latent space onto a low-dimensional learned manifold, typically an image manifold. The intermediate values between the latent space and image manifold can be interpreted as noisy images which are determined by the noise scheduling scheme employed during pre-training. We exploit this interpretation to introduce Boomerang, a local image manifold sampling approach using the dynamics of diffusion models. We call it Boomerang because we first add noise to an input image, moving



it closer to the latent space, then bring it back to the image space through diffusion dynamics. We use this method to generate images which are similar, but nonidentical, to the original input images on the image manifold. We are able to set how close the generated image is to the original based on how much noise we add. Additionally, the generated images have a degree of stochasticity, allowing us to locally sample as many times as we want without repetition. We show three applications for which Boomerang can be used. First, we provide a framework for constructing privacy-preserving datasets having controllable degrees of anonymity. Second, we show how to use Boomerang for data augmentation while staying on the image manifold. Third, we introduce a framework for image superresolution with 8x upsampling. Boomerang does not require any modification to the training of diffusion models and can be used with pretrained models on a single, inexpensive GPU.

Initial Image

Back to t = 0 prompt = "person" Back to t = 0 prompt = "person" Back to t = 0 prompt = "person" Back to t = 0 prompt = "person" Back to t = 0 prompt = "cat" Forward to t = 200 Forward to t = 500 Forward to t = 700 Forward to t = 800 et al., 2022) . Code available here. Starting from an initial image x 0 ∼ p(x 0 ), we add varying levels of noise to the latent variables according to the noise schedule of the forward diffusion process. Boomerang maps the noisy latent variables back to the image manifold by running the reverse diffusion process starting from the reverse step associated with the added noise. The resulting images are local samples from the image manifold, where the closeness is determined by the amount of added noise. While Boomerang here is applied to the Stable Diffusion model, it is applicable to other types of diffusion models, e.g., denoising diffusion models (Ho et al., 2020) . Additional images are provided in Appendix A.1.

1. INTRODUCTION

Generative models have seen a tremendous rise in popularity and applications over the past decade, ranging from image synthesis, audio generation, out-of-distribution data detection, and reinforcement learning to drug synthesis (Kingma & Welling, 2014; Goodfellow et al., 2014; Bond-Taylor et al., 2021) . One of the key benefits of generative models is that they can generate new samples from an unknown probability distribution from which we have samples. Recently, with the advent of models such as Dall-E 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022a), and Stable Diffusion (Rombach et al., 2022) , a family of generative models known as diffusion models have gained attention in both the academic and public spotlights. A key difference between diffusion models and previous state-of-the art generative models, such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , is that diffusion models take a fundamentally different approach to image synthesis, opting for a series of denoising objectives, whereas GANs opt for saddle-point objectives that have proven to be difficult to train (Yadav et al., 2018; Mescheder et al., 2018; Dhariwal & Nichol, 2021) . Generative models estimate the underlying probability distribution, or manifold, of a dataset. As such, generative models should be able to produce samples close to a particular data point-known as local sampling on the manifold--given that they have some knowledge about properties of the data manifold. One application of local sampling is to remove noise from an image, especially in the cases of severe noise, where traditional denoising methods might fail (Kawar et al., 2021) . Another application of local sampling is data augmentation, which involves applying transformations onto copies of data points to produce new data points. These data points are still from the original data distribution, but different enough from existing data to encourage the model to generalize in downstream tasks such as classification (Wong et al., 2016) . While established techniques in image augmentation include crops, flips, rotations, and color manipulations, data augmentation techniques for images and other data types are an ongoing field of research (Yang et al., 2022; Wen et al., 2021; Feng et al., 2021) . Diffusion models are typically designed to sample globally on the image manifold instead of performing local sampling. GANs (Goodfellow et al., 2014 ), VAEs (Kingma & Welling, 2014 ), and NFs (Rezende & Mohamed, 2015) can sample locally to a limited extent. VAEs and NFs can project a data point x into a latent vector z in their respective latent spaces and then re-project z back into the original data space, producing an estimate x ′ of the original data point. The key difference between x ′ VAE and x ′ NF is that neither the encoder nor the decoder in a VAE is invertible, while the projection function of an NF is invertible, resulting in x ′ VAE ≈ x = x ′ NF (Kobyzev et al., 2021) . Meanwhile, GANs generally do not learn a map from the data space to the latent space, instead opting to only learn a map from the latent space to the data space; finding or training the best methods for GAN inversion is an ongoing area of research (Karras et al., 2020; Xia et al., 2022) . Previous work on GAN inversion has shown that, without special attention to underrepresented data points during GAN training, reconstruction of certain data points can fail (Yu et al., 2020) . As such, VAEs and NFs can perform local sampling of a data point x by projecting a perturbed version of its corresponding latent z back into the original data space, while GANs also have the potential to do so, albeit requiring a suitable GAN inversion method (Wang et al., 2022; Zhu et al., 2020) . While the straightforward tractability of VAE-or NF-based local sampling is attractive, GANs currently outperform VAEs and NFs on popular tasks such as image synthesis (Zhang et al., 2022; Sauer et al., 2022) . However, given the recent advent and popularity of diffusion models and the brittle nature of GAN training (Saxena & Cao, 2021), local sampling using diffusion models is a promising avenue for leveraging local sampling techniques towards problems such as anonymization, data augmentation, and super-resolution. We propose the Boomerang algorithm to enable local sampling of image manifolds using diffusion models. The algorithm earns its name from its principle mechanism-using noise of a certain variance to push data away from the image manifold, and then using a diffusion model to pull the noised data back onto the manifold. The variance of the noise is the only parameter in the algorithm, and governs how similar the new image is to the old image, as reported by reported by Ho et al. (2020) . We show how this technique can be used within three applications: (1) data anonymization for privacy-preserving machine learning; (2) data augmentation; and (3) image super-resolution. We show that by exploiting the proposed local sampling technique we are able to anonymize the dataset and maintain better classification accuracy when compared with state-of-the-art generated data from



Figure 1: Boomerang via Stable Diffusion(Rombach et al., 2022). Code available here. Starting from an initial image x 0 ∼ p(x 0 ), we add varying levels of noise to the latent variables according to the noise schedule of the forward diffusion process. Boomerang maps the noisy latent variables back to the image manifold by running the reverse diffusion process starting from the reverse step associated with the added noise. The resulting images are local samples from the image manifold, where the closeness is determined by the amount of added noise. While Boomerang here is applied to the Stable Diffusion model, it is applicable to other types of diffusion models, e.g., denoising diffusion models(Ho et al., 2020). Additional images are provided in Appendix A.1.

