KNN-DIFFUSION: IMAGE GENERATION VIA LARGE-SCALE RETRIEVAL

Abstract

Figure 1: (a) Samples of stickers generated from text inputs, (b) Semantic text-guided manipulations applied to the "Original" image without using edit masks.

1. INTRODUCTION

Large-scale generative models have been applied successfully to image generation tasks (Gafni et al., 2022; Ramesh et al., 2021; Nichol et al., 2021; Saharia et al., 2022; Yu et al., 2022) , and have shown outstanding capabilities in extending human creativity using editing and user control. However, these models face several significant challenges: (i) Large-scale paired data requirement. To achieve high-quality results, text-to-image models rely heavily on large-scale datasets of (text, image) pairs collected from the internet. Due to the requirement of paired data, these models cannot be applied to new or customized domains with only unannotated images. (ii) Computational cost and efficiency. Training these models on highly complex distributions of natural images usually requires scaling the size of the model, data, batch-size, and training time, which makes them challenging to train and less accessible to the community. Recently, several works proposed text-to-image models trained without an explicit paired text-image datasets. Liu et al. (2021) performed a direct optimization to a pre-trained model based on a CLIP loss (Radford et al., 2021) . Such approaches are time-consuming, since they require optimization for each input. Zhou et al. (2021) proposed training with CLIP image embedding perturbed with Gaussian noise. However, to achieve high-quality results, an additional model needs to be trained with an annotated text-image pairs dataset. In this work, we introduce a novel generative model, kNN-Diffusion, which tackles these issues and progresses towards more accessible models for the research community and other users. Our model leverages a large-scale retrieval method, k-Nearest-Neighbors (kNN) search, in order to train the model without an explicit text-image dataset. Specifically, our diffusion model is conditioned on two inputs: (1) image embedding (at training time) or text embedding (at inference), extracted using pre-trained CLIP encoder, and (2) kNN embeddings, representing the k most similar images in the CLIP latent space. During training, we assume that no paired text is available, hence condition only on CLIP image embedding and on k additional image embeddings, selected using the retrieval model. At inference, only text inputs are given, so instead of image embeddings, we use the text embedding that shares a joint embedding space with the image embeddings. Here, the kNN image embeddings are retrieved using the text embeddings. The additional kNN embeddings have three main benefits: (1) they extend the distribution of conditioning embeddings and ensure the distribution is similar in train and inference, thus helping to bridge the gap between the image and text embedding distributions (see Fig. 5);  (2) they teach the model to learn to generate images from a target distribution by using samples from that distribution. This allows generalizing to different distributions at test time and generating out-of-distribution samples; (3) they hold information that does not need to be present in the model, which allows it to be substantially smaller. We demonstrate the effectiveness of our kNN approach in Sec. 4. To assess the performance of our method, we train our model on two large-scale datasets: the Public Multimodal Dataset (Singh et al., 2021) and an image-only stickers dataset collected from the Internet. We show state-of-the-art zero-shot results on MS-COCO (Lin et al., 2014) , LN-COCO (Pont-Tuset et al., 2020) and CUB (Wah et al., 2011) . To further demonstrate the advantage of retrieval methods in text-to-image generation, we train two diffusion backbones using our kNN approach: continuous (Ramesh et al., 2022) and discrete (Gu et al., 2021) . In both cases we outperform the model trained without kNN. In comparison to alternative methods presented in Sec. 4, we achieve state-of-the-art results in both human evaluations and FID score, with only 400 million parameters and 7 seconds inference time. Lastly, we introduce a new approach for local and semantic manipulations that is based on CLIP and kNN, without relying on user-provided masks. Specifically, we fine-tune our model to perform local and complex modifications that satisfies a given target text prompt. For example, given the teddy bear's image in Fig. 4 , and the target text "holds a heart", our method automatically locates the local region that should be modified and synthesizes a high-resolution manipulated image in which (1) the teddy bear's identity is accurately preserved and (2) the manipulation is aligned with the target text. We demonstrate our qualitative advantage by comparing our results with two state-of-the-art models, Text2Live (Bar-Tal et al., 2022) and Textual Inversion (Gal et al., 2022) , that perform image manipulations without masks (Fig. 4, 21 and 22) . We summarize the contributions of this paper as follows: (1) We propose kNN-Diffusion, a novel and efficient model that utilizes a large-scale retrieval method for training a text-to-image model with only pre-trained multi-modal embeddings, but without an explicit text-image dataset. (2) We demonstrate efficient out-of-distribution generation, which is achieved by substituting retrieval databases. (3) We present a new approach for local and semantic image manipulation, without utilizing masks. (4) We evaluate our method on two diffusion backbones, discrete and continuous, as well as on several datasets, and present state-of-the-art results compared to baselines.

2. RELATED WORK

Text-to-image models. Text-to-image generation is a well-studied task that focuses on generating images from text descriptions. While GANs (Xu et al., 2018; Zhu et al., 2019; Zhang et al., 2021) and Transformer-based methods (Ramesh et al., 2021; Gafni et al., 2022; Yu et al., 2022; Ding et al., 2021) have shown remarkable results, recently impressive results have been attained with dis-

