KNN-DIFFUSION: IMAGE GENERATION VIA LARGE-SCALE RETRIEVAL

Abstract

Figure 1: (a) Samples of stickers generated from text inputs, (b) Semantic text-guided manipulations applied to the "Original" image without using edit masks.

1. INTRODUCTION

Large-scale generative models have been applied successfully to image generation tasks (Gafni et al., 2022; Ramesh et al., 2021; Nichol et al., 2021; Saharia et al., 2022; Yu et al., 2022) , and have shown outstanding capabilities in extending human creativity using editing and user control. However, these models face several significant challenges: (i) Large-scale paired data requirement. To achieve high-quality results, text-to-image models rely heavily on large-scale datasets of (text, image) pairs collected from the internet. Due to the requirement of paired data, these models cannot be applied to new or customized domains with only unannotated images. (ii) Computational cost and efficiency. Training these models on highly complex distributions of natural images usually requires scaling the size of the model, data, batch-size, and training time, which makes them challenging to train and less accessible to the community. Recently, several works proposed text-to-image models

