AN IMAGE IS WORTH ONE WORD: PERSONALIZING TEXT-TO-IMAGE GENERATION USING TEXTUAL IN-VERSION

Abstract

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use languageguided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Code, data and new words are available at our project page.

1. INTRODUCTION

Large-scale text-to-image models (Rombach et al., 2021; Ramesh et al., 2021; 2022; Nichol et al., 2021; Yu et al., 2022; Saharia et al., 2022) have demonstrated an unprecedented capability to reason over natural language descriptions. They allow users to synthesize novel scenes with unseen compositions and produce vivid pictures in a myriad of styles. These tools have been used for artistic creation, as sources of inspiration, and even to design new, physical products (Yacoubian, 2022). Their use, however, is constrained by the user's ability to describe the desired target through text. One can then ask: How could we instruct such models to mimic the likeness of a specific object? How could we ask them to craft a novel scene containing a cherished childhood toy? Or to pull our child's drawing from its place on the fridge, and turn it into an artistic showpiece? Introducing new concepts into large scale models is often difficult. Re-training a model with an expanded dataset for each new concept is prohibitively expensive, and fine-tuning on few exam-ples typically leads to catastrophic forgetting (Ding et al., 2022; Li et al., 2022) . More measured approaches freeze the model and train transformation modules to adapt its output when faced with new concepts (Zhou et al., 2021; Gao et al., 2021; Skantze & Willemsen, 2022) . However, these approaches are still prone to forgetting prior knowledge, or face difficulties in accessing it concurrently with newly learned concepts (Kumar et al., 2022; Cohen et al., 2022) . We propose to overcome these challenges by finding new words in the textual embedding space of pre-trained text-to-image models. We consider the first stage of the text encoding process (Figure 2 ). Here, an input string is first converted to a set of tokens. Each token is then replaced with its own embedding vector, and these vectors are fed through the downstream model. Our goal is to find new embedding vectors that represent new, specific concepts. We represent a new embedding vector with a new pseudo-word (Rathvon, 2004) which we denote by S * . This pseudo-word is then treated like any other word, and can be used to compose novel textual queries for the generative models. One can therefore ask for "a photograph of S * on the beach", "an oil painting of a S * hanging on the wall", or even compose two concepts, such as "a drawing of S 1 * in the style of S 2 * ". Importantly, this process leaves the generative model untouched. In doing so, we retain the rich textual understanding and generalization capabilities that are typically lost when fine-tuning vision and language models on new tasks. To find these pseudo-words, we frame the task as one of inversion. We are given a pre-trained textto-image model and a small (3-5) image set depicting the concept. We aim to find a word embedding, so that prompts of the form "A photo of S * " will lead to the reconstruction of images from our set. This embedding is found using an optimization process, which we refer to as "Textual Inversion". We further investigate a series of extensions based on tools typically used in Generative Adversarial Network (GAN) inversion. Our analysis reveals that, while some core principles remain, applying the prior art in a naïve way may harm performance. We demonstrate the effectiveness of our approach over a wide range of concepts and prompts, showing that it can inject unique objects into new scenes, transform them across different styles, transfer poses, diminish biases, and even imagine new products. In summary, our contributions are as follows: • We introduce the task of personalized text-to-image generation, where we synthesize novel scenes of user-provided concepts guided by natural language instructions. • We present the idea of "Textual Inversions" in the context of generative models. Here the goal is to find new pseudo-words in the embedding space of a text encoder that can capture both high-level semantics and fine visual details. • We conduct a prelminary analysis of the properties of the textual embedding space.

2. RELATED WORK

Text-guided synthesis. Text-guided image synthesis has been widely studied in the context of GANs (Goodfellow et al., 2014) . Typically, a conditional model is trained to reproduce samples from paired image-caption datasets (Zhu et al., 2019; Tao et al., 2020) by leveraging attention mechanisms (Xu et al., 2018) or cross-modal contrastive approaches (Zhang et al., 2021; Ye et al., 2021) . More recently, impressive visual results were achieved using large scale auto-regressive (Ramesh et al., 2021; Yu et al., 2022) or diffusion models (Ramesh et al., 2022; Saharia et al., 2022; Nichol et al., 2021; Rombach et al., 2021) . Alternatively, test-time optimization can be used to explore the latent space of pre-trained generators (Crowson et al., 2022; Murdock, 2021; Crowson, 2021) . Typically, by maximizing a text-to-image similarity score derived from CLIP (Radford et al., 2021) . Moving beyond pure generation, a large body of work explores the use of text-based interfaces for image editing (Patashnik et al., 2021; Abdal et al., 2021; Avrahami et al., 2022b) , generator domain adaptation (Gal et al., 2021; Kim et al., 2022) and style transfer (Kwon & Ye, 2021; Liu et al., 2022) . Our approach builds on the open-ended, conditional synthesis models. Rather than training a new model from scratch, we show that we can expand a frozen model's vocabulary and introduce new pseudo-words that describe specific concepts.



Figure 1: (left) We find new pseudo-words in the embedding space of pre-trained text-to-image models which describe specific concepts. (right) These pseudo-words are composed into new sentences, placing our targets in new scenes, changing their style or ingraining them into new products.

