AN IMAGE IS WORTH ONE WORD: PERSONALIZING TEXT-TO-IMAGE GENERATION USING TEXTUAL IN-VERSION

Abstract

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use languageguided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Code, data and new words are available at our project page.

1. INTRODUCTION

Large-scale text-to-image models (Rombach et al., 2021; Ramesh et al., 2021; 2022; Nichol et al., 2021; Yu et al., 2022; Saharia et al., 2022) have demonstrated an unprecedented capability to reason over natural language descriptions. They allow users to synthesize novel scenes with unseen compositions and produce vivid pictures in a myriad of styles. These tools have been used for artistic creation, as sources of inspiration, and even to design new, physical products (Yacoubian, 2022). Their use, however, is constrained by the user's ability to describe the desired target through text. One can then ask: How could we instruct such models to mimic the likeness of a specific object? How could we ask them to craft a novel scene containing a cherished childhood toy? Or to pull our child's drawing from its place on the fridge, and turn it into an artistic showpiece? Introducing new concepts into large scale models is often difficult. Re-training a model with an expanded dataset for each new concept is prohibitively expensive, and fine-tuning on few exam-



Figure 1: (left) We find new pseudo-words in the embedding space of pre-trained text-to-image models which describe specific concepts. (right) These pseudo-words are composed into new sentences, placing our targets in new scenes, changing their style or ingraining them into new products.

