ADVERSARIAL TEXT TO CONTINUOUS IMAGE GENER-ATION

Abstract

Implicit Neural Representations (INR) provide a natural way to parametrize images as a continuous signal, using an MLP that predicts the RGB color at an (x, y) image location. Recently, it has been shown that high-quality INR decoders can be designed and integrated with Generative Adversarial Networks (GANs) to facilitate unconditional continuous image generation that is no longer bound to a particular spatial resolution. In this paper, we introduce HyperCGAN, a conceptually simple approach for Adversarial Text to Continuous Image Generation based on HyperNetworks, which produces parameters for another network. HyperCGAN utilizes HyperNetworks to condition an INR-based GAN model on text. In this setting, the generator and the discriminator weights are controlled by their corresponding HyperNetworks, which modulate weight parameters using the provided text query. We propose an effective Word-level hyper-modulation Attention operator, termed WhAtt, which encourages grounding words to independent pixels at input (x, y) coordinates. To the best of our knowledge, our work is the first that explores Text to Continuous Image Generation (T2CI). We conduct comprehensive experiments on COCO 256 2 , CUB 256 2 , and ArtEmis 256 2 benchmark, which we introduce in this paper. HyperCGAN improves the performance of textcontrollable image generators over the baselines while significantly reducing the gap between text-to-continuous and text-to-discrete image synthesis. Additionally, we show that HyperCGAN, when conditioned on text, retains the desired properties of continuous generative models (e.g., extrapolation outside of image boundaries, accelerated inference of low-resolution images, out-of-the-box superresolution). Code and ArtEmis 256 2 benchmark will be made publicly available.

1. INTRODUCTION

a public transit bus on a city street a person of riding skis on a snowy surface the background is so dark and the scene is kinda gloomy but art is still good i like the setting of the snow and it is a calm and peaceful painting this bird is black in color with a black beak and black eye rings this bird has wings that are black and white and has a small bill a baseball player preparing to throw the baseball a group of boys playing soccer on a field the green leaves on the trees look very luscious his white hair and beard is a stark contrast against the background this bird is white with grey and has a long pointy beak this particular bird has a belly that is gray and has black wings COCO ArtEmis CUB Humans have the innate ability to connect what they visualize with language or textual descriptions. Text-to-image (T2I) synthesis, an AI task inspired by this ability, aims to generate an image conditioned on a textual input description. Compared to other possible inputs in the conditional generation literature, sentences are an intuitive and flexible way to express visual content that we may want to generate. The main challenge in traditional T2I synthesis lies in learning from the unstructured description and connecting the different statistical properties of vision and language inputs. This field has seen significant progress in recent years in synthesis quality, the size and complexity of datasets used as well as image-text alignment (Xu et al., 2018; Li et al., 2019; Zhu et al., 2019; Tao et al., 2022; Zhang et al., 2021; Ramesh et al., 2021) . 



Figure 1: Text Conditioned Extrapolation outside of Image Boundaries: The red rectangles indicate the resolution boundaries that our HyperCGAN model was trained. On three datasets, our model can synthesize meaningful pixels at surrounding (x, y) coordinates beyond these boundaries.

Existing methods for T2I can be broadly categorized based on the architecture innovations developed to condition on text. Models that condition on a single caption input include stacked architectures (Zhang et al., 2017), attention mechanisms (Xu et al., 2018), Siamese architectures (Yin et al., 2019), cycle consistency approaches (Qiao et al., 2019), and dynamic memory networks (Zhu et al., 2019). A parallel line of work (Yuan & Peng, 2019; Souza et al., 2020; Wang et al., 2020) looks at adapting unconditional models for T2I synthesis. Despite the significant progress, images in existing approaches are typically represented as a discrete 2D pixel array which is a cropped, quantized version of the true continuous underlying 2D signal. We take an alternative view, in which we use an implicit neural representation (INR) to approximate the continuous signal. This paradigm accepts coordinate locations (x, y) as input and produces RGB values at the corresponding location for the continuous images. Working directly with continuous images enables several useful features such as extrapolation outside of image boundaries, accelerated inference of low-resolution images and out-of-the-box superresolution. Our proposed network, the HyperCGAN uses a HyperNetwork-based conditioning mechanism that we developed for Text to continuous image generation. It extends the INR-GAN (Skorokhodov et al., 2021a) backbone to efficiently generate continuous images conditioned on input text while preserving the desired properties of the continuous signal. Figure 1 shows examples of images generated by our HyperCGAN model on the CUB (Wah et al., 2011), COCO(Lin et al., 2015), and ArtEmis (Achlioptas et al., 2021) datasets.By design and while conditioning on the input text, we can see that HyperCGAN, trained on the CUB dataset, can extend bird images with more natural details like the tail and background (see Figure1) and the branches (top right). We observe similar behavior on scene-level benchmarks, including COCO and ArtEmis (introduced in this paper).

Figure 2: Scalability limitations in discrete decoders: Increasing training resolution decreases batch size/GPU hitting GPU limits.By representing signals as continuous functions, INRs do not depend on spatial resolution. Thus, the memory requirements to parameterize the signal grow not with respect to spatial resolution but only increase with the complexity of the signal. This type of representation enables generated images to have arbitrary spatial resolutions, keeping the memory requirements near constant. In contrast, discrete-based models need both generator and discriminator to scale with respect to spatial resolution, making training of these models impractical. Figure2shows that for discrete-based models, increasing training resolution leads to decreasing effective batch size during training due to GPU memory limits. Coupled with the expressiveness of HyperNetworks, we believe that building conditional generative models that are naturally capable of producing images of arbitrary resolutions while maintaining visual semantic consistency at low training costs is a promising paradigm in the future progress of generative models. Our work introduces a step in this direction. The prevalent T2I models in the literature likeAttnGAN (Xu et al., 2018), ControlGAN (Li et al., 2019) and XMC-GAN (Zhang et al., 2021) use architecture-specific ways to condition the generator and discriminator on textual information and often introduce additional text-matching losses. These approaches use text embeddings c to condition their model by updating a hidden representation h. Unlike these approaches, we explore a different paradigm in HyperCGAN, and use HyperNetworks (Ha et al., 2016) to condition the model on textual information c by modulating the model weights. Such a procedure can be viewed as creating a different instance of the model for each conditioning vector c and was recently shown to be significantly more than the embedding-based conditioning approaches. (Galanti & Wolf, 2020). A traditional HyperNetwork (Chang et al., 2020) generates the entire parameter vector θ from the conditioning signal c ie. θ = F (c), where F (c) is a modulating HyperNetwork. However, this quickly becomes infeasible in modern neural networks where |θ| can easily span millions of parameters. Our HyperNetwork instead produces a tensor-decomposed modulation

