ADVERSARIAL TEXT TO CONTINUOUS IMAGE GENER-ATION

Abstract

Implicit Neural Representations (INR) provide a natural way to parametrize images as a continuous signal, using an MLP that predicts the RGB color at an (x, y) image location. Recently, it has been shown that high-quality INR decoders can be designed and integrated with Generative Adversarial Networks (GANs) to facilitate unconditional continuous image generation that is no longer bound to a particular spatial resolution. In this paper, we introduce HyperCGAN, a conceptually simple approach for Adversarial Text to Continuous Image Generation based on HyperNetworks, which produces parameters for another network. HyperCGAN utilizes HyperNetworks to condition an INR-based GAN model on text. In this setting, the generator and the discriminator weights are controlled by their corresponding HyperNetworks, which modulate weight parameters using the provided text query. We propose an effective Word-level hyper-modulation Attention operator, termed WhAtt, which encourages grounding words to independent pixels at input (x, y) coordinates. To the best of our knowledge, our work is the first that explores Text to Continuous Image Generation (T2CI). We conduct comprehensive experiments on COCO 256 2 , CUB 256 2 , and ArtEmis 256 2 benchmark, which we introduce in this paper. HyperCGAN improves the performance of textcontrollable image generators over the baselines while significantly reducing the gap between text-to-continuous and text-to-discrete image synthesis. Additionally, we show that HyperCGAN, when conditioned on text, retains the desired properties of continuous generative models (e.g., extrapolation outside of image boundaries, accelerated inference of low-resolution images, out-of-the-box superresolution). Code and ArtEmis 256 2 benchmark will be made publicly available.

1. INTRODUCTION

a public transit bus on a city street a person of riding skis on a snowy surface the background is so dark and the scene is kinda gloomy but art is still good i like the setting of the snow and it is a calm and peaceful painting this bird is black in color with a black beak and black eye rings this bird has wings that are black and white and has a small bill a baseball player preparing to throw the baseball a group of boys playing soccer on a field the green leaves on the trees look very luscious his white hair and beard is a stark contrast against the background this bird is white with grey and has a long pointy beak this particular bird has a belly that is gray and has black wings COCO ArtEmis CUB Humans have the innate ability to connect what they visualize with language or textual descriptions. Text-to-image (T2I) synthesis, an AI task inspired by this ability, aims to generate an image conditioned on a textual input description. Compared to other possible inputs in the conditional generation literature, sentences are an intuitive and flexible way to express visual content that we may want to generate. The main challenge in traditional T2I synthesis lies in learning from the unstructured description and connecting the different statistical properties of vision and language inputs. This field has seen significant progress in recent years in synthesis quality, the size and complexity of datasets used as well as image-text alignment (Xu et al., 2018; Li et al., 2019; Zhu et al., 2019; Tao et al., 2022; Zhang et al., 2021; Ramesh et al., 2021) . Existing methods for T2I can be broadly categorized based on the architecture innovations developed to condition on text. Models that condition on a single caption input include stacked architectures (Zhang et al., 2017) , attention mechanisms (Xu et al., 2018) , Siamese architectures (Yin et al., 2019) , cycle consistency approaches (Qiao et al., 2019) , and dynamic memory networks (Zhu et al., 2019) . A parallel line of work (Yuan & Peng, 2019; Souza et al., 2020; Wang et al., 2020) looks at adapting unconditional models for T2I synthesis. Despite the significant progress, images in existing approaches are typically represented as a discrete 2D pixel array which is a cropped, quantized version of the true continuous underlying 2D signal. We take an alternative view, in which we use an implicit neural representation (INR) to approximate the continuous signal. This paradigm accepts coordinate locations (x, y) as input and produces RGB values at the corresponding location for the continuous images. Working directly with continuous images enables several useful features such as extrapolation outside of image boundaries, accelerated inference of low-resolution images and out-of-the-box superresolution. Our proposed network, the HyperCGAN uses a HyperNetwork-based conditioning mechanism that we developed for Text to continuous image generation. It extends the INR-GAN (Skorokhodov et al., 2021a) backbone to efficiently generate continuous images conditioned on input text while preserving the desired properties of the continuous signal. Figure 1 shows examples of images generated by our HyperCGAN model on the CUB (Wah et al., 2011) , COCO (Lin et al., 2015) , and ArtEmis (Achlioptas et al., 2021) datasets. By design and while conditioning on the input text, we can see that HyperCGAN, trained on the CUB dataset, can extend bird images with more natural details like the tail and background (see Figure 1 ) and the branches (top right). We observe similar behavior on scene-level benchmarks, including COCO and ArtEmis (introduced in this paper). By representing signals as continuous functions, INRs do not depend on spatial resolution. Thus, the memory requirements to parameterize the signal grow not with respect to spatial resolution but only increase with the complexity of the signal. This type of representation enables generated images to have arbitrary spatial resolutions, keeping the memory requirements near constant. In contrast, discrete-based models need both generator and discriminator to scale with respect to spatial resolution, making training of these models impractical. Figure 2 shows that for discrete-based models, increasing training resolution leads to decreasing effective batch size during training due to GPU memory limits. Coupled with the expressiveness of HyperNetworks, we believe that building conditional generative models that are naturally capable of producing images of arbitrary resolutions while maintaining visual semantic consistency at low training costs is a promising paradigm in the future progress of generative models. Our work introduces a step in this direction. The prevalent T2I models in the literature like AttnGAN (Xu et al., 2018) , ControlGAN (Li et al., 2019) and XMC-GAN (Zhang et al., 2021) use architecture-specific ways to condition the generator and discriminator on textual information and often introduce additional text-matching losses. These approaches use text embeddings c to condition their model by updating a hidden representation h. Unlike these approaches, we explore a different paradigm in HyperCGAN, and use HyperNetworks (Ha et al., 2016) to condition the model on textual information c by modulating the model weights. Such a procedure can be viewed as creating a different instance of the model for each conditioning vector c and was recently shown to be significantly more than the embedding-based conditioning approaches. (Galanti & Wolf, 2020) . A traditional HyperNetwork (Chang et al., 2020) generates the entire parameter vector θ from the conditioning signal c ie. θ = F (c), where F (c) is a modulating HyperNetwork. However, this quickly becomes infeasible in modern neural networks where |θ| can easily span millions of parameters. Our HyperNetwork instead produces a tensor-decomposed modulation F (c) = M of the same size as the weight tensor W . This tensor is then used to alter W via an elementwise multiplicative operation W c = W ⊙ F (c). Additionally, we develop an attention-based word level modulation WhAtt to alter weight tensors W of both Generator and Discriminator using F (c). Our primary contributions are as follows: • We propose the HyperCGAN framework for synthesizing continuous images from text input. The model is augmented with a novel language-guided mechanism termed WhAtt, that modulates weights at the word level. • We show that our method has the ability to meaningfully extrapolate outside the image boundaries, and can outperform most existing discrete methods on the COCO and ArtEmis datasets, including stacked generators and single generator methods. • We establish a baseline on a new affective T2I benchmark based on the ArtEmis dataset (Achlioptas et al., 2021) , which has 455,000 affective utterances collected on more than 80K artworks. ArtEmis contains captions that explain emotions elicited by a visual stimulus, which can lead to more human emotion-aware T2I synthesis generation models.

2. RELATED WORK

Text-to-Image Generation. T2I synthesis has been an active area of research since at least (Mansimov et al., 2015; Reed et al., 2016a) proposed a DRAW-based (Gregor et al., 2015) model to generate images from captions. (Reed et al., 2016a) first demonstrated improved fidelity of the generated images from text using GANs (Goodfellow et al., 2014) . Several GAN-based approaches for T2I synthesis have emerged since. StackGAN (Zhang et al., 2017) proposed decomposing the T2I generation into two stages -a coarse to fine approach and used conditional augmentation of the conditioning text. Later, (Xu et al., 2018) proposed AttnGAN, an extended version of StackGAN, and adopted cross-modal attention mechanisms for improved visual-semantic alignment and grounding. Following the architecture of AttnGAN, some approaches were proposed to improve the generation quality (Li et al., 2019; Zhu et al., 2019) . XMC-GAN (Zhang et al., 2021) , DF-GAN (Tao et al., 2022) proposes to use additional auxiliary losses to improve visual semantic alignment. Non-GAN based generative models have also been explored in T2I, e.g autoregressive approaches (Reed et al., 2016b; 2017; Ramesh et al., 2021; Gafni et al., 2022) , flow-based models (Mahajan et al., 2020) . Diffusion Models. With the introduction of diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) , which learns to perform denoising task, the breakthrough has been made in T2I due to the emergence of diffusion-based models conditioned on text (Ramesh et al., 2022; Nichol et al., 2021; Saharia et al., 2022; Rombach et al., 2022; Gu et al., 2022) . These methods cannot be directly compared with our work since they have a huge number of parameters and requires a massive amount of data for training. The diffusion-based methods do not suffer from mode collapse, but their compute cost and carbon footprint are much higher than GAN-based approaches. Art generation. Synthetically generating realistic artwork with conditional GAN is challenging due to unstructured shapes and its metaphoric nature. Several works have explored learning artistic style representations. ArtGAN (Tan et al., 2017; 2018) trained a conditional GAN on artist, genre, and style labels. (Alvarez-Melis & Amores, 2017) proposed emotion-to-art generation by training an AC-GAN (Odena et al., 2017) on ten classes of emotions. Another line of work includes CAN (Elgammal et al., 2017) and later H-CAN (Sbai et al., 2018) , which generates creative art by learning about styles and deviating from style norms. We extend prior work by applying our HyperNetwork-based conditioning to the novel text-to-continuous-image generation task on the challenging ArtEmis (Achlioptas et al., 2021) dataset, where we leverage verbal explanations as conditioning signals to achieve better human cognition-aware T2I synthesis.

Implicit Neural Representation (INR).

INRs parametrize any type of signal (e.g. images, audio signals, 3D shapes) as a continuous function that maps the domain of the signal to values at a specified coordinate (Genova et al., 2019; Mildenhall et al., 2020; Sitzmann et al., 2019; 2020) . For 2D image synthesis, several works have explored ways to enable INRs using generative models (Anokhin et al., 2021; Skorokhodov et al., 2021a; b) . Connection to HyperNetworks. HyperNetworks are models that generate parameters for other models. They have been applied to several tasks in architecture search (Zhang et al., 2019) , few-shot learning (Bertinetto et al., 2016), and continual learning (von Oswald et al., 2020) . Generative HyperNetworks, also called implicit generators (Skorokhodov et al., 2021a; Anokhin et al., 2021) were recently shown to rival StyleGAN2 (Karras et al., 2020) in generation quality. Our HyperC-GAN generates continuous images conditioned on text using two types of Hypernetworks: (1) Image generator HyperNetwork, which produces an image represented by its INR. (2) Text controlling Hy-perNetwork that guides the learning mechanism of the image generator HyperNetwork using the input text. Despite the progress in unconditional INR-based decoders (e.g., (Lin et al., 2019; Skorokhodov et al., 2021a; Anokhin et al., 2021; Skorokhodov et al., 2021b) ), generating high-quality continuous images conditioned on text is less studied compared to discrete image generators. Our HyperNetwork-augmented modulation approach facilitates conditioning the continuous image generator on text while preserving the desired INR properties (e.g., out-of-the-box-super resolution, extrapolation outside image boundaries).

3. APPROACH

The T2I generation task can be formulated as modeling the data distribution of images P r given a conditioning signal c. We use a standard GAN training setup where we model the image distribution using a generator G. In our case, c is text information in the form of sentence, or word embeddings. During training, we alternate between optimizing the generator and discriminator objectives: L D (c) = -E x∼Pr [D(x, c)] -E G(z,c)∼Pg [1 -D(G(z, c), c)] L G (c) = -E G(z,c)∼Pg [D(G(z, c), c)] + λL contrastive (1) where P g is the generated image distribution, and P r is its real distribution. L D (c) and L G (c) are the discriminator and the generator losses, respectively. To facilitate continuous image generation, HyperCGAN augments the unconditional baseline INR-GAN (Skorokhodov et al., 2021a) with an effective modulation mechanism that encourages better sentence-level and word-level alignment. To encourage fine-grained image-text matching, the generator is regularized with an auxiliary contrastive loss based on the Deep Attentional Multimodal Similarity Model (DAMSM) (Xu et al., 2018) , which measures the similarity between generated images and global sentence-level as well as fine-grained word-level information. As shown in our experiments, our proposed modulation helps DAMSM loss improve continuous image-text alignment at the word level while preserving high image fidelity. We also explore integrating the CLIP (Radford et al., 2021) loss to improve the alignment between the text and the generated continuous images. The following subsections introduce INR-based decoders and describe how we adapted them in our HyperCGAN approach to facilitate continuous image generation conditioned on text.

3.1. INR-BASED GENERATOR BACKBONE

: INR-GAN (SKOROKHODOV ET AL., 2021A; ANOKHIN ET AL., 2020) INR is an implicit approach that can represent a 2D image, with a neural network to produce RGB pixel values given image coordinate locations (x, y). We build our approach upon the INR-based generator (Skorokhodov et al., 2021a; Anokhin et al., 2020) , that consists of two main modules: a hypernetwork H(z) and an MLP model F θ(z) (x, y). The hypernetwork H(z) samples a noise vector z ∼ N (0, I) and generates parameters for an MLP model F θ(z) (x, y). The MLP model F θ(z) (x, y) then predicts RGB values at each location (x, y) of a predefined coordinate grid to synthesize an image. The weights of the MLP model are modulated through a Factorized Multiplicative Modulation (FMM) mechanism, where two matrices are multiplied together and passed through an activation function to obtain a modulating tensor. Later, this modulating tensor is multiplied by the shared parameters matrix of the MLP network.

3.2. HYPER-CONDITIONAL GANS (HYPERCGANS)

Architecture Overview. Our generator architecture is based on the multi-scale INR-GAN and mainly consists of fully-connected linear layers followed by activations. The weights of these layers are two-dimensional. i.e., W ℓ ∈ R cout×cin×1×1 at layer l. We use the StyleGAN2 discriminator during the training process, which comprises a series of ConvNet blocks. The convolutional weights can be represented as a four-dimensional tensor W ℓ ∈ R cout×cin×k h ×kw . In the INR-GAN, these weights are not conditioned on the text. As our goal is to condition both generator and discriminator on input text c, we apply the HyperCGAN conditional modulation framework to both the INR-GAN generator and discriminator. In this framework, c is transformed by a HyperNetwork to produce modulating tensors for the weight tensors. Figure 3 is an overview of our proposed HyperCGAN approach.

3.2.1. LEVERAGING CONDITIONAL SIGNAL FOR WEIGHT MODULATION

When conditioning the generator, we use two strategies to generate the modulating tensor M for linear layers depending on the language representation granularity (word-level or sentence level). Sentence-level Conditioning: We also explore sentence-level conditioning on top of sentence embeddings e s . In this case, the HyperNet backbone receives as input the concatenation of noise vector z ∼ N (0, I) of size d z and sentence embedding vectors e s of size d c ; i.e., [z, s] . Then, for each linear layer ℓ in the INR MLP-decoder, separate modulating tensors M l z,s are generated through fully-connected layers (FC) (see Figure 3 .a). This tensor M l z,s is further used to modulate the generator's weight W ℓ G at layer ℓ through element-wise multiplication; see Equation 5. Word-level Conditioning: Word embeddings e w ∈ R Ω×dw are represented as a sequence of individual vectors of size d w for each word in the sentence, where Ω denotes sequence length of the word embeddings (i.e., the number of words). Two hypernetworks are used to condition the generator. The first is an MLP which receives noise vector z and outputs the modulating tensor M z , and the second is a Conv1x1 followed by a novel Word-level Hyper-Attention mechanism proposed in this work, termed WhAtt, detailed later in this section. Slightly different from the generator, hypernetworks of the discriminator are either FC which takes sentence embedding e s as an input and generate a tensor M l s or Conv1x1 which receives word embedding e w and generate a tensor M l w for modulation.(see Figure 3 .a,b). In addition, the final projection head in the discriminator is conditioned through s = h ⊤ F (e s ), where h is the output of the last discriminator block and F (e s ) is the vector produced by our hypernetwork. This form resembles the traditional Projection Discriminator (Miyato & Koyama, 2018 ) that uses output s = h ⊤ j, (j one-hot), which we generalize to condition on beyond one-hot class labels (see dsicriminator in Figure 3 .a). Extreme Modulating Tensor Factorization. Producing a full-rank tensor M ℓ for each block l is memory-intensive and infeasible even for modestly sized architectures. For example, if the hidden layer size of our hypernetwork is of size d h = 512 and the convolutional weight tensor at layer ℓ is of dimensionality d o = c out × c in × k h × k w = 512 × 512 × 3 × 3 ≈ 2. 4 million, then the output weight matrix in the hypernetwork will be of size d o × d h = 1.2 billion. To overcome this issue, we propose factorizing the modulating tensor with an extreme low-rank tensor decomposition for learning efficiency. The canonical polyadic (CP) decomposition (Kiers, 2000) lets us express a rank-R tensor T ∈ R d1×...×dn as a sum of R rank-1 tensors: T = R r=1 t r 1 ⊗ ... ⊗ t r n ( ) where ⊗ is the tensor product and t k r is a vector of length d k . Going back to our example mentioned above, if we instead generate separately low-rank factors and build modulating tensor out of the factors d o = c out + c in + k h + k w = 512 + 512 + 3 + 3 = 1030. So, the output weight matrix in the hypernetwork will be of size d o × d h = 527360 which leads to ≈ 99.95% decrease in the parameter size of hypernetworks. Therefore, M l z,s will be the tensor product of 4 low-rank rank-1 tensors t 1 , t 2 , t 3 , and t 4 of size c out , c in , k h and k w , respectively.

3.2.2. WORD-LEVEL MODULATION WITH WHATT ATTENTION

In contrast to sentence embedding where words are summarized in one vector, individual word embeddings consist of sequences of individual word encodings, containing fine-grained information that is typically visually grounded to the image. Hence, we focus on how to leverage this information in our model. We introduce a Word-level Hyper Attention mechanism, denoted as WhAtt, that can leverage this word-level as well as sentence-level information through self-attention. WhAtt Attention. First, word embeddings from the text encoder are extracted. These embeddings are of size Ω × d where Ω denotes sequence length of the word embeddings (i.e., the number of words) and d is an embedding size. The word embeddings are further encoded with a different HyperNetwork which consists of a single Conv1x1 layer for each layer l. From every hypernetwork at layer l, a different tensor T ℓ ∈ R Ω×(cin+k h +kw) is obtained. Basically, a tensor T ℓ is composed of Ω number of different vectors v i ∈ R cin+k h +kw corresponding to the i-th word. Then, each vector v i is "sliced" into three low-rank factors v i in , v i h , v i w of dimensions c in , k h , k w , respectively. From the entire tensor T ℓ , we, therefore, derive a two dimensional matrix Q ℓ ∈ R Ω×(cin×k h ×kw) using tensor factorization (Eq. 2) which can be expressed via an outer product operation: Q ℓ i = v i in ⊗ v i h ⊗ v i w (3) where Q ℓ i is the i-th row in Q ℓ , We apply scaled dot product attention mechanism (Vaswani et al., 2017) to attend to the relevant words in the resulting tensor M ℓ w ∈ R Ω×cin×k h ×kw : M ℓ w = WhAtt(W ℓ , Q ℓ ) = softmax( W ℓ (Q ℓ ) T √ c out )Q ℓ , where W ℓ is the weight matrix at layer l, M ℓ w is the word-level modulating tensor, W ℓ and M ℓ w ∈ R cout×cin×k h ×kw . Finally, the modulating tensors for the generator and discriminator for both sentence and word based modulation are defined by Eq. 5 and Eq. 6, respectively: Ŵ ℓ G =M ℓ z,s ⊙ W ℓ G Ŵ ℓ D =M l s ⊙ W ℓ D (5) Ŵ ℓ G =M ℓ z ⊙ M ℓ w ⊙ W ℓ G Ŵ ℓ D =M l w ⊙ W ℓ D (6) where Ŵ ℓ G and Ŵ ℓ D are the modulated weights at layer ℓ for the generator and the discriminator respectively. W ℓ G and W ℓ D are the corresponding weights at layer ℓ before modulation. Note that like sentence level modulation, k h = 1 and k w = 1 for the generator and are equal to the kernel size in the discriminator as it is convolutional. Our word-level modulation aims at grounding words to independent pixels at input (x, y) coordinates, represented as low-res features in the earlier layers, and the final RGB value in the last layer. More generally, word-level conditioning benefit for visual-semantic consistency was first demonstrated for discrete decoders in AttnGAN (Xu et al., 2018) . Our word-level modulation is our proposed mechanism to bring similar properties to text-conditioned continuous image generation.

4. EXPERIMENTS AND RESULTS

In this section, we first define the used datasets, metrics, and our baselines following which we compare our model relative to the baselines on the benchmarks, and study the various properties and limitations of our approach. Datasets. We comprehensively evaluate HyperCGAN on the challenging MS-COCO (Lin et al., 2015) , ArtEmis (Achlioptas et al., 2021) , and CUB (Wah et al., 2011) datasets. -COCO 256 2 contains over 80K images for training and more than 40K images for testing. Each image has 5 associated captions that describe the visual content of the image. We use the splits proposed in (Xu et al., 2018) to train and test our models. -ArtEmis 256 2 (introduced T2I benchmark) contains over 450K emotion attributes and explanations from humans on more than 81K artworks from WikiArt dataset. Each image is associated with at least 5 captions. The unique aspect of the dataset is that utterances are more affective and subjective rather than descriptive. These aspects of the dataset impose additional challenges on our T2I generation task. We use the train and test splits provided by the authors and benchmark recent T2I methods on it. Both COCO and ArtEmis are scene-level T2I benchmarks. -CUB 256 2 contains 8,855 training and 2,933 test images of bird species. Each image has 10 corresponding text descriptions. In contrast to COCO and ArtEmis, CUB is an object-level benchmark, yet challenging as the bird species are fine-grained. Evaluation Metrics. We evaluate all models in terms of both Image Quality and Text-Image Alignment. Due to the limitations of the Inception score (IS) (Salimans et al., 2016) to capture the diversity and quality of the generation, we report Frechet Inception Distance (FID) (Heusel et al., 2017) score following previous works (Zhang & Schomaker, 2020; Tao et al., 2022) . Additionally, we compute R-precision since image quality scores alone cannot reflect whether the generated image is well conditioned on the given text description. Given a generated image, R-precision measures the retrieval rate for the corresponding caption using a surrogate multi-modal network which computes the similarity score between image features and text features. (Zhu et al., 2019; Xu et al., 2018; Li et al., 2019; Zhu et al., 2019) relied on a pretrained DAMSM model consisting of a text encoder and image encoder to compute the similarity between generated image and text descriptions for R-precision, termed as DAMSM-R. However, the same DAMSM model used during training and evaluation leads to severely biased behavior towards this metric (see Table 9 in Appendix). Therefore, as suggested in (Park et al., 2021) , we also report R-precision score where image-text similarity is computed with CLIP (Radford et al., 2021) , dubbed as CLIP-R. Moreover, we conduct a human evaluation to assess the meaningfulness and image-text alignment quality for the extrapolated regions facilitated by the conditional continuous image generation ability of HyperCGAN. Text to Continuous Image (T2CI) Generation Baselines. Since our work is the first attempt on T2CI, we define the following baselines: INR-CGAN sent : we transform unconditional INR-GAN to be conditioned on sentence embeddings as a baseline. In this transformation, this baseline simply takes the concatenated noise vector and sentence embeddings and then generates parameters for the decoder to synthesize an image. We condition its discriminator via a projection head like in our approach but do not modulate the convolution layers conditioned on text. This corresponds to configuration B in Table 1 WhAtt Attention Generalization on Discrete Decoders. Our sentence-based modulation and word-level WhAtt conditioning mechanism can easily be applied to conventional convolution-based generators. In this case, our hypernetwork-based methods modulate the convolution weights of the generator, which has convolutional layers with kernel sizes more than 1 (k h > 1 and k w > 1). To show this, we enable the standard unconditional Style-GAN2 (Karras et al., 2020) backbone to be conditioned on either sentence or word embeddings dubbed as Table 2 shows that HyperC-SG word DAMSM equipped with our proposed WhAtt mechanism boosts CLIP-R results to 61.49% compared to HyperC-SG sent DAMSM at 54.45%, while also significantly improving the image quality with FID score 20.81 from 31.47. We also include the same comparison in Table 2 our continuous model HyperCGAN, which achieves the best results, and the WhAtt mechanism improvement is more significant (18.58 FID, 64.14 CLIP-R). 8 and Figure 9 for results). b) Extrapolation. Figure 1 shows the ability of HyperCGAN to extrapolate outside of the training image boundaries. After training on a coordinate grid with a specific range, HyperCGAN can be evaluated on a wider grid. We are interested in studying whether the extended regions made sense beyond the training data coordinates. To validate this, we conducted user studies where subjects were asked to indicate a) whether the extended area in the generations is meaningful and b) whether the extended area makes the image more aligned with the text description. In Table 3 , for COCO, the results show that 68.8% of responses indicate that the out-of-the-region extrapolation is meaningful while 20.8% of them say it is not. For ArtEmis, 75.6% of responses are in favor of meaningful; meanwhile, 16.4% of them show the opposite. In Table 4 , more than 65% of responses suggested that the alignment between the image and text description improved or remained the same for both COCO and ArtEmis. Comparison to the State-of-the-Art. To demonstrate the gap we reduced compared to T2I discrete decoder, We compare HyperCGAN with discrete state-of-the-art approaches (Xu et al., 2018; Li et al., 2019; Zhu et al., 2019; Zhang et al., 2021; Tao et al., 2022) . Note that AttnGAN (Xu et al., 2018) and DM-GAN (Zhu et al., 2019) are multi-stage generations. Figure 4 shows qualitative results of our model compared to baselines. Generation qualities are comparable to the state-of-theart. Table 5 shows that our models achieve the highest CLIP-R on COCO and comparable results to XMC-GAN on ArtEmis and CUB. For fair comparison to other baselines that use DAMSM regularizer, we report the scores with our HyperCGAN word DAMSM model, which achieves higher CLIP-R 64.14% in COCO and 16.26%. Note that every baseline except DF-GAN utilizes both sentence and word embeddings, while our model is only conditioned on one type of text embeddings during training and still achieves superior results compared to other baselines in terms of FID on all datasets (except XMC-GAN, 2.5 times more model parameters than ours). As for CUB dataset, HyperCGAN et al., 2021) on COCO and CUB: 27.21 vs. 27.50 and 11.00 vs. 56.10, respectively. However, the comparison to DALL-E might not be fair since their result is based on zero-shot T2I generation. DALL-E1 was trained on a large amount of data, orders of magnitudes larger than the benchmarks we are using, and it may/may not cover data similar to these benchmarks

5. CONCLUSION

In this paper, we propose HyperCGAN, a novel HyperNet-based conditional continous GAN. Hyper-CGAN is a text-to-continuous-image generative model with a single generator that operates with a novel language-guided tensor modulation operator for sentence-level and word-level attention mechanism. To our knowledge, HyperCGAN is the first approach that facilitates text-to-continuous-image generation, and we show its ability to meaningfully extrapolate images beyond training image di-mension while maintaining the alignment with the input language description. We showed that HyperCGAN achieves better performance compared to existing discrete-based text-to-image synthesis baselines. In addition, we demonstrated that our hypernet-modulation methods can be applied to discrete GANs as well. We hope that our method may encourage future work on hyper networks on Text to Continuous Image Generation (T2CI). For all experiments, we kept the batch size equal to 16 and run for 25k iterations. For COCO datasets, we followed standard splits, but we split the ArtEmis dataset into train/val/test splits in a ratio of 0.85, 0.10, 0.05. At inference, we used only test split to generate art images.

6.2. SENTENCE-LEVEL INFORMATION

Similar to (Xu et al., 2018; Li et al., 2019; Zhu et al., 2019; Tao et al., 2022) , first we extract 256dimensional sentence embeddings denoted as c from LSTM-based pretrained text encoder. Then, we concatenate extracted embeddings and noise vector z of dimension 512, and pass it through a hypernetwork T G (z, c) of the generator.

6.3. EFFICIENT SENTENCE LEVEL MODULATION:

To efficiently produce the conditioning tensors for sentence level modulation, we used the aforementioned factorization technique to predict the modulation mask M of size c out × c in × k h × k w dimensions from only 4 vectors: M = t 1 ⊗ t 2 ⊗ t 3 ⊗ t 4 , where each t r of different dimensions c out , c in , k h , k w , respectively (see Figure 3 ):

6.4. WORD-LEVEL INFORMATION

In order to leverage word-level information, we extract the word embeddings from the same text encoder mentioned above. However, word embeddings have different sequence lengths and not suitable for batch processing. Therefore, the words embeddings are padded with 0s matching the max word length. Then, the padded embeddings go through hypernetworks with a single conv1x1 layers to generate style vectors of dimension τ × (c in + k w + k h ) (See Figure 8 ). 

6.5. TEXT ENCODER

For text encoder, we adopt pretrained text encoder from AttnGAN. This text encoder is used in all the baselines reported in the paper. Therefore, for consistency, we also used AttnGAN text encoder which is based on a bi-directional Long Short-Term Memory (LSTM). In the bi-directional LSTM, each word corresponds to two hidden states, one for each direction. To represent the semantic meaning of a word, they concatenate its two hidden states. The last hidden states of the bi-directional LSTM are concatenated to be the global sentence vector. The hidden size of both embeddings is equal to 256. 6.6 DAMSM LOSS DAMSM loss (Xu et al., 2018) is defined on top of Inception-v3 image model (Szegedy et al., 2016) , which is used to extract image features f ∈ R 768×289 (reshaped from 768×17×17). 768 is the dimension of the local feature vector, and 289 is the number of sub-regions in the image. These features are then converted to a common semantic space of text features by adding an FC layer v = W f , u = W f , where v i is the visual feature vector for the i th sub-region of the image; and u ∈ R D is the global vector for the whole image. We then calculate the similarity matrix for all possible pairs of words in the sentence and sub-regions in the image. s = e T v, (7) where s is a similarity matrix between all word-region paris, s i,j is the dot-product similarity between the i th word of the sentence and the j th sub-region of the image. We find that it is beneficial to normalize the similarity matrix as follows s i,j = exp(s i,j ) T -1 k=0 exp(s k,j ) . Then, region-context vector c i is defined as a representation of the image's sub-regions related to the i th word of the sentence. It is computed as the weighted sum over all regional visual vectors, i.e., c i = 288 j=0 α j v j , where α j = exp(γ 1 s i,j ) 288 k=0 exp(γ 1 s i,k ) . Then, the relevance between the i th word and the image using the cosine similarity between c i and e i , i.e., R(c i , e i ) = (c T i e i )/(||c i ||||e i ||). The attention-driven image-text matching score between the entire image (q) and the whole text description (d) is defined as R(q, d) = log T -1 i=1 exp(γ 2 R(c i , e i )) 1 γ 2 , we used the default parameters in (Xu et al., 2018) . The DAMSM loss is finally defined as L DAM SM = L w 1 + L w 2 + L s 1 + L s 2 . (11) where L w 1 = - M i=1 log P (d i |q i ), L w 2 = - M i=1 log P (q i |d i ), where 'w' stands for "word", where P (q i |d i ) = exp(γ3R(qi,di)) M j=1 exp(γ3R(qj ,di)) is the posterior probability that sentence d i is matched with its corresponding image q i . If we redefine Eq. 10 by R(q, d) = v T e / ||v||||e|| and substitute it to Eq. 13 and 12, we can obtain loss functions L s 1 and L s 2 (where 's' stands for "sentence") using the sentence vector e and the global image vector v. The DAMSM loss is designed to learn the attention model in a semi-supervised manner, in which the only supervision is the matching between entire images and whole sentences (a sequence of words). Similar to (Fang et al., 2015; Huang et al., 2013) , for a batch of image-sentence pairs {(q i , d i )} M i=1 , the posterior probability of sentence d i being matching with image a i is computed as P (D i |Q i ) = exp(γ 3 R(Q i , D i )) M j=1 exp(γ 3 R(Q i , D j )) , ( ) where γ 3 is a smoothing factor determined by experiments. In this batch of sentences, only d i matches the image q i , and treat all other M -1 sentences as mismatching descriptions. The loss function is defined as as the negative log posterior probability that the images are matched with their corresponding text descriptions (ground truth), as shown in Eq. 12.

6.7. ABLATIONS FOR RANK VALUES

We experimented with different ranks to produce modulating tensor (e.g. R = 1, 3, 5, 10), and found that for discrete-based generator, R = 1 is enough to achieve good results. Increasing the rank value did not contribute to the improvement of performance but rather increased parameter sizes. 



Figure 1: Text Conditioned Extrapolation outside of Image Boundaries: The red rectangles indicate the resolution boundaries that our HyperCGAN model was trained. On three datasets, our model can synthesize meaningful pixels at surrounding (x, y) coordinates beyond these boundaries.

Figure 2: Scalability limitations in discrete decoders: Increasing training resolution decreases batch size/GPU hitting GPU limits.

Figure 3: The architecture of the proposed HyperCGAN, with two ways of conditioning. a) Sentence-Level Modulation: Generator is conditioned with a hypernetwork which takes concatenation of noise vector z and sentence embedding e s . Then, the weights of every Linear layer of the generator are modulated by modulating tensor. Discriminator's convolutional weights at block l are modulated by the hypernetwork operating at level l. The final projection head is conditioned as h T F (e s ), where F (e s ) is two-layer MLP and h is the output of the last discriminator block. b) Word-Level Modulation with WhAtt attention: Two hypernetworks are used to condition the generator. The first is MLP which receives noise vector z and outputs modulating tensor M z , and the second is Conv1x1 followed by a Word-level hyper-Attention mechanism proposed in this work, dubbed as WhAtt. Details are introduced in Section 3.2.2.

Figure 4: HyperCGAN qualitative results on COCO 256 2 , CUB 256 2 , and ArtEmis 256 2 .

Figure 5: Example of affective captions and corresponding emotion from ArtEmis dataset and generations from HyperC-SG word . DM-GAN DF-GAN DM-GAN DF-GAN HyperC-SG HyperC-SG

Figure 7: Numpy-like pseudocode for core tensor modulation implementation.

Figure 8: Numpy-like pseudocode for attention-based word-level tensor modulation.

6.10 GENERATING HIGH RESOLUTION IMAGES (COCO)a house being built with lots of wood a laptop computer sits on a computer desk next to a mouse a batter backs up as the ball is thrown a man holding a fully topped pizza in front of the camera soccer players are running after the ball together a plane is flying high in the very cloudy sky

Figure 10: High-resolution generations (1024x1024) from our HyperC-SGs trained on COCO.

. HyperCGAN sent : This baseline is built on top of INR-CGAN sent . The generator stays unchanged, but the discriminator convolution weights are modulated with our "Efficient Sentence level Modulation" (config E in Table1). HyperCGAN word : The generator and discriminator of this model are conditioned via our proposed WhAtt mechanism (config H in Table1). For both HyperCGAN sent and HyperCGAN word , we either use DAMSM-based or CLIPbased regularizers. When the model is trained with one of these regularizers, we indicate it with subscript, e.g., HyperCGAN word DAMSM or HyperCGAN word CLIP . T2CI Results. In Table1, INR-CGAN sent (config B) with naive conditioning achieves for 34.91% for CLIP-R and 27.73 in terms of FID. When its discriminator changed to hyper-modulated one, T2CI Performance on COCO 256 2 . Our hypernetwork-based conditioning makes it possible to use word-level conditioning, which is crucial in achieving good results. Blue is for the best result and green for 2nd best. Note that CLIP-R is meaningless for unconditional INR-GAN (config A). Note that config F starts from unconditional INR-GAN, similar to config B and C. CLIP-R is improved to 40.81%, and show comparable FID score 28.29 (config E). When both INRdecoder and discriminator is conditioned via our WhAtt method on word embeddings (config H), FID score improves from 28.25 to 25.39 and achieves slightly lower CLIP-R score 37.23% compared to config B. Our WhAtt conditioning coupled with contrastive regularizers achieves CLIP-R retrieval scores are improved much leading to the best scores 64.14% and 63.99%, while achieving the better FID scores 18.91 and 27.21 for HyperCGAN word DAMSM and HyperCGAN word CLIP , respectively configs L and M. The word-level modulation has significantly better performance due to the improved granularity connecting the generated images to the input text. It is interesting to observe that HyperCGAN word DAMSM even outperforms Unconditional INR-GAN by 5.83 FID points. Despite that regularizers focus more on visual semantic alignment (CLIP-R) than image quality (FID), we observe relative improvement also on FID in most cases, which could be due to the improved representation guided by text.

Discrete and continuous synthesis performance with our hypernetbased conditioning on COCO 256 2 .

Human Subject Experiment on Extrapolation meaningfulness.

Human Subject Experiment on Extrapolation alignment with text.

DAMSM with WhAtt achieves the best FID score of 11.00. Compared to XMC-GAN on Artemis and CUB, results in terms of FID and CLIP-R are almost the same. However, our model requires much fewer number of parameters, making it more efficient during training. Almost all the baselines'

Comparison to SOTA Discrete T2I models.

out , c in 21.11 47.86% c in , k h , k w 20.32 49.42% c out , k h , k w 23.49 43.95% c out , c in , k h , k w 23.59 49.85%

Effect of different choices of modulating tensors in HyperC-SG sent .

annex

We performed additional experiments where we trained our models on CUB dataset and compared to recent baselines (DF-GAN, DM-GAN). In Table 7 , the results indicate that our models achieve the best FID and comparable CLIP-R scores. 8 show that our model outperforms standard upsampling techniques on all datasets. See Figure 9 for qualitative results. AttnGAN (Xu et al., 2018) 81.52% 78.68% 67.82% ControlGAN (Li et al., 2019) 82.43% 78.75% 69.33% DM-GAN (Zhu et al., 2019) 88.56% 93.54% 75.89% XMC-GAN (Zhang et al., 2021) 

