MVP: MULTIVARIATE POLYNOMIALS FOR CONDITIONAL DATA GENERATION

Abstract

Conditional Generative Adversarial Nets (cGANs) have been widely adopted for image generation. cGANs take i) a noise vector and ii) a conditional variable as input. The conditional variable can be discrete (e.g., a class label) or continuous (e.g., an input image) resulting into class-conditional (image) generation and imageto-image translation models, respectively. However, depending on whether the conditional variable is discrete or continuous, various cGANs employ substantially different deep architectures and loss functions for their training. In this paper, we propose a novel framework, called MVP, for conditional data generation. MVP resorts to multivariate polynomials of higher-order and treats in a unified way both discrete and continuous conditional variables. MVP is highly expressive, capturing higher-order auto-and cross-correlations of input variables (noise vector and conditional variable). Tailored sharing schemes are designed between the polynomial's parameter tensors, which result in simple recursive formulas. MVP can synthesize realistic images in both class-conditional and image-to-image translation tasks even in the absence of activation functions between the layers.

1. INTRODUCTION

Modelling high-dimensional distributions and generating samples from complex distributions are fundamental tasks in machine learning. Generative adversarial networks (GANs) (Goodfellow et al., 2014) have demonstrated spectacular results in the two tasks using both unsupervised (Miyato et al., 2018) and supervised (Brock et al., 2019) learning. In the unsupervised setting, (the generator of) a GAN accepts as input a noise vector z I and maps the noise vector to a high-dimensional output. The supervised models, called conditional Generative Adversarial Nets (cGANs) (Mirza & Osindero, 2014) , accept both a noise vector z I and an additional conditional variable z II that facilitates the generation. The conditional variable can be discrete (e.g., a class or an attribute label) or continuous (e.g., a low-resolution image). The impressive results obtained with both discrete conditional input (Brock et al., 2019) and continuous conditional input (Park et al., 2019; Ledig et al., 2017) have led to a plethora of applications that range from text-to-image synthesis (Qiao et al., 2019) to deblurring (Yan & Wang, 2017 ) and medical analysis (You et al., 2019) . Despite the similarity in the formulation for discrete and continuous conditional input (i.e., learning the function Gpz I , z II q), the literature has focused on substantially different architectures and losses. Frequently, techniques are simultaneously developed, e.g., the self-attention in the class-conditional Self-Attention GAN (Zhang et al., 2019) and in the Attention-GAN (Chen et al., 2018) with continuous conditional input. This delays the progress since practitioners develop twice as many architectures and losses for every case. A couple of straightforward ideas can be employed to unify the behavior of the two conditional variable types. One idea is to use an encoder network to obtain representations that are independent of the conditional variable. This has two drawbacks: i) the network ignores the noise and a deterministic one-variable mapping is learned (Isola et al., 2017) , ii) such encoder has not been successful so far for discrete conditional input. An alternative idea is to directly concatenate the labels in the latent space instead of finding an embedding. In AC-GAN (Odena et al., 2017) the class labels are concatenated with the noise; however, the model does not scale well beyond 10 classes. We argue that concatenation of the input is only capturing additive correlation and not higher-order interactions between the inputs. A detailed discussion is conducted on sec. D (in the Appendix). A polynomial expansion with respect to the input variables can capture such higher-order correlations. Π-Net (Chrysos et al., 2020) casts the function approximation into a polynomial expansion of a single input variable. By concatenating the input variables, we can express the function approximation as a polynomial of the fused variable. However, the concatenation reduces the flexibility of the model significantly, e.g., it enforces the same order of expansion with respect to the different variables and it only allows the same parameter sharing scheme to all variables. We introduce a multivariate framework, called MVP, for conditional data generation. MVP resorts to multivariate polynomials with two input variables, i.e., z I for the noise vector and z II for the conditional variable. MVP captures higher-order auto-and cross-correlations between the variables. By imposing a tailored structure in the higher-order interactions, we obtain an intuitive, recursive formulation for MVP. The formulation is flexible and enables different constraints to be applied to each variable and its associated parameters. The formulation can be trivially extended to M input variables. In summary, our contributions are the following: • We introduce a framework, called MVP, that expresses a high-order, multivariate polynomial for conditional data generation. Importantly, MVP treats both discrete and continuous conditional variables in a unified way. • We offer an in-depth relationship with state-of-the-art works, such as SPADE (Park et al., 2019) , that can be interpreted as polynomial expansions. We believe this perspective better explains the success of such architectures and offers a new direction for their extension. • MVP is trained on eight different datasets for both class-conditional generation and imageto-image translation tasks. The trained models rely on both input variables, i.e., they do not ignore the noise vector. • To illustrate the expressivity of the model, we also experiment with generators that do not use activation functions between the layers. We verify that MVP can synthesize realistic images even in the absence of activation functions between the layers. The source code of MVP will be published upon the acceptance of the paper.

2. RELATED WORK

The literature on conditional data generation is vast; dedicated surveys per task (Agnese et al., 2019; Wu et al., 2017b ) can be found for the interested reader. Below, we review representative works in conditional generation and then we summarize the recent progress in multiplicative interactions.

2.1. CONDITIONAL GENERATIVE MODELS

The challenging nature of image/video generation has led to a proliferation of conditional models. Although cGAN (Mirza & Osindero, 2014 ) is a general framework, since then the methods developed for conditional generation differ substantially depending on the type of conditional data. We present below representative works of the two categories, i.e., discrete and continuous conditional data, and their combination. Discrete conditional variable: This is most frequently used for class-conditional generation (Miyato et al., 2018; Brock et al., 2019; Kaneko et al., 2019) . Conditional normalization (Dumoulin et al., 2017; De Vries et al., 2017) techniques have been popular in the case of discrete conditional input, e.g., in generation of natural scenes images (Miyato et al., 2018; Brock et al., 2019) . Conditional normalization cannot trivially generalize to a continuous conditional variable. In AC-GAN (Odena et al., 2017) , they concatenate the class labels with the noise; however, their model does not scale well (i.e., they train one model per 10 classes). The aforementioned methods cannot be trivially used or modified for continuous conditional input. Text-to-image generation models (Qiao et al., 2019; Li et al., 2019; Zhang et al., 2018; Xu et al., 2018) use a specialized branch to embed the text labels.

Continuous conditional variable:

The influential work of pix2pix (Isola et al., 2017) has become the reference point for continuous conditional input. The conditional input is embedded in a lowdimensional space (with an encoder), and then mapped to a high-dimensional output (through a decoder). The framework has been widely used for inverse tasks (Ledig et al., 2017; Pathak et al., 2016; Wu et al., 2017a; Iizuka et al., 2017; Huang et al., 2017; Yu et al., 2018a; Grm et al., 2019; Xie et al., 2018; Yan & Wang, 2017) , conditional pose generation (Ma et al., 2017; Siarohin et al., 2018; Liang et al., 2019) , representation learning (Tran et al., 2017) , conditional video generation (Wang et al., 2018a) , generation from semantic labels (Wang et al., 2018b) , image blending (Wu et al., 2019; Zhan et al., 2019) . We recognize two major drawbacks in the aforementioned methods: a) they cannot be easily adapted for discrete conditional input, b) they learn a deterministic mapping, i.e., the noise is typically ignored. However, in many real applications, such as inverse tasks, the mapping is not one-to-one; there are multiple plausible outputs for every conditional input. The auxiliary losses used in such works, e.g., 1 loss (Isola et al., 2017) , perceptual loss (Ledig et al., 2017) , are an additional drawback. Those losses both add hyper-parameters that require tuning and are domain-specific, thus it is challenging to transfer them to different domains or even different datasets. On the contrary, in our experiments, we do not use any additional loss. Discrete and continuous conditional variables: Few works combine both discrete and continuous conditional inputs (Yu et al., 2018b; Xu et al., 2017; Lu et al., 2018) . However, these methods include significant engineering (e.g., multiple discriminators (Xu et al., 2017) , auxiliary losses), while often the generator learns to ignore the noise (similarly to the continuous conditional input). Antipov et al. (2017) design a generator for face aging. The generator combines continuous with discrete variables (age classes), however there is no Gaussian noise utilized, i.e., a deterministic transformation is learned for each input face. InfoGAN (Chen et al., 2016) includes both discrete and continuous conditional variables. However, the authors explicitly mention that additional losses are required, otherwise the generator is 'free to ignore' the additional variables. The idea of Li et al. (2020) is most closely related to our work. They introduce a unifying framework for paired (Isola et al., 2017) and unpaired (Zhu et al., 2017a) learning. However, their framework assumes a continuous conditional input, while ours can handle discrete conditional input (e.g., class labels). In addition, their method requires a pre-trained teacher generator, while ours consists of a single generator trained end-to-end. Diverse data generation: Conditional image generation often suffers from deterministic mappings, i.e., the noise variable has often negligible or negative impact in the generator (Zhu et al., 2017b; Isola et al., 2017) . This has been tackled in the literature with additional loss terms and/or auxiliary network modules. A discussion of representative methods that tackle diverse generation is deferred to sec. I in the Appendix. In Table 1 the differences of the core techniques are summarized. Even though diverse generation is a significant task, we advocate that learning a generator does not ignore the input variables can be achieved without such additional loss terms. We highlight that diverse generation is a byproduct of MVP and not our main goal. Particularly, we believe that diverse images can be synthesized because the higher-order correlations of the input variables are captured effectively the proposed method. Multiplicative connections have long been adopted in computer vision and machine learning (Shin & Ghosh, 1991; Hochreiter & Schmidhuber, 1997; Bahdanau et al., 2015) . The idea is to combine the inputs through elementwise products or other diagonal forms. Even though multiplicative connections have successfully been applied to different tasks, until recently there was no comprehensive study of their expressivity versus the standard feedforward networks. Jayakumar et al. ( 2020) include the proof that second order multiplicative operators can represent a greater class of functions than classic feed-forward networks. Even though we capitalize on the theoretical argument, our framework can express any higher-order interactions while the framework of Jayakumar et al. ( 2020) is limited to second order interactions. (Chrysos et al., 2020) StyleGAN (Karras et al., 2019) sBN (Chen et al., 2019 ) SPADE (Park et al., 2019) MVP (ours) Higher-order interactions have been studied in the tensor-related literature (Kolda & Bader, 2009; Debals & De Lathauwer, 2017) . However, their adaptation in modern deep architectures has been slower. Chrysos et al. (2020) propose high-order polynomial for mapping the input z to the output x " Gpzq. Π-Netfocuses on a single input variable and cannot handle the multivariate cases that are the focus of this work. Three additional works that can be thought of as polynomial expansions are Karras et al. (2019) ; Park et al. (2019) ; Chen et al. (2019) . The three works were originally introduced as (conditional) normalization variants, but we attribute their improvements in the expressiveness of their polynomial expansions. Under the polynomial expansion perspective, they can be expressed as special cases of the proposed MVP. A detailed discussion is conducted in sec. F in the Appendix. We believe that the proposed framework offers a direction to further extend the results of such works, e.g., by allowing more than one conditional variables.

3. METHOD

The framework for a multivariate polynomial with a two-variable input is introduced (sec. 3.1). The derivation, further intuition and additional models are deferred to the Appendix (sec. B). The crucial technical details, including the stability of the polynomial, are developed in sec. 3.2. We emphasize that a multivariate polynomial can approximate any function (Stone, 1948; Nikol'skii, 2013) , i.e., a multivariate polynomial is a universal approximator. Notation:Tensors/matrices/vectors are symbolized by calligraphic/uppercase/lowercase boldface letters e.g., W,W ,w. The mode-m vector product of W (of order M ) with a vector u P R Im is W ˆm u and results in a tensor of order M ´1. We assume that ś b i"a x i " 1 when a ą b. The core symbols are summarized in Table 3 , while a detailed tensor notation is deferred to the Appendix (sec. B.1).

3.1. TWO INPUT VARIABLES

Given two input variablesfoot_0 z I , z II P K d where K Ď R or K Ď N, the goal is to learn a function G : K dˆd Ñ R o that captures the higher-degree interactions between the elements of the two inputs. We can learn such higher-degree interactions as polynomials of two input variables. A polynomial of expansion order N P N with output x P R o has the form: x " Gpz I , z II q " N ÿ n"1 n`1 ÿ ρ"1 ˆWrn,ρs ρ ź j"2 ˆjz I n`1 ź τ "ρ`1 ˆτ z II ˙`β (1) where β P R o and W rn,ρs P R oˆś n m"1 ˆmd for n P r1, N s, ρ P r1, n `1s are the learnable parameters. The expansion depends on two (independent) variables, hence we use the n and ρ as auxiliary variables. The two products of (1) do not overlap, i.e., the first multiplies the modes r2, ρs (of W rn,ρs ) with z I and the other multiplies the modes rρ `1, n `1s with z II . or Figure 1 : Abstract schematic for N th order approximation of x " Gpz I , z II q. The inputs z I , z II are symmetric in our formulation. We denote with z I a sample from a prior distribution (e.g., Gaussian), while z II symbolizes a sample from a conditional input (e.g., class label or low-resolution image). Recursive relationship: The aforementioned derivation can be generalized to an arbitrary expansion order. The recursive formula for an arbitrary order N P N is the following: x n " x n´1 `´U T rn,Is z I `U T rn,IIs z II ¯˚x n´1 for n " 2, . . . , N with x 1 " U T r1,Is z I `U T r1,IIs z II and x " Cx N `β. The parameters C P R oˆk , U rn,φs P R dˆk for n " 1, . . . , N and φ " tI, IIu are learnable. The intuition behind this model is the following: An embedding is initially found for each of the two input variables, then the two embeddings are added together and they are multiplied elementwise with the previous approximation. The different embeddings for each of the input variables allows us to implement U rn,Is and U rn,IIs with different constraints, e.g., U rn,Is to be a dense layer and U rn,IIs to be a convolution.

3.2. MODEL EXTENSIONS AND TECHNICAL DETAILS

There are three limitations in (2). Those are the following: a) (2) describes a polynomial expansion of a two-variable input, b) each expansion order requires additional layers, c) high-order polynomials might suffer from unbounded values. Those limitations are addressed below. Our model can be readily extended beyond two-variable input; an extension with three-variable input is developed in sec. C. The pattern (for each order) is similar to the two-variable input: a) a different embedding is found for each input variable, b) the embeddings are added together, c) the result is multiplied elementwise with the representation of the previous order. The polynomial expansion of (2) requires ΘpN q layers for an N th order expansion. That is, each new order n of expansion requires new parameters U rn,Is and U rn,IIs . However, the order of expansion can be increased without increasing the parameters substantially. To that end, we can capitalize on the product of polynomials. Specifically, let N 1 be the order of expansion of the first polynomial. The output of the first polynomial is fed into a second polynomial, which has expansion order of N 2 . Then, the output of the second polynomial will have an expansion order of N 1 ¨N2 . The product of polynomials can be used with arbitrary number of polynomials; it suffices the output of the τ th polynomial to be the input to the pτ `1q th polynomial. For instance, if we assume a product of Φ P N polynomials, where each polynomial has an expansion order of two, then the polynomial expansion is of 2 Φ order. In other words, we need Θplog 2 pN qq layers to achieve an N th order expansion. In algebra, higher-order polynomials are unbounded and can thus suffer from instability for large values. To avoid such instability, we take the following three steps: a) MVP samples the noise vector from the uniform distribution, i.e., from the bounded interval of r´1, 1s, b) a hyperbolic tangent is used in the output of the generator as a normalization, i.e., it constrains the outputs in the bounded interval of r´1, 1s, c) batch normalization (Ioffe & Szegedy, 2015) is used to convert the representations to zero-mean. We emphasize that in GANs the hyperbolic tangent is the default activation function in the output of the generator, hence it is not an additional requirement of our method. Additionally, in our preliminary experiments, the uniform distribution can be changed for a Gaussian distribution without any instability. A theoretical analysis on the bounds of such multivariate polynomials would be an interesting subject for future work.

4. EXPERIMENTS

The proposed MVP is empirically evaluated in three settings: a) a class-conditional generation, i.e., with discrete conditional input, b) an image-to-image translation, i.e., with continuous conditional input, c) a mixed conditional setting with two conditional variables. The goal is to showcase how MVP can be used with both discrete and continuous conditional inputs. Even though architectures specialized for a single task (e.g., Ledig et al. (2017) ) perform well in that task, their well-selected inductive biases (e.g., perceptual or 1 loss) do not generalize well in other domains or different conditional inputs. Hence, our goal is not to demonstrate state-of-the-art results in specific tasks, but rather to propose one generic formulation. Further experiments (e.g., class-conditional generation with SVHN or MNIST to SVHN translation; sec H), the details on the datasets and the evaluation metrics (sec. G) are deferred to the Appendix. Throughout the experimental section, we reserve the symbol z II for the conditional input (e.g., a class label). Our framework, e.g., (2), does not include any activation functions. To verify the expressivity of our framework, we maintain the same setting for the majority of the experiments below. Particularly, the generator does not have activation functions between the layers; there is only a hyperbolic tangent in the output space for normalization. Training a generator without activation functions between the layers also emerged in Π-Net (Chrysos et al., 2020) , where the authors demonstrate the challenges in such framework. However, we conduct one experiment using a strong baseline with activation functions. That is, a comparison with SNGAN (Miyato & Koyama, 2018) in class-conditional generation is performed (sec. 4.1). Baselines: 'Π-Net-SICONC' implements a polynomial expansion of a single variable, i.e., by concatenating all the input variables. 'SPADE' implements a polynomial expansion with respect to the conditional variable. Also, 'GAN-CONC' and 'GAN-ADD' are added as baselines, where we replace the Hadamard products with concatenation and addition respectively. An abstract schematic of the differences between the compared polynomial methods is depicted in Fig. 6 , while a detailed description of all methods is deferred to sec. G. Each experiment is conducted five times and the mean and the standard deviation are reported.

4.1. CLASS-CONDITIONAL GENERATION

The first experiment is on class-conditional generation, where the conditional input is a class label in the form of one-hot vector. Two types of networks are utilized: a) a resnet-based generator (SNGAN), b) a polynomial generator (Π-Net) based on Chrysos et al. (2020) . The former network has exhibited strong performance the last few years, while the latter bears resemblance to the formulation we propose in this work. Table 4 : Quantitative evaluation on class-conditional generation with resnet-based generator (i.e., SNGAN). Higher Inception Score (IS) (Salimans et al., 2016) (lower Frechet Inception Distance (FID) (Heusel et al., 2017) ) indicates better performance. The baselines improve the IS of SNGAN, however they cannot improve the FID. Nevertheless, SNGAN-MVP improves upon all the baselines in both the IS and the FID. class Resnet-based generator: The experiment is conducted by augmenting the resnet-based generator of SNGAN. The quantitative results are in Table 4 and synthesized samples are illustrated in Fig. 2 (a). SNGAN-MVP improves upon all the baselines in both the Inception score (IS) (Salimans et al., 2016) and the FID (Heusel et al., 2017) . The proposed formulation enables inter-class interpolations. That is, the noise z I is fixed, while the class z II is interpolated. In 5 (right) depict a substantial improvement over the all the baselines (53.9% reduction over the best-performing baseline). We should note that the baseline was not built for conditional generation, however we have done our best effort to optimize the respective hyper-parameters. We hypothesize that the improvement arises because of the correlations of the classes. That is, the 196 classes might be correlated (e.g., the SUV cars of different carmakers share several patterns). Such correlations are captured by our framework, while they might be missed when learning different normalization statistics per class. Overall, MVP synthesizes plausible images (Fig. 11 ) even in the absence of activation functions.

4.2. CONTINUOUS CONDITIONAL INPUT

The performance of MVP is scrutinized in tasks with continuous conditional input, e.g., superresolution. The conditional input z II is an input image, e.g., a low-resolution sample or a corrupted sample. Even though the core architecture remains the same, a single change is made in the structure of the discriminator: Motivated by (Miyato & Koyama, 2018) , we include an elementwise product of z II with the real/fake image in the discriminator. This stabilizes the training and improves the results. A wealth of literature is available on such continuous conditional inputs (sec. 2.1), however we select the challenging setting of using a generator without activation functions between the layers. The experiments are performed in (a) super-resolution, (b) block-inpainting. Super-resolution assumes a low-resolution image is available, while in block inpainting, a (rectangular) part of the image is missing. The two tasks belong in the broader category of 'inverse tasks', and they are significant both for academic reasons but also for commercial reasons (Sood et al., 2018; You et al., 2019) . Such inverse tasks are underdetermined; each input image corresponds to several plausible output images. The FID scores in Cars196 for the task of super-resolution are reported in Table 6 . In super-resolution 16ˆ, z II has 48 dimensions, while in super-resolution 8ˆ, z II has 192 dimensions. Notice that the performance of Π-Net-SICONC deteriorates substantially when the dimensionality of the conditional variable increases. That validates our intuition about the concatenation in the input of the generator (sec. E). We also report the SPADE-MVP, which captures higher-order correlations with respect to the first variable as well (further details in sec. G). The proposed SPADE-MVP outperforms the original SPADE, however it cannot outperform the full two-variable model, i.e., MVP. MVP maintains outperforms all baselines by a large margin. The qualitative results on (a) super-resolution 8ˆon CelebA, (b) super-resolution 8ˆon Cars196, (c) super-resolution 16ˆon Cars196 are illustrated in Fig. 3 . Similarly the qualitative results on block-inpainting are visualized in Fig. 11 . For each conditional image, different noise vectors z I are sampled. Notice that the corresponding synthesized images differ in the fine-details. For instance, changes in the mouth region, the car type or position and even background changes are observed. Thus, MVP results in high-resolution images that i) correspond to the conditional input, ii) vary in fine-details. Similar variation has emerged even when the source and the target domains differ substantially, e.g., in the translation of MNIST digits to SVHN digits (sec. H.3). We should mention that regularization techniques have been proposed specifically for image-to-image translation, e.g., 

5. CONCLUSION

The topic of conditional data generation is the focus of this work. A multivariate polynomial model, called MVP, is introduced. MVP approximates a function Gpz I , z II q with inputs z I (e.g., sample from a Gaussian distribution) and z II (e.g., class or low-resolution image). MVP resorts to multivariate polynomials with arbitrary conditional inputs, which capture high-order correlations of the inputs. The empirical evaluation confirms that our framework can synthesize realistic images in both class-conditional generation (trained on CIFAR10, Cars196 and SVHN), attribute-guided generation and image-to-image translation (i.e., super-resolution, block-inpainting, edges-to-shoes, edges-to-handbag, MNIST-to-SVHN). We also showcase that it can be extended to three-variable input with class-conditional super-resolution. In addition to conditional data generation, the proposed framework can be used in tasks requiring fusion of different types of variables.

A SUMMARY OF SECTIONS IN THE APPENDIX

In the following sections, further details and derivations are provided to elaborate the details of the MVP. Specifically, in sec. B the decomposition and related details on the method are developed. The extension of our method beyond two-input variables is studied in sec. C. A method frequently used in the literature for fusing information is concatenation; we analyze how concatenation captures only additive and not more complex correlations (e.g., multiplicative) in sec. D. The differences from Π-Net (Chrysos et al., 2020) is explored in sec. E. In sec. F, some recent (conditional) data generation methods are cast into the polynomial neural network framework and their differences from the proposed framework are analyzed. The experimental details including the evaluation metrics and details on the baselines are developed in sec. G. In sec. H, additional experimental results are included. Lastly, the differences from works that perform diverse generation are explored in sec. I.

B METHOD DERIVATIONS

In this section, we expand on the method details, including the scalar output case or the notation. Specifically, a more detailed notation is determined in sec. B.1; the scalar output case is analyzed in sec. B.2. In sec. B.3 a second order expansion is assumed to illustrate the connection between the polynomial expansion and the recursive formula. Sequentially, we derive an alternative model with different factor sharing. This model, called Nested-MVP, has a nested factor sharing format (sec. B.4).

B.1 NOTATION

Our derivations rely on tensors (i.e., multidimensional equivalent of matrices) and (tensor) products. We relay below the core notation used in our work, the interested reader can find further information in the tensor-related literature (Kolda & Bader, 2009; Debals & De Lathauwer, 2017) .

Symbols of variables:

Tensors/matrices/vectors are symbolized by calligraphic/uppercase/lowercase boldface letters e.g., W,W ,w.

Matrix products:

The Hadamard product of A, B P R IˆN is defined as A ˚B and is equal to a pi,jq b pi,jq for the pi, jq element. The Khatri-Rao product of matrices A P R IˆN and B P R JˆN is denoted by A d B and yields a matrix of dimensions pIJq ˆN . The Khatri-Rao product for a set of matrices tA rms P R ImˆN u M m"1 is abbreviated by A r1s d A r2s d ¨¨¨d A rM s . " Ä M m"1 A rms . Tensors: Each element of an M th order tensor W is addressed by M indices, i.e., pWq i1,i2,...,i M . " w i1,i2,...,i M . An M th -order tensor W is defined over the tensor space R I1ˆI2ˆ¨¨¨ˆI M , where I m P Z for m " 1, 2, . . . , M . The mode-m unfolding of a tensor W P R I1ˆI2ˆ¨¨¨ˆI M maps W to a matrix W pmq P R ImˆĪm with Īm " ś M k"1 k‰m I k such that the tensor element w i1,i2,...,i M is mapped to the matrix element w im,j where j " 1 `řM k"1 k‰m pi k ´1qJ k with J k " ś k´1 n"1 n‰m I n . The mode-m vector product of W with a vector u P R Im , denoted by W ˆm u P R I1ˆI2ˆ¨¨¨ˆIm´1ˆIm`1ˆ¨¨¨ˆI M , results in a tensor of order M ´1: pW ˆm uq i1,...,im´1,im`1,...,i M " Im ÿ im"1 w i1,i2,...,i M u im . We denote W ˆ1 u p1q ˆ2 u p2q ˆ3 ¨¨¨ˆM u pM q . " W ś m m"1 ˆmu pmq . The CP decomposition (Kolda & Bader, 2009) factorizes a tensor into a sum of component rank-one tensors. The rank-R CP decomposition of an M th -order tensor W is written as: W . " rrU r1s , U r2s , . . . , U rM s ss " R ÿ r"1 u p1q r ˝up2q r ˝¨¨¨˝u pM q r , where ˝is the vector outer product. The factor matrices U rms " ru pmq 1 , u pmq 2 , ¨¨¨, u pmq R s P R ImˆR ( M m"1 collect the vectors from the rank-one components. By considering the mode-1 unfolding of W, the CP decomposition can be written in matrix form as: W p1q . " U r1s ˆ2 ä m"M U rms ˙T (5) The following lemma is useful in our method: Lemma 1. For a set of N matrices tA rνs P R Iν ˆK u N ν"1 and tB rνs P R Iν ˆLu N ν"1 , the following equality holds: p N ä ν"1 A rνs q T ¨p N ä ν"1 B rνs q " pA T r1s ¨Br1s q ˚. . . ˚pA T rN s ¨BrNs q (6) An indicative proof can be found in the Appendix of Chrysos et al. (2019) .

B.2 SCALAR OUTPUT

The proposed formulation expresses higher-order interactions of the input variables. To elaborate that, we develop the single output case below. That is, we focus on an element τ of the output vector, e.g., a single pixel. In the next few paragraphs, we consider the case of a scalar output x τ , with τ P r1, os when the input variables are z I , z II P K d . To avoid cluttering the notation we only refer to the scalar output with x τ in the next few paragraphs. As a reminder, the polynomial of expansion order N P N with output x P R o has the form: x " Gpz I , z II q " N ÿ n"1 n`1 ÿ ρ"1 ˆWrn,ρs ρ ź j"2 ˆjz I n`1 ź τ "ρ`1 ˆτ z II ˙`β We assume a second order expansion (N " 2) and let τ denote an arbitrary scalar output of x. The first order correlations can be expressed through the sums ř d λ"1 w r1,1s τ,λ z II,λ and ř d λ"1 w r1,2s τ,λ z I,λ . The second order correlations include both auto-and cross-correlations. The tensors W r2,1s and W r2,3s capture the auto-correlations, while the tensor W r2,2s captures the cross-correlations. A pictorial representation of the correlations are captured in Fig. 4 . Collecting all the terms in an equation, each output is expressed as: + x τ = w [1,1] τ,1 , w [1,1] τ,2 , • • • , w [1,1] τ,d z I,1 z II,d z II,1 . . . . . . z I,d w [1,2] τ,1 , w [1,2] τ,2 , • • • , w [1,2] τ,d + . . . z I,λ . . . x τ " β τ `d ÿ λ"1 " w r1,1s τ,λ z II,λ `wr1,2s τ,λ z I,λ `d ÿ µ"1 w r2,1s τ,λ,µ z II,λ z II,µ `d ÿ µ"1 w r2,3s τ,λ,µ z I,λ z I,µ `d ÿ µ"1 w r2,2s τ,λ,µ z I,λ z II,µ ı where β τ P R. Notice that all the correlations of up to second order are captured in equation 8.

B.3 SECOND ORDER DERIVATION FOR TWO-VARIABLE INPUT

In all our derivations, the variables associated with the first input z I have an I notation, e.g., U r1,Is . Respectively for the second input z II , the notation II is used. Even though equation 7 enables any order of expansion, the learnable parameters increase exponentially, therefore we can use a coupled factorization to reduce the parameters. Next, we derive the factorization for a second order expansion (i.e., N " 2) and then provide the recursive relationship that generalizes it for an arbitrary order. Second order derivation: For a second order expansion (i.e., N " 2 in equation 1), we factorize each parameter tensor W rn,ρs . We assume a coupled CP decomposition for each parameter as follows: • Let W r1,1s " CU T r1,IIs and W r1,2s p1q " CU T r1,Is be the parameters for n " 1. • Let W r2,1s p1q " CpU r2,IIs d U r1,IIs q T and W r2,3s p1q " CpU r2,Is d U r1,Is q T capture the second order correlations of a single variable (z II and z I respectively). • The cross-terms are expressed in W r2,2s ˆ2 z I ˆ3 z II . The output of the τ elementfoot_1 is ř d λ,µ"1 w r2,2s τ,λ,µ z I,λ z II,µ . The product Ŵr2,2s ˆ2 z II ˆ3 z I also results in the same elementwise expression. Hence, to allow for symmetric expression, we factorize the term W r2,2s p1q as the sum of the two terms CpU r2,IIs d U r1,Is q T and CpU r2,Is d U r1,IIs q T . For each of the two terms, we assume that the vector-valued inputs are accordingly multiplied. The parameters C P R oˆk , U rm,φs P R dˆk (m " 1, 2 and φ " tI, IIu) are learnable. The aforementioned factorization results in the following equation: x " CU T r1,IIs z II `CU T r1,Is z I `C´U r2,IIs d U r1,IIs ¯T ´zII d z II ¯`C ´Ur2,Is d U r1,Is ¯T ´zI d z I ¯C ´Ur2,Is d U r1,IIs ¯T ´zI d z II ¯`C ´Ur2,IIs d U r1,Is ¯T ´zII d z I ¯`β This expansion captures the correlations (up to second order) of the two input variables z I , z II . To make the proof more complete, we remind the reader that the recursive relationship (i.e., (2) in the main paper) is: x n " x n´1 `´U T rn,Is z I `U T rn,IIs z II ¯˚x n´1 for n " 2, . . . , N with x 1 " U T r1,Is z I `U T r1,IIs z II and x " Cx N `β. Claim 1. The equation ( 9) is a special format of a polynomial that is visualized as in Fig. 1 of the main paper. Equivalently, prove that (9) follows the recursive relationship of (10). Proof. We observe that the first two terms of equation 9 are equal to Cx 1 (from equation 10). By applying Lemma 1 in the terms that have Khatri-Rao product, we obtain: x " β `Cx 1 `C" ´U T r2,IIs z II ¯˚´U T r1,IIs z II ¯`´U T r2,Is z I ¯˚´U T r1,Is z I ¯Ù T r2,Is z I ¯˚´U T r1,IIs z II ¯`´U T r2,IIs z II ¯˚´U T r1,Is z I ¯* " β `Cx 1 `C" "´U T r2,Is z I ¯`´U T r2,IIs z II ¯ı ˚x1 * " Cx 2 `β The last equation is precisely the one that arises from the recursive relationship from equation 10. To prove the recursive formula for the N th order expansion, a similar pattern as in sec.C of Poly-GAN (Chrysos et al., 2019 ) can be followed. Specifically, the difference here is that because of the two input variables, the auto-and cross-correlation variables should be included. Other than that, the same factor sharing is followed.

B.4 NESTED-MVP MODEL FOR TWO-VARIABLE INPUT

The model proposed above (i.e., equation 10), relies on a single coupled CP decomposition, however a more flexible model can factorize each level with a CP decomposition. To effectively do that, we utilize learnable hyper-parameters b rns P R ω for n P r1, N s, which act as scaling factors for each parameter tensor. Then, a polynomial of expansion order N P N with output x P R o has the form: x " Gpz I , z II q " N ÿ n"1 n`2 ÿ ρ"2 ˆWrn,ρ´1s ˆ2 b rN `1´ns ρ ź j"3 ˆjz I n`2 ź τ "ρ`1 ˆτ z II ˙`β To demonstrate the factorization without cluttering the notation, we assume a second order expansion in equation 12. Second order derivation: The second order expansion, i.e., N " 2, is derived below. We jointy factorize all parameters of equation 12 with a nested decomposition as follows: • First order parameters : W r1,1s p1q " CpA r2,IIs d B r2s q T and W r1,2s p1q • The cross-terms are included in W r2,2s ˆ2 b r1s ˆ3 z I ˆ4 z II . The output of the τ element is expressed as " CpA r2,Is d B r2s q T . • Let W r2,1s p1q " C " A ř ω ν"1 ř d λ,µ"1 w r2,2s τ,ν,λ,µ b r1s,ω z I,λ z II,µ . Similarly, the product Ŵr2,2s ˆ2 b r1s ˆ3 z II ˆ4 z I has output ř ω ν"1 ř d λ,µ"1 w r2,2s τ,ν,µ,λ b r1s,ω z I,λ z II,µ for the τ element. Notice that the only change in the two expressions is the permutation of the third and forth modes of the tensor; the rest of the expression remains the same. Therefore, to account for this symmetry we factorize the term W r2,2s as the sum of two terms and assume that each term is multiplied by the respective terms. Let W The parameters C P R oˆk , A rn,φs P R dˆk , V rns P R kˆk , B rns P R ωˆk for n " 1, 2 and φ " tI, IIu are learnable. Collecting all the terms above and extracting C as a common factor (we ommit C below to avoid cluttering the notation): Recursive relationship: The recursive formula for the Nested-MVP model with arbitrary expansion order N P N is the following: pA r2,IIs d B r2s q T pz II d b r2s q `pA r2,Is d B r2s q T pz I d b r2s q" A r2,IIs d " ´Ar1,IIs d B r1s ¯Vr2s * T pz II d z II d b r1s q" A r2,Is d " ´Ar1,Is d B r1s ¯Vr2s * T pz I d z I d b r1s q" A r2,Is d " ´Ar1,IIs d B r1s ¯Vr2s * T pz I d z II d b r1s q" A r2,IIs d " ´Ar1,Is d B r1s ¯Vr2s * T pz II d z I d b r1s q " ´AT r2,IIs z II `AT x n " ´AT rn,Is z I `AT rn,IIs z II ¯˚´V T rns x n´1 `BT rns b rns ¯(14) where n P r2, N s and x 1 " ´AT r1,Is z I `AT r1,IIs z II ¯˚´B T r1s b r1s ¯. The parameters C P R oˆk , A rn,φs P R dˆk , V rns P R kˆk , B rns P R ωˆk for φ " tI, IIu are learnable. Then, the output x " Cx N `β. The Nested-MVP model manifests an alternative network that relies on slightly modified assumptions on the decomposition. Thus, changing the underlying assumptions of the decomposition can modify the resulting network. This can be an important tool for domain-specific applications, e.g., when the domain-knowledge should be inserted in the last layers.

C BEYOND TWO VARIABLES

Frequently, more than one conditional inputs are required (Yu et al., 2018b; Xu et al., 2017; Maximov et al., 2020) . In such tasks, the aforementioned framework can be generalized to more than two input variables. We demonstrate how this is possible with three variables; then it can trivially extended to an arbitrary number of input variables. Let z I , z II , z III P K d denote the three input variables. We aim to learn a function that captures the higher-order interactions of the input variables. The polynomial of expansion order N P N with output x P R o has the form: x " Gpz I , z II , z III q " N ÿ n"1 n`1 ÿ ρ"1 n`1 ÿ δ"ρ ˆWrn,ρ,δs ρ ź j"2 ˆjz I δ ź τ "ρ`1 ˆτ z II n`1 ź ζ"δ`1 ˆζz III ˙`β where β P R o and W rn,ρ,δs P R oˆś n m"1 ˆmd (for n P r1, N s and ρ, δ P r1, n `1s) are the learnable parameters. As in the two-variable input, the unknown parameters increase exponentially. To that end, we utilize a joint factorization with factor sharing. The recursive relationship of such a factorization is: x n " x n´1 `´U T rn,Is z I `U T rn,IIs z II `U T rn,IIIs z III ¯˚x n´1 for n " 2, . . . , N with x 1 " U T r1,Is z I `U T r1,IIs z II `U T r1,IIIs z III and x " Cx N `β. Notice that the pattern (for each order) is similar to the two-variable input: a) a different embedding is found for each input variable, b) the embeddings are added together, c) the result is multiplied elementwise with the representation of the previous order.

D CONCATENATION OF INPUTS

A popular method used for conditional generation is to concatenate the conditional input with the noise labels. However, as we showcase below, concatenation has two significant drawbacks when compared to our framework. To explain those, we will define a concatenation model.

Let z I P K d1

1 , z II P K d2 2 where K 1 , K 2 can be a subset of real or natural numbers. The output of a concatenation layer is x " P T " z I ; z II ı T where the symbol ';' denotes the concatenation and P P R pd1d2qˆo is an affine transformation on the concatenated vector. The j th output is x j " ř d1 τ "1 p τ,j z I,τ `řd2 τ "1 p τ `d1,j z II,τ . Therefore, the two differences from the concatenation case are: • If the input variables are concatenated together we obtain an additive format, not a multiplicative that can capture cross-term correlations. That is, the multiplicative format does allow achieving higher-order auto-and cross-term correlations. • The concatenation changes the dimensionality of the embedding space. Specifically, the input space has dimensionality d 1 ¨d2 . That has a significant toll on the size of the filters (i.e., it increases the learnable parameters), while still having an additive impact. On the contrary, our framework does not change the dimensionality of the embedding spaces.

E IN-DEPTH DIFFERENCES FROM Π-NET

In the next few paragraphs, we conduct an in-depth analysis of the differences between Π-Net and MVP. The analysis assumes knowledge of the proposed model, i.e., (2). Chrysos et al. ( 2020) introduce Π-Net as a polynomial expansion of a single input variable. Their goal is to model functions x " Gpzq as high-order polynomial expansions of z. Their focus is towards using a single-input variable z, which can be noise in case of image generation or an image in discriminative experiments. The authors express the StyleGAN architecture (Karras et al., 2019) as a polynomial expansion, while they advocate that the impressive results can be attributed to the polynomial expansion. To facilitate the in-depth analysis, the recursive relationship that corresponds to (2) is provided below. An N th order expansion in Π-Net is expressed as: x n " ´ΛT rns z ¯˚x n´1 `xn´1 for n " 2, . . . , N with x 1 " Λ T r1s z and x " Γx N `β. The parameters Λ, Γ are learnable. In this work, we focus on conditional data generation, i.e., there are multiple input variables available as auxiliary information. The trivial application of Π-Net would be to concatenate all the M input variables z I , z II , z III , . . .. The input variable z becomes z " " z I ; z II ; z III ; . . .

ı

, where the symbol ';' denotes the concatenation. Then, the polynomial expansion of Π-Net can be learned on the concatenated z. However, there are four significant reasons that we believe that this is not as flexible as the proposed MVP. When we refer to Π-Net below, we refer to the model with concatenated input. In addition, let z I P K d1 1 , z II P K d2 2 denote the input variables where K 1 , K 2 can be a subset of real or natural numbers. Parameter sharing: MVP allows additional flexibility in the structure of the architecture, since MVP utilizes a different projection layer for each input variable. We utilize this flexibility to share the parameters of the conditional input variable; as we detail in (19), we set U rn,IIs " U r1,IIs on (2). If we want to perform a similar sharing in Π-Net, the formulation equivalent to (17) would be pλ rns q i " pλ r1s q i for i " d 1 , . . . , d 1 `d2 . However, sharing only part of the matrix might be challenging. Additionally, when Λ is a convolution, the sharing pattern is not straightforward to be computed. Therefore, MVP enables additional flexibility to the model, which is hard to be included in Π-Net. Inductive bias: The inductive bias is crucial in machine learning (Zhao et al., 2018) , however concatenating the variables restricts the flexibility of the model (i.e. Π-Net). To illustrate that, let us use the super-resolution experiments as an example. The input variable z I is the noise vector and z II is the (vectorized) low-resolution image. If we concatenate the two variables, then we should use a fully-connected (dense) layer, which does not model well the spatial correlations. Instead, with MVP, we use a fully-connected layer for the noise vector and a convolution for z II (low-resolution image). The convolution reduces the number of parameters and captures the spatial correlations in the image. Thus, by concatenating the variables, we reduce the flexibility of the model.

Dimensionality of the inputs:

The dimensionality of the inputs might vary orders of magnitude, which might create an imbalance during learning. For instance, in class-conditional generation concatenating the one-hot labels in the input does not scale well when there are hundreds of classes (Odena et al., 2017) . We observe a similar phenomenon in class-conditional generation: in Cars196 (with 196 classes) the performance of Π-Net deteriorates considerably when compared to its (relative) performance in CIFAR10 (with 10 classes). On the contrary, MVP does not fuse the elements of the input variables directly, but it projects them into a subspace appropriate for adding them. Order of expansion with respect to each variable: Frequently, the two inputs do not require the same order of expansion. Without loss of generality, assume that we need correlations up to N I and N II order (with N I ă N II ) from z I and z II respectively. MVP includes a different transformation for each variable, i.e., U rn,Is for z I and U rn,IIs for z II . Then, we can set U rn,Is " 0 for n ą N I . On the contrary, the concatenation of inputs (in Π-Net) constrains the expansion to have the same order with respect to each variable. All in all, we can use concatenation to fuse variables and use Π-Net, however an inherently multivariate model is more flexible and can better encode the types of inductive bias required for conditional data generation.

F DIFFERENCES FROM OTHER NETWORKS CAST AS POLYNOMIAL NEURAL NETWORKS

A number of networks with impressive results have emerged in (conditional) data generation the last few years. Three such networks that are particularly interesting in our context are Karras et al. (2019) ; Park et al. (2019) ; Chen et al. (2019) . We analyze below each method and how it relates to polynomial expansions: • Karras et al. (2019) propose an Adaptive instance normalization (AdaIN) method for unsupervised image generation. An AdaIN layer expresses a second-order interactionfoot_2 : h " pΛ T wq ˚npcph in qq, where n is a normalization, c the convolution operator and w is the transformed noise w " M LP pz I q (mapping network). The parameters Λ are learnable, while h in is the input to the AdaIN. Stacking AdaIN layers results in a polynomial expansion with a single variable. • Chen et al. ( 2019) propose a normalization method, called sBN, to stabilize the GAN training. The method performs a 'self-modulation' with respect to the noise variable and optionally the conditional variable in the class-conditional generation setting. Henceforth, we focus on the class-conditional setting that is closer to our work. sBN injects the network layers with a multiplicative interaction of the input variables. Specifically, sBN projects the conditional variable into the space of the variable z I through an embedding function. Then, the interaction of the two vector-like variables is passed through a fully-connected layer (and a ReLU activation function); the result is injected into the network through the batch normalization parameters. If cast as a polynomial expansion, a network with sBN layers expresses a single polynomial expansionfoot_3 • Park et al. ( 2019) introduce a spatially-adaptive normalization, i.e., SPADE, to improve semantic image synthesis. Their model, referred to as SPADE in the remainder of this work, assumes a semantic layout as a conditional input that facilitates the image generation. We analyze in sec. F.1 how to obtain the formulation of their spatially-adaptive normalization. If cast as a polynomial expansion, SPADE expresses a polynomial expansion with respect to the conditional variable. The aforementioned works propose or modify the batch normalization layer to improve the performance or stabilize the training, while in our work we propose the multivariate polynomial as a general function approximation technique for conditional data generation. Nevertheless, given the interpretation of the previous works in the perspective of polynomials, we still can express them as special cases of MVP. Methodologically, there are two significant limitations that none of the aforementioned works tackle: • The aforementioned architectures focus on no or one conditional variable. Extending the frameworks to multiple conditional variables might not be trivial, while MVP naturally extends to arbitrarily many conditional variables. • Even though the aforementioned three architectures use (implicitly) a polynomial expansion, a significant factor is the order of the expansion. In our work, the product of polynomials enables capturing higher-order correlations without increasing the amount of layers substantially (sec. 3.2). In addition to the aforementioned methodological differences, our work is the only polynomial expansion that conducts experiments on a variety of conditional data generation tasks. Thus, we both demonstrate methodologically and verify experimentally that MVP can be used for a wide range of conditional data generation tasks.

F.1 IN-DEPTH DIFFERENCES FROM SPADE

In the next few paragraphs, we conduct an in-depth analysis of the differences between SPADE and MVP. Park et al. ( 2019) introduce a spatially-adaptive normalization, i.e., SPADE, to improve semantic image synthesis. Their model, referred to as SPADE in the remainder of this work, assumes a semantic layout as a conditional input that facilitates the image generation. The n th model block applies a normalization on the representation x n´1 of the previous layer and then it performs an elementwise multiplication with a transformed semantic layout. The transformed semantic layout can be denoted as A T rn,IIs z II where z II denotes the conditional input to the generator. The output of this elementwise multiplication is then propagated to the next model block that performs the same operations. Stacking N such blocks results in an N th order polynomial expansion which is expressed as: x n " ´AT rn,IIs z II ¯˚´V T rns x n´1 `BT rns b rns ¯(18) where n P r2, N s and x 1 " A T r1,Is z I . The parameters C P R oˆk , A rn,φs P R dˆk , V rns P R kˆk , B rns P R ωˆk for φ " tI, IIu are learnable. Then, the output x " Cx N `β. 18) resembles one of the proposed models of MVP (specifically ( 14)). In particular, it expresses a polynomial with respect to the conditional variable. The parameters A rn,Is are set as zero, which means that there are no higher-order correlations with respect to the input variable z I . Therefore, our work bears the following differences from Park et al. (2019) :

SPADE as expressed in (

• SPADE proposes a normalization scheme that is only applied to semantic image generation. On the contrary, our proposed MVP can be applied to any conditional data generation task, e.g., class-conditional generation or image-to-image translation. • SPADE is a special case of MVP. In particular, by setting i) A r1,IIs equal to zero, ii) A rn,Is in ( 14) equal to zero, we obtain SPADE. In addition, MVP allows different assumptions on the decompositions which lead to an alternative structure, such as (2). • SPADE proposes a polynomial expansion with respect to a single variable. On the other hand, our model can extend to an arbitrary number of input variables to account for auxiliary labels, e.g., ( 16). • Even though SPADE models higher-order correlations of the conditional variable, it still does not leverage the higher-order correlations of the representations (e.g., as in the product of polynomials) and hence without activation functions it might not work as well as the two-variable expansion. Park et al. ( 2019) exhibit impressive generation results with large-scale computing (i.e., they report results using NVIDIA DGX with 8 V100 GPUs). Our goal is not to compete in computationally heavy, large-scale experiments, but rather to illustrate the benefits of the generic formulation of MVP. SPADE is an important baseline for our work. In particular, we augment SPADE in wo ways: a) by extending it to accept both continuous and discrete variables in z II and b) by adding polynomial terms with respect to the input variable z I . The latter model is referred to as SPADE-MVP (details on the next section).

Metrics:

The two most popular metrics (Lucic et al., 2018; Creswell et al., 2018) for evaluation of the synthesized images are the Inception Score (IS) (Salimans et al., 2016) and the Frechet Inception Distance (FID) (Heusel et al., 2017) . The metrics utilize the pretrained Inception network (Szegedy et al., 2015) to extract representations of the synthesized images. FID assumes that the representations extracted follow a Gaussian distribution and matches the statistics (i.e., mean and variance) of the representations between real and synthesized samples. Alternative evaluation metrics have been reported as inaccurate, e.g., in Theis et al. (2016) , thus we use the IS and FID. Following the standard practice of the literature, the IS is computed by synthesizing 5, 000 samples, while the FID is computed using 10, 000 samples. The IS is used exclusively for images of natural scenes as a metric. The reasoning behind that is that the Inception network has been trained on images of natural scenes. On the contrary, the FID metric relies on the first and second-order moments of the representations, which are considered more robust to different types of images. Hence, we only report IS for the CIFAR10 related experiments, while for the rest the FID is reported. Dataset details: There are five main datasets used in this work: • Large-scale CelebFaces Attributes (or CelebA for short) (Liu et al., 2015) is a large-scale face attributes dataset with 202, 000 celebrity images. We use 160, 000 images for training our method. • Cars196 (Krause et al., 2013) is a dataset that includes different models of cars in different positions and backgrounds. Cars196 has 16, 000 images, while the images have substantially more variation than CelebA faces. • CIFAR10 (Krizhevsky et al., 2014) contains 60, 000 images of natural scenes. Each image is of resolution 32 ˆ32 ˆ3 and is classified in one of the 10 classes. CIFAR10 is frequently used as a benchmark for image generation. • The Street View House Numbers dataset (or SVHN for short) (Netzer et al., 2011) has 100, 000 images of digits (73, 257 of which for training). SVHN includes color housenumber images which are classified in 10 classes; each class corresponds to a digit 0 to 9. SVHN images are diverse (e.g., with respect to background, scale). • MNIST (LeCun et al., 1998 ) consists of images with handwritten digits. Each images depicts a single digit (annotated from 0 to 9) in a 28 ˆ28 resolution. The dataset includes 60, 000 images for training. • Shoes (Yu & Grauman, 2014; Xie & Tu, 2015) consists of 50, 000 images of shoes, where the edges of each shoe are extracted (Isola et al., 2017 ). • Handbags (Zhu et al., 2016; Xie & Tu, 2015) consists of more than 130, 000 images of handbag items. The edges have been computed for each image and used as conditional input to the generator (Isola et al., 2017) . • Anime characters dataset (Jin et al., 2017) consists of anime characters that are generated based on specific attributes, e.g., hair color. The public version usedfoot_4 contains annotations on the hair color and the eye color. We consider 7 classes on the hair color and 6 classes on the eye color, with a total of 14, 000 training images. All the images of CelebA, Cars196, Shoes and Handbags are resized to 64 ˆ64 resolution. Architectures: The discriminator structure is left the same for each experiment, we focus only on the generator architecture. All the architectures are based on two different generator schemes, i.e., the SNGAN (Miyato & Koyama, 2018) and the polynomial expansion of Chrysos et al. (2020) that does not include activation functions in the generator. The variants of the generator of SNGAN are described below: • SNGAN (Miyato & Koyama, 2018) : The generator consists of a convolution, followed by three residual blocks. The discriminator is also based on successive residual blocks. The public implementation of SNGAN with conditional batch normalization (CBN) is used as the baseline. • SNGAN-MVP [proposed]: We convert the resnet-based generator of SNGAN to an MVP model. To obtain MVP, the SNGAN is modified in two ways: a) the Conditional Batch Normalization (CBN) is converted into batch normalization (Ioffe & Szegedy, 2015) , b) the injections of the two embeddings (from the inputs) are added after each residual block, i.e. the formula of (2). In other words, the generator is converted to a product of two-variable polynomials. • SNGAN-CONC: Based on SNGAN-MVP, we replace each Hadamard product with a concatenation. This implements the variant mentioned in sec. D. • SNGAN-SPADE (Park et al., 2019) : As described in sec. F.1, SPADE is a polynomial with respect to the conditional variable z II . The generator of SNGAN-MVP is modified to perform the Hadamard product with respect to the conditional variable every time. The variants of the generator of Π-Net are described below: • Π-Net (Chrysos et al., 2020) : The generator is based on a product of polynomials. The first polynomials use fully-connected connections, while the next few polynomials use cross-correlations. The discriminator is based on the residual blocks of SNGAN. We stress out that the generator does not include any activation functions apart from a hyperbolic tangent in the output space for normalization. The authors advocate that this exhibits the expressivity of the designed model. • Π-Net-SICONC: The generator structure is based on Π-Net with two modifications: a) the Conditional Batch Normalization is converted into batch normalization (Ioffe & Szegedy, 2015) , b) the second-input is concatenated with the first (i.e., the noise) in the input of the generator. Thus, this is a single variable polynomial, i.e., a Π-Net, where the second-input is vectorized and concatenated with the first. This baseline implements the Π-Net described in sec. E. • MVP [proposed]: The generator of Π-Net is converted to an MVP model with two modifications: a) the Conditional Batch Normalization is converted into batch normalization (Ioffe & Szegedy, 2015) , b) instead of having a Hadamard product with a single variable as in Π-Net, the formula with the two-variable input (e.g., (2)) is followed. • GAN-CONC: Based on MVP, each Hadamard product is replaced by a concatenation. This implements the variant mentioned in sec. D. • GAN-ADD: Based on MVP, each Hadamard product is replaced by an addition. This modifies (14 ) to x n " ´AT rn,Is z I `AT rn,IIs z II ¯`´V T rns x n´1 `BT rns b rns ¯. • SPADE (Park et al., 2019) : As described in sec. F.1, SPADE defines a polynomial with respect to the conditional variable z II . The generator of Π-Net is modified to perform the Hadamard product with respect to the conditional variable every time. • SPADE-MVP [proposed]: This is a variant we develop to bridge the gap between SPADE and the proposed MVP. Specifically, we augment the aforementioned SPADE twofold: a) the dense layers in the input space are converted into a polynomial with respect to the variable z I and b) we also convert the polynomial in the output (i.e., the rightmost polynomial in the Fig. 6 schematics) to a polynomial with respect to the variable z I . This model captures higher-order correlations of the variable z I that SPADE did not not originally include. This model still includes single variable polynomials, however the input in each polynomial varies and is not only the conditional variable. The two baselines GAN-CONC and GAN-ADD capture only additive correlations, hence they cannot effectively model complex distributions without activation functions. Nevertheless, they are added as a reference point to emphasize the benefits of higher-order polynomial expansions. An abstract schematic of the generators that are in the form of products of polynomials is depicted in Fig. 6 . Notice that the compared methods from the literature use polynomials of a single variable, while we propose a polynomial with an arbitrary number of inputs (e.g., two-input shown in the schematic). This also enables the non-discrete conditional variables (e.g., low-resolution images) to be concatenated. (c) SPADE implements a single-variable polynomial for conditional image generation. The polynomial is built with respect to the conditional variable z II . This is substantially different from the polynomial with multiple-input variables, i.e., MVP. Two additional differences are that (i) SPADE is motivated as a spatially-adaptive method (i.e., for continuous conditional variables), while MVP can be used both for discrete and for continuous type variables, (ii) there is no polynomial in the dense layers in the SPADE. However, as illustrated in Π-Net converting the dense layers into a higher-order polynomial can further boost the performance. (d) The proposed generator structure. Implementation details of MVP: Throughout this work, we reserve the symbol z II for the conditional input (e.g., a class label). In each polynomial, we reduce further the parameters by using the same embedding for the conditional variables. That is expressed as: U rn,IIs " U r1,IIs for n " 2, . . . , N . Equivalently, that would be A rn,IIs " A r1,IIs in ( 14). Additionally, Nested-MVP performed better in our preliminary experiments, thus we use ( 14) to design each polynomial. Given the aforementioned sharing, the N th order expansion is described by: x n " ´AT rn,Is z I `AT r1,IIs z II ¯˚´V T rns x n´1 `BT rns b rns ¯(20) for n " 2, . . . , N . Lastly, the factor A r1,IIs is a convolutional layer in the case of continuous conditional input, while it is a fully-connected layer in the case of discrete conditional input. partial visibility of other digits. Therefore, the generation of digits of SVHN is challenging for a generator without activation functions between the layers. Our framework, e.g., equation 14, does not include any activation functions. To verify the expressivity of our framework, we maintain the same setting for this experiment. Particularly, the generator does not have activation functions between the layers; there is only a hyperbolic tangent in the output space for normalization. The generator receives a noise sample and a class as input, i.e., it is a class-conditional polynomial generator. The results in Fig. 12 (b) illustrate that despite the noise, MVP learns the distribution. As mentioned in the main paper, our formulation enables both inter-class and intra-class interpolations naturally. In the inter-class interpolation the noise z I is fixed, while the class z II is interpolated. In Fig. 12(d ) several inter-class interpolations are visualized. The visualization exhibits that our framework is able to synthesize realistic images even with inter-class interpolations.

H.3 TRANSLATION OF MNIST DIGITS TO SVHN DIGITS

An experiment on image translation from the domain of binary digits to house numbers is conducted below. The images of MNIST are used as the source domain (i.e., the conditional variable z II ), while the images of SVHN are used as the target domain. The correspondence of the source to the target domain is assumed to be many-to-many, i.e., each MNIST digit can synthesize multiple SVHN images. No additional loss is used, the setting of continuous conditional input from sec. 4.2 is used. The images in Fig. 13 illustrate that MVP can translate MNIST digits into SVHN digits. Additionally, for each source digit, there is a significant variation in the synthesized images. Figure 13 : Qualitative results on MNIST-to-SVHN translation. The first row depicts the conditional input (i.e., a MNIST digit). The rows 2-6 depict outputs of the MVP when a noise vector is sampled per row. Notice that for each source digit, there is a significant variation in the synthesized images.

H.4 TRANSLATION OF EDGES TO IMAGES

An additional experiment on translation is conducted, where the source domain depicts edges and the target domain is the output image. Specifically, the tasks of edges-to-handbags (on Handbags dataset) and edges-to-shoes (on Shoes dataset) have been selected Isola et al. (2017) . In this experiment, the MVP model of sec. 4.2 is utilized, i.e., a generator without activation functions between the layers. The training is conducted using only the adversarial loss. Visual results for both the case of edges-to-handbags and edges-to-shoes are depicted in Fig. 14 . The first row depicts the conditional input z II , i.e., an edge, while the rows 2-6 depict the synthesized images. Note that in both the case of handbags and shoes there is significant variation in the synthesized images, while they follow the edges provided as input. (a) edges-to-handbags (b) edges-to-shoes Figure 14 : Qualitative results on edges-to-image translation. The first row depicts the conditional input (i.e., the edges). The rows 2-6 depict outputs of the MVP when we vary z I . Notice that for each edge, there is a significant variation in the synthesized images.

H.5 MULTIPLE, DISCRETE CONDITIONAL INPUTS

Frequently, more than one type of input conditional inputs are available. Our formulation can be extended beyond two input variables (sec. C); we experimentally verify this case. The task selected is attribute-guided generation trained on images of Anime characters. Each image is annotated with respect to the color of the eyes (6 combinations) and the color of the hair (7 combinations). Since SPADE only accepts a single conditional variable, we should concatenate the two attributes in a single variable. We tried simply concatenating the attributes directly, but this did not work well. Instead, we can use the total number of combinations, which is the product of the individual attribute combinations, i.e., in our case the total number of combinations is 42. Obviously, this causes 'few' images to belong in each unique combination, i.e., there are 340 images on average that belong to each combination. On the contrary, there are 2380 images on average for each eye color. SPADE and Π-Net are trained by using the two attributes in a single combination, while in our case, we consider the multiple conditional variable setting. In each case, only the generator differs depending on the compared method. In Fig. 15 few indicative images are visualized for each method; each row depicts a single combination of attributes, i.e., hair and eye color. Notice that SPADE results in a single image per combination, while in Π-Net-SINCONC there is considerable repetition in each case. The single image in SPADE can be explained by the lack of higher-order correlations with respect to the noise variable z I . In addition to the diversity of the images per combination, an image from every combination is visualized in Fig. 16 . MVP synthesizes more realistic images than the compared methods of Π-Net-SINCONC and SPADE. SPADE results in some combinations that do not follow the structure of the face, e.g., 3 rd column in the last row. Similarly, in Π-Net-SINCONC some of the synthesized images are not completely realistic, e.g., penultimate row. MVP synthesizes images that resemble faces for every combination.

H.6 MULTIPLE CONDITIONAL INPUTS WITH MIXED CONDITIONAL VARIABLES

We extend the previous experiment with multiple conditional variables to the case of mixed conditional variables, i.e., there is one discrete and one continuous conditional variable. The discrete conditional variable captures the class label, while the continuous conditional variable captures the low-resolution image. Thus, the task is class-conditional super-resolution. We use the experimental details of sec. 4.2 in super-resolution 8ˆ. In Fig. 17 As emphasized in sec. I, various methods have been utilized for synthesizing more diverse images in conditional image generation tasks. A reasonable question is whether our method can be used in conjunction with such methods, since it already synthesizes diverse results. Our hypothesis is that when MVP is used in conjunction with any diversity-inducing technique, it will further improve the diversity of the synthesized images. To assess the hypothesis, we conduct an experiment on edges to images that is a popular benchmark in such diverse generation tasks (Zhu et al., 2017b; Yang et al., 2019) . The plug-n-play regularization term of Yang et al. ( 2019) is selected and added to the GAN loss during the training. The objective of the regularization term L reg is to maximize the following term: L reg " minp ||Gpz I, 1 , z II q ´Gpz I, 2 , z II q| | 1 ||z I, 1 ´zI, 2 | | 1 , τ q (21) where τ is a predefined constant, z I, 1 , z I, 2 are different noise samples. The motivation behind this term lies in encouraging the generator to produce outputs that differ when the input noise samples differ. In our experiments, we follow the implementation of the original paper with τ " 10. The regularization loss of equation 21 is added to the GAN loss; the architecture of the generator remains similar to sec. H.4. The translation task is edges-to-handbags (on Handbags dataset) and edges-to-shoes (on Shoes dataset). In Fig. 18 the synthesized images are depicted. The regularization loss causes more diverse images to be synthesized (i.e., when compared to the visualization of Fig. 14 that was trained using only the adversarial loss). For instance, in both the shoes and the handbags, new shades of blue are now synthesized, while yellow handbags can now be synthesized. The empirical results validate the hypothesis that our model can be used in conjunction with diversity regularization losses in order to improve the results. Nevertheless, the experiment in sec. H.4 indicates that a regularization term is not necessary to synthesize images that do not ignore the noise as feed-forward generators had previously.

I DIFFERENCE OF MVP FROM OTHER DIVERSE GENERATION TECHNIQUES

One challenge that often arises in conditional data generation is that one of the variables gets ignored by the generator (Isola et al., 2017) . This has been widely acknowledged in the literature, e.g., Zhu et al. (2017b) advocates that it is hard to utilize a simple architecture, like Isola et al. (2017) , with noise. A similar conclusion is drawn in InfoGAN (Chen et al., 2016) where the authors explicitly mention that additional losses are required, otherwise the generator is 'free to ignore' the additional variables. To mitigate this, a variety of methods have been developed. We summarize the most prominent methods from the literature, starting from image-to-image translation methods: • BicycleGAN (Zhu et al., 2017b) proposes a framework that can synthesize diverse images in image-to-image translation. The framework contains 2 encoders, 1 decoder and 2 discriminators. This results in multiple loss terms (e.g., eq.9 of the paper). Interestingly, the authors utilize a separate training scheme for the encoder-decoder and the second encoder as training together 'hides the information of the latent code without learning meaningful modes'. • Almahairi et al. (2018) augment the deterministic mapping of CycleGAN (Zhu et al., 2017a) with a marginal matching loss. The framework learns diverse mappings utilizing the additional encoders. The framework includes 4 encoders, 2 decoders and 2 discriminators. • MUNIT (Huang et al., 2018) focuses on diverse generation in unpaired image-to-image translation. MUNIT demonstrates impressive translation results, while the inverse translation is also learnt simultaneously. That is, in case of edges-to-shoes, the translation shoes-toedges is also learnt during the training. The mapping learnt comes at the cost of multiple network modules. Particularly, MUNIT includes 2 encoders, 2 decoders, 2 discriminators for learning. This also results in multiple loss terms (e.g., eq.5 of the paper) along with additional hyper-parameters and network parameters. • Drit++ (Lee et al., 2020) extends unpaired image-to-image translation with disentangled representation learning, while they allow multi-domain image-to-image translations. Drit++ uses 4 encoders, 2 decoders, 2 discriminators for learning. Similarly to the previous methods, this results in multiple loss terms (e.g., eq.6-7 of the paper) and additional hyper-parameters. • Choi et al. ( 2020) introduce a method that supports multiple target domains. The method includes four modules: a generator, a mapping network, a style encoder and a discriminator. All modules (apart from the generator) include domain-specific sub-networks in case of multiple target domains. To ensure diverse generation, Choi et al. ( 2020) utilize a regularization loss (i.e., eq. 3 of the paper), while their final objective consists of multiple loss terms. The aforementioned frameworks contain additional network modules for training, which also results in additional hyper-parameters in the loss-function and the network architecture. Furthermore, the frameworks focus exclusively on image-to-image translation and not all conditional generation cases, e.g., they do not tackle class-conditional or attribute-based generation. An interesting technique for diverse, class-conditional generation is the self-conditional GAN of Liu et al. (2020) . The method conditions the generator with pseudo-labels that are automatically derived from clustering on the feature space of the discriminator. This enables the generator to synthesize more diverse samples. This method is orthogonal to our, i.e., the generator of Liu et al. (2020) can be replaced with MVP. Using regularization terms in the loss function has been an alternative way to achieve diverse generation. Mao et al. (2019) ; Yang et al. (2019) propose simple regularization terms that can be



To avoid cluttering the notation we use same dimensionality for the two inputs. However, the derivations apply for different dimensionalities, only the dimensionality of the tensors change slightly. An elementwise analysis (with a scalar output) is provided on the Appendix (sec. B.2). The formulation is derived from the public implementation of the authors. In MVP, we do not learn a single embedding function for the conditional variable. In addition, we do not project the (transformed) conditional variable to the space of the noise-variable. Both of these can be achieved by making simplifying assumptions on the factor matrices of MVP. The version is downloaded following the instructions of https://github.com/bchao1/ Anime-Generation.



Figure 2: Synthesized images by MVP in the class-conditional CIFAR10 (with resnet-based generator): (a) Random samples where each row depicts the same class, (b) Intra-class linear interpolation from a source to the target, (c) inter-class linear interpolation. In inter-class interpolation, the class labels of the leftmost and rightmost images are one-hot vectors, while the rest are interpolated in-between; the resulting images are visualized. In all three cases, MVP synthesizes realistic images.

(a) Super-resolution 8ˆ(b) Super-resolution 8ˆ(c) Super-resolution 16F igure 3: Synthesized images for super-resolution by (a), (b) 8ˆ, (c) 16ˆ. The first row depicts the conditional input (i.e., low-resolution image). The rows 2-6 depict outputs of the MVP when a noise vector is sampled per row. Notice how the noise changes (a) the smile or the pose of the head, (b) the color, car type or even the background, (c) the position of the car. Yang et al. (2019); Lee et al. (2019). However, such works utilize additional losses and even require additional networks for training, which makes the training more computationally heavy and more sensitive to design choices.

Figure4: Schematic for second order expansion with scalar output x τ P R. The abbreviations z I,λ , z I,µ are elements of z I with λ, µ P r1, ds. Similarly, z II,λ , z II,µ are elements of z II . The first two terms (on the right side of the equation) are the first-order correlations; the next two terms are the second order auto-correlations. The last term expresses the second order cross-correlations.

d B r1s ¯Vr2s * T capture the second order correlations of a single variable (z II and z I respectively).

is precisely a recursive equation that can be expressed with the Fig.5or equivalently the generalized recursive relationship below.

Figure5: Abstract schematic for N th order approximation of x " Gpz I , z II q with Nested-MVP model. The inputs z I , z II are symmetric in our formulation. We denote with z I a sample from the noise distribution (e.g., Gaussian), while z II symbolizes a sample from a conditional input (e.g., a class label or a low-resolution image).

Figure 6: Abstract schematic of the different compared generators. All the generators are products of polynomials. Each colored box represents a different type of polynomial, i.e., the green box symbolizes polynomial(s) with dense layers, the blue box denotes convolutional or cross-correlation layers. The red box includes the up-sampling layers. (a) Π-Net implements a single-variable polynomial for modeling functions x " Gpzq. Π-Net enables class-conditional generation by using conditional batch normalization (CBN). (b) An alternative to CBN is to concatenate the conditional variable in the input, as in Π-Net-SICONC. This also enables the non-discrete conditional variables (e.g., low-resolution images) to be concatenated. (c) SPADE implements a single-variable polynomial for conditional image generation. The polynomial is built with respect to the conditional variable z II . This is substantially different from the polynomial with multiple-input variables, i.e., MVP. Two additional differences are that (i) SPADE is motivated as a spatially-adaptive method (i.e., for continuous conditional variables), while MVP can be used both for discrete and for continuous type variables, (ii) there is no polynomial in the dense layers in the SPADE. However, as illustrated in Π-Net converting the dense layers into a higher-order polynomial can further boost the performance. (d) The proposed generator structure.

visualizations are provided in this section. Additional visualizations for class-conditional generation are provided in sec. H.1. An additional experiment with class-conditional generation with SVHN digits is performed in sec. H.2. An experiment that learns the translation of MNIST to SVHN digits is conducted in sec. H.3. To explore further the image-to-image translation, two additional experiments are conducted in sec. H.4. An attribute-guided generation is performed in sec. H.5 to illustrate the benefit of our framework with respect to multiple, discrete conditional inputs. This is further extended in sec. H.6, where an experiment with mixed conditional input is conducted. Finally, an additional diversity-inducing regularization term is used to assess whether it can further boost the diversity the synthesized images in sec. H.7.H.1 ADDITIONAL VISUALIZATIONS IN CLASS-CONDITIONAL GENERATIONIn Fig.7the qualitative results of the compared methods in class-conditional generation on CIFAR10 are shared. Both the generator of SNGAN and ours have activation functions in this experiment.

Figure 7: Qualitative results on CIFAR10. Each row depicts random samples from a single class.

Figure 10: Inter-class linear interpolations across different methods. In inter-class interpolation, the class labels of the leftmost and rightmost images are one-hot vectors, while the rest are interpolated in-between; the resulting images are visualized. Many of the intermediate images in SNGAN-CONC and SNGAN-ADD are either blurry or not realistic. On the contrary, in SPADE and MVP the higher-order polynomial expansion results in more realistic intermediate images. Nevertheless, MVP results in sharper shapes and images even in the intermediate results when compared to SPADE.

Figure 11: Synthesized images by MVP in the (a), (b) class-conditional generation (sec. 4.1) and (b) block-inpainting (sec. 4.2). The networks do not include activation functions between the layers. In class-conditional generation, each row depicts a single class. Notice how the MVP synthesizes diverse images even in the absence of activation functions.

Figure15: Each row depicts a single combination of attributes, i.e., hair and eye color. Please zoom-in to check the finer details. The method of SPADE synthesizes a single image per combination. Π-Net-SINCONC synthesizes few images, but not has many repeated elements, while some combinations result in unrealistic faces, e.g., the 5 th or the 7 th row. On the contrary, MVP synthesizes much more diverse images for every combination.

Figure16: Each row depicts a single chair color, while each column depicts a single eye color. SPADE results in some combinations that do not follow the structure of the face, e.g., 3 rd column in the last row. Similarly, in Π-Net-SINCONC some of the synthesized images are not completely realistic, e.g., penultimate row. MVP synthesizes images that resemble faces for every combination.

Figure 17: Threevariable input generative model.

Figure18: Qualitative results on edges-to-image translation with regularization loss for diverse generation (sec. H.7). The first row depicts the conditional input (i.e., the edges). The rows 2-6 depict outputs of the MVP when we vary z I . Diverse images are synthesized for each edge. The regularization loss results in 'new' shades of blue to emerge in the synthesized images in both the shoes and the handbags cases.

Comparison of techniques used for diverse, conditional generation. The majority of the methods insert additional loss terms, while some of them even require additional networks to be trained to achieve diverse generation results. MVP learns a non-deterministic mapping without additional networks or loss terms, thus simplifying the training. Nevertheless, as we empirically exhibit in sec. H.7, dedicated works that tackle diverse generation can be used in conjunction with the proposed MVP to further boost the diversity of the synthesized images.Methods for diverse generation.

Comparison of attributes of polynomial-like neural networks. Even though the architectures ofKarras et al. (2019);Chen et al. (2019);Park et al. (2019) were not posed as polynomial expansions, we believe that their success can be (partly) attributed to the polynomial expansion (please check sec. F for further information). Π-Net and StyleGAN are not designed for conditional data generation. In practice, learning complex distributions requires high-order polynomial expansions; this can be effectively achieved with products of polynomials as detailed in sec. 3.2. Only Π-Net and MVP include such a formulation. Additionally, the only work that enables multiple conditional variables (and includes experiments with both continuous and discrete conditional variables) is the proposed MVP.Attributes of polynomial-like networks.

Symbols

Net-based generator: A product of polynomials, based on Π-Net, is selected as the baseline architecture for the generator. Π-Net has conditional batch normalization (CBN) in the generator, while in the rest compared methods CBN is replaced by batch normalization. The results in CIFAR10 are summarized in Table5(left), where MVP outperforms all the baselines by a large margin. An additional experiment is performed in Cars196 that has 196 classes. The results in Table



annex

plugged into any architecture to encourage diverse generation. Lee et al. (2019) propose two variants of a regularization term, with the 'more stable variant' requiring additional network modules.We emphasize that our method can be used in conjunction with many of the aforementioned techniques to obtain more diverse examples. We demonstrate that this is possible in an experiment in sec. H.7.

