MVP: MULTIVARIATE POLYNOMIALS FOR CONDITIONAL DATA GENERATION

Abstract

Conditional Generative Adversarial Nets (cGANs) have been widely adopted for image generation. cGANs take i) a noise vector and ii) a conditional variable as input. The conditional variable can be discrete (e.g., a class label) or continuous (e.g., an input image) resulting into class-conditional (image) generation and imageto-image translation models, respectively. However, depending on whether the conditional variable is discrete or continuous, various cGANs employ substantially different deep architectures and loss functions for their training. In this paper, we propose a novel framework, called MVP, for conditional data generation. MVP resorts to multivariate polynomials of higher-order and treats in a unified way both discrete and continuous conditional variables. MVP is highly expressive, capturing higher-order auto-and cross-correlations of input variables (noise vector and conditional variable). Tailored sharing schemes are designed between the polynomial's parameter tensors, which result in simple recursive formulas. MVP can synthesize realistic images in both class-conditional and image-to-image translation tasks even in the absence of activation functions between the layers.

1. INTRODUCTION

Modelling high-dimensional distributions and generating samples from complex distributions are fundamental tasks in machine learning. Generative adversarial networks (GANs) (Goodfellow et al., 2014) have demonstrated spectacular results in the two tasks using both unsupervised (Miyato et al., 2018) and supervised (Brock et al., 2019) learning. In the unsupervised setting, (the generator of) a GAN accepts as input a noise vector z I and maps the noise vector to a high-dimensional output. The supervised models, called conditional Generative Adversarial Nets (cGANs) (Mirza & Osindero, 2014) , accept both a noise vector z I and an additional conditional variable z II that facilitates the generation. The conditional variable can be discrete (e.g., a class or an attribute label) or continuous (e.g., a low-resolution image). The impressive results obtained with both discrete conditional input (Brock et al., 2019) and continuous conditional input (Park et al., 2019; Ledig et al., 2017) have led to a plethora of applications that range from text-to-image synthesis (Qiao et al., 2019) to deblurring (Yan & Wang, 2017) and medical analysis (You et al., 2019) . Despite the similarity in the formulation for discrete and continuous conditional input (i.e., learning the function Gpz I , z II q), the literature has focused on substantially different architectures and losses. Frequently, techniques are simultaneously developed, e.g., the self-attention in the class-conditional Self-Attention GAN (Zhang et al., 2019) and in the Attention- GAN (Chen et al., 2018) with continuous conditional input. This delays the progress since practitioners develop twice as many architectures and losses for every case. A couple of straightforward ideas can be employed to unify the behavior of the two conditional variable types. One idea is to use an encoder network to obtain representations that are independent of the conditional variable. This has two drawbacks: i) the network ignores the noise and a deterministic one-variable mapping is learned (Isola et al., 2017) , ii) such encoder has not been successful so far for discrete conditional input. An alternative idea is to directly concatenate the labels in the latent space instead of finding an embedding. In AC- GAN (Odena et al., 2017) the class labels are concatenated with the noise; however, the model does not scale well beyond 10 classes. We argue that concatenation of the input is only capturing additive correlation and not higher-order interactions between the inputs. A detailed discussion is conducted on sec. D (in the Appendix). A polynomial expansion with respect to the input variables can capture such higher-order correlations. Π-Net (Chrysos et al., 2020) casts the function approximation into a polynomial expansion of a single input variable. By concatenating the input variables, we can express the function approximation as a polynomial of the fused variable. However, the concatenation reduces the flexibility of the model significantly, e.g., it enforces the same order of expansion with respect to the different variables and it only allows the same parameter sharing scheme to all variables. We introduce a multivariate framework, called MVP, for conditional data generation. MVP resorts to multivariate polynomials with two input variables, i.e., z I for the noise vector and z II for the conditional variable. MVP captures higher-order auto-and cross-correlations between the variables. By imposing a tailored structure in the higher-order interactions, we obtain an intuitive, recursive formulation for MVP. The formulation is flexible and enables different constraints to be applied to each variable and its associated parameters. The formulation can be trivially extended to M input variables. In summary, our contributions are the following: • We introduce a framework, called MVP, that expresses a high-order, multivariate polynomial for conditional data generation. Importantly, MVP treats both discrete and continuous conditional variables in a unified way. • We offer an in-depth relationship with state-of-the-art works, such as SPADE (Park et al., 2019) , that can be interpreted as polynomial expansions. We believe this perspective better explains the success of such architectures and offers a new direction for their extension. • MVP is trained on eight different datasets for both class-conditional generation and imageto-image translation tasks. The trained models rely on both input variables, i.e., they do not ignore the noise vector. • To illustrate the expressivity of the model, we also experiment with generators that do not use activation functions between the layers. We verify that MVP can synthesize realistic images even in the absence of activation functions between the layers. The source code of MVP will be published upon the acceptance of the paper.

2. RELATED WORK

The literature on conditional data generation is vast; dedicated surveys per task (Agnese et al., 2019; Wu et al., 2017b ) can be found for the interested reader. Below, we review representative works in conditional generation and then we summarize the recent progress in multiplicative interactions.

2.1. CONDITIONAL GENERATIVE MODELS

The challenging nature of image/video generation has led to a proliferation of conditional models. Although cGAN (Mirza & Osindero, 2014 ) is a general framework, since then the methods developed for conditional generation differ substantially depending on the type of conditional data. We present below representative works of the two categories, i.e., discrete and continuous conditional data, and their combination. Discrete conditional variable: This is most frequently used for class-conditional generation (Miyato et al., 2018; Brock et al., 2019; Kaneko et al., 2019) . Conditional normalization (Dumoulin et al., 2017; De Vries et al., 2017) techniques have been popular in the case of discrete conditional input, e.g., in generation of natural scenes images (Miyato et al., 2018; Brock et al., 2019) . Conditional normalization cannot trivially generalize to a continuous conditional variable. In AC-GAN (Odena et al., 2017) , they concatenate the class labels with the noise; however, their model does not scale well (i.e., they train one model per 10 classes). The aforementioned methods cannot be trivially used or modified for continuous conditional input. Text-to-image generation models (Qiao et al., 2019; Li et al., 2019; Zhang et al., 2018; Xu et al., 2018) use a specialized branch to embed the text labels.

Continuous conditional variable:

The influential work of pix2pix (Isola et al., 2017) has become the reference point for continuous conditional input. The conditional input is embedded in a lowdimensional space (with an encoder), and then mapped to a high-dimensional output (through a decoder). The framework has been widely used for inverse tasks (Ledig et al., 2017; Pathak et al., 

