CONDITIONAL GENERATIVE MODELING VIA LEARN-ING THE LATENT SPACE

Abstract

Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find solutions corresponding to multiple output modes. Compared to existing generative solutions, our approach demonstrates faster and more stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can perform better than highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs.

1. INTRODUCTION

Conditional generative models provide a natural mechanism to jointly learn a data distribution and optimize predictions. In contrast, discriminative models improve predictions by modeling the label distribution. Learning to model the data distribution allows generating novel samples and is considered a preferred way to understand the real world. Existing conditional generative models have generally been explored in single-modal settings, where a one-to-one mapping between input and output domains exists (Nalisnick et al., 2019; Fetaya et al., 2020) . Here, we investigate continuous multimodal (CMM) spaces for generative modeling, where one-to-many mappings exist between input and output domains. This is critical since many real world situations are inherently multimodal, e.g., humans can imagine several completions for a given occluded image. In a discrete setting, this problem becomes relatively easy to tackle using techniques such as maximum-likelihoodestimation, since the output can be predicted as a vector (Zhang et al., 2016) , which is not possible in continuous domains. One way to model CMM spaces is by using variational inference, e.g., variational autoencoders (VAE) (Kingma & Welling, 2013). However, the approximated posterior distribution of VAEs are often restricted to the Gaussian family, which hinders its ability to model more complex distributions. As a solution, Maaløe et al. (2016) suggested using auxiliary variables to improve the variational distribution. To this end, the latent variables are hierarchically correlated through injected auxiliary variables, which can produce non-Gaussian distributions. A slightly similar work by Rezende & Mohamed (2015) proposed Normalizing Flows, that can hierarchically generate more complex probability distributions by applying a series of bijective mappings to an original simpler distribution. Recently, Chang et al. (2019) proposed a model, where a separate variable can be used to vary the impact of different loss components at inference, which allows diverse outputs. For a more detailed discussion on these methods see App. 1. In addition to the aforesaid methods, in order to model CMM spaces, a prominent approach in the literature is to use a combination of reconstruction and adversarial losses (Isola et al., 2017; Zhang et al., 2016; Pathak et al., 2016) . However, this entails key shortcomings. 1) The goals of adversarial and reconstruction losses are contradictory (Sec. 4), hence model engineering and numerous regularizers are required to support convergence (Lee et al., 2019; Mao et al., 2019) , thereby resulting in less-generic models tailored for specific applications (Zeng et al., 2019; Vitoria et al., 2020) . 2) The adversarial loss based models are notorious for difficult convergence due to the challenge of finding Nash equilibrium of a non-convex min-max game in high-dimensions (Barnett, 2018; Chu et al., 2020; Kodali et al., 2017) . 3) The convergence is heavily dependent on the architecture, hence such models show lack of scalability (Thanh-Tung et al., 2019; Arora & Zhang, 2017). 4) The promise of assisting downstream tasks remains challenging, with a large gap in performance between the generative modelling approaches and their discriminative counterparts (Grathwohl et al., 2020; Jing & Tian, 2020) . In this work, we propose a general-purpose framework-Conditional Generation by Modeling the Latent Space (cGML)-for modeling CMM spaces using a set of domain-agnostic regression cost functions instead of the adversarial loss. This improves both the stability and eliminates the incompatibility between the adversarial and reconstruction losses, allowing more precise outputs while maintaining diversity. The underlying notion is to learn the 'behaviour of the latent variables' in minimizing these cost functions while converging to an optimum mode during the training phase, and mimicking the same at inference. Despite being a novel direction, the proposed framework showcases promising attributes by: (a) achieving state-of-the-art results on a diverse set of tasks using a generic model, implying generalizability, (b) rapid convergence to optimal modes despite architectural changes, (c) learning useful features for downstream tasks, and (d) producing diverse outputs via traversal through multiple output modes at inference.

2. PROPOSED METHODOLOGY

We define a family of cost functions {E i,j = d(y g i,j , G(x j , w))}, where x j ∼ χ is the input, y g i,j ∼ Υ is the i th ground-truth mode for x j , G is a generator function with weights w, and d(•, •) is a distance function. Note that the number of cost functions E (•,j) for a given x j can vary over χ. Our aim here is to come up with a generator function G(x j , w), that can minimize each E i,j , ∀i as G(x j , w) → y g i,j . However, since G is a deterministic function (x and w are both fixed at inference), it can only produce a single output. Therefore, we introduce a latent vector z to the generator function, that can be used to converge ȳi,j = G(x j , w, z i,j ) towards a ground truth y g (i,j) at inference, and possibly, to multiple solutions. Formally, the family of cost functions now becomes: { Êi,j = d(y g i,j , G(x j , w, z i,j ))}, ∀z i,j ∼ ζ. Then, our training objective can be defined as finding a set of optimal z * i ∈ ζ and w * ∈ ω by minimizing E i∼I [ Êi,j ], where I is the number of possible solutions for x j . Note that w * is fixed for all i and a different z * i exists for each i. Considering all the training samples x j ∼ χ, our training objective becomes, {{z * i,j }, w * } = arg min zi,j ∈ζ,w∈ω E i∈I,j∈J [ Êi,j ]. Eq. 1 can be optimized via Algorithm 1 (proof in App. 2.2). Intuitively, the goal of Eq. 1 is to obtain a family of optimal latent codes {z * i,j }, each causing a global minima in the corresponding Êi,j as y g i,j = G(x j , w, z * i,j ). Consequently, at inference, we can optimize ȳi,j to converge to an optimal mode in the output space by varying z. Therefore, we predict an estimated zi,j at inference, zi,j ≈ min z Êi,j , for each y g i,j , which in turn can be used to obtain the prediction G(x j , w, zi,j ) ≈ y g i,j . In other words, for a selected x j , let ȳt i,j be the initial estimate for ȳi,j . At inference, z can traverse gradually towards an optimum point y g i,j in the space, forcing ȳt+n i,j → y g i,j , in finite steps (n). However, still a critical problem exists: Eq. 2 depends on y g i,j , which is not available at inference. As a remedy, we enforce Lipschitz constraints on G over (x j , z i,j ), which bounds the gradient norm as, G(x j ,w * ,z * i,j )-G(x j ,w * ,z 0 ) z * i,j -z 0 ≤ ∇ z G(x j , w * , γ(t)) dt ≤ C, where z 0 ∼ ζ is an arbitrary random initialization, C is a constant, and γ(•) is a straight path from z 0 to z * i,j (proof in App. 2.1) . Intuitively, Eq. 3 implies that the gradients ∇ z G(x j , w * , z 0 ) along the path γ(•) do not tend to vanish or explode, hence, finding the path to optimal z * i,j in the space ζ becomes a fairly straight forward regression problem. Moreover, enforcing the Lipschitz constraint encourages meaningful structuring of the latent space: suppose z * 1,j and z * 2,j are two optimal codes corresponding to two ground truth modes for a particular input. Since z * 2,j -z * 1,j is lower bounded by G(xj ,w * ,z * 2,j )-G(xj ,w * ,z * 1,j ) L , where L is the Lipschitz constant, the minimum distance between the two latent codes is proportional to the difference between the corresponding ground truth modes. In practice, we observed that this encourages the optimum latent codes to be placed sparsely (visual illustration in App. 2), which helps a network to learn distinctive paths towards different modes.

2.1. CONVERGENCE AT INFERENCE

We formulate finding the convergence path of z at inference as a regression problem, i.e., z t+1 = r(z t , x j ). We implement r(•) as a recurrent neural network (RNN). The series of predicted values {z (t+k) : k = 1, 2, .., N } can be modeled as a first-order Markov chain requiring no memory for the RNN. We observe that enforcing Lipschitz continuity on G over z leads to smooth trajectories even in high dimensional settings, hence, memorizing more than one step into the history is redundant. However, z t+1 is not a state variable, i.e., the existence of multiple modes for output prediction ȳ leads to multiple possible solutions for z t+1 . On the contrary, E[z t+1 ] is a state variable w.r.t. the state (z t , x), which can be used as an approximation to reach the optimal z * at inference. Therefore, instead of directly learning r(•), we learn a simplified version r (z t , x) = E[z t+1 ]. Intuitively, the whole process can be understood as observing the behavior of z on a smooth surface at the training stage, and predicting the movement at inference. A key aspect of r (z t , x) is that the model is capable of converging to multiple possible optimum modes at inference based on the initial position of z.

2.2. MOMENTUM AS A SUPPLEMENTARY AID

Based on Sec. 2.1, z can now traverse to an optimal position z * during inference. However, there can exist rare symmetrical positions in the ζ where E[z t+1 ] -z t ≈ 0, although far away from {z * }, forcing z t+1 ≈ z t . Simply, the above phenomenon can occur if some z t+1 has traveled in many nonorthogonal directions, so the vector addition of z t+1 ≈ 0. This can fool the system to falsely identify convergence points, forming phantom optimum point distributions amongst the true distribution (see Fig. 3 ). To avoid such behavior, we learn the expected momentum E[ρ(z t , x j )] = αE[|z t+1 -z t | xj ] at each (z t , x j ) during the training phase, where α is an empirically chosen scalar. In practice, E[ρ(z t , x j )] → 0 as z t+1 , z t → {z * }. Thus, to avoid phantom distributions, we improve the z update as, z t+1 = z t + E[ρ(z t , x j )] r (z t , x j ) -z t r (z t , x j ) -z t . Since both E[ρ(z t , x j )] and r (z t , x j ) are functions on (z t , x j ), we jointly learn these two functions using a single network Z(z t , x j ). Note that coefficient E[ρ(z t , x j )] serves two practical purposes: 1) slows down the movement of z near true distributions, 2) pushes z out of the phantom distributions.

3. OVERALL DESIGN

The proposed model consists of three major blocks as shown in Fig. 1 : an encoder H, a generator G, and Z. The detailed architecture diagram for 128 × 128 is shown in Fig. 2 . Note that for derivations in Sec. 2, we used x instead of h = H(x), as h is a high-level representation of x. The training process is illustrated in Algorithm 1. At each optimization z t+1 = z t -β∇ zt [ Êi,j ], Z is trained separately to approximate (z t+1 , ρ). At inference, x is fed to H, and then Z optimizes the output ȳ by updating z for a pre-defined number of iterations of Eq. 4. For Ê(•, •), we use L 1 loss. Furthermore, it is important to limit the search space for z t+1 , to improve the performance of Z. To this end, we 

4. MOTIVATION

Here, we explain the drawbacks of conditional GAN methods and illustrate our idea via a toy example. Incompatibility of adversarial and reconstruction losses: cGANs use a combination of adversarial and reconstruction losses. We note that this combination is suboptimal to model CMM spaces. Remark: Consider a generator G(x, z) and a discriminator D(x, z), where x and z are the input and the noise vector, respectively. Then, consider an arbitrary input x j and the corresponding set of ground-truths {y g i,j }, i = 1, 2, ..N . Further, let us define the optimal generator G * (x j , z) = ŷ, ŷ ∈ {y g i,j }, L GAN = E i [log D(y g i,j )] + E z [log(1 -D(G(x j , z))] and L = E i,z [|y g i,j -G(x j , z)|]. Then, G * = Ĝ * where Ĝ * = arg G min D max L GAN + λL , ∀λ = 0. (Proof in App. 2.

3).

Generalizability: The incompatibility of above mentioned loss functions demands domain specific design choices from models that target high realism in CMM settings. This hinders the generalizability across different tasks (Vitoria et al., 2020; Zeng et al., 2019) . We further argue that due to this discrepancy, cGANs learn sub-optimal features which are less useful for downstream tasks (Sec. 5.3). Convergence and the sensitivity to the architecture: The difficulty of converging GANs to the Nash equilibrium of a non-convex min-max game in high-dimensional spaces is well explored (Barnett, 2018; Chu et al., 2020; Kodali et al., 2017) . Goodfellow et al. (2014b) underlines if the discriminator has enough capacity, and is optimal at every step of the GAN algorithm, then the generated distribution converges to the real distribution; that cannot be guaranteed in a practical scenario. In fact, Arora et al. (2018) confirmed that the adversarial objective can easily approach to an equilibrium even if the generated distribution has very low support, and further, the number of training samples required to avoid mode collapse can be in order of exp(d) (d is the data dimension). A toy example: Here, we experiment with the formulations in Sec. 2. Consider a 3D CMM space y = ±4(x, x 2 , x 3 ). Then, we construct multi-layer perceptrons (MLP) with three layers to represent each of the functions, H, G, and Z, and compare the proposed method against the L 1 loss. Figure 3 illustrates the results. As expected, L 1 loss generates the line y = 0, and is inadequate to model the multimodal space. As explained in Sec. 2.2, without momentum correction, the network is fooled by a phantom distribution where E[z t+1 ] ≈ 0 at training time. However, the push of momentum removes the phantom distribution and refines the output to closely resemble the input distribution. As implied in Sec. 2.2, the momentum is maximized near the true distribution and minimized otherwise.

5. EXPERIMENTS AND DISCUSSIONS

The distribution of natural images lies on a high dimensional manifold, making the task of modelling it extremely challenging. Moreover, conditional image generation poses an additional challenge with their constrained multimodal output space (a single input may correspond to multiple outputs while not all of them are available for training). In this section, we experiment on several such tasks. For a fair comparison with a similar capacity GAN, we use the encoder and decoder architectures used in 

5.1. CORRUPTED IMAGE RECOVERY

We design this task as image completion, i.e., given a masked image as input, our goal is to recover the masked area. Interestingly, we observed that the MNIST dataset, in its original form, does not have a multimodal behaviour, i.e., a fraction of the input image only maps to a single output. Therefore, we modify the training data as follows: first, we overlap the top half of an input image with the top half of another randomly sampled image. We carry out this corruption for 20% of the training data. Corrupted samples are not fixed across epochs. Then, we apply a random sized mask to the top half, and ask the network to predict the missing pixels. We choose two competitive baselines here: our network with the L 1 loss and CE. Fig. 4 illustrates the predictions. As shown, our model converges to the most probable non-corrupted mode without any ambiguity, while other baselines give sub-optimal results. In the next experiment, we add a small white box to the top part of the ground-truth images at different rates. At inference, our model was able to converge to both the modes (Fig. 5 ), depending on the initial position of z, as the probability of the alternate mode reaches 0.3. Here, KL(•||•) is the KL divergence and u(0, 1) is a uniform distribution (see App. 3.3). Fig. 7 and Table 2 depict our qualitative and quantitative results, respectively. We demonstrate the superior performance of our method against four metrics: LPIP, PieAPP, SSIM and PSNR (App. 3.2). Fig. 10 depicts examples of multimodality captured by our model (more examples in App. 3.4). Fig. 6 shows colorization behaviour as the z converges during inference.

5.2. AUTOMATIC IMAGE

User study: We also conduct two user studies to further validate the quality of generated samples (Table 1 ). a) In the PSYCHOPHYSICAL STUDY, we present volunteers with batches of 3 images, each generated with a different method. A batch is displayed for 5 secs and the user has to pick the most realistic image. After 5 secs, the next image batch is displayed. b) We conduct a TURING TEST to validate our output quality against the ground-truth, following the setting proposed by Zhang et al. (2016) . The volunteers are presented with a series of paired images (ground-truth and our output). The images are visible for 1 sec, and then the user has an unlimited time to pick the realistic image. 

5.3. IMAGE COMPLETION

In this case, we show that our generic model outperforms a similar capacity GAN (CE) as well as task-specific GANs. In contrast to task-specific models, we do not use any domain-specific modifications to make our outputs perceptually pleasing. We observe that with random irregular and fixed-sized masks, all the models perform well, and we were not able to visually observe a considerable difference (Fig. 8 , see App. 3.11 for more results). Therefore, we presented models with a more challenging task: train with random sized square-shaped masks and evaluate the performance against masks of varying sizes. Fig. 9 illustrates qualitative results of the models with 25% masked data. As evident, our model recovers details more accurately compared to the state-of-the-art. Notably, all models produce comparable results when trained with a fixed sized center mask, but find this setting more challenging. Table 3 includes a quantitative comparison. Observe that in the case of smaller sized masks, PN performs slightly better than ours, but worse otherwise. We also evaluate the learned features of the models against a downstream classification task (Table 5 ). First, we train all the models on Facades (Tyleček & Šára, 2013) against random masks, and then apply the trained models on CIFAR10 (Krizhevsky et al., 2009) to extract bottleneck features, and finally pass them through a FC layer for classification (App. 3.7). We compare PN and ours against an oracle (AlexNet features pre-trained on ImageNet) and show our model performs closer to the oracle. 

5.3.1. DIVERSITY AND OTHER COMPELLING ATTRIBUTES

We also experiment on a diverse set of image translation tasks to demonstrate our generalizability. Fig. 12 , 13, 14, 16 and 17 illustrate the qualitative results of sketch-to-handbag, sketch-to-shoes, map-to-arial, lanmarks-to-faces and surface-normals-to-pets tasks. Fig. 10 , 11, 12, 13, 16 and 17 show the ability of our model to converge to multiple modes, depending on the z initialization. Fig. 15 demonstrates the quantitative comparison against other models. See App. 3.4 for further details on experiments. Another appealing feature of our model is its strong convergence properties irrespective of the architecture, hence, scalability to different input sizes. Fig. 19 shows examples from image completion and colorization for varying input sizes. We add layers to the architecture to be trained on increasingly high-resolution inputs, where our model was able to converge to optimal modes at each scale (App. 3.8). Fig. 18 demonstrates our faster and stable convergence. Table 7 compares the number of FLOPS required by the models for a batch size of 10. ). However, spectral information of 3D objects has not been used before for self-supervised learning, a key reason being the difficulty of learning representations in the spectral domain due to the complex structure and unbounded spectral coefficients. Here, we present an efficient pretext task that is conducted in the spectral domain: denoising 3D spectral maps. We use two types of spectral spaces: spherical harmonics and Zernike polynomials (App. 4). We first convert the 3D point clouds to spherical harmonic coefficients, arrange the values as a 2D map, and mask or add noise to a map portion (App. 3.12). The goal is to recover the original spectral map. Fig. 20 and Table 6 depicts our qualitative and quantitative results. We perform favorably well against other methods. To evaluate the learned features, we use Zernike polynomials, as they are more discriminative compared to spherical harmonics (Ramasinghe et al., 2019a) . We first train the network on the 55k ShapeNet objects by denoising spectral maps, and then apply the trained network on the ModelNet10 & 40. The features are then extracted from the bottleneck (similar to Sec. 5.3), and fed to a FC classifier (Table 4 ). We achieve state-of-the-art results in ModelNet40 with a simple pretext task. 

6. CONCLUSION

Conditional generation in multimodal domains is a challenging task due to its ill-posed nature. In this paper, we propose a novel generative framework that minimizes a family of cost functions during training. Further, it observes the convergence patterns of latent variables and applies this knowledge during inference to traverse to multiple output modes during inference. Despite using a simple and generic architecture, we show impressive results on a diverse set of tasks. The proposed approach demonstrates faster convergence, scalability, generalizability, diversity and superior representation learning capability for downstream tasks.



Training and inference process. Refer to Algorithm 1 for the training process. At inference, z is iteratively updated using the predictions of Z and fed to G to obtain increasingly fine-tuned outputs (see Sec. 3).

Figure 2: Overall architecture for 128 × 128 inputs.

Figure 3: Toy Example: Plots generated for each dimension of the CMM space Υ. (a) Groundtruth distributions. (b) Model outputs for L1 loss. (c) Output when trained with the proposed objective (without ρ correction). Note the phantom distribution identified by the model. (d) E[ρ] as a heatmap on (x, y). E[ρ] is lower near the true distribution and higher otherwise. (e) Model outputs after ρ correction.

COLORIZATIONDeep models have tackled this problem using semantic priors(Iizuka et al., 2016;Vitoria et al., 2020), adversarial and L 1 losses(Isola et al., 2017; Zhu et al., 2017a;Lee et al., 2019), or by conversion to a discrete form through binning of color values(Zhang et al., 2016). Although these methods provide compelling results, several inherent limitations exist: (a) use of semantic priors results in complex models, (b) adversarial loss suffers from drawbacks (see Sec. 4), and (c) discretization reduces the precision. In contrast, we achieve better results using a simpler model.The input and the output of the network are l and (a, b) planes respectively (LAB color space). However, since the color distributions of a and b spaces are highly imbalanced over a natural dataset(Zhang et al., 2016), we add another constraint to the cost function E to push the predicted a and b colors towards a uniform distribution: E = a gt -a + b gt -b + λ(loss kl,a + loss kl,b ), where loss kl,• = KL(•||u(0, 1)).

Figure 4: Performance with 20% corrupted data. Our model demonstrates better convergence compared to L1 loss and a similar capacity GAN (Pathak et al., 2016).

Figure 5: With >30% alternate mode data, our model can converge to both the input modes (cols 4-5).itr 0 itr 5 itr 10 itr 15 itr 20 Figure 6: The prediction quality increases as the z traverses to an optimum position at the inference.

Figure 11: Multi-modality of our predictions on Celeb-HQ dataset. (Best viewed with zoom)

Figure 13: Translation from shoe sketches to images.

Figure 14: Map to aerial image translation. From left: GT, Input and Output. Also see App. 5.2.

Figure 15: Diversity: Quantitative comparisons.

Figure 16: Translation from facial landmarks to faces.

Figure 17: Translation from surface-normals to pet faces.

DENOISING OF 3D OBJECTS IN SPECTRAL SPACE Spectral moments of 3D objects provide a compact representation, and help building light-weight networks (Ramasinghe et al., 2020; 2019b; Cohen et al., 2018; Esteves et al., 2018

Figure 19: Scalability: we subsequently add layers to the architecture to be trained on increasingly high-resolution inputs

Colorization: Psychophysical study and Turing test results. All performances are in %. PieAPP ↓ SSIM ↑ PSNR ↑ LPIP ↓ PieAPP ↓ SSIM ↑ PSNR ↑

Colorization: Quantitative analysis of our method against the state-of-the-art. Ours perform better on a variety of metrics.PieAPP ↓ PSNR ↑ SSIM ↑ LPIP ↓ PieAPP ↓ PSNR ↑ SSIM ↑ LPIP ↓ PieAPP ↓ PSNR ↑ SSIM ↑

Image completion: Quantitative analysis of our method against state-of-the-art on a variety of metrics.

Downstream

Comparison on down-

Reconstruction mAP of 3d spectral denoising.

Model complexity comparison.

availability

//github.com/samgregoost

