CONDITIONAL GENERATIVE MODELING VIA LEARN-ING THE LATENT SPACE

Abstract

Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find solutions corresponding to multiple output modes. Compared to existing generative solutions, our approach demonstrates faster and more stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can perform better than highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs.

1. INTRODUCTION

Conditional generative models provide a natural mechanism to jointly learn a data distribution and optimize predictions. In contrast, discriminative models improve predictions by modeling the label distribution. Learning to model the data distribution allows generating novel samples and is considered a preferred way to understand the real world. Existing conditional generative models have generally been explored in single-modal settings, where a one-to-one mapping between input and output domains exists (Nalisnick et al., 2019; Fetaya et al., 2020) . Here, we investigate continuous multimodal (CMM) spaces for generative modeling, where one-to-many mappings exist between input and output domains. This is critical since many real world situations are inherently multimodal, e.g., humans can imagine several completions for a given occluded image. In a discrete setting, this problem becomes relatively easy to tackle using techniques such as maximum-likelihoodestimation, since the output can be predicted as a vector (Zhang et al., 2016) , which is not possible in continuous domains. One way to model CMM spaces is by using variational inference, e.g., variational autoencoders (VAE) (Kingma & Welling, 2013) . However, the approximated posterior distribution of VAEs are often restricted to the Gaussian family, which hinders its ability to model more complex distributions. As a solution, Maaløe et al. (2016) suggested using auxiliary variables to improve the variational distribution. To this end, the latent variables are hierarchically correlated through injected auxiliary variables, which can produce non-Gaussian distributions. A slightly similar work by Rezende & Mohamed (2015) proposed Normalizing Flows, that can hierarchically generate more complex probability distributions by applying a series of bijective mappings to an original simpler distribution. Recently, Chang et al. (2019) proposed a model, where a separate variable can be used to vary the impact of different loss components at inference, which allows diverse outputs. For a more detailed discussion on these methods see App. 1. In addition to the aforesaid methods, in order to model CMM spaces, a prominent approach in the literature is to use a combination of reconstruction and adversarial losses (Isola et al., 2017; Zhang et al., 2016; Pathak et al., 2016) . However, this entails key shortcomings. 1) The goals of adversarial and reconstruction losses are contradictory (Sec. 4), hence model engineering and numerous regularizers are required to support convergence (Lee et al., 2019; Mao et al., 2019) , thereby resulting in less-generic models tailored for specific applications (Zeng et al., 2019; Vitoria et al., 2020) . 2) The adversarial loss based models are notorious for difficult convergence due to the challenge of finding Nash equilibrium of a non-convex min-max game in high-dimensions (Barnett, 2018; Chu et al., 2020; Kodali et al., 2017) . 3) The convergence is heavily dependent on the architecture, hence such models show lack of scalability (Thanh-Tung et al., 2019; Arora & Zhang, 2017) . 4) The promise of assisting downstream tasks remains challenging, with a large gap in performance between the generative modelling approaches and their discriminative counterparts (Grathwohl et al., 2020; Jing & Tian, 2020) . In this work, we propose a general-purpose framework-Conditional Generation by Modeling the Latent Space (cGML)-for modeling CMM spaces using a set of domain-agnostic regression cost functions instead of the adversarial loss. This improves both the stability and eliminates the incompatibility between the adversarial and reconstruction losses, allowing more precise outputs while maintaining diversity. The underlying notion is to learn the 'behaviour of the latent variables' in minimizing these cost functions while converging to an optimum mode during the training phase, and mimicking the same at inference. Despite being a novel direction, the proposed framework showcases promising attributes by: (a) achieving state-of-the-art results on a diverse set of tasks using a generic model, implying generalizability, (b) rapid convergence to optimal modes despite architectural changes, (c) learning useful features for downstream tasks, and (d) producing diverse outputs via traversal through multiple output modes at inference.

2. PROPOSED METHODOLOGY

We define a family of cost functions {E i,j = d(y g i,j , G(x j , w))}, where x j ∼ χ is the input, y g i,j ∼ Υ is the i th ground-truth mode for x j , G is a generator function with weights w, and d(•, •) is a distance function. Note that the number of cost functions E (•,j) for a given x j can vary over χ. Our aim here is to come up with a generator function G(x j , w), that can minimize each E i,j , ∀i as G(x j , w) → y g i,j . However, since G is a deterministic function (x and w are both fixed at inference), it can only produce a single output. Therefore, we introduce a latent vector z to the generator function, that can be used to converge ȳi,j = G(x j , w, z i,j ) towards a ground truth y g (i,j) at inference, and possibly, to multiple solutions. Formally, the family of cost functions now becomes: { Êi,j = d(y g i,j , G(x j , w, z i,j ))}, ∀z i,j ∼ ζ. Then, our training objective can be defined as finding a set of optimal z * i ∈ ζ and w * ∈ ω by minimizing E i∼I [ Êi,j ] , where I is the number of possible solutions for x j . Note that w * is fixed for all i and a different z * i exists for each i. Considering all the training samples x j ∼ χ, our training objective becomes, {{z * i,j }, w * } = arg min zi,j ∈ζ,w∈ω E i∈I,j∈J [ Êi,j ] . (1) Eq. 1 can be optimized via Algorithm 1 (proof in App. 2.2). Intuitively, the goal of Eq. 1 is to obtain a family of optimal latent codes {z * i,j }, each causing a global minima in the corresponding Êi,j as y g i,j = G(x j , w, z * i,j ). Consequently, at inference, we can optimize ȳi,j to converge to an optimal mode in the output space by varying z. Therefore, we predict an estimated zi,j at inference, zi,j ≈ min z Êi,j , for each y g i,j , which in turn can be used to obtain the prediction G(x j , w, zi,j ) ≈ y g i,j . In other words, for a selected x j , let ȳt i,j be the initial estimate for ȳi,j . At inference, z can traverse gradually towards an optimum point y g i,j in the space, forcing ȳt+n i,j → y g i,j , in finite steps (n). However, still a critical problem exists: Eq. 2 depends on y g i,j , which is not available at inference. As a remedy, we enforce Lipschitz constraints on G over (x j , z i,j ), which bounds the gradient norm as, G(x j ,w * ,z * i,j )-G(x j ,w * ,z 0 ) z * i,j -z 0 ≤ ∇ z G(x j , w * , γ(t)) dt ≤ C, where z 0 ∼ ζ is an arbitrary random initialization, C is a constant, and γ(•) is a straight path from z 0 to z * i,j (proof in App. 2.1) . Intuitively, Eq. 3 implies that the gradients ∇ z G(x j , w * , z 0 ) along the path γ(•) do not tend to vanish or explode, hence, finding the path to optimal z * i,j in the space ζ becomes a fairly straight forward regression problem. Moreover, enforcing the Lipschitz constraint

availability

//github.com/samgregoost

