

Abstract

We propose conditional transport (CT) as a new divergence to measure the difference between two probability distributions. The CT divergence consists of the expected cost of a forward CT, which constructs a navigator to stochastically transport a data point of one distribution to the other distribution, and that of a backward CT which reverses the transport direction. To apply it to the distributions whose probability density functions are unknown but random samples are accessible, we further introduce asymptotic CT (ACT), whose estimation only requires access to mini-batch based discrete empirical distributions. Equipped with two navigators that amortize the computation of conditional transport plans, the ACT divergence comes with unbiased sample gradients that are straightforward to compute, making it amenable to mini-batch stochastic gradient descent based optimization. When applied to train a generative model, the ACT divergence is shown to strike a good balance between mode covering and seeking behaviors and strongly resist mode collapse. To model high-dimensional data, we show that it is sufficient to modify the adversarial game of an existing generative adversarial network (GAN) to a game played by a generator, a forward navigator, and a backward navigator, which try to minimize a distribution-to-distribution transport cost by optimizing both the distribution of the generator and conditional transport plans specified by the navigators, versus a critic that does the opposite by inflating the point-to-point transport cost. On a wide variety of benchmark datasets for generative modeling, substituting the default statistical distance of an existing GAN with the ACT divergence is shown to consistently improve the performance.

1. INTRODUCTION

Measuring the difference between two probability distributions is a fundamental problem in statistics and machine learning (Cover, 1999; Bishop, 2006; Murphy, 2012) . A variety of statistical distances have been proposed to quantify the difference, which often serves as the first step to build a generative model. Commonly used statistical distances include the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) , Jensen-Shannon (JS) divergence (Lin, 1991) , and Wasserstein distance (Kantorovich, 2006) . While being widely used for generative modeling (Kingma and Welling, 2013; Goodfellow et al., 2014; Arjovsky et al., 2017; Balaji et al., 2019) , they all have their own limitations. The KL divergence, directly related to both maximum likelihood estimation and variational inference, is amenable to mini-batch stochastic gradient descent (SGD) based optimization (Wainwright and Jordan, 2008; Hoffman et al., 2013; Blei et al., 2017) . However, it requires the two probability distributions to share the same support, and hence is often inapplicable if either of them is an implicit distribution whose probability density function (PDF) is unknown (Mohamed and Lakshminarayanan, 2016; Huszár, 2017; Tran et al., 2017; Yin and Zhou, 2018) . The JS divergence is directly related to the mini-max loss of a generative adversarial net (GAN) when the discriminator is optimal (Goodfellow et al., 2014) . However, it is difficult to maintain a good balance between the generator and discriminator, making GANs notoriously brittle to train. The Wasserstein distance is a widely used metric that allows the two distributions to have non-overlapping supports (Villani, 2008; Santambrogio, 2015; Peyré and Cuturi, 2019) . However, it is challenging to estimate in its primal form and generally results in biased sample gradients when its dual form is employed (Arjovsky et al., 2017; Bellemare et al., 2017; Bottou et al., 2017; Bińkowski et al., 2018; Bernton et al., 2019) . To address the limitations of existing measurement methods, we introduce conditional transport (CT) as a new divergence to quantify the difference between two probability distributions. We refer to them as the source and target distributions and denote their probability density functions (PDFs) as p X (x) and p Y (y), respectively. The CT divergence is defined with a bidirectional distribution-to-distribution transport. It consists of a forward CT that transports the source to target distribution, and a backward CT that reverses the transport direction. Our intuition is that given a source (target) point, it is more likely to be transported to a target (source) point closer to it. Denoting d(x, y) = d(y, x) as a learnable function and c(x, y) = c(y, x) ≥ 0, where the equality is true when x = y, as the point-to-point transport cost, the goal is to minimize the transport cost between two distributions. The forward CT is constructed in three steps: 1) We define a forward "navigator" as π(y | x) = e -d(x,y) p Y (y)/ e -d(x,y) p Y (y)dy, a conditional distribution specifying how likely a given source point x will be transported to distribution p Y (y) via path x → y; 2) We define the cost of a forward x-transporting CT as c(x, y)π(y | x)dy, the expected cost of employing the forward navigator to transport x to a random target point; 3) We define the total cost of the forward CT as p X (x) c(x, y)π(y | x)dydx, which is the expectation of the cost of a forward x-transporting CT with respect to p X (x). Similarly, we construct the backward CT by first defining a backward navigator as π(x | y) = e -d(x,y) p X (x)/ e -d(x,y) p X (x)dx and then its total cost as p Y (y) c(x, y)π(x | y)dxdy. Estimating the CT divergence involves both π(x | y) and π(y | x), which, however, are generally intractable to evaluate and sample from, except for a few limited settings where both p X (x) and p Y (y) are exponential family distributions conjugate to e -d(x,y) . To apply the CT divergence in a general setting where we only have access to random samples from the distributions, we introduce asymptotic CT (ACT) as a divergence measure that is friendly to mini-batch SGD based optimization. The ACT divergence is the expected value of the CT divergence, whose p X (x) and p Y (y) are both replaced with their discrete empirical distributions, respectively supported on N independent, and identically distributed (iid) random samples from p X (x) and M iid random samples from p Y (y). The ACT divergence is asymptotically equivalent to CT divergence when both N → ∞ and M → ∞. Intuitively, it can also be interpreted as performing both a forward one-to-M stochastic CT from the source to target and a backward one-to-N stochastic CT from the target to source, with the expected cost providing an unbiased sample estimate of the ACT divergence. We show that similar to the KL divergence, ACT provides unbiased sample gradients, but different from it, neither p X (x) nor p Y (y) needs to be known. Similar to the Wasserstein distance, it does not require the distributions to share the same support, but different from it, the sample estimates of ACT and its gradients are unbiased and straightforward to compute. In GANs or Wasserstein GANs (Arjovsky et al., 2017) , having an optimal discriminator or critic is required to unbiasedly estimate the JS divergence or Wasserstein distance and hence the gradients of the generator (Bottou et al., 2017) . However, this is rarely the case in practice, motivating a common remedy to stabilize the training by carefully regularizing the gradients, such as clipping or normalizing their values (Gulrajani et al., 2017; Miyato et al., 2018) . By contrast, in an adversarial game under ACT, the optimization of the critic, which manipulates the point-to-point transport cost c(x, y) but not the navigators' conditional distributions for x → y and x ← y, has no impact on how ACT is estimated. For this reason, the sample gradients stay unbiased regardless of how well the critic is optimized. To demonstrate the use of the ACT (or CT) divergence, we apply it to train implicit (or explicit) distributions to model both 1D and 2D toy data, MNIST digits, and natural images. The implicit distribution is defined by a deep generative model (DGM) that is simple to sample from. We focus on adapting existing GANs, with minimal changes to their settings except for substituting the statistical distances in their loss functions with the ACT divergence. We leave tailoring the network architectures to the ACT divergence to future study. More specifically, we modify the GAN loss function to an adversarial game between a generator, a forward navigator, and a backward navigator, which try to minimize the distribution-to-distribution transport cost by optimizing both the fake data distribution p Y (y) and two conditional point-to-point navigation-path distributions π(y | x) and π(x | y), versus a critic that does the opposite by inflating the point-to-point transport cost c(x, y). Modifying an existing (Wasserstein) GAN with the ACT divergence, our experiments show consistent improvements in not only quantitative performance and generation quality, but also learning stability. 2 CONDITIONAL TRANSPORT WITH GENERATOR, NAVIGATORS, AND CRITIC Denote x as a data taking its value in R V . In practice, we observe a finite set X = {x i } |X | i=1 , consisting of |X | data samples assumed to be iid drawn from p X (x). Given X , the usual task is to learn a distribution to approximate p X (x), explaining how the data in X are generated. To approximate

