

Abstract

We propose conditional transport (CT) as a new divergence to measure the difference between two probability distributions. The CT divergence consists of the expected cost of a forward CT, which constructs a navigator to stochastically transport a data point of one distribution to the other distribution, and that of a backward CT which reverses the transport direction. To apply it to the distributions whose probability density functions are unknown but random samples are accessible, we further introduce asymptotic CT (ACT), whose estimation only requires access to mini-batch based discrete empirical distributions. Equipped with two navigators that amortize the computation of conditional transport plans, the ACT divergence comes with unbiased sample gradients that are straightforward to compute, making it amenable to mini-batch stochastic gradient descent based optimization. When applied to train a generative model, the ACT divergence is shown to strike a good balance between mode covering and seeking behaviors and strongly resist mode collapse. To model high-dimensional data, we show that it is sufficient to modify the adversarial game of an existing generative adversarial network (GAN) to a game played by a generator, a forward navigator, and a backward navigator, which try to minimize a distribution-to-distribution transport cost by optimizing both the distribution of the generator and conditional transport plans specified by the navigators, versus a critic that does the opposite by inflating the point-to-point transport cost. On a wide variety of benchmark datasets for generative modeling, substituting the default statistical distance of an existing GAN with the ACT divergence is shown to consistently improve the performance. Under review as a conference paper at ICLR 2021 refer to them as the source and target distributions and denote their probability density functions (PDFs) as p X (x) and p Y (y), respectively. The CT divergence is defined with a bidirectional distribution-to-distribution transport. It consists of a forward CT that transports the source to target distribution, and a backward CT that reverses the transport direction. Our intuition is that given a source (target) point, it is more likely to be transported to a target (source) point closer to it. Denoting d(x, y) = d(y, x) as a learnable function and c(x, y) = c(y, x) ≥ 0, where the equality is true when x = y, as the point-to-point transport cost, the goal is to minimize the transport cost between two distributions. The forward CT is constructed in three steps: 1) We define a forward "navigator" as π(y | x) = e -d(x,y) p Y (y)/ e -d(x,y) p Y (y)dy, a conditional distribution specifying how likely a given source point x will be transported to distribution p Y (y) via path x → y; 2) We define the cost of a forward x-transporting CT as c(x, y)π(y | x)dy, the expected cost of employing the forward navigator to transport x to a random target point; 3) We define the total cost of the forward CT as p X (x) c(x, y)π(y | x)dydx, which is the expectation of the cost of a forward x-transporting CT with respect to p X (x). Similarly, we construct the backward CT by first defining a backward navigator as π(x | y) = e -d(x,y) p X (x)/ e -d(x,y) p X (x)dx and then its total cost as p Y (y) c(x, y)π(x | y)dxdy. Estimating the CT divergence involves both π(x | y) and π(y | x), which, however, are generally intractable to evaluate and sample from, except for a few limited settings where both p X (x) and p Y (y) are exponential family distributions conjugate to e -d(x,y) . To apply the CT divergence in a general setting where we only have access to random samples from the distributions, we introduce asymptotic CT (ACT) as a divergence measure that is friendly to mini-batch SGD based optimization. The ACT divergence is the expected value of the CT divergence, whose p X (x) and p Y (y) are both replaced with their discrete empirical distributions, respectively supported on N independent, and identically distributed (iid) random samples from p X (x) and M iid random samples from p Y (y). The ACT divergence is asymptotically equivalent to CT divergence when both N → ∞ and M → ∞. Intuitively, it can also be interpreted as performing both a forward one-to-M stochastic CT from the source to target and a backward one-to-N stochastic CT from the target to source, with the expected cost providing an unbiased sample estimate of the ACT divergence. We show that similar to the KL divergence, ACT provides unbiased sample gradients, but different from it, neither p X (x) nor p Y (y) needs to be known. Similar to the Wasserstein distance, it does not require the distributions to share the same support, but different from it, the sample estimates of ACT and its gradients are unbiased and straightforward to compute. In GANs or Wasserstein GANs (Arjovsky et al., 2017) , having an optimal discriminator or critic is required to unbiasedly estimate the JS divergence or Wasserstein distance and hence the gradients of the generator (Bottou et al., 2017) . However, this is rarely the case in practice, motivating a common remedy to stabilize the training by carefully regularizing the gradients, such as clipping or normalizing their values (Gulrajani et al., 2017; Miyato et al., 2018) . By contrast, in an adversarial game under ACT, the optimization of the critic, which manipulates the point-to-point transport cost c(x, y) but not the navigators' conditional distributions for x → y and x ← y, has no impact on how ACT is estimated. For this reason, the sample gradients stay unbiased regardless of how well the critic is optimized. To demonstrate the use of the ACT (or CT) divergence, we apply it to train implicit (or explicit) distributions to model both 1D and 2D toy data, MNIST digits, and natural images. The implicit distribution is defined by a deep generative model (DGM) that is simple to sample from. We focus on adapting existing GANs, with minimal changes to their settings except for substituting the statistical distances in their loss functions with the ACT divergence. We leave tailoring the network architectures to the ACT divergence to future study. More specifically, we modify the GAN loss function to an adversarial game between a generator, a forward navigator, and a backward navigator, which try to minimize the distribution-to-distribution transport cost by optimizing both the fake data distribution p Y (y) and two conditional point-to-point navigation-path distributions π(y | x) and π(x | y), versus a critic that does the opposite by inflating the point-to-point transport cost c(x, y). Modifying an existing (Wasserstein) GAN with the ACT divergence, our experiments show consistent improvements in not only quantitative performance and generation quality, but also learning stability.

1. INTRODUCTION

Measuring the difference between two probability distributions is a fundamental problem in statistics and machine learning (Cover, 1999; Bishop, 2006; Murphy, 2012) . A variety of statistical distances have been proposed to quantify the difference, which often serves as the first step to build a generative model. Commonly used statistical distances include the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) , Jensen-Shannon (JS) divergence (Lin, 1991) , and Wasserstein distance (Kantorovich, 2006) . While being widely used for generative modeling (Kingma and Welling, 2013; Goodfellow et al., 2014; Arjovsky et al., 2017; Balaji et al., 2019) , they all have their own limitations. The KL divergence, directly related to both maximum likelihood estimation and variational inference, is amenable to mini-batch stochastic gradient descent (SGD) based optimization (Wainwright and Jordan, 2008; Hoffman et al., 2013; Blei et al., 2017) . However, it requires the two probability distributions to share the same support, and hence is often inapplicable if either of them is an implicit distribution whose probability density function (PDF) is unknown (Mohamed and Lakshminarayanan, 2016; Huszár, 2017; Tran et al., 2017; Yin and Zhou, 2018) . The JS divergence is directly related to the mini-max loss of a generative adversarial net (GAN) when the discriminator is optimal (Goodfellow et al., 2014) . However, it is difficult to maintain a good balance between the generator and discriminator, making GANs notoriously brittle to train. The Wasserstein distance is a widely used metric that allows the two distributions to have non-overlapping supports (Villani, 2008; Santambrogio, 2015; Peyré and Cuturi, 2019) . However, it is challenging to estimate in its primal form and generally results in biased sample gradients when its dual form is employed (Arjovsky et al., 2017; Bellemare et al., 2017; Bottou et al., 2017; Bińkowski et al., 2018; Bernton et al., 2019) . To address the limitations of existing measurement methods, we introduce conditional transport (CT) as a new divergence to quantify the difference between two probability distributions. We p X (x), we consider a DGM defined as y = G θ ( ), ∼ p( ), where G θ is a generator that transforms noise ∼ p( ) via a deep neural network parameterized by θ to generate random sample y ∈ R V . While the PDF of the generator, denoted as p θ (y), is often intractable to evaluate, it is straightforward to draw y ∼ p θ (y) with G θ . Denote both µ(dx) = p X (x)dx and ν(dy) = p θ (y)dy as continuous probability measures over R V , with µ(R V ) = R V p X (x)dx = 1 and ν(R V ) = R V p θ (y)dy = 1. The Wasserstein distance in its primal form can be defined with Kantorovich's optimal transport problem (Kantorovich, 2006; Villani, 2008; Santambrogio, 2015; Peyré and Cuturi, 2019) : W(µ, ν) = min π∈Π(µ,ν) { R V ×R V c(x, y)π(dx, dy)} = min π∈Π(µ,ν) {E (x,y)∼π(x,y) [c(x, y)]}, (1) where the minimum is taken over Π(µ, ν), defined as the set of all possible joint probability measures π on R V × R V , with marginals π(A, R V ) = µ(A) and π(R V , A) = ν(A) for any Borel set A ⊂ R V . When c(x, y) = xy , we obtain the Wasserstein-1 distance, also known as the Earth Mover's distance, for which there exists a dual form according to the Kantorovich duality as W 1 (µ, ν) = sup f ∈Lip 1 {E x∼p X (x) [f (x)] -E y∼p Y (y) [f (y)]}, where f is referred to as the "critic" and Lip 1 denotes the set of all 1-Lipschitz functions (Villani, 2008) . Intuitively, the critic f plays the role of "amortizing" the computation of the optimal transport plan. However, as it is difficult to ensure the 1-Lipschitz constraint, one often resorts to approximations (Arjovsky et al., 2017; Gulrajani et al., 2017; Wei et al., 2018; Miyato et al., 2018) that inevitably introduce bias into the estimation of W 1 and its gradient (Bellemare et al., 2017; Bottou et al., 2017) .

2.1. FORWARD AND BACKWARD NAVIGATORS AND CONDITIONAL TRANSPORT PLANS

Constraining π ∈ Π(µ, ν), the Wasserstein distance satisfies W(µ, ν) = W(ν, µ). By contrast, the proposed divergence allows π / ∈ Π(µ, ν). Denote T φ (•) ∈ R H as a neural network based function, transforming its input in R V into a feature vector in R H , and d(h 1 , h 2 ) as a function that measures the difference between h 1 , h 2 ∈ R H . We introduce a forward CT, whose transport cost is defined as C φ,θ (µ → ν) = E x∼p X (x) E y∼π φ (y | x) [c(x, y)], π φ (y | x) def. = e -d(T φ (x),T φ (y)) p θ (y) e -d(T φ (x),T φ (y)) p θ (y)dy , where π φ (y | x) will be analogized to the forward navigator that defines the forward conditional transport plan. Similarly, we introduce the backward CT, whose transport cost is defined as C φ,θ (µ ← ν) = E y∼p θ (y) E x∼π φ (x | y) [c(x, y)], π φ (x | y) def. = e -d(T φ (x),T φ (y)) p X (x) e -d(T φ (x),T φ (y)) p X (x)dx , where π φ (x | y) will be analogized to a backward navigator. We now define the CT problem as min φ,θ {C φ,θ (µ, ν)} , C φ,θ (µ, ν) def. = 1 2 C φ,θ (µ → ν) + 1 2 C φ,θ (µ ← ν), where C φ,θ (µ, ν) = C φ,θ (ν, µ) will be referred to as the CT divergence between µ and ν. Lemma 1. If y ∼ p θ (y) is equal to x ∼ p X (x) in distribution and T φ is chosen such that e -d(T φ (x),T φ (y)) = 1(x = y), where 1(•) is an indicator function, then both the the joint probability measure π defined with p X (x)π φ (y | x) in (2) and that with p θ (y )π φ (x | y) in (3) are in Π(µ, ν). Lemma 2. If y ∼ p θ (y) is equal to x ∼ p X (x) in distribution, then C φ,θ (µ, ν) = C φ,θ (µ → ν) = C φ,θ (µ ← ν) ≥ W(µ, ν) = 0, where the equality can be achieved if e -d(T φ (x),T φ (y)) = 1(x = y). The proofs are deferred to Appendix A. Note in general, before both θ and φ reach their optimums, the conditions specified in Lemmas 1 and 2 are not satisfied and the joint probability measure π defined with p X (x)π φ (y | x) or p θ (y)π φ (x | y) is not restricted to be in Π(µ, ν), and hence it is possible for C φ,θ (µ → ν), C φ,θ (µ ← ν), or C φ,θ (µ, ν) to go below W(µ, ν) during training.

2.2. ASYMPTOTIC CONDITIONAL TRANSPORT

Computing the CT divergence requires either knowing the PDFs of both navigators π φ (y | x) and π φ (x | y), or being able to draw random samples from them. However, usually neither is true unless both p θ (y) and p X (x) are known and conjugate to e -d(T φ (x),T φ (y)) . For example, if d(T φ (x), T φ (y)) = φx -φy 2 2 = (x -y) T (φ T φ)(x -y) , where φ ∈ R V ×V is a full-rank matrix, and both p X (x) and p θ (y) are multivariate Gaussian distributions, then one may show that both π φ (y | x) and π φ (x | y) are multivariate Gaussian distributions. In the experimental results section, we will provide a univariate normal distribution based toy examples for illustration. Below we show how to apply the CT divergence in a general setting that only requires access to random samples of both x and y. While knowing neither p X (x) nor p θ (y), we can obtain mini-batch based empirical probability measures μN and νM , as defined below, to guide the optimization of G θ in an iterative manner. With N random observations sampled without replacement from X , we define μN = 1 N N i=1 δ xi , {x 1 , . . . , x N } ⊆ X (5) as an empirical probability measure for x. Similarly, with M random samples of the generator, we define an empirical probability measure for y as νM = 1 M M j=1 δ y j , y j = G θ ( j ), j iid ∼ p( ) . Substituting p θ (y) in (2) with νM (y), the continuous forward navigator becomes a discrete one as πφ (y | x) = M j=1 πM (y j | x, φ)δ y j , πM (y j | x, φ) def. = e -d(T φ (x),T φ (y j )) M j =1 e -d(T φ (x),T φ (y j )) . Thus the cost of a forward CT becomes C φ,θ (µ → νM ) = E x∼p X (x) E y∼π φ (y | x) [c(x, y)] = E x∼p X (x) [C φ,θ (x → νM )] , where C φ,θ (x → νM ) def. = M j=1 c(x, y j )π M (y j | x, φ). (8) Similarly, we have the backward navigator and the cost of backward CT as πφ (x | y) = N i=1 πN (x i | y, φ)δ xi , πN (x i | y, φ) def. = e -d(T φ (x i ),T φ (y)) M i =1 e -d(T φ (x i ),T φ (y)) , (9) C φ,θ (μ N ← ν) = E y∼p θ (y) E x∼π φ (x | y) [c(x, y)] = E y∼p θ (y) [C φ,θ (μ N ← y)] , where C φ,θ (μ N ← y) def. = N i=1 c(x i , y)π N (x i | y, φ). (10) Combining both the forward and backward CTs, we define the asymptotic CT (ACT) problem as min φ,θ {C φ,θ (µ, ν, N, M )}, C φ,θ (µ, ν, N, M ) is the ACT divergence defined as C φ,θ (µ, ν, N, M ) = 1 2 E y 1:M iid ∼ p θ (y) [C φ,θ (µ → νM )] + 1 2 E x 1:N iid ∼ p X (x) [C φ,θ (μ N ← ν)]. (12) Lemma 3. The ACT divergence is asymptotic that lim N,M →∞ C φ,θ (µ, ν, N, M ) = C φ,θ (µ, ν). Lemma 4. With x 1:N iid ∼ p X (x) and y 1:M iid ∼ p θ (y) and drawing x ∼ μN (x) and y ∼ νM (y), an unbiased sample estimator of the ACT divergence can be expressed as L φ,θ (x 1:N , y 1:M ) = 1 2 M j=1 c(x, y j )π M (y j | x, φ) + 1 2 N i=1 c(x i , y)π N (x i | y, φ). Intuitively, the first term in the summation can be interpreted as the expected cost of following the forward navigator to stochastically transport a random source point x to one of the M randomly instantiated "anchors" of the target distribution. The second term shares a similar interpretation. Note in optimal transport, the Wasserstein distance W(µ, ν) in its primal form, shown in (1), is in general intractable to compute. To use the primal form, one often resorts to the sample Wasserstein distance defined as W(μ N , νM ), computing which, however, requires solving a combinatorial optimization problem (Peyré and Cuturi, 2019) . To make W(μ N , νM ) practical to compute, one remedy is to smooth the optimal transport plan between μN and νM with an entropic regularization term, resulting in the Sinkhorn distance that still requires to be estimated with an iterative procedure, whose convergence is sensitive to the entropic regularization coefficient (Cuturi, 2013; Genevay et al., 2016; 2018; Xie et al., 2020) . When the entropic regularization coefficient goes to infinity, we recover maximum mean discrepancy (MMD), which is considered as the metric for minimization, evaluated in a kernel space found by the adversarial mechanism in MMD-GAN (Li et al., 2015; 2017) . By contrast, equipped with two navigators, the ACT can directly compute a forward point-to-distribution transport cost, denoted as C φ,θ (x → νM ) in (8), and a backward one, denoted as C φ,θ (μ N ← y) in (10), which are then combined to define an unbiased sample estimator, as shown in (13), of the ACT divergence. Intuitively, the navigators play the role of "amortizing" the computation of the conditional transport plans between two empirical distributions, removing the need of using an iterative procedure to estimate the transport cost. From this amortization perspective, the navigators for ACT are analogous to the critic for the Wasserstein distance in its dual form. Lemma 5. Another unbiased sample estimator fully using the data in mini-batches x 1:N and y 1:M , computing an amortized transport cost between two empirical distributions, can be expressed as L φ,θ (x 1:N , y 1:M ) = N i=1 M j=1 c(x i , y j ) 1 2N πM (y j | x i , φ) + 1 2M πN (x i | y j , φ) . (14)

2.3. CRITIC BASED ADVERSARIAL FEATURE EXTRACTION

A naive definition of the transport cost between x and y is some distance between their raw feature vectors, such as c(x, y) = xy 2 2 , which, however, often poorly reflects the difference between high-dimensional data residing on low-dimensional manifolds. For this reason, with cosine dissimilarity (Salimans et al., 2018) , we introduce a critic T η (•), parameterized by η, to help define an adversarial transport cost between two high-dimensional data points, expressed as c η (x, y) = 1 -cos(T η (x), T η (y)), cos(h 1 , h 2 ) def. = |h T 1 h2| √ h T 1 h1 √ h T 2 h2 . ( ) Intuitively, to minimize the distribution-to-distribution transport cost, the generator tries to mimic true data and both navigators try to optimize conditional path distributions. By contrast, the critic does the opposite by inflating the point-to-point transport cost. In summary, given the training data set X , to train the generator G θ , forward navigator π φ (y | x), backward navigator π φ (x | y), and critic T η , we propose to solve a mini-max problem as min φ,θ max η E x 1:N ⊆X , 1:M iid ∼ p( ) [L φ,θ,η (x 1:N , {G θ ( j )} M j=1 )], where L φ,θ,η is defined the same as in ( 13) or ( 14), except that we replace c(x i , y j ) in them with c η (x i , y j ) shown in (15) and draw y 1:M using reparameterization as in ( 6), which means y 1:M def. = {G θ ( j )} M j=1 . We train φ and θ with SGD using ∇ φ,θ L φ,θ,η (x 1:N , {G θ ( j )} M j=1 )) and, if the critic is employed, train η with stochastic gradient ascent using ∇ η L φ,θ,η (x 1:N , {G θ ( j )} M j=1 )). Note in existing critic-based GANs, how well the critics are optimized are directly related to how accurate and stable the gradients can be estimated. By contrast, regardless of how well the critic is optimized to inflate c η (x, y), Lemma 4 shows the sample estimate L φ,θ,η (x 1:N , {G θ ( j )} M j=1 ) of the ACT divergence and its gradients stay unbiased. Thus in ACT one can also train the critic parameter η using a different loss other than ( 16), such as the cross-entropy discriminator loss used in vanilla GANs to discriminate between x and G θ ( ). This point will be verified in our ablation study.

3. EXPERIMENTAL RESULTS

CT divergence for toy data: As a proof of concept, we illustrate optimization under the CT divergence in 1D, with x, y, φ, θ ∈ R. We consider a univariate normal distribution based example: p(x) = N (0, 1), p θ (y) = N (0, e θ ), c(x, y) = (x -y) 2 , d(T φ (x), T φ (y)) = (x-y) 2 2e φ . Thus θ = 0 is the optimal solution that makes ν = µ. Denote σ(a) = 1/(1 + e -a ) as the sigmoid function, we have analytic forms of the Wasserstein distance as W 2 (µ, ν) 2 = (1 -e θ 2 ) 2 , forward and backward navigators as π φ (y | x) = N (σ(θ -φ)x, σ(θ -φ)e φ ) and π φ (x | y) = N (σ(-φ)y, σ(φ)), and forward and backward CT costs as Appendix B .1 for more details). Thus when applying gradient descent to minimize the CT divergence C φ,θ (µ, ν), we expect the generator parameter θ → 0 as long as the learning rate of the navigator parameter φ is appropriately controlled to prevent e φ → 0 from happening too soon. This is confirmed by Fig. 1 , which shows that long before e φ approaches zero, θ has already converged close to zero. This suggests that the navigator parameter φ mainly plays the role in assisting the learning of θ. It is also interesting to observe that the CT divergences keep descending towards zero even when W 2 (µ, ν) 2 has already reached close to zero. As θ and φ converge towards their optimal solutions under the CT divergence, we can observe that C φ,θ (µ → ν) and C φ,θ (µ ← ν) are getting closer. Moreover, C φ,θ (µ, ν) initially stays above W 2 (µ, ν) 2 = (1 -e θ 2 ) 2 but eventually becomes very close to W 2 (µ, ν) 2 , which agrees what Lemma 2 suggests. The second and third subplots describe the descent trace on the gradient of the CT cost with respect to (w.r.t.) θ and φ, respectively, while the fourth to six subplots show the forward, backward, and bi-directional CT costs, respectively, against θ when e φ is optimized close to its optimum (see Fig. 8 for analogous plots for additional values of e φ ). It is interesting to notice that the forward cost is minimized at e θ > 1, which implies mode covering, and the backward cost is minimized at e θ → 0, which implies mode seeking, while the bi-directional cost is minimized at around the optimal solution e θ = 1; the forward CT cost Evolution of the CT divergence, its parameters and forward and backward costs, and corresponding Wasserstein distance; Middle: Gradients of the CT w.r.t. θ or φ. The 2D trace of (θ, e φ ) is marked with red arrows. Right: C φ,θ (µ → ν) = σ(φ -θ)(e θ + σ(φ -θ)) and C φ,θ (µ ← ν) = σ(φ)(1 + σ(φ)e θ ) (see The forward, backward, and CT values against θ when e φ is optimized to a small value, which show combining forward and backward balances mode covering and seeking, making it easier for θ to move towards its optimum. exhibits a flattened curve on the right hand side of its minimum, adding the backward CT cost to which not only moves that minimum left, making it closer to θ = 0, but also raises the whole curve on the right hand side, making the optimum of θ become easier to reach via gradient descent. We further consider a 1D example to illustrate the properties of the conditional transport distributions of the navigators, and analyze the risk for them to degenerate to point mass distributions when optimized under the CT or ACT divergence. We defer the details to Appendix B.2 and Fig. 9 . ACT for 1D toy data: We move on to model the empirical samples from a true data distribution, for which it is natural to apply the ACT divergence. To parameterize ACT, we apply a deep neural network to generator G θ and another one to T φ that is shared by both navigators. We consider the squared Euclidean (i.e. L 2 2 ) distance to define both cost c(x, y) and distance d(h 1 , h 2 ). We first consider a 1D example, where X consists of |X | = 5, 000 samples x i ∈ R of a bimodal Gaussian mixture p X (x) = 1 4 N (x; -5, 1) + 3 4 N (x; 2, 1). We illustrate in Fig. 2 the training with unbiased sample gradients ∇ φ,θ L φ,θ (X , y 1:M ) of the ACT divergence shown in ( 14), where y j = G θ ( j ). The top panel shows the ACT divergence, its backward and forward costs, and Wasserstein distance between the empirical probability measures μN and νM defined as in ( 5) and ( 6). We set M = N and hence W 2 (μ X , νY ) 2 can be exactly computed by sorting the 1D elements of x 1:N and y 1:N (Peyré and Cuturi, 2019) . We first consider N = |X | = 5000. Fig. 2 (Top) shows that the ACT divergence converges close to W 2 (μ X , νY ) 2 and the forward and backward costs move closer to each other and can sometime go below W 2 (μ X , νY ) 2 . Fig. 2 (Bottom) shows that minimizing the ACT divergence successfully drives the generator distribution towards true data density: From the left to right, we can observe that initially the generator is focused on fitting a single mode; at around the 500 th iteration, as the forward and backward navigators are getting better, they start to help the generator locate the missing mode and we can observe a blue density mode starts to form over there; as the generator and both navigators are getting optimized, we can observe that the generator clearly captures both modes and the fitting is getting improved further; finally the generator well approximates the data density. Under the guidance of the ACT divergence, the generator and navigators are helping each other: An optimized generator helps the two navigators to train and realize the missing mode, and the optimized navigators help the generator locate under-fitted regions and hence better fit the true data density. Given the same X , below we further consider setting N = 20, 200, or 5000 to train the generator, using either the Wasserstein distance W 2 (μ N , νN ) 2 or ACT divergence L φ,θ (x 1:N , y 1:N ) as the loss function. As shown in the right column of Fig. 3 , when the mini-batch size N is as large as 5000, both Wasserstein and ACT lead to a well-trained generator. However, as shown in the left and middle columns, when N is getting much smaller, we can observe that the generator trained with Wasserstein clearly underperforms that trained with ACT, especially when the mini-batch size becomes as small as N = 20. While the Wasserstein distance W(µ, ν) in theory can well guide the training of a generative model, the sample Wasserstein distance W(μ N , νN ), whose optimal transport plan is locally re-computed for each mini-batch, could be sensitive to the mini-batch size N , which also explains why in practice the sample Wasserstein-based generative models are difficult to train and desire a large mini-batch size (Salimans et al., 2018) . On the contrary, ACT amortizes its conditional transport plans through its navigators, whose parameter φ is globally updated across mini-batches, leading to a well-trained generator whose performance has low sensitivity to the mini-batch size. ACT for 2D toy data: We further conduct experiments on four representative 2D datasets: 8-Gaussian mixture, Swiss Roll, Half Moons, and 25-Gaussian mixture, whose results are shown in Figs. 10-13 of Appendix B.3. We apply the vanilla GAN (Goodfellow et al., 2014) and Wasserstein GAN with gradient penalty (WGAN-GP) (Gulrajani et al., 2017) as two representatives of mini-max DGMs that require solving a mini-max loss to train the generator. We then apply the generators trained under the sliced Wasserstein distance (SWD) (Deshpande et al., 2018) and ACT divergence as two representatives of mini-max-free DGMs. Compared to mini-max DGMs, which require an adversarially learned critic in order to train the generator, one clear advantage of mini-max-free DGMs is that the generator is stable to train without the need of an adversarial game. On each 2D data, we train these DGMs as one would normally do during the first 15k iterations. We then only train the generator and freeze all the other learnable model parameters, which means we freeze the discriminator in GAN, critic in WGAN, and the navigator parameter φ of the ACT divergence, for another 15k iterations. Fig. 10 illustrates this training process on the 8-Gaussian mixture dataset, where for both mini-max DGMs, the mode collapse issue deteriorates after the first 15k iterations, while the training for SWD remains stable and that for ACT continues to improve. Compared to SWD, our method covers all 8 data density modes and moves the generator much closer to the true data density. On the other three datasets, the ACT divergence based DGM also exhibits good training consistency, high stability, and and close-to-optimal data generation, as shown in Appendix B.3. Resistance to mode collapse: We use a 8-Gaussian mixture to empirically evaluate how well a DGM resists mode collapse. Unlike the data in Fig. 10 , where 8 modes are equally weighted, here the mode at the left lower corner is set to have weight ρ while the other modes are set to have the same weight of 1-ρ 7 . We set X with 5000 samples and the mini-batch size as N = 100. When ρ is lowered to 0.05, its corresponding mode is shown to be missed by GAN, WGAN, and SWD-based DGM, while well kept by the ACT-based DGM. As an explanation, GANs are known to be susceptible to mode collapse; WGAN and SWDbased DGMs are sensitive to the mini-batch size, as when ρ equals to a small value, the samples from this mode will appear in the mini-batches less frequently than those from any other mode, amplifying their missing mode problem. Similarly, when ρ is increased to 0.5, the other modes are likely to be missed by the baseline DGMs, while the ACT-based DGM does not miss any modes. The resistance of ACT to mode dropping can be attributed to the amortized computation of its conditional transport plans provided by the navigators, whose parameter is optimized with SGD over mini-batches and, as indicated by Fig. 3 , is robust to estimate across a wide rage of mini-batch sizes. 12) when γ = 0.5. Fig. 5 shows the fitting results of ACT γ on the same 1D bi-modal Gaussian mixture used in Fig. 2 and 2D 8-Gaussian mixture used in Fig. 10 ; the other experimental settings are kept the same. Comparing the results of different γ in Fig. 5 suggests that minimizing the forward transport cost only encourages the generator to exhibit mode covering behaviors, while minimizing the backward transport cost only encourages mode seeking/dropping behaviors; by contrast, combining both costs provides a user-controllable balance between mode covering and seeking, leading to satisfactory fitting performance, as shown in columns 2 to 4. Note that for a fair comparison, we stop the fitting at the same iteration; in practice, we find if training with more iterations, both γ = 0.75 and γ = 0.25 can achieve comparable results as γ = 0.5. Allowing the mode covering and seeking behaviors to be controlled by tuning γ is an attractive property of ACT γ . We leave the theoretical analysis of the mode covering/seeking behaviors of ACT γ for future study. ACT for natural images: We conduct a variety of experiments on natural images to evaluate the performance and reveal the properties of DGMs optimized under the ACT divergence. We consider three widely-used image datasets, including CIFAR-10 ( Krizhevsky et al., 2009) , CelebA (Liu et al., 2015) , and LSUN-bedroom (Yu et al., 2015) , and compare the results of DGMs optimized with the ACT divergence against DGMs trained with the vanilla GAN and its various generalizations. Note different from previous experiments on toy data, where the transport cost c(x, y) can be defined by directly comparing x and y, for natural images, whose differences in raw pixel values are often not that meaningful, we need to compare T η (x) and T η (y), where T η (•) is a critic that plays the role of adversarial feature extraction, as discussed in Section 2.3. In particular, we use (15) to define the transport cost as c η (x, y) = 1 -cos(T η (x), T η (y)). To parameterize the navigators, we also set d(T φ (x), T φ (y)) = 1 -cos(T φ (x), T φ (y)). We test with the architecture suggested in Radford et al. (2015) as a standard CNN backbone and also apply the architecture in Miyato et al. (2018) as the ResNet (He et al., 2016) backbone. Specifically, we use the same architecture for the generator, and slightly modify the output dimension of the discriminator architecture as 2048 for both T η of the critic and T φ used by the two navigators. We train this model with ( 16) and ( 14), and to keep close to the corresponding backbone's original experiment setting, we set N = M = 64 for all experiments. We summarize in WGAN (Arjovsky et al., 2017) 51.3± 1.5 37.1±1.9 73.3±2.5 6.9±0.1 WGAN-GP (Gulrajani et al., 2017) 19.0±0.8 18.0±0.7 26.9±1.1 7.9±0.1 MMD-GAN (Li et al., 2017) 73.9±0.1 --6.2±0.1 Cramér-GAN (Bellemare et al., 2017) 40.3±0.2 31.3±0.2 54.2±0.4 6.4±0.1 CTGAN (Wei et al., 2018) 17.6±0.7 15.8±0.6 19.5±1.2 5.1±0.1 OT-GAN (Salimans et al., 2018) 32.5±0.6 19.4±3.0 70.5±5.3 8.5±0.1 SWG (Deshpande et al., 2018) 33.7±1.5 21.9±2.0 67.9±2.7 -Max-SWG (Deshpande et al., 2019) 23.6±0.5 10.1±0.6 40.1±4.5 -SWGAN (Wu et al., 2019) 17.0±1.0 13.2±0.7 14.9±1.0 -DCGAN (Radford et al., 2015) 30.2±0.9 52.5±2.2 61.7±2.9 6.2±0.1 DCGAN backbone + ACT divergence (ACT-DCGAN) 24.8±1.0 29.2±2.0 37.4±2.5 7.5±0.1 SNGAN (Miyato et al., 2018) 21.5±1. 2018), self-supervised learning (Chen et al., 2019) , and data augmentation (Karras et al., 2020; Zhao et al., 2020a; b) , which we leave for future study. As the paper is primarily focused on constructing and validating a new divergence measure, we have focused on demonstrating its efficacy on toy data and benchmark image data with moderate resolutions. We have focused on adapting DCGAN and SNGAN under ACT, and we leave to future work using the ACT to optimize a big DGM, such as BigGAN, that is often trained for high-resolution images with a substantially bigger network and larger mini-batch size and hence requires intensive computation that is not easy to afford.

4. CONCLUSION

We propose conditional transport (CT) as a new divergence to measure the difference between two probability distributions, via the use of both forward-path and backward-path point-to-point conditional distributions. To apply CT to two distributions that have unknown density functions but are easy to sample from, we introduce the asymptotic CT (ACT) divergence whose estimation only requires access to empirical samples. ACT amortizes the computation of its conditional transport plans via its navigators, removing the need of a separate iterative procedure for each mini-batch, and provides unbiased mini-batch based sample gradients that are simple to compute. Its minimization, achieved with the collaboration between the generator, forward navigator, and backward navigator, is shown to be robust to the mini-batch size. In addition, empirical analysis suggests the combination weight of the forward and backward CT costs can be adjusted to encourage either model covering or seeking behaviors. We further show that a critic can be integrated into the point-to-point transport cost of ACT to adversarially extract features from high-dimensional data without biasing the sample gradients. We apply ACT to train both a vanilla GAN and a Wasserstein GAN. Consistent improvement is observed in our experiments, which shows the potential of the ACT divergence in more broader settings where quantifying the difference between distributions plays an essential role.

A PROOFS

Proof of Lemma 1. If y ∼ p θ (y) is equal to x ∼ p X (x) in distribution, then p X (x) = p θ (x) for any x ∈ R V . For (2) we have p X (x)π φ (y | x)dy = p X (x) π φ (y | x)dy = p X (x) and p X (x)π φ (y | x)dx = p X (x) e -d(T φ (x),T φ (y)) p θ (y) e -d(T φ (x),T φ (y)) p θ (y)dy dx = p θ (y) e -d(T φ (x),T φ (y)) p X (x) e -d(T φ (x),T φ (y)) p θ (y)dy dx. If e -d(T φ (x),T φ (y)) = 1(x = y), then we further have e -d(T φ (x),T φ (y)) p X (x) e -d(T φ (x),T φ (y)) p θ (y)dy dx = 1(x = y)p X (x) 1(x = y)p θ (y)dy dx = 1(x = y) p X (x) p θ (x) dx = 1 and hence it is true that p X (x)π φ (y | x)dx = p θ (y). Similarly, for (3) we have p θ (y)π φ (x | y)dx = p θ (y) and can prove p θ (y)π φ (x | y)dy = p X (x) given these two conditions. Proof of Lemma 2. Since c(x, y) ≥ 0 by definition, we have C φ,θ (µ → ν) ≥ 0 and C φ,θ (µ ← ν) ≥ 0. When µ = ν, it is known that W(µ, ν) = 0. If y ∼ p θ (y) is equal to x ∼ p X (x) in distribution, which means p X (x) = p θ (x) and p X (y) = p θ (y) for any x, y ∈ R V and µ = ν, then we have C φ,θ (µ → ν) = c(x, y)p X (x)π φ (y | x)dxdy = c(x, y) e -d(T φ (x),T φ (y)) p X (x)p θ (y) e -d(T φ (x),T φ (y)) p θ (y)dy dxdy = c(y, x) e -d(T φ (y),T φ (x)) p X (y)p θ (x) e -d(T φ (y),T φ (x)) p θ (x)dx dxdy = c(y, x) e -d(T φ (y),T φ (x)) p θ (y)p X (x) e -d(T φ (y),T φ (x)) p X (x)dx dxdy = c(x, y)p θ (y) e -d(T φ (x),T φ (y)) p X (x) e -d(T φ (x),T φ (y)) p X (x)dx dxdy = c(x, y)p θ (y)π φ (x | y)dxdy = C φ,θ (µ ← ν) and hence C φ,θ (µ, ν) = C φ,θ (µ → ν) = C φ,θ (µ ← ν) ≥ 0 = W (µ, ν). If e -d(T φ (x),T φ (y)) = 1(x = y), since c(x, x) = 0 by definition, we have C φ,θ (µ → ν) = c(x, y) 1(x = y)p X (x)p θ (y) 1(x = y)p θ (y)dy dxdy = c(x, y) 1(x = y)p X (x)p θ (y) p θ (x) dxdy = c(x, y)1(x = y)p θ (y)dxdy = c(x, x)p θ (x)dx = 0. Proof of Lemma 3. According to the strong law of large numbers, when M → ∞, νM (A) = 1 M M j=1 1(y j ∈ A) converges almost surely to 1 M M j=1 E y j ∼p θ (y) [1(y j ∈ A)] = A p θ (y)dy = ν(A) and hence C φ,θ (µ → νM ) converges to C φ,θ (µ → ν). Therefore, E y 1:M iid ∼ p θ (y) [C φ,θ (µ → νM )] converges to C φ,θ (µ → ν). Similarly, we can prove that as N → ∞, E x 1:N iid ∼ p X (x) [C φ,θ (μ N ← ν) converges to C φ,θ (µ ← ν). Therefore, C φ,θ (µ, ν, N, M ) defined in (12) converges to 1 2 C φ,θ (µ → ν) + 1 2 C φ,θ (µ ← ν) = C φ,θ (µ, ν) as N, M → ∞. Proof of Lemma 4. C φ,θ (µ, ν, N, M ) = 1 2 E y 1:M iid ∼ p θ (y) [C φ,θ (µ → νM )] + 1 2 E x 1:N iid ∼ p X (x) [C φ,θ (μ N ← ν)] = 1 2 E x∼p X (x), y 1:M iid ∼ p θ (y) [C φ,θ (x → νM )] + 1 2 E x 1:N iid ∼ p X (x), y∼p θ (y) [C φ,θ (μ N ← y)] = 1 2 E x∼ pN (x) E x 1:M iid ∼ p X (x), y 1:M iid ∼ p θ (y) [C φ,θ (x → νM )] + 1 2 E y∼ pM (y) E x 1:N iid ∼ p X (x), y 1:M iid ∼ p θ (y) [C φ,θ (μ N ← y)] = E x∼ pN (x), y∼ pM (y) E x 1:N iid ∼ p X (x), y 1:M iid ∼ p θ (y) 1 2 C φ,θ (x → νM ) + 1 2 C φ,θ (μ N ← y) . ( ) Plugging ( 8) and ( 10) into the above equation concludes the proof. Proof of Lemma 5. Solving the first expectation of (18), we have C φ,θ (µ, ν, N, M ) = E x 1:N iid ∼ p X (x), y 1:M iid ∼ p θ (y) 1 2N N i=1 C φ,θ (x i → νM ) + 1 2M M j=1 C φ,θ (μ N ← y j ) . Plugging ( 8) and (10) into the above equation concludes the proof.

B.1 ADDITIONAL DETAILS FOR THE UNIVARIATE NORMAL TOY EXAMPLE SHOWN IN (17)

For the toy example specified in ( 17), exploiting the normal-normal conjugacy, we have an analytical conditional distribution for the forward navigator as π φ (y | x) ∝ e -(x-y) 2 2e φ N (y; 0, e θ ) ∝ N (x; y, e φ )N (y; 0, e θ ) = N e θ e θ + e φ x, e φ e θ e θ + e φ , and an analytical conditional distribution for the backward navigator as π φ (x | y) ∝ e -(x-y) 2 2e φ N (x; 0, 1) ∝ N (y; x, e φ )N (x; 0, 1) = N y 1 + e φ , e φ 1 + e φ . Plugging them into (2) and (3), respectively, and solving the expectations, we have C φ,θ (µ → ν) = E x∼N (0,1) e φ e θ + e φ e θ + e φ e θ + e φ x 2

=

e φ e θ + e φ e θ + e φ e θ + e φ , C φ,θ (µ ← ν) = E y∼N (0,e θ ) e φ 1 + e φ 1 + e φ 1 + e φ y 2 e φ 1 + e φ 1 + e φ 1 + e φ e θ . Figure 8 : For the univariate normal based toy example specified in (17), we plot the forward, backward, and CT values against θ at four different values of φ, which show combining forward and backward balances mode covering and seeking, making it easier for θ to move towards its optimum.

B.2 ANALYSIS OF CONDITIONAL TRANSPORT PLANS ON 1D MIXTURE

We consider a 1D example to illustrate the properties of the conditional transport plans of the navigators, and analyze the risk for them to degenerate to point mass distributions when optimized under the CT or ACT divergence. We consider two representative scenarios that both seem to pose a high risk for the conditional transport distributions to degenerate. In the first scenario, we consider both the source and target distributions as a mixture of a point mass and a Gaussian distribution, where the location of the point mass and the center of the Gaussian are far away from each other; we obtain the analytic forms of the conditional transport plans under the CT divergence, and analyze the properties of the empirical conditional transport plans under the ACT divergence. In the second scenario, we consider both the source and target distributions as discrete distributions; we make the support of each distribution contain an outlier supporting point far away from all the other supporting points. In this scenario, we are essentially learning how to transport between two discrete sets.

DISTRIBUTIONS

Below we consider the first scenario, where we assume p X (x) = ρδ -1 + (1 -ρ)N (-1000, 1) and p Y (y) = ρδ 1 + (1 -ρ)N (1000, 1), where ρ ∈ [0, 1] is the probability for x = -1 and y = 1. We construct this specific example to check whether there is a danger that the forward CT distribution π(y | x) will degenerate to a point mass distribution that concentrates its probability mass at y = 1. Setting d(T φ (x), T φ (y)) = (x-y) 2 2e φ , via the definition of the conditional transport distributions and the property of the normal distribution, we can show that π(y | x, φ) = ρr(y, x) ρr(y, x) + 1 -ρ δ 1 + 1 -ρ ρr(y, x) + 1 -ρ N (µ, σ 2 ), r = N (y; µ, σ 2 ) N (y; 1000, 1) , µ def. = 1000 e φ 1 + e φ + x 1 1 + e φ , σ 2 def. = e φ 1 + e φ . Similarly, for the backward CT, we can show that π(x | y, φ) = ρr (y, x) ρr (y, x) + 1 -ρ δ -1 + 1 -ρ ρr (y, x) + 1 -ρ N (µ , σ 2 ), r def. = N (x; µ , σ 2 ) N (x; -1000, = -1000 e φ 1 + e φ + y 1 1 + e φ , σ 2 def. = e φ 1 + e φ . As shown in the top panel of Fig. 9 , it is clear that when φ → ∞, we have π(y = 1 | x) = ρ and when φ → -∞, we have π(y = 1 | x) = 0. Assuming ρ = 0.01 and x = -1000, when φ = 15, we have π φ (y = 1 | x = -1000) = 0.0157 and π φ (y = 1 | x = -1) = 0.0116; when φ = -1.095945, we have π φ (y = 1 | x = -1000) = 0.0626 and π φ (y = 1 | x = -1) = 1. These analyses suggest that as long the navigator parameter φ is chosen appropriately, the conditional transport distributions π(x | y, φ) and π(y | x, φ) will not degenerate to a point mass distribution. Next we analyze the degeneration risk if using the ACT between p X (x) = ρδ -1 +(1-ρ)N (-1000, 1) and p Y (y) = ρδ 1 + (1 -ρ)N (1000, 1), in which case we assume we don't know p X (x) and p Y (y) but can access random samples from them. In ACT, we would approximate π(y | x) with πM (y | x) def. = M j=1 e -(x-y j ) 2 2e φ M j =1 e -(x-y j ) 2 2e φ δ yj , y j iid ∼ p Y (y). Even if φ is set at a value such that the forward CT distribution π φ (y = 1 | x) → 1, under ACT, as long as 1 / ∈ {y 1 , . . . , y M }, x will surely not be transported to y = 1 and will instead be transported to a y j that is drawn from N (1000, 1); note 1 / ∈ {y 1 , . . . , y M } happens with probability M j=1 P (y j = 1) = (1 -ρ) M , which is as large as 36.6% when ρ = 0.01 and M = 100. Therefore, mini-batch based optimization under the ACT divergence would further reduce the risk for the conditional transport plans to degenerate. Figure 9 : Top: The risk for the forward and backward conditional transport plans to degenerate w.r.t. the value of φ based on CT analysis. Bottom: The forward conditional transport plan between two discrete sets of S = 500 points; among the S data points, one point is an outlider. The ACT navigators are optimized with SGD over mini-batches, whose elements are randomly sampled (with replacement) from their corresponding discrete sets and the mini-batch size varies from 50 to 500.

B.2.2 ACT BETWEEN TWO DISCRETE DISTRIBUTIONS WITH OUTLIER SUPPORTS

Below we consider the second scenario, where we assume two discrete distributions supported on S = 500 data points as p X (x) = 1 S δ -1 + S -1 S S-1 i=1 δ xi , where x i iid ∼ N (-1000, 1), and p Y (y) = 1 S δ 1 + S -1 S S-1 i=1 δ yi , where y i iid ∼ N (1000, 1). In this scenario, we will be optimizing the navigator parameters using SGD over mini-batches, and apply the navigators optimized with a mini-batch size of N to calculate the conditional transport plans between the two discrete sets of S = 500 data points. More specifically, we use ( 14) as the loss to train the navigator parameter φ (note here both distributions are fixed and there is no generator to train), with the mini-batch size set as N = M = 50, 200, 400, or 500. Each mini-batch consists of two sets of M data points iid drawn from their corresponding discrete distributions (i.e., sampled without replacement from their corresponding discrete sets). With the navigator parameter φ optimized under the ACT divergence, the bottom panel of Fig. 9 reports the forward conditional transport plan from the Gaussian to p Y , i.e., (π(y 1 | x, φ), . . . , π(y S | x, φ)), where x ∼ N (-1000, 1) and φ is learned with four different mini-batch sizes. There results suggest that even though for two discrete distributions whose supporting points contain outliers, there is a low risk for the conditional transport plans optimized under ACT to degenerate.

B.3 MORE RESULTS ON 2D TOY DATASETS

We visualize the results on the 8-Gaussian mixture and three additional 2D toy datasets. Compared to the 8-Gaussian mixture dataset, the mode collapse issue of both GAN and WGAN-GP becomes more severe on the Swiss-Roll, Half-Moon, and 25-Gaussian datasets, while ACT consistently shows good and stable performance on all of them. We also illustrate the data points and generated samples with empirical samples, shown in Fig. 14 . The first column shows the generated samples (marked in blue) and the samples from data distribution (marked in red). To visualize how the feature extractor T φ used by navigators works, we set its output dimension as 1 and plot the logits in the third and fifth columns and map the corresponding data (in the second column) and generated samples (in the fourth column) with the same color. Similarly, we visualize the GAN's generated samples and logits produced by its discriminator in Fig. 15 . We can observe that the discriminator maps the data to very close values. Specifically, in both the 8-Gaussian mixture and 25-Gaussian mixture cases, when the mode collapse occurs, the logits of the missed modes have similar value to the those in the other modes. This property results in GAN's mode collapse problem and it is commonly observed in GANs. Different from the GAN case, the navigator in our ACT model maps the data with non-saturating logits. We can observe in various multi-mode cases, different modes are assigned with different values by the navigator. This property helps ACT to well resist the mode collapse problem and stabilize the training. 

B.4 RESULTS FOR ABLATION STUDY

Transport cost in pixel space vs. feature space We visualize the difference of using the transport cost in the pixel space and in the feature space here. In both Figs. 16 and 17, we test with MNSIT and CIFAR-10 data and with the L 2 2 distance and cosine dissimilarity as the transport cost, respectively. For the MNIST dataset, due to its simple data structure, ACT can still be trained to generate meaningful digits, though some digits appear blurry. On the CIFAR-10, we can observe the model fails to generate any class of CIFAR images. As the dimensionality of the input space increases, using the distance in the pixel space as transport cost might lose the essential information for the transport and increases the training complexity of the navigator. We also consider using the perceptual similarity (Zhang et al., 2018) to define the cost function. Here we test with four configurations: 1) Apply a fixed and pre-trained perceptual loss to calculate the distance between the data and generated samples, use that distance as the point-to-point cost, and calculate ACT to train the generator; 2) Apply a fixed and pre-trained perceptual loss to calculate the energy distance between the data and generated samples to train the generator; 3) Apply the pre-trained perceptual loss as cost and fine-tune with ACT to train the generator; 4) Apply the pre-trained perceptual loss as cost, fine-tune it with ACT to train the generator, and then fix it for 20 more training epochs. We report the visual results of these 4 configurations in Fig. 18 . As shown, fixing the metric and calculate the distance (either ACT or energy distance) in the feature space does not show good generation results. An explanation could be the pretrained perceptual loss is not trained on the generation task and hence might not be compatible with the learning objective. Thus the cost could not feedback useful signal to guide the generator. Using the energy distance as in Bellemare et al. ( 2017) shows similar results. We further use the perceptual network as the initialization of our critic T η , and train with ACT by maximizing the cost for 40 epochs on MNIST, which produce good-quality generations as shown in the third column. Then we fix the training of this critic and train the generator for 20 more epochs, which lead to degraded generation quality. We expect the generation quality will get worse as the training under the fixed critic continues. Using alternative cost function We also test ACT with different configurations. As discussed in previous sections, the defined cost may also affect the training of the model. Here we vary the choice of transport cost c η (x, y) and the navigator cost d(T φ (x), T φ (y)). For both cost, we test with L 2 2 distance and cosine dissimilarity. Moreover, we also compare the effects of distance in the original pixel space and the feature space (equipped with critic T η ). The results in Table 2 highlight the importance of the cost in the feature space when dealing with high-dimensional image data. Moreover, compared to the L 2 2 distance, the cosine dissimilarity is observed to improve the model when applied as transport and navigator cost, especially with ResNet architecture. Training the critic with the discriminator loss of a vanilla GAN Contrary to the existing criticbased GANs, the sample estimates of ACT divergence and its gradient are unbiased regardless of how well the critic is trained. We thus keep the same experiment settings and train the ACT-DCGAN model's critic with the discriminator loss of standard GANs, i.e. E x∼P d [-log(T η (x)] + E x∼Pg [-log(1 -T η (x))]. The results in Fig. 19 shows that ACT works well in conjunction with the alternative critic training. The quantitative and qualitative on MNIST, CIFAR-10, CelebA, and LSUN are shown in Fig. 19 . We can observe the quality of generated samples, while clearly not as good as training the critic with the ACT divergence, can still catch up with some of the benchmarks in Table 1 . Figure 19 : Visual results of using a standard cross-entropy discriminator loss in lieu of ACT divergence to train the critic of ACT.

B.5 MORE RESULTS ON IMAGE DATASETS

For the experiments on the image datasets, we provide more visual results in this part. Apart from the datasets described in the experiment part, we also test the capacity of single-channel image generation with the MNIST dataset. Considering the inception score and the FID score are designed for RGB natural images, we also calculate the inception score of the real testing sets for reference. The presented methods are all able to generate meaningful digits on MNIST. If we take a closer look at the digits, the digits generated with L 2 cost is less natural than the one with cosine cost. Moreover, we show both unconditional and conditional generation results on CIFAR-10. For both unconditional and conditional generation, our proposed method achieves good quantitative and qualitative results. 

B.6 EXPERIMENT DETAILS

Preparation of datasets We apply the commonly used training set of MNIST (50K images, 28×28 pixels) (Lecun et al., 1998) , CIFAR-10 (50K images, 32×32 pixels) (Krizhevsky et al., 2009) , CelebA (about 203K images, resized to 64 × 64 pixels) (Liu et al., 2015) , and LSUN bedrooms (around 3 million images, resized to 64 × 64 pixels) (Yu et al., 2015) . The images were scaled to range [-1, 1]. For MNIST, when calculate the inception score, we repeat the channel to convert each gray-scale image into a RGB format. Network architecture and hyperparameters For the network architectures presented here, the slopes of all lReLU functions in the networks are set to 0.1 by default. For toy experiments, typically 10, 000 update steps are sufficient. However, our experiments show that the DGM optimized with the ACT divergence can be stably trained at least over 500, 000 steps (or possibly even more if allowed to running non-stop) regardless of whether the navigators are frozen or not after a certain number of iterations, where the GAN's discriminator usually diverges long before reaching that many iterations even if we do not freeze it after a certain number of iterations. For all image experiments, the output feature dimension of the navigators and that of the critic (i.e., T φ (•), T η (•) ∈ R m ) are set to m = 2048. All models are able to be trained on a single GPU, such as NVidia GTX 1080-TI in our experiments, with 150, 000 generator updates (for CIFAR-10 we apply 50, 000 iterations). To keep close to the configuration of the DCGAN and SNGAN experiments setting, we use the the Adam optimizer (Kingma and Ba, 2015) with learning rate α = 2 × 10 -4 and β 1 = 0.5, β 2 = 0.99 for the parameters of the generator, navigators, and critic. On the DCGAN backbone, we let all the modules update with the same frequency; while on the SNGAN backbone, the critic is updated once per 5 generator updating steps. The performance might be further improved with more careful fine-tuning. For example, the learning rate of the navigator parameter could be made smaller than that of the generator parameter. The true data minibatch size is fixed to N = 64 for all experiments. Moreover, with this batch-size we let the generated sample size M the same as minibatch size N for ACT computation and we have monitored the average time for each update step on a single NVidia GTX 1080-TI GPU: On CIFAR-10, each update step takes around 0.1s and 0.2s for DCGAN and SNGAN, respectively; For DCGAN and SNGAN backbone trained with ACT divergence each update step takes around 0.4s and 0.7s. On CelebA and LSUN, each update takes 0.6s and 0.7s for DCGAN and SNGAN, respectively; when trained with ACT, the elapsed time for each update increases to 3.3s and 3.6s, respectively. Table 3 : Network architecture for toy datasets (V indicates the dimensionality of data). 



Figure1: Illustration of minimizing the CT divergence C φ,θ (µ, ν) between N (0, 1) and N (0, e θ ). Left:

Figure 2: Illustration of how minimizing the ACT divergence between the empirical distribution of a generator and that of a bimodal Gaussian mixture, whose 5000 random samples are given, helps optimize the generator distribution towards the true one. Top: Plots of the ACT divergence C(μN , νM ), forward ACT cost C(μN → νM ), backward ACT cost C(μN ← νM ), and Wasserstein distance W2(μN , νM ) 2 , where N = M = 5000. Bottom: The PDF of the true data distribution µ(dx) = pX (x)dx (red) and the generator distribution ν(dy) = p θ (y)dy (blue, visualized via kernel density estimation) at different training iterations.

Figure 3: Top: Plot of the sample Wasserstein distance W2(μ5000, ν5000) 2 against the number of training epochs, where the generator is trained with either W2(μN , νN ) 2 or the ACT divergence between μN and νN , with the mini-batch size set as N = 20 (left), N = 200 (middle), or N = 5000 (right); one epoch consists of 5000/N SGD iterations. Bottom: The fitting results of different configurations, where the KDE curves of the data distribution and the generative one are marked in red and blue, respectively.

Figure 4: Comparison of the generation quality on 8-Gaussian mixture data: one of the 8 modes has weight ρ and the rest modes have equal weight as 1-ρ 7 .

Figure 5: Fitting 1D bi-modal Gaussian (top) and 2D 8-Gaussian mixture (bottom) by interpolating between the forward ACT (γ = 1) and backward ACT (γ = 0).Forward and backward analysis: To empirically analyze the roles played by the forward and backward ACTs in training a DGM, we modify (12) to define ACT γ , where γ ∈ [0, 1] is the interpolation weight from the forward ACT cost to the backward one, which means ACT γ reduces to the forward ACT when γ = 1, to backward ACT when γ = 0, and to the ACT in (12) when γ = 0.5. Fig.5shows the fitting results of ACT γ on the same 1D bi-modal Gaussian mixture used in Fig.2and 2D 8-Gaussian mixture used in Fig.10; the other experimental settings are kept the same. Comparing the results of different γ in Fig.5suggests that minimizing the forward transport cost only encourages the generator to exhibit mode covering behaviors, while minimizing the backward transport cost only encourages mode seeking/dropping behaviors; by contrast, combining both costs provides a user-controllable balance between mode covering and seeking, leading to satisfactory fitting performance, as shown in columns 2 to 4. Note that for a fair comparison, we stop the fitting at the same iteration; in practice, we find if training with more iterations, both γ = 0.75 and γ = 0.25 can achieve comparable results as γ = 0.5. Allowing the mode covering and seeking behaviors to be controlled by tuning γ is an attractive property of ACT γ . We leave the theoretical analysis of the mode covering/seeking behaviors of ACT γ for future study.

Figure 6: Generated samples of the deep generative model that adopts the backbone of SNGAN but is optimized with the ACT divergence on CIFAR-10, CelebA, and LSUN-Bedroom. See Appendix B for more results.

Figure 7: Analogous plot to Fig. 6 for Left: LSUN-Bedroom (128x128) and Right: CelebA-HQ (256x256).

Figure 10: On a 8-Gaussian mixture data, comparison of generation quality and training stability between two mini-max deep generative models (DGMs), including vallina GAN and Wasserstein GAN with gradient penalty (WGAN-GP), and two mini-max-free DGMs, whose generators are trained under the sliced Wasserstein distance (SWD) and the proposed ACT divergence, respectively. The critics of GAN and WGAN-GP and the navigators of ACT are fixed after 15k iterations. The first column shows the true data density.

Figure 11: Analogous plot to Fig. 10 for the Swiss-Roll dataset.

Figure 12: Analogous plot to Fig. 10 for the Half-Moon dataset.

Figure 13: Analogous plot to Fig. 10 for the 25-Gaussian mixture dataset.

Figure14: Visual results of ACT for generated samples (blue dots) compared to real samples (red dots) on Swiss Roll, Half Moons, 8-Gaussian mixture, and 25-Gaussian mixture. The second and third columns map the data points and their corresponding navigator logits by color; The fourth and fifth columns map the generated points and their corresponding navigator logits by color.

Figure 16: Visual results of generated samples on MNIST and CIFAR-10 using pixel-wise transport cost, with DCGAN (standard CNN) backbone.

Figure 17: Visual results of generated samples on MNIST and CIFAR-10 using pixel-wise transport cost, with SNGAN (ResNet) backbone. The Inception and FID scores are not shown due to poor visual quality.

Figure 18: Visual results of generated samples with the perceptual similarity (Zhang et al., 2018) with four different training configurations.

Figure 20: Unconditional generated samples and inception scores of MNIST, with DCGAN (standard CNN) backbone.

Figure 21: Unconditional generated samples and FIDs of CIFAR-10, with DCGAN (standard CNN) backbone.



Figure 22: Unconditional generated samples and FIDs of CIFAR-10, with SNGAN (ResNet) backbone.

Figure 23: Conditional generated samples and FIDs of CIFAR-10, with SNGAN (ResNet) backbone.

Figure 24: Generated samples and FIDs of CelebA, with DCGAN (standard CNN) backbone.

Figure 25: Generated samples and FIDs of CelebA, with SNGAN (ResNet) backbone.

Figure 26: Generated samples and FIDs of LSUN, with DCGAN (standard CNN) backbone.

Figure 27: Generated samples and FIDs of LSUN, with SNGAN (ResNet) backbone.

(a) Generator G θ ∈ R 50 ∼ N (0, 1) 50 → 100, dense, lReLU 100 → 50, dense, lReLU 50 → V , dense, linear (b) Navigator T φ / Discriminator D φ x ∈ R V 2 → 100, dense, lReLU 100 → 50, dense, lReLU 50 → 1, dense, linear

Table1the Fréchet inception distance (FID) ofHeusel et al. (2017) on all datasets and Inception Score (IS) ofSalimans et al. (2016) on CIFAR-10. Both FID and IS are calculated using a pre-trained inception model(Szegedy et al., 2016). Lower FID and higher IS scores indicate better image quality. We observe that ACT-DCGAN and ACT-SNGAN, which are DCGAN and SNGAN backbones optimized with the ACT divergence, convincingly outperform DCGAN and SNGAN, respectively, suggesting that ACT is compatible with standard GANs and WGANs and generally helps improve generation quality. Moreover, ranked among the Top 3 on all datasets, ACT-SNGAN performs on par with the best benchmarks on CIFAR-10 and CelebA, while slightly worse on LSUN. The qualitative results shown in Fig.6are consistent with quantitative results in Table1. To additionally show how ACT works for more complex generation tasks, we show in Fig.7example higher-resolution images generated by ACT-SNGAN on LSUN bedroom and CelebA-HQ. Comparison of generative models on

FID comparison for ACT-DCGAN and ACT-SNGAN on CIFAR-10 with different cost and architecture.

