AUGMENTED SLICED WASSERSTEIN DISTANCES

Abstract

While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efficiency through random projection, yet they suffer from low projection efficiency because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more flexible nonlinear projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efficiently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.

1. INTRODUCTION

Comparing samples from two probability distributions is a fundamental problem in statistics and machine learning. The optimal transport (OT) theory (Villani, 2008) provides a powerful and flexible theoretical tool to compare degenerative distributions by accounting for the metric in the underlying spaces. The Wasserstein distance, which arises from the optimal transport theory, has become an increasingly popular choice in various machine learning domains ranging from generative models to transfer learning (Gulrajani et al., 2017; Arjovsky et al., 2017; Kolouri et al., 2019b; Lee et al., 2019; Cuturi & Doucet, 2014; Claici et al., 2018; Courty et al., 2016; Shen et al., 2018; Patrini et al., 2018) . Despite its favorable properties, such as robustness to disjoint supports and numerical stability (Arjovsky et al., 2017) , the Wasserstein distance suffers from high computational complexity especially when the sample size is large. Besides, the Wasserstein distance itself is the result of an optimization problem -it is non-trivial to be integrated into an end-to-end training pipeline of deep neural networks, unless one can make the solver for the optimization problem differentiable. Recent advances in computational optimal transport methods focus on alternative OT-based metrics that are computationally efficient and differentiably solvable (Peyré & Cuturi, 2019) . Entropy regularization is introduced in the Sinkhorn distance (Cuturi, 2013) and its variants (Altschuler et al., 2017; Dessein et al., 2018; Lin et al., 2019) to smooth the optimal transport problem; as a result, iterative matrix scaling algorithms can be applied to provide significantly faster solutions with improved sample complexity (Genevay et al., 2019) . An alternative approach is to approximate the Wasserstein distance through slicing, i.e. linearly projecting, the distributions to be compared. The sliced Wasserstein distance (SWD) (Bonneel et al., 2015) is defined as the expected value of Wasserstein distances between one-dimensional random projections of high-dimensional distributions. The SWD shares similar theoretical properties with the Wasserstein distance (Bonnotte, 2013) and is computationally efficient since the Wasserstein distance in one-dimensional space has a closed form solution based on sorting. Deshpande et al. (2019) extends the sliced Wasserstein distance to the max-sliced Wasserstein distance (max-SWD), by finding a single projection direction with the maximal distance between projected samples. In Nguyen et al. (2020) , the distributional sliced Wasserstein distance (DSWD) finds a distribution of projections that maximizes the expected distances over these projections. The subspace robust Wasserstein distance extends the idea of slicing to projecting distributions on linear subspaces (Paty & Cuturi, 2019) . However, the linear nature of these projections usually leads to low projection efficiency of the resulted metrics in high-dimensional spaces (Deshpande et al., 2019; Liutkus et al., 2019; Kolouri et al., 2019a) . More recently, there are growing interests and evidences that slice-based Wasserstein distances with nonlinear projections can improve the projection efficiency, leading to a reduced number of projections needed to capture the structure of the data distribution (Kolouri et al., 2019a; Nguyen et al., 2020) . (Kolouri et al., 2019a) extends the connection between the sliced Wasserstein distance and the Radon transform (Abraham et al., 2017) to define generalized sliced Wasserstein distances (GSWDs) by utilizing generalized Radon transforms (GRTs). It is shown in (Kolouri et al., 2019a ) that the GSWD is indeed a metric if and only if the adopted GRT is injective. Injective GRTs are also used to extend the DSWD to the distributional generalized sliced Wasserstein distance (DGSWD) (Nguyen et al., 2020) . However, both the GSWD and the DGSWD are restricted by the limited class of injective GRTs, which utilize the circular functions and a finite number of harmonic polynomial functions with odd degrees as their defining function (Kuchment, 2006; Ehrenpreis, 2003) . The results reported in (Kolouri et al., 2019a; Nguyen et al., 2020) show impressive performance from the GSWD and the DGSWD, yet they require one to specify a particular form of defining function from the aforementioned limited class of candidates. However, the selection of defining function is usually task-dependent and needs domain knowledge. In addition, the impact on performance from different defining functions is still unclear. One variant of the GSWD (Kolouri et al., 2019a) is the GSWD-NN, which generates projections directly with neural network outputs to remove the limitations of slicing distributions with predefined GRTs. In the GSWD-NN, the number of projections, which equals the number of nodes in the neural network's output layer, is fixed. Hence different neural networks are needed if one wants to change the number of projections. There is also no random projections involved in the resulted GSWD-NN, since the projection results are determined by the neural network's weights. Besides, the GSWD-NN is a pseudo-metric since it uses a vanilla neural network, rather than the Radon transform or GRTs, as its push-forward operator. Therefore, the GSWD-NN does not fit into the theoretical framework of the GSWD and does not inherit its geometric properties. In this paper, we present the augmented sliced Wasserstein distance (ASWD), a distance metric constructed by first mapping samples to hypersurfaces in an augmented space, which enables flexible nonlinear slicing of data distributions for improved projection efficiency (See Figure 1 ). Our main contributions include: (i) We exploit the capacity of nonlinear projections employed in the ASWD by constructing injective mapping with arbitrary neural networks; (ii) We prove that the ASWD is a valid distance metric; (iii) We provide a mechanism in which the hypersurface where high-dimensional distributions are projected onto can be optimized and show that the optimization of hypersurfaces can help slice-based Wasserstein distances improve their projection efficiency. Hence, the ASWD is data-adaptive, i.e. the hypersurfaces can be learned from data. This implies one does not need to specify a defining function from a limited class of candidates; (iv) We demonstrate superior performance of the ASWD in numerical experiments for both synthetic and real-world datasets. The remainder of the paper is organized as follows. Section 2 reviews the necessary background. We present the proposed method and its numerical implementation in Section 3. Numerical experiment results are presented and discussed in Section 4. We conclude the paper in Section 5.

2. BACKGROUND

In this section, we provide a brief review of concepts related to the proposed work, including the Wasserstein distance, (generalized) Radon transform and (generalized) sliced Wasserstein distances. Wasserstein distance: Let P k (Ω) be a set of Borel probability measures with finite k-th moment on a Polish metric space (Ω,d) (Villani, 2008) . Given two probability measures µ, ν ∈P k (Ω), whose probability density functions (PDFs) are p µ and p ν , the Wasserstein distance of order k ∈[1,+∞) between µ and ν is defined as: W k (µ,ν)= inf γ∈Γ(µ,ν) Ω d(x,y) k dγ(x,y) 1 k , where d(•, •) k is the cost function, Γ(µ, ν) represents the set of all transportation plans γ, i.e. joint distributions whose marginals are p µ and p ν , respectively. With a slight abuse of notation, we interchangeably use W k (µ,ν) and W k (p µ ,p ν ). While the Wasserstein distance is generally intractable for high-dimensional distributions, there are several favorable cases where the optimal transport problem can be efficiently solved. In particular, if µ and ν are continuous one-dimensional measures, the Wasserstein distance between µ and ν has a closed form solution (Bonneel et al., 2015) : W k (µ,ν)= 1 0 d(F -1 µ (z),F -1 ν (z)) k dz 1 k , where F -1 µ and F -1 ν are inverse cumulative distribution functions (CDFs) of µ and ν, respectively. Radon transform and generalized Radon transform: The Radon transform (Radon, 1917 ) maps a function f(•)∈L 1 (R d ) to the space of functions defined over spaces of lines in R d . The Radon transform of f(•) is defined by line integrals of f(•) along all possible hyperplanes in R d : Rf(t,θ)= R d f(x)δ(t-x,θ )dx, where t∈R and θ ∈S d-1 represent the parameters of hyperplanes in R d , δ(•) is the Dirac delta function, and •,• refers to the Euclidean inner product. By replacing the inner product x,θ in Equation (3) with β(x,θ), a specific family of functions named as defining function in Kolouri et al. (2019a) , the generalized Radon transform (GRT) (Beylkin, 1984) is defined as: Gf(t,θ)= R d f(x)δ(t-β(x,θ))dx, where t∈R, θ ∈Ω θ while Ω θ is a compact set of all feasible θ, e.g. Ω θ =S d-1 for β(x,θ)= x,θ (Kolouri et al., 2019a) . In practice, we can empirically approximate the Radon transform and the GRT of a probability density function p µ via: Rp µ (t,θ)≈ 1 N N n=1 δ(t-x n ,θ ), Gp µ (t,θ)≈ 1 N N n=1 δ(t-β(x n ,θ)), where x n ∼p µ and N is the number of samples. Notably, the Radon transform is a linear bijection (Helgason, 1980) , and the GRT is a bijection if the defining function β satisfies certain conditions (Beylkin, 1984) . Sliced Wasserstein distance and generalized sliced Wasserstein distance: By applying the Radon transform to p µ and p ν to obtain multiple projections, the sliced Wasserstein distance (SWD) decomposes the high-dimensional Wasserstein distance into multiple one-dimensional Wasserstein distances which can be efficiently evaluated (Bonneel et al., 2015) . The k-SWD between µ and ν is defined by: SWD k (µ,ν)= S d-1 W k k Rp µ (•,θ),Rp ν (•,θ) dθ 1 k , where the Radon transform R defined by Equation ( 3) is adopted as the measure push-forward operator. The GSWD generalizes the idea of SWD by projecting distributions onto hypersurfaces rather than hyperplanes (Kolouri et al., 2019a) . The GSWD is defined as: GSWD k (µ,ν)= Ω θ W k k Gp µ (•,θ),Gp ν (•,θ) dθ 1 k , ( ) where the GRT G is used as the measure push-forward operator. The Wasserstein distances between one-dimensional distributions can be obtained by sorting projected samples and calculating the distance between sorted samples (Kolouri et al., 2019b) : with L random projections, the SWD and GSWD between µ and ν can be approximated by: SWD k (µ,ν)≈ 1 L L l=1 N n=1 | x I l x [n] ,θ l -y I l y [n] ,θ l | k 1 k , GSWD k (µ,ν)≈ 1 L L l=1 N n=1 |β(x I l x [n] ,θ l )-β(y I l y [n] ,θ l )| k 1 k , where I l x and I l y are sequences consist of the indices of sorted samples which satisfy x I l x [n] ,θ l ≤ x I l x [n+1] ,θ l , y I l y [n] ,θ l ≤ y I l y [n+1] ,θ l in the SWD, and β(x I l x [n] ,θ l ) ≤ β(x I l x [n+1] ,θ l ), β(y I l y [n] ,θ l ) ≤ β(y I l y [n+1] ,θ l ) in the GSWD. It is proved in Bonnotte (2013) that the SWD is a valid distance metric. The GSWD is a valid metric except for its neural network variant (Kolouri et al., 2019a) .

3. AUGMENTED SLICED WASSERSTEIN DISTANCES

In this section, we propose a new distance metric called the augmented sliced Wasserstein distance (ASWD), which embeds flexible nonlinear projections in its construction. We also provide an implementation recipe for the ASWD.

3.1. SPATIAL RADON TRANSFORM AND AUGMENTED SLICED WASSERSTEIN DISTANCE

In the definitions of the SWD and GSWD, the Radon transform (Radon, 1917) and the generalized Radon transform (GRT) (Beylkin, 1984) are used as the push-forward operator for projecting distributions to a one-dimensional space. However, it is not straightforward to design defining functions β(x,θ) (Kolouri et al., 2019a) for the GRT due to certain non-trivial requirements for the function (Beylkin, 1984) . In practice, the assumption of the transform can be relaxed, as Theorem 1 shows that as long as the transform is injective, the corresponding ASWD metric is a valid distance metric. To help us define the augmented sliced Wasserstein distance, we first introduce the spatial Radon transform which includes the vanilla Radon transform and the polynomial GRT as special cases (See Remark 2). Definition 1. Given an injective mapping g(•) : R d → R d θ and a probability measure µ ∈ P (R d ) which probability density function (PDF) is p µ , the spatial Radon transform of p µ is defined as Hp µ (t,θ;g)= R d p µ (x)δ(t-g(x),θ )dx, where t∈R and θ ∈S d θ -1 are the parameters of hypersurfaces in R d . Remark 1. Note that the spatial Radon transform can be interpreted as applying the vanilla Radon transform to the PDF of x=g(x), where x∼p µ . Denote the PDF of x by p μg , the spatial Radon transform defined by Equation ( 11) can be rewritten as: Hp µ (t,θ;g)=E x∼pµ [δ(t-g(x),θ )], =E x∼p μg [δ(t-x,θ )] = p μg (x)δ(t-x,θ )dx =Rp μg (t,θ). ( ) Hence the spatial Radon transform inherits the theoretical properties of the Radon transform subject to certain conditions of g(•) and incorporates nonlinear projections through g(•). In what follows, we use See Appendix A for the proof of Lemma 1. f 1 ≡ f 2 to denote functions f 1 (•) : X → R and f 2 (•) : X → R that satisfy f 1 (x)=f 2 (x) Remark 2. The spatial Radon transform degenerates to the vanilla Radon transform when the mapping g(•) is an identity mapping. When g(•) is a homogeneous polynomial function with odd degrees, the spatial Radon transform is equivalent to the polynomial GRT (Ehrenpreis, 2003) . Appendix B provides the proof of Remark 2. We now introduce the augmented sliced Wasserstein distance, by utilizing the spatial Radon transform as the measure push-forward operator: Definition 2. Given two probability measures µ,ν ∈P k (R d ), whose probability density functions are p µ and p ν , respectively, and an injective mapping g(•):R d →R d θ , the augmented sliced Wasserstein distance (ASWD) of order k ∈[1,+∞) is defined as: ASWD k (µ,ν;g)= S d θ -1 W k k Hp µ (•,θ;g),Hp ν (•,θ;g) dθ 1 k , ( ) where θ ∈ S d θ -1 , W k is the k-Wasserstein distance defined by Equation ( 1), and H refers to the spatial Radon transform defined by Equation (11). Remark 3. Following the connection between the spatial Radon transform and the vanilla Radon transform as shown in Equation ( 12), the ASWD can be rewritten as: ASWD k (µ,ν;g)= S d θ -1 W k k Rp μg (•,θ),Rp νg (•,θ) dθ 1 k =SWD k (μ g ,ν g ), where μg and νg are probability measures on R d θ which satisfy g(x) ∼ μg for x ∼ µ and g(y) ∼ νg for y ∼ν. Theorem 1. The augmented sliced Wasserstein distance (ASWD) of order k ∈ [1, +∞) defined by Equation ( 13) with a mapping g(•):R d →R d θ is a metric on P k (R d ) if and only if g(•) is injective. Proof of Theorem 1 is provided in Appendix C.

3.2. NUMERICAL IMPLEMENTATION

We discuss in this section how to realize injective mapping g(•) with neural networks due to their expressiveness and optimize it with gradient based methods. Injective neural networks: As stated in Lemma 1 and Theorem 1, the injectivity of g(•) is the sufficient and necessary condition for the ASWD being a valid metric. Thus we need specific architecture designs on implementing g(•) by neural networks. One option is the family of invertible neural networks (Behrmann et al., 2019; Karami et al., 2019; Song et al., 2019) , which are both injective and surjective. However, the running cost of those models is usually much higher than that of vanilla neural networks. We propose an alternative approach by concatenating the input x of an arbitrary neural network to its output φ ω (x): g ω (x)=[x,φ ω (x)]. ( ) It is trivial to show that g ω (x) is injective, since different inputs will lead to different outputs. Although embarrassingly simple, this idea of concatenating the input and output of neural networks has found success in preserving information with dense blocks in the DenseNet (Huang et al., 2017) , where the input of each layer is injective to the output of all preceding layers. Optimization objective: We aim to project samples to maximally discriminating hypersurfaces between two distributions, so that the projected samples between distributions are most dissimilar subject to certain constraints on the hypersurface, as shown in Figure 1 . Similar ideas have been employed to identify important projection directions (Deshpande et al., 2019; Kolouri et al., 2019a; Paty & Cuturi, 2019) or a discriminative ground metric (Salimans et al., 2018) in optimal transport metrics. For the ASWD, the parameterized injective neural network g ω (•) is optimized by maximizing the following objective: L(µ,ν;g ω ,λ)= S d θ -1 W k k Hp µ (•,θ;g ω ),Hp ν (•,θ;g ω ) dθ 1 k -L λ , ( ) where λ>0 and the regularization term L λ =λE x,y∼µ,ν (||g ω (x)|| 2 +||g ω (y)|| 2 ) is used to control the norm of the output of g ω (•), otherwise the projections may be arbitrarily large. Remark 4. The regularization coefficient λ adjusts the introduced non-linearity in the evaluation of the ASWD by controlling the norm of φ ω (•) in Equation ( 15). In particular, when λ→∞, the nonlinear term φ ω (•) shrinks to 0. The rank of the augmented space is hence explicitly controlled by the flexible choice of φ ω (•) and implicitly regularized by L λ . By plugging the optimized g * ω,λ (•)=argmax gω (L(µ,ν;g ω ,λ)) into Equation ( 13), we obtain the empirical version of the ASWD. Notably, the regularization term L λ is only used when maximizing the objective in Equation ( 16), once the optimization is completed, L λ is not used in the calculation of the ASWD defined by Equation ( 13). Pseudocode is provided in Appendix D.

4. EXPERIMENTS

In this section, we describe the experiments that we have conducted to evaluate performance of the proposed distance metric. The GSWD leads to the best performance in a sliced Wasserstein flow problem reported in Kolouri et al. (2019a) and the DSWD outperforms the compared methods in the generative modeling task examined in Nguyen et al. (2020) . Hence we compare performance of the ASWD with the state-of-the-art distance metrics in the same examples and report results as belowfoot_0 . To examine the robustness of the ASWD, throughout the experiments, we adopt the injective network architecture given in Equation ( 15) and set φ ω to be a one layer fully-connected neural network whose outputs' dimension equals its inputs' dimension, with a ReLU layer as its activation function.

4.1. SLICED WASSERSTEIN FLOWS

We first consider the problem of evolving a source distribution µ to a target distribution ν by minimizing Wasserstein distances between µ and ν in the sliced Wasserstein flow task reported in Kolouri et al. (2019a) . ∂ t µ t =-∇SWD(µ t ,ν), where µ t refers to the updated source distribution at each iteration t. The SWD in Equation ( 17) can be replaced by other sliced-Wasserstein distances to be evaluated. As in Kolouri et al. (2019a) , the 2-Wasserstein distance was used as the metric for evaluating performance of different distance metrics in this task. The set of hyperparameter values used in this experiment can be found in Appendix E.1. Without loss of generality, we initialize µ 0 to be the standard normal distribution N (0,I). We repeat each experiment 50 times and record the 2-Wasserstein distance between µ and ν at every iteration. In Figure 2 , we plot the 2-Wasserstein distances between the source and target distributions as a function of the training epochs and the 8-Gaussian, the Knot, the Moon, and the Swiss roll distributions are respective target distributions. For clarity, Figure 2 displays the experiment results from the 6 best performing distance metrics, including the ASWD, the DSWD, the SWD, the GSWD-NN 1, which directly generates projections through a one layer MLP, as well as the GSWD with the polynomial of degree 3, circular defining functions, out of the 12 distance metrics we compared. We observe from Figure 2 that the ASWD not only leads to lower 2-Wasserstein distances, but also converges faster by achieving better results with fewer iterations than the other methods in these four target distributions. A complete set of experimental results with 12 compared distance metrics and 8 target distributions are included in Appendix F.1. The ASWD outperforms the compared state-of-the-art sliced-based Wasserstein distance metrics with 7 out of the 8 target distributions except for the 25-Gaussian. This is achieved through the simple injective network architecture given in Equation ( 15) and a one layer fully-connected neural network with equal input and output dimensions throughout the experiments. In addition, ablation study is conducted to study the effect of injective mappings, the regularization coefficient λ, and the optimization of hypersurfaces in the ASWD. Details can be found in Appendix F.2.

4.2. GENERATIVE MODELING

In this experiment, we use the sliced-based Wasserstein distances for a generative modeling task described in Nguyen et al. (2020) . The task is to generate images using generative adversarial networks (GANs) (Goodfellow et al., 2014) trained on either the CIFAR10 dataset (64×64 resolution) (Krizhevsky, 2009) or the CELEBA dataset (64×64 resolution) (Liu et al., 2015) . Denote the hidden layer and the output layer of the discriminator by h ψ and D Ψ , and the generator by G Φ , we train GAN models with the following objectives: where p z is the prior of latent variable z and p r is the distribution of real data. The SWD in Equation ( 18) is replaced by the ASWD and other variants of the SWD to compare their performance. The GSWD with the polynomial defining function and the DGSWD is not included in this experiment due to its excessively high computational cost in high-dimensional space.The Fréchet Inception Distance (FID score) (Heusel et al., 2017) is used to assess the quality of generated images. More details on the network structures and the parameter setup used in this experiment are available in Appendix E.2. min Φ SWD(h ψ (p r ),h ψ (G Φ (p z ))), Ψ,ψ E x∼pr [log(D Ψ (h ψ (x)))]+E z∼pz [log(1-D Ψ (h ψ (G Φ (z))))], We run 200 and 100 training epochs to train the GAN models on the CIFAR10 and the CELEBA dataset, respectively. Each experiment is repeated for 10 times. We report experimental results in Table 1 . With the same number of projections and a similar computation cost, the ASWD leads to significantly improved FID scores among all evaluated distances metrics on both datasets, which implies that images generated with the ASWD are of higher qualities. Figure 3 plots the FID scores recorded during the training process. The GAN model trained with the ASWD exhibits a faster convergence as it reaches lower FID scores with fewer epochs. Randomly selected samples of generated images are presented in Appendix G.

5. CONCLUSION

We proposed a novel variant of the sliced Wasserstein distance, namely the augmented sliced Wasserstein distance (ASWD), which is flexible, has high projection efficiency, and generalizes well. The ASWD adaptively updates the hypersurface where the samples are projected onto by learning from data. We proved that the ASWD is a valid distance metric and presented its numerical implementation. We reported empirical performance of the ASWD over state-of-the-art sliced Wasserstein metrics in numerical experiments. The ASWD leads to the smallest distance errors over the majority of datasets in a sliced Wasserstein flow task and superior performance in a generative modeling task. The ASWD can be extended in several directions. One future research topic is to incorporate the framework of the ASWD into the sliced Gromov-Wasserstein distance, such that data-adaptive, nonlinear projections can be learned to compare distributions whose supports do not necessary lie in the same metric space. It would be also interesting to explore the identification of the barycenter of objects via the ASWD, e.g. through an optimization scheme alternating between optimization of the nonlinear maps in the ASWD framework and the barycenter. Another interesting area is the incorporation of the ASWD with state-of-theart generative models to further exploit projection efficiency. We anticipate that they will provide promising directions for enabling the generalization for sliced Wasserstein distance to be applied in wider domains.



Code to reproduce experiment results is available at : https://bit.ly/2Y23wOz.



Figure 1: (a) and (b) are visualizations of projections for the ASWD and the SWD between two 2-dimensional Gaussians. (c) and (d) are distance histograms for the ASWD and the SWD between two 100-dimensional Gaussians. Figure1(a)shows that the injective neural network embedded in the ASWD learns data patterns (in the X-Y plane) and produces well-separate projected values (Z-axis) between distributions in a random projection direction. The high projection efficiency of the ASWD is evident in Figure1(c), as almost all random projection directions in a 100-dimensional space lead to significant distances between 1-dimensional projections. In contrast, random linear mappings in the SWD often produce closer 1-d projections (Z-axis) (Figure1(b)); as a result, a large percentage of random projection directions in the 100-d space result in trivially small distances (Figure1(d)), leading to a low projection efficiency in high-dimensional spaces.

Figure 2: The first and third columns are target distributions. The second and fourth columns are log 2-Wasserstein distances between the target distribution and the source distribution. The horizontal axis show the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. A more extensive set of experimental results can be found in Appendix F.1.

Figure 3: FID scores of generative models trained with different metrics on CIFAR10 and CELEBA datasets with L = 1000 projections. The error bar represents the standard deviation of the FID scores at the specified training epoch among 10 simulation runs.

for almost ∀x∈X. Lemma 1. Given an injective mapping g(•) : R d → R d θ and two probability measures µ,ν ∈ P (R d ) whose probability density functions are p µ and p ν , respectively, for all t ∈ R and θ ∈ S d θ -1 , Hp

FID scores of generative models trained with different distance metrics. Lower scores indicate better image qualities. L is the number of projections, we run each experiment 10 times and report the average values and standard errors of FID scores for CIFAR10 dataset and CELEBA dataset. The running time per training iteration for one batch containing 512 samples is computed based on a computer with an Intel (R) Xeon (R) Gold 5218 CPU 2.3 GHz and 16GB of RAM, and a RTX 6000 graphic card with 22GB memories.

