HIERARCHICAL SLICED WASSERSTEIN DISTANCE

ABSTRACT

Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNetfoot_0 .

1. INTRODUCTION

Wasserstein distance (Villani, 2008; Peyré & Cuturi, 2020) has been widely used in applications, such as generative modeling on images (Arjovsky et al., 2017; Tolstikhin et al., 2018; litu Rout et al., 2022) , domain adaptation to transfer knowledge from source to target domains (Courty et al., 2017; Bhushan Damodaran et al., 2018 ), clustering problems (Ho et al., 2017) , and various other applications (Le et al., 2021; Xu et al., 2021; Yang et al., 2020) . Despite the increasing importance of Wasserstein distance in applications, prior works have alluded to the concerns surrounding the high computational complexity of that distance. When the probability measures have at most n supports, the computational complexity of Wasserstein distance scales with the order of O(n 3 log n) (Pele & Werman, 2009) . Additionally, it suffers from the curse of dimensionality, i.e., its sample complexity (the bounding gap of the distance between a probability measure and the empirical measures from its random samples) is of the order of O(n -1/d ) (Fournier & Guillin, 2015) , where n is the sample size and d is the number of dimensions. Over the years, numerous attempts have been made to improve the computational and sample complexities of the Wasserstein distance. One primal line of research focuses on using entropic regularization (Cuturi, 2013) . This variant is known as entropic regularized optimal transport (or in short entropic regularized Wasserstein). By using the entropic version, one can approximate the Wasserstein distance with the computational complexities O(nfoot_1 ) (Altschuler et al., 2017; Lin et al., 2019b; a; 2020) (up to some polynomial orders of approximation errors). Furthermore, the sample complexity of the entropic version had also been shown to be at the order of O(n -1/2 ) (Mena & Weed, 2019), which indicates that it does not suffer from the curse of dimensionality. Another line of work builds upon the closed-form solution of optimal transport in one dimension. A notable distance metric along this direction is sliced Wasserstein (SW) distance (Bonneel et al., 2015) . SW is defined between two probability measures that have supports belonging to a vector space, e.g, R d . As defined in (Bonneel et al., 2015) , SW is written as the expectation of one-dimensional Wasserstein distance between two projected measures over the uniform distribution on the unit sphere. Due to the intractability of the expectation, Monte Carlo samples from the uniform distribution over the unit sphere are used to approximate SW distance. The number of samples is often called the number of projections that is denoted as L. On the computational side, the projecting directions matrix of size d × L is sampled and then multiplied by the two data matrices of size n × d resulting in two matrices of size n × L that represent L one-dimensional projected probability measures. Thereafter, L one-dimensional Wasserstein distances are computed between the two corresponding projected measures with the same projecting direction. Finally, the average of those distances yields an approximation of the value of the sliced Wasserstein distance. Prior works (Kolouri et al., 2018a; Deshpande et al., 2018; 2019; Nguyen et al., 2021a; b) show that the number of projections L should be large enough compared to the dimension d for a good performance of the SW. Despite the large L, SW has lots of benefits in practice. It can be computed in O(n log 2 n) time, with the statistical rate O(n -1/2 ) that does not suffer from the curse of dimensionality, while becoming more memory efficient 2 compared with the vanilla Wasserstein distance. For these reasons, it has been successfully applied in several applications, such as (deep) generative modeling (Wu et al., 2019; Kolouri et al., 2018a; Nguyen & Ho, 2022a ), domain adaptation (Lee et al., 2019 ), and clustering (Kolouri et al., 2018b) . Nevertheless, it also suffers from certain limitations in, e.g., deep learning applications where the mini-batch approaches (Fatras et al., 2020) are utilized. Here, the number of supports n is often much smaller than the number of dimensions. Therefore, the computational complexity of solving L one-dimensional Wasserstein distance, Θ(Ln log 2 n) is small compared to the computational complexity of matrix multiplication Θ(Ldn). This indicates that almost all computation is for the projection step. The situation is ubiquitous since there are several deep learning applications involving processing high-dimensional data, including images (Genevay et al., 2018; Nguyen & Ho, 2022b ), videos (Wu et al., 2019 ), and text (Schmitz et al., 2018) . Motivated by the low-rank decomposition of matrices, we propose a more efficient approach to project original measures to their one-dimensional projected measures. In particular, two original measures are first projected into k one-dimensional projected measures via Radon transform where k < L. For convenience, we call these projected measures as bottleneck projections. Then, new L one-dimensional projected measures are created as random linear combinations of the bottleneck projections. The linear mixing step can be seen as applying Radon transform on the joint distribution of k one-dimensional projected measures. From the computational point of view, the projecting step consists of two consecutive matrix multiplications. The first multiplication is between the data matrix of size n × d and the bottleneck projecting directions matrix of size d × k, and the second multiplication is between the bottleneck projecting matrix and the linear mixing matrix of size k × L. Columns of both the bottleneck projecting directions matrix and the linear mixing matrix are sampled



Code for experiments in the paper is published at the following link https://github.com/ UT-Austin-Data-Science-Group/HSW. SW does not need to store the cost matrix between supports.

