SPHERICAL SLICED-WASSERSTEIN

Abstract

Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders. Published as a conference paper at ICLR 2023 Idrobo (2020) learn the transport map on hyperbolic spaces. However, the computational bottleneck to compute the Wasserstein distance on such spaces remains, and, as underlined in the conclusion of (Nadjahi, 2021), defining SW distances on manifolds would be of much interest. Notably, Rustamov & Majumdar (2020) proposed a variant of SW, based on the spectral decomposition of the Laplace-Beltrami operator, which generalizes to manifolds given the availability of the eigenvalues and eigenfunctions. However, it is not directly related to the original SW on Euclidean spaces. Contributions. Therefore, by leveraging properties of the Wasserstein distance on the circle (Rabin et al., 2011a), we define the first, to the best of our knowledge, natural generalization of the original SW discrepancy on a non trivial manifold, namely the sphere S d-1 , and hence we make a first step towards defining SW distances on Riemannian manifolds. We make connections with a new spherical Radon transform and analyze some of its properties. We discuss the underlying algorithmic procedure, and notably provide an efficient implementation when computing the discrepancy against a uniform distribution. Then, we show that we can use this discrepancy on different tasks such as sampling, density estimation or generative modeling.

1. INTRODUCTION

Optimal transport (OT) (Villani, 2009) has received a lot of attention in machine learning in the past few years. As it allows to compare distributions with metrics, it has been used for different tasks such as domain adaptation (Courty et al., 2016) or generative models (Arjovsky et al., 2017) , to name a few. The most classical distance used in OT is the Wasserstein distance. However, calculating it can be computationally expensive. Hence, several variants were proposed to alleviate the computational burden, such as the entropic regularization (Cuturi, 2013; Scetbon et al., 2021) , minibatch OT (Fatras et al., 2020) or the sliced-Wasserstein distance (SW) for distributions supported on Euclidean spaces (Rabin et al., 2011b) . Although embedded in larger dimensional Euclidean spaces, data generally lie in practice on manifolds (Fefferman et al., 2016) . A simple manifold, but with lots of practical applications, is the hypersphere S d-1 . Several types of data are by essence spherical: a good example is found in directional data (Mardia et al., 2000; Pewsey & García-Portugués, 2021) for which dedicated machine learning solutions are being developed (Sra, 2018) , but other applications concern for instance geophysical data (Di Marzio et al., 2014) , meteorology (Besombes et al., 2021) , cosmology (Perraudin et al., 2019) or extreme value theory for the estimation of spectral measures (Guillou et al., 2015) . Remarkably, in a more abstract setting, considering hyperspherical latent representations of data is becoming more and more common (e.g. (Liu et al., 2017; Xu & Durrett, 2018; Davidson et al., 2018) ). For example, in the context of variational autoencoders (Kingma & Welling, 2013) , using priors on the sphere has been demonstrated to be beneficial (Davidson et al., 2018) . Also, in the context of self-supervised learning (SSL) , where one wants to learn discriminative representations in an unsupervised way, the hypersphere is usually considered for the latent representation (Wu et al., 2018; Chen et al., 2020a; Wang & Isola, 2020; Grill et al., 2020; Caron et al., 2020) . It is thus of primary importance to develop machine learning tools that accommodate well with this specific geometry. The OT theory on manifolds is well developed (Villani, 2009; Figalli & Villani, 2011; McCann, 2001) and several works started to use it in practice, with a focus mainly on the approximation of OT maps. For example, Cohen et al. (2021) ; Rezende & Racanière (2021) approximate the OT map to define normalizing flows on Riemannian manifolds, Hamfeldt & Turnquist (2021a;b); Cui et al. (2019) derive algorithms to approximate the OT map on the sphere, Alvarez-Melis et al. (2020) ; Hoyos-

2. BACKGROUND

The aim of this paper is to define a Sliced-Wasserstein discrepancy on the hypersphere S d-1 = {x ∈ R d , ∥x∥ 2 = 1}. Therefore, in this section, we introduce the Wasserstein distance on manifolds and the classical SW distance on R d .

2.1. WASSERSTEIN DISTANCE

Since we are interested in defining a SW discrepancy on the sphere, we start by introducing the Wasserstein distance on a Riemannian manifold M endowed with the Riemannian distance d. We refer to (Villani, 2009; Figalli & Villani, 2011) for more details. Let p ≥ 1 and µ, ν ∈ P p (M ) = {µ ∈ P(M ), M d p (x, x 0 ) dµ(x) < ∞ for some x 0 ∈ M }. Then, the p-Wasserstein distance between µ and ν is defined as W p p (µ, ν) = inf γ∈Π(µ,ν) M ×M d p (x, y) dγ(x, y), where Π(µ, ν) = {γ ∈ P(M × M ), ∀A ⊂ M, γ(M × A) = ν(A) and γ(A × M ) = µ(A)} denotes the set of couplings. For discrete probability measures, the Wasserstein distance can be computed using linear programs (Peyré et al., 2019) . However, these algorithms have a O(n 3 log n) complexity w.r.t. the number of samples n which is computationally intensive. Therefore, a whole literature consists of defining alternative discrepancies which are cheaper to compute. On Euclidean spaces, one of them is the Sliced-Wasserstein distance.

2.2. SLICED-WASSERSTEIN DISTANCE

On M = R d with d(x, y) = ∥x -y∥ p p , a more attractive distance is the Sliced-Wasserstein (SW) distance. This distance relies on the appealing fact that for one dimensional measures µ, ν ∈ P(R), we have the following closed-form (Peyré et al., 2019, Remark 2.30 ) W p p (µ, ν) = 1 0 F -1 µ (u) -F -1 ν (u) p du, where F -1 µ (resp. F -1 ν ) is the quantile function of µ (resp. ν). From this property, Rabin et al. (2011b) ; Bonnotte (2013) defined the SW distance as ∀µ, ν ∈ P p (R d ), SW p p (µ, ν) = S d-1 W p p (P θ # µ, P θ # ν) dλ(θ), where P θ (x) = ⟨x, θ⟩, λ is the uniform distribution on S d-1 and for any Borel set A ∈ B(R d ), P θ # µ(A) = µ((P θ ) -1 (A)). This distance can be approximated efficiently by using a Monte-Carlo approximation (Nadjahi et al., 2019) , and amounts to a complexity of O(Ln(d + log n)) where L denotes the number of projections used for the Monte-Carlo approximation and n the number of samples. SW can also be written through the Radon transform (Bonneel et al., 2015) . Let f ∈ L 1 (R d ), then the Radon transform R : L 1 (R d ) → L 1 (R × S d-1 ) is defined as (Helgason et al., 2011 ) ∀θ ∈ S d-1 , ∀t ∈ R, Rf (t, θ) = R d f (x)1 {⟨x,θ⟩=t} dx. Its dual R * : C 0 (R × S d-1 ) → C 0 (R d ) (also known as back-projection operator), where C 0 denotes the set of continuous functions that vanish at infinity, satisfies for all f, g, ⟨Rf, g⟩ R×S d-1 = ⟨f, R * g⟩ R d and can be defined as (Boman & Lindskog, 2009; Bonneel et al., 2015) ∀g ∈ C 0 (R × S d-1 ), ∀x ∈ R d , R * g(x) = S d-1 g(⟨x, θ⟩, θ) dθ. (5) Therefore, by duality, we can define the Radon transform of a measure µ ∈ M(R d ) as the measure Rµ ∈ M(R × S d-1 ) such that for all g ∈ C 0 (R × S d-1 ), ⟨Rµ, g⟩ R×S d-1 = ⟨µ, R * g⟩ R d . Since Rµ is a measure on the product space R × S d-1 , we can disintegrate it w.r.t. λ, the uniform measure on S d-1 (Ambrosio et al., 2005) , as Rµ = λ ⊗ K with K a probability kernel on S d-1 × B(R), i.e. for all θ ∈ S d-1 , K(θ, •) is a probability on R, for any Borel set A ∈ B(R), K(•, A) is measurable, and ∀ϕ ∈ C(R × S d-1 ), R×S d-1 ϕ(t, θ)d(Rµ)(t, θ) = S d-1 R ϕ(t, θ)K(θ, dt)dλ(θ), with C(R × S d-1 ) the set of continuous functions on R × S d-1 . By Proposition 6 in (Bonneel et al., 2015) , we have that for λ-almost every θ ∈ S d-1 , (Rµ) θ = P θ # µ where we denote K(θ, •) = (Rµ) θ . Therefore, we have ∀µ, ν ∈ P p (R d ), SW p p (µ, ν) = S d-1 W p p (Rµ) θ , (Rµ) θ dλ(θ). Variants of SW have been defined in recent works, either by integrating w.r.t. different distributions (Deshpande et al., 2019; Nguyen et al., 2021; 2020) , by projecting on R using different projections (Nguyen & Ho, 2022a; b; Rustamov & Majumdar, 2020) or Radon transforms (Kolouri et al., 2019; Chen et al., 2020b) , or by projecting on subspaces of higher dimensions (Paty & Cuturi, 2019; Lin et al., 2020; 2021; Huang et al., 2021) .

3. A SLICED-WASSERSTEIN DISCREPANCY ON THE SPHERE

Our goal here is to define a sliced-Wasserstein distance on the sphere S d-1 . To that aim, we proceed analogously to the classical Euclidean space. We first rely on the nice properties of the Wasserstein distance on the circle (Rabin et al., 2011a) and then propose to project distributions lying on the sphere to great circles. Hence, circles play the role of the real line for the hypersphere. In this section, we first describe the OT problem on the circle, then we define a sliced-Wasserstein discrepancy on the sphere and discuss some of its properties. Notably, we derive a new spherical Radon transform which is linked to our newly defined spherical SW. We refer to Appendix A for the proofs.

3.1. OPTIMAL TRANSPORT ON THE CIRCLE

On the circle S 1 = R/Z equipped with the geodesic distance d S 1 , an appealing formulation of the Wasserstein distance is available (Delon et al., 2010) . First, let us parametrize S 1 by [0, 1[, then the geodesic distance can be written as (Rabin et al., 2011a) , for all x, y ∈ [0, 1[, d S 1 (x, y) = min(|x -y|, 1 -|x -y|). Then, for the cost function c(x, y) = h(d S 1 (x, y)) with h : R → R + an increasing convex function, the Wasserstein distance between µ ∈ P(S 1 ) and ν ∈ P(S 1 ) can be written as W c (µ, ν) = inf α∈R 1 0 h |F -1 µ (t) -(F ν -α) -1 (t)| dt, where F µ : [0, 1[→ [0, 1] denotes the cumulative distribution function (cdf) of µ, F -1 µ its quantile function and α is a shift parameter. The optimization problem over the shifted cdf F ν -α can be seen as looking for the best "cut" (or origin) of the circle into the real line because of the 1-periodicity. Indeed, the proof of this result for discrete distributions in (Rabin et al., 2011a) consists in cutting the circle at the optimal point and wrapping it around the real line, for which the optimal transport map is the increasing rearrangement F -1 ν • F µ which can be obtained for discrete distributions by sorting the points (Peyré et al., 2019) . Rabin et al. (2011a) showed that the minimization problem is convex and coercive in the shift parameter and Delon et al. (2010) derived a binary search algorithm to find it. For the particular case of h = Id, it can further be shown (Werman et al., 1985; Cabrelli & Molter, 1995) that W 1 (µ, ν) = inf α∈R 1 0 |F µ (t) -F ν (t) -α| dt. In this case, we know exactly the minimum which is attained at the level median (Hundrieser et al., 2021) . For f : [0, 1[→ R, LevMed(f ) = min argmin α∈R 1 0 |f (t) -α|dt = inf t ∈ R, β({x ∈ [0, 1[, f (x) ≤ t}) ≥ 1 2 , ( ) where β is the Lebesgue measure. Therefore, we also have W 1 (µ, ν) = 1 0 |F µ (t) -F ν (t) -LevMed(F µ -F ν )| dt. ( ) Since we know the minimum, we do not need the binary search and we can approximate the integral very efficiently as we only need to sort the samples to compute the level median and the cdfs. Another interesting setting in practice is to compute W 2 , i.e. with h(x) = x 2 , w.r.t. a uniform distribution ν on the circle. We derive here the optimal shift α for the Wasserstein distance between µ an arbitrary distribution on S 1 and ν. We also provide a closed-form when µ is a discrete distribution. Proposition 1. Let µ ∈ P 2 (S 1 ) and ν = Unif(S 1 ). Then, W 2 2 (µ, ν) = 1 0 |F -1 µ (t) -t -α| 2 dt with α = x dµ(x) - 1 2 . ( ) In particular, if x 1 < • • • < x n and µ n = 1 n n i=1 δ xi , then W 2 2 (µ n , ν) = 1 n n i=1 x 2 i - 1 n n i=1 x i 2 + 1 n 2 n i=1 (n + 1 -2i)x i + 1 12 . ( ) This proposition offers an intuitive interpretation: the optimal cut point between an empirical and a uniform distributions is the antipodal point of the circular mean of the discrete samples. Moreover, a very efficient algorithm can be derived from this property, as it solely requires a sorting operation to compute the order statistics of the samples.

3.2. DEFINITION OF SW ON THE SPHERE

On the hypersphere, the counterpart of straight lines are the great circles, which are circles with the same diameter as the sphere, and which correspond to the geodesics. Moreover, we can compute the Wasserstein distance on the circle fairly efficiently. Hence, to define a sliced-Wasserstein discrepancy on this manifold, we propose, analogously to the classical SW distance, to project measures on great circles. The most natural way to project points from S d-1 to a great circle C is to use the geodesic projection (Jung, 2021; Fletcher et al., 2004 ) defined as ∀x ∈ S d-1 , P C (x) = argmin y∈C d S d-1 (x, y), where d S d-1 (x, y) = arccos(⟨x, y⟩) is the geodesic distance. See Figure 1 for an illustration of the geodesic projection on a great circle. Note that the projection is unique for almost every x (see (Bardelli & Mennucci, 2017, Proposition 4. 2) and Appendix B.1) and hence the pushforward P C # µ of µ ∈ P p,ac (S d-1 ), where P p,ac (S d-1 ) denotes the set of absolutely continuous measures w.r.t. the Lebesgue measure and with moments of order p, is well defined. Great circles can be obtained by intersecting S d-1 with a 2-dimensional plane (Jung et al., 2012) . Therefore, to average over all great circles, we propose to integrate over the Grassmann manifold (Absil et al., 2004; Bendokat et al., 2020) and then to project the distribution onto the intersection with the hypersphere. Since the Grassmannian is not very practical, we consider the identification using the set of rank 2 projectors: G d,2 = {E ⊂ R d , dim(E) = 2} G d,2 = {P ∈ R d×d , P T = P, P 2 = P, Tr(P ) = 2} = {U U T , U ∈ V d,2 }, where V d,2 = {U ∈ R d×2 , U T U = I 2 } is the Stiefel manifold (Bendokat et al., 2020) . Finally, we can define the Spherical Sliced-Wasserstein distance (SSW) for p ≥ 1 between locally absolutely continuous measures w.r.t. the Lebesgue measure (Bardelli & Mennucci, 2017) µ, ν ∈ P p,ac (S d-1 ) as SSW p p (µ, ν) = V d,2 W p p (P U # µ, P U # ν) dσ(U ), ( ) where σ is the uniform distribution over the Stiefel manifold V d,2 , P U is the geodesic projection on the great circle generated by U and then projected on S 1 , i.e. ∀U ∈ V d,2 , ∀x ∈ S d-1 , P U (x) = U T argmin y∈span(U U T )∩S d-1 d S d-1 (x, y) = argmin z∈S 1 d S d-1 (x, U z), (17) and the Wasserstein distance is defined with the geodesic distance d S 1 . Figure 1 : Illustration of the geodesic projections on a great circle (in black). In red, random points sampled on the sphere. In green the projections and in blue the trajectories. Moreover, we can derive a closed form expression which will be very useful in practice: Lemma 1. Let U ∈ V d,2 then for a.e. x ∈ S d-1 , P U (x) = U T x ∥U T x∥ 2 . ( ) Hence, we notice from this expression of the projection that we recover almost the same formula as Lin et al. (2020) but with an additional ℓ 2 normalization which projects the data on the circle. As in (Lin et al., 2020) , we could project on a higher dimensional subsphere by integrating over V d,k with k ≥ 2. However, we would lose the computational efficiency provided by the properties of the Wasserstein distance on the circle.

3.3. A SPHERICAL RADON TRANSFORM

As for the classical SW distance, we can derive a second formulation using a Radon transform. Let f ∈ L 1 (S d-1 ), we define a spherical Radon transform R : L 1 (S d-1 ) → L 1 (S 1 × V d,2 ) as ∀z ∈ S 1 , ∀U ∈ V d,2 , Rf (z, U ) = S d-1 f (x)1 {z=P U (x)} dx. ( ) This is basically the same formulation as the classical Radon transform (Natterer, 2001; Helgason et al., 2011) where we replaced the real line coordinate t by the coordinate on the circle z and the projection is the geodesic one which is well suited to the sphere. This transform is actually new since we integrate over different sets compared to existing works on spherical Radon transforms. Then, analogously to the classical Radon transform, we can define the back-projection operator R * : C 0 (S 1 × V d,2 ) → C b (S d-1 ), C b (S d-1 ) being the space of continuous bounded functions, for g ∈ C 0 (S 1 × V d,2 ) as for a.e. x ∈ S d-1 , R * g(x) = V d,2 g(P U (x), U ) dσ(U ). ( ) Proposition 2. R * is the dual operator of R, i.e. for all f ∈ L 1 (S d-1 ), g ∈ C 0 (S 1 × V d,2 ), ⟨ Rf, g⟩ S 1 ×V d,2 = ⟨f, R * g⟩ S d-1 . Now that we have a dual operator, we can also define the Radon transform of an absolutely continuous measure µ ∈ M ac (S d-1 ) by duality (Boman & Lindskog, 2009; Bonneel et al., 2015) as the measure Rµ satisfying ∀g ∈ C 0 (S 1 × V d,2 ), S 1 ×V d,2 g(z, U ) d( Rµ)(z, U ) = S d-1 R * g(x) dµ(x). ( ) Since Rµ is a measure on the product space S 1 × V d,2 , Rµ can be disintegrated (Ambrosio et al., 2005, Theorem 5.3.1) w.r.t. σ as Rµ = σ ⊗ K where K is a probability kernel on V d,2 × S 1 with S 1 the Borel σ-field of S 1 . We will denote for σ-almost every U ∈ V d,2 , ( Rµ ) U = K(U, •) the conditional probability. Proposition 3. Let µ ∈ M ac (S d-1 ), then for σ-almost every U ∈ V d,2 , ( Rµ) U = P U # µ. Finally, we can write SSW (16) using this Radon transform: ∀µ, ν ∈ P p,ac (S d-1 ), SSW p p (µ, ν) = V d,2 W p p ( Rµ) U , ( Rν) U dσ(U ). ( ) Note that a natural way to define SW distances can be through already known Radon transforms using the formulation (23). It is for example what was done in (Kolouri et al., 2019) using generalized Radon transforms (Ehrenpreis, 2003; Homan & Zhou, 2017) to define generalized SW distances, or in (Chen et al., 2020b) with the spatial Radon transform. However, for known spherical Radon transforms (Abouelaz & Daher, 1993; Antipov et al., 2011) such as the Minkowski-Funk transform (Dann, 2010) or more generally the geodesic Radon transform (Rubin, 2002) , there is no natural way that we know of to integrate over some product space and allowing to define a SW distance using disintegration. As observed by Kolouri et al. (2019) for the generalized SW distances (GSW), studying the injectivity of the related Radon transforms allows to study the set on which SW is actually a distance. While the classical Radon transform integrates over hyperplanes of R d , the generalized Radon transform over hypersurfaces (Kolouri et al., 2019) and the Minkowski-Funk transform over "big circles", i.e. the intersection between a hyperplane and S d-1 (Rubin, 2003) , the set of integration here is a half of a big circle. Hence, R is related to the hemispherical transform (Rubin, 1999) on S d-2 . We refer to Appendix A.6 for more details on the links with the hemispherical transform. Using these connections, we can derive the kernel of R as the set of even measures which are null over all hyperplanes intersected with S d-1 . Proposition 4. ker( R) = {µ ∈ M even (S d-1 ), ∀H ∈ G d,d-1 , µ(H ∩ S d-1 ) = 0} where µ ∈ M even if for all f ∈ C(S d-1 ), ⟨µ, f ⟩ = ⟨µ, f + ⟩ with f + (x) = (f (x) + f (-x))/2 for all x. We leave for future works checking whether this set is null or not. Hence, we conclude here that SSW is a pseudo-distance, but a distance on the sets of injectivity of R (Agranovskyt & Quintott, 1996). Proposition 5. Let p ≥ 1, SSW p is a pseudo-distance on P p,ac (S d-1 ).

4. IMPLEMENTATION

In practice, we approximate the distributions with empirical approximations and, as for the classical SW distance, we rely on the Monte-Carlo approximation of the integral on V d,2 . We first need to sample from the uniform distribution σ ∈ P(V d,2 ). This can be done by first constructing Z ∈ R d×2 by drawing each of its component from the standard normal distribution N (0, 1) and then applying the QR decomposition (Lin et al., 2021) . Once we have (U ℓ ) L ℓ=1 ∼ σ, we project the samples on the circle S 1 by applying Lemma 1 and we compute the coordinates on the circle using the atan2 function. Finally, we can compute the Wasserstein distance on the circle by either applying the binary search algorithm of (Delon et al., 2010) or the level median formulation (11) for SSW 1 . In the particular case in which we want to compute SSW 2 between a measure µ and the uniform measure on the sphere ν = Unif(S d-1 ), we can use the appealing fact that the projection of ν on the circle is uniform, i.e. P U # ν = Unif(S 1 ) (particular case of Theorem 3.1 in (Jung, 2021) , see Appendix B.3). Hence, we can use the Proposition 1 to compute W 2 , which allows a very efficient implementation either by the closed-form (13) or approximation by rectangle method of ( 12). This will be of particular interest for applications in Section 5 such as autoencoders. We sum up the procedure in Algorithm 1.

Algorithm 1 SSW

Input: (x i ) n i=1 ∼ µ, (y j ) m j=1 ∼ ν, L the number of projections, p the order for ℓ = 1 to L do Draw a random matrix Z ∈ R d×2 with for all i, j, Z i,j ∼ N (0, 1) U = QR(Z) ∼ σ Project on S 1 the points: ∀i, j, xℓ i = U T xi ∥U T xi∥2 , ŷℓ j = U T yj ∥U T yj ∥2 Compute the coordinates on the circle S 1 : ∀i, j, xℓ i = (π + atan2(-x i,2 , -x i,1 ))/(2π), ỹℓ j = (π + atan2(-y j,2 , -y j,1 ))/(2π) Compute W p p ( 1 n n i=1 δ xℓ i , 1 m m j=1 δ ỹℓ j ) by binary search or (11) for p = 1 end for Return SSW p p (µ, ν) ≈ 1 L L ℓ=1 W p p ( 1 n n i=1 δ xℓ i , 1 m m j=1 δ ỹℓ j ) Complexity. Let us note n (resp. m) the number of samples of µ (resp. ν), and L the number of projections. First, we need to compute the QR factorization of L matrices of size d × 2. This can be done in O(Ld) by using e.g. Householder reflections (Golub & Van Loan, 2013, Chapter 5.2) or the Scharwz-Rutishauser algorithm (Gander, 1980) . Projecting the points on S 1 by Lemma 1 is in O((n + m)dL) since we need to compute L(n + m) products between U T ℓ ∈ R 2×d and x ∈ R d . For the binary search or particular case formula ( 11) and ( 13), we need first to sort the points. But the binary search also adds a cost of O((n + m) log( 1ϵ )) to approximate the solution with precision ϵ (Delon et al., 2010) and the computation of the level median requires to sort (n + m) points. Hence, for the general SSW p , the complexity is O(L(n + m)(d + log( 1 ϵ )) + Ln log n + Lm log m) versus O(L(n + m)(d + log(n + m))) for SSW 1 with the level median and O(Ln(d + log n)) for SSW 2 against a uniform with the particular advantage that we do not need uniform samples in this case. Runtime Comparison. We perform here some runtime comparisons. Using Pytorch (Paszke et al., 2019) , we implemented the binary search algorithm of (Delon et al., 2010) and used it with ϵ = 10 -6 . We also implemented SSW 1 using the level median formula (11) and SSW 2 against a uniform measure (12). All experiments are conducted on GPU. (11) between two distributions on S 2 . The time includes the calculation of the distance matrices. On Figure 2 , we compare the runtime between two distributions on S 2 between SSW, SW, the Wasserstein distance and the entropic approximation using the Sinkhorn algorithm (Cuturi, 2013) with the geodesic distance as cost function. The distributions were approximated using n ∈ {10 2 , 10 3 , 10 4 , 5 • 10 4 , 10 5 , 5 • 10 5 } samples of each distribution and we report the mean over 20 computations. We use the Python Optimal Transport (POT) library (Flamary et al., 2021) to compute the Wasserstein distance and the entropic approximation. For large enough batches, we observe that SSW is much faster than its Wasserstein counterpart, and it also scales better in term of memory because of the need to store the n × n cost matrix. For small batches, the computation of SSW actually takes longer because of the computation of the QR factorizations and of the projections. For bigger batches, it is bounded by the sorting operation and we recover the quasi-linear slope. Furthermore, as expected, the fastest algorithms are SSW 1 with the level median and SSW 2 against a uniform as they have a quasilinear complexity. We report in Appendix C.2 other runtimes experiments w.r.t. to e.g. the number of projections or the dimension. Additionally, we study both theoretically and empirically the projection and sample complexities in Appendices A.9 and C.1. We obtain similar results as (Nadjahi et al., 2020) derived for the SW distance. Notably, the sample complexity is independent w.r.t. the dimension. 

5. EXPERIMENTS

Apart from showing that SSW is an effective discrepancy for learning problems defined over the sphere, the objectives of this experimental Section is to show that it behaves better than using the more immediate SW in the embedding space. We first illustrate the ability to approximate different distributions by minimizing SSW w.r.t. some target distributions on S 2 and by performing density estimation experiments on real earth data. Then, we apply SSW for generative modeling tasks using the framework of Sliced-Wasserstein autoencoder and we show that we obtain competitive results with other Wasserstein autoencoder based methods using a prior on higher dimensional hyperspheres. Complete details about the experimental settings and optimization strategies are given in Appendix C. We also report in Appendices C.5 or C.7 complementary experiments on variational inference on the sphere or self-supervised learning with uniformity prior on the embedding hypersphere that further assess the effectiveness of SSW in a wide range of learning tasks. The code is available onlinefoot_0 .

5.1. SSW AS A LOSS

Gradient flow on toy data. We verify on the first experiments that we can learn some target distribution ν ∈ P(S d-1 ) by minimizing SSW, i.e. we consider the minimization problem argmin µ SSW p p (µ, ν). We suppose that we have access to the target distribution ν through samples, i.e. through νm = 1 m m j=1 δ yj where (y j ) m j=1 are i.i.d samples of ν. We add in Appendix C.5 the case where we know the density up to some constant which can be dealt with the sliced-Wasserstein variational inference framework introduced in (Yi & Liu, 2021) . We choose as target distribution a mixture of 6 well separated von Mises-Fisher distributions (Mardia, 1975) . This is a fairly challenging distribution since there are 6 modes which are not connected. We show on Figure 3 the Mollweide projection of the density approximated by a kernel density estimator for a distribution with 500 particles. To optimize directly over particles, we perform a Riemannian gradient descent on the sphere (Absil et al., 2009) . Density estimation on earth data. We perform density estimation on datasets first gathered by Mathieu & Nickel (2020) which contain locations of wild fires (EOSDIS, 2020), floods (Brakenridge, 2017) or eathquakes (NOAA, 2022) . We use exponential map normalizing flows introduced in (Rezende et al., 2020) (see Appendix B.4) which are invertible transformations mapping the data to some prior that we need to enforce. Here, we choose as prior a uniform distribution on S 2 and we learn the model using SSW. These transformations allow to evaluate exactly the density at any point. More precisely, let T be such transformation, let p Z be a prior distribution on S 2 and µ the measure of interest, which we know from samples, i.e. through μn = 1 n n i=1 δ xi . Then, we solve the following optimization problem min T SSW 2 2 (T # µ, p Z ). Once it is fitted, then the learned density f µ can be obtained by ∀x ∈ S 2 , f µ (x) = p Z T (x) | det J T (x)|, where we used the change of variable formula. We show on Figure 5 the density of test data learned. We observe on this figure that the normalizing flows (NFs) put mass where most data points lie, and hence are able to somewhat recover the principle modes of the data.We also compare on Table 1 the negative test log likelihood, averaged over 5 trainings with different split of the data, between different OT metrics, namely SSW, SW and the stereographic projection model (Gemici et al., 2016) which first projects the data on R 2 and use a regular NF in the projected space. We observe that SSW allows to better fit the data compared to the other OT based methods which are less suited to the sphere. In this section, we use SSW to learn the latent space of autoencoders (AE). We rely on the SWAE framework introduced by Kolouri et al. (2018) . Let f be some encoder and g be some decoder, denote p Z a prior distribution, then the loss minimized in SWAE is

5.2. SSW AUTOENCODERS

L(f, g) = c x, g(f (x)) dµ(x) + λSW 2 2 (f # µ, p Z ), where µ is the distribution of the data for which we have access to samples. One advantage of this framework over more classical VAEs (Kingma & Welling, 2013) is that no parametrization trick is needed here and therefore the choice of the prior is more free. In several concomitant works, it was shown that using a prior on the hypersphere can improve the results (Davidson et al., 2018; Xu & Durrett, 2018) . Hence, we propose in the same fashion as (Kolouri et al., 2018; 2019; Patrini et al., 2020) to replace SW by SSW, which we denote SSWAE, and to enforce a prior on the sphere. In the following, we use the MNIST (LeCun & Cortes, 2010), FashionMNIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky, 2009) datasets, and we put an ℓ 2 normalization at the output of the encoder. As a prior, we use the uniform distribution on S 10 for MNIST and Fashion, and on S 64 for CIFAR10. We compare in Table 2 the Fréchet Inception Distance (FID) (Heusel et al., 2017) , for 10000 samples and averaged over 5 trainings, obtained with the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018) , the classical SWAE (Kolouri et al., 2018) , the Sinkhorn Autoencoder (SAE) (Patrini et al., 2020) and circular GSWAE (Kolouri et al., 2019) . We observe that we obtain fairly competitive results on the different datasets. We add on Figure 4 the latent space obtained with a uniform prior on S 2 on MNIST. We notably observe a better separation between classes for SSWAE.

6. CONCLUSION AND DISCUSSION

In this work, we derive a new sliced-Wasserstein discrepancy on the hypersphere, that comes with practical advantages when computing optimal transport distances on hyperspherical data. We notably showed that it is competitive or even sometimes better than other metrics defined directly on R d on a variety of machine learning tasks, including density estimation or generative models. Our work is, up to our knowledge, the first to adapt the classical sliced Wasserstein framework to non-trivial manifolds. The three main ingredients are: i) a closed-form for Wasserstein on the circle, ii) a closed-form solution to the projection onto great circles, and iii) a novel Radon transform on the Sphere. An immediate extension of this work would be to consider sliced-Wasserstein discrepancy in hyperbolic spaces, where geodesics are circular arcs as in the Poincaré disk. Beyond the generalization to other, possibly well behaved, manifolds, asymptotic properties as well as statistical and topological aspects need to be examined. While we postulate that results comparable to the Euclidean case might be reached, the fact that the manifold is closed might bring interesting differences and justify further use of this type of discrepancies rather than their Euclidean counterparts.

A PROOFS

A.1 PROOF OF PROPOSITION 1 Optimal α. Let µ ∈ P 2 (S 1 ), ν = Unif(S 1 ). Since ν is the uniform distribution on S 1 , its cdf is the identity on [0, 1] (where we identified S 1 and [0, 1]). We can extend the cdf F on the real line as in (Rabin et al., 2011a) with the convention F (y + 1) = F (y) + 1. Therefore, F ν = Id on R. Moreover, we know that for all x ∈ S 1 , (F ν -α) -1 (x) = F -1 ν (x + α) = x + α and W 2 2 (µ, ν) = inf α∈R 1 0 |F -1 µ (t) -(F ν -α) -1 (t)| 2 dt. ( ) For all α ∈ R, let f (α) = 1 0 F -1 µ (t) -(F ν -α) -1 (t) 2 dt. Then, we have: ∀α ∈ R, f (α) = 1 0 F -1 µ (t) -t -α 2 dt = 1 0 F -1 µ (t) -t 2 dt + α 2 -2α 1 0 (F -1 µ (t) -t) dt = 1 0 F -1 µ (t) -t 2 dt + α 2 -2α 1 0 x dµ(x) - 1 2 , ( ) where we used that (F -1 µ ) # Unif([0, 1]) = µ. Hence, f ′ (α) = 0 ⇐⇒ α = 1 0 x dµ(x) -1 2 . Closed-form for empirical distributions. Let (x i ) n i=1 ∈ [0, 1[ n such that x 1 < • • • < x n and let µ n = 1 n n i=1 δ xi a discrete distribution. To compute the closed-form of W 2 between µ n and ν = Unif(S 1 ), we first have that the optimal α is α n = 1 n n i=1 x i -1 2 . Moreover, we also have: W 2 2 (µ n , ν) = 1 0 F -1 µn (t) -(t + αn ) 2 dt = 1 0 F -1 µn (t) 2 dt -2 1 0 tF -1 µn (t)dt -2 αn 1 0 F -1 µn (t)dt + 1 3 + αn + α2 n . Then, by noticing that F -1 µn (t) = x i for all t ∈ [F (x i ), F (x i+1 )[, we have 1 0 tF -1 µn (t)dt = n i=1 i n i-1 n tx i dt = 1 2n 2 n i=1 x i (2i -1), ( ) 1 0 F -1 µ (t) 2 dt = 1 n n i=1 x 2 i , 1 0 F -1 µ (t)dt = 1 n n i=1 x i , and we also have: αn + α2 n = 1 n n i=1 x i - 1 2 + 1 n n i=1 x i 2 + 1 4 - 1 n n i=1 x i = 1 n n i=1 x i 2 - 1 4 . Then, by plugging these results into (28), we obtain W 2 2 (µ n , ν) = 1 n n i=1 x 2 i - 1 n 2 n i=1 (2i -1)x i -2 1 n n i=1 x i 2 + 1 n n i=1 x i + 1 3 + 1 n n i=1 x i 2 - 1 4 = 1 n n i=1 x 2 i - 1 n n i=1 x i 2 + 1 n 2 n i=1 (n + 1 -2i)x i + 1 12 . A.2 PROOF OF EQUATION 17 Let U ∈ V d,2 . Then the great circle generated by U ∈ V d,2 is defined as the intersection between span(U U T ) and S d-1 . And we have the following characterization: x ∈ span(U U T ) ∩ S d-1 ⇐⇒ ∃y ∈ R d , x = U U T y and ∥x∥ 2 2 = 1 ⇐⇒ ∃y ∈ R d , x = U U T y and ∥U U T y∥ 2 2 = y T U U T y = ∥U T y∥ 2 2 = 1 ⇐⇒ ∃z ∈ S 1 , x = U z. And we deduce that ∀U ∈ V d,2 , x ∈ S d-1 , P U (x) = argmin z∈S 1 d S d-1 (x, U z). ( ) A.3 PROOF OF LEMMA 1 Let U ∈ V d,2 and x ∈ S d-1 such that U T x ̸ = 0. Denote U = (u 1 u 2 ), i.e. the 2-plane E is E = span(U U T ) = span(u 1 , u 2 ) and (u 1 , u 2 ) is an orthonormal basis of E. Then, for all x ∈ S d-1 , the projection on E is p E (x) = ⟨u 1 , x⟩u 1 + ⟨u 2 , x⟩u 2 = U U T x. Now, let us compute the geodesic distance between x ∈ S d-1 and p E (x) ∥p E (x)∥2 ∈ E ∩ S d-1 : d S d-1 x, p E (x) ∥p E (x)∥ 2 = arccos ⟨x, p E (x) ∥p E (x)∥ 2 ⟩ = arccos(∥p E (x)∥ 2 ), using that x = p E (x) + p E ⊥ (x). Let y ∈ E ∩ S d-1 another point on the great circle. By the Cauchy-Schwarz inequality, we have ⟨x, y⟩ = ⟨p E (x), y⟩ ≤ ∥p E (x)∥ 2 ∥y∥ 2 = ∥p E (x)∥ 2 . Therefore, using that arccos is decreasing on (-1, 1), d S d-1 (x, y) = arccos(⟨x, y⟩) ≥ arccos(∥p E (x)∥ 2 ) = d S d-1 x, p E (x) ∥p E (x)∥ 2 . ( ) Moreover, we have equality if and only if y = λp E (x). And since y ∈ S d-1 , |λ| = 1 ∥p E (x)∥2 . Using again that arccos is decreasing, we deduce that the minimum is well attained in y = p E (x) ∥p E (x)∥2 = U U T x ∥U U T x∥2 . Finally, using that ∥U U T x∥ 2 = x T U U T U U T x = x T U U T x = ∥U T x∥ 2 , we deduce that P U (x) = U T x ∥U T x∥ 2 . ( ) Finally, by noticing that the projection is unique if and only if U T x = 0, and using (Bardelli & Mennucci, 2017, Proposition 4. 2) which states that there is a unique projection for a.e. x, we deduce that {x ∈ S d-1 , U T x = 0} is of measure null and hence, for a.e. x ∈ S d-1 , we have the result. A.4 PROOF OF PROPOSITION 2 Let f ∈ L 1 (S d-1 ), g ∈ C 0 (S 1 × V d,2 ), then by Fubini's theorem, ⟨ Rf, g⟩ S 1 ×V d,2 = V d,2 S 1 Rf (z, U )g(z, U ) dzdσ(U ) = V d,2 S 1 S d-1 f (x)1 {z=P U (x)} g(z, U ) dxdzdσ(U ) = S d-1 f (x) V d,2 S 1 g(z, U )1 {z=P U (x)} dzdσ(U )dx = S d-1 f (x) V d,2 g P U (x), U dσ(U )dx = S d-1 f (x) R * g(x) dx = ⟨f, R * g⟩ S d-1 . A.5 PROOF OF PROPOSITION 3 Let g ∈ C 0 (S 1 × V d,2 ), V d,2 S 1 g(z, U ) ( Rµ) U (dz) dσ(U ) = S 1 ×V d,2 g(z, U ) d( Rµ)(z, U ) = S d-1 R * g(x) dµ(x) = S d-1 V d,2 g(P U (x), U ) dσ(U )dµ(x) = V d,2 S d-1 g(P U (x), U ) dµ(x)dσ(U ) = V d,2 S 1 g(z, U ) d(P U # µ)(z)dσ(U ). Hence, for σ-almost every U ∈ V d,2 , ( Rµ) U = P U # µ.

A.6 STUDY OF THE SPHERICAL RADON TRANSFORM R

In this Section, we first discuss the set of integration of the spherical Radon transform R (19). We further show that it is related to the hemispherical Radon transform and we derive its kernel. Set of integration. While the classical Radon transform integrates over hyperplanes of R d and the generalized Radon transform integrates over hypersurfaces (Kolouri et al., 2019) , the set of integration of the spherical Radon transform ( 19) is a half of a "big circle", i.e. half of the intersection between a hyperplane and S d-1 (Rubin, 2003) . We illustrate this on S 2 in Figure 6 . On S 2 , the intersection between a hyperplane and S 2 is a great circle.  Proposition 6. Let U ∈ V d,2 , z ∈ S 1 . The set of integration of (19) is {x ∈ S d-1 , P U (x) = z} = {x ∈ F ∩ S d-1 , ⟨x, U z⟩ > 0}, ( ) where F = span(U U T ) ⊥ ⊕ span(U z). Proof. Let U ∈ V d,2 , z ∈ S 1 . Denote E = span(U U T ) the 2-plane generating the great circle, and E ⊥ its orthogonal complementary. Hence, E ⊕ E ⊥ = R d and dim(E ⊥ ) = d -2. Now, let F = E ⊥ ⊕ span(U z). Since U z = U U T U z ∈ E, we have that dim(F ) = d -1. Hence, F is a hyperplane and F ∩ S d-1 is a "big circle" (Rubin, 2003) , i.e. a (d -2)-dimensional subsphere of S d-1 . Now, for the first inclusion, let x ∈ {x ∈ S d-1 , P U (x) = z}. First, we show that x ∈ F ∩ S d-1 . By Lemma 1 and hypothesis, we know that P U (x) = U T x ∥U T x∥2 = z. By denoting by p E the projection on E, we have: p E (x) = U U T x = U (∥U T x∥ 2 z) = ∥U T x∥ 2 U z ∈ span(U z). Hence, x = p E (x) + x E ⊥ = ∥U T x∥ 2 U z + x E ⊥ ∈ F . Moreover, as ⟨x, U z⟩ = ∥U T x∥ 2 ⟨U z, U z⟩ = ∥U T x∥ 2 > 0, we deduce that x ∈ {F ∩ S d-1 , ⟨x, U z⟩ > 0}. For the other inclusion, let x ∈ {F ∩ S d-1 , ⟨x, U z⟩ > 0}. Since x ∈ F , we have x = x E ⊥ + λU z, λ ∈ R. Hence, using Lemma 1, P U (x) = U T x ∥U T x∥ 2 = λ |λ| z ∥z∥ 2 = sign(λ)z. But, we also have ⟨x, U z⟩ = λ∥U z∥ 2 2 = λ > 0. Therefore, sign(λ) = 1 and P U (x) = z. Finally, we conclude that {x ∈ S d-1 , P U (x) = z⟩} = {x ∈ F ∩ S d-1 , ⟨x, U z⟩ > 0}. Link with Hemispherical transform. Since the intersection between a hyperplane and S d-1 is isometric to S d-2 (Jung et al., 2012) , we can relate R to the hemispherical transform H (Rubin, 2003) on S d-2 . First, the hemispherical transform of a function f ∈ L 1 (S d-1 ) is defined as ∀x ∈ S d-1 , Hf (x) = S d-1 f (y)1 {⟨x,y⟩>0} dy. ( ) From Proposition 6, we can write the spherical Radon transform (19) as a hemispherical transform on S d-2 . Proposition 7. Let f ∈ L 1 (S d-1 ), U ∈ V d,2 and z ∈ S 1 , then Rf (z, U ) = S d-2 f (x)1 {⟨x, Ũ z⟩>0} dx = H f ( Ũ z), where for all x ∈ S d-2 , f (x) = f (O T Jx) with O the rotation matrix such that for all x ∈ F , Ox ∈ span(e 1 , . . . , e d-1 ) where (e 1 , . . . , e d ) denotes the canonical basis, and J = I d-1 0 1,d-1 , and Ũ = J T OU ∈ R (d-1)×2 . Proof. Let f ∈ L 1 (S d-1 ), z ∈ S 1 , U ∈ V d,2 , then by Proposition 6, Rf (z, U ) = S d-1 ∩F f (x)1 {⟨x,U z⟩>0} dx. F is a hyperplane. Let O ∈ R d×d be the rotation such that for all x ∈ F , Ox ∈ span(e 1 , . . . , e d-1 ) = F where (e 1 , . . . , e d ) is the canonical basis. By applying the change of variable Ox = y, and since O -1 = O T , det O = 1, we obtain Rf (z, U ) = O(F ∩S d-1 ) f (O T y)1 {⟨O T y,U z⟩>0} dy = F ∩S d-1 f (O T y)1 {⟨y,OU z⟩>0} dy. (47) Now, we have that OU ∈ V d,2 since (OU ) T (OU ) = I 2 , and since U z ∈ F , OU z ∈ F . For all y ∈ F , we have ⟨y, 1) , then for all y ∈ F ∩ S d-1 , e d ⟩ = y d = 0. Let J = I d-1 0 1,d-1 ∈ R d×(d- y = J ỹ where ỹ ∈ S d-2 is composed of the d -1 first coordinates of y. Let's define, for all ỹ ∈ S d-2 , f (ỹ) = f (O T J ỹ), Ũ = J T OU . Then, since F ∩ S d-1 ∼ = S d-2 , we can write: Rf (z, U ) = S d-2 f (ỹ)1 {⟨ỹ, Ũ z⟩>0} dỹ = H f ( Ũ z). Kernel of R. By exploiting the expression using the hemispherical transform in Proposition 7, we can derive its kernel in Appendix A.7.

A.7 PROOF OF PROPOSITION 4

First, we recall Lemma 2.3 of (Rubin, 1999) on S d-2 . Lemma 2 (Lemma 2.3 (Rubin, 1999) ). ker(H) = {µ ∈ M even (S d-2 ), µ(S d-2 ) = 0} where M even is the set of even measures, i.e. measures such that for all f ∈ C(S d-2 ), ⟨µ, f ⟩ = ⟨µ, f -⟩ where f -(x) = f (-x) for all x ∈ S d-2 . Let µ ∈ M ac (S d-1 ). First, we notice that the density of Rµ w.r.t. λ ⊗ σ is, for all z ∈ S 1 , U ∈ V d,2 , ( Rµ)(z, U ) = S d-1 1 {P U (x)=z} dµ(x) = F ∩S d-1 1 {⟨x,U z⟩>0} dµ(x). ( ) Indeed, using Proposition 2, and Proposition 6, we have for all g ∈ C 0 (S 1 × V d,2 ), ⟨ Rµ, g⟩ S 1 ×V d,2 = ⟨µ, R * g⟩ S d-1 = S d-1 R * g(x)dµ(x) = S d-1 V d,2 S 1 g(z, U )1 {z=P U (x)} dzdσ(U )dµ(x) = V d,2 ×S 1 g(z, U ) S d-1 1 {z=P U (x)} dµ(x) dzdσ(U ) = V d,2 ×S 1 g(z, U ) F ∩S d-1 1 {⟨x,U z⟩>0} dµ(x) dzdσ(U ). (50) Hence, using Proposition 7, we can write ( Rµ)(z, U ) = (Hμ)( Ũ z) where μ = J T # O # µ. Now, let µ ∈ ker( R), then for all z ∈ S 1 , U ∈ V d,2 , Rµ(z, U ) = Hμ( Ũ z) = 0 and hence μ ∈ ker(H) = {μ ∈ M even (S d-2 ), μ(S d-2 ) = 0}. First, let's show that µ ∈ M even (S d-1 ). Let f ∈ C(S d-1 ) and U ∈ V d,2 , then, by using the same notation as in Propositions 6 and 7, we have ⟨µ, f ⟩ S d-1 = S d-1 f (x)dµ(x) = S d-1 S 1 f (x)1 {z=P U (x)} dz dµ(x) = S 1 S d-1 f (x)1 {z=P U (x)} dµ(x)dz = S 1 F ∩S d-1 f (x)1 {⟨x,U z⟩>0} dµ(x)dz by Prop. 6 = S 1 S d-2 f (y)1 {⟨y, Ũ z⟩>0} dμ(y)dz = S 1 ⟨Hμ, f ⟩ S d-2 dz = S 1 ⟨μ, H f ⟩ S d-2 dz = S 1 ⟨μ, (H f ) -⟩ S d-2 dz since μ ∈ M even = S d-1 f -(x)dµ(x) = ⟨µ, f -⟩ S d-1 , using for the last line all the opposite transformations. Therefore, µ ∈ M even (S d-1 ). Now, we need to find on which set the measure is null. We have ∀z ∈ S 1 , U ∈ V d,2 , μ(S d-2 ) = 0 ⇐⇒ ∀z ∈ S 1 , U ∈ V d,2 , µ(O -1 ((J T ) -1 (S d-2 ))) = µ(F ∩ S d-1 ) = 0. ( ) Hence, we deduce that ker( R) = {µ ∈ M even (S d-1 ), ∀U ∈ V d,2 , ∀z ∈ S 1 , F = span(U U T ) ⊥ ∩ span(U z), µ(F ∩ S d-1 ) = 0}. (53) Moreover, we have that ∪ U,z F U,z ∩ S d-1 = {H ∩ S d-1 ⊂ R d , dim(H) = d -1}. Indeed, on the one hand, let H an hyperplane, x ∈ H ∩ S d-1 , U ∈ V d,2 , and note z = P U (x). Then, x ∈ F ∩ S d-1 by Proposition 6 and H ∩ S d-1 ⊂ ∪ U,z F U,z . On the other hand, let U ∈ V d,2 , z ∈ S 1 , F is a hyperplane since dim(F ) = d -1 and therefore F ∩ S d-1 ⊂ {H, dim(H) = d -1}. Finally, we deduce that ker( R) = µ ∈ M even (S d-1 ), ∀H ∈ G d,d-1 , µ(H ∩ S d-1 ) . A.8 PROOF OF PROPOSITION 5 Let p ≥ 1. First, it is straightforward to see that for all µ, ν ∈ P p (S d-1 ), SSW p (µ, ν) ≥ 0, SSW p (µ, ν) = SSW p (ν, µ), µ = ν =⇒ SSW p (µ, ν) = 0 and that we have the triangular inequality since ∀µ, ν, α ∈ P p (S d-1 ), SSW p (µ, ν) = V d,2 W p p (P U # µ, P U # ν) dσ(U ) 1 p ≤ V d,2 W p (P U # µ, P U # α) + W p (P U # α, P U # ν) p dσ(U ) 1 p ≤ V d,2 W p p (P U # µ, P U # α) dσ(U ) 1 p + V d,2 W p p (P U # α, P U # ν) dσ(U ) 1 p = SSW p (µ, α) + SSW p (α, ν), (55) using the triangular inequality for W p and the Minkowski inequality. Therefore, it is at least a pseudo-distance. To be a distance, we also need SSW p (µ, ν) = 0 =⇒ µ = ν. Suppose that SSW p (µ, ν) = 0. Since, for all U ∈ V d,2 , W p p (P U # µ, P U # ν) ≥ 0, SSW p p (µ, ν) = 0 implies that for σ-ae U ∈ V d,2 , W p p (P U # µ, P U # ν) = 0 and hence P U # µ = P U # ν or ( Rµ) U = ( Rν) U for σ-ae U ∈ V d,2 since W p is a distance on the circle. Therefore, it is a distance on the sets of injectivity of R.

A.9 ADDITIONAL PROPERTIES

In this Section, we derive additional properties of SSW. First, we will show that the weak convergence implies the convergence w.r.t SSW. Then, we will show that the sample complexity is independent of the dimension. Finally, we will derive the projection complexity of SSW. Convergence Properties. Proposition 8. Let (µ k ), µ ∈ P p (S d-1 ) such that µ k ----→ k→∞ µ, then SSW p (µ k , µ) ----→ k→∞ 0. ( ) Proof. Since the Wasserstein distance metrizes the weak convergence (Corollary 6.11 (Villani, 2009 )), we have Sample Complexity. We show here that the sample complexity is independent of the dimension. Actually, this is a well known properties of sliced-based distances and it was studied first in (Nadjahi et al., 2020) . To the best of our knowledge, the sample complexity of the Wasserstein distance on the circle has not been previously derived. We suppose in the next proposition that it is known as we mainly want to show that the sample complexity of SSW does not depend on the dimension. P U # µ k ----→ k→∞ P U # µ (by continuity) ⇐⇒ W p p (P U # µ k , P U # µ) ----→ Proposition 9. Let p ≥ 1. Suppose that for µ, ν ∈ P(S 1 ), with empirical measures μn = δ yi , where (x i ) i ∼ µ, (y i ) i ∼ ν are independent samples, we have E[|W p p (μ n , νn ) -W p p (µ, ν)|] ≤ β(p, n). ( ) Then, for any µ, ν ∈ P p,ac (S d-1 ) with empirical measures μn and νn , we have E[|SSW p p (μ n , νn ) -SSW p p (µ, ν)|] ≤ β(p, n). ( ) Proof. By using the triangle inequality, Fubini-Tonelli, and the hypothesis on the sample complexity of W p p on S 1 , we obtain: E[|SSW p p (μ n , νn ) -SSW p p (µ, ν)|] = E V d,2 W p p (P U # μn , P U # νn ) -W p p (P U # µ, P U # ν) dσ(U ) ≤ E V d,2 W p p (P U # μn , P U # νn ) -W p p (P U # µ, P U # ν) dσ(U ) = V d,2 E W p p (P U # μn , P U # νn ) -W p p (P U # µ, P U # ν) dσ(U ) ≤ V d,2 β(p, n) dσ(U ) = β(p, n). ( ) Projection Complexity. We derive in the next proposition the projection complexity, which refers to the convergence rate of the Monte Carlo approximate w.r.t of the number of projections L towards the true integral. Note that we find the typical rate of Monte Carlo estimates, and that it has already been derive for sliced-based distances in (Nadjahi et al., 2020) . Proposition 10. Let p ≥ 1, µ, ν ∈ P p,ac (S d-1 ). Then, the error made with the Monte Carlo estimate of SSW p can be bounded as E U | SSW p p,L (µ, ν) -SSW p p (µ, ν)| 2 ≤ 1 L V d,2 W p p (P U # µ, P U # ν) -SSW p p (µ, ν) 2 dσ(U ) = 1 L Var U W p p (P U # µ, P U # ν) , p p,L (µ, ν) = 1 L L i=1 W p p (P Ui # µ, P U i # ν) with (U i ) L i=1 ∼ σ independent samples. Proof. Let (U i ) L i=1 be iid samples of σ. Then, by first using Jensen inequality and then remembering that E U [W p p (P U # µ, P U # ν)] = SSW p p (µ, ν), we have E U | SSW p p,L (µ, ν) -SSW p p (µ, ν)| 2 ≤ E U SSW p p,L (µ, ν) -SSW p p (µ, ν) 2 = E U   1 L L i=1 W p p (P Ui # µ, P Ui # ν) -SSW p p (µ, ν) 2   = 1 L 2 Var U L i=1 W p p (P Ui # µ, P Ui # ν) = 1 L Var U W p p (P U # µ, P U # ν) = 1 L V d,2 W p p (P U # µ, P U # ν) -SSW p p (µ, ν) 2 dσ(U ). B BACKGROUND ON THE SPHERE B.1 UNIQUENESS OF THE PROJECTION Here, we discuss the uniqueness of the projection P U for almost every x. For that, we recall some results of (Bardelli & Mennucci, 2017) . Let M be a closed subset of a complete finite-dimensional Riemannian manifold N . Let d be the Riemannian distance on N . Then, the distance from the set M is defined as d M (x) = inf y∈M d(x, y). ( ) The infimum is a minimum since M is closed and N locally compact, but the minimum might not be unique. When it is unique, let's denote the point which attains the minimum as π(x), i.e. d(x, π(x)) = d M (x). Proposition 11 (Proposition 4.2 in (Bardelli & Mennucci, 2017) ). Let M be a closed set in a complete m-dimensional Riemannian manifold N . Then, for almost every x, there exists a unique point π(x) ∈ M that realizes the minimum of the distance from x. From this Proposition, they further deduce that the measure π # γ is well defined on M with γ a locally absolutely continuous measure w.r.t. the Lebesgue measure. In our setting, for all U ∈ V d,2 , we want to project a measure µ ∈ P(S d-1 ) on the great circle span(U U T ) ∩ S -1 . Hence, we have N = S d-1 which is a complete finite-dimensional Riemannian manifold and M = span(U U T ) ∩ S d-1 a closed set in N . Therefore, we can apply Proposition 11 and the push-forward measures are well defined for absolutely continuous measures.

B.2 OPTIMIZATION ON THE SPHERE

Let F : S d-1 → R be some functional on the sphere. Then, we can perform a gradient descent on a Riemannian manifold by following the geodesics, which are the counterpart of straight lines in R d . Hence, the gradient descent algorithm (Absil et al., 2009; Bonnabel, 2013) reads as ∀k ≥ 0, x k+1 = exp x k -γgradf (x) , where for all x ∈ S d-1 , exp x : T x S d-1 → S d-1 is a map from the tangent space T x S d-1 = {v ∈ R d , ⟨x, v⟩ = 0} to S d-1 such that for all v ∈ T x S d-1 , exp x (v) = γ v (1) with γ v the unique geodesic starting from x with speed v, i.e. γ(0) = x and γ ′ (0) = v. For S d-1 , the exponential map is known and is ∀x ∈ S d-1 , ∀v ∈ T x S d-1 , exp x (v) = cos(∥v∥ 2 )x + sin(∥v∥ 2 ) v ∥v∥ 2 . ( ) Moreover, the Riemannian gradient on S d-1 is known as (Absil et al., 2009, Eq. 3.37 ) gradf (x) = Proj x (∇f (x)) = ∇f (x) -⟨∇f (x), x⟩x, Proj x denoting the orthogonal projection on T x S d-1 . For more details, we refer to (Absil et al., 2009; Boumal, 2022) .

B.3 VON MISES-FISHER DISTRIBUTION

The von Mises-Fisher (vMF) distribution is a distribution on S d-1 characterized by a concentration parameter κ > 0 and a location parameter µ ∈ S d-1 through the density ∀θ ∈ S d-1 , f vMF (θ; µ, κ) = κ d/2-1 (2π) d/2 I d/2-1 (κ) exp(κµ T θ), where I ν (κ) = 1 2π π 0 exp(κ cos(θ)) cos(νθ)dθ is the modified Bessel function of the first kind. Several algorithms allow to sample from it, see e.g. (Wood, 1994; Ulrich, 1984) for algorithms using rejection sampling or (Kurz & Hanebeck, 2015) without rejection sampling. For d = 1, the vMF coincides with the von Mises (vM) distribution, which has for density ∀θ ∈ [-π, π[, f vM (θ; µ, κ) = 1 I 0 (κ) exp(κ cos(θ -µ)), with µ ∈ [0, 2π[ the mean direction and κ > 0 its concentration parameter. We refer to (Mardia et al., 2000, Section 3.5 and Chapter 9) for more details on these distributions. In particular, for κ = 0, the vMF (resp. vM) distribution coincides with the uniform distribution on the sphere (resp. the circle). Jung (2021) studied the law of the projection of a vMF on a great circle. In particular, they showed that, while the vMF plays the role of the normal distributions for directional data, the projection actually does not follow a von Mises distribution. More precisely, they showed the following theorem: Theorem 1 (Theorem 3.1 in (Jung, 2021) ). Let d ≥ 3, X ∼ vMF(µ, κ) ∈ S d-1 , U ∈ V d,2 and T = P U (X) the projection on the great circle generated by U . Then, the density function of T is ∀t ∈ [-π, π[, f (t) = 1 0 f R (r)f vM (t; 0, κ cos(δ)r) dr, ( ) where δ is the deviation of the great circle (geodesic) from µ and the mixing density is ∀r ∈]0, 1[, f R (r) = 2 I * ν (κ) I 0 (κ cos(δ)r)r(1 -r 2 ) ν-1 I * ν-1 (κ sin(δ) 1 -r 2 ), with ν = (d -2)/2 and I * ν (z) = ( z 2 ) -ν I ν (z) for z > 0, I * ν (0) = 1/Γ(ν + 1). Hence, as noticed by Jung (2021) , in the particular case κ = 0, i.e. X ∼ Unif(S d-1 ), then f (t) = 1 0 f R (r)f vM (t; 0, 0) dr = f vM(t;0,0) 1 0 f R (r)dr = f vM (t; 0, 0), and hence T ∼ Unif(S 1 ).

B.4 NORMALIZING FLOWS ON THE SPHERE

Normalizing flows (Papamakarios et al., 2021 ) are invertible transformations. There has been a recent interest in defining such transformations on manifolds, and in particular on the sphere (Rezende et al., 2020; Cohen et al., 2021; Rezende & Racanière, 2021) . Exponential map normalizing flows. Here, we implemented the Exponential map normalizing flows introduced in (Rezende et al., 2020) . The transformation T is ∀x ∈ S d-1 , z = T (x) = exp x Proj x (∇ϕ(x)) , where ϕ(x) = K i=1 αi βi e βi(x T µi-1) , α i ≥ 0, i α i ≤ 1, µ i ∈ S d-1 and β i > 0 for all i. (α i ) i , (β i ) i and (µ i ) i are the learnable parameters. The density of z can be obtained as p Z (z) = p X (x) det E(x) T J T (x) T J T (x)E(x) -1 2 , ( ) where J f is the Jacobian in the embedded space and E(x) it the matrix whose columns form an orthonormal basis of T x S d-1 . The common way of training normalizing flows is to use either the reverse or forward KL divergence. Here, we use them with a different loss, namely SSW. Stereographic projection. The stereographic projection ρ : S d-1 → R d-1 maps the sphere S d-1 to the Euclidean space. A strategy first introduced in (Gemici et al., 2016) is to use it before applying a normalizing flows in the Euclidean space in order to map some prior, and which allows to perform density estimation. More precisely, the stereographic projection is defined as ∀x ∈ S d-1 , ρ(x) = x 2:d 1 + x 1 , and its inverse is ∀u ∈ R d-1 , ρ -1 (u) = 2 u ∥u∥ 2 2 +1 1 -2 ∥u∥ 2 2 +1 . ( ) Gemici et al. ( 2016) derived the change of variable formula for this transformation, which comes from the theory of probability between manifolds. If we have a transformation T = f • ρ, where f is a normalizing flows on R d-1 , e.g. a RealNVP (Dinh et al., 2016) , then the log density of the target distribution can be obtained as log p(x) = log p Z (z) + log | det J f (z)| - 1 2 log | det J T ρ -1 J ρ -1 (ρ(x))| = log p Z (z) + log | det J f (z)| -d log 2 ∥ρ(x)∥ 2 2 + 1 , where we used the formula of (Gemici et al., 2016) for the change of variable formula of ρ, and where p Z is the density of some prior on R d-1 , typically of a standard Gaussian. We refer to (Gemici et al., 2016; Mathieu & Nickel, 2020) for more details about these transformations.

C ADDITIONAL EXPERIMENTS C.1 EVOLUTION OF SSW BETWEEN VON MISES-FISHER DISTRIBUTIONS

The KL divergence between the von Mises-Fisher distribution and the uniform distribution has been derived analytically in (Davidson et al., 2018; Xu & Durrett, 2018) as KL vMF(µ, κ)||vMF(•, 0) = κ I d/2 (κ) I d/2-1 (κ) + d 2 -1 log κ - d 2 log(2π) -log I d/2-1 (κ) + d 2 log π + log 2 -log Γ d 2 . (76) We plot on Figure 7 the evolution of KL and SSW w.r.t. κ for different dimensions. We observe a different trend. SSW seems to get lower with the dimension contrary to KL. For SW, we used 100 projections (for memory reasons for d = 100), and computed it for κ ∈ {1, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 250}, 10 times by dimension and κ, and with 500 samples of both distributions. As a sanity check, we compare on Figure 8 the evolution of SSW between vMF distributions where we fix vMF(µ 0 , 10) and we rotate the first vMF along a great circle. More precisely, we plot SW 2 2 vMF((1, 0, 0, ...), 10), vMF ((cos(θ) , sin(θ), 0, ...), 10) for θ ∈ { kπ 6 } k∈{0,...,12} . As expected, we obtain a bell shape which is maximal when the second vMF distribution has for location parameter -µ 0 . We observe a similar behavior between SSW 2 , SSW 1 and SW 2 with different scales. On Figure 9 , we plot the evolution of SSW w.r.t. the number of projections for different dimensions. We observe that for around 100 projections, the variance seems to be low enough. Nadjahi et al. (2020) proved that, contrary to the Wasserstein distance, the classical sliced-Wasserstein distance has a sample complexity independent of the dimension d. As shown in Propositon 9, we have similar results for SSW. We show it empirically on Figure 10 by plotting SSW and the Wasserstein distance (with geodesic distance) between samples of the uniform distribution on the sphere w.r.t. the number of samples. We observe indeed that the convergence rate of SSW is independent of the dimension. it consists at simply following the geodesics of the regular ULA step, i.e. ∀k > 0, x k+1 = exp x k Proj x k (-γ∇V (x k ) + 2γZ) , Z ∼ N (0, I), where for the sphere, ∀x ∈ S d-1 , ∀v ∈ T x S d-1 , exp x (v) = x cos(∥v∥) + v ∥v∥ sin(∥v∥), Proj x is the projection on the tangent space T x S d-1 = {v ∈ R d , ⟨x, v⟩ = 0} (which is the orthogonal space) and is defined as Proj x (v) = v -⟨x, v⟩x. For more details, we refer to (Absil et al., 2009) . We use GLA here for simplicity and as a proof of concept. But note that GLA, as ULA, is biased and therefore the distribution learned will not be the exact true stationary distribution. However, a Metropolis-Hastings step at each iteration could be used to enforce the reversibility w.r. (2020) . This distribution has the advantage over the vMF distribution to allow for the direct use of the reparameterization trick since it does not require rejection sampling. The pdf is obtained as, ∀x ∈ S d-1 , p X (x; µ, κ) ∝ (1 + µ T x) κ with µ ∈ S d-1 and κ > 0. We can sample from drawing first Z ∼ Beta( d-1 2 + κ, d-1 2 ), v ∼ Unif(S d-2 ), then constructing T = 2Z -1 and Y = [T, v T √ 1 -T 2 ] T . Finally, apply a Householder reflection about µ to Y . All the operations are well differentiable and allow to apply the reparametrization trick. For the algorithm, see Algorithm 1 in (De Cao & Aziz, 2020) . Hence, in this case, if we denote g θ the map which takes samples from a uniform distribution on S d-2 and from a Beta distribution as input and outputs samples of power spherical distribution with parameters θ = (κ, µ), we can use it as the sampler. We test the algorithm with a target being a power spherical distribution of parameter µ = (0, 1, 0) and κ = 10, starting from µ = (1, 1, 1) and κ = 0.1. Performing 2000 optimization steps with a gradient descent (Riemannian gradient descent on µ to stay on the sphere), and 20 steps of the GLA algorithm, we are getting close enough to the true distribution as we can see on Figure 15 . For the hyperparameters, we used a step size of 10 -3 for GLA, 1000 projections to approximate SSW, a Riemannian gradient descent on the sphere (Absil et al., 2009) to learn the location parameter µ with a learning rate of 2, and a learning of 200 for κ. We performed K = 2000 steps and used N = 500 particles. i.e. the target ν has a density p(x) ∝ 4 k=1 e 10x T Ts→e(µ k ) with µ 1 = (0.7, 1.5), µ 2 = (-1, 1), µ 3 = (0.6, 0.5) and µ 4 = (-0.7, 4). These are spherical coordinates which are be converted to euclidean using T s→e (θ, ϕ) = (sin ϕ cos θ, sin ϕ sin θ, cos ϕ). The exponential map normalizing flow is composed of N = 6 blocks with K = 5 components. We run the algorithm for 10000 iterations, with at each iteration 20 steps of GLA with γ = 10 -1 as learning rate, and one step of backpropagation through SSW using the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 10 -3 . We report on Figure 16 the Mollweide projection of the learned density. Since we learn to samples from a noise distribution, here the uniform distribution on the sphere, we do not have directly access to the density and we report a kernel density estimate with a Gaussian kernel using the implementation of Scipy (Virtanen et al., 2020) . We also report in Figure 17 the effective sample size (ESS) (Doucet et al., 2001; Liu & Chen, 1995) over the iterations. The ESS is estimated by (Rezende et al., 2020)

ESS =

Var U nif (e -βu(X) ) Var q e -βu(X) qη(X) ≈ S s=1 w s 2 S s=1 w 2 s , where w s = e -βu(xs)/qη (xs) . The ESS is reported as a percentage of the sample size. Higher ESS indicates that the flow matches the target better (Rezende et al., 2020) .

C.6 SLICED-WASSERSTEIN AUTOENCODER

We recall that in the WAE framework, we want to minimize L(f, g) = c x, g(f (x)) dµ(x) + λD(f # µ, p Z ), where f is an encoder, g a decoder, p Z a prior distribution, c some cost function and D is a divergence in the latent space. Several D were proposed. For example, Tolstikhin et al. Architecture and procedure. We first detail the hyperparameters and architectures of neural networks for MNIST and Fashion MNIST. For the encoder f and the decoder g, we use the same architecture as Kolouri et al. (2018) . For both the encoder and the decoder architecture, we use fully convolutional architectures with 3x3 convolutional filters. More precisely, the architecture of the encoder is x ∈ R 28×28 → Conv2d 16 → LeakyReLU 0.2 → Conv2d 16 → LeakyReLU 0.2 → AvgPool 2 → Conv2d 32 → LeakyReLU 0.2 → Conv2d 32 → LeakyReLU 0.2 → AvgPool 2 → Conv2d 64 → LeakyReLU 0.2 → Conv2d 64 → LeakyReLU 0.2 → AvgPool 2 → Flatten → FC 128 → ReLU → FC d Z → ℓ 2 normalization where d Z is the dimension of the latent space (either 11 for S 10 or 3 for S 2 ). The architecture of the decoder is To compare the different autoencoders, we used as the reconstruction loss the binary cross entropy, λ = 10, Adam (Kingma & Ba, 2014) as optimizer with a learning rate of 10 -3 and Pytorch's default momentum parameters for 800 epochs with batch of size n = 500. Moreover, when using SW type of distance, we approximated it with L = 1000 projections. z ∈ R d Z → For the experiment on CIFAR10, we use the same architecture as Tolstikhin et al. (2018) . More precisely, the architecture of the encoder is We conduct experiments using SSW to prevent collapsing representations in contrastive self-supervised learning (SSL) models. Such contrastive losses on the hypersphere have exhibited great representative capacity (Wu et al., 2018; Chen et al., Caron et al., 2020) on unlabelled datasets by learning robust image representations invariantly to augmentations. As proposed in (Wang & Isola, 2020) , the contrastive objective can be decomposed into an alignment loss which forces positive representations coming from the same image to be similar and a uniformity loss which preserves maximal information of the feature distribution and hence avoids collapsing representations. Without the uniformity loss, the representations tend to converge towards a constant representation which yields the best alignment loss possible but also contains no information about original images. Wang & Isola (2020) propose to enforce uniformicity by leveraging the Gaussian potential kernel which is bound to the uniform distribution on the sphere. This formulation is also related to the denominator of the contrastive loss as specified in Chen et al. (2020a) . We propose to replace the Gaussian kernel uniformity loss with SSW for which the complexity is more linear w.r.t. the number of batch samples. A simple choice of the alignment loss is to minimize the mean squared euclidean distance between pairs of different augmented versions of the same image. A self-supervised learning network is pre-trained using this alignment loss added with an uniformity term. Our overall self-supervised loss can be defined as: x ∈ R 3×32×32 → L SSW-SSL = 1 n n i=1 ∥z A i -z B i ∥ 2 2 Alignment loss + λ 2 SSW 2 2 (z A , ν) + SSW 2 2 (z B , ν) Uniformity loss , where z A , z B ∈ R n×d are the representations from the network projected on the hypersphere of two augmented versions of the same images, ν = Unif(S d-1 ) is the uniform distribution on the hypersphere and λ > 0 is used to balance the two terms. We pretrain a ResNet18 (He et al., 2016) model on the CIFAR10 (Krizhevsky, 2009) data with projections projected onto the sphere S 2 . This feature dimension allow us to visualize the entire validation set of CIFAR10 and its distribution on the sphere. The visualization of the projections on S 2 are visible on Figure 20 . We then evaluate the performance of each contrastive objective by fitting a linear classifier on top of the output of the layer before the projection on the sphere on the training dataset as is common for SSL methods. For comparison, we also report the results when the features are taken directly on the sphere. As a baseline, we also train a predictive supervised encoder by training jointly the linear classifier and the image encoder in a supervised manner using cross entropy. We use a ResNet18 (He et al., 2016) encoder which outputs 1024 features that are then projected onto the sphere S 2 using a last fully connected layer followed by a ℓ 2 normalization. We pretrain the model for 200 epochs using minibatch stochastic gradient descent (SGD) with a momentum of 0.9, a weight decay of 0.001 and an initial learning rate of 0.05. We use a batch size of 512 samples. The images are augmented using a standard set of random augmentations for SSL: random crops, horizontal flipping, color jittering and gray scale transformation as done in Wang & Isola (2020) . For the trade-off parameter λ, we λ = 20 for SSW and λ = 1 for SW. To evaluate the performance of representations, we use the common linear evaluation protocol where a linear classifier is fitted on top of the pre-trained representations and the best validation accuracy is reported. The linear classifiers are trained for 100 epochs using the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 0.001 with a decay of 0.2 at epoch 60 and 80. We compare our methods with two other contrastive objectives, Chen et al. (2020a) with the normalized temperaturescaled cross-entropy (NT-Xent) loss and Wang & Isola (2020) which proposes to decompose the 



https://github.com/clbonet/Spherical_Sliced-Wasserstein



Figure 2: Runtime comparison in log-log scale between W, Sinkhorn with the geodesic distance, SW 2 , SSW 2 with the binary search (BS) and uniform distribution (12) and SSW 1 with formula (11) between two distributions on S 2 . The time includes the calculation of the distance matrices.

Figure 3: Minimization of SSW with respect to a mixture of vMF.

Figure 4: Latent space of SWAE and SSWAE on MNIST for a uniform prior on S 2 .

Figure 5: Density estimation of models trained on earth data. We plot the density on the test data.

Figure 6: Set of integration of the spherical Radon transform (19). The great circle is in black and the set of integration in blue. The point U z ∈ span(U U T ) ∩ S d-1 is in blue.

Figure 7: Evolution w.r.t κ between vMK(µ, κ) and vMF(•, 0).For SW, we used 100 projections (for memory reasons for d = 100), and computed it for κ ∈{1, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 250}, 10  times by dimension and κ, and with 500 samples of both distributions.

Figure 8: Evolution of SW between vMF samples in S d-1 (mean over 100 batch).

Figure 9: Influence of the number of projections. We compute SW 2 2 vMF(µ, κ)||vMF(•, 0) 20 times, for n = 500 samples in dimension d = 3.

Figure 10: Spherical Sliced-Wasserstein and Wasserstein distance (with geodesic distance) between samples of the uniform distribution on the sphere. Results are averaged over 20 runs and the shaded are correponds to the standard deviation.

Figure 15: SWVI on Power Spherical Distributions with Mollweide projections.

Figure 16: SSWVI on mixture of vMF

(2018)  proposed to use the MMD,Kolouri et al. (2018) used the SW distance, Patrini et al. (2020) used the Sinkhorn divergence, Kolouri et al. (2019) used the generalized SW distance. Here, we use D = SSW 2 2 .

(a) SSW-SSL, λ = 20, L = 10 (Ours) (b) SW-SSL, λ = 1, L 10 (c) Wang & Isola (2020) (d) Chen et al. (2020a) (e) Supervised prediction (f) Random initialization

Figure 20: The CIFAR10 validation set on S 2 after pre-training.

Negative test log likelihood.

FID (Lower is better).

t. the target distribution or we could use other MCMC with more appealing convergence properties (see e.g. Liu et al. (2016)). Power spherical distribution. First, as a simple example on S 2 , we use the power spherical distribution introduced by De Cao & Aziz

FC 128 → FC 1024 → ReLU → Reshape(64x4x4) → Upsample 2 → Conv 64 → LeakyReLU 0.2 → Conv 64 → LeakyReLU 0.2 → Upsample 2 → Conv 64 → LeakyReLU 0.2 → Conv 32 → LeakyReLU 0.2 → Upsample 2 → Conv 32 → LeakyReLU 0.2 → Conv 1 → Sigmoid

Conv2d 128 → BatchNorm → ReLU → Conv2d 256 → BatchNorm → ReLU → Conv2d 512 → BatchNorm → ReLU → Conv2d 1024 → BatchNorm → ReLU → FC dz → ℓ 2 normalizationwhere d z = 65. Linear evaluation on CIFAR10. The features are taken either on the encoder output or directly on the sphere S 2 .

ACKNOWLEDGMENTS

Clément Bonet thanks Benoît Malézieux for fruitful discussions. This work was performed partly using HPC resources from GENCI-IDRIS (Grant 2022-AD011013514). This research was funded by project DynaLearn from Labex CominLabs and Région Bretagne ARED DLearnMe, and by the project OTTOPIA ANR-20-CHIA-0030 of the French National Research Agency (ANR).

annex

Published as a conference paper at ICLR 2023

C.2 RUNTIME COMPARISONS

We study here the evolution of the runtime w.r.t. different parameters. On Figure 11 , we plot for several dimensions the runtime to compute SSW 2 w.r.t. the number of projections and the number of samples. We observe the linearity w.r.t. the number of projections and the quasi-linearity w.r.t. the number of samples. 

C.3 GRADIENT FLOWS

Mixture of vMF distributions. For the experiment in Section 5.1, we use as target distribution of mixture of 6 vMF distributions from which we have access to samples. We refer to Appendix B.3 for background on vMF distributions.The 6 vMF distributions have weights 1/6, concentration parameter κ = 10 and location parameters µ 1 = (1, 0, 0), µ 2 = (0, 1, 0), µ 3 = (0, 0, 1), µ 4 = (-1, 0, 0), µ 5 = (0, -1, 0) and µ 6 = (0, 0, -1).We use two different approximation of the distribution. First, we approximate it using the empirical distribution, i.e. μ = 1 n n i=1 δ xi and we optimize over the particles (x i ) n i=1 . To optimize over particles, we can either use a projected gradient descent:or a Riemannian gradient descent on the sphere (Absil et al., 2009) (see Appendix B.2 for more details). Note that the projected gradient descent is a Riemannian gradient descent with retraction (Boumal, 2022) .We can also use neural networks such as a multilayer perceptron (MLP). We used a MLP composed of 5 layers of 100 units with leaky relu activation functions. The output of the MLP is normalized on the sphere using a ℓ 2 normalization. We perform a gradient descent using Adam (Kingma & Ba, 2014) as the optimizer with a learning rate of 10 -4 for 2000 epochs. We approximate SSW with L = 1000 projections and a batch size of 500. The base distribution is choose as the uniform distribution on the sphere.We report on Figure 12 a comparison of the 2 approximations where the density is estimated with a Gaussian kernel density estimator. Let T be a normalizing flow (NF). For a density estimation task, we have access to a distribution µ through samples (x i ) n i=1 , i.e. through the empirical measure μn = 1 n n i=1 δ xi . And the goal is to find an invertible transformation T such that T # µ = p Z , where p Z is a prior distribution for which we know the density. In that case, indeed, the density of µ, denoted as f µ can be obtained as 1 .To fit T # µ = p Z , we use either SSW, SW on the sphere, or SW on R d-1 for the stereographic projection based NF. For the exponential map normalizing flow, we compose 48 blocks, each one with 100 components. These transformations have 24000 parameters. For Real NVP, we compose 10 blocks of Real NVPs, with shifting and scaling as multilayer perceptron, composed of 10 layers, 25 hidden units and with leaky relu of parameters 0.2 for the activation function. The number of parameters of these networks are 27520.For the training process, we perform 20000 epochs with full batch size. We use Adam as optimizer with a learning rate of 10 -1 . For the sterographic NF, we use a learning rate of 10 -3 .We report in Table 3 details of the datasets.Algorithm 2 SWVI (Yi & Liu, 2021) Input: V a potential, K the number of iterations of SWVI, N the batch size, ℓ the number of MCMC steps Initialization: Choose q θ a sampler et al., 1999; Blei et al., 2017) , we have some observed data (x i ) n i=1 and some latent data (z i ) n i=1 . The goal of variational inference is to approximate the posterior distribution p(•|x) by some distribution q ∈ Q where Q is a family of probabilities. The usual way of doing that is to minimize the Kullback-Leibler divergence among this family, i.e.But the KL divergence suffers from some drawbacks, as it is only a divergence (i.e. it does not satisfy the triangular inequality, and it is non symmetric), but it also suffers from under estimating the target distribution (or over estimating it for the reverse KL).Yi & Liu (2021) propose to use an optimal transport distance instead, namely the SW distance which gives the sliced-Wasserstein variational inference method. Basically, given some unnormalized probability p(•|x) that we want to approximate with some variational distribution q ϕ , we can first apply a MCMC algorithm and then learn q ϕ using a gradient descent on SW with the target being the empirical distributions of the samples given by the MCMC. But running long MCMC chain is time consuming and it might be difficult to diagnose burn-in period. Therefore, they propose to only run at each iteration some number of steps t of MCMC chain, and then learn by gradient descent the variational distribution. Therefore, the variational distribution is guided at each step by the MCMC samples toward the stationary distribution which is the target. This is called an amortized sampler (see Problem 1 in (Wang & Liu, 2016) ). We sum up the procedure in Algorithm 2.We propose here to substitute SW by SSW in order to perform SSWVI on the sphere. To do that, we first need a MCMC method on the sphere.

C.5.2 MCMC ON THE SPHERE

Several MCMC methods on the sphere have been proposed. For example, Hamiltonian Monte-Carlo (HMC) methods were proposed in (Byrne & Girolami, 2013; Lan et al., 2014; Liu et al., 2016) , and Riemannian Langevin algorithms were proposed in (Li & Erdogdu, 2020; Wang et al., 2020) .In our experiments, we use the Geodesic Langevin algorithm (GLA) introduced by Wang et al. (2020) . This algorithm is a natural generalization of the Unadjusted Langevin Algorithm (ULA) and Published as a conference paper at ICLR 2023The architecture of the decoder isWe use here a batch size of n = 128, λ = 0.1, the binary cross entropy as reconstruction loss and Adam as optimizer with a learning rate of 10 -3 .We report in Table 2 the FID obtained using 10000 samples and we report the mean over 5 trainings.For SSW, we used the formulation using the uniform distribution (12). To compute SW, we used the POT library (Flamary et al., 2021) . To compute the Sinkhorn divergence, we used the GeomLoss package (Feydy et al., 2019) .Additional experiments. We report on Figure 18 samples obtained with SSW for a uniform prior on S 10 . Published as a conference paper at ICLR 2023 objective in two distinct terms L align and L uniform . We recall the respective uniformity loss of each method in Table 5 . As one can see in Table 4 , our method achieves here comparable performances to two state-of-the-art approaches, yet slightly under-performing compared to (Chen et al., 2020a) .We suspect that a finer validation of the balancing parameter λ is needed. Especially since the representations on Figure 20a are not completely uniformly distributed around the sphere after pre-training compared to other contrastive methods. Nevertheless, these preliminary results show that SSW-SSL is a promising contrastive learning approach without explicit distances between negative samples, especially compared to SW on the sphere. To this end, further works should be devoted to finding a good balance between the alignment and uniformity objectives.

