LEARNING TOPOLOGY-PRESERVING DATA REPRESEN-TATIONS

Abstract

We propose a method for learning topology-preserving data representations (dimensionality reduction). The method aims to provide topological similarity between the data manifold and its latent representation via enforcing the similarity in topological features (clusters, loops, 2D voids, etc.) and their localization. The core of the method is the minimization of the Representation Topology Divergence (RTD) between original high-dimensional data and low-dimensional representation in latent space. RTD minimization provides closeness in topological features with strong theoretical guarantees. We develop a scheme for RTD differentiation and apply it as a loss term for the autoencoder. The proposed method "RTD-AE" better preserves the global structure and topology of the data manifold than state-of-theart competitors as measured by linear correlation, triplet distance ranking accuracy, and Wasserstein distance between persistence barcodes.

1. INTRODUCTION

Dimensionality reduction is a useful tool for data visualization, preprocessing, and exploratory data analysis. Clearly, immersion of high-dimensional data into 2D or 3D space is impossible without distortions which vary for popular methods. Dimensionality reduction methods can be broadly classified into global and local methods. Classical global methods (PCA, MDS) tend to preserve the global structure of a manifold. However, in many practical applications, produced visualizations are non-informative since they don't capture complex non-linear structures. Local methods (UMAP (McInnes et al., 2018) , PaCMAP (Wang et al., 2021) , t-SNE (Van der Maaten & Hinton, 2008) , Laplacian Eigenmaps (Belkin & Niyogi, 2001) , ISOMAP (Tenenbaum et al., 2000) ) focus on preserving neighborhood data and local structure with the cost of sacrificing the global structure. The most popular methods like t-SNE and UMAP are a good choice for inferring cluster structures but often fail to describe correctly the data manifold's topology. t-SNE and UMAP have hyperparameters influencing representations neighborhood size taken into account. Different values of hyperparameters lead to significantly different visualizations and neither of them is the "canonical" one that correctly represents high-dimensional data. We take a different perspective on dimensionality reduction. We propose the approach based on Topological Data Analysis (TDA). Topological Data Analysis (Barannikov, 1994; Zomorodian, 2001; Chazal & Michel, 2017 ) is a field devoted to the numerical description of multi-scale topological properties of data distributions by analyzing point clouds sampled from them. TDA methods naturally capture properties of data manifolds on multiple distance scales and are arguably a good trade-off between local and global approaches. The state-of-the-art TDA approach of this kind is TopoAE (Moor et al., 2020) . However, it has several weaknesses: 1) the loss term is not continuous 2) the nullity of the loss term is only necessary but not a sufficient condition for the coincidence of topology, as measured by persistence barcodes, see more details in Appendix J. In our paper, we suggest using the Representation Topology Divergence (RTD) (Barannikov et al., 2022) to produce topology-aware dimensionality reduction. RTD measures the topological discrepancy between two point clouds with one-to-one correspondence between clouds and enjoys nice theoretical properties (Section 3.2). The major obstacle to incorporate RTD into deep learning is its differentiation. There exist approaches to the differentiation of barcodes, generic barcodes-based functions with respect to deformations of filtration (Carriére et al., 2021) and to TDA differentiation in special cases (Hofer et al., 2019; Poulenard et al., 2018) . In this paper, we make the following contributions: 1. We develop an approach for RTD differentiation. Topological metrics are difficult to differentiate; the differentiability of RTD and its implementation on GPU is a valuable step forward in the TDA context which opens novel possibilities in topological optimizations; 2. We propose a new method for topology-aware dimensionality reduction: an autoencoder enhanced with the differentiable RTD loss: "RTD-AE". Minimization of RTD loss between real and latent spaces forces closeness in topological features and their localization with strong theoretical guarantees; 3. By doing computational experiments, we show that the proposed RTD-AE outperforms state-ofthe-art methods of dimensionality reduction and the vanilla autoencoder in terms of preserving the global structure and topology of a data manifold; we measure it by the linear correlation, the triplet distance ranking accuracy, Wasserstein distance between persistence barcodes, and RTD. In some cases, the proposed RTD-AE produces more faithful and visually appealing low-dimensional embeddings than state-of-the-art algorithms. We release the RTD-AE source code.foot_0 

2. RELATED WORK

Various dimensionality reduction methods have been proposed to obtain 2D/3D visualization of high-dimensional data (Tenenbaum et al., 2000; Belkin & Niyogi, 2001; Van der Maaten & Hinton, 2008; McInnes et al., 2018) . Natural science researchers often use dimensionality reduction methods for exploratory data analysis or even to focus further experiments (Becht et al., 2019; Kobak & Berens, 2019; Karlov et al., 2019; Andronov et al., 2021; Szubert et al., 2019) . The main problem with these methods is inevitable distortions (Chari et al., 2021; Batson et al., 2021; Wang et al., 2021) and incoherent results for different hyperparameters. These distortions can largely affect global representation structure such as inter-cluster relationships and pairwise distances. As the interpretation of these quantities in some domain such as physics or biology can lead to incorrect conclusions, it is of high importance to preserve them as much as possible. UMAP and t-SNE visualizations are frequently sporadic and cannot be considered as "canonical" representation of high-dimensional data. An often overlooked issue is the initialization which significantly contributes to the performance of dimensionality reduction methods (Kobak & Linderman, 2021; Wang et al., 2021) . Damrich & Hamprecht (2021) revealed that the UMAP's true loss function is different from the purported from its theory because of negative sampling. There is a number of works that try to tackle the distortion problem and preserve as much inter-data relationships as possible. Authors of PHATE (Moon et al., 2019) and ivis (Szubert et al., 2019) claim that their methods are able to capture local as well as global features, but provide no theoretical guarantees for this. (Wagner et al., 2021) propose DIPOLE, an approach to dimensionality reduction combining techniques of metric geometry and distributed persistent homology. From a broader view, deep representation learning is also dedicated to obtaining low-dimensional representation of data. Autoencoder (Hinton & Salakhutdinov, 2006) and Variational Autoencoder (Kingma & Welling, 2013) are mostly used to learn representations of objects useful for solving downstream tasks or data generation. They are not designed for data visualization and fail to preserve simultaneously local and global structure on 2D/3D spaces. Though, their parametric nature makes them scalable and applicable to large datasets, which is why they are used in methods such as parametric UMAP (Sainburg et al., 2021) and ivis (Szubert et al., 2019) and ours. Moor et al. (2020) proposed TopoAE, including an additional loss for the autoencoder to preserve topological structures of the input space in latent representations. The topological similarity is achieved by retaining similarity in the multi-scale connectivity information. Our approach has a stronger theoretical foundation and outperforms TopoAE in computational experiments. An approach for differentiation of persistent homology-based functions was proposed by Carriére et al. (2021) . Leygonie et al. (2021) systematizes different approaches to regularisation of persistence diagrams function and defines notions of differentiability for maps to and from the space of persistence barcodes. Luo et al. (2021) proposed a topology-preserving dimensionality reduction method based on graph autoencoder. Kim et al. (2020) proposed a differentiable topological layer for general deep learning models based on persistence landscapes.

3.1. TOPOLOGICAL DATA ANALYSIS, PERSISTENT HOMOLOGY

Topology is often considered to describe the "shape of data", that is, multi-scale properties of the datasets. Topological information was generally recognized to be important for various data analysis problems. In the perspective of the commonly assumed manifold hypothesis (Goodfellow et al., 2016) , datasets are concentrated near low-dimensional manifolds located in high-dimensional ambient spaces. The standard direction is to study topological features of the underlying manifold. The common approach is to cover the manifold via simplices. Given the threshold α, we take sets of the points from the dataset X which are pairwise closer than α. The family of such sets is called the Vietoris-Rips simplicial complex. For further convenience, we introduce the fully-connected weighted graph G whose vertices are the points from X and whose edges have weights given by the distances between the points. Then, the Vietoris-Rips simplicial complex is defined as: VR α (G) = {{i 0 , . . . , i k }, i m ∈ Vert(G) | m i,j ≤ α} , where m i,j is the distance between points, Vert(G) = {1, . . . , |X|} is the vertices set of the graph G. For each VR α (G), we define the vector space C k , which consists of formal linear combinations of all k-dimensional simplices from VR α (G) with modulo 2 arithmetic. The boundary operator ∂ k : C k → C k-1 maps every simplex to the sum of its facets. One can show that ∂ k • ∂ k-1 = 0 and the chain complex can be created: . . . → C k+1 ∂ k+1 → C k ∂ k → C k-1 → . . . .

The quotient vector space

H k = ker(∂ k )/im(∂ k+1 ) is called the k-th homology group, elements of H k are called homology classes. The dimension β k = dim(H k ) is called the k-th Betti number and it approximates the number of basic topological features of the manifold represented by the point cloud X. The immediate problem here is the selection of appropriate α which is not known beforehand. The standard solution is to analyze all α > 0. Obviously, if α 1 ≤ α 2 ≤ . . . ≤ α m , then VR α1 (G) ⊆ VR α2 (G) ⊆ . . . ⊆ VR αm (G) ; the nested sequence is called the filtration. The evolution of cycles across the nested family of simplicial complexes S αi is canonically decomposed into "birth" and "death" of basic topological features, so that a basic feature c appears in H k (S α ) at a specific threshold α c and disappears at a specific threshold β c , β c -α c describes the "lifespan" or persistence of the homology class. The set of the corresponding intervals [α c , β c ] for the basic homology classes from H k is called the persistence barcode; the whole theory is dubbed the persistent homology (Chazal & Michel, 2017; Barannikov, 1994; Zomorodian, 2001) .

3.2. REPRESENTATION TOPOLOGY DIVERGENCE (RTD)

The classic persistent homology is dedicated to the analysis of a single point cloud X. Recently, Representation Topology Divergence (RTD) (Barannikov et al., 2022) was proposed to measure the dissimilarity in the multi-scale topology between two point clouds X, X of equal size N with a one-toone correspondence between clouds. Let G w , G w be graphs with weights on edges equal to pairwise distances of X, X. To provide the comparison, the auxiliary graph Ĝw, w with doubled set of vertices and edge weights matrix m(w, w), see details in Appendix B, is created. The persistence barcode of the graph Ĝw, w is called the R-Cross-Barcode and it tracks the differences in the multi-scale topology of the two point clouds by comparing their α-neighborhood graphs for all α. Here we give a simple example of an R-Cross-Barcode, see also (Cherniavskii et al., 2022) . Suppose For example, a discrepancy appears at the threshold α = 0.53 when the vertex sets {4} and {3, 6, 7} are joined into one connected component in the union of neighborhood graphs of A and B by the edge (4, 7). We identify the "death" of this R-Cross-Barcode feature at α = 0.57, when these two sets are joined into one connected component in the neighborhood graph of cloud A (via the edge (4, 7) in Figure 2 becoming grey). By definition, RTD k (X, X) is the sum of intervals' lengths in the R-Cross-Barcode k (X, X) and measures its closeness to an empty set. Proposition 1 (Barannikov et al. (2022) ). If RTD k (X, X) = RTD k ( X, X) = 0 for all k ≥ 1, then the barcodes of the weighted graphs G w and G w are the same in any degree. Moreover, in this case the topological features are located in the same places: the inclusions VR α (G w ) ⊆ VR α (G min(w, w) ), VR α (G w) ⊆ VR α (G min(w, w) ) induce homology isomorphisms for any threshold α. The Proposition 1 is a strong basis for topology comparison and optimization. Given a fixed data representation X, how to find X lying in a different space, and having a topology similar to X, in particular, similar persistence barcodes? Proposition 1 states that it is sufficient to minimize i≥1 RTD i (X, X) + RTD i ( X, X) . In most of our experiments we minimized RTD 1 (X, X) + RTD 1 ( X, X). RTD 1 can be calculated faster than RTD 2+ , also RTD 2+ are often close to zero. To simplify notation, we denote RTD(X, X) := 1 /2(RTD 1 (X, X) + RTD 1 ( X, X)). Comparison with TopoAE loss. TopoAE (Moor et al., 2020) is the state-of-the-art algorithm for topology-preserving dimensionality reduction. The TopoAE topological loss is based on comparison of minimum spanning trees in X and X spaces. However, it has several weak spots. First, when the TopoAE loss is zero there is no guarantee that persistence barcodes of X and X coincide. Second, the TopoAE loss can be discontinuous in rather standard situations, see Appendix J. At the same time, RTD loss is continuous, and its nullity guarantees the coincidence of persistence barcodes of X and X. The continuity of the RTD loss follows from the stability of the R-Cross-Barcode k (Proposition 2). Proposition 2. (a) For any quadruple of edge weights sets w ij , wij , v ij , ṽij on G: d B (R-Cross-Barcode k (w, w), R-Cross-Barcode k (v, ṽ)) ≤ max(max ij |v ij -w ij |, max ij |ṽ ij -wij |). (b) For any pair of edge weights sets w ij , wij on G: ∥R-Cross-Barcode k (w, w)∥ B ≤ max ij |w ij -wij |. (c) The expectation for the bottleneck distance between R-Cross-Barcode k (w, w) and R-Cross-Barcode k (w ′ , w), where w ij = w(x i , x j ), w ′ ij = w ′ (x i , x j ), wij = w(x i , x j ), w, w ′ , w is a triple of metrics on a measure space (X , µ), and X = {x 1 , . . . , x n }, x i ∈ X is a sample from (X , µ), is upper bounded by Gromov-Wasserstein distance between w and w ′ : X ×...×X d B (R-Cross-Barcode k (w, w), R-Cross-Barcode k (w ′ , w))dµ ⊗n ≤ n GW (w, w ′ ). (d) The expectation for the bottleneck norm of R-Cross-Barcode k (w, w) for two weighted graphs with edge weights w ij = w(x i , x j ), wij = w(x i , x j ), where w, w is a pair of metrics on a measure space (X , µ), and X = {x 1 , . . . , x n }, x i ∈ X is a sample from (X , µ), is upper bounded by Gromov-Wasserstein distance between w and w: We propose to use RTD as a loss in neural networks. Here we describe our approach to RTD differentiation. Denote by Σ k the set of all k-simplices in the Vietoris-Rips complex of the graph Ĝw, w, and by T k the set of all intervals in the R-Cross-Barcode k (X, X). Fix (an arbitrary) strict order on T k . There exists a function f k : ∪ (bi,di)∈T k {b i , d i } → Σ k that maps b i (or d i ) to a simplex σ whose appearance leads to "birth" (or "death") of the corresponding homological class. Let m σ = max i,j∈σ m i,j denote the function of m ij equal to the filtration value at which the simplex σ joins the filtration. Since ∂ RTD k (X, X) ∂di = -∂ RTD k (X, X) ∂bi = 1 , we obtain the following equation for the subgradient ∂ RTD k (X, X) ∂m σ = i∈T k I{f k (d i ) = σ} - i∈T k I{f k (b i ) = σ}. Here, for any σ no more than one term has non-zero indicator. Then ∂ RTD k (X, X) ∂m i,j = σ∈Σ k ∂ RTD k (X, X) ∂m σ ∂m σ ∂m i,j The only thing that is left is to obtain subgradients of RTD(X, X) by points from X and X . Consider (an arbitrary) element m i,j of matrix m. There are 4 possible scenarios: 1. i, j ≤ N , in other words m i,j is from the upper-left quadrant of m. Its length is constant and thus ∀l : ∂mi,j ∂X l = ∂mi,j ∂ Xl = 0. 2. i ≤ N < j, in other words m i,j is from the upper-right quadrant of m. Its length is computed as Euclidean distance and thus ∂mi,j ∂Xi = Xi-X j-N ||Xi-X j-N ||2 (similar for X N -j ). 3. j ≤ N < i, similar to the previous case. 4. N < i, j, in other words m i,j is from the bottom-right quadrant of m. Here we have subgradients like ∂m i,j ∂X i-N = X i -X j-N ||X i -X j-N || 2 I{w i-N,j-N < wi-N,j-N } Similar for X j-N , Xi-N and Xj-N . Subgradients ∂ RTD(X, X) ∂Xi and ∂ RTD(X, X) ∂ Xi can be derived from the beforementioned using the chain rule and the formula of full (sub)gradient. Now we are able to minimize RTD(X, X) by methods of (sub)gradient optimization. We discuss some possible tricks for improving RTD differentiation in Appendix I.

4.2. RTD AUTOENCODER

Given the data X = {x i } n i=1 , x i ∈ R d , in high-dimensional space, our goal is to find the representation in low-dimensional space Z = {z i }, z i ∈ R p . For the visualization purposes, p = 2, 3. Our idea is to find a representation Z which preserves persistence barcodes, that is, multi-scale topological properties of the point clouds, as much as possible. The straightforward approach is to solve min Z RTD(X, Z), where the optimization is performed over n vectors z i ∈ R p , in the flavor similar to UMAP and t-SNE. This approach is workable albeit very time-consuming and could be applied only to small datasets, see Appendix F. A practical solution is to learn representations via the encoder network E(w, x) : X → Z, see Figure 3 . Algorithm. Initially, we train the autoencoder for E 1 epochs with the reconstruction loss 1 2 ||X -X rec || 2 only. Then, we train for E 2 epochs with the loss 1 2 ||X -X rec || 2 + RTD(X, Z). Both losses are calculated on mini-batches. The two-step procedure speedups training since calculating RTD(X, Z) for the untrained network takes much time. 

5. EXPERIMENTS

In computational experiments, we perform dimensionality reduction to high-dimensional and 2D/3D space for ease of visualization. We compare original data with latent representations by (1) linear correlation of pairwise distances, (2) Wasserstein distance (W.D.) between H 0 persistence barcodes (Chazal & Michel, 2017) , (3) triplet distance ranking accuracy (Wang et al., 2021) (4) RTD. All of the quality measures are tailored to evaluate how the manifold's global structure and topology are preserved. We note that RTD, as a quality measure, provides a more precise comparison of topology than the W.D. between H 0 persistence barcodes. First, RTD takes into account the localization of topological features, while W.D. does not. Second, W.D. is invariant to permutations of points, but we are interested in comparison between original data and latent representation where natural one-to-one correspondence holds. We compare the proposed RTD-AE with t-SNE (Van der Maaten & Hinton, 2008) , UMAP (McInnes et al., 2018) , TopoAE (Moor et al., 2020) , vanilla autoencoder (AE), PHATE (Moon et al., 2019) , Ivis (Szubert & Drozdov, 2019) , PacMAP (Wang et al., 2021) . The complete description of all the used datasets can be found in Appendix L. See hyperparameters in Appendix H.

5.1. SYNTHETIC DATASETS

We start with the synthetic dataset "Spheres": eleven 100D spheres in the 101D space, any two of those do not intersect and one of the spheres contains all other inside. For the visualization, we perform dimensionality reduction to 3D space. Figure 4 shows the results: RTD-AE is the best one preserving the nestedness for the "Spheres" dataset. Also, RTD-AE outperforms other methods by quality measures, see Table 1 . We were unable to run MDS on "Spheres" dataset because it was too large for that method. See more results in Appendix M.

5.2. REAL WORLD DATASETS

We performed experiments with a number of real-world datasets: MNIST (LeCun et al., 1998) , F-MNIST (Xiao et al., 2017) , COIL-20 (Nene et al., 1996) , scRNA mice (Yuan et al., 2017) , scRNA melanoma (Tirosh et al., 2016) with latent dimension of 16 and 2, see Tables 2, 5 . The choice of scRNA datasets was motivated by the increased importance of dimensionality reduction methods in natural sciences, as was previously mentioned. RTD-AE is consistently better than competitors; moreover, the gap in metrics for the latent dimension 16 is larger than such for the latent dimension 2 (see Appendix D). 2 For the latent dimension 2, RTD-AE is the first or the second one among the methods by the quality measures (see Table 5 , Figure 7 in Appendix D). We conclude that the proposed RTD-AE does a good job in preserving global structure of data manifolds. 2 PHATE execution take too much time and its results are no presented for many datasets. For the "Mammoth" (Coenen & Pearce, 2019b) dataset (Figure 1 ) we did dimensionality reduction 3D → 2D. Besides good quality measures, RTD-AE produced an appealing 2D visualization: both large-scale (shape) and low-scale (chest bones, toes, tusks) features are preserved.

5.3. ANALYSIS OF DISTORTIONS

Next, to study distortions produced by various dimensionality reduction methods we learn transformation from 2D to 2D space, see Figure 5 . Here, we observe that RTD-AE in general recovers the global structure for all of the datasets. RTD-AE typically does not suffer from the squeezing (or bottleneck) issue, unlike AE, which is noticeable in "Random", "3 Clusters" and "Circle". Whereas t-SNE and UMAP struggle to preserve cluster densities and intercluster distances, RTD-AE manages to do that in every case. It does not cluster random points together, like t-SNE. Finally, the overall shape of representations produced by RTD-AE is consistent, it does not tear apart close points, which is something UMAP does in some cases, as shown in the "Circle" dataset. The metrics, presented in the Table 6 in Appendix E, also confirm the statements above. RTD-AE has typically higher pairwise distances linear correlation and triplet accuracy, which accounts for good multi-scale properties, while having a lower Wasserstein distance between persistence barcodes. 

5.5. DISCUSSION

Experimental results show that RTD-AE better preserves the data manifold global structure than its competitors. The most interesting comparison is with TopoAE, the state-of-the-art, which uses an alternative topology-preserving loss. The measures of interest for topology comparison are the Wasserstein distances between persistence barcodes. Tables 2, 6 , 5 show that RTD-AE is better than TopoAE. RTD minimization has a stronger theoretical foundation than the loss from TopoAE (see Section 3.2).

6. CONCLUSIONS

In this paper, we have proposed an approach for topology-preserving representation learning (dimensionality reduction). The topological similarity between data points in original and latent spaces is achieved by minimizing the Representation Topology Divergence (RTD) between original data and latent representations. Our approach is theoretically sound: RTD=0 means that persistence barcodes of any degree coincide and the topological features are located in the same places. We proposed how to make RTD differentiable and implemented it as an additional loss to the autoencoder, constructing RTD-autoencoder (RTD-AE). Computational experiments show that the proposed RTD-AE better preserves the global structure of the data manifold (as measured by linear correlation, triplet distance ranking accuracy, Wasserstein distance between persistence barcodes) than popular methods t-SNE and UMAP. Also, we achieve higher topological similarity than the alternative TopoAE method. Of course, the application of RTD loss is not limited to autoencoders and we expect more deep learning applications involving one-to-one correspondence between points. The main limitation is that calculation of persistence barcodes and RTD, in particular, is computationally demanding. We see here another opportunity for further research.

A SIMPLICIAL COMPLEXES AND FILTRATIONS

Here we briefly recall basic topological objects mentioned in our paper. Suppose we have a full graph X = {x 0 , x 1 , . . . x n } a set of points in some metric space (R, d). Definition A.1 Any set σ ⊆ X is (combinatorial) simplex. Its vertices are all points that belong to K. Its dimensionality is number equal to |σ| -1. Its faces are all proper subsets of σ. Definition A.2 Simplicial complex C is a set of simplices such that for every simplex σ ∈ C it contains all faces of σ and for every two simplices σ 1 , σ 2 ∈ C their intersection is face of both of them. Simplicial complexes can be seen as higher-dimensional generalization of graphs. There are many ways to build a simplical complex from a set of points, but only two important for this work: Vietoris-Rips and Chech complexes. Definition A.3 Given a threshold α the Vietoris-Rips (simplicial) complex at threshold α (denotes as VR α (X)) is defined as set set of all simplices σ such that ∀x i , x j ∈ σ holds d(x i , x j ) ≤ α. Definition A.4 Given a threshold α the Čech (simplicial) complex at threshold α (denotes as Cech α (X)) is defined as set set of all simplices σ such that all closed balls of radius α and with centers in vertices of σ have a non-empty intersection. Alhough Čech complexes are rarely used in applications of Topological Data Analysis, they are important due to the fact that their fundamental topological properties are equal to those of the manifold 'behind' X (so-called Nerve theorem, see (Chazal & Michel, 2017) for proper explanation). The Vietoris-Rips complexes 'approximate' Čech complexes : VR α (X) ⊆ Cech α (X) ⊆ VR 2α (X) Note that the definition of the Vietoris-Rips complex doesn't require (even indirectly) function d(.) to be metric -it should only be symmetric and non-negative. And so we can define the Vietoris-Rips complex of a weighted graph G = (V, E). To do so we modify Definition A.3 by replacing X with V and taking d(v i , v j ) as the weight of the edge between v i and v j for i ̸ = j and d(v i , v i ) = 0, ∀i. In the scope of this work we consider only Vietoris-Rips complexes of graphs. Definition A.5 A filtration of a simplicial complex C is a nested family of subcomplexes (C t ) t∈T , where T ⊆ R, such that for any t 1 , t 2 ∈ T , if t 1 ≤ t 2 then C t1 ⊆ C t2 , and C = t∈T C t . The set T may be either finite or infinite. Vietoris-Rips filtration can 'reflect' topology of data set at every scale. Usually data sets are finite so there is finite number of thresholds that give different Vietoris-Rips complexes and thus finite filtration is enough for it. The classic persistent homology is dedicated to the analysis of a single point cloud X. Recently, Representation Topology Divergence (RTD) (Barannikov et al., 2022) was proposed to measure the dissimilarity in the multi-scale topology between two point clouds X, X of equal size N with a one-to-one correspondence between clouds.

B FORMAL DEFINITION OF RTD

Let VR α (G w ), VR α (G w) be two Vietoris-Rips simplicial complexes, where w, w -are the distance matrices of X, X. The idea behind RTD is to compare VR α (G w ) with VR α (G min(w, w)), where G min(w, w) is the graph having weights min(w, w) on its edges. By definition, VR α (G w ) ⊆ VR α (G min(w, w) ), VR α (G w) ⊆ VR α (G min(w, w) ). To compare VR α (G w ) with VR α (G min(w, w)), the auxiliary graph is constructed with doubled set of vertices Ĝw, w (Figure 6 ) and weights Published as a conference paper at ICLR 2023 on edges given in the simplest case by: m = 0 (w + ) ⊺ w + min (w, w) , where w + is the w matrix with lower-triangular part replaced by +∞, see ( (Barannikov et al., 2022) , section 2.2) for the general form of the matrix m. The persistence barcode of the weighted graph VR( Ĝw, w) is called the R-Cross-Barcode (for Representations' Cross-Barcode). Note that for every two nodes in the graph Ĝw, w there exists a path with edges having zero weights. Thus, the H 0 barcode in the R-Cross-Barcode is always empty. Intuitively, the k-th barcode of VR α ( Ĝw, w) records the k-dimensional topological features that are born in VR α (G min(w, w)) but are not yet born near the same place in VR α (G w ), and the (k -1)-dimensional topological features that are dead in VR α (G min(w, w) ) but are not yet dead in VR α (G w ). The R-Cross-Barcode k (X, X) records the differences in the multi-scale topology of the two point clouds. The topological features with longer lifespans indicate in general the essential features. Basic properties of R-Cross-Barcode k (X, X) (Barannikov et al. (2022) ) are: • if X = X, then for all k R-Cross-Barcode k (X, X) = ∅; • if all distances within X are zero i.e. all objects are represented by the same point in X, then for all k ≥ 0: R-Cross-Barcode k+1 (X, X) = Barcode k (X) the standard barcode of the point cloud X; • for any value of threshold α, the following sequence of natural linear maps of homology groups r3i+3 ---→ H i (V R α (G w )) r3i+2 ---→ H i (V R α (G min(w, w) )) r3i+1 ---→ r3i+1 ---→ H i (V R α ( Ĝw, w)) r3i --→ H i-1 (V R α (G w )) r3i-1 ---→ r3i-1 ---→ . . . r1 -→ H 0 (V R α (G min(w, w) )) r0 -→ 0 (1) is exact, i.e. for any j the kernel of the map r j is the image of the map r j+1 . Proposition 3. Given an exact sequence as in (1) with finite-dimensional filtered complexes A α , B α , C α , the alternating sums over k of their topological features lifespans satisfy k (-1) k l k (A) - k (-1) k l k (B) + k (-1) k l k (C) = 0 (2) where l k (Z) denotes the sum of bars lengths in Barcode k (Z), here for simplicity all lifespans are assumed to be finite. Proof. The exact sequence implies that the alternating sums of dimensions of the homology groups satisfy, for any α, k (-1) k dim H k (A α ) - k (-1) k dim H k (B α ) + k (-1) k dim H k (C α ) = 0 Notice that for any α 1 < α 2 dim H k (Z α2 ) -dim H k (Z α1 ) = #b(Z, (α 1 , α 2 ], k) -#d(Z, (α 1 , α 2 ], k) where #b(Z, (α 1 , α 2 ], k), respectfully #d(Z, (α 1 , α 2 ], k), is the number of births, respectfully deaths, of dimension k topological features in Z at thresholds α, α Setting 1 < α ≤ α 2 . Hence k (-1) k (#b -#d)(A, (α 1 , α 2 ], k) - k (-1) k (#b -#d)(B, (α 1 , α 2 ], k)+ + k (-1) k (#b -#d)(C, (α 1 , α 2 ], k) = 0 (3) α 1 = α -ϵ, α 2 = α + ϵ, we get, for any α k (-1) k (#b-#d)(A, α, k)- k (-1) k (#b-#d)(B, α, k)+ k (-1) k (#b-#d)(C, α, k) = 0 where #b(Z, α, k), respectfully #d(Z, α, k), is the number of births, respectfully deaths, of dimension k topological features in Z at the threshold α. Summing this over all nontrivial filtration steps α gives the identity (2). Proposition 4. k (-1) k RT D k (w, w) - k (-1) k l k (V R(G w )) + k (-1) k l k (V R(G min(w, w) ))) = 0

C LEARNING REPRESENTATIONS IN HIGHER DIMENSIONS

The following table shows results of the experiment with latent dimensions 16, 32, 64 and 128 for the F-MNIST dataset. RTD-AE are consistently better than the competitors.

D REAL WORLD DATASETS, 2D LATENT SPACE

Table 5 and Figure 7 present the results.

E SYNTHETIC DATASETS, 2D LATENT SPACE

Table 6 shows the results.

F RTD MINIMIZATION WITHOUT THE AUTOENCODER

Given the set X = {x i } n i=1 of n objects in high-dimensional space x i ∈ R d , our goal is to find their representations in low-dimensional space Z = {z i }, z i ∈ R k . It is possible to solve min Z RTD(X, Z) directly w.r.t n vectors z i ∈ R k , in the flavor similar to UMAP and t-SNE. Figures 8, 9 show the results of two experiments with 3D→2D dimensionality reduction. We conclude that dimensionality reduction via RTD optimization better preserves data topology: meridians are kept connected (Figure 8 ) and the nestedness is retained (Figure 9 ). The optimization took ∼1 hour. For the experiment with nested spheres, the RTD optimization was warmstarted with the MDS solution. w, w ) . m = 0 (w + ) ⊺ w + min( An alternative variant of RTD is possible with the following matrix of weights: m = 0 max(w, w) ⊺ + max(w, w) + w in the simplest case. Both of them share similar properties and guarantee that RTD(X, Z) = 0 when all the pairwise distances in point clouds X and Z are the same. Also in both cases if RT D k (X, Z) = RT D k (Z, X) = 0 for k ≥ 1 then the persistence diagrams of X and Z coincide. The minimization of the sum of both variants of RTD leads to richer gradient information. We used this loss in the experiment with the "Mammoth" dataset. 

H HYPERPARAMETERS

In the experiments with projecting to 3D-space we trained model for 100 epochs using Adam optimizer. We initially trained autoencoder for 10 epochs with only the reconstruction loss and learning rate 1e-4, then continued with RTD. Epochs 11-30 were trained with learning rate 1e-2, epochs 31-50 with learning rate 1e-3 and for epochs all after learning rate 1e-4 was used. Batch size was 80. For 2D and high-dimensional projections, we used fully-connected autoencoders with hyperparameters specified in the Table 7 . The autoencoder was initially trained only with reconstruction loss for some number of epochs, and then the RTD loss kicked in. The learning rate stayed the same for an entire duration of training. For experiments we used NVIDIA TITAN RTX.

I RTD OPTIMIZATION SPEED-UPS

For all computations of RTD-barcodes in this work we used modified version of Ripser++ software (Zhang et al., 2020) . Modification that we made was intended at decreasing computational time via exploration of the structure of graph Ĝw, w (see Section 3.2). The idea behind it is to reduce the size of filtered complex by excluding from it the simplices that do not affect the persistence homology. Here we consider only simplices of dimension at least 1. We exclude all simplices spanned by vertices from the first half of the vertex set of the graph Ĝw, w. Those are the vertices corresponding to the upper-left quadrant of the graph's edge weights matrix In particular, this eliminates around 1/8 of rows and 1/4 of columns (around 1/3 cells in total) from the boundary matrix used for the computation of persistence pairs of dimension 1. On average, comparing to the standard Ripser++ computation, this gives ≈ 45% less time for the computation of persistence intervals of dimension 1. Next, we describe some techniques that can improve convergence when RTD is to be minimized without an autoencoder (F). Usually we perform (sub)gradient descent to minimize RTD(X, X) between "movable" cloud X and given constant X. Gradient smoothing. Subgradients computed at each step of this procedure associate each homological class with at most 4 points from X, while topological structures often include much more. Moreover, adjustments w.r.t. them may be inconsistent for nearby points. To overcome this, we "smooth" gradients by passing to each point averaged gradients of all its neighbours. Let ∇ (k) i be the gradient value for X i at step k and U (X i ) be some neighbourhood of X (k) i . Then the formula for each step of the gradient descent is X (k) i = X (k-1) i -λ k   β∇ (k) i + (1 -β) 1 #{X (k) j ∈ U (X (k) i )} X (k) j ∈U (X (k) i ) ∇ (k) j    Here β ∈ [0; 1] is some parameter. Minimum bypassing. Suppose we want to shorten an edge m i+N,j+N from bottom-right quadrant of matrix m (i.e. ∂ RTD(X, X) ∂m i+N,j+N < 0). It may occur that w i,j > wi,j , so ∂m i+N,j+N ∂X i = X i -X j ||X i -X j || 2 I{w i,j < wi,j } = 0 and gradient descent will stuck here (since X is constant). Thus there may appear a certain threshold below which RTD(X, X) can't be minimized in this case. But it can be further minimized if we move points X i and X j close enough to each other so w i,j < wi,j . To do it, if ∂ RTD(X, X) ∂m i+N,j+N < 0 , we compute ∂mi,j ∂Xi without indicator, i.e. as

Xi-Xj

||Xi-Xj ||2 . This will assure w i,j is decreasing and at certain point will became lower than wi,j . If ∂ RTD(X, X) ∂m i+N,j+N ≥ 0 we don't change anything, because the discussing effect appears only if we minimize a minimum of a function and a constant. We performed an experiment to transform a cloud in the shape of the infinity sign by minimizing the RTD between this cloud and a ring-shaped cloud. Both clouds had 100 points and we did not use batch optimisation. We performed 100 iterations of gradient descend to minimize RTD in each of the following four setups: using none, each or both of Gradient Smoothing and Minimum Bypassing tricks. For each setup we also searched for the best learning rate. The Table 8 shows the results after 100 iterations. . The distances within each cluster are of order 10 -1 and the distances between the clusters equal to 10 3 ± 10 -1 . The TopoAE loss is discontinuous because under a small perturbation of points, the minimal spanning tree Γ may change. When the point 3 moves slightly as indicated, then the minimal spanning tree Γ, coloured by yellow, changes and the term (w 14 -w14 ) 2 ∼ 10 -2 in TopoAE loss is replaced by (w 23 -w23 ) 2 ∼ 10 6 .

J COMPARISON WITH TOPOAE LOSS

The following simple example on Figure 10 shows that the TopoAE loss can be discontinuous in a rather standard situation. The TopoAE loss (Moor et al., 2020) is constructed by calculating first the two minimal spanning trees Γ, Γ for each of the graphs G w , G w, whose weights are the distances within two point clouds X and X. Then the TopoAE loss is the sum of two terms L TopoAE = l + l. One term is the sum over the set of edges of Γ: l = 1 2 ij∈Edges(Γ) (w ij -wij ) 2 , and the other is the analogous sum over the edges of Γ: l = 1 2 ij∈Edges( Γ) (w ij -wij ) 2 . Under a small perturbation of points, the minimal spanning tree Γ may change, e.g. with a change of pair of the closest points from two clusters. But then the corresponding weights w change in general discontinuosly. The point cloud X on Figure 10 consists of two clusters {1, 2, 3} and {4}. The point cloud X consists of two clusters {1, 2} and {3, 4}. We set the distances within each cluster to be of order 10 -1 and the distance between the clusters equal to 10 3 ± 10 -1 . When the point 3 moves in X slightly as indicated, then the minimal spanning tree Γ, coloured by yellow, changes and the term (w 14 -w14 ) 2 ∼ 10 -2 in l is replaced by (w 23 -w23 ) 2 ∼ 10 6 . Proposition 5. The RTD loss is continuous. The RTD k (X, X) depends continuously on (X, X). The proof follows from the stability of the barcode of the filtered complex VR α ( Ĝw, w) with respect to the bottleneck distance under perturbation of the edge weights, see Appendix K. K STABILITY OF R-Cross-Barcode AND RTD Proposition 6. For any perturbations X ′ of a point cloud X and X′ of a point cloud X, d B (R-Cross-Barcode k (X, X), R-Cross-Barcode k (X ′ , X′ )) ≤ 2 max(max i ∥X ′ i -X i ∥, max j ∥ X′ j -Xj ∥) (5) where d B denotes the bottleneck distance. Proof. By construction, the R-Cross-Barcode k (X, X) is the k-th persistence barcode of the weighted graph Ĝw, w with the weights w ij = ∥X i -X j ∥ and min(w ij , wij ), where wij = ∥ Xi -Xj ∥. If max i ∥X ′ i -X i ∥ = ε, then |w ′ ij -w ij | ≤ 2ε for w ′ ij = ∥X ′ i -X ′ j ∥. Similarly, | w′ ij -wij | ≤ 2ε, where ε = max j ∥ X′ j -Xj ∥. It follows that |min(w ′ ij , w′ ij ) -min(w ij , wij )| ≤ 2 max(ε, ε). Hence the filtration of each simplex in V R α ( Ĝw, w) changes at most by 2 max(ε, ε) under the perturbations. Next, it follows from e.g. the description of metamorphoses of canonical forms in (Barannikov, 1994) that the birth or the death of each segment in the k-th barcode of Ĝw, w changes under such perturbations at most by 2 max(ε, ε). The above arguments give also the proof for the following stability result. Proposition 7. For any quadruple of edge weights sets w ij , wij , w ′ ij , w′ ij on G: d B (R-Cross-Barcode k (w, w), R-Cross-Barcode k (w ′ , w′ )) ≤ max(max ij |w ′ ij -w ij |, max ij | w′ ij -wij |) (6) where d B denotes the bottleneck distance and R-Cross-Barcode k (w, w) denotes the persistence barcode for the weighted graph Ĝw, w. Proposition 8. For any pair of edge weights sets w ij , wij : ∥R-Cross-Barcode k (w, w)∥ B ≤ max ij |w ij -wij | (7) where ∥∥ B denotes the bottleneck norm. Proof. Substitute w ′ = w′ = w into (6). Notice that ( 7) is analogous to (Barannikov et al., 2021, Proposition 1) . Given a pair of metrics u, u ′ on a measure space (X , µ), an analogue of Gromov-Wasserstein distance between u and u ′ is GW (u, u ′ ) = inf e,e ′ :X →Z X ρ Z (e(x), e ′ (x)) dµ where e : X → Z, e ′ : X → Z are embeddings to various metric spaces (Z, ρ Z ) that are isometric with respect to u, u ′ . 10 . For "Spheres" dataset we have also performed experiments on dimensionality reduction to 2D space. Overall results are quite similar to those obtained for 3D case. The behavior of baselines remains essentially the same. The only interesting change is that RTD-AE now projects bigger sphere to a ring and puts the projections of smaller spheres into the ring's hollow center. RTD-AE outperforms other methods in terms of linear correlation and triplet accuracy. All of the representations were generated with default parameters of baseline methods. Results are presented at Figures 4 ("Spheres" to 3D space) and 11 ("Spheres" to 2D space and "Torus" to 3D).

N ABLATION STUDY

In this section we investigate the effect of adding RTD loss on the performance of the model. We add a hyperparameter λ responsible for the scale of the RTD loss variable: L rec (X, X) + λRTD(X, Z). We run our experiments on two datasets: COIL-20 and Circle. The hyperparameter value ranged Q RECONSTRUCTION LOSS See Table 12 for results.

R HYPERPARAMETERS SEARCH FOR SPHERES DATASET (INTO 2D)

For TopoAE we performed hyperparametrs search in accordance with the original paper Moor et al. (2020) and selected best combination according to KL 0.1 -divergence. For RTD-AE we searched for batch size in [20; 250] and λ in [0.1; 10]. Best combination was once again selected w.r.t. KL 0.1 -divergence. Results are presented in Table 13 . For Wasserstein Distance and Triplet Accuracy difference between means is lesser than standard derivations, and due to this, we performed one-tailed Student's t-test to verify their relation. According to its results, we can reject the null hypothesis that the mean W.D. 

S ON IDENTITY OF INDISCERNIBLES FOR THE TOPOAE LOSS

We compare two point clouds X, X from Figure 16 . For these point clouds, RTD(X, X) = 0.207, while the topological part of the TopoAE loss equals 0. The distinguishing topological feature between X and X is the cycle in X which is born at α = 1 and dies at α = √ 2. R-Cross-Barcode( X, X) depicts this difference. (e) The R-Cross-Barcode1(X, X) The distinguishing topological feature between X and X is the cycle in X which is born at α = 1 and dies at α = √ 2. The R-Cross-Barcode 1 ( X, X) depicts this difference. 



github.com/danchern97/RTD AE



Figure 1: Dimensionality reduction (3D → 2D) on the "Mammoth" dataset. The proposed RTD-AE method better captures both global and local structure.

we have two point clouds A and B, of seven points each, with distances between points as shown in the top row of Figure 2. Consider the R-Cross-Barcode 1 (A, B), it consists of 4 intervals (the bottom row of the figure). The 4 intervals describe the topological discrepancies between connected components of α-neighborhood graphs of A and B.

Figure 2: A graphical representation of an R-Cross-Barcode 1 (A, B) for the point clouds A and B. The pairwise distance matrices for A and B are shown in the top raw. Edges present in the α-neighborhood graphs for B but not for A are colored in red. Edges present in the α-neighborhood graph for A are colored in grey. The timeline for appearance-disappearance of topological features distinguishing the two graphs is shown. The appearance-disappearance process is illustrated by the underlying bars, connecting the corresponding thresholds.

Figure 3: RTD Autoencoder

Figure 4: Results on dimensionality reduction to 3D-space

Figure 5: Results on synthetic 2D data. First column: original data. Other columns: results of dimensionality reduction methods.

Figure 6: The graph Ĝw, w to compare G w = {A 1 , A 2 , A 3 } and G w = { Ã1 , Ã2 , Ã3 }. Dashed edges correspond to zero weights, green edges to w, blue edges to min(w, w); edges with weight +∞ are not shown.

Figure 7: Results on real-world data reduction to 2D.

Figure 9: "Nested spheres" dataset.

Figure 10: Discontinuity of the TopoAE loss. The point cloud X consists of two clusters {1, 2, 3} and {4} (top). The point cloud X (bottom left) consists of two clusters {1, 2} and {3, 4}. (bottom left). The distances within each cluster are of order 10 -1 and the distances between the clusters equal to 10 3 ± 10 -1 . The TopoAE loss is discontinuous because under a small perturbation of points, the minimal spanning tree Γ may change. When the point 3 moves slightly as indicated, then the minimal spanning tree Γ, coloured by yellow, changes and the term (w 14 -w14 ) 2 ∼ 10 -2 in TopoAE loss is replaced by (w 23 -w23 ) 2 ∼ 10 6 .

Figure 11: Results on dimensionality reduction of Spheres to 2D-space and Torus to 3D-space

Figure 13: Additional dimensionality reduction methods applied to the "Mammoth" dataset

Figure14: R-Cross-Barcodes between latent representations and original data points. Top: R-Cross-Barcode(Z 0 , X), R-Cross-Barcode(X, Z 0 ). Bottom: R-Cross-Barcode(Z, X), R-Cross-Barcode(X, Z). X -"Mammoth" dataset, Z -latent representations from RTD-AE, Z 0 -latent representation from the untrained autoencoder. Intervals in R-Cross-Barcodes are smaller after training.

Figure 15: Dimensionality reduction of Spheres dataset to 2D-space after hyperparameter search.

(a) The point cloud X. (b) The point cloud X. (c) The barcode of X. (d) The barcode of X.

Figure16: Two point clouds X, X for which the identity of indiscernibles property doesn't hold for the topological term in the TopoAE loss. The one-to-one correspondence between clouds is depicted by numbers. The minimal spanning trees 1 -2 -3 -4 have edges of identical length for both point clouds. For these point clouds, RTD(X, X) = 0.207, while the topological term of the TopoAE loss equals 0. The topology of these point clouds is different, in particular they have different barcodes. The distinguishing topological feature between X and X is the cycle in X which is born at α = 1 and dies at α = √ 2. The R-Cross-Barcode 1 ( X, X) depicts this difference.

× 10 -3 MNIST AE 3.78 × 10 -3 TopoAE 3.70 × 10 -3 RTD-AE 4.88 × 10 -3 scRNA mice AE 1.31 × 10 -3 TopoAE 1.23 × 10 -3 RTD-AE 1.32 × 10 -3 scRNA melanoma AE 1.16 × 10 -3 TopoAE 1.11 × 10 -3 RTD-AE 1.15 × 10 -3

Quality of data manifold global structure preservation at projection from 101D into 3D space.

Quality of data manifold global structure preservation at projection into 16D space.

Quality of data manifold global structure preservation at projection into 16D space.

Quality of data manifold global structure preservation at projection into high-dimensional space.RTD relies on the auxiliary graph with doubled set of vertices Ĝw, w and weights on edges:

Quality of data manifold global structure preservation for real-world data dimension reduction to 2D.

Quality of data manifold global structure preservation on synthetic data.

Hyperparameters description. All of them have diameters equal to zero. And if any such simplex spawn a topological feature, it is immediately killed by another such simplex.As before, let N be the number of vertices in point clouds. Then Ĝw, w has 2N vertices and our modification eliminates N d out of 2N

RTD optimization speed-ups

Quality of data manifold global structure preservation at projection Torus dataset from 100D into 3D space and Spheres dataset from 101D to 2D. AE 0.611 48.20 ± 1.72 0.538 ± 0.05 0.343 ± 0.01 41.22 ± 1.70

Quality of data manifold global structure preservation for projection of COIL-20 into 3D-space. AE takes the third place with very similar quality, see Figure11and Table

Reconstruction loss for when projecting into 16 dimension latent space . TopoAE 2.84 × 10 -3 RTD-AE 3.17

Hyperparameter search for Spheres (into 2D) dataset . AE 0.706 42.133 ± 1.683 0.3765 ± 0.0124 37.286 ± 1.393

ACKNOWLEDGEMENTS

The work was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021) 

REPRODUCIBILITY STATEMENT

To provide reproducibility, we release the source code of the proposed RTD-AE, see section 1, for hyperparameters see Appendix H. For other methods, we used either official implementations or implementations from scikit-learn with default hyperparameters. We used public datasets (see Section 5, Appendix L). We generated several synthetic datasets and made the generating code available.

annex

Proposition 9. Given a triple of metrics u, u ′ , ũ on a measure space (X , µ), the expectation for the bottleneck distance between the R-Cross-Barcode k (w, w) and the R-Cross-Barcode k (w ′ , w), comparing the pairs of weighted graphs associated with a sample X = {x 1 , . . . , x n }, x i ∈ X , with the edge weights w ij = u(x i , x j ), w ′ ij = u ′ (x i , x j ), wij = ũ(x i , x j ), is upper bounded by the Gromov-Wasserstein distance between u and u ′ :Proof. It follows from the R-Cross-Barcode stability (6) thatFor any pair of isometric embeddings e : X → Z, e ′ : X → Z:by the triangle inequality for ρ Z . ThereforeProposition 10. The expectation for the bottleneck norm of R-Cross-Barcode k (w, w) for two weighted graphs with edge weights w ij = u(x i , x j ), wij = ũ(x i , x j ), where u, ũ is a pair of metrics on a measure space (X , µ), and X = {x 1 , . . . , x n }, x i ∈ X is a sample from (X , µ), is upper bounded by Gromov-Wasserstein distance between u and ũ:L DATASETSThe exact size, nature and dimension of the datasets are presented in Table 9 . The errors for the synthetic data are not reported as they are zero due to the small sizes of the datasets.L.1 SYNTHETIC DATAThe "Random" dataset consists of 500 points randomly distributed on a 2-dimensional unit square. The choice for this dataset was inspired by Coenen & Pearce (2019a) and the ability of UMAP to find clusters in noise.The "Circle" dataset is represented by 100 points randomly distributed on a 2D circle. This dataset has a simple non-trivial topology.The "2 Clusters" dataset consists of 200 points, half of which goes to a dense Gaussian cluster, and the other half goes to sparse Gaussian cluster with the same mean. It is used to test the methods abilities to preserve cluster density.The "3 Clusters" dataset consists of 3 Gaussian clusters each having 100 points. Two clusters are located much closer to each other than the remaining one. We propose it to test the preservation of the global structure, i.e. the distances between clusters.Published as a conference paper at ICLR 2023 (Szubert et al., 2019) .Datasets licences:• Mammoth (Coenen & Pearce, 2019b) , CC Zero License. Mammuthus primigenius (blumbach), The Smithsonian Institute, https://3d.si.edu/object/3d/mammuthus-primigenius-blumbach:341c96cd-f967-4540-8ed1-d3fc56d31f12• MNIST (LeCun et al., 1998) , MIT License.• Fashion-MNIST (Xiao et al., 2017) , MIT License.• COIL-20 (Nene et al., 1996) .• scRNA mice (Yuan et al., 2017) .• scRNA melanoma (Tirosh et al., 2016) .

M MORE DETAILS ON EXPERIMENTS WITH "SPHERES" AND "TORUS"

We performed experiments on dimensionality reduction to 3D space to evaluate preservation of 3-dimensional structures in data by our method. Experimental setup was outlined in Section 5.For this task we have used two synthetic datasets.The "Spheres" dataset consists of 17,250 points randomly distributed on surface of eleven 100-spheres in 101-dimensional space. Any two of those do not intersect and one of the spheres contains all other inside. Similar to "Circle" dataset (Section 5.3) UMAP splits bigger sphere (light grey) into 10 parts and wraps each small sphere into one of them. PacMAP performs similar but it also splits a part of bigger sphere into separate sphere. PCA and Ivis preserve the shape of inner spheres only and turn all structure 'inside out'. Both t-SNE and regular AE projects all points onto one sphere without clear separation between clouds. The addition of a topological loss, both in TopoAE and in our RTD-AE, preserves the global structure of inlaid clusters. However, TopoAE flattens inner clusters into disks, while RTD-AE makes them into (hollow) spheres.The "Torus" dataset consists of 5,000 points randomly distributed on surface of a 2-torus (T 2 ) immersed into 100-dimensional space. Due to such nature of this dataset, PCA and MDS methods

