LEARNING TOPOLOGY-PRESERVING DATA REPRESEN-TATIONS

Abstract

We propose a method for learning topology-preserving data representations (dimensionality reduction). The method aims to provide topological similarity between the data manifold and its latent representation via enforcing the similarity in topological features (clusters, loops, 2D voids, etc.) and their localization. The core of the method is the minimization of the Representation Topology Divergence (RTD) between original high-dimensional data and low-dimensional representation in latent space. RTD minimization provides closeness in topological features with strong theoretical guarantees. We develop a scheme for RTD differentiation and apply it as a loss term for the autoencoder. The proposed method "RTD-AE" better preserves the global structure and topology of the data manifold than state-of-theart competitors as measured by linear correlation, triplet distance ranking accuracy, and Wasserstein distance between persistence barcodes.

1. INTRODUCTION

Dimensionality reduction is a useful tool for data visualization, preprocessing, and exploratory data analysis. Clearly, immersion of high-dimensional data into 2D or 3D space is impossible without distortions which vary for popular methods. Dimensionality reduction methods can be broadly classified into global and local methods. Classical global methods (PCA, MDS) tend to preserve the global structure of a manifold. However, in many practical applications, produced visualizations are non-informative since they don't capture complex non-linear structures. Local methods (UMAP (McInnes et al., 2018 ), PaCMAP (Wang et al., 2021 ), t-SNE (Van der Maaten & Hinton, 2008 ), Laplacian Eigenmaps (Belkin & Niyogi, 2001) , ISOMAP (Tenenbaum et al., 2000) ) focus on preserving neighborhood data and local structure with the cost of sacrificing the global structure. The most popular methods like t-SNE and UMAP are a good choice for inferring cluster structures but often fail to describe correctly the data manifold's topology. t-SNE and UMAP have hyperparameters influencing representations neighborhood size taken into account. Different values of hyperparameters lead to significantly different visualizations and neither of them is the "canonical" one that correctly represents high-dimensional data. We take a different perspective on dimensionality reduction. We propose the approach based on Topological Data Analysis (TDA). Topological Data Analysis (Barannikov, 1994; Zomorodian, 2001; Chazal & Michel, 2017 ) is a field devoted to the numerical description of multi-scale topological properties of data distributions by analyzing point clouds sampled from them. TDA methods naturally capture properties of data manifolds on multiple distance scales and are arguably a good trade-off between local and global approaches. The state-of-the-art TDA approach of this kind is TopoAE (Moor et al., 2020) . However, it has several weaknesses: 1) the loss term is not continuous 2) the nullity of the loss term is only necessary but not a sufficient condition for the coincidence of topology, as measured by persistence barcodes, see more details in Appendix J. In this paper, we make the following contributions: 1. We develop an approach for RTD differentiation. Topological metrics are difficult to differentiate; the differentiability of RTD and its implementation on GPU is a valuable step forward in the TDA context which opens novel possibilities in topological optimizations; 2. We propose a new method for topology-aware dimensionality reduction: an autoencoder enhanced with the differentiable RTD loss: "RTD-AE". Minimization of RTD loss between real and latent spaces forces closeness in topological features and their localization with strong theoretical guarantees; 3. By doing computational experiments, we show that the proposed RTD-AE outperforms state-ofthe-art methods of dimensionality reduction and the vanilla autoencoder in terms of preserving the global structure and topology of a data manifold; we measure it by the linear correlation, the triplet distance ranking accuracy, Wasserstein distance between persistence barcodes, and RTD. In some cases, the proposed RTD-AE produces more faithful and visually appealing low-dimensional embeddings than state-of-the-art algorithms. We release the RTD-AE source code.foot_0 

2. RELATED WORK

Various dimensionality reduction methods have been proposed to obtain 2D/3D visualization of high-dimensional data (Tenenbaum et al., 2000; Belkin & Niyogi, 2001; Van der Maaten & Hinton, 2008; McInnes et al., 2018) . Natural science researchers often use dimensionality reduction methods for exploratory data analysis or even to focus further experiments (Becht et al., 2019; Kobak & Berens, 2019; Karlov et al., 2019; Andronov et al., 2021; Szubert et al., 2019) . The main problem with these methods is inevitable distortions (Chari et al., 2021; Batson et al., 2021; Wang et al., 2021) and incoherent results for different hyperparameters. These distortions can largely affect global representation structure such as inter-cluster relationships and pairwise distances. As the interpretation of these quantities in some domain such as physics or biology can lead to incorrect conclusions, it is of high importance to preserve them as much as possible. UMAP and t-SNE visualizations are frequently sporadic and cannot be considered as "canonical" representation of high-dimensional data. An often overlooked issue is the initialization which significantly contributes to the performance of dimensionality reduction methods (Kobak & Linderman, 2021; Wang et al., 2021) . Damrich & Hamprecht (2021) revealed that the UMAP's true loss function is different from the purported from its theory because of negative sampling. There is a number of works that try to tackle the distortion problem and preserve as much inter-data relationships as possible. Authors of PHATE (Moon et al., 2019) and ivis (Szubert et al., 2019) claim that their methods are able to capture local as well as global features, but provide no theoretical guarantees for this. (Wagner et al., 2021) 



github.com/danchern97/RTD AE



Figure 1: Dimensionality reduction (3D → 2D) on the "Mammoth" dataset. The proposed RTD-AE method better captures both global and local structure.

