TOAST: TOPOLOGICAL ALGORITHM FOR SINGULARITY TRACKING

Abstract

The manifold hypothesis, which assumes that data lie on or close to an unknown 1 manifold of low intrinsic dimensionality, is a staple of modern machine learning 2 research. However, recent work has shown that real-world data exhibit distinct 3 non-manifold structures, which result in singularities that can lead to erroneous 4 conclusions about the data. Detecting such singularities is therefore crucial as a 5 precursor to interpolation and inference tasks. We address detecting singularities 6 by developing (i) persistent local homology, a new topology-driven framework 7 for quantifying the intrinsic dimension of a data set locally, and (ii) Euclidicity, a 8 topology-based multi-scale measure for assessing the 'manifoldness' of individual 9 points. We show that our approach can reliably identify singularities of complex 10 spaces, while also capturing singular structures in real-world data sets. 11

1. INTRODUCTION 12

The ever-increasing amount and complexity of real-world data necessitate the development of new 13 methods to extract less complex-but still meaningful-representations of the underlying data. One 14 approach to this problem is via dimensionality reduction techniques, where the data is assumed to 15 be of strictly lower dimension than its number of features. Traditional algorithms in this field such 16 as PCA are restricted to linear descriptions of data, and are therefore of limited use for complex, 17 non-linear data sets that often appear in practice. By contrast, non-linear dimensionality reduc-18 tion algorithms, such as UMAP (McInnes et al., 2018) , t-SNE (van der Maaten & Hinton, 2008) , 19 or autoencoders (Kingma & Welling, 2019) share one common assumption: the underlying data is 20 supposed to be close to a manifold with small intrinsic dimension, i.e. while the input data may have 21 a large ambient dimension N , there is a n-dimensional manifold with n ≪ N that best describes the 22 data. For some data sets, this manifold hypothesis is appropriate: certain natural images are known 23 to be well-described by a manifold, for instance (Carlsson, 2009) , enabling the use of specialised 

32

Since singularities-unlike outliers that arise from incorrect labels, for example-may carry relevant 33 information (Jakubowski et al., 2020), we address the shortcomings of existing dimensionality re-34 duction methods by assuming an agnostic view on any given data set. Instead of trying to prescribe 35 the rigid requirements of a manifold, we consider intrinsic dimensionality to be a fundamentally 36 local phenomenon: we permit dimensionality to vary across points in the data set, and, more im-37 portantly, across the scale of locality to be considered. The only assumption we make is that the 38 data is of significantly lower dimension than the dimension of the ambient space. This perspective 39 enables us to assess the deviation of individual points from idealised non-singular spaces, resulting 40 in a measure of the Euclidicity of a point. Our method is based on a local version of topological data 41 analysis (TDA), a method from computational topology that is capable of quantifying the shape of 42 a data set on multiple scales (Edelsbrunner & Harer, 2010) . 



24autoencoders for visualisation(Moor et al., 2020). However, recent research shows evidence that 25 the manifold hypothesis does not necessarily hold for complex data sets(Brown et al., 2022), and 26 that manifold learning techniques tend to fail for non-manifold data (Rieck & Leitte, 2015; Scoccola 27 & Perea, 2022). These failures are typically the result of singularities, i.e. regions of a space that 28 violate the properties of a manifold. For example, the 'pinched torus,' an object obtained by com-29 pressing a neighbourhood of a random point in a torus to a single point, fails to satisfy the manifold 30 hypothesis at the 'pinch point:' this point, unlike all other points of the 'pinched torus,' does not 31 have a neighbourhood homeomorphic to R 2 (see Fig.1for an illustration).

