TOAST: TOPOLOGICAL ALGORITHM FOR SINGULARITY TRACKING

Abstract

The manifold hypothesis, which assumes that data lie on or close to an unknown 1 manifold of low intrinsic dimensionality, is a staple of modern machine learning 2 research. However, recent work has shown that real-world data exhibit distinct 3 non-manifold structures, which result in singularities that can lead to erroneous 4 conclusions about the data. Detecting such singularities is therefore crucial as a 5 precursor to interpolation and inference tasks. We address detecting singularities 6 by developing (i) persistent local homology, a new topology-driven framework 7 for quantifying the intrinsic dimension of a data set locally, and (ii) Euclidicity, a 8 topology-based multi-scale measure for assessing the 'manifoldness' of individual 9 points. We show that our approach can reliably identify singularities of complex 10 spaces, while also capturing singular structures in real-world data sets. 11

1. INTRODUCTION 12

The ever-increasing amount and complexity of real-world data necessitate the development of new 13 methods to extract less complex-but still meaningful-representations of the underlying data. One 14 approach to this problem is via dimensionality reduction techniques, where the data is assumed to 15 be of strictly lower dimension than its number of features. Traditional algorithms in this field such 16 as PCA are restricted to linear descriptions of data, and are therefore of limited use for complex, 17 non-linear data sets that often appear in practice. By contrast, non-linear dimensionality reduc-18 tion algorithms, such as UMAP (McInnes et al., 2018) , t-SNE (van der Maaten & Hinton, 2008) , 19 or autoencoders (Kingma & Welling, 2019) share one common assumption: the underlying data is 20 supposed to be close to a manifold with small intrinsic dimension, i.e. while the input data may have 21 a large ambient dimension N , there is a n-dimensional manifold with n ≪ N that best describes the 22 data. For some data sets, this manifold hypothesis is appropriate: certain natural images are known 23 to be well-described by a manifold, for instance (Carlsson, 2009) , enabling the use of specialised autoencoders for visualisation (Moor et al., 2020) . However, recent research shows evidence that 25 the manifold hypothesis does not necessarily hold for complex data sets (Brown et al., 2022) , and 26 that manifold learning techniques tend to fail for non-manifold data (Rieck & Leitte, 2015; Scoccola violate the properties of a manifold. For example, the 'pinched torus,' an object obtained by com-29 pressing a neighbourhood of a random point in a torus to a single point, fails to satisfy the manifold 30 hypothesis at the 'pinch point:' this point, unlike all other points of the 'pinched torus,' does not 31 have a neighbourhood homeomorphic to R 2 (see Fig. 1 for an illustration).

32

Since singularities-unlike outliers that arise from incorrect labels, for example-may carry relevant 33 information (Jakubowski et al., 2020), we address the shortcomings of existing dimensionality re-34 duction methods by assuming an agnostic view on any given data set. Instead of trying to prescribe 35 the rigid requirements of a manifold, we consider intrinsic dimensionality to be a fundamentally 36 local phenomenon: we permit dimensionality to vary across points in the data set, and, more im-37 portantly, across the scale of locality to be considered. The only assumption we make is that the 38 data is of significantly lower dimension than the dimension of the ambient space. This perspective 39 enables us to assess the deviation of individual points from idealised non-singular spaces, resulting 40 in a measure of the Euclidicity of a point. Our method is based on a local version of topological data analysis (TDA), a method from computational topology that is capable of quantifying the shape of a data set on multiple scales (Edelsbrunner & Harer, 2010) Our contributions. We present a universal framework for detecting singular regions in data. This framework is agnostic with respect to geometric or stochastic properties of the underlying data and 45 only requires a notion of intrinsic dimension of neighbourhoods. Our approach is based on a novel



& Perea, 2022). These failures are typically the result of singularities, i.e. regions of a space that d B (D, D ′ ) := inf γ sup x∈D ∥xγ(x)∥ ∞ , where γ ranges over all bijections between D and D ′ . For readers familiar with persistent homology, we depart from the usual convention of using ϵ as the threshold parameter since we will require it to denote the scale of our persistent local homology calculations.



. Overview of our method. Using persistent local homology (PLH), we derive a persistent intrinsic dimension and, subsequently, a Euclidicity score that measures the deviation from a space to a Euclidean model space. Here, Euclidicity highlights the singularity at the 'pinch point.' Please refer to Section 4 for more details.

annex

formulation of persistent local homology (PLH), a multi-parameter tool that detects the shape of 47 local neighbourhoods of a given point in the data set, making use of multiple scales of locality.

48

We employ PLH in two different capacities: (i) We use PLH to estimate the intrinsic dimension 49 of a point locally. This enables us to assess how complex a given data set is, both in terms of the 50 magnitude of the intrinsic dimension and in terms of the variance of its intrinsic dimension across 51 individual points. (ii) Given the intrinsic dimension of the neighbourhood of a point, we use PLH to 52 measure Euclidicity, a novel quantity that we define to measure the deviation of a point from being 53 Euclidean.We also provide theoretical guarantees on the approximation quality for certain classes of 54 spaces and show the utility of our proposed method experimentally on several data sets. We first provide an overview of persistent homology and stratified spaces, as well as their relation 57 to local homology. The former concept constitutes a generic framework for assessing complex data 58 at multiple scales by measuring its topological characteristics such as 'holes' and 'voids,' while the 59 latter will subsequently serve as a general setting to describe singularities, in which our framework 60 admits advantageous properties.

61

Persistent homology. Persistent homology is a method for computing topological features at dif-62 ferent scales, capturing an intrinsic notion of relevance in terms of spatial scale parameters. Given 63 a finite metric space (X, d), the Vietoris-Rips complex at step t is defined as the abstract simplicial 64 complex V(X, t), in which an abstract k-simplex (x 0 , . . . , x k ) of points in X is spanned if and onlya filtration, i.e. a sequence of nested simplicial complexes, which we denote by V(X, •). Applying 67 the ith homology functor to this collection of spaces and inclusions between them induces maps 68 on the homology level f t1,t2 i : H i (V(X, t 1 )) → H i (V(X, t 2 )) for any t 1 ≤ t 2 . The ith persistent 69 homology (PH) of X with respect to the Vietoris-Rips construction is defined to be the collection 70 of all these ith homology groups, together with the respective induced maps between them, and 71 denoted by PH i (X; V). PH can therefore be viewed as a tool that keeps track of topological fea- 

