A SPARSE, FAST, AND STABLE REPRESENTATION FOR MULTIPARAMETER TOPOLOGICAL DATA ANALYSIS

Abstract

Topological data analysis (TDA) is a new area of geometric data analysis that focuses on using invariants from algebraic topology to provide multiscale shape descriptors for point clouds. One of the most important shape descriptors is persistent homology, which studies the topological variations as a filtration parameter changes; a typical parameter is the feature scale. For many data sets, it is useful to consider varying multiple filtration parameters at once, for example scale and density. While the theoretical properties of single parameter persistent homology are well understood, less is known about the multiparameter case. Of particular interest is the problem of representing multiparameter persistent homology by elements of a vector space for integration with traditional machine learning. Existing approaches to this problem either ignore most of the multiparameter information to reduce to the one-parameter case or are heuristic and potentially unstable in the face of noise. In this article, we introduce a general representation framework for multiparameter persistent homology that encompasses previous approaches. We establish theoretical stability guarantees under this framework as well as efficient algorithms for practical computation, making this framework an applicable and versatile tool for TDA practitioners. We validate our stability results and algorithms with numerical experiments that demonstrate statistical convergence, prediction accuracy, and fast running times on several real data sets.

1. INTRODUCTION

Topological Data Analysis (TDA) (Carlsson, 2009) is a methodology for analyzing data sets using multiscale shape descriptors coming from algebraic topology. There has been intense interest in the field in the last decade, since topological features have allowed practitioners to compute and encode information that classical approaches do not capture. Moreover, TDA rests on solid theoretical grounds, with guarantees accompanying many of its methods and descriptors. TDA has proved useful in a wide variety of application areas, including computer graphics (Carrière et al., 2015a; Poulenard et al., 2018 ), computational biology (Rabadán & Blumberg, 2019) , and material science (Buchet et al., 2018; Saadatfar et al., 2017) , among many others. The main tool of TDA is persistent homology. In its most standard form, one is given a finite metric space (e.g., a finite set of points and their pairwise distances) and a continuous function f : X → R. This function usually represents a parameter of interest (such as, e.g., scale or density for point clouds, marker genes for single-cell data, etc), and the goal of persistent homology is to characterize the topological variations of this function on the data. Of course, the idea of considering multiscale representations of geometric data is not new (Chapelle et al., 2002; Ozer, 2019; Witkin, 1987) ; the contribution of persistent homology is to obtain a novel and theoretically tractable multiscale shape descriptor. More formally, this is achieved by computing the so-called persistence diagram of f , which is obtained by looking at all sublevel sets of the form {f -foot_0 ((-∞, α])} α and by computing a decomposition of these sets, that is, by recording the appearances and disappearances of topological features (connected components, loops, enclosed spheres, etc) in these sets. When such a feature appears (resp. disappears), e.g., in a sublevel set f -1 ((-∞, α b ]), we call the corresponding threshold α b (resp. α d ) the birth time (resp. death time) of the topological feature, and we summarize this information in the persistence diagram D(f ) := {(α b , α d )} α∈A ⊂ R 2 . Moreover, it is also usual to consider the distance to the diagonal α d -α b as a proxy for the statistical significance of the corresponding feature. The most important foundational result in the subject establishes that persistence diagrams are stable -they will not change much when computed on a small perturbation f of the function f (Cohen-Steiner et al., 2007) . However, an inherent limitation of the formulation of persistent homology is that it can handle only a single filtration parameter f . However, in practice it is common that one has to deal with multiple parameters. This translates into multiple filtration functions: a standard example is given in Figure 1 , where both feature scale and density functions are necessary to obtain meaningful topological representation of a noisy point cloud. An extension of persistent homology to this more general setting is called multiparameter persistent homology (Botnan & Lesnick, 2022; Carlsson & Zomorodian, 2009) , since roughly speaking it amounts to studying the topological variation of a continuous multiparameter function f : X → R n with n ∈ N * . This setting is notoriously difficult to analyze theoretically, as there is no general decomposition theorem and hence no analogue of persistence diagrams. Still, it remains possible to define topological invariants, even in this setting. The most common one is the so-called rank invariant, which describes how the topological features associated to any pair of sublevel sets {x ∈ X : f (x) ≤ α} and {x ∈ X : f (x) ≤ β} such that α ≤ β (w.r.t. the partial order in R n ), are connected. Since it is defined as an algebraic construction, and, as such, not suitable as input for subsequent data science purposes, the task of finding appropriate representations of this invariant, i.e., embeddings into Hilbert spaces, is critical. Hence, a number of such representations have been defined recently (Corbet et al., 2019; Coskunuzer et al., 2021; Vipond, 2020) . However, while the rank invariant is equivalent to the persistence diagram in the single-parameter case, it is known to be much less informative in the multiparameter case, even when the function admits a decomposition: many functions have different decompositions yet same rank invariants. Therefore, all aforementioned representations can encode only limited multiparameter topological information. Instead, in this work, we define representations based on candidate decompositions of the function, in order to create descriptors that are strictly more powerful than the rank invariant. Indeed, while there is no general decomposition theorem, there is now a whole body of work that allows to define stable, candidate decompositions (Asashiba et al., 2019; Botnan et al., 2021; Cai et al., 2020; Dey & Xin, 2021; Loiseaux et al., 2022) , that are more informative than the rank invariant and that agree with the true decomposition when it exists 1 . Our representation is motivated from this series of recent contributions, and expects one of such candidate decompositions as input. Closely related to our method is the recent contribution (Carrière & Blumberg, 2020), which also proposes a representation built from a given decomposition. However, their approach, while being efficient in practice, is a heuristic with no corresponding mathematical guarantees. In particular, it is known to be unstable: similar decompositions can lead to very different representations. Our approach can be understood as a subsequent generalization of the work of (Carrière & Blumberg, 2020), with new mathematical guarantees that allow to derive, e.g., statistical rates of convergence. Figure 1 : (left) Example of point cloud X filtered by both feature scale (computed from unions of balls centered on X with increasing radii) and (co)density. A point will be in the bifiltration if its density is high enough, and if it is close enough to X. Note that the circle is present in the red zone, and can be detected as a large summand in the decomposition of the corresponding module (middle), or in a representation of it (right).



Strictly speaking, multiparameter persistent homology can always be decomposed in an abstract way(Botnan & Lesnick, 2022, Theorem 4.2). However, such abstract decompositions are hard to interpret and work with, hence the motivation for finding candidate decompositions that are more visual and intuitive.

