A SPARSE, FAST, AND STABLE REPRESENTATION FOR MULTIPARAMETER TOPOLOGICAL DATA ANALYSIS

Abstract

Topological data analysis (TDA) is a new area of geometric data analysis that focuses on using invariants from algebraic topology to provide multiscale shape descriptors for point clouds. One of the most important shape descriptors is persistent homology, which studies the topological variations as a filtration parameter changes; a typical parameter is the feature scale. For many data sets, it is useful to consider varying multiple filtration parameters at once, for example scale and density. While the theoretical properties of single parameter persistent homology are well understood, less is known about the multiparameter case. Of particular interest is the problem of representing multiparameter persistent homology by elements of a vector space for integration with traditional machine learning. Existing approaches to this problem either ignore most of the multiparameter information to reduce to the one-parameter case or are heuristic and potentially unstable in the face of noise. In this article, we introduce a general representation framework for multiparameter persistent homology that encompasses previous approaches. We establish theoretical stability guarantees under this framework as well as efficient algorithms for practical computation, making this framework an applicable and versatile tool for TDA practitioners. We validate our stability results and algorithms with numerical experiments that demonstrate statistical convergence, prediction accuracy, and fast running times on several real data sets.

1. INTRODUCTION

Topological Data Analysis (TDA) (Carlsson, 2009) is a methodology for analyzing data sets using multiscale shape descriptors coming from algebraic topology. There has been intense interest in the field in the last decade, since topological features have allowed practitioners to compute and encode information that classical approaches do not capture. Moreover, TDA rests on solid theoretical grounds, with guarantees accompanying many of its methods and descriptors. TDA has proved useful in a wide variety of application areas, including computer graphics (Carrière et al., 2015a; Poulenard et al., 2018 ), computational biology (Rabadán & Blumberg, 2019) , and material science (Buchet et al., 2018; Saadatfar et al., 2017) , among many others. The main tool of TDA is persistent homology. In its most standard form, one is given a finite metric space (e.g., a finite set of points and their pairwise distances) and a continuous function f : X → R. This function usually represents a parameter of interest (such as, e.g., scale or density for point clouds, marker genes for single-cell data, etc), and the goal of persistent homology is to characterize the topological variations of this function on the data. Of course, the idea of considering multiscale representations of geometric data is not new (Chapelle et al., 2002; Ozer, 2019; Witkin, 1987) ; the contribution of persistent homology is to obtain a novel and theoretically tractable multiscale shape descriptor. More formally, this is achieved by computing the so-called persistence diagram of f , which is obtained by looking at all sublevel sets of the form {f -1 ((-∞, α])} α and by computing a decomposition of these sets, that is, by recording the appearances and disappearances of topological features (connected components, loops, enclosed spheres, etc) in these sets. When such a feature appears (resp. disappears), e.g., in a sublevel set f -1 ((-∞, α b ]), we call the corresponding threshold α b (resp. α d ) the birth time (resp. death time) of the topological feature, and we summarize this information in the persistence diagram D(f ) := {(α b , α d )} α∈A ⊂ R 2 . Moreover, it is also

