FUZZY C-MEANS CLUSTERING FOR PERSISTENCE DIAGRAMS Anonymous authors Paper under double-blind review

Abstract

Persistence diagrams concisely represent the topology of a point cloud whilst having strong theoretical guarantees. Most current approaches to integrating topological information into machine learning implicitly map persistence diagrams to a Hilbert space, resulting in deformation of the underlying metric structure whilst also generally requiring prior knowledge about the true topology of the space. In this paper we give an algorithm for Fuzzy c-Means (FCM) clustering directly on the space of persistence diagrams,22 enabling unsupervised learning that automatically captures the topological structure of data, with no prior knowledge or additional processing of persistence diagrams. We prove the same convergence guarantees as traditional FCM clustering: every convergent subsequence of iterates tends to a local minimum or saddle point. We end by presenting experiments where our fuzzy topological clustering algorithm allows for unsupervised top-k candidate selection in settings where (i) the properties of persistence diagrams make them the natural choice over geometric equivalents, and (ii) the probabilistic membership values let us rank candidates in settings where verifying candidate suitability is expensive: lattice structure classification in materials science and pre-trained model selection in machine learning.

1. INTRODUCTION

Persistence diagrams, a concise representation of the topology of a point cloud with strong theoretical guarantees, have emerged as a new tool in the field of data analysis (Edelsbrunner & Harer, 2010) . Persistence diagrams have been successfully used to analyse problems ranging from financial crashes (Gidea & Katz, 2018) to protein binding (Kovacev-Nikolic et al., 2014) , but the non-Hilbertian nature of the space of persistence diagrams means it is difficult to directly use persistence diagrams for machine learning. In order to better integrate diagrams into machine learning workflows, efforts have been made to map them into a more manageable form; primarily through embeddings into finite feature vectors, functional summaries, or by defining a positive-definite kernel on diagram space. In all cases, this explicitly or implicitly embeds diagrams into a Hilbert space which deforms the metric structure, potentially losing important information. With the exception of Topological Autoencoders, techniques to integrate these persistence-based summaries as topological regularisers and loss functions currently require prior knowledge about the correct topology of the dataset, which is clearly not feasible in most scenarios. Against this background, we give an algorithm to perform Fuzzy c-Means (FCM) clustering (Bezdek, 1980) directly on collections of persistence diagrams, giving an important unsupervised learning algorithm and enabling learning from persistence diagrams without deforming the metric structure. We perform the convergence analysis for our algorithm, giving the same guarantees as traditional FCM clustering: that every convergent subsequence of iterates tends to a local minimum or saddle point. We demonstrate the value of our fuzzy clustering algorithm by using it to cluster datasets that benefit from both the topological and fuzzy nature of our algorithm. We apply our technique in two settings: lattice structures in materials science and the decision boundaries of CNNs. A key property for machine learning in materials science has been identified as "invariance to the basis symmetries of physics [...] rotation, reflection, translation" (Schmidt et al., 2019) . Geometric clustering algorithms do not have this invariance, but persistence diagrams do, making them ideally suited for this application; we can cluster transformed lattice structure datasets where geometric equivalents fail. In addition to this, our probabilistic membership values allow us to rank the top-k most likely lattices assigned to a cluster. This is particularly important in materials science, as further investigation requires expensive laboratory time and expertise. Our second application is inspired by Ramamurthy et al. (2019) , who show that models perform better on tasks if they have topologically similar decision boundaries. We use our algorithm to cluster models and tasks by the persistence diagrams of their decision boundaries. Not only is our algorithm able to successfully cluster models to the correct task, based just on the topology of its decision boundary, but we show that higher membership values imply better performance on unseen tasks.

1.1. RELATED WORK

Means of persistence diagrams. Our work relies on the existence of statistics in the space of persistence diagrams. Mileyko et al. (2011) first showed that means and expectations are well-defined in the space of persistence diagrams. Specifically, they showed that the Fréchet mean, an extension of means onto metric spaces, is well-defined under weak assumptions on the space of persistence diagrams. Turner et al. ( 2012 Learning with persistence-based summaries. Integrating diagrams into machine learning workflows remained challenging even with well-defined means, as the space is non-Hilbertian (Turner & Spreemann, 2019) . As such, efforts have been made to map diagrams into a Hilbert space; primarily either by embedding into finite feature vectors (Kališnik, 2018; Fabio & Ferri, 2015; Chepushtanova et al., 2015) or functional summaries (Bubenik, 2015; Rieck et al., 2019) , or by defining a positivedefinite kernel on diagram space (Reininghaus et al., 2015; Carrière et al., 2017; Le & Yamada, 2018 ). These vectorisations have been integrated into deep learning either by learning parameters for the embedding (Hofer et al., 2017; Carrière et al., 2020; Kim et al., 2020; Zhao & Wang, 2019; Zieliński et al., 2019) , or as part of a topological loss or regulariser (Chen et al., 2018; Gabrielsson et al., 2020; Clough et al., 2020; Moor et al., 2019) . However, the embeddings used in these techniques deform the metric structure of persistence diagram space (Bubenik & Wagner, 2019; Wagner, 2019; Carrière & Bauer, 2019) , potentially leading to the loss of important information. Furthermore, these techniques generally require prior knowledge of a 'correct' target topology which cannot plausibly be known in most scenarios. In comparison, our algorithm acts in the space of persistence diagrams so it does not deform the structure of diagram space via embeddings, and is entirely unsupervised, requiring no prior knowledge about the topology. (ii) The fuzzy membership values provide information about proximity to all clusters, whereas hard labelling loses most of that information. In our experiments we demonstrate that this additional information can be utilised in practice. (iii) The weighted cost function makes the convergence analysis (which we provide) entirely nontrivial in comparison to the non-fuzzy case. We consider this convergence analysis a primary contribution of our paper. (iv) Fuzzy membership values have been shown to be more robust to noise than discrete labels (Klawonn, 2004 ).



) then developed an algorithm to compute the Fréchet mean. We adapt the algorithm by Turner et al. to the weighted case, but the combinatoric nature of their algorithm makes it computationally intense. Lacombe et al. (2018) framed the computation of means and barycentres in the space of persistence diagram as an optimal transport problem, allowing them to use the Sinkhorn algorithm (Cuturi & Doucet, 2014) for fast computation of approximate solutions. The vectorisation of the diagram required by the algorithm by Lacombe et al. makes it unsuitable for integration into our work, as we remain in the space of persistence diagrams. Techniques to speed up the matching problem fundamental to our computation have also been proposed by Vidal et al. (2020) and Kerber et al. (2017).

Hard clustering. Maroulas et al. (2017) gave an algorithm for hard clustering persistence diagrams based on the algorithm by Turner et al. Lacombe et al. (2018) gave an alternate implementation of hard clustering based on their algorithm for barycentre computation, providing a computational speed-up over previous the work by Maroulas et al. The primary advantages of our work over previous work on hard clustering are as follows. (i) The probabilistic membership values allow us to rank datasets in the cluster, enabling top-k candidate selection in settings where verifying correctness is expensive. The value provided by this fuzzy information is demonstrated in the experiments.

