FUZZY C-MEANS CLUSTERING FOR PERSISTENCE DIAGRAMS Anonymous authors Paper under double-blind review

Abstract

Persistence diagrams concisely represent the topology of a point cloud whilst having strong theoretical guarantees. Most current approaches to integrating topological information into machine learning implicitly map persistence diagrams to a Hilbert space, resulting in deformation of the underlying metric structure whilst also generally requiring prior knowledge about the true topology of the space. In this paper we give an algorithm for Fuzzy c-Means (FCM) clustering directly on the space of persistence diagrams,22 enabling unsupervised learning that automatically captures the topological structure of data, with no prior knowledge or additional processing of persistence diagrams. We prove the same convergence guarantees as traditional FCM clustering: every convergent subsequence of iterates tends to a local minimum or saddle point. We end by presenting experiments where our fuzzy topological clustering algorithm allows for unsupervised top-k candidate selection in settings where (i) the properties of persistence diagrams make them the natural choice over geometric equivalents, and (ii) the probabilistic membership values let us rank candidates in settings where verifying candidate suitability is expensive: lattice structure classification in materials science and pre-trained model selection in machine learning.

1. INTRODUCTION

Persistence diagrams, a concise representation of the topology of a point cloud with strong theoretical guarantees, have emerged as a new tool in the field of data analysis (Edelsbrunner & Harer, 2010) . Persistence diagrams have been successfully used to analyse problems ranging from financial crashes (Gidea & Katz, 2018) to protein binding (Kovacev-Nikolic et al., 2014) , but the non-Hilbertian nature of the space of persistence diagrams means it is difficult to directly use persistence diagrams for machine learning. In order to better integrate diagrams into machine learning workflows, efforts have been made to map them into a more manageable form; primarily through embeddings into finite feature vectors, functional summaries, or by defining a positive-definite kernel on diagram space. In all cases, this explicitly or implicitly embeds diagrams into a Hilbert space which deforms the metric structure, potentially losing important information. With the exception of Topological Autoencoders, techniques to integrate these persistence-based summaries as topological regularisers and loss functions currently require prior knowledge about the correct topology of the dataset, which is clearly not feasible in most scenarios. Against this background, we give an algorithm to perform Fuzzy c-Means (FCM) clustering (Bezdek, 1980) directly on collections of persistence diagrams, giving an important unsupervised learning algorithm and enabling learning from persistence diagrams without deforming the metric structure. We perform the convergence analysis for our algorithm, giving the same guarantees as traditional FCM clustering: that every convergent subsequence of iterates tends to a local minimum or saddle point. We demonstrate the value of our fuzzy clustering algorithm by using it to cluster datasets that benefit from both the topological and fuzzy nature of our algorithm. We apply our technique in two settings: lattice structures in materials science and the decision boundaries of CNNs. A key property for machine learning in materials science has been identified as "invariance to the basis symmetries of physics [...] rotation, reflection, translation" (Schmidt et al., 2019) . Geometric clustering algorithms do not have this invariance, but persistence diagrams do, making them ideally suited for this application; we can cluster transformed lattice structure datasets where geometric equivalents fail. In addition to this, our probabilistic membership values allow us to rank the top-k

