EXPECTED PROBABILISTIC HIERARCHIES

Abstract

Hierarchical clustering has usually been addressed by discrete optimization using heuristics or continuous optimization of relaxed scores for hierarchies. In this work, we propose to optimize expected scores under a probabilistic model over hierarchies. (1) We show theoretically that the global optimal values of the expected Dasgupta cost and Tree-Sampling divergence (TSD), two unsupervised metrics for hierarchical clustering, are equal to the optimal values of their discrete counterparts contrary to some relaxed scores. (2) We propose Expected Probabilistic Hierarchies (EPH), a probabilistic model to learn hierarchies in data by optimizing expected scores. EPH uses differentiable hierarchy sampling enabling end-to-end gradient-descent based optimization, and an unbiased subgraph sampling approach to scale to large datasets. (3) We evaluate EPH on synthetic and real-world datasets including vector and graph datasets. EPH outperforms all other approaches on quantitative results and provides meaningful hierarchies in qualitative evaluations.

1. INTRODUCTION

A fundamental problem in unsupervised learning is clustering. Given a dataset, the task is to partition the instances into similar groups. While flat clustering algorithms such as k-means group data points into disjoint groups, a hierarchical clustering divides the data recursively into smaller clusters, which yields several advantages over a flat one. Instead of only providing cluster assignments of the data points, it captures the clustering at multiple granularities, allowing the user to choose the desired level of fine and coarseness depending on the task. The hierarchical structure can be easily visualized in a dendrogram (e.g. see Fig. 4 ), making it easy to interpret and analyze. Hierarchical clustering finds applications in many areas, from personalized recommendation (Zhang et al., 2014) and document clustering (Steinbach et al., 2000) to gene-expression (Eisen et al., 1998) and phylogenetics (Felsenstein, 2004) . Furthermore, the presence of hierarchical structures can be observed in many real-world graphs in nature and society (Ravasz & Barabási, 2003) . A first family of methods for hierarchical clustering are discrete approaches. They aim at optimizing some hierarchical clustering quality scores on a discrete search space, i.e.: max T score(X, T ) s.t. T ∈ discrete hierarchies, where X denotes a given (vector or graph) dataset. Examples of scores optimization could be the minimization of the discrete Dasgupta score Dasgupta (2016), the minimization of the error sum of squares (Ward Jr, 1963) , the maximization of the discrete TSD Charpentier & Bonald (2019), or the maximization of the modularity score (Blondel et al., 2008) . Discrete approaches have two main limitations: the optimization search space of discrete hierarchies is large and constrained which often makes the problem intractable without using heuristics, and the learning procedure is not differentiable and thus not amenable to gradient-based optimization, as done by most deep learning approaches. To mitigate these issues, a second more recent family of continuous methods proposes to optimize some (soft-)scores on a continuous search space of relaxed hierarchies: max T soft-score(X, T ) s.t.T ∈ relaxed hierarchies, Examples are the relaxation of Dasgupta (Chami et al., 2020; Chierchia & Perret, 2019; Zügner et al., 2021) or TSD scores (Zügner et al., 2021) . A major drawback of continuous methods is that the optimal value of soft scores might not align with their discrete counterparts. Contributions. In this work, we propose to optimize expected discrete scores, called Exp-Das and Exp-TSD, instead of the relaxed soft scores called Soft-Das and Soft-TSD (Zügner et al., 2021) . In particular, our contributions can be summarized as follows: • Theoretical contribution: We analyze the theoretical properties of both the soft scores and the expected scores. We show that the optimal values of the expected scores are equal to their optimal discrete counterparts. Further, we show that the minimal value of Soft-Das can be different from that of the discrete Dasgupta cost. et al. (2018) . Note that sampling spanning trees is not applicable in our case since we have a restricted structure, where the nodes of the graph correspond to the leaves of the tree.

3. PROBABILISTIC HIERARCHICAL CLUSTERING

We consider a graph dataset. Let G = (V, E) be a graph with n vertices V = {v 1 , . . . , v n } and m edges E = {e 1 , . . . , e m }. Let w i,j denote the weight of the edge connecting the nodes v i and v j if



Model contribution: We propose a new method called Expected Probabilistic Hierarchies (EPH) to optimize the Exp-Das and Exp-TSD. EPH provides an unbiased estimate of Exp-Das and Exp-TSD with biased gradients based on differentiable hierarchy sampling. EPH scales to even large (vector) datasets based on an unbiased subgraph sampling. • Experimental contribution: In quantitative experiments, we show that EPH outperforms other baselines on 20/24 cases on 16 datasets including both graph and vector datasets. In qualitative experiments, we show that EPH provides meaningful hierarchies. Continuous Methods. In recent years, many continuous algorithms emerged to solve hierarchical clustering. These methods minimize continuous relaxations of the Dasgupta cost using gradient descent based optimizers. Monath et al. (2017) optimized a probabilistic cost version. To parametrize the probabilities, they performed a softmax operation on learnable routing functions from each node on a fixed binary hierarchy. Chierchia & Perret (2019) proposed UFit, a model operating in the ultra-metric space. Furthermore, to optimize their model, they presented a soft-cardinal measure to compute a differentiable relaxed version of the Dasgupta cost. Other approaches operate on continuous representations in hyperbolic space such as gHHC (Monath et al., 2019) and HypHC (Chami et al., 2020). Zügner et al. (2021) recently presented a flexible probabilistic hierarchy model (FPH), on which our method is based. FPH directly parametrizes a probabilistic hierarchy and substitutes the discrete terms in the Dasgupta cost and Tree-Sampling Divergence with their probabilistic counterparts. This results in a differentiable objective function which they optimize using projected gradient descent. Differentiable Sampling Methods. Stochastic models with discrete random variables are difficult to train as the backpropagation algorithm requires all operations to be differentiable. To address this problem, estimators such as the Gumbel-Softmax (Jang et al., 2016) or Gumbel-Sinkhorn (Mena et al., 2018) are used to retain gradients when sampling discrete variables. These differentiable sampling methods have been used for several tasks including DAG predictions (Charpentier et al., 2022), spanning trees or subset selection (Paulus et al., 2020) and generating graphs Bojchevski

