EXPECTED PROBABILISTIC HIERARCHIES

Abstract

Hierarchical clustering has usually been addressed by discrete optimization using heuristics or continuous optimization of relaxed scores for hierarchies. In this work, we propose to optimize expected scores under a probabilistic model over hierarchies. (1) We show theoretically that the global optimal values of the expected Dasgupta cost and Tree-Sampling divergence (TSD), two unsupervised metrics for hierarchical clustering, are equal to the optimal values of their discrete counterparts contrary to some relaxed scores. (2) We propose Expected Probabilistic Hierarchies (EPH), a probabilistic model to learn hierarchies in data by optimizing expected scores. EPH uses differentiable hierarchy sampling enabling end-to-end gradient-descent based optimization, and an unbiased subgraph sampling approach to scale to large datasets. (3) We evaluate EPH on synthetic and real-world datasets including vector and graph datasets. EPH outperforms all other approaches on quantitative results and provides meaningful hierarchies in qualitative evaluations.

1. INTRODUCTION

A fundamental problem in unsupervised learning is clustering. Given a dataset, the task is to partition the instances into similar groups. While flat clustering algorithms such as k-means group data points into disjoint groups, a hierarchical clustering divides the data recursively into smaller clusters, which yields several advantages over a flat one. Instead of only providing cluster assignments of the data points, it captures the clustering at multiple granularities, allowing the user to choose the desired level of fine and coarseness depending on the task. The hierarchical structure can be easily visualized in a dendrogram (e.g. see Fig. 4 ), making it easy to interpret and analyze. Hierarchical clustering finds applications in many areas, from personalized recommendation (Zhang et al., 2014) and document clustering (Steinbach et al., 2000) to gene-expression (Eisen et al., 1998) and phylogenetics (Felsenstein, 2004) . Furthermore, the presence of hierarchical structures can be observed in many real-world graphs in nature and society (Ravasz & Barabási, 2003) . A first family of methods for hierarchical clustering are discrete approaches. They aim at optimizing some hierarchical clustering quality scores on a discrete search space, i.e.:

max

T score(X, T ) s.t. T ∈ discrete hierarchies, where X denotes a given (vector or graph) dataset. Examples of scores optimization could be the minimization of the discrete Dasgupta score Dasgupta (2016) , the minimization of the error sum of squares (Ward Jr, 1963) , the maximization of the discrete TSD Charpentier & Bonald (2019) , or the maximization of the modularity score (Blondel et al., 2008) . Discrete approaches have two main limitations: the optimization search space of discrete hierarchies is large and constrained which often makes the problem intractable without using heuristics, and the learning procedure is not differentiable and thus not amenable to gradient-based optimization, as done by most deep learning approaches. To mitigate these issues, a second more recent family of continuous methods proposes to optimize some (soft-)scores on a continuous search space of relaxed hierarchies: max T soft-score(X, T ) s.t.T ∈ relaxed hierarchies, Examples are the relaxation of Dasgupta (Chami et al., 2020; Chierchia & Perret, 2019; Zügner et al., 2021) or TSD scores (Zügner et al., 2021) . A major drawback of continuous methods is that the optimal value of soft scores might not align with their discrete counterparts. Contributions. In this work, we propose to optimize expected discrete scores, called Exp-Das and Exp-TSD, instead of the relaxed soft scores called Soft-Das and Soft-TSD (Zügner et al., 2021) . In particular, our contributions can be summarized as follows: • Theoretical contribution: We analyze the theoretical properties of both the soft scores and the expected scores. We show that the optimal values of the expected scores are equal to their optimal discrete counterparts. Further, we show that the minimal value of Soft-Das can be different from that of the discrete Dasgupta cost. • Model contribution: We propose a new method called Expected Probabilistic Hierarchies (EPH) to optimize the Exp-Das and Exp-TSD. EPH provides an unbiased estimate of Exp-Das and Exp-TSD with biased gradients based on differentiable hierarchy sampling. EPH scales to even large (vector) datasets based on an unbiased subgraph sampling. • Experimental contribution: In quantitative experiments, we show that EPH outperforms other baselines on 20/24 cases on 16 datasets including both graph and vector datasets. In qualitative experiments, we show that EPH provides meaningful hierarchies.

2. RELATED WORK

Discrete Methods. We further differentiate between agglomerative (bottom-up) and divisive (topdown) discrete algorithms. Well-established agglomerative methods are the linkage algorithms that subsequently merge the two clusters with the lowest distance, into a new cluster. There are several ways to define the similarity of two clusters. The average linkage (AL) method uses the average similarity, while single linkage (SL) and complete linkage (CL) use the minimum and maximum similarity between the groups, respectively (Hastie et al., 2009) . Finally, the ward linkage (WL) algorithm (Ward Jr, 1963) operates on Euclidean distances and merges the two clusters with the lowest increase in the sum of squares. Another agglomerative approach is the Louvain algorithm (Blondel et al., 2008) which maximizes iteratively the modularity score. Unlike agglomerative methods, divisive algorithms work in a top-down fashion. Initially, all leaves share the same cluster and are recursively divided into smaller ones using flat clustering algorithms. Famous examples are based on the kmeans algorithm (Steinbach et al., 2000) or use approximations of the sparsest cut (Dasgupta, 2016) . Continuous Methods. In recent years, many continuous algorithms emerged to solve hierarchical clustering. These methods minimize continuous relaxations of the Dasgupta cost using gradient descent based optimizers. Monath et al. (2017) optimized a probabilistic cost version. To parametrize the probabilities, they performed a softmax operation on learnable routing functions from each node on a fixed binary hierarchy. Chierchia & Perret (2019) proposed UFit, a model operating in the ultra-metric space. Furthermore, to optimize their model, they presented a soft-cardinal measure to compute a differentiable relaxed version of the Dasgupta cost. Other approaches operate on continuous representations in hyperbolic space such as gHHC (Monath et al., 2019) and HypHC (Chami et al., 2020) . Zügner et al. (2021) recently presented a flexible probabilistic hierarchy model (FPH), on which our method is based. FPH directly parametrizes a probabilistic hierarchy and substitutes the discrete terms in the Dasgupta cost and Tree-Sampling Divergence with their probabilistic counterparts. This results in a differentiable objective function which they optimize using projected gradient descent. Differentiable Sampling Methods. Stochastic models with discrete random variables are difficult to train as the backpropagation algorithm requires all operations to be differentiable. To address this problem, estimators such as the Gumbel-Softmax (Jang et al., 2016) or Gumbel-Sinkhorn (Mena et al., 2018) are used to retain gradients when sampling discrete variables. These differentiable sampling methods have been used for several tasks including DAG predictions (Charpentier et al., 2022) , spanning trees or subset selection (Paulus et al., 2020) and generating graphs Bojchevski et al. (2018) . Note that sampling spanning trees is not applicable in our case since we have a restricted structure, where the nodes of the graph correspond to the leaves of the tree.

3. PROBABILISTIC HIERARCHICAL CLUSTERING

We consider a graph dataset. Let G = (V, E) be a graph with n vertices V = {v 1 , . . . , v n } and m edges E = {e 1 , . . . , e m }. Let w i,j denote the weight of the edge connecting the nodes v i and v j if (i, j) ∈ E, 0 otherwise and w i = j w i,j the weight of the node v i . We define the edge distribution P (v i , v j ) for pairs of nodes, P (v i , v j ) ∝ w i,j , s.t. vi,vj ∈V P (v i , v j ) = 1 and equivalently the node distribution P (v i ) ∝ w i , s.t. vi∈V P (v i ) = 1. We can extend this representation to any vector dataset D = {x 1 , . . . , x n } and interpret the dataset as a graph by using the data points x i as nodes and pairwise similarities (e.g. cosine similarities) as edge weights. Discrete hierarchical clustering. We define a discrete hierarchical clustering T of a graph G as a rooted tree with n leaves and n ′ internal nodes. The leaves V = {v 1 , v 2 , . . . , v n } represent the nodes of G, while the internal nodes Z = {z 1 , z 2 , . . . , z n ′ } represent clusters, with z n ′ being the root node. Each internal node groups the data into disjoint sub-clusters, where edges reflect memberships of clusters. We can represent the hierarchy using two binary adjacency matrices Â ∈ {0, 1} n×n ′ and B ∈ {0, 1} n ′ ×n ′ , i.e. T = ( Â, B). While Â describes the edges from the leaves to the internal nodes, B specifies the edges between the internal nodes. Since every node in the hierarchy except the root has exactly one outgoing edge, we have the following constraints: n ′ j Âi,j = 1 for 1 ≤ i ≤ n, n ′ j Bi,j = 1 for 1 ≤ i < n ′ and n ′ j Bn ′ ,j = 0 for the last row. Thus, except for the last row of B, both matrices are row-stochastic. We denote the ancestors of v as anc(v), and the lowest common ancestor (LCA) of the two leaves v i and v j in T as v i ∧ v j . Probabilistic hierarchical clustering. Zügner et al. (2021) recently proposed probabilistic hierarchies. The idea is to use a continuous relaxation of the binary adjacency matrices while keeping the row-stochasticity constraints. Thus, we end up with two matrices A ∈ [0, 1] n×n ′ and B ∈ [0, 1] n ′ ×n ′ . The entries represent parent probabilities, i.e. A i,j := p(z j |v i ) describes the probability of the internal node z j being the parent of v i and B i,j := p(z j |z i ) the probability of the internal node z j being the parent of z i . Together, they define a probabilistic hierarchy T = (A, B). Given such a probabilistic hierarchy, one can easily obtain a discrete hierarchy by interpreting the corresponding rows of A and B as categorical distributions. We sample an outgoing edge for each leaf and internal node. Since B is restricted to be an upper triangular matrix, this tree-sampling procedure will result in a valid discrete hierarchy, denoted by T = ( Â, B) ∼ P A,B (T ).

4. EXPECTED PROBABILISTIC HIERARCHICAL CLUSTERING

4.1 EXPECTED METRICS Unlike flat clusterings, there has been a shortage of objective functions for hierarchical clusterings. Thus, many algorithms to derive hierarchies were developed without a precise objective. An objective function not only allows us to evaluate the performance of a hierarchy, but also yields possibilities for optimization techniques. Recently the two unsupervised functions Dasgupta cost (Das) (Dasgupta, 2016) and Tree-Sampling Divergence (TSD) (Charpentier & Bonald, 2019) were proposed, triggering the development of a new generation of hierarchical clustering algorithms. The Dasgupta cost is a well-established metric for graphs and vector data, while the TSD is a recent metric specifically designed for graphs. In addition to being unsupervised, i.e., applicable in cases where the data is unlabeled, both metrics also have intuitive motivations. The metrics can be written as: Das( T ) = vi,vj ∈V P (v i , v j )c(v i ∧ v j ) and TSD( T ) = KL(p(z)||q(z)), where c(z) is the number of leaves whose ancestor is z, i.e. c(z) = vi∈V 1 [z∈anc(vi)] , and p(z) and q(z) are two distributions induced by the edge and node distributions, i.e. p(z) = vi,vj 1 [z=vi∧vj] P (v i , v j ) and q(z) = vi,vj 1 [z=vi∧vj] P (v i )P (v j ). Importantly, both Dasgupta and TSD scores have intuitive motivations. Dasgupta favors similar leaves to have lowest common ancestors low in the hierarchy (Dasgupta, 2016) . TSD quantifies the ability to reconstruct the graph from the hierarchy in terms of information loss Charpentier & Bonald (2019) . Recently, Zügner et al. (2021) proposed the Flexible Probabilistic Hierarchy (FPH) method. FPH substitutes the indicator functions with their corresponding probabilities under the tree-sampling procedure, obtaining cost functions for probabilistic hierarchies, called Soft-Das and Soft-TSD. These two metrics correspond to the scores of the expected hierarchies (see App.A.1). In contrast, we propose in this work to optimize the expected metrics under the tree-sampling procedure. Intuitively, this corresponds to moving the expectation from inside the metric functions to outside, reflecting the natural way of performing Monte-Carlo approximation via (tree-) sampling. More specifically, our objectives are: min A,B E T Das( T ) s.t. T ∼ P A,B (T ) and max A,B E T TSD( T ) s.t. T ∼ P A,B (T ), which we denote as Exp-Das and Exp-TSD. Note that we optimize over A and B, which parametrize a probabilistic hierarchy, while the edge weights are given by the dataset and used to compute the node and edge distribution. We show in Section 4.2 that the optimal values of the expected scores share the same intuitive meaning as their discrete counterparts. While the probabilities used in the FPH computation are consistent, their relaxed scores are not consistent with the expected scores under the tree-sampling procedure. In Fig. 2 we show a simple case where Soft-Das does not align with the global optimal value whereas Exp-Das does.

4.2. THEORETICAL ANALYSIS OF EPH AND FPH

The main motivation to use the expected metrics is the property that their global optimal value, i.e. the score obtained by the globally optimal hierarchy (the optimizer), is equal to their discrete counterparts as we show in Theo. 1. Theorem 1. Let A and B be probabilistic transition matrices. Then the following equalities hold, min A,B E T ∼P A,B (T ) Das( T ) = min T Das( T ) and max A,B E T ∼P A,B (T ) TSD( T ) = max T TSD( T ) (5) (See proof in App. A.5) Consequently, optimizing our cost function aims to find the optimal discrete hierarchy. Furthermore, we prove in Theo. 2 that Soft-Das is a lower bound of Exp-Das, therefore its minimum is a lower bound of the optimal discrete Dasgupta cost. Theorem 2. Let A and B be transition matrices describing a probabilistic hierarchy. Then, Soft-Das will be lower than or equal to the expected Dasgupta cost under the tree-sampling procedure, i.e., (see proof in App. A.4) Soft-Das(T ) ≤ E T ∼P A,B (T ) Das( T ) . In Fig. 2 we show a specific example where the minimizer of Soft-Das is continuous and FPH fails to find the optimal hierarchy. For EPH, we know that an integral solution exists since Exp-Das and Exp-TSD are convex combinations of their discrete counterparts. Furthermore, Exp-Das is neither convex nor concave, as we show in App. A.6. In Table 1 we provide an overview of properties of the cost functions of FPH and EPH.

4.3. UNBIASED COMPUTATION OF EXPECTED SCORES VIA DIFFERENTIABLE SAMPLING

In order to compute the expected scores we can use a closed-form expression. To derive these for Exp-Das and Exp-TSD, we need to be able to calculate the probability p (z = v i ∧ v j , z ∈ anc(v)) v 1 v 2 v 3 v 4 (a) K4 Graph z 3 z 1 z 2 v 1 v 2 v 3 v 4 (b) A minimizing hierarchy z 3 z 1 z 2 v 1 v 2 v 3 v 4 (c) FPH z 3 z 1 z 2 v 1 v 2 v 3 v 4 (d) EPH Figure 2 : Example where FPH fails to infer a minimizing hierarchy. A hierarchy minimizing the Dasgupta cost and the inferred hierarchies by FPH and EPH on the unweighted K 4 graph, i.e. every normalized edge weight is equal to 1 6 . While FPH achieves a Dasgupta cost of 4.0 after discretization, the continuous hierarchy has a Soft-Das score below 3.0. On the other hand, EPH finds a minimizing hierarchy with the cost of 10 3 . for which no known solution exists, and the expectancy of a logarithm (see Eq. 13 and Eq. 14). An alternative to the closed-form solution is to approximate the expectancies via the Monte Carlo method. We propose to approximate Exp-Das and Exp-TSD with N differentiably sampled hierarchies { T (1) , . . . , T (N ) } (see "Loss computation" in Fig. 1 ): Exp-Das(T ) ≈ 1 N N i=1 Das( T (i) ) and Exp-TSD(T ) ≈ 1 N N i=1 TSD( T (i) ). However, differentiable sampling of discrete structures like hierarchies is often complex. To this end, our differentiable hierarchy sampling algorithm combines the tree-sampling procedure (Zügner et al., 2021) and the straight-through Gumbel-Softmax estimator (Jang et al., 2016) in three steps: (1) We sample the parents of the leaf nodes by interpreting the column of A as parameters of straightthrough Gumbel-Softmax estimators. (2) We sample the parents of the leaf nodes by interpreting the column of B as parameters of straight-through Gumbel-Softmax estimators. This procedure is differentiable -each step is differentiable -and expressive -it can sample any hierarchy with n leaves and n ′ internal nodes. (3) We use the Monte Carlo method to approximate the expectancies by computing the arithmetic mean of the scores of the sampled hierarchies. We reuse the differentiable computation of Soft-Das and Soft-TSD which match the discrete scores for discrete hierarchies while providing gradients w.r.t. A and B (see Fig. 1 for an overview). Complexity. Since we sample N hierarchies from n ′ + n -1 many categorical distributions with O(n ′ ) classes, the sampling process can be done with a complexity of O(N ×n×n ′ +N ×n ′2 ). The dominating term is the computation of the Das and TSD scores with a complexity of O(N ×m×n ′2 ) for graph datasets and O(N ×n 2 ×n ′2 ) for vector datasets (Zügner et al., 2021) . This is often efficient as we typically have n ′ ≪ n and for graphs m ≪ n 2 . In Sec. 4.4 we propose a subgraph sampling approach to reduce the complexity to O(N ×M ×n ′2 +n 2 ) for large vector datasets, where M < n 2 . Limitations. While the previously explained MC estimators of the expectancies are unbiased in the forward pass, the estimation of the gradients is not (Paulus et al., 2021) thus impacting the EPH optimization. Furthermore, even though the global optimal values of the expected and discrete scores match, EPH does not guarantee convergence into a global optimum when optimizing using gradient descent methods.

4.4. SCALABLE EXP-DAS COMPUTATION VIA SUBGRAPH SAMPLING

As we discussed in the complexity analysis the limiting factor is O(n 2 × n ′2 ) corresponding to the evaluation of the Dasgupta cost, which becomes prohibitive for large datasets. To reduce the complexity we propose an unbiased subgraph sampling approach. First, we note that the normalized similarities P (v i , v j ) can be interpreted as a probability mass function of a categorical distribution. This interpretation allows the Dasgupta cost to be rewritten as an expectancy and approximated via a sampling procedure. More specifically, Das( T ) = vi,vj P (v i , v j )c(v i ∧ v j ) = E (vi,vj )∼P (vi,vj ) [c(v i ∧ v j )] ≈ 1 M M k=1 c(v (k) i ∧ v (k) j ) (8) where {(v (1) i , v (1) j ), . . . , (v (M ) i , v (M ) j )} are M edges sampled from the edge distribution P (v i , v j ), which can be done in O(M + n 2 ) (Kronmal & Peterson, 1979) . We refer to this sampling approach as subgraph sampling (see Fig. 1 ). Using the same procedure, we can approximate the expected Dasgupta cost. In contrast to Exp-Das, Exp-TSD cannot be easily viewed as an expectation of edges, thus making the approximation via sub-graph sampling impractical. However, since TSD is a metric originally designed for graphs which are generally sparse, it would not yield substantial benefits. Note that we end up with two different sampling procedures. First, we have the differentiable hierarchy sampling (see Eq. 7). This is necessary to approximate the expectancies. Since we do not have a closed-form expression of Exp-Das and Exp-TSD, we sample discrete hierarchies from the probabilistic ones and average the scores. Secondly, we have the subgraph sampling (see Eq. 8), which interprets the Dasgupta cost as an expectancy. This is done to speed up the runtime for vector datasets since the number of pairwise similarities grows quadratically in the number of data points. The estimation is unbiased and introduces an additional parameter, i.e. the number of sampled edges, which allows a trade-off between runtime and quality. By inserting the probabilistic edge sampling approach into the tree-sampling, we estimate Exp-Das to scale it to large vector datasets. An overview of our model is shown in Fig. 1 and a formal description in App. B.8.

5.1. EXPERIMENTAL SETUP

Datasets. We evaluate our method on both graph and vector datasets. Graph datasets: We use the same graphs and preprocessing as Zügner et al. (2021) . More specifically, we use the datasets Polblogs (Adamic & Glance, 2005), Brain (Amunts et al., 2013) , Citeseer (Sen et al., 2008) , Genes (Cho et al., 2014) , Cora-ML (McCallum et al., 2000; Bojchevski & Günnemann, 2018) , OpenFlight (Patokallio), WikiPhysics (Aspert et al., 2019) , and DBLP (Yang & Leskovec, 2015) . To preprocess the graph, we first collect the largest connected component. Secondly, every edge is made bidirectional and unweighted. An overview of the graphs is shown in Tab. 6 in the appendix. Vector datasets: We test our method on vector data for the Dasgupta cost. Here we selected the seven datasets Zoo, Iris, Glass, Digits, Segmentation, Spambase, and Letter from the UCI Machine Learning repository (Dua & Graff, 2017) . Furthermore, we also use Cifar-100 (Krizhevsky et al., 2009) . Digits and Cifar-100 are image datasets, the remaining ones are vector data. While we only flatten the images of Digits, we preprocess Cifar-100 using the ResNet-101 BiT-M-R101x1 by Kolesnikov et al. (2020) which was pretrained on ImageNet-21k (Deng et al., 2009) . More specifically, we use the 2048 dimensional activations of the final layer for each image in Cifar-100 as feature vector. Furthermore, we normalize all features to have a mean of zero and a standard deviation of one. We compute cosine similarities between all pairs of data points using their normalized features. This results in a dense similarity matrix. Finally, we remove the self-loops. Note that in contrast to the graph datasets, the vector data similarities are weighted. An overview is shown in Tab. 7 in the appendix. Since we are in an unsupervised setting we have no train/test split, i.e. we train and evaluate on the whole graph. Baselines. We compare our model against both discrete and continuous approaches. For discrete approaches, we use the single, average, complete (Hastie et al., 2009) and ward linkage (Ward Jr, 1963) algorithm, respectively referred to as SL, AL, CL and WL. We do not report results of SL and CL on the graph datasets which do not have edge weights since these methods are not applicable for unweighted graphs. In addition to the linkage algorithms, we also compare to the recursive sparsest cut (RSC) (Dasgupta, 2016) and the Louvain method (Louv.) (Blondel et al., 2008) . For continuous approaches, we use the gradient-based optimization approaches Ultrametric Fitting (UF) (Chierchia & Perret, 2019) , Hyperbolical Hierarchical Clustering (HypHC) (Chami et al., 2020) , gradient-based Hyperbolic Hierarchical Clustering (gHHC) (Monath et al., 2019) and Flexible Probabilistic Hierarchy (FPH) (Zügner et al., 2021) . While the linkage algorithms derive a hierarchy based on heuristics or local objectives, UF, HypHC, gHHC, and FPH aim to reduce a relaxed Dasgupta cost. For all the methods, we set a time limit of 120 hours and provide a budget of 512GB of memory for each experiment. Experimental Setup. We repeat the randomized methods with five random seeds and report the scores of the discrete hierarchies. We use the same experimental setup as Zügner et al. (2021) , i.e., we use n ′ = 512 internal nodes, compress hierarchies using the scheme presented by Charpentier & Bonald (2019) , and use 10 and 32-dimensional DeepWalk embeddings (Perozzi et al., 2014) on the graphs for methods that require features. We train EPH using PAdamax (projected Adamax (Kingma & Ba, 2014)) for 10000 epochs for Exp-Das and 3000 epochs for Exp-TSD. Additionally, every 1000 epochs we reduce the learning rate for B by a factor of 0.1 and reset the probabilistic hierarchy to the so far best discrete hierarchy. To approximate the expectancy of EPH, we use 20 samples, except for Spambase, Letter, and Cifar-100 where we use 10, 1, and 1, respectively, to reduce the runtime. On the datasets Digits, Segmentation, Spambase, Letter, and Cifar-100, we train EPH and FPH by sampling n √ n edges, on the remaining datasets, we use the full graph. Both, EPH and FPH are initialized using the average linkage algorithm. We train FPH with its original setting and our proposed scheduler and report the minimum of both for each dataset. Finally, to obtain the discrete hierarchy for EPH and FPH we take the most likely edge for each row in A and B as Zügner et al. (2021) did. For the remaining methods, we use the recommended hyperparameters. An overview of the hyperparameters is shown in Tab. 9 and an ablation study in App. B.5.

Graph Dataset Results.

We report the Dasgupta and Tree-Sampling Divergence results for the graph datasets in Tab. 2. EPH achieves 13/16 best scores and second best scores otherwise. In particular, EPH which optimizes Exp-Das always achieves a better Dasgupta cost compared to FPH which optimizes Soft-Das. This observation aligns with the theoretical advantages of Exp-Das compared to Soft-Das (see Sec. 4.2). EPH and FPH which both use the tree sampling probabilistic framework always achieve the best results. This highlights the benefit of the tree sampling probabilistic framework for hierarchical clustering. The discrete approaches which uses heuristics achieve competitive results but are constantly inferior than EPH. Furthermore, we can observe that the performance of the linkage algorithms, WL and AL, and the Louvain method is competitive, even though they use heuristics or local objectives to infer a hierarchy. Finally, the inferior performance of gHHC and HypHC can intuitively be explained by the fact that these methods are originally designed for vector datasets. WL, UF, and HypHC were not able to scale to the DBLP datasets within the memory budget. Indeed, they require to compute a dense n 2 similarity matrix leading to out-of-memory (OOM) issues. Vector Dataset Results. We report the Dasgupta costs of several methods on the vector datasets in Tab. 3. Similarly to the graph datasets, EPH outperforms all baselines and achieves 7/8 best scores. These results demonstrate the capacity of EPH to also adapt to vector datasets. Further, EPH constantly outperforms FPH. This emphasized the benefit of optimizing expected scores compared to soft scores. In contrast with graph datasets, HypHC performs competitively on vector datasets. It is reasonable since this method is originally designed for vector datasets. FPH has a slightly worse performance than HypHC on most datasets and is only better on Iris. Hyperparameter study. We show in Fig. 3 (left) the effect of the number of sampled hierarchies on the EPH performances. On one hand, we observe that a large number of sampled hierarchies (i.e. N ≥ 20) generally yields better results than a small number of sampled hierarchies (i.e. N ≤ 10) except for Citeseer. Intuitively, a higher number of sampled hierarchies should lead to a more accurate expected score approximation. On the other hand, we observe that a very large number for sampled hierarchies (i.e. N ≥ 100) might not lead to significant improvements while requiring more computational resources. Intuitively, the noise induced by a lower number of sampled hierarchies could be beneficial to escape local optima. In general, we found that 20 samples lead to satisfactory results for all datasets, thus achieving a good trade-off between approximation accuracy, optimization noise, and computational requirements. We show in Fig. 3 (right) the effect of the number of sampled edges on the EPH performances on vector datasets. Using more edges consistently leads to better results. In particular, going from n to n √ n shows a significant performance improvement while going from n √ n to n 2 yields only minor improvements. Hence, controlling the amount of sampled edges allows us to scale our method to large datasets while maintaining high performance. On the small datasets Zoo, Iris and Glass, we use the whole graph, while for the other datasets, we sample n √ n edges as a trade-off between runtime and quality of the hierarchical clustering. External Evaluation. We propose to complement the evaluation with Dasgupta and TSD which are internal metrics with external evaluation metrics. However, since we typically do not have access to ground-truth hierarchies in real-world data, it is difficult to perform external evaluation. To address this, we evaluate our models on synthetic datasets with known ground-truth hierarchies and investigate whether the inferred hierarchies on the vector datasets preserve the flat class-labels. For the graph datasets, we use two hierarchical stochastic block models (HSBMs) which allow us to compare the inferred hierarchies with the ground-truth hierarchies. As the HSBM graphs are generated based on a random process, the ground-truth hierarchy is not necessarily the best in terms of the Dasgupta cost or Tree-Sampling Divergence. Hence, we observe that the Dasgupta cost and Tree-Sampling Divergence of the hierarchies inferred by EPH are even better than the ground-truth hierarchies on the HSBMs. This underlines the great capacity of EPH to optimize the Dasgupta and TSD scores. Furthermore, we compute the normalized mutual information (NMI) between the different levels of the ground-truth hierarchy and the inferred hierarchy (see Tab. 4). We observe that EPH recovers almost perfectly the first three levels of the ground-truth hierarchy. Interestingly, the TSD objective appears to be a more suitable metric to recover the ground-truth HSBM levels. We show the results of FPH in App. B.3. We show a visualization of the ground truth and inferred hierarchies in Fig. 4 . For vector datasets, we flatten the derived hierarchies and compare clusters with the available ground-truth labels, by applying the Hungarian algorithm to align the cluster assignments with the labels as explained by Zhou et al. (2022) . This procedure allows us to compute the accuracy, which we show on the right-hand side in Tab. 3. While the linkage al- Qualitative Evaluation. We visualize the largest cluster, i.e. most directly connected leaves, inferred on Cifar-100 using EPH. More specifically, we select the internal nodes with the most directly connected leaves. Furthermore, we sort the images by their probability, i.e. their entry in the matrix A. We show the 16 images with the highest probability and the 16 with the lowest probability for the largest cluster in Fig. 5 . We observe that the images with high probabilities are related to insects. This shows that EPH is able to group similar images together. In contrast, the last images with the lowest probability, do not fit into the group. This demonstrates the capacity of EPH to measure the uncertainty in the cluster assignments. We show additional results with the same behavior for other clusters in App. B.4 (see Fig. 9 and Fig. 10 , Fig. 8 ). Furthermore, we visualize the graph and inferred hierarchies of EPH for OpenFlight in Fig. 11 in the appendix. Both, minimizing Exp-Das and Exp-TSD generate reasonable clusters and are able to successfully distinguish different world regions.

6. CONCLUSION

In this work, we propose EPH, a novel end-to-end learnable approach to infer hierarchies in data. EPH operates on probabilistic hierarchies and directly optimizes the expected Dasgupta cost and expected Tree-Sampling Divergence using differentiable hierarchy sampling. We show that the global optima of the expected scores are equal to their discrete counterparts. Furthermore, we present an unbiased subgraph sampling approach to scale EPH to large datasets. We demonstrate the capacity of our model by evaluating it on several synthetic and real-world datasets. EPH outperforms traditional and recent state-of-the-art baselines.

ETHICS STATEMENT

EPH is not used on a specific real-world application, therefore, the outcome solely depends on how the practitioner uses it. This could potentially be abused by governments or corporations by analyzing collected data at large scales using our algorithm. However, EPH can also have positive contributions by supporting scientists finding hierarchies in data. While EPH outperforms other state-of-the-art methods for hierarchical clustering we raise awareness of the possibility that the algorithm fails to generate meaningful hierarchies, especially in a novel setting. Therefore, the result should carefully be assessed by its practitioner.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of our experiments we provide an overview of our datasets in App. B. Furthermore, we provide a detailed description of the experimental setup and data preprocessing in Sec.5.1 and an overview of the hyperparameters we used in Table 9 . Our model is implemented in PyTorch and will be publicly available upon acceptance. We use sklearnfoot_0 to flatten the hierarchies and compute the Louvain algorithm. We provide proofs of the theorems we used in App. A for verifiability of our theoretical results.

A APPENDIX

A.1 EQUATIONS OF SOFT-DAS AND SOFT-TSD In the following, we show the equations of Soft-Das and Soft-TSD. Soft-Das(T ) = vi,vj ∈V P (v i , v j ) z∈Z v∈V p (z = v i ∧ v j ) P (z ∈ anc(v)) (9) Soft-TSD(T ) = z∈Z p(z) log p(z) q(z) where p(z) = vi,vj ∈V P (v i , v j )p (z = v i ∧ v j ) (11) q(z) = vi,vj ∈V P (v i )P (v j )p (z = v i ∧ v j ) A.

2. CLOSED FORM SOLUTIONS OF EXP-DAS AND EXP-TSD

To compute closed form solutions of the expectancies, the following equations need to be solved: Exp-Das(T ) = vi,vj ∈V P (v i , v j ) z∈Z v∈V p (z = v i ∧ v j , z ∈ anc(v)) (13) Exp-TSD(T ) = z∈Z E T ∼P A,B (T =(A,B)) p(z) log p(z) q(z) .

A.3 RELATION BETWEEN JOINT AND INDEPENDENT LCA AND ANCESTOR PROBABILISTIES

While the LCA probabilities are crucial to compute Soft-Das, Exp-Das requires the joint LCA and ancestor probabilities, i.e. p(z k = v i ∧ v j , v ∈ anc(z k )), for the leaves v i , v j and v and the internal node z k . In Theo. 3, we show that the joint probabilities are an upper bound of the product of the single terms. Theorem 3. Let p describe the probability under the tree-sampling procedure, z k an internal node, v 1 , v 2 and v leaves. Then, the following inequality holds: z k z k ′ v 1 v v 2 (a) r z k v and r z k v 1 meet at node z k ′ . z k z k ′ v 1 v v 2 (b) r z k v and r z k v 2 meet at node z k ′ . z k v 1 v v 2 (c) r z k v , r z k v 1 and r z k v 2 meet at node z k . p(z k = v 1 ∧ v 2 )p(z k ∈ anc(v)) ≤ p (z k = v 1 ∧ v 2 , z k ∈ anc(v)) (15) Proof. First, we observe that the right-hand side of the inequality can be rewritten as: 16) To prove the non-trivial case p (z k ∈ anc(v)) ̸ = 0, we need to show that the following holds: p (z k = v 1 ∧ v 2 , z k ∈ anc(v)) = p (z k = v 1 ∧ v 2 |z k ∈ anc(v)) P (z k ∈ anc(v)). ( p (z k = v 1 ∧ v 2 ) ≤ p (z k = v 1 ∧ v 2 |z k ∈ anc(v)) . ( ) Let r zj vi = (v i , . . . , z j ) denote a path from a leaf v i to an internal node z j and let z n ′ be the root node. Recalling from Zügner et al. (2021) that the paired path probability under the tree-sampling procedure is p((r z n ′ v1 , r z n ′ v2 )) = p(r z k v1 )p(r z k v2 )p(r z n ′ z k ), with z k = v 1 ∧ v 2 , we can rewrite the LCA probabilities as p (z k = v 1 ∧ v 2 ) = (r z k v 1 ,r z k v 2 ):z k =v1∧v2 p(r z k v1 )p(r z k v2 ). Adding the condition z k ∈ anc(v) means there exists a path from the leaf v to the internal node z k . There are three different cases: first, the path meets r z k v1 and r z k v2 at z k for the first time, or the path meets the path r z k v1 or r z k v2 in a lower node z k ′ , with k ′ < k. The cases are shown in Fig. 6 . In the first case, all three paths are independent. Thus, the LCA probabilities do not change. In the other two cases, they are only independent up to the node z k ′ . The probability for the path r z k z k ′ is equal to 1 since we know that z k ∈ anc(v). More formally, the conditional probability is p (z k = v 1 ∧ v 2 |z k ∈ anc(v)) = (r z k v 1 ,r z k v 2 ):z k =v1∧v2 p(r z k v1 |z k ∈ anc(v))p(r z k v2 |z k ∈ anc(v)). ( ) Assuming that the path from v to z k meets the path from v 1 to z k in the node z k ′ with k ′ ≤ k, we have p(r z k v1 |z k ∈ anc(v))p(r z k v2 |z k ∈ anc(v)) = p(r z ′ k v1 )p(r z k v2 ) ≥ p(r z k v1 )p(r z k v2 ). The last inequality follows since r z ′ k v1 is a subpath of r z k v1 and therefore has a higher probability. This concludes the proof. A.4 PROOF OF THEO. 2 In the following, we provide the proof of the inequality shown in Theo.2. Proof. To prove it, we first write out the definitions of Soft-Das and the expected Dasgupta cost.

Soft-Das

(T ) = v1,v2 P (v 1 , v 2 ) z v P (z = v 1 ∧ v 2 )P (z ∈ anc(v)) and E T ∼P A,B (T ) Das( T ) = E T ∼P A,B (T ) v1,v2 P (v 1 , v 2 ) z v I [z=v1∧v2] I [z∈anc(v)] = E T ∼P A,B (T ) v1,v2 P (v 1 , v 2 ) z v I [z=v1∧v2,z∈anc(v)] = v1,v2 P (v 1 , v 2 ) z v E T ∼P A,B (T ) I [z=v1∧v2,z∈anc(v)] (25) = v1,v2 P (v 1 , v 2 ) z v P (z = v 1 ∧ v 2 , z ∈ anc(v)) The proof follows by using Theo. 3. A.5 PROOF OF THEO. 1 Here we provide the proof of Theo. 1. Proof. To prove the left-hand side, we first observe that the expected Dasgupta cost can be rewritten as a convex combination of the Dasgupta costs of all possible hierarchies under the tree-sampling procedure. More formally, E T ∼P A,B (T ) Das( T ) = T ∈H(n,n ′ ) P A,B ( T )Das( T ) where H(n, n ′ ) describes the set of all valid hierarchies with n leaves and n ′ internal nodes. Thus, the minimizer of the expected Dasgupta cost is a convex combination of all minimizing hierarchies, with the minimum being equal to the optimal Dasgupta cost. The equation on the right-hand side for TSD can be proved equivalently. Note that, since the expectation operator is convex, any discrete optimizer (i.e. discrete hierarchies achieving the optimum value) of the discrete scores will be an optimizer of the expected scores and vice-versa. In this case discrete hierarchies are represented by deterministic A, B matrices. In this case discrete hierarchies are represented by deterministic A, B matrices. Only probabilistic hierarchies which are optmizers of the expected scores, represented by non-discrete A, B matrices, are not optimizers of the discrete scores. This is expected since those probalistic hierarchies do not belong to the valid input domain of the discrete scores. In addition, any sample we draw from these probabilistic optimizers is also a discrete optimizer of Dasgupta or TSD because of the convexity of the expectation operator. A.6 NON-CONVEXITY AND NON-CONCAVITY OF EXP-DAS Minimizing a convex function using gradient descent is easier than a concave one. In a constrained setting, minimizing a concave function heavily depends on the initialization. Exp-Das(T = (A, B)) is neither convex nor concave with respect to A and B. For both, a counter-example exists. This implies that we can not tell whether Exp-Das converges into a local or global minimum when training. To show that Exp-Das is not concave, it is sufficient to find two hierarchies T 1 = (A 1 , B 1 ) and T 2 = (A 2 , B 2 ) such that: 1 2 Exp-Das(T 1 ) + 1 2 Exp-Das(T 2 ) ≥ Exp-Das 1 2 (T 1 + T 2 ) , and equivalently to show that it is not convex: 1 2 Exp-Das(T 1 ) + 1 2 Exp-Das(T 2 ) ≤ Exp-Das 1 2 (T 1 + T 2 ) , where T 1 + T 2 = (A 1 + A 2 , B 1 + B 2 ). In Fig. 7 , we show these two examples. In (a) and (b) we show two hierarchies and in (c) a linear interpolation of these two. The graph in (d) satisfies Eq. 28, while the graph in (e) satisfies Eq. 29. We report the Dasgupta costs for all hierarchy and graph combinations in Tab. 5. We show an overview of the hyperparameters we used in Tab. 9.

Hierarchy

3 z 1 z 2 v 1 v 2 v 3 v 4 (a) T1 = ( Â1, B1) z 3 z 1 z 2 v 1 v 2 v 3 v 4 (b) T2 = ( Â2, B2) z 3 z 1 z 2 v 1 v 2 v 3 v 4 (c) TI = 0.5 • T1 + 0.5 • T2 v 1 v 2 v 3 v 4 1 8 1 8 1 2 1 8 1 8 (d) Convex Example v 1 v 2 v 3 v 4 1 4 1 4 1 4 1 4 (e) Concave Example

B.3 HSBM RESULTS FOR FPH

We show the results for FPH on the HSBM graphs in Tab 10. Constrained vs. Unconstrained Optimization. We require the rows of the matrices A and B to be row-stochastic. There are several possibilities to enforce this. Either we can perform constrained optimization using projections onto the probabilistic simplex or simply perform a softmax operation over the rows. In Tab. 11 we show a comparison of the Dasgupta costs on the graph datasets for several graph datasets. We can observe that the constrained optimization,i.e. using projections after each step yields better results than the unconstrained optimization on every graph. This aligns with the findings of Zügner et al. (2021) . Therefore, we recommend using constrained optimization. Initialization. The initialization of a model can play a crucial role. Zügner et al. (2021) found that using the AL algorithm as initialization yields substantial improvements. Therefore, we compare both initializations and additionally test using their algorithm FPH as initialization. We show the Dasgupta costs for several graph datasets in Tab. 12. As expected, using the AL algorithm or FPH as initialization yields significant improvements over a random initialization. Even though the FPH initialization starts with a better hierarchy, the resulting hierarchies are inferior than the AL initialization. This could be caused by local minima, in which the model gets stuck. We recommend using AL as initialization since it performs best on most datasets and has a lower computational cost than FPH. Direct vs. Embedding Parametrization. Additionally to the direct parametrization of the matrices A and B, we test an embedding parametrization for each node in the hierarchy. More specifically we use d-dimensional embeddings for the leaves and internal nodes. To infer A and B, we perform a softmax operation with an additional learnable temperature parameter t i over the cosine similarities between the embeddings. The main advantage of the embedding approach is that, additionally to the hierarchical clustering, we gain node embeddings that can be used for downstream tasks such as classification or regression. We test the embedding parametrization with d = 128 on several graph datasets. Once we let t i be learnable and once we freeze it to t i = 1. We compare the results to the constrained optimization. While we train the direct parametrization for 1000 epochs, the embedding approach is trained for 20000 epochs. This is done to ensure convergence since it is randomly initialized. We show the results in Tab. 13. First, we observe that not using a temperature parameter yields substantially worse results. Furthermore, the embedding parametrization is inferior to the direct parametrization, even though it was trained for 20000 epochs, while the constrained optimization was only trained for 1000. Only on the dataset PolBlogs the embedding approach is slightly better than the direct parametrization. We attribute the inferior performance to the random initialization and the fact that we have to use a softmax operation instead of projections. Our results are in line with the ablation study of Zügner et al. (2021) . They also parametrized their model using embeddings and used the softmax function on the negative Euclidean distances to infer the matrices A and B. Since the embedding approach yields worse results with longer training times, we recommend using the direct parametrization.

Number of Internal Nodes.

As in many real-world problems we do not know the number of internal nodes n ′ beforehand in our experiments. While increasing n ′ generally leads to more refined and expressive hierarchies, it reduces interpretability and comes with a higher computational cost. To select the hyperparameter n ′ , we test various choices on several datasets. We show the corresponding Dasgupta costs and TSD scores in Fig. 13 and Fig. 14 . We found that n ′ = 512 is sufficient to capture most information. In practice we recommend using the Elbow method. Number of Sampled Hierarchies. Another crucial hyperparameter for EPH is the number of sampled hierarchies. Additionally to Fig. 3 , we provide the raw Dasgupta costs and standard errors after the training in Fig. 15 . Furthermore, we show the influence of the number of samples to approximate the expected Dasgupta cost on randomly initialized hierarchies in Fig. 16 B.6 STANDARD DEVIATIONS We show the standard deviations of the randomized models on the graph datasets in Tab. 14 and for the vector datasets in Tab. 15. 

B.7 RUNTIMES

We report the runtimes for EPH and the baselines in Tab. 16 and Tab. 17. While HypHC, FPH and EPH are executed on a GPU (NVIDIA A100), the remaining method do not support or did not require GPU acceleration. Since gHHC has a lower computational runtime than the other randomized methods, we run it with 50 random seeds instead of 5.

B.8 PSEUDOCODE

In the following we provide a formal description of our EPH algorithm, the subgraph sampling, and how we normalize graphs. g t ← g t + ∇ T Score( Ĝ, T ) 7: end for 8: T t ← T t-1 -α K g t 9: T t ← P (T t ) ▷ simplex projection 10: end for 11: return Tt Ê.add(e) 5: end for 6: Ĝ ← (V, Ê) 7: NormalizeGraph( Ĝ) 8: return Ĝ



https://scikit-learn.org/stable/index.html



Figure 1: Overview of our proposed EPH model. A formal description is given in App. B.8

Figure 3: Hyperparameter study. Normalized Dasgupta costs for different numbers of sampled hierarchies (left) and different number of sampled edges (right) after the EPH training, including the average linkage algorithm (AL) and a training on the full graph (FG). The scores are normalized such that each dataset has a mean of zero and a standard deviation of one.

Figure 4: Ground truth clusters and dendrograms compared to the inferred ones for the HSBMs.gorithms were inferior to the continuous optimization algorithms in terms of Dasgupta cost, they dominate here. EPH, which was trained on Exp-Das, yields the best accuracies only on Iris and Spambase. As the linkage algorithms and Louvain generate hierarchies using heuristics while the continuous methods aim to minimize the Dasgupta cost the results are not surprising, since the Dasgupta cost and other metrics do not necessarily go hand in hand.

Figure 6: The different cases of the event p (z k = v 1 ∧ v 2 |z k ∈ anc(v)). While the LCA of v 1 and v 2 is z k in every case, the LCA of v 1 and v and the LCA of v 2 and v are different. We have three cases: either the paths from v 1 or v 2 and v meet before z k at node z k ′ (shown in (a) and (b)), or all paths meet for the first time at z k (shown in fig. (c)).

the graph and vector datasets are given in Tab. 6 and Tab. 7. The deatils of the HSBMs are shown in Tab. 8.

z

Figure 7: Three hierarchies and two graphs that show that Exp-Das is neither convex nor concave with respect to A and B. The hierarchy in (c) is a linear interpolation of the hierarchies in (a) and (b). The graphs in (d) and (e) are counter-examples, with convex and concave behavior, respectively.

Figure 8: Largest derived clusters on Digits. On the left in each subplot the 16 images with the highest probability, on the right the 16 images with the lowest probability.

Figure 9: Second largest derived cluster on Cifar-100..

Figure 10: Third largest derived cluster on Cifar-100..

Figure 14: TSD scores for different numbers of internal nodes.

Figure 15: Dasgupta costs and standard error for different numbers of sampled hierarchies after the EPH training.

Figure 16: Approximated Expected Dasgupta costs for different numbers of sampled hierarchies for randomly initialized probabilistic hierarchies.

Algorithm 1 EPH Require: G = (V, E): Graph Require: T = (A, B): Initial hierarchy Require: α: Learning rate Require: K: Number of sampled hierarchies 1: for t = 1, . . . do

NormalizeGraphRequire:G = (V, E): Graph 1: P (v i , v j ) ← wi,j u,v∈V wu,v 2: P (v i ) ← wij wj Algorithm 3 SampleSubgraph Require: G = (V, E): Graph Require: M : Number of sampled edges 1: Ê ← MultiSet() ▷ allow duplicate edges 2: for m = 1 . . . M do 3:e = (v i , v j ) ∼ P (v i , v j )4:

Properties of Soft-Das, Exp-Das, Soft-TSD, and Exp-TSD.

Results for the graph datasets.

Results for the vector datasets.

Results of EPH for the HSBMs with n ′ =# Cluster.

Overview of the graph datasets.

Overview of the vector datasets.

Overview of the HSBMs.

Overview of the Hyperparameters.

Results of EPH for the HSBMs with n ′ =# Cluster.

Dasgupta costs for different initializations on several graph datasets with n ′ = 512 internal nodes. In the first three rows the initial Dasgupta costs and in the last three rows the Dasgupta costs after the training. Best scores in bold, second best underlined.

Dasgupta costs for the direct and embedding parametrization on several graph datasets with n ′ = 512 internal nodes. Best scores in bold, second best underlined.

Standard Deviations for the graph datasets.

Standard Deviations for the vector datasets.

