DIVIDE-AND-CLUSTER: SPATIAL DECOMPOSITION BASED HIERARCHICAL CLUSTERING

Abstract

This paper is about increasing the computational efficiency of clustering algorithms. Many clustering algorithms are based on properties of relative locations of points, globally or locally, e.g., interpoint distances and nearest neighbor distances. This amounts to using less than the full dimensionality D of the space in which the points are embedded. We present a clustering algorithm, Divideand-Cluster (DAC), which detects local clusters in small neighborhoods and then merges adjacent clusters hierarchically, following the Divide-and-Conquer paradigm. This significantly reduces computation time which may otherwise grow nonlinearly in the number n of points. We define local clusters as those within hypercubes in a recursive hypercubical decomposition of space, represented by a tree. Clusters within each hypercube at a tree level are merged with those from neighboring hypercubes to form clusters of the parent hypercube at the next level. We expect DAC to perform better than many other algorithms because (a) as clusters merge into larger clusters (components), their number steadily decreases vs the number of points, and we cluster (b) only neighboring (c) components (not points). The recursive merging yields a cluster hierarchy (tree). Further, our use of small neighborhoods allows piecewise uniform approximation of large, nonuniform, arbitrary shaped clusters, thus avoiding the need for global cluster models. We present DAC's complexity and experimentally verify the correctness of detected clusters on several datasets, posing a variety of challenges, and show that DAC's runtime is significantly better than representative algorithms of other types, for increasing values of n and D.

1. INTRODUCTION

Finding clusters formed by n data points in a D-dimensional space is a very frequent operation in machine learning, even more so in unsupervised learning. This paper is about increasing the computational efficiency of clustering. The notion of cluster is fundamentally a perceptual one; its precise definition, used implicitly or explicitly by clustering algorithms, varies. Given a set of n points x 1 , x 2 , ..., x n in a D-dimensional space (x i ∈ R D ), clustering is aimed at identifying different groupings/components of nearby points according to the definition of cluster being used. In this paper, we model a cluster as a contiguous overlapping set of small neighborhoods such that: points in each neighborhood are distributed (nearly) uniformly, the cluster shape is arbitrary, and at their closest approach any two clusters are separated by a distance larger than the distance between nearby within-cluster neighbors. Many common clustering algorithms are based on assumptions about global properties of point positions, such as distances between points or sets of points, as in K-means. Another class of algorithms uses local structure, defining locality via edge-connectivity in different types of graphs defined by the points. Other algorithms use nearby points, e.g., located within a chosen size neighborhood. There are a few approaches that make full use of the D-dimensional geometry, e.g., those based on Voronoi neighborhoods (Ahuja & Tuceryan, 1989) ; while they work well, their constituent algorithms (e.g., for Voronoi tessellation) are available for only small values of D. In this paper, we present a clustering algorithm, called Divide-and-Cluster (DAC), that detects clusters in small local neighborhoods formed by hierarchically decomposing the D-space and then re-peatedly grows them outwards, by remerging chopped parts of a cluster stretching across neighborhood boundaries. This implements the Divide-and-Conquer paradigm of computation. We use hypercubical neighborhoods formed by a recursive hypercubical tessellation of the D-space, represented by a tree. Clusters straddling boundaries of neighboring hypercubes/nodes at one level are combined to form clusters of their common parent node at the next level, in a bottom up tree traversal. Once the recursive merger is completed at the tree root level, all detected clusters are agglomeratively merged, closest clusters first, to form a cluster hierarchy. We evaluate our algorithm against other algorithms in the following ways. ( 1 The advantages of the proposed algorithm lead to the following contributions: 1. Spatial Divide-and-Conquer: Our approach recursively divides the D space in a using a fixed, predetermined geometric criterion. This helps divide the points without using additional computation that may be needed to evaluate the division criteria, The resulting logarithmic run-time increases with E. DAC also offers a new mechanism for trade-off among n, D and E. 2. Point Components: DAC's time complexity over global/points based algorithms further benefits from the use of components of points instead of individual points: (a) It limits computation of closest distance between neighboring components to mutually visible points (e.g., half) of the component boundaries (instead of all points in the components), and (b) the agglomerative generation of hierarchical clusters need only merge entire, closest components, implying a complexity in terms of the smaller number of components vs points. 3. Cluster Shape: DAC's local cluster detection allows arbitrarily simple local models of clusters, avoiding cluster wide global models. For example, by repeated division, DAC can increasingly improve the validity of piecewise uniform approximation of cluster density. 4. Cluster Density:Compositional detection of a cluster from its contiguous small pieces also allows detection of clusters of arbitrary shapes, thus avoiding shape models such as isotropy/compactness used by K-means. In the following, we first review related work (Sec 2), followed by spatial decomposition (tree representation) used to define DAC (Sec 3), DAC algorithm (Sec 4), computational complexity (Sec 5), experimental results (Sec 6), and limitations and conclusions (Sec 7).

2. RELATED WORK

Clustering obtains a partition of the points that minimizes a cost function chosen to reflect the types of clusters to be detected, often along with some user specified parameters. An example of a cost function is the Sum of Squared Error (SSE), the sum of the squares of distances of points from their cluster centers. Hierarchical clustering approaches derive an agglomerative tree (hierarchy) of clusters formed by repeatedly merging, bottom up, lowest cost first. Partitioning algorithms form the hierarchy by repeatedly splitting the clusters, maximally reducing the cost at each step. Following are some broad categories of algorithms: Graph-based clustering algorithms typically use a partitioning strategy on graph of data points with weighted edges. Examples are Minimal Spanning Tree (MST) and Shared Nearest Neighbor (SNN) clustering algorithms. MST defines edge weight as distance between two points and removes k -1 highest weighted edges from the MST of the graph to obtain k clusters, where k is a parameter. SNN defined similarity of two points as threshold overlap in nearest neighbors (where k ′ nearest neighbors are found for each point), similar vertices are connected, and connected components are discovered to return the clusters. Spectral approaches aim to partition a graph such that similar points are together and dissimilar are not. It defines a similarity matrix that contains similarity score for each pair of points, and uses this matrix to find the graph cut with minimum cost.



) Correctness of Detected Clusters: We compare detected clusters in the standard, Fundamental Clustering Problems Suite of 2D and 3D datasets (FCPS), which poses a variety of important clustering challenges. (2) Runtimes: We compare runtimes on datasets having a range of n and D values. (3) Downstream Task: We observe the impact of DAC as a clustering substitute in a CNN.

