DIVIDE-AND-CLUSTER: SPATIAL DECOMPOSITION BASED HIERARCHICAL CLUSTERING

Abstract

This paper is about increasing the computational efficiency of clustering algorithms. Many clustering algorithms are based on properties of relative locations of points, globally or locally, e.g., interpoint distances and nearest neighbor distances. This amounts to using less than the full dimensionality D of the space in which the points are embedded. We present a clustering algorithm, Divideand-Cluster (DAC), which detects local clusters in small neighborhoods and then merges adjacent clusters hierarchically, following the Divide-and-Conquer paradigm. This significantly reduces computation time which may otherwise grow nonlinearly in the number n of points. We define local clusters as those within hypercubes in a recursive hypercubical decomposition of space, represented by a tree. Clusters within each hypercube at a tree level are merged with those from neighboring hypercubes to form clusters of the parent hypercube at the next level. We expect DAC to perform better than many other algorithms because (a) as clusters merge into larger clusters (components), their number steadily decreases vs the number of points, and we cluster (b) only neighboring (c) components (not points). The recursive merging yields a cluster hierarchy (tree). Further, our use of small neighborhoods allows piecewise uniform approximation of large, nonuniform, arbitrary shaped clusters, thus avoiding the need for global cluster models. We present DAC's complexity and experimentally verify the correctness of detected clusters on several datasets, posing a variety of challenges, and show that DAC's runtime is significantly better than representative algorithms of other types, for increasing values of n and D.

1. INTRODUCTION

Finding clusters formed by n data points in a D-dimensional space is a very frequent operation in machine learning, even more so in unsupervised learning. This paper is about increasing the computational efficiency of clustering. The notion of cluster is fundamentally a perceptual one; its precise definition, used implicitly or explicitly by clustering algorithms, varies. Given a set of n points x 1 , x 2 , ..., x n in a D-dimensional space (x i ∈ R D ), clustering is aimed at identifying different groupings/components of nearby points according to the definition of cluster being used. In this paper, we model a cluster as a contiguous overlapping set of small neighborhoods such that: points in each neighborhood are distributed (nearly) uniformly, the cluster shape is arbitrary, and at their closest approach any two clusters are separated by a distance larger than the distance between nearby within-cluster neighbors. Many common clustering algorithms are based on assumptions about global properties of point positions, such as distances between points or sets of points, as in K-means. Another class of algorithms uses local structure, defining locality via edge-connectivity in different types of graphs defined by the points. Other algorithms use nearby points, e.g., located within a chosen size neighborhood. There are a few approaches that make full use of the D-dimensional geometry, e.g., those based on Voronoi neighborhoods (Ahuja & Tuceryan, 1989) ; while they work well, their constituent algorithms (e.g., for Voronoi tessellation) are available for only small values of D. In this paper, we present a clustering algorithm, called Divide-and-Cluster (DAC), that detects clusters in small local neighborhoods formed by hierarchically decomposing the D-space and then re-

