MINI-BATCH k-MEANS TERMINATES WITHIN O(d/ϵ) IT-ERATIONS

Abstract

We answer the question: "Does local progress (on batches) imply global progress (on the entire dataset) for mini-batch k-means?". Specifically, we consider minibatch k-means which terminates only when the improvement in the quality of the clustering on the sampled batch is below some threshold. Although at first glance it appears that this algorithm might execute forever, we answer the above question in the affirmative and show that if the batch is of size Ω((d/ϵ) 2 ), it must terminate within O(d/ϵ) iterations with high probability, where d is the dimension of the input, and ϵ is a threshold parameter for termination. This is true regardless of how the centers are initialized. When the algorithm is initialized with the k-means++ initialization scheme, it achieves an approximation ratio of O(log k) (the same as the full-batch version). Finally, we show the applicability of our results to the mini-batch k-means algorithm implemented in the scikit-learn (sklearn) python library.

1. INTRODUCTION

The mini-batch k-means algorithm (Sculley, 2010) is one of the most popular clustering algorithms used in practice (Pedregosa et al., 2011) . However, due to its stochastic nature, it appears that if we do not explicitly bound the number of iterations of the algorithm, then it might never terminate. We show that, when the batch size is sufficiently large, using only an "early-stopping" condition, which terminates the algorithm when the local progress observed on a batch is below some threshold, we can guarantee a bound on the number of iterations that the algorithm performs which is independent of input size. Problem statement We consider the following optimization problem. We are given an input (dataset), X = {x i } n i=1 ⊆ [0, 1] d , of size n of d-dimensional real vectors and a parameter k. Note that the assumption that X ⊆ [0, 1] d is standard in the literature (Arthur et al., 2011) , and is meant to simplify notation (otherwise we would have to introduce a new parameter for the diameter of X). Our goal is to find a set C of k centers (vectors in [0, 1] d ) such that the following goal function is minimized: 1 n x∈X min c∈C ∥c -x∥ 2 Usually, the 1/n factor does not appear as it does not affect the optimization goal, however, in our case, it will be useful to define it as such. Lloyd's algorithm The most popular method to solve the above problem is Lloyd's algorithm (often referred to as the k-means algorithm) (Lloyd, 1982) . It works by randomly initializing a set of k centers and performing the following two steps: (1) Assign every point in X to the center closest to it. (2) Update every center to be the mean of the points assigned to it. The algorithm terminates when no point is reassigned to a new center. This algorithm is extremely fast in practice but has a worst-case exponential running time (Arthur & Vassilvitskii, 2006; Vattani, 2011) .

