MINI-BATCH k-MEANS TERMINATES WITHIN O(d/ϵ) IT

Abstract

We answer the question: "Does local progress (on batches) imply global progress (on the entire dataset) for mini-batch k-means?". Specifically, we consider minibatch k-means which terminates only when the improvement in the quality of the clustering on the sampled batch is below some threshold. Although at first glance it appears that this algorithm might execute forever, we answer the above question in the affirmative and show that if the batch is of size Ω((d/ϵ) 2 ), it must terminate within O(d/ϵ) iterations with high probability, where d is the dimension of the input, and ϵ is a threshold parameter for termination. This is true regardless of how the centers are initialized. When the algorithm is initialized with the k-means++ initialization scheme, it achieves an approximation ratio of O(log k) (the same as the full-batch version). Finally, we show the applicability of our results to the mini-batch k-means algorithm implemented in the scikit-learn (sklearn) python library.

1. INTRODUCTION

The mini-batch k-means algorithm (Sculley, 2010) is one of the most popular clustering algorithms used in practice (Pedregosa et al., 2011) . However, due to its stochastic nature, it appears that if we do not explicitly bound the number of iterations of the algorithm, then it might never terminate. We show that, when the batch size is sufficiently large, using only an "early-stopping" condition, which terminates the algorithm when the local progress observed on a batch is below some threshold, we can guarantee a bound on the number of iterations that the algorithm performs which is independent of input size.

Problem statement

We consider the following optimization problem. We are given an input (dataset), X = {x i } n i=1 ⊆ [0, 1] d , of size n of d-dimensional real vectors and a parameter k. Note that the assumption that X ⊆ [0, 1] d is standard in the literature (Arthur et al., 2011) , and is meant to simplify notation (otherwise we would have to introduce a new parameter for the diameter of X). Our goal is to find a set C of k centers (vectors in [0, 1] d ) such that the following goal function is minimized: 1 n x∈X min c∈C ∥c -x∥ 2 Usually, the 1/n factor does not appear as it does not affect the optimization goal, however, in our case, it will be useful to define it as such. Lloyd's algorithm The most popular method to solve the above problem is Lloyd's algorithm (often referred to as the k-means algorithm) (Lloyd, 1982) . It works by randomly initializing a set of k centers and performing the following two steps: (1) Assign every point in X to the center closest to it. (2) Update every center to be the mean of the points assigned to it. The algorithm terminates when no point is reassigned to a new center. This algorithm is extremely fast in practice but has a worst-case exponential running time (Arthur & Vassilvitskii, 2006; Vattani, 2011) .

Mini-batch k-means

To the centers, Lloyd's algorithm must go over the entire input at every iteration. This can be computationally expensive when the input data is extremely large. To tackle this, the mini-batch k-means method was introduced by Sculley (2010). It is similar to Lloyd's algorithm except that steps (1) and ( 2) are performed on a batch of b elements sampled uniformly at random with repetitions, and in step (2) the centers are updated slightly differently. Specifically, every center is updated to be the weighted average of its current value and the mean of the points (in the batch) assigned to it. The parameter by which we weigh these values is called the learning rate, and its value differs between centers and iterations. In the original paper by Sculley, there is no stopping condition similar to that of Lloyd's algorithm, instead, the algorithm is simply executed for t iterations, where t is an input parameter. In practice (for example in sklearn (Pedregosa et al., 2011) ), together with an upper bound on the number of iterations to perform there are several "early stopping" conditions. We may terminate the algorithm when the change in the locations of the centers is sufficiently small or when the change in the goal function for several consecutive batches does not improve. We note that in both theory (Tang & Monteleoni, 2017; Sculley, 2010) and practice (Pedregosa et al., 2011) the learning rate goes to 0 over time. That is, over time the movement of centers becomes smaller and smaller, which guarantees termination for most reasonable early-stopping conditions at the limit. Our results are the first to show extremely fast termination guarantees for mini-batch k-means with early stopping conditions. Surprisingly, we need not require the learning rate to go to 0. Taking this into account, we get a convergence rate of Ω(n 2 /t), which implies, at best, a quadratic bound on the execution time of the algorithm. This is due to setting the learning rate at iteration t to O(1/(n 2 + t)). Our results do not guarantee convergence to any local-minima, however, they guarantee an exponentially faster runtime bound.

Our results

We analyze the mini-batch k-means algorithm described above (Sculley, 2010) , where the algorithm terminates only when the improvement in the quality of the clustering for the sampled batch is less than some threshold parameter ϵ. That is, we terminate if for some batch the difference in the quality of the clustering before the update and after the update is less than ϵ. Our stopping condition is slightly different than what is used in practice. In sklearn termination is determined based on the changes in cluster centers. In Section 5 we prove that this condition also fits within our framework. Our main goal is to answer the following theoretical question: "Does local progress (on batches) imply global progress (on the entire dataset) for mini-batch k-means, even when the learning rate does not go to 0?". Intuitively, it is clear that the answer depends on the batch size used by the algorithm. If the batch is the entire dataset the claim is trivial and results in a termination guarantee of O(d/ϵ) iterationsfoot_0 . We show that when the batch size exceeds a certain threshold, indeed local progress implies global progress and we achieve the same asymptotic bound on the number of iterations as when the batch is the entire dataset. We present several results: We start with a warm-up in Section 3, showing that when b = Ω(kd 3 ϵ -2 ) we can guarantee termination within O(d/ϵ) iterations 2 w.h.p (with high probability) 3 . We require the additional assumption that every real number in the system can be represented using O(1) bits (e.g., 64-bit floats). The above bound holds regardless of how cluster centers are initialized or updated. That is, this bound holds for any center update rule, and not only for the "standard" center update rule described above. Our proof uses elementary tools and is presented to set the stage for our main result.



This holds because the maximum value of the goal function is d (Lemma 2.1).2 Throughout this paper the tilde notation hides logarithmic factors in n, k, d, ϵ.3 This is usually taken to be 1 -1/n p for some constant p ≥ 1. For our case, it holds that p = 1, however, this can be amplified arbitrarily by increasing the batch size by a multiplicative constant factor.

