IMPROVED LEARNING-AUGMENTED ALGORITHMS FOR K-MEANS AND K-MEDIANS CLUSTERING

Abstract

We consider the problem of clustering in the learning-augmented setting. We are given a data set in d-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an α ∈ (0, c) for some c < 1 fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost OPT. For a dataset of size m, we propose a deterministic k-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the O(dm log m) runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for α up to 1/2, an improvement from α being at most 1/7 in previous work. For the k-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter α to get a cost of (1 + O(α))OPT, while requiring essentially just O(md log 3 m/α) runtime.

1. INTRODUCTION

In this paper we study k-means and k-medians clustering in the learning-augmented setting. In both these problems we are given an input data set P of m points in d-dimensional Euclidean space and an associated distance function dist(•, •). The goal is to compute a set C of k points in that same space that minimize the following cost function: cost(P, C) = p∈P min i∈[k] dist(p, c i ). In words, the cost associated with a singular data point is its distance to the closest point in C, and the cost of the whole data set is the sum of the costs of its individual points. In the k-means setting dist(x, y) := ∥x -y∥ 2 , i.e., the square of the Euclidean distance, and in the k-medians setting we set dist(x, y) := ∥x-y∥, although here instead of the norm of x-y, we can in principle also use any other distance function. These problem are well-studied in the literature of algorithms and machine learning, and are known to be hard to solve exactly (Dasgupta, 2008) , or even approximate well beyond a certain factor (Cohen-Addad & Karthik C. S., 2019) . Although approximation algorithms are known to exist for this problem and are used widely in practice, the theoretical approximation factors of practical algorithms can be quite large, e.g., the 50-approximation in Song & Rajasekaran (2010) and the O(ln k)-approximation in Arthur & Vassilvitskii (2006) . Meanwhile, the algorithms with relatively tight approximation factors do not necessarily scale well in practice (Ahmadian et al., 2019) . To overcome these computational barriers, Ergun et al. ( 2022) proposed a learning-augmented setting where we have access to some auxiliary information about the input data set. This is motivated by the fact that in practice we expect the dataset of interest to have exploitable structures relevant to the optimal clustering. For instance, a classifier's predictions of points in a dataset can help group similar instances together. This notion was formalized in Ergun et al. (2022) by assuming that we have access to a predictor in the form of a labelling P = P 1 ∪ • • • ∪ P k (all the points in P i have the same label i ∈ [k]), such that there exist an unknown optimal clustering P = P * 1 ∪ • • • ∪ P * k , an associated set of centers C = (c * 1 , . . . , c * k ) that achieve the optimally low clustering cost OPT ( i∈[k] cost(P i , {c * i }) = OPT) , and a known label error rate α such that: |P i ∩ P * i | ≥ (1 -α) max(|P i |, |P * i |) In simpler terms, the auxiliary partitioning (P 1 , . . . , P k ) is close to some optimal clustering: each predicted cluster has at most an α-fraction of points from outside its corresponding optimal cluster, and there are at most an α-fraction of points in the corresponding optimal cluster not included in predicted cluster. The predictor, in other words, has at most α false positive and false negative rate for each label. Observe that even when the predicted clusters P i are close to a set of true clusters P * i in the sense that the label error rate α is very small, computing the means or medians of P i can lead to arbitrarily bad solutions. It is known that for k-means the point that is allocated for an optimal cluster should simply be the average of all points in that cluster (this can be seen by simply differentiating the convex 1-mean objective and solving for the minimizer). However, a single false positive located far from the cluster can move this allocated point arbitrarily far from the true points in the cluster and drive the cost up arbitrarily high. This problem requires the clustering algorithms to process the predicted clusters in a way so as to preclude this possibility. Using tools from the robust statistics literature, the authors of Ergun et al. ( 2022) proposed a randomized algorithm that achieves a (1 + 20α)-approximation given a label error rate α < 1/7 and a guarantee that each predicted cluster has Ω k α points. For the k-medians problem, the authors of Ergun et al. ( 2022) also proposed an algorithm that achieves a (1 + α ′ )-approximation if each predicted cluster contains Ω n k points and a label rate α at most O α ′4 k log k α ′ , where the big-Oh notation hides some small unspecified constant, and α ′ < 1. The restrictions for the label error rate α to be small in both of the algorithms of Ergun et al. (2022) lead us to investigate the following question: Is it possible to design a k-means and a k-medians algorithm that achieve (1 + α)-approximate clustering when the predictor is not very accurate?

1.1. OUR CONTRIBUTIONS

In this work, we not only give an affirmative answer to the question above for both the k-means and the k-medians problems, our algorithms also have improved bounds on the clustering cost, while preserving the time complexity of the previous approaches and removing the requirement on a lower bound on the size of each predicted cluster. For learning-augmented k-means, we modify the main subroutine of the previous randomized algorithm to get a deterministic method that works for all α < 1/2, which is the natural breaking point (as explained below). In the regime where the k-means algorithm of Ergun et al. ( 2022) applies, we get improve the approximation factor to 1 + 7.7α. For the larger domain α ∈ [0, 1/2), we derive a more general expression as reproduced in table 1. Furthermore, our algorithm has better bound on the clustering cost compared to that of the previous approach, while preserving the O(md log m) runtime and not requiring a lower bound on the size of each predicted cluster. Our k-medians algorithm improves upon the algorithm in Ergun et al. ( 2022) by achieving a (1 + O(α))-approximation for α < 1/2, thereby improving both the range of α as well as the dependence of the approximation factor on the label error rate from bi-quadratic to near-linear. For success probability 1 -δ, our runtime is O( 1 1-2α md log 3 m α log k log(k/δ) (1-2α)δ log k δ ), so we see that by setting δ = 1/poly(k), we have just a logarithmic dependence in the run-time on k, as opposed to a polynomial dependence.

funding

* Equal contribution. All three authors were supported in part by NSF CAREER grant CCF-1750716 and NSF grant CCF-1909314. 

