IMPROVED LEARNING-AUGMENTED ALGORITHMS FOR K-MEANS AND K-MEDIANS CLUSTERING

Abstract

We consider the problem of clustering in the learning-augmented setting. We are given a data set in d-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an α ∈ (0, c) for some c < 1 fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost OPT. For a dataset of size m, we propose a deterministic k-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the O(dm log m) runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for α up to 1/2, an improvement from α being at most 1/7 in previous work. For the k-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter α to get a cost of (1 + O(α))OPT, while requiring essentially just O(md log 3 m/α) runtime.

1. INTRODUCTION

In this paper we study k-means and k-medians clustering in the learning-augmented setting. In both these problems we are given an input data set P of m points in d-dimensional Euclidean space and an associated distance function dist(•, •). The goal is to compute a set C of k points in that same space that minimize the following cost function: cost(P, C) = p∈P min i∈[k] dist(p, c i ). In words, the cost associated with a singular data point is its distance to the closest point in C, and the cost of the whole data set is the sum of the costs of its individual points. In the k-means setting dist(x, y) := ∥x -y∥ 2 , i.e., the square of the Euclidean distance, and in the k-medians setting we set dist(x, y) := ∥x-y∥, although here instead of the norm of x-y, we can in principle also use any other distance function. These problem are well-studied in the literature of algorithms and machine learning, and are known to be hard to solve exactly (Dasgupta, 2008) , or even approximate well beyond a certain factor (Cohen-Addad & Karthik C. S., 2019) . Although approximation algorithms are known to exist for this problem and are used widely in practice, the theoretical approximation factors of practical algorithms can be quite large, e.g., the 50-approximation in Song & Rajasekaran (2010) and the O(ln k)-approximation in Arthur & Vassilvitskii (2006) . Meanwhile, the algorithms with relatively tight approximation factors do not necessarily scale well in practice (Ahmadian et al., 2019) . To overcome these computational barriers, Ergun et al. ( 2022) proposed a learning-augmented setting where we have access to some auxiliary information about the input data set. This is motivated

funding

* Equal contribution. All three authors were supported in part by NSF CAREER grant CCF-1750716 and NSF grant CCF-1909314. 

