IMPROVED LEARNING-AUGMENTED ALGORITHMS FOR K-MEANS AND K-MEDIANS CLUSTERING

Abstract

We consider the problem of clustering in the learning-augmented setting. We are given a data set in d-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an α ∈ (0, c) for some c < 1 fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost OPT. For a dataset of size m, we propose a deterministic k-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the O(dm log m) runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for α up to 1/2, an improvement from α being at most 1/7 in previous work. For the k-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter α to get a cost of (1 + O(α))OPT, while requiring essentially just O(md log 3 m/α) runtime.

1. INTRODUCTION

In this paper we study k-means and k-medians clustering in the learning-augmented setting. In both these problems we are given an input data set P of m points in d-dimensional Euclidean space and an associated distance function dist(•, •). The goal is to compute a set C of k points in that same space that minimize the following cost function: cost(P, C) = p∈P min i∈[k] dist(p, c i ). In words, the cost associated with a singular data point is its distance to the closest point in C, and the cost of the whole data set is the sum of the costs of its individual points. In the k-means setting dist(x, y) := ∥x -y∥ 2 , i.e., the square of the Euclidean distance, and in the k-medians setting we set dist(x, y) := ∥x-y∥, although here instead of the norm of x-y, we can in principle also use any other distance function. These problem are well-studied in the literature of algorithms and machine learning, and are known to be hard to solve exactly (Dasgupta, 2008) , or even approximate well beyond a certain factor (Cohen-Addad & Karthik C. S., 2019) . Although approximation algorithms are known to exist for this problem and are used widely in practice, the theoretical approximation factors of practical algorithms can be quite large, e.g., the 50-approximation in Song & Rajasekaran (2010) and the O(ln k)-approximation in Arthur & Vassilvitskii (2006) . Meanwhile, the algorithms with relatively tight approximation factors do not necessarily scale well in practice (Ahmadian et al., 2019) . To overcome these computational barriers, Ergun et al. (2022) proposed a learning-augmented setting where we have access to some auxiliary information about the input data set. This is motivated by the fact that in practice we expect the dataset of interest to have exploitable structures relevant to the optimal clustering. For instance, a classifier's predictions of points in a dataset can help group similar instances together. This notion was formalized in Ergun et al. (2022) by assuming that we have access to a predictor in the form of a labelling P = P 1 ∪ • • • ∪ P k (all the points in P i have the same label i ∈ [k]), such that there exist an unknown optimal clustering P = P * 1 ∪ • • • ∪ P * k , an associated set of centers C = (c * 1 , . . . , c * k ) that achieve the optimally low clustering cost OPT ( i∈[k] cost(P i , {c * i }) = OPT), and a known label error rate α such that: |P i ∩ P * i | ≥ (1 -α) max(|P i |, |P * i |) In simpler terms, the auxiliary partitioning (P 1 , . . . , P k ) is close to some optimal clustering: each predicted cluster has at most an α-fraction of points from outside its corresponding optimal cluster, and there are at most an α-fraction of points in the corresponding optimal cluster not included in predicted cluster. The predictor, in other words, has at most α false positive and false negative rate for each label. Observe that even when the predicted clusters P i are close to a set of true clusters P * i in the sense that the label error rate α is very small, computing the means or medians of P i can lead to arbitrarily bad solutions. It is known that for k-means the point that is allocated for an optimal cluster should simply be the average of all points in that cluster (this can be seen by simply differentiating the convex 1-mean objective and solving for the minimizer). However, a single false positive located far from the cluster can move this allocated point arbitrarily far from the true points in the cluster and drive the cost up arbitrarily high. This problem requires the clustering algorithms to process the predicted clusters in a way so as to preclude this possibility. Using tools from the robust statistics literature, the authors of Ergun et al. (2022) proposed a randomized algorithm that achieves a (1 + 20α)-approximation given a label error rate α < 1/7 and a guarantee that each predicted cluster has Ω k α points. For the k-medians problem, the authors of Ergun et al. (2022) also proposed an algorithm that achieves a (1 + α ′ )-approximation if each predicted cluster contains Ω n k points and a label rate α at most O α ′4 k log k α ′ , where the big-Oh notation hides some small unspecified constant, and α ′ < 1. The restrictions for the label error rate α to be small in both of the algorithms of Ergun et al. (2022) lead us to investigate the following question: Is it possible to design a k-means and a k-medians algorithm that achieve (1 + α)-approximate clustering when the predictor is not very accurate?

1.1. OUR CONTRIBUTIONS

In this work, we not only give an affirmative answer to the question above for both the k-means and the k-medians problems, our algorithms also have improved bounds on the clustering cost, while preserving the time complexity of the previous approaches and removing the requirement on a lower bound on the size of each predicted cluster. For learning-augmented k-means, we modify the main subroutine of the previous randomized algorithm to get a deterministic method that works for all α < 1/2, which is the natural breaking point (as explained below). In the regime where the k-means algorithm of Ergun et al. (2022) applies, we get improve the approximation factor to 1 + 7.7α. For the larger domain α ∈ [0, 1/2), we derive a more general expression as reproduced in table 1. Furthermore, our algorithm has better bound on the clustering cost compared to that of the previous approach, while preserving the O(md log m) runtime and not requiring a lower bound on the size of each predicted cluster. Our k-medians algorithm improves upon the algorithm in Ergun et al. (2022) by achieving a (1 + O(α))-approximation for α < 1/2, thereby improving both the range of α as well as the dependence of the approximation factor on the label error rate from bi-quadratic to near-linear. For success probability 1 -δ, our runtime is O( 1 1-2α md log 3 m α log k log(k/δ) (1-2α)δ log k δ ) , so we see that by setting δ = 1/poly(k), we have just a logarithmic dependence in the run-time on k, as opposed to a polynomial dependence.

Work, Problem

Approx. Factor Label Error Range Time Complexity Ergun et al. (2022)  , k-Means 1 + 20α ( 10 log m √ m , 1/7) O(md log m) Algorithm 1, k-Means 1 + 5α-2α 2 (1-2α)(1-α) [0,1/2) O(md log m) 1 + 7.7α [0, 1/7) Ergun et al. (2022), k-Medians 1 + Õ((kα) 1/4 ) small constant O(md log 3 m+ poly(k, log m)) Algorithm 2, k-Medians 1 + 7+10α-10α 2 (1-α)(1-2α) [0, 1/2) Õ 1 1-2α md log 3 m log 2 k δ Table 1 : Comparison of learning-augmented k-means and k-medians algorithms. We recall that m is the data set size, d is the ambient dimension, α is the label error rate, and δ is the failure probability (where applicable). The success probability of the k-medians algorithm of Ergun et al. ( 2022) is 1poly(1/k). The Õ notation hides some log factors to simplify the expressions. Upper bound on α. Note that if the error label rate α equals 1/2, then even for three clusters there is no longer a clear relationship between the predicted clusters and the related optimal clusters -for instance given three optimal clusters P * 1 , P * 2 , P * 3 with equally many points, if for all i ∈ [3], the predicted clusters P i consist of half the points in P * i and half the points in P * (i+1) mod 3 , then the label error rate α = 1/2 is achieved, but there is no clear relationship between P * i and P i . In other words, it is not clear whether the predicted labels give us any useful information about an optimal clustering. In this sense, α = 1/2 is in a way a natural stopping point for this problem.

1.2. RELATED WORK

This work belongs to a growing literature on learning-augmented algorithms. Machine learning has been used to improve algorithms for a number of classical problems, including data structures (Kraska et al., 2018; Mitzenmacher, 2018; Lin et al., 2022) , online algorithms (Purohit et al., 2018) , graph algorithms (Khalil et al., 2017; Chen et al., 2022a; b) , computing frequency estimation (Du et al., 2021) , caching (Rohatgi, 2020; Wei, 2020) , and support estimation (Eden et al., 2021) . We refer the reader to Mitzenmacher & Vassilvitskii (2020) for an overview and applications of the framework. Another relevant line of work is clustering with side information. The works Balcan & Blum (2008) ; Awasthi et al. (2014) ; Vikram & Dasgupta (2016) studied an interactive clustering setting where an oracle interactively provides advice about whether or not to merge two clusters. Basu et al. (2004) proposed an active learning framework for clustering, where the algorithm has access to a predictor that determines if two points should or should not belong to the same cluster. Ashtiani et al. (2016) introduced a semi-supervised active clustering framework where the algorithm has access to a predictor that answers queries whether two particular points belong in an optimal clustering. The goal is to produce a (1 + α)-approximate clustering while minimizing the query complexity to the oracle. Approximation stability, proposed in Balcan et al. (2013) , is another assumption proposed to circumvent the NP-hardness of approximation for k-means clustering. More formally, the concept of (c, α)-stability requires that every c-approximate clustering is α-close to the optimal solution in terms of the fraction of incorrectly clustered points. This is different from our setting, where at most an α fraction of the points are incorrectly clustered and can worsen the clustering cost arbitrarily. Gamlath et al. (2022) studies the problem of k-means clustering in the presence of noisy labels, where the cluster label of each point created by either an adversarial or a random perturbation of the optimal solution. Their Balanced Adversarial Noise Model assumes that the size of the symmetric difference between the predicted cluster P i and optimal cluster P * i is bounded by α|P * i |. The algorithm uses a subroutine with runtime exponential in k and d for a fixed α ≤ 1/4. In this work, we have different assumptions on the predicted cluster cluster P i and the optimal cluster P * i . Moreover, our focus is on efficient algorithms practical nearly linear-time algorithms that can scale to very large datasets for k-means and k-medians clustering.

2. k-MEANS

Algorithm 1 Deterministic Learning-augmented k-Means Clustering Require: Data set P of m points, Partition P = P 1 ∪ . . . P k from a predictor, accuracy parameter α for i ∈ [k] do for j ∈ [d] do Let ω i,j be the collection of all subsets of (1 -α)m i contiguous points in P i,j . I i,j ← argmin Z∈ωi,j cost(Z, Z) = argmin Z∈ωi,j z∈Z z 2 -1 |Z| ( z ′ inZ z ′ ) 2 end for Let c i = (I i,j ) j∈[d] end for Return { c 1 , . . . , c k } We briefly recall some notation for ease of reference. Definition 1. We make the following definitions: 1. The given data set is denoted as P , and m := |P |. The output of the predictor is a partition (P 1 , . . . P k ) of P . Further, m i := |P i |. 2. There exists an optimal partition (P  ∩ P i = Q i . Recall that |Q i | ≥ (1 -α) max(|P i |, |P * i |), for some α < 1/2. 3. We denote the average of a set X by X. For the sets X i and P i we denote their projections onto the j-th dimension by X i,j and P i,j , respectively. Before we describe our algorithm, we recall why the naive solution of simply taking the average of each cluster provided by the predictor is insufficient. Consider P i , the set of points labeled i by the predictor. Recall that the optimal 1-means solution for this set is its mean, P i . Since the predictor is not perfect, there might exist a number of points in P i that are not actually in P * i . Thus, if the points in P i \ P * i are significantly far away from P * i , they will increase the clustering cost arbitrary if we simply use P i as the center. The following well-known identity formalizes this observation. Lemma 2 (Inaba et al. (1994)  ). Consider a set X ⊂ R d of size n and c ∈ R d , cost(X, c) = min c ′ ∈R d cost(X, c ′ ) + n • ∥c -X∥ 2 = cost(X, X) + n • ∥c -X∥ 2 . Ideally, we would like to be able to recover the set Q i = P i ∩ P * i and use the average of Q i as the center. We know that |Q i \ P * i | ≤ αm * i . By lemma 3, it is not hard to show that cost(P * i , Q i ) ≤ 1 + α 1-α cost(P * i , P * i ) = 1 + α 1-α cost(P * i , c * i ), which also implies a 1 + α 1-α -approximation for the problem. Lemma 3. For any partition J 1 ∪ J 2 of a set J ⊂ R of size n, if |J 1 | ≥ (1 -λ)n, then |J -J 1 | 2 ≤ λ (1-λ)n cost(J, J). Since we do not have access to Q i , the main technical challenge is to filter out the outlier points in P i and construct a center close to Q i . Minimizing the distance of the center to Q i implies reducing the distance to c * i as well as the clustering cost. Our algorithm for k-means, algorithm 1, iterates over all clusters given by the predictor and finds a set of contiguous points of size (1 -α)m i with the smallest clustering cost in each dimension. At the high level, our analysis shows that the average of the chosen points, I i,j , is not too far away from that of the true positives, Q i,j . This also implies that the additive clustering cost of I i,j would not be too large. Since we can analyze the clustering cost by bounding the cost in every cluster i and dimension j, for simplicity we will not refer to a specific i and j when discussing the intuition of the algorithm. The proofs of the following lemmas and theorem are included in the appendix. Note that there can be multiple optimal solutions in the optimization step. The algorithm can either be randomized by choosing an arbitrary set, or can also be deterministic by always choosing the first optimal solution. Lemma 4 shows that the optimization step guarantees that I i,j has the smallest clustering cost with respect to all sets of size (1 -α)m i in P i,j . Lemma 4. For all i ∈ [k], j ∈ [d], let ω ′ i,j be the collection of all subsets of (1 -α)m i points in P i,j . Then cost(I i,j , I i,j ) = min Z ′ ∈ω ′ i,j cost(Z ′ , Z ′ ). Since we know that |Q i,j | ≥ (1 -α)m i , it can be shown from lemma 4 that the cost of the set I i,j is smaller than that of Q i,j . More precisely, cost(I i,j , I i,j ) ≤ (1 -α)m i |Q i | cost(Q i,j , Q i,j ). With this fact, we are ready to bound the clustering cost by bounding |I i,j -Q i,j | 2 , |I i,j -Q i,j | 2 ≤ 2|I i,j -I i,j ∩ Q i,j | 2 + 2|I i,j ∩ Q i,j -Q i,j | 2 . Using lemma 3, we can bound |I i,j -I i,j ∩ Q i,j | 2 and |I i,j ∩ Q i,j -Q i,j | 2 respectively by cost(I i,j , I i,j ) and cost(Q i,j , Q i,j ). Combining this fact with eq. ( 1), we can bound, |I i,j -Q i,j | 2 by cost(Q i,j , Q i,j ). Lemma 5. The following bound holds: |I i,j -Q i,j | 2 ≤ 4α 1 -2α cost(Q i,j , Q i,j ) |Q i | . Notice that lemma 5 also applies to any set in ω i,j with cost smaller than the expected cost of a subset of size (1 -α)m i drawn uniformly at random from Q i,j . Instead of repeatedly sampling different subsets of Q i,j and returning the one with the lowest clustering cost, the optimization step not only simplifies the analysis of the algorithm, but also guarantees that we find such a subset efficiently. This is the main innovation of the algorithm. In the notations of lemma 2, we can consider c = I i,j , P * i,j = X, m * i = n. Thus, we want to bound |P * i,j -I i,j | 2 by cost(P * i,j ,P * i,j ) m * i to achieve a (1 + O(α))-approximation. Recall that we bound |I i,j -Q i,j | 2 by cost(Qi,j ,Qi,j ) |Qi| in lemma 5. In lemma 6 we relate cost(Q i,j , Q i,j ) to cost(P * i,j , P * i,j ) as follows, cost(P * i,j , P * i,j ) ≥ 1 -α α m * i |P * i,j -Q i,j | 2 + cost(Q i,j , Q i,j ) We can then apply lemma 5 to bound |P * i,j -I i,j | 2 by cost(P * i,j ,P * i,j ) m * i . Lemma 6. The following bound holds: |P * i,j -I i,j | 2 ≤ cost(P * i,j , P * i,j ) α 1 -α + 4α (1 -2α)(1 -α) /m * i Applying lemma 6 and lemma 2 to all i ∈ [K], j ∈ [d], we are able to bound the total clustering cost. Theorem 7. Algorithm 1 is a deterministic algorithm for k-means clustering such that given a data set P ∈ R m×d and a partition (P 1 , . . . , P k ) with error rate α < 1/2, it outputs a 1 + α 1-α + 4α (1-2α)(1-α) -approximation in time O (dm log m) . Corollary 8. For α ≤ 1/7, algorithm 1 achieves a clustering cost of (1 + 7.7α)OPT.

3. k-MEDIANS

In this section we describe our algorithm for learning-augmented k-medians clustering and a theoretical bound on the clustering cost and the run-time. Our algorithm works for ambient spaces equipped with any metric dist(•, •) for which it is possible to efficiently compute the geometric median, which is the minimizer of the 1-medians clustering cost. For instance, it is known from prior work (Cohen et al., 2016) that the geometric median with respect to the ℓ 2 -metric can be efficiently calculated, and appealing to this result as a subroutine allows us to derive a guarantee for learning-augmented k-medians with respect to the ℓ 2 norm. Theorem 9. (Cohen et al. (2016) ) There is an algorithm that computes a (1 + ϵ)-approximation to the geometric median of a set of size n in d-dimensional Euclidean space with respect to the ℓ 2 distance metric with constant probability in O(nd log 3 (n/ϵ)) time. Looking ahead at the pseudocode of algorithm 2, we see that to eventually derive a bound on the time complexity, we would need to account for adjusting the success probability in the many calls to theorem 9. Corollary 10. It follows from theorem 9 that with probability 1 -δ 2k , we have that for all j ∈ [R], cost(P i \ P ′ i , c j i ) is a (1 + γ)-approximation to the optimal 1-median cost for P i \ P ′ i while taking time O(m i d log 3 (m i /γ) log(Rk/δ)). We refer the reader to definition 1 for all notation that is undefined in this section; the only additional notation we introduce is the following definition. Definition 11. We make the following definitions: 1. We denote the optimal clustering cost of P by OPT, and the optimal 1-median clustering cost of P * i by OPT i , with which notation we have that i∈[k] OPT i = OPT. 2. We denote the distance dist(x, y) between two points by ∥x -y∥. We now describe at a high-level a run of our algorithm. Algorithm 2 operates sequentially on each cluster estimate; for the cluster estimate P i , it samples a point x ∈ P i uniformly at random, and removes from P i the ⌈αm i ⌉-many points that lie furthest from x. It then computes the median of the clipped set, which is where we appeal to an algorithm for the geometric median, for instance theorem 9 when the ambient metric for the input data set is the ℓ 2 metric. It turns out that this subroutine already gives us a good median for the cluster P * i with constant probability (lemma 14); to boost the success probability we repeat this subroutine some R-many times (the exact expression is given in the pseudocode and justified in lemma 15), and pick the median with the lowest cost, denoted c i . Collecting the c i across i ∈ [k], we get our final solution {ĉ 1 , . . . , ĉk }.

Algorithm 2 Learning-augmented k-Medians Clustering

Require: Data set P of m points, Partition P = P 1 ∪ . . . P k from a predictor, accuracy parameter α < 1/2 for i ∈ [k] do Let R = 2 1-2α log 2k δ for j ∈ [R] do Sample x ∼ P i u.a.r. Let P ′ i be the ⌈αm i ⌉ points farthest from x c j i ← median of P i \ P ′ i . end for Let c i be the c j i with minimum cost end for Return {ĉ 1 , . . . , ĉk } Although our algorithm itself is relatively straightforward, the analysis turns out to be more involved. We trace the proof at a high level in this section and mention the main steps, and defer all proofs to the appendix. We see that it would suffice to allocate a center that works well for the true cluster P * i , but we only have access to the set P i with the promise that they have a significant overlap (as characterized by α). Fixing an arbitrary true median c * i , one key insight is that the "false" points, i.e. points in P i \P * i will only significantly distort the median if they happen to lie far from c * i . If there were a way to identify and remove these false points which lie far from c * i , then simply computing the geometric median of the clipped data set should work well. By a direct application of Markov's inequality it is possible to show that a point x picked uniformly at random will in fact lie at a distance on the order of the average clustering cost OP T i /m i with constant probability, as formalized in lemma 12. Lemma 12. With probability 1-2α 2 , ∥x - c * i ∥ ≤ 2OPT i /m i . As we will condition on this good event holding, it will be convenient to introduce the notation E. Definition 13. We let E denote the event that ∥x - c * i ∥ ≤ 2OPT i /m i . Having identified a good point x to serve as a proxy for where the true median c * i lies, we need to figure out a good way to clip the data set so as to avoid false points which lie very far from c * i . We observe that since there are guaranteed to be at most ⌈αn⌉-many false points, if we were to remove the ⌈αn⌉-many points that lie farthest from x (denoted P ′ i ), then we either remove false points that lie very far from c * i , or true points (P * i ∩ P ′ i ) which are at the same distance from c * i as the remaining false points (the points in P i \(P ′ i ∪ P * i ). In particular, this implies that the impact of the remaining false points is roughly dominated by the clustering cost of an equal number of true points, and we are able to exploit this to show that the clustering cost of P i \P ′ i with respect to its own median estimate c i is already close to that of the true center P * i . Lemma 14. Conditioned on E, cost(P i \ P ′ i , c j i ) ≤ (1 + 5α)OPT i . Since the event E that the randomly sampled point x is close to a true median c * i is true only with constant probability, we boost the success probability by running this subroutine some R times and letting c i be the median estimate with respect to which the respective clipped data set had the lowest clustering cost. Lemma 15. For R = O 1 (1-2α) log 2k δ many repetitions, with probability at least 1 -δ 2k , we have that cost(P i \ P ′ i , c i ) ≤ (1 + 5α)OPT i . We see from lemma 21 that the set P i \P ′ i differs from the true positives P i ∩ P * i by sets of size at most ⌈αn⌉. It follows that as long as the distance between c i and c * i is on the order of OPT i /n, they will not influence the clustering cost by more than an O(αOPT i ) additive term, which we will be able to absorb into the (1 + O(α)) multiplicative approximation factor. We formalize this in lemma 16. Lemma 16. If cost(P i \ P ′ i , c i ) ≤ (1 + 5α)OPT, then ∥ c i -c * i ∥ ≤ 2+5α (1-2α) OPTi n . We finally put everything together to show that the clustering cost of the set of true points P i ∩ P * i with respect to the estimate c i is only at most an additive O(αOPT i ) more than the cost with respect to the true median c * i . The key technical point in the analysis is that we can only appeal to the fact that the cost of P i \P ′ i is close to OP T i , and we cannot directly reason about c i apart from appealing to lemma 16. Lemma 17. With probability 1 -δ/k, cost(P i ∩ P * i , c i ) ≤ cost(P i ∩ P * i , c * i ) + (5α+10α 2 )OPTi 1-2α . We can now derive our main cost bound stated in lemma 18. Doing so only requires that we account for the mislabelled points P * i \P i which were not accounted for during our clustering. Again, from lemma 16 it suffices to appeal to the fact that the estimate c i lies within an O(αOPT i /n) distance of the true median c * i . Lemma 18. With probability 1 -δ/k, cost(P * i , ĉi ) ≤ (1 + cα)OPT i for c = 7+10α-10α 2 (1-α)(1-2α) . We now formalize our main cost bound, success probability and run-time guarantees in theorem 19. Theorem 19. There is an algorithm for k-medians clustering such that given a data set P and a labelling (P 1 , . . . , P k ) with error rate α < 1/2, it outputs a set of centers C = ( c 1 , . . . , c k ) such that i∈k cost(P * i , c i ) ≤ (1 + cα)OPT i for c = 7+10α-10α 2 (1-α)(1-2α) ,

and does so in time O

1 1-2α md log 3 m α log k log(k/δ) (1-2α)δ log k δ . Proof. We see from lemma 18 that by applying our subroutine for 1-median clustering on each labelled partition P i , we get a center c i with the promise that with probability 1 -δ k , cost(P * i , c i ) = (1 + cα)OPT i . By the union bound, it follows that with probability 1 -δ, i∈[k] cost(P * i , c i ) ≤ i∈k (1 + cα)OPT i = (1 + cα)OPT. Since P = P * 1 ∪ • • • ∪ P * k , it follows that cost(P, Ĉ) = (1 + cα)OPT. The time taken to execute the 1-median clustering subroutine on partition P i is R(m i d + O(m i log m i ) + O(m i d log 3 (m i /γ) log(Rk/δ)) + m i d) . This is because we have R iterations, in each of which we first compute the distances of all m i points from the sampled point x in time m i d, followed by sorting the m i many points by their distances in time O(m i log m i ), followed by O(log(Rk/δ)) many iterations of the median computation for the clipped sets (wherein we appeal to corollary 10), followed by a calculation of the 1-median clustering cost achieved in time m i d. We recall that we set R = O 1 1-2α log k δ . Further, we note that the expression for the upper bound on the time complexity is convex in m i , so if we were to denote the value of this expression on a set of size m i by T (m i ),it follows that i∈ [k] T (m i ) ≤ T i∈[k] m i = T (m). Putting everything together, we get that the net time complexity is O 1 1-2α md log 3 m α log k log(k/δ) (1-2α)δ log k δ .

4. EXPERIMENTS

In this section, we evaluate algorithm 1 and algorithm 2 on real-world datasets. Our experiments were done on a i9-12900KF processor with 32GB RAM. For all experiments, we fix the number of points to be allocated k = 10, and report the average and the standard deviation error of the clustering cost over 20 independent runsfoot_0 . Datasets. We test the algorithms on the testing set of the CIFAR-10 dataset (Krizhevsky et al., 2009) Predictor description. For each dataset, we create a predictor by first finding good k-means and k-medians solutions. Specifically, for k-means we initialize by kmeans++ and then run Lloyd's algorithm until convergence. For k-medians, we use the "alternating" heuristic (Park & Jun, 2009) of the k-medoids problem to find the center of each cluster. In both settings, we use the label given to each point by the k-means and k-medians solutions to form the optimal partition (P * 1 , . . . , P * 10 ) (recall we set k = 10). In order to test the algorithms' performance under different error rates of the predictor, for each cluster i, we change the labels of the αm i points closest to the mean (or median) to that of a random center. For every dataset, we generate the set of corrupted labels (P 1 , . . . , P 10 ) for α from 0.1 to 0.5. Furthermore, we use the same set of optimal partition (P * 1 , . . . , P * 10 ) across all instances of the algorithms. By fixing the optimal partition, we can investigate the effects of increasing α on the clustering cost. Guessing the error rate. Note that in most situations, we will not have access to the error rate α and must try out different guesses of α then return the clustering with the best cost. For algorithm 1, algorithm 2, and the k-medians algorithm of Ergun et al. (2022) , we iterate over 15 possible value of α uniformly spanning the interval [0.1, 0.5]. For the k-means algorithm of Ergun et al. (2022) , the algorithm is defined for α < 1/5 (not to be confused with the assumption that α < 1/7 for the bound on the clustering cost). Thus, the range is [0.1, 1/5] for the algorithm. Baselines. We report the clustering costs of the initial optimal k-means and k-medians solution (P * 1 , . . . , P * 10 ) along with that of the naive approach of taking the average and geometric median of each group returned by the predictor, e.g., returning (P 1 , . . . , P 10 ) for k-means. We first randomly select a q-fraction of points from each cluster for q varied from 1% to 50%. Then, we compute the means and the geometric medians of the sampled points to calculate the clustering cost. Finally, we return the clustering corresponding to the value of q with the best cost. We use the implementation provided in Ergun et al. (2022) for their k-means algorithm. Although both our k-medians algorithm and the algorithm in Ergun et al. (2022) use the approach in Cohen et al. (2016) as the subroutine to compute the geometric median in nearly linear time, we use Weiszfeld's algorithm as implemented in Pillutla et al. (2022) , a well-known method to compute the geometric medians, for the k-medians algorithms. To generate the predictions, we use Pedregosa et al. (2011); Scikit-Learn-Contrib (2021) for the implementations of the k-means and k-medoids algorithms, and the code provided in Ergun et al. (2022) for the implementation of their k-means algorithm. For algorithm 2, we can treat the number of rounds R as a hyperparameter. We set R = 1; as shown below, this is already enough to achieve a good performance compared to the other approaches.

4.1. RESULTS

In Figure 1 , we omit the Sampling and the Prediction approach for the PHY dataset as they have much larger clustering cost than ours and the k-means algorithm in Ergun et al. (2022) . For the CIFAR-10 dataset, we observe that the approach in Ergun et al. (2022) has slightly better clustering costs as α increases. For the MNIST dataset, our approach has slightly improved costs across all values of α. For the PHY dataset, observe that algorithm 1 is comparable to the Ergun et al. (2022) . In summary, the mean clustering cost of the two learning-augmented algorithms are similar across the datasets. It is important to note that our algorithm achieves similar clustering cost to that of Ergun et al. (2022) without any variance as it is a deterministic technique. Figure 2 shows that our our k-medians algorithm has the best clustering cost across all the datasets. We also observe that the sampling approach outperforms the approach of Ergun et al. (2022) for the CIFAR-10 and the MNIST datasets. This is expected since the latter algorithm sample a random subset of a fixed size in each cluster while the baseline approach samples subsets of different sizes and uses the one with the best cost.

A APPENDIX

A.1 MISSING PROOFS FOR k-MEANS Lemma 3. For any partition J 1 ∪ J 2 of a set J ⊂ R of size n, if |J 1 | ≥ (1 -λ)n, then |J -J 1 | 2 ≤ λ (1-λ)n cost(J, J). Proof. We know |J 1 | = (1 -x)n, |J 2 | = xn for some x ≤ λ. It follows that J = (1 -x)J 1 + xJ 2 ⇒ |J -J 1 | = x|J 2 -J 1 | and |J -J 2 | = (1 -x)|J 2 -J 1 | ⇒ |J -J 2 | = 1 -x x |J -J 1 |. We now observe that we can write cost(J, J) = cost(J 1 , J) + cost(J 2 , J). and recall the identity cost(J b , J) = cost(J b , J b ) + |J b | • |J -J b | 2 for b ∈ {0, 1}. It then follows that cost(J, J) ≥ |J 1 | • |J -J 1 | 2 + |J 2 | • |J -J 2 | 2 = (1 -x)n|J -J 1 | 2 + xn|J -J 2 | 2 = (1 -x)n|J -J 1 | 2 + (1 -x) 2 n x |J -J 1 | 2 = (1 -x)n x |J -J 1 | 2 ≥ (1 -λ)n λ |J -J 1 | 2 ⇒ |J -J 1 | 2 ≤ λ (1 -λ)n cost(J, J). Lemma 4. For all i ∈ [k], j ∈ [d], let ω ′ i,j be the collection of all subsets of (1 -α)m i points in P i,j . Then cost(I i,j , I i,j ) = min Z ′ ∈ω ′ i,j cost(Z ′ , Z ′ ). Proof. Suppose I ′ i,j = argmin Z ′ ∈ω ′ i,j cost(Z ′ , Z ′ ). If I ′ i,j ∈ ω i,j then we are done since we know: cost(I i,j , I i,j ) = min Z∈ωi,j cost(Z, Z) If I ′ i,j / ∈ ω i,j , let a and b be the minimum point and maximum points in I ′ i,j . We know there exists a point p ∈ P i,j ∩ (a, b) such that p / ∈ I ′ i,j . If |I ′ i,j | = 2, then we have a contradiction since cost(I ′ i,j , I ′ i,j ) = (b -a) 2 /2 > (b -p) 2 /2 = cost({b, p}, {b, p}) If |I ′ i,j | ≥ 3, we know either a or b is the furthest point from I ′ i,j \ {a, b} in the interval [a, b]. Suppose a is such a point, consider K i,j = (I ′ i,j \ a) ∪ p. We have the following identity, cost(K i,j \ p, p) = cost(K i,j \ {p, b}, p) + |p -b| 2 = cost(I ′ i,j \ {a, b}, p) + |p -b| 2 = cost(I ′ i,j \ {a, b}, I ′ i,j \ a, b) + |I ′ i,j \ {a, b}| • |p -I ′ i,j \ {a, b}| 2 + |p -b| 2 < cost(I ′ i,j \ {a, b}, I ′ i,j \ a, b) + |I ′ i,j \ {a, b}| • |a -I ′ i,j \ {a, b}| 2 + |a -b| 2 = cost(I ′ i,j \ {a, b}, a) + |a -b| 2 = cost(I ′ i,j \ {a}, a). For the inequality, we used the fact that a is the furthest point from I ′ i,j \ {a, b} in the interval [a, b], and q ∈ (a, b). We have, cost(K i,j , K i,j ) = 1 (1 -α)m i y1,y2∈Ki,j |y 1 -y 2 | 2 = 1 (1 -α)m i   y1,y2∈Ki,j \p |y 1 -y 2 | 2 + cost(K i,j \ p, p)   = 1 (1 -α)m i   y1,y2∈I ′ i,j \a |y 1 -y 2 | 2 + cost(I ′ i,j \ a, p)   < 1 (1 -α)m i   y1,y2∈I ′ i,j \a |y 1 -y 2 | 2 + cost(I ′ i,j \ a, a)   = 1 (1 -α)m i y1,y2∈I ′ i,j |y 1 -y 2 | 2 = cost(I ′ i,j , I ′ i,j ) Hence, cost(K i,j , K i,j ) < cost(I ′ i,j , I ′ i,j ) and we have a contradiction. Lemma 5. The following bound holds: |I i,j -Q i,j | 2 ≤ 4α 1 -2α cost(Q i,j , Q i,j ) |Q i | . Proof. Consider the set S i,j = {(Q i,j -q) 2 : q ∈ Q i,j }. Let V i,j be a subset of size (1-α)m drawn uniformly at random from Q i,j . Since the sample mean is an unbiased estimator for the population mean, we know 1 (1 -α)m i E   q∈Vi,j (Q i,j -q) 2   = S i,j = cost(Q i,j , Q i,j ) |Q i | . We also know that, E   q∈Vi,j (Q i,j -q) 2   = E cost(V i,j , Q i,j ) ≥ E cost(V i,j , V i,j ) ≥ cost(I i,j , I i,j ), where we used the fact that I i,j is a subset of size (1 -α)|P i | with minimum 1-means clustering cost (lemma 4). Thus, we have cost(I i,j , I i,j ) ≤ (1 -α)m i |Q i | cost(Q i,j , Q i,j ). Now, in the notation of lemma 3, we set J = I i,j and J 1 = I i,j ∩ P * i,j . Since we have that |I i,j | = (1 -α)m i and |I i,j ∩ P * i,j | = |I i,j ∩ Q i,j | = (1 - |Pi,j \Qi,j | 1-mi )(1 -α)m i , we can set λ = |Pi,j \Qi,j | 1-mi , and get that |I i,j -I i,j ∩ Q i,j | 2 ≤ |P i,j \ Q i,j | cost(I i,j , I i,j ) ((1 -α)m i -|P i,j \ Q i,j |)(1 -α)m i ≤ |P i,j \ Q i,j | cost(Q i,j , Q i,j ) ((1 -α)m i -|P i,j \ Q i,j |)|Q i | ≤ α cost(Q i,j , Q i,j ) (1 -2α)|Q i | , where we use the fact that |P i,j \ Q i,j | ≤ αm i . Also, by lemma 3, |Q i,j -I i,j ∩ Q i,j | 2 ≤ αm i cost(Q i,j , Q i,j ) (|Q i,j | -αm i )|Q i,j | ≤ α cost(Q i,j , Q i ) (1 -2α)|Q i | We conclude the proof by noting that |I i,j -Q i,j | 2 ≤ 2|I i,j -I i,j ∩ Q i,j | 2 + 2|I i,j ∩ Q i,j -Q i,j | 2 . Lemma 6. The following bound holds: |P * i,j -I i,j | 2 ≤ cost(P * i,j , P * i,j ) α 1 -α + 4α (1 -2α)(1 -α) /m * i Proof. By eq. ( 2), |P * i,j -P * i,j \ Q i,j | 2 = (1 -z) 2 z 2 |P * i,j -Q i,j | 2 , where z = |P * i,j \Qi,j | |P * i,j | ≤ α. We have cost(P * i,j , P * i,j ) = cost(P * i,j \ Q i,j , P * i,j ) + cost(Q i,j , P * i,j ) = cost(P * i,j \ Q i,j , P * i,j \ Q i,j ) + zm * i |P * i,j -P * i,j \ Q i,j | 2 + cost(Q i,j , Q i,j ) + (1 -z)m * i |P * i,j -Q i,j | 2 = 1 -z z m * i |P * i,j -Q i,j | 2 + cost(P * i,j \ Q i,j , P * i,j \ Q i,j ) + cost(Q i,j , Q i,j ) ≥ 1 -α α m * i |P * i,j -Q i,j | 2 + cost(Q i,j , Q i,j ). Applying lemma 5, we have cost(P * i,j , P * i,j ) ≥ 1 -α α m * i |P * i,j -Q i,j | 2 + 1 -2α 4α • (1 -α)m * i |I i,j -Q i,j | 2 . By Cauchy-Schwarz, |P * i,j -Q i,j | + |I i,j -Q i,j | 2 ≤ α 1 -α + 4α (1 -2α)(1 -α) 1 -α α m * i |P * i,j -Q i,j | 2 + 1 -2α 4α • (1 -α)m * i |I i,j -Q i,j | 2 /m * i ≤ cost(P * i,j , P * i,j ) α 1 -α + 4α (1 -2α)(1 -α) /m * i We conclude the proof by the fact that |P * i,j -I i,j | 2 ≤ |P * i,j -Q i,j | + |I i,j -Q i,j | 2 . Theorem 7. Algorithm 1 is a deterministic algorithm for k-means clustering such that given a data set P ∈ R m×d and a partition (P 1 , . . . , P k ) with error rate α < 1/2, it outputs a 1 + α 1-α + 4α (1-2α)(1-α) -approximation in time O (dm log m) . Proof. Recall that the k-means clustering cost can be written as the sums of the clustering cost in each dimension. For every i ∈ [k], we have i∈[k] cost(P * i , { c j } k j=1 ) ≤ i∈[k] cost(P * i , c i ) = i∈[k] j∈[d] cost(P * i,j , c i,j ) = i∈[k] j∈[d] cost(P * i,j , P * i,j ) + m * i | c i,j -P * i,j | = i∈[k] j∈[d] cost(P * i,j , P * i,j ) + m * i |I i,j -P * i,j | ≤ i∈[k] j∈[d] 1 + α 1 -α + 4α (1 -2α)(1 -α) cost(P * i,j , P * i,j ) = 1 + α 1 -α + 4α (1 -2α)(1 -α) i∈[k] cost(P * i , c * i ). The inequality is due to lemma 6. We analyze the runtime of algorithm 1. Notice for every i ∈ [k], j ∈ [d], computing I i,j involves sorting the points P i,j , iterating from the smallest to the largest point, and taking the average of the interval in ω i,j with the smallest cost. This takes O (m i log m i ) time. Note that i∈[K] m i = m. Thus, the total time over all i ∈ [k] and j ∈ [d] is O (dm log m) . Corollary 8. For α ≤ 1/7, algorithm 1 achieves a clustering cost of (1 + 7.7α)OPT. Proof. We recall that the generic guarantee for α < 1/2 is cost(P, { c 1 , . . . , c k }) ≤ 1 + α 1 -α + 4α (1 -2α)(1 -α) OPT. We see that for α < 1/7, α 1-α ≤ 7α 6 , and 4α (1-2α)(1-α) ≤ 49•4α 30 , so in sum the net approximation factor is 1 + 7.7α. A.2 MISSING PROOFS FOR k-MEDIANS Lemma 12. With probability 1-2α 2 , ∥x -c * i ∥ ≤ 2OPT i /m i . Proof. We observe that cost( P i ∩P * i , c * i ) ≤ OPT i . It follows that E x∼Pi∩P * i [∥x-c * i ∥] ≤ OPTi |Pi∩P * i | ≤ OPTi (1-α)mi . By Markov's inequality, Pr ∥x -c * i ∥ > (1 + ϵ) • OPT i (1 -α)m i x ∈ P i ∩ P * i ≤ 1 1 + ϵ ⇒ Pr ∥x -c * i ∥ ≤ (1+ϵ)OPTi (1-α)mi ∧ x ∈ P i ∩ P * i P (x ∈ P i ∩ P * i ) ≥ ϵ 1 + ϵ Pr ∥x -c * i ∥ ≤ (1 + ϵ)OPT i (1 -α)m i ≥ ϵ 1 + ϵ P (x ∈ P i ∩ P * i ) ≥ ϵ (1 -α) 1 + ϵ To get the stated bound we set ϵ = 1 -2α. Lemma 14. Conditioned on E, cost(P i \ P ′ i , c j i ) ≤ (1 + 5α)OPT i . We first define some notation for the sets of false positive and false negative points that occur in our proof for lemma 14, and prove a technical lemma relating the sets P i ∩ P * i and P i \P ′ i . Definition 20. We make the following definitions: 1. Let E 1 denote the event that ∥x -c * i ∥ ≤ 2OPT i /n. 2. Let A denote the set of false negatives, i.e. P * i ∩ P ′ i . 3. Let B denote the set of false positives, i.e. P i \(P ′ i ∪ P * i ). To bound the clustering cost of P i \P ′ i , in terms of the cost of P i ∩ P ′ i , we first relate these two sets in terms of the false positives B and the false negatives A. Lemma 21. We can write P i ∩ P * i = ((P i \P ′ i )\B) ∪ A (see definition 20 for the definitions of A and B). Proof. To see this we observe that P i \P ′ i = ((P i \P ′ i ) ∩ P * i ) ∪ ((P i \P ′ i )\P * i ) = ((P i \P ′ i ) ∩ P * i ) ∪ B ⇒ (P i \P ′ i ) ∩ P * i = (P i \P ′ i )\B. We also have that P i ∩ P * i = ((P i ∩ P * i ) ∩ P ′ i ) ∪ ((P i ∩ P * i )\P ′ i ) ⇒ ((P i ∩ P * i )\P ′ i ) = (P i ∩ P * i )\((P i ∩ P * i ) ∩ P ′ i ) = (P i ∩ P * i )\A. Since (P i ∩ P * i )\P ′ i = (P i \P ′ i ) ∩ P * i , we can identify the left hand sides in the last two displays and write (P i ∩ P * i )\A = (P i \P ′ i )\B ⇒ P i ∩ P * i = ((P i \P ′ i )\B) ∪ A. wherein we use that A = (P i ∩ P * i ) ∩ P ′ i . We can now formalize our main argument showing that the clipped data set P i \P ′ i has a clustering cost close to that of the true cluster P * i . Proof of lemma 14. By lemma 21, we first observe that cost(P i \ P ′ i , c * i ) = cost((P i ∩ P * i ), c * i ) -cost(A, c * i ) + cost(B, c * i ) , where A and B are defined as in definition 20. Again by lemma 21, P i \ P ′ i = (P i ∩ P * i ) \ A ∪ B, A ⊂ P i ∩ P * i and B ∩ (P i ∩ P * i ) = ∅, it follows that |P i \ P ′ i | = |P i ∩ P * i | -|A| + |B|. Further, we know that |P i \ P ′ i | ≤ (1 -α)|P i | and |P i ∩ P * i | ≥ (1 -α)|P i |. It follows that |B| ≤ |A| ≤ αn. Therefore, for every false positive p ∈ B, we can assign a unique corresponding false negative n p ∈ A arbitrarily. We observe that every point in A is farther from x than every point in B, and so we can write ∥n p -c * i ∥≥∥n p -x∥ -∥x -c * i ∥ ≥∥p -x∥ -∥x -c * i ∥ ≥∥p -c * i ∥ -2∥x -c * i ∥ ≥∥p -c * i ∥ - 4OPT i n ⇒ ∥p -c * i ∥ ≤∥n p -c * i ∥ + 4OPT i n . It follows that cost(B, c * i ) = p∈B ∥p -c * i ∥ ≤ p∈B ∥n p -c * i ∥ + 4OPT i n ≤ cost(A, c * i ) + 4αOPT i . Returning to our expression for cost(P i \ P ′ i , m * i ), we get that cost(P i \ P ′ i , c * i ) = cost((P i ∩ P * i ) \ A, c * i ) + cost(B, c * i ) = cost((P i ∩ P * i ), c * i ) -cost(A, c * i ) + cost(B, c * i ) ≤ cost((P i ∩ P * i ), c * i ) + 4αOPT i ≤ (1 + 4α)OPT i . It follows that the optimal clustering cost for the set P i \ P ′ i is at most (1 + 4α)OPT i , and hence that cost(P i \ P ′ i , c j i ) ≤ (1 + γ)(1 + 4α)OPT i ≤ (1 + 5α)OPT i , for suitably small γ ≤ α 1+4α . Lemma 15. For R = O 1 (1-2α) log 2k δ many repetitions, with probability at least 1 -δ 2k , we have that cost(P i \ P ′ i , c i ) ≤ (1 + 5α)OPT i . Proof. The probability E 1 not holding for some c j i is at most (1 -2α) /2. The probability of E 1 not holding for any of the c j i is (1-(1 -2α) /2) R . It follows that for R = 2 1-2α ln 2k δ , the probability of E 1 not holding for any of the m j i is at most (1 -(1 -2α)/2) R ≤ exp(-(1 -2α)/2) R ≤ exp (-ln (2k/δ)) ≤ δ 2k . It follows that with probability 1 -δ 2k , E 1 holds for some c j i and consequently by the union bound cost(P i \ P ′ i , c i ) ≤ (1 + 5α)OPT i holds with probability 1 -δ k . Lemma 16. If cost(P i \ P ′ i , c i ) ≤ (1 + 5α)OPT, then ∥ c i -c * i ∥ ≤ 2+5α (1-2α) OPTi n . Proof. By the reverse triangle inequality we have that for every point p ∈ P * i ∩ (P i \ P ′ i ), ∥ c i -p∥ ≥ ∥ c i -c * i ∥ -∥p -c * i ∥. Summing up across p, we get p∈P * i ∩(Pi\P ′ i ) ∥ c i -p∥ ≥ |P * i ∩ (P i \ P ′ i )| • ∥ c i -c * i ∥ - p∈P * i ∩(Pi\P ′ i ) ∥p -c * i ∥ (1 + 5α)OPT i ≥ |P * i ∩ (P i \ P ′ i )| • ∥ c i -c * i ∥ -OPT i ⇒ |P * i ∩ (P i \ P ′ i )| • ∥ c i -c * i ∥ ≤ ((1 + 5α) + 1)OPT i ⇒ ∥ c i -c * i ∥ ≤ (2 + 5α)OPT i (1 -2α)m i . Lemma 17. With probability 1 -δ/k, cost(P i ∩ P * i , c i ) ≤ cost(P i ∩ P * i , c * i ) + (5α+10α 2 )OPTi 1-2α . Proof. From corollary 10, we know that with probability 1 -δ 2k , the following bound holds: cost(P i \ P ′ i , c i ) ≤ (1 + γ)cost(P i \ P ′ i , c ′ i ), where γ ≤ α (1+4α) and c ′ i is an optimal 1-median for P i \P ′ i . Also, it follows by definition that cost(P i \P ′ i , c ′ i ) ≤ cost(P i \P ′ i , c * i ). Further, from lemma 15 and lemma 16 it follows that with probability 1 -δ 2k , ∥ c i -c * i ∥ ≤ (2 + 5α)OPT i (1 -2α)m i . By the union bound, both these events hold simultaneously with probability 1 -δ k . Conditioning on this being the case, since P i ∩ P * i = ((P i \P ′ i )\B) ∪ A, we can write cost(P i ∩ P * i , c i ) -cost(P i ∩ P * i , c * i ) = (cost(P i \ P ′ i , c i ) -cost(P i \ P ′ i , c * i )) + (cost(B, c * i ) -cost(B, c i )) + (cost(A, c i ) -cost(A, c * i )) ≤ (1 + γ)cost(P i \ P ′ i , c ′ i ) -cost(P i \P ′ i , c ′ i ) + |B| • ∥ c i -c * i ∥ + |A| • ∥ c i -c * i ∥ ≤ γ • cost(P i \P ′ i , c * i ) + |B| • ∥ c i -c * i ∥ + |A| • ∥ c i -c * i ∥ ≤ αOPT i + (αm i + αm i ) • (2 + 5α)OPT i (1 -2α)m i ≤ α + 2α(2 + 5α)OPT i (1 -2α) = 5α + 10α 2 OPT i 1 -2α . Lemma 18. With probability 1 -δ/k, cost(P * i , ĉi ) ≤ (1 + cα)OPT i for c = 7+10α-10α 2 (1-α)(1-2α) . Proof. We have that cost(P * i , c i ) = cost(P * i ∩ P i , c i ) + cost(P * i \P i , c i ). We bound the second summand as follows cost(P * i \P i , c i ) = cost(P * i \P i , c * i ) + |P * i \P i | • (2 + 5α)OPT i (1 -2α)c i ≤ cost(P * i \P i , c * i ) + α(2 + 5α)OPT i (1 -α) (1 -2α) . Bounding the first summand cost(P * i ∩ P i , c i ) using the bound from above, we get cost(P * i , c i ) = cost(P * i ∩ P i , c * i ) + 5α + 10α 2 OPT i 1 -2α + cost(P * i \P i , c * i ) + α(2 + 5α)OPT i (1 -α) (1 -2α) = cost(P * i , c * i ) + 5α + 10α 2 -5α 2 -10α 3 + 2α + 5α 2 OPT i (1 -α)(1 -2α) = cost(P * i , c * i ) + 7α + 10α 2 -10α 3 OPT i (1 -α)(1 -2α) .

B EXPERIMENTS ON RUNTIME

In this section, we report the runtimes of our k-means and k-medians approaches and the methods in Ergun et al. (2022) . We sample subsets of points from the CIFAR-10 and the PHY datasets, and report the runtime (means and standard deviations) of the algorithms over 20 random runs. The subset sizes are varied from 20% to 100% of the size of the datasets, k is fixed at 10 and α is fixed at .2 For k-means, we observe in fig. 3 that the runtime of the two approaches are comparable, except for subset sizes 80% and 100% of CIFAR-10 where ours is slightly slower. This is expected since finding a subset of size (1 -α)m i with the best clustering cost in our algorithm and computing the shortest interval containing m i (1 -5α)/2 points in the approach of Ergun et al. (2022) both involve sorting the points and takes O(m i log m i ) time. We observe similar trends in the k-medians setting in fig. 4 . This is also expected given that the runtimes of both algorithms are dominated by calls to compute the 1-median center of the filtered points in each predicted cluster.



The repository is hosted at github.com/thydnguyen/LA-Clustering.



(m = 10 4 , d = 3072), the PHY dataset from KDD Cup 2004 (KDD Cup 2004), and the MNIST dataset (Deng, 2012) (m = 1797, d = 64). For the PHY dataset , we take m = 10 4 random samples to form our dataset (d = 50).

Figure 2: Experimental comparison of algorithm 2 with prior work and baselines for k-Medians

Figure 3: Runtime comparison of algorithm 1 with Ergun et al. (2022)

For each cluster i ∈ [k], denote the set of true positives P * i

The two baselines help us see how much the clustering cost increases for different error rate α. The clustering cost

funding

* Equal contribution. All three authors were supported in part by NSF CAREER grant CCF-1750716 and NSF grant CCF-1909314. 

