NEAR-OPTIMAL CORESETS FOR ROBUST CLUSTERING

Abstract

We consider robust clustering problems in R d , specifically k-clustering problems (e.g., k-MEDIAN and k-MEANS) with m outliers, where the cost for a given center set C ⊂ R d aggregates the distances from C to all but the furthest m data points, instead of all points as in classical clustering. We focus on the ϵ-coreset for robust clustering, a small proxy of the dataset that preserves the clustering cost within ϵ-relative error for all center sets. Our main result is an ϵ-coreset of size O(m + poly(kϵ -1 )) that can be constructed in near-linear time. This significantly improves previous results, which either suffers an exponential dependence on (m + k) (Feldman & Schulman, 2012), or has a weaker bi-criteria guarantee (Huang et al., 2018) . Furthermore, we show this dependence in m is nearly-optimal, and the fact that it is isolated from other factors may be crucial for dealing with large number of outliers. We construct our coresets by adapting to the outlier setting a recent framework (Braverman et al., 2022) which was designed for capacity-constrained clustering, overcoming a new challenge that the participating terms in the cost, particularly the excluded m outlier points, are dependent on the center set C. We validate our coresets on various datasets, and we observe a superior size-accuracy tradeoff compared with popular baselines including uniform sampling and sensitivity sampling. We also achieve a significant speedup of existing approximation algorithms for robust clustering using our coresets.

1. INTRODUCTION

We give near-optimal ϵ-coresets for k-MEDIAN and k-MEANS (and more generally, (k, z)-CLUSTERING) with outliers in Euclidean spaces. Clustering is a central task in data analysis, and popular center-based clustering methods, such as k-MEDIAN and k-MEANS, have been widely applied. In the vanilla version of these clustering problems, given a center set of k points C, the objective is usually defined by the sum of (squared) distances from each data point to C. This formulation, while quite intuitive and simple to use, has severe robustness issues when dealing with noisy/adversarial data; for instance, an adversary may add few noisy outlier points that are far from the center to "fool" the clustering algorithm to wrongly put centers towards those points in order to minimize the cost. Indeed, such robustness issue introduced by outliers has become a major challenge in data science and machine learning, and it attracted extensive algorithmic research on the topic (Charikar et al., 2001; Chen, 2008; Candès et al., 2011; Chawla & Gionis, 2013; Mount et al., 2014; Gupta et al., 2017; Statman et al., 2020; Ding & Wang, 2020) . Moreover, similar issues have also been studied from the angle of statistics (Huber & Ronchetti, 2009) . Robust Clutering We consider robust versions of these clustering problems, particularly a natural and popular variant, called clustering with outliers (Charikar et al., 2001) . Specifically, given a dataset X ⊂ R d , the (k, z, m)-ROBUST CLUSTERING problem is to find a center set C ⊂ R d of k points (repetitions allowed), that minimizes the objective function cost (m) z (X, C) := min L⊆X:|L|=m x∈X\L (dist(x, C)) z . (1) Here, L denotes the set of outliers, dist denotes the Euclidean distance, and dist(x, C) := min c∈C dist(x, c). Intuitively, the outliers capture the furthest points in a cluster which are "not well-clustered" and are most likely to be the noise. Notice that the parameter z captures various (robust) clustering problems, including (k, m)-ROBUST MEDIAN (where z = 1), (k, m)-ROBUST MEANS (where z = 2). On the other hand, if the number of outliers m = 0 then the robust clustering problem falls back to the non-robust version. The (k, z, m)-ROBUST CLUSTERING problem has been widely studied in the literature (Chen, 2008; Gupta et al., 2017; Krishnaswamy et al., 2018; Friggstad et al., 2019; Statman et al., 2020) . Moreover, the idea of removing outliers has been also considered in other machine learning tasks, e.g., robust PCA (Bhaskara & Kumar, 2018) and robust regression Rousseeuw & Leroy (1987) ; Mount et al. (2014) . Computational Challenges However, the presence of outliers introduces significant computational challenges, and it inspires a series of research to design efficient algorithms for robust clustering. On one hand, approximation algorithms with strict accuracy guarantee has been obtained (Charikar et al., 2001; Chen, 2008; Gupta et al., 2017; Krishnaswamy et al., 2018; Feng et al., 2019; Friggstad et al., 2019; Zhang et al., 2021) but their running time is a high-degree polynomial which is impractical. On the other hand, more scalable algorithms were also proposed (Bhaskara et al., 2019; Deshpande et al., 2020) , however, the approximation ratio is worse, and a more severe limitation is that their guarantee usually violates the required number of outliers. Moreover, to the best of our knowledge, we are not aware of works that design algorithms in sublinear models, such as streaming and distributed computing. Coresets In order to tackle the computational challenges, we consider coresets for robust clustering. Roughly, an ϵ-coreset is a tiny proxy of the massive input dataset, on which the clustering objective is preserved within ϵ-error for every potential center set. Existing algorithms may benefit a significant speedup if running on top of a coreset, and more importantly, coresets can be used to derive sublinear algorithms, including streaming algorithms (Har-Peled & Mazumdar, 2004) , distributed algorithms (Balcan et al., 2013) and dynamic algorithms (Henzinger & Kale, 2020) , which are highly useful to deal with massive datasets. Stemming from Har-Peled & Mazumdar (2004) , the study of coresets for the non-robust version of clustering, i.e., (k, z)-CLUSTERING, has been very fruitful (Feldman & Langberg, 2011; Feldman et al., 2020; Sohler & Woodruff, 2018; Huang & Vishnoi, 2020; Braverman et al., 2021; Cohen-Addad et al., 2021b; Braverman et al., 2022) , and the state-of-the-art coreset achieves a size poly(kϵ -1 ), independent of d and n. However, coresets for robust clustering were much less understood. Existing results either suffers an exponential (k + m) k+m factor in the coreset size (Feldman & Schulman, 2012) , or needs to violate the required number of outliers (Huang et al., 2018) . This gap leads to the following question: can we efficiently construct an ϵ-coreset of size poly(m, k, ϵ -1 ) for (k, z, m)-ROBUST CLUSTERING (without violating the number of outliers)? 1.1 OUR CONTRIBUTIONS Our main contribution, stated in Theorem 1.1, is a near-optimal coreset for robust clustering, affirmatively answering the above question. In fact, we not only achieve poly(m), but also linear in m and is isolated from other factors. This can be very useful when the number of outliers m is large. Theorem 1.1 (Informal; see Theorem 3.1). There exists a near-linear time algorithm that given data set X ⊂ R d , z ≥ 1, ϵ ∈ (0, 0.3) and integers k, m ≥ 1, computes an ϵ-coreset of X for (k, z, m)- ROBUST CLUSTERING of size O(m) + 2 O(z log z) Õ(k 3 ϵ -3z-2 ), with constant probability. Our coreset improves over previous results in several aspects. Notably, compared with Feldman & Schulman (2012) , our result avoids their exponential (k + m) k+m factor in the coreset size which is likely to be impractical since typical values of k and/or m may be O(log n). In fact, as observed in our experiments, the value of m can be as large as 1500 in real datasets, so the dependence in Feldman & Schulman ( 2012) is prohibitively large which leads to an inferior practical performance (see Section 4). Moreover, our coreset has a strict guarantee for m outliers instead of a bi-criteria guarantee as in (Huang et al., 2018) that needs to allow more or fewer outliers in the objective for the coreset. We also note that our coreset is composable (Remark 3.2). Furthermore, we show that the linear dependence in m is necessary (Theorem 1.2). Hence, combining this with a recent size lower bound of Ω(kϵ -2 ) (Cohen-Addad et al., 2022) for vanilla clustering (i.e., m = 0), we conclude that the dependence of every parameter (i.e., m, k, ϵ) is nearly tight. Theorem 1.2. For every integer m ≥ 1, there exists a dataset X ⊂ R of n ≥ m points, such that for every 0 < ϵ < 0.5, any ϵ-coreset for (1, m)-ROBUST MEDIAN must have size Ω(m). For the lower bound, we observe that when m = n -1, the clustering cost for (1, m)-ROBUST MEDIAN reduces to the distance to the nearest-neighbor from the center c. This is easily shown to require Ω(n) = Ω(m) points in the coreset, in order to achieve any finite approximation. The formal proof can be found in Section H. Experiments We evaluate the empirical performance of our coreset on various datasets (in Section 4). We validate the size-accuracy tradeoff of our coreset compared with popular coreset construction methods, particularly uniform sampling (which is a natural heuristic) and sensitivity sampling (Feldman & Schulman, 2012) , and we observe that our coreset consistently outperforms these baselines in accuracy by a significant margin for every experimented coreset size ranging from 500 to 5000. We also run existing approximation algorithms on top of our coreset, and we achieve about 100x speedup for both a) a Lloyd heuristic adopted to the outlier setting (Chawla & Gionis, 2013) that is seeded by an outlier-version of k-MEANS++ (Bhaskara et al., 2019) , and b) a natural local search algorithm (Friggstad et al., 2019) . These numbers show that our coreset is not only near-optimal in theory, but also demonstrates the potential to be used in practice.

1.2. TECHNICAL OVERVIEW

Similar to many previous coreset constructions, we first compute a near-optimal solution, an (α, β, γ)-approximation (see Definition 2.2) C * := {c * i | i ∈ [βk]}, obtained using known approximation algorithms (see the discussion in Section A). Then with respect to C * , we identify the outliers L * ⊂ X of C * and partition the remaining inlier points X \ L * into |C * | clusters {X i } i . We start with including L * into our coreset, and we also include a weighted subset of the remaining inlier points X \ L * by using a method built upon a recent framework Braverman et al. (2022) , which was originally designed for clustering with capacity constraints. The step of including L * in the coreset is natural, since otherwise one may miss the remote outlier points which can incur a huge error; furthermore, the necessity of this step is also justified by our Ω(m) lower bound (Theorem 1.2). Similar to Braverman et al. (2022) , for each cluster X i among the remaining inliers points X \L * , we identify a subset of X i that consists of poly(kϵ -1 ) rings, and merge the remaining part into poly(kϵ -1 ) groups of rings such that each group G has a tiny cost (Theorem 3.3). We use a general strategy that is similar to Braverman et al. (2022) to handle separately the rings (Lemma 3.6) and groups (Lemma 3.7), but the actual details differ significantly due to the presence of outliers. Handling Rings Similar to Braverman et al. (2022) , for a ring data subset R = ring(c * i , r, 2r) ⊆ X i , i.e., a subset such that every point is at a similar distance (up to a factor of 2) to the center c * i , we apply a uniform sampling on it to construct a coreset (with additive error, Definition 3.5). In Braverman et al. (2022) , for any center set C ⊂ R d , the error incurred by uniform sampling is bounded by ϵ•cost z (R, C) which is ϵ times the total cost without outliers from R to C (ignoring some neglectable additive term). However, in the presence of outliers, their error bound ϵ•cost z (R, C) can hardly be charged to ϵ • cost (m R ) z (R, C) , where m R is the number of outliers in R with respect to C. This is because cost (m R ) z (R, C) can be very small and even close to 0 when m R ≈ |R|. Moreover, the number of outliers m R is not known a priori and can be any number between 0 and m. Hence, we provide a stronger guarantee (Lemma 3.6) where we give an alternative upper bound which eventually charges the error to ϵ • cost (m R ) z (R, C) and ϵ • opt. We use the fact that cost (m R ) z (R, C) is "small enough" compared to opt for large m R , while for small m R , we rewrite the robust clustering cost as an integration of ball ranges (Fact F.1) and use a fact that uniform sampling approximately estimates all ball ranges (Lemma F.3). Similar idea of writing the cost as an integration has also been used in previous works, e.g., Huang et al. (2018) ; Braverman et al. (2022) .

Handling Groups

The main technical difficulty is to handle groups (Lemma 3.7). We still construct a two-point coreset (Definition 3.4) for every group G ⊂ X i , as in Braverman et al. (2022) . To analyze the error of this two-point coreset for an arbitrary center set C ⊂ R d , we partition the groups into colored and uncolored groups with respect to C (Lemma G.1) in a way similar to Braverman et al. (2022) . Let us call a group G "bad" if the error incurred by the two-point coreset is much larger than ε • cost (m G ) z (G, C), which is ε times the contribution of G. We focus our discussion on bounding the error of bad groups. We first show even with outliers (in Lemma G.1), the error for each bad group is at most ε • cost z (G, C * ). Hence, it remains to bound the number of bad groups. In Braverman et al. (2022) , only a colored group can be bad, and the number of them is bounded by O(k log z ε ). However, in the outlier setting a key difference to Braverman et al. (2022) is that, uncolored groups can also be bad (due to a reason similar to that for rings: cost (m G ) z (G, C) may be too small when the number of outliers m G in G is large), and we call them "special uncolored groups" (22). To bound the number of them, in Lemma G.5 we make a crucial geometric observation: even though special uncolored groups may significantly change for varying C's, an invariant is that they must always be consecutive uncolored groups, due to the way we decompose X i , and apart from the two groups that partially intersect the outliers, every other group within the consecutive sequence consists of outliers only. This key geometric observation implies the number of such groups is O(1) (Lemma G.5), and consequently, the total number of bad groups is at most O(k log z ε ) with respect to any C. Finally, in addition to the above new steps, we remark that it is also necessary to make these bounds for rings/groups work for all numbers of outliers 0 ≤ t ≤ m "simultaneously", since one does not know in advance how many outliers reside each ring/group due to the arbitrarily chosen C.

1.3. OTHER RELATED WORKS

Robust Clustering in R d Robust clustering, first proposed by Charikar et al. (2001) , has been studied for two decades. For (k, m)-ROBUST MEDIAN, Charikar et al. (2001) designed a bi-criteria approximate algorithm with violations on k. Chen (2008) first showed a pure constant approximate algorithm, whose approximate ratio was improved to 7.081 + ϵ (Krishnaswamy et al., 2018) . When k = O(1), Feng et al. (2019) -Addad et al., 2022; Huang & Vishnoi, 2020) . In addition, coresets for constrained clustering in Euclidean spaces has also been considered, such as capacitated clustering and the tightly related fair clustering (Schmidt et al., 2019; Huang et al., 2019; Braverman et al., 2022) , and ordered weighted clustering (Braverman et al., 2019) . Going beyond Euclidean spaces, coresets of size poly(kϵ -1 ) were known for (k, z)-CLUSTERING in doubling metrics (Huang et al., 2018) , shortest-path metrics of graphs with bounded treewidth (Baker et al., 2020) and graphs that exclude a fixed minor (Braverman et al., 2021) . is Õ(z O(z) kϵ -2 • min {k, ϵ -z }). This bound nearly matches a lower bound of Ω(kϵ -2 + k min d, 2 z/20 ) (Cohen

2. PRELIMINARIES

Balls and Rings For a point a ∈ R d , and positive real numbers r ′ > r > 0, define Ball(a, r) = {x ∈ R d , dist(x, a) ≤ r} and ring(a, r, r ′ ) = Ball(a, r ′ ) \ Ball(a, r). For a set of points A ⊂ R d , Balls(A, r) = ∪ a∈A Ball(a, r). Weighted Outliers Since our coreset uses weighted points, we need to define the notion of weighted sets and weighted outliers. We call a set S with an associated weight function w S : S → R ≥0 a weighted set. Given two weighted sets (X, w X ) and (Y, w Y ) such that Y ⊆ X and w Y (x) ≤ w X (x) for any x ∈ Y , let X -Y denote a weighted set (Z, w Z ) such that w Z = w Xw Y ,foot_0 and Z is the support of w Z . Moreover, for a weighted set X, we denote L (m) X as the collection of all possible sets of weighted outliers (Y, w Y ) satisfying that Y ⊆ X, x∈Y w Y (x) = m and that ∀x ∈ X, w Y (x) ≤ w X (x). In this definition, since X is a weighted point set, we need to pick outliers of total weight m in the objective cost (m) z (X, C), instead of m distinct points which may have a much larger weights than m. Weighted Cost Functions For m = 0, we write cost z for cost (m) z . We extend the definition of the cost function to that on a weighted set X ⊂ R d . For m = 0, we define cost z (X, C) := x∈X w X (x) • (dist(x, C)) z . For general m ≥ 1, the cost is defined using the notion of weighted outliers and aggregating using the cost z function which is the m = 0 case. cost (m) z (X, C) := min (L,w l )∈L (m) X {cost z (X -L, C)} . One can check that this definition is a generalization of the unweighted case (1). For a weighted set X ⊂ R d , let the optimal solution be opt (m) z (X) := min C⊂R d ,|C|=k cost (m) z (X, C). Definition 2.1 (Coreset). Given a point set X ⊂ R d and ϵ ∈ (0, 1), an ϵ-coreset for (k, z, m)- ROBUST CLUSTERING is a weighted subset (S, w S ) of X such that ∀C ⊂ R d , |C| = k, cost (m) z (S, C) ∈ (1 ± ϵ) • cost (m) z (X, C) Even though Definition 2.1 naturally extends the definition of coresets for vanilla clustering (Har-Peled & Mazumdar, 2004; Feldman & Langberg, 2011; Feldman et al., 2020) , it is surprising that this exact definition did not seem to appear in the literature. A closely related definition (Huang et al., 2018) considers a relaxed "bi-criteria" (with respect to the number of outliers) guarantee of the cost, i.e., (1 -ϵ) • cost (1+β)m z (S, C) ≤ cost (m) z (X, C) ≤ (1 + ϵ) • cost (1-β)m z (S, C), for β ∈ [0, 1), and their coreset size depends on β -1 . Another definition was considered in Feldman & Schulman (2012) , which considers a more general problem called weighted clustering (so their coreset implies our Definition 2.1). Unfortunately, this generality leads to an exponential-size coreset (in k, m). Definition 2.2 ((α, β, γ)-Approximation). Given a dataset X ⊂ R d and real numbers α, β, γ ≥ 1, an (α, β, γ)-approximate solution for (k, z, m)-ROBUST CLUSTERING on X is a center set C * ⊂ R d with |C * | ≤ βk such that cost (γm) z (X, C * ) ≤ α • opt (m) z (X).

3. CORESETS FOR (k, z, m)-ROBUST CLUSTERING

We present our main theorem in Theorem 3.1. As mentioned, the proof of Theorem 3.1 is based on the framework in Braverman et al. (2022) , and we review the necessary ingredients in Section 3.1. The statement of our algorithm and the proof of Theorem 3.1 can be found in Section 3.2. Theorem 3.1. Given input dataset P ⊂ R d with |P | = n, integers k, m ≥ 1, and real number z ≥ 1 and assume there exists an algorithm that computes an (α, β, γ)-approximation of P for (k, z, m)- ROBUST CLUSTERING in time A(n, k, d, z), then Algorithm 1 uses time A(n, k, d, z) + O(nkd) to construct a weighted subset (S, w S ) with size |S| = γm+2 O(z log z) •β • Õ(k 3 ϵ -3z-2 ), such that with probability at least 0.9, for every integer 0 ≤ t ≤ m, S is an αϵ-coreset S of P for (k, z, t)-ROBUST CLUSTERING. Remark 3.2. By rescaling ϵ to ϵ/α in the input of Algorithm 1, we obtain an ϵ-coreset of size γm + 2 O(z log z) •α 3z+2 β • Õ(k 3 ϵ -3z-2 ). We discuss how to obtain (α, β, γ)-approximations in Section A. We also note that Theorem 3.1 actually yields an ϵ-coreset for (k, z, t)-ROBUST CLUSTERING simultaneously for every integer 0 ≤ t ≤ m, which implies that our coreset is composable. Specifically, if for every integer  0 ≤ t ≤ m, S X is an ϵ-coreset of X for (k, z, t)-ROBUST CLUSTERING and S Y is an ϵ-coreset of Y for (k, z, t)-ROBUST CLUSTERING, then for every integer 0 ≤ t ≤ m, S X ∪ S Y is an ϵ-coreset of X ∪ Y for (k, z, t)-ROBUST CLUSTERING. R 1 ∈ R R 2 R 3 R 4 = G 2 ∈ G R 5 ∈ R G 1 ∈ G c * i data points

3.1. THE FRAMEWORK OF BRAVERMAN ET AL. (2022)

Theorem 3.3 is a general geometric decomposition theorem for coresets which we use crucially. It partitions an arbitrary cluster into poly(k/ϵ) rings and merge the remaining rings into poly(k/ϵ) groups with low contribution to cost z (X i , c * i ). (See Figure 1 for an illustration.) Theorem 3.3 (Decomposition into rings and groups (Braverman et al., 2022, Theorem 3.2) ). Let X ⊂ R d be a set and c ∈ R d be a center point. There exists an O(nkd)-time algorithm that computes a partition of X into two disjoint collections of sets R and G, such that X = (∪ R∈R R) ∪ (∪ G∈G G), where R is a collection of disjoint rings satisfying 1. ∀R ∈ R, R is a ring of the form R = R i (X, c) for some integer i ∈ Z ∪ {-∞}, where R i (X, c) := X ∩ ring(c, 2 i-1 , 2 i ) for i ∈ Z and R -∞ (X, c) := X ∩ {c} 2. |R| ≤ 2 O(z log z) • Õ(kϵ -z ) and G is a collection of disjoint groups satisfying 1. ∀G ∈ G, G is the union of consecutive rings of (X, c). Formally, ∀G ∈ G, there exists two integers -∞ ≤ l G ≤ r G such that G = ∪ r G i=l G R i (X, c) and the intervals {[l G , r G ], G ∈ G} are disjoint for different G ∈ G 2. |G| ≤ 2 O(z log z) • Õ(kϵ -z ), and ∀G ∈ G, cost z (G, c) ≤ ( ϵ 6z ) z • costz(P,c) k•log(24z/ϵ) . Rings and groups are inherently different geometric objects, hence they require different coreset construction methods.foot_1 As in Braverman et al. (2022) , uniform sampling is applied on rings, but a two-point coreset, whose construction is defined in Definition 3.4, is applied for each group. Our main algorithm (Algorithm 1) also follows this general strategy.  p ∈ [0, 1] such that dist z (p, c) = λ p • dist z (p G close , c) + (1 -λ p ) • dist z (p G far , c). Let D G = {p G far , p G close }, w D G (p G close ) = p∈G λ p , and w D G (p G far ) = p∈G (1-λ p ). D G is called the two-point coreset of G with respect to c. By definition, we can verify that w D G (D G ) = |G| and cost z (D G , c) = cost z (G, c), which are useful for upper bounding the error induced by such two-point coresets.

3.2. PROOF OF THEOREM 3.1

Coreset Construction Algorithm We present our main algorithm in Algorithm 1. In Line 1 and Line 2, the set L * of outliers of C * is the set of γm furthest points to C * and L * is directly added into the coreset S. In Line 3 and Line 4, the inliers P \ L * are decomposed into βk clusters with respect to C * and the linear time decomposition algorithm of Theorem 3.3 is applied in each cluster. In Line 5 and Line 6, similar to Braverman et al. (2022) , a uniform sampling and a two-point coreset (see Definition 3.4) are applied in constructing coresets for rings and groups, respectively.  Algorithm 1 Coreset Construction for (k, z, m)-ROBUST CLUSTERING Input: dataset P ⊂ R d , z ≥ 1, integer k, m ≥ 1, an (α, β, γ)-approximation C * = {c * i } βk i=1 1: let L * ← ∈ R i , take a uniform sample Q R of size 2 O(z log z) • Õ( k ϵ 2z+2 ) from R, set ∀x ∈ Q R , w Q R (x) ← |R| |Q R | , and add (Q R , w Q R ) into S 6: for i ∈ [βk] and every group G ∈ G i center c * i , construct a two-point coreset (D G , w D G ) of G as in Definition 3.4 and add (D G , w D G ) into S 7: return (S, w S ) Error Analysis Recall that P is decomposed into 3 parts, the outliers L * , the collection of rings, and the collection of groups. We prove the coreset property for each of the 3 parts and claim the union yields an ϵ-coreset of P for (k, z, m)-ROBUST CLUSTERING. As L * is identical in the data set P and the coreset S, we only have to put effort in the rings and groups. We first introduce the following relaxed coreset definition which allows additive error. Definition 3.5. Let P ⊂ R d , 0 < ϵ < 1 and A ≥ 0, a weighted set (S, w S ) is an (ϵ, A)-coreset of X for (k, z, t)-ROBUST CLUSTERING if for every C ⊂ R d , |C| = k, | cost (t) z (P, C) -cost (t) z (S, C)| ≤ ϵ • cost (t) z (P, C) + ϵ • A. This allowance of additive error turns out to be crucial in our analysis, and eventually we are able to charge the total additive error to the (near-)optimal cost, which enables us to obtain the coreset (without additive error). The following two are the key lemmas for the proof of Theorem 3.1, where we analyze the guarantee of the uniform-sampling coresets for rings (Lemma 3.6) and the two-point coresets (Lemma 3.7). Lemma 3.6 (Coresets for rings). Let Q = i∈[βk] R∈Ri Q R denote the coreset of the rings R all = i∈[βk] R∈Ri R, constructed by uniform sampling as in Line 5 of Algorithm 1, then ∀t, 0 ≤ t ≤ m, Q is an ϵ, cost z (R all , C * ) -coreset of R all for (k, z, t)-ROBUST CLUSTERING. Proof. The proof can be found in Section F. Let cost (t) z (S, C) ≤ (1 + ϵ) cost (t) z (P, C) + ϵ • cost z (P \ L * , C * ). t R = |L ∩ R all |, t G = |L ∩ G all |. By Lemma 3.6 and Lemma 3.7, there exists weighted subset T Q ⊂ Q, T D ⊂ D such that, w T Q (T Q ) = t R , w T D (T D ) = t G , cost z (Q -T Q , C) ≤ (1 + ϵ) cost z (R all -(L ∩ R all ), C) + ϵ • cost z (R all , C * ) (3) and cost z (D -T D , C) ≤ (1 + ϵ) cost z (G all -(L ∩ G all ), C) + ϵ • cost z (P \ L * , C * ) (4) Define a weighted subset (T, w T ) of S, such that T = (L ∩ L * ) ∪ T Q ∪ T D . Then w T (T ) = t and cost (t) z (S, C) ≤ cost z (S -T, C) = cost z (L * -(L ∩ L * ), C) + cost z (Q -T Q , C) + cost z (D -T D , C) ≤ cost z (L * -(L ∩ L * ), C) + (1 + ϵ) cost z (R all -(L ∩ R all ), C) + ϵ • cost z (R all , C * ) + (1 + ϵ) cost z (G all -(L ∩ G all ), C) + ϵ • cost z (P \ L * , C * ) ≤ (1 + ϵ) cost z (P -L, C) + O(ϵ) • cost z (P \ L * , C * ) ≤ (1 + O(α • ϵ)) cost (t) z (P, C). Similarly, we can also obtain that cost (t) z (P, C) ≤ (1 + O(α • ϵ)) cost (t) z (S, C) for any 0 ≤ t ≤ m. It remains to scale ϵ by a universal constant. We analyze the time complexity. Clearly, the running time of Algorithm 1 is dominated by the first four lines, each of which takes O(nkd) time. Apart from the steps of building the coresets, the time for the initial tri-criteria approximation is discussed in Section 1.3 and Section A.

4. EXPERIMENTS

We implement our coreset construction algorithm and evaluate its empirical performance on various real datasets. We compare it with several baselines and demonstrate the superior performance of our coreset. In addition, we show that our coresets can significantly speed up approximation algorithms for both (k, m)-ROBUST MEDIAN and (k, m)-ROBUST MEANS problems. Experiment Setup Our experiments are conducted on publicly available clustering datasets, see Table 1 for a summary of specifications and choice of parameters. For all datasets, we select numerical features to form a vector in R d for each record. For larger dataset, particularly Census1990 and Twitter, we subsample it to 10 5 points so that inefficient baselines can still finish in a reasonable amount of time. Unless otherwise specified, we typically set k = 5 for the number of centers. The number of outliers m is determined by a per-dataset basis, via observing the distance distribution of points to a near-optimal center (see Section C for details). All experiments are conducted on a PC with Intel Core i7 CPU and 16 GB memory, and algorithms are implemented using C++ 11. We implement our coreset following Algorithm 1 except for a few modifications. The detailed modifications are described in Section C.

Empirical Error

We evaluates the tradeoff between the coreset size and empirical error under the (k, m)-ROBUST MEDIAN objective. In general, for (k, z, m)-ROBUST CLUSTERING, given a coreset S, define its empirical error, denoted as ε(S, C), for a specific center C ⊂ R d , |C| = k as ε(S, C) := | cost (m) z (X,C)-cost (m) z (S,C)| cost (m) z (X,C) . Since it is difficult to exactly verify whether a coreset preserves the objective for all centers (as required by the definition), we evaluate the empirical error, denoted as ε(S), for the coreset S as the maximum empirical error over C, which is a collection of 500 randomly-chosen center sets, i.e., ε(S) := max C∈C ε(S, C). Note that ε(S) is defined in a way similar to the worst-case error parameter ϵ as in Definition 2.1. Baselines We compare our coreset with the following baselines: a) uniform sampling (US), where we draw N independent uniform samples from X and set the weight |X| N for each sample, b) outlieraware uniform sampling (OAUS), where we follow Line 1 -Line 2 of Algorithm 1 to add m outliers L * to the coreset and sample Nm data points from X \ L * as in US baseline, and c) sensitivity sampling (SS), the previous coreset construction algorithm of Feldman & Schulman (2012) . Experiment: Size-error Tradeoff For each coreset algorithm, we run it to construct coresets of varying target sizes N , ranging from m + 300 to m + 4800, with a step size of 500. We evaluate the empirical error ε(•) and we plot the size-error curves in Figure 2 for each baseline and dataset. To make the measurement stable, the coreset construction and evaluations are run 100 times independently and the average is reported. As can be seen from Figure 2 , our coreset admits a similar error curve regardless of the dataset, and it achieves about 2.5% error using a coreset of size m + 800 (within 2.3% -2.5% of data size), which is perfectly justified by our theory that the coreset size only depends on O(m + poly(kϵ -1 )). Our coresets outperform all three baselines by a significant margin in every dataset and every target coreset size. Interestingly, the two baselines SS and US seem to perform similarly, even though the construction of SS (Feldman & Schulman, 2012) is way more costly since its running time has an exponential dependence on k + m, which is already impractical in our setting of parameters. Another interesting finding is that, OAUS performs no better than US overall, and both are much worse than ours. This indicates that it is not the added initial outliers L * (as in Algorithm 1) that leads to the superior performance of our coreset. Finally, we also observe that our coreset has a smaller variance in the empirical error (≈ 10 -6 ), compared with other baselines (≈ 10 -4 ). Experiment: Impact of The Number of Outliers We also examine the impact of the number of outliers m on the empirical error. The details of this experiment can be found in Section D. Experiment: Speeding Up Existing Approximation Algorithms We validate the ability of our coresets for speeding up existing approximation algorithms for robust clustering. Due to space limit, the details and results can be found in Section E. 1) and achieves α = O(1) (Chen, 2008; Krishnaswamy et al., 2018) . 2019)

B TECHNICAL LEMMAS

) (a + b) z ≤ (1 + δ) z-1 • a z + (1 + 1 δ ) z-1 • b z 2. (Claim 5 of Sohler & Woodruff (2018)) (a + b) z ≤ (1 + δ) • a z + ( 3z δ ) z-1 • b z The following lemma is a simple but useful way to bound the error between coresets and data sets and the proof idea is similar to Lemma 3.5 of Braverman et al. (2022)  (U ) = w V (V ) = N , then for every C ⊂ R d , |C| = k we have | cost z (U, C) -cost z (V, C)| ≤ ϵ • cost z (U, C) + ( 6z ϵ ) z-1 • cost z (U, c * i ) + cost z (V, c * i ) . (5) Proof. Since w U (U ) = w V (V ), there must exist a matching M : U × V → R ≥0 between the mass of U and V . So ∀u ∈ U, v∈V M (u, v) = w U (u) and ∀v ∈ V, u∈U M (u, v) = w V (v). By generalized triangle inequality Lemma B.1 we have, and plot the distribution of distances from data points to the found near-optimal centers. As shown in Figure 3 , every dataset admits a clear breaking point that defines the outliers, and we pick m accordingly. | cost z (U, C) -cost z (V, C)| ≤ u∈U v∈V M (u, v)| dist(u, C) z -dist(v, C) z | ≤ u∈U v∈V M (u, v) ϵ • dist(u, C) z + ( 3z ϵ ) z-1 • (dist(u, C) -dist(v, C)) z ≤ ϵ u∈U w U (u) • dist(u, C) z + ( 3z ϵ ) z-1 • u∈U v∈V M (u, v) • (dist(u, c * i ) + dist(v, c * i )) z ≤ ϵ • cost z (U, C) + ( 3z ϵ ) z-1 • ( u∈U w U (u) • 2 z-1 • dist(u, c * i ) z + v∈V w V (v) • 2 z-1 • dist(v, c * i )) ≤ ϵ • cost z (U, C) + ( 6z ϵ ) z-1 cost z (U, c * i ) + cost z (V, c * i ) C MORE Implementation Details Our coreset implementation mostly follows Algorithm 1 except for a few modifications. For efficiency, we use a near-linear time algorithm by Bhaskara et al. (2019) to compute an (O(1), O(1), O(1))-approximation (as required by Algorithm 1), but we still add m outliers to coreset (in Line 2) instead of adding all the found ones. Moreover, since it is more practical to directly set the target coreset size N (instead of solving for N from ϵ), we modify the algorithm so that the generated coreset has exactly N points. Specifically, the coreset size is affected by two key parameters, one is a threshold, denoted as t, used to determine how the rings and groups are formed in the construction of Theorem 3.3 (whose details can be found in Braverman et al. (2022) ), and the other, denoted as s, is the size of each uniform sample (used in Line 5). Here, we heuristically set t = O( 1 N -m ) and solve for s such that the total size equals to N .

D EXPERIMENT: IMPACT OF THE NUMBER OF OUTLIERS

We examine the impact of the number of outliers m on empirical error. Specifically, we experiment with varying m, but a fixed Nm, which is the number of "samples" besides the included outliers L * in our algorithm. We pick a typical value of Nm = 800 based on the curves of Figure 2 , . We plot this outlier-error curve in Figure 4 , and we observe that while some of our baselines have a fluctuating empirical error, the error curve of our coreset is relatively stable. This suggests that the empirical error of our coreset is mainly determined by the number of additional samples Nm, and is mostly independent of the number of outliers m itself.

E EXPERIMENT: SPEEDING UP EXISTING APPROXIMATION ALGORITHMS

We validate the ability of our coresets for speeding up existing approximation algorithms for robust clustering. We consider two natural algorithms and run them on top of our coreset for speedup: a Lloyd-style algorithm tailored to (k, m)-ROBUST MEANS (Chawla & Gionis, 2013) seeded by 4 : The impact of the number of outliers m on the empirical error a modified k-MEANS++ for robust clustering (Bhaskara et al., 2019) , which we call "LL", and a local search algorithm for (k, m)-ROBUST MEDIAN (Friggstad et al., 2019) , which we call "LS". We note that for LS, we uniformly sample 100 points from the dataset and use them as the only potential centers, since otherwise it takes too long to run on the original dataset (without coresets). We use a coreset of size m + 500 for each dataset (recalling that m is picked per dataset according to Figure 3 ) to speed up the algorithms. To make a consistent comparison, we measure the clustering costs on the original dataset for all runs (instead of on the coreset). We report in Table 2 the running time and the cost achieved by LL and LS, with and without coresets. The results show that the error incurred by using coreset is tiny (< 5% error), but the speedup is a significant 80x-250x for LL, and a 100x-200x for LS. Even taking the coreset construction time into consideration, it still achieves a 10x-30x speedup to LL and a 80x-140x speedup to LS. We conclude that our coreset drastically improves the running time for existing approximation algorithms, while only suffering a neglectable error. F PROOF OF LEMMA 3.6: ERROR ANALYSIS OF UNIFORM SAMPLING As with recent works in Euclidean coresets (Cohen-Addad et al., 2021a; b; 2022; Braverman et al., 2022) , we make use of an iterative size reduction Braverman et al. (2021)  (P, C) -cost (t) z (Q R , C)| ≤ T far T close zu z-1 • w R Balls(C, u) ∩ R) -w Q R Balls(C, u) ∩ Q R du ≤ ( ϵ 12z ) z • |R| • (T z far -T z close ) ≤ ( ϵ 12z ) z • |R| • ϵ • T z close + ( 3z ϵ ) z-1 • (4r) z ≤ ϵ • ( ϵ 12z ) z • |R| • T z close + ϵr z |R| ≤ ϵ • cost (t) z (R, C) + ϵr z |R| where for the last inequality, we have used the fact that cost (t) z (R, C) ≥ (|R| -t) • T z close ≥ ( ϵ 12z ) z • |R| • T z close . We are ready to prove Lemma 3.6. Proof of Lemma 3.6. Fix a center C ⊂ R d , |C| = k. By Lemma F.3, the sample size in Line 5 of Algorithm 1 implies that Q R is an ( ϵ 12z ) z -approximation of the k-balls range space on R for every R ∈ i∈[βk] R i . By lemma F.4 and the union bound, with probability at least 0.9, for every i ∈ [β], for every ring R ∈ R i , and for every e ∈ [0, |R|], | cost (e) z (R, C) -cost (e) z (Q R , C)| ≤ ϵ • cost (e) z (R, C) + ϵ • cost z (R, c * i ). Let L denote the set of t outliers of R all with respect to C. By (9), for every R ∈ i∈[βk] R i , there exists a weighted subset T R ⊂ Q R such that w T R (T R ) = w L (L ∩ R) and cost z (Q R -T R , C) ≤ (1 + ϵ) • cost z (R -(L ∩ R), C) + ϵ • cost z (R, c * i ). Summing over all R ∈ i∈[βk] R i , we know that, cost (t) z (Q, C) ≤ i∈[βk] R∈Ri cost z (Q R -T R , C) ≤ i∈[βk] R∈Ri (1 + ϵ) • cost z (R -(L ∩ R), C) + ϵ • cost z (R, c * i ) = (1 + ϵ) • cost z (R all -L, C) + ϵ • cost z (R all , C * ) = (1 + ϵ) • cost (t) z (R all , C) + ϵ • cost z (R all , C * ). On the same way, we can show that cost (t) z (R all , C) ≤ (1 + ϵ) • cost (t) z (Q, C) + ϵ • cost z (R all , C * ). Thus we finish the proof. G PROOF OF LEMMA 3.7: ERROR ANALYSIS OF TWO-POINT CORESETS Throughout this section, we fix a center set C ⊂ R d , |C| = k and prove the coreset property of D with respect to C. To analyze the error of two-point coreset for G all , we further decomposes all groups into colored groups and uncolored groups based on the position of C in the following Lemma G.1, which was also considered in Braverman et al. (2022) . Furthermore, inside our proof, we also consider a more refined type of groups called special groups. An overview illustration of these groups and other relevant notions can be found in Figure 5 . Lemma G.1 (Colored groups and uncolored groups (Braverman et al., 2022) ). For a center set  C ⊂ R d , |C| = k, G ∈ G i , for every u ∈ C, either ∀p ∈ G, dist(u, c * i ) < ϵ 9z • dist(p, c * i ) or ∀p ∈ G, dist(u, c * i ) > 24z ϵ dist(p, c * i ). Let G ∈ G i be an uncolored group with respect to C, Lemma G.1 implies that the center set C can be decomposed into a "close" portion and a "far" portion to G, as in the following Definition G.2. Definition G.2 (Braverman et al. (2022) ). For a center set C, assume G ∈ G i is an uncolored group with respect to C. Define  C G far = {u ∈ C | ∀p ∈ G, dist(u, c * i ) > 24z ϵ dist(p, c * i )}, C G close = {u ∈ C | ∀p ∈ G, dist(u, c * i ) < ϵ 9z • dist(p, c * i )}. Remark that C = C G far ∪ C G (G i , c * i ) ≤ ( ϵ 6z ) z • costz(Pi,c * i ) k log(24z/ϵ) , we can obtain the following inequality. Lemma G.3 (Robust variant of (Braverman et al., 2022, Lemma 3  .5)). For a group G ∈ G i , assume (U, w U ) and (V, w V ) are two weighted subsets of G such that w U (U ) = w V (V ). Then for every C ⊂ R d , |C| = k, | cost z (U, C) -cost z (V, C)| ≤ ϵ • cost z (U, C) + ϵ • cost z (P i , c * ) 2k log(z/ϵ) . ( ) Lemma G.4. Let G denote an uncolored group with respect to C. Suppose (U, w U ) and (V, w V ) are two weighted subsets of G such that one of the following items hold, 1. either C G close ̸ = ∅ and cost z (U, c * i ) = cost z (V, c * i ), 2. or C G close = ∅ and w U (U ) = w V (V ). Then we have cost z (U, C) ∈ (1 ± ϵ) cost z (V, C). Proof. If C G close ̸ = ∅, by the property of uncolored group as in Lemma G.1, we know that ∀x ∈ G, dist(x, C) ∈ (1 ± ϵ 3z ) • dist(x, c * i ). So we have cost z (U, C) ∈ (1 ± ϵ) cost z (U, c * i ) and cost z (V, C) ∈ (1 ± ϵ) cost z (V, c * i ) . By combining the above two inequalities and scaling ϵ, we obtain (11). In the other case, if C G close = ∅, Lemma G.1 implies ∀x ∈ G, dist(x, C) > 9z ϵ • dist(x, c * i ). By triangle inequality, we know that dist(x, C) ∈ (1 ± ϵ 3z ) dist(c * i , C). So we have, cost z (U, C) ∈ (1 ± ϵ) • w U (U ) • cost z (c * i , C) and cost z (V, C) ∈ (1 ± ϵ) • w V (V ) • cost z (c * i , C ), moreover since w U (U ) = w V (V ), we conclude (11) by scaling ϵ. We are ready to prove Lemma 3.7. Proof of Lemma 3.7. It suffices to prove the following two directions separately. cost (t) z (D, C) ≤ (1 + ϵ) cost (t) z (G all , C) + ϵ • cost z (P \ L * , C * ), cost (t) z (G all , C) ≤ (1 + ϵ) cost (t) z (D, C) + ϵ • cost z (P \ L * , C * ), and scale ϵ. Proof of ( 12) Let (L, w L ) denote the outliers of G all with respect to C. Namely, L ⊂ G, w L (L) = t and cost z (G all -L, C) = cost (t) z (G all , C). It suffices to find a weighted subset (T, w T ) of D such that w T (T ) = t and cost z (D -T, C) ≤ (1 + ϵ) cost z (G all -L, C) + ϵ • cost z (P \ L * , C * ). We define T as the following. Recall that G all = i∈[βk] G∈Gi G. For every G ∈ G i , we add {p G close , p G far } into T and set w T (p G close ) = x∈L∩G λ x , w T (p G far ) = x∈L∩G (1 -λ x ) where we recall that λ x is the unique number in [0, 1] such that dist z (x, c * i ) = λ x •dist z (p G close , c * i )+ (1 -λ x ) • dist z (p G far , c * i ). If G is an colored group, we apply Lemma G.3 to obtain cost z (D G -(T ∩ D G ), C) ≤ (1 + ϵ) cost z (G -(L ∩ G), C) + ϵ • cost z (P i , c * i ) 2k log(z/ϵ) Now suppose G is an uncolored group, observe that by construction, w T (T ∩D G ) = w L (L∩G) and cost z (T ∩ D G , c * i ) = cost z (L ∩ G, c * i ). Applying Lemma G.4 in D G -(T ∩ D G ) and G -(L ∩ G), we obtain that, cost z (D G -(T ∩ D G , C) ≤ (1 + ϵ) cost z (G -(L ∩ G), C). By Lemma G.1, there are at most k log(z/ϵ) many colored groups in each cluster P i , combining with ( 15) and ( 16), we have cost z (D -T, C) = i∈[βk] G∈Gi cost z (D G -(T ∩ D G ), C) ≤ i∈[βk] G∈Gi (1 + ϵ) cost z (G -(L ∩ G), C) + k log(z/ϵ) i∈[βk] ϵ • cost z (P i , c * i ) 2k log(z/ϵ) ≤ (1 + ϵ) cost z (G all -L, C) + ϵ • cost z (P \ L * , C * ) which is (14). Proof of ( 13) Let (T, w T ) denote the set of (total weight w T (T ) = t) outliers of D with respect to C. Namely, cost (t) z (D, C) = cost (t) z (D -T, C). It suffices to find a weighted subset (L, w L ) of G such that w L (L) = t and cost (t) z (G all -L, C) ≤ (1 + ϵ) • cost In other words, L G is the subset of furthest m G weights of points to C in G. Add L G into L and set w L (x) = w (L G ) (x) for every x ∈ L G . We prove L satisfies (17). We do the following case study. • If G is a colored group, we simply apply Lemma G.  Combining ( 18), ( 19), ( 21), ( 22), and the fact that there are at most k log(z/ϵ) colored groups and 2 special groups in each G i , we have Proof. For the sake of contradiction, assume there are 3 special uncolored groups G 1 , G 2 , and G 3 in cluster P i . Assume w.l.o.g. that G 1 is the furthest to center c * i and G 3 is the closest one. Since G 1 is a special uncolored group, we know that C G1 close ̸ = ∅, so ∀x ∈ G 1 , dist(x, C) ∈ 1 ± ϵ • dist(x, c * i ). In particular, there exists an inlier y 1 ∈ D G1 such that dist(y 1 , C) ≥ (1ϵ) • dist(y 1 , c * i ). Similarly, there exists an outlier y 3 ∈ G 3 such that dist(y 3 , C) ≤ (1 + ϵ) • dist(y 3 , c * i ). However, G 1 , G 2 and G 3 are disjoint groups which are union of consecutive rings. So dist(y 1 , c * i ) ≥ 2 dist(y 3 , c * i ) and this implies dist(y 1 , C) ≥ (1 -ϵ) • dist(y 1 , c * i ) ≥ 2(1 -ϵ) • dist(y 3 , c * i ) > (1 + ϵ) • dist(y 3 , c * i ) > dist(y 3 , C) where we have used that ϵ < 0.3. However, this contradicts to the fact that y 1 is an inlier but y 3 is an outlier.

H LOWER BOUNDS

We show in Theorem H.1 that the factor m is necessary in the coreset size, even for the very simple case of k = 1 and one dimension, for (k, m)-ROBUST MEDIAN. Theorem H.1. For every integer m ≥ 1, there exists a dataset X ⊂ R of n ≥ m points, such that for every 0 < ϵ < 0.5, any ϵ-coreset for (1, m)-ROBUST MEDIAN must have size Ω(m). Proof. Fix 0 < ϵ < 0.5. Consider the following instance X = {x 0 , . . . , x m } ⊂ R 1 of size n = m + 1, where x 0 = 0 and x i = i for i ∈ [m]. Suppose (S, w S ) is an ϵ-coreset for (k, m)-ROBUST MEDIAN. We first claim that w S (S) ≥ m + 1ϵ. This can be verified by letting center c → +∞, and we have cost which is a contradiction. Hence, either x i-1 or x i must be contained in S. It is not hard to conclude that |S| ≥ m-1 2 , which completes the proof.



Here, if x / ∈ Y , we let wY (x) = 0. In Braverman et al. (2022), they mark some of the rings (which they call heavy rings), then group the remaining (unmarked) rings into groups. Our notion of ring corresponds to their "marked ring", and our group is the same as theirs. However, we do not need the concept of unmarked rings explicitly, since we only need to deal with the groups that are formed from them (and the construction of groups follows from a black box inBraverman et al. (2022)). Bhaskara et al. (2019) only showed the case of z = 2, but we check that it also generalizes to other z's.



also proposed a PTAS for (k, m)-ROBUST MEDIAN. For (k, m)-ROBUST MEANS, Gupta et al. (2017) designed a bi-criteria approximate algorithm with violations on m. Krishnaswamy et al. (2018) first proposed a constant approximate algorithm, and the approximate ratio was improved to 6 + ϵ Feng et al. (2019). For general (k, z, m)-ROBUST CLUSTERING, Friggstad et al. (2019) achieved an O(z z ) approximate solution with (1 + ϵ)k centers. Due to the wide applications, scalable algorithms have been designed for (k, m)-ROBUST MEANS Bhaskara et al. (2019); Deshpande et al. (2020) besides theoretical study, which may have a worse provable guarantee but are more efficient in practice. Coresets for Clustering There is a large body of work that studies coreset construction for vanilla (k, z)-CLUSTERING in R d Har-Peled & Mazumdar (2004); Feldman & Langberg (2011); Braverman et al. (2016); Huang et al. (2018); Cohen-Addad et al. (2021b; 2022). The state-ofart result for general (k, z)-CLUSTERING is by Cohen-Addad et al. (2022), where the coreset size

Figure 1: Illustration of Theorem 3.3 (plotted distance is the logarithm of the real distance).

Definition 3.4 (Construction of two-point coreset(Braverman et al., 2022)). For a group G ⊂ R d and a center point c ∈ R d , let p G far and p G close denote the furthest and closest point to c in G. For every p ∈ G, compute the unique λ

Lemma 3.7 (Two-point coresets for groups).Let D = i∈[βk] G∈Gi D G denote the two-point coresets of the groups G all = i∈[βk] G∈Gi G, as in Line 6 of Algorithm 1, then for every t, 0 ≤ t ≤ m, D is an (ϵ, cost z P \ L * , C * ) -coreset of G all for (k, z, t)-ROBUST CLUSTERING.Proof. The proof can be found in Section G. Proof of Theorem 3.1. Fix a center C ⊂ R d , |C| = k and fix a t ∈ [0, m], we first prove that

Figure 2: The tradeoff between the coreset size and the empirical error.

ALGORITHMS FOR TRI-CRITERIA APPROXIMATION Various known algorithms that offer different tradeoffs may be used for the required (α, β, γ)approximation. In particular, Friggstad et al. (2019) designed a polynomial-time A(n, k, d, z) = n O(1) algorithm with α = O(2 z ), β = O(1), γ = 1; Bhaskara et al. (2019) gave a near-linear time A(n, k, d, z) = Õ(nkd) algorithm with α = O(2 O(z) ), β = O(1), γ = O(1) (which implies the statement in Theorem 1.1). 3 Finally, true approximation algorithms, i.e., β = γ = 1, are known for both (k, m)-ROBUST MEDIAN and (k, m)-ROBUST MEANS, and they run in polynomial-time A(n, k, d, z) = n O(

Lemma B.1 (Generalized triangle inequalities). Let a, b ≥ 0 and δ ∈ (0, 1), then for z ≥ 1, 1. (Lemma A.1 ofMakarychev et al. (

Figure 3: Distances to the found near-optimal center (using a vanilla clustering algorithm) for each point, sorted decreasingly and rescaled to [0, 1].

Figure 5: An illustration of the decomposition into colored, uncolored and special groups with respect to C = C far ∪ C close , where the radii of balls are taken the logarithm.

(t)  z (D -T, C)+ ϵ • cost z (P \ L * , C * ). (17)We construct L as the following. For every i ∈[k], for every G ∈ G i , let m G = w T (T ∩ D G ) and let (L G , w (L G ) ) denote a weighted subset of G such that cost (m G ) z (G, C) = cost (m G ) z (G -L G , C).

3 to obtain cost z (G -L G , C) ≤ (1 + ϵ) cost z (D G -(T ∩ D G ), C) + ϵ • cost z (P i , c * i ) 2k log(z/ϵ) . (18)• If G is an uncolored group, andC G close = ∅, by Lemma G.4, we know that cost z (G -L G , C) ≤ (1 + ϵ) cost z (D G -(T ∩ D G ), C). (19) • If G is an uncolored group, C G close ̸ = ∅, and m G ∈ {0, |G|}, note that in this case L G = G or L G = ∅. So we have cost z (G -L G , c * i ) = cost z (D G -(T ∩ D G ), c * i )(20)by the fact that D G is the two-point coreset of G, satisfying Definition 3.4. So in this case, the conditions of Lemma G.4 are satisfied. So we have,cost z (G -L G , C) ≤ (1 + ϵ) cost z (D G -(T ∩ D G ), C).(21)• If G is an uncolored group, C G close ̸ = ∅, and m G ̸ ∈ {0, |G|}, we call such group a special uncolored group and prove in Lemma G.5 that there at most 2 special groups in every G i . (See Figure5for an illustration.) Then we use Lemma G.3 to obtaincost z (G -L G , C) ≤ (1 + ϵ) cost z (D G -(T ∩ D G ), C) + ϵ • cost z (P i , c * i ) 2k log(z/ϵ).

cost z (G all -L, C) = i∈[βk] G∈Gi cost z (G -L G , C) ≤ (1 + ϵ) i∈[βk] G∈Gi cost z (D G -(T ∩ D G ), C) + (k log(z/ϵ) + 2) • i∈[βk] ϵ • cost z (P i , c * i ) 2k log(z/ϵ) ≤ (1 + ϵ) cost z (D -T, C) + ϵ • cost z (P \ L * , C * ).Lemma G.5. For a center set C ⊂ R d , |C| = k, in every G i , there are at most 2 special uncolored groups with respect to C.

c) = w S (S)m ∈ 1 ± ϵ. Next, let c = xi-1+xi 2 for some i ∈ [m + 1], which implies that cost (m)1 (X, c) = |x i -c| = 0.5, i.e., the distance to the nearest-neighbor of c in X. Suppose both x i-1 and x i are not in S and we have cost

arg min |L|=γm cost z (P \ L, C * ) denote the set of γm outliers 2: add L * into S and set ∀x ∈ L * , w S ← 1 3: partition P \ L * into βk clusters P 1 , ..., P βk such that P i is the subset of P \ L * closest to c * i 4: for each i ∈ [βk], apply the decomposition of Theorem 3.3 to (P i , c * i ) and obtain a collection R i of disjoint rings and a collection G i of disjoint groups 5: for i ∈ [βk] and every ring R

To this end, assume L ⊂ P is the set of outliers for C with |L| = t.

Specifications of datasets and the choice of the parameters.

Running time and costs for LL and LS with/without coresets. TX and TS are the running time without/with the coreset, respectively. Similarly, cost and cost ′ are the clustering costs without/with the coreset. TC is coreset construction time. This entire experiment is repeated 10 times and the average is reported.

and a terminal embedding If t < (1 -( ϵ 12z ) z ) • |R|, using (7), (8), Fact F.1 and the generalized triangle inequality Lemma B.1, we have, | cost (t) z

a collection of groups G i can be further divided into colored groups and uncolored groups with respect to C such that 1. there are at most O(k log z ϵ ) colored groups and 2. for every uncolored group

ACKNOWLEDGMENTS

Research is partially supported by a national key R&D program of China No. 2021YFA1000900, a startup fund from Peking University, and the Advanced Institute of Information Technology, Peking University.

annex

technique Narayanan & Nelson (2019) , which allows us to trade the factor O(d) in coreset size bound with a factor of O log(k/ϵ) ϵ 2. Hence, it suffices to prove that a uniform sample of size Õ( kd ϵ 2z ) yields the desired coreset.The following simple formula can be obtained via integration by parts. Fact F.1. Let (Y, w Y ) denote a weighted dataset andThe following notion of ϵ-approximation for k-balls range space is well-studied in PAC learning and computational geometry communities (see e.g. Har-peled (2011) ). Definition F.2 (ϵ-Approximation for k-balls range space). Letdenote the set of unions of k balls with the same radius. For a dataset P ⊂ R d , the k-Balls range space on P is denoted by (P, P k ) whereThe following lemma reduces the construction of an ϵ-approximation to uniform sampling.Lemma F.3 (Li et al. (2001) ). Assume Q is a uniform sample of size Õ( kd ϵ 2 ) from P , then with probability at least 1 -1 poly(k/ϵ) , Q is an ϵ-approximation of the k-Balls range space on P .The following Lemma F.4 shows an ( ϵ 12z ) z -approximation yields a 2 O(z log z) • ϵ-coreset for robust (k, z, t)-ROBUST CLUSTERING for every t.Lemma F.4. Assume R = i ∩ ring(c * i , r, 2r) is a ring in the cluster P i . Let Q R be an ( ϵ 12z ) zapproximation of the k-balls range space on R. Suppose every element of Q R is re-weighted by) z -approximation of the k-Balls range space on R, we know that for every u > 0,Let T close = min x∈R dist(x, C) and T far = max x∈R dist(x, C). Since R ⊂ ring(c * i , r, 2r), the diameter of R is at most 4r and this implies T far -T close ≤ 4r. Since Q R is a subset of R, we know that for every u ̸ ∈ [T close , T far ],To prove (6), we do the following case analysis.

