CONSTANT-FACTOR APPROXIMATION ALGORITHMS FOR SOCIALLY FAIR k-CLUSTERING

Abstract

We study approximation algorithms for the socially fair (ℓ p , k)-clustering problem with m groups which include the socially fair k-median (p = 1) and k-means (p = 2). We present (1) a polynomial-time (5 + 2 √ 6) p -approximation with at most k + m centers (2) a (5 + 2 √ 6 + ϵ) p -approximation with k centers in time (nk) 2 O(p) m 2 /ϵ , and (3) a (15 + 6 √ 6) p approximation with k centers in time k m • poly(n). The former is obtained by a refinement of the iterative rounding method via a sequence of linear programs. The latter two are obtained by converting a solution with up to k + m centers to one with k centers by sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing approximation algorithms on benchmark datasets, and find that our algorithms outperform existing methods. Automated decision making using machine learning algorithms is being widely adopted in modern society. Examples of real-world decision being made by ML algorithms are innumerable and include applications with considerable societal effects such as automated content moderation Gorwa et al. (2020) and recidivism prediction Angwin et al. (2016). This necessitates designing (new) machine learning algorithms that incorporate societal considerations, especially fairness Dwork et al. (2012); Kearns and Roth (2019). The facility location problem is a well-studied problem in combinatorial optimization. Famous instances include the k-means, k-median and k-center problems, where the input is a finite metric and the goal is to find k points ("centers" or "facilities") such that a function of distances of each given point to its nearest center is minimized. For k-means, the objective is the average squared distance to the nearest center; for k-median, it is the average distance; and for k-center, it is the maximum distance. These are all captured by the (ℓ p , k)-clustering problem, defined as follows: given a set of clients A of size n, a set of candidate facility locations F, and a metric d, find a subset F ⊂ F of size k that minimizes i∈A d(i, F ) p , where d(i, F ) = min j∈F d(i, j). This is NP-hard for all p, and also hard to approximate Drineas et al. (2004) ; Guha and Khuller (1999) . A 2 O(p)approximation algorithm was given by Charikar et al. ( 2002) 1 . The current best approximation factors for k-median and k-means on general metrics are (2.675 + ϵ)-approximation Byrka et al.



Here we consider socially fair extensions of the (ℓ p , k)-clustering problem in which m different (not necessarily disjoint) subgroups, A = A 1 ∪ • • • ∪ A m , among the data are given, and the goal is to minimize the maximum cost over the groups, so that a common solution is not too expensive for any one of them. Each group can be a subset of the data or simply any nonnegative weighting. The goal is to minimize the maximum weighted cost among the groups, i.e., min F ⊂F :|F |=k max s∈[m] i∈As w s (i)d(i, F ) p . (1) A weighting of w s (i) = 1/|A s |, for i ∈ A s , corresponds to the average of groups. The groups usually arise from sensitive attributes such as race and gender (that are protected against discrimination under the Civil Rights Act of 1968 Hutchinson and Mitchell (2019) ; Benthall and Haynes (2019) ). The cases of p = 1 and p = 2 are the socially fair k-median and k-means, respectively, introduced by Ghadiri et al. (2021) ; Abbasi et al. (2021) . As discussed in Ghadiri et al. (2021) , the objective of the socially fair k-means promotes a more equitable average clustering cost among different groups. The objective function of socially fair k-median was first studied by Anthony et al. (2010) who gave an O(log m+log n)-approximation algorithm. Moreover, the existing approximation algorithms for the vanilla k-means and k-median can be used to find O(m)-approximate solutions for the socially fair versions Ghadiri et al. (2021) ; Abbasi et al. (2021) . The proof technique directly yields a m • 2 O(p) -approximation for the socially fair (ℓ p , k)-clustering. The natural linear programming (LP) relaxation of the socially fair k-median problem has an integrality gap of Ω(m) Abbasi et al. (2021) . More recently, Makarychev and Vakilian (2021) strengthened the LP relaxation of the socially fair (ℓ p , k)-clustering by a sparsification technique. The stronger LP has an integrality gap of Ω(log m/ log log m) and their rounding algorithm (similar to Charikar et al. (2002) ) finds a (2 O(p) log m/ log log m)-approximation algorithm for the socially fair (ℓ p , k)-clustering. For the socially fair k-median, this is asymptotically the best possible in polynomial time under the assumption NP ̸ ⊈ δ>0 DTIME(2 n δ ) Bhattacharya et al. (2014) . Due to this hardness result, it is natural to consider a bicriteria approximation, which allows for more centers whose total cost is close to the optimal cost for k centers. For the socially fair k-median and 0 < ϵ < 1, Abbasi et al. (2021) presents an algorithm that gives at most k/(1 -ϵ) centers with objective value at most 2 O(p) /ϵ times the optimum for k centers. Our first result is an improved bicriteria approximation algorithm for the socially fair ℓ p clustering problem with only m additional centers (m is usually a small constant). Theorem 1.1. There is a polynomial-time bicriteria approximation algorithm for the socially fair (ℓ p , k)-clustering problem with m groups that finds a solution with at most k + m centers of cost at most (5 + 2 √ 6) p ≈ 9.9 p times the optimal cost for a solution with k centers. Goyal and Jaiswal Goyal and Jaiswal (2021) show that a solution to the socially fair (ℓ p , k)clustering problem with k ′ > k centers and cost C can be converted to a solution with k centers and cost at most 3 p-1 (C + 2opt) by simply taking the k-subset of the k ′ centers of lowest cost. A proof is in the appendix for completeness. We improve this factor using a sparsification technique. Theorem 1.2. For any ϵ > 0, there is a (5+2 √ 6+ϵ) p -approximation algorithm for the socially fair (ℓ p , k)-clustering problem that runs in time (nk)foot_1 O(p) m 2 /ϵ ; there is a (15 + 6 √ 6) p -approximation algorithm that runs in time k m • poly(n). This raises the question of whether a faster-constant-factor approximation is possible. Goyal and Jaiswal (2021) show under the Gap-Exponential Time Hypothesis 2 , it is hard to approximate socially fair k-median and k-means within factors of 1 + 2/e -ϵ and 1 + 8/e -ϵ, respectively, in time g(k) • n f (m)•o(k) , for f, g : R + → R + ; socially fair (ℓ p , k)-clustering is hard to approximate within a factor of 3 p -ϵ in time g(k) • n o (k) . They also give a (3 + ϵ) p -approximation in time (k/ϵ) O(k) poly(n/ϵ). This leaves open the possibility of a constant-factor approximation in time f (m)poly(n, k). For the case of p → ∞, the problem reduces to fair k-center problem if we take p th root of the objective. The problem is much better understood and widely studied along with many generalization Jia et al. (2021) ; Anegg et al. (2021) ; Makarychev and Vakilian (2021) . Makarychev and Vakilian (2021) result implies an O(1)-approximation in this case. We compare the performance of our bicriteria algorithm against Abbasi et al. (2021) and our algorithm with exactly k centers against Makarychev and Vakilian (2021) on three different benchmark datasets. Our experiments show that our algorithms consistently outperform these in practice (Section 5) and often select fewer centers than the algorithm of Abbasi et al. (2021) (Section E.3).

1.1. APPROACH AND TECHNIQUES

Our starting point is a LP relaxation of the problem. The integrality gap of the natural LP relaxation is m Abbasi et al. (2021) . For our bicriteria result, we use an iterative rounding procedure, inspired by Krishnaswamy et al. (2018) . In each iteration, we solve an LP whose constraints change from one iteration to the next. We show that the feasible region of the final LP is the intersection of a matroid polytope and m affine spaces. This implies that the size of the support of an optimal extreme solution is at most k + m -see Lemma 1.3. Rounding up all of these fractional variables results in a solution with k + m centers. There are two approaches to convert a solution with up to k + m centers to a solution with k centers. The first is to take the best k-subset of the k + m centers which results in a (15 + 6 √ 6) papproximation for an additional cost of O(k m n(k + m)) in the running time. This follows from the work of Goyal and Jaiswal (2021) . For completeness, we include it as Lemma A.1 in the Appendix. The second approach is to "sparsify" the given instance of the problem. We show if the instance is "sparse," then the integrality gap of the LP is small. A similar idea was used by Li and Svensson (2016) for the classic k-median problem. We extend this sparsification technique to socially fair clustering. We define an α-sparse instance for the socially fair k-median problem as an instance in which for an optimum set of facilities O, any group s ∈ [m] and any facility i not in the optimum solution, the number of clients of group s in a ball of radius d(i, O)/3 centered at i is less than 3α|As| 2d(i,O) . For such an instance, given a set of facilities, replacing facility i with the closest facility to i in O can only increase the total cost of the clients served by this facility by a constant factor plus 2α. We show that if an instance is O( opt m )-sparse, then the integrality gap of the LP is constant. For an O(opt/m)-sparse instance of the socially fair k-median problem, a solution with k + m centers can be converted to a solution with k centers in time n O(m 2 ) while increasing the objective value only by a constant factor. Our conversion algorithm is based on the fact that there are at most O(m 2 ) facilities that are far from the facilities in the optimal solution. We enumerate candidates for these facilities and then solve an optimization problem for the facilities that are close to the facilities in the optimal solution. This optimization step is again over the intersection of the polytope of a matroid with m half-spaces. In summary, our algorithm consists of three main steps. 1. We produce n O(m 2 ) instances of the problem such that at least one is O( opt m )-sparse and its optimal objective value is equal to that of the original instance (Section 4, Lemma 3.2). 2. For each of the instances produced in the previous step, we find a pseudo-solution with at most k + m centers by an iterative rounding procedure (Section 2, Lemma 2.1). 3. We convert each pseudo-solution with k + m centers to a solution with k centers (Section 4, Lemma 4.2) and return the solution with the minimum cost.

1.2. PRELIMINARIES

We use terms centers and facilities interchangeably. For a set S and item i, we denote S∪{i} by S+i. For sets S 1 , . . . , S k , we denote their Cartesian product by j∈[k] S j , i.e., (s 1 , . . . , s k ) ∈ j∈[k] S j if and only if s 1 ∈ S 1 , . . . , s k ∈ S k . For an instance I of the problem, we denote an optimal solution of I and its objective value by OPT I and opt I , respectively. A pair M = (E, I), where I is a non-empty family of subsets of E, is a matroid if: 1) for any S ⊆ T ⊆ E, if T ∈ I then S ∈ I (hereditary property); and 2) for any S, T ∈ I, if |S| < |T |, then there exists i ∈ T \ S such that S + i ∈ I (exchange property) see Schrijver (2003) . We call I the set of independent sets of the matroid M. The basis of M are all the independent sets of M of maximal size. The size of all of the basis of a matroid is equal and is called the rank of the matroid. We use the following lemma in the analysis of both our bicriteria algorithm and the algorithm with exactly k centers. Lemma 1.3. However as shown by Gupta et al. (2020) , different notions of fairness are incompatible in the sense that they cannot be satisfied simultaneously. For example, see the discussion and experimental result regarding incompatibility of social fairness and equitable representation in Ghadiri et al. (2021) . In addition, several other notions of fairness for clustering has been considered Chen et al. (2019) ; Jung et al. (2020) ; Mahabadi and Vakilian (2020) ; Micha and Shah (2020) ; Brubach et al. (2020) .

2. BICRITERIA APPROXIMATION

In this section, we prove Theorem 1.1. Our method relies on solving a series of linear programs and utilizing the iterative rounding framework for the k-median problem as developed in Krishnaswamy et al. (2018) ; Gupta et al. (2021) . We aim for a cleaner exposition over smaller constants below. We use the following standard linear programming (LP) relaxation (LP1). Theorem 1.1 follows as a corollary to Lemma 2.1 as we can pick all the fractional centers integrally. Observe that, once the centers are fixed, the optimal allocation of clients to facilities is straightforward: every client connects to the nearest opened facility. Lemma 2.1. Let 0 < λ ≤ 1. There is a polynomial time algorithm that given a feasible solution (x, ỹ, z) to the linear program LP1 returns a feasible solution (x ′ , y ′ , z ′ ) where z ′ ≤ ((1+2(1 + λ)/λ)(1+λ)) p z and the size of the support of y ′ is at most k+m. The running time is polynomial in n and the logarithm of the distance of the farthest points divided by λ. min z (LP1) min z (LP2) s.t. z ≥ i∈As,j∈F w s (i)d(i, j) p x ij , s.t. z ≥ j∈As i∈Fj w s (i)d(i, j) p y i , ∀ 1 ≤ s ≤ m, ∀ 1 ≤ s ≤ m, (2) x ij ≤ y j , ∀ i ∈ A, j ∈ F, j∈F y j = k, j∈F y j = k, j∈F x ij = 1 , ∀ i ∈ A, j∈Fi y j = 1 , ∀i ∈ A, (4) x, y ≥ 0. y ≥ 0. i D i i ′ D i ′ Proof. We describe the iterative rounding argument to round the solution (x, ỹ, z). As a first step, we work with an equivalent linear program LP2, where we have removed the assignment variables x. This can be achieved by splitting each facility j to the number of unique nonzero xij 's and setting the corresponding variable for these facilities accordingly, e.g., if the unique weights are x1j < x2j < x3j , then the corresponding weights for the facilities are x1j , x2j -x1j , and x3j -x2j and the weights of the connections between these new facilities and clients are determined accordingly as either zero or the weight of the facility. Let F be the set of all (splitted) copies of facilities. Then we can assume xij ∈ {0, ỹj } for each i, j (where j ∈ F). We set F i = {j ∈ F : xij > 0}. Note that F i could contain multiple copies of original facilities. Observe that LP2 has a feasible solution (ỹ, z) for this choice of F i for each i ∈ A. Moreover, any feasible solution to LP2 can be converted to a solution of LP1 of same cost while ensuring that each client i gets connected to the original copy of the facilities in F i . The iterative argument is based on the following. We group nearby clients and pick only one representative for each group such that if each client is served by the facility that serves its representative, the cost is at most ((1+2(1 + λ)/λ)(1+λ)) p z. Moreover, we ensure that candidate facilities F i for representative clients are disjoint. In this case, one observes that the constraints Eq. 3-Eq. 4 in LP2 define the convex hull of a partition matroid and must be integral. Indeed, this already gives an integral solution to the basic k-median problem. But, in the socially fair clustering problem, there are m additional constraints, one for each of the m groups. Nevertheless, by Lemma 1.3, any extreme point solution to the matroid polytope intersected with at most m linear constraints has a support of size at most k + m (see also Lau et al. (2011) Chap. 11) .

Algorithm 1: Iterative Rounding

Input: A = A1 ∪ • • • ∪ Am, F, k, d, λ Output: A set of centers of size at most k + m. Solve LP1 to get optimal solution (x ⋆ , y ⋆ , z ⋆ ) and reate set F by splitting facilities. Set Fi = {j ∈ F : x ⋆ ij > 0}, Di = max{dij : x ⋆ ij > 0} for each i ∈ U. Sort clients in increasing order of Di and greedily include clients in U ⋆ while maintaining that {Fi : i ∈ U ⋆ } remain disjoint. Set U f = A \ U ⋆ . while there is some tight constraint from Eq. 8 do if there exists i ∈ U f such that y(Bi) = 1 (i.e., Eq. 8 is tight for i) then Fi ← Bi, Di ← D i 1+λ , Bi ← {j ∈ Fi : d(i, j) ≤ D i 1+λ }, and Update-U ⋆ (i). 10 Find an extreme point solution y to the linear program LP(U ⋆ , U f , D) . 11 return the support of y in the solution of LP(U ⋆ , U f , D) . Procedure Update-U ⋆ (i) if for every i ′ ∈ U ⋆ that Fi ∩ F i ′ ̸ = ∅, D i ′ > Di then Remove all i ′ that represent i from U ⋆ and add them to U f . U ⋆ ← U ⋆ ∪ {i}. We now formalize the argument and specify how one iteratively groups the clients. We iteratively remove/change constraints in LP2 as we do the grouping while ensuring that linear program's cost does not increase. We initialize D i = max{d(i, j) : xij > 0} = max{d(i, j) : j ∈ F i } for each client i. We maintain a set of representative clients U ⋆ . We say a client i ∈ U ⋆ represents a client i ′ if they share a facility, i.e., F i ∩ F i ′ ̸ = ∅, and D i ≤ D i ′ . The representative clients do not share any facility with each other. The non-representative clients are put in the set U f . We initialize U ⋆ as follows. Sort all clients in increasing order of D i . Greedily add clients to U ⋆ while maintaining F i ∩ F i ′ = ∅ for all i ̸ = i ′ ∈ U ⋆ . Observe that U ⋆ is maximal with above property, i.e., for every i ′ ̸ ∈ U ⋆ , there is i ∈ U ⋆ such that F i ∩ F i ′ ̸ = ∅ and D i ≤ D i ′ . We will maintain this invariant in the algorithm. For clients i ∈ U f , we set B i to be the facilities in F i that are within a distance of Di 1+λ . In each iteration we solve the following linear program and update U ⋆ , U f , D i 's, B i 's, and F i 's. min z (LP(U ⋆ , U f , D)) s.t. z ≥ i∈As∩U ⋆ w i (s) j∈Fi d(i, j) p y j + i∈As∩U f w i (s) j∈Bi d(i, j) p y j + (1 -y(B i ))D p i , ∀1 ≤ s ≤ m, j∈F y j = k, ( ) j∈Fi y j = 1 , ∀i ∈ U ⋆ , ( ) j∈Bi y j ≤ 1 , ∀i ∈ U f , y ≥ 0. For clients in i ∈ U f , we only insist that we pick at most one facility from B i (see Eq. 8). The objective is modified to pay D i for any fractional shortfall of facilities in this smaller ball (see Eq. 5). Observe that if this additional constraint Eq. 8 becomes tight for some j ∈ U f , we can decrease D i by a factor of (1 + λ) for this client and then update U ⋆ accordingly to see if i can be included in it. Also, we round each d(i, j) to the nearest power of (1 + λ). This only changes the objective by a factor of (1 + λ) p and we abuse notation to assume that d satisfies this constraint (it might no longer be a metric but in the final assignment, we will work with its metric completion). The iterative algorithm runs as described in Algorithm 1. It is possible that a client moves between U f and U ⋆ but any time that a point is processed in U f (Step 3(a) above), D i is divided by (1 + λ). Thus the algorithm takes O(n log (diam) λ ) iterations, where diam is the distance between the two farthest points. Finally the result is implied by the following claims (proved in the appendix). Claim 2.2. The cost of the LP is non-increasing over iterations. Moreover, when the algorithm ends, there are at most k + m facilities in the support.

Algorithm 2: Sparsify

Input: A = A1 ∪ • • • ∪ Am, F, k, d, t ∈ N Output: A set of fair k-median instances. for t ′ = 1, . . . , m 2 t and t ′ facility pairs (j1, j ′ 1 ), . . . , (j t ′ , j ′ t ′ ) do Output I ′ = (F ′ , A, k, d), where F ′ = F \ t ′ r=1 FBALL(jr, d(jr, j ′ r )). Claim 2.3. For any client i ′ ∈ U f , there is always one total facility at a distance of at most (1 + 2(1 + λ)/λ) D i ′ , i.e., j:d(i ′ ,j)≤(1+2(1+λ)/λ)D i ′ y j ≥ 1. Claim 2.4. Let ŷ be an integral solution to linear program LP(U ⋆ , U f , D) after the last iteration. Then, there is a solution (x, ŷ) to the linear program LP1 such that objective is at most ((1+2(1 + λ)/λ)(1+λ)) p times the objective of the linear program LP(U ⋆ , U f , D) . Now we prove Theorem 1.1 by substituting the best λ in Lemma 2.1. Proof of Theorem 1.1. By Lemma 2.1, the output vector of Algorithm 1 corresponding to the centers has a support of size at most k + m. Rounding up all the fractional centers, we get a solution with at most k +m centers and a cost of at most ((1+2(1 + λ)/λ)(1+λ)) p of the optimal. We optimize over λ by taking the gradient of (1+2(1 + λ)/λ)(1+λ) and setting it to zero. This gives the optimum value of λ = 2/3. Substituting this, gives a total approximation factor of (5 + 2 √ 6) p .

3. APPROXIMATION ALGORITHMS FOR FAIR k-CLUSTERING

We first show how to generate a set of instances such that at least one of them is sparse and has the same optimal objective value as the original instance. Then we present our algorithm to find a solution with k facilities from a pseudo-solution with k + m facilities for a sparse instance, inspired by Li and Svensson (2016) . We need to address new difficulties: the sparsity with respect to all groups s ∈ [m]; and as our pseudo-solution has m additional centers (instead of O(1) additional centers), we need a sparser instance compared to Li and Svensson (2016) . One new technique is solving the optimization problem given in Step 11 of Algorithm 3 (Lemma 4.1). This is trivial for the vanilla k-median but in the fair setting, we use certain properties of the extreme points of intersection of a matroid polytope with half-spaces, and combine this with a careful enumeration. For an instance I, we denote the cost of a set of facilities F by cost I (F ). For a point q and r > 0, we denote the facilities in the ball of radius r at q by FBall I (q, r). This does not contain facilities at distance exactly r from q. For a group s ∈ [m], the set of clients of s in the ball of radius r at q is CBall I,s (q, r). Note that because we consider the clients as weights on points, CBall I,s (q, r) is actually a set of (point, weight) pairs. We let |CBall I,s (q, r)| = i∈CBall I,s (q,r) w s (i). Definition 3.1. [Sparse Instance] For α > 0, an instance of the fair ℓ p clustering problem I = (k, F, A, d) is α-sparse if for each facility j ∈ F and group s ∈ [m], 2 3 d(j, OPT I ) p • |CBall I,s (j, 1 3 d(j, OPT I ))| ≤ α. We say that a facility j is α-dense if it violates the above for some group s ∈ [m]. To motivate the definition, let I be an α-sparse instance, OPT I be an optimal solution of I, j be a facility not in OPT I and j * be the closest facility in OPT I to j. Let F be a solution that contains j and η j,s be the total cost of the clients of group s ∈ [m] that are connected to j in solution F . Then, (cost of group s for solution F ∪ j \ j * ) ≤ (cost of group s for solution F ) + 2 O(p) • (α + η j,s ). This property implies that if α ≤ opt I /m, then replacing m different facilities can increase the cost by a factor of 2 O(p) plus 2 O(p) • opt I , and the integrality gap of the LP relaxation is 2 O(p) . The next algorithm generates a set of instances such that at least one of them has objective value equal to opt I and is (opt I /mt)-sparse for a fixed integer t. Lemma 3.2. Algorithm 2 runs in n O(m 2 t) time and produces instances of the socially fair ℓ p clustering problem such that at least one of them satisfies the following: (1) The optimal value of the original instance I is equal to the optimal value of the produced instance I ′ ; (2) I ′ is opt I mt -sparse. 4 CONVERTING A SOLUTION WITH k + m CENTERS TO ONE WITH k CENTERS We first analyze the special case when the set of facilities is partitioned to k disjoint sets and we are constrained to pick exactly one from each set. This will be a subroutine in our algorithm. Lemma 4.1. Let S 1 , . . . , S k be disjoint sets such that S 1 ∪ • • • ∪ S k = [n]. For g ∈ [m], j ∈ [k], v ∈ S j , let α (g,j) v ≥ 0. Then there is an (nk) O(m 2 /ϵ) -time algorithm that finds a (1+ϵ)-approximate solution to min vi∈Si:i∈ [k] max g∈[m] j∈[k] α (g,j) vj . Proof. The LP relaxation of the above problem is min θ such that j∈[k] v∈Sj α (g,j) v x (j) v ≤ θ, ∀g ∈ [m], v∈Sj x (j) v = 1, ∀j ∈ [k], x (j) ≥ 0, ∀j ∈ [k]. Note that this is equivalent to optimizing over a partition matroid with m extra linear constraints. Therefore by Lemma 1.3, an extreme point solution has a support of size at most k + m. Now suppose θ * is the optimal integral objective value, and v * 1 ∈ S 1 , . . . , v * k ∈ S k are the points that achieve this optimal objective. For each g ∈ [m], at most m/ϵ many α (g,j) v * j , j ∈ [k], can be more than ϵ m θ * because j∈[k] α (g,j) v * j ≤ θ * . Suppose, for each g ∈ [m], we guess the set of indices T g = {j ∈ [k] : α (g,j) v * j ≥ ϵ m θ * }. This takes k O(m 2 /ϵ) time by enumerating over set [k]. Let T = T 1 ∪ • • • ∪ T m . For j ∈ T , we also guess v * j in the optimum solution by enumerating over S j 's, j ∈ T . This increases the running time by a multiplicative factor of n O(m 2 /ϵ) since v * j ∈ S j and |S j | ≤ n. Based on these guesses, we can set the corresponding variables in the LP, i.e., for each S j such that j ∈ T , we add the following constraints for v ∈ S j . x We solve all such LPs to get optimum extreme points. Let (θ, x (1) , . . . , x (k) ) be an optimum extreme point for the LP corresponding to the right guess (i.e., the guess in which we have identified all indices j ∈ [k] along with their corresponding v * j such that there exists g ∈ [m] where α (j) v = 1 if v = v * j , (g,j) v * j ≥ ϵ m θ * ). Therefore θ ≤ θ * . Let R = {j ∈ [k] : x (j) / ∈ {0, 1} |Sj | }. Since the feasible region of the LP corresponds to the intersection of a matroid polytope and m half-spaces, by Lemma 1.3, the size of the support of an extreme solution is k + m. Moreover any cluster with j ∈ [k] that have fractional centers in the extreme point solution, contributes at least 2 to the size of the support because of the equality constraint in the LP. Therefore 2|R| + (k -|R|) ≤ k + m which implies |R| ≤ m. Now we guess the v * j for all j ∈ R. By construction, R∩T = ∅. Therefore for all j ∈ R and g ∈ [m], α (g,j) v * j < ϵ m θ * . Therefore for all g ∈ [m], j∈R α (g,j) v * j ≤ m • ϵ m θ * = ϵθ * . Thus for the right guess of v * j , j ∈ R, we get an integral solution with a cost less than or equal to θ + ϵθ * ≤ (1 + ϵ)θ * . Algorithm 3 is our main procedure to convert a solution with k + m centers to a solution with k centers. We need β to be in the interval mentioned in Lemma 4.2. To achieve this we guess opt I as different powers of two and try the corresponding β's. The main idea behind the algorithm is that in a pseudo-solution of a sparse instance, there are only a few (< m 2 t) facilities that are far from facilities in the optimal solution. So the algorithm tries to guess those facilities and replace them by facilities in the optimal solution. For the rest of the facilities in the pseudo-solution (which are close to facilities in the optimal solution), the algorithm solves an optimization problem (Lemma 4.1) to find a set of facilities with a cost comparable to the optimal solution. Finally combining the following lemma (proved in the appendix) with Lemma 3.2 (sparsification) and Theorem 1.1 (bicriteria algorithm) implies Theorem 1.2. Lemma 4.2. Let I = (k, F, A, d) be an opt I mt -sparse instance of the (ℓ p , k)-clustering problem, T be a pseudo-solution with at most k + m centers, ϵ ′ > 0, δ ∈ (0, min{ 1 8 , log(1+ϵ ′ ) 12 }), t ≥ 4(1 + 3 δ ) p Under review as a conference paper at ICLR 2023 Algorithm 3: Obtaining a solution from a pseudo-solution 1 Input: Instance I, β, a pseudo-solution T , ϵ ′ > 0, δ∈ (0, min{ 1 8 , log(1+ϵ ′ ) 12 }), and integer t ≥ 4•(1+ 3 δ ) p . 2 Output: A solution with at most k centers. 3 T ′ ← T 4 while |T ′ | > k and there is j ∈ T ′ s.t. costI(T ′ \ {j}) ≤ costI(T ′ ) + β do T ′ ← T ′ \ {j}; 5 if |T ′ | ≤ k then return T ′ ; 6 forall D ⊆ T ′ and V ⊆ F such that |D| + |V| = k and |V| < m 2 t do 7 For j ∈ D, set Lj = d(j, T ′ \ {j}) 8 For s ∈ [m], j ∈ D, fj ∈ FBallI(j, δLj), set α (s,j) f j = i∈CBall I,s (j,L j /3) min{d(i, fj) p , d(i, V) p }. 9 Let ( fj : j ∈ D) ∈ j∈D FBallI(j, δLj) be (1 + ϵ)-approximate solution to (see Lemma 4.1) min (f j :j∈D)∈ j∈D FBall I (j,δL j ) max s∈[m] j ′ ∈D α (s,j ′ ) f j ′ SD,V ← V ∪ { fj : j ∈ D} 10 return S := arg min S D,V costI(SD,V ) be an integer, and 2 mt opt I + (1 + 3 δ ) p cost I (T ) ≤ β ≤ 2 mt 2opt I + (1 + 3 δ ) p cost I (T ) . Then Algorithm 3 finds a set S ∈ F in time n m 2 •2 O(p) such that |S| ≤ k and cost I (S) ≤ (O(1) + (1 + ϵ ′ ) p ) (cost I (T ) + opt I ).

5. EMPIRICAL STUDY

We compare our algorithm with previously best algorithms in the literature on benchmark datasets for socially fair k-median problem. Namely, we compare our bicriteria algorithm with Abbasi et al. (2021) (ABV), and our exact algorithm (that outputs exactly k centers) with Makarychev and Vakilian (2021) (MV). Since our bicriteria algorithm produces only a small number of extra centers (e.g., for two groups, our algorithm only produces one extra center -see Section E.3), we search over the best k-subset in the set of k + m selected centers. However, instead of performing an exhaustive search combinatorially, we use a mixed-integer linear programming solver to find the best k-subset. Our code is written in MATLAB. We use IBM ILOG CPLEX 12.10 to solve the linear programs (and mixed-integer programs). We used a MacBook Pro (2019) with a 2.3 GHz 8-Core Intel Core i9 processor, a 16 GB 2667 MHz DDR4 memory card, a Intel UHD Graphics 630 1536 MB graphic card, 1 TB of SSD storage, and macOS version 12.3.1. Datasets. We use three benchmark datasets that have been extensively used in the fairness literature. Similar to other works in fair clustering Chierichetti et al. (2017) , we subsample the points in the datasets. Namely, we consider the first 500 examples in each dataset. A quick overview of the used datasets is in the following. (1) Credit dataset Yeh and Lien (2009) consists of records of 30000 individuals with 21 features. We divided the multi-categorical education attribute to two demographic groups: "higher educated" and "lower educated." (2) Adult dataset Kohavi et al. (1996) ; Asuncion and Newman (2007) contains records of 48842 individuals collected from census data, with 103 features. We consider five racial groups of "Amer-Indian-Eskim", "AsianPac-Islander", "Black", "White", and "Other" for one of our experiments. For another experiment we consider the intersectional groups of race and gender (male and female) that results in 10 groups. (3) COMPAS dataset Angwin et al. (2016) is gathered by ProPublica and contains the recidivism rates for 9898 defendants. The data is divided to two racial groups of African-Americans (AA) and Caucasians (C). The results for the COMPAS datasets are included in the appendix. Bicriteria approximation. The ABV algorithm, first solves the natural LP relaxation and then uses the "filtering" technique Lin and Vitter (1992) ; Charikar et al. (2002) to round the fractional solution to an integral one. Given a parameter 0 < ϵ < 1, the algorithm outputs at most k/(1 -ϵ) centers and guarantees a 2/ϵ approximation. In our comparison, we consider ϵ that gives almost the same number of centers as our algorithm. Tables in Section E.3, summarise the number of selected centers for different k and ϵ. The λ parameter in our algorithm (see Algorithm 1 and Lemma 2.1) determines the factor of decrease in the radii of client balls in the iterative rounding algorithm. As illustrated in Empirical running time. We compare the running time of our algorithms with that of ABV and MV on the three datasets (see Section E.4). To summarize, the running times of our bicriteria algorithm and the exact algorithm with exhaustive search are virtually the same when the number of groups is no more than 5. Moreover our algorithms' times are comparable to ABV and are significantly less than MV in most cases. The latter is because MV needs to guess the value of the optimal objective value. Therefore it needs to run the algorithm multiple times. We run the algorithm with 5 different values (by multiplying different factors of two) and output the best out of these for MV.

6. CONCLUSION

We presented a polynomial time bicrteria algorithm for the socially fair (ℓ p , k)-clustering with m groups that outputs at most k + m centers. Using this, we presented two different constant-factor approximation algorithms that return exactly k centers. An interesting future work is to investigate the use of recently introduced techniques in k-means++ Lattanzi and Sohler (2019) ; Choo et al. (2020) forw faster constant-factor approximation algorithms for socially fair k-means. It is also interesting to explore scalable algorithms, for example, using the coreset framework.

A EXHAUSTIVE SEARCH

Lemma A.1 (Goyal and Jaiswal (2021) ). Let k ′ > k and S be a set of centers of size k ′ and cost C for the socially fair (ℓ p , k)-clustering problem with m groups. Let T ⊂ S be a set of size k with minimum cost among all subsets of size k of S. Then the cost of T is less than or equal to 3 p-1 (C + 2opt) where opt is the cost of the optimal solution. Proof. Let OPT be an optimal set of centers.  ) ≤ d(i, o i ) + d(o i , t ′ i ). By definition of t ′ i , d(o i , t ′ i ) ≤ d(o i , s i ). Therefore d(i, t ′ i ) ≤ d(i, o i ) + d(o i , s i ). Moreover by triangle inequality d(o i , s i ) ≤ d(i, o i ) + d(i, s i ). Therefore d(i, t ′ i ) ≤ 2d(i, o i ) + d(i, s i ). Taking both sides to the power of p and using the power mean inequality, i.e., (x + y + z) p ≤ 3 p-1 (x p + y p + z p ), we conclude d(i, t ′ i ) p ≤ 3 p-1 (2d(i, o i ) p + d(i, s i ) p ). The result follows from summing such inequality for each group and taking the maximum over groups.

B OMITTED PROOFS OF SECTION 2

Proof of Claim 2.2. In each iteration, we put at most one client in U ⋆ . For this client, Eq. 8 is tight, i.e., j∈Bi y j = 1. Note that we update F i to B i . Therefore the new point in U ⋆ satisfies Eq. 7. Moreover, for a point i ′ that is removed from U ⋆ , we have j∈B i ′ y j ≤ j∈F i ′ y j = 1. Therefore such a point satisfies Eq. 8. Hence a feasible solution to the LP of iteration t is also feasible for iteration t + 1. Therefore the cost of the LP is non-increasing over iterations. The second statement follows since if no constraint from Eq. 8 is tight, then the linear program is the intersection of a matroid polytope with m linear constraints and the result follows from Lemma 1.3. Proof of Claim 2.3. Let t be the iteration where D i ′ is updated for the last time. If D i ′ is only set once at Line 4 of Algorithm 1 and it is never updated, then t = 0. We first show that immediately after iteration t, there is one total facility at a distance of at most 3D i ′ from i ′ . If t = 0, then there existed i ∈ U ⋆ such that F i ∩ F i ′ ̸ = ∅ and D i ≤ D i ′ . Therefore by triangle inequality, all the facilities in F i are within a distance of at most 3D i ′ from i ′ , see Figure 1 (a) . Hence because Eq. 7 enforces one total facility in F i , there exists one total facility at a distance of at most 3D i ′ from i ′ . If t > 0, then i ′ is moved from U ⋆ to U f because at iteration t, a facility i is added to U ⋆ such that D i < D i ′ and F i ∩ F i ′ ̸ = ∅ -see the condition of Procedure Update-U ⋆ (i) in Algorithm 1. Again because of enforcement of Eq. 7 and triangle inequality, there exists one total facility at a distance of at most 3D i ′ from i ′ immediately after iteration t. Now note that after iteration t, the facility i ∈ U ⋆ with F i ∩F i ′ ̸ = ∅ and D i ≤ D i ′ might get removed from U ⋆ . In which case, we do not have the guarantee of Eq. 7 any longer. Let i 0 := i. We define i p+1 to be the client that has caused the removal of client i p (through Procedure Update-U ⋆ (i)) from U ⋆ after iteration t. Note that by the condition of Update-U ⋆ (i)) from U ⋆ , D ip+1 < D ip . Therefore because we have rounded the distances to multiples of (1 + λ), we have D ip+1 ≤ Di p 1+λ . Let i r be the last point in this chain, i.e., i r has caused the removal of i r-1 and i r has stayed in U ⋆ until termination of the algorithm. Then by guarantee of Eq. 7 and triangle inequality, there is one total facility for i ′ within a distance of D i ′ + r j=0 2D ij ≤ D i ′ + 2 r j=0 D i ′ (1 + λ) j ≤ 1 + 2(1 + λ) λ D i ′ . Proof of Claim 2.4. By Claim 2.2, at every iteration, the cost of the linear program only decreases since a feasible solution to previous iteration remains feasible for the next iteration. Thus the objective value of ŷ in LP(U ⋆ , U f , D) is at most (1 + λ) p the optimal cost of LP1 (where we lost the factor of (1 + λ) p by rounding all distances to powers of (1 + λ)). We now construct x such that (x, ŷ) is feasible to LP1. First note that the above procedure always terminates. We construct x by processing clients one by one. We process the clients in U f and U ⋆ as follows. For any i ∈ U ⋆ , we define xij = ŷj for each j ∈ F i . Observe that we have j∈Fi x ij = 1 for such i ∈ U ⋆ and we obtain feasibility for this client. For any i ∈ U f , we define xij = ŷj for each j ∈ B i . Observe that we only insisted j∈Bi ŷj ≤ 1 and therefore we still need to find 1 -j∈Bi xij = 1 -j∈Bi ŷj facilities to assign to client i. For this remaining amount 1 -ŷ(B i ), we notice by Claim 2.3, there is at least one facility within distance 1 + 2(1+λ) λ D i of this client. Thus we can assign the remaining 1 -ŷ(B i ) facility to client i at a distance of no more than 1 + 2(1+λ) λ D i . Note that the cost is only increased by a factor of 1 + 2(1+λ) λ p .

C OMITTED PROOFS OF SECTION 3

Proof of Lemma 3.2. First note that a facility i in OPT I cannot be α-dense because d(i, OPT I )) = 0. Let (j 1 , j ′ 1 ), . . . , (j ℓ , j ′ ℓ ) be a sequence of pairs of facilities such that for every b = 1, . . . , ℓ, • j b ∈ F \ b-1 z=1 FBall I (j z , d(j z , j ′ z )) is an opt I mt -dense facility; and • j ′ b is the closest facility to j b in OPT I . We show that ℓ ≤ m 2 t. For b ∈ [ℓ] and s ∈ [m], let B b,s := CBall I,s (j b , 1 3 d(j b , j ′ b )). First we show that for any group s ∈ [m], the client balls B 1,s , . . . , B ℓ,s are disjoint. Let 1 ≤ z < w ≤ ℓ. By triangle inequality d(j w , j ′ z ) ≤ d(j w , j z ) + d(j z , j ′ z ). Moreover by definition j w ̸ ∈ FBall I (j z , d(j z , j ′ z )). Thus d(j z , j ′ z ) ≤ d(j w , j z ). Hence d(j w , j ′ z ) ≤ 2d(j w , j z ). Since j ′ w is the closest facility to j w in OPT I , d(j w , j ′ w ) ≤ d(j w , j ′ z ). Therefore d(j w , j ′ w ) ≤ 2d(j w , j z ). Combining this with d(j z , j ′ z ) ≤ d(j w , j z ) implies 1 3 (d(j z , j ′ z ) + d(j w , j ′ w )) ≤ d(j z , j w ). If B z,s and B w,s overlap then there exists u ∈ B z,s ∩B w,s and by triangle inequality, d(j z , j w ) ≤ d(j z , u)+ d(j w , u) < 1 3 d(j z , j ′ z ) + 1 3 d(j w , j ′ w ), which is a contradiction. Therefore for s ∈ [m], B 1,j , . . . , B ℓ,j are disjoint. Also since A 1 , . . . , A s are disjoint, all of B b,s 's are disjoint for b ∈ [ℓ] and s ∈ [m]. By definition, for any b ∈ [ℓ], there exists s b ∈ [m] such that 2 3 d(j b , OPT I ) p |B b,s b | > opt I mt . Therefore, if ℓ > m 2 t, ℓ b=1 2 3 d(j b , OPT I ) p |B b,s b | > mopt I . Thus m • max s∈[m] ℓ b=1 2 3 d(j b , OPT I ) p |B b,s | ≥ s∈[m] ℓ b=1 2 3 d(j b , OPT I ) p |B b,s | > mopt I . Note that the connection cost of a client in B b,s in the optimal solution is at least 2 3 d(j b , OPT I ) p = 2 3 d(j b , j ′ b ) p . Therefore, as the B b,s 's are disjoint, opt I ≥ max s∈[m] ℓ b=1 2 3 d(j b , OPT I ) p |B b,s |. This is a contradiction. Therefore ℓ ≤ m 2 t. Thus Algorithm 2 returns an instance with the desired properties.

D OMITTED PROOFS OF SECTION 4

Proof of Lemma 4.2. If Algorithm 3 ends in Step 7, then cost I (T ) + mβ is at most cost I (T ) + 2 t (2opt I + (1 + 3 δ ) p cost I (T )) = O (opt I + cost I (T )) . Otherwise, we run the loop. Now we show that there exist sets D 0 ⊆ T ′ and V 0 ⊆ F such that |V 0 | < m 2 t, |D 0 | + |V 0 | = k, and S D0,V0 satisfies the desired properties. For a facility j ∈ T ′ , let L j = d(j, T ′ \ {j}) and ℓ j = d(j, OPT I ). We say j ∈ I is determined if ℓ j ≤ δL j . Otherwise, we say j is undetermined. Let D 0 = {j ∈ T ′ : ℓ j ≤ δL j }. For j ∈ D 0 , let f * j be the closest facility to j in OPT I . Let V 0 = OPT I \ {f * j : j ∈ D 0 }. First note that for any two distinct facilities in j, j ′ ∈ D 0 , d(j, j ′ ) ≥ max{L j , L j ′ }. Moreover by definition, d(j, f * j ) ≤ δL j ≤ δ max{L j , L j ′ }. Therefore by triangle inequality, d(j ′ , f * j ) ≥ (1 -δ) max{L j , L j ′ }. Moreover by definition and because δ ∈ (0, 1 8 ), (1 -δ) max{L j , L j ′ } > δL j ′ ≥ d(j ′ , f * j ′ ). Therefore d(j ′ , f * j ) > d(j ′ , f * j ′ ). Thus for any two distinct j, j ′ ∈ D 0 , f * j ̸ = f * j ′ . Therefore |{f * j : j ∈ D 0 }| = |D 0 |. Thus |V 0 | = |OPT I | -|D 0 | = k -|D 0 |. Let U 0 = T ′ \ D 0 be the set of undetermined facilities. Since |T ′ | > k, |V 0 | = k -|D 0 | = k -|T ′ | + |U 0 | < |U 0 |. We show |U 0 | < m 2 t. For every j ∈ T ′ and s ∈ [m], let A s,j be the set of clients of group s that are connected to j in solution T ′ and let C s,j be the total connection cost of these clients. Therefore cost I (T ′ ) = max s∈[m] j∈T ′ C s,j . Let j * := arg min j∈U0 s∈[m] C s,j . Let j be the closest facility to j * in T ′ \ {j * }, i.e., d(j * , j) = L j * . Then cost Combining with Eq. 10, cost I (T ′ \ {j * }) -cost I (T ′ ) ≤ max s∈[m] i∈A s,j * d(i, j) p . For s ∈ [m], let A in s,j * := A s,j * ∩ CBall I,s (j * , 1 3 δL j * ) and A out s,j * := A s,j * \ A in s,j * . By triangle inequality, for any i ∈ A in s,j * , d(i, j) ≤ (1 + 1 3 δ)L j * . Moreover since j * is undetermined, d(i, j) < (1 + 1 3 δ) 1 δ ℓ j * = ( 1 δ + 1 3 )ℓ j * < 2 3 ℓ j * , d(i, j) ≤ d(i, j * ) + d(j * , j) = d(i, j * ) + L j * ≤ (1 + 3 δ )d(i, j * ). Therefore cost I (T ′ \ {j * }) -cost I (T ′ ) ≤ max s∈[m] opt I mt + (1 + 3 δ ) p C s,j * ≤ opt I mt + (1 + 3 δ ) p s∈[m] C s,j * . I (T ′ \ {j * }) -cost I (T ′ ) ≤ 2 opt I mt + 3 2 (1 + 3 δ ) p cost I (T ) mt ≤ β . This is a contradiction because j * should be removed in Step 4 of Algorithm 3. Therefore |U 0 | < m 2 t. Now, we need to bound the cost of S D0,V0 . For j ∈ D 0 and s ∈ [m], let i ∈ CBall I,s (j, 1 3 L j ). By triangle inequality the distance of i to any facility in FBall I (j, δL j ) is at most ( 1 3 + δ)L j . For a facility j ′ ∈ D 0 , j ′ ̸ = j, by triangle inequality and because d(j, j ′ ) ≥ max{L j , L j ′ }, the distance of i to any facility in FBall I (j ′ , δL j ′ ) is at least d(j, j ′ ) - Lj 3 -δL j ′ ≥ d(j, j ′ ) -( 1 3 + δ)d(j, j ′ ) = ( 2 3 -δ)d(j, j ′ ) ≥ ( 2 3 -δ)L j . For δ < 1 8 , we have 1 3 + δ < 2 3 -δ. Therefore, i is either connected to f j or to a facility in V 0 . Let α (s,j) fj 's be as defined in Algorithm 3 for D 0 and V 0 . Let ( f j : j ∈ D 0 ) be a (1 + ϵ)-approximate solution for the following, obtained by Lemma 4.1. For s ∈ [m], let T s = j∈D0 CBall I,s (j, L j /3). Since for j ∈ D 0 , f * j ∈ OPT I 's are also in balls FBall I (j, δL j ), max s∈[m] i∈Ts d(i, S D0,V0 ) p ≤ (1 + ϵ) max s∈[m] i∈Ts d(i, OPT I ) p . Now consider a client i ∈ A s \ T s . If in the optimal solution, i is connected to a facility in V 0 , then by definition, d(i, OPT I ) ≥ d(i, S D0,V0 ). Otherwise, in the optimal solution, i is connected to f * j ∈ FBall I (j, δL j ) for some j ∈ D 0 . We compare d(i, fj ) to d(i, f * j ). Since fj , f * j ∈ FBall I (j, δL j ), by triangle inequality and because d(i, j)  ≥ Lj 3 , d(i, fj ) p d(i,f * j ) p ≤ (d(i,j)+δLj ) p (d(i,j)-δLj ) p ≤ (Lj /3+δLj ) p (Lj /3-δLj ) p = 1+3δ



In some other works, the p'th root of the objective is considered and therefore the approximation factors look different in such works. Informally Gap-ETH states that there is no 2 o(n) -time algorithm to distinguish between a satisfiable formula and a formula that is not even (1 -ϵ) satisfiable.



Figure 1: (a) Distance of i ′ from the facilities of its representative i. (b) Solid and dashed circles are the balls corresponding to representative (U ⋆ ) and non-representative clients (U f ), respectively.

otherwise. The number of LPs generated by this enumeration is (nk) O(m 2 /ϵ) .

(a) Credit dataset (2 groups). (b) Adult dataset (5 groups). (c) Adult dataset (10 groups).

Figure 2: Comparison of our bicriteria algorithm withABV Abbasi et al. (2021). The number of centers our algorithm selects is close to k and is often smaller than ABV (see Section E.3). Section E.2, the performance of our algorithm do not change significantly by changing λ. So in our comparisons, we fix λ = 0.6. Figure4illustrates that our algorithm outperforms ABV on different benchmark datasets. The gap between the performance of our algorithm and ABV becomes larger as the number of groups and k become larger. For example, for the Adult dataset with 10 groups and k = 50, the objective value of ABV is almost twice of the objective that our algorithm achieves.Exactly k centers. The MV algorithm, fist sparsifies the linear programming relaxation by setting the connection variables of points that are far from each other to zero. It then adopts a randomized rounding algorithm similar toCharikar et al. (2002) based on consolidating centers and points. In the process of rounding, it produces a (1 -γ)-restricted solution which is a solution where each center is either open by a fraction of at least (1 -γ) or it is zero. The algorithm needs γ < 0.5. The results of MV for different values of γ are presented in Section E.2. It appears that MV performs better for larger values of γ, so below we use γ = 0.1 and γ = 0.4 for our comparisons. Figure5illustrates that our algorithms outperforms MV on different benchmark datasets. Similar to the bicriteria case, the gap between the performance of our algorithm and MV becomes larger as the number of groups and k become larger. For example, for the Adult dataset with 5 or 10 groups and k = 50, the objective value of MV is almost thrice of the objective that our algorithm achieves.

Figure 3: Comparison of our algorithm with k centers with MV Makarychev and Vakilian (2021).

By definition, s∈[m] C s,j * = min j∈U0 s∈[m] C s,j ≤ mcost I (T ′ ) |U0| . So if |U 0 | ≥ m 2 t, then s∈[m] C s,j * ≤ cost I (T ′ ) mt . Moreover, since |T \ T ′ | < m, cost I (T ′ ) < cost I (T ) + mβ ≤ cost I (

Thus because δ ≤ 1 8 , 1+3δ 1-3δ ≤ 1 + 12δ. Moreover since δ < log(1+ϵ ′ ) 12 , cost I (S D0,V0 ) ≤ (1 + ϵ ′ ) p • opt I . Finally note that the loop runs for n O(m 2 t) iterations because |V| < m 2 t and |T \ D| ≤ m 2 t + m. Moreover by Lemma 4.1, each iteration runs in (nk) O(m 2 /ϵ) time. E OMITTED EMPIRICAL RESULTS E.1 COMPARISON OF ALGORITHMS SHOWING BOTH MAXIMUM AND MINIMUM (a) Credit Dataset (2 groups). (b) COMPAS dataset (2 groups). (c) Adult dataset (5 groups).

Figure 4: Comparison of our bicriteria algorithm with ABV Abbasi et al. (2021). The max and min on Subfigure (c) are across the demographic groups and are used to prevent cluttering plots with 5 groups. The number of centers our algorithm selects is close to k and is often smaller than ABV (see Section E.3).

Figure 5: Comparison of our algorithm with exactly k centers with MV Makarychev and Vakilian (2021). The max and min on Subfigure (c) are across the groups and are used to prevent cluttering plots with 5 groups.

Figure 6: Performance of our bicriteria algorithm for different values of λ. The max and min on Subfigure (c) are across the demographic groups.

Figure 7: Performance of our bicriteria algorithm of ABV Abbasi et al. (2021) for different values of ϵ. The max and min on Subfigure (c) are across the demographic groups.

(a) Credit Dataset (2 groups). (b) COMPAS dataset (2 groups). (c) Adult dataset (5 groups).

Figure 8: Performance of our algorithm with exactly k centers for different values of λ. The max and min on Subfigure (c) are across the demographic groups.

Figure 9: Performance of our MV algorithm Makarychev and Vakilian (2021) for different values of γ. The max and min on Subfigure (c) are across the demographic groups.

For each center o ∈ OPT, let s o be the closest center in S to o, i.e., s o := arg min s∈S d(s, o). Let T ′ := {s o : o ∈ OPT}. Because the size of OPT is k, |T ′ | ≤ k. We show that the cost of T ′ is less than or equal to 3 p-1 (C + 2opt). The result follows from this because T ′ ⊂ S and |T ′ | ≤ k.Let i be a client and o i be the closest facility in OPT to i. Let t ′ i be the closest facility to o i in T ′ which means t ′ i is also the closest facility to o i in S. Moreover let s i be the closest facility to i in S. By triangle inequality, d(i, t ′ i

and for any s∈ [m], CBall I,s (j * , 1 3 δL j * ) ⊆ CBall I,s (j * , 1 3 ℓ j * ). Thus p |CBall I,s (j * , 1 3 ℓ j * )|. Therefore since I is a opt I mt -sparse instance, i∈A in s,j * d(i, j) p ≤ opt I mt . For i ∈ A out s,j * , d(i, j * ) ≥ 1 3 δL j * . Thus 3 δ d(i, j * ) ≥ L j *and by triangle inequality,

The number of selected centers for our bicriteria algorithm on the Credit dataset. λ is a parameter of the algorithm and denotes the amount of decrease in radii of balls around the clients in the iterative rounding algorithm.

The number of selected centers for bicriteria algorithm of Abbasi-Bhaskara-VenkatasubramanianAbbasi et al. (2021) on the Credit dataset. ϵ is a parameter of the algorithm. The maximum number of selected centers is k/(1 -ϵ) which achieves a 2/ϵ approximation factor.

The number of selected centers for our bicriteria algorithm on the COMPAS dataset. λ is a parameter of the algorithm and denotes the amount of decrease in radii of balls around the clients in the iterative rounding algorithm.

The number of selected centers for bicriteria algorithm of Abbasi-Bhaskara-VenkatasubramanianAbbasi et al. (2021) on the COMPAS dataset. ϵ is a parameter of the algorithm. The maximum number of selected centers is k/(1 -ϵ) which achieves a 2/ϵ approximation factor.

The number of selected centers for our bicriteria algorithm on the Adult dataset with 5 groups. λ is a parameter of the algorithm and denotes the amount of decrease in radii of balls around the clients in the iterative rounding algorithm.

The number of selected centers for bicriteria algorithm of Abbasi-Bhaskara-VenkatasubramanianAbbasi et al. (2021) on the Adult dataset with 5 groups. ϵ is a parameter of the algorithm. The maximum number of selected centers is k/(1 -ϵ) which achieves a 2/ϵ approximation factor.

The number of selected centers for our bicriteria algorithm on the Adult dataset with 10 groups. λ is a parameter of the algorithm and denotes the amount of decrease in radii of balls around the clients in the iterative rounding algorithm.

The number of selected centers for bicriteria algorithm of Abbasi-Bhaskara-VenkatasubramanianAbbasi et al. (2021) on the Adult dataset with 10 groups. ϵ is a parameter of the algorithm. The maximum number of selected centers is k/(1 -ϵ) which achieves a 2/ϵ approximation factor. COMPARISON OF RUNNING TIME OF DIFFERENT ALGORITHMS IN PRACTICE

Comparison of the running time of different algorithms on the first 200 samples of the Credit dataset averaged over five runs.

Comparison of the running time of different algorithms on the first 200 samples of the COMPAS dataset averaged over five runs.

Comparison of the running time of different algorithms on the first 200 samples of the Adult dataset with 5 race groups averaged over five runs.

Comparison of the running time of different algorithms on the first 500 samples of the Adult dataset with 10 race and gender groups averaged over five runs.

