CONSTANT-FACTOR APPROXIMATION ALGORITHMS FOR SOCIALLY FAIR k-CLUSTERING

Abstract

We study approximation algorithms for the socially fair (ℓ p , k)-clustering problem with m groups which include the socially fair k-median (p = 1) and k-means (p = 2). We present (1) a polynomial-time (5 + 2 √ 6) p -approximation with at most k + m centers (2) a (5 + 2 √ 6 + ϵ) p -approximation with k centers in time (nk) 2 O(p) m 2 /ϵ , and (3) a (15 + 6 √ 6) p approximation with k centers in time k m • poly(n). The former is obtained by a refinement of the iterative rounding method via a sequence of linear programs. The latter two are obtained by converting a solution with up to k + m centers to one with k centers by sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing approximation algorithms on benchmark datasets, and find that our algorithms outperform existing methods.

1. INTRODUCTION

Automated decision making using machine learning algorithms is being widely adopted in modern society. Examples of real-world decision being made by ML algorithms are innumerable and include applications with considerable societal effects such as automated content moderation Gorwa et al. (2020) and recidivism prediction Angwin et al. (2016) . This necessitates designing (new) machine learning algorithms that incorporate societal considerations, especially fairness Dwork et al. (2012) ; Kearns and Roth (2019) . The facility location problem is a well-studied problem in combinatorial optimization. Famous instances include the k-means, k-median and k-center problems, where the input is a finite metric and the goal is to find k points ("centers" or "facilities") such that a function of distances of each given point to its nearest center is minimized. For k-means, the objective is the average squared distance to the nearest center; for k-median, it is the average distance; and for k-center, it is the maximum distance. These are all captured by the (ℓ p , k)-clustering problem, defined as follows: given a set of clients A of size n, a set of candidate facility locations F, and a metric d, find a subset F ⊂ F of size k that minimizes i∈A d(i, F ) p , where d(i, F ) = min j∈F d(i, j). This is NP-hard for all p, and also hard to approximate Drineas et al. (2004) ; Guha and Khuller (1999) . A 2 O(p)approximation algorithm was given by Charikar et al. ( 2002 Here we consider socially fair extensions of the (ℓ p , k)-clustering problem in which m different (not necessarily disjoint) subgroups, A = A 1 ∪ • • • ∪ A m , among the data are given, and the goal is to minimize the maximum cost over the groups, so that a common solution is not too expensive for any one of them. Each group can be a subset of the data or simply any nonnegative weighting. The goal is to minimize the maximum weighted cost among the groups, i.e., min F ⊂F :|F |=k max s∈[m] i∈As w s (i)d(i, F ) p . (1) A weighting of w s 2014). Due to this hardness result, it is natural to consider a bicriteria approximation, which allows for more centers whose total cost is close to the optimal cost for k centers. For the socially fair k-median and 0 < ϵ < 1, Abbasi et al. ( 2021) presents an algorithm that gives at most k/(1 -ϵ) centers with objective value at most 2 O(p) /ϵ times the optimum for k centers. Our first result is an improved bicriteria approximation algorithm for the socially fair ℓ p clustering problem with only m additional centers (m is usually a small constant). Theorem 1.1. There is a polynomial-time bicriteria approximation algorithm for the socially fair (ℓ p , k)-clustering problem with m groups that finds a solution with at most k + m centers of cost at most (5 + 2 √ 6) p ≈ 9.9 p times the optimal cost for a solution with k centers. (i) = 1/|A s |, for i ∈ A s , Goyal and Jaiswal Goyal and Jaiswal (2021) show that a solution to the socially fair (ℓ p , k)clustering problem with k ′ > k centers and cost C can be converted to a solution with k centers and cost at most 3 p-1 (C + 2opt) by simply taking the k-subset of the k ′ centers of lowest cost. A proof is in the appendix for completeness. We improve this factor using a sparsification technique. Theorem 1.2. For any ϵ > 0, there is a (5+2 √ 6+ϵ) p -approximation algorithm for the socially fair (ℓ p , k)-clustering problem that runs in time (nk)foot_1 O(p) m 2 /ϵ ; there is a (15 + 6 √ 6) p -approximation algorithm that runs in time k m • poly(n). This raises the question of whether a faster-constant-factor approximation is possible. Goyal and Jaiswal (2021) show under the Gap-Exponential Time Hypothesis 2 , it is hard to approximate socially fair k-median and k-means within factors of 1 + 2/e -ϵ and 1 + 8/e -ϵ, respectively, in time g(k) • n f (m)•o(k) , for f, g : R + → R + ; socially fair (ℓ p , k)-clustering is hard to approximate within a factor of 3 p -ϵ in time g(k) • n o(k) . They also give a (3 + ϵ) p -approximation in time (k/ϵ) O(k) poly(n/ϵ). This leaves open the possibility of a constant-factor approximation in time f (m)poly(n, k). For the case of p → ∞, the problem reduces to fair k-center problem if we take p th root of the objective. The problem is much better understood and widely studied along with many generalization Jia et al. ( 2021 



In some other works, the p'th root of the objective is considered and therefore the approximation factors look different in such works. Informally Gap-ETH states that there is no 2 o(n) -time algorithm to distinguish between a satisfiable formula and a formula that is not even (1 -ϵ) satisfiable.



) 1 . The current best approximation factors for k-median and k-means on general metrics are (2.675 + ϵ)-approximation Byrka et al. (2014) and (9 + ϵ)-approximation Kanungo et al. (2004); Ahmadian et al. (2019), respectively.

corresponds to the average of groups. The groups usually arise from sensitive attributes such as race and gender (that are protected against discrimination under the Civil Rights Act of 1968 Hutchinson and Mitchell (2019); Benthall and Haynes (2019)). The cases of p = 1 and p = 2 are the socially fair k-median and k-means, respectively, introduced by Ghadiri et al. (2021); Abbasi et al. (2021). As discussed in Ghadiri et al. (2021), the objective of the socially fair k-means promotes a more equitable average clustering cost among different groups. The objective function of socially fair k-median was first studied by Anthony et al. (2010) who gave an O(log m+log n)-approximation algorithm. Moreover, the existing approximation algorithms for the vanilla k-means and k-median can be used to find O(m)-approximate solutions for the socially fair versions Ghadiri et al. (2021); Abbasi et al. (2021). The proof technique directly yields a m • 2 O(p) -approximation for the socially fair (ℓ p , k)-clustering. The natural linear programming (LP) relaxation of the socially fair k-median problem has an integrality gap of Ω(m) Abbasi et al. (2021). More recently, Makarychev and Vakilian (2021) strengthened the LP relaxation of the socially fair (ℓ p , k)-clustering by a sparsification technique. The stronger LP has an integrality gap of Ω(log m/ log log m) and their rounding algorithm (similar to Charikar et al. (2002)) finds a (2 O(p) log m/ log log m)-approximation algorithm for the socially fair (ℓ p , k)-clustering. For the socially fair k-median, this is asymptotically the best possible in polynomial time under the assumption NP ̸ ⊈ δ>0 DTIME(2 n δ ) Bhattacharya et al. (

);Anegg et al. (2021);Makarychev and Vakilian (2021).Makarychev and  Vakilian (2021)  result implies an O(1)-approximation in this case.We compare the performance of our bicriteria algorithm against Abbasi et al. (2021) and our algorithm with exactly k centers against Makarychev and Vakilian (2021) on three different benchmark datasets. Our experiments show that our algorithms consistently outperform these in practice (Section 5) and often select fewer centers than the algorithm of Abbasi et al. (2021) (Section E.3).1.1 APPROACH AND TECHNIQUESOur starting point is a LP relaxation of the problem. The integrality gap of the natural LP relaxation is m Abbasi et al.(2021). For our bicriteria result, we use an iterative rounding procedure, inspired

