CONSTANT-FACTOR APPROXIMATION ALGORITHMS FOR SOCIALLY FAIR k-CLUSTERING

Abstract

We study approximation algorithms for the socially fair (ℓ p , k)-clustering problem with m groups which include the socially fair k-median (p = 1) and k-means (p = 2). We present (1) a polynomial-time (5 + 2 √ 6) p -approximation with at most k + m centers (2) a (5 + 2 √ 6 + ϵ) p -approximation with k centers in time (nk) 2 O(p) m 2 /ϵ , and (3) a (15 + 6 √ 6) p approximation with k centers in time k m • poly(n). The former is obtained by a refinement of the iterative rounding method via a sequence of linear programs. The latter two are obtained by converting a solution with up to k + m centers to one with k centers by sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing approximation algorithms on benchmark datasets, and find that our algorithms outperform existing methods.

1. INTRODUCTION

Automated decision making using machine learning algorithms is being widely adopted in modern society. Examples of real-world decision being made by ML algorithms are innumerable and include applications with considerable societal effects such as automated content moderation Gorwa et al. The facility location problem is a well-studied problem in combinatorial optimization. Famous instances include the k-means, k-median and k-center problems, where the input is a finite metric and the goal is to find k points ("centers" or "facilities") such that a function of distances of each given point to its nearest center is minimized. For k-means, the objective is the average squared distance to the nearest center; for k-median, it is the average distance; and for k-center, it is the maximum distance. These are all captured by the (ℓ p , k)-clustering problem, defined as follows: given a set of clients A of size n, a set of candidate facility locations F, and a metric d, find a subset F ⊂ F of size k that minimizes i∈A d(i, F ) p , where d(i, F ) = min j∈F d(i, j). This is NP-hard for all p, and also hard to approximate Drineas et al. ( 2004); Guha and Khuller (1999) . A 2 O(p)approximation algorithm was given by Charikar et al. ( 2002 Here we consider socially fair extensions of the (ℓ p , k)-clustering problem in which m different (not necessarily disjoint) subgroups, A = A 1 ∪ • • • ∪ A m , among the data are given, and the goal is to minimize the maximum cost over the groups, so that a common solution is not too expensive for any one of them. Each group can be a subset of the data or simply any nonnegative weighting. The goal is to minimize the maximum weighted cost among the groups, i.e., min F ⊂F :|F |=k max s∈[m] i∈As w s (i)d(i, F ) p . A weighting of w s (i) = 1/|A s |, for i ∈ A s , corresponds to the average of groups. The groups usually arise from sensitive attributes such as race and gender (that are protected against discrimination



In some other works, the p'th root of the objective is considered and therefore the approximation factors look different in such works.



(2020) and recidivism prediction Angwin et al. (2016). This necessitates designing (new) machine learning algorithms that incorporate societal considerations, especially fairness Dwork et al. (2012); Kearns and Roth (2019).

) 1 . The current best approximation factors for k-median and k-means on general metrics are (2.675 + ϵ)-approximation Byrka et al. (2014) and (9 + ϵ)-approximation Kanungo et al. (2004); Ahmadian et al. (2019), respectively.

