STOCHASTIC DIFFERENTIALLY PRIVATE AND FAIR LEARNING

Abstract

Machine learning models are increasingly used in high-stakes decision-making systems. In such applications, a major concern is that these models sometimes discriminate against certain demographic groups such as individuals with certain race, gender, or age. Another major concern in these applications is the violation of the privacy of users. While fair learning algorithms have been developed to mitigate discrimination issues, these algorithms can still leak sensitive information, such as individuals' health or financial records. Utilizing the notion of differential privacy (DP), prior works aimed at developing learning algorithms that are both private and fair. However, existing algorithms for DP fair learning are either not guaranteed to converge or require full batch of data in each iteration of the algorithm to converge. In this paper, we provide the first stochastic differentially private algorithm for fair learning that is guaranteed to converge. Here, the term "stochastic" refers to the fact that our proposed algorithm converges even when minibatches of data are used at each iteration (i.e. stochastic optimization). Our framework is flexible enough to permit different fairness notions, including demographic parity and equalized odds. In addition, our algorithm can be applied to non-binary classification tasks with multiple (non-binary) sensitive attributes. As a byproduct of our convergence analysis, we provide the first utility guarantee for a DP algorithm for solving nonconvex-strongly concave min-max problems. Our numerical experiments show that the proposed algorithm consistently offers significant performance gains over the state-of-the-art baselines, and can be applied to larger scale problems with non-binary target/sensitive attributes.

1. INTRODUCTION

In recent years, machine learning algorithms have been increasingly used to inform decisions with far-reaching consequences (e.g. whether to release someone from prison or grant them a loan), raising concerns about their compliance with laws, regulations, societal norms, and ethical values. Specifically, machine learning algorithms have been found to discriminate against certain "sensitive" demographic groups (e.g. racial minorities), prompting a profusion of algorithmic fairness research (Dwork et al., 2012; Sweeney, 2013; Datta et al., 2015; Feldman et al., 2015; Bolukbasi et al., 2016; Angwin et al., 2016; Calmon et al., 2017; Hardt et al., 2016a; Fish et al., 2016; Woodworth et al., 2017; Zafar et al., 2017; Bechavod & Ligett, 2017; Kearns et al., 2018; Prost et al., 2019; Baharlouei et al., 2020; Lowy et al., 2022a) . Algorithmic fairness literature aims to develop fair machine learning algorithms that output non-discriminatory predictions. Fair learning algorithms typically need access to the sensitive data in order to ensure that the trained model is non-discriminatory. However, consumer privacy laws (such as the E.U. General Data Protection Regulation) restrict the use of sensitive demographic data in algorithmic decision-making. These two requirements-fair algorithms trained with private data-presents a quandary: how can we train a model to be fair to a certain demographic if we don't even know which of our training examples belong to that group? The works of Veale & Binns (2017) ; Kilbertus et al. (2018) proposed a solution to this quandary using secure multi-party computation (MPC), which allows the learner to train a fair model without directly accessing the sensitive attributes. Unfortunately, as Jagielski et al. (2019) observed, MPC does not prevent the trained model from leaking sensitive data. For example, with MPC, the output of the trained model could be used to infer the race of an individual in the training data set (Fredrikson et al., 2015; He et al., 2019; Song et al., 2020; Carlini et al., 2021) . To prevent such leaks, Jagielski et al. (2019) argued for the use of differential privacy (Dwork et al., 2006) in fair learning. Differential privacy (DP) provides a strong guarantee that no company (or adversary) can learn much more about any individual than they could have learned had that individual's data never been used. Since Jagielski et al. (2019) , several follow-up works have proposed alternate approaches to DP fair learning (Xu et al., 2019; Ding et al., 2020; Mozannar et al., 2020; Tran et al., 2021b; a; 2022) . As shown in Fig. 1 , each of these approaches suffers from at least two critical shortcomings. In particular, none of these methods have convergence guarantees when mini-batches of data are used in training. In training large-scale models, memory and efficiency constraints require the use of small minibatches in each iteration of training (i.e. stochastic optimization). Thus, existing DP fair learning methods cannot be used in such settings since they require computations on the full training data set in every iteration. See Appendix A for a more comprehensive discussion of related work. Our Contributions: In this work, we propose a novel algorithmic framework for DP fair learning. Our approach builds on the non-private fair learning method of Lowy et al. (2022a) . We consider a regularized empirical risk minimization (ERM) problem where the regularizer penalizes fairness violations, as measured by the Exponential Rényi Mutual Information. Using a result from Lowy et al. (2022a) , we reformulate this fair ERM problem as a min-max optimization problem. Then, we use an efficient differentially private variation of stochastic gradient descent-ascent (DP-SGDA) to solve this fair ERM min-max objective. The main features of our algorithm are: 1. Guaranteed convergence for any privacy and fairness level, even when mini-batches of data are used in each iteration of training (i.e. stochastic optimization setting). As discussed, stochastic optimization is essential in large-scale machine learning scenarios. Our algorithm is the first stochastic DP fair learning method with provable convergence. 2. Flexibility to handle non-binary classification with multiple (non-binary) sensitive attributes (e.g. race and gender) under different fairness notions such as demographic parity or equalized odds. In each of these cases, our algorithm is guaranteed to converge. Empirically, we show that our method outperforms the previous state-of-the-art methods in terms of fairness vs. accuracy trade-off across all privacy levels. Moreover, our algorithm is capable of training with mini-batch updates and can handle non-binary target and non-binary sensitive attributes. By contrast, existing DP fairness algorithms could not converge in our stochastic/non-binary experiment. A byproduct of our algorithmic developments and analyses is the first DP convergent algorithm for nonconvex min-max optimization: namely, we provide an upper bound on the stationarity gap of DP-SGDA for solving problems of the form min θ max W F pθ, W q, where F p¨, W q is non-convex. We expect this result to be of independent interest to the DP optimization community. Prior works that provide convergence results for DP min-max problems have assumed that F p¨, W q is either (strongly) convex (Boob & Guzmán, 2021; Zhang et al., 2022) or satisfies a generalization of strong convexity known as the Polyak-Łojasiewicz (PL) condition (Yang et al., 2022) .

2. PROBLEM SETTING AND PRELIMINARIES

Let Z " tz i " px i , s i , y i qu n i"1 be a data set with non-sensitive features x i P X , discrete sensitive attributes (e.g. race, gender) s i P rks fi t1, . . . , ku, and labels y i P rls. Let p y θ pxq denote the model predictions parameterized by θ, and ℓpθ, x, yq " ℓpp y θ pxq, yq be a loss function (e.g. cross-entropy loss). Our goal is to (approximately) solve the empirical risk minimization (ERM) problem min θ # p Lpθq :" 1 n n ÿ i"1 ℓpθ, x i , y i q + (1) in a fair manner, while maintaining the differential privacy of the sensitive data ts i u n i"1 . We consider two different notions of fairness in this work:foot_0 Definition 2.1 (Fairness Notions). Let A : Z Ñ Y be a classifier. • A satisfies demographic parity (Dwork et al., 2012) if the predictions ApZq are statistically independent of the sensitive attributes. • A satisfies equalized odds (Hardt et al., 2016a) if the predictions ApZq are conditionally independent of the sensitive attributes given Y " y for all y. Depending on the specific problem at hand, one fairness notion may be more desirable than the other (Dwork et al., 2012; Hardt et al., 2016a) . In practical applications, achieving exact fairness, i.e. (conditional) independence of p Y and S, is unrealistic. In fact, achieving exact fairness can be impossible for a differentially private algorithm that achieves non-trivial accuracy (Cummings et al., 2019) . Thus, we instead aim to design an algorithm that achieves small fairness violation on the given data set Z. Fairness violation can be measured in different ways: see e.g. Lowy et al. (2022a) for a thorough survey. For example, if demographic parity is the desired fairness notion, then one can measure (empirical) demographic parity violation by max p yPY max sPS ˇˇp p Y |S pp y|sq ´p p Y pp yq ˇˇ, where p denotes an empirical probability calculated directly from pZ, tp y i u n i"1 q. Next, we define differential privacy (DP). Following the DP fair learning literature in (Jagielski et al., 2019; Tran et al., 2021b; 2022 )), we consider a relaxation of DP, in which only the sensitive attributes require privacy. Say Z and Z 1 are adjacent with respect to sensitive data if Z " tpx i , y i , s i qu n i"1 , Z 1 " tpx i , y i , s 1 i qu n i"1 , and there is a unique i P rns such that s i ‰ s 1 i . Definition 2.2 (Differential Privacy w.r.t. Sensitive Attributes). Let ϵ ě 0, δ P r0, 1q. A randomized algorithm A is pϵ, δq-differentially private w.r.t. sensitive attributes S (DP) if for all pairs of data sets Z, Z 1 that are adjacent w.r.t. sensitive attributes, we have PpApZq P Oq ď e ϵ PpApZq P Oq `δ, (3) for all measurable O Ď Y. As discussed in Section 1, Theorem 2.2 is useful if a company wants to train a fair model, but is unable to use the sensitive attributes (which are needed to train a fair model) due to privacy concerns and laws (e.g., the E.U. GDPR). Theorem 2.2 enables the company to privately use the sensitive attributes to train a fair model, while satisfying legal and ethical constraints. That being said, Theorem 2.2 still may not prevent leakage of non-sensitive data. Thus, if the company is concerned with privacy of user data beyond the sensitive demographic attributes, then it should impose DP for all the features. Our algorithm and analysis readily extends to DP for all features: see Section 3. Throughout the paper, we shall restrict attention to data sets that contain at least ρ-fraction of every sensitive attribute for some ρ P p0, 1q: i.e. 1

|Z| ř |Z|

i"1 1 tsi"ru ě ρ for all r P rks. This is a reasonable assumption in practice: for example, if sex is the sensitive attribute and a data set contains all men, then training a model that is fair with respect to sex and has a non-trivial performance (better than random) seems almost impossible. Understanding what performance is (im-)possible for DP fair learning in the absence of sample diversity is an important direction for future work.

3. PRIVATE FAIR ERM VIA EXPONENTIAL RÉNYI MUTUAL INFORMATION

A standard in-processing strategy in the literature for enforcing fairness is to add a regularization term to the empirical objective that penalizes fairness violations (Zhang et al., 2018; Donini et al., 2018; Mary et al., 2019; Baharlouei et al., 2020; Cho et al., 2020b; Lowy et al., 2022a) . We can then jointly optimize for fairness and accuracy by solving min θ ! p Lpθq `λDp p Y , S, Y q ) , where D is some measure of statistical (conditional) dependence between the sensitive attributes and the predictions (given Y ), and λ ě 0 is a scalar balancing fairness and accuracy considerations. (Lowy et al., 2022a) . Further, ERMI provides an upper bound on other commonly used measures of fairness violation: e.g.) (2), Shannon mutual information (Cho et al., 2020a) , Rényi correlation (Baharlouei et al., 2020) , L q fairness violation (Kearns et al., 2018; Hardt et al., 2016a) (Lowy et al., 2022a) . This implies that any algorithm that makes ERMI small will also have small fairness violation with respect to these other notions. Lastly, (Lowy et al., 2022a, Proposition 2) shows that empirical ERMI (Theorem 3.1) is an asymptotically unbiased estimator of "population ERMI"-which can be defined as in Theorem 3.1, except that empirical distributions are replaced by their population counterparts. Our approach to enforcing fairness is to augment (1) with an ERMI regularizer and privately solve: min θ ! FERMIpθq :" p Lpθq `λ p D R p p Y θ pXq, Sq ) . (FERMI obj.) Since empirical ERMI is an asymptotically unbiased estimator of population ERMI, a solution to (FERMI obj.) is likely to generalize to the corresponding fair population risk minimization problem (Lowy et al., 2022a) . There are numerous ways to privately solve (FERMI obj.). For example, one could use the exponential mechanism (McSherry & Talwar, 2007) , or run noisy gradient descent (GD) (Bassily et al., 2014) . The problem with these approaches is that they are inefficient or require computing n gradients at every iteration, which is prohibitive for large-scale problems, as discussed earlier. Notice that we could not run noisy stochastic GD (SGD) on (FERMI obj.) because we do not (yet) have a statistically unbiased estimate of ∇ θ p D R p p Y θ pXq, Sq. Our next goal is to derive a stochastic, differentially private fair learning algorithm. For feature input x, let the predicted class labels be given by p ypx, θq " j P rls with probability F j px, θq, where Fpx, θq is differentiable in θ, has range r0, 1s l , and ř l j"1 F j px, θq " 1. For instance, Fpx, θq " pF 1 px, θq, . . . , F l px, θqq could represent the output of a neural net after softmax layer or the probability label assigned by a logistic regression model. Then we have the following min-max re-formulation of (FERMI obj.): Theorem 3.2 (Lowy et al. (2022a) ). There are differentiable functions p ψ i such that (FERMI obj.) is equivalent to min θ max W PR kˆl # p F pθ, W q :" p Lpθq `λ 1 n n ÿ i"1 p ψipθ, W q + . Further, p ψ i pθ, ¨q is strongly concave for all θ. The functions p ψ i are given explicitly in Appendix C. Theorem 3.2 is useful because it permits us to use stochastic optimization to solve (FERMI obj.): for any batch size m P rns, the gradients (with respect to θ and W ) of 1 m ř iPB ℓpx i , y i ; θq `λ p ψ i pθ, W q are statistically unbiased estimators of the gradients of p F pθ, W q, if B is drawn uniformly from Z. However, when differential privacy of the sensitive attributes is also desired, the formulation (4) presents some challenges, due to the non-convexity of p F p¨, W q. Indeed, there is no known DP algorithm for solving non-convex min-max problems that is proven to converge. Next, we provide the first such convergence guarantee.

3.1. NOISY DP-FERMI FOR STOCHASTIC PRIVATE FAIR ERM

Our proposed stochastic DP algorithm for solving (FERMI obj.), is given in Algorithm 1. It is a noisy DP variation of two-timescale stochastic gradient descent ascent (SGDA) Lin et al. (2020) . Algorithm 1 DP-FERMI Algorithm for Private Fair ERM 1: Input: θ 0 P R d θ , W 0 " 0 P R kˆl , step-sizes pη θ , η w q, fairness parameter λ ě 0, iteration number T , minibatch size |B t | " m P rns, set W Ă R kˆl , noise parameters σ 2 w , σ 2 θ . 2: Compute p P ´1{2 S . 3: for t " 0, 1, . . . , T do 4: Draw a mini-batch B t of data points tpx i , s i , y i qu iPBt 5: Set θ t`1 Ð θ t ´ηθ |Bt| ř iPBt r∇ θ ℓpx i , y i ; θ t q `λp∇ θ p ψ i pθ t , W t q `ut qs, where u t " N p0, σ 2 θ I d θ q. 6: Set W t`1 Ð Π W ´Wt `ηw " λ |Bt| ř iPBt ∇ w p ψ i pθ t , W t q `Vt ı¯, where V t is a k ˆl matrix with independent random Gaussian entries pV t q r,j " N p0, σ 2 w q. 7: end for 8: Pick t uniformly at random from t1, . . . , T u. 9: Return: θT :" θ t. Explicit formulae for ∇ θ p ψ i pθ t , W t q and ∇ w p ψ i pθ t , W t q are given in Theorem D.1 (Appendix D). We provide the privacy guarantee of Algorithm 1 in Theorem 3.3: Theorem 3.3. Let ϵ ď 2 lnp1{δq, δ P p0, 1q, and T ě ´n ? ϵ 2m ¯2. Assume Fpx, ¨q is L θ -Lipschitz for all x, and |pW t q r,j | ď D for all t P rT s, r P rks, j P rls. Then, for σ 2 w ě 16T lnp1{δq ϵ 2 n 2 ρ and σ 2 θ ě 16L 2 θ D 2 lnp1{δqT ϵ 2 n 2 ρ , Algorithm 1 is pϵ, δq-DP with respect to the sensitive attributes for all data sets containing at least ρ-fraction of minority attributes. Further, if σ 2 w ě 32T lnp1{δq ϵ 2 n 2 ´1 ρ `D2 ¯and σ 2 θ ě 64L 2 θ D 2 lnp1{δqT ϵ 2 n 2 ρ `32D 4 L 2 θ l 2 T lnp1{δq ϵ 2 n 2 , then Algorithm 1 is pϵ, δq-DP (with respect to all features) for all data sets containing at least ρ-fraction of minority attributes. See Appendix D for the proof. Next, we give a convergence guarantee for Algorithm 1: Theorem 3.4. Assume the loss function ℓp¨, x, yq and Fpx, ¨q are Lipschitz continuous with Lipschitz gradient for all px, yq, and p P S prq ě ρ ą 0 @ r P rks. In Algorithm 1, choose W to be a sufficiently large ball that contains W ˚pθq :" argmax W p F pθ, W q for every θ in some neighborhood of θ ˚P argmin θ max W p F pθ, W q. Then there exist algorithmic parameters such that the pϵ, δq-DP Algorithm 1 returns θT with E}∇FERMIp θT q} 2 " O ˜amaxpd θ , klq lnp1{δq ϵn ¸, treating D " diameterpWq, λ, ρ, l, and the Lipschitz and smoothness parameters of ℓ and F as constants. Theorem 3.4 shows that Algorithm 1 finds an approximate stationary point of (FERMI obj.). Finding approximate stationary points is generally the best one can hope to do in polynomial time for nonconvex optimization (Murty & Kabadi, 1985) . The stationarity gap in Theorem 3.4 depends on the number of samples n and model parameters d θ , the desired level of privacy pϵ, δq, and the number of labels l and sensitive attributes k. For large-scale models (e.g. deep neural nets), we typically have d θ " 1 and k, l " Op1q, so that the convergence rate of Algorithm 1 is essentially immune to the number of labels and sensitive attributes. In contrast, no existing works with convergence guarantees are able to handle non-binary classification (l ą 2), even with full batches and a single binary sensitive attribute. A few more remarks are in order. First, the utility bound in Theorem 3.4 corresponds to DP for all of the features. If DP is only required for the sensitive attributes, then using the smaller σ 2 θ , σ 2 w in Theorem 3.3 would improve the dependence on constants D, l, L θ in the utility bound. Second, the choice of W in Theorem 3.4 implies that ( 4) is equivalent to min θ max W PW p F pθ, W q, which is what our algorithm directly solves (c.f. ( 7)). Lastly, note that while we return a uniformly random iterate in Algorithm 1 for our theoretical convergence analysis, we recommend returning the last iterate θ T in practice: our numerical experiments show strong performance of the last iterate. In Theorem E.1 of Appendix E, we prove a result which is more general than Theorem 3.4. Theorem E.1 shows that noisy DP-SGDA converges to an approximate stationary point of any smooth nonconvex-strongly concave min-max optimization problem (not just (4)). We expect Theorem E.1 to be of general interest to the DP optimization community beyond its applications to DP fair learning, since it is the first DP convergence guarantee for nonconvex min-max optimization. We also give a bound on the iteration complexity T in Appendix E. The proof of Theorem E.1 involves a careful analysis of how the Gaussian noises propagate through the optimization trajectories of θ t and w t . Compared with DP non-convex minimization analyses (e.g. Wang et al. (2019) ; Hu et al. (2021) ; Ding et al. (2021b) ; Lowy et al. (2022b) ), the two noises required to privatize the solution of the min-max problem we consider complicates the analysis and requires careful tuning of η θ and η W . Compared to existing analyses of DP min-max games in Boob & Guzmán (2021) ; Yang et al. (2022) ; Zhang et al. (2022) , which assume that f p¨, wq is convex or PL, dealing with non-convexity is a challenge that requires different optimization techniques.

4. NUMERICAL EXPERIMENTS

In this section, we evaluate the performance of our proposed approach (DP-FERMI) in terms of the fairness violation vs. test error for different privacy levels. We present our results in two parts: In Section 4.1, we assess the performance of our method in training logistic regression models on several benchmark tabular datasets. Since this is a standard setup that existing DP fairness algorithms can handle, we are able to compare our method against the state-of-the-art baselines. We carefully tuned the hyperparameters of all baselines for fair comparison. We find that DP-FERMI consistently outperforms all state-of-the-art baselines across all data sets and all privacy levels. These observations hold for both demographic parity and equalized odds fairness notions. To quantify the improvement of our results over the state-of-the-art baselines, we calculated the performance gain with respect to fairness violation (for fixed accuracy level) that our model yields over all the datasets. We obtained a performance gain of demographic parity that was 79.648 % better than Tran et al. (2021b) on average, and 65.89% better on median. The average performance gain of equalized odds was 96.65% while median percentage gain was 90.02%. In Section 4.2, we showcase the scalability of DP-FERMI by using it to train a deep convolutional neural network for classification on a large image dataset. In Appendix F, we give detailed descriptions of the data sets, experimental setups and training procedure, along with additional results. 4.1 STANDARD BENCHMARK EXPERIMENTS: LOGISTIC REGRESSION ON TABULAR DATASETS In the first set of experiments we train a logistic regression model using DP-FERMI (Algorithm 1) for demographic parity and a modified version of DP-FERMI (described in Appendix F) for equalized odds. We compare DP-FERMI against all applicable publicly available baselines in each expeiment. For the Adult dataset, the task is to predict whether the income is greater than $50K or not keeping gender as the sensitive attribute. The Retired Adult dataset is the same as the Adult dataset, but with updated data. We use the same output and sensitive attributes for both experiments. The results for Adult and Retired Adult are shown in Figs. 2 and 6 The results in Fig. 5 empirically verify our main theoretical result: DP-FERMI converges even for non-binary classification with small batch size and non-binary sensitive attributes. We took Tran et al. (2021a; b) as our baselines and attempted to adapt them to this non-binary large-scale task. We observed that the baselines were very unstable while training and mostly gave degenerate results (predicting a single output irrespective of the input). By contrast, our method was able to obtain stable and meaningful tradeoff curves. Also, while Tran et al. (2022) reported results on UTK-Face, their code is not publicly available and we were unable to reproduce their results. 

5. CONCLUDING REMARKS

Motivated by pressing legal, ethical, and social considerations, we studied the challenging problem of learning fair models with differentially private demographic data. We observed that existing works suffer from a few crucial limitations that render their approaches impractical for large-scale problems. Specifically, existing approaches require full batches of data in each iteration (and/or exponential runtime) in order to provide convergence/accuracy guarantees. We addressed these limitations by deriving a DP stochastic optimization algorithm for fair learning, and rigorously proved the convergence of the proposed method. Our convergence guarantee holds even for non-binary classification (with any hypothesis class, even infinite VC dimension, c.f. Jagielski et al. ( 2019)) with multiple sensitive attributes and access to random minibatches of data in each iteration. Finally, we evaluated our method in extensive numerical experiments and found that it significantly outperforms the previous state-of-the-art models, in terms of fairness-accuracy tradeoff. The potential societal impacts of our work are discussed in Appendix G.

APPENDIX

A ADDITIONAL DISCUSSION OF RELATED WORK The study of differentially private fair learning algorithms was initiated by Jagielski et al. (2019) . Jagielski et al. (2019) considered equalized odds and proposed two DP algorithms: 1) an ϵ-DP post-processing approach derived from Hardt et al. (2016a) ; and 2) an pϵ, δq-DP in-processing approach based on Agarwal et al. (2018) . The major drawback of their post-processing approach is the unrealistic requirement that the algorithm have access to the sensitive attributes at test time, which Jagielski et al. (2019) 2022) provided a semisupervised fair "Private Aggregation of Teacher Ensembles" framework. A shortcoming of each of these three most recent works is their lack of theoretical convergence or accuracy guarantees. In another vein, some works have observed the disparate impact of privacy constraints on demographic subgroups (Bagdasaryan et al., 2019; Tran et al., 2021c) .

B EQUALIZED ODDS VERSION OF ERMI

If equalized odds (Hardt et al., 2016b) is the desired fairness notion, then one should use the following variation of ERMI as a regularizer Lowy et al. (2022a) : p D R p p Y ; S|Y q :" E # p p Y ,S|Y p p Y , S|Y q p p Y |Y p p Y |Y qp S|Y pS|Y q + ´1 " l ÿ y"1 l ÿ j"1 k ÿ r"1 p p Y ,S|Y pj, r|yq 2 p p Y |Y pj|yqp S|Y pr|yq pY pyq ´1. (5) Here p p Y ,S|Y denotes the empirical joint distribution of the predictions and sensitive attributes p p Y , Sq conditional on the true labels Y . In particular, if D R p p Y ; S|Y q " 0, then p Y and S are conditionally independent given Y (i.e. equalized odds is satisfied).

C COMPLETE VERSION OF THEOREM 3.2

Let p ypx i ; θq P t0, 1u l and s i P t0, 1u k be the one-hot encodings of p ypx i , θq and s i , respectively: i.e., p y j px i ; θq " 1 tp ypxi,θq"ju and s i,r " 1 tsi"ru for j P rls, r P rks. Also, denote p P s " diagpp p S p1q, . . . , p p S pkqq, where p p S prq :" 1 n ř n i"1 1 tsi"ru ě ρ ą 0 is the empirical prob- ability of attribute r (r P rks). Then we have the following re-formulation of (FERMI obj.) as a min-max problem: Theorem C.1 (Lowy et al. (2022a) ). (FERMI obj.) is equivalent to min θ max W PR kˆl # p F pθ, W q :" p Lpθq `λ 1 n n ÿ i"1 p ψipθ, W q + , where ¯2. Assume Fp¨, xq is L θ -Lipschitz for all x, and |pW t q r,j | ď D for all t P rT s, r P rks, j P rls. p ψ i pθ, W q :" ´TrpW Erp ypx i , θqp ypx i , θq T |x i sW T q `2 Then, for σ 2 w ě 16T lnp1{δq ϵ 2 n 2 ρ and σ 2 θ ě 16L 2 θ D 2 lnp1{δqT ϵ 2 n 2 ρ , Algorithm 1 is pϵ, δq-DP with respect to the sensitive attributes for all data sets containing at least ρ-fraction of minority attributes. Further, if σ 2 w ě 32T lnp1{δq ϵ 2 n 2 ´1 ρ `D2 ¯and σ 2 θ ě 64L 2 θ D 2 lnp1{δqT ϵ 2 n 2 ρ `32D 4 L 2 θ l 2 T lnp1{δq ϵ 2 n 2 , then Algorithm 1 is pϵ, δq-DP (with respect to all features) for all data sets containing at least ρ-fraction of minority attributes. Proof. First consider the case in which only the sensitive attributes are private. By the moments accountant Theorem 1 in Abadi et al. (2016) , it suffices to bound the sensitivity of the gradient updates by ∆ 2 θ ď 8D 2 L 2 θ m 2 ρ and ∆ 2 w ď 8 m 2 ρ . Here ∆ 2 θ " sup Z"Z 1 ,θ,W › › › › › 1 m ÿ iPBt " ∇ θ p ψpθ, W ; z i q ´∇θ p ψpθ, W ; z 1 i q ı › › › › › and Z " Z 1 means that Z and Z 1 are two data sets (both with ρ-fraction of minority attributes) that differ in exactly one person's sensitive attributes: i.e. s i ‰ s 1 i for some unique i P rns, but z j " z 1 j for all j ‰ i and px i , y i q " px 1 i , y 1 i q. Likewise, ∆ 2 w " sup Z"Z 1 ,θ,W › › › › › 1 m ÿ iPBt " ∇ w p ψpθ, W ; z i q ´∇w p ψpθ, W ; z 1 i q ı › › › › › 2 . Now, by Theorem D.1, ∇ θ p ψ i pθ, W q " ´∇θ vecpErp ypx i , θqp ypx i , θq T |x i sq T vecpW T W q `2∇ θ vecpErs i p ypx i , θq T |x i , s i sq vec ˆW T ´p P S ¯´1{2 ˙, and notice that only the second term depends on S. Therefore, we can bound the ℓ 2 -sensitivity of the θ-gradient updates by: ∆ 2 θ " sup Z"Z 1 ,W,θ › › › › › 1 m m ÿ i"1 2∇ θ vecpErs i p ypx i , θq T |x i , s i sq vec ˆW T ´p P S ¯´1{2 2∇ θ vecpErs 1 i p ypx i , θq T |x i , s 1 i sq vec ˆW T ´p P S 1 ¯´1{2 ˙› › › › › 2 ď 4 m 2 sup x,si,s 1 i ,W,θ » - - k ÿ r"1 l ÿ j"1 }∇ θ F j pθ, xq} 2 W 2 r,j ¨si,r b p P S prq ´s1 i,r b p P S 1 prq ‚2 fi ffi fl ď 8 ρm 2 sup x,W,θ ˜l ÿ j"1 }∇ θ F j pθ, xq} 2 W 2 r,j ḑ 8D 2 L 2 θ ρm 2 , using Lipschitz continuity of Fp¨, xq, the assumption that W has diameter bounded by D, the assumption that the data sets have at least ρ-fraction of sensitive attribute r for all r P rks. Similarly, for the W -gradients, we have ∇ w p ψ i pθ, W q " ´2W Erp ypx i , θqp ypx i , θq T |x i s `2 p P ´1{2 S Ers i p ypx i , θq T |x i , s i s by Theorem D.1. Hence ∆ 2 W " sup θ,W,si,s 1 i 4 m 2 › › › › › ´W diagpF 1 pθ, x i q, . . . , F l pθ, x i qq `p P ´1{2 S Ers i p y i px i ; θ t q T |x i , s i s `W diagpF 1 pθ, x i q, . . . , F l pθ, x i qq ´p P ´1{2 S 1 Ers 1 i p y i px i ; θ t q T |x i , s 1 i s › › › › › 2 ď 4 m 2 sup θ,W,si,s 1 i l ÿ j"1 F j pθ, x i q 2 k ÿ r"1 ¨si,r b p P S prq ´s1 i,r b p P S 1 prq ‚2 ď 8 m 2 ρ , since ř l j"1 F j pθ, x i q 2 ď ř l j"1 F j pθ, x i q " 1. This establishes the desired privacy guarantee with respect to sensitive attributes for Algorithm 1. Now consider the case in which all features are private. We aim to bound the sensitivities of the gradient updates to changes in a single sample z i " ps i , x i , y i q. Denote these new sensitivities by ∆θ " sup Z"Z 1 ,θ,W › › › › › 1 m ÿ iPBt " ∇ θ p ψpθ, W ; z i q ´∇θ p ψpθ, W ; z 1 i q ı › › › › › , where we now write Z " Z 1 to mean that Z and Z 1 are two data sets (both with ρ-fraction of minority attributes) that differ in exactly one person's (sensitive and non-sensitive) data: i.e. z i ‰ z 1 i for some unique i P rns. Likewise, ∆W " sup Z"Z 1 ,θ,W › › › › › 1 m ÿ iPBt " ∇ w p ψpθ, W ; z i q ´∇w p ψpθ, W ; z 1 i q ı › › › › › . Then ∆θ " 1 m sup zi,z 1 i ,θ,W,S"S 1 › › › › › ´∇θ vecpErp ypx i , θqp ypx i , θq T |x i sq T vecpW T W q `2∇ θ vecpErs i p ypx i , θq T |x i , s i sq vec ˆW T ´p P S ¯´1{2 ˙`∇ θ vecpErp ypx 1 i , θqp ypx 1 i , θq T |x 1 i sq T vecpW T W q ´2∇ θ vecpErs 1 i p ypx 1 i , θq T |x 1 i , s 1 i sq vec ˆW T ´p P S 1 ¯´1{2 ˙› › › › › ď 2L θ lD m `∆θ . Thus, ∆2 θ ď 4L 2 θ l 2 D 2 m 2 `2∆ 2 θ . Therefore, by the moments accountant, the collection of all θ t updates in Algorithm 1 is pϵ, δq-DP if σ 2 θ ě 32D 2 L 2 θ T lnp1{δq ρϵ 2 n 2 `8D 2 L 2 θ l 2 T lnp1{δq ϵ 2 n 2 " 8L 2 θ D 2 T lnp1{δq ϵ 2 n 2 ´4 ρ `l2 ¯. Next, we bound the sensitivity ∆W of the W -gradient updates. We have ∆2 W " sup θ,W,zi,z 1 i 4 m 2 › › › › › ´W diagpF 1 pθ, x i q, . . . , F l pθ, x i qq `p P ´1{2 S Ers i p y i px i ; θ t q T |x i , s i s `W diagpF 1 pθ, x 1 i q, . . . , F l pθ, x 1 i qq ´p P ´1{2 S 1 Ers 1 i p y T i px 1 i ; θ t q|x 1 i , s 1 i s › › › › › 2 ď 2∆ 2 W `8 m 2 sup θ,W,xi,x 1 i › › › › › W diagpF 1 pθ, x i q ´F1 pθ, x 1 i q, . . . , F l pθ, x i q ´Fl pθ, x 1 i qq › › › › › 2 ď 2∆ 2 W `16D 2 m 2 sup θ,xi l ÿ j"1 F j pθ, x i q 2 ď 2∆ 2 W `16D 2 m 2 . Therefore, by the moments accountant, the collection of all W t updates in Algorithm 1 is pϵ, δq-DP if σ 2 w ě 32T lnp1{δq ϵ 2 n 2 ´1 ρ `D2 ¯. This completes the proof.

E DP-FERMI ALGORITHM: UTILITY

To prove Theorem 3.4, we will first derive a more general result. Namely, in Appendix E.1, we will provide a precise upper bound on the stationarity gap of noisy DP stochastic gradient descent ascent (DP-SGDA).

E.1 NOISY DP-SGDA FOR NONCONVEX-STRONGLY CONCAVE MIN-MAX PROBLEMS

Consider a generic (smooth) nonconvex-strongly concave min-max ERM problem: min θPR d θ max wPW # F pθ, wq :" 1 n n ÿ i"1 f pθ, w; ziq + , where f pθ, ¨; zq is µ-strongly concavefoot_2 for all θ, z but f p¨, w; zq is potentially non-convex. We Published as a conference paper at ICLR 2023 Algorithm 2 Noisy Differentially Private Stochastic Gradient Descent-Ascent (DP-SGDA) 1: Input: data Z, θ 0 P R d θ , w 0 P W, step-sizes pη θ , η w q, privacy noise parameters σ θ , σ w , batch size m, iteration number T ě 1. 2: for t " 0, 1, . . . , T ´1 do 3: Draw a batch of data points tz i u m i"1 uniformly at random from Z.

4:

Update θ t`1 Ð θ t ´ηθ `1 m ř m i"1 ∇ θ f pθ t , w t ; z i q `ut ˘, where u t " N p0, σ 2 θ I d θ q and w t`1 Ð Π W " w t `ηw `1 m ř m i"1 ∇ w f pθ t , w t ; z i q `vt ˘‰, where v t " N p0, σ 2 w I dw q. 5: end for 6: Draw θT uniformly at random from tθ t u T t"1 . 7: Return: θT propose Noisy DP-SGDAfoot_3 (Algorithm 2) for privately solving (7), which is a noisy DP variation of two-timescale SGDA (Lin et al., 2020) . Now, we provide the first theoretical convergence guarantee for DP non-convex min-max optimization: Theorem E.1 (Privacy and Utility of Algorithm 2, Informal Version). Let ϵ ď 2 lnp1{δq, δ P p0, 1q. Assume: f p¨, w; zq is L θ -Lipschitzfoot_4 and f pθ, ¨; zq is L w -Lipschitz for all θ, w, z; and W Ă R dw is a convex, compact set. Denote Φpθq " max wPW F pθ, wq. Choose σ 2 w " 8T L 2 w lnp1{δq ϵ 2 n 2 , σ 2 θ " 8T L 2 θ lnp1{δq ϵ 2 n 2 , and T ě ´n ? ϵ 2m ¯2. Then, Algorithm 2 is pϵ, δq-DP. Further, if f p¨, ¨; zq has Lipschitz gradients and f pθ, ¨; zq is strongly concave, then D T, η θ , η w such that E}∇Φp θT q} 2 " O ˜ad lnp1{δq ϵn ¸, where d " maxpd θ , d w q. (The expectation is solely over the algorithm.) In our DP fair learning application, f pθ, W ; z i q " ℓpθ, x i , y i q `λ p ψ i pθ, W q and the strong concavity assumption on f in Theorem E.1 is automatically satisfied, by Lowy et al. (2022a) . The Lipschitz and smoothness assumptions on f are standard in optimization literature and are satisfied for loss functions that are typically used in pracdtice. In our application to DP-FERMI, these assumptions hold as long as the loss function ℓ and F are Lipschitz continuous with Lipschitz gradients. Our next goal is to prove (the precise, scale-invariant version of) Theorem E.1. To that end, we require the following notation. Notation and Assumptions: Let f : R d θ ˆRdw ˆZ Ñ R, and F pθ, wq " 1 n ř n i"1 f pθ, w; z i q for fixed training data Z " pz 1 , ¨¨¨, z n q P Z n . Let W Ă R dw be a convex, compact set. For any θ P R d θ , denote w ˚pθq P argmax wPW F pθ, wq and p Φpθq " max wPW F pθ, wq. Let ∆ Φ " p Φpθ 0 q ´inf θ p Φ Z pθq. Recall that a function h is β-smooth if its derivative ∇h is β-Lipschitz. We write a À b if there is an absolute constant C ą 0 such that a ď Cb. Assumption E.2. 1. f p¨, w; zq is L θ -Lipschitz and β θ -smooth for all w P W, z P Z. 2. f pθ, ¨; zq is L w -Lipschitz, β w -smooth, and µ-strongly concave on W for all θ P R d θ , z P Z. 3. }∇ w f pθ, w; zq ´∇w f pθ 1 , w; zq} ď β θw }θ ´θ1 } and }∇ θ f pθ, w; zq ´∇θ f pθ, w 1 ; zq} ď β θw }w ´w1 } for all θ, θ 1 , w, w 1 , z. 4. W has ℓ 2 diameter bounded by D ě 0. 5. ∇ w F pθ, w ˚pθqq " 0 for all θ, where w ˚pθq denotes the unconstrained global minimizer of F pθ, ¨q. The first four assumptions are standard in (DP and min-max) optimization. The fifth assumption means that W contains the unconstrained global minimizer w ˚pθq of F pθ, ¨q for all θ. Hence ( 7) is equivalent to min θPR d θ max wPR dw F pθ, wq. This assumption is not actually necessary for our convergence result to hold, but we will need it when we apply our results to the DP fairness problem. Moreover, it simplifies the proof of our convergence result. We refer to problems of the form (7) that satisfy Theorem E.2 as "(smooth) nonconvex-strongly concave min-max." We denote κ w :" βw µ and κ θw :" β θw µ . We can now provide the complete, precise version of Theorem E.1: Theorem E.3 (Privacy and Utility of Algorithm 2, Formal Version). Let ϵ ď 2 lnp1{δq, δ P p0, 1q. Grant Theorem E.2. Choose σ 2 w " 8T L 2 w lnp1{δq ϵ 2 n 2 , σ 2 θ " 8T L 2 θ lnp1{δq ϵ 2 n 2 , and T ě ´n ? ϵ 2m ¯2. Then Algorithm 2 is pϵ, δq-DP. Further, if we choose η θ " 1 16κwpβ θ `βθw κ θw q , η w " 1 βw , and T « a κ w r∆ Φ pβ θ `βθw κ θw q `β2 θw D 2 sϵn min ´1 L θ ? d θ , βw β θw Lw ? κwdw ¯, then E}∇Φp θT q} 2 À b ∆ Φ pβ θ `βθw κ θw qκ w `κw β 2 θw D 2 q « L θ a d θ lnp1{δq ϵn `ˆβ θw ? κ w β w ˙Lw a d w lnp1{δq ϵn ff `1tmănu m ˆL2 θ `κw β 2 θw L 2 w β 2 w ˙. In particular, if m ě min ˆϵnL θ ? d θ κwr∆Φpβ θ `βθw κ θw q`β 2 θw D 2 s , ϵnLw ? κw β θw βw ? dwκwr∆Φpβ θ `βθw κ θw q`β 2 θw D 2 s ˙, then E}∇Φp θT q} 2 À b κ w r∆ Φ pβ θ `βθw κ θw q `β2 θw D 2 s ˜alnp1{δq ϵn ¸ˆL θ a d θ `ˆβ θw ? κ w β w ˙Lw a d w ˙. The proof of Theorem E.3 will require several technical lemmas. These technical lemmas, in turn, require some preliminary lemmas, which we present below. We begin with a refinement of Lemma 4.3 from Lin et al. (2020) : Lemma E.4. Grant Theorem E.2. Then Φ is 2pβ θ `βθw κ θw q-smooth with ∇Φpθq " ∇ θ F pθ, w ˚pθqq, and w ˚p¨q is κ w -Lipschitz. Proof. The proof follows almost exactly as in the proof of Lemma 4.3 of Lin et al. (2020) , using Danskin's theorem, but we carefully track the different smoothness parameters with respect to w and θ (and their units) to obtain the more precise result. Lemma E.5 (Lei et al. (2017) ). Let ta l u lPrns be an arbitrary collection of vectors such that ř n l"1 a l " 0. Further, let S be a uniformly random subset of rns of size m. Then, E › › › › › 1 m ÿ lPS a l › › › › › 2 " n ´m pn ´1qm 1 n n ÿ l"1 }a l } 2 ď 1 tmănu m n n ÿ l"1 }a l } 2 . Lemma E.6 (Co-coercivity of the gradient). For any β-smooth and convex function g, we have }∇gpaq ´∇gpbq} 2 ď 2βpgpaq ´gpbq ´xgpbq, a ´byq, for all a, b P domainpgq. Having recalled the necessary preliminaries, we now provide the novel technical ingredients that we'll need for the proof of Theorem E.3. The next lemma quantifies the progress made in minimizing Φ from a single step of noisy stochastic gradient descent in θ (i.e. line 4 of Algorithm 2): Lemma E.7. For all t P rT s, the iterates of Algorithm 2 satisfy Lemma E.9. Grant Theorem E.2. If η w " 1 βw , then the iterates of Algorithm 2 satisfy (@t ě 0) E}w ˚pθ t`1 q ´wt`1 } 2 ď ˆ1 ´1 2κ w `4κ w κ 2 θw η 2 θ β 2 θw ˙E}w ˚pθ t q ´wt } 2 `2 β 2 w ˆ4L 2 w m 1 tmănu `dw σ 2 w 4κ w κ 2 θw η 2 θ `E}∇Φpθ t q} 2 `dθ σ 2 θ ˘. Proof. Fix any t and denote δ t :" E}w ˚pθ t q ´wt } 2 :" E}w ˚´w t } 2 . We may assume without loss of generality that f pθ, ¨; zq is µ-strongly convex and that w t`1 " Π W rw t 1 βw `1 m ř m i"1 ∇ w f pθ t , w t ; z i q `vt ˘s :" Π W rw t ´1 βw p∇hpw t q `vt qs :" Π W rw t ´1 βw ∇ hpw t qs. Now, E}w t`1 ´w˚}2 " E › › › › Π W rw t ´1 β w ∇ hpw t qs ´w˚› › › › 2 ď E › › › › w t ´1 β w ∇ hpw t q ´w˚› › › › 2 " E}w t ´w˚}2 `1 β 2 w " E}∇hpw t q} 2 `dw σ 2 w ‰ ´2 β w E A w t ´w˚, ∇ r hpw t q E ď E}w t ´w˚}2 `1 β 2 w " E}∇hpw t q} 2 `dw σ 2 w ‰ ´2 β w E " F pθ t , w t q ´F pθ t , w ˚q `µ 2 }w t ´w˚}2 ı ď δ t ˆ1 ´µ β w ˙´2 β w E rF pθ t , w t q ´F pθ t , w ˚qs `E}∇hpw t q} 2 β 2 w `dw σ 2 w β 2 w . Further, E}∇hpw t q} 2 " E " }∇hpw t q ´∇w F pθ t , w t q} 2 `}∇ w F pθ t , w t q} 2 ‰ ď 4L 2 w m 1 tmănu `E}∇ w F pθ t , w t q} 2 ď 4L 2 w m 1 tmănu `2β w rF pθ t , w t q ´F pθ t , w ˚pθ t qqs, using independence and Theorem E.5 plus Lipschitz continuity of f in the first inequality and Theorem E.6 (plus Theorem E.2 part 5) in the second inequality. This implies E}w t`1 ´w˚}2 ď δ t ˆ1 ´1 κ w ˙`1 β 2 w " d w σ 2 w `4L 2 w m 1 tmănu ȷ . Therefore,  δ t`1 " E}w t`1 ´w˚p θ t q `w˚p θ t q ´w˚p θ t`1 q} 2 ď ˆ1 `1 2κ w ´1 ˙E}w t`1 ´w˚p θ t q} 2 `2κ w E}w ˚pθ t q ´w˚p θ t`1 q} 2 ď ˆ1 `1 2κ w ´1 ˙ˆ1 ´1 κ w ˙δt `2 β 2 w " d w σ 2 w `4L 2 w m 1 tmănu ȷ `2κ w E}w ˚pθ t q ´w˚p θ t`1 q} 2 ď ˆ1 `1 2κ w ´1 ˙ˆ1 ´1 κ w ˙δt `2 β 2 w " d w σ 2 w `4L 2 w m 1 tmănu ȷ `2κ w κ 2 θw E}θ t ´θt`1 } 2 ď ˆ1 `1 2κ w ´1 ˙ˆ1 ´1 κ w ˙δt `2 β 2 w " d w σ 2 w `4L 2 w m 1 tmănu ȷ `4κ w κ 2 θw η 2 θ " E}∇ θ F pθ t , w t q ´∇Φpθ t q} 2 `}∇Φpθ t q} 2 `dθ σ 2 t ‰ " ˆ1 `1 2κ w ´1 ˙ˆ1 ´1 κ w ˙δt `2 β 2 w " d w σ 2 w `4L 2 w m 1 tmănu ȷ `4κ w κ 2 θw η 2 θ " E}∇ θ F pθ t , C t :" 2 β 2 ÿ t"1 t´1 ÿ j"0 ζ t´1´j E}∇Φpθ j q} 2 `5 8 ˜T ÿ t"1 t´1 ÿ j"0 ζ t´1´j ¸ηθ β 2 θw " 2 β 2 w ˆ4L 2 w m 1 tmănu `dw σ 2 w ˙`4κ w κ 2 θw η 2 θ d θ σ 2 θ ȷ . ˙`κ w β 2 θw L 2 w d w T lnp1{δq β 2 w ϵ 2 n 2 `β2 θw D 2 κ w T . Our choice of T then implies E}∇Φp θT q} 2 À b ∆ Φ pβ θ `βθw κ θw qκ w `κw β 2 θw D 2 q « L θ a d θ lnp1{δq ϵn `ˆβ θw ? κ w β w ˙Lw a d w lnp1{δq ϵn ff `1tmănu m ˆL2 θ `κw β 2 θw L 2 w β 2 w ˙. Finally, our choice of sufficiently large m yields the last claim in Theorem E.3. E.2 PROOF OF THEOREM 3.4 Theorem 3.4 is an easy consequence of Theorem E.1, which we proved in the above subsection: Theorem E.10 (Re-statement of Theorem 3.4). Assume the loss function ℓp¨, x, yq and Fpx, ¨q are Lipschitz continuous with Lipschitz gradient for all px, yq, and p P S prq ě ρ ą 0 @ r P rks. In Algorithm 1, choose W to be a sufficiently large ball that contains W ˚pθq :" argmax W p F pθ, W q for every θ in some neighborhood of θ ˚P argmin θ max W p F pθ, W q. Then there exist algorithmic parameters such that the pϵ, δq-DP Algorithm 1 returns θT with E}∇FERMIp θT q} 2 " O ˜amaxpd θ , klq lnp1{δq ϵn ¸, treating D " diameterpWq, λ, ρ, l, and the Lipschitz and smoothness parameters of ℓ and F as constants. Proof. By Theorem E.1, it suffices to show that f pθ, W ; z i q :" ℓpθ, x i , y i q `λ p ψ i pθ, W q is Lipschitz continuous with Lipschitz gradient in both the θ and W variables for any z i " px i , y i , s i q, and that f pθ, ¨; z i q is strongly concave. We assumed ℓp¨, x i , y i q is Lipschitz continuous with Lipschitz gradient. Further, the work of Lowy et al. (2022a) showed that f pθ, ¨; z i q is strongly concave. Thus, it suffices to show that p ψ i pθ, W q is Lipschitz continuous with Lipschitz gradient. This clearly holds by Theorem D.1, since Fpx, ¨q is Lipschitz continuous with Lipschitz gradient and W P W is bounded.

F.1 MEASURING DEMOGRAPHIC PARITY AND EQUALIZED ODDS VIOLATION

We used the expressions given in ( 10) and ( 11) to measure the demographic parity violation and the equalized odds violation respectively. We denote Y to be the set of all possible output classes and S to be the classes of the sensitive attribute. P rEs denotes the empirical probability of the occurrence of an event E. max y 1 PY,s1,s2PS ˇˇP rp y " y 1 |s " s 1 s ´P rp y " y 1 |s " s 2 s ˇˇ(10) max y 1 PY,s1,s2PS maxp ˇˇP rp y " y 1 |s " s 1 , y " y 1 s ´P rp y " y 1 |s " s 2 , y " y 1 s ˇˇ, ˇˇP rp y " y 1 |s " s 1 , y ‰ y 1 s ´P rp y " y 1 |s " s 2 , y ‰ y 1 s ˇˇq F.2 TABULAR DATASETS

F.2.1 MODEL DESCRIPTION AND EXPERIMENTAL DETAILS

Demographic Parity: We split each dataset in a 3:1 train:test ratio. We preprocess the data similar to Hardt et al. (2016a) and use a simple logistic regression model with a sigmoid output O " σpW x `bq which we treat as conditional probabilities ppp y " i|xq. The predicted variables and sensitive attributes are both binary in this case across all the datasets. We analyze fairness-accuracy trade-offs with four different values of ϵ P t0.5, 1, 3, 9u for each dataset. We compare against state-of-the-art algorithms proposed in Tran et al. (2021a) and (the demographic parity objective of) Tran et al. (2021b) . The tradeoff curves for DP-FERMI were generated by sweeping across different values for λ P r0, 2.5s. The learning rates for the descent and ascent, η θ and η w , remained constant during the optimization process and were chosen from r0.005, 0.01s. Batch size was 1024. We tuned the ℓ 2 diameter of the projection set W and θ-gradient clipping threshold in r1, 5s in order to generate stable results with high privacy (i.e. low ϵ). Each model was trained for 200 epochs. The results displayed are averages over 15 trials (random seeds) for each value of ϵ. Equalized Odds: We replicated the experimental setup described above, but we took ℓ 2 diameter of W and the value of gradient clipping for θ to be in r1, 2s. Also, we only tested three values of ϵ P t0.5, 1, 3u.

F.2.2 DESCRIPTION OF DATASETS

Adult Income Dataset: This dataset contains the census information about the individuals. The classification task is to predict whether the person earns more than 50k every year or not. We followed a preprocessing approach similar to Lowy et al. (2022a) . After preprocessing, there were a total of 102 input features from this dataset. The sensitive attribute for this work in this dataset was taken to be gender. This dataset consists of around 48,000 entries spanning across two CSV files, which we combine and then we take the train-test split of 3:1. Retired Adult Income Dataset: The Retired Adult Income Dataset proposed by Ding et al. (2021a) is essentially a superset of the Adult Income Dataset which attempts to counter some caveats of the Adult dataset. The input and the output attributes for this dataset is the same as that of the Adult Dataset and the sensitive attribute considered in this work is the same as that of the Adult. This dataset contains around 45,000 entries. Parkinsons Dataset: In the Parkinsons dataset, we use the part of the dataset which had the UPRDS scores along with some of the features of the recordings obtained from individuals affected and not affected with the Parkinsons disease. The classification task was to predict from the features whether the UPDRS score was greater than the median score or not. After preprocessing, there were a total of 19 input features from this dataset and the sensitive attribute for this dataset was taken to be gender. This dataset contains around 5800 entries in total. We took a train-test split of 3:1. Credit Card Dataset: This dataset contains the financial data of users in a bank in Taiwan consisting of their gender, education level, age, marital status, previous bills, and payments. The assigned classification task is to predict whether the person defaults their credit card bills or not, essentially making the task if the clients were credible or not. We followed a preprocessing approach similar to Lowy et al. (2022a) . After preprocessing, there were a total of 85 input features from this dataset. The sensitive attribute for this dataset was taken to be gender. This dataset consists of around 30,000 entries from which we take the train-test split of 3:1. UTK-Face Dataset: This dataset is a large scale image dataset containing with an age span from 0 to 116. The dataset consists of over 20,000 face images with details of age, gender, and ethnicity and covers large variation in pose, facial expression, illumination, occlusion, resolution. We consider the age classification task with 9 age groups similar to the experimental setup in Tran et al. (2022) . We consider the sensitive attribute to be the ethnicity which consists of 5 different classes. For the equalized-odds, the ERMI between the predicted and the sensitive attributes is minimized conditional to each of the label present in the output variable of the dataset. So, the FERMI regularizer is split into as many parts as the number of labels in the output. This enforces each part of the FERMI regularizer to minimize the ERMI while an output label is given/constant. Each part has its own unique W that is maximized in order to create a stochastic estimator for the ERMI with respect to each of the output labels.

G SOCIETAL IMPACTS

In this paper, we considered the socially consequential problem of privately learning fair models from sensitive data. Motivated by the lack of scalable private fair learning methods in the literature, e developed the first differentially private (DP) fair learning algorithm that is guaranteed to converge with small batches (stochastic optimization). We hope that our method will be used to help companies, governments, and other organizations to responsibly use sensitive, private data. Specifically, we hope that our DP-FERMI algorithm will be useful in reducing discrimination in algorithmic decisionmaking while simultaneously preventing leakage of sensitive user data. The stochastic nature of our algorithm might be especially appealing to companies that are using very large models and datasets. On the other hand, there are also some important limitations of our method that need to be considered before deployment. One caveat of our work is that we have assumed that the given data set contains fair and accurate labels. For example, if gender is the sensitive attribute and "likelihood of repaying a loan" is the target, then we assume that the training data accurately describes everyone's financial history without discrimination. If training data is biased against a certain demographic group, then it is possible that our algorithm could amplify (rather than mitigate) unfairness. See e.g. Kilbertus et al. (2020) ; Bechavod et al. (2019) for further discussion. Another important practical consideration is how to weigh/value the different desiderata (privacy, fairness, and accuracy) when deploying our method. As shown in prior works (e.g., Cummings et al. (2019) ) and re-enforced in the present work, there are fundamental tradeoffs between fairness, accuracy, and privacy: improvements in one generally come at a cost to the other two. Determining the relative importance of each of these three desiderata is a critical question that lacks a clear or general answer. Depending on the application, one might be seriously concerned with either discrimination or privacy attacks, and should calibrate ϵ and λ accordingly. Or, perhaps very high accuracy is necessary for a particular task, with privacy and/or fairness as an afterthought. In such a case, one might want to start with very large ϵ and small λ to ensure high accuracy, and then gradually shrink ϵ and/or increase λ to improve privacy/fairness until training accuracy dips below a critical threshold. A thorough and rigorous exploration of these issues could be an interesting direction for future work.



Our method can also handle any other fairness notion that can be defined in terms of statistical (conditional) independence, such as equal opportunity. However, our method cannot handle all fairness notions: for example, false discovery rate and calibration error are not covered by our framework. To simplify the presentation, we will assume that demographic parity is the fairness notion of interest in the remainder of this section. However, we consider both fairness notions in our numerical experiments. We say a differentiable function g is µ-strongly concave if gpαq `x∇gpαq, α 1 ´αy ´µ 2 }α ´α1 } 2 ě gpα 1 q for all α, α 1 . DP-SGDA was also used inYang et al. (2022) for convex and PL min-max problems. We say function g is L-Lipschitz if }gpαq ´gpα 1 q} ď L}α ´α1 } for all α, α 1 .



Figure 1: Comparison with existing works. "Guarantee" refers to provable guarantee. N/A: the post-processing method of Jagielski et al. (2019) is not an iterative algorithm. *Method requires access to the sensitive data at test time. The in-processing method of Jagielski et al. (2019) is inefficient. The work of Mozannar et al. (2020) specializes to equalized odds, but most of their analysis seems to be extendable to other fairness notions.

4.1.1 DEMOGRAPHIC PARITYWe use four benchmark tabular datasets: Adult Income, Retired Adult, Parkinsons, and Credit-Card dataset from the UCI machine learning repository(Dua & Graff (2017)). The predicted variables and sensitive attributes are both binary in these datasets. We analyze fairness-accuracy trade-offs with four different values of ϵ P t0.5, 1, 3, 9u for each dataset. We compare against state-of-the-art algorithms proposed inTran et al. (2021a)  and (the demographic parity objective of)Tran et al. (2021b). The results displayed are averages over 15 trials (random seeds) for each value of ϵ.

(in Appendix F.2). Compared to Tran et al. (2021a;b), DP-FERMI offers superior fairness-accuracy tradeoffs at every privacy (ϵ) level.

Figure 2: Private, Fair (Demographic Parity) logistic regression on Adult Dataset. In the Parkinsons dataset, the task is to predict whether the total UPDRS score of the patient is greater than the median or not keeping gender as the sensitive attribute. Results for ϵ P t1, 3u are shown in Fig. 3. See Fig. 8 in Appendix F for ϵ P t0.5, 9u. Our algorithm again outperforms the baselines Tran et al. (2021a;b) for all tested privacy levels.In the Credit Card dataset , the task is to predict whether the user will default payment the next month keeping gender as the sensitive attribute. Results are shown in Fig.7in Appendix F.2. Once again, DP-FERMI provides the most favorable privacy-fairness-accuracy profile.

Figure 3: Private, Fair (Demogrpahic Parity) logistic regression on Parkinsons Dataset

Figure 4: Private, Fair (Equalized Odds) logistic regression on Credit Card Dataset

Figure 5: DP-FERMI on a Deep CNN for Image Classification on UTK-Face

Jagielski et al. (2019), several works have proposed other DP fair learning algorithms. None of these works have managed to simultaneously address all the shortcomings of the method ofJagielski et al. (2019). The work ofXu et al. (2019) proposed DP and fair binary logistic regression, but did not provide any theoretical convergence/performance guarantees. The work ofMozannar et al. (2020) combined aspects of bothHardt et al. (2016a)  andAgarwal et al. (2018) in a two-step locally differentially private fairness algorithm. Their approach is limited to binary classification. Moreover, their algorithm requires n{2 samples in each iteration (of their in-processing step), making it impractical for large-scale problems. More recently,Tran et al. (2021b)  devised another DP in-processing method based on lagrange duality, which covers non-binary classification problems. In a subsequent work,Tran et al. (2021a)  studied the effect of DP on accuracy parity in ERM, and proposed using a regularizer to promote fairness. Finally,Tran et al. (

Figure 6: Private, fair logistic regression on the Retired Adult Dataset

admits "isn't feasible (or legal) in certain applications." Additionally, post-processing approaches are known to suffer from inferior fairness-accuracy tradeoffs compared with in-processing methods. While the in-processing method ofJagielski et al. (2019) does not require access to sensitive attributes at test time, it comes with a different set of disadvantages: 1) it is limited to binary classification; 2) its theoretical performance guarantees require the use of the computationally inefficient (i.e. exponential-time) exponential mechanism(McSherry & Talwar, 2007); 3) its theoretical performance guarantees require computations on the full training set and do not permit mini-batch implementations; 4) it requires the hypothesis class H to have finite VC dimension. In this work, we propose the first algorithm that overcomes all of these pitfalls: our algorithm is amenable to multi-way classification with multiple sensitive attributes, computationally efficient, and comes with convergence guarantees that hold even when mini-batches of m ă n samples are used in each iteration of training, and even when VCpHq " 8. Furthermore, our framework is flexible enough to accommodate many notions of group fairness besides equalized odds (e.g. demographic parity, accuracy parity).

TrpW Erp ypx i ; θqs T i |x i , s i s p Erp ypx i ; θqp ypx i ; θq T |x i s " diagpF 1 px i , θq, . . . , F l px i , θqq, and Erp ypx i ; θqs T i |x i , s i s is a kˆl matrix with Erp ypx i ; θqs T i |x i , s i s r,j " s i,r F j px i , θq.

w t q ´∇θ F pθ t , w ˚pθ t q} 2 `}∇Φpθ t q} 2 `dθ σ 2 E}w t ´w˚p θ t q} 2 `}∇Φpθ t q} 2 `dθ σ 2 Proof of Theorem E.3. Privacy: This is an easy consequence of Theorem 1 inAbadi et al. (2016) (with precise constants obtained from the proof therein, as inBassily et al. (2019)) applied to both the min (descent in θ) and max (ascent in w) updates. UnlikeAbadi et al. (2016), we don't clip the gradients here before adding noise, but the Lipschitz continuity assumptions (Theorem E.2 parts 1 and 2) imply that the ℓ 2 -sensitivity of the gradient updates in lines 4 and 5 of Algorithm 2 are nevertheless bounded by 2L θ {m and 2L w {m, respectively. Thus, Theorem 1 inAbadi et al. (2016) still applies. Convergence: Denote ζ :" 1 ´1 2κw `4κ w κ 2 θw η 2 θ β 2 θw , δ t " E}w ˚pθ t q ´wt } 2 , and

ACKNOWLEDGMENTS

This work was supported in part with funding from the NSF CAREER award 2144985, from the YIP AFOSR award, from a gift from the USC-Meta Center for Research and Education in AI & Learning, and from a gift from the USC-Amazon Center on Secure & Trusted Machine Learning.

