THE COST OF PRIVACY IN FAIR MACHINE LEARNING Anonymous authors Paper under double-blind review

Abstract

A common task in fair machine learning is training ML models that preserve certain summary statistics across subpopulations defined by sensitive attributes. However, access to such sensitive attributes in training data is restricted and the learner must rely on noisy proxies for the sensitive attributes. In this paper, we study the effect of a privacy mechanism that obfuscates the sensitive attributes from the learner on the fairness of the resulting classifier. We show that the cost of privacy in fair ML is a decline in the generalizability of fairness constraints.

1. INTRODUCTION

The fairness of machine learning systems is gaining increasing attention in recent years. Among the numerous fairness objectives is ensuring that a machine learning model does not discriminate against subpopulations that are typically identified by sensitive attributes (e.g., race, gender). When training a fair model and evaluating model bias, it is necessary to possess sensitive attributes; however, access to and use of such sensitive data is frequently prohibited by laws and regulations. Credit card companies, for instance, are not permitted to inquire about a person's race when they apply for credit, yet they must demonstrate that their decisions are not discriminatory (Chen et al., 2019) . Ideally, sensitive personal information should not be disclosed during the training of ML models. However, it is impossible to ensure exact notions of fairness (such as demographic parity or equality of opportunity) without any knowledge of the sensitive data. Fortunately, differential privacy (Dwork et al., 2006 ) is a promising workaround, which can offer a graceful compromise between privacy and utility. Mozannar et al. (2020) propose to release sensitive attributes in a locally differentially private way: adding noise to the sensitive data so that adversaries cannot infer any information with high confidence about a single record. The advantage of the privacy mechanism proposed by Mozannar et al. (2020) is an invariance property: exact notions of fairness with regard to true sensitive attributes and noisy sensitive attributes are equivalent. An implication of the invariance property is that the optimal model of fairness can be learned at the population level. Nonetheless, it remains unclear what the precise statistical impact of privacy on fairness is. In this work, we study the statistical cost of privacy on fairness in the task of learning fair ML models with differentially private sensitive attributes. The main benefits of the developed theory are 1. statistically principled: We propose a statistically principled metric to characterize the cost of privacy on fairness. A restricted notion of statistical efficiency precisely quantifies the privacy cost asymptotically. 2. interpretable: Privacy leads to a decline in the statistical efficiency. Such efficiency loss is interpretable: it explicitly depends on the privacy budget, the subpopulation imbalance level, and few other problem-specific parameters. The rest of this paper is organized as follows. In Section 2, we formalize the problem setup, which consists of the constrained stochastic optimization problem for fair machine learning, the local differential privacy mechanism for releasing sensitive attributes, the learning procedure of fair model using private sensitive attributes, and the definition of asymptotic relative efficiency in terms of fairness violations. In Section 3, we develop theory for the privacy cost under a single exact fairness constraint and then generalize this theory to some extent. By simulating a risk-parity linear regression problem in Section 4, we validate our theory and illustrate the utility of our tools. Finally, we summarize our work in Section 5 and point out an interesting avenue of future work.

1.1. RELATED WORK

The interaction between fairness and privacy has been investigated from three perspectives: learning approximately fair models without sensitive attributes (Hashimoto et al., 2018; Lahoti et al., 2020) , learning approximately fair models with wildly noisy sensitive attributes (Kallus et al., 2019; Awasthi et al., 2020; Wang et al., 2020) , and learning exactly fair models with structured noisy sensitive attributes (Lamy et al., 2020; Mozannar et al., 2020) . This paper focuses on the third aspect. The works that are most pertinent to ours are Lamy et al. (2020) and Mozannar et al. (2020) . Lamy et al. (2020) assume that the sensitive attributes are subject to noise from the mutually contaminated learning model. Under such a structured noise mechanism, the noise rates can be consistently estimated, and when enforcing fairness with regard to noisy groups, scaling the fairness tolerance parameter more tightly is all that is required. Mozannar et al. (2020) suggest a differentially private model to release the sensitive attributes, which is a special type of the mutually contaminated learning model. Under such a designed noise mechanism, Mozannar et al. (2020) show that if the classifier is independent of the sensitive attributes, then exact fairness with regard to noisy sensitive attributes is equivalent to that with regard to true sensitive attributes. The idea of the equivalence can be found in Lamy et al. (2020) while Mozannar et al. (2020) put it into a formal statement. We basically study the statistical cost of privacy on the generalizability of fairness when using Lamy et al. ( 2020)'s method under Mozannar et al. (2020) 's privacy mechanism.

2.1. FAIR MACHINE LEARNING AS CONSTRAINED STOCHASTIC OPTIMIZATION

In-processing fair machine learning is typically a supervised learning problem with fairness constraints (Zafar et al., 2017; Agarwal et al., 2018) . Such a problem can most often be formulated as a constrained stochastic optimization problem: (empirical) risk minimization subject to (empirical) fairness constraints. Consider a fair binary classification problem. Let X ⊂ R d be the input space, Y = {0, 1} be the set of possible labels, and A be the set of possible values of the protected/sensitive attribute. In this setup, training and test examples are tuples of the form (X, A, Y ) ∈ X × A × Y, and a classifier is a map f : X → {0, 1}. Two popular definitions of algorithmic fairness for binary classification are demographic parity (Dwork et al., 2011) and equality of opportunity (Hardt et al., 2016) . Definition 2.1 (Demographic parity). Let Y = f (X) be the output of the classifier. Demographic parity entails P{ Y = 1 | A = a} = P{ Y = 1 | A = a ′ } for all a, a ′ ∈ A. Demographic parity, also known as statistical parity, means that the prediction Y = f (X) is statistically independent of the protected attribute A. Definition 2.2 (Equality of opportunity). Let Y = 1 be the advantaged label that is associated with a positive outcome and Y = f (X) be the output of the classifier. Equality of opportunity entails P{ Y = 1 | A = a, Y = 1} = P{ Y = 1 | A = a ′ , Y = 1} for all a, a ′ ∈ A. Equality of opportunity, also known as true positive rate parity, means that the prediction Y = f (X) conditioned on the advantaged label Y = 1 is statistically independent of the protected attribute A. Given a parametric model space H = {f θ (•) : θ ∈ Θ} and loss function ℓ : Θ × X × Y → R + (where Θ ⊂ R d is a finite-dimensional parameter space), an in-processing fair ML routine is to minimize the (empirical) risk E ℓ(θ; X, Y ) while satisfying some fairness constraints. To keep things simple, we assume there are only two demographic groups; i.e. |A| = 2. Without loss of generality, we refer to one group as advantaged (A = 1) and the other as disadvantaged (A = 0). Consider fair learning with demographic parity as an example. At the population level, the goal is to solve the problem: θ ⋆ ∈ arg min θ∈Θ E ℓ(θ; X, Y ) subject to E 1{f θ (X) = 1}|A = 1 -E 1{f θ (X) = 1}|A = 0 = 0 , (2.1) where the expectation is with respect to the distribution of tuple (X, A, Y ). The true underlying distribution is unknown, so we cannot solve (2.1) directly. Instead, we observe IID training samples {(X i , A i , Y i )} n i=1 from the true distribution and solve the empirical version of (2.1): θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 1{f θ (Xi)=1,Ai=1} n i=1 1{Ai=1} - n i=1 1{f θ (Xi)=1,Ai=0} n i=1 1{Ai=0} ≤ α n    , (2.2) where 0 < α n = o( 1 √ n ) is a slackness term shrinking to zero at a rate faster than 1 √ n . Through the rest of the work, we always let α n be a positive number of order o( 1 √ n ).

2.2. LOCAL DIFFERENTIAL PRIVACY MECHANISM FOR RELEASING SENSITIVE ATTRIBUTES

Consider the randomized response mechanism (Warner, 1965; Kairouz et al., 2014) for releasing privatized sensitive attribute: Q(Z = z | A = a) = e ε |A|-1+e ε if z = a 1 |A|-1+e ε if z ̸ = a (2.3) for all a, z ∈ A, where ε > 0 controls the privacy level. The privatized sensitive attribute Z of the true sensitive attribute A is defined as Z = Q(• | A). In addition, the sampling mechanism Q requires Z ⊥ ⊥ X, Y | A. Then the private dataset {(X i , Z i , Y i )} n i=1 is generated from the original dataset {(X i , A i , Y i )} n i=1 via the transition kernel Q. The randomized response mechanism (2.3) is a locally ε-differentially private mechanism (Duchi et al., 2013), that is max z,a,a ′ ∈A Q(Z = z | A = a) Q(Z = z | A = a ′ ) ≤ e ε . Here a smaller parameter ε indicates a stronger privacy guarantee. Moreover, the mechanism Q is considered optimal for distribution estimation under local differential privacy constraints (Kairouz et al., 2014; 2016) . From this point forward (with the exception of the general theory presented in Section 3.1), we assume that there are only two demographic groups, i.e. |A| = 2. The mechanism (2.3) becomes Q(Z = z | A = a) = e ε 1+e ε ≜ 1 -γ if z = a 1 1+e ε ≜ γ if z ̸ = a (2.4) for a ∈ {0, 1}, where γ ∈ [0, 0.5). The parameter γ = 0 (or equivalently ε = ∞) signifies complete lack of privacy, whereas γ → 0.5 (or equivalently ε → 0) corresponds to perfect privacy.

2.3. FAIR MACHINE LEARNING WITH PRIVATE SENSITIVE ATTRIBUTES

The privatized sensitive attribute Z can be served as a noisy proxy for the true sensitive attribute A. One may wish to learn a fair classifier by directly enforcing fairness notion on Z i 's, the proxies for A i 's. This approach is feasible and justifiable (at the population level) due to the invariance of exact fairness under local differential privacy. Proposition 2.3 (Proposition 1 in Mozannar et al. (2020) ). Consider any exact fairness notion among demographic parity and equality of opportunity. Let Y = f (X) be a binary classifier. Then Y is fair with respect to A if and only if Y is fair with respect to Z. Proposition 2.3 requires Y is only a function of X. Mozannar et al. (2020) shows by construction the existence of a classifier Y = f (X, Z) which is fair with respect to Z but unfair to A. Now we consider fair ML with private sensitive attributes by (empirical) risk minimization subject to fairness constraints with respect to Z. Take fair learning with demographic parity as an example. At the population level, the goal is to solve the problem θ ⋆ ∈ arg min θ∈Θ E ℓ(θ; X, Y ) subject to E 1{f θ (X) = 1}|Z = 1 -E 1{f θ (X) = 1}|Z = 0 = 0 , (2.5) where the expectation is with respect to the distribution of tuple (X, Z, Y ). Here Z is the proxy sensitive attribute but the true sensitive attribute A is unobservable. The true underlying distribution is unknown, so we cannot solve (2.5) directly. Instead, we observe IID (private) training samples {(X i , Z i , Y i )} n i=1 from the true distribution and solve the empirical version of (2.5): θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 1{f θ (Xi)=1,Zi=1} n i=1 1{Zi=1} - n i=1 1{f θ (Xi)=1,Zi=0} n i=1 1{Zi=0} ≤ α n    . (2.6) A direct corollary of Proposition 2.3 is that (2.1) and (2.5) have exactly the same solution θ ⋆ (assuming uniqueness of the solution). One can also show that under regularity conditions both θ n and θ n , the solution to (2.2) and to (2.6), are √ n-consistent for θ ⋆ . We wish to compare the estimating quality of θ n and θ n , and quantify the quality difference in terms of the privacy level parameter γ (or ε) and few other problem-specific parameters.

2.4. ASYMPTOTIC RELATIVE EFFICIENCY

In statistics, consistency and efficiency are popular notions to evaluate the performance of estimators. Definition 2.4 (Consistency). An estimator θ n is consistent for θ ⋆ if θ n p → θ ⋆ as n → ∞. Suppose that we have two consistent estimators θ n and θ n . Both of them are reasonable, but which one should be preferred? To answer this question, we can employ the notion of efficiency, i.e. measuring how spread out about θ n (or θ n ) is the sampling distribution of the estimator. In light of this, we now adapt the concept of statistical efficiency to fair machine learning. In fair ML, the most important metric to evaluate the performance of a classifier is fairness violation. Let c : Θ → R be the constraint function. For example, demographic parity constraint corresponds to c(θ) = E 1{f θ (X) = 1}|A = 1 -E 1{f θ (X) = 1}|A = 0 . Since the exact fairness notion entails a classifier f θ is fair if c(θ) = 0, we define the (signed) fairness violation of θ as c(θ) itself. Definition 2.5 (Efficiency in terms of constraint violations). Suppose that we have two consistent estimators θ n and θ n satisfying √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, σ 2 ) and √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, σ 2 ) as n → ∞. We say that the estimator θ n is more efficient (in terms of constraint violations) than θ n if σ 2 ≤ σ 2 . The asymptotic relative efficiency (ARE) of θ n to θ n is ARE( θ n , θ n ) ≜ σ 2 σ 2 . In other words, the estimator θ n is more efficient than θ n if ARE( θ n , θ n ) ≤ 1. Another way to examine the efficiency loss is to look at the asymptotic joint distribution of c( θ n ) and c( θ n ). Let ρ be the asymptotic correlation between c( θ n ) and c( θ n ). The fairness violations of the two estimators can be compared using the ratio of c( θ n ) to c( θ n ), which converges in distribution to a Cauchy random variable U : c( θ n ) c( θ n ) d → U ∼ p U (u) = 1 π β (u -α) 2 + β 2 with α = ρσ σ , β = σ σ 1 -ρ 2 . Constraint violation inflates if we observe a value of the ratio |c( θ n )/c( θ n )| less than one. Assume θ n is more efficient than θ n , i.e. σ 2 < σ 2 . Since |ρ| ≤ 1, the median and mode of U , α, satisfies |α| < 1, which indicates a high likelihood of constraint violation inflation. Precisely, the asymptotic probability of constraint violation inflation is lim n→∞ P c( θ n ) c( θ n ) < 1 = 1 π tan -1 σ -ρσ σ 1 -ρ 2 + tan -1 σ + ρσ σ 1 -ρ 2 > 1 2 . In the rest of the paper, the asymptotic relative efficiency (ARE) is the key quantity of interest, which compares the asymptotic variances of two estimators by ARE = lim n→∞ Var[c( θ n )]/ Var[c( θ n )].

3. PRIVACY COST IN FAIR MACHINE LEARNING

In this section, we wish to study ARE( θ n , θ n ), the asymptotic relative efficiency (ARE) of θ n to θ n given by solving (2.6) and (2.2). To this end, we extend the notion of demographic parity and equality of opportunity to a more general form: we say that θ is fair (with respect to A) if c(θ) ≜ E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 = 0. (3.1) The fairness notion (3.1) is known as linear-fractional fairness constraint (Celis et al., 2021) . Note that demographic parity is a special case of (3.1) if we take g(θ; X, Y ) = 1{f θ (X) = 1} and h ≡ 1. Besides, (3.1) becomes equality of opportunity if we take g(θ; X, Y ) = 1{f θ (X) = 1, Y = 1} and h(X, Y ) = 1{Y = 1}. When h ≡ 1, (3.1) degenerates to linear fairness (see Appendix A). Let the marginal distribution of A and conditional distribution of (X, Y ) given A be P(A = 0) = π 0 , P(A = 1) = π 1 (X, Y )|A = 0 ∼ Q 0 , (X, Y )|A = 1 ∼ Q 1 . (3.2) Then the distribution of (X, A, Y ) is uniquely identified by (3.2). Moreover, (X, Y ) ∼ π 0 Q 0 +π 1 Q 1 is a mixture of Q 0 and Q 1 weighted by π 0 and π 1 . Denote the marginal distribution of Z and conditional distribution of (X, Y ) given Z by P(Z = k) = π k , (X, Y )|Z = k ∼ Q k for k ∈ {0, 1}. Enforcing fairness notion (3.1) with respect to Z is c(θ) ≜ E g(θ; X, Y )|Z = 1 E h(X, Y )|Z = 1 - E g(θ; X, Y )|Z = 0 E h(X, Y )|Z = 0 = 0. By some algebra, we find that the proxy constraint function c(θ) is equal to the true constraint function c(θ) up to a scaling factor: c(θ) = ψ frac (γ, π 0 , π 1 , m 0 , m 1 ) × c(θ), where ψ frac (γ, π 0 , π 1 , m 0 , m 1 ) ≜ (1 -2γ)π 0 π 1 m 0 m 1 {γπ 0 m 0 + (1 -γ)π 1 m 1 } {(1 -γ)π 0 m 0 + γπ 1 m 1 } , as well m 0 ≜ E Q0 h(X, Y ) and m 1 ≜ E Q1 h(X, Y ) . This also implies c(θ) = 0 if and only if c(θ) = 0, offering an alternative proof for Proposition 2.3 and extending Proposition 2.3 to linearfractional fairness notions (3.1). Now we are ready to show the privacy cost in linear-fractional fairness (3.1)-aware learning. First, let the true parameter θ ⋆ , i.e. the solution to the population problem, be θ ⋆ ∈      arg min θ∈Θ E ℓ(θ; X, Y ) subject to E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 = 0      , (3.3) where the expectation is with respect to the underlying distribution of tuple (X, A, Y ). Then, let the estimator θ n be the solution to the empirical problem given the true sensitive attribute, θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ;Xi,Yi)1{Ai=1} n i=1 h(Xi,Yi)1{Ai=1} - n i=1 g(θ;Xi,Yi)1{Ai=0} n i=1 h(Xi,Yi)1{Ai=0} ≤ α n    . Finally, let θ n be the solution to the empirical problem given the proxy sensitive attribute, θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ;Xi,Yi)1{Zi=1} n i=1 h(Xi,Yi)1{Zi=1} - n i=1 g(θ;Xi,Yi)1{Zi=0} n i=1 h(Xi,Yi)1{Zi=0} ≤ α n    . We made the following technical assumptions on the population problem (3.3). 1. smoothness and concentration: ℓ and g are twice continuously differentiable with respect to θ, and ℓ(θ ⋆ ; X, Y ), ∇ℓ(θ ⋆ ; X, Y ), g(θ ⋆ ; X, Y ), ∇g(θ ⋆ ; X, Y ), h(X, Y ) are sub-Gaussian. 2. uniqueness: the stochastic optimization problem with a single expected value constraint (3.3) has unique optimal primal-dual pair (θ ⋆ , λ ⋆ ), and θ ⋆ belongs to the interior of the compact set Θ.

3.. positive definiteness:

The Hessian of the Lagrangian evaluated at (θ ⋆ , λ ⋆ ) is positive definite. The preceding assumptions are not the most general, but they are easy to interpret. The smoothness conditions on ℓ and g with respect to θ, the concentration conditions of ℓ(θ ⋆ ), g(θ ⋆ ) and h, and the uniqueness condition facilitate the use of standard tools from asymptotic statistics to study the large sample properties of the constraint value. The positive definiteness condition postulates the Lagrangian of the equality constrained optimization problem is locally strongly convex at (θ ⋆ , λ ⋆ ). The main technical result characterizes the efficiency of θ n and θ n (see proof in Appendix C). Theorem 3.1 (Privacy cost in linear-fractional fairness (3.1)-aware learning). Under the standing assumptions, let estimators θ n and θ n be consistent for θ ⋆ , then √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, σ 2 ) and √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, σ 2 ) , where σ 2 = Var Q0 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 0 (E Q0 [h(X, Y )]) 2 + Var Q1 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 1 (E Q1 [h(X, Y )]) 2 , σ 2 = ψ -2 frac × Var Q0 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 0 (E Q0 [h(X, Y )]) 2 + Var Q1 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 1 (E Q1 [h(X, Y )]) 2 , and κ ≜ E Q0 [g(θ ⋆ ; X, Y )] E Q0 [h(X, Y )] = E Q1 [g(θ ⋆ ; X, Y )] E Q1 [h(X, Y )] . The asymptotic relative efficiency (ARE) of θ n to θ n is ARE( θ n , θ n ) = φ γ, π 0 m 0 π 1 m 1 , Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 0]/m 0 Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 1]/m 1 , where φ(γ, r 1 , r 2 ) ≜ (1 -2γ) 2 r 1 (r 1 + r 2 ) {γr 1 + (1 -γ)} 2 {(1 -γ)r 1 r 2 + γ} + {(1 -γ)r 1 + γ} 2 {γr 1 r 2 + (1 -γ)} . Recall that demographic parity corresponds to h ≡ 1 and equality of opportunity corresponds to h(X, Y ) = 1{Y = 1}. In order to interpret (3.4), we therefore take h(X, Y ) = 1{E(X, Y )}, where E(X, Y ) is an event of X and Y . Then the ARE (3.4) becomes ARE( θ n , θ n ) = φ γ, P(E(X, Y ), A = 0) P(E(X, Y ), A = 1) , Var[g(θ ⋆ ; X, Y )|E(X, Y ), A = 0] Var[g(θ ⋆ ; X, Y )|E(X, Y ), A = 1] . Note that the ARE is jointly determined by the level of privacy, a ratio of marginal probabilities of the minority and majority groups, and a ratio of their conditional variances. Theorem 3.1 demonstrates that the cost of privacy is the efficiency loss in terms of fairness violations. For fixed ratios r 1 ≜ P(E(X, Y ), A = 0) P(E(X, Y ), A = 1) > 0, and r 2 ≜ Var[g(θ ⋆ ; X, Y )|E(X, Y ), A = 0] Var[g(θ ⋆ ; X, Y )|E(X, Y ), A = 1] > 0, function φ(γ, r 1 , r 2 ) is decreasing in γ. In the absence of privacy, φ(0, r 1 , r 2 ) = 1 means no efficiency loss. Under perfect privacy, φ(0.5, r 1 , r 2 ) = 0 indicates total loss of efficiency. Moreover, θ n is always more efficient than θ n because ARE( θ n , θ n ) ≤ 1. Figure 1 demonstrates the asymptotic relative efficiency (ARE) curve of privacy level γ for varying ratios r 1 and r 2 . The ARE is always upper bounded by (1 -2γ) 2 , which is achieved only if r 1 = 1. Therefore for any fixed γ and r 2 , the ARE achieves its maximum only if the dataset is balanced in the sense that P(E(X, Y ), A = 0) = P(E(X, Y ), A = 1). Moreover, for any fixed γ and r 2 , the ARE is strictly increasing in r 1 (assuming r 1 ≤ 1). This implies the effect of subpopulation size imbalance: demographic group imbalance degrades the efficiency loss in privately fair learning. In the literature, the effect of group size imbalance on the difficulty of learning fair classifier from contaminated data (note that private sensitive attribute is a particular type of data contamination) was also reported in Konstantinov & Lampert (2022) and the references therein. Lastly, the ARE is strictly increasing in the problem-specific parameter r 2 , given fixed γ and r 1 < 1. r1 = 1 : 1, r2 = 1 : 1 r1 = 1 : 9, r2 = 1 : 1 r1 = 1 : 99, r2 = 1 : 1 r1 = 1 : 1, r2 = 2 : 3 r1 = 1 : 9, r2 = 2 : 3 r1 = 1 : 99, r2 = 2 : 3 r1 = 1 : 1, r2 = 3 : 2 r1 = 1 : 9, r2 = 3 : 2 r1 = 1 : 99, r2 = 3 : 2

3.1. GENERAL THEORY

In this subsection, we discuss some extensions to the established theory.

Multiple demographic groups.

It is natural to extend our theory of two demographic groups to general number of groups. Suppose we have K + 1(K ≥ 2) groups indexed by 0, 1, . . . , K. The notion of linear-fractional fairness (3.1) can be adapted to more than two groups: we say θ is fair if E g(θ; X, Y )|A = k E h(X, Y )|A = k - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 = 0 for k ∈ [K], (3.5) where group 0 is referred to as a reference group. Let the marginal distribution of A and conditional distribution of (X, Y ) given A be P(A = k) = π k , (X, Y )|A = k ∼ Q k for k ∈ {0} ∪ [K]. (3.6) Then the distribution of (X, A, Y ) is uniquely identified by (3.6). Moreover, the distribution of (X, Y ) ∼ K k=0 π k Q k d = Q ⋆ is a mixture of Q k 's weighted by π k 's. Let the private mechanism Q be Q(Z = z | A = a) = e ε K+e ε ≜ 1 -Kγ if z = a 1 K+e ε ≜ γ if z ̸ = a where γ ∈ 0, 1 K+1 . The mechanism Q perturbs the membership of a group to a different group that is evenly picked at random from the other groups. The parameter γ = 0 (or equivalently ε = ∞) signifies complete lack of privacy, whereas γ → 1 K+1 (or equivalently ε → 0) means perfect privacy. The joint distribution of (X, Z, Y ) is uniquely identified by the marginal distribution and conditional distribution as follows: P(Z = k) = γ + (1 -|A|γ)π k ≜ π k (X, Y )|Z = k ∼ γ γ+(1-|A|γ)π k Q ⋆ + 1-|A|γ γ+(1-|A|γ)π k Q k ≜ Q k for k ∈ {0} ∪ [K]. (3.7) Let the true parameter θ ⋆ , i.e. the solution to the population problem, be θ ⋆ ∈        arg min θ∈Θ E ℓ(θ; X, Y ) subject to E g(θ; X, Y )|A = k E h(X, Y )|A = k - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 = 0 K k=1        , where the expectation is with respect to the underlying distribution of tuple (X, A, Y ). Then, let the estimator θ n be the solution to the empirical problem given the true sensitive attribute, θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ;Xi,Yi)1{Ai=k} n i=1 h(Xi,Yi)1{Ai=k} - n i=1 g(θ;Xi,Yi)1{Ai=0} n i=1 h(Xi,Yi)1{Ai=0} ≤ α n K k=1    . Finally, let θ n be the solution to the empirical problem given the proxy sensitive attribute, θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ;Xi,Yi)1{Zi=k} n i=1 h(Xi,Yi)1{Zi=k} - n i=1 g(θ;Xi,Yi)1{Zi=0} n i=1 h(Xi,Yi)1{Zi=0} ≤ α n K k=1    . The true fairness constraint function c(θ) : R d → R K is defined as c(θ) ≜ (c 1 (θ), . . . , c K (θ)) ⊤ with c k (θ) = E g(θ; X, Y )|A = k E h(X, Y )|A = k - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 , k ∈ [K]. Under the same assumptions as the two-group problem, we have the main technical result as follows (see Appendix D for a complete treatment to the general-number-of-groups problem). Theorem 3.2 (Privacy cost in linear-fractional fairness (3.5)-aware learning). Under the standing assumptions, let estimators θ n and θ n be consistent for θ ⋆ , then √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, Σ) and √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, Ψ -1 frac ΣΨ -⊤ frac ) , where Σ kl = Var Q0 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 0 (E Q0 [h(X, Y )]) 2 + Var Q k [g(θ ⋆ ; X, Y ) -κh(X, Y )] π k (E Q k [h(X, Y )]) 2 1 {k = l} Σ kl = Var Q0 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 0 (E Q0 [h(X, Y )]) 2 + Var Q k [g(θ ⋆ ; X, Y ) -κh(X, Y )] π k (E Q k [h(X, Y )]) 2 1 {k = l} for k, l ∈ [K], κ ≜ E Q0 [g(θ ⋆ ; X, Y )] E Q0 [h(X, Y )] = E Q1 [g(θ ⋆ ; X, Y )] E Q1 [h(X, Y )] = . . . = E Q K [g(θ ⋆ ; X, Y )] E Q K [h(X, Y )] . Missing sensitive attributes. Some users may choose not to disclose their demographic identities during data collection due to privacy concerns. We investigate how the absence of sensitive attributes impacts the generalizability of fairness constraints. Consider the following missing data mechanism for sensitive attributes : P(R = 1 | X, A, Y ) = P(R = 1 | A) ≜ ω A . (3.8) where R = 1 corresponds to response (i.e., A is observed) and otherwise R = 0 corresponds to non-response (i.e., A is missing). The missingness mechanism (3.8) is a particular type of missing at random (MAR) at the population level and missing completely at random (MCAR) within each subpopulation. One common approach for analyzing data with missing values is to just use the completely observed samples (i.e., samples with all features observed) and discard the samples with some missing features. We employ this strategy by solving the following empirical problem: θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ;Xi,Yi)1{Ai=1,Ri=1} n i=1 h(Xi,Yi)1{Ai=1,Ri=1} - n i=1 g(θ;Xi,Yi)1{Ai=0,Ri=1} n i=1 h(Xi,Yi)1{Ai=0,Ri=1} ≤ α n    , of which the empirical risk function is computed with all samples while the fairness constraint function is calculated with samples that include the sensitive attribute. With the same assumptions as the two-group problem and further assuming that the response probability is non-vanishing, i.e., ω a > 0 for a ∈ {0, 1}, we have the asymptotic relative efficiency (ARE) of θ n to θ n as follows (see Appendix E for a complete treatment to the missing sensitive attributes problem): ARE( θ n , θ n ) = r 2 + r 1 ω -1 0 r 2 + ω -1 1 r 1 , r 1 = π 0 m 0 π 1 m 1 , r 2 = Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 0]/m 0 Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 1]/m 1 , This indicates that any probability of missing data degrades the asymptotic efficiency of the estimator inversely proportionally.

4. SIMULATIONS

We simulate the asymptotic relative efficiency (ARE) for the risk-parity linear regression problem: min β∈Θ E (Y -β ⊤ X) 2 subject to E[(Y -β ⊤ X) 2 |A = 1] -E[(Y -β ⊤ X) 2 |A = 0] = 0 (4.1) where we generate n ∈ {300, 3000} samples by the following data generating process: A ∼ Bernoulli(1 -π 0 ), X|A = a ∼ N (µ a , Σ a ) and Y |X, A = a ∼ N (β ⊤ a X, σ 2 a ) for a ∈ {0, 1}. We pick µ 0 = (1, 2) ⊤ , µ 1 = (2, 1) ⊤ , Σ 0 = Σ 1 = I 2 , σ 2 0 = σ 2 1 = 1 and investigate two scenarios: imbalanced subgroups with π 0 = 0.3 and balanced subgroups with π 0 = 0.5. The goal of the optimization problem (4.1) is to minimize the population risk (in least square) while satisfying the parity of subpopulation risks (in least square) of group A = 0 and group A = 1. In Figure 2 , we plot relative efficiency curves for π 0 = 0.3 and π 0 = 0.5, all of which are averaged over 500 replicates. For large sample size n, the relative efficiency curves are close to the theoretical line of asymptotic relative efficiency curve, validating our theory in the large sample regime. As a by-product, our theory can visualize the fairness-privacy trade-off without retraining models with varying privacy budgets. 

5. SUMMARY AND DISCUSSION

In this work, we study the statistical impact of privacy on fairness under the task of learning fair machine learning models with private sensitive attributes. We define a restricted notion of asymptotic statistical efficiency in order to examine such impact. Quantitatively, the cost of privacy on fairness generalizability is represented by a relative decline in statistical efficiency. The relative efficiency loss is interpretable: it explicitly depends on the privacy budget, subpopulation imbalance level, and a number of other problem-specific quantities. We validate and demonstrate the utility of our theory by a synthetic task of risk-parity linear regression with private group membership. For the sake of clarity, we consider h ≡ 1. Denote the loss vectors with regard to the true sensitive attribute A and the noisy sensitive attribute Z, and the Markov transition matrix induced by the privacy mechanism Q (2.4) by L A (θ) = E g(θ; X, Y )|A = 1 E g(θ; X, Y )|A = 0 , L Z (θ) = E g(θ; X, Y )|Z = 1 E g(θ; X, Y )|Z = 0 and M = 1 -γ γ γ 1 -γ . Further, let b = (1, -1) ⊤ . Noiseless, noisy, and debiased constraints are equivalent to each other at the population level in the way that b ⊤ L A (θ) = 0 ⇐⇒ b ⊤ L Z (θ) = 0 ⇐⇒ b ⊤ M -1 L Z (θ) = 0. Consider their empirical counterparts, we note that b ⊤ L Z,n (θ) = 0 ⇐⇒ b ⊤ M -1 L Z,n (θ) = 0. Combined with our theory, this empirical level equivalence of two constraints implies that using the inverse of the empirical transition matrix to match the noisy constraint to the noiseless constraint cannot improve the efficiency of the in-processing training procedure. Developing a principled inprocessing method to increase the statistical efficiency is an intriguing direction for future research. 

A LINEAR FAIRNESS CONSTRAINT

We extend the notion of demographic parity to a more general form: we say that θ is fair (with respect to A) if E g(θ; X, Y )|A = 1 -E g(θ; X, Y )|A = 0 = 0. (A. 1) The fairness notion (A.1) is known as linear fairness constraint Celis et al. (2021) . Note that demographic parity is a special case of (A.1) if we take g(θ; X, Y ) = 1{f θ (X) = 1}. On the one hand, enforcing fairness notion (A.1) with respect to A is E (X,Y )|A=1 g(θ; X, Y ) -E (X,Y )|A=0 g(θ; X, Y ) = 0 or equivalently E Q1 g(θ; X, Y ) -E Q0 g(θ; X, Y ) = 0. On the other hand, enforcing fairness notion (A.1) with respect to Z is E (X,Y )|Z=1 g(θ; X, Y ) -E (X,Y )|Z=0 g(θ; X, Y ) = 0 or equivalently E γπ 0 γπ 0 +(1-γ)π 1 Q0+ (1-γ)π 1 γπ 0 +(1-γ)π 1 Q1 g(θ; X, Y ) -E (1-γ)π 0 (1-γ)π 0 +γπ 1 Q0+ γπ 1 (1-γ)π 0 +γπ 1 Q1 g(θ; X, Y ) = 0. Therefore, the true fairness constraint function is c(θ) = X ×Y g(θ; x, y)d(Q 1 -Q 0 )(x, y), while the proxy fairness constraint function is c(θ) = - γπ 0 γπ 0 + (1 -γ)π 1 + (1 -γ)π 0 (1 -γ)π 0 + γπ 1 X ×Y g(θ; x, y)d(Q 1 -Q 0 )(x, y) = (1 -γ)π 1 γπ 0 + (1 -γ)π 1 - γπ 1 (1 -γ)π 0 + γπ 1 X ×Y g(θ; x, y)d(Q 1 -Q 0 )(x, y) ≜ ψ lin (γ, π 0 , π 1 ) × c(θ). (A.2) By (A.2), the proxy constraint function c(θ) is equal to the true c(θ) up to a scaling factor ψ lin (γ, π 0 , π 1 ) ≜ - γπ 0 γπ 0 + (1 -γ)π 1 + (1 -γ)π 0 (1 -γ)π 0 + γπ 1 = (1 -γ)π 1 γπ 0 + (1 -γ)π 1 - γπ 1 (1 -γ)π 0 + γπ 1 = (1 -2γ)π 0 π 1 {γπ 0 + (1 -γ)π 1 } {(1 -γ)π 0 + γπ 1 } . (A.3) This also implies c(θ) = 0 if and only if c(θ) = 0, providing an alternative proof for Proposition 2.3. Now we are ready to show the privacy cost in linear fairness (A.1)-aware learning. First, let the true parameter θ ⋆ , i.e. the solution to the population problem, be θ ⋆ ∈ arg min θ∈Θ E ℓ(θ; X, Y ) subject to E g(θ; X, Y )|A = 1 -E g(θ; X, Y )|A = 0 = 0 , (A.4) where the expectation is with respect to the underlying distribution of tuple (X, A, Y ). Then, let the estimator θ n be the solution to the empirical problem given the true sensitive attribute, θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ;Xi,Yi)1{Ai=1} n i=1 1{Ai=1} - n i=1 g(θ;Xi,Yi)1{Ai=0} n i=1 1{Ai=0} ≤ α n    . Finally, let the estimator θ n be the solution to the empirical problem given the proxy sensitive attribute, θ n ∈    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ;Xi,Yi)1{Zi=1} n i=1 1{Zi=1} - n i=1 g(θ;Xi,Yi)1{Zi=0} n i=1 1{Zi=0} ≤ α n    . We made the following technical assumptions on the problem (A.4). 1. smoothness and concentration: ℓ and g are twice continuously differentiable with respect to θ, and ℓ(θ ⋆ ; X, Y ), ∇ℓ(θ ⋆ ; X, Y ), g(θ ⋆ ; X, Y ), ∇g(θ ⋆ ; X, Y ) are sub-Gaussian random variables. 2. uniqueness: the stochastic optimization problem with a single expected value constraint (A.4) has a unique optimal primal-dual pair (θ ⋆ , λ ⋆ ), and θ ⋆ belongs to the interior of the compact set Θ.

3.. positive definiteness:

The Hessian of the Lagrangian evaluated at (θ ⋆ , λ ⋆ ) is positive definite. We have the main technical result as follows. Theorem A.1 (Privacy cost in linear fairness (A.1)-aware learning). Under the standing assumptions, let estimators θ n and θ n be consistent for θ ⋆ , then √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, σ 2 ) and √ n{c( θ n ) - ¨c (θ ⋆ )} d → N (0, σ 2 ), where σ 2 = Var Q0 [g(θ ⋆ ; X, Y )] π 0 + Var Q1 [g(θ ⋆ ; X, Y )] π 1 and σ 2 = ψ -2 lin × Var Q0 [g(θ ⋆ ; X, Y )] π 0 + Var Q1 [g(θ ⋆ ; X, Y )] π 1 . The asymptotic relative efficiency (ARE) of θ n to θ n is ARE( θ n , θ n ) = φ γ, π 0 π 1 , Var[g(θ ⋆ ; X, Y )|A = 0] Var[g(θ ⋆ ; X, Y )|A = 1] , where φ(γ, r 1 , r 2 ) ≜ (1 -2γ) 2 r 1 (r 1 + r 2 ) {γr 1 + (1 -γ)} 2 {(1 -γ)r 1 r 2 + γ} + {(1 -γ)r 1 + γ} 2 {γr 1 r 2 + (1 -γ)} . Proof of Theorem A.1. Note that Theorem 3.1 implies Theorem A.1 by letting h(X, Y ) ≡ 1. Therefore, it is sufficient to prove Theorem 3.1, whose proof can be found in Appendix C. □ Theorem A.1 demonstrates that the cost of privacy is the efficiency loss in terms of fairness violations. For fixed ratios r 1 ≜ π 0 /π 1 > 0 and r 2 ≜ Var[g(θ ⋆ ; X, Y )|A = 0]/ Var[g(θ ⋆ ; X, Y )|A = 1] > 0, φ lin (γ, r 1 , r 2 ) is a decreasing function in γ. In the absence of privacy, φ lin (0, r 1 , r 2 ) = 1 means no efficiency loss. Under perfect privacy, φ lin (0.5, r 1 , r 2 ) = 0 indicates total loss of efficiency. Moreover, θ n is always more efficient than θ n because ARE( θ n , θ n ) ≤ 1. Figure 3 demonstrates the asymptotic relative efficiency (ARE) curve of privacy level γ for varying ratios r 1 and r 2 . The ARE is always upper bounded by (1 -2γ) 2 , which is achieved only if π 0 = π 1 = 0.5. Recall that π 0 = P(A = 0) and π 1 = P(A = 1). Therefore for any fixed γ and r 2 , the ARE achieves its maximum only if the dataset is balanced in the sensitive attribute A. Moreover, for any fixed γ and r 2 , the ARE is strictly increasing in π 0 (assuming π 0 < 0.5). This implies the effect of subgroup size imbalance: demographic group imbalance degrades the efficiency loss in privately fair learning. Lastly, the ARE is strictly increasing in the problem-specific parameter r 2 , given fixed γ and r 1 < 1. 0 : 1 = 1 : 1, Var 0 : Var 1 = 1 : 1 0 : 1 = 1 : 9, Var 0 : Var 1 = 1 : 1 0 : 1 = 1 : 99, Var 0 : Var 1 = 1 : 1 0 : 1 = 1 : 1, Var 0 : Var 1 = 2 : 3 0 : 1 = 1 : 9, Var 0 : Var 1 = 2 : 3 0 : 1 = 1 : 99, Var 0 : Var 1 = 2 : 3 0 : 1 = 1 : 1, Var 0 : Var 1 = 3 : 2 0 : 1 = 1 : 9, Var 0 : Var 1 = 3 : 2 0 : 1 = 1 : 99, Var 0 : Var 1 = 3 : 2 

B LINEAR-FRACTIONAL FAIRNESS CONSTRAINT

We provide further discussion to supplement Section 3. Recall the marginal distributions and conditional distributions in (3.2) and P(Z = 0) = π 0 , P(Z = 1) = π 1 (X, Y )|Z = 0 ∼ Q 0 , (X, Y )|Z = 1 ∼ Q 1 . Under the private mechanism Q in (2.4), we have        π 0 = (1 -γ)π 0 + γπ 1 , π 1 = γπ 0 + (1 -γ)π 1 Q 0 d = (1-γ)π0 (1-γ)π0+γπ1 Q 0 + γπ1 (1-γ)π0+γπ1 Q 1 Q 1 d = γπ0 γπ0+(1-γ)π1 Q 0 + (1-γ)π1 γπ0+(1-γ)π1 Q 1 . (B.1) The marginal distribution and conditional distribution in (B.1) uniquely identify the joint distribution of (X, Z, Y ). On the one hand, enforcing fairness notion (3.1) with respect to A is E (X,Y )|A=1 g(θ; X, Y ) E (X,Y )|A=1 h(X, Y ) - E (X,Y )|A=0 g(θ; X, Y ) E (X,Y )|A=0 h(X, Y ) = 0 or equivalently c(θ) ≜ E Q1 g(θ; X, Y ) E Q1 h(X, Y ) - E Q0 g(θ; X, Y ) E Q0 h(X, Y ) = 0. On the other hand, enforcing fairness notion (3.1) with respect to Z is E (X,Y )|Z=1 g(θ; X, Y ) E (X,Y )|Z=1 h(X, Y ) - E (X,Y )|Z=0 g(θ; X, Y ) E (X,Y )|Z=0 h(X, Y ) = 0 or equivalently c(θ) ≜          γπ 0 E Q0 g(θ; X, Y ) + (1 -γ)π 1 E Q1 g(θ; X, Y ) γπ 0 E Q0 h(X, Y ) + (1 -γ)π 1 E Q1 h(X, Y ) - (1 -γ)π 0 E Q0 g(θ; X, Y ) + γπ 1 E Q1 g(θ; X, Y ) (1 -γ)π 0 E Q0 h(X, Y ) + γπ 1 E Q1 h(X, Y )          = 0. By some algebra, we find that the proxy constraint function c(θ) is equal to the true constraint function c(θ) up to a scaling factor: c(θ) = ψ frac (γ, π 0 , π 1 , m 0 , m 1 ) × c(θ), where ψ frac (γ, π 0 , π 1 , m 0 , m 1 ) ≜ (1 -2γ)π 0 π 1 m 0 m 1 {γπ 0 m 0 + (1 -γ)π 1 m 1 } {(1 -γ)π 0 m 0 + γπ 1 m 1 } , (B.2) as well m 0 ≜ E Q0 h(X, Y ) and m 1 ≜ E Q1 h(X, Y ) . By comparing the scaling factor (B.2) with the functional form of (A.3), we can rewrite ψ frac (•) by ψ frac (γ, π 0 , π 1 , m 0 , m 1 ) = ψ lin (γ, π 0 m 0 , π 1 m 1 ). Therefore, we can interpret the scaling factor ψ frac (•) by treating π 0 m 0 and π 1 m 1 as a whole, allowing us to understand the privacy cost from a different perspective. Note that for equality of opportunity, we have π a m a = P(A = a)P(Y = 1|A = a) = P(Y = 1, A = a) for a ∈ {0, 1}. For equality of opportunity, Mozannar et al. (2020) show a sample complexity bound for the fairness violation of the estimator θ n : c( θ n ) - ¨c (θ ⋆ ) ≤ C 1 (1 -γ) (1 -2γ)p 2 C 2 + C 3 R np 4 (F) + C 4 √ nδp (B.3) with probability at least 1 -δ, where p = min{P(Y = 1, A = 0), P(Y = 1, A = 1)}, R • (•) is the Rademacher complexity, and C i 's (1 ≤ i ≤ 4) are some universal constants. Not precisely, the upper bound (B.3) reflects the effect of privacy level via γ and the effect of dataset imbalance through p. Comparing to this, our theory states that lim n→∞ Var[c( θ n ) - ¨c (θ ⋆ )] Var[c( θ n ) - ¨c (θ ⋆ )] = φ γ, P(Y = 1, A = 0) P(Y = 1, A = 1) , 1 , which is depicted by Figure 4 . C PROOF OF THEOREM 3.1 First, we prove the case when α n = 0 for all n. For this case both the population problem and the empirical problem are subject to equality constraints. Consider a stochastic optimization problem with linear-fractional constraint (P 0 ) : θ ⋆ ∈      arg min θ∈Θ E ℓ(θ; X, Y ) subject to E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 = 0      , where the expectation is with respect to the underlying distribution of tuple (X, A, Y ). The corresponding empirical problem given the true sensitive attribute is (P n ) : θ n ∈          arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ; X i , Y i )1{A i = 1} n i=1 h(X i , Y i )1{A i = 1} - n i=1 g(θ; X i , Y i )1{A i = 0} n i=1 h(X i , Y i )1{A i = 0} = 0          . The corresponding empirical problem given the proxy sensitive attribute is ( P n ) : θ n ∈          arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ; X i , Y i )1{Z i = 1} n i=1 h(X i , Y i )1{Z i = 1} - n i=1 g(θ; X i , Y i )1{Z i = 0} n i=1 h(X i , Y i )1{Z i = 0} = 0          . We denote F (θ) = E ℓ(θ; X, Y ) , F n (θ) = 1 n n i=1 ℓ(θ; X i , Y i ), G(θ) = E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 and G n (θ) = n i=1 g(θ; X i , Y i )1{A i = 1} n i=1 h(X i , Y i )1{A i = 1} - n i=1 g(θ; X i , Y i )1{A i = 0} n i=1 h(X i , Y i )1{A i = 0} . Note that F n (•) and G n (•)'s are random functions serving as approximations to F (•) and G(•)'s. Consider the Lagrangian functions L(θ, λ) = F (θ) + λG(θ) and L n (θ, λ) = F n (θ) + λ G n (θ) of the programs (P 0 ) and (P n ) respectively. Lemma C.1 (A version of Theorem 6.6.2 in Rubinstein & Shapiro (1993) ). Suppose that: (i) The functions F (θ) and G(θ) are twice continuously differentiable. (ii) The true program (P 0 ) has a unique optimal solution θ ⋆ and a unique Lagrange multiplier λ ⋆ with θ ⋆ being an interior point of Θ. (iii) The Hessian matrix ∇ 2 L(θ ⋆ , λ ⋆ ) is positive definite. (iv) The random functions G n (θ),k ∈ [K], are Lipschitz continuous in a neighborhood of θ ⋆ and differentiable at θ ⋆ with probability 1. (v) ∥∆ in (θ ⋆ )∥ 2 = O p (n -1/2 ), i = 1, 2, 3 and there is a neighborhood U of θ ⋆ such that sup θ∈U ∥∆ in (θ) -∆ in (θ ⋆ )∥ 2 n -1/2 + ∥θ -θ ⋆ ∥ 2 = o p (1), i = 1, 2, 3. Here we define random mappings ∆ 1n (θ) = ∇ F n (θ) -∇F (θ), ∆ 2n (θ) = G n (θ) -G(θ), ∆ 3n (θ) = ∇ G n (θ) -∇G(θ). (vi) Random vectors √ n(∇ L n (θ ⋆ , λ ⋆ ), G n (θ ⋆ )) converge in distribution to Y = (Y 1 , Y 2 ) as n → ∞, where Y 1 is a random vector and Y 2 is a random variable. Let θ n be an optimal solution of (P n ) converging in probability as n → ∞ to θ ⋆ . Then √ n( θ n -θ ⋆ ) d -→ x(Y ) where x = x(Y ) is the optimal solution to the quadratic programming problem minimize x x ⊤ Y 1 + 1 2 x ⊤ ∇ 2 L(θ ⋆ , λ ⋆ )x subject to ∇G(θ ⋆ ) ⊤ x + Y 2 = 0 . Recall the standing assumptions, (i), (iv), (v) are guaranteed by the smoothness and concentration assumption, (ii) is postulated by the uniqueness assumption, and (iii) is made by our assumption. Now we derive the limiting distribution of random vectors √ n(∇ L n (θ ⋆ , λ ⋆ ), G n (θ ⋆ )) required in (vi). For a ∈ {0, 1}, we have E g(θ ⋆ ; X, Y )1{A = a} = P(A = a)E g(θ ⋆ ; X, Y )|A = a = π a E Qa [g], and Var[g(θ ⋆ ; X, Y )1{A = a}] = E g 2 (θ ⋆ ; X, Y )1{A = a} -E g(θ ⋆ ; X, Y )1{A = a} 2 = π a E Qa [g 2 ] -π 2 a (E Qa [g]) 2 = π a (E Qa [g 2 ] -(E Qa [g]) 2 ) + (π a -π 2 a )(E Qa [g]) 2 = π a Var Qa [g] + π 0 π 1 (E Qa [g]) 2 . Similarly, for a ∈ {0, 1}, we have E h(X, Y )1{A = a} = π a E Qa [h] and Var[h(X, Y )1{A = a}] = π a Var Qa [h]+π 0 π 1 (E Qa [h]) 2 . Moreover, we have Cov(g(θ ⋆ ; X, Y )1{A = 1}, g(θ ⋆ ; X, Y )1{A = 0}) =E g 2 (θ ⋆ ; X, Y )1{A = 0}1{A = 1} -E g(θ ⋆ ; X, Y )1{A = 0} × E g(θ ⋆ ; X, Y )1{A = 1} = -π 0 π 1 E Q0 [g]E Q1 [g] and similarly we can derive Cov(h(X, Y )1{A = 1}, h(X, Y )1{A = 0}) = -π 0 π 1 E Q0 [h]E Q1 [h], Cov(g(θ ⋆ ; X, Y )1{A = a}, h(X, Y )1{A = a}) = π a Cov Qa [g, h] + π 0 π 1 E Qa [g]E Qa [h] and Cov(g(θ ⋆ ; X, Y )1{A = a}, h(X, Y )1{A = 1 -a}) = -π 0 π 1 E Qa [g]E Q1-a [h] for a ∈ {0, 1}. Let η 1 = E ∇ℓ(θ ⋆ ; X, Y ) , η 2 = π 1 E Q1 ∇g(θ ⋆ ; X, Y ) and η 3 = π 0 E Q0 ∇g(θ ⋆ ; X, Y ) . By central limit theorem, √ n                           n -1 n i=1 ∇ℓ(θ ⋆ ; X i , Y i ) n -1 n i=1 ∇g(θ ⋆ ; X i , Y i )1{A i = 1} n -1 n i=1 ∇g(θ ⋆ ; X i , Y i )1{A i = 0} n -1 n i=1 g(θ ⋆ ; X i , Y i )1{A i = 1} n -1 n i=1 g(θ ⋆ ; X i , Y i )1{A i = 0} n -1 n i=1 h(X i , Y i )1{A i = 1} n -1 n i=1 h(X i , Y i )1{A i = 0}          -         η 1 η 2 η 3 π 1 E Q1 [g] π 0 E Q0 [g] π 1 E Q1 [h] π 0 E Q0 [h]                          d -→ N 0, Ω 11 Ω 12 Ω 21 Ω 22 , (C.1) where Ω 11 ∈ R 3d×3d , Ω 21 ∈ R 4×3d , Ω 12 = Ω ⊤ 21 , Ω 22 is given by    π 1 Q 2 1 [g] + π 0 π 1 (Q 1 g) 2 -π 0 π 1 Q 0 gQ 1 g π 1 Q 2 1 [g, h] + π 0 π 1 Q 1 gQ 1 h -π 0 π 1 Q 0 hQ 1 g -π 0 π 1 Q 0 gQ 1 g π 0 Q 2 0 [g] + π 0 π 1 (Q 0 g) 2 -π 0 π 1 Q 0 gQ 1 h π 0 Q 2 0 [g, h] + π 0 π 1 Q 0 gQ 0 h π 1 Q 2 1 [g, h] + π 0 π 1 Q 1 gQ 1 h -π 0 π 1 Q 0 gQ 1 h π 1 Q 2 1 [h] + π 0 π 1 (Q 1 h) 2 -π 0 π 1 Q 0 hQ 1 h -π 0 π 1 Q 0 hQ 1 g π 0 Q 2 0 [g, h] + π 0 π 1 Q 0 gQ 0 h -π 0 π 1 Q 0 hQ 1 h π 0 Q 2 0 [h] + π 0 π 1 (Q 0 h) 2    . Let function w : R d × R d × R d × R × R × R × R → R d+1 be w(v 1 , v 2 , v 3 , s 1 , s 2 , s 3 , s 4 ) = v 1 + λ ⋆ v 2 s 3 - v 3 s 4 , s 1 s 3 - s 2 s 4 ⊤ . The gradient of function w evaluated at (v 1 , v 2 , v 3 ) = (η 1 , η 2 , η 3 ) and (s 1 , s 2 , s 3 , s 4 ) = (π 1 E Q1 [g], π 0 E Q0 [g], π 1 E Q1 [h], π 0 E Q0 [h]) is given by ∇w = * 3d×d 0 3d×1 * 4×d ξ 4×1 ∈ R (3d+4)×(d+1) where ξ = 1 π 1 Q 1 h , - 1 π 0 Q 0 h , - Q 1 g π 1 (Q 1 h) 2 , Q 0 g π 0 (Q 0 h) 2 ⊤ . Applying delta method to (C.1) with w(•), we have √ n ∇ L n (θ ⋆ , λ ⋆ ) G n (θ ⋆ ) d → N 0, ∇w ⊤ Ω 11 Ω 12 Ω 21 Ω 22 ∇w d == N 0, Σ 11 Σ 12 Σ 21 σ 2 , where σ 2 = ξ ⊤ Ω 22 ξ = Q 2 0 [g] π 0 (Q 0 h) 3 + Q 2 0 [h](Q 0 g) 2 π 0 (Q 0 h) 4 - 2Q 2 0 [g, h]Q 0 g π 0 (Q 0 h) 3 + Q 2 1 [g] π 1 (Q 1 h) 3 + Q 2 1 [h](Q 1 g) 2 π 1 (Q 1 h) 4 - 2Q 2 1 [g, h]Q 1 g π 1 (Q 1 h) 3 (C.2) Note that KKT condition implies η 1 + λ ⋆ η 2 π 1 Q 1 g - η 3 π 0 Q 0 g = 0 and Q 1 g Q 1 h = Q 0 g Q 0 h ≜ κ. (C.3) Combining (C.2) and (C.3), we have σ 2 = Var Q0 [g] + Var Q0 [κh] -2 Cov Q0 [g, κh] π 0 (E Q0 [h]) 2 + Var Q1 [g] + Var Q1 [κh] -2 Cov Q1 [g, κh] π 1 (E Q1 [h]) 2 = Var Q0 [g -κh] π 0 (E Q0 [h]) 2 + Var Q1 [g -κh] π 1 (E Q1 [h]) 2 . (C.4) Therefore, we conclude that the limiting distribution of √ n(∇ L n (θ ⋆ , λ ⋆ ), G n (θ ⋆ )) is (Y 1 , Y 2 ) ∼ N 0, Σ 11 Σ 12 Σ 21 σ 2 . By Lemma (C.1), we have √ n( θ n -θ ⋆ ) d -→ x, where x is given by the linear system ∇ 2 L(θ ⋆ , λ ⋆ ) ∇G(θ ⋆ ) ∇G(θ ⋆ ) ⊤ 0 ≜B x λ = - Y 1 Y 2 ∼ N 0, Σ 11 Σ 12 Σ 21 σ 2 , or x λ ∼ N 0, B -1 Σ 11 Σ 12 Σ 21 σ 2 B -1 , (C.5) which implies √ n( θ n -θ ⋆ ) d -→ x ∼ N (0, Σ) for some μ and Σ determined by (C.5). By delta method, we have √ nG( θ n ) = √ n{G( θ n ) -G(θ ⋆ ) =0 } d -→ N (0, ∇G(θ ⋆ ) ⊤ Σ∇G(θ ⋆ )). Now we calculate ∇G(θ ⋆ ) ⊤ Σ∇G(θ ⋆ ). For notation simplicity, we denote ∇ 2 L = ∇ 2 L(θ ⋆ , λ ⋆ ), ∇G = ∇G(θ ⋆ ) and H = (∇ 2 L) -1 ∇G[∇G ⊤ (∇ 2 L) -1 ∇G] -1 . By block matrix inversion, we have B -1 = (∇ 2 L) -1 -H∇G ⊤ (∇ 2 L) -1 H H ⊤ -[∇G ⊤ (∇ 2 L) -1 ∇G] -1 . Note that ∇G ⊤ H = 1 and ∇G ⊤ (∇ 2 L) -1 -H∇G ⊤ (∇ 2 L) -1 = 0. We have ∇G(θ ⋆ ) ⊤ Σ∇G(θ ⋆ ) =∇G ⊤ (∇ 2 L) -1 -H∇G ⊤ (∇ 2 L) -1 Σ 11 + HΣ 21 (∇ 2 L) -1 -(∇ 2 L) -1 ∇GH ⊤ ∇G =0 + ∇G ⊤ (∇ 2 L) -1 -H∇G ⊤ (∇ 2 L) -1 =0 Σ 12 H ⊤ ∇G + ∇G ⊤ Hσ 2 H ⊤ ∇G =σ 2 . Therefore, we conclude that √ n{c( θ n ) - ¨c (θ ⋆ )} = √ nG( θ n ) d → N (0, σ 2 ) d == N 0, Var Q0 [g -κh] π 0 (E Q0 [h]) 2 + Var Q1 [g -κh] π 1 (E Q1 [h]) 2 . By a similar argument, we have √ n{ψ frac × c( θ n ) - @ @ @ @ @ @ ψ frac × c(θ ⋆ )} d → N 0, Var Q0 [g -κh] π 0 (E Q0 [h]) 2 + Var Q1 [g -κh] π 1 (E Q1 [h]) 2 , which implies √ n × c( θ n ) d → N (0, σ 2 ) d == N 0, ψ -2 frac × Var Q0 [g -κh] π 0 (E Q0 [h]) 2 + Var Q1 [g -κh] π 1 (E Q1 [h]) 2 . Now, we prove the case when α n = o( 1 √ n ). For this case note that the equality constraint for the population problem can be rewritten as two inequality constraints: (P 0 ) : θ ⋆ ∈                arg min θ∈Θ E ℓ(θ; X, Y ) subject to E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 ≤ 0 E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 - E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 ≤ 0                , where the expectation is with respect to the underlying distribution of tuple (X, A, Y ). The corresponding empirical problem given the true sensitive attribute is (P n ) : θ n ∈                    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ; X i , Y i )1{A i = 1} n i=1 h(X i , Y i )1{A i = 1} - n i=1 g(θ; X i , Y i )1{A i = 0} n i=1 h(X i , Y i )1{A i = 0} -α n ≤ 0 n i=1 g(θ; X i , Y i )1{A i = 0} n i=1 h(X i , Y i )1{A i = 0} - n i=1 g(θ; X i , Y i )1{A i = 1} n i=1 h(X i , Y i )1{A i = 1} -α n ≤ 0                    . The corresponding empirical problem given the proxy sensitive attribute is ( P n ) : θ n ∈                    arg min θ∈Θ 1 n n i=1 ℓ(θ; X i , Y i ) subject to n i=1 g(θ; X i , Y i )1{Z i = 1} n i=1 h(X i , Y i )1{Z i = 1} - n i=1 g(θ; X i , Y i )1{Z i = 0} n i=1 h(X i , Y i )1{Z i = 0} -α n ≤ 0 n i=1 g(θ; X i , Y i )1{Z i = 0} n i=1 h(X i , Y i )1{Z i = 0} - n i=1 g(θ; X i , Y i )1{Z i = 1} n i=1 h(X i , Y i )1{Z i = 1} -α n ≤ 0                    . We denote F (θ) = E ℓ(θ; X, Y ) , F n (θ) = 1 n n i=1 ℓ(θ; X i , Y i ), G 1 (θ) = E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 - E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 , G 2 (θ) = E g(θ; X, Y )|A = 0 E h(X, Y )|A = 0 - E g(θ; X, Y )|A = 1 E h(X, Y )|A = 1 , G 1n (θ) = n i=1 g(θ; X i , Y i )1{A i = 1} n i=1 h(X i , Y i )1{A i = 1} - n i=1 g(θ; X i , Y i )1{A i = 0} n i=1 h(X i , Y i )1{A i = 0} -α n , G 2n (θ) = n i=1 g(θ; X i , Y i )1{A i = 0} n i=1 h(X i , Y i )1{A i = 0} - n i=1 g(θ; X i , Y i )1{A i = 1} n i=1 h(X i , Y i )1{A i = 1} -α n . Consider the Lagrangian functions L(θ, λ) = F (θ) + λ 1 G 1 (θ) + λ 2 G 2 (θ) and L n (θ, λ) = F n (θ) + λ 1 G 1n (θ) + λ 2 G 2n (θ). of the programs (P 0 ) and (P n ) respectively. Lemma C.2 (A version of Theorem 6.6.2 in Rubinstein & Shapiro (1993) ). Suppose that: (i) The functions F (θ), G 1 (θ) and G 2 (θ) are twice continuously differentiable. (ii) The true program (P 0 ) has a unique optimal solution θ ⋆ and a unique Lagrange multiplier λ ⋆ with θ ⋆ being an interior point of Θ. (iii) The Hessian matrix ∇ 2 L(θ ⋆ , λ ⋆ ) is positive definite. (iv) The random functions G 1n (θ) and G 2n (θ),k ∈ [K], are Lipschitz continuous in a neighborhood of θ ⋆ and differentiable at θ ⋆ with probability 1. (v) ∥∆ in (θ ⋆ )∥ 2 = O p (n -1/2 ), i = 1, 2, 3 and there is a neighborhood U of θ ⋆ such that where x = x(Y ) is the optimal solution to the quadratic programming problem sup θ∈U ∥∆ in (θ) -∆ in (θ ⋆ )∥ 2 n -1/2 + ∥θ -θ ⋆ ∥ 2 = o p minimize x x ⊤ Y 1 + 1 2 x ⊤ ∇ 2 L(θ ⋆ , λ ⋆ )x subject to ∇G 1 (θ ⋆ ) ⊤ x + Y 2 ≤ 0 ∇G 2 (θ ⋆ ) ⊤ x + Y 3 ≤ 0 . Note that ∇G 1 (θ ⋆ ) ⊤ x + Y 2 ≤ 0 ⇐⇒ ∇G(θ ⋆ ) ⊤ x + Y ≤ 0 and ∇G 1 (θ ⋆ ) ⊤ x + Y 2 ≤ 0 ⇐⇒ -∇G(θ ⋆ ) ⊤ x + (-Y ) ≤ 0. Therefore the last quadratic programming problem with two inequality constraints reduces to the quadratic programming problem with single equality constraint when α n ≡ 0. The limiting distributional results thus persist as we proved for the α n ≡ 0 case. Lastly, we calculate the asymptotic relative efficiency (ARE) of θ n to θ n . Recall that σ 2 = Var Q0 [g -κh] π 0 (E Q0 [h]) 2 + Var Q1 [g -κh] π 1 (E Q1 [h]) 2 , σ 2 = ψ -2 frac × Var Q0 [g -κh] π 0 (E Q0 [h]) 2 + Var Q1 [g -κh] π 1 (E Q1 [h]) 2 = ψ -2 frac × (1 -γ)π 0 Var Q0 [g -κh] + γπ 1 Var Q1 [g -κh] {(1 -γ)π 0 E Q0 [h] + γπ 1 E Q1 [h]} 2 + γπ 0 Var Q0 [g -κh] + (1 -γ)π 1 Var Q1 [g -κh] {γπ 0 E Q0 [h] + (1 -γ)π 1 E Q1 [h]} 2 , and ψ frac = (1 -2γ)π 0 π 1 E Q0 [h]E Q1 [h] {γπ 0 E Q0 [h] + (1 -γ)π 1 E Q1 [h]} {(1 -γ)π 0 E Q0 [h] + γπ 1 E Q1 [h]} . Therefore, we have  ARE( θ n , θ n ) = σ σ 2 = φ γ, π 0 E Q0 [h] π 1 E Q1 [h] , π 0 + Var Q k [g(θ ⋆ ; X, Y )] π k 1 {k = l} Ψ lin =    1-Kγ π k -γ π0 π k if k l 1 π k -1 π0 γπ l if k ̸ = l for k, l ∈ [K].

E MISSING SENSITIVE ATTRIBUTES

Under the missingness mechanism (3.8), the probability of observing a complete sample from group a is P(A = a, R = 1) = ω a π a for a ∈ {0, 1}. By the intermediate conclusion of Theorem 3.1, we have  √ n{c( θ n )- ¨c (θ ⋆ )} d → N 0, Var Q0 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 0 (E Q0 [h(X, Y )]) 2 + Var Q1 [g(θ ⋆ ; X, Y ) -κh(X, Y )] π 1 (E Q1 [h(X, Y )]



Figure 1: Asymptotic relative efficiency curve of γ for varying r 1 and r 2 .

Figure 2: Relative efficiency curves for π 0 = 0.3 (left) and π 0 = 0.5 (right).

Figure 3: Asymptotic relative efficiency curve of γ for varying ratios of π 0 to π 1 and Var[g(θ ⋆ ; X, Y )|A = 0] to Var[g(θ ⋆ ; X, Y )|A = 1].

Figure 4: Asymptotic relative efficiency curve of γ for varying ratio of P(Y = 1, A = 0) to P(Y = 1, A = 1).

1), i = 1, 2, 3.Here we define random mappings∆ 1n (θ) = ∇ F n (θ) -∇F (θ), ∆ 2n (θ) = G n (θ) -G(θ), and ∆ 3n (θ) = ∇ G n (θ) -∇G(θ). (vi) Random vectors √ n(∇ L n (θ ⋆ , λ ⋆ ), G 1n (θ ⋆ )), G 2n (θ ⋆ )) converge in distribution to Y = (Y 1 , Y 2 , Y 3 ) as n → ∞,where Y 1 is a random vector and Y 2 and Y 3 are random variables.Let θ n be an optimal solution of (P n ) converging in probability as n → ∞ to θ ⋆ . Then √ n( θ n -θ ⋆ ) d -→ x(Y )

Q0 [g(θ ⋆ ; X, Y ) -κh(X, Y )]/E Q0 [h] Var Q1 [g(θ ⋆ ; X, Y ) -κh(X, Y )]/E Q1 [h] = φ γ, π 0 m 0 π 1 m 1 , Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 0]/m 0 Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 1]/m 1 , where φ(γ, r 1 , r 2 ) ≜ (1 -2γ) 2 r 1 (r 1 + r 2 ) {γr 1 + (1 -γ)} 2 {(1 -γ)r 1 r 2 + γ} + {(1 -γ)r 1 + γ} 2 {γr 1 r 2 + (1 -γ)} .Hence we complete the proof of Theorem 3.1. □D MULTIPLE DEMOGRAPHIC GROUPSWe provide further discussion to supplement Section 3.1.Note that the fairness notion (3.5) uses group 0 as a reference group. One can also define a fairness notion by E g(θ; X, Y )|A = k E h(X, Y )|A = k in group indices. Due to the equivalence of (D.1) and (3.5), we opt to use (3.5) for a comparison with two-group theory. Theorem 3.2 is a direct extension of Theorem 3.1 and follows the same proof procedure as of Theorem 3.1. Moreover, let h ≡ 1, the linear-fractional fairness (3.5) degenerates into linear fairness:E g(θ; X, Y )|A = k -E g(θ; X, Y )|A = 0 = 0 for k ∈ [K].(D.2) By Theorem 3.2, we immediately have the following corollary.

Privacy cost in linear fairness (D.2)-aware learning). Under the standing assumptions, let estimators θ n and θ n be consistent for θ ⋆ , then√ n{c( θ n ) -¨c (θ ⋆ )} d → N (0, Σ) and √ n{c( θ n ) -¨c (θ ⋆ )} d → N (0, Ψ -1 lin ΣΨ -⊤ lin ), whereΣ kl = Var Q0 [g(θ ⋆ ; X, Y )] π 0 + Var Q k [g(θ ⋆ ; X, Y )] π k 1 {k = l} Σ kl = Var Q0 [g(θ ⋆ ; X, Y )]

Q0 [g(θ ⋆ ; X, Y ) -κh(X, Y )] ω 0 π 0 (E Q0 [h(X, Y )]) 2 + Var Q1 [g(θ ⋆ ; X, Y ) -κh(X, Y )] ω 1 π 1 (E Q1 [h(X, Y )]) 2 .Comparing the two asymptotic variances, we conclude thatARE( θ n , θ n ) = r 2 + r 1 ω -1 0 r 2 + ω -1 1 r 1 ,wherer 1 = π 0 m 0 π 1 m 1 and r 2 = Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 0]/m 0 Var[g(θ ⋆ ; X, Y ) -κh(X, Y )|A = 1]/m 1 .

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment. In Proceedings of the 26th International Conference on World Wide Web, WWW '17, pp. 1171-1180, Perth, Australia, April 2017. International World Wide Web Conferences Steering Committee. ISBN 978-1-4503-4913-0. doi: 10.1145/3038912.3052660.

