DISPARATE IMPACT IN DIFFERENTIAL PRIVACY FROM GRADIENT MISALIGNMENT

Abstract

As machine learning becomes more widespread throughout society, aspects including data privacy and fairness must be carefully considered, and are crucial for deployment in highly regulated industries. Unfortunately, the application of privacy enhancing technologies can worsen unfair tendencies in models. In particular, one of the most widely used techniques for private model training, differentially private stochastic gradient descent (DPSGD), frequently intensifies disparate impact on groups within data. In this work we study the fine-grained causes of unfairness in DPSGD and identify gradient misalignment due to inequitable gradient clipping as the most significant source. This observation leads us to a new method for reducing unfairness by preventing gradient misalignment in DPSGD.

1. INTRODUCTION

The increasingly widespread use of machine learning throughout society has brought into focus social, ethical, and legal considerations surrounding its use. In highly regulated industries, such as healthcare and banking, regional laws and regulations require data collection and analysis to respect the privacy of individuals. 1 Other regulations focus on the fairness of how models are developed and used. 2 As machine learning is progressively adopted in highly regulated industries, the privacy and fairness aspects of models must be considered at all stages of the modelling lifecycle. There are many privacy enhancing technologies including differential privacy (Dwork et al., 2006) , federated learning (McMahan et al., 2017) , secure multiparty computation (Yao, 1986) , and homomorphic encryption (Gentry, 2009) that are used separately or jointly to protect the privacy of individuals whose data is used for machine learning (Choquette-Choo et al., 2020; Adnan et al., 2022; Kalra et al., 2021) . The latter three technologies find usage in sharing schemes and can allow data to be analysed while preventing its exposure to the wrong parties. However, the procedures usually return a trained model which itself can leak private information (Carlini et al., 2019) . On the other hand, differential privacy (DP) focuses on quantifying the privacy cost of disclosing aggregated information about a dataset, and can guarantee that nothing is learned about individuals that could not be inferred from population-level correlations (Jagielski et al., 2019) . Hence, DP is often used when the results of data analysis will be made publicly available, for instance when exposing the outputs of a model, or the results of the most recent US census (Abowd, 2018) . Not only must privacy be protected for applications in regulated industries, models must be fair. While there is no single definition that captures what it means to be fair, with regards to modelbased decision making fairness may preclude disparate treatment or disparate impact (Mehrabi et al., 2021) . Disparate treatment is usually concerned with how models are applied across populations, whereas disparate impact can arise from biases in datasets that are amplified by the greedy nature of loss minimization algorithms (Buolamwini & Gebru, 2018) . Differences in model performance across protected groups can result in a significant negative monetary, health, or societal impact for individuals who are discriminated against (Chouldechova & Roth, 2020) . Unfortunately, it has been observed that disparate impact can be exacerbated by applying DP in machine learning (Bagdasaryan et al., 2019) . Applications of DP always come with a privacy-utility tradeoff, where stronger guarantees of privacy negatively impact the usefulness of results -model performance in this context (Dwork & Roth, 2014) . Underrepresented groups within the population can experience disparity in the cost of adding privacy, hence, fairness concerns are a major obstacle to deploying models trained with DP. The causes of unfairness in DP depend on the techniques used, but are not fully understood. For the most widely used technique, differentially private stochastic gradient descent (DPSGD), two sources of error are introduced that impact model utility. Per-sample gradients are clipped to a fixed upper bound on their norm, then noise is added to the averaged gradient. Disparate impact from DPSGD was initially hypothesized to be rooted in unbalanced datasets (Bagdasaryan et al., 2019) , though counterexamples were found by Xu et al. (2021) . Recent research claims disparate impact to be caused by incommensurate clipping errors across groups, in turn effected by a large difference in average group gradient norms (Xu et al., 2021; Tran et al., 2021a) . In this work we highlight the disparate impact of gradient misalignment. In particular, we claim that the most significant cause of disparate impact is the difference in the direction of the unclipped and clipped gradients, which in turn can be caused by aggressive clipping and imbalances of gradient norms between groups. Our analysis of direction errors leads to a variant of DPSGD with properly aligned gradients. We explore this alternate method in relation to disparate impact and show that it not only significantly reduces the cost of privacy across all protected groups, it also reduces the difference in cost of privacy for all groups. Hence, it removes disparate impact and is more effective than previous proposals in doing so. On top of this, it is the only approach which does not require access to protected group labels, and thereby avoids disparate treatment of groups. In summary we: • Conduct a more fine-grained analysis of disparate impact in DPSGD, and demonstrate gradient misalignment to be the most significant cause; • Identify an existing algorithm, previously undiscussed in the fairness context, which properly aligns gradients, and show it reduces disparate impact and disparate treatment; • Improve the utility of said algorithm via two alterations; • Experimentally verify that aligning gradients is more successful at mitigating disparate impact than previous approaches.

2. RELATED WORK

Privacy and Fairness: While privacy and fairness have been extensively studied separately, recently their interactions have come into focus. Ekstrand et al. (2018) considered the intersection of privacy and fairness for several definitions of privacy. This research gained new urgency when Bagdasaryan et al. (2019) observed that DPSGD exacerbated existing disparity in model accuracy on underrepresented groups. Disparate impact due to DP was further observed in Pujol et al. (2020) and Farrand et al. (2020) for varying levels of group imbalance. Using an adversarial definition of privacy, Jaiswal & Mower Provost (2020) found that overrepresented groups can incur higher privacy costs. Similar examples were shown in Xu et al. (2021) for DPSGD, and disparate impact was linked to groups having larger gradient norms. Other fairness-aware learning research has evaluated the fairness of a private model's outcomes on protected groups. In this context fairness might refer to a statistical condition of non-discrimination with respect to groups (Mozannar et al., 2020; Tran et al., 2021b) , for example, equalized odds (Jagielski et al., 2019) , equality of opportunity (Cummings et al., 2019) , or demographic parity (Xu et al., 2019; Farrand et al., 2020) . Chang & Shokri (2021) empirically found that imposing fairness constraints on private models could lead to higher privacy loss for certain groups. We consider crossmodel fairness where the cost of adding privacy to a non-private model must be fairly distributed between groups. Adaptive Clipping: Many variations on the clipping procedure in DPSGD have been proposed to improve properties other than fairness. Adaptive clipping comes in many forms, but usually tunes the clipping threshold during training to provide better privacy-utility tradeoffs and convergence (Andrew et al., 2021; Pichapati et al., 2019) . The convergence of DPSGD connects to the symmetry properties of the distribution of gradients (Chen et al., 2020) which are affected by clipping.

3.1. SETTING AND DEFINITIONS

We begin by laying out the problem setting and review the relevant definitions for discussing fairness in privacy. For concreteness we consider a binary classification problem on a dataset D which consists of n points of the form (x i , a i , y i ), where x i ∈ R d is a feature vector, y i ∈ {0, 1} is a binary label, and a i ∈ [K] refers to a protected group attribute which partitions the data. The group label a i can optionally be an attribute in x i , the label value y i , or some distinct auxiliary value. The goal is to train a model f θ : R d → [0, 1] with parameter vector θ that is simultaneously useful and private, and in which the application of privacy is fair. Utility in the empirical risk minimization (ERM) problem is governed by the per-sample loss ℓ : [0, 1] × {0, 1} → R, with the optimal model minimizing the objective L(θ; D) = 1 n i∈D ℓ(f θ (x i ), y i ), which happens for optimal parameters θ * = arg min θ L(θ; D). The requirement of privacy is applied to the model through its parameters; private parameters θ must be obtained while exposing a minimal amount of private information in D. For this we apply the framework of differential privacy, recounted in the next section. Fairness of the privacy methodology can be measured in terms of the disparate impact that applying privacy has on the protected groups. As in Bagdasaryan et al. (2019) , we use a version of accuracy parity, the difference in classification accuracy across protected groups after adding privacy. We denote a subset of the data containing all points belonging to group k as D k = {(x i , a i , y i ) ∈ D | a i = k}. A private model has accuracy parity for subset D k if it minimizes the privacy cost π(θ, D k ) = acc(θ * ; D k ) -E θ [acc( θ; D k )], where the expectation is over the randomness involved in acquiring private model parameters. Of course, metrics other than classification accuracy could be used as required by the problem setting. Alternatively, fairness for privacy can be measured at the level of the loss function as in Tran et al. (2021a) , which is more amenable to analyzing the causes of unfairness. The excessive risk over the course of training experienced by a group is The goal of a fair private classifier is to minimize the privacy cost and/or excessive risk for all values of the protected group attribute, while maintaining small fairness gaps. R(θ, D k ) = E θ [L( θ; D k )] -L(θ * ; D k ).

3.2. DIFFERENTIAL PRIVACY Algorithm 1 DPSGD

Require: Iterations T , Dataset D, sampling rate q, clipping bound C 0 , noise multiplier σ, learning rates η t Initialize θ 0 randomly for t in 0, . . . , T -1 do B ← Poisson sample of D with rate q for (x i , y i ) in B do g i ← ∇ θ ℓ(f θt (x i ), y i ) ḡi ← g i • min 1, C0 ∥gi∥ gB ← 1 |B| i∈B ḡi + N (0, σ 2 C 2 0 I) θ t+1 ← θ t -η t gB Differential privacy (DP) (Dwork et al., 2006 ) is a widely used framework for quantifying the privacy consumed by a data analysis procedure. Formally, let D represent a set of data points, and M a probabilistic function, or mechanism, acting on datasets. We say that the mechanism is (ϵ, δ)-differentially private if for all subsets of possible outputs S ⊆ Range(M ), and for all pairs of databases D and D ′ that differ by the addition or removal of one element, Pr[M (D) ∈ S] ≤ exp(ϵ) Pr[M (D ′ ) ∈ S] + δ. (3) For the ERM problem, there are several ways to train a differentially private model (Chaudhuri et al., 2011) . In this work we consider models that can be trained with stochastic gradient descent (SGD), such as neural networks, and focus on the most successful approach, DPSGD (Abadi et al., 2016) , in which the Gaussian mechanism (Dwork & Roth, 2014) is applied to gradient updates as in Alg. 1. Since per-sample gradients g i generally do not have finite sensitivity, defined as ∆ h = max D,D ′ ∥h(D) -h(D ′ )∥ for a function h, they are first clipped to have norm upper bounded by a fixed hyperparameter C 0 . Clipped gradients ḡi in a batch B ⊂ D are aggregated into ḡB and noise is added to produce gB used in the parameter update.

3.3. FAIRNESS CONCERNS FROM CLIPPING AND NOISE IN DPSGD

The two most significant steps in DPSGD, clipping and adding noise, can impact the learning process disproportionately across groups, but the exact conditions where disparate impact will occur have been debated (Bagdasaryan et al., 2019; Farrand et al., 2020; Xu et al., 2021; Tran et al., 2021a) . The most concrete connection so far appears in (Tran et al., 2021a) , where the expected loss L(θ; D a ) is decomposed into terms contributing to the excessive risk at a single iteration for group a, R a : Proposition 1 (Tran et al. (2021a) ). Consider the ERM problem with twice-differentiable loss ℓ with respect to the model parameters. The expected loss E[L(θ t+1 ; D a )] of group a ∈ [K] at iteration t is approximated up to second order in ∥θ t+1 -θ t ∥ as: E[L(θ t+1 ; D a )] ≈ L(θ t ; D a ) -η t ⟨g Da , g D ⟩ + η 2 t 2 E[g T B H a ℓ g B ] (non-private term) + η t ⟨g Da , g D -ḡD ⟩ + η 2 t 2 E[ḡ T B H a ℓ ḡB ] -E[g T B H a ℓ g B ] (R clip a ) + η 2 t 2 Tr(H a ℓ )C 2 0 σ 2 . (R noise a ) The expectation is taken over the randomness of the DP mechanisms, and batches of data. Terms in the first line appear for ordinary SGD, and do not contribute to the excessive risk Eq. ( 2). The terms in the second line, R clip a , are caused by clipping since they cancel when ḡB = g B for every batch. They involve gradients g Da and Hessians H a ℓ , averaged over datapoints belonging to group a. The final term, R noise a , depends on the scale of noise added in Alg. 1, as well as the trace of the Hessian, also called the Laplacian, averaged over D a . Based on Prop. 1, Tran et al. (2021a) argue that clipping causes excessive risk to groups with large gradient norms, which can result from large input norms ∥x i ∥. Whether or not a group is underrepresented has less influence. In the next section we provide a new perspective on R clip a and the underlying causes of unfairness in DPSGD. Clipping in DPSGD introduces two types of error to the clipped batch gradient ḡB . It will generally have different norm than ∥g B ∥, and be misaligned compared to the SGD batch gradient, g B . At a high level, gradient misalignment poses a more serious problem to the convergence of DPSGD than magnitude error, as illustrated in Fig. 1 . Changing only the norm means gradient descent will still step towards the (local) minimum of the loss function, and any norm error could be completely compensated for by adapting the learning rate η t . In contrast, a misaligned gradient could result in a step towards significantly worse regions of the loss landscape causing catastrophic failures of convergence. Misaligned gradients add bias which compounds over training, as underrepresented or complex groups are systematically clipped. For comparison, adding noise to the aggregated gradient does not add bias, so noise errors tend to cancel out over training. We aim to quantify the relative impact of these effects and how they contribute to the excessive risk.

4. DISPARATE IMPACT IS CAUSED BY GRADIENT MISALIGNMENT

We can distinguish the effects of clipping by rewriting the clipped batch gradient as ḡB =M B ∥ḡ B ∥ ∥g B ∥ g B for an orthogonal matrix M B such that ḡB and M B g B are colinear. As a proof of concept that gradient misalignment is the more severe error we compared models trained by taking steps ∥ḡ B ∥ ∥g B ∥ g B vs. M B g B with no noise added. These represent magnitude errors and direction errors from clipping, respectively. The models were trained on MNIST with class 8 undersampled, and the results compare the typical class 2 to the underrepresented class 8; full details are provided in App. B. As seen in Table 1 , direction error is more detrimental to performance than magnitude error. In particular, it disproportionately increases loss and decreases accuracy on the underrepresented class 8.  R clip a ≈ η t g Da , E 1 -∥ḡ B ∥ ∥g B ∥ g B + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 -1 g T B H a ℓ g B (R mag a ) + η t g Da , E ∥ḡ B ∥ ∥g B ∥ (g B -M B g B ) + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B )-g T B H a ℓ g B , (R dir a ) where g Da , ḡDa denote the average non-clipped and clipped gradients over group a at iteration t, H a ℓ refers to the Hessian over group a, and M B is an orthogonal matrix such that ḡB and M B g B are colinear. The expectations are taken over batches of data. We provide a derivation in App. A. Note that when the magnitude error is zero for all batches, ∥g B ∥=∥ḡ B ∥, we have that R mag a =0 as expected. As well, when there is no gradient misalignment then M B is the identity matrix for every batch, and so R dir a = 0. To determine the characteristics of groups that will have unfair outcomes from clipping in DPSGD we can distill a simpler condition for when R dir a > R dir b . Tran et al. (2021a) already provide such a condition for clipping overall, however it does not effectively account for the danger of gradient misalignment. Their condition is sufficient, but not necessary, and some of its looseness stems from the inequality x T y ≥ -∥x∥∥y∥ used to convert all terms in R clip a into expressions involving group gradient norms. This approach loses information about gradient direction. We instead propose a tighter analysis of R dir a -R dir b using x T y = ∥x∥∥y∥ cos θ, where θ = ∠(x, y). Proposition 3. Assume the loss ℓ is twice continuously differentiable and convex with respect to the model parameters. As well, assume that η t ≤ (max k∈[K] λ k ) -1 where λ k is the maximum eigenvalue of the Hessian H k ℓ . For groups a, b ∈ [K], R dir a > R dir b if E ∥ḡ B ∥(cos θ a B -cos θa B ) > ∥g D b ∥ ∥g Da ∥ E ∥ḡ B ∥(cos θ b B -cos θb B ) + E[∥ḡ B ∥ 2 ] ∥g Da ∥ , where θ k B = ∠(g D k , g B ) and θk B = ∠(g D k , ḡB ) for a group k ∈ [K]. Furthermore, the bound is tight. App. A contains our proof. Prop. 3 shows that if the clipping operation disproportionately and sufficiently increases the direction error for group a relative to group b, then group a incurs larger excessive risk due to gradient misalignment. The lower bound for R dir a -R dir b inferred from Eq. 4 is tight, and in our experiments we empirically show that it is close to saturation in a typical case. Hence, when the direction errors for groups a and b are small (i.e. we expect that θ i B ≈ θi B for i = a, b), we have that R dir a -R dir b ≈ 0 regardless of the size of ∥g Da ∥ relative to ∥g D b ∥. It follows that clipping does not negatively impact excessive risk if gradients remain aligned. On the other hand if direction error is not close to zero, large group gradient norms do exacerbate the error in direction, as the dominant term of R dir a scales with ∥g Da ∥. The excessive risk in Eq. 2 is evaluated at the end of training, whereas Props. 1 and 2 estimate it per-iteration. Fig. 1 demonstrates that the full impact of clipping errors may not be felt per-iteration, but only at convergence. Indeed what matters to the end user is how fair the final model is, not how fair any intermediate training step is. However, it is not possible to attribute overall excessive risk to the per-iteration terms R dir a , R mag a , and R noise a , since the optimal θ * i used in the expansions of Props. 1 and 2 differ at each iteration, and do not equal the overall optimal θ * . Still, Table 1 demonstrates that gradient misalignment is the main cause of disparate impact, so we seek a method to prevent it.  Require: Iterations T , Dataset D, sam- pling rate q, clipping bound C 0 , strict clipping bound Z ≥ C 0 , noise multipliers σ 1 , (σ 2 ), learn- ing rates η t , (clipping learning rate η Z , threshold τ ≥ 0) Initialize θ 0 randomly for t in 0, . . . , T -1 do B ← Poisson sample of D with rate q for (x i , y i ) in B do g i ← ∇ θ ℓ(f θt (x i ), y i ) γ i ← C0 Z , ∥g i ∥ ≤ Z 0 ( C0 ∥gi∥ ), ∥g i ∥ > Z ḡi ← γ i g i gB ← 1 |B| i∈B ḡi + N (0, σ 2 1 C 2 0 I) θ t+1 ← θ t -η t gB (Adaptively set Z): b t ← |{i : ∥g i ∥ > τ • Z}| bt ← 1 |B| (b t + N (0, σ 2 2 )) Z ← Z • exp(-η Z + bt ) Our results so far show that gradient misalignment due to clipping is the most significant cause of unfairness in DPSGD. Logically, R dir a would be minimized if privatization left the direction of g B unchanged. A promising avenue is to scale down all per-sample gradients in a batch by the same amount. This is the approach taken by DPSGD-Global (Bu et al., 2021) , which was recently proposed to improve the convergence of DPSGD, and has not been discussed in the context of fairness before. Our theoretical results suggest that global scaling will reduce disparate impact. DPSGD-Global (Alg. 2) aims to preserve privacy by scaling gradients as ḡi = γg i , 0 < γ < 1. Of course, scaling alone is insufficient to ensure persample gradients have bounded sensitivity. However, supposing that there were a strict upper bound Z ≥ ∥g i ∥ ∀ i ∈ D, then scaling all gradients by γ = C 0 /Z would guarantee bounded sensitivity of C 0 for each ḡi (Fig. 2a ). Given sufficient smoothness of the loss function, for any sample of data there will be such an upper bound max i∈D ∥g i ∥, but determining it exactly cannot be done in a differentially private manner. DPSGD-Global sets Z as a hyperparameter without looking at the data, in the same way C 0 is chosen in DPSGD. If Z fails to be a strict upper bound, any gradients with ∥g i ∥ > Z are discarded to guarantee a bound on sensitivity. When Z is chosen sufficiently large, no gradients are discarded and gradient misalignment is avoided. The drawback of a large Z is that the scaled gradients ḡi will become small and convergence of gradient descent may be hindered. In addition to identifying that DPSGD-Global has the potential to reduce disparate impact, we propose two modifications to improve its utility. First, we note that discarding gradients with ∥g i ∥ > Z can exacerbate disparate impact as it is often underrepresented groups that have large gradient norms (Xu et al., 2021) . Instead, we clip large gradients to have norm C 0 , which preserves more information while maintaining finite sensitivity (Fig. 2b ). Second, rather than choosing Z as a hyperparameter, we adaptively update Z to upper-bound max i∈B ∥g i ∥. When Z is larger than all gradients it should be reduced to scale down gradients less, but if gradients are being clipped, Z should be increased. Z can be updated each iteration by privately estimating b t , the number of gradients in B that are larger than Z times a tolerance threshold τ ≥ 0. Since b t is a unit sensitivity quantity we can estimate it privately as bt = 1 |B| (b t + N (0, σ 2 2 )). Then, we use the geometric update rule Z ← Z • exp(-η Z + bt ) with a learning rate η Z (cf. (Andrew et al., 2021) ). When all samples have gradient norm less than or equal to τ • Z, then in expectation bt = 0 and Z is decreased by a factor of exp(-η Z ). Alternatively, Z is increased when bt > η Z , which occurs with probability 0.977 when bt |B| ≥ η Z + 2σ2 |B| . As a result, with high probability the algorithm will not have more than |B|η Z + 2σ 2 gradients with norm exceeding τ • Z. We call the method with our two alterations DPSGD-Global-Adapt, shown in Alg. 2 in red parentheses. We empirically find in Sec. 6 that both global approaches improve fairness compared to prior methods, and that DPSGD-Global-Adapt has improved utility over DPSGD-Global. While the alterations are minor, our main contributions are elucidating that gradient misalignment is the main cause of disparate impact, and identifying that global scaling can prevent this problem. Both global methods apply the sampled Gaussian mechanism (Mironov et al., 2019) to gradient norms with a sensitivity of C 0 , and hence are amenable to the same DP analysis as DPSGD itself. In DPSGD-Global-Adapt, the additional step of privately estimating the number of gradients with norm larger than τ • Z must be accounted for in the overall DP guarantee via a composition of sampled Gaussian mechanisms. From the analysis in (Mironov et al., 2019) , DPSGD-Global-Adapt is (ϵ, δ)-DP for any σ 1 , σ 2 > 0, where ϵ can be determined numerically given δ. However, our adaptive method is empirically not sensitive to the exact count b t , so a relatively large amount of noise can be used, see (Andrew et al., 2021) for comparison. In practice we used σ 2 ≈ 10σ 1 which produced a negligible additional cost in the overall privacy budget. Finally, we note that other approaches for mitigating unfairness, specifically DPSGD-F (Xu et al., 2021) and that of Tran et al. (2021a) , require protected group labels for the training set. Collecting such labels may expose individuals to additional privacy risks in the case of security breaches, or may be prohibited in practice. Both global methods have the advantage of not requiring protected group labels for training data, and treat all training examples on an equal footing, thereby avoiding disparate treatment, while disparate impact is mitigated by reducing gradient misalignment.

6. EXPERIMENTS

In our experiments we provide evidence that gradient misalignment is the most significant cause of unfairness, and demonstrate that global scaling can effectively reduce unfairness by aligning gradients. Our code for reproducing the experiments is provided as supplementary material.

6.1. EXPERIMENT SETTINGS

For all experiments, full details are provided in App. B. We use an artificially unbalanced MNIST training dataset where class 8 only constitutes about 1% of the dataset on average, and protected groups are the classes. We also use two census datasets popular in the ML fairness literature, Adult and Dutch (van der Laan, 2000), preprocessed as in Le Quy et al. (2022) . For both datasets, "sex" is the protected group attribute which is balanced between males and females. Finally, we use the CelebA dataset (Liu et al., 2015) for binary classification on the gender label. The protected group attribute is whether the image contains eyeglasses. Images with eyeglasses comprise 12% of male images but only 2% of female images, and are more difficult to classify accurately. We compare both global scaling techniques (Alg. 2) against two methods designed to reduce unfairness, DPSGD-F (Xu et al., 2021) (Alg. 3) and the Fairness-Lens method (Tran et al., 2021a ) (Alg. 4), both of which are reviewed in App. B.5. Each method's effectiveness in removing disparate impact is measured using privacy cost π a (Eq. 1), and excessive risk R a (Eq. 2) per group, as well as the privacy cost gap π a,b , and excessive risk gap R a,b between groups. For MNIST, the underrepresented group 8 is compared to group 2 (Xu et al., 2021) . All experiments were run for 5 random seeds, and results are given as means ± standard errors. For MNIST and CelebA, all methods train a convolutional neural network with two layers of 32 and 16 channels, 3x3 kernels, and tanh activations. Adult uses an MLP model with two hidden layers of 256 units, while Dutch uses a logistic regression model. For all private methods, we use an RDP accountant (Mironov, 2017) with δ = 10 -6 . As a baseline, for DPSGD we set σ = 1, C 0 = 0.5 for Adult, σ = 1, C 0 = 0.1 for Dutch, and σ = 0.8, C 0 = 1 for image datasets. With this, training 20 epochs for tabular datasets, 60 epochs for MNIST and 30 epochs for CelebA gives ϵ = 3.41 for Adult, ϵ = 2.27 for Dutch, ϵ = 5.90 for MNIST, and ϵ = 2.49 for CelebA. DPSGD-F has negligibly Published as a conference paper at ICLR 2023 higher ϵ, while our method achieves the same ϵ guarantees to three significant digits. Complete hyperparameters are given in App. B.2. Tables 2 and 3 display the accuracy and loss, along with privacy cost and excessive risk metrics respectively for MNIST on classes 2 and 8 and CelebA on group W with eyeglasses, and group W/O without. 3 Recall that higher is better for accuracy, but for all other metrics lower is better. According to the one-sided Wilcoxon signed rank test, both global methods have statistically significant (p < 0.05) improvement over DPSGD on accuracy, loss, privacy cost gap, and excessive risk gap. Similarly, DPSGD-Global-Adapt has statistically significant improvement over DPSGD-Global and DPSGD-F on accuracy and loss. The same conclusions hold for the Adult dataset, and also for Dutch with the exception of DPSGD-Global being comparable to DPSGD in loss, see Tables 4 and 5 in App. B.7. We infer that the global scaling technique mitigates unfairness, while our modifications further improve utility.

6.2. RESULTS

Not only are final model metrics improved, we see that DPSGD-Global-Adapt trains more similarly to non-private SGD in Fig. 3 for Dutch (cf. Figs. 8, 9, and 10 in App. B.7 for Adult, MNIST, and CelebA). This shows the average train loss per iteration, and average norm of the batched gradient. The difference in loss for groups in DPSGD-Global-Adapt resembles that of the non-private method more closely than other methods. Consider Fig. 3 (bottom), where the group M average norm does not converge to 0 in DPSGD, a problem which is somewhat improved in DPSGD-F, while for FairLens the group F norms become much larger. In DPSGD-Global-Adapt the norms for both groups remain small, but importantly the gap between groups is reduced. 

7. DISCUSSION

In this paper we identified a core cause of disparate impact in DPSGD, gradient misalignment, and proposed a mitigating solution, global scaling. We empirically verified that global scaling is successful in improving fairness in terms of accuracy and loss over DPSGD and other fair baselines on several datasets. Our method has additional advantages over other fair baselines in that it does not require the collection of protected group data for training, does not involve disparate treatment, and it removes disparate impact for all groups simultaneously. It is important to note that while global scaling is effective at reducing disparate impact by aligning gradients, it does not resolve the privacy-utility trade-off, which exists in any private mechanism fundamentally. Nor does it ensure that the model is non-discriminatory towards subgroups, only that adding privacy does not exacerbate unfairness. For example, biases in data collection or discriminatory modelling assumptions can cause disparate impact within the non-private model, which overlaying global scaling will not cure. Any models trained with global scaling should still be validated for fairness independently; failure to do so could unknowingly cause additional unfairness. Cuong Tran, Ferdinando Fioretto, and A THEORETICAL RESULTS

A.1 PROOFS OF MAIN RESULTS

In this section we provide complete proofs for our theoretical contributions. Proposition 2. Consider the ERM problem with twice-differentiable loss ℓ with respect to the model parameters. The excessive risk due to clipping experienced by group a ∈ [K] at iteration t is approximated up to second order in ∥θ t+1 -θ t ∥ as R clip a ≈ η t g Da , E 1 -∥ḡ B ∥ ∥g B ∥ g B + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 -1 g T B H a ℓ g B (R mag a ) + η t g Da , E ∥ḡ B ∥ ∥g B ∥ (g B -M B g B ) + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B )-g T B H a ℓ g B , (R dir a ) where g Da , ḡDa denote the average non-clipped and clipped gradients over group a at iteration t, H a ℓ refers to the Hessian over group a, and M B is an orthogonal matrix such that ḡB and M B g B are colinear. The expectations are taken over batches of data. We remark that assuming a twice-differentiable loss is a mild requirement in machine learning where most loss functions and models are designed to be smooth enough for backpropagation.

Proof.

The proof is based on a Taylor expansion of the excessive risk, as in Tran et al. (2021a) . Let M B be an orthogonal matrix such that ḡB = M B ∥ḡ B ∥ ∥g B ∥ g B . In this way, ∥ḡ B ∥ = ∥ḡ B ∥ ∥g B ∥ g B and g B and ∥ḡ B ∥ ∥g B ∥ g B are colinear, and so the former characterizes direction error, and the latter error in magnitude. The excessive risk due to error in magnitude for group a at iteration t is then given by E L θ t -η t ∥ḡ B ∥ ∥g B ∥ g B ; D a -L (θ t -η t g B ; D a ) , the cost in loss of using the update vector ∥ḡ B ∥ ∥g B ∥ g B rather than g B , where the expectation is over randomness of batch sampling. We perform second-order Taylor expansion of E L θ t -η t ∥ḡ B ∥ ∥g B ∥ g B ; D a and take the expectation to get that E L θ t -η t ∥ḡ B ∥ ∥g B ∥ g B ; D a ≈ L (θ t ; D a ) -η t g Da , E ∥ḡ B ∥ ∥g B ∥ g B + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 g T B H a ℓ g B . Hence, R clip a = η t ⟨g Da , g D -ḡD ⟩ + η 2 t 2 E[ḡ T B H a ℓ ḡB ] -E[g T B H a ℓ g B ] = η t ⟨g Da , g D -ḡD ⟩ + η 2 t 2 E[ḡ T B H a ℓ ḡB ] -E[g T B H a ℓ g B ] -η t g Da , E ∥ḡ B ∥ ∥g B ∥ g B + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 g T B H a ℓ g B + η t g Da , E ∥ḡ B ∥ ∥g B ∥ g B - η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 g T B H a ℓ g B = η t g Da , g D -E ∥ḡ B ∥ ∥g B ∥ g B + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 g T B H a ℓ g B -E g T B H a ℓ g B + η t g Da , E ∥ḡ B ∥ ∥g B ∥ g B -ḡD + η 2 t 2 E ḡT B H a ℓ ḡB -E ∥ḡ B ∥ 2 ∥g B ∥ 2 g T B H a ℓ g B = η t g Da , g D -E ∥ḡ B ∥ ∥g B ∥ g B + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 -1 g T B H a ℓ g B (R mag a ) + η t g Da , E ∥ḡ B ∥ ∥g B ∥ g B -ḡD + η 2 t 2 E ḡT B H a ℓ ḡB -E ∥ḡ B ∥ 2 ∥g B ∥ 2 g T B H a ℓ g B . (R dir a ) We can also further simplify R dir a by using that ḡD = E[ḡ B ], ḡB = M B ∥ḡ B ∥ ∥g B ∥ g B and that M B is a linear transformation R dir a = η t g Da , E ∥ḡ B ∥ ∥g B ∥ g B -M B ∥ḡ B ∥ ∥g B ∥ g B + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) -E ∥ḡ B ∥ 2 ∥g B ∥ 2 g T B H a ℓ g B (5) = η t g Da , E ∥ḡ B ∥ ∥g B ∥ (g B -M B g B ) + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) -g T B H a ℓ g B . Proposition 3. Assume the loss ℓ is twice continuously differentiable and convex with respect to the model parameters. As well, assume that η t ≤ (max k∈[K] λ k ) -1 where λ k is the maximum eigenvalue of the Hessian H k ℓ . For groups a, b ∈ [K], R dir a > R dir b if E ∥ḡ B ∥(cos θ a B -cos θa B ) > ∥g D b ∥ ∥g Da ∥ E ∥ḡ B ∥(cos θ b B -cos θb B ) + E[∥ḡ B ∥ 2 ] ∥g Da ∥ , ( ) where θ k B = ∠(g D k , g B ) and θk B = ∠(g D k , ḡB ) for a group k ∈ [K]. Furthermore, the bound is tight. Again, requiring a twice continuously differentiable loss is a mild requirement. However, when neural networks are used most loss functions are non-convex. Empirically we see in Fig. 5 that the lower bound can still apply in practice. The requirement on the learning rate is under the control of the practitioner, and we have verified that in practice it can be satisfied. This proof follows some steps presented in Lemma 2 of Tran et al. (2021a) . We seek a simplified condition for when the following is positive, R dir a -R dir b = η t g Da , E ∥ḡ B ∥ ∥g B ∥ (g B -M B g B ) -η t g D b , E ∥ḡ B ∥ ∥g B ∥ (g B -M B g B ) + η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) -g T B H a ℓ g B - η 2 t 2 E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H b ℓ (M B g B ) -g T B H b ℓ g B . Looking at one of the inner product terms, we use that ⟨x, y⟩ = ∥x∥∥y∥ cos(x, y) and linearity of expectation to obtain g Da , E ∥ḡ B ∥ ∥g B ∥ (g B -M B g B ) = E ∥ḡ B ∥ ∥g B ∥ (⟨g Da , g B ⟩ -⟨g Da , M B g B ⟩) = ∥g Da ∥E ∥ḡ B ∥ ∥g B ∥ ∥g B ∥ cos(g Da , g B ) -∥M B g B ∥ cos(g Da , M B g B ) = ∥g Da ∥E ∥ḡ B ∥(cos θ a B -cos θa B ) , where θ a B = ∠(g Da , g B ) and θa B = ∠(g Da , M B g B ) = ∠(g Da , ḡB ). The last equality follows from the definition of M B such that ḡB and M B g B are aligned and ∥g B ∥ = ∥M B g B ∥. We can also get a bound on the difference in conjugates of the Hessian, E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) -g T B H a ℓ g B . Note that since we assume the loss ℓ is convex, the Hessian H a ℓ is positive semi-definite such that x T H a ℓ x ≥ 0 for all vectors x. It follows that E[x T H a ℓ x] ≥ 0 and so using linearity of expectation, E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) -g T B H a ℓ g B ≤ E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) . ( ) Since ℓ is twice continuously differentiable we have that H a ℓ is symmetric and hence x T H a ℓ x ≤ λ a ∥x∥ 2 where λ a is the maximum eigenvalue of H a ℓ . We then again use that ∥M B g B ∥ = ∥g B ∥ and linearity of expectation to obtain E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) -g T B H a ℓ g B ≤ λ a E ∥ḡ B ∥ 2 . ( ) Similar analysis gives that E ∥ḡ B ∥ 2 ∥g B ∥ 2 (M B g B ) T H a ℓ (M B g B ) -g T B H a ℓ g B ≥ -λ a E[∥ḡ B ∥ 2 ]. Combining the above, it follows that R dir a -R dir b ≥ η t ∥g Da ∥E ∥ḡ B ∥(cos θ a B -cos θa B ) -∥g D b ∥E ∥ḡ B ∥(cos θ b B -cos θb B ) - η 2 t 2 (λ a + λ b )E[∥ḡ B ∥ 2 ], and since we assume η t ≤ 1 max k∈[K] λ k , R dir a -R dir b ≥ η t ∥g Da ∥E ∥ḡ B ∥(cos θ a B -cos θa B ) -∥g D b ∥E ∥ḡ B ∥(cos θ b B -cos θb B ) -E[∥ḡ B ∥ 2 ] . It follows that R dir a > R dir b when the following is satisfied: E ∥ḡ B ∥(cos θ a B -cos θa B ) > ∥g D b ∥ ∥g Da ∥ E ∥ḡ B ∥(cos θ b B -cos θb B ) + E[∥ḡ B ∥ 2 ] ∥g Da ∥ . ( ) Finally, to see that the bound is tight we simply note that the inequalities that were introduced can all be saturated simultaneously. In Eq. 9 we require that g T B H a ℓ g B = 0, and in Eq. 10 we require (M B g B ) T H a ℓ (M B g B ) = λ a ∥M B g B ∥ 2 for each batch. These independent conditions can plausibly be met for some H a ℓ , g B , and M B . The only other inequality introduced is the assumption η t ≤ 1 max k∈[K] λ k , which we can strengthen to η t = 1 max k∈[K] λ k for the sake of achieving saturation.

A.2 ALTERNATE DECOMPOSITIONS OF THE CLIPPING ERROR

In Sec. A.2 we proposed a decomposition of the clipped batch gradient into parts representing magnitude and direction error, ḡB =M B ∥ḡ B ∥ ∥g B ∥ g B . We presented a simple experiment in Table 1 to demonstrate that direction error causes the most severe problems for the final performance of models, and analysed the contributions of the two effects to the excessive risk in Prop. 2. However, the decomposition we used is not unique, and furthermore it is not possible to completely isolate the two effects in the excessive risk analysis. For example, if we think of magnitude error as the difference in loss between using update vector g B and ∥ḡ B ∥ ∥g B ∥ g B (γ in Fig. 6 ), then it follows that the remaining error is due to gradient misalignment, in other words, the difference in loss between using update vector ∥ḡ B ∥ ∥g B ∥ g B and ḡB (λ in Fig. 6 ). In this example, the error due to gradient misalignment includes both error in direction and error in magnitude, while magnitude error is "pure", R mag a = E[L(θ t -η t ∥ḡ B ∥ ∥g B ∥ g B ; D a ) -L(θ t -η t g B ; D a )], R dir a = E[L(θ t -η t ḡB ; D a ) -L(θ t -η t ∥ḡ B ∥ ∥g B ∥ g B ; D a )]. ( ) A different way of decomposing the clipping error is considering the direction error as the difference in loss between using update vector g B and M B g B (α in Fig. 6 ). In this case, direction error is pure, i.e. does not include difference in magnitudes. It follows that the remaining error is magnitude error, so is the difference in loss between using update vector M B g B and ḡB (β in Fig. 6 ). Thus, the magnitude error in this case quantifies the difference in loss of scaling the already misaligned ḡB , R dir a * = E[L(θ t -η t M B g B ; D a ) -L(θ t -η t g B ; D a )], R mag a * = E[L(θ t -η t ḡB ; D a ) -L(θ t -η t M B g B ; D a )]. In our analysis we used the first decomposition where magnitude error can be completely corrected by an adjustment of the learning rate, and direction error, what we hypothesized to be the largest cause of disparate impact, is the remaining part of the clipping error. For completeness, by using the second decomposition we can derive alternative versions of Props. 2 and 3: Proposition 2*. Consider the ERM problem with twice-differentiable loss ℓ with respect to the model parameters. The excessive risk due to clipping experienced by group a ∈ [K] at iteration t is approximated up to second order in ∥θ t+1 -θ t ∥ as R clip a ≈ η t g Da , E ∥g B ∥ ∥ḡ B ∥ -1 ḡB + η 2 t 2 E 1 -∥g B ∥ 2 ∥ḡ B ∥ 2 ḡT B H a ℓ ḡB , (R mag a * ) + E [η t ⟨g Da , g D -M B g B ]⟩ + η 2 t 2 E (M B g B ) T H a ℓ (M B g B ) -g T B H a ℓ g B , (R dir a * ) where g Da , ḡDa denote the average non-clipped and clipped gradients over group a at iteration t, H a ℓ refers to the Hessian over group a, and M B is an orthogonal matrix such that ḡB and M B g B are colinear. The expectations are taken over batches of data. Proposition 3*. Assume the loss ℓ is twice continuously differentiable and convex with respect to the model parameters. As well, assume that η t ≤ (max k∈[K] λ k ) -1 where λ k is the maximum eigenvalue of the Hessian H k ℓ . For groups a, b ∈ [K], R dir a > R dir b if E ∥g B ∥(cos θ a B -cos θa B ) > ∥g D b ∥ ∥g Da ∥ E ∥g B ∥(cos θ b B -cos θb B ) + E[∥g B ∥] ∥g Da ∥ where θ k B = ∠(g D k , g B ) and θk B = ∠(g D k , ḡB ) for a group k ∈ [K]. We omit the proofs since they are directly analogous to those in App. A.1. Adult The original Adult datasetfoot_3 consists of 48,842 samples, reduced to 45,222 by removing all samples with missing values. The "final weight" feature is removed and the "race" attribute is discretized by {white, non-white}, giving 5 numerical, 3 binary and 6 categorical features. The numerical features are normalized and the categorical features are one-hot encoded. As is typical in the fairness literature, choices for the protected attribute are "sex", "race" (binary) and possibly the discretized "age". We use "sex" by default. The classification label is "income" (whether or not income exceeds $50,000). Prior to sampling, the Adult dataset is unbalanced with respect to sex with 30,527 males and 14,695 females. We sample a balanced dataset as in Xu et al. ( 2021) with 14,000 females and 14,000 males on average.

Dutch

The Dutch dataset van der Laan (2000)foot_4 is preprocessed by dropping underage samples (14 and under) and removing the "weight" feature. As well, all "unemployed" samples are removed, as well as those with missing or middle-level "occupation", for a total of 60,420 samples. Specifically, "occupation" values 3,6,7,8 are considered middle-level. "Occupation" is then made binary by considering values 4,5,9 as low-level professions (0) and 1,2 as high-level professions (1). The binary classification task is to predict "occupation", given the rest of the features. We consider "sex" as the protected group attribute. The processed dataset is balanced with respect to "sex" with 30,147 male and 30,273 female samples. We use an 80/20 train/test split for both tabular datasets. CelebA The CelebA dataset (Liu et al., 2015) foot_5 , consists of 64x64 pixel RGB images of celebrity faces, along with binary attributes describing each image. Many of these attributes are subjective, but we chose to use the most objective ones for training and group labels. We used the binary attribute "Male" for the classification target, which is roughly balanced at 84,434 males in 202,599 total images. The attribute "Eyeglasses" was our protected group label; although wearing eyeglasses in public typically does not construe sensitive information, we used this attribute because it was objectively defined, and formed a minority group which was empirically more difficult for models to classify accurately. Of the male images, 10,478 have eyeglasses, while only 2,715 female images have them. The training/validation/test split is provided with the dataset and is roughly in a 80/10/10 ratio.

B.2 EXPERIMENT SETTINGS

We set σ = 1, C 0 = 0.5 for Adult, σ = 1, C 0 = 0.1 for Dutch, while for MNIST and CelebA, we set σ = 0.8 and C 0 = 1. For DPSGD-F, the gradient noise is unchanged σ 2 = σ, and σ 1 = 10σ 2 . For FairLens, we use regularization weights as in Tran et al. (2021a) , λ 1 = λ 2 = 1. For non-global methods, the learning rate is η t = 0.01 for all iterations t and all datasets except Dutch which has η t = 0.8. For DPSGD-Global we have η t = 1, Z = 50 for Adult, η t = 2, Z = 1 for Dutch, η t = 0.2, Z = 100 for MNIST, and η t = 0.1, Z = 100 for CelebA. For DPSGD-Global-Adapt we have σ 2 = 10, Z = 50, η Z = 0.1 for all datasets (the only exception is for CelebA Z = 100), η t = 0.2, τ = 1 for Adult, η = 1, τ = 1 for Dutch, and η = 0.1, τ = 0.7 for MNIST and CelebA. All methods for all datasets use training and test batches of size 256. Experiments were conducted on single TITAN V GPU machines. Approximately four GPU-days were used to train all methods over five seeds for the four datasets.

B.3 IMPLEMENTATION DETAILS

The excessive risk terms for different groups (R clip a and R noise . Because this approach incurs a high memory burden, the models trained were limited to small MLPs with a single hidden layer of 20 hidden units. 7In our implementation, provided as supplemental material, we avoid computing the Hessian as a matrix altogether which allows us to scale our experiments to common image datasets. For the four datasets, our models have parameter counts of N = 91650 for Adult, N = 120 for Dutch, N = 80522 for MNIST, and N = 120722 for CelebA, which would produce Hessian matrices with up to 14.5 billion entries. Instead, we compute the terms involving Hessians like H a ℓ g B through Hessian-vector products (HVPs) using the functorchfoot_7 library with PyTorch 1.11. Using HVPs requires memory comparable to that used when computing gradients for SGD. For the trace of the Hessian matrix, also called the Laplacian, one possible approach that does not require realizing the entire matrix in memory is to compute HVPs with unit vectors to isolate each diagonal element: Tr(H a ℓ ) = N i=1 I T i H a ℓ I i where I i is the ith column of the identity matrix. While exact, this approach requires N HVPs for each group a ∈ K, of which there are at least two. Since this method is much too expensive for even the simple MLPs and CNNs we used, we instead employed Hutchinson's trace estimator (Hutchinson, 1990)  to estimate Tr(H a ℓ ) = E z [z T H a ℓ z] . This estimator is unbiased when z is drawn from a Rademacher distribution which we used, and only requires n HVPs per group, where n can be chosen as large as required for convergence of the estimate. In practice we used n = 100. Additionally, whereas Tran et al. (2021a) replace dataset gradients g D and g Da with batch gradients when computing R clip a and R noise a in Prop. 1, we use the exact g D and g Da . This eliminates an easily preventable source of noise in our results. To further reduce computation time, we only evaluate excessive risk terms (Hessians) every 50, 100, 200, or 200 iterations for the Adult, Dutch, MNIST, and CelebA datasets respectively.

B.4 DIRECTION ERROR IS MORE SEVERE THAN MAGNITUDE ERROR

As noted earlier, Prop. 2 only evaluates excessive risk for a single iteration, not necessarily capturing how each of R dir a and R mag a contribute to convergence and disparate impact over the course of training. In order to evaluate the full impact of magnitude error and error due to gradient misalignment, we consider the difference in final loss and accuracy between models which have zero magnitude error and zero direction error in Table 1 . In these experiments, we consider zero magnitude error to be when ∥ḡ B ∥ = ∥g B ∥ for all batches, and zero direction error to be when g B and ḡB are aligned for all batches. Note that these definitions correspond to comparing update vectors g B and ∥g B ∥ ∥ḡ B ∥ ḡB for the zero magnitude error experiment, and comparing update vectors g B and ∥ḡ B ∥ ∥g B ∥ g B for the zero direction error experiment. These do not correspond to the definitions of R dir a and R mag a in Prop. 2, but capture the intuitive definitions of direction and magnitude error. As described in App. A.2, while R clip a = R mag a + R dir a , direction error and magnitude error cannot be purely separated with any definition of R mag a , R dir a .

B.5 BASELINE METHODS

We compared our approach DPSGD-Global-Adapt with its predecessor DPSGD-Global, which was designed to improve convergence, not fairness, as well as two approaches specifically designed to improve fairness. DPSGD-Global (Bu et al., 2021) is presented in Alg. 2, and involves scaling almost all per-sample gradients by a global factor rather than only scaling large gradients with ∥g i ∥ > C 0 by a normdependent factor. We say "almost all", because scaling alone does not provide a strict upper bound on the sensitivity, as required for an application of the Gaussian mechanism, see Fig. 2b . The method additionally clips gradients to zero if their norm is above a strict upper bound Z, which we found to be unnecessarily aggressive. Otherwise, the global scaling factor is C 0 /Z, which ensures that the sensitivity, namely C 0 , is finite. The advantage of DPSGD-Global is that it can better preserve the direction of ḡB , especially when no gradients are clipped to zero. Hence, Bu et al. (2021) advocate for setting Z larger than ∥g i ∥ for any sample in the batch. The drawback of a large Z is that all gradients are scaled down by a larger factor, so the convergence will be slowed unless the learning rate is increased to compensate. Setting Z is itself a challenge because we cannot inspect the batch to determine max i ∥g i ∥ without accounting for that expense in our privacy budget. In Sec. 5 we described how DPSGD-Global-Adapt resolves these concerns, first by clipping less aggressively, to C 0 instead of 0, while maintaining the same sensitivity, and second by adaptively setting Z each round according to a private estimate of how many gradients in a batch exceeded τ • Z (using the tolerance threshold τ ). Xu et al. ( 2021) designed DPSGD-F as a method for removing disparate impact caused by DPSGD by adaptively setting the clipping threshold for different protected groups. The method was based on the observation that negatively impacted groups tended to have large gradient norms which were affected more by clipping. Hence, the clipping threshold is raised for groups with larger gradient norms, based on a private estimate of how many gradients per-group have ∥g i ∥ > C 0 . Given large enough batch sizes, the private estimate can be done with much more noise as compared to the gradient update, so it does not meaningfully increase the privacy budget. One drawback of this approach is that it requires group label information for every datapoint in the training set. In practice, especially in highly regulated industries, such information may not be permissible to use or even collect. Collecting additional private information from data subjects on protected attributes can itself be a negative process and creates unnecessary privacy risks. One major advantage of DPSGD-Global-Adapt is that it reduces unfairness without ever using group label information. While each group is clipped using its own threshold, noise is added to the batched gradient based on the sensitivity, determined by the largest group threshold. While all groups receive the same theoretical privacy guarantee in terms of (ϵ, δ), groups that are clipped to smaller thresholds may enjoy stronger empirical privacy guarantees, as determined for example by adversarial attacks (Jagielski et al., 2020; Nasr et al., 2021) . Hence, it appears likely that DPSGD-F can produce unfairness in the amount of privacy afforded to different groups. DPSGD-F is shown in Alg. 3. Note that we present the algorithm as implemented in the author's codebase, not as written in their paper. In our experiments we use the version shown in Alg. 3. Our final baseline, referred to as "FairLens" was developed in (Tran et al., 2021a) to reduce excessive risk from clipping, R clip a , and adding noise, R noise a . Regularization terms are added to the loss function in DPSGD that specifically target these sources of excessive risk. The source of R noise was identified Algorithm 3 DPSGD-F Require: Iterations T , Dataset D, sampling rate q, clipping bound C 0 , noise multipliers σ 1 , σ 2 , learning rates η t Initialize θ 0 randomly for t in 0, . . . , T -1 do B ← Poisson sample of D with rate q for (x i , a i , y i ) in B do g i ← ∇ θ ℓ(f θt (x i ), y i ) ▷ Compute per-sample gradients for k in [K] do m k ← i : ∥g k i ∥ > C 0 ▷ Count samples per-group above/below clipping bound o k ← i : ∥g k i ∥ ≤ C 0 mk , õk k∈[K] ← m k , o k k∈[K] + N (0, σ 2 1 I) ▷ Privatize unit sensitivity count vectors mk , õk k∈[K] ← max(⌊ mk ⌋, 0), max(⌊õ k ⌋, 0) k∈[K] ▷ Postprocessing m = k∈[K] mk for k in [K] do bk = mk + õk C k = C 0 • 1 + mk / bk m/|B| for (x i , a i , y i ) in B do ḡi ← g i • min 1, C k ∥gi∥ where k = a i ▷ Clip according to per-group clipping bounds gB ← 1 |B| i∈B ḡi + N (0, σ 2 2 C 2 0 I) θ t+1 ← θ t -η t gB to involve the per-group Laplacian of the loss ℓ with respect to model parameters -a second order derivative whose computation scales poorly with model size. To avoid this difficulty, the authors used a stand-in for the Laplacian based on the distance of a point to the decision boundary. Our implementation is directly based off of code made available by the authors on OpenReview at openreview.net/forum?id=7EFdodSWee4. The version implemented in their code is shown in Alg. 4, and assumes there are only two mutually exclusive protected groups, denoted a and b. Hence, it is not applicable to the MNIST dataset. We also attempted to use this code for our CelebA experiments but found that the implementation did not scale to the simple CNNs we used. Therefore, we omitted FairLens from the CelebA experiments. In Sec. 6.3 and Fig. 5 we compared the usefulness of our lower bound on R dir a -R dir b from Prop. 3, to a previous lower bound in the literature. For our lower bound to be valid, the assumptions of Prop. 3 should be satisfied. The first assumption, that the loss is twice continuously differentiable with respect to the model parameters, holds since the model architecture is an MLP with tanh activations. However, the loss is not in general convex. The third assumption, that the inverse of the learning rate upper bounds the largest eigenvalue of any group's Hessian, is checked empirically for each iteration in Fig. 7 .

Algorithm 4 FairLens

Require: Iterations T , Dataset D, sampling rate q, clipping bound C 0 , noise multiplier σ, learning rates η t , regularization weights γ 1 , γ 2 Initialize θ 0 randomly for t in 0, . . . , T -1 do B ← Poisson sample of D with rate q for (x i , a i , y i ) in B do g i ← ∇ θ ℓ(f θt (x i ), y i ) ▷ Compute per-sample gradients of original loss ḡi ← g i • min 1, C0 ∥gi∥ g B ← 1 |B| i∈B g i ḡB ← 1 |B| i∈B ḡi for k in {a, b} do g B k ← 1 |B k | i∈B,ai=k g i f k ← 1 |B k | i∈B,ai=k f θt (x i ) R 1 = |⟨g Ba -g B b , ḡB -g B ⟩| R 2 = 1 2 (f a • (1 -f a ) + f b • (1 -f b )) L = ℓ(f θt (x i ), y i ) + γ 1 R 1 + γ 2 R 2 ▷ Define regularized loss for (x i , a i , y i ) in B do g ′ i ← ∇ θ L(f θt (x i ), y i ) ▷ Compute per-sample gradients of regularized loss ḡ′ i ← g ′ i • min 1, C0 ∥g ′ i ∥ ▷ Clip to ensure finite sensitivity g′ B ← 1 |B| i∈B ḡ′ i + N (0, σ 2 C 2 0 I) θ t+1 ← θ t -η t g′ B B.7 ADDITIONAL RESULTS In this section we complete the set of experimental results shown in Sec. 6 over all datasets and methods. All results are averaged over five random seeds with one standard error shown. 78.5±0.5 89.9±0.1 2.0±0.2 2.2±0.1 0.2±0.2 0.43±0.00 0.25±0.00 0.04±0.00 0.05±0.00 0.02±0.00 DPSGD-G.-A. 80.7±0.4 92.3±0.1 -0.1±0.1 -0.1±0.1 0.0±0.1 0.39±0.00 0.18±0.00 0.00±0.00 0.00±0.00 0.00±0.00 79.0±0.2 86.5±0.1 0.8±0.1 0.4±0.0 0.4±0.2 0.510±0.001 0.460±0.001 0.012±0.001 0.013±0.001 0.002±0.001 DPSGD-G.-A. 79.4±0.1 86.7±0.1 0.4±0.2 0.2±0.0 0.2±0.2 0.504±0.001 0.452±0.001 0.006±0.001 0.005±0.001 0.001±0.001 First we look at the final performance and fairness metrics on the test set for Adult in Table 4 and Dutch in Table 5 (cf. MNIST in Table 2 and CelebA in Table 3 ). We see that FairLens is inconsistent in reducing the privacy cost gap and excessive risk gap compared to DPSGD. DPSGD-F improves both fairness metrics while achieving better performance. DPSGD-Global improves over or is comparable to DPSGD-F in all metrics, and does so without requiring access to protected group membership information. Our method DPSGD-Global-Adapt further improves both performance and fairness by clipping less aggressively and adaptively setting the upper clipping threshold Z. 2, 3 , 4, 5. Gaussian noise adds zero bias and the errors it introduces tend to cancel out over the course of training. These observations further validate that direction error is the core cause of disparate impact, and minimizing gradient misalignment should be prioritized over other sources of unfairness.



Examples of laws governing data privacy include the General Data Protection Regulation in Europe, Health Insurance Portability and Accountability Act in the USA, and Personal Information Protection and Electronic Documents Act in Canada. In the USA, fair lending laws including the Fair Housing Act, and Equal Credit Opportunity Act prohibit discrimination based on protected characteristics such as race, age, and sex. The FairLens method(Tran et al., 2021a) is not compared for MNIST and CelebA because the authorprovided code only handles binary classification problems, and does not scale to image datasets. Figure 3: Dutch dataset. Top: Train loss per epoch. Bottom: ∥g B ∥ averaged over batches per epoch. The Adult dataset is available at archive.ics.uci.edu/ml/datasets/Adult. The Dutch dataset is also available through the work of Le Quy et al. (2022) at raw.githubusercontent.com/tailequy/fairness dataset/main/Dutch census/dutch census 2001.arff. We accessed this dataset via kaggle.com/datasets/jessicali9530/celeba-dataset. See implementation available at openreview.net/forum?id=7EFdodSWee4. See documentation at pytorch.org/functorch/stable/.



When the model is clear from context we denote R(θ; D k ) as R k , and similarly for privacy cost π k . For both accuracy and loss we consider the gap between disparate impact values across groups. The privacy cost gap is π a,b = |π a -π b | for groups a, b ∈ [K], and the excessive risk gap refers to R a,b = |R a -R b |.

Figure 1: Direction errors from clipping are more severe than magnitude errors over the course of training and can lead to suboptimal convergence.

(a) Left: Per-sample gradients colored based on group membership. Top: Local clipping in DPSGD Bottom: Global scaling in DPSGD-Global. (b) In DPSGD-Global-Adapt scaling alone does not guarantee finite sensitivity, so gradients with norm above Z are clipped to C0 (DPSGD-Global clips large gradients to 0 rather than C0).

Figure 2: Illustration of privatization steps in DPSGD, DPSGD-Global, and DPSGD-Global-Adapt

Figure 4: Adult dataset. Top R dir a , excessive risk due to gradient misalignment per group. Bottom R mag a , excessive risk due to magnitude error per group. See Prop. 2 for definitions.Fig. 4 shows the excessive risk terms due to gradient misalignment R dir a , and magnitude error R mag a for Adult at each iteration (see Figs. 11, 12, and 13 in App. B.7 for Dutch, MNIST, and CelebA).We see that global clipping almost completely removes direction errors as intended, but as a tradeoff increases magnitude error. However, we have argued that direction error is the more severe cause of disparate impact over the course of training, which is borne out by the results in Tables 1, 2 and 3, as well as 4, and 5 in App. B.7. Direction errors introduce bias which accumulates, whereas magnitude errors do not alter the convergence path, and noise errors add zero bias and tend to cancel out.6.3 TIGHTNESS OF LOWER BOUNDS

Figure 6: Decomposition of steps between g B and ḡB .

in Prop. 1 and R mag a and R dir a in Prop. 2) all involve the Hessian of the loss function with respect to the model parameters. Calculating the Hessian as a matrix is computationally expensive, but more crucially requires memory that scales quadratically in the number of parameters. In the previous work studying R clip a and R noise a , Tran et al. (2021a) use the PyHessian library to compute the Hessian as a matrix, and then used it to compute the products and traces needed for R clip a and R noise a

Figure 7: The maximum eigenvalue of any group's Hessian remains below η t -1 .

Figure 8: Adult dataset. Top: Train loss per epoch. Bottom: ∥g B ∥ averaged over batches per epoch.

Figure 9: MNIST dataset. Top: Train loss per epoch. Bottom: ∥g B ∥ averaged over batches per epoch.

Figure 10: CelebA dataset. Top: Train loss per epoch. Bottom: ∥g B ∥ averaged over batches per epoch. To go along with the training curves shown for Dutch in Fig. 3, we present the same for Adult in Fig. 8, MNIST in Fig. 9, and CelebA in Fig. 10. The trends are consistent across datasets -whereas DPSGD produces large values and a large gap for the gradient norms and losses between protected groups, our method DPSGD-Global-Adapt reduces the values and gap at all stages of training.

Figure 11: Dutch dataset. Top: Excessive risk due to gradient misalignment per group. Bottom: Excessive risk due to magnitude error per group.

Figure 12: MNIST dataset. Top: Excessive risk due to gradient misalignment per group. Bottom: Excessive risk due to magnitude error per group.

Figure 13: CelebA dataset. Top: Excessive risk due to gradient misalignment per group. Bottom: Excessive risk due to magnitude error per group.We also present the values of terms R dir a and R mag a over training for Dutch in Fig.11, for MNIST in Fig.12, and CelebA in Fig.13as was done for Adult in Fig.4. Both Global methods dramatically reduce R dir a compared to DPSGD at the cost of larger R mag a . Comparing to the final training results where global methods also show the best performance, this provides further evidence for our hypothesis that gradient misalignment is the most significant cause of disparate impact in DPSGD.

Figure 14: Excessive risk due to noise error per group for the Adult dataset

Figure 15: Excessive risk due to noise error per group for the Dutch dataset

Figure 16: Excessive risk due to noise error per group for the MNIST dataset

Figure 17: Excessive risk due to noise error per group for the CelebA dataset

Effect of direction vs. magnitude error on MNIST with class 8 undersampled. The results compare accuracy and loss on the typical class 2 to the underrepresented class 8. Taylor expansion of the expected loss using ḡB in the gradient descent update compared to g B . The excessive risk from magnitude error comes from comparing g B to ∥ḡ B ∥ ∥g B ∥ g B , while that of gradient misalignment is isolated by comparing ḡB =M B ∥ḡ B ∥ ∥g B ∥ g B to ∥ḡ B ∥ ∥g B ∥ g B (see Fig.1). Proposition 2. Consider the ERM problem with twice-differentiable loss ℓ with respect to the model parameters. The excessive risk due to clipping experienced by group a ∈ [K] at iteration t is approximated up to second order in ∥θ t+1 -θ t ∥ as

Performance and Fairness metrics for MNIST

Performance and Fairness metrics for CelebA

Pascal Van Hentenryck. Differentially Private and Fair Deep Learning: A Lagrangian Dual Approach. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11):9932-9939, May 2021b. Paul van der Laan. The 2001 Census in the Netherlands Integration of Registers and Surveys. In Insee-Eurostat Seminar on Censuses after 2001, 2000.

B.1 DATASET PREPROCESSING MNIST We use the artificially unbalanced MNIST training dataset where class 8 is sampled with probability 9% such that class 8 only constitutes about 1% of the dataset on average. This gives about 6000 data samples for each class, other than class 8 with about 500. The protected group values are the class labels. As in Xu et al. (2021), we compare models on how they treat the under-represented class 8 versus the well-represented class 2. The test set remains balanced, with approximately 1000 samples for each class. Data is scaled to be in the domain [0,1].

Performance and Fairness metrics for Adult dataset

Performance and Fairness metrics for Dutch dataset

