INDIVIDUAL PRIVACY ACCOUNTING FOR DIFFEREN-TIALLY PRIVATE STOCHASTIC GRADIENT DESCENT

Abstract

Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose an efficient algorithm to compute privacy guarantees for individual examples when releasing models trained by DP-SGD. We use our algorithm to investigate individual privacy parameters across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility are simultaneously underserved in terms of privacy guarantee. For example, on CIFAR-10, the average ε of the class with the lowest test accuracy is 43.6% higher than that of the class with the highest accuracy.

1. INTRODUCTION

Differential privacy is a strong notion of data privacy, enabling rich forms of privacy-preserving data analysis (Dwork & Roth, 2014) . Informally speaking, it quantitatively bounds the maximum influence of any datapoint using a privacy parameter ε, where a small value of ε corresponds to stronger privacy guarantees. Training deep models with differential privacy is an active research area (Papernot et al., 2017; Bu et al., 2020; Yu et al., 2022; Anil et al., 2021; Li et al., 2022; Golatkar et al., 2022; Mehta et al., 2022; De et al., 2022; Bu et al., 2022) . Models trained with differential privacy not only provide theoretical privacy guarantee to their data but also are more robust against empirical attacks (Bernau et al., 2019; Carlini et al., 2019; Jagielski et al., 2020; Nasr et al., 2021) . Differentially private stochastic gradient descent (DP-SGD) is the de-facto choice for differentially private deep learning (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016) . DP-SGD first clips individual gradients and then adds Gaussian noise to the average of clipped gradients. Standard privacy accounting takes a worst-case approach, and provides all examples with the same privacy parameter ε. However, from the perspective of machine learning, different examples can have very different impacts on a learning algorithm (Koh & Liang, 2017; Feldman & Zhang, 2020) . For example, consider support vector machines: removing a non-support vector has no effect on the resulting model, and hence that example would have perfect privacy. In this paper, we give an efficient algorithm to accurately estimate individual privacy parameters of models trained by DP-SGD. Our privacy guarantee adapts to the training trajectory of one execution of DP-SGD to provide a precise characterization of privacy cost (see Section 2.1 for more details). Inspecting individual privacy parameters allows us to better understand example-wise impacts. It turns out that, for common benchmarks, many examples experience much stronger privacy guarantee than the worst-case bound. To illustrate this, we plot the individual privacy parameters of MNIST (LeCun et al., 1998) , CIFAR-10 (Krizhevsky, 2009) , and UTKFace (Zhang et al., 2017) in Figure 1 . Experimental details, as well as more results, can be found in Section 4 and 5. The disparity in individual privacy guarantees naturally arises when running DP-SGD. To the best of our knowledge, our investigation is the first to explicitly reveal such disparity. We propose two techniques to make individual privacy accounting viable for DP-SGD. First, we maintain estimates of the gradient norms for all examples so the individual privacy costs can be computed accurately at every update. Second, we round the gradient norms with a small precision r to control the number of different privacy costs, which need to be computed numerically. We explain Figure 1 : Individual privacy parameters of models trained by DP-SGD. The value of δ is 1 × 10 -5 . The dashed lines indicate 10%, 30%, and 50% of datapoints. The black solid line shows the privacy parameter of the original analysis. why these two techniques are necessary in Section 2. More details of the proposed algorithm, as well as methods to release individual privacy parameters, are in Section 3. We further demonstrate a strong correlation between the privacy parameter of an example and its final training loss. We find that examples with higher training loss also have higher privacy parameters in general. This suggests that the same examples suffer a simultaneous unfairness in terms of worse privacy and worse utility. While prior works have shown that underrepresented groups experience worse utility (Buolamwini & Gebru, 2018) , and that these disparities are amplified when models are trained privately Bagdasaryan et al. (2019); Suriyakumar et al. (2021); Hansen et al. (2022) ; Noe et al. (2022) , we are the first to show that the privacy guarantee and utility are negatively impacted concurrently. This is in comparison to prior work in the differentially private setting which took a worst-case perspective for privacy accounting, resulting in a uniform privacy guarantee for all training examples. For instance, when running gender classification on UTKFace, the average ε of the race with the lowest test accuracy is 25% higher than that of the race with the highest accuracy. We also study the disparity in privacy when models are trained without differential privacy, which may be of independent interest to the community. We use the success rates of membership inference attacks to measure privacy in this case and show groups with worse accuracy suffer from higher privacy risks. 1.1 RELATED WORK Several works have explored example-wise privacy guarantees in differentially private learning. Jorgensen et al. (2015) propose personalized differential privacy that provides pre-specified individual privacy parameters which are independent of the learning algorithm, e.g., users can choose different levels of privacy guarantees based on their sensitivities to privacy leakage (Mühl & Boenisch, 2022) . A recent line of works also uses the variation in example-wise sensitivities that naturally arise in learning to study example-wise privacy. Per-instance differential privacy captures the privacy parameter of a target example with respect to a fixed training set (Wang, 2019; Redberg & Wang, 2021; Golatkar et al., 2022) . Feldman & Zrnic (2021) design an individual Rényi differential privacy filter. The filter stops when the accumulated cost reaches a target budget that is defined before training. It allows examples with smaller per-step privacy costs to run for more steps. The final privacy guarantee offered by the filter is still the worst-case over all possible outputs as the predefined budget has to be independent of the algorithm outcomes. In this work, we propose output-specific differential privacy and give an efficient algorithm to compute individual guarantees of DP-SGD. We further discover that the disparity in individual privacy parameters correlates well with the disparity in utility.

2. PRELIMINARIES

We first give some background on differential privacy. Then we highlight the challenges in computing individual privacy for DP-SGD. Finally, we argue that providing the same privacy bound to all examples is not ideal because of the variation in individual gradient norms. 2.1 BACKGROUND ON DIFFERENTIALLY PRIVATE LEARNING Differential privacy builds on the notion of neighboring datasets. A dataset D = {d i } n i=1 is a neighboring dataset of D ′ (denoted as D ∼ D ′ ) if D ′ can be obtained by adding/removing one example from D. The privacy guarantees in this work take the form of (ε, δ)-differential privacy. Definition 1. [Individual (ε, δ)-DP] For a datapoint d, let D be an arbitrary dataset and D ′ = D ∪ {d} be its neighboring dataset. An algorithm A satisfies (ε(d), δ)-individual DP if for any subset of outputs S ⊂ Range(A) it holds that Pr[A(D) ∈ S] ≤ e ε(d) Pr[A(D ′ ) ∈ S] + δ and Pr[A(D ′ ) ∈ S] ≤ e ε(d) Pr[A(D) ∈ S] + δ. We further allow the privacy parameter ε to be a function of a subset of outcomes to provide a sharper characterization of privacy. We term this variant as output-specific (ε, δ)-DP. It shares similar insights as the ex-post DP in Ligett et al. (2017) ; Redberg & Wang (2021) as both definitions tailor the privacy guarantee to algorithm outcomes. Ex-post DP characterizes privacy after observing a particular outcome. In contrast, the canonical (ε, δ)-DP is an ex-ante privacy notion. Ex-ante refers to 'before sampling from the distribution'. Ex-ante DP builds on property about the distribution of the outcome. When the set A in Definition 2 contains more than one outcome, output-specific (ε, δ)-DP remains ex-ante because it measures how the outcome is distributed over A.  Pr[A(D) ∈ S] ≤ e ε(A,d) Pr[A(D ′ ) ∈ S] + δ and Pr[A(D ′ ) ∈ S] ≤ e ε(A,d) Pr[A(D) ∈ S] + δ. Definition 2 has the same semantics as Definition 1 once the algorithm's outcome belongs to A is known. It is a strict generalization of (ε, δ)-DP as one can recover (ε, δ)-DP by maximizing ε(A, d) over A and d. Making this generalization is crucial for us to precisely compute the privacy parameters of models trained with DP-SGD. The output of DP-SGD is a sequence of models {θ t } T t=1 , where T is number of iterations. The privacy risk at step t highly depends on the training trajectory (the first t -1 models). For example, one can adversarially modify the training data and hence change the training trajectory to maximize the privacy risk of a target example (Tramèr et al., 2022) . With Definition 2, we can analyze the privacy guarantee of DP-SGD with regards to a fixed training trajectory, which specifies a subset of DP-SGD's outcomes. In comparison, canonical DP analyzes the worst-case privacy over all possible outcomes which could give loose privacy parameters. A common approach for doing deep learning with differential privacy is to make each update differentially private instead of protecting the trained model directly. The composition property of differential privacy allows us to reason about the overall privacy of running several such steps. We give a simple example to illustrate how to privatize each update. Suppose we take the sum of all gradients v = n i=1 g i from dataset D. Without loss of generality, further assume we add an arbitrary example d ′ to obtain a neighboring dataset D ′ . The summed gradient becomes v ′ = v + g ′ , where g ′ is the gradient of d ′ . If we add isotropic Gaussian noise with variance σ 2 , then the output distributions of two neighboring datasets are A(D) ∼ N (v, σ 2 I) and A(D ′ ) ∼ N (v ′ , σ 2 I). We then can bound the divergence between two Gaussian distributions to prove differential privacy, e.g., through Rényi differential privacy (RDP) (Mironov, 2017) . We give the definition of individual RDP in Appendix A. The expectations of A(D) and A(D ′ ) differ by g ′ . A larger gradient leads to a larger difference and hence a worse privacy parameter.

2.2. CHALLENGES OF COMPUTING INDIVIDUAL PRIVACY PARAMETERS FOR DP-SGD

Privacy accounting in DP-SGD is more complex than the simple example in Section 2.1 because the analysis involves privacy amplification by subsampling. Roughly speaking, randomly sampling a minibatch in DP-SGD strengthens the privacy guarantees since most points in the dataset are not involved in a single step. How subsampling complicates the theoretical privacy analysis has been studied extensively (Abadi et al., 2016; Balle et al., 2018; Mironov et al., 2019; Zhu & Wang, 2019; Wang et al., 2019) . In this section, we focus on how subsampling complicates the empirical computation of individual privacy parameters. Before we expand on these difficulties, we first describe the output distributions of neighboring datasets in DP-SGD (Abadi et al., 2016) . Poisson sampling is assumed, i.e., each example is sampled independently with probability p. Let v = i∈M g i be the sum of the minibatch of gradients of D, where M is the set of sampled indices. Consider also a neighboring dataset D ′ that has one datapoint with gradient g ′ added. Because of Poisson sampling, the output is exactly v with probability 1 -p (g ′ is not sampled) and is v ′ = v + g ′ with probability p (g ′ is sampled). Suppose we still add isotropic Gaussian noise, the output distributions of two neighboring datasets are A(D) ∼ N (v, σ 2 I), A(D ′ ) ∼ N (v, σ 2 I) with prob. 1 -p, A(D ′ ) ∼ N (v ′ , σ 2 I) with prob. p. With Equation ( 1) and ( 2), we explain the challenges in computing individual privacy parameters.

2.2.1. FULL BATCH GRADIENT NORMS ARE REQUIRED AT EVERY ITERATION

There is some privacy cost for d ′ even if it is not sampled in the current iteration because the analysis makes use of the subsampling process. For a given sampling probability and noise variance, the amount of privacy cost is determined by ∥g ′ ∥. Therefore, we need accurate gradient norms of all examples to compute accurate privacy costs at every iteration. However, when running SGD, we only compute minibatch gradients. Previous analysis of DP-SGD evades this problem by simply assuming all examples have the maximum possible norm, i.e., the clipping threshold. The density function of A(D ′ ) is a mixture of two Gaussian distributions. Abadi et al. (2016) compute the Rényi divergence between A(D) and A(D ′ ) numerically to get tight privacy parameters. Although there are some asymptotic bounds, those bounds are looser than numerical computation, and thus such numerical computations are necessary in practice (Abadi et al., 2016; Wang et al., 2019; Mironov et al., 2019; Gopi et al., 2021) . In the classic analysis, there is only one numerical computation as all examples have the same privacy cost over all iterations. However, naive computation of individual privacy parameters would require up to n × T numerical computations, where n is the dataset size and T is the number of iterations.

2.3. AN OBSERVATION: GRADIENT NORMS IN DEEP LEARNING VARY SIGNIFICANTLY

As shown in Equation 1 and 2, the privacy parameter of an example is determined by its gradient norm once the noise variance is given. It is worth noting that examples with larger gradient norms usually have higher training loss. This implies that the privacy cost of an example correlates with its training loss, which we empirically demonstrate in Section 4 and 5. In this section, we show gradient norms of different examples vary significantly to demonstrate that different examples experience very different privacy costs. We train a ResNet-20 model with DP-SGD. The maximum clipping threshold is the median of gradient norms at initialization. We plot the average norms of three different classes in Figure 2 . The gradient norms of different classes show significant stratification. Such stratification naturally leads to different individual privacy costs. This suggests that quantifying individual privacy parameters may be valuable despite the aforementioned challenges.

3. DEEP LEARNING WITH INDIVIDUAL PRIVACY

We give an efficient algorithm (Algorithm 1) to estimate individual privacy parameters for DP-SGD. The privacy analysis of Algorithm 1 is in Section 3.1. We perform two modifications to make individual privacy accounting feasible with small computational overhead. First, we compute full batch gradient norms once in a while, e.g., at the beginning of every epoch, and use the results to estimate the gradient norms for subsequent iterations. We show the estimations of gradient norms are accurate in Section 3.2. Additionally, we round the gradient norms to a given precision so the number of numerical computations is independent of the dataset size. More details of this modification are in Section 3.3. Finally, we discuss how to make use of the individual privacy parameters in Section 3.4. In Algorithm 1, we clip individual gradients with estimations of gradient norms (referred to as individual clipping). This is different from the vanilla DP-SGD that uses the maximum threshold to clip all examples (Abadi et al., 2016) . Individual clipping gives us exact upper bounds on gradient norms and hence we have exact bounds on privacy costs. Our analysis is also applicable to vanilla DP-SGD though the privacy parameters in this case become estimates of exact guarantees, which is inevitable because one can not have exact bounds on gradient norms at every iteration when running vanilla SGD. We report the results for vanilla DP-SGD in Appendix C.2. Our observations include: (1) the estimates of privacy guarantees for vanilla DP-SGD are very close to the exact ones, (2) all our conclusions in the main text still hold when running vanilla DP-SGD, (3) the individual privacy of Algorithm 1 is very close to that of vanilla DP-SGD. 1), ( 2) and store the results in ρc. end Let {o (i) = 0} n i=1 be the accumulated Rényi divergences at different orders. for e = 1 to E do for t ′ = 1 to ⌊1/p⌋ do t = t ′ + ⌊1/p⌋ × (e -1) //Global step count. if t mod ⌊1/pK⌋ = 0 then Compute full batch gradient norms {S (i) } n i=1 . Update gradient estimates with rounded norms {C (i) = arg min c∈C (|c -S (i) |)} n i=1 . end Sample a minibatch of gradients {g (I j ) } |I| j=1 , where I is the sampled indices. Clip gradients ḡ(I j ) = clip(g (I j ) , C (I j ) ). Update model θt = θt-1 -η( ḡ(I j ) + z), where z ∼ N (0, σ 2 I). for i = 1 to n do 18 Set ρ (i) = ρc where c = C (i) . i) . //Update individual privacy costs. o (i) = o (i) + ρ ( end end end

3.1. PRIVACY ANALYSIS OF ALGORITHM 1

We state the privacy guarantee of Algorithm 1 in Theorem 3.1. We use output-specific DP to precisely account the privacy costs of a realized run of Algorithm 1. Theorem 3.1. [Output-specific privacy guarantee] Algorithm 1 at step t satisfies (o (i) α + log(1/δ) α-1 , δ)- output-specific individual DP for the i th example at A = (θ 1 , . . . , θ t-1 , Θ t ), where o (i) α is the accumulated RDP at order α and Θ t is the domain of θ t . To prove Theorem 3.1, we first define a t-step non-adaptive composition with θ 1 , . . . , θ t-1 . We then show RDP of the non-adaptive composition gives an output-specific privacy bound on the adaptive composition in Algorithm 1. We relegate the proof to Appendix A.

3.2. ESTIMATIONS OF GRADIENT NORMS ARE ACCURATE

Although the gradient norms used for privacy accounting are updated only occasionally, we show that the individual privacy parameters based on estimations are very close to the those based on exact norms (Figure 3 ). This indicates that the estimations of gradient norms are close to the exact ones. We comment that the privacy parameters computed with exact norms are not equivalent to those of vanilla DP-SGD because individual clipping in Algorithm 1 slightly modifies the training trajectory. We show individual clipping has a small influence on the individual privacy of vanilla DP-SGD in Appendix C.2. To compute the exact gradient norms, we randomly sample 1000 examples and compute the exact gradient norms at every iteration. We compute the Pearson correlation coefficient between the privacy parameters computed with estimated norms and those computed with exact norms. We also compute the average and the worst absolute errors. We report results on MNIST, CIFAR-10, and UTKFace. Details about the experiments are in Section 4. We plot the results in Figure 3 . The ε values based on estimations are close to those based on exact norms (Pearson's r > 0.99) even we only update the gradient norms every two epochs (K = 0.5). Updating full batch gradient norms more frequently further improves the estimation, though doing so would increase the computational overhead. It is worth noting that the maximum clipping threshold C affects the computed privacy parameters. Large C increases the variation of gradient norms (and hence the variation of privacy parameters) but leads to large noise variance while small C suppresses the variation and leads to large gradient bias. Large noise variance and gradient bias are both harmful for learning (Chen et al., 2020; Song et al., 2021) . In Appendix D, we show the influence of using different C on both accuracy and privacy.

3.3. ROUNDING INDIVIDUAL GRADIENT NORMS

The rounding operation in Algorithm 1 is essential to make the computation of individual privacy parameters feasible. The privacy cost of one example at each step is the Rényi divergences between Equation ( 1) and (2). For a fixed sampling probability and noise variance, we use ρ c to denote the privacy cost of an example with gradient norm c. Note that ρ c is different for every value of c. Consequently, there are at most n × E different privacy costs because individual gradient norms vary across different examples and epochs (n is the number of datapoints and E is the number of training epochs). In order to make the number of different ρ c tractable, we round individual gradient norms with a prespecified precision r (see Line 12 in Algorithm 1). Because the maximum clipping threshold C is usually a small constant, then, by the pigeonhole principle, there are at most ⌈C/r⌉ different values of ρ c . Throughout this paper we set r = 0.01C that has almost no impact on the precision of privacy accounting. We give a concrete comparison on the computational costs in Table 1 . We run the numerical method in Mironov et al. (2019) once for every different privacy cost (with the default setup in the Opacus library (Yousefpour et al., 2021) ). We run DP-SGD on CIFAR-10 for 200 epochs with K = 1. All results in Table 1 use multiprocessing with 5 cores of an AMD EPYC ™ 7V13 CPU. With rounding, the overhead of computing individual privacy parameters is negligible. In contrast, the computational cost without rounding is more than 7 hours.

3.4. WHAT CAN WE DO WITH INDIVIDUAL PRIVACY PARAMETERS?

Note that individual privacy parameters are dependent on the private data and thus sensitive, and consequently may not be released publicly without care. We describe some approaches to safely make use of individual privacy parameters. The first approach is to release ε i to the owner of the ith example. Although we use gradient norms without adding noise, this approach does not incur additional privacy loss for two reasons. First, it is safe for the ith example because only the rightful owner sees ε i . Second, releasing ε i does not increase the privacy loss of any other examples. This is because the computation only uses (θ 1 , . . . , θ t-1 ), which is reported in a privacy-preserving manner. We proof the claim in Appendix E.1. The second approach is to privately release aggregate statistics of the population, e.g., the average or quantiles of the ε values. Recent works have demonstrated such statistics can be published accurately with minor privacy cost Andrew et al. (2021) . We show the statistics can be released accurately with a very small privacy parameter (ε = 0.1) in Appendix E.2. Finally, individual privacy parameters can also serve as a powerful tool for a trusted data curator to improve the model quality. By analysing the individual privacy parameters of a dataset, a trusted curator can focus on collecting more data representative of the groups that have higher privacy risks to mitigate the disparity in privacy.

4. INDIVIDUAL PRIVACY PARAMETERS ON DIFFERENT DATASETS

We first show the distribution of individual privacy parameters of running DP-SGD on four classification tasks in Section 4.1. Then we investigate how individual privacy parameters correlate with training loss in Section 4.2. Experimental setup is as follows. Datasets. We use two benchmark datasets MNIST (n = 60000) and CIFAR-10 (n = 50000) (LeCun et al., 1998; Krizhevsky, 2009) as well as the UTKFace dataset (n ≃ 15000) (Zhang et al., 2017) that contains the face images of four different races (White, n ≃ 7000; Black, n ≃ 3500; Asian, n ≃ 2000; Indian, n ≃ 2800). We construct two tasks on UTKFace: predicting gender, and predicting whether the age is under 30.foot_0 We slightly modify the dataset between these two tasks by randomly removing a few examples to ensure each race has balanced positive and negative labels.

Models and hyperparameters.

For CIFAR-10, we use the WRN16-4 model in De et al. (2022) , which achieves advanced performance in private setting. We follow the implementation details in De et al. (2022) expect their data augmentation method to reduce computational cost. For MNIST and UTKFace, we use ResNet20 models with batch normalization layers replaced by group normalization layers. For UTKFace, we initialize the model with weights pre-trained on ImageNet. We set C = 1 on CIFAR-10, following De et al. (2022) . For MNIST and UTKFace, we set C as the median of gradient norms at initialization, following the suggestion in Abadi et al. (2016) . We set K = 3 for all experiments in this section. More details about the hyperparameters are in Appendix B.

4.1. INDIVIDUAL PRIVACY PARAMETERS VARY SIGNIFICANTLY

Figure 1 shows the individual privacy parameters on all datasets. The privacy parameters vary across a large range on all tasks. On the CIFAR-10 dataset, the maximum ε i value is 7.8 while the minimum ε i value is only 1.1. We also observe that, for easier tasks, more examples enjoy stronger privacy guarantees. For example, ∼40% of examples reach the worst-case ε on CIFAR-10 while only ∼3% do so on MNIST. This may because the loss decreases quickly when the task is easy, resulting in gradient norms also decreasing and thus a reduced privacy loss.

4.2. PRIVACY PARAMETERS AND LOSS ARE POSITIVELY CORRELATED

We study how individual privacy parameters correlate with individual training loss. The analysis in Section 2 suggests that the privacy parameter of one example depends on its gradient norms across training. Intuitively, an example would have high loss after training if its gradient norms are large. We visualize individual privacy parameters and the final training loss values in Figure 4 . The individual privacy parameters on all datasets increase with loss until they reach the maximum ε. To quantify the order of correlation, we further fit the points with one-dimensional logarithmic functions and compute the Pearson correlation coefficients with the privacy parameters predicted by the fitted curves. The Pearson correlation coefficients are larger than 0.9 on all datasets, showing an logarithmic correlation between the privacy parameter of a datapoint and its training loss.

5. GROUPS ARE SIMULTANEOUSLY UNDERSERVED IN BOTH ACCURACY AND PRIVACY

5.1 LOW-ACCURACY GROUPS HAVE WORSE PRIVACY PARAMETERS It is well-documented that machine learning models may have large differences in accuracy on different groups (Buolamwini & Gebru, 2018; Bagdasaryan et al., 2019) . Our finding demonstrates that this disparity may be simultaneous in terms of both accuracy and privacy. We empirically verify this by plotting the average ε and training/test accuracy of different groups. The experiment setup is the same as Section 4. For CIFAR-10 and MNIST, the groups are the data from different classes, while for UTKFace, the groups are the data from different races. We plot the results in Figure 5 . The groups are sorted based on the average ε. Both training and test accuracy correlate well with ε. Groups with worse accuracy do have worse privacy guarantee in general. On CIFAR-10, the average ε of the 'Cat' class (which has the worst test accuracy) is 43.6% higher than the average ε of the 'Automobile' class (which has the highest test accuracy). On UTKFace-Gender, the average ε of the group with the lowest test accuracy ('Asian') is 25.0% higher than the average ε of the group with the highest accuracy ('Indian'). Similar observation also holds on other tasks. To the best of our knowledge, our work is the first to reveal this simultaneous disparity.

5.2. LOW-ACCURACY GROUPS SUFFER FROM HIGHER EMPIRICAL PRIVACY RISKS

We run membership inference (MI) attacks against models trained without differential privacy to study whether underserved groups are more vulnerable to empirical privacy attacks. Several recent works show that MI attacks have higher success rates on some examples than other examples (Song & Mittal, 2020; Choquette-Choo et al., 2021; Carlini et al., 2021) . In this section, we directly connect such disparity in privacy risks with the unfairness in utility. We use a simple loss-threshold attack that predicts an example is a member if its loss value is smaller than a prespecified threshold (Sablayrolles et al., 2019) . For each group, we use its whole test set and a random subset of training set so the numbers of training and test losses are balanced. We further split the data into two subsets evenly to find the optimal threshold on one and report the success rate on another. More implementation details are in Appendix B. The results are in Figure 6 . The groups are sorted based on their average ε. The disparity in privacy risks is clear. On UTKFace-Age, the MI success rate is 70.6% on the 'Black' group while is only 60.1% on the 'Asian' group. This suggests that empirical privacy risks also correlate well with the disparity in utility.

6. CONCLUSION

We propose an efficient algorithm to accurately estimate the individual privacy parameters for DP-SGD. We use this new algorithm to examine individual privacy guarantees on several datasets. Significantly, we find that groups with worse utility also suffer from worse privacy. This new finding reveals the complex while interesting relation among utility, privacy, and fairness. It has two immediate implications. Firstly, it shows that the learning objective aligns with privacy protection for a given privacy budget, i.e., the better the model learns about an example, the better privacy that example would get. Secondly, it suggests that mitigating the utility fairness under differential privacy is more tricky than doing so in the non-private case. This is because classic methods such as upweighting underserved examples would exacerbate the disparity in privacy. We comment that the RDP parameters of the adaptive composition of Algorithm 1 are random variables because they depend on the randomness of the intermediate outcomes. This is different from the conventional analysis that chooses a constant privacy parameter before training. Composition of random RDP parameters requires additional care because one needs to upper bound the privacy parameter over its randomness (Feldman & Zrnic, 2021; Lécuyer, 2021; Whitehouse et al., 2022) . In this paper, we focus on the realizations of those random RDP parameters and hence provide a precise output-specific privacy bound.

B MORE IMPLEMENTATION DETAILS

The batchsize is 4K, 2K, and 200 for CIFAR-10, MNIST, and UTKFace, respectively. The training epoch is 300 for CIFAR-10 and 100 for MNIST and UTKFace. We use the package in Yousefpour et al. (2021) to compute the noise multiplier. The standard deviation of noise in Algorithm 1 is the noise multiplier times the maximum clipping threshold. The non-private models in Section 5.2 are as follows. For CIFAR-10, we use a ResNet20 model in He et al. (2016) that has ∼0.2M parameters, with all batch normalization layers replaced by group normalization layers. For UTKFace, we use the same models in Section 4. We remove both gradient clipping and noise for non-private training. Other settings are the same as those in Section 4. All experiments are run on a single Tesla V100 GPU with 32G memory. Our source code, including the implementation of our algorithm as well as scripts to reproduce the plots, will be released soon.

C MORE DISCUSSION ON INDIVIDUAL CLIPPING C.1 INDIVIDUAL CLIPPING DOES NOT AFFECT ACCURACY

Here we run experiments to check the influence of using individual clipping thresholds on utility. Algorithm 1 uses individual clipping thresholds to ensure the computed privacy parameters are strict privacy guarantees. If the clipping thresholds are close to the actual gradient norms, then the clipped results are close to those of using a single maximum clipping threshold. However, if the estimations of gradient norms are not accurate, individual thresholds would clip more signal than using a single maximum threshold. We get accurate estimations of actual guarantees. The privacy parameters are still computed with the estimations of gradient norms and hence are estimations of the actual guarantees. We compare the privacy parameters and actual guarantees in Figure 7 . To compute the actual guarantees, we randomly sample 1000 examples and compute their exact gradient norms at every iteration. The results suggest that the privacy parameters are accurate estimations (Pearson's r > 0.99 and small maximum absolute errors). Privacy parameters still have a strong correlation with training loss. In Figure 8  which completes the proof.

E.2 RELEASE POPULATIONAL STATISTICS OF PRIVACY PARAMETERS

The individual privacy parameters computed by Algorithm 1 are sensitive and hence can not be directly released to the public. Here we show the populational statistics of individual parameters can be released with minor privacy cost. Specifically, we compute the average and quantiles of the ε values with differential privacy. For average, we release the noisy aggregation through Gaussian Mechanism. For quantiles, we solve the objective function in Andrew et al. (2021) with 20 steps of full batch gradient descent. The results on MNIST and CIFAR-10 are in Table 3 and Table 4 respectively. The released statistics are close to the actual values on both datasets with (0.1, 10 -5 )-DP.



We acknowledge that predicting gender and age from images may be problematic. Nonetheless, as facial images have previously been highlighted as a setting where machine learning has disparate accuracy on different groups, we revisit this domain through a related lens. The labels are provided by the dataset curators.



[Output-specific individual (ε, δ)-DP] Fix a datapoint d and a set of outcomes A ⊂ Range(A), let D be an arbitrary dataset and D ′ = D ∪ {d}. An algorithm A satisfies (ε(A, d), δ)output-specific individual DP for d at A if for any S ⊂ A it holds that

Figure 2: Gradient norms on CIFAR-10.

Figure 3: Privacy parameters based on estimations of gradient norms (ε) versus those based on exact norms (έ). The results suggest that the estimations of gradient norms are close to the exact norms.

Figure 4: Privacy parameters and final training losses. Each point shows the loss and privacy parameter of one example. Pearson's r is computed between privacy parameters and log losses.

Figure 5: Accuracy and average ε of different groups. Groups with worse accuracy also have worse privacy in general.

Figure 6: MI attack success rates on different groups. Target models are trained without DP

Figure 9: Test accuracy and privacy parameters computed with/without individual clipping (I.C.). Groups with worse test accuracy also have worse privacy in general.

Pr [f (A(D), d i ) ∈ S] = Pr [A(D) ∈ T] (8) ≤ e εj Pr [A(D ′ ) ∈ T] + δ (9) = e εj Pr [f (A(D ′ ), d i ) ∈ S] + δ,

Algorithm 1: Deep Learning with Individual Privacy Accounting Input :Maximum clipping threshold C, rounding precision r, noise variance σ 2 , sampling probability p, frequency of updating full gradient norms at every epoch K, number of epochs E. Let {C (i) } n i=1 be estimated gradient norms of all examples and initialize C (i) = C. Let C = {r, 2r, 3r, . . . , C} be all possible norms under rounding. for c ∈ C do Compute Rényi divergences at different orders between Equation (

Computational costs of computing individual privacy parameters for CIFAR-10 with K = 1.

Comparison between the test accuracy of using individual clipping thresholds and that of using a single maximum clipping threshold. The maximum ε is 7.8 for CIFAR-10 and 2.4 for MNIST.We compare the accuracy of two different clipping methods in Table2. The individual clipping thresholds are updated once per epoch. We repeat the experiment four times with different random seeds. Other setups are the same as those in Section 4. The results suggest that using individual clipping thresholds in Algorithm 1 has a negligible effect on accuracy.C.2 EXPERIMENTS WITHOUT INDIVIDUAL CLIPPINGWe run Algorithm 1 without individual clipping, i.e., vanilla DP-SGD inAbadi et al. (2016), to see whether our conclusions in the main text still hold. Specifically, we change the Line 15 of Algorithm 1 to clip all gradients with the maximum clipping threshold C. Other experimental setup is the same as that in Experiment 4 and Appendix B.

Populational statistics of individual privacy parameters on MNIST. The average estimation error rate is 1.19%. The value of δ is 1 × 10 -5 .

Populational statistics of individual privacy parameters on CIFAR-10. The average estimation error rate is 1.51%. The value of δ is 1 × 10 -5 . = {a ∈ A : f (a, d i ) ∈ S}. Because f is a bijective function, we have

annex

A PROOF OF THEOREM 3.1 Theorem 3.1. [Output-specific privacy guarantee] Algorithm 1 at step t satisfies (oα-1 , δ)output-specific individual DP for the i th example at A = (θ 1 , . . . , θ t-1 , Θ t ), where o (i) α is the accumulated RDP at order α and Θ t is the domain of θ t .Here we give the the proof of Theorem 3.1. We compose the privacy costs at different steps through Rényi differential privacy (RDP) (Mironov, 2017) . RDP uses the Rényi divergence at different orders to measure privacy. We use D α (µ||ν) = 1 α-1 log ( dµ dν ) α dν to denote the Rényi divergence at order α between µ and ν and D ↔ α (µ||ν) = max(D α (µ||ν), D α (ν||µ)) to denote the maximum divergence of two directions. The definition of individual RDP is as follows.Definition 3. [Individual RDP Feldman & Zrnic (2021) ] Fix a datapoint d, let D be an arbitrary dataset and) be a sequence of randomized algorithms and (θ 1 , . . . , θ t-1 ) be some arbitrary fixed outcomes from the domain, we defineNoting that Â(t) is not an adaptive composition as the input of each individual mechanism does not depend on the outputs of previous mechanisms. Further letbe the adaptive composition. In Theorem A.1, we show a RDP bound on the non-adaptive composition Â(t) gives an output-specific DP bound on the adaptive composition A (t) .where θ 1 , . . . , θ t-1 are some arbitrary fixed outcomes. If Â(t) (•) satisfies o α RDP at order α, then A (t) (D) satisfies (o α + log(1/δ) α-1 , δ)-output-specific differential privacy at A.Proof. For a given outcome θ (t) = (θ 1 , θ 2 , . . . , θ t-1 , θ t ) ∈ A, we have Pby the product rule of conditional probability.Apply the product rule recurrently on P A (t-1) (D) = θ (t-1) , we have P AIn words, A (t) and Â(t) are identical in A. Therefore, 0.9 for all datasets. Moreover, we show our observation in Section 5, i.e., low-accuracy groups have worse privacy parameters, still holds in Figure 9 . We also make a direct comparison with privacy parameters computed with individual clipping. We find that privacy parameters computed with individual clipping are very close to those computed without individual clipping. We also find that the order of groups, sorted by the average ε, is exactly the same for both cases.

D THE INFLUENCE OF MAXIMUM CLIPPING ON INDIVIDUAL PRIVACY

The value of the maximum clipping threshold C would affect individual privacy parameters. A large value of C would increase the stratification in gradient norms but also increase the noise variance for a fixed privacy budget. A small value of C would suppress the stratification but also increase the gradient bias. Here we run experiments with different values of C on CIFAR-10. We use a ResNet20 for CIFAR-10 in He et al. (2016) , which only has ∼0.2M parameters, to reduce the computation cost. All batch normalization layers are replaced with group normalization layers. Let M be the median of gradient norms at initialization, we choose C from the list [0.1M, 0.3M, 0.5M, 1.5M ]. Other experimental setup is the same as that in Section 4. The histograms of individual privacy parameters are in Figure 10 . In terms of accuracy, using clipping thresholds near the median gives better test accuracy. In terms of privacy, using smaller clipping thresholds increases privacy parameters in general. The number of datapoints that reaches the worst privacy decreases with the value of C. When C = 0.1M , nearly 70% datapoints reach the worst privacy parameter while only ∼2% datapoints reach the worst parameter when C = 1.5M .The correlation between accuracy and privacy is in Figure 11 . The disparity in average ε is clear for all choices of C. Another important observation is that when decreasing C, the privacy parameters of underserved groups increase quicker than other groups. When changing C = 1.5M to 0.5M , the average ε of 'Cat' increases from 4.8 to 7.4, almost reaching the worst-case bound. In comparison, the increment in ε of the 'Ship' class is only 1.3 (from 4.2 to 5.5).

E MAKE USE OF INDIVIDUAL PRIVACY PARAMETERS E.1 RELEASING INDIVIDUAL PRIVACY PARAMETERS TO RIGHTFUL OWNERS

Let ε i be the privacy parameter of the i th user, we can compute ε i with the training trajectory and d i itself. Theorem E.1 shows that releasing ε i does not increase the privacy cost of any other example d j ̸ = d i . The proof uses the fact that computing ε i can be seen as a post-processing of (θ 1 , . . . , θ t-1 ), which is reported in a privacy-preserving manner. Proof. We first note that the construction of f is independent of d j . Without loss of generality, let D, D ′ ∈ D be the neighboring datasets where D ′ = D ∪ {d j }. Let S ⊂ F be an arbitrary event and

