INDIVIDUAL PRIVACY ACCOUNTING WITH GAUSSIAN DIFFERENTIAL PRIVACY

Abstract

Individual privacy accounting enables bounding differential privacy (DP) loss individually for each participant involved in the analysis. This can be informative as often the individual privacy losses are considerably smaller than those indicated by the DP bounds that are based on considering worst-case bounds at each data access. In order to account for the individual privacy losses in a principled manner, we need a privacy accountant for adaptive compositions of randomised mechanisms, where the loss incurred at a given data access is allowed to be smaller than the worst-case loss. This kind of analysis has been carried out for the Rényi differential privacy by Feldman and Zrnic (12), however not yet for the so called optimal privacy accountants. We make first steps in this direction by providing a careful analysis using the Gaussian differential privacy which gives optimal bounds for the Gaussian mechanism, one of the most versatile DP mechanisms. This approach is based on determining a certain supermartingale for the hockey-stick divergence and on extending the Rényi divergence-based fully adaptive composition results by Feldman and Zrnic (12). We also consider measuring the individual (ε, δ)-privacy losses using the so called privacy loss distributions. With the help of the Blackwell theorem, we can then make use of the results of Feldman and Zrnic (12) to construct an approximative individual (ε, δ)-accountant. Published as a conference paper at ICLR 2023 and Zrnic (12). We note that the idea of individual accounting of privacy losses has been previously considered in various forms by, e.g., Ghosh and Roth ( 13); Ebadi et al. (10); Wang (32); Cummings and Durfee (6); Ligett et al. (22); Redberg and Wang (27). We also consider measuring the individual (ε, δ)-privacy losses using the so called privacy loss distributions (PLDs). Using the Blackwell theorem, we can in this case rely on the results of ( 12) to construct an approximative (ε, δ)-accountant that often leads to smaller individual ε-values than commonly used RDP accountants. For this accountant, evaluating the individual DP-parameters using the existing methods requires computing FFT at each step of the adaptive analysis. We speed up this computation by placing the individual DP hyperparameters into well-chosen buckets, and by using pre-computed Fourier transforms. Moreover, by using the Plancherel theorem, we obtain a further speed-up. 1.1 OUR CONTRIBUTIONS Our main contributions are the following: • We show how to analyse fully adaptive compositions of DP mechanisms using the Gaussian differential privacy. Our results give tight (ε, δ)-bounds for compositions of Gaussian mechanisms and are the first results with tight bounds for fully adaptive compositions. • Using the concept of dominating pairs of distributions and by utilising the Blackwell theorem, we propose an approximative individual (ε, δ)-accountant that in several cases leads to smaller individual ε-bounds than the individual RDP analysis. • We propose efficient numerical techniques to compute individual privacy parameters using privacy loss distributions (PLDs) and the FFT algorithm. We show that individual ε-values can be accurately approximated in O(n)-time, where n is the number of discretisation points for the PLDs. Due to the lack of space this is described in Appendix D. • We give experimental results that illustrate the benefits of replacing the RDP analysis with GDP accounting or with FFT based numerical accounting techniques. As an observation of indepedent interest, we notice that individual filtering leads to a disparate loss of accuracies among subgroups when training a neural network using DP gradient descent. BACKGROUND We first shortly review the required definitions and results for our analysis. For more detailed discussion, see e.g. ( 7) and ( 37). An input dataset containing N data points is denoted as X = (x 1 , . . . , x N ) ∈ X N , where x i ∈ X , 1 ≤ i ≤ N . We say X and X ′ are neighbours if we get one by adding or removing one element in the other (denoted X ∼ X ′ ). To this end, similarly to Feldman and Zrnic (12), we also denote X -i the dataset obtained by removing element x i from X, i.e. We call M tightly (ε, δ)-DP, if there does not exist δ ′ < δ such that M is (ε, δ ′ )-DP. The (ε, δ)-DP bounds can also be characterised using the Hockey-stick divergence. For α > 0 the hockey-stick divergence H α from a distribution P to a distribution Q is defined as where for x ∈ R, [x] + = max{0, x}. Tight (ε, δ)-values for a given mechanism can be obtained using the hockey-stick-divergence:

1. INTRODUCTION

Differential privacy (DP) (8) provides means to accurately bound the compound privacy loss of multiple accesses to a database. Common differential privacy composition accounting techniques such as Rényi differential privacy (RDP) based techniques (23; 33; 38; 24) or so called optimal accounting techniques (19; 15; 37) require that the privacy parameters of all algorithms are fixed beforehand. Rogers et al. (28) were the first to analyse fully adaptive compositions, wherein the mechanisms are allowed to be selected adaptively. Rogers et al. (28) introduced two objects for measuring privacy in fully adaptive compositions: privacy filters, which halt the algorithms when a given budget is exceeded, and privacy odometers, which output bounds on the privacy loss incurred so far. Whitehouse et al. (34) have tightened these composition bounds using filters to match the tightness of the so called advanced composition theorem (9) . Feldman and Zrnic (12) obtain similar (ε, δ)-asymptotics via RDP analysis. In their analysis using RDP, Feldman and Zrnic (12) consider individual filters, where the algorithm stops releasing information about the data elements that have exceeded a pre-defined RDP budget. This kind of individual analysis has not yet been considered for the optimal privacy accountants. We make first steps in this direction by providing a fully adaptive individual DP analysis using the Gaussian differential privacy (7) . Our analysis leads to tight bounds for the Gaussian mechanism and it is based on determining a certain supermartingale for the hockey-stick divergence and on using similar proof techniques as in the RDP-based fully adaptive composition results of Feldman Lemma 2 (Zhu et al. 37) . For a given ε ≥ 0, tight δ(ε) is given by the expression δ(ε) = max X∼X ′ H e ε (M(X)||M(X ′ )). Thus, if we can bound the divergence H e ε (M(X)||M(X ′ )) accurately, we also obtain accurate δ(ε)-bounds. To this end we consider so called dominating pairs of distributions: Definition 3 (Zhu et al. 37) . A pair of distributions (P, Q) is a dominating pair of distributions for mechanism M(X) if for all neighbouring datasets X and X ′ and for all α > 0, H α (M(X)||M(X ′ )) ≤ H α (P ||Q). If the equality holds for all α for some X, X ′ , then (P, Q) is tightly dominating. Dominating pairs of distributions also give upper bounds for adaptive compositions: Theorem 4 (Zhu et al. 37 ). If (P, Q) dominates M and (P ′ , Q ′ ) dominates M ′ , then (P × P ′ , Q × Q ′ ) dominates the adaptive composition M • M ′ . To convert the hockey-stick divergence from P × P ′ to Q × Q ′ into an efficiently computable form, we consider so called privacy loss random variables. Definition 5. Let P and Q be probability density functions. We define the privacy loss function L P/Q as L P/Q (t) = log P (t) Q(t) . We define the privacy loss random variable (PRV) ω P/Q as ω P/Q = L P/Q (t), t ∼ P (t). With slight abuse of notation, we denote the probability density function of the random variable ω P/Q by ω P/Q (t), and call it the privacy loss distribution (PLD). Theorem 6 (Gopi et al. 15) . The δ(ε)-bounds can be represented using the following representation that involves the PRV: H e ε (P ||Q) = E t∼P 1 -e ε-L P /Q (t) + = E s∼ω P /Q 1 -e ε-s + . (2.1) Moreover, if ω P/Q is the PRV for the pair of distributions (P, Q) and ω P ′ /Q ′ the PRV for the pair of distributions (P ′ , Q ′ ), then the PRV for the pair of distributions (P × P ′ , Q × Q ′ ) is given by ω P/Q + ω P ′ /Q ′ . When we set α = e ε , the following characterisation follows directly from Theorem 6. Corollary 7. If the pair of distributions (P, Q) is a dominating pair of distributions for a mechanism M, then for all neighbouring datasets X and X ′ and for all ε ∈ R, E t∼M(X) 1 -e ε-L M(X)/M(X ′ ) (t) + ≤ E t∼P 1 -e ε-L P /Q (t) + . We will in particular consider the Gaussian mechanism and its subsampled variant. Example: hockey-stick divergence between two Gaussians. Let x 0 , x 1 ∈ R, σ ≥ 0, and let P be the density function of N (x 0 , σ 2 ) and Q the density function of N (x 1 , σ 2 ). Then, the PRV ω P/Q is distributed as (Lemma 11 by 30) ω P/Q ∼ N (x 0 -x 1 ) 2 2σ 2 , (x 0 -x 1 ) 2 σ 2 . (2.2) Thus, in particular: H α (P ||Q) = H α (Q||P ) for all α > 0. Plugging in PLD ω P/Q to the expression (2.1), we find that for all ε ≥ 0, the hockey-stick H e ε (P ||Q) is given by the expression δ(ε) = Φ - εσ ∆ + ∆ 2σ -e ε Φ - εσ ∆ - ∆ 2σ , where Φ denotes the CDF of the standard univariate Gaussian distribution and ∆ = |x 0 -x 1 |. This formula was originally given by Balle and Wang (3). If M is of the form M(X) = f (X) + Z, where f : X N → R d and Z ∼ N (0, σ 2 I d ), and ∆ = max X≃X ′ ∥f (X) -f (X ′ )∥ 2 , then for x 0 = 0, x 1 = ∆, (P, Q) of the above form gives a tightly dominating pair of distributions for M (37). Subsequently, by Theorem 6, M is (ε, δ)-DP for δ(ε) given by (2.3). Lemma 8 allows tight analysis of the subsampled Gaussian mechanism using the hockey-stick divergence. We state the result for the case of Poisson subsampling with sampling rate γ. Lemma 8 (Zhu et al. 37) . If (P, Q) dominates a mechanism M for add neighbors then (P, (1 -γ) • P + γ • Q) dominates the mechanism M • S Poisson for add neighbors and ((1 -γ) • Q + γ • P ), P ) dominates M • S Poisson for removal neighbors. We will also use the Rényi differential privacy (RDP) (23) which is defined as follows. Rényi divergence of order α ∈ (1, ∞) between two distributions P and Q is defined as D α (P ||Q) = 1 α -1 log P (t) Q(t) α Q(t) dt. By continuity, we have that lim α→1+ D α (P ||Q) equals the KL divergence KL(P ||Q). Definition 9. We say that a mechanism M is (α, ρ)-RDP, if for all neighbouring datasets X, X ′ , the output distributions M(X) and M(X ′ ) have Rényi divergence of order α less than ρ, i.e. max X≃X ′ {D α M(X)||M(X ′ ) , D α M(X ′ )||M(X) } ≤ ρ.

2.2. INFORMAL DESCRIPTION: FILTRATIONS, SUPERMARTINGALES, STOPPING TIMES

Similarly to (34) and ( 12), we use the notions of filtrations and supermartingales for analysing fully adaptive compositions, where the individual worst-case pairs of distributions are not fixed but can be chosen adaptively based on the outcomes of the previous mechanisms. Given a probability space (Ω, F, P), a filtration (F n ) n∈N of F is a sequence of σ-algebras satisfying: (i) F n ⊂ F n+1 for all n ∈ N, and (ii) F n ⊂ F for all n ∈ N. In the context of the so called natural filtration generated by a stochastic process X t , t ∈ N, the σ-algebra of the filtration F n represents all the information contained in the outcomes of the first n random variables (X 1 , . . . , X n ). The law of total expectation states that if a random variable X is F n -measurable and F n ⊂ F n+1 , then E((X|F n+1 )|F n ) = E(X|F n ). Thus, if we have natural filtrations F 0 , . . . , F n for a stochastic process X 0 , . . . , X n , then E(X n |F 0 ) = E(((X n |F n-1 )|F n-2 )| . . . |F 0 ). The supermartingale property means that for all n, E(X n |F n-1 ) ≤ X n-1 . (2.5) From the law of total expectation it then follows that for all n ∈ N, E(X n |F 0 ) ≤ X 0 . We follow the analysis of Feldman and Zrnic (12) and first set a maximum number of steps (denote by k) for the compositions. We do not release more information if a pre-defined privacy budget is exceeded. Informally speaking, the stochastic process X n that we analyse represents the sum of the realised privacy loss up to step n and the budget remaining at that point. The privacy budget has to be constructed such that the amount of the budget left at step n is included in the filtration F n-1 . This allows us to reduce the privacy loss of the adaptively chosen nth mechanism from the remaining budget. Mathematically, this means that the integration E(X n |F n-1 ) will be only w.r.t. the outputs of the nth mechanism. Consider e.g. the differentially private version of the gradient descend (GD) method, where the amount of increase in the privacy budget depends on the gradient norms which depend on the parameter values at step n -1, i.e., they are included in F n-1 . Then, E(X k |F 0 ) corresponds to the total privacy loss. If we can show that (2.5) holds for X n , then by the law of total expectation the total privacy loss is less than X 0 , the pre-defined budget. In our case the total budget X 0 will equal the (ε, δ)-curve for a µ-Gaussian DP mechanism, where µ determines the total privacy budget, and E[X T ], the expectation of the consumed privacy loss at step T , will equal the (ε, δ)-curve for the fully adaptive composition to be analysed. A discrete-valued stopping time τ is a random variable in the probability space (Ω, F, {F t }, P) with values in N which gives a decision of when to stop. It must be based only on the information present at time t, i.e., it has to hold {τ = t} ∈ F t . The optimal stopping theorem states that if the stochastic process X n is a supermartingale and if T is a stopping time, then E[X T ] ≤ E[X 0 ]. In the analysis of fully adaptive compositions, the stopping time T will equal the step where the privacy budget is about to exceed the limit B. Then, only the outputs of the (adaptively selected) mechanisms up to step T are released, and from the optimal stopping theorem it follows that E[X T ] ≤ X 0 .

3. FULLY ADAPTIVE COMPOSITIONS

In order to compute tight δ(ε)-bounds for fully adaptive compositions, we determine a suitable supermartingale that gives us the analogues of the RDP results of (12).

3.1. NOTATION AND THE EXISTING ANALYSIS

Similarly to Feldman and Zrnic (12) , we denote the mechanism corresponding to the fully adaptive composition of first n mechanisms as M (n) (X) = M 1 (X), M 2 (M 1 (X), X), . . . , M n (M 1 (X), . . . , M n-1 (X), X) and the outcomes of M (n) (X) as a (n) = (a 1 , . . . , a n ), For datasets X and X ′ , define L (n) X/X ′ as L (n) X/X ′ = log P(M (n) (X) = a (n) ) P(M (n) (X ′ ) = a (n) ) and, given a (n-1) , we define L n X/X ′ as the privacy loss of the mechanism M n , L n X/X ′ = log P(M n (a (n-1) , X) = a n ) P(M n (a (n-1) , X ′ ) = a n ) . Using the Bayes rule it follows that L (n) X/X ′ = L (n-1) X/X ′ + L n X/X ′ = n m=1 L m X/X ′ . Whitehouse et al. (34) obtain the advanced-composition-like (ε, δ)-privacy bounds for fully adaptive compositions via a certain privacy loss martingale. However, our approach is motivated by the analysis of Feldman and Zrnic (12) . We review the main points of the analysis in Appendix A. The approach of Feldman and Zrnic (12) does not work directly in our case since the hockey-stick divergence does not factorise as the Rényi divergence does. However, we can determine a certain random variable via the hockey-stick divergence and show that it has the desired properties in case the individual mechanisms M i have dominating pairs of distributions that are Gaussians. As we show, this requirement is equivalent to them being Gaussian differentially private.

3.2. GAUSSIAN DIFFERENTIAL PRIVACY

Informally speaking, a randomised mechanism M is µ-GDP, µ ≥ 0, if for all neighbouring datasets the outcomes of M are not more distinguishable than two unit-variance Gaussians µ apart from each other (7) . Commonly the Gaussian differential privacy (GDP) is defined using so called tradeoff functions (7) . For the purpose of this work, we equivalently formalise GDP using pairs of dominating distributions: Lemma 10. A mechanism M is µ-GDP, if and only if for all neighbouring datasets X, X ′ and for all α > 0: H α (M(X)||M(X ′ )) ≤ H α N (0, 1)||N (µ, 1) . (3.1) Proof. By Corollary 2.13 of (7), a mechanism is µ-GDP if and only it is (ε, δ)-DP for all ε ≥ 0, where δ(ε) = Φ -ε µ + µ 2 -e ε Φ -ε µ -µ 2 . From (2. 3) we see that this is equivalent to the fact that for all neighbouring datasets X, X ′ and for all ε ≥ 0: H e ε (M(X)||M(X ′ )) ≤ H e ε (N (0, 1)||N (µ, 1)). By Lemma 31 of (37), H α M(X)||M(X ′ ) ≤ H α P, Q for all α > 1 if and only if H α M(X)||M(X ′ ) ≤ H α Q, P for all 0 < α ≤ 1. As P and Q are Gaussians, we see from the form of the privacy loss distribution (2.2) that H α Q, P = H α P, Q and that (3.1) holds for all α > 0.

3.3. GDP ANALYSIS OF FULLY ADAPTIVE COMPOSITIONS

Analogously to individual RDP parameters (A.1), we define the conditional GDP parameters as µ m = inf{µ ≥ 0 : M m (•, a (m-1) ) is µ-GDP}. (3.2) By Lemma 10 above, in particular, this means that for all neighbouring datasets X, X ′ and for all α > 0: H α (M m (X, a (m-1) ), M m (X ′ , a (m-1) ) ≤ H α (N (µ m , 1)||N (0, 1)). Notice that for all m, the GDP parameter µ m depends on the history a (m-1) and is therefore a random variable, similarly to the conditional RDP values ρ m defined in (A.1). Example: Private GD. Suppose each mechanism M i , i ∈ [k], is of the form M i (X, a) = x∈X f (x, a) + N (0, σ 2 ). Since the hockey-stick divergence is scaling invariant, and since the sensitivity of the deterministic part of M i (X, a) is max x∈X ∥f (x, a (m-1) )∥ 2 , we have that µ m = max x∈X ∥f (x, a (m-1) )∥ 2 /σ. We now give the main theorem, which is a GDP equivalent of (Thm. 3.1, 12) . Theorem 11. Let k denote the maximum number of compositions. Suppose that, almost surely, k m=1 µ 2 m ≤ B 2 . Then, M (k) (X) is B-GDP. Proof. We here describe the main points, a proof with more details is given in Appendix B. We remark that an alternative proof of this result is given in an independent and concurrent work by Smith and Thakurta (29) . First, recall the notation from Section 3.1: L (k) X/X ′ denotes the privacy loss be- tween M (k) (X) and M (k) (X ′ ) with outputs a (k) . Let ε ∈ R. Our proof is based on showing the supermartingale property for the random variable M n (ε), n ∈ [k], defined as M k (ε) = 1 -e ε-L (k) X/X ′ + , M n (ε) = E t∼Rn 1 -e ε-L (n) X/X ′ -Ln(t) + , 0 ≤ n ≤ k -1, (3.3) where L n (t) = log(R n (t)/Q(t)) and R n is the density function of N B 2 - n m=1 µ 2 m , 1 and Q is the density function of N (0, 1). This implies that M 0 (ε) = Et∼R 0 1 -e ε-L0(t) + , where L 0 (t) = log(R 0 (t)/Q(t)) and R 0 is the density function of N B, 1 and Q is the density function of N (0, 1). In particular, this means that M 0 (ε) gives δ(ε) for a B-GDP mechanism. Let F n denote the natural filtration σ(a (n) ). First, we need to show that E [M k (ε)|F k-1 ] ≤ M k-1 (ε). Since the pair of distributions N (µ k , 1), N (0, 1) dominates the mechanism M k , we have by the Bayes rule and Corollary 7, E M k (ε) F k-1 = E a k ∼M k 1 -e ε-L (k-1) X/X ′ -L k X/X ′ + F k-1 ≤ E t∼P k 1 -e ε-L (k-1) X/X ′ -L k (t) + ≤ M k-1 (ε), (3.4) where L k (t) = log(P k (t)/Q(t)), P k is the density function of N (µ k , 1) and Q is the density function of N (0, 1). Above we have also used the fact that L (k-1) X/X ′ ∈ F k-1 . The last inequality follows from the fact that k m=1 µ 2 m ≤ B 2 a.s., i.e., µ k ≤ (B 2 - k-1 m=1 µ 2 m ) 1 2 a.s., and from the data-processing inequality. Moreover, we see that (B 2 - k-1 m=1 µ 2 m ) 1 2 ∈ F k-2 . Thus we can repeat (3.4) and use the fact that a composition of µ 1 -GDP and µ 2 -GDP mechanisms is 7) , and by induction see that M n (ε) is a supermartingale. By the law of total expectation (2.4)  ( µ 1 2 + µ 2 2 ) 1 2 -GDP (Cor. 3.3, , E[M k (ε)] ≤ M 0 (ε). By Theorem 6, E[M k (ε)] = H e ε M (k) (X)||M (k) (X ′ ) , and M 0 (ε) = H e ε N (B, 1)||N (0, 1) . As ε was taken to be an arbitrary real number, we see that the inequality E[M k (ε)] ≤ M 0 (ε) holds for all ε ∈ R and by Lemma 10, M (k) (X) is B-GDP.

4. INDIVIDUAL GDP FILTER

Similarly to (12) , we can determine an individual GDP privacy filter that keeps track of individual privacy losses and adaptively drops the data elements for which the cumulative privacy loss is about to cross the pre-determined budget (Alg. 1). First, we need to define a GDP filter: F B (µ 1 , . . . , µ t ) = HALT, if t i=1 µ 2 i > B 2 , CONT, else. (4.1) Also, similarly to (12) , we define S(x i , n) as the set of dataset pairs (S, S ′ ), where |S| = n and S ′ is obtained from S by deleting the data element x i from S.

Algorithm 1 Individual GDP Filter Algorithm

Input: Budget B, maximum number of compositions k, initial value a 0 . for j = 1, . . . , k do For each i ∈ [N ], find µ (i) j ≥ 0 such that for all α > 0 and for all (S, S ′ ) ∈ S(x i , n): H α M j (S, a (j-1) )||M j (S ′ , a (j-1) ) ≤ H α N (µ (i) j , 1)||N (0, 1) . (4.2) Define the active set S j : S j = {x i : F B (µ (i) 1 , . . . , µ (i) j ) = CONT}. For all x i : set µ (i) j = µ (i) j 1{x i ∈ S j }. Compute a j = M j (a (j-1) , S j ). end for return a (j) . Using Theorem 11 and the supermartingale property of a personalised version of M n (ε) (Eq. (3.3)), we can show that the output of Alg. 1 is B-GDP. Theorem 12. Denote by M the output of Algorithm 1. M is B-GDP under remove neighbourhood relation, meaning that for all datasets X ∈ X N , for all i ∈ [N ] and for all α > 0: max{H α M(X)||M(X -i )) , H α M(X -i )||M(X)) } ≤ H α N (B, 1)||N (0, 1) . Proof. The proof goes the same way as the proof for (Thm. 4.3 12) which holds for the RDP filter. Let F t denote the natural filter σ(a (t) ), and let the privacy filter F B be defined as in (4.1). We see that the random variable T = min{min{t : F B (µ (i) 1 , . . . , µ (i) t+1 ) = HALT}, k} is a stopping time since {T = t} ∈ F t . Let M n (ε) be the random variable of Eq. (3.3) defined for the pair of datasets (X, X -i ) or (X -i , X). From the optimal stopping theorem (26) and the supermartingale property of M n (ε) it follows that E[M T (ε)] ≤ M 0 (ε) for all ε ∈ R. By the reasoning of the proof of Thm.11 we have that Alg. 1 is B-GDP.

4.1. BENEFITS OF GDP VS. RDP: MORE ITERATIONS FOR THE SAME PRIVACY

When we replace the RDP filter with a GDP filter for the private GD, we get considerably smaller ε-values. As an example, consider the private GD experiment by Feldman and Zrnic (12) and set σ = 100 and the number of compositions k = 420 (this corresponds to worst-case analysis ε = 0.8 for δ = 10 -5 ). When using GDP instead of RDP, we can run k = 495 iterations for an equal value of ε. Figure 3 (Section C.4) depicts the differences in (ε, δ)-values computed via RDP and GDP.

5. APPROXIMATIVE (ε, δ)-FILTER VIA BLACKWELL'S THEOREM

We next consider a filter that can use any individual dominating pairs of distributions, not just Gaussians. To this end, we need to determine pairs of dominating distributions at each iteration. Assumption. Given neighbouring datasets X, X ′ , we assume that for all i, i ∈ [n], we can determine a dominating pair of distributions (P i , Q i ) such that for all α > 0, H α M i (a (i-1) , X)||M i (a (i-1) , X ′ ) ≤ H α P i ||Q i . A tightly dominating pair of distributions (P i , Q i ) always exists (Proposition 8, 37), and on the other hand, uniquely determining such a pair is straightforward for the subsampled Gaussian mechanism, for example (see Lemma 8) . For the so called shufflers, such worst case pairs can be obtained by post-processing (11) . As we show in Appendix C.6, the orderings determined by the trade-off functions and the hockey-stick divergence are equivalent. Therefore, from the Blackwell theorem (7, Thm. 2.10) it follows that there exists a stochastic transformation (Markov kernel) T such that T P i = M i (a (i-1) , X) and T Q i = M i (a (i-1) , X ′ ). First, we replace the GDP filter condition µ 2 i ≤ B 2 by the condition (µ > 0) E t1∼P1,...,tn∼Pn 1 -e ε-n m=1 Lm(tm) + ≤ H e ε N (µ, 1)||N (0, 1) for all ε ∈ R. By the Blackwell theorem there exists a stochastic transformation that maps N (µ, 1) and N (µ, 0) to the product distributions (P 1 , . . . , P n ) and (Q 1 , . . . , Q n ), respectively. From the data-processing inequality for Rényi divergence we then have D α (P 1 × . . . × P n ||Q 1 × . . . × Q n ) ≤ D α N (0, 1)||N (µ, 1) , for all α ≥ 1, where D α denotes the Rényi divergence of order α. Since the pairs (P i , Q i ), 1 ≤ i ≤ n, are the worst-case pairs also for RDP (as described above, due to the data-processing inequality), by (5.2) and the RDP filter results of Feldman and Zrnic ( 12), we have that for all α ≥ 1, D α M(X)||M(X ′ ) ≤ D α N (µ, 1)||N (0, 1) . (5.3) By converting the RDPs of Gaussians in (5.3) to (ε, δ)-bounds, this procedure provides (ε, δ)-upper bounds and can be straightforwardly modified into an individual PLD filter as in case of GDP. One difficulty, however, is how to compute the parameter µ in (5.1), given the individual pairs (P i , Q i ), 1 ≤ i ≤ n. When the number of iterations is large, by the central limit theorem the PLD of the composition starts to resemble that of a Gaussian mechanism (30) , and it is then easy to numerically approximate µ (see Fig. 1 for an example). It is well known that the (ε, δ)-bounds obtained via RDPs are always non-tight, since the conversion of RDP to (ε, δ) is lossy (37) . Moreover, often the computation of the RDP values themselves is lossy. In the procedure described here, the only loss comes from converting (5.3) to (ε, δ)-bounds. In Appendix D we show how to numerically efficiently compute the individual PLDs using FFT. To illustrate the differences between the individual ε-values obtained with an RDP accountant and with our approximative PLD-based accountant, we consider DP-SGD training of a small feedforward network for MNIST classification. We choose randomly a subset of 1000 data elements and compute their individual ε-values (see Fig. 1 ). To compute the ε-values, we compare RDP accounting and our approach based on PLDs. We train for 50 epochs with batch size 300, noise parameter σ = 2.0 and clipping constant C = 5.0. 

6. EXPERIMENTS WITH MIMIC-III: GROUP-WISE ε-VALUES

For further illustration, we consider the phenotype classification task from a MIMIC-III benchmark library ( 16) on the clinical database MIMIC-III (17) , freely-available from PhysioNet (14) . The task is a multi-label classification and aims to predict which of 25 acute care conditions are present in a patient's MIMIC-III record. We have trained a multi-layer perceptron to maximise the macroaveraged AUC-ROC, the task's primary metric. We train the model using DP-GD combined with the Adam optimizer, and use the individual GDP filtering algorithm 1. See Appendix E for further details. To study the model behaviour between subgroups, we observe five non-overlapping groups of size 1000 from the train set and of size 400 from the test set by the present acute care condition: subgroup 0: no condition at all, subgroups 1 and 2: diagnosed with/not with Pneumonia, subgroups 3 and 4: diagnosed with/not with acute myocardial infarction (heart attack). Similarly as Yu et al. (36) , we see a correlation between individual ε-values and model accuracies across the subgroups: the groups with the best privacy protection (smallest average ε-values) have also the smallest average training and test losses. Fig. 2 shows that after running the filtered DP-GD beyond the worst-case εthreshold for a number of iterations, both the training and test loss get smaller for the best performing group and larger for other groups. Similarly as DP-SGD has a disparate impact on model accuracies across subgroups (2), we find that while the individual filtering leads to more equal group-wise εvalues, it leads to even larger differences in model accuracies. Here, one could alternatively consider other than algorithmic solutions for balancing the privacy protection among subgroups, by, e.g., collecting more data from the groups with the weakest privacy protection according to the individual ε-values (36) . Finally, we observe negligible improvements of the macro-averaged AUC-ROC in the optimal hyperparameter regime using filtered DP-GD, but similarly to (12) improvements can be seen when choosing sub-optimal hyperparameters (see Appendix E.1). 

7. CONCLUSIONS

To conclude, we have shown how to rigorously carry out fully adaptive analysis and individual DP accounting using the Gaussian DP. We have also proposed an approximative (ε, δ)-accountant that can utilise any dominating pairs of distributions and shown how to implement it efficiently. As an application we have studied the connection between group-wise individual privacy parameters and model accuracies when using DP-GD, and found that the filtering further amplifies the model accuracy imbalance between groups. An open question remains how to carry out tight fully adaptive analysis using arbitrary dominating pairs of distributions.

8. ETHICS STATEMENT

Our work is on improving differential privacy techniques, which contributes to the strong theoretical foundation of privacy-preserving machine learning, an essential component of trustworthy machine learning. Our method provides accurate estimates of individual privacy loss, therefore helping to evaluate the impact of privacy-preserving machine learning to individual privacy. Our experiments indicate that the filtered DP gradient descent has disparate impact on subgroups of data, and should therefore be used with caution. Our experiments use the MIMIC-III data set of pseudonymised health data by permission of the data providers. The data was processed according to usage rules defined by the data providers, and all reported results are anonymised. A EXISTING ANALYSIS USING RDP BY FELDMAN AND ZRNIC (12) We next illustrate how the stochastic process X n that is used to analyse fully adaptive compositions is determined in case of RDP analysis (12) . Central in the analysis is showing the supermartingale property (2.5) of X n . The fully adaptive RDP analysis by Feldman and Zrnic ( 12) is based on studying the properties of the supermartingale M n which they define as M n (X, X ′ ) = Loss(a (n) , X, X ′ , α) • e -(α-1) n m=1 ρm , where α ≥ 1, Loss(a (n) , X, X ′ , α) = P(M (n) (X) = a (n) ) P(M (n) (X ′ ) = a (n) ) α , and ρ m gives the RDP of order α given M 1:m-1 (X), i.e. ρ m = 1 α -1 log sup (X,X ′ )∈S E a (m) ∼M (m) (X ′ ) P(M (m) (X) = a (m) ) P(M (m) (X ′ ) = a (m) ) α a (m-1) , (A.1) where S is a pre-determined set of neighbouring datasets In particular, the RDP bounds for the fully adaptive compositions are obtained by showing that M n (X, X ′ ) has the supermartingale property, meaning that E(M n (X, X ′ )|F n-1 ) ≤ M n-1 (X, X ′ ). (A. 2) Feldman and Zrnic (12) show that from this property, and from the law of total expectation (2.4), it follows that if k i=1 ρ i ≤ B almost surely, where K is the maximum number of compositions, then the fully adaptive composition is (α, B)-RDP (Thm. 3.1, 12) . Due to the factorisability of the Rényi divergence, the property (A.2) is straightforward to show for the random variable M n (X, X ′ ) using the Bayes theorem: E(M n (X, X ′ )|F n-1 ) = E(Loss(a (n) , X, X ′ , α) • e -(α-1) n m=1 ρm |F n-1 ) = E P(M n (X) = a n |a (n-1) ) P(M n (X ′ ) = a n |a (n-1) ) α • Loss(a (n-1) , X, X ′ , α) • e -(α-1) n m=1 ρm , (A.3) since ρ 1 , . . . , ρ n ∈ F n-1 and Loss(a (n-1) , X, X ′ , α) ∈ F n-1 . Moreover, as M n is ρ n -RDP, Loss(a (n-1) , X, X ′ , α) ≤ e (α-1)ρn , and the supermartingale property follows from (A.3), i.e., that E(M n (X, X ′ )|F n-1 ) ≤ M n-1 (X, X ′ ). As the hockey-stick divergence does not factorise in this way, we need to take another approach to get the required supermartingale.

B MAIN THEOREM

Theorem B.1. Let k denote the maximum number of compositions. Suppose that, almost surely, k m=1 µ 2 m ≤ B 2 . (B.1) Then, M (k) (X) is B-GDP. Proof. First, recall the notation from Section 3.1: L (k) X/X ′ denotes the privacy loss between M (k) (X) and M (k) (X ′ ) with outputs a (k) . Let ε ∈ R. Our proof is based on showing the supermartingale property for the random variable M n (ε), n ∈ [k], defined as M k (ε) = 1 -e ε-L (k) X/X ′ + , M n (ε) = E t∼Rn 1 -e ε-L (n) X/X ′ -Ln(t) + , 0 ≤ n ≤ k -1, (B.2) where L n (t) = log R n (t)/Q(t) and R n is the density function of N B 2 - n m=1 µ 2 m , 1 and Q is the density function of N (0, 1). Moreover, M 0 (ε) = E t∼R0 1 -e ε-L0(t) + , where L 0 (t) = log R 0 (t)/Q(t) and R 0 is the density function of N B, 1 and Q is the density function of N (0, 1). Notice that in particular this means that M 0 (ε) gives δ(ε) for a B-GDP mechanism.

Published as a conference paper at ICLR 2023

We next show that E M k (ε) F k-1 ≤ M k-1 (ε) for all k. The supermartingale property follows then by induction. Since the pair of distributions N (µ k , 1), N (0, 1) dominates the mechanism M k , we have by the Bayes rule and Corollary 7 that E M k (ε) F k-1 = E a (k) ∼M (k) 1 -e ε-L (k) X/X ′ + F k-1 = E a k ∼M k (ε) 1 -e ε-L (k-1) X/X ′ -L k X/X ′ + F k-1 ≤ E t∼P k 1 -e ε-L (k-1) X/X ′ -L k (t) + , where L k (t) = log P k (t)/Q(t) and P k is the density function of N (µ k , 1) and Q is the density function of N (0, 1). Above we have also used the fact that L (k-1) X/X ′ ∈ F k-1 . Since k m=1 µ 2 m ≤ B 2 almost surely, i.e., µ k ≤ B 2 - k-1 m=1 µ 2 m almost surely, by the dataprocessing inequality for α-divergence we have that, almost surely, E t∼P k 1 -e ε-L (k-1) X/X ′ -L k (t) + ≤ E t∼R k-1 1 -e ε-L (k-1) X/X ′ -L k-1 (t) + = M k-1 (ε), where L k-1 (t) = log R k-1 (t)/Q(t) and R k-1 is the density function of N B 2 - k-1 m=1 µ 2 m , 1 and Q is the density function of N (0, 1). Therefore, E M k (ε) F k-1 ≤ M k-1 (ε). Since µ 1 , . . . , µ k-1 ∈ F k-2 , we have that, almost surely, E M k-1 (ε) F k-2 = E a (k-1) ∼M (k-1) E t∼R k-1 1 -e ε-L (k-1) X/X ′ -L k-1 (t) + F k-2 = E a k-1 ∼M k-1 E t∼R k-1 1 -e ε-L (k-2) X/X ′ -L k-1 X/X ′ -L k-1 (t) + F k-2 = E t∼R k-1 E a k-1 ∼M k-1 1 -e ε-L (k-2) X/X ′ -L k-1 X/X ′ -L k-1 (t) + F k-2 ≤ E t∼R k-1 E t k-1 ∼P k-1 1 -e ε-L (k-2) X/X ′ -L k-1 (t k-1 )-L k-1 (t) + = E t∼R k-2 1 -e ε-L (k-2) X/X ′ -L k-2 (t) + =M k-2 , (B.3) where L k-1 (t) = log P k-1 (t)/Q(t) and P k-1 is the density function of N µ k-1 , 1 and Q is the density function of N (0, 1). In the inequality step we use Corollary 7 and the fact that the pair of distributions N (µ k-1 , 1), N (0, 1) dominates the mechanism M k-1 (ε). In the second last step we have also use the fact that if P 1 ∼ N ( µ 1 , 1), P 1 ∼ N ( µ 2 , 1) and Q ∼ N (0, 1), and L 1 (t) = log P 1 (t)/Q(t) and L 2 (t) = log P 2 (t)/Q(t), then E t1∼ P1 E t2∼ P2 1 -e ε-L1(t)-L2(t) + = E t∼ P3 1 -e ε-L3(t)

+

, where P 3 ∼ N ( µ 2 1 + µ 2 2 , 1) and L 3 (t) = log P 3 (t)/Q(t). This follows directly from the fact that the PLDs determined by the pairs of distributions ( P 1 , Q) and ( P 2 , Q) are Gaussians (see Eq. (2.2)), the convolution of two Gaussians is a Gaussian. By induction, we see from (B.3) that the the supermartingale property holds for the random variable M n (ε). By the law of total expectation (2.4)  , E[M k (ε)] ≤ M 0 (ε). By Theorem 6, E[M k (ε)] = H e ε M (k) (X)||M (k) (X ′ ) , and M 0 (ε) = H e ε N (B, 1)||N (0, 1) . As ε was taken to be an arbitrary real number, the inequality E[M k (ε)] ≤ M 0 (ε) holds for all ε ∈ R and by Lemma 10 we see that M (k) (X) is B-GDP.

C FILTERS AND ODOMETERS

We here give additional details on the GDP filters and shortly discuss implementation of GDP privacy odometers as well.

C.1 GDP -PRIVACY FILTER

For simplicity, we here consider a GDP filter that chooses the privacy parameters adaptively, but not individually (like the filter in the main text). I.e., the amount that the privacy budget is spent at each step has to provide a guarantee over the whole dataset. To this end we formally define a GDP filter as F B (µ 1 , . . . , µ t ) = HALT, if t i=1 µ 2 i > B 2 , CONT, else. Using the filter F B , a GDP filter is given as in Alg. 2.

Algorithm 2 GDP Filter Algorithm

Input: Budget B, maximum number of compositions k, initial value a 0 . for j = 1, . . . , k do Find parameter µ j ≥ 0 such that M j (•, a (j-1) ) is µ j -GDP. if j ℓ=0 µ 2 ℓ > B 2 : then BREAK else a j = M j (X, a (j-1) ) end if end for return a (j-1) . In principle, the supermartingale property of the random variable M n (ε), as defined in (3.3) , is sufficient to show that the algorithm below is B-GDP. The only difference is that the algorithm can stop at random time. To include that feature in the analysis, we need to use the optimal stopping time theorem. Theorem C.1. Denote by M the output of Algorithm 2. M is B-GDP under remove neighbourhood relation, meaning that for all datasets X ∈ X N , for all i ∈ [N ] and for all α > 0: max{H α M(X)||M(X -i )) , H α M(X -i )||M(X)) } ≤ H α N (B, 1)||N (0, 1) . (C.1) Proof. The proof goes exactly the same as the proof for (Thm. 4.3 12) which holds for the RDP filter. By using the fact that for all t ≥ 0: µ t+1 ∈ F t , where F t is the natural filter σ(a (t) ), we see that the random variable T = min{t : F B (µ 1 , . . . , µ t+1 ) = HALT} ∧ k is a stopping time since {T = t} ∈ F t since µ t+1 ∈ F t . Let M n (ε) be the random variable of Eq. (3.3) defined for the pair of datasets (X, X -i ) or (X -i , X). From the optimal stopping theorem and the supermartingale property it then follows that for all ε ∈ R, E[M T (ε)] ≤ M 0 (ε), which by the reasoning of the proof of Thm.11 shows that (C.1) holds, i.e., output of Alg. 2 is B-GDP w.r.t. to the removal neighbourhood relation of datasets. A benefit of GDP filter when compared to RDP filter is that we obtain tight (ε, δ(ε))-bounds for adaptive compositions of Gaussian mechanisms. Moreover, from Thm. 10 it follows that these tight (ε, δ(ε))-DP bounds can be obtained by an analytic formula: Corollary C.2. For GDP-budget B > 0, the outputs of Algorithms 1 and 2 are (ε, δ(ε))-DP for all ε ≥ 0 and δ(ε) = Φ - ε B + B 2 -e ε Φ - ε B - B 2 . (C.2)

C.2 GDP PARAMETERS FOR THE INDIVIDUAL FILTERING OF PRIVATE GD

Notice that we have the following for the individual GDP filtering. Suppose each mechanism M i , i ∈ [k], in the sequence is of the form M i (X, a) = x∈X f (x, a) + N (0, σ 2 ). Since the hockey-stick divergence is scaling invariant and since the sensitivity of x∈X f (x, a (j-1) ) w.r.t. to removal of x i is ∥f (x i , a (j-1) )∥ 2 , we have that µ (i) j = ∥f (x i , a (j-1) )∥ 2 /σ.

C.3 TIGHT BOUNDS FOR THE GAUSSIAN MECHANISM

When running e.g. the DP-GD algorithm and using either the filtering of Alg. 2 or the individual filtering of Alg. 1, by appropriate scaling of the gradients each individual data element can be made to fully consume its privacy budget. This scaling for individual filtering is given in (Algorithm 3 12). Remark C.3. Suppose we use the Gaussian mechanism and scale the noise for each data element x i , i ∈ [N ] at the last step such that the GDP budget is fully consumed, i.e., we have that j µ (i) j = B, then the resulting algorithm is tightly (ε, δ)-DP for δ(ε) given by the expression (C.2), in a sense that for all i ∈ [N ], max{H e ε M (k) (X)||M (k) (X -i )) , H α M (k) (X -i )||M (k) (X)) } = δ(ε).

C.4 BENEFITS OF GDP VS. RDP FILTERING

To experimentally illustrate the benefits of GDP accounting, consider one of the private GD experiments of (12) , where σ = 100, and number of compositions corresponding to worst-case analysis is k = 420. The RDP value of order α corresponding to this iteration is then α/(2 • σ 2 ), where σ = σ/ √ k. Figure 3 shows the (ε, δ)-values, computed via RDP and GDP. To get the (ε, δ)-values from the RDP-values, we use the conversion formula of Lemma C.4 below. When using GDP instead of RDP, we can run k = 495 iterations instead of the k = 420 iterations, for an equal privacy budget of ε = 0.8, when δ = 10 -5 . Rényi DP parameters are converted to (ε, δ)-DP by minimizing w.r.t. λ over the values given by 12)). This means that when we replace the RDP filter with GDP filter for the private GD, we can get roughly 10 percent smaller ε-values. (C.3). Lemma C.4 (Canonne et al. 4). Suppose the mechanism M is λ, ε ′ -RDP. Then M is also (ε, δ(ε))-DP for arbitrary ε ≥ 0 with δ(ε) = exp (λ -1)(ε ′ -ε) λ 1 - 1 λ λ-1 . (C.

C.4.1 FURTHER COMPARISONS OF RDP AND GDP

Figures 4 and 5 further illustrate the differences between RDP and GDP accounting for filtering. Figure 4 shows the effect of number of compositions k, when σ = 100 and Figure 5 illustrates the maximum number of compositions for a given privacy budget ε, when σ = 100 and δ = 10 

C.5 GDP PRIVACY ODOMETERS

We here also shortly comment on privacy odometers, considered e.g. by (28; 12; 21) . In practice, one might want to track the privacy loss incurred so far. Rogers et al. (28) were the first ones to formalise this in terms of a privacy odometer. Feldman and Zrnic (12) utilise a sequence of valid Rényi privacy filters such that a fixed sequence of privacy losses B 1 , B 2 , . . . determine random stopping times T 1 , T 2 , . . . such that the privacy spent up to time T i is at most B i . By assuming, for example, that for all i, B i+1 -B i = ∆ for a fixed discretisation parameter ∆ > 0, we may employ the RDP filter such that whenever the privacy budget counter crosses ∆ (suppose for the mth time) we release the sequence a (Tm) and initialize the privacy loss counter to The fact that a (Tm) is m∆-RDP follows directly from the RDP results that hold for the filters. With the GDP, we can construct in the exactly same way an algorithm that outputs states always after every predetermined amount of GDP budget ∆ is spent. If at round we spend ∆ i -GDP budget, by the results for GDP filters we know that B 2 i+1 = B 2 i + ∆ 2 i , and that the output a (Tm) is B m -GDP, where B m = ∆ 2 1 + . . . + ∆ 2 m .

C.6 BLACKWELL'S THEOREM VIA DOMINATING PAIRS OF DISTRIBUTIONS

There is a one-to-one relationship between the orderings determined by trade-off functions and the hockey-stick divergence and this follows from the results by Zhu et al. (37) as follows. Let P and Q be probability distributions and denote by T (P, Q) the trade-off function determined by P and Q and let H α (P ||Q) denote the hockey-stick divergence or order α > 0. The Lemma 20 of ( 37) is a restatement of (Proposition 2.12, 7) and it states that for any ε ∈ R, H e ε (P ||Q) = 1 + T [P, Q] * (-e ε ), (C.4) where, for a trade-off function f , f * denotes the function f * (y) = sup x∈R yx -f (x). From (C.4) the equivalence follows directly, i.e., if ( P , Q) is another pair of probability distributions, then H e ε ( P || Q) ≤ H e ε (P ||Q) for all ε ∈ R if and only if T [ P , Q] ≥ T [P, Q]. This also means that, if H α ( P || Q) ≤ H α (P ||Q) for all α > 0, then by the Blackwell theorem (see e.g. Thm. 2.10, 7), there exists a stochastic transformation (Markov kernel) T such that T P = P and T Q = Q.

D EFFICIENT INDIVIDUAL NUMERICAL ACCOUNTING FOR DP-SGD

We next show how to compute the individual PLDs for DP-SGD. These are needed when implementing the approximative individual (ε, δ)-accountant described in Section 5. The errors arising from the approximations are generally negligible and a rigorous error analysis could be carried out using the techniques presented in (20) and (15) . The numerical approximation is based on 1. a numerical σ-grid which allows evaluating upper bounds for δ's efficiently: we precompute FFTs for different σ-values and no additional FFT computations are then needed during the evaluation of the individual ε-values. By the data-processing inequality this grid approximation also leads to upper (ε, δ)-bounds. 2. the Plancherel Theorem, which removes the need to compute inverse FFTs when evaluating individual PLDs. First, we recall some basics about numerical accounting using FFT (see also (19; 15) ).

D.1 NUMERICAL EVALUATION OF DP PARAMETERS USING FFT

We use a Fast Fourier Transform (FFT)-based method by Koskela et al. (19; 20) called the Fourier Accountant (FA). The same approximation could be done when using the PRV accountant by Gopi et al. (15) . Using FFT requires that we truncate and place the PLD ω on an equidistant numerical grid over an interval [-L, L], L > 0. Convolutions are evaluated using the FFT algorithm, and using the existing error analysis (see e.g., 20), the error incurred by the numerical FFT approximation can be bounded. The Fast Fourier Transform (FFT) is described as follows (5) . Let x = [x 0 , . . . , x n-1 ] T , w = [w 0 , . . . , w n-1 ] T ∈ R n . The discrete Fourier transform F and its inverse F -1 are defined as (31) (Fx) k = n-1 j=0 x j e -i 2πkj/n , (F -1 w) k = 1 n n-1 j=0 w j e i 2πkj/n , where i = √ -1. Using FFT the running time of evaluating Fx and F -1 w reduces to O(n log n). Also, FFT enables evaluating discrete convolutions efficiently using the so called convolution theorem For obtaining computational speed-ups, we use the Plancherel Theorem (Chpt. 12, 25), which states that the DFT preserves inner products: for x, y ∈ R n , ⟨x, y⟩ = 1 n ⟨F(x), F(y)⟩. When using FA to approximate δ(ε), we need to evaluate an expression of the form b k = D F -1 F(Da 1 ) ⊙k1 ⊙ • • • ⊙ F(Da m ) ⊙km , D = 0 I n/2 I n/2 0 ∈ R n×n , where a i corresponds to a numerical PLD for a combination of DP hyperparameters i, and k i is the number of times the composition contains a mechanism with this PLD. Approximation for δ(ε) is then obtained from the discrete sum that approximates the hockey-stick integral: δ(ε) = -L+ℓ∆x>ε 1 -e ε-(-L+ℓ∆x) b k ℓ . The Plancherel Theorem gives the following: Lemma D.1. Let δ(ε) and b k be defined as above. Denote w ε ∈ R n such that (w ε ) ℓ = 0, 1 -e ε-(-L+ℓ∆x) . (D.1) Then, we have that δ(ε) = 1 n ⟨F(Dw ε ), F(Da 1 ) ⊙k1 ⊙ • • • ⊙ F(Da m ) ⊙km ⟩. Proof. See (18) . We instantly see that if both F(Dw ε ) and F(Da i ), 1 ≤ i ≤ m, are precomputed, δ(ε) can be computed in O(n) time (where n is the number of discretisation points for the PLD). We can utilise this by placing the individual DP hyperparameters into well-chosen buckets, and by pre-computing FFTs corresponding to the hyperparameter values of each bucket. Then, the approximative numerical PLD for each sequence of DP hyperparameters (e.g. sequence of noise ratios) can be written in a form F(Da 1 ) ⊙k1 ⊙ • • • ⊙ F(Da m ) ⊙km , where k i 's correspond to number of elements in each bucket. If we also have F(Dw ε ) precomputed for different values of ε, we can easily construct a numerical accountant that outputs an approximation of ε as a function of δ.

D.2 NOISE VARIANCE GRID FOR FAST INDIVIDUAL ACCOUNTING

We next show how to construct the DP hyperparameter grid for DP-SGD: a numerical σ-grid. We remark that Yu et al. ( 36) carry out similar approximations for speeding up their approximative individual RDP accountants. Suppose we have models a 0 , . . . , a T as an output of DP-SGD iteration that we run with subsampling ratio q, clipping constant C > 0 and noise parameter σ. Also, suppose, that for a given data element x, along the iteration the gradients have norms C x,i := ∥∇ θ f (a i , x)∥, 0 ≤ i ≤ T -1. We get the individual ε x -value (or individual numerical PLD, more generally) then for the entry x by considering heterogeneous compositions of DP-SGD mechanisms with parameter values q and σ x,i = C C x,i • σ, 0 ≤ i ≤ T -1. A naive approach would require up to T FFT evaluations which quickly becomes computationally heavy. For the approximation, we determine a σ-grid Σ = {σ 0 , . . . , σ nσ }, where n σ ∈ Z + is a number of intervals in the grid and σ i = σ min + i • σ max -σ min n σ . We then encode the sequence of noise ratios Σ := { σ x,0 , . . . , σ x,n-1 } into a tuple of integers k = (k 0 , k 1 , . . . , k nσ ), k i = #{ σ ∈ Σ : σ i ≤ σ < σ i+1 }, i < n σ , #{ σ ∈ Σ : σ nσ ≤ σ}, i = n σ . (D.2) i.e. k i is number of scaled noise parameters σ hitting the bin number i in the grid Σ. By the construction of the approximation, we have the following: Theorem D.2. Consider the approximation described above. Denote the FFT transformation of the approximative numerical PLD obtained with the Σ-grid as a x = F(D a 1 ) ⊙k1 ⊙ • • • ⊙ F(D a nσ ) ⊙kn σ and the corresponding δ (as a function of ε), as given by Lemma D.1 as δ(ε) = 1 n ⟨F(Dw ε ), a x ⟩, where w ε is the weight vector (D.1). Then, we have that for each ε ≥ 0: δ(ε) ≤ δ(ε) + err, where δ(ε) is the tight value of δ corresponding to the actual sequence of noise ratios { σ x,0 , . . . , σ x,n-1 } and err denotes the (controllable) numerical errors arising from the discretisations of PLDs. Proof. The results follows from the data-processing inequality since each σ x,i -value is placed to bucket corresponding to a smaller noise ratio. The numerical error term err can also be bounded using the techniques and results of (15) . The importang thing here is that if the FFTs F(D a i ), 0 ≤ i ≤ n σ , are precomputed as well as F(Dw ε ), then evaluating δ(ε) is an O(n) operation. To implement the approximative accountant described in Section 5, we numerically approximate individual upper bound µ-GDP values using the bisection method.

D.3 POISSON SUBSAMPLING OF THE GAUSSIAN MECHANISM

For completeness we show how to determine the PLDs for the Poisson subsampled Gaussian mechanism, required for the individual accounting of DP-SGD. Consider the Gaussian mechanism M(X) = x∈X f (x) + N (0, σ 2 I d ), where f is a function f : X → R d . Then, if the dataset X ′ and X are neighbours such that X = X ′ ∪{x ′ } for some entry x ′ , then from the translation invariance of the hockey-stick divergence and from the unitary invariance of the Gaussian noise, it follows that, for all α ≥ 0, H α M n (X)||M n (X ′ ) = H α N ∥f (x ′ )∥ 2 , σ 2 ||N 0, σ 2 . Furthermore, from the scaling invariance of the hockey-stick divergence, we have that for all α ≥ 0, H α M n (X)||M n (X ′ ) = H α N C, σ 2 ||N 0, σ 2 = H α N 1, (σ/C) 2 ||N 0, (σ/C) 2 , where C = ∥f (x ′ )∥ 2 . Using the subsampling amplification results of Zhu et al. (37) we get a unique worst-case pair (P, Q), where P = q • N 1, σ 2 + (1 -q) • N 0, σ 2 , Q = N 0, σ 2 , where σ = σ/C. The PLD ω P/Q is then determined by P and Q as defined in Def. 5.

E EXPERIMENTS WITH MIMIC-III

We use the preprocessing provided by ( 16) to obtain the train and test data for the phenotyping task. We refer to (16) for details on the preprocessing pipeline and the details on the phenotyping task. We tune the hyperparameters using Bayesian optimization using the hyperparameter tuning library Optuna (1) to maximise the macro-averaged AUC-ROC, the task's primary metric. We train using DP-GD and opacus (Yousefpour et al.) with noise parameter σ ≈ 10.61 and determine the optimal clipping constant as C ≈ 0.79 in our training runs. We compute the budget B so that filtering starts after 50 epochs and set the maximum number of epochs to 100. With these parameter values ε = 2.75, when δ = 10 -5 .

E.1 EFFECT OF SUBOPTIMAL HYPERPARAMETER VALUES ON FILTERED DP-GD

We study here the effect of choosing sub-optimal clipping constants by evaluating the effects of filtering using clipping constants reaching from half the optimum to five times the optimum (Figure 6 ). We observe that filtering only improves the utility when choosing clipping constants that are sub-optimal (e.g., 5x the optimum). Our observations complement the observations made by (12) , who also observe the largest improvements by filtering in sub-optimal hyperparameter regimes. 

E.2 HISTOGRAMS OF INDIVIDUAL ε-VALUES FOR THE MIMIC-III EXPERIMENT

As described in the main text, to observe the differences across subgroups, we choose five nonoverlapping groups of size 1000 based on the following criteria: subgroup 0: No diagnosis at all, subgroups 1 and 2: Pneumonia/no Pneumonia, subgroups 3 and 4: Heart attack/no heart attack. In the training data, there are total 2072 cases without a diagnosis, 4105 Pneumonia cases and in total 9413 heart attack cases. We remark that it is not uncommon for a patient to have multiple conditions. During the training, we track the gradient norms C x,i for all elements of the training dataset, and thus we compute the individual ε-values after given number of iterations for a given value of δ (δ is set to 10 -5 ). In Figures 7 and 8 we display histograms of the individual ε values after 50 epochs. With the optimal clipping constant a majority of the datapoints have an individual ε = 2.75, which is near the budget. For a clipping constant that is 5x the optimum most points are significantly smaller than ε = 2.75. 

E.3 FURTHER EXPERIMENTAL RESULTS

We run the same experiment as above but now, instead of maximum privacy loss of (ε = 2.75, δ = 10 -5 ), using maximum privacy loss of ε = 0.5 and ε = 10.0. Figures 9 and 10 depict the performance in these cases (test AUC-ROC curves). We see, similarly to the experiments of (12) , that the overall performance slightly increases when using the individual filtering. 



Figure 1: MNIST experiment. Left: Randomly chose data element and its accurate (ε, δ)-curve after 50 epochs vs. the µ-GDP upper bound approximation. Right: Comparison of individual ε-values obtained via RDPs and PLDs: histograms for randomly selected 1000 samples after 50 epochs (δ = 10 -6 ). Computation using PLDs is better able to capture small individual ε-values.

-DP via RDP Filtering ( , )-DP via GDP Filtering

Figure 3: Comparison of RDP and GDP for private GD with σ = 100 and number of iterations k = 420 (experiment conisdered by Feldman and Zrnic (12)). This means that when we replace the RDP filter with GDP filter for the private GD, we can get roughly 10 percent smaller ε-values.

Figures 4 and 5 further illustrate the differences between RDP and GDP accounting for filtering. Figure4shows the effect of number of compositions k, when σ = 100 and Figure5illustrates the maximum number of compositions for a given privacy budget ε, when σ = 100 and δ = 10 -5 .

Figure 4: Comparison of RDP and GDP for private GD with σ = 100 and different number of compositions k.

Figure 5: Comparison of RDP and GDP for private GD with σ = 100: maximum number of allowed steps for private GD (number of compositions k) for different values of ε, when δ = 10 -5 .

Figure 6: Filtered DP-GD with a maximum privacy loss of (ε, δ) = (2.75, 10 -5 ) using different clipping constants. Rest of the hyperparameters are tuned. Left: The test AUC-ROC as a function of epochs. The red vertical line denotes the starting point of the filtering. Right: The number of active elements.

Figure 7: Histogram of individual ε after 50 epochs without filtering when using the optimal clipping constant. A majority of the individual ε are near ε = 2.75, which means that they will be deactivated in epoch 51, which uses filtering.

Figure 8: Histogram of individual ε after 50 epochs without filtering when using a clipping constant that is 5x the optimum. A majority of the individual ε are far away from ε = 2.75, which means that they will not be instantly deactivated in epoch 51, which uses filtering.

elementsFiltered DP-GD with ( = 0.5, = 1e-5)

Figure9: Filtered DP-GD with a maximum privacy loss of (ε, δ) = (0.5, 10 -5 ) using tuned hyperparameters. Left: the test AUC-ROC as a function of epochs. Right: The number of active elements.

elementsFiltered DP-GD with ( = 10, = 1e-5)

Figure 10: Filtered DP-GD with a maximum privacy loss of (ε, δ) = (10.0, 10 -5 ) using tuned hyperparameters. Left: the test AUC-ROC as a function of epochs. Right: The number of active elements.

Figure 2: MIMIC III experiment and individual filtering for private GD. Comparing the test losses, training losses and average privacy losses before and after filtering has started (at 50 epochs). The filtering has a further disparate impact on model accuracies across subgroups.

All the code related to MIMIC-III data set is publicly available (https://github.com/DPBayes/individual-accounting-gdp), as requested by Physionet (https://physionet.org/content/mimiciii/view-dua/1.4/).

10. ACKNOWLEDGMENTS

The authors acknowledge CSC -IT Center for Science, Finland, and the Finnish Computing Competence Infrastructure (FCCI) for computational and data storage resources. This work was supported by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI; and grant 325573), the Strategic Research Council at the Academy of Finland (Grant 336032) as well as the European Union (Project 101070617). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them.

