DEEP LEARNING WITH DATA PRIVACY VIA RESIDUAL PERTURBATION

Abstract

Protecting data privacy in deep learning (DL) is at its urgency. Several celebrated privacy notions have been established and used for privacy-preserving DL. However, many of the existing mechanisms achieve data privacy at the cost of significant utility degradation. In this paper, we propose a stochastic differential equation principled residual perturbation for privacy-preserving DL, which injects Gaussian noise into each residual mapping of ResNets. Theoretically, we prove that residual perturbation guarantees differential privacy (DP) and reduces the generalization gap for DL. Empirically, we show that residual perturbation outperforms the stateof-the-art DP stochastic gradient descent (DPSGD) in both membership privacy protection and maintaining the DL models' utility. For instance, in the process of training ResNet8 for the IDC dataset classification, residual perturbation obtains an accuracy of 85.7% and protects the perfect membership privacy; in contrast, DPSGD achieves an accuracy of 82.8% and protects worse membership privacy.

1. INTRODUCTION

Many high-capacity deep nets (DNs) are trained with private data, including medical images and financial transaction data (Yuen et al., 2011; Feng et al., 2017; Liu et al., 2017) . DNs usually overfit and can memorize the private training data, which makes training DNs exposed to data privacy leakage (Fredrikson et al., 2015a; Shokri et al., 2017; Salem et al., 2018; Yeom et al., 2018; Sablayrolles et al., 2018) . Given a pre-trained DN, the membership inference attack can determine if an instance is in the training set based on DN's response (Fredrikson et al., 2014; Shokri et al., 2017; Salem et al., 2018) ; the model extraction attack can learn a surrogate model that matches the target model, given the adversary only black-box access to the target model (Tramèr et al., 2016; Gong & Liu, 2018) ; the model inversion attack can infer certain features of a given input from the output of a target model (Fredrikson et al., 2015b; Al-Rubaie & Chang, 2016) ; the attribute inference attack can deanonymize the anonymized training data (Gong & Liu, 2016; Zheng et al., 2018) . Machine learning (ML) with data privacy is crucial in many applications (Lindell & Pinkas, 2000; Barreno et al., 2006; Hesamifard et al., 2018; Bae et al., 2019) . Several algorithms have been developed to reduce privacy leakage include differential privacy (DP) (Dwork et al., 2006) , federated learning (FL) (McMahan et al., 2016; Konečnỳ et al., 2016) , and k-anonymity (Sweeney, 2002; El Emam & Dankar, 2008) . Objective, output, and gradient perturbations are among the most used approaches for ML with DP guarantees at the cost of significant utility degradation (Chaudhuri et al., 2011; Bassily et al., 2014; Shokri & Shmatikov, 2015; Abadi et al., 2016b; Bagdasaryan et al., 2019) . FL trains centralized ML models, through gradient exchange, with the training data being distributed at the edge devices. However, the gradient exchange can still leak the privacy (Zhu et al., 2019; Wang et al., 2019c) . Most of the existing privacy is achieved at a tremendous sacrifice of utility. Moreover, training ML models using the state-of-the-art DP stochastic gradient descent (DPSGD) leads to tremendous computational cost due to the requirement of computing and clipping the per-sample gradient (Abadi et al., 2016a) . It remains a great interest to develop new privacy-preserving ML algorithms without excessive computational overhead or degrading the utility of the ML models.

1.1. OUR CONTRIBUTION

In this paper, we propose residual perturbation for privacy-preserving deep learning (DL) with DP guarantees. At the core of residual perturbation is injecting Gaussian noise to each residual mapping of ResNet (He et al., 2016) , and the residual perturbation is theoretically principled by the stochastic differential equation (SDE) theory. The major advantages of residual perturbation are threefold: • It can protect the membership privacy of the training data almost perfectly and often without sacrificing ResNets' utility. Furthermore, it can even improve ResNets' classification accuracy. • It has fewer hyperparameters to tune than the benchmark DPSGD. Also, it is more computationally efficient than DPSGD, which requires to compute the per-sample gradient. • It can be easily implemented by a few lines of code in modern DL libraries. 1.2 RELATED WORK Improving the utility of ML models with DP guarantees is an important task. PATE (Papernot et al., 2017; 2018) uses semi-supervised learning together with model transfer between the "student" and "teacher" models to enhance utility. Several variants of the DP notions have also been proposed to improve the privacy budget and some times can also improve the resulting model's utility at a given DP budget (Abadi et al., 2016b; Mironov, 2017; Wang et al., 2018; Dong et al., 2019) . Some post-processing techniques have also been developed to improve the utility of ML models with negligible computational overhead (Wang et al., 2019a; Liang et al., 2020) . From the SDE viewpoint, (Li et al., 2019; Wang et al., 2015) showed that several stochastic gradient Monte Carlo samplers could reach state-of-the-art performance in terms of both privacy and utility in Bayesian learning. Gaussian noise injection in residual learning has been used to improve the robustness of ResNets (Rakin et al., 2018; Wang et al., 2019b; Liu et al., 2019) . In this paper, we inject Gaussian noise to each residual mapping to achieve data privacy instead of adversarial robustness.

1.3. ORGANIZATION

We organize this paper as follows: In Section 2, we introduce the residual perturbation for privacypreserving DL. In Section 3, we present the generalization and DP guarantees for residual perturbation. In Section 4, we numerically verify the efficiency of the residual perturbation in protecting data privacy without degrading the underlying models' utility. We end with some concluding remarks. Technical proofs and some more experimental details and results are provided in the appendix.

1.4. NOTATIONS

We denote scalars by lower or upper case letters; vectors/ matrices by lower/upper case bold face letters. For a vector x = (x 1 , • • • , x d ) ∈ R d , we use x 2 = ( d i=1 |x i | 2 ) 1/2 to denote its 2 norm. For a matrix A, we use A 2 to denote its induced norm by the vector 2 norm. We denote the standard Gaussian in R d as N (0, I) with I ∈ R d×d being the identity matrix. The set of (positive) real numbers is denoted as (R + ) R. We use B(0, R) to denote the ball centered at 0 with radius R.

2.1. DEEP RESIDUAL LEARNING AND ITS CONTINUOUS ANALOGUE Given the training set S

N := {x i , y i } N i=1 , with {x i , y i } ⊂ R d × R being a data-label pair. For a given x i the forward propagation of a ResNet with M residual mappings can be written as x l+1 = x l + F (x l , W l ), for l = 0, 1, • • • , M -1, with x 0 = x i ; ŷi = f (x M ), where F (•, W l ) is the nonlinear mapping of the lth residual mapping parameterized by W l ; f is the output activation function, and ŷi is the predicted label for x i . The heuristic continuum limit of (1) is dx(t) = F (x(t), W(t))dt, x(0) = x, where t is the time variable. (2) The ordinary differential equation (ODE) (2) can be revertible, and thus the ResNet counterpart might be exposed to data privacy leakage. For instance, we use the ICLR logo (Fig. 1 (a)) as the initial data x in (2). Then we simulate the forward propagation of ResNet by solving (2) from t = 0 to t = 1 using the forward Euler solver with a time step size ∆t = 0.01 and a given velocity field F (x(t), W(t)) (see Appendix E for the details of F (x(t), W(t))), which maps the original image to its features (Fig. 1 (b)). To recover the original image, we start from the feature and use the backward Euler iteration, i.e., x(t) = x(t + ∆t) -∆tF ( x(t + ∆t), t + ∆t), to evolve x(t) from t = 1 to t = 0 with x(1) = x(1) being the features obtained in the forward propagation. We plot the recovered image from features in Fig. 1 (c), and the original image can be almost perfectly recovered. We see that it is easy to break the privacy of the ODE model, but harder for SDE.

2.2. RESIDUAL PERTURBATION AND ITS SDE ANALOGUE

In this part, we propose two SDE models to reduce the reversibility of (2), and the corresponding residual perturbations analogue can protect the data privacy in DL. Strategy I. For the first strategy, we consider the following SDE model: dx(t) = F (x(t), W(t))dt + γdB(t), γ > 0, where B(t) is the standard Brownian motion. We simulate the forward propagation and reverseengineering the input from the output by solving the SDE model (3) with γ = 1 using the same F (x(t), W(t)) and initial data x. We use the following forward (4) and backward (5) Euler-Maruyama discretizations (Higham, 2001) of (3), x(t + ∆t) = x(t) + ∆tF (x(t), W(t)) + γN (0, √ ∆t I), x(t) = x(t + ∆t) -∆tF ( x(t + ∆t), W(t + ∆t)) + γN (0, √ ∆t I), for the forward and backward propagation, respectively. Figure 1 (d) and (e) show the results of the forward and backward propagation by SDE, respectively, and these results show that it is much harder to reverse the features obtained by SDE evolution. The SDE model informs us to inject Gaussian noise, in both training and test phases, to each residual mapping of ResNet to protect data privacy, which results in x i+1 = x i + F (x i , W i ) + γn i , where n i ∼ N (0, I) 1 . ( ) Strategy II. For the second strategy, we consider using the multiplicative noise instead of the additive noise used in (3)foot_1 and (6), and the corresponding SDE can be written as dx(t) = F (x(t), W(t))dt + γx(t) dB(t), γ > 0, where denotes the Hadamard product. Similarly, we can use the forward and backward Euler-Maruyama discretizations of (7) to propagate the image in Fig. 1 (a), and we provide these results in Appendix D.1. The corresponding residual perturbation is x i+1 = x i + F (x i , W i ) + γx i n i , where n i ∼ N (0, I), again, the noise γx i n i is injected to each residual mapping in both training and test phases. We will provide theoretical guarantees for these two residual perturbation schemes, i.e., ( 6) and ( 8), in Section 3, and numerically verify their efficacy in Section 4. 2.3 UTILITY ENHANCEMENT VIA MODEL ENSEMBLE Wang et al. (2019b) showed that an ensemble of noise injected ResNets can improve models' utility. In this paper, we will also study the model ensemble for utility enhancement. We inherit notations from (Wang et al., 2019b) , e.g., we denote an ensemble of two noise injected ResNet8 as En 2 ResNet8.

3. MAIN THEORY

In this section, we will provide theoretical guarantees for the above two residual perturbations.

3.1. DIFFERENTIAL PRIVACY GUARANTEE FOR STRATEGY I

We consider the following function class for ResNets with residual perturbation: F 1 := {f (x) = w T x M |x i+1 = x i + φ U i x i + γn i , i = 0, • • • , M -1, x 0 = input data + πn, n and n i ∼ N (0, I), w ∈ R d , U i ∈ R d×d }, where x 0 ∈ R d is the noisy inputfoot_2 , U i is the weight matrix in the ith residual mapping and w ∈ R d is the weights of the last layer. γ, π > 0 are hyperparameters. φ = BN(ψ) with BN being the batch normalization and ψ being a L-Lipschitz and monotonically increasing activation function (e.g., ReLU). We first recap on the definition of differential privacy below. Definition 1 (( , δ)-DP). (Dwork et al., 2006) A randomized mechanism M : S N → R satisfies ( , δ)-DP if for any two datasets S, S ∈ S N that differ by one element, and any output subset O ⊆ R, it holds that P[M(S) ∈ O] ≤ e • P[M(S ) ∈ O] + δ, where δ ∈ (0, 1) and > 0. We have the following DP guarantee for Strategy I, and we provide its proof in Appendix A. Theorem 1. Assume the input to ResNet lies in B(0, R) and the output of every residual mapping is normal distributed and bounded by G, in 2 norm, in expectation. Given the total number of iterations T used for training ResNet. For any > 0 and δ, λ ∈ (0, 1), the parameters U i and w in the ResNet with residual perturbation satisfies ((λ/i + (1 -λ)) , δ)-DP and ((λ/M + (1 -λ)) , δ)-DP, respectively, provided that π > R (2T bα)/(N λ ) and γ > G (2T bα)/(N λ ), where α = log(1/δ)/ ((1 -λ) ) + 1, M, N and b are the number of residual mappings, training data, and batch size, respectively. In particular, when γ > G (2T bα)/(N M λ ) the whole model obtained by injecting noise according to strategy I satisfies ( , δ)-DP.

3.2. THEORETICAL GUARANTEES FOR STRATEGY II

Privacy. To analyze the residual perturbation (8), we consider the following function class: F 2 := {f (x) = w T x M + πx M n| x i+1 = x i + φ U i x i + γ xi n i ), ( ) i = 0, • • • , M -1, n i ∼ N (0, I), w 2 ≤ a} where a > 0 is a constant; we denote the entry of x i that has the largest absolute value as x i max , and xi is defined as (sgn(x i j ) max(|x i j |, η)) d j=1 . Due to batch normalization, we assume φ can be bounded by a positive constant B. The other notations are defined similar to that in (9). Consider training F 2 by using two different datasets S and S , and we denote the resulting models as: f (x|S) := w T 1 x M + πx M n M ; x i+1 = x i + φ U i 1 x i + γ xi n i , i = 0, • • • , M -1. (11) f x|S := w T 2 x M + πx M n M ; x i+1 = x i + φ U i 2 x i + γ xi n i , i = 0, • • • , M -1. ( ) Theorem 2. For f (x|S) and f (x|S ) that are defined in (11) and (12), respectively. Let λ ∈ (0, 1), δ ∈ (0, 1), and > 0, if γ > (B/η) (2αM )/(λ ) and π > a (2αM ) /λ , where α = log(1/δ)/ ((1 -λ) ) + 1, then P[f (x|S) ∈ O] ≤ e • P[f (x|S ] ∈ O] + δ for any input x and any subset O in the output space. We provide the proof of Theorem 2 in Appendix B. Theorem 2 guarantees the privacy of the training data given only black-box access to the model, i.e., the model will output the prediction for any input without granting adversaries access to the model itself. In particular, we cannot infer whether the model is trained on S or S no matter how we query the model in a black-box fashion. We leave the theoretical DP-guarantee for for Strategy II as a future work. Generalization Gap. Many works have shown that overfitting in training ML models leads to privacy leakage (Salem et al., 2018) , and reducing overfitting can mitigate data privacy leakage (Shokri et al., 2017; Yeom et al., 2018; Sablayrolles et al., 2018; Salem et al., 2018; Wu et al., 2019b) . In this part, we will show that the residual perturbation (8) can reduce overfitting via computing the Rademacher complexity. For simplicity, we consider binary classification problems. Suppose SN ={xi, yi} N i=1 is drawn from X × Y ⊂ R d × {-1, +1} with X and Y being the input data and label spaces, respectively. Assume D is the underlying distribution of X × Y , which is unknown. Let H ⊂ V be the hypothesis class of the ML model. We first recap on the definition of Rademacher complexity. Definition 2. (Barlett & Mendelson, 2002) Let H : X → R be the space of real-valued functions on the space X. For a given sample S = {x 1 , x 2 , • • • , x N } of size N , the empirical Rademacher complexity of H is defined as RS(H) := 1 N Eσ[sup h∈H N i=1 σih(xi)], where σ 1 , σ 2 , • • • , σ N are i.i.d. Rademacher random variables with P(σ i = 1) = P(σ i = -1) = 1 2 . Rademacher complexity is a tool to bound the generalization gap (Barlett & Mendelson, 2002) . The smaller the generalization gap is, the less overfitting the model is. For ∀x i ∈ R d and constant c ≥ 0, we consider the following two function classes: F := {f (x, w) = wx p (T )|dx(t) = Ux(t)dt, x(0) = x i ; w ∈ R 1×d , U ∈ R d×d } with w 2 , U 2 ≤ c}, G := {f (x, w) = E (wx p (T )) |dx(t) = Ux(t)dt + γx(t) dB(t), x(0) = x i ; w ∈ R 1×d , with U ∈ R d×d , w 2 , U 2 ≤ c}, where 0 < p < 1 takes the value such that x p is well defined on the whole R d . γ > 0 is a hyperparameter and U is a circulant matrix that corresponding to the convolution layer in DNs. B(t) being the 1D Brownian motion. The function class F represents the continuous analogue of ResNet without inner nonlinear activation functions, and G denotes F with the residual perturbation (8). Theorem 3. Given the training set S N = {x i , y i } N i=1 . We have R S N (G) < R S N (F). We provide the proof of Theorem 3 in Appendix C, where we will also provide quantitative lower and upper bounds of the above Rademacher complexities. Theorem 3 shows that residual perturbation (8) can reduce the generalization error. We will numerically verify this generalization error reduction for ResNet with residual perturbation in Section 4.

4. EXPERIMENTS

In this section, we will numerically verify that 1) can residual perturbation protect data privacy; in particular, membership privacy? 2) can the ensemble of ResNets with residual perturbation improve the classification accuracy? 3) are skip connections crucial in residual perturbation for DL with data privacy? 4) what is the advantage of the residual perturbation over the DPSGD? We focus on Strategy I in this section, and we provide the results of Strategy II in Appendix D.

4.1. PRELIMINARIES

Datasets. We consider both CIFAR10/CIFAR100 (Krizhevsky et al., 2009) The purpose of this splitting of the training set is for the membership inference attack, which will be discussed below. Membership inference attack. To verify the efficiency of residual perturbation for protecting data privacy, we consider the membership inference attack (Salem et al., 2018) in all the experiments below. The membership attack proceeds as follows: 1) train the shadow model by using D train shadow ; 2) apply the trained shadow model to predict all data points in D shadow and obtain the corresponding classification probabilities of belonging to each class. Then we take the top three classification probabilities (or two in the case of binary classification) to form the feature vector for each data point. A feature vector is tagged as 1 if the corresponding data point is in D train shadow , and 0 otherwise. Then we train the attack model by leveraging all the labeled feature vectors; 3) train the target model by using D train target and obtain feature vector for each point in D target . Finally, we leverage the attack model to decide whether a data point is in D train target . Experimental settings. We consider En 5 ResNet8 (ensemble of 5 ResNet8 with residual perturbation) and the standard ResNet8 as the target and shadow models. We use a multilayer perceptron with a hidden layer of 64 nodes, followed by a softmax output function as the attack model, which is adapted from (Salem et al., 2018) . We apply the same settings as that used in (He et al., 2016) to train the target and shadow models on the CIFAR10 and CIFAR100. For training models on the IDC dataset, we run 100 epochs of SGD with the same setting as before except that we decay the learning rate by 4 at the 20th, 40th, 60th, and 80th epoch, respectively 

4.2. EXPERIMENTS ON THE IDC DATASET

In this subsection, we numerically verify that the residual perturbation in protecting data privacy while retaining the classification accuracy on the IDC dataset. We select the En 5 ResNet8 as a benchmark architecture, which has ResNet8 as its baseline architecture (the details of the neural architectures are provided in Appendix F). As shown in Figure 3 , we set four different thresholds to obtain different attack results with three different noise coefficients (γ) when γ = 0 means the standard ResNet8 without residual perturbation. We also depict the ROC curve for this experiment in Figure 7 (c) . 

4.2.1. RESIDUAL PERTURBATION VS. DPSGD

In this part, we compare the residual perturbation with the benchmark Tensorflow DPSGD module (McMahan et al., 2018) , and we calibrate the hyperparameters, including the initial learning rate (0.1) which decays by a factor of 4 after every 20 epochs, noise multiplier (1.1), clipping threshold (1.0), micro-batches (128), and epochs (100)foot_3 such that the resulting model gives the optimal trade-off between membership privacy and classification accuracy. DPSGD is significantly more expensive due to the requirement of computing the per-sample gradient. We compare the standard ResNet8 trained by DPSGD with En 1 ResNet8 and En 5 ResNet8 with residual perturbation (γ = 0.3). 

4.3.1. EFFECTS OF THE NUMBER OF MODELS IN THE ENSEMBLE

In this part, we consider the effects of the number of residual perturbed ResNets in the ensemble. Figure 6 illustrates the performance of EnResNet8 for CIFAR10 classification measured in the AUC, and training and test accuracy. These results show that tuning the noise coefficient and the number of models in the ensemble is crucial to optimize the trade-off between accuracy and privacy.

4.4. ON THE IMPORTANCE OF SKIP CONNECTIONS

Residual perturbations theoretically relies on the irreversibility of the SDEs ( 3) and ( 7), and this ansatz lies in the skip connections in the ResNet. We test both standard ResNet and the modified ResNet without skip connections. For CIFAR10 classification, under the same noise coefficient (γ = 0.75), the test accuracy is 0.675 for the En 5 ResNet8 (with skip connection); while the test accuracy is 0.653 for the En 5 ResNet8 (without skip connection). Skip connections makes EnResNet more resistant to noise injection which is indeed crucial for the success of residual perturbation for protecting data privacy.

4.5. ROC CURVES FOR EXPERIMENTS ON DIFFERENT DATASETS

The receiver operating characteristic (ROC) curve can be used to illustrate the classification ability of a binary classifier. ROC curve is obtained by plotting the true positive rate against the false positive rate at different thresholds. The true positive rate, also known as recall, is the fraction of the positive set (all the positive samples) that is correctly inferred as a positive sample by the binary classifier. The false positive rate can be calculated by 1-specificity, where specificity is the fraction of the negative set (all the negative samples) that is correctly inferred as a negative sample by the binary classifier. In our case, the attack model is a binary classifier. Data points in the training set of the target model are tagged as positive samples, and data points out of the training set of the target model are tagged as negative samples. Then we plot ROC curves for different datasets (as shown in Figure 7 ). These ROC curves show that if γ is sufficiently large, the attack model's prediction will be nearly a random guess.

4.6. REMARK ON THE PRIVACY BUDGET

In the experiments above, we set the constants G and R to 30 for Strategy I. For classifying IDC with ResNet8, the DP budget for Strategy I is ( = 1.1e5, δ = 1e -5) and the DP-budget for DPSGD is ( = 15.79, δ = 1e -5). For classifying CIFAR10 with ResNet8, the DP budget for Strategy I is ( = 3e5, δ = 1e -5) and the DP-budget for DPSGD is ( = 22.33, δ = 1e -5). Note that theorem 1 offers a quite loose DP budget compared to DPSGD. There are several difficulties we need to overcome to get tight DP bounds for Strategy I. Compared to DPSGD, it is significantly harder. In particular, 1) the loss function of the nose injected ResNets is highly nonlinear and very complex with respect to the weights, also the noise term appears in the loss function due to the noise injected in each residual mapping. These together make the tight estimate very difficult. 2) In our proof, we leveraged the framework of subsampled Rényi-DP (Wang et al., 2018) to find a feasible range of noise variance parameter, and then convert to DP to get the value of γ for a given DP budget. This procedure will significantly reduce the accuracy of the estimated γ. We leave the tight DP guarantee as future work. In particular, how to reduce the accuracy of estimating due to the conversion between Rényi-DP and DP.

5. CONCLUDING REMARKS

In this paper, we proposed residual perturbations, whose theoretical foundation lies in the theory of stochastic differential equations, to protect data privacy for deep learning. Theoretically, we prove that the residual perturbation can reduce the generalization gap with differential privacy guarantees. Numerically, we have shown that residual perturbations are effective for protecting membership privacy on some benchmark datasets. In particular, on the IDC benchmark, residual perturbations protect better membership privacy than state-of-the-art differentially private stochastic gradient descent and achieve remarkably better classification accuracy. The appendices are structured as follows. In Sections A, B, and C, we prove Theorems 1, 2, and 3, respectively. In Section D, we provide numerical results for the Strategy II. More experimental details are provided in Section E. Finally, in Section F, we summarize the architecture of deep neural networks used in our experiments. A PROOF OF THEOREM 1

Part Appendices Table of Contents

A.1 R ÉNYI DIFFERENTIAL PRIVACY We will use the notion of Rényi differential privacy (RDP) to prove the differential privacy (DP) guarantees for the proposed residual perturbations. First, let's review the definition and several results of the Rényi differential privacy (Mironov, 2017) . Definition 3. (Mironov, 2017) (Rényi divergence) For any two probability distributions P and Q defined over the distribution D, the Rényi divergence of order α > 1 is D α (P ||Q) = 1 α -1 log E x∼Q (P/Q) α . Definition 4. (Mironov, 2017) ((α, )-RDP) A randomized mechanism M : D → R is said to have -Rényi differential privacy of order α or (α, )-RDP for short, if for any adjacent S, S ∈ D that differ by only one entry, it holds that D α (M(S)||M(S ) ≤ ε. Lemma 1. (Mironov, 2017) Let f : D → R 1 be (α, 1 )-RDP and g : R 1 × D → R be (α, )-RDP, then the mechanism defined as (X, Y ), where X ∼ f (D) and Y ∼ g(X, D), satisfies (α, 1 + 2 )-RDP. Lemma 2. (Mironov, 2017) (From RDP to ( , δ)-DP) If f is an (α, )-RDP mechanism, then it also satisfies ( -(log δ)/(α -1), δ)-differential privacy for any 0 < δ < 1.

B PROOF OF THEOREM 2

In this section, we will provide a proof for Theorem 2. Proof. Let φ = BN (ψ), where BN is batch normalization operation and ψ is an activation function. Because of the property of batch normalization, we assume that φ can be bounded by a positive constant B. To show that the model ( 9) guarantees training data privacy given only black-box access to the model. Consider training the model ( 9) with two different datasets S and S , and we denote the resulting model as f (•|S) and f (•|S ), respectively. In the following, we prove that with appropriate choices of γ and π, D α (f (x|S)||f (x|S )) < p for any input x. First, we consider the convolution layers, and let Conv(x) i be the ith entry of the vectorized Conv(x). Then we have Conv(x) i = x i + φ(Ux) i + γ xi n i . For any two different training datasets, we denote Conv(x) i |S = x i + φ(U 1 x) i + γ xi n i ∼ N (x i + φ(U 1 x) i , γ 2 x2 i ), Conv(x) i |S = x i + φ(U 2 x) i + γ xi n i ∼ N (x i + φ(U 2 x) i , γ 2 x2 i ). Therefore, if γ > (B/η) (2αM )/( p ), we have D α N x i + φ (U 1 x) i , γ 2 x2 i ||N x i + φ (U 2 x) i , γ 2 x2 i ≤ α(φ(U 1 x) i -φ(U 2 x) i ) 2 2γ 2 η 2 ≤ 4αL 2 B 2 x 2 2 2γ 2 x2 i ≤ 2αB 2 γ 2 η 2 ≤ p /M. Furthermore, γ > (b/η) (2αM )/( p ) guarantees (α, p /M )-RDP for every convolution layer. For the last fully connected layer, if π > a (2αM ) / p , we have D α N w T 1 x, π 2 x2 i ||N w T 2 x, π 2 x2 i = α w T 1 x -w T 2 x 2 2π 2 x 2 ≤ 4αa 2 x 2 2 2π 2 x 2 ≤ 2αa 2 π 2 ≤ p /M, i.e., π > a (2αM ) / p guarantees that the fully connected layer to be (α, p /M )-RDP. According to Lemma 1, we have (α, p )-RDP guarantee for the ResNet of M residual mappings if γ > (Lb/η) (2αdM )/( p ) and π > a (2αM ) / p . Let λ ∈ (0, 1), for any given ( p , δ) pair, if p ≤ λ and -(log δ)/(α -1), δ) ≤ (1 -λ) , then we have get the ( , δ)-DP guarantee for the ResNet with M residual mapping using the residual perturbation Strategy II.

C PROOF OF THEOREM 3

In this section, we will proof that the residual perturbation (8) can reduce the generalization error via computing the Rademacher complexity. Let us first recap on some related lemmas on the stochastic differential equation and the Rademacher complexity.

C.1 SOME LEMMAS

Let : V × Y → [0, B] be the loss function. Here we assume is bounded and B is a positive constant. In addition, we denote the function class H = {(x, y) → (h(x), y) : h ∈ H, (x, y) ∈ X × Y }. The goal of the learning problem is to find h ∈ H such that the population risk R(h) = E (x,y)∈D [ (h(x) , y)] is minimized. The gap between population risk and empirical risk R S N (h) = (1/N ) N i=1 (h(x i ), y i ) is known as the generalization error. We have the following lemma and theorem to connect the population and empirical risks via Rademacher complexity. Lemma 4. (Ledoux-Talagrand inequality) (M & Talagrand, 2002) Let H be a bounded real valued function space and let φ : R → R be a Lipschitz with constant L and φ(0) = 0. Then we have 1 n E σ sup h∈H n i=1 σ i φ (h (x i )) ≤ L n E σ sup h∈H n i=1 σ i h (x i ) . Lemma 5. (Barlett & Mendelson, 2002) Let S N = {(x 1 , y 1 ) , • • • , (x N , y N )} be samples chosen i.i.d. according to the distribution D. If the loss function is bounded by B > 0. Then for any δ ∈ (0, 1), with probability at least 1 -δ, the following holds for all h ∈ H, R (h) ≤ R S N (h) + 2BR S N ( H ) + 3B (log(2/δ)/(2N ). In  C =          a 1 a 2 . . . a d a d a 1 . . . . . . . . . . . . . . . a 2 a 2 . . . a d a 1          . For any circulant matrix, we have the following eigen-decomposition Ψ H CΨ = diag(λ 1 , . . . , λ d ), where √ dΨ =      1 1 . . . 1 1 m 1 . . . m d-1 . . . . . . . . . . . . 1 m d-1 1 . . . m d-1 d-1      , and m i s are the roots of unity and λ i = a 1 + a 2 m i + • • • + a d m d-1 i .

C.2 THE PROOF OF THEOREM 3

Proof. By the definition of Rademacher complexity (Def. 2), we have R S N (F) = (1/N ) E σ sup f ∈F N i=1 σ i wx p i (T ) = (c/N ) E σ sup U 2≤c N i=1 σ i x p i (T ) 2 . Let u i = Ψx p i and denote the jth element of u i as u i,j . Then by lemma 7, we have R S N (F) = (c/N ) E σ sup U 2≤c   i,j σ i σ j x p i (T ), x p j (T )   1/2 = (c/N ) E σ sup |λi|≤c ( Ψ H N i=1 σ i u i,1 exp(λ 1 T p), • • • , u i,d exp(λ d T p)) T 2 = (c/N ) E σ sup |λi|≤c { d j=1 N i=1 σ i u i,j exp (λ j T p) 2 } 1/2 = (c/N ) exp(cT p)E σ { d j=1 N i=1 σ i u i,j 2 } 1/2 = (c/N ) exp (cT p) E σ N i=1 σ i u i 2 = (c/N ) exp (cT p) E σ Ψ N i=1 σ i x p i 2 = (c/N ) exp (cT p) E σ N i=1 σ i x p i 2 Note that E (wx p i (T )) = wE (x p i (T )) and according to Lemma 6, similar to proof for the function class F we have R S N (G) = (c/N ) E σ sup U 2 ≤c   i,j σ i σ j Ex p i (T ), Ex p j (T )   1/2 = (c/N )E σ sup |λi|≤c ( Ψ H N i=1 σ i (u i,1 exp(λ 1 T p -p(1 -p)γ 2 T /2), • • • , u i,d exp(λ d T p -p(1 -p)γ 2 T /2)) T 2 ) = (c/N ) E σ sup |λi|≤c { d j=1 N i=1 σ i u i,j exp λ j T p -p(1 -p)γ 2 T /2 2 } 1/2 = (c/N ) exp(cT p -p(1 -p)γ 2 T /2)E σ { d j=1 N i=1 σ i u i,j 2 } 1/2 = (c/N ) exp cT p -p(1 -p)γ 2 T /2 E σ N i=1 σ i u i 2 = (c/N ) exp cT p -p(1 -p)γ 2 T /2 E σ Ψ N i=1 σ i x p i 2 = (c/N ) exp cT p -p(1 -p)γ 2 T /2 E σ N i=1 σ i x p i 2 < R S N (F) Therefore, we have completed the proof for the fact that the ensemble of Gaussian noise injected ResNets can reduce generalization error compared to the standard ResNet. 

D.2 EXPERIMENTS ON THE IDC DATASET

In this subsection, we consider the performance of the second residual perturbation in protecting membership privacy while retaining the classification accuracy on the IDC dataset. We use the same ResNet models as that used for the first residual perturbation. We list the results in Fig. 9 , these results confirm that the residual perturbation (8) can effectively protect data privacy and maintain or even improve the classification accuracy. In addition, we depict the ROC curve for this experiment in Figure 12 (c ). We note that, as the noise coefficient increases, the gap between training and testing accuracies narrows, which is consistent with Theorem 3. We have shown that the first residual perturbation (6) outperforms the DPSGD in protecting membership privacy and improving classification accuracy. In this part, we further show that the second residual perturbation (8) also outperforms the benchmark DPSGD with the above settings. perturbation ( 8) is significantly more robust to the membership inference attack. For instance, the AUC of the attack model for ResNet8 and En 5 ResNet8 (γ = 2.0) is 0.757 and 0.526, respectively. Also, the classification accuracy of En 5 ResNet8 (γ = 2.0) is higher than that of ResNet8, and their accuracy is 71.2% and 66.9% for CIFAR10 classification. 



Liu et al. (2019);Wang et al. (2019b) used this noise injection to improve robustness of ResNets. Liu et al. (2019) injected multiplicative noise to neural networks to improve their robustness. We add noise to the input for DP guarantee https://github.com/tensorflow/privacy/tree/master/tutorials



Figure 1: Illustrations of the forward and backward propagation of the training data using 2D ODE (2) and SDE (3) models. (a) the original image; (b) & (d) the features of the original image generated by the forward propagation using ODE and SDE, respectively; (c) & (e) the recovered images by reverse-engineering the features shown in (b) & (d), respectively. We see that it is easy to break the privacy of the ODE model, but harder for SDE.

Figure 2: Visualization of a few selected images from the IDC dataset.

Figure 4: Performance of En 5 ResNet8 with residual perturbation using different noise coefficients (γ) and membership inference attack threshold on CIFAR10. Residual perturbation can not only enhance the membership privacy, but also improve the classification accuracy. γ = 0 corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)

Figure6: The performance of residual perturbation with different noise coefficients (γ) and different number of models in the ensemble. The optimal privacy-utility tradeoff lies in the choice of these two options. (Unit: %)

Proof of Theorem 1 13 A.1 Rényi Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B Proof of Theorem 2 15 C Proof of Theorem 3 15 C.1 Some Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 C.2 The proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D Experiments on Strategy II 18 D.1 Forward and Backward Propagation Using (7) . . . . . . . . . . . . . . . . . . 18 D.2 Experiments on the IDC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 18 D.3 Experiments on the CIFAR10/CIFAR100 Datasets . . . . . . . . . . . . . . . . 18 D.4 ROC Curves for the Experiments on Different Datasets . . . . . . . . . . . . . . 19 E More Experimental Details 19 Architectures of the Used DNs 20

Figure8plots the forward and backward propagation of the ICLR logo using the SDE model (7). Again, we cannot simply use the backward Euler-Maruyama discretization to reverse the features generated by the propagating through the forward Euler-Maruyama discretization.

Figure 9: Performance of residual perturbation (8) for En 5 ResNet8 with different noise coefficients (γ) and membership inference attack thresholds on the IDC dataset. Residual perturbation can significantly improve membership privacy and reduce the generalization gap. γ = 0 corresponding to the baseline ResNet8. (Unit: %)

Figure 10: Performance of En 5 ResNet8 with residual perturbation (8) using different noise coefficients (γ) and membership inference attack threshold on CIFAR10. Residual perturbation (8) can not only enhance the membership privacy, but also improve the classification accuracy. γ = 0 corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)

Figure 13: Architectures of the ResNet8 used in our experiments.

and the Invasive Ductal Carcinoma (IDC) datasets. Both CIFAR10 and CIFAR100 contain 60K 32 × 32 color images with 50K and 10K of them used for training and test, respectively. The IDC dataset is a breast cancerrelated benchmark dataset, which contains 277,524 patches of the size 50 × 50 with 198,738 labeled negative (0) and 78,786 labeled positive (1). Figure2depicts a few patches from the IDC dataset. For the IDC dataset, we follow Wu et al. (2019a) and split the whole dataset into training, validation, and test set. The training set consists of 10,788 positive patches and 29,164 negative patches, and the test set contains 11,595 positive patches and 31,825 negative patches. The remaining patches are used as the validation set. For each dataset, we split its training set into D shadow and D target with the same size. Furthermore, we split D shadow into two halves with the same size and denote them as D train

Performance evaluations. We consider both classification accuracy and capability for protecting membership privacy. The attack model is a binary classifier, which is to decide if a data point is in the training set of the target model. For any x ∈ D target , we apply the attack model to predict its probability (p) of belonging to the training set of the target model. Given any fixed threshold t if p ≥ t, we classify x as a member of the training set (positive sample), and if p < t, we conclude that x is not in the training set (negative sample); so we can obtain different attack results with different thresholds. Furthermore, we can plot the ROC curve(see details in subsection 4.5) of the attack model and use the area under the ROC curve (AUC) as an evaluation of the membership inference attack. The target model protects perfect membership privacy if the AUC is 0.5 (attack model performs random guess), and the higher AUC is, the less private the target model is. Moreover, we use the precision (the fraction of records inferred as members are indeed members of the training set) and recall (the fraction of training set that is correctly inferred as members of the training set by the attack model) to measure ResNets' capability for protecting membership privacy.

Residual perturbation vs. DPSGD in training ResNet8 and EnResNet8 for the IDC classification. Ensemble of ResNet8 with residual perturbation has higher test accuracy and protects better membership privacy (smaller AUC).

Table 1 lists the AUC of the attack model and training and test accuracies of the target model; we see that residual perturbation can improve accuracy and protect better membership privacy.

addition, according to the Ledoux-Talagrand inequality and assume loss function is L-lipschitz, we have R S N ( H ) ≤ LR S N (H) . So the population risk can be bounded by the empirical risk and Rademacher complexity of the function class H. Because we can't minimize the population risk directly, we can minimize it indirectly by minimizing the empirical risk and Rademacher complexity of the function class H. Next, we will further discuss Rademacher complexity of the function class H. We first introduce several lemmas below. Lemma 7. A matrix C ∈ R d×d is circulant, if there exists real number a 1 , • • • , a d such that

Table 2 lists the AUC of the attack model and training & test accuracy of the target model; we see that the second residual perturbation can also improve the classification accuracy and protecting better membership privacy.D.3 EXPERIMENTS ON THE CIFAR10/CIFAR100 DATASETSIn this subsection, we will test the second residual perturbation (8) on the CIFAR10/CIFAR100 datasets with the same model using the same settings as before. Figure10plots the performance of En 5 ResNet8 on the CIFAR10 dataset. These results show that the ensemble of ResNets with residual Residual perturbation (8) vs. DPSGD in training ResNet8 for the IDC dataset classification. Ensemble of ResNet8 with residual perturbation is more accurate for classification (higher test acc) and protects better membership privacy (smaller AUC).

annex

Lemma 3. (Mironov, 2017) (Post-processing lemma) Let M : D → R be a randomized algorithm that is (α, )-RDP, and let f : R → R be an arbitrary randomized mapping. Then f (M(•)) : D → R is (α, )-RDP A.2 PROOF OF THEOREM 1In this subsection, we will give a proof of Theorem 1, i.e., DP-guarantee for the Strategy I.Proof. We will prove Theorem 1 by mathematical induction. Consider two adjacent datasetsdiffer by one entry. For the first residual mapping, it is easy to check that when γ ≥ R 2α/ p we haveFor the remaining residual mappings, we denote the response of the ith residual mapping, for any two input data x N and xN , as x i N , xi N , respectively. Based on our assumpution, we have), where µ N,i and μN,i are both bounded by the constant G., according the post-processing lemma (Lemma 3), we have≤ p /i, so when γ > 2αG 2 / p , we further haveOn the other hand, note thatLeveraging the post-processing lemma (Lemma 3) again, we get.Let B t be the index set with |B t | = b, and we update U i as following:where U i+1 t is the weights updated after the tth training iterations. When N / ∈ B t , it's obviously that D α (U i+1 t+1 |S||U i+1 t+1 |S ) = 0; when N ∈ B t , the equations which we use to update U i can be rewritten asAccording to the post-processing lemma (Lemma 3), we have 

