DP-SGD-LF: IMPROVING UTILITY UNDER DIFFEREN-TIALLY PRIVATE LEARNING VIA LAYER FREEZING

Abstract

Differentially Private SGD (DP-SGD) is a widely known substitute for SGD to train deep learning models with privacy guarantees. However, privacy guarantees come at cost in model utility. The key DP-SGD steps responsible for this utility cost are per-sample gradient clipping, which introduces bias, and adding noise to the aggregated (clipped) gradients, which increases the variance of model updates. Inspired by the observation that different layers in a neural network often converge at different rates following a bottom-up pattern, we incorporate layer freezing into DP-SGD to increase model utility at fixed privacy budget. Through theoretical analysis and empirical evidence we show that layer freezing improves model utility, by reducing both the bias and variance introduced by gradient clipping and noising. These improvements in turn lead to better model accuracy, and empirically generalize over multiple datasets, models, and privacy budgets.

1. INTRODUCTION

Deep Neural Networks (DNNs) have seen a growing success at many tasks under various domains in recent years. As a result DNNs are now deployed in numerous applications, including some involving sensitive data, such as users' medical history, purchasing records, or chat histories. In these sensitive applications, data privacy is a concern. However, there is strong evidence that deep learning models memorize, and thus leak, information about their training data (Shokri et al., 2016; Carlini et al., 2020; 2022; Feldman & Zhang, 2020) . To prevent data leakage, common DNN training algorithms such as Stochastic Gradient Descent (SGD) and its variants have been adapted to enforce Differential Privacy (DP) Song et al. (2013) ; Dwork et al. (2006) , a rigorous privacy guarantee which provably mitigates data leaks. As a convenient drop-in replacement for SGD, the DP-SGD algorithm is commonly used for privacy-preserving machine learning, and numerous efforts haave improved its theoretical privacy analysis (Abadi et al., 2016; Mironov, 2017; Mironov et al., 2019) . However, the privacy guarantees offered by DP-SGD still come at a substantial cost in model utility (accuracy), despite substantial practical improvements over time De et al. (2022) ; Papernot et al. (2021) . There are two key changes to SGD that DP-SGD introduces in each model update step. Each change is required to prove privacy guarantees, and contributes to utility costs. The first change is to clip per-sample gradient to a fixed L 2 norm bound, which introduces bias in the estimation of the gradient descent direction. The second change is to add noise from a standard Gaussian to the aggregated (clipped) gradients, which increases the variance of model updates. We show using theoretical analysis that increasing the gradient clipping norm of a given DNN layer in DP-SGD reduces the variance introduced by DP noise and, under some assumptions, the clipping bias as well. Both lead to better convergence upper-bounds for DP-SGD. We combine this result with the observation that different layers in a DNN trained with SGD converge at different rates following a bottom-up pattern-which we empirically verify also holds for DP-SGD-and introduce the DP-SGD Layer Freeze (DP-SGD-LF) algorithm. This algorithm freezes the lower layers (closer to the input) of a DNN towards the end of training, which increases the norm of clipped gradients for the remaining layers, thereby decreasing the bias and variance introduced by DP-SGD when updating these parameters. Since the remaining layers benefit more from updates at this point of traininig, the finial accuracy increases. We apply DP-SGD-LF to state of the art DP-SGD implementations on three datasets De et al. (2022) ; Papernot et al. (2021) , and show that it improves the final model's accuracy by up to 1.3 percentage points, and is particularly effective in the high privacy (low DP ϵ) regime. We also show that DP-SGD-LF is not sensitive to hyper-parameters, and propose and use easy to set, reasonable defaults. The rest of the paper describes our contributions: after introducing the necessary background in §2, §3 introduces our algorithm, and supports its design through empirical and theoretical analysis. §4 then empirically confirms the expected behavior, and shows that DP-SGD-LF improves the accuracy of different models over multiple image classification datasets.

2. BACKGROUND

Mini-batch SGD is one of the most commonly used optimization algorithm in non-private deep learning. For each iteration t, and calling η t is the step size, SGD updates the parameters of the model θ by stepping into the direction of steepest descent, estimated with the averaged gradients over B samples in a mini-batch, θ t+1 ← -θ t -η t 1 B B i=1 g t (x i ) . The convergence analysis of the SGD algorithm often rely on the following fundamental result. Lemma 2.1 (Decent Lemma (Bottou et al., 2018) ). Assuming the objective function f : R d -→ R to be continuously differentiable and the gradient of f , ∇f : R d - → R d to be Lipschitz continuous with the Lipschitz constant L > 0, ∥∇f (v) -∇f (w)∥ ≤ L∥v -w∥, ∀v, w, then f (v) ≤ f (w) + ∇f (w) T (v -w) + L 2 ∥v -w∥ 2 , ∀v, w. Under privacy constraints, the DP-SGD algorithm provides a convenient substitution to SGD for training DNNs with differential privacy guarantees (Abadi et al., 2016) . The DP-SGD algorithm protects privacy by clipping the per-sample gradient vector, g t (x i ) ← -∇ θt f (θ t , x i ), and adding noise drawn from a Normal distribution to the aggregated clipped gradients. Let C be the L2-norm clipping threshold, σ be the noise multiplier, and d be the dimension of the model's parameters. the update rule for DP-SGD in each iteration is: θ t+1 ← -θ t -η t 1 B B i=1 clip g t (x i ), C + z t , z t ∼ N 0, σ 2 C 2 I d clip g t (x i ), C ← -g t (x i )/ max 1, ∥g t (x i )∥ 2 C , where C controls the maximum influence that an individual sample can have on the gradient (the sensitivity), and σ controls the noise level scaled with respect to the sensitivity. We use the analysis based on Rényi Differential Privacy (RDP) (Mironov, 2017) for privacy accounting. The composition over t steps of training and the conversion of the RDP guarantee to the (ϵ, δ)-DP guarantee follow from the results in Mironov et al. (2019) . we use the publicly available implementation of the RDP privacy accountant in Opacus (Yousefpour et al., 2021) .

3. DIFFERENTIALLY PRIVATE LEARNING WITH LAYER FREEZING

We propose to incorporate layer freezing with DP-SGD, and demonstrate its effectiveness in increasing the trained model's predictive accuracy at fixed privacy budget. The intuition behind the performance gain is as follows. The two key steps in DP-SGD, clipping and noising, provide a DP guarantee at the cost of degrading model utility: clipping potentially introduces bias into the estimated descent direction, since it truncates individual gradients before aggregation to control sensitivity (Chen et al., 2021; Pichapati et al., 2019; Zhang et al., 2019) ; noising introduces variance on top of the biased estimate by adding random noise to the aggregated clipped gradients. Freezing parameters limits the model capacity in learning representations, but could bring benefits by reducing the bias and variance caused by clipping and noising on the remaining trainable parameters. Given the observation that lower layers (closer to the input side) converge faster than higher layers (closer to the prediction), we can freeze the parameters in lower layers during training, to minimally sacrifice model capacity in exchange for the benefits of better updates for the upper layer parameters. We detail our approach in the rest of this section. §3.1 shows the algorithm we propose. §3.2 presents empirical evidence that lower layers in a DP-SGD trained neural network converge faster than the upper layers. §3.3 shows theoretically that clipping and noising can be expected to degrade the convergence of model training, and presents our key metric to quantify this negative impact: the distortion angle in estimating the descent direction. In §3.4, we show that under some assumptions, layer freezing can improve both bias and variance caused by clipping and noising with respect to the trainable parameters. §4, then empirically evaluates our claims and demonstrates the effectiveness of layer freezing in improving model utility under multiple settings. 3. 

3.2. LAYER CONVERGENCE FOLLOWS A BOTTOM-UP PATTERN IN PRIVATE TRAINING

There is strong empirical evidence to suggest that for DNNs trained in the non-private setting, layers in the DNN architecture converge at different rates, and exhibit a bottom-up pattern (Wang et al., 2022; Raghu et al., 2017; Morcos et al., 2018; Yosinski et al., 2014; Rogers et al., 2020) . In this section, we verify that a similar observation holds under private training. The dataset, model, and algorithm used for demonstration are CIFAR-10, a 5-layer CNN (with the last layer being a softmaxactivated classification layer) and DP-SGD with C = 3 and σ = 1. The accuracy of the full model is around 0.63 at the last training step, when ϵ is around 7. We examine the privacy-utility trade-off of the DP-SGD trained model by training only a single layer after ϵ = 3 where the model achieves moderately high accuracy but is not fully trained yet. We note that the final layer is the softmax activated classification layer and is not frozen in these experiments. As shown in Figure 1 (Left), we observe that there is a minor gain in accuracy when only Layer 1 is trained further in steps while a larger increase is observed when only upper layers are trained. The final accuracy obtained by training Layer 1 only is also considerably lower than when only training the other layers. To support this observation, we also measure the convergence quality using a post-hoc analysis tool, PWCCA (Morcos et al., 2018) intermediate activation vectors throughout training to the converged activation vector from a fullytrained model (i.e., the final iteration the model converges to, in which the accuracy is reasonably high). Figure 1 (Right) shows the PWCCA score for the first 4 layers over training steps. A lower PWCCA score means that the layer converges better. We observe that only Layer 1 shows a weak sign of convergence, while the upper layers are likely to be dominated by the accumulated noise and have no clear sign of convergence. These experiments confirm that: lower layers converge better and earlier during training with DP-SGD, and focusing training on higher layers leads to more utility. ) as in Definition 3.1 and 3.2. Middle: Assuming the unclipped and clipped gradients remain unclipped and clipped after freezing, when s U t is more aligned with the clipped gradients, then γ U B t is reduced after freezing since the clipped gradient is clipped less (magnitude increase prior to freezing) due to an decrease in ∥g t (x j )∥. Right: For an arbitrary noise vector, since s B t increases in magnitude γ BP t is reduced after freezing such that the noising makes s P t less variable in direction.

3.3. DISTORTIONS IN DESCENT DIRECTION DEGRADES DP-SGD PERFORMANCE

We first define the true, unbiased, biased and private signal vectors (which refer to the true descent direction, and different estimates of it from a minibatch: without changes, with clipping, and with clipping and noise, respectively) and the corresponding distortion angle in each step t of the optimization algorithm. Figure 2 (Left) shows a demonstration using 2-dimensional vectors. We adapt the Decent Lemma (Lemma 2.1) to DP-SGD, and show that bias and variance in the distortion angle will increase the upper-bound on convergence. This suggests that a larger bias (mis-oriented descent direction) and higher variance in the distortion angle is likely to lead to slower convergence. Definition 3.1 (The True, Unbiased, Biased and Private Signal Vectors). For each step t, the true signal vector ∇f t is the gradient of the loss function f evaluated at θ t on all the training data. It is the true direction of steepest descent for the empirical loss. The unbiased signal vector s U t is the mean gradient over samples in a batch drawn following a uniform sampling scheme. It is an unbiased estimate of ∇f t (Bottou et al., 2018) , s U t := 1 B B i=0 g t (x i ). The biased signal vector s B t is the mean of sample-wise clipped gradients on a batch drawn following a uniform sampling scheme. It is a biased estimate of s U t since per-sample gradients are re-scaled differently before aggregation, s B t := 1 B B i=0 clip g t (x i ), C . Let z t ∼ N 0, (1/B 2 )σ 2 C 2 I be the random noise vector, the private signal vector is the sum of s B t and z t and is the actual descent direction in DP-SGD updates, s P t := s B t + z t . With Definition 3.1, the parameter update rule of DP-SGD can be rewritten as: θ t+1 ← -θ t -η t s P t , s P t = s B t + z t . We note that gradient clipping in DP-SGD has been shown to be biased and a bias vector b t can be decomposed following the analysis of Chen et al. (2021) . We define the distortion in estimating the descent direction by the angle between ∇f t and s P t . Definition 3.2 (Distortion Angle). For each step t, the distortion angle γ t is the angle between ∇f t and s P t ,  γ t := arccos ⟨∇f t , γ t = γ B t + γ BP t . In what follows, the expectations and variances of s U t , s B t , s P t , γ t , γ B t , and γ BP t are taken with respect to the data sampling and the DP noise distribution when applicable. We show next show that the bias and variance introduced by clipping and noising leads to a worse convergence bound rate bound for DP-SGD. The proof of Lemma 3.1 is in Appendix A. Lemma 3.1 (DP-SGD convergence bound). Following the proof of Bottou et al. (2018) for DP-SGD, we show the following result. Assuming the objective function f : R d -→ R to be continuously differentiable and the private gradient of f , ∇f : R d -→ R d to be Lipschitz continuous with the Lipschitz constant L > 0, ∥∇f (v) -∇f (w)∥ ≤ L∥v -w∥, ∀v, w, the convergence bound of DP-SGD is, min t=0,...,T E ∥∇f (θ t )∥ 2 ≤ f (θ 0 ) -E f (θ T ) - T t=1 η t E cos (γ B t )∥∇f (θ t )∥∥s B t ∥ + T t=1 η 2 t L 2B 2 σ 2 DP C 2 / T t=1 η t . Finally, we explain why γ t is an effective metric for utility. From the convergence bound above, we see the negative effect of γ t due clipping and noising: a larger bias (γ B t ) and variance (σ DP ) make the upper-bound larger, and could potentially lead to worse model performance: (1) A negative value of E cos (γ B t )∥∇f (θ t )∥∥s B t ∥ could be caused by a large bias angle γ B t . This makes the convergence bound worse by adding a positive term. A smaller γ B t (|γ B t | ∈ [0, π ]) means that the DP-SGD descent direction is better aligned with the true direction of steepest descent (γ t is smaller), and E cos (γ B t )∥∇f (θ t )∥∥s B t ∥ is positive. However, |γ B t | > 0 still makes cos(γ B t ) < 1, increasing the bound. (2) Since z t is drawn from a zero mean Normal distribution, adding noise does not bias the estimation. However, from the convergence bound we see that a larger variance of the noise (L/2B 2 )σ 2 DP C 2 also makes the bound worse. A higher variance means a larger noise is more likely added to s B t , therefore we would expect γ t to be large.

3.4. LAYER FREEZING MITIGATES DISTORTIONS IN OPTIMIZATION DIRECTION

In this section, we introduce (strong) assumptions, but supported by empirical measurements in Appendix E, under which we prove that freezing some parameters benefits the remaining trainable parameters, by reducing the bias and variance in their distortion angle γ t . Intuitively, since a subset of the parameters are frozen, the gradient's L2-norm is reduced (the frozen parameters have a gradient of zero). This in turn leads to an increase in the magnitude of each sample-wise clipped gradients on the remaining parameters. Under our assumptions, clipping gradients less aggressively reduces bias in the distortion angle γ B t , since each clipped gradient is closer to its original value. A larger biased signal is also more robust to noise, and γ BP t is smaller for any fixed noise draw, leading to reduced variance. Figure 2  (x i )∥ ≤ ∥g b t (x i )∥ ∀x i and ∥s B t ,a (θ ′ t )∥ ≥ ∥s B t ,b (θ ′ t )∥. Assumption 3.1. The mean of per-sample gradients g(x i ) with a larger magnitude such that ∥g(x i )∥ > C are more aligned (smaller angle) with the gradient direction ∇f t , whereas those with a smaller magnitude such that ∥g(x i )∥ ≤ C are less aligned (larger angle). Assumption 3.2. Rescaling the gradient norm of ∥g(x i )∥ by freezing does not change the gradients that are clipped and unclipped gradients, but only rescales the clipped gradients. Proposition 3.1 (Freezing reduces bias and variance in distortion angle with respect to trainable parameters). Let θ t be the set of full parameters and let θ ′ t be the subset of trainable parameters. Let γ t (θ ′ t ) be the distortion angle with respect to trainable parameters. Let the superscript b and a denotes the quantity before and after freezing occurs. Under Assumptions 3.1 and 3.2, the following results hold: (1) E[γ B,a t (θ ′ t )] ≤ E[γ B,b t (θ ′ t )]; (2) Var[γ BP,a t (θ ′ t ))] ≤ Var[γ BP,b t (θ ′ t ))]; (3) E[γ a t (θ ′ t )] ≤ E[γ b t (θ ′ t )], Var[γ a t (θ ′ t )] ≤ Var[γ b t (θ ′ t )] Proof. (1) By the clipping function in Equation 2, each per-sample gradient is either clipped if ∥g t (x i )∥ > C or preserved to its original value if ∥g t (x i )∥ ≤ C. Let g ′ denote the gradients with respect to trainable parameters, v and w be the vector of the sum of unclipped and clipped gradients in a random batch of size B respectively, v = j:∥g ′ t (xj )∥≤C g ′ t (x j ) , w = k:∥g ′ t (x k )∥>C g ′ t (x k ) • C ∥g ′ t (x k )∥ , |j| + |k| = B. By Lemma 3.2, since ∥g ′ t a (x k )∥ ≤ ∥g ′ t b (x k )∥ ∀i, (g ′ t (x k ) • C)/∥g ′ t a (x k )∥ a ≥ (g ′ t (x k ) • C)/∥g ′ t b (x k )∥, i. e. the magnitude of the clipped gradients increase after freezing comparing to before. By Assumption 3.2, v does not change. Since w a ≥ w b we have: E arccos ⟨v, v + w a ⟩ ∥v∥∥v + w a ∥ ≤ E arccos ⟨v, v + w b ⟩ ∥v∥∥v + w b ∥ , which means E(γ vB,a t ) ≤ E(γ vB,b t ). Under Assumption 3.1, if E(γ U v,a t ) ≥ E(γ vw,b t ), then we have E[γ B,a t (θ ′ t )] ≤ E[γ B,b t (θ ′ t )]. (2) Since the noise is drawn from a zero mean Normal distribution, for each batch, when taking expectation over the random noise draw, E[γ BP t ] = 0. The variance term can be simplified as, Var[γ BP t ] = E[(γ BP t ) 2 ] -(E[γ BP t ]) 2 = E arccos ⟨s B t (θ ′ t ), s B t (θ ′ t ) + z t ⟩ ∥s B t (θ ′ t )∥∥s B t (θ ′ t ) + z t ∥ 2 . Let X denote the random variable inside arccos (•), for every batch the distribution of the private gradient is the same as the random noise distribution, Var[γ BP t ] = ∞ -∞ (arccos (x)) 2 f (x)dx, X ∼ N (0, σ 2 C 2 I d ). We show the following result in Appendix C, E arccos ⟨s B t (θ ′ t ) a , s B t (θ ′ t ) a + z t ⟩ ∥s B t (θ ′ t ) a ∥∥s B t (θ ′ t ) a + z t ∥ 2 ≤ E arccos ⟨s B t (θ ′ t ) b , s B t (θ ′ t ) b + z t ⟩ ∥s B t (θ ′ t ) b ∥∥s B t (θ ′ t ) b + z t ∥ 2 . Therefore Var[γ BP,a t (θ ′ t )] ≤ Var[γ BP,b t (θ ′ t )]. (3) is a consequence of (1) and (2), since E[γ a t (θ ′ t )] = E[γ B,a t (θ ′ t )] + E[γ BP t (θ ′ t )] = E[γ B,a t (θ ′ t )] since E[γ BP t (θ ′ t )] = 0. Var[γ t (θ ′ t )] = Var[γ B t (θ ′ t )] + Var[γ BP t (θ ′ t )] since the noise z t is drawn inde- pendently. By Assumption 3.1, ∀ batch with size B, γ B,a t ≤ γ B,b t , so Var[γ B,a t (θ ′ t )] ≤ Var[γ B,b t (θ ′ t )] thus Var[γ a t (θ ′ t )] ≤ Var[γ b t (θ ′ t )].

4. EVALUATING LAYER FREEZING IN END-TO-END PRIVATE TRAINING

In this section we empirically examine our method on utility improvements over multiple datasets, models and privacy levels. We also empirically evaluate the claims from previous sections. We also perform a sensitivity analysis on hyperparameters, and present our suggested defaults. Unless otherwise specified, the analysis results in this section are demonstrated using CIFAR-10 and a 5layered CNN model. Baseline models and experimental setup. We implement layer freezing on top of the existing baseline models. For MNIST, FashionMNIST and CIFAR10 with CNN, we use the baseline model from Papernot et al. (2021) . For the CIFAR10 experiment with Wide-ResNet, the baseline model is from De et al. (2022) . All hyperparameters related to the model, training, or DP settings are kept constant between with and without layer freezing (and tuned without freezing in the baselines' papers). For each experiment, we run the model with and without layer freezing 5 times independently, using different random seeds. For the CIFAR10 model with Wide-ResNet we were only able to repeat the experiments for ϵ = 1 settings due to limited computational resources. The exact hyperparameters and other details are included in Appendix D. Model utility under DP-SGD-LF. Table 1 compares the performances of DP-SGD-LF on top of the current 'state-of-the-art' baselines and show the median and standard deviation over 5 runs. We observe that layer freezing generally improves the predictive performance across datasets, models and privacy budgets, as it improves the median score and has a smaller standard deviation. Layer Freezing Improves Angle of Distortions in Optimization Direction Figure 3 shows the median of |s B t | of the trainable layers after freezing the first 3 layers after step 5000. In the last step, the model accuracy is around 0.64, for ϵ = 7. We observe an increase in signal strength for both for trainable layers after freezing the lower 3 layers at step 5000. We observe a decrease in γ U B t in both Layer 4 and 5, indicating that the biased signal is closer to the unbiased signal after freezing. layers which matches the claim in §3.4. Figures 4 and 5 show the change in the distortion angle when layer freezing is imposed. We note that since the full-sample-size gradient is expensive to compute, we measure the angle between the unclipped and unnoised SGD gradients γ U B t on the same batch of data. We thus make the underlying assumption that moving the DP descent direction closer to the minibatch direction in each iteration t would improve the private training performance. We observe that both γ U B t and γ BP t decrease after freezing for trainable layers. The distortion measured in γ U B t is generally weaker than in γ BP t as the absolute scale is higher in the later. Although the privacy hyperparamters C and σ affect the results, we generally observe that noising has a stronger negative effect than clipping in terms of distorting the optimization direction. The tradeoff between gains of freezing and model capacity. The benefits of freezing layers comes at a cost in model capacity, which potentially limits performance. We evaluate such a tradeoff by measuring the final model accuracy under different choices of freezing hyperparameters: how many layers to freeze (n f ) and when to start freezing (m f ). Figure 6 (Left) compares the model's utility when freezing different number of layers. In these experiments, freezing starts after ϵ = 3. We observe that in general, the choice of how many layers to freeze is not quite sensitive as long as the model still has a reasonable capacity (e.g., leaving only the last layer is insufficient). Figure 6 (Right) compares the model utility by varying when to start freezing layers. In these experiments, we freeze Layers 1-3 at different training steps. We observe that freezing early can result in worse performance, whereas freezing in later steps does not bring any gains prior to freezing. Determining the optimal values for t f and m f might lead to additional privacy cost. Since the utility is not too sensitive to these hyperparameters in a wide reasonable range, we suggest the following default hyperparameters, which are those we use for all experiments in Table 1 : given a target privacy budget ϵ, we calculate the number of training to take (based DP and optimization parameters), and set t f to be about 20 steps earlier. We set m f to be the lower half of all layers in a model.

5. RELATED WORK

There is a rich literature on representations learning in DNNs under non-private training. It is commonly observed that the lower level layers extract more general features and are easier to train while higher level layers capture more abstract and task-specific features and take more steps for learning good representations (Raghu et al., 2017; Morcos et al., 2018; Yosinski et al., 2014; Rogers et al., 2020) . Wang et al. (2022) demonstrate with image classification tasks that different layers converge with a bottom-up pattern. This line of work promotes the possibility of transfer learning, with a subset of the parameters to be inherited from pre-trained models, and kept frozen when fine-tuning on downstream tasks. In non-private training, parameter freezing is mainly used to reduce data requirements, computational costs, or communication (Zhuang et al., 2019) . Parameter has also been studied in the private training setting, also under the transfer learning scenario but with additional privacy benefits. A subset of the model parameters are transferred from publicly trained models, and are frozen when fine-tuning with privat data on the downstream task. Such an approach is empirically effective across multiple computer vision and natural language processing tasks (Tramèr & Boneh, 2021; Luo et al., 2021; Yu et al., 2021; Li et al., 2021; Mehta et al., 2022) . In these work, the frozen parameters are either used directly as good initializations, or attached to additional layers for finetuning. In Luo et al. (2021) , the parameters chosen to be frozen are the ones with smaller scales which are usually considered less important in a neural network. It coincides with our observation that freezing the lower level layers, which converge early, does not overly hurt model performance. A few work closely related works incorporate techniques to increase sparsity in private learning. Zhang et al. (2021) presents a theoretical study on the benefits of sparse gradients in wide neural network models trained with DP. Talwar et al. (2015) studies the DP LASSO model which encourages sparsity by design. Huang et al. (2020) shows that pruning can be an alternative approach to privacy. A PROOF OF LEMMA 3.1 Lemma (DP-SGD convergence bound). Following the proof of Bottou et al. (2018) for DP-SGD, we show the following result. Assuming the objective function f : R d -→ R to be continuously differentiable and the private gradient of f , ∇f : R d -→ R d to be Lipschitz continuous with the Lipschitz constant L > 0, ∥∇f (v) -∇f (w)∥ ≤ L∥v -w∥, ∀v, w, the convergence bound of DP-SGD is, min t=0,...,T E ∥∇f (θ t )∥ 2 ≤ f (θ 0 ) -E f (θ T ) - T t=1 η t E ∇f (θ t ) T b t + T t=1 η 2 t L 2B 2 σ 2 C 2 / T t=1 η t . Proof. Let f be the loss function we want to optimize, let ∇f (θ t ) be the true steepest descent gradient vector, and let s U t , s P t be the unbiased and private gradient as in Definition 3.1 at step t. Since the private gradients are assumed to be L-Lipschitz continuous, the descent lemma ( §2) implies that, f (θ t+1 ) ≤ f (θ t ) + ∇f (θ t ) T (θ t+1 -θ t ) + L 2 ∥θ t+1 -θ t ∥ 2 . Substituting in the parameter update rule of DP-SGD, θ t+1 ← -θ t -η t s P t , we get, f (θ t+1 ) ≤ f (θ t ) -η t ∇f (θ t ) T s P t + η 2 t L 2 ∥s P t ∥ 2 . Taking the expectation over the data distribution and assuming the step size η t is independent of which data are sampled in each iteration t we get, E f (θ t+1 ) ≤ f (θ t ) -η t ∇f (θ t ) T E s P t + η 2 t L 2 E ∥s P t ∥ 2 . The private gradient s P t is composed of the biased gradient s B t and the random noise z t . The bias in s B t can be isolated and quantified by integrating over the probability density function of the gradient noise caused by data sampling (Chen et al., 2021) . Therefore we can simplify the expectation of s P t as, E s P t = E s U t + b t + E z t = ∇f (θ t ) + b t , for some bias vector b t . The second equality holds because s U t is an unbiased estimator of the true gradient ∇f (θ t ) with the assumption that each x i is drawn with a uniform sampling scheme as in the regular mini-batch SGD, and E[z t ] = 0 since z t ∼ N 0, (1/B 2 )σ 2 C 2 I . The variance of s P t is bounded under DP-SGD since we performed gradient clipping to control sensitivity and added the noise drawn from Normal distribution. Given batch size B, L2-clipping norm threshold C and noise multiplier σ DP , E ∥s P t ∥ 2 ≤ 1 B 2 σ 2 DP C 2 . Therefore, the progress bound for each parameter update step t of DP-SGD is, E f (θ t+1 ) ≤ f (θ t ) -η t ∥∇f (θ t )∥ 2 + ∇f (θ t ) T b t + η 2 t L 2B 2 σ 2 DP C 2 . Rearranging the terms and summing over all iterations t = 1, . . . , T we get, T t=1 η t ∥∇f (θ t )∥ 2 ≤ T t=1 f (θ t ) -E f (θ t+1 ) -η t ∇f (θ t ) T b t + η 2 t L 2B 2 σ 2 DP C 2 . Taking expectations on both sides and simplify using the Law of Iterated Expectations we get, T t=1 η t E ∥∇f (θ t )∥ 2 ≤ T t=1 E f (θ t ) -E f (θ t+1 ) - T t=1 η t E ∇f (θ t ) T b t + T t=1 η 2 t L 2B 2 σ 2 DP C 2 . Similar to the usual gradient descent scenario, when the magnitude of the gradient vector shrinks and gets closer to 0, we consider the algorithm is converged (to optima or to saddle point). We can rewrite the above inequality in terms of the smallest gradient norm over all training steps as, min t=0,...,T E ∥∇f (θ t )∥ 2 T t=1 η t ≤ T t=1 η t E ∥∇f (θ t )∥ 2 ≤ T t=1 E f (θ t ) -E f (θ t+1 ) - T t=1 η t E ∇f (θ t ) T b t + T t=1 η 2 t L 2B 2 σ 2 DP C 2 ≤ f (θ 0 ) -E f (θ T ) - T t=1 η t E ∇f (θ t ) T b t + T t=1 η 2 t L 2B 2 σ 2 DP C 2 . Let s B t and γ B t be defined as in Definition 3.1 and 3.2, we can rewrite the bound in terms of these quantities, E[∇f (θ t ) T b t ] = E[⟨∇f (θ t ), s B t ⟩] = E cos (γ B t )∥∇f (θ t )∥∥s B t ∥ , Therefore the convergence bound of DP-SGD is, min t=0,...,T E ∥∇f (θ t )∥ 2 ≤ f (θ 0 ) -E f (θ T ) - T t=1 η t E cos (γ B t )∥∇f (θ t )∥∥s B t ∥ + T t=1 η 2 t L 2B 2 σ 2 DP C 2 / T t=1 η t . B PROOF OF LEMMA 3.2 Lemma (The magnitude of the biased signal of trainable parameters increases after freezing). Let s B t ,b (θ ′ t ) and s B t ,a (θ ′ t ) be the biased signal of the trainable parameters before and after freezing, then ∥g a t (x i )∥ ≤ ∥g b t (x i )∥ ∀x i and ∥s B t ,a (θ ′ t )∥ ≥ ∥s B t ,b (θ ′ t )∥. Proof. Let g t (x i , θ t ), g t (x i , θ ′ t ) and g t (x i , θ ′ t ) be the corresponding gradient vector of the full, frozen and trainable parameters for each instance x i , since θ t = θ ′ t ∪ θ ′ t then g b t (x i , θ t ) = g b t (x i , θ ′ t ) ∪ g b t (x i , θ ′ t ), g a t (x i , θ t ) = g a t (x i , θ ′ t ) = g b t (x i , θ ′ t ) . By the Triangle Inequality, we get, ∥g b t (x i , θ t )∥ = ∥g t (x i , θ ′ t ) + g b t (x i , θ ′ t )∥ ≤ ∥g b t (x i , θ ′ t )∥ + ∥g b t (x i , θ ′ t )∥. Since ∥g b t (x i , θ ′ t )∥ ≥ 0, ∥g b t (x i , θ t )∥ ≥ ∥g b t (x i , θ ′ t )∥ = ∥g a t (x i , θ t )∥. By Definition 3.1, s B t ,b (θ ′ t ) = 1 B B i=1 g t (x i )/max 1, ∥g b (x i , θ ′ t )∥ C , s B t ,a (θ ′ t ) = 1 B B i=1 g t (x i )/max 1, ∥g a (x i , θ ′ t )∥ C , so it follows that ∥g a t (x i )∥ ≤ ∥g b t (x i )∥ ∀x i and ∥s B t ,a (θ ′ t )∥ ≥ ∥s B t ,a (θ ′ t )∥. C SUPPORTING PROOFS FOR PROPOSITION 3.1 We first show a common result that helps the proofs stated afterwards: The cosine similarity increases between two vectors if either one of the vectors is a sum of the other vector plus a third vector with a larger magnitude. Let v 1 and v 2 be two arbitrary vectors, let v 3 be the sum of v 1 and v 2 , v 3 = v 1 + v 2 . Keeping v 1 the same, if the magnitude of v 2 increases, then the cosine similarity between v 1 and v 3 is larger, leading to a smaller angle between v 1 and v 3 . The same result also holds if v 2 is kept the same and the magnitude of v 1 increases. Note that if v 1 , v 2 and v 3 are all 2-dimensional vectors, then the claims follow naturally from the Parallelogram law. We show that the claim holds in the general p-dimensional case. Let v i = [v i1 , v i2 , . . . , v ip ] be the flattened p-dimensional vector, we show one direction of the results which when v 1 is kept the same and the magnitude of v 2 increases, v ′ 2 = kv 2 for some k ≥ 1, and the other direction follows by symmetry. we show that LHS = ⟨v 2 , (v 2 + v 1 )⟩ ∥v 2 ∥∥(v 2 + v 1 )∥ ≥ ⟨v ′ 2 , (v ′ 2 + v 1 )⟩ ∥v ′ 2 ∥∥(v ′ 2 + v 1 )∥ = RHS. Expanding the dot product and the norm into summations on both sides, and let a = i v 2 2i , b = i v 1i v 2i , c = i v 2 1i . Then we can simplify LHS and RHS as, LHS = k 2 a + kb k √ a + √ k 2 a + c + 2kb , RHS = a + b √ a + √ a + c + 2b . Since a ≥ 0, c ≥ 0, if b ≥ 0, we can easily verify that, (1)At k = 1, LHS -RHS = 0; (2) lim k→∞ (LHS -RHS) = ∞; (3) ∇ k (LHS -RHS) ≥ 0. Therefore, LHS -RHS is a positive non-decreasing function in k for k ≥ 1, i.e. LHS ≥ RHS. Finally, we show that the following result is true, E arccos ⟨s B t (θ ′ t ) a , s B t (θ ′ t ) a + z t ⟩ ∥s B t (θ ′ t ) a ∥∥s B t (θ ′ t ) a + z t ∥ 2 ≤ E arccos ⟨s B t (θ ′ t ) b , s B t (θ ′ t ) b + z t ⟩ ∥s B t (θ ′ t ) b ∥∥s B t (θ ′ t ) b + z t ∥ 2 . Let X denote the random variable inside the arccos function, in this case X follows the same distribution as the noise, X ∼ N 0, σ 2 C 2 I d since for each fixed s B t the sources of randomness comes from the addition of random noise. Let x be a sample of X and let x b and x a denote the function calculating the quantity of x before and after freezing, x a = ⟨s B t (θ ′ t ) a , s B t (θ ′ t ) a + z t ⟩ ∥s B t (θ ′ t ) a ∥∥s B t (θ ′ t ) a + z t ∥ , x b = ⟨s B t (θ ′ t ) b , s B t (θ ′ t ) b + z t ⟩ ∥s B t (θ ′ t ) b ∥∥s B t (θ ′ t ) b + z t ∥ . Since x a ≥ x b , arccos (x a ) 2 ≤ arccos (x b ) 2 , therefore, ∞ -∞ arccos (x a ) 2 f (x)dx ≤ -∞ -∞ arccos (x b ) 2 f (x)dx, where f (x) = (1/(σ ′ √ 2π)) exp (-(1/2)((x -µ)/σ ′ ) 2 ), µ = 0, σ ′2 = σ 2 C 2 I d is the probability density function of X. Therefore E[(γ BP t ) 2 ] a ≤ E[(γ BP t ) 2 ] b .

D EXPERIMENT DETAILS

We implement our method and the CNN baseline model (Papernot et al., 2021) with JAX (Bradbury et al., 2018) . The MNIST model is run using a 2-layered CNN model, FashionMNIST and CIFAR-10 are run using a 5-layered CNN model. The architecture is the same as in Table 1 and 2 in Papernot et al. (2021) . The Wide-ResNet baseline model is from Balle et al. (2022) and we implement the layer freezing part on top of it. We mostly adopt the hyperparameters suggested in the original paper. Below are the details for each experiment: (1) MNIST, 2-layered CNN: C = 1.0, σ DP = 1.923, B = 2048, η = 2.0, activation function is tempered sigmoid with scale=1.58, inverse temp=3.0, offset=0.71 as suggested in the original paper. When using layer freezing, we freeze the first 2 layers after epoch 3, 17, 39 for ϵ = 1, 2, 3 respectively. ( (4) CIFAR10, Wide-Resnet: We adopt the publicly available configuration file in Balle et al. (2022) . We freeze the first 2 convolution groups after update step 800 and 3000 for ϵ = 1, 2 respectively. E EMPIRICALLY EVALUATING ASSUMPTION 3.1 AND 3.2 In this section we empirically examine the assumptions in Section 3.4. We observe that Assumption 3.1 is generally well-supported empirically as the sum of clipped gradients are closer, in terms of having a smaller angle, to the true gradient direction ∇f t than the sum of unclipped gradients. We also observe that Assumption 3.2 is generally not true since freezing rescales ∥g(x i )∥ thus moving some per-sample gradient g(x i ) from being clipped to unclipped. However, since only a small amount of g(x i ) are moved, the directions of the summed clipped and unclipped gradients after freezing are quite aligned with the directions before freezing. Intuitively, if we clip less by freezing a subset of the parameters, under Assumption 3.1, we assign a larger weight to the clipped gradients which are more aligned with the truth so that we decrease the bias. To verify Assumption 3.1, in each iteration we record the clipping status and compute the sum of the clipped (those with ∥g(x i )∥ ≥ C) and unclipped (∥g(x i )∥ < C) gradients, and compute the distortion angle (as in Definition 3.2) to the true gradient ∇f t computed on all training examples. Figure 7 shows the results with different L2-clipping norm C run with DP-SGD, CIFAR10, σ DP = 1.0, batch size B = 512 and learning rate η = 0.15. At the end of training, ϵ = 7.0 and the test accuracy is 0.42, 0.58, 0.60, 0.58 for C = 0.1, 1, 5, 10 respectively. We observe that in general the summed clipped gradients are more aligned with the true gradient than the summed unclipped gradients by having a smaller angle for different Cs which empirically supports Assumption 3.1. It matches with our intuition since there are often more gradients being clipped than unclipped throughout training, given a reasonable range of C from 0.1 to 10 as suggest by the Opacus authors (Yousefpour et al., 2021) , and these per-sample gradients with larger magnitudes are more likely to dominate the true gradient direction. Assumption 3.2 is used to characterize sample gradients that change clipping status due to layer freezing, from clipped to unclipped. Figure 8 shows the change in the number of clipped and unclipped gradients. We see that a small number of gradients does change their category as some data points' gradients go from being clipped to unclipped when rescaling ∥g(x i )∥. Referring to the demo in Figure 2 , Assumption 3.2 ensures that the directions of g t (x j ) and g t (x k ) (the summed unclipped and clipped gradients respectively) do not change after freezing so that the observed change γ U B is mostly due to rescaling the gradient norm which the analysis proceeds. We empirically measure the change in angle before and after freezing, for the sum of gradients in the clipped and unclipped category. We observe that the change in the angle of the summed clipped gradients are 0.09 and 0.11 radians, and 0.10 and 0.12 radians for the summed unclipped gradients, when freezing the first 3 layers of parameters after epoch 20 and 40. This matches our intuition that the direction of the sum of gradients over a minibatch is stable to changing a small number of medium-sized gradients from the clipped to the unclipped category.



Figure 1: Left: The privacy-utility trade-off when only a single layer is trained with every other layer frozen (except for the final classification layer) after ϵ = 3. Layer 1 has the least amount of gain in accuracy when other layers are frozen while the upper layers perform similarly. Right: The PWCCA score over training steps for different layers. A lower PWCCA score indicates the model converges better. Layer 1 shows a weak sign of convergence while the upper layers has no strong evidence of converging.

Figure3: The median of ∥s B t ∥ for trainable layers over training steps after the lower 3 layers frozen at step 5000. We observe in both Layer 4 and 5 there is an increase in the signal strength.

Figure 5: The change in γ BP tfor trainable layers after freezing the lower 3 layers at step 5000. We observe a decrease in γ U B t in both Layer 4 and 5, indicating that the improved strength in signal makes it more robust to noise as noising changes the direction less.

Figure6: Evaluating the hyperparameter choices of how many layers to freeze (m f ) and when to start freezing (t f ). Left:The privacy-utility trade-off when freezing different number of layers after ϵ = 3. Right:The privacy-utility trade-off when freezing the lower 3 layers at different steps.

) FashionMNIST, 5-layered CNN: C = 1.0, σ DP = 2.15, B = 2048, η = 4.0, activation function is tempered sigmoid with scale=1.58, inverse temp=3.0, offset=0.71 as suggested in the original paper. When using layer freezing, we freeze the first 3 layers after epoch 4, 19, 40 for ϵ = 1, 2, 3 respectively. (3) CIFAR10, 5-layered CNN: For the ϵ = 3 experiment we used C = 3.0, σ DP = 1.0, B = 512, η = 0.15, activation function is tempered sigmoid with scale=1.58, inverse temp=3.0, offset=0.71 as suggested in the original paper. We freeze the first 3 layers after epoch 20. For the ϵ = 7 experiment we used C = 1.0, σ DP = 1.47, B = 2048, η = 4.0, We freeze the first 3 layers after epoch 75.

1 THE DP-SGD-LF ALGORITHM Algorithm 1 shows the pseudo code of our method. When the current iteration index t exceeds a preset threshold t f , we apply parameter freezing on the first m f layers (closest to the input) of the model architecture. t f and m f are two hyperparameters of the algorithm. Input: Dataset D = (xi, yi) N i=0 , loss function f , learning rate ηt, batch size B, noise multiplier σ, L2-norm clipping threshold C, privacy parameters ϵ, δ, freezing parameters t f , m f

(Middle and Right)  demonstrates this intuition with 2-dimensional vectors, and the formal results are stated below, following from results which we prove in Appendix B and C.Lemma 3.2 (The magnitude of the biased signal of trainable parameters increases after freezing).



Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity, 2019. URL https://arxiv.org/abs/ 1905.11881. Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. CoRR, abs/1911.02685, 2019. URL http://arxiv.org/abs/1911.02685.

