

Abstract

Differentially Private methods for training Deep Neural Networks (DNNs) have progressed recently, in particular with the use of massive batches and aggregated data augmentations for a large number of steps. These techniques require much more compute than their non-private counterparts, shifting the traditional privacyaccuracy trade-off to a privacy-accuracy-compute trade-off and making hyperparameter search virtually impossible for realistic scenarios. In this work, we decouple privacy analysis and experimental behavior of noisy training to explore the trade-off with minimal computational requirements. We first use the tools of Rényi Differential Privacy (RDP) to show that the privacy budget, when not overcharged, only depends on the total amount of noise (TAN) injected throughout training. We then derive scaling laws for training models with DP-SGD to optimize hyper-parameters with more than a 100× reduction in computational budget. We apply the proposed method on CIFAR-10 and ImageNet and, in particular, strongly improve the state-of-the-art on ImageNet with a +9 points gain in accuracy for a privacy budget ε = 8.

1. INTRODUCTION

Deep neural networks (DNNs) have become a fundamental tool of modern artificial intelligence, producing cutting-edge performance in many domains such as computer vision (He et al., 2016) , natural language processing (Devlin et al., 2018) or speech recognition (Amodei et al., 2016) . The performance of these models generally increases with their training data size (Brown et al., 2020; Rae et al., 2021; Ramesh et al., 2022) , which encourages the inclusion of more data in the model's training set. This phenomenon also introduces a potential privacy risk for data that gets incorporated. Indeed, AI models not only learn about general statistics or trends of their training data distribution (such as grammar for language models), but also remember verbatim information about individual points (e.g., credit card numbers), which compromises their privacy (Carlini et al., 2019; 2021) . Access to a trained model thus potentially leaks information about its training data. The gold standard of disclosure control for individual information is Differential Privacy (DP) (Dwork et al., 2006) . Informally, DP ensures that the training algorithm does not produce very different models if a sample is added or removed from the dataset. Motivated by applications in deep learning, DP-SGD (Abadi et al., 2016) is an adaptation of Stochastic Gradient Descent (SGD) that clips individual gradients and adds Gaussian noise to their sum. Its DP guarantees depend on the privacy parameters: the sampling rate q = B/N (where B is the batch size and N is the number of training examples), the number of gradient steps S, and the noise variance σ 2 . Training neural networks with DP-SGD has seen progress recently, due to several factors. The first is the use of pre-trained models, with DP finetuning on downstream tasks (Li et al., 2021; De et al., 2022) . This circumvents the traditional problems of DP, because the model learns meaningful features from public data and can adapt to downstream data with minimal information. In the remainder of this paper, we only consider models trained from scratch, as we focus on obtaining information through the DP channel. Another emerging trend among DP practitioners is to use massive batch sizes at a large number of steps to achieve a better tradeoff between privacy and utility: Anil et al. ( 2021) have successfully pre-trained BERT with DP-SGD using batch sizes of 2 million. This paradigm makes training models computationally intensive and hyper-parameter (HP) search effectively impractical for realistic datasets and architectures. 2022) and improved performance obtained under the privacy budget ε = 8 with a +6 points gain in top-1 accuracy. The shaded blue areas denote 2 standard deviations over three runs. In this context, we look at DP-SGD through the lens of the Total Amount of Noise (TAN) injected during training, and use it to decouple two aspects: privacy accounting and influence of noisy updates on the training dynamics. We first show that within a wide range of the privacy parameters, the privacy budget ε is a function only of the total amount of noise. Using the tools of RDP accounting, we approximate ε by a closed-form expression. We then analyze the scaling laws of DNNs at constant TAN and show that performance at very large batch sizes (computationally intensive) is (linearly) predictable from performance at small batch sizes as illustrated in Figure 1 . In summary, our contributions are as follows: • We define the notion of Total Amount of Noise (TAN) and show that when the budget ε is not overcharged, it only depends on TAN; • We derive scaling laws and showcase the predictive power of TAN to reduce the computational cost of hyper-parameter tuning with DP-SGD, saving a factor of 128 in compute on ImageNet experiments (Figure 1 ). We then use TAN to find optimal privacy parameters, leading to a gain of +9 points under ε = 8 compared to the previous SOTA; • We leverage TAN to quantify the impact of the dataset size on the privacy/utility trade-off and demonstrate that doubling dataset size halves ε while providing better performance.

2. BACKGROUND AND RELATED WORK

In this section, we review traditional definitions of DP and RDP. We consider a randomized mechanism M that takes as input a dataset D and outputs a machine learning model θ ∼ M(D). (2)



Figure 1: Training with DP-SGD on ImageNet for a constant number of steps S = 72k. All points are obtained at constant σ/B, with σ ref = 2.5 and B ref = 16384. The dashed lines are computed using a linear regression on the crosses, and the dots and stars illustrate the predictive power of TAN. We perform low compute hyper-parameter (HP) search at batch size 128 and extrapolate our best setup for a single run at large batch size: stars show our reproduction of the previous SOTA from De et al. (2022) and improved performance obtained under the privacy budget ε = 8 with a +6 points gain in top-1 accuracy. The shaded blue areas denote 2 standard deviations over three runs.

Definition 1 (Differential Privacy). A randomized mechanism M satisfies (ε, δ)-DP(Dwork et al.,  2006)  if, for any pair of datasets D and D ′ that differ by one sample and for all subset R ⊂ Im(M),P(M(D) ∈ R) ≤ P(M(D ′ ) ∈ R) exp(ε) + δ.(1) DP-SGD(Abadi et al., 2016)  is the most popular DP algorithm to train DNNs. It selects samples uniformly at random with probability q = B/N (Poisson sampling), clips per-sample gradients to a norm C (clip C ), aggregates them and adds (Gaussian) noise. With θ the parameters of the DNN:g noisy = 1 B i∈B clip C (∇ θ ℓ(θ, (x i , y i ))) + N 0, C 2 σ 2 B 2 .

