EFFECTIVELY USING PUBLIC DATA IN PRIVACY PRE-SERVING MACHINE LEARNING

Abstract

A key challenge towards differentially private machine learning is balancing the trade-off between privacy and utility. A recent line of work has demonstrated that leveraging public data samples can enhance the utility of DP-trained models (for the same privacy guarantees). In this work, we show that public data can be used to improve utility in DP models significantly more than shown in recent works. Towards this end, we introduce a modified DP-SGD algorithm that leverages public data during its training process. Our technique uses public data in two complementary ways: (1) it uses generative models trained on public data to produce synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a new gradient clipping mechanism (required for achieving differential privacy) which changes the origin of gradient vectors using information inferred from available public and generated data from generative models. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in differentially private machine learning across multiple datasets, network architectures, and application domains. Notably, we achieve a 75.1% accuracy on CIFAR10 when using only 2, 000 public images; this is significantly higher than the state-of-the-art which is 68.1% for DP-SGD with the privacy budget of ε = 2, δ = 10 -5 (given the same number of public data points).

1. INTRODUCTION

Machine learning is becoming an essential part of many technological advancements in various fields. One major concern with usage of machine learning is privacy of individuals whose data is used to develop the machine learning models. To tackle this concern, recent works De et al. (2022) ; Kurakin et al. (2022) ; Abadi et al. (2016) ; Yu et al. (2021) ; Amid et al. (2021) ; Li et al. (2022a) suggest to train ML models with differential privacy (Dwork et al. (2014) ) guarantees. However, existing differential privacy (DP) techniques for ML impose large degradation to the utility of the trained models in comparison to non-private models. Recent works have made substantial improvements to the utility-privacy trade-off of such private ML techniques, (e.g., by scaling the hyper-parameters (De et al. ( 2022))), however, there still exists a huge gap between the accuracy of DP-guaranteeing ML mechanisms and their non-private alternatives, e.g., De et al. (2022) achieves an accuracy of 65% on CIFAR10 (for ε = 2.0 and δ = 10 -5 ) compared to the > 90% accuracy of non-private models. In this work, we explore an emerging approach to close the utility gap between private and non-private models. Specifically, recent works (De et al. (2022); Kurakin et al. (2022); Abadi et al. (2016); Yu et al. (2021) ) show that leveraging publicly available (therefore, non-private) data can enhance the utility of DP-trained models without impacting their privacy guarantees. In such works, the public data is used to pre-train the model, and then the pre-trained model is fine-tuned with the private data while applying DP protections. In this work, we show that, while recent works use public data to pre-train private models, public data can be used much more effectively in enhancing the utility of private models. To this aim, we design a generic method to utilize public data in differentially private machine learning, an approach we call Differentially Private Origin Estimation Stochastic Gradient Descent (DOPE-SGD). Our work uses two complementary techniques to enhance the utility of differentially private models. First, it improves the quality of the noisy gradients based on the available non-private data. This helps by reducing the variance of the noise added to the gradients in the DP model, therefore better preserving the information in the original gradient vector (Section 3). Second, DOPE-SGD uses advanced data augmentation techniques to enhance the quality of the data used for training, therefore reducing overfitting to the public data and improving generalization. Through extensive experiments we show that DOPE-SGD's use of public data along with data augmentation improves the privacy-utility trade-offs of private models by large margins. For instance, we show improvements up to 12.3% over DP-SGD models on the CIFAR10 dataset, pre-trained with the same public data. We also show improvements on language model both on training from scratch (from 221 to 198 on a small BERT model) or fine-tuning (from 21.23 to 19.09 on GPT-2) with ε = 1.0 and δ = 10 -5 .

2. BACKGROUND

Differential privacy (Dwork (2011); Dwork et al. ( 2014)) is the gold standard for data privacy. It is formally defined as below: Definition 1 (Differential Privacy). A randomized mechanism M with domain D and range R preserves (ε, δ)-differential privacy iff for any two neighboring datasets D, D ′ ∈ D and for any subset S ⊆ R we have: Pr[M(D) ∈ S] ≤ e ε Pr[M(D ′ ) ∈ S] + δ (1) where ε is the privacy budget and δ is the failure probability. 

3.1. DP-SGD WITH ADAPTIVE ORIGIN

One of the main steps of DP-SGD is to clip the gradient of each instance and add noise to the gradient of each batch based on a clipping value. Clipping the gradient results in a bias in the optimization process which makes convergence slower. One way to prevent this is to use larger clipping values, however, this requires larger noise to obtain the desired privacy. The main idea of this work is to clip the gradients around an estimate of the gradient instead of clipping the gradient around the origin, as shown in Figure 1 . As a result, we can potentially clip more harshly (i.e., use small clipping values) around the carefully chosen centers with less bias in the optimization compared to DP-SGD and obtain better accuracy while maintaining privacy. We first introduce a general algorithm called DP-SGDA (Algorithm 1) that uses adaptive origin selection. We use a function G that takes the history of the protocol and also some auxiliary



Several works have used differential privacy in traditional machine learning algorithms to protect the privacy of the training data (Li et al. (2014); Chaudhuri et al. (2011); Feldman et al. (2018); Zhang et al. (2016); Bassily et al. (2014)). Many of these works (Feldman et al. (2018); Bassily et al. (2014); Chaudhuri et al. (2011)) use properties such as convexity or smoothness for their privacy analysis, which is not necessarily true in deep learning, and therefore, one cannot use many of such methods in practice. Abadi et al. (2016) designed a deep learning training algorithm, DP-SGD, where they used gradient clipping to limit the sensitivity of the learning algorithm, and then add noise to a clipped model gradient proportional to its sensitivity. As we know, training a deep learning model is an iterative process. The main approach to analyze the privacy cost of private deep learning is to compute the privacy cost of the single step of the learning algorithm and then use composition method to calculate the overall privacy cost which is commonly done in RDP ( Mironov (2017)) instead of (ε, δ)-DP. One of the important features of differential privacy is the post-processing ( Mironov (2017); Dwork et al. (2014)) which we will utilize in this work. DP-SGD is now commonly used to train differentially private deep learning models.3 DOPE-SGD: OUR IMPROVED DPSGD ALTERNATIVEWe can improve the utility-privacy trade-off for differentially private machine learning algorithms in three phases: (1) use pretraining to improve the initial point of private training, (2) use better algorithms for differentially private training, (3) do a post processing on the private models. Previous works showed the effect of using pretraining (Abadi et al. (2016); Kurakin et al. (2022); De et al. (2022)), so in this work we mainly focus on two last phases of private training.

