EFFECTIVELY USING PUBLIC DATA IN PRIVACY PRE-SERVING MACHINE LEARNING

Abstract

A key challenge towards differentially private machine learning is balancing the trade-off between privacy and utility. A recent line of work has demonstrated that leveraging public data samples can enhance the utility of DP-trained models (for the same privacy guarantees). In this work, we show that public data can be used to improve utility in DP models significantly more than shown in recent works. Towards this end, we introduce a modified DP-SGD algorithm that leverages public data during its training process. Our technique uses public data in two complementary ways: (1) it uses generative models trained on public data to produce synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a new gradient clipping mechanism (required for achieving differential privacy) which changes the origin of gradient vectors using information inferred from available public and generated data from generative models. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in differentially private machine learning across multiple datasets, network architectures, and application domains. Notably, we achieve a 75.1% accuracy on CIFAR10 when using only 2, 000 public images; this is significantly higher than the state-of-the-art which is 68.1% for DP-SGD with the privacy budget of ε = 2, δ = 10 -5 (given the same number of public data points).

1. INTRODUCTION

Machine learning is becoming an essential part of many technological advancements in various fields. One major concern with usage of machine learning is privacy of individuals whose data is used to develop the machine learning models. To tackle this concern, recent works De et al. ( 2022 2022))), however, there still exists a huge gap between the accuracy of DP-guaranteeing ML mechanisms and their non-private alternatives, e.g., De et al. (2022) achieves an accuracy of 65% on CIFAR10 (for ε = 2.0 and δ = 10 -5 ) compared to the > 90% accuracy of non-private models. In this work, we explore an emerging approach to close the utility gap between private and non-private models. Specifically, recent works (De et al. ( 2022 2021)) show that leveraging publicly available (therefore, non-private) data can enhance the utility of DP-trained models without impacting their privacy guarantees. In such works, the public data is used to pre-train the model, and then the pre-trained model is fine-tuned with the private data while applying DP protections. In this work, we show that, while recent works use public data to pre-train private models, public data can be used much more effectively in enhancing the utility of private models. To this aim, we design a generic method to utilize public data in differentially private machine learning, an approach we call Differentially Private Origin Estimation Stochastic Gradient Descent (DOPE-SGD). Our work uses two complementary techniques to enhance the utility of differentially private models. First, it improves the quality of the noisy gradients based on the available non-private data. This helps by reducing the variance of the noise added to the gradients in the DP model, therefore better preserving



); Kurakin et al. (2022); Abadi et al. (2016); Yu et al. (2021); Amid et al. (2021); Li et al. (2022a) suggest to train ML models with differential privacy (Dwork et al. (2014)) guarantees. However, existing differential privacy (DP) techniques for ML impose large degradation to the utility of the trained models in comparison to non-private models. Recent works have made substantial improvements to the utility-privacy trade-off of such private ML techniques, (e.g., by scaling the hyper-parameters (De et al. (

); Kurakin et al. (2022); Abadi et al. (2016); Yu et al. (

