SELF-SUPERVISED PRETRAINING FOR DIFFERENTIALLY PRIVATE LEARNING

Abstract

We demonstrate self-supervised pretraining (SSP) is a scalable solution to deep learning with differential privacy (DP) regardless of the size of available public datasets in image classification. When facing the lack of public datasets, we show the features generated by SSP on only one single image enable a private classifier to obtain a much better utility than the non-learned handcrafted features under the same privacy budget. When a moderate or large size public dataset is available, the features produced by SSP greatly outperform the features trained with labels on various complex private datasets under the same private budget. We also compared multiple DP-enabled training frameworks to train a private classifier on the features generated by SSP.

1. INTRODUCTION

Machine learning (ML) has been applied ubiquitously in the analysis of sensitive data such as medical images (Tajbakhsh et al., 2016) , financial records (Fischer & Krauss, 2018) , or social media channels (Agrawal & Awekar, 2018) . Many attacks (Shokri et al., 2017; Carlini et al., 2021) are developed to successfully extract meaningful training data out of standard ML models. According to recent governmental regulations, e.g., GDPR and CCPA, ML models have to protect sensitive training data. Differential privacy (DP) (Chaudhuri et al., 2011; Bu et al., 2020; Abadi et al., 2016) has emerged as an effective framework to train models resilient to private training data leakage. Unfortunately, training models with strong DP guarantees significantly hurts the model utility (i.e., accuracy) (Papernot et al., 2018; Abadi et al., 2016) . Although non-learned handcrafted features such as ScatterNet (Oyallon & Mallat, 2015; Oyallon et al., 2018) make a private linear model (Tramer & Boneh, 2021) achieve the state-of-the-art (SOTA) utility of < 70% under the privacy budget of (ϵ ≤ 3, δ = 10 -5 ) on a private CIFAR-10 dataset, it is difficult to learn better features in the DP domain, since the clipped and perturbed gradients during DP training provide only a noisy estimate of the update direction. In contrast, it is straightforward that pretrained features learned from large public labeled (Luo et al., 2021) datasets can greatly mitigate the utility gap between private and non-private models. However, sometimes there is no available public dataset for training a feature extractor due to legal causes or ethical issues (Flanders, 2009) . In this paper, we aim to demonstrate that self-supervised pretraining (SSP) is a scalable solution to improving the utility of deep learning with DP regardless of the size of available public datasets in image classification. Any updates on the learnable parameters of a differentially private model increase privacy overhead. It is easier to achieve both high utility and small privacy loss via the features generated by a well-trained feature extractor that can fully take advantage of SOTA network architectures and public datasets. Even when no large public dataset is available, we show a feature extractor built upon data-efficient HarmonicNet (Ulicny et al., 2019) and trained by selfsupervised SimCLRv2 (Chen et al., 2020) on only one single image (YM. et al., 2020) can make a private linear classifier obtain much better utility than the non-learned handcrafted features (Tramer & Boneh, 2021) under the same privacy budget. With a larger public dataset, the features generated by SSP substantially outperform the features trained with labels on various complex private datasets, as shown in Table 1 . To better explore the trade-off between utility and privacy, we compared SOTA DP-enabled training frameworks, i.e., DP stochastic gradient descent (DPSGD) (Abadi et al., 2016) , DP direct feedback alignment (DPDFA) (Ohana et al., 2021; Lee & Kifer, 2020) , DP stochastic gradient Langevin dynamics (DPSGLD) (Bu et al., 2021) , and Private Aggregation of Teacher En- • When facing the lack of public datasets, we adopt HarmonicNet as the backbone of SimCLRv2 to learn only one image. The features extracted by the HarmonicNet greatly outperform the nonlearned handcrafted ScatterNet features on various private datasets by 0.6% on CIFAR10, 1.4% on CIFAR100, 6.7% on CropDiseases, 49.7% on EuroSAT, and 3.5% on ISIC2018, when ϵ = 2. • When there is a moderate or large size public dataset, the features produced by SSP improve the utility of these private complex datasets over the features trained with labels by 0.5% ∼ 8.6% under the privacy budget of ϵ = 2. • We compared SOTA DP-enabled training frameworks, i.e., DPSGD, DPDFA, DPSGLD, and PATE, to train a private classifier on the features produced by SSP. Compared to DPSGD, DPS-GLD obtains a better utility on private datasets when ϵ ≤ 1. DPDFA achieves a higher utility than DPSGD on private datasets with a smaller learning distance from the public dataset when ϵ > 0.5.

2. BACKGROUND

2.1 DIFFERENTIALLY PRIVATE LEARNING Differential privacy. A network model M : D → R is trained on two datasets D, D ′ ∈ D, which differ only by a single data record. For any subset of outputs R ∈ R, the model satisfies (ϵ, δ)-differential privacy (DP) (Abadi et al., 2016) if Pr[M (D) ∈ R] ≤ e ϵ • Pr[M (D ′ ) ∈ R] + δ In another word, ϵ bounds the privacy loss on any single sample, and δ is the probability that this bound does not hold. Rényi DP (RDP) (Mironov, 2017) is a generalization of (ϵ, δ)-DP that uses Rényi divergence as a distance metric. More RDP details can be viewed in Appendix A. 



The utility and privacy comparison on a private CIFAR10 dataset. DP-SOTA-1: Tramer & Boneh (2021); and DP-SOTA-2: Luo et al. (2021). Private datasets having different learning distances from the public dataset favor different training frameworks under different privacy budgets. Our contributions are summarized as:

1. Gaussian noise of magnitude σ 2 C 2 is added to the gradient updates for a noise scale parameter σ. The privacy cost ϵ during DPSGD can be measured by Moments Accountant (Abadi et al., 2016), which computes the upper bound of ϵ as a consequence of using different composition theories. The privacy loss rate depends on the hyper-parameters (Mohapatra et al., 2021) of DPSGD. • DPSGLD. SGLD (Welling & Teh, 2011) is a gradient technique to train Bayesian networks.SGLD makes the weights of a model to converge to a posterior distribution rather than to a

