EXPLORING THE LIMITS OF DIFFERENTIALLY PRIVATE DEEP LEARNING WITH GROUP-WISE CLIPPING

Abstract

Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of group-wise clipping. To reduce the compute time overhead of private learning, we show that per-layer clipping, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with per-device clipping that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with perdevice clipping achieves a task performance at ϵ = 1 better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.

1. INTRODUCTION

Recent works on deep learning with differential privacy (DP) have substantially improved the computational efficiency (Subramani et al., 2021; Anil et al., 2021) and privacy-utility trade-off (Li et al., 2022a; Yu et al., 2022; De et al., 2022; Mehta et al., 2022) , resulting in cost-effective private learning workflows with favourable utility under common levels of privacy guarantee. Common to most of these works is the use of differentially private stochastic gradient descent (DP-SGD) which clips per-example gradients (herein referred to as flat clipping) and noises their average before performing the parameter update based on a minibatch (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016) . We explore whether further improvements in computational efficiency and privacy-utility trade-off are possible and provide affirmative answers for both directions, leveraging two instantiations of group-wise clipping for DP-SGD. DP-SGD is known to be computationally costly due to clipping per-example gradients. Instantiating per-example gradients and (potentially) normalizing them can incur both high memory and time costs in standard machine learning frameworks (Paszke et al., 2019; Frostig et al., 2018) , and thus private machine learning with DP-SGD is reportedly much more memory demanding and/or slower than its non-private counterpart (Carlini et al., 2019; Hoory et al., 2021) . Recent works have considerably improved the memory and time efficiency of DP-SGD with better software primitives (Subramani et al., 2021) and algorithms (Yousefpour et al., 2021; Lee & Kifer, 2021; Li et al., 2022b; Bu et al., 2022) . Nevertheless, private learning still shows non-trivial increases in either memory usage or compute time when compared to non-private learning head-to-head. For instance, better software primitives do not eliminate the inherent increase in memory spending (Subramani et al., 2021) , and improved algorithms only remove this memory overhead at the cost of extra runtime (Li et al., 2022b) . The first research question we study is therefore Can private learning be as memory and time efficient (per epoch) as non-private learning? We answer the above question affirmatively by giving an efficient implementation of per-layer clipping which had been studied in past works but not from a computational efficiency perspective (McMahan et al., 2018b; Dupuy et al., 2022 ). Clipping per-example gradients of separate neural networks layers (e.g., linear, convolution) separately allows clipping to be performed in conjunction with backpropagation. This results in private learning that is as memory-efficient and almost as timeefficient per training update as non-private learning for many small to moderate scale workflows of practical interest. While per-layer clipping with static clipping thresholds chosen by hand tends to underperform flat clipping, we show that per-layer clipping with adaptively estimated thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performances with less wall time. DP-SGD is known to (possibly) incur substantial performance losses compared to non-private learning. To improve the privacy-utility trade-off, several past works have leveraged large-scale publicly pretrained models (Yu et al., 2022; Li et al., 2022b; De et al., 2022; Mehta et al., 2022) . These works observe that the privacy-utility trade-off improves with the use of larger (and thus better) pretrained models. 1 We extend this research and study a second research question Can the privacy-utility trade-off be further improved with even better / larger pretrained models? To study this, we scale DP fine-tuning to work with one of the largest and most performant pretrained language models to date-the original 175 billion-parameter GPT-3. Weights of this model cannot be hosted in the memory of a single GPU and must be distributed across multiple devices. This presents challenges for flat clipping which calls for communicating per-example gradient norms across devices. To bypass these challenges, we turn to per-device clipping, where each device is prescribed a clipping threshold for clipping per-example gradients of the hosted model piece. Per-device clipping incurs no additional communication cost and allowed us to obtain with GPT-3 a private fine-tuning performance at ϵ = 1 that is better than what is attainable by non-privately fine-tuning the largest GPT-2 on a challenging summarization task. Our contributions are summarized below. (1) We show per-layer clipping enables clipping to be done in conjunction with backpropagation in DP optimization and results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many small to moderate scale workflows of interest. (2) We show adaptive per-layer clipping matches or outperforms flat clipping under fixed training epoch constraints, and thus attains similar or better task performances with less wall time. (3) We bypass scaling challenges associated with communicating per-example gradient norms with per-device clipping, with which we scale DP fine-tuning to work with the 175 billion-parameter GPT-3 and obtain improved task performance for a challenging summarization task at ϵ = 1.

2. PRELIMINARIES

This section aims to cover background on gradient clipping in DP optimization and explain why alternate group-wise clipping strategies can be attractive from a computational efficiency standpoint. Our work trains machine learning models (more precisely deep neural networks) with optimizers that guarantee DP. For completeness, we recap the definition of approximate DP/(ϵ, δ)-DP below. 



* The work of Jiyan He, Xuechen Li and Da Yu was done while they were interns at Microsoft Research. † Equal contributions. Correspondence to Jiyan He, Xuechen Li, Huishuai Zhang and Janardhan Kulkarni. While size does not equate quality, there is strong correlation between the two under currently popular pretraining techniques(Liu et al., 2019; Brown et al., 2020). We're optimistic that future smaller models pretrained with improved techniques can be as performant as current large models(Hoffmann et al., 2022).



Definition 2.1 ((ϵ, δ)-DP). A randomized algorithm M : X → Y is (ϵ, δ)-DP if for all neighboring datasets D, D ′ ∈ X and all Y ⊂ Y, it holds that P (M(D) ∈ Y ) ≤ exp(ϵ) • P (M(D ′ ) ∈ Y ) + δ.

