REGRESSION WITH LABEL DIFFERENTIAL PRIVACY *

Abstract

We study the task of training regression models with the guarantee of label differential privacy (DP). Based on a global prior distribution on label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a "randomized response on bins", and propose an efficient algorithm for finding the optimal bin values. We carry out a thorough experimental evaluation on several datasets demonstrating the efficacy of our algorithm.

1. INTRODUCTION

In recent years, differential privacy (DP, Dwork et al., 2006a; b) has emerged as a popular notion of user privacy in machine learning (ML) . On a high level, it guarantees that the output model weights remain statistically indistinguishable when any single training example is arbitrarily modified. Numerous DP training algorithms have been proposed, with open-source libraries tightly integrated in popular ML frameworks such as TensorFlow Privacy (Radebaugh & Erlingsson, 2019) and PyTorch Opacus (Yousefpour et al., 2021) . In the context of supervised ML, a training example consists of input features and a target label. While many existing research works focus on protecting both features and labels (e.g., Abadi et al. (2016) ), there are also some important scenarios where the input features are already known to the adversary, and thus protecting the privacy of the features is not needed. A canonical example arises from computational advertising where the features are known to one website (a publisher), whereas the conversion events, i.e., the labels, are known to another website (the advertiser).foot_0 Thus, from the first website's perspective, only the labels can be treated as unknown and private. This motivates the study of label DP algorithms, where the statistical indistinguishability is required only when the label of a single example is modified. 2 The study of this model goes back at least to the work of Chaudhuri & Hsu (2011) . Recently, several works including (Ghazi et al., 2021a; Malek Esmaeili et al., 2021) studied label DP deep learning algorithms for classification objectives. Our Contributions. In this work, we study label DP for regression tasks. We provide a new algorithm that, given a global prior distribution (which, if unknown, could be estimated privately), derives a label DP mechanism that is optimal under a given objective loss function. We provide an explicit characterization of the optimal mechanism for a broad family of objective functions including the most commonly used regression losses such as the Poisson log loss, the mean square error, and the mean absolute error. More specifically, we show that the optimal mechanism belongs to a class of randomized response on bins (Algorithm 1). We show this by writing the optimization problem as a linear program (LP), and characterizing its optimum. With this characterization in mind, it suffices for us to compute the * Authors in alphabetical order. Email: {badih.ghazi, ravi.k53}@gmail.com, {pritishk, ethanleeman, pasin, avaradar, chiyuan}@google.com 



A similar use case is in mobile advertising, where websites are replaced by apps. We note that this label DP setting is particularly timely and relevant for ad attribution and conversion measurement given the deprecation of third-party cookies by several browsers and platforms(Wilander, 2020; Wood, 2019; Schuh, 2020).

