EASY DIFFERENTIALLY PRIVATE LINEAR REGRESSION

Abstract

Linear regression is a fundamental tool for statistical analysis. This has motivated the development of linear regression methods that also satisfy differential privacy and thus guarantee that the learned model reveals little about any one data point used to construct it. However, existing differentially private solutions assume that the end user can easily specify good data bounds and hyperparameters. Both present significant practical obstacles. In this paper, we study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. Given n samples of d-dimensional data used to train m models, we construct an efficient analogue using an approximate Tukey depth that runs in time O(d 2 n + dm log(m)). We find that this algorithm obtains strong empirical performance in the data-rich setting with no data bounds or hyperparameter selection required.

1. INTRODUCTION

Existing methods for differentially private linear regression include objective perturbation (Kifer et al., 2012) , ordinary least squares (OLS) using noisy sufficient statistics (Dwork et al., 2014; Wang, 2018; Sheffet, 2019) , and DP-SGD (Abadi et al., 2016) . Carefully applied, these methods can obtain high utility in certain settings. However, each method also has its drawbacks. Objective perturbation and sufficient statistics require the user to provide bounds on the feature and label norms, and DP-SGD requires extensive hyperparameter tuning (of clipping norm, learning rate, batch size, and so on). In practice, users of differentially private algorithms struggle to provide instance-specific inputs like feature and label norms without looking at the private data (Sarathy et al., 2022) . Unfortunately, looking at the private data also nullifies the desired differential privacy guarantee. Similarly, while recent work has advanced the state of the art of private hyperparameter tuning (Liu & Talwar, 2019; Papernot & Steinke, 2022) , non-private hyperparameter tuning remains the most common and highest utility approach in practice. Even ignoring its (typically elided) privacy cost, this tuning adds significant time and implementation overhead. Both considerations present obstacles to differentially private linear regression in practice. With these challenges in mind, the goal of this work is to provide an easy differentially private linear regression algorithm that works quickly and with no user input beyond the data itself. Here, "ease" refers to the experience of end users. The algorithm we propose requires care to construct and implement, but it only requires an end user to specify their dataset and desired level of privacy. We also emphasize that ease of use, while nice to have, is not itself the primary goal. Ease of use affects both privacy and utility, as an algorithm that is difficult to use will sacrifice one or both when data bounds and hyperparameters are imperfectly set.

1.1. CONTRIBUTIONS

Our algorithm generalizes previous work by Alabi et al. (2022) , which proposes a differentially private variant of the Theil-Sen estimator for one-dimensional linear regression (Theil, 1992) . The core idea is to partition the data into m subsets, non-privately estimate a regression model on each, and then apply the exponential mechanism with some notion of depth to privately estimate a high-depth model from a restricted domain that the end user specifies. In the simple one-dimensional case (Alabi et al., 2022) each model is a slope, the natural notion of high depth is the median, and the user provides an interval for candidate slopes. We generalize this in two ways to obtain our algorithm, TukeyEM. The first step is to replace the median with a multidimensional analogue based on Tukey depth. Second, we adapt a technique based on propose-test-release (PTR), originally introduced by Brown et al. ( 2021) for private estimation of unbounded Gaussians, to construct an algorithm which does not require bounds on the domain for the overall exponential mechanism. We find that a version of TukeyEM using an approximate and efficiently computable notion of Tukey depth achieves empirical performance competitive with (and often exceeding) that of non-privately tuned baseline private linear regression algorithms, across several synthetic and real datasets. We highlight that the approximation only affects utility and efficiency; TukeyEM is still differentially private. Given an instance where TukeyEM constructs m models from n samples of d-dimensional data, the main guarantee for our algorithm is the following: Theorem 1.1. TukeyEM is (ε, δ)-DP and takes time O(d 2 n + dm log(m)). Two caveats apply. First, our use of PTR comes at the cost of an approximate (ε, δ)-DP guarantee as well as a failure probability: depending on the dataset, it is possible that the PTR step fails, and no regression model is output. Second, the algorithm technically has one hyperparameter, the number m of models trained. Our mitigation of both issues is empirical. Across several datasets, we observe that a simple heuristic about the relationship between the number of samples n and the number of features d, derived from synthetic experiments, typically suffices to ensure that the PTR step passes and specifies a high-utility choice of m. For the bulk of our experiments, the required relationship is on the order of n 1000 • d. We emphasize that this heuristic is based only on the data dimensions n and d and does not require further knowledge of the data itself.

1.2. RELATED WORK

Linear regression is a specific instance of the more general problem of convex optimization. Ignoring dependence on the parameter and input space diameter for brevity, DP-SGD (Bassily et al., 2014) and objective perturbation (Kifer et al., 2012) obtain the optimal O( √ d/ε) error for empirical risk minimization. AdaOPS and AdaSSP also match this bound (Wang, 2018) . Similar results are known for population loss (Bassily et al., 2019) , and still stronger results using additional statistical assumptions on the data (Cai et al., 2020; Varshney et al., 2022) . Recent work provides theoretical guarantees with no boundedness assumptions on the features or labels (Milionis et al., 2022) but requires bounds on the data's covariance matrix to use an efficient subroutine for private Gaussian estimation and does not include an empirical evaluation. The main difference between these works and ours is empirical utility without data bounds and hyperparameter tuning. Another relevant work is that of Liu et al. (2022) , which also composes a PTR step adapted from Brown et al. (2021) with a call to a restricted exponential mechanism. The main drawback of this work is that, as with the previous work (Brown et al., 2021) , neither the PTR step nor the restricted exponential mechanism step is efficient. This applies to other works that have applied Tukey depth to private estimation as well (Beimel et al., 2019; Kaplan et al., 2020; Liu et al., 2021; Ramsay & Chenouri, 2021) . The main difference between these works and ours is that our approach produces an efficient, implemented mechanism. Finally, concurrent independent work by Cumings-Menon (2022) also studies the usage of Tukey depth, as well as the separate notion of regression depth, to privately select from a collection of non-private regression models. A few differences exist between their work and ours. First, they rely on additive noise scaled to smooth sensitivity to construct a private estimate of a high-depth point. Second, their methods are not computationally efficient beyond small d, and are only evaluated for d ≤ 2. Third, their methods require the end user to specify bounds on the parameter space.

2. PRELIMINARIES

We start with the definition of differential privacy, using the "add-remove" variant. 



* {kamin, mtjoseph, mribero, sergeiv}@google.com. Part of this work done while Mónica was at UT Austin.



Definition 2.1 (Dwork et al. (2006)). Databases D, D from data domain X are neighbors, denoted D ∼ D , if they differ in the presence or absence of a single record. A randomized mechanism

