EASY DIFFERENTIALLY PRIVATE LINEAR REGRESSION

Abstract

Linear regression is a fundamental tool for statistical analysis. This has motivated the development of linear regression methods that also satisfy differential privacy and thus guarantee that the learned model reveals little about any one data point used to construct it. However, existing differentially private solutions assume that the end user can easily specify good data bounds and hyperparameters. Both present significant practical obstacles. In this paper, we study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. Given n samples of d-dimensional data used to train m models, we construct an efficient analogue using an approximate Tukey depth that runs in time O(d 2 n + dm log(m)). We find that this algorithm obtains strong empirical performance in the data-rich setting with no data bounds or hyperparameter selection required.

1. INTRODUCTION

Existing methods for differentially private linear regression include objective perturbation (Kifer et al., 2012) , ordinary least squares (OLS) using noisy sufficient statistics (Dwork et al., 2014; Wang, 2018; Sheffet, 2019) , and DP-SGD (Abadi et al., 2016) . Carefully applied, these methods can obtain high utility in certain settings. However, each method also has its drawbacks. Objective perturbation and sufficient statistics require the user to provide bounds on the feature and label norms, and DP-SGD requires extensive hyperparameter tuning (of clipping norm, learning rate, batch size, and so on). In practice, users of differentially private algorithms struggle to provide instance-specific inputs like feature and label norms without looking at the private data (Sarathy et al., 2022) . Unfortunately, looking at the private data also nullifies the desired differential privacy guarantee. Similarly, while recent work has advanced the state of the art of private hyperparameter tuning (Liu & Talwar, 2019; Papernot & Steinke, 2022) , non-private hyperparameter tuning remains the most common and highest utility approach in practice. Even ignoring its (typically elided) privacy cost, this tuning adds significant time and implementation overhead. Both considerations present obstacles to differentially private linear regression in practice. With these challenges in mind, the goal of this work is to provide an easy differentially private linear regression algorithm that works quickly and with no user input beyond the data itself. Here, "ease" refers to the experience of end users. The algorithm we propose requires care to construct and implement, but it only requires an end user to specify their dataset and desired level of privacy. We also emphasize that ease of use, while nice to have, is not itself the primary goal. Ease of use affects both privacy and utility, as an algorithm that is difficult to use will sacrifice one or both when data bounds and hyperparameters are imperfectly set.

1.1. CONTRIBUTIONS

Our algorithm generalizes previous work by Alabi et al. (2022) , which proposes a differentially private variant of the Theil-Sen estimator for one-dimensional linear regression (Theil, 1992) . The core idea is to partition the data into m subsets, non-privately estimate a regression model on each, and then apply the exponential mechanism with some notion of depth to privately estimate a high-depth model from a restricted domain that the end user specifies. In the simple one-dimensional case (Alabi et al., * {kamin, mtjoseph, mribero, sergeiv}@google.com. Part of this work done while Mónica was at UT Austin.

