EASY DIFFERENTIALLY PRIVATE LINEAR REGRESSION

Abstract

Linear regression is a fundamental tool for statistical analysis. This has motivated the development of linear regression methods that also satisfy differential privacy and thus guarantee that the learned model reveals little about any one data point used to construct it. However, existing differentially private solutions assume that the end user can easily specify good data bounds and hyperparameters. Both present significant practical obstacles. In this paper, we study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. Given n samples of d-dimensional data used to train m models, we construct an efficient analogue using an approximate Tukey depth that runs in time O(d 2 n + dm log(m)). We find that this algorithm obtains strong empirical performance in the data-rich setting with no data bounds or hyperparameter selection required.

1. INTRODUCTION

Existing methods for differentially private linear regression include objective perturbation (Kifer et al., 2012) , ordinary least squares (OLS) using noisy sufficient statistics (Dwork et al., 2014; Wang, 2018; Sheffet, 2019) , and DP-SGD (Abadi et al., 2016) . Carefully applied, these methods can obtain high utility in certain settings. However, each method also has its drawbacks. Objective perturbation and sufficient statistics require the user to provide bounds on the feature and label norms, and DP-SGD requires extensive hyperparameter tuning (of clipping norm, learning rate, batch size, and so on). In practice, users of differentially private algorithms struggle to provide instance-specific inputs like feature and label norms without looking at the private data (Sarathy et al., 2022) . Unfortunately, looking at the private data also nullifies the desired differential privacy guarantee. Similarly, while recent work has advanced the state of the art of private hyperparameter tuning (Liu & Talwar, 2019; Papernot & Steinke, 2022) , non-private hyperparameter tuning remains the most common and highest utility approach in practice. Even ignoring its (typically elided) privacy cost, this tuning adds significant time and implementation overhead. Both considerations present obstacles to differentially private linear regression in practice. With these challenges in mind, the goal of this work is to provide an easy differentially private linear regression algorithm that works quickly and with no user input beyond the data itself. Here, "ease" refers to the experience of end users. The algorithm we propose requires care to construct and implement, but it only requires an end user to specify their dataset and desired level of privacy. We also emphasize that ease of use, while nice to have, is not itself the primary goal. Ease of use affects both privacy and utility, as an algorithm that is difficult to use will sacrifice one or both when data bounds and hyperparameters are imperfectly set.

1.1. CONTRIBUTIONS

Our algorithm generalizes previous work by Alabi et al. (2022) , which proposes a differentially private variant of the Theil-Sen estimator for one-dimensional linear regression (Theil, 1992) . The core idea is to partition the data into m subsets, non-privately estimate a regression model on each, and then apply the exponential mechanism with some notion of depth to privately estimate a high-depth model from a restricted domain that the end user specifies. In the simple one-dimensional case (Alabi et al., 2022) each model is a slope, the natural notion of high depth is the median, and the user provides an interval for candidate slopes. We generalize this in two ways to obtain our algorithm, TukeyEM. The first step is to replace the median with a multidimensional analogue based on Tukey depth. Second, we adapt a technique based on propose-test-release (PTR), originally introduced by Brown et al. (2021) for private estimation of unbounded Gaussians, to construct an algorithm which does not require bounds on the domain for the overall exponential mechanism. We find that a version of TukeyEM using an approximate and efficiently computable notion of Tukey depth achieves empirical performance competitive with (and often exceeding) that of non-privately tuned baseline private linear regression algorithms, across several synthetic and real datasets. We highlight that the approximation only affects utility and efficiency; TukeyEM is still differentially private. Given an instance where TukeyEM constructs m models from n samples of d-dimensional data, the main guarantee for our algorithm is the following: Theorem 1.1. TukeyEM is (ε, δ)-DP and takes time O(d 2 n + dm log(m)). Two caveats apply. First, our use of PTR comes at the cost of an approximate (ε, δ)-DP guarantee as well as a failure probability: depending on the dataset, it is possible that the PTR step fails, and no regression model is output. Second, the algorithm technically has one hyperparameter, the number m of models trained. Our mitigation of both issues is empirical. Across several datasets, we observe that a simple heuristic about the relationship between the number of samples n and the number of features d, derived from synthetic experiments, typically suffices to ensure that the PTR step passes and specifies a high-utility choice of m. For the bulk of our experiments, the required relationship is on the order of n 1000 • d. We emphasize that this heuristic is based only on the data dimensions n and d and does not require further knowledge of the data itself.

1.2. RELATED WORK

Linear regression is a specific instance of the more general problem of convex optimization. Ignoring dependence on the parameter and input space diameter for brevity, DP-SGD (Bassily et al., 2014) and objective perturbation (Kifer et al., 2012) obtain the optimal O( √ d/ε) error for empirical risk minimization. AdaOPS and AdaSSP also match this bound (Wang, 2018) . Similar results are known for population loss (Bassily et al., 2019) , and still stronger results using additional statistical assumptions on the data (Cai et al., 2020; Varshney et al., 2022) . Recent work provides theoretical guarantees with no boundedness assumptions on the features or labels (Milionis et al., 2022) but requires bounds on the data's covariance matrix to use an efficient subroutine for private Gaussian estimation and does not include an empirical evaluation. The main difference between these works and ours is empirical utility without data bounds and hyperparameter tuning. Another relevant work is that of Liu et al. (2022) , which also composes a PTR step adapted from Brown et al. (2021) with a call to a restricted exponential mechanism. The main drawback of this work is that, as with the previous work (Brown et al., 2021) , neither the PTR step nor the restricted exponential mechanism step is efficient. This applies to other works that have applied Tukey depth to private estimation as well (Beimel et al., 2019; Kaplan et al., 2020; Liu et al., 2021; Ramsay & Chenouri, 2021) . The main difference between these works and ours is that our approach produces an efficient, implemented mechanism. Finally, concurrent independent work by Cumings-Menon (2022) also studies the usage of Tukey depth, as well as the separate notion of regression depth, to privately select from a collection of non-private regression models. A few differences exist between their work and ours. First, they rely on additive noise scaled to smooth sensitivity to construct a private estimate of a high-depth point. Second, their methods are not computationally efficient beyond small d, and are only evaluated for d ≤ 2. Third, their methods require the end user to specify bounds on the parameter space.

2. PRELIMINARIES

We start with the definition of differential privacy, using the "add-remove" variant. Definition 2.1 (Dwork et al. (2006) ). Databases D, D from data domain X are neighbors, denoted D ∼ D , if they differ in the presence or absence of a single record. A randomized mechanism M : X → Y is (ε, δ)-differentially private (DP) if for all D ∼ D ∈ X and any S ⊆ Y P M [M(D) ∈ S] ≤ e ε P M [M(D ) ∈ S] + δ. When δ = 0, M is ε-DP. One general ε-DP algorithm is the exponential mechanism. ,y) 2∆u ). We say the utility function u is monotonic if, for D 1 ⊂ D 2 , for any y, u(D 1 , y) ≤ u(D 2 , y). Given monotonic u, the 2 inside the exponent denominator can be dropped. Lemma 2.3 (McSherry & Talwar (2007) ). The exponential mechanism is -DP. Finally, we define Tukey depth. Definition 2.4 (Tukey (1975) Note that for a collection of n points, the maximum possible Tukey depth is n /2. We will prove a theoretical utility result for a version of our algorithm that uses exact Tukey depth. However, Tukey depth is NP-hard to compute for arbitrary d (Johnson & Preparata, 1978)  ). A halfspace h v is defined by a vector v ∈ R d , h v = {y ∈ R d | v, y ≥ 0}. Let D ⊂ R d E [β i ] = β * ∈ R d . Given β ∈ R d with Tukey depth at least p with respect to S, there exists a constant c > 0 such that when m ≥ c d+log(foot_0/γ) α 2 with probability 1 -γ, β -β * Σ ≤ Φ -1 (1 -p/m + α), where Φ denotes the CDF of of the standard univariate Gaussian. In practice, we observe that empirical distributions of models for real data often feature Gaussian-like concentration, fast tail decay, and symmetry. Plots of histograms for the the models learned by TukeyEM on experiment datasets appear in the Appendix's Section 7.7. Nonetheless, we emphasize that Theorem 3.1 is a statement of sufficiency, not necessity. TukeyEM does not require any distributional assumption to be private, nor does non-Gaussianity preclude accurate estimation. The remaining subsections elaborate on the details of our version using approximate Tukey depth, culminating in the full pseudocode in Algorithm 2 and overall result, Theorem 1.1.

3.1. COMPUTING VOLUMES

We start by describing how to compute volumes corresponding to different Tukey depths. As shown in the next subsection, these volumes will be necessary for the PTR subroutine. Definition 3.2. Given database D, define V i,D = vol({y | y ∈ R d and TD (y) ≥ i}), the volume of the region of points in R d with approximate Tukey depth at least i in D. When D is clear from context, we write V i for brevity.

Since our notion of approximate Tukey depth uses the canonical basis (Definition

2.5), it follows that V 1 , V 2 , . . . , V m/2 1 form a sequence of nested (hyper)rectangles, as shown in Figure 3 . With this observation, computing a given V i is simple. For each axis, project the non-private models {β i } m i=1 onto the axis and compute the distance between the two points of exact Tukey depth i (from the "left" and "right") in the one-dimensional sorted array. This yields one side length for the hyperrectangle. Repeating this d times in total and taking the product then yields the total volume of the hyperrectangle, as formalized next. The simple proof appears in the Appendix's Section 7.2. 

3.2. APPLYING PROPOSE-TEST-RELEASE

The next step of TukeyEM employs PTR to restrict the output region eventually used by the exponential mechanism. We collect this process into a subroutine PTRCheck. The overall strategy applies work done by Brown et al. (2021) . Their algorithm privately checks if the given database has a large Hamming distance to any "unsafe" database and then, if this PTR check passes, runs an exponential mechanism restricted to a domain of high Tukey depth. Since a "safe" database is defined as one where the restricted exponential mechanism has a similar output distribution on any neighboring database, the overall algorithm is DP. As part of their utility analysis, they prove a lemma translating a volume condition on regions of different Tukey depths to a lower bound on the Hamming distance to an unsafe database (Lemma 3.8 (Brown et al., 2021) ). This enables them to argue that the PTR check typically passes if it receives enough Gaussian data, and the utility guarantee follows. However, their algorithm requires computing both exact Tukey depths of the samples and the current database's exact Hamming distance to unsafety. The given runtimes for both computations are exponential in the dimension d (see their Section C.2 (Brown et al., 2021) ). We rely on approximate Tukey depth (Definition 2.5) to resolve both issues. First, as the previous section demonstrated, computing the approximate Tukey depths of a collection of m d-dimensional points only takes time O(dm log(m)). Second, we adapt their lower bound to give a 1-sensitive lower bound on the Hamming distance between the current database and any unsafe database. This yields an efficient replacement for the exact Hamming distance calculation used by Brown et al. (2021) . The overall structure of PTRCheck is therefore as follows: use the volume condition to compute a 1-sensitive lower bound on the given database's distance to unsafety; add noise to the lower bound Algorithm 1 PTRCheck 1: Input: Tukey depth region volumes V , privacy parameters ε and δ 2: Use Lemma 3.6 with t = |V | 2 and δ 8e ε to compute lower bound k for distance to unsafe database 3: if k + Lap (1/ε) ≥ log(1/2δ) ε then 4: Return True 5: else 6: Return False and compare it to a threshold calibrated so that an unsafe dataset has probability ≤ δ of passing; and if the check passes, run the exponential mechanism to pick a point of high approximate Tukey depth from the domain of points with moderately high approximate Tukey depth. Before proceeding to the details of the algorithm, we first define a few necessary terms. Definition 3.4 (Definition 2.1 Brown et al. (2021) ). Two distributions P, Q over domain W are (ε, δ)-indistinguishable, denoted P ≈ ε,δ Q, if for any measurable subset W ⊂ W, P w∼P [w ∈ W ] ≤ e ε P w∼Q [w ∈ W ] + δ and P w∼Q [w ∈ W ] ≤ e ε P w∼P [w ∈ W ] + δ. Note that (ε, δ)-DP is equivalent to (ε, δ)-indistinguishability between output distributions on arbitrary neighboring databases. Given database D, let A denote the exponential mechanism with utility function TD (see Definition 2.5). Given nonnegative integer t, let A t denote the same mechanism that assigns score -∞ to any point with score < t, i.e., only samples from points of score ≥ t. We will say a database is "safe" if A t is indistinguishable between neighbors. Definition 3.5 (Definition 3.1 Brown et al. ( 2021 )). Database D is (ε, δ, t)-safe if for all neighboring D ∼ D, we have A t (D) ≈ ε,δ A t (D ). Let Safe (ε,δ,t) be the set of safe databases, and let Unsafe (ε,δ,t) be its complement. We now state the main result of this section, Lemma 3.6. Briefly, it modifies Lemma 3.8 from Brown et al. (2021) to construct a 1-sensitive lower bound on distance to unsafety. Lemma 3.6. Define M (D) to be a mechanism that receives as input database D and computes the largest k ∈ {0, . . . , t -1} such that there exists g > 0 where, for volumes V defined using a monotonic utility function, V t-k-1,D V t+k+g+1,D • e -εg/2 ≤ δ or outputs -1 if the inequality does not hold for any such k. Then for arbitrary D 1. M is 1-sensitive, and 2. for all z ∈ Unsafe (ε,4e ε δ,t) , d H (D, z) > M (D). The proof of Lemma 3.6 appears in the Appendix's Section 7.2. Our implementation of the algorithm described by Lemma 3.6 randomly perturbs the models with a small amount of noise to avoid having regions with 0 volume. We note that this does not affect the overall privacy guarantee. PTRCheck therefore runs the mechanism defined by Lemma 3.6, add Laplace noise to the result, and proceeds to the restricted exponential mechanism if the noisy statistic crosses a threshold. Pseudocode appears in Algorithm 1, and we now state its guarantee as proved in the Appendix's Section 7.2. Lemma 3.7. Given the depth volumes V computed in Lines 9 to 10 of Algorithm 2, PTRCheck(V, ε, δ) is ε-DP and takes time O(m log(m)).

3.3. SAMPLING

If PTRCheck passes, TukeyEM then calls the exponential mechanism restricted to points of approximate Tukey depth at least t = m /4, a subroutine denoted RestrictedTukeyEM (Line 12 in Algorithm 2). Note that the passage of PTR ensures that with probability at least 1 -δ, running RestrictedTukeyEM is (ε, δ)-DP. We use a common two step process for sampling from an exponential mechanism over a continuous space: 1) sample a depth using the exponential mechanism, then 2) return a uniform sample from the region corresponding to the sampled depth.

3.3.1. SAMPLING A DEPTH

We first define a slight modification W of the volumes V introduced earlier. Definition 3.8. Given database D, define W i,D = vol({y | y ∈ R d and TD (y) = i}), the volume of the region of points in R d with approximate Tukey depth exactly i in D. To execute the first step of sampling, for i ∈ {m/4, m/4 + 1, . . . , m/2}, W i,D = V i,D -V i+1,D , so we can compute {W i,D } m/2 i=m/4 from the V computed earlier in time O(m). The restricted exponential mechanism then selects approximate Tukey depth i ∈ {m/4, m/4 + 1, . . . , m/2} with probability P [i] ∝ W i,D • exp(ε • i). Note that this expression drops the 2 in the standard exponential mechanism because approximate Tukey depth is monotonic; see Appendix Section 7.3 for details. For numerical stability, racing sampling Medina & Gillenwater (2020) can sample from this distribution using logarithmic quantities.

3.3.2. UNIFORMLY SAMPLING FROM A REGION

Having sampled a depth î, it remains to return a uniform random point of approximate Tukey depth î. By construction, W î,D is the volume of the set of points y = (y 1 , ..., y d ) such that the depth along every dimension j is at least î, and the depth along at least one dimension j is exactly î. The result is straightforward when d = 1: draw a uniform sample from the union of the two intervals of points of depth exactly î (depth from the "left" and "right"). For d > 1, the basic idea of the sampling process is to partition the overall volume into disjoint subsets, compute each subset volume, sample a subset according to its proportion in the whole volume, and then sample uniformly from that subset. Our partition will split the overall region of depth exactly i according to the first dimension with dimension-specific depth exactly i. Since any point in the overall region has at least one such dimension, this produces a valid partition, and we will see that computing the volumes of these partitions is straightforward using the S computed earlier. Finally, the last sampling step will be easy because the final subset will simply be a pair of (hyper)rectangles. Since space is constrained and the details are relatively straightforward from the sketch above, full specification and proofs for this process SamplePointWithDepth(S, i) appear in Section 7.4. For immediate purposes, it suffices to record the following guarantee: Lemma 3.9. SamplePointWithDepth(S, i) returns a uniform random sample from the region of points with approximate Tukey depth i in S in time O(d).

3.4. OVERALL ALGORITHM

We now have all of the necessary material for the main result, Theorem 1.1, restated below. The proof essentially collects the results so far into a concise summary. Theorem 3.10. TukeyEM, given in Algorithm 2, is (ε, δ)-DP and takes time O d 2 n + dm log(m) . Proof. Line 11 of the TukeyEM pseudocode in Algorithm 2 calls the check with privacy parameters ε/2 and δ/[8e ε ]. By the sensitivity guarantee of Lemma 3.6, the check itself is ε/2-DP. By the safety guarantee of Lemma 3.6 and our choice of threshold, if it passes, with probability at least 1 -δ/2, the given database lies in Safe (ε/2,δ/2,t) . A passing check therefore ensures that the sampling step in Line 12 is (ε/2, δ)-DP. By composition, the overall privacy guarantee is (ε, δ)-DP. Algorithm 2 TukeyEM 1: Input: Features matrix X ∈ R n×d , label vector y ∈ R n , number of models m, privacy parameters ε and δ 2: Evenly and randomly partition X and y into subsets {(X i , y i )} m i=1 3: for i = 1, . . . , m do 4: Compute OLS estimator β i ← (X T i X i ) -1 X T i y i 5: for dimension j ∈ [d] do 6: {β i,j } m i=1 ← projection of {β i } m i=1 onto dimension j 7: (S j,1 , . . . , S j,m ) ← {β i,j } m i=1 sorted in nondecreasing order 8: Collect projected estimators into S ∈ R d×m , where each row is nondecreasing 9: for i ∈ [m/2] do 10: Compute volume of region of depth ≥ i, V i ← d j=1 (S j,m-(i-1) -S j,i ) 11: if PTRCheck(V, ε/2, δ) then 12: β ← RestrictedTukeyEM(V, S, m/4, ε/2) 13: Return β 14: else 15: Return ⊥ 2. AdaSSP (Wang, 2018) computes a DP OLS estimator based on noisy versions of X T X and X T y. This requires the end user to supply bounds on both X 2 and y 2 . Our implementation uses these values non-privately for each dataset. The implementation is therefore not private and represents an artificially strong version of AdaSSP. As specified by Wang (2018) , AdaSSP (privately) selects a ridge parameter and runs ridge regression. 3. DPSGD (Abadi et al., 2016) uses DP-SGD, as implemented in TensorFlow Privacy and Keras (Chollet et al., 2015) , to optimize mean squared error using a single linear layer. The layer's weights are regression coefficients. A discussion of hyperparameter selection appears in Section 4.4. As we will see, appropriate choices of these hyperparameters is both dataset-specific and crucial to DPSGD's performance. Since we allow DPSGD to tune these non-privately for each dataset, our implementation of DPSGD is also artificially strong. All experiment code can be found on Github (Google, 2022).

4.2. DATASETS

We evaluate all four algorithms on the following datasets. The first dataset is synthetic, and the rest are real. The datasets are intentionally selected to be relatively easy use cases for linear regression, as reflected by the consistent high Rfoot_2 for NonDP. 2 However, we emphasize that, beyond the constraints on d and n suggested by Section 4.3, they have not been selected to favor TukeyEM: all feature selection occurred before running any of the algorithms, and we include all datasets evaluated where NonDP achieved a positive R 2 . A complete description of the datasets appears both in the public code and the Appendix's Section 7.5. For each dataset, we additionally add an intercept feature. 1. Synthetic (d = 11, n = 22,000, Pedregosa et al. ( 2011)). This dataset uses sklearn.make regression and N (0, σ 2 ) label noise with σ = 10. 2. California (d = 9, n = 20,433, Nugent (2017)) predicting house price. 

4.3. CHOOSING THE NUMBER OF MODELS

Before turning to the results of this comparison, recall from Section 3 that TukeyEM privately aggregates m non-private OLS models. If m is too low, PTRCheck will probably fail; if m is too high, and each model is trained on only a small number of points, even a non-private aggregation of inaccurate models will be an inaccurate model as well. Experiments on synthetic data support this intuition. In the left plot in Figure 2 , each solid line represents synthetic data with a different number of features, generated by the same process as the Synthetic dataset described in the previous section. We vary the number of models m on the x-axis and plot the distance computed by Lemma 3.6. As d grows, the number of models required to pass the PTRCheck threshold, demarcated by the dashed horizontal line, grows as well. To select the value of R 2 used for TukeyEM, we ran it on each dataset using m = 250, 500, . . . , 2000 and selected the smallest m where all PTR checks passed. We give additional details in Section 7.6 but note here that the resulting choices closely track those given by Figure 2 . Furthermore, across many datasets, simply selecting m = 1000 typically produces nearly optimal R 2 , with several datasets exhibiting little dependence on the exact choice of m.

4.4. ACCURACY COMPARISON

Our main experiments compare the four methods at (ln(3), 10 -5 )-DP. A concise summary of the experiment results appears in Figure 1 . For every method other than NonDP (which is deterministic), we report the median R 2 values across the trials. For each dataset, the methods with interquartile ranges overlapping that of the method with the highest median R 2 are bolded. Extended plots recording R 2 for various m appear in Section 7.6. All datasets use 10 trials, except for California and Diamonds, which use 50.

Dataset

NonDP We briefly elaborate on the latter. Our experiments tune DPSGD over a large grid consisting of 2,184 joint hyperparameter settings, over learning rate ∈ {10 -6 , 10 -5 , . . . , 1}, clip norm ∈ {10 -6 , 10 -5 , . . . , 10 6 }, microbatches ∈ {2 5 , 2 6 , . . . , 2 10 }, and epochs ∈ {1, 5, 10, 20}. Ignoring the extensive computational resources required to do so at this scale (100 trials of each of the 2,184 hyperparameter combinations, for each dataset), we highlight that even mildly suboptimal hyperparameters are sufficient to significantly decrease DPSGD's utility. Figure 1 quantifies this by recording the R 2 obtained by the hyperparameters that achieved the highest and 90th percentile median R 2 during tuning. While the optimal hyperparameters consistently produce results competitive with or sometimes exceeding that of TukeyEM, even the mildly suboptimal hyperparameters nearly always produce results significantly worse than those of TukeyEM. The exact hyperparameters used appear in Section 7.6. We conclude our discussion of DPSGD by noting that it has so far omitted any attempt at differentially private hyperparameter tuning. We suggest that the results here indicate that any such method will need to select hyperparameters with high accuracy while using little privacy budget, and emphasize that the presentation of DPSGD in our experiments is generous. Overall, TukeyEM's overall performance on the eight datasets is strong. We propose that the empirical evidence is enough to justify TukeyEM as a first-cut method for linear regression problems whose data dimensions satisfy its heuristic requirements (n 1000 • d).

4.5. TIME COMPARISON

We conclude with a brief discussion of runtime. The rightmost plot in Figure 2 records the average runtime in seconds over 10 trials of each method. TukeyEM is slower than the covariance matrixbased methods NonDP and AdaSSP, but it still runs in under one second, and it is substantially faster than DPSGD. TukeyEM's runtime also, as expected, depends linearly on the number of models m. Since the plots are essentially identical across datasets, we only include results for the Synthetic dataset here. Finally, we note that, for most reasonable settings of m, TukeyEM has runtime asymptotically identical to that of NonDP (Theorem 1.1). The gap in practical performance is likely a consequence of the relatively unoptimized nature of our implementation.

5. FUTURE DIRECTIONS

An immediate natural extension of TukeyEM would generalize the approach to similar problems such as logistic regression. More broadly, while this work focused on linear regression for the sake of simplicity and wide applicability, the basic idea of TukeyEM can in principle be applied to select from arbitrary non-private models that admit expression as vectors in R d . Part of our analysis observed that TukeyEM may benefit from algorithms and data that lead to Gaussian-like distributions over models; describing the characteristics of algorithms and data that induce this property -or a similar property that better characterizes the performance of TukeyEM -is an open question. In both figures, the set of points is {(1, 1), (7, 3), (5, 7), (3, 3), (5, 5), (6, 3)}, the region of depth 0 is white, the region of depth 1 is light gray, and the region of depth 2 is dark gray. Note that for exact Tukey depth, the regions of different depths form a sequence of nested convex polygons; for approximate Tukey depth, they form a sequence of nested rectangles.

7.2. OMITTED PROOFS

We start with the proof of our utility result for exact Tukey depth, Theorem 3.1. Proof of Theorem 3.1. This is a direct application of the results of Brown et al. (2021) . They analyzed a notion of probabilistic (and normalized) Tukey depth over samples from a distribution: T N (µ,Σ) (y) := min v P X∼N (µ,Σ) [ X, v ≥ y, v ]. Their Proposition 3.3 shows that T N (µ,Σ) (y) can be characterized in terms of Φ, the CDF of the standard one-dimensional Gaussian distribution. Specifically, they show T N (µ,Σ) (y) = Φ(-y-µ Σ ). From their Lemma 3.5, if m ≥ c d+log(1/γ) α 2 , then with probability 1 -γ, |p/m -T N (β * ,Σ) ( β)| ≤ α. Thus -α ≤ T N (β * ,Σ) ( β) -p/m p/m -α ≤ Φ(-β -β * Σ ) p/m -α ≤ 1 -Φ( β -β * Σ ) Φ( β -β * Σ ) ≤ 1 -p/m + α β -β * Σ ≤ Φ -1 (1 -p/m + α) where the third inequality used the symmetry of Φ. Next, we prove our result about computing the volumes associated with different approximate Tukey depths, Lemma 3.3. Proof of Lemma 3.3. By the definition of approximate Tukey depth, for arbitrary y = (y 1 , . . . , y d ) of Tukey depth at least i, each of the 2d halfspaces h y1•e1 , h y1•-e1 , . . . , h y d •e d , h y d •-e d contains at least i points from D, where x • y denotes multiplication of a scalar and vector. Fix some dimension j ∈ [d]. Since min(|h yj •ej ∩ D|, |h yj •-ej ∩ D|) ≥ i, y j ∈ [S j,i , S j,m-(i-1) ]. Thus V i,D = d j=1 (S j,m-(i-1) -S j,i ). The computation of S starting in Line 5 sorts d arrays of length m and so takes time O(dm log(m)). Line 9 iterates over m/2 depths and computes d quantities, each in constant time, so its total time is O(dm). The next result is our 1-sensitive and efficient adaptation of the lower bound from Brown et al. (2021) . We first restate that result. While their paper uses swap DP, the same result holds for add-remove DP. Lemma 7.1 (Lemma 3.8 (Brown et al., 2021) ). For any k ≥ 0, if there exists a g > 0 such that V t-k-1,D V t+k+g+1,D • e -εg/2 ≤ δ, then for every database z in Unsafe (ε,4e ε δ,t) , d H (D, z) > k, where d H denotes Hamming distance. We now prove its adaptation, Lemma 3.6. Proof of Lemma 3.6. We first prove item 1. Let D and D be neighboring databases, D = D ∪ {x}, and let k * D and k * D denote the mechanism's outputs on the respective databases. It suffices to show |k * D -k * D | ≤ 1. Consider some V p for nonnegative integer p. If x has depth less than p in D, then V p-1,D ≥ V p,D ≥ V p,D > V p+1,D . Otherwise, V p+1,D < V p,D = V p,D < V p-1,D . In either case, V p+1,D < V p,D ≤ V p,D ≤ V p-1,D . Now suppose there exist k * D ≥ 0 and g * D > 0 such that V t-k * D -1,D V t+k * D +g * D +1,D • e -εg * D /2 ≤ δ. Then by Equation 1, V t-k * D ,D V t+k * D +g * D ,D • e -εg * D /2 ≤ δ, so k * D ≥ k * D -1. Similarly, if there exist k * D ≥ 0 and g * D > 0 such that V t-k * D -1,D V t+k * D +g * D +1,D • e -εg * D /2 ≤ δ, then by Equation 1, V t-k * D ,D V t+k * D +g * D ,D • e -εg * D /2 ≤ δ, so k * D ≥ k * D -1. Thus if k * D ≥ 0 or k * D ≥ 0, |k * D -k * D | ≤ 1. The result then follows since k * ≥ -1. We now prove item 2. This holds for k ≥ 0 by Lemma 7.1; for k = -1, the lower bound on distance to unsafety is the trivial one, d H (D, z) ≥ 0. The next proof is for Lemma 3.7, which verifies the overall privacy and runtime of PTRCheck. Proof of Lemma 3.7. The privacy guarantee follows from the 1-sensitivity of computing k (Lemma 3.6). For the runtime guarantee, we perform a binary search over the distance lower bound k starting from the maximum possible m /4 and, for each k, an exhaustive search over g. Note that if some k, g pair satisfies the inequality in Lemma 7.1, there exists some g for every k < k that satisfies it as well. Thus since both have range ≤ m /4, the total time is O(m log(m)).

7.3. USING MONOTONICITY

This section discusses our use of monotonicity in the restricted exponential mechanism. Definition 2.2 states that, if u is monotonic, the exponential mechanism can sample an output y with probability proportional to exp εu (D,y) ∆u and satisfy ε-DP. Approximate Tukey depth is monotonic, so our application can also sample from this distribution. It remains to incorporate monotonicity into the PTR step. It suffices to show that Lemma 7.1 also holds for a restricted exponential mechanism using a monotonic score function. Turning to the proof of Lemma 7.1 given by Brown et al. (2021) , it suffices to prove their Lemma 3.7 using w x (S) = S exp(εq(x; y))dy. Note that their w x (S) differs by the 2 in its denominator inside the exponent term; this modification is where we incorporate monotonicity. This difference shows up in two places in their argument. First, we can replace their bound P [M ε,t (x) = y] P [M ε,t (x ) = y] ≤ e ε/2 • w x (Y t,x ) w x (Y t,x ) ≤ e ε • w x (Y t,x ) w x (Y t,x ) with the two cases that arise in add-remove differential privacy. The first considers x x and yields P [M ε,t (x) = y] P [M ε,t (x ) = y] ≤ e ε • w x (Y t,x ) w x (Y t,x ) ≤ e ε • w x (Y t,x ) w x (Y t,x ) since the mechanism on x never assigns a lower score to an output than on x . Using the same logic, the second considers x x , and we get P [M ε,t (x) = y] P [M ε,t (x ) = y] ≤ w x (Y t,x ) w x (Y t,x ) ≤ e ε • w x (Y t,x ) w x (Y t,x ) . Algorithm 3 RestrictedTukeyEM 1: Input: Tukey depth region volumes V , sorted collection of estimators S, depth restriction t, privacy parameter ε 2: for i = t, t + 1, . . . , |V | -1 do 3: Compute volume of region of Tukey depth exactly i, W i ← V i -V i+1 4: Sample depth î from distribution where Compute V <j,i+1 , W j,i , and V >j,i using Lemma 7.4 P [i] ∝ W i exp ε•i 6: Compute vol(C j,i ) = V <j,i+1 • W j,i • V >j,i 7: Sample index j * ∈ [d] with probability vol(Cj,i) V ≥1,i 8: for j = 1, . . . , j * -1 do 9: y j ← uniform random sample from [S j ,i+1 , S j ,m-i ] 10: y j * ← uniform random sample from [S j * ,i , S j * ,i+1 ) ∪ (S j * ,m-i , S j * ,m-(i-1) ] 11: for j = j * + 1, . . . , d do 12: y j ← uniform random sample from [S j ,i , S j ,m-(i-1) ] 13: Return y the Cartesian product of three lower dimensional regions, and thus its volume is the product of the corresponding volumes, vol(C j,i ) = V <j,i+1 • W j,i • V >j,i . These quantities, along with the normalizing constant V ≥1,i , can be computed using Lemma 7.4. Since i is fixed, computing the full set of V j,i and W j,i takes time O(d), and by tracking partial sums and using logarithms, we can compute the full set of V <j,i and V >j,i in time O(d) as well. The last step is sampling the final point y, which takes time O(d) using the previously computed S.

7.5. DATASET FEATURE SELECTION DETAILS

This section provides details for each of the real datasets evaluated in our experiments. 1. California Housing Nugent (2017): The label is median housevalue, and the categorical ocean proximity is dropped. Features are days on market (DOM), followers, area of house in meters (square), number of kitchens (kitchen), buildingType, renovationCondition, building material (buildingStructure), ladders per residence (ladderRatio), elevator presence elevator, whether previous owner owned for at least five years (fiveYearsProperty), proximity to subway (subway), district, and nearby average housing price (communityAverage). Categorical buildingType, renovationCondition, and buildingStructure are encoded as one-hot variables. We additionally removed a single outlier row (60422) whose norm is more than two orders of magnitude larger than that of other points; none of the DP algorithms achieved positive R 2 with the row included. 



We assume m is even for simplicity. The algorithm and its guarantees are essentially the same when m is odd, and our implementation handles both cases. EXPERIMENTS4.1 BASELINES 1. NonDP computes the standard non-private OLS estimator β * = (X T X) -1 X T y. R 2 measures the variation in labels accounted for by the features. R 2 = 1 is perfect, R 2 = 0 is the trivial baseline achieved by simply predicting the average label, and R 2 < 0 is worse than the trivial baseline.



Definition 2.2 (McSherry & Talwar (2007)). Given database D and utility function u : X × Y → R mapping (database, output) pairs to scores with sensitivity ∆ u = max D∼D ,y∈Y |u(D, y) -u(D , y)|, the exponential mechanism selects item y ∈ Y with probability proportional to exp ( u(D

be a collection of n points. The Tukey depth T D (y) of a point y ∈ R d with respect to D is the minimum number of points in D in any halfspace containing y, T D (y) = min hv|y∈hv x∈D 1 x∈hv .

Lemma 3.3. Lines 5 to 10 of Algorithm 2 compute {V i } m/2 i=1 in time O(dm log(m)).

Turning to runtime, the m OLS computations inside Line 3 each multiply d × n m and n m × d matrices, for O(d 2 n) time overall. From Lemma 3.3, Lines 5 to 10 take time O(dm log(m)). Lemma 3.7 gives the O(m log(m)) time for Line 11, and Lemma 3.9 gives the O(d) time for Line 12.

Diamonds (d = 10, n = 53,940, Agarwal (2017)), predicting diamond price. 4. Traffic (d = 3, n = 7,909, NYSDOT (2013)), predicting number of passenger vehicles. 5. NBA (d = 6, n = 21,613, Lauga (2022)), predicting home team score. 6. Beijing (d = 25, n = 159,375, ruiqurm (2018)), predicting house price. 7. Garbage (d = 8, n = 18,810, DSNY (2022)), predicting tons of garbage collected. 8. MLB (d = 11, n = 140,657, Samaniego (2018)), predicting home team score.

Figure 2: Left: plot of Hamming distance to unsafety using Lemma 3.6 as the feature dimension d and number of models m varies, using (ln(3), 10 -5 )-DP and n = (d + 1)m throughout. Right: plot of average time in seconds as the number of models m used by TukeyEM varies

Figure3: An illustrated comparison between exact (left) and approximate (right) Tukey depth. In both figures, the set of points is {(1, 1), (7, 3), (5, 7), (3, 3), (5, 5), (6, 3)}, the region of depth 0 is white, the region of depth 1 is light gray, and the region of depth 2 is dark gray. Note that for exact Tukey depth, the regions of different depths form a sequence of nested convex polygons; for approximate Tukey depth, they form a sequence of nested rectangles.

Return y ← SamplePointWithDepth(S, î) Algorithm 4 SamplePointWithDepth 1: Input: Sorted collection of estimators S, depth i 2: d, m ← number of rows and columns in S 3: Compute V ≥1,i using Lemma 7.4 4: for j = 1, . . . , d do 5:

Diamonds Agarwal (2017): The label is price. Ordinal categorical features (carat, color, clarity) are replaced with integers 1, 2, . . ..3. Traffic NYSDOT (2013):The label is passenger vehicle count (Class 2), and the remaining features are motorcycles (Class 1) and pickups, panels, and vans (Class 3).

NBA Lauga (2022): The label is PTS home, and the features are FT PCT home, FG3 PCT home, FG PCT home, AST home, and REB home. 5. Beijing Housing ruiqurm (2018): The label is totalPrice.

Figure 4: Hyperparameter settings used by DPSGD on each dataset.

Figure 5: Plots of R 2 as the number of models m used by TukeyEM varies. The lines mark medians and the shaded regions span the first and third quartiles. All datasets except Housing and Diamonds use 10 trials. Housing and Diamonds use 50 trials due to the variance of TukeyEM. Methods other than TukeyEM appear as flat lines because they do not vary with m. Each plot varies the number of models m in increments of 250, starting with the m sufficient to pass PTR in all trials.

Figure 6: Histograms of models on the California dataset.

Figure 7: Histograms of models on Synthetic.

Figure 9: Histograms of models on Diamonds.

Figure 10: Histograms of models on NBA.

, so our experiments instead use a notion of approximate Tukey depth that can be computed efficiently. The approximate notion of Tukey depth only takes a minimum over the 2d halfspaces corresponding to the canonical basis. Definition 2.5. Let E = {e 1 , ..., e d } be the canonical basis for R d and let D ⊂ R d . The approximate Tukey depth of a point y ∈ R d with respect to D, denoted TD (y), is the minimum number of points in D in any of the 2d halfspaces determined by E containing y, Let 0 < α, γ < 1 and let S = {β 1 , ..., β m } be an i.i.d. sample from the multivariate normal distribution N (β * , Σ) with covariance Σ ∈ R d×d and mean

For each dataset, the DP methods with interquartile ranges overlapping that of the DP method with the highest median R 2 are bolded.A few takeaways are immediate. First, on most datasets TukeyEM obtains R 2 exceeding or matching that of both AdaSSP and DPSGD. TukeyEM achieves this even though AdaSSP receives non-private access to the true feature and label norms, and DPSGD receives non-private access to extensive hyperparameter tuning.

6. New York Garbage DSNY (2022): The label is REFUSETONSCOLLECTED. The features are PAPERTONSCOLLECTED and MPGTONSCOLLECTED. The categorical BOROUGH is encoded as one-hot variables.

6. ACKNOWLEDGMENTS

We thank Gavin Brown for helpful discussion of Brown et al. (2021) , and we thank Jenny Gillenwater for useful feedback on an early draft. We also thank attendees of the Fields Institute Workshop on Differential Privacy and Statistical Data Analysis for helpful general discussions.

annex

The second application in their argument, which bounds P[Mε,t(x )=y] P[Mε,t(x)=y] , uses the same logic. As a result, their Lemma 3.7 also holds for a monotonic restricted exponential mechanism, and we can drop the 2 in the sampling distribution as desired.

7.4. SAMPLING FROM A REGION DETAILS

We start by formally defining our partition. Definition 7.2. Given d-dimensional database D and dimension j ∈ [d], for y ∈ R d , let T D,j (y) denote the exact (one-dimensional) Tukey depth of point y with respect to dimension j in database D. Let B i denote the region of points with approximate Tukey depth i. Define the partition {C j,i } d j=1 of B i as the volume of points where depth i occurs in dimension j for the first time, i.e.,T D,j (y) > i and T D,j (y) = i and minThe partition is well defined because any point with approximate Tukey depth i is in exactly one of the C j,i volumes. Each C j,i is also the Cartesian product of three sets: any y ∈ C j,i must have 1) depth strictly greater than i in dimensions 1, ..., j -1, 2) depth i in dimension j, and 3) depth at least i in dimensions j + 1, ..., d. Being the Cartesian product of three sets, the total volume of C j,i can be computed as the product of the three corresponding volumes in lower dimensions. We will denote these by V <j,i , W j,i , V >j,i , formalized below. Definition 7.3. Given d-dimensional database D, dimension j ∈ [d], and depth i, define 1. V j,i,D = vol({y j | y ∈ R d , TD (y) ≥ i}), the total length in dimension j of the region with approximate Tukey depth at least i.= i and TD (y) ≥ i}), the total length in dimension j of the region with depth exactly i in dimension j and approximate Tukey depth at least i., the volume of the projection onto the first j -1 dimensions of points with approximate Tukey depth at least i. Define V >j,i,D analogously.When D is clear from context, we drop it from the subscript.The next lemma shows how to compute these and other relevant volumes. We again note that we fix m to be even for neatness. The odd case is similar.Lemma 7.4. Given matrix S ∈ R d×m of projected and sorted models, as in Line 8 of Algorithm 2,Proof. The proofs of the first item uses essentially the same reasoning as the proof of Lemma 3.3. For the second item, any point contributing to V j,i but not V j,i+1 has depth exactly i in dimension j.For the third item, a point contributes to V <j,i if and only if it has depth at least i in all d dimensions; since the resulting region is a rectangle, its volume is the product of its side lengths.With Lemma 7.4, we can now prove that SamplePointWithDepth works as intended.Proof of Lemma 3.9. Given S ∈ R d×m , define B i and {C j,i } d j=1 as in Definition 7.2. To show the outcome of SamplePointWithDepth(S, i) is uniformly distributed over points with approximate Tukey depth i, it suffices to show the algorithm samples a C j,i with probability proportional to its volume. Recall that C j,i is the set of points with depth greater than i in dimensions 1, 2, . . . , j -1, exactly i in dimension j, and at least i in the remaining dimensions. With this interpretation, C j,i is 

