NEAR OPTIMAL PRIVATE AND ROBUST LINEAR RE-GRESSION

Abstract

We study the canonical statistical estimation problem of linear regression from n i.i.d. examples under (ε, δ)-differential privacy when a fraction of response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to provably guarantee both (ϵ, δ)-DP and robustness. Synthetic experiments confirm the superiority of our approach.

1. INTRODUCTION

Differential Privacy (DP) is a widely accepted notion of privacy introduced in (Dwork et al., 2006) , which is now standard in industry and government (Tang et al., 2017; Erlingsson et al., 2014; Fanti et al., 2016; Abowd, 2018) . A query to a database is said to be (ε, δ)-differentially private if a strong adversary who knows all other entries cannot identify with high confidence whether you participated in the database or not. The parameters ε and δ restrict the Type-I and Type-II errors achievable by the adversary (Kairouz et al., 2015) . Smaller ε > 0 and δ ∈ [0, 1] imply stronger privacy guarantees. Significant advances have been made recently in understanding the utility-privacy trade-offs in canonical statistical tasks. We provide a survey in App. A. However, several important questions remain open, some of which we address below. A canonical statistical task of linear regression is when n i.i.d. samples, {(x i ∈ R d , y i ∈ R)} n i=1 , are drawn from x i ∼ N (0, Σ), y i = x ⊤ i w * + z i , and z i ∼ N (0, σ 2 ). The error is measured in ∥ ŵ -w * ∥ Σ := ∥Σ 1/2 ( ŵ -w * )∥, which correctly accounts for the signal-to-noise ratio in each direction; in the direction of large eigenvalue of Σ, we have larger signal in x i and the same noise in z i , and hence expect smaller error. When computational complexity is not concerned, the best known algorithm is introduced by Liu et al. (2022b), called High-dimensional Propose-Test-Release (HPTR), that can be flexibly applied to a variety of statistical tasks to achieve the optimal sample complexity under (ε, δ)-DP. For linear regression, n = O(d/α 2 +d/(εα)) samples are sufficient for HPTR to achieve an error of (1/σ)∥ ŵw * ∥ 2 Σ = α with high probability. This is optimal, matching known information theoretic lower bounds. It remains an important open question if this can be achieved with an efficient algorithm. After a series of work surveyed in App. A, Varshney et al. (2022) achieves the best known sample complexity for an efficient algorithm: n = Õ(κ 2 d/ε + d/α 2 + κd/(εα)). The last term is suboptimal by a factor of κ, the condition number of the covariance Σ of the covariates, and the first term is unnecessary. We further close this gap in the following. Theorem 1 (informal version of Theorem 3 with no adversary). Under the (Σ, σ 2 , w * , K, a)-model in Assumption 1, n = Õ(d/α 2 + κ 1/2 d/(εα)) samples are sufficient for Algorithm 1 to achieve an error rate of (1/σ)∥ ŵ -w * ∥ 2 Σ = Õ(α) and (ε, δ)-DP, where κ := λ max (Σ)/λ min (Σ). Perhaps surprisingly, we show that the same algorithm is also robust against label-corruption, where an adversary selects arbitrary α corrupt fraction of the data points and changes their response variables arbitrarily. When computational complexity is not concerned, the best known algorithm is HPTR by Liu et al. (2022b) that also provides optimal robustness and (ε, δ)-DP simultaneously, i.e., n = O(d/α 2 + d/(εα)) samples are sufficient for HPTR to achieve an error of (1/σ)∥ ŵ -w * ∥ 2 Σ = α for any corruption bounded by α corrupt ≤ α. Note that this is a strong adversary who can corrupt both the covariate, x i , and the response variable, y i . Currently, there is no efficient algorithm that can guarantee both privacy and robustness for linear regression. Under a weaker adversary who can only corrupt the response variable, we close this gap in the following. Theorem 2 (informal version of Theorem 3 with adversarial label corruption). Under the hypotheses of Theorem 1 and under α corrupt -corruption model of Assumption 2, if α corrupt ≤ α then n = Õ(d/α 2 +κ 1/2 d/(εα)) samples are sufficient for Algorithm 1 to achieve an error rate of (1/σ)∥ ŵw * ∥ 2 Σ = Õ(α) and (ε, δ)-DP, where κ := λ max (Σ)/λ min (Σ). We start with the formal description of the setting in Sec. 2 and present the approach of Varshney et al. ( 2022) for a private but non-robust linear regression. We build upon this approach and make two innovations. First, we propose full-batch gradient descent, which is more challenging to analyze but achieves improved dependence on the condition number κ. Crucial in overcoming the challenges in the analysis is the notion of resilience explained in our proof sketch (Sec. 6). Secondly, we propose novel adaptive clipping method that ensures robustness against label corruption. We present our main algorithm (Alg. 1) in Sec. 3 with theoretical analysis and justification of the assumptions. Our adaptive clipping is both robust and private. We use truncated mean to ensure robustness and private histogram to ensure privacy in Sec. 4. We present numerical experiments on synthetic data that demonstrates the sample efficiency of our approach in Sec. 5. We end with a sketch of our main proof ideas in Sec. 6, which might be of independent interest to those requiring tight analysis of linear regression in other settings.

2. PROBLEM FORMULATION AND BACKGROUND

For linear regression without adversarial corruption, the following assumption is standard for the uncorrupted dataset S good , except for the fact that we assume a more general family of (K, a)-sub-Weibull distributions that recovers the standard sub-Gaussian family as a special case when a = 1/2. Assumption 1 ((Σ, σ 2 , w * , K, a)-model). A multiset S good = {(x i ∈ R d , y i ∈ R)} n i=1 of n i.i.d. samples is from a linear model y i = ⟨x i , w * ⟩ + z i , where the input vector x i is zero mean, E[x i ] = 0, with a positive definite covariance Σ := E[x i x ⊤ i ] ≻ 0, and the (input dependent) label noise z i is zero mean, E[z i ] = 0, with variance σ 2 := E[z 2 i ]. We further assume E[x i z i ] = 0, which is equivalent to assuming that the true parameter w * = Σ -1 E[y i x i ]. We assume that the marginal distribution of x i is (K, a)-sub-Weibull and that of z i is also (K, a)-sub-Weibull, as defined below. Sub-Weibull distributions provide Gaussian-like tail bounds determining the resilience of the dataset in Lemma J.7, which our analysis critically relies on and whose necessity is justified in Sec. 3.3. Definition 2.1 (sub-Weibull distribution Kuchibhotla & Chakrabortty (2018) ). For some K, a > 0, we say a random vector x ∈ R d is from a (K, a)-sub-Weibull distribution if for all v ∈ R d , E   exp   ⟨v, x⟩ 2 K 2 E[⟨v, x⟩ 2 ] 1/(2a)     ≤ 1 . Our goal is to estimate the unknown parameter w * , given upper bounds on the sub-Weibull parameters (K, a) and a corrupted dataset under the the standard definition of label corruption in (Bhatia et al., 2015) . There are variations in literature, which we survey in Appendix A. Assumption 2 (α corrupt -corruption). Given a dataset S good = {(x i , y i )} n i=1 , an adversary inspects all the data points, selects α corrupt n data points denoted as S r , and replaces the labels with arbitrary labels while keeping the covariates unchanged. We let S bad denote this set of α corrupt n newly labelled examples by the adversary. Let the resulting set be S := S good ∪ S bad \ S r . We further assume that the corruption rate is bounded by α corrupt ≤ ᾱ, where ᾱ is a known positive constant satisfying ᾱ ≤ 1/10, 72C 2 K 2 ᾱ log 2a (1/(6ᾱ)) log(κ) ≤ 1/2, and 2C 2 K 2 log 2a (1/(2 ᾱ)) ≥ 1 for the (K, a)-sub-Weibull distribution of interest and a positive constant C 2 defined in Lemma J.7 that only depends on (K, a). Notations. A vector x ∈ R d has the Euclidean norm ∥x∥. For a matrix M , we use ∥M ∥ 2 to denote the spectral norm. The error is measured in ∥ ŵ -w * ∥ Σ := ∥Σ 1/2 ( ŵ -w * )∥ for some PSD matrix Σ.

