NEAR OPTIMAL PRIVATE AND ROBUST LINEAR RE-GRESSION

Abstract

We study the canonical statistical estimation problem of linear regression from n i.i.d. examples under (ε, δ)-differential privacy when a fraction of response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to provably guarantee both (ϵ, δ)-DP and robustness. Synthetic experiments confirm the superiority of our approach.

1. INTRODUCTION

Differential Privacy (DP) is a widely accepted notion of privacy introduced in (Dwork et al., 2006) , which is now standard in industry and government (Tang et al., 2017; Erlingsson et al., 2014; Fanti et al., 2016; Abowd, 2018) . A query to a database is said to be (ε, δ)-differentially private if a strong adversary who knows all other entries cannot identify with high confidence whether you participated in the database or not. The parameters ε and δ restrict the Type-I and Type-II errors achievable by the adversary (Kairouz et al., 2015) . Smaller ε > 0 and δ ∈ [0, 1] imply stronger privacy guarantees. Significant advances have been made recently in understanding the utility-privacy trade-offs in canonical statistical tasks. We provide a survey in App. A. However, several important questions remain open, some of which we address below. A canonical statistical task of linear regression is when n i.i.d. samples, {(x i ∈ R d , y i ∈ R)} n i=1 , are drawn from x i ∼ N (0, Σ), y i = x ⊤ i w * + z i , and z i ∼ N (0, σ 2 ). The error is measured in ∥ ŵ -w * ∥ Σ := ∥Σ 1/2 ( ŵ -w * )∥, which correctly accounts for the signal-to-noise ratio in each direction; in the direction of large eigenvalue of Σ, we have larger signal in x i and the same noise in z i , and hence expect smaller error. When computational complexity is not concerned, the best known algorithm is introduced by Liu et al. (2022b) , called High-dimensional Propose-Test-Release (HPTR), that can be flexibly applied to a variety of statistical tasks to achieve the optimal sample complexity under (ε, δ)-DP. For linear regression, n = O(d/α 2 +d/(εα)) samples are sufficient for HPTR to achieve an error of (1/σ)∥ ŵw * ∥ 2 Σ = α with high probability. This is optimal, matching known information theoretic lower bounds. It remains an important open question if this can be achieved with an efficient algorithm. After a series of work surveyed in App. A, Varshney et al. (2022) achieves the best known sample complexity for an efficient algorithm: n = Õ(κ 2 d/ε + d/α 2 + κd/(εα)). The last term is suboptimal by a factor of κ, the condition number of the covariance Σ of the covariates, and the first term is unnecessary. We further close this gap in the following. Theorem 1 (informal version of Theorem 3 with no adversary). Under the (Σ, σ 2 , w * , K, a)-model in Assumption 1, n = Õ(d/α 2 + κ 1/2 d/(εα)) samples are sufficient for Algorithm 1 to achieve an error rate of (1/σ)∥ ŵ -w * ∥ 2 Σ = Õ(α) and (ε, δ)-DP, where κ := λ max (Σ)/λ min (Σ). Perhaps surprisingly, we show that the same algorithm is also robust against label-corruption, where an adversary selects arbitrary α corrupt fraction of the data points and changes their response variables arbitrarily. When computational complexity is not concerned, the best known algorithm is HPTR by Liu et al. (2022b) that also provides optimal robustness and (ε, δ)-DP simultaneously, i.e., n = O(d/α 2 + d/(εα)) samples are sufficient for HPTR to achieve an error of (1/σ)∥ ŵ -w * ∥ 2 Σ = α for any corruption bounded by α corrupt ≤ α. Note that this is a strong adversary who can corrupt both the covariate, x i , and the response variable, y i . Currently, there is no efficient algorithm that can guarantee both privacy and robustness for linear regression. Under a weaker adversary who can only corrupt the response variable, we close this gap in the following. Theorem 2 (informal version of Theorem 3 with adversarial label corruption). Under the hypotheses of Theorem 1 and under α corrupt -corruption model of Assumption 2, if α corrupt ≤ α then n = Õ(d/α 2 +κ 1/2 d/(εα)) samples are sufficient for Algorithm 1 to achieve an error rate of (1/σ)∥ ŵw * ∥ 2 Σ = Õ(α) and (ε, δ)-DP, where κ := λ max (Σ)/λ min (Σ). We start with the formal description of the setting in Sec. 2 and present the approach of Varshney et al. (2022) for a private but non-robust linear regression. We build upon this approach and make two innovations. First, we propose full-batch gradient descent, which is more challenging to analyze but achieves improved dependence on the condition number κ. Crucial in overcoming the challenges in the analysis is the notion of resilience explained in our proof sketch (Sec. 6) . Secondly, we propose novel adaptive clipping method that ensures robustness against label corruption. We present our main algorithm (Alg. 1) in Sec. 3 with theoretical analysis and justification of the assumptions. Our adaptive clipping is both robust and private. We use truncated mean to ensure robustness and private histogram to ensure privacy in Sec. 4. We present numerical experiments on synthetic data that demonstrates the sample efficiency of our approach in Sec. 5. We end with a sketch of our main proof ideas in Sec. 6, which might be of independent interest to those requiring tight analysis of linear regression in other settings.

2. PROBLEM FORMULATION AND BACKGROUND

For linear regression without adversarial corruption, the following assumption is standard for the uncorrupted dataset S good , except for the fact that we assume a more general family of (K, a)-sub-Weibull distributions that recovers the standard sub-Gaussian family as a special case when a = 1/2. Assumption 1 ((Σ, σ 2 , w * , K, a)-model). A multiset S good = {(x i ∈ R d , y i ∈ R)} n i=1 of n i.i.d. samples is from a linear model y i = ⟨x i , w * ⟩ + z i , where the input vector x i is zero mean, E[x i ] = 0, with a positive definite covariance Σ := E[x i x ⊤ i ] ≻ 0, and the (input dependent) label noise z i is zero mean, E[z i ] = 0, with variance σ 2 := E[z 2 i ]. We further assume E[x i z i ] = 0, which is equivalent to assuming that the true parameter w * = Σ -1 E[y i x i ]. We assume that the marginal distribution of x i is (K, a)-sub-Weibull and that of z i is also (K, a)-sub-Weibull, as defined below. Sub-Weibull distributions provide Gaussian-like tail bounds determining the resilience of the dataset in Lemma J.7, which our analysis critically relies on and whose necessity is justified in Sec. 3.3. Definition 2.1 (sub-Weibull distribution Kuchibhotla & Chakrabortty (2018) ). For some K, a > 0, we say a random vector x ∈ R d is from a (K, a)-sub-Weibull distribution if for all v ∈ R d , E   exp   ⟨v, x⟩ 2 K 2 E[⟨v, x⟩ 2 ] 1/(2a)     ≤ 1 . Our goal is to estimate the unknown parameter w * , given upper bounds on the sub-Weibull parameters (K, a) and a corrupted dataset under the the standard definition of label corruption in (Bhatia et al., 2015) . There are variations in literature, which we survey in Appendix A. Assumption 2 (α corrupt -corruption). Given a dataset S good = {(x i , y i )} n i=1 , an adversary inspects all the data points, selects α corrupt n data points denoted as S r , and replaces the labels with arbitrary labels while keeping the covariates unchanged. We let S bad denote this set of α corrupt n newly labelled examples by the adversary. Let the resulting set be S := S good ∪ S bad \ S r . We further assume that the corruption rate is bounded by α corrupt ≤ ᾱ, where ᾱ is a known positive constant satisfying ᾱ ≤ 1/10, 72C 2 K 2 ᾱ log 2a (1/(6ᾱ)) log(κ) ≤ 1/2, and 2C 2 K 2 log 2a (1/(2 ᾱ)) ≥ 1 for the (K, a)-sub-Weibull distribution of interest and a positive constant C 2 defined in Lemma J.7 that only depends on (K, a).

Notations.

A vector x ∈ R d has the Euclidean norm ∥x∥. For a matrix M , we use ∥M ∥ 2 to denote the spectral norm. The error is measured in ∥ ŵ -w * ∥ Σ := ∥Σ 1/2 ( ŵ -w * )∥ for some PSD matrix Σ. The identity matrix is denoted by I d ∈ R d×d . Let [n] = {1, 2, . . . , n}. Õ(•) hides some constants terms, K, a = Θ(1), and poly-logarithmic terms in n, d, 1/ε, log(1/δ), 1/ζ, and 1/α corrupt .

2.1. BACKGROUND ON DP

Differential Privacy (DP) is a standard measure of privacy leakage when a dataset is accessed via queries, introduced by Dwork et al. (2006) . Algorithms with strong DP guarantees provide plausible deniability to a strong adversary who knows all other entries in that dataset and tries to identify a particular user's entry (Kairouz et al., 2015) . Two datasets S and S ′ are said to be neighbors if they differ at most by one entry, which is denoted by S ∼ S ′ . A stochastic query q is said to be (ε, δ)-differentially private for some ε > 0 and δ ∈ [0, 1], if P(q(S) ∈ A) ≤ e ε P(q(S) ∈ A) + δ, for all neighboring datasets S ∼ S ′ and all subset A of the range of the query. We build upon two widely used DP primitives, the Gaussian mechanism and the private histogram. A central concept in DP mechanism design is the sensitivity of a query, defined as ∆ q := sup S∼S ′ ∥q(S) -q(S ′ )∥. We describe private histogram in App. B. Lemma 2.2 (Gaussian mechanism Dwork & Roth (2014) ). For a query q with sensitivity ∆ q , the Gaussian mechanism outputs q(S) + N (0, (∆ q 2 log(1.25/δ)/ε) 2 I d ) and achieves (ε, δ)-DP.

2.2. STANDARD APPROACH IN PRIVATE LINEAR REGRESSION

When there is no adversarial corruption, the state-of-the-art approach introduced by Varshney et al. ( 2022) is based on stochastic gradient descent with clipping and additive Gaussian noise to ensure privacy. There are two main components in this approach: adaptive clipping and streaming SGD. Adaptive clipping with an appropriate threshold θ t ensures that no data point is clipped while providing a bound on the sensitivity of the average mini-batch gradient, which ensures we do not add too much noise. The streaming approach, where a data point is only touched once and discarded, ensures independence between the past iterate w t-1 and the gradients at round t, which the analysis critically relies on. For T = Θ(κ), iterations where κ is the condition number of the covariance Σ of the covariates, the dataset S = {(x i , y i )} n i=1 is partitioned into {B t } T t=1 subsets of equal size. At each round t < T , the gradients are clipped and averaged with additive Gaussian noise: w t+1 ← w t -η 1 |B t | i∈Bt clip θt (x i (w ⊤ t x i -y i )) + θ t 2 log(1.25/δ) ε|B t | ν t , where ν t ∼ N (0, I d ). In Varshney et al. (2022) , a variation of this streaming SGD is shown to require n = Õ(κ 2 d/ε + d/α 2 + κd/(εα)) to achieve an error of ∥w T -w * ∥ 2 Σ = O(σ 2 α 2 ). Our approach builds upon such gradient based methods but makes two important innovations. First, we use full-batch gradient descent, as opposed to the streaming SGD above. Using all n samples reduces the sensitivity of the per-round gradient average, allowing us to improve the sample complexity to n = Õ(d/α 2 + κ 1/2 d/(εα)) to achieve an error of ∥w T -w * ∥ 2 Σ = O(σ 2 α 2 ). However, we lose the independence between w t-1 and the gradients in the current round, which makes the analysis more challenging. We instead rely on resilience to precisely track the bias and variance of the (dependent) full-batch gradient average. Resilience is a central concept in robust statistics which we explain in Sec. 6. The second innovation we make is separately clipping x i and (w t ⊤ x i -y i ) in the gradient. This is critical in achieving robustness to label-corruption, as we explain in Sec. 3.1.

3. ROBUST AND DIFFERENTIALLY PRIVATE LINEAR REGRESSION

We introduce a gradient descent approach for linear regression with a novel adaptive clipping that ensures robustness against label-corruption. This achieves a near-optimal sample complexity and, for the special case of private linear regression without adversarial corruption, improves upon the state-of-the-art algorithm.

3.1. ALGORITHM

The skeleton of our approach in Alg. 1 is the general DP-SGD Abadi et al. (2016) ; Song et al. (2013) with adaptive clipping Andrew et al. (2021) . However, the standard adaptive clipping is not robust against label-corruption under the more general (K, a)-sub-Weibull assumption. In particular, it is possible under sub-Weibull distribution that a positive fraction of the covariates are close to the origin, which is not possible under Gaussian data due to concentration. In this case, the adversary can select to corrupt those points with small norm, ∥x i ∥, making large changes in the residual, (y i -w ⊤ t x i ), while evading the standard clipping (by the norm of the gradient), since the norm of the gradient, ∥x i (y i -w ⊤ t x i )∥ = ∥x i ∥ |y i -w ⊤ t x i |, can remain under the threshold. This is problematic, since the bias due to the corrupted samples in the gradient scales proportional to the magnitude of the residual (after clipping). To this end, we propose clipping the norm and the residual separately: clip Θ (x i )clip θt w ⊤ t x i -y i . This keeps the sensitivity of gradient average bounded by Θθ t , and the subsequent Gaussian mechanism in line 8 ensures (ε 0 , δ 0 )-DP at each round. Applying advanced composition in Lemma B.4 of T rounds, this ensures end-to-end (ε, δ)-DP. Novel adaptive clipping. In clip Θ (x i ), the only purpose of clipping the covariate by its norm, ∥x i ∥, is to bound the sensitivity of the resulting clipped gradient. In particular, we do not need to make it robust as there is no corruption in the covariates. Ideally, we want to select the smallest threshold Θ that does not clip any of the covariates. Since the norm of a covariate is upper bounded by ∥x i ∥ 2 ≤ K 2 Tr(Σ) log 2a (1/ζ) with probability 1 -ζ (Lemma J. 3), we estimate the unknown Tr(Σ) using Private Norm Estimator in Alg. 3 in App. F and set the norm threshold Θ = K √ 2Γ log a (n/ζ) (line 3). The n in the logarithm ensures that the union bound holds. In clip θt (w ⊤ t x i -y i ), the purpose of clipping the residual by its magnitude, |y i -w ⊤ t x i | = |(w * - w t ) ⊤ x i + z i |, is to bound the sensitivity of the gradient and also to provide robustness against label-corruption. We want to choose a threshold that only clips corrupt data points and at most a few clean data points. We know that any set of (1 -2α corrupts ) fraction of the clean data points is sufficient to get a good estimate of the average gradient, and we can find such a large enough set of points that satisfy |(w * -w t ) ⊤ x i + z i | 2 ≤ (∥w t -w * ∥ 2 Σ + σ 2 )CK 2 log 2a (1/(2α) ) from Lemma J.3. At the same time, this threshold on the residual is small enough to guarantee robustness against the label-corrupted samples. We introduce Robust Private Distance Estimator in Alg. 2 to estimate the unknown (squared and shifted) distance, ∥w t -w * ∥ 2 Σ + σ 2 , and set the distance threshold θ t = 2 √ 2γ t 9C 2 K 2 log 2a (1/(2α)) (line 6). Both norm and distance estimation rely on private histogram (Lemma B.1), but over a set of statistics computed on partitioned datasets, which we explain in detail in Sec. 4. Algorithm 1: Robust and Private Linear Regression Input: dataset S = {(x i , y i )} 3n i=1 , privacy parameters (ε, δ), number of iterations T , learning rate η, failure probability ζ, target error rate α, distribution parameter (K, a) Partition dataset S into three equal sized disjoint subsets S = S 1 ∪ S 2 ∪ S 3 . δ 0 ← δ/(2T ), ε 0 ← ε/(4 T log(1/δ 0 )), ζ 0 ← ζ/3, w 0 ← 0 Γ ← PrivateNormEstimator(S 1 , ε 0 , δ 0 , ζ 0 ), Θ ← K √ 2Γ log a (n/ζ 0 ) for t = 1, 2, . . . , T -1 do γ t ← RobustPrivateDistanceEstimator(S 2 , w t , ε 0 , δ 0 , α, ζ 0 ) θ t ← 2 √ 2γ t • 9C 2 K 2 log 2a (1/(2α)). Sample ν t ∼ N (0, I d ) w t+1 ← w t -η 1 n i∈S3 clip Θ (x i )clip θt w ⊤ t x i -y i + √ 2 log(1.25/δ0)Θθt ε0n • ν t Return w T

3.2. ANALYSIS

We show that Algorithm 1 achieves a near-optimal sample complexity. We provide a proof in Appendix H and a sketch of the proof in Section 6. We address the necessity of the assumptions in Sec. 3.3, along with some lower bounds. Theorem 3. Algorithm 1 is (ε, δ)-DP. Under (Σ, σ 2 , w * , K, a)-model of Assumption 1 and α corruptcorruption of Assumption 2 and for any failure probability ζ ∈ (0, 1) and target error rate α ≥ α corrupt , if sample size is large enough such that n = Õ K 2 d log 2a+1 1 ζ + d + log(1/ζ) α 2 + dT 1/2 log( 1 δ ) log a ( 1 ζ ) εα , with a large enough constant where Õ hides poly-logarithmic terms in d, n, and κ, then the choices of a small enough step size, η ≤ 1/(1.1λ max (Σ)), and the number of iterations, T = Θ (κ log (∥w * ∥)) for a condition number of the covariance κ := λ max (Σ)/λ min (Σ), ensures that, with probability 1 -ζ, Algorithm 1 achieves E ν1,••• ,νt∼N (0,I d ) ∥w T -w * ∥ 2 Σ = Õ K 4 σ 2 α 2 log 4a 1 α , where the expectation is taken over the noise added for DP, and Θ(•) hides logarithmic terms in K, σ, d, n, 1/ε, log(1/δ), 1/α, and κ. Optimality. Omitting some constant and logarithmic terms, Alg. 1 requires n = Õ d α 2 + κ 1/2 d log(1/δ) εα , samples to ensure an error rate of E[∥w T -w * ∥ 2 Σ ] = Õ(σ 2 α 2 ) for any α ≥ α corrupt . The lower bound on the achievable error of σ 2 α 2 ≥ σ 2 α 2 corrupt is due to the label-corruption and cannot be improved, as it matches an information theoretic lower bound we provide in Proposition 3.1. In the special case when the covariate follows a sub-Gaussian distribution, that is (K, 1/2)-sub-Weibull for a constant K, there is an n = Ω(d/α 2 + d/(εα)) lower bound (Cai et al. (2019) , Theorem 4.1), and our upper bound matches this lower bound up to a factor of κ 1/2 in the second term and other logarithmic factors. Eq. ( 4) is the best known rate among all efficient private linear regression algorithms, strictly improving upon existing methods when log(1/δ) = Õ(1). We discuss some exponential time algorithms that closes the κ 1/2 gap in Sec. 3.3. Comparisons with the state-of-the-art. The best existing efficient algorithm by Varshney et al. (2022) can only handle the case where there is no adversarial corruption, and requires n = Õ(κ 2 d log(1/δ)/ε + d/α 2 + κd log(1/δ)/(εα)) to achieve an error rate of σ 2 α 2 . Compared to Eq. ( 4), the first term dominates in its dependence in κ, which is a factor of κ larger than Eq. ( 4). The third term is larger by a factor of κ 1/2 but smaller by a factor of log 1/2 (1/δ), compared to the second term in Eq. ( 4). In the non-private case, when ε = ∞, a recent line of work has developed algorithms for linear regression that are robust to label corruptions (Bhatia et al., 2015; 2017; Suggala et al., 2019; Dalalyan & Thompson, 2019) . Of these, Bhatia et al. (2015) ; Dalalyan & Thompson (2019) are relevant to our work as they consider the same adversary model as us. When x i 's and z i 's are sampled from N (0, Σ) and N (0, σ 2 ), Dalalyan & Thompson (2019) proposed a Huber loss based estimator that achieves error rate of σ 2 α 2 log 2 (n/δ) when n = Õ κ 2 d/α 2 . Under the same setting, Bhatia et al. (2015) propoased a hard thresholding based estimator that achieves σ 2 α 2 error rate with Õ d/α 2 sample complexity. Our results in Theorem 3 match these rates, except for the sub-optimal dependence on log 4a (1/α). Another line of work considered both label and covariate corruptions and developed optimal algorithms for parameter recovery (Diakonikolas et al., 2019c; b; Prasad et al., 2018; Pensia et al., 2020; Cherapanamjeri et al., 2020; Jambulapati et al., 2020; Klivans et al., 2018; Bakshi & Prasad, 2021; Zhu et al., 2019; Depersin, 2020) . The best existing efficient algorithm , e.g. Pensia et al. (2020) achieves error rate of σ 2 α 2 log(1/α) when n = Õ d/α 2 , and the uncorrupted x i and z i are sampled from N (0, I) and N (0, σ 2 ). Under both privacy requirements and adversarial corruption, the only algorithm with a provable guarantee is the exponential time approach, known as High-dimensional Propose-Test-Release (HPTR), of (Liu et al., 2022b, Corollary C.2) , which achieves a sample complexity of n = O(d/α 2 + (d + log(1/δ))/(εα)). Notice that there is no dependence on κ and the log(1/δ) term scales as 1/(εα) as opposed to κd 1/2 /(εα) of Eq. ( 4). It remains an open question if computationally efficient private linear regression algorithms can achieve such a κ-independent sample complexity. Further, HPTR is robust against a stronger adversary who corrupts the covariates also and not just the labels. Under this more powerful adversary, it remains an open question if there is an efficient algorithm that achieves n = O(d/α 2 + d/(εα)) sample complexity even for constant κ and δ.

3.3. LOWER BOUNDS

Necessity of our assumptions. A tail assumption on the covariate x i such as Assumption 1 is necessary to achieve n = O(d) sample complexity in Eq. ( 4). Even when the covariance Σ is close to identity, without further assumptions on the tail of covariate x, the result in Bassily et al. (2014) implies that for δ < 1/n and sufficiently large n, no (ε, δ)-DP estimator can achieve excess risk ∥ ŵ -w * ∥ 2 Σ better than Ω(d 3 /(ε 2 n 2 )) (see Eq. (3) in Wang ( 2018)). Note that this lower bound is a factor d larger than our upper bound that benefits from the additional tail assumption. A tail assumption on the noise z i such as Assumption 1 is necessary to achieve n = O(d/(εα)) dependence on the sample complexity in Eq. ( 4). For heavy-tailed noise, such as k-th moment bounded noise, the dependence can be significantly larger. (Liu et al., 2022b, Proposition C.5) implies that for δ = e -Θ(d) and 4-th moment bounded x i and z i , any (ε, δ)-DP estimator requires n = Ω(d/(εα 2 )) to achieve excess risk E[∥ ŵ -w * ∥ 2 Σ ] = Õ(σ 2 α 2 ). The assumption that only label is corrupted is critical for Algorithm 1. The average of the (adaptively) clipped gradient can be significantly more biased, if the adversary can place the covariates of the corrupted samples in the same direction. In particular, the bound on the bias of our gradient step in Eq. ( 42 Ashtiani & Liaw (2022) . Pursuing this direction is outside the scope of this paper. Lower bounds under label corruption. Under the α corrupt label corruption setting (Assumption 2), even with infinite data and without privacy constraints, no algorithm is able to learn w * with ℓ 2 error better than α corrupt . We provide a formal derivation for completeness. Proposition 3.1. Let D Σ,σ 2 ,w * ,K,a be a class of joint distributions on (x i , y i ) from (Σ, σ 2 , w * , K, a)-model in Assumption 1. Let S n,α be an α-corrupted dataset of n i.i.d. samples from some distribution D ∈ D Σ,σ 2 ,w * ,K,a under Assumption 2. Let M be a class of estimators that are functions over the datasets S n,α . Then there exists a positive constant c such that min n, ŵ∈M max Sn,α,D∈D Σ,σ 2 ,w * ,K,a ,w * ,K,a, E[∥ ŵ -w * ∥ 2 Σ ] ≥ c α 2 σ 2 . ( ) A proof is provided in Appendix I.1. A similar lower bound can be found in (Bakshi & Prasad, 2021 , Theorem 6.1).

4. ADAPTIVE CLIPPING FOR THE GRADIENT NORM

In the ideal clipping thresholds for norm and the residual, there is an unknown terms which we need to estimate adaptively, (∥w t -w * ∥ 2 Σ + σ 2 ) and Tr(Σ), up to a constant multiplicative error. We privately estimate the (squared and shifted) distance to optimum, (∥w t -w * ∥ 2 Σ + σ 2 ), with Alg. 2 and privately estimate the average input norm, E[∥x i ∥ 2 ] = Tr(Σ), with Alg. 3 in App. F. These are used to get the clipping thresholds in Alg. 1. We propose a trimmed mean approach below for distance estimation. The norm estimator is similar and is provided in App. F. Private distance estimation using private trimmed mean. The goal is to estimate the (shifted) distance to optimum, ∥w t -w * ∥ 2 Σ + σ 2 , up to some constant multiplicative error. Note that this is precisely the task of estimating the variance of the residual b i = y i -w ⊤ t x i . When there is no adversarial corruptions and no privacy constraints, we can simply use the empirical variance estimator (1/n) i∈[n] (y i -w ⊤ t x i ) 2 to obtain a good estimate. However, the empirical variance estimator is not robust against adversarial corruptions since one outlier can make the estimate arbitrarily large. A classical idea is using the trimmed estimator from (Tukey & McLaughlin, 1963) , which throws away the 2α fraction of residuals b i with the largest magnitude. For datasets with resilience property as assumed in this paper, this will guarantee an accurate estimate of the distance to optimum in the presence of α fraction of corruptions. To make the estimator private, it is tempting to simply add a Laplacian noise to the estimate. However, the sensitivity of the trimmed estimator is unknown and depends on the distance to the optimum that we aim to estimate; we cannot determine the variance of the Laplacian noise we need to generate. Instead, we propose to partition the dataset into k batches, compute an estimate for each batch, and form a histogram with over those k estimates. Using a private histogram mechanism with geometrically increasing bin sizes, we propose using the bin with the most estimates to guarantee a constant factor approximation of the distance to the optimum. We describe the algorithm as follows. Algorithm 2: Robust Private Distance Estimator Input: S 2 = {(x i , y i )} n i=1 , current weight w t , privacy (ε 0 , δ 0 ), ᾱ, ζ Let b i ← (y i -w ⊤ t x i ) 2 , for all i ∈ [n] and S ← {b i } n i=1 . Partition S into k = ⌊C 1 log(1/(δ 0 ζ))/ε 0 ⌋ subsets of equal size and let G j be the j-th partition. For j ∈ [k], denote ψ j as the (1 -3 ᾱ)-quantile of G j and ϕ j ← 1 |Gj | i∈Gj b i 1{b i ≤ ψ j }. Partition [0, ∞) into bins of geometrically increasing intervals Ω := . . . , 2 -1 , 1 , [1, 2) , 2, 2 2 , 2 2 , 2 3 , . . . ∪ {[0, 0]} Run (ε 0 , δ 0 )-DP histogram learner of Lemma B.1 on {ϕ j } k j=1 over Ω if all the bins are empty then Return ⊥ Let [ℓ, r] be a non-empty bin that contains the maximum number of points in the DP histogram return ℓ This algorithm gives an estimate of the distance up to a constant multiplicative error as we show in the following theorem. We provide a proof in App. D. Theorem 4. Algorithm 2 is (ε 0 , δ 0 )-DP. For an α corrupt -corrupted dataset S 2 and an upper bound ᾱ on α corrupt that satisfy Assumption 1 and 37C 2 K 2 • ᾱ log 2a (1/(6ᾱ)) ≤ 1/4 and any ζ ∈ (0, 1), if n = O (d + log((log(1/(δ 0 ζ)))/ε 0 ζ))(log(1/(δ 0 ζ))) ᾱ2 ε 0 , with a large enough constant then, with probability 1 -ζ, Algorithm 2 returns ℓ such that 1 4 (∥w t - Varshney et al. (2022) ) can also be used to estimate ∥w t -w * ∥ Σ + σ (and it would not change the ultimate sample complexity in its dependence on κ, d, ε, and n), there are three important improvements we make: (i) DP-STAT requires the knowledge of ∥w * ∥ Σ + σ; (ii) our utility guarantee has improved dependence in K and log 2a (n); and (iii) Algorithm 2 is robust against label corruption. w * ∥ 2 Σ + σ 2 ) ≤ ℓ ≤ 4(∥w t -w * ∥ 2 Σ + σ 2 ). Remark 4.1. While DP-STAT (Algorithm 3 in Upper bound on clipped good data points. Using the above estimated distance to the optimum in selecting a threshold θ t , we also need to ensure that we do not clip too many clean data points. The tolerance in our algorithm to reach the desired level of accuracy is clipping O(α) fraction of clean data points. This is ensured by the following lemma, and we provide a proof in Appendix E. Lemma 4.2. Under Assumptions 1, if θ t ≥ 9C 2 K 2 log 2a (1/(2α)) • (∥w * -w t ∥ Σ + σ), then i ∈ S 3 ∩ S good : w ⊤ t x i -y i ≥ θ t ≤ αn, for all t ∈ [T ].

5.1. DP LINEAR REGRESSION

We present experimental results comparing our proposed technique (DP-ROBGD) with other baselines. We consider non-corrupted regression in this section and defer corrupted regression to the next section. We begin by describing the problem setup and the baseline algorithms first. Experiment Setup. We generate data for all the experiments using the following generative model. The parameter vector (w * ) is uniformly sampled from the surface of a unit sphere. The covariates {x i } n i=1 are first sampled from N (0, Σ) and then projected to unit sphere. We consider diagonal covariances Σ of the following form: Σ[0, 0] = κ, and Σ[i, i] = 1 for all i ≥ 1. Here κ ≥ 1 is the condition number of Σ. We generate noise z i from uniform distribution over [-σ, σ] . Finally, the response variables are generated as follows y i = ⟨x i , w * ⟩ + z i . All the experiments presented below are repeated 5 times and the averaged results are presented. We set the DP parameters (ϵ, δ) as ϵ = 1, δ = min(10 -6 , n -2 ). Experiments for ϵ = 0.1 can be found in the Appendix. Baseline Algorithms. We compare our estimator with the following baseline algorithms: • Non private algorithms: ordinary least squares (OLS), one-pass stochastic gradient descent with tail-averaging (SGD). For SGD, we use a constant step-size of 1/(2λ max ) with n/T minibatch size, where T = 3κ log n. • Private algorithms: sufficient statistics perturbation (DP-SSP) (Foulds et al., 2016; Wang, 2018) , differentially private stochastic gradient descent (DP-AMBSSGD) (Varshney et al., 2022) . DP-SSP had the best empirical performance among numerous techniques studied by Wang (2018) , and DP-AMBSSGD has the best known theoretical guarantees. The DP-SSP algorithm involves releasing X T X and X T y differentially privately and computing ( X T X) -1 X T y. DP-AMBSSGD is a private version of SGD where the DP noise is set adaptively according to the excess error in each iteration. For both the algorithms, we use the hyper-parameters recommended in their respective papers. To improve the performance of DP-AMBSSGD, we reduce the clipping threshold recommended by the theory by a constant factor. DP-ROBGD. We implement Algorithm 1 with the following key changes. Instead of relying on PrivateNormEstimator to estimate Γ, we set it to its true value Tr(Σ). This is done for a fair comparison with DP-AMBSSGD which assumes the knowledge of Tr(Σ). Next, we use 20% of the samples to compute γ t in line 5 (instead of the 50% stated in Algorithm 1). In our experiments we also present results for a variant of our algorithm called DP-ROBGD* which outputs the best iterate based on γ t , instead of the last iterate. One could also perform tail-averaging instead of picking the best iterate. Both these modifications are primarily used to reduce the variance in the output of Algorithm 1 and achieved similar performance in our experiments. Results. Figure 1 presents the performance of various algorithms as we vary n, κ, σ. It can be seen that DP-ROBGD outperforms DP-AMBSSGD in almost all the settings. DP-SSP has poor performance when the noise σ is low, but performs slightly better than DP-ROBGD in other settings. A major drawback of DP-SSP is its computational complexity which scales as O(nd 2 + d ω ). In contrast, the computational complexity of DP-ROBGD has smaller dependence on d and scales as Õ(ndκ). Thus the latter is more computationally efficient for high-dimensional problems.

5.2. DP ROBUST LINEAR REGRESSION

We now illustrate the robustness of our algorithm. We consider the same experimental setup as above and randomly corrupt α fraction of the response variables by setting them to 1000. The figure on the right presents the results from this experiment. It can be seen that none of the baselines are robust to adversarial corruptions. They can be made arbitrarily bad by in-creasing the magnitude of corruptions. In contrast, DP-ROBGD is able to handle the corruptions well. More experimental results on a harder adversary can be found in the Appendix.

6. SKETCH OF THE MAIN IDEAS IN THE ANALYSIS

We provide the main ideas behind the proof of Theorem 3. The privacy proof is straightforward since no matter what clipping threshold we get from private norm estimator and private distance estimator, the noise we add is always proportionally to the clipping threshold which guaranteed privacy. The focus of this section will be on the utility of the algorithm. The proof of the utility heavily relies on the resilience Steinhardt et al. ( 2017) (also known as stability Diakonikolas & Kane ( 2019)), which states that given a large enough sample set S, varies statistics (for example, sample mean and sample variance) of any large enough subset of S will be close to each other. We provide the formal definition of resilience in Appendix C. The main effort for proving Theorem 3 lies in the analysis of the gradient descent algorithm. Without clipping and added noise for differential privacy, convergence property of gradient descent for linear regression is well known. The convergence proof of noisy gradient descent is also relatively straightforward. However, our algorithm requires clipping and added noise together for robustness and privacy purposes, and the key difference between our setting and the classical setting is the existence of adversarial bias and random noise in the gradient. We give an overview of the proof of our robust and private gradient descent as follows. First we introduce some notations. Let g i = (x ⊤ i w t -y i )x i be the raw gradient, gi = clip θt (x ⊤ i w ty i )clip Θ (x i ) be the clipped gradient. Note that when the data follows from our distributional assumption, clip Θ (x i ) = x i for i ∈ S good . We can write down one step of gradient update as follows: w t+1 -w * = w t -η 1 n i∈S g(t) i + ϕ t ν t -w * = I - η n i∈G x i x ⊤ i (w t -w * ) + η n i∈G x i z i + η n i∈G (g (t-1) i - g(t-1) i ) - η n i∈S bad g(t) i -ηϕ t ν t In the above equation, the first term is a contraction, meaning w t is moving toward w * . The second term captures the noise from the randomness of the data set. The third term captures the bias introduced by the clipping operation, the fourth term η n i∈S bad g(t) i captures the bias introduced by the adversarial datapoints, the fifth term captures the added Gaussian noise. The second term is standard and relatively easy to control, and our main focus is on the last three terms. The third term η n i∈G (g (t-1) i - g(t-1) i ) can be controlled using the resilience property. We prove that with our estimated threshold, the clipping will only affect a small amount of datapoints, whose contribution to the gradient is small collectively. The fourth term η n i∈S bad g(t) i = η n i∈S bad clip θt (x ⊤ i w t -y i ) x i can be controlled since there is only a small amount data points whose label is corrupted, the clip θt (x ⊤ i w t -y i ) is controlled by the clipping threshold and the x i part satisfies resilience property which implies a small, say S bad , must have small ∥ i∈S bad x i ∥. Now we have controlled the deterministic bias. Then, we upper bound the fifth term, which is the noise introduced by the Gaussian noise for the purpose of privacy, and show the expected prediction error decrease in every gradient step. The difficulty is that, since our clipping threshold is adaptive, the decrease of the estimation error depends on the estimation error of all the previous steps. This causes that in some iterations, the estimation error actually increase. In order to get around this, we split the iterations into length κ chunks, and argue that the maximum estimation error in a chunk must be a constant factor smaller than the previous chunk. This implies we will reach the desired error with in Õ(κ) steps. 

B PRELIMINARY ON DIFFERENTIAL PRIVACY

Our algorithm builds upon two DP primitive: Gaussian mechanism and private histogram. The Gaussian mechanism is one examples of a larger family of mechanisms known as output perturbation mechanisms. In practice, it is possible to get better utility trade-off for a output perturbation mechanism by carefully designing the noise, such as the stair-case mechanism which are shown to achieve optimal utility in the variance (Geng et al., 2015) and also in hypothesis testing (Kairouz et al., 2014) . However, the gain is only by constant factors, which we do not try to optimize in this paper. We provide a reference for the private histogram below. 

Lemma

P(|p k -pk | ≤ β) ≥ 1 -α When the databse is accessed multiple times, we use the following composition theorems to account for the end-to-end privacy leakage.

Lemma B.2 (Parallel composition McSherry (2009))

. Consider a sequence of interactive queries {q k } K k=1 each operating on a subset S k of the database and each satisfying (ε, δ)-DP. If S k 's are disjoint then the composition (q 1 (S 1 ), q 2 (S 2 ), . . . , q K (S K )) is (ε, δ)-DP. Dwork & Roth (2014) ). If a database is accessed with an (ε 1 , δ 1 )-DP mechanism and then with an (ε 2 , δ 2 )-DP mechanism, then the end-to-end privacy guarantee is

Lemma B.3 (Serial composition

(ε 1 + ε 2 , δ 1 + δ 2 )-DP. In most modern privacy analysis of iterative processes, advanced composition theorem from Kairouz et al. (2015) gives tight accountant for the end-to-end privacy budget. It can be improved for specific mechanisms using tighter accountants, e.g., in Mironov ( 2017 Kairouz et al. (2015) ). For ε ≤ 0.9, an end-to-end guarantee of (ε, δ)-differential privacy is satisfied if a database is accessed k times, each with a (ε/(2 2k log(2/δ)), δ/(2k))-differential private mechanism.

C DEFINITION OF RESILIENCE

Definition C.1 ( (Liu et al., 2022b, Definition 23) ). For some α ∈ (0, 1), ρ 1 ∈ R + , ρ 2 ∈ R + , and ρ 3 ∈ R + , ρ 4 ∈ R + , we say dataset S good = {(x i ∈ R d , y i ∈ R)} n i=1 is (α, ρ 1 , ρ 2 , ρ 3 , ρ 4 )-resilient with respect to (w * , Σ, σ) for some w * ∈ R d , positive definite Σ ≻ 0 ∈ R d×d , and σ > 0 if for any T ⊂ S good of size |T | ≥ (1 -α)n, the following holds for all v ∈ R d : 1 |T | (xi,yi)∈T ⟨v, x i ⟩(y i -x ⊤ i w * ) ≤ ρ 1 √ v ⊤ Σv σ , (7) 1 |T | xi∈T ⟨v, x i ⟩ 2 -v ⊤ Σv ≤ ρ 2 v ⊤ Σv , ( ) 1 |T | (xi,yi)∈T (y i -x ⊤ i w * ) 2 -σ 2 ≤ ρ 3 σ 2 , ( ) 1 |T | (xi,yi)∈T ⟨v, x i ⟩ ≤ ρ 4 √ v ⊤ Σv . ( )

D PROOF OF THEOREM 4 ON THE PRIVATE DISTANCE ESTIMATION

We first analyze the privacy. Changing a data point (x i , y i ) can affect at most one partition in {G j } k j=1 . This would affect at most two histogram bins, increasing the count of one bin by one and decreasing the count in another bin by one. Under such a bounded ℓ 1 sensitivity, the privacy guarantees follows from Lemma B.1. Next, we analyze the utility. In the (private) histogram step, we claim that at most only two consecutive bins can be occupied by any ϕ j 's. This is also true for the private histogram, because the private histogram of Lemma B.1 adds noise to non-empty bins only. By Lemma B.1, if k ≥ c log(1/(δ 0 ζ 0 ))/ε 0 , one of these two intervals (the union of which contains the true distance ∥w t -w * ∥ 2 Σ + σ 2 ) is released. This results in a multiplicative error bound of four, as the bin size increments by a factor of two. To show that only two bins are occupied, we show that all ϕ j 's are close to the true distance. We first show that each partition contains at most 2α corrupt fraction of corrupted samples and thus all partitions are (2ᾱ, 6 ᾱ, 6ρ, 6ρ, 6ρ, 6ρ ′ )-corrupt good, where ρ(C 2 , K, a, ᾱ) = C 2 K 2 ᾱ log 2a (1/6 ᾱ) and ρ′ (C 2 , K, a, ᾱ) = C 2 K ᾱ log a (1/6ᾱ), as defined in Definition J.6. Let B = ⌊n/k⌋ be the sample size in each partition. Let ζ 0 = ζ/2. Since the partition is drawn uniformly at random, for each partition G j , the number of corrupted samples α ′ n satisfies α ′ n ∼ Hypergeometric(n, α corrupt n, n/k). The tail bound gives that with probability 1 -ζ 0 , α ′ ≤ α corrupt + (k/n) log(2/ζ 0 ) ≤ 2ᾱ , where the last inequality follows from the fact that the corruption level is bounded by α corruption ≤ ᾱ and the assumption on the sample size in Eq. ( 6) which implies n ≳ log(1/(δ 0 ζ 0 )) log(1/ζ 0 )/(ᾱε 0 ). For a particular subset G j , Lemma J.7 implies that if B = O((d + log(1/ζ 0 ))/ᾱ 2 ), then G j is (α ′ , 6 ᾱ, 6ρ, 6ρ, 6ρ, 6ρ ′ )-corrupt good set with respect to (w * , Σ, σ) from Assumption 1. This means that there exists a constant C 2 > 0 such that for any T 1 ⊂ S good with |T 1 | ≥ (1 -6 ᾱ)B, we have 1 |T 1 | i∈T1 ⟨x i , w * -w t ⟩ 2 -∥w * -w t ∥ 2 Σ ≤ 6C 2 K 2 ᾱ log 2a (1/(6 ᾱ))∥w * -w t ∥ 2 Σ , 1 |T 1 | i∈T1 z 2 i -σ 2 ≤ 6C 2 K 2 ᾱ log 2a (1/(6ᾱ))σ 2 , and 1 |T 1 | i∈T1 z i ⟨x i , w * -w t ⟩ ≤ 6C 2 K 2 ᾱ log 2a (1/(6ᾱ))∥w * -w t ∥ Σ σ . Note that for i ∈ S good , b i = z 2 i + 2z i (w * -w t ) ⊤ x i + (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) . By the triangular inequality, we know, under above conditions, 1 |T 1 | i∈T1 b i -∥w * -w t ∥ 2 Σ -σ 2 ≤ 12C 2 K 2 ᾱ log 2a (1/(6ᾱ))(∥w * -w t ∥ 2 Σ + σ 2 ) . ( ) Which also implies that any subset T 2 ⊂ S good and |T 2 | ≤ 6ᾱ|S good |, we have 1 |T 2 | i∈T2 b i -∥w * -w t ∥ 2 Σ -σ 2 ≤ 12C 2 K 2 log 2a (1/(6ᾱ))(∥w * -w t ∥ 2 Σ + σ 2 ) . ( ) Recall that ψ j is the (1-3 ᾱ)-quantile of the dataset G j . Let T := {i ∈ S good : b i ≤ ψ j }, where with a slight abuse of notations, we use S good to denote the set of uncorrupted samples corresponding to G j and S bad to denote the set of corrupted samples corresponding to G j . Since the corruption is less than α ′ , we know (1 -3ᾱ -α ′ )B ≤ |T | ≤ (1 -3ᾱ + α ′ )B. By our assumption that α ′ ≤ 2 ᾱ, we have | Ē| ≥ (3ᾱ -α ′ )B ≥ ᾱB where Ē := S good \ E. Using Eq. ( 12) with a choice of T 2 = Ē, we get that min i∈ Ē b i -∥w * -w t ∥ 2 Σ -σ 2 ≤ 12C 2 K 2 log 2a (1/(6ᾱ))(∥w * -w t ∥ 2 Σ + σ 2 ) . ( ) This implies that ψ j ≤ 12C 2 K 2 log 2a (1/(6 ᾱ))(∥w * -w t ∥ 2 Σ + σ 2 ). ( ) Hence ϕ j -∥w * -w t ∥ 2 Σ -σ 2 = 1 B i∈Gj b i • 1{b i ≤ ψ j } -∥w * -w t ∥ 2 Σ -σ 2 = 1 B i∈T b i -∥w * -w t ∥ 2 Σ -σ 2 + 1 B i∈S bad b i • 1{b i ≤ ψ j } ≤ 37C 2 K 2 • ᾱ log 2a (1/(6 ᾱ))(∥w * -w t ∥ 2 Σ + σ 2 ), where we applied Eq. ( 14) and Eq. ( 11) in the last inequality. On a fixed partition G j , we showed that if B = O((d + log(1/ζ 0 ))/ᾱ 2 ) then, with probability 1 -ζ 0 , |ϕ j -∥w * -w t ∥ 2 Σ -σ 2 | ≤ 1 4 (∥w * -w t ∥ 2 Σ + σ 2 ), which follows from our assumption that 37C 2 K 2 • ᾱ log 2a (1/(6ᾱ)) ≤ 1/4. Using an union bound for all subsets, we know if B = O((d + log(k/ζ 0 ))/ᾱ 2 ), then 1 -ζ 0 , |ϕ j -∥w * -w t ∥ 2 Σ -σ 2 | ≤ 1 4 (∥w * -w t ∥ 2 Σ + σ 2 ) holds for all j ∈ [k]. Since the upper bound lower bound ratio is 5/3 which is less than 2. All the ϕ j must lie in two bins, which will result in a factor of 4 multiplicative error. E PROOF OF LEMMA 4.2 ON THE UPPER BOUND ON CLIPPED GOOD POINTS Let ρ(C 2 , K, a, α) = 2C 2 K 2 α log 2a (1/(2α)) and ρ′ (C 2 , K, a, α) = 2C 2 Kα log a (1/(2α)). Lemma J.7 implies that if n = O((d + log(1/ζ))/(α 2 )) with a large enough constant, then there exists a universal constant C 2 such that S 3 is, with respect to (w * , Σ, σ), (α corrupt , 2α, ρ, ρ, ρ, ρ′ )corrupt good. The rest of the proof is under this (deterministic) resilience condition. By the resilience property in Eq. ( 8), we know for any T ⊂ S good with |T | ≥ (1 -2α)n, 1 |T | i∈T (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) -∥w * -w t ∥ 2 Σ ≤ 2C 2 K 2 α log 2a (1/(2α))∥w * -w t ∥ 2 Σ . Let E := i ∈ S good : (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) > ∥w * -w t ∥ 2 Σ (8C 2 K 2 log 2a (1/(2α)) + 1) . Denote α := |E|/n. We want to show that α ≤ α/2. Let T be the set of points that contain the smallest 1 -α/2 fraction in {(w * -w t ) ⊤ x i x ⊤ i (w * -w t )} i∈S good . We know |T | = (1 -α/2)n ≥ (1 -2α)n. To prove by contradiction, suppose α > α/2, which means all data points in S good \ T are larger than ∥w * -w t ∥ 2 Σ (8C 2 K 2 log 2a (1/(2α)) + 1). From resilience property in Eq. ( 16), we Under review as a conference paper at ICLR 2023 know 1 n i∈S good (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) = 1 n i∈T (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) + 1 n i∈S good \T (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) ≥ 1 - α 2 1 -2C 2 K 2 α log 2a ( 1 2α ) ∥w * -w t ∥ 2 Σ + α 2 (8C 2 K 2 log 2a ( 1 2α ) + 1)∥w * -w t ∥ 2 Σ > (1 + 2C 2 K 2 α log 2a (1/2α))∥w * -w t ∥ 2 Σ , which contradicts Eq. ( 16) for S good . This shows α ≤ α/2. Similarly, we can show that i ∈ S good : z 2 t > σ 2 (8C 2 K 2 log 2a (1/(2α)) + 1) ≤ α/2. This means the rest (1 -α)n points in S good satisfies (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) + |z i | ≤ (∥w t - w * ∥ + σ) (8C 2 K 2 log 2a (1/(2α)) + 1). Note that for all i ∈ S good , we have |x ⊤ i w t -y i | = x ⊤ i (w t -w * ) -z i ≤ |x ⊤ i (w t -w * )| + |z i | ≤ (w * -w t ) ⊤ x i x ⊤ i (w * -w t ) + |z i | . By our assumption that C 2 K 2 log 2a (1/(2ᾱ)) ≥ 1 which follows from Assumption 2, we have i ∈ S good : ∥x ⊤ i w t -y i ∥ ≤ (∥w t -w * ∥ + σ) 9C 2 K 2 log 2a (1/(2α)) ≥ (1 -α)n .

F PRIVATE NORM ESTIMATION: ALGORITHM AND ANALYSIS

Algorithm 3: Private Norm Estimator Input: S 1 = {(x i , y i )} n i=1 , target privacy (ε 0 , δ 0 ), failure probability ζ. Let a i ← ∥x i ∥ 2 . Let S = {a i } n i=1 . Partition S into k = ⌊C 1 log(1/(δ 0 ζ))/ ε⌋ subsets of equal size and let G j be the j-th partition. For each j ∈ [k], denote ψ j = (1/|G j |) i∈Gj a i . Partition [0, ∞) into bins of geometrically increasing intervals Ω := . . . , 2 -2/4 , 2 -1/4 , 2 -1/4 , 1 , 1, 2 1/4 , 2 1/4 , 2 2/4 , . . . ∪ {[0, 0]} Run (ε 0 , δ 0 )-DP histogram learner of Lemma B.1 on {ψ j } k j=1 over Ω if all the bins are empty then Return ⊥ Let [ℓ, r] be a non-empty bin that contains the maximum number of points in the DP histogram Return ℓ Lemma F.1. Algorithm 3 is (ε 0 , δ 0 )-DP. If {x i } n i=1 are i.i.d. samples from (K, a)-sub-Weibull distributions with zero mean and covariance Σ and n = Õ log 2a (1/(δ 0 ζ)) ε 0 , with a large enough constant then Algorithm 3 returns Γ such that, with probability 1 -ζ, 1 √ 2 Tr(Σ) ≤ Γ ≤ √ 2 Tr(Σ) . We provide a proof in App. F.1.

F.1 PROOF OF LEMMA F.1 ON THE PRIVATE NORM ESTIMATION

By Hanson-Wright inequality in Lemma J.1 and union bound, there exists constant c > 0 such that with probability 1 -ζ, | 1 b b i=1 ∥x i ∥ 2 -Tr(Σ)| ≤ cK 2 Tr(Σ) log(1/ζ) b + log 2a (1/ζ) b , This means there exists a constant c ′ > 0 such that if b ≥ c ′ K 2 log 2a (k/ζ), then for all j ∈ [k]. |ψ j -Tr(Σ)| ≤ 2 1/8 Tr(Σ) With probability 1-ζ, {ψ j } k j=1 lie in interval of size 2 1/4 Tr(Σ). Thus, at most two consecutive bins are filled with {ψ j } k j=1 . Denote them as I = I 1 ∪ I 2 . Our analysis indicates that P(ψ i ∈ I) ≥ 0.99. By private histogram in Lemma B.1, if k ≥ log(1/(δζ))/ε, |p I -pI | ≤ 0. 01 where pI is the empirical count on I and pI is the noisy count on I. Under this condition, one of these two intervals are released. This results in multiplicative error of √ 2. G PROOF OF THE RESILIENCE IN LEMMA J.7 We apply following resilience property for general distribution characterized by Orlicz function from Zhu et al. (2019) . Lemma G.1 ((Zhu et al., 2019, Theorem 3.4)). Dataset S = {x i ∈ R d } n i=1 consists i.i.d. samples from a distribution D. Suppose D is zero mean and satisfies E x∼D ψ (v ⊤ x) 2 κ 2 E x∼D [(v ⊤ x) 2 ] ≤ 1 for all v ∈ R d , where ψ(•) is Orlicz function. Let Σ = E x∼D [xx ⊤ ]. Suppose α ≤ ᾱ, where ᾱ satisfies (1 + ᾱ/2) • 2κ 2 ᾱψ -1 (2/ᾱ) < 1/3, ᾱ ≤ 1/4. Then there exists constant c 1 , C 2 such that if n ≥ c 1 ((d + log(1/ζ))/(α 2 )), with probability 1 -ζ, for any T ⊂ S of size |T | ≥ (1 -α)n, the following holds: Σ -1/2 1 |T | i∈T x i ≤ C 2 κα ψ -1 (1/α) and I d -Σ -1/2 1 |T | i∈T x i x ⊤ i Σ -1/2 2 ≤ C 2 κ 2 αψ -1 (1/α) . Let ψ(t) = e t 1/(2a) . It is easy to see that ψ(t) is a valid Orlicz function. Then if x i is (K, a)-sub-Weibull, then we know Σ -1/2 1 |T | i∈T x i ≤ C 2 Kα log 2a (1/α) , and I d -Σ -1/2 1 |T | i∈T x i x ⊤ i Σ -1/2 2 ≤ C 2 K 2 α log 2a (1/α) . This implies (1 -C 2 K 2 α log 2a (1/α))I d ⪯ Σ -1/2 1 |T | i∈T x i x ⊤ i Σ -1/2 ⪯ (1 + C 2 K 2 α log 2a (1/α))I d . Using the fact that C ⊤ AC ⪯ C ⊤ BC if A ⪯ B, we know (1 -C 2 K 2 α log 2a (1/α))Σ ⪯ 1 |T | i∈T x i x ⊤ i ⪯ (1 + C 2 K 2 α log 2a (1/α))Σ . This implies resilience properties of x i and z i in Eq. ( 8) and Eq. ( 9) in Definition C.1 respectively. Next, we show the resilience property of x i z i . By ab ≤ a 2 2 + b 2 2 , for any fixed v ∈ R d , E[exp | ⟨x i z i , v⟩ | 2 K 4 σ 2 v ⊤ Σv 1/(4a) ] ≤ E exp | ⟨x i , v⟩ | 2 K 2 v ⊤ Σv 1/(2a) /2 exp z 2 i K 2 σ 2 1/(2a) /2 (26) ≤ 1 2 E exp | ⟨x i , v⟩ | 2 K 2 v ⊤ Σv 1/(2a) + E exp z 2 i K 2 σ 2 1/(2a) (27) ≤ 1 . ( ) Since E[x i z i ] = 0, (Zhu et al., 2019, Lemma E.3) implies that there exists constant c 1 , C 2 > 0 such that if n ≥ c 1 (d+log(1/ζ))/(α 2 ), with probability 1-ζ, for any T ⊂ S good of size |T | ≥ (1-α)n, Σ -1 1 |T | i∈T x i z i ≤ C 2 K 2 σα log 2a (1/α) . H PROOF OF THEOREM 3 ON THE ANALYSIS OF ALGORITHM 1 The main theorem builds upon the following lemma that analyzes a (stochastic) gradient descent method, where the randomness is from the DP noise we add and the analysis only relies on certain deterministic conditions on the dataset including resilienece and concentration. Theorem 3 follows in a straightforward manner by collecting Theorem 4, Lemma F.1, Lemma 4.2, and Lemma H.1. Lemma H.1. Algorithm 1 is (ε, δ)-DP. Under Assumptions 1 and 2 for any ζ ∈ (0, 1) and α ≥ α corrupt satisfying K 2 α log 2a (1/α) log(κ) ≤ c for some universal constant c > 0, if distance threshold is small enough such that θ t ≤ 3C 1/2 2 K log a (1/(2α)) • (∥w * -w t ∥ Σ + σ) , and large enough such that the number of clipped clean data points is no larger than αn, at every round, the norm threshold is large enough such that Θ ≥ K Tr(Σ) log a (n/ζ) , and sample size is large enough such that n = O K 2 d log(d/ζ) log 2a (n/ζ) + d + log(1/ζ) α 2 + K 2 T 1/2 d log(T /δ) log a (n/(αζ)) εα , with a large enough constant, then the choices of a step size, η = 1/(Cλ max (Σ)) for some C ≥ 1.1, and the number of iterations, T = Θ (κ log (∥w * ∥)) , ensures that Algorithm 1 outputs w T satisfying the following with probability 1 -ζ: E ν1,••• ,νt∼N (0,I d ) [∥w T -w * ∥ 2 Σ ] ≲ K 4 σ 2 log 2 (κ)α 2 log 4a (1/α) , where the expectation is taken over the noise added for DP and Θ(•) hides logarithmic terms in K, σ, d, n, 1/ε, log(1/δ), 1/α. Proof of Lemma H.1. We first prove a set of deterministic conditions on the clean dataset, which is sufficient for the analysis of the gradient descent. Step 1: Sufficient deterministic conditions on the clean dataset. Let S good be the uncorrupted dataset for S 3 and S bad be the corrupted datapoints in S 3 . Let G := S good ∩ S 3 = S 3 \ S bad denote the clean data that remains in the input dataset. Let λ max = ∥Σ∥ 2 . Define Σ := (1/n) i∈G x i x ⊤ i , B := I d -η Σ. Lemma J.4 implies that if n = O(K 2 d log(d/ζ) log 2a (n/ζ)), then 0.9Σ ⪯ Σ ⪯ 1.1Σ . ( ) We pick step size η such that η ≤ 1/(1.1λ max ) to ensure that η ≤ 1/∥ Σ∥ 2 . Since the covariates {x i } i∈S are not corrupted, from Lemma J.3, we know with probability 1 -ζ, for all i ∈ S 3 , ∥x i ∥ 2 ≤ K 2 Tr(Σ) log 2a (n/ζ) . ( ) Lemma J.7 implies that if n = O((d + log(1/ζ))/(α 2 )), then there exists a universal constant C 2 such that S 3 is, following Definition J.6, with respect to (w * , Σ, σ), (α corrupt , α, C 2 K 2 α log 2a (1/α), C 2 K 2 α log 2a (1/α), C 2 K 2 α log 2a (1/α), C 2 Kα log a (1/α))- corrupt good. Such corrupt good sets have a sufficiently large, 1 -α corrupt , fraction of points that satisfy a good property that we need: resilience. The rest of the proof is under Eq. ( 34), Eq. ( 35), and that S good is resilient. Step 2: Upper bounding the deterministic noise in the gradient. In this step, we bound the deviation of the gradient from its mean. There are several sources of deviation: (i) clipping, (ii) adversarial corruptions, and (iii) randomness of the data noise and privacy noise. We will show that deviations from all these sources can be controlled deterministically under the corrupt-goodness (i.e., resilience). Let ϕ t = ( 2 log(1.25/δ 0 )Θθ t )/(ε 0 n), which ensures that we add enough noise to guarantee (ε 0 , δ 0 )-DP for each step of gradient descent. This follows from the standard Gaussian mechanism in Lemma 2.2 and the fact that each gradient is clipped to the norm of Θθ t , resulting in a DP sensitivity of Θθ t /n. The fact that this sensitivity scales as 1/n is one of the main reasons for the performance gain we get over Varshney et al. (2022) that uses a minimatch of size n/κ with sensitivity scaling as κ/n. Define g (t) i := x i (x ⊤ i w t -y i ). For i ∈ S good , we know y i = x ⊤ i w * + z i . Let gi = clip Θ (x i )clip θt (x ⊤ i w t -y i ). Note that under Eq. ( 35), clip Θ (x i ) = x i for all i ∈ S 3 . From Algorithm 1, we can write one-step update rule as follows:  and u (3) w t+1 -w * =w t -η 1 n i∈S g(t) i + ϕ t ν t -w * = I - η n i∈G x i x ⊤ i (w t -w * ) + η n i∈G x i z i + η n i∈G (g (t) i - g(t) i ) -ηϕ t ν t - η n i∈S bad g(t) i (36) Let E t := {i ∈ G : θ t ≤ |x ⊤ i w t -y i |} be the set of clipped clean data points such that i∈G (g (t) i - g(t) i ) = i∈Et (g (t) i - g(t) i ). We define v := (1/n) i∈G x i z i , u := (1/n) i∈Et x i x ⊤ i (w t - w * ), u (2) t := (1/n) i∈Et -x i z i , t := (1/n) i∈S bad ∪Et g(t) i . We can further write the update rule as: w t+1 -w * = B(w t -w * ) + ηv + ηu (1) t-1 + ηu (2) t-1 -ηϕ t ν t -ηu (3) t-1 . (37) We bound each term one-by-one. Since G ⊂ S good and |G| = (1 -α corrupt )n, using the resilience property in Eq. ( 7), we know ∥Σ -1/2 v∥ = (1 -α corrupt ) max ∥v∥=1 Σ -1/2 v, 1 (1 -α corrupt )n i∈G x i z i ≤ (1 -α corrupt )C 2 K 2 α log 2a (1/α)σ (38) ≤ C 2 K 2 α log 2a (1/α)σ . Let α = |E t |/n. By assumption, we know α ≤ α (which holds for the given dataset due to Lemma 4.2), and ∥Σ -1/2 u (1) t ∥ = ∥Σ -1/2 1 n i∈Et x i x ⊤ i (w t -w * )∥ . From Corollary J.8, we know ∥Σ -1/2 1 |E t | i∈Et x i x ⊤ i (w t -w * )∥ -∥w t -w * ∥ Σ = max u:∥u∥=1 1 |E t | i∈Et u ⊤ Σ -1/2 x i x ⊤ i (w t -w * )∥ -max v:∥v∥=1 v ⊤ Σ 1/2 (w t -w * ) ≤ max u:∥u∥=1 1 |E t | i∈Et u ⊤ Σ -1/2 x i x ⊤ i Σ -1/2 Σ 1/2 (w t -w * )∥ -u ⊤ Σ 1/2 (w t -w * ) ≤ max u:∥u∥=1 1 |E t | i∈Et u ⊤ Σ -1/2 x i x ⊤ i Σ -1/2 -I d Σ 1/2 (w t -w * )∥ = 1 |E t | i∈Et Σ -1/2 x i x ⊤ i Σ -1/2 -I d Σ 1/2 (w t -w * ) ≤ 1 |E t | i∈Et Σ -1/2 x i x ⊤ i Σ -1/2 -I d • Σ 1/2 (w t -w * ) ≤ 2 - α α C 2 K 2 α log 2a (1/α) ∥w t -w * ∥ Σ .

This implies that

∥Σ -1/2 u (1) t ∥ ≤ ∥Σ -1/2 1 n i∈E x i x ⊤ i (w t -w * )∥ ≤ α + 2C 2 K 2 α log 2a (1/α) ∥w t -w * ∥ Σ ≤ 3C 2 K 2 α log 2a (1/α) ∥w t -w * ∥ Σ , where the last inequality follows from the fact that α ≤ α and our assumption that C 2 K 2 log 2a (1/ᾱ) ≥ 1 from Assumption 2. Similarly, we use resilience property in Eq. ( 7) instead of Eq. ( 8), we can show that ∥Σ -1/2 u (2) t ∥ ≤ 3C 2 K 2 α log 2a (1/α)σ . Next, we consider u (3) t . Since |S bad | ≤ α corrupt n and |E t | ≤ αn, using Eq. ( 10) and Corollary J.8, we have ∥Σ -1/2 u (3) t ∥ = max v:∥v∥=1 1 n i∈S bad ∪Et v ⊤ Σ -1/2 x i clip θt (x ⊤ i w t -y i ) ≤ 2C 2 Kα log a (1/α)θ t ≤ 6C 1.5 2 K 2 α log 2a (1/α)(∥w t -w * ∥ Σ + σ) . Now we use Eq. ( 39), Eq. ( 40), Eq. ( 41) and Eq. ( 42) to bound the final error from update rule in Eq. (37). Step 3: Analysis of the t-steps recurrence relation. We have controlled the deterministic noise in the last step. In this step, we will upper bound the noise introduced by the Gaussian noise for the purpose of privacy, and show the expected distance to optimum decrease every step. Define u t = (v + u (1) t + u (2) t -u t ). We can rewrite Eq. ( 37) as w t+1 -w * = B(w t -w * ) + ηu t -ηϕ t ν t (43) = Bt+1 (w 0 -w * ) + η t i=0 Bi u t-i -η t i=0 ϕ t-i Bi ν t-i . Taking expectations of Σ-norm square with respect to ν 1 , • • • , ν t , we have E ν1,...,νt∼N (0,I d ) ∥w t+1 -w * ∥ 2 Σ (45) ≤ 2∥ Bt+1 (w 0 -w * )∥ 2 Σ + 2E[∥η t i=0 Bi u t-i ∥ 2 Σ] + η 2 t i=0 Tr( B2i Σ)E[ϕ 2 t-i ] (46) ≤ 2∥ Bt+1 (w 0 -w * )∥ 2 Σ + 2η 2 E[ t i=0 t j=0 ∥ Bi u t-i ∥ Σ∥ Bj u t-j ∥ Σ] (47) + η 2 t i=0 Tr( B2i Σ)E[ϕ 2 t-i ] , where at the second step we used the fact that ν 1 , ν 2 , • • • , ν t are independent isotropic Gaussian. Note that η∥ Bi u t-i ∥ Σ = η∥ Σ1/2 Bi Σ1/2 Σ-1/2 u t-i ∥ ≤ η∥ Σ1/2 Bi Σ1/2 ∥ 2 • ∥ Σ-1/2 u t-i ∥ ≤ η∥ Σ1/2 Bi Σ1/2 ∥ 2 ρ(α) (∥w t-i -w * ∥ Σ + σ) ≤ 1 i + 1 ρ(α) (∥w t-i -w * ∥ Σ + σ) , where ρ(α) = 1.1(6C 2 + 6C 1.5 2 )K 2 α log 2a (1/α), and the second inequality follows from Eq. ( 40), Eq. ( 41), Eq. ( 42) and the deterministic condition in Eq. ( 34). Note that the last inequality is true because η ≤ 1/(1.1λ max ) and ∥ Σ1/2 Bi Σ1/2 ∥ 2 ≤ ∥I d -η Σ∥ i 2 ∥ Σ∥ 2 ≤ λ max /(i + 1) . This implies E[η 2 t i=0 t j=0 ∥ Bi u t-i ∥ Σ∥ Bj u t-j ∥ Σ] (49) ≤ 4 E[ t i=0 t j=0 ρ(α) 2 (i + 1)(j + 1) (E[∥w t-i -w * ∥ 2 Σ] + E[∥w t-j -w * ∥ 2 Σ] + σ 2 ) (50) ≤ 8( t i=0 1 i + 1 ) 2 ρ(α) 2 (max i E[∥w t-i -w * ∥ 2 Σ] + σ 2 ) (51) ≤ 8(log t) 2 ρ(α) 2 (max i E[∥w t-i -w * ∥ 2 Σ] + σ 2 ) , Then, ∥ Bt+1 (w 0 -w * )∥ 2 Σ = ∥ Σ1/2 Bt+1 Σ-1/2 Σ1/2 (w 0 -w * )∥ 2 ≤ (1 - 1 κ ) 2(t+1) ∥w 0 -w * ∥ 2 Σ ≤ e -2(t+1)/κ ∥w 0 -w * ∥ 2 Σ , and for n ≳ (1/ε) κd log(1/δ)/α, η 2 t i=0 Tr( B2i Σ)E[ϕ 2 t-i ] (53) ≤η 2 t i=0 ∥I d -η Σ∥ 2i 2 ∥ Σ∥ 2 • 2 log(1.25/δ 0 )K 2 Tr(Σ) log 2a (n/ζ 0 )C 2 K 2 log 2a (1/(2α))(E[∥w t-i -w * ∥ 2 Σ ] + σ 2 ) ε 2 0 n 2 (54) ≤4 t i=0 ( 1 i + 1 ) 2 ρ(α) 2 (E[∥w t-i -w * ∥ 2 Σ] + σ 2 ) . We have E ν1,...,νt∼N (0,I d ) [∥w t+1 -w * ∥ 2 Σ] ≤ 2e -2(t+1)/κ ∥w 0 -w * ∥ 2 Σ+20(log t) 2 ρ(α) 2 (max i∈[t] E[∥w t-i -w * ∥ 2 Σ]+σ 2 ) . Note that this also implies that E[∥(w t ′ +t -w * )∥ 2 Σ|w t ′ ] ≤ 2e -2t/κ ∥w t ′ -w * ∥ 2 Σ + 20ρ(α) 2 t-1 i=0 ( 1 i + 1 ) 2 (E[∥w t ′ +t-i -w * ∥ 2 Σ|w t ′ ] + σ 2 ) , ) which implies E[∥(w t ′ +t -w * )∥ 2 Σ] ≤ 2e -2t/κ E[∥w t ′ -w * ∥ 2 Σ] + 20ρ(α) 2 t-1 i=0 ( 1 i + 1 ) 2 (E[∥w t ′ +t-i -w * ∥ 2 Σ] + σ 2 ) (57) ≤ 2e -2t/κ E[∥w t ′ -w * ∥ 2 Σ] + 20(log t) 2 ρ(α) 2 (max i∈[t] E[∥w t ′ +t-i -w * ∥ 2 Σ] + σ 2 ) (58) Step 4: End-to-end analysis of the convergence. In the last step, we shown that the amount of estimation error decrease depends on the estimation error of the previous t steps. In order for the estimation error to decrease by a constant factor, we will take t = κ. Roughly speaking, we will prove that for every κ steps, the estimation error will decrease by a constant factor, if it is much larger than O((log κ) 2 ρ(α) 2 σ 2 ). This implies we will reach O((log κ) 2 ρ(α) 2 σ 2 ) error with in Õ(κ) steps. Assuming ρ(α) 2 (log κ) 2 ≤ 1/2-1/e 2 , the maximum expected error in a length κ sequence decrease by a factor of 1/2 every time. Now we bound the maximum expected error in the first length κ sequence: max i∈[0,κ-1] E[∥w i - w * ∥ 2 Σ]. Since E[∥w i -w * ∥ 2 Σ] ≤ e -2i/κ ∥w 0 -w * ∥ 2 Σ + (log i) 2 ρ(α) 2 max j∈[0,i-1] E[∥w j -w * ∥ 2 Σ] + (log i) 2 ρ(α) 2 σ 2 . As a function of i, max j∈[0,i-1] E[∥w j -w * ∥ 2 Σ] only increase when it is smaller than 1 1 -(log i) 2 ρ(α) 2 (∥w 0 -w * ∥ 2 Σ + (log i) 2 ρ(α) 2 σ 2 ) . Definition J.6 (Corrupt good set). We say a dataset S is (α corrupt , α, ρ 1 , ρ 2 , ρ 3 , ρ 4 )-corrupt good with respect to (w * , Σ, σ) if it is α corrupt -corruption of an (α, ρ 1 , ρ 2 , ρ 3 , ρ 4 )-resilient dataset S good . Lemma J.7. Under Assumptions 1 and 2, there exists positive constants c 1 and C 2 such that if n ≥ c 1 ((d + log(1/ζ))/α 2 , then with probability 1 -ζ, S good is, with respect to (w * , Σ, σ), (α, C 2 K 2 α log 2a (1/α), C 2 K 2 α log 2a (1/α), C 2 K 2 α log 2a (1/α), C 2 Kα log a (1/α))-resilient. We provide a proof in Appendix G. Corollary J.8 (Lemma 10 from Steinhardt et al. (2017) and Lemma 25 from Liu et al. (2022b) ). For a (α, ρ 1 , ρ 2 , ρ 3 , ρ 4 )-resilient set S with respect to (w * , Σ, γ) and any 0 ≤ α ≤ α, the following holds for any subset T ⊂ S of size at least αn and for any unit vector v ∈ R d :  1 |T | (xi,yi)∈T ⟨v, x i ⟩(y i -x ⊤ i w * ) ≤ 2 - α α ρ 1 √ v ⊤ Σv σ , ( ) 1 |T | xi∈T ⟨v, x i ⟩ 2 -v ⊤ Σv ≤ 2 - α α ρ 2 v ⊤ Σv , ( ) 1 |T | (xi,yi)∈T (y i -x ⊤ i w * ) 2 -σ 2 ≤ 2 - α α ρ 3 σ 2 , and 1 |T | xi∈T ⟨v, x i ⟩ ≤ 2 - α α ρ 4 √ v ⊤ Σv . ≤ e -t 1/a e K(E[⟨v,x⟩ 2 ]) 1/(2a) (71) = exp - t 2 K 2 E[⟨v, x⟩ 2 ] 1/(2a) . ( ) This implies for any fixed v, with probability 1 -ζ, Experimental results for ϵ = 0.1 can be found in Figure 2 . The observations are similar to the ϵ = 1 case. In particular, DP-SSP has poor performance when σ is small. In other settings, DP-SSP has better performance than DP-ROBGD. ⟨x, v⟩ 2 ≤ K 2 v ⊤ E[xx ⊤ ]v log 2a (1/ζ) .

K.2 DP ROBUST LINEAR REGRESSION

In this section, we consider a stronger adversary for DP-ROBGD than the one considered in Section 5. Recall, for the adversary model considered in Section 5, DP-ROBGD was able to consistently estimate the parameter w * (i.e., the parameter recovery error goes down to 0 as n → ∞). This is because the algorithm was able to easily identify the corruptions and ignore the corresponding points while performing gradient descent. We now construct a different instance where the corruptions are hard to identify. Consequently, DP-ROBGD can no longer be consistent against the adversary.



Figure 1: Performance of various techniques on DP linear regression. d = 10 in all the experiments. n = 10 7 , κ = 1 in the 2 nd experiment. n = 10 7 , σ = 1 in the 3 rd experiment.

Frank D McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19-30, 2009.

); Girgis et al. (2021); Wang et al. (2019); Zhu et al. (2022); Gopi et al. (2021).

any integer s ≥ 0, as long asmax i∈[(s-1)κ+1,sκ] E[∥w i -w * ∥ 2 Σ] ≥ 2(log κ) 2 ρ(α) 2 σ 2 , max i∈[sκ+1,(s+1)κ] E[∥w i -w * ∥ 2 Σ] ≤ ( 1 e 2 + (log κ) 2 ρ(α) 2 ) max i∈[(s-1)κ+1,sκ] E[∥w i -w * ∥ 2 Σ] + (log 2κ) 2 ρ(α) 2 σ 2 . (59)

Thus we concludemax i∈[0,κ-1] E[∥w i -w * ∥ 2 Σ] ≤ 1 1 -(log κ) 2 ρ(α 2 ) (∥w 0 -w * ∥ 2 Σ + (log κ) 2 ρ(α 2 )σ 2 ) s = log(∥w * ∥/(ρ(α)σ)) will give us E[∥w sκ+1 -w * ∥ 2 Σ] ≤ (log κ) 2 ρ(α) 2 σ 2 .

Figure 2: Performance of various techniques on DP linear regression. d = 10 in all the experiments. n = 10 7 , κ = 1 in the 2 nd experiment. n = 10 7 , σ = 1 in the 3 rd experiment.Note that f (x) = x α is concave function for α ≤ 1 and x > 0.Then (a 1 + • • • a k ) α ≤ a α 1 + • • • a α k holds for any positive numbers a 1 , • • • , a k > 0.By our assumption that 1/(2a) ≤ 1. , we have

) in App. H would no longer hold. Against such strong attacks, one requires additional steps to estimate the mean of the gradients robustly and privately, similar to those used in robust private mean estimation Liu et al. (2021); Kothari et al. (2021); Hopkins et al. (2022);

B.1 (Stability-based histogram(Karwa & Vadhan, 2017, Lemma 2.3)). For every K ∈ N ∪ ∞, domain Ω, for every collection of disjoint bins B 1 , . . . , B K defined on Ω, n ∈ N, ε ≥ 0, δ ∈ (0, 1/n), β > 0 and α ∈ (0, 1) there exists an (ε, δ)-differentially private algorithm M : Ω n → R K such that for any set of data X 1 , . . . , X n ∈ Ω n

APPENDIX A RELATED WORK

Differentially private optimization. There is a long line of work at the intersection of differentially privacy and optimization (Chaudhuri et al., 2011; Kifer et al., 2012; Bassily et al., 2014; Song et al., 2013; Bassily et al., 2019; Wu et al., 2017; Andrew et al., 2021; Feldman et al., 2020; Song et al., 2020; Asi et al., 2021; Kulkarni et al., 2021; Kamath et al., 2021; Zhang et al., 2022) . As one of the most well-studied problem in differentially privacy, DP Empirical Risk Minimization (DP-ERM) aims to minimize the empirical risk (1/n) i∈S ℓ(x i ; w) privately. The optimal excess empirical risk for approximate DP (i.e., δ > 0) is known to be GD • √ d/(εn), where the loss ℓ is convex and G-Lipschitz with respect to the data, and D is the diameter of the convex parameter space (Bassily et al., 2014) . This bound can be achieved by several DP-SGD methods, e.g., (Song et al., 2013; Bassily et al., 2014) , with different computational complexities. Differentially private stochastic convex optimization considers minimizing the population risk E x∼D [ℓ(x, w)], where data is drawn i.i.d. from some unknown distribution D. Using some variations of DP-SGD, Bassily et al. (2019) and Feldman et al. (2020) (Vu & Slavkovic, 2009; Kifer et al., 2012; Mir, 2013; Dimitrakakis et al., 2014; Wang et al., 2015; Foulds et al., 2016; Minami et al., 2016; Wang, 2018; Sheffet, 2019; Wang & Gu, 2019; Hu et al., 2022) Milionis et al. (2022) analyze linear regression algorithm with sub-optimal guarantees. (Dwork & Lei, 2009; Alabi et al., 2020; Amin et al., 2022; Liu et al., 2022b ) also consider using robust statistics like Tukey median (Tukey, 1975) or Theil-Sen estimator (Theil, 1950) for differentially private regression. However, (Dwork & Lei, 2009; Amin et al., 2022) lack utility guarantees and (Alabi et al., 2020 ) is restricted to one-dimensional data. Liu et al. (2022b) achieves optimal sample complexity but takes exponential time.Robust linear regression. Robust mean estimation and linear regression have been studied for a long time in the statistics community (Tukey & McLaughlin, 1963; Huber, 1992; Tukey, 1975) . However, for high dimensional data, these estimators generalizing the notion of median to higher dimensions are typically computationally intractable. Recent advances in the filter-based algorithms, e.g., (Diakonikolas et al., 2017; 2020; 2019a; 2018; Cheng et al., 2019; Dong et al., 2019) , achieve nearly optimal guarantees for mean estimation in time linear in the dimension of the dataset. 2022b) achieve nearly optimal rates using d samples but require exponential time complexities. An important special case of adversarial corruption is when the adversary only corrupts the response variable in supervised learning (Khetan et al., 2018) and also in unsupervised learning (Thekumparampil et al., 2018) . For linear regression, when there is only label corruptions, (Bhatia et al., 2015; Dalalyan & Thompson, 2019; Kong et al., 2022 ) achieve nearly optimal rates with O(d) samples. Under the oblivious label corruption model, i.e., the adversary only corrupts a fraction of labels in complete ignorance of the data, (Bhatia et al., 2017; Suggala et al., 2019) provide consistent estimator ŵn such thatRobust and private linear regression. Under the settings of both DP and data corruptions, the only algorithm by Liu et al. (2022b) achieves nearly optimal rates α log(1/α)σ with optimal sample complexities of d/α 2 + d/(εα). However, their algorithm requires exponential time complexities.

I LOWER BOUNDS I.1 PROOF OF PROPOSITION 3.1 FOR LABEL CORRUPTION LOWER BOUNDS

We first prove the following lemma. Lemma I.1. Consider an α label-corrupted dataset S = {(x i , y i )} n i=1 with α < 1/2, that is generated from either x i ∼ N (0, 1), y i ∼ N (0, 1) or x i ∼ N (0, 1), z i ∼ N (0, 1 -α 2 ), y i = αx i + z i . It is impossible to distinguish the two hypotheses with probability larger than 1/2.In the first case,In the second case,By simple calculation, it holds that D KL (P/2 for all α < 1/2. Then, Pinsker's inequality implies that D T V (P 1 ||P 2 ) ≤ α/2. Since the covariate x i follows from the same distribution in the two cases, and the total variation distance between the two cases is less than α/2. This means there is an label corruption adversary that change α/2 fraction of y i 's in P 1 to make it identical to P 2 . Therefore, no algorithm can distinguish the two cases with probability better than 1/2 under α fraction of label corruption.Since Σ = 1, σ 2 ∈ [3/4, 1], the first case above has w * = 0, and the second case has w * = α, this implies that no algorithm is able to achieve E[∥ ŵ -w * ∥ Σ ] < σα for all instances with ∥w * ∥ ≤ 1 under α fraction of label corruption.

J TECHNICAL LEMMAS

Lemma J.1 (Hanson-Wright inequality for subWeibull distributions Sambale (2020)). Let S = {x i ∈ R d } n i=1 be a dataset consist of i.i.d. samples from (K, a)-subWeibull distributions, then• with probability 1 -ζ,We provide a proof in Appendix J.1.1.Then there exists a constant c 1 > 0 such that with probability 1 -ζ,Lemma J.5 (Lemma F.1 from Liu et al. (2022a) ). Let x ∈ R d ∼ N (0, Σ). Then there exists universal constant C 6 such that with probability 1 -ζ, 2021)). This is a 2 dimensional problem where the first covariate is sampled uniformly from [-1, 1]. The second covariate, which is uncorrelated from the first, is sampled from a distribution with the following pdf.We set σ = 0.1 in our experiments. The noise z i is sampled uniformly from [-σ, σ]. We consider two possible parameter vectors w * = (1, 1) and w * = (1, -1). It can be shown that the total variation (TV) distance between these problem instances (each parameter vector corresponds to one problem instance) is Θ(α) (Bakshi & Prasad, 2021) . What this implies is that, one can corrupt at most α fraction of the response variables and convert one problem instance into another. Since the distance (in Σ norm) between the two parameter vectors is Ω(ασ), any algorithm will suffer an error of Ω(ασ).We generate 10 7 samples from this problem instance and add corruptions that convert one problem instance to the other. Figure 3 presents the results from this experiment. It can be seen that our algorithm works as expected. In particular, it is not consistent in this setting. Moreover, the parameter recovery error increases with the fraction of corruptions.

