LOWER BOUNDS FOR DIFFERENTIALLY PRIVATE ERM: UNCONSTRAINED AND NON-EUCLIDEAN

Abstract

We consider the lower bounds of differentially private empirical risk minimization (DP-ERM) for convex functions in both constrained and unconstrained cases concerning the general p norm beyond the 2 norm considered by most of the previous works. We provide a simple black-box reduction approach that can generalize lower bounds in constrained to unconstrained cases. Moreover, for ( , δ)-DP, we achieve the optimal Ω( ) lower bounds for both constrained and unconstrained cases and any p geometry where p ≥ 1 by considering 1 loss over the ∞ ball. Under review as a conference paper at ICLR 2023 (2014); Steinke & Ullman (2016); Wang et al. (2017); Bassily et al. (2019) achieved the tight bound Θ( √ d log(1/δ) n ). DP-ERM in the unconstrained case was neglected before and gathered people's attention recently. Jain & Thakurta (2014); Song et al. (2021) found a tight bound Õ( √ rank n ) for minimizing the excess empirical risk of Generalized Linear Models (GLMs, see Definition A.1 in Appendix) in the unconstrained case and evaded the curse of dimensionality, where rank is the rank of the feature matrix in the GLM problem. As a comparison, the tight bound Θ( √ d n ) holds for the constrained DP-GLM, even for the overparameterized case when rank ≤ n d. The dimension-independent result is intriguing, as modern machine learning models are usually huge, with millions to billions of parameters (dimensions). A natural question arises whether one can get similar dimension-independent results for a more general family of functions beyond GLMs. Unfortunately, Asi et al. (2021) provided a negative answer and gave an Ω( √ d n log d ) lower bound for some general convex functions. Their method chooses appropriate objective functions and utilizes one-way marginals, but the extra logarithmic term in their bound seems nontrivial to remove in the unconstrained case. Another aspect is DP-ERM in non-Euclidean settings. Most previous works in the literature consider the constrained Euclidean setting where the convex domain and (sub)gradients of objective functions have bounded 2 norms, and DP-ERM concerning the general p norm is much less well-understood. Motivated by the importance and wide applications of non-Euclidean settings, some previous works Talwar et al. (2015); Asi et al. (2021); Bassily et al. (2021b) analyzed constrained DP-ERM with respect to the general p norm with many exciting results, and there is still room for improvement in many regimes. Article Constrained? p ) for 1 ≤ p ≤ 2 and O( d 1-1/p √ log(1/δ) n ) for 2 < p ≤ ∞.

1. INTRODUCTION

Since the seminal work of Dwork et al. (2006) , differential privacy (DP), defined below, has become the standard and rigorous notion of privacy guarantee for machine learning algorithms. Definition 1.1 (Differential privacy). A randomized mechanism M is ( , δ)-differentially privatefoot_0 if for any event O ∈ Range(M) and for any neighboring databases D and D that differ by a single data element, one has Pr[M(D) ∈ O] ≤ exp( ) Pr[M(D ) ∈ O] + δ. Among the rich literature on DP, many fundamental problems are based on empirical risk minimization (ERM), and DP-ERM becomes one of the most well-studied problems in the DP community. See e.g., Chaudhuri & Monteleoni (2008) ; Rubinstein et al. (2009) ; Chaudhuri et al. (2011) ; Kifer et al. (2012) ; Song et al. (2013) ; Bassily et al. (2014) ; Jain & Thakurta (2014) ; Talwar et al. (2015) ; Kasiviswanathan & Jin (2016) ; Fukuchi et al. (2017) ; Wu et al. (2017) ; Zhang et al. (2017) ; Wang et al. (2017) ; Iyengar et al. (2019) ; Bassily et al. (2020) ; Kulkarni et al. (2021) ; Asi et al. (2021) ; Bassily et al. (2021b) ; Wang et al. (2021) ; Bassily et al. (2021a) ; Gopi et al. (2022) ; Arora et al. (2022) ; Ganesh et al. (2022) . In DP-ERM, we are given a family convex functions where each function (•; z) is defined on a convex set K ⊆ R d , and a data-set D = {z 1 , • • • , z n } to design a differentially private algorithm that can minimize the loss function L(θ; D) = 1 n n i=1 (θ; z i ), and the value L(θ; D) -min θ ∈K L(θ ; D) is called the excess empirical loss with respect to solution θ, measuring how it compares with the best solution in K. DP-ERM in the constrained case and Euclidean geometry (with respect to 2 norm) was studied first, well-studied, and most of the previous literature belongs to this case. More specifically, the Euclidean constrained case considers convex loss functions defined on a bounded convex set C R d , assuming the functions are 1-Lipschitz over the convex set of diameter 1 with respect to the 2 norm. For pure-DP (i.e. ( , 0)-DP), the seminal work Bassily et al. (2014)  √ d log(1/δ) n ) Ours both 1 ≤ p ≤ ∞ general Ω( d n ) Ω( √ d log(1/δ) n ) Table 1 : Comparison of lower bounds for private convex ERM. One can easily extend our lower bounds in the unconstrained case to the constrained case. The lower bound of Song et al. (2021) is weaker than ours in the important over-parameterized d n setting, as rank ≤ min{n, d}.

1.1. OUR CONTRIBUTIONS

This paper considers the lower bounds for DP-ERM under unconstrained and/or non-euclidean settings. We summarize our main results as follows: • We propose a black-box reduction approach, which directly generalizes the lower bounds in constrained cases to the unconstrained case. Such a method is beneficial for its simplicity. Nearly all exiting lower bounds in the constrained case can be extended to the unconstrained case directly, and any new progress in the constrained case can be of immediate use due to its black-box nature. • We achieve Ω( √ d log(1/δ) n ) lower bounds for both constrained and unconstrained cases and any p geometry for p ≥ 1 at the same time by considering 1 loss over the ∞ ball. This bound improves previous results, exactly matches the upper bounds for 1 < p ≤ 2 and obtains novel bounds for p > 2. 2As an example of the application of our simple reduction approach, we will show how to get the Ω( d n ) lower bound in the unconstrained pure-DP case from the result in constrained case Bassily et al. (2014) , which is the first lower bound in this setting to the best of our knowledge. This reduction also demonstrates that evading the curse of dimensionality is impossible based on existing dimension-dependent lower bounds, even in the (arguably less complicated) unconstrained case.

1.2. RELATED WORK

Previous studies on DP-ERM primarily focus on the constrained setting. The unconstrained case recently attracted people's interest because Jain & Thakurta (2014) ; Song et al. (2021) found an O( √ rank n ) upper bound for minimizing the excess risk of GLMs, which evades the curse of dimensionality. It has been known that the constrained condition plays a crucial role in achieving dimension independence, as pointed out by the Ω( Bassily et al. (2014) , particularly for minimizing constrained GLMs for the case when "rank ≤ n d". There are fundamental differences between constrained and unconstrained cases, and analyzing the unconstrained case is an important direction. √ d n ) lower bound in Most existing lower bounds of DP-ERM use GLM functions. For example, the objective function used in Bassily et al. (2014) is a linear function (θ; z) = θ, z which cannot be applied in the unconstrained case; otherwise, the loss value would be infinite. Considering this limitation, Song et al. (2021) adopted the objective functions (θ; z) = | θ, x -y|. They transferred the problem of minimizing GLM to estimating the mean of a set of vectors, then got the lower bound by tools from the coding theory. Kairouz et al. (2020) ; Zhou et al. (2020) considered how to evade the curse of dimensionality for more general functions beyond GLMs with public data, which serves to identify a low-rank subspace similar to Song et al. (2021) 2021b) is another important and fundamental problem which is closely related to DP-ERM. Roughly speaking, in the SCO, the objective is to minimize the function E z∼P [ (θ; z)] for some underlying distribution P, which requires analyses on the generalization ability of the algorithms. The tight bound of DP-SCO is usually the maximum among the informational lower bound on (non-private) SCO and the lower bound on DP-ERM, and improved lower bounds on DP-ERM can also benefit research on DP-SCO.

2. PRELIMINARY

We introduce the prior background knowledge required in the rest of the paper. Additional background knowledge, such as the definition of GLM, can be found in the appendix. We start with defining Lipschitz functions. Definition 2.1 (G-Lipschitz Continuity). A function f : K → R is G-Lipschitz continuous with respect to p geometry if for all θ, θ ∈ K, one has: |f (θ) -f (θ )| ≤ G θ -θ p . (2)

2.1. PROPERTIES OF DIFFERENTIAL PRIVACY

In this subsection, we introduce several very basic properties of differential privacy without proving them (refer Dwork et al. (2014) for details). Proposition 2.2 (Group privacy). If M : X n → Y is ( , δ)-differentially private mechanism, then for all pairs of datasets x, x ∈ X n , then M(x), M(x ) are (k , kδe k )-indistinguishable when x, x differs on at most k locations. Proposition 2.3 (Post processing). If M : X n → Y is ( , δ)-differentially private and A : Y → Z is any randomized function, then A • M : X n → Z is also ( , δ)-differentially private.

3. REDUCTION APPROACH

In this section, we prove a general black-box reduction theorem, which can directly generalize the lower bounds on DP-ERM in the constrained case to the unconstrained case. As an application, we give an example for using our reduction approach to get pure-DP lower bound in the unconstrained case from the constrained case Bassily et al. (2014) . To begin with, we introduce the following lemma from Cobzas & Mustata (1978) , which gives a Lipschitz extension of any convex Lipschitz function over some bounded convex set to the whole domain R d . Lemma 3.1 (Theorem 1 in Cobzas & Mustata (1978) ). Let f be a convex function which is η-Lipschitz w.r.t. 2 and defined on a convex bounded set K ⊂ R d . Define an auxiliary function g y (x) as: g y (x) := f (y) + η x -y 2 , y ∈ K, ∀x ∈ R d . ( ) Then consider the function f : R d → R defined as f (x) := min y∈K g y (x). We know f is η-Lipschitz w.r.t. 2 on R d , and f (x) = f (x) for any x ∈ K. For any y ∈ R d , we define Π K (y) := arg min x∈K x -y 2 . It is well-known in the convex analysis, that for a compact convex set K and any point y ∈ R d , the the set {x ∈ K : x-y 2 < z-y 2 , ∀z ∈ K, z = x} is always non-empty and singleton Hazan (2019) . In short, to prove lower bounds for the unconstrained case, one can extend the loss function in the constrained domain to R d with an important observation on such convex extension: the loss L(θ; D) at a point θ does not increase after projecting θ to the convex domain K, i.e. L(θ; D) ≥ L(Π K (θ); D). One can derive this property from the Pythagorean Theorem (Lemma 3.2) for any convex set by combining with the particular structure of the extension. Lemma 3.2 (Pythagorean Theorem for convex set). Letting K ⊂ R d be a convex set, y ∈ R d and x = Π K (y), then for any z ∈ K we have: x -z 2 ≤ y -z 2 . ( ) In the unconstrained case, we usually assume a public prior knowledge on C where C ≥ θ 0 -θ * 2 , θ 0 is the public initial point and θ * is the optimal solution to L(•; D) over R d . To proceed, we first assume some lower bound in the constrained case, which we use to reduce. The definition below defines a witness function for any lower bound in the constrained case, for example in Bassily et al. (2014) the (witness) loss function is simply linear and the lower bound is roughly Ω(min{1, √ d n }). Definition 3.3. Let n, d be large enough, 0 ≤ δ ≤ 1 and > 0. We say functions is a witness to the lower bound function f , if for any ( , δ)-DP algorithm, there exist a convex set K ⊂ R d of diameter C, a family of G-Lipschitz convex functions (θ; z) defined on K w.r.t. 2 , a dataset D of size n, such that with probability at least 1/2 (over the random coins of the algorithm), L(θ priv ; D) -min θ∈K L(θ; D) = Ω(f (d, n, , δ, G, C)), where L(θ; D) := 1 n n i=1 (θ; z i ) and θ priv ∈ K is the output of the algorithm. The function f can be any lower bound in the constrained case with dependence on the parameters, and is the loss function used to construct the lower bound. We use the Lipschitz extension mentioned above to define our loss function in the unconstrained case, i.e., ˜ (θ; z) = min y∈K (y; z) + G θ -y 2 (5) which is convex, G-Lipschitz and equal to (θ; z) when θ ∈ K by Lemma 3.1. Our intuition is simple: if θ priv lies in K, then we are done by using the witness function and lower bound from Definition 3.3. If not, the projection of θ priv to K should lead to a smaller loss. However, the projected point cannot have a minimal loss due to the lower bound in Definition 3.3, let alone θ priv itself. We have the following theorem that achieves the general reduction approach in the unconstrained setting: Proof. Without loss of generality, let K = {θ : θ -θ 0 2 ≤ C} be the 2 ball around θ 0 , let (θ; z) be the convex functions used in Definition 3.3, and as mentioned we can find our loss functions ˜ (θ; z) = min y∈K (y; z) + G θ -y 2 . As θ * ∈ K, we know that L(θ * ; D) = min θ∈K L(θ; D). Denote θpriv = Π K (θ priv ) the projected point of θ priv to K. Because post-processing keeps privacy, outputting θpriv is also ( , δ)-DP. By Definition 3.3, we have L( θpriv ; D) -min θ L(θ; D) = Ω(f (d, n, , δ, G, C)). If θpriv = θ priv , which means θ priv ∈ K, then because ˜ (θ; z) is equal to (θ; z) for any θ ∈ K and z, one has L(θ priv ; D) = L( θpriv ; D) = L( θpriv ; D). If θpriv = θ priv which means θ priv / ∈ K, then since (•; z) is G-Lipschitz, for any z, we have that (denoting y * = arg min y∈K (y; z) + G θ priv -y 2 ): ˜ (θ priv ; z) = min y∈K (y; z) + G θ priv -y 2 = (y * ; z) + G θ priv -y * 2 ≥ (y * ; z) + G θpriv -y * 2 ≥ min y∈K (y; z) + G θpriv -y 2 = ˜ ( θpriv ; z), where the third line is by the Pythagorean Theorem for the convex set, see Lemma 3.2. We have L(θ priv ; D) ≥ L( θpriv ; D) = L( θpriv ; D). In a word, we get L(θ priv ; D) ≥ L( θpriv ; D) = L( θpriv ; D). Combining Equation ( 7), ( 8) and ( 9) together, we have that L(θ priv ; D) -L(θ * ; D) = L(θ priv ; D) -min θ L(θ; D) ≥ L( θpriv ; D) -min θ L(θ; D) ≥ Ω(f (d, n, , δ, G, C)).

3.1. EXAMPLE FOR PURE-DP

This subsection gives a concrete example of the reduction method in the pure-DP setting. In the construction of lower bounds for constrained DP-ERM in Bassily et al. (2014) , they chose the linear function (θ; z) = θ, z as the objective function, which is not applicable in the unconstrained setting because it could decrease to negative infinity. Instead, we extend the linear loss in unit 2 ball to the whole R d while preserving its Lipschitzness and convexity. We use such an extension to define our loss function in the unconstrained case. Namely, we define (θ; z) = min y 2 ≤1 -y, z + θ -y 2 for all θ, z in the unit 2 ball, which is convex, 1-Lipschitz and equal to -θ, z when θ 2 ≤ 1 according to Lemma 3.1. Specifically, it's easy to verify that (θ; 0) = max{0, θ 2 -1}. When z 2 = 1, one has (θ; z) ≥ min y 2 ≤1 -y, z ≥ -1, where the equation holds if and only if θ = z. For any dataset D = {z 1 , ..., z n }, we define L(θ; D) = 1 n n i=1 (θ; z i ). We need the following lemma from Bassily et al. (2014) to prove the lower bound. The proof is similar to that of Lemma 5.1 in Bassily et al. (2014) , except that we change the construction by adding points 0 (the all-zero d dimensional vector) as our dummy points. For completeness, we include it here. Lemma 3.5 (Part-One of Lemma 5.1 in Bassily et al. (2014) with slight modifications). Let n, d ≥ 2 and > 0. There is a number n * = Ω(min(n, d )) such that for any -differentially private algorithm A, there is a dataset D = {z 1 , ..., z n } ⊂ { 1 √ d , -1 √ d } d ∪ {0} with n i=1 z i 2 = n * such that, with probability at least 1/2 (taken over the algorithm random coins), we have A(D) -q(D) 2 = Ω(min(1, d n )), where q(D) = 1 n n i=1 z i . Lemma 3.5 basically says that for any -DP algorithm, it's impossible to for it to estimate the average of some dataset z 1 , ..., z n with accuracy o(min(1, d n )). Using the loss functions defined in Equation ( 10), Lemma 3.5 and our reduction theorem 3.4, we have the following theorem, whose proof can be found in the appendix. Theorem 3.6 (Lower bound for -differentially private algorithms). Let n, d be large enough and > 0. For every -differentially private algorithm with output θ priv ∈ R d , there is a dataset D = {z 1 , ..., z n } ⊂ { 1 √ d , -1 √ d } d ∪ {0} such that, with probability at least 1/2 (over the algorithm random coins), we must have that L(θ priv ; D) -min θ∈R d L(θ; D) = Ω(min(1, d n )).

4. IMPROVED BOUNDS

In this section, we consider lower bounds for approximate DP. We aim to improve previous results in two ways: to tighten the previous lower bounds and extend this bound to any non-euclidean geometry and the unconstrained case. We make the assumption that 2 -O(n) < δ < o(1/n). The assumption on δ is common in the literature, for example, in Steinke & Ullman (2016).

4.1. BACKGROUND

We briefly introduce previous lower bounds for constrained DP-ERM and how changing the loss to be 1 and domain to be ∞ balls generalizes the lower bounds to the unconstrained non-euclidean case. As shown in Table 1,   ; D) = n i=1 zi 2 n (1 -θ, θ * ) ≥ n i=1 zi 2 2n θ -θ * 2 2 . Therefore one can reduce the lower bound for DP-ERM to the lower bound for estimating the mean of a dataset, for example, the following lemma:  = {z 1 , • • • , z n } ⊆ {-1/ √ d, 1/ √ d} d with n i=1 z i 2 ∈ [M -1, M + 1] such that with probability at least 1/3 (taken over A's random coins), we have A(D) -q(D) 2 = Ω(min(1, √ d n )), where q(D) = 1 n n i=1 z i . In a word, Bassily et al. (2014) uses linear functions as the objective functions and reduces the lower bound on DP-ERM to the lower bound on estimating the mean. Steinke & Ullman (2016) improved the lower bound by a logarithmic factor by using the group privacy technique. The previous framework fails in the unconstrained and non-euclidean case for two reasons. First, they rely on the 2 ball as the domain, which lacks the generalizability to the general p norm. Second, to generalize the lower bound to the unconstrained case, linear functions are no longer appropriate to be loss functions, as they can take minus infinity values and do not have a global minimum. To address these concerns, we consider our problem in an ∞ ball and choose the loss function to be (θ; z) = θ -z 1 . Formally, the loss function used is the following: (θ; z) = θ -z 1 , θ ∈ R d , z ∈ {-1, 1} d . The convex domain K is the ∞ unit ball. For any data-set D = {z 1 , ..., z n }, we define L(θ; D) = 1 |D| |D| i=1 (θ; z i ) = 1 |D| |D| i=1 θ -z i 1 . One main reason for our choice is that 1 and ∞ are the "strongest" norms for loss and domain, respectively, implying lower bounds for general p geometry by the Holder inequality. Moreover, unlike linear functions, the 1 loss function can be generalized to the unconstrained case directly. It suffices to figure out how to prove lower bounds for this constrained setting (with 1 loss functions in an ∞ unit ball). Looking into previous lower bounds, such as Bassily et al. (2014) and Steinke & Ullman (2016) , one finds that the core idea is two-step: reduce the lower bound of the DP-ERM to the lower bound of mean estimation first, then build the mean estimation by coding theory, particularly the fingerprinting code to be discussed later. In our case, we can not directly reduce the lower bound of the DP-ERM to the lower bound of mean estimation due to the loss function and domain change. In particular, a large mean estimation error does not necessarily imply a large empirical risk. Consider a simple example. Recall that we want to minimize L(θ; D) = n i=1 (θ; z i )/n over the ∞ unit ball K, where (θ; z) = θ -z 1 and each z i ∈ {0, 1} d as the set up before. If 1 n n i=1 z i = 1 2 1 where 1 means the all-one vector, then we know L(θ; D) is a constant function, equaling to d/2 for any θ ∈ K. In this example, for a bad estimator θ bad , even if θ bad -1 n n i=1 z i 2 is large, it can still be a minimizer to the loss function, i.e., L(θ bad ; D) -min θ∈K L(θ; D) = 0.

4.2. FINGERPRINTING CODES

Fingerprinting code was first introduced in Boneh & Shaw (1998a) , developed and frequently used to demonstrate lower bounds in the DP community Bun et al. (2018) ; Steinke & Ullman (2016; 2015) . To overcome the challenge discussed before, we slightly modify the definition of the fingerprinting code used in this work. Definition 4.2 ( 1 -loss Fingerprinting Code). A γ-complete, γ-sound, α-robust 1 -loss fingerprinting code for n users with length d is a pair of random variables D ∈ {0, 1} n×d and Trace : [0, 1] d → 2 [n] such that the following hold:  Completeness: For any fixed M : {0, 1} n×d → [0, 1] d , Pr L(M(D); D) -min θ L(θ; D) ≤ αd ∧ (Trace(M(D)) = ∅) ≤ γ. Soundness: For any i ∈ [n] and fixed M : {0, 1} n×d → [0, 1] d , Pr[i ∈ Trace(M (D -i ))] ≤ γ, (D) -q(D) 1 ≤ αd ∧ Trace(M(D)) = ∅] ≤ γ. As discussed before, they use the fingerprinting code in their version to build a lower bound on the mean estimation, while we modify the definition and build a lower bound on the DP-ERM under our set-up. Following the optimal fingerprinting construction Tardos (2008) , and subsequent works Bun et al. (2018) Bassily et al. (2014) , we have the following result demonstrating the existence of fingerprinting code in our version. Lemma 4.3. For every n ≥ 1, and γ ∈ (0, 1], there exists a γ-complete, γ-sound, 1/150-robust 1 -loss fingerprinting code for n users with length d where d = O(n 2 log(1/γ).

4.3. MAIN RESULT IN EUCLIDEAN GEOMETRY

Similar to Bun et al. (2018) , we have the following standard lemma, which allows us to reduce any < 1 to the = 1 case without losing generality. The proof is based on the well-known 'secrecy of the sample' lemma from Kasiviswanathan et al. (2011) . Lemma 4.4. For 0 < < 1, a condition Q has sample complexity n * for algorithms with (1, o(1/n))differential privacy (n * is the smallest sample size that there exists an (1, o(1/n))-differentially private algorithm A which satisfies Q), if and only if it also has sample complexity Θ(n * / ) for algorithms with ( , o(1/n))-differential privacy. We apply the group privacy technique in Steinke & Ullman (2016) , which needs the following technical lemma: Lemma 4.5. Let n, k be two large positive integers such that k < n/1000. Let n k = n/k . Let z 1 , • • • , z n k be n k numbers where z i ∈ {0, 1, 1/2} for all i ∈ [n k ]. For any real value q ∈ [0, 1], if we copy each z i k times, and append n -kn k '0' to get n numbers z 1 , • • • , z n , then we have | n k i=1 |q -z i |/n k - n i=1 |q -z i |/n| ≤ 3k/n. Proof. Without loss of generality, we can assume z k(i-1)+1 = z k(i-1)+2 = • • • = z ki = z i , and z n-kn k +1 = z kn k +2 = • • • = z n = 0. With this observation, we know | n k i=1 |q -z i |/n k - n i=1 |q -z i |/n| = | n k i=1 |q -z i |(1/n k -k/n) - n i=n-kn k +1 q/n| ≤ | n k i=1 |q -z i |(1/n k -k/n)| + | n i=n-kn k +1 q/n| ≤ n k ( 1 k/n -1 - k n ) + k/n ≤ 3k/n. This section's main result is the following theorem, which modifies and generalizes the techniques in Steinke & Ullman (2016); Bassily et al. (2014) to reach a tighter bound for the unconstrained case. Theorem 4.6 (Lower bound for ( , δ)-differentially private algorithms). Let n, d be large enough and 1 ≥ > 0, 2 -O(n) < δ < o(1/n). For every ( , δ)-differentially private algorithm with output θ priv ∈ R d , there is a data-set D = {z 1 , ..., z n } ⊂ {0, 1} d ∪ { 1 2 } d such that E[L(θ priv ; D) -L(θ ; D)] = Ω(min(1, d log(1/δ) n )GC) where is G-Lipschitz w.r.t. 2 geometry, θ is a minimizer of L(θ; D), and C = √ d is the diameter of K w.r.t. 2 geometry, where K is the unit ∞ ball containing all possible true minimizers and differs from its usual definition in the constrained setting. Remark 4.7. The dependence on parameters GC makes sense. For example, one can scale the loss function to be ˆ (x; z) = ax -z 1 for some constant a ∈ (0, 1), which decreases Lipschitz constant G but increases the diameter C (we should choose K to contain all possible minimizes). This bound improves a log factor over Bassily et al. (2014) and can be directly extended to the constrained bounded setting, by setting the constrained domain to be the unit ∞ ball.

4.4. EXTENSION TO NON-EUCLIDEAN GEOMETRY

We illustrate the power of our construction in Theorem 4.6, by showing that the same bound holds for any p geometry where p ≥ 1 in the constrained setting, and the bound is tight for all 1 < p ≤ 2, improving/generalizing existing results in Asi et al. (2021) ; Bassily et al. (2021b) . Our construction is advantageous in that it uses 1 loss and ∞ -ball-like domain in the constrained setting, both being the strongest in their direction when relaxing to p geometry. Simply using the Holder inequality yields that the product of the Lipschitz constant G and the diameter of the domain C is equal to d when p varies in [1, ∞). Theorem 4.8. Let n, d be large enough and 1 ≥ > 0, 2 -O(n) < δ < o(1/n) and p ≥ 1. There exists a convex set K ⊂ R d , such that for every ( , δ)-differentially private algorithm with output θ priv ∈ K, there is a data-set D = {z 1 , ..., z n } ⊂ {0, 1} d ∪ { 1 2 } d such that E[L(θ priv ; D) -L(θ ; D)] = Ω(min(1, d log(1/δ) n )GC), ( ) where θ is a minimizer of L(θ; D), is G-Lipschitz, and C is the diameter of the domain K. Both G and C are defined w.r.t. p geometry. Proof. We use the same construction as in Theorem 3.6 which considers 2 geometry. We only need to calculate the Lipschitz constant G and the diameter of the domain K. For the Lipschitz constant G, notice that our loss is the 1 norm: (θ; z) = θ -z 1 . It is evident that it is (d 1-1 p )-Lipschitz w.r.t. p geometry. For the domain, i.e., the unit ∞ ball K, it obvious that its diameter w.r.t. p geometry is C = d 1 p . To conclude, we find that for any p geometry where p ≥ 1, we have GC = d which is independent of p. The bound holds for any p geometry by applying Theorem 4.6. For the unconstrained case, we notice that the optimal θ * under our construction must lie in the unit ∞ -ball K = {x ∈ R d |0 ≤ x i ≤ 1, ∀i ∈ [d]} , by observing that projecting any point to K does not increase the 1 loss. Therefore, our result can be generalized to the unconstrained case directly. In a word, our result presents lower bounds Ω( √ d log(1/δ) n ) for all p ≥ 1 and for both constrained case and unconstrained case. Remarkably, our bound is the best for p near 1 and p > 2 to our knowledge.

5. CONCLUSION

This paper studies lower bounds for DP-ERM in the unconstrained case and non-euclidean geometry. We propose a simple but powerful black-box reduction approach that can transfer any lower bound in the constrained case to the unconstrained one, indicating that getting rid of dimension dependence is generally impossible. We also prove better lower bounds for approximate-DP ERM for any p geometry when p ≥ 1 by considering 1 loss over the ∞ ball. Our bound is tight when 1 ≤ p ≤ 2 and novel for p > 2. However, there is a gap between the current best upper bound Bassily et al. (2021b) and our lower bound when p > 2. Filling this gap can be an exciting and interesting problem. Designing better algorithms for general (un)constrained DP-ERM based on our insights would also be an interesting and meaningful direction, which we leave as future work.

A.1 GENERALIZED LINEAR MODEL (GLM)

The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables with error distribution models other than a normal distribution. To be specific, Definition A.1 (Generalized linear model (GLM)). The generalized linear model (GLM) is a special class of ERM problems where the loss function (θ, d) takes the following inner-product form: (θ; z) = ( θ, x ; y) (16) for z = (x, y). Here, x ∈ R d is usually called the feature vector and y ∈ R is called the response.

B CONSTRUCTION OF FINGERPRINTING CODES

To address the digital watermarking problem, Fingerprinting codes were introduced by Boneh & Shaw (1998b) . Imagine a company selling software to users. A fingerprinting code is a pair of randomized algorithms (Gen, Trace), where Gen generates a length d code for each user i. To prevent any malicious coalition of users copy and distributing the software, the Trace algorithm can trace one of the malicious users, given a code produced by the coalition of users. They may only can the bits with a divergence in the code: any bit in common is potentially vital to the software and risky to change. In this section, we introduce the fingerprinting code used by Bun et al. (2018) , which is based on the first optimal fingerprinting code Tardos (2008) with additional error robustness. The mechanism of the fingerprinting code is described in Algorithm 1 for completeness. The sub-procedure part is the original fingerprinting code in Tardos (2008) , with a pair of randomized algorithms (Gen, Trace). The code generator Gen outputs a codebook C ∈ {0, 1} n×d . The ith row of C is the codeword of user i. The parameter d is called the length of the fingerprinting code. We make the formal definition of fingerprinting codes: Fingerprint codes imply the hardness of privately estimating the mean of a dataset over {0, 1} d . Otherwise, the coalition of users can simply use the rounded mean of their codes to produce the copy. Then the DP-ERM problem can be reduced to privately estimating the mean by using the linear loss whose minimizer is precisely the mean. The security property of fingerprinting codes asserts that any codeword can be "traced" to a user i. Moreover, we require that the fingerprinting code can find one of the malicious users even when they get together and combine their codewords in any way that respects the marking condition. That is, a tracing algorithm Trace takes as inputs the codebook C and the combined codeword c and outputs one of the malicious users with high probability. The sub-procedure Gen first uses a sin 2 x like distribution to generate a parameter p j (the mean) for each column j independently, then generates C randomly by setting each element to be 1 with probability p j according to its location. The sub-procedure Trace computes a threshold value Z and a 'score function' S i (c ) for each user i, then reports i when its score is higher than the threshold. The main procedure was introduced in Bun et al. (2018) , where Gen adds dummy columns to the original fingerprinting code and applies a random permutation. Trace can first 'undo' the



When δ > 0, we may refer to it as approximate-DP, and we name the particular case when δ = 0 pure-DP sometimes. The current best upper bound is O(min{log d, 1 p-1 } √ d log(1/δ) n



in spirit. DP Stochastic Convex Optimization (SCO) Feldman et al. (2020); Bassily et al. (2020; 2019); Kulkarni et al. (2021); Asi et al. (2021); Bassily et al. (

Theorem 3.4. Assume , f are the witness function and lower bound as in Definition 3.3. For any ( , δ)-DP algorithm and any initial point θ 0 ∈ R d , there exist a family of G-Lipschitz convex functions ˜ (θ; z) : R d → R being the from Definition 3.3, a dataset D of size n and the same function f , such that with probability at least 1/2 (over the random coins of the algorithm)L(θ priv ; D) -L(θ * ; D) = Ω(f (d, n, , δ, G, C)),(6)where L(θ; D) := 1 n zi∈D ˜ (θ; z i ) is the ERM objective function, θ * = arg min θ∈R d L(θ; D), C ≥ θ 0 -θ * 2 and θ priv is the output of the algorithm.

Bassily et al. (2014) demonstrates a lower bound Ω( √ d n ) for the constrained case. They choose K to be the unit 2 ball and the loss function be linear function (θ; z) = -θ, z . The empirical loss function is L(θ; D) = -θ; n i=1 z i /n with minimizer θ * = n i=1 zi n i=1 zi 2 . Hence for any solution θ, one has L(θ; D) -L(θ *

Lemma 4.1 (Part-Two of Lemma 5.1 in Bassily et al. (2014)). Let > 0, δ = o(1/n) and M = Ω(min(n, √ d/ )), for any ( , δ)-DP algorithm A, there is a dataset D

where D -i denotes D with the ith row replaced by some fixed element of {0, 1} d . Definition 4.2 is similar to the one in Steinke & Ullman (2016) (See Definition 3.2 in Steinke & Ullman (2016)), except that their requirement of completeness is Pr[||M

Definition B.1 (fingerprinting codes). Given n, d ∈ N, ξ ∈ (0, 1], a pair of (random) algorithms (Gen, Trace) is called an (n, d)-fingerprinting code with security ξ ∈ (0, 1] if Gen outputs a codebook C ∈ {0, 1} n×d and for any (possibly randomized) adversary A F P and any subset S ⊆[n], if we set c ← R A F P (C S ), then • Pr[c ∈ F (C S ) Trace(C, c) =⊥] ≤ ξ • Pr [Trace (C, c) ∈ [n]\S] ≤ ξ where F (C S ) = c ∈ {0, 1} d | ∀j ∈ [d], ∃i ∈ S, c j = c ij ,and the probability is taken over the coins of Gen, Trace and A F P .

annex

Algorithm 1 The Fingerprinting Code (Gen, Trace) Sub-procedure Gen : Let d = 100n 2 log(n/ξ) be the length of the code. Let t = 1/300n be a parameter and let t be such that sin 2 t = t. for j = 1, ..., d: doChoose random r uniformly from [t , π/2 -t ] and let p j = sin 2 r j . Note that p j ∈ [t, 1 -t].For each i = 1, ..., n, set C ij = 1 with probability p j independently. end for Return: C Sub-procedure Trace (C, c ): Let Z = 20n log(n/ξ) be a parameter. For each j = 1, ..., d, let q j = (1 -p j )/p j . For each j = 1, ..., d, and each i = 1, ..., n, let U ij = q j if C ij = 1 and U ij = -1/q j else wise. for each i = 1, ..., n: doOutput ⊥ if S i (c ) < Z/2 for every i = 1, ..., n. end for Main-procedure Gen: Let C be the (random) output of Gen , C ∈ {0, 1} n×d Append 2d 0-marked columns and 2d 1-marked columns to C. Apply a random permutation π to the columns of the augmented codebook. Let the new codebook be C ∈ {0, 1} n×5d . Return: C .

Main-procedure Trace(C, c ):

Obtain C from the shared state with Gen. Obtain C by applying π -1 to the columns of C and removing the dummy columns. Obtain c by applying π -1 to c and removing the symbols corresponding to fake columns. Return: i randomly from Trace (C, c).permutation and remove the dummy columns, then use Trace as a black box. This procedure makes the fingerprinting code more robust in tolerating a small fraction of errors to the marking condition.In particular, they prove the fingerprinting code Algorithm 1 has the following property.Theorem B.2 (Theorem 3.4 in Bun et al. (2018) ). For every d, and γ ∈ (0, 1], there exists a (n, d)-fingerprinting code with security γ robust to a 1/75 fraction of errors for, forThe proof of Lemma 3.5 is basically the same as the proof in Bassily et al. (2014) with minor modifications. Readers familiar with the literature can feel free to skip it. Lemma 3.5. Let n, d ≥ 2 and > 0. There is a number n * = Ω(min(n, d )) such that for any -differentially private algorithm A, there is a datasetsuch that, with probability at least 1/2 (taken over the algorithm random coins), we havewhereProof. By using a standard packing argument we can construct K = 2such that for every distinct pair z (i) , z (j) of these points, we haveIt is easy to show the existence of such a set of points using the probabilistic method (for example, the Gilbert-Varshamov construction of a linear random binary code).Fix > 0 and define n = d 20 . Let's first consider the case where n ≤ n . We construct K datasets D (1) , ..., D (K) where for each i ∈ [K], D (i) contains n copies of z (i) . Note that q(D (i) ) = z (i) , we have for all i = j,Let A be any -differentially private algorithm. Suppose that for everywhere for any dataset D, B(D) is defined asNote that for all i = j, D (i) and D (j) differs in all their n entries. Since A is -differentially private, for all i ∈ [K],which implies that n > n for sufficiently large p, contradicting the fact that n ≤ n . Hence, there must exist a dataset D (i) on which A makes an 2 -error on estimating q(D) which is at least 1/16 with probability at least 1/2. Note also that the 2 norm of the sum of the entries of such D (i) is n.Next, we consider the case where n > n . As before, we construct K = 2 p 2 datasets D(1) , • • • , D(K) of size n where for every i ∈ [K], the first n elements of each dataset D(i) are the same as dataset D (i) from before whereas the remaining n -n elements are 0.Note that any two distinct datasets D(i) , D(j) in this collection differ in exactly n entries. Let A be any -differentially private algorithm for answering q. Suppose that for every i ∈ [K], with probability at least 1/2, we have thatNote that for all i ∈ [K], we have that q( D(i) ) = n * n q(D (i) ). Now, we define an algorithm Ã for answering q on datasets D of size n as follows. First, Ã appends 0 as above to get a dataset D of size n. Then, it runs A on D and outputs n * A( D) n. Hence, by the post-processing propertry of differential privacy, Ã is -differentially private since A is -differentially private. Thus for every i ∈ [K], with probability at least 1/2, we have that || Ã(D (i) ) -q(D (i) )|| 2 < 1 16 . However, this contradicts our result in the first part of the proof. Therefore, there must exist a dataset D(i) in the above collection such that, with a probability at least 1/2,Note that the 2 norm of the sum of entries of such D(i) is always n .C.2 PROOF OF THEOREM 3.6The proof does not use the reduction Theorem 3.4 directly as a black box but uses the intuition behind it in detail.Theorem 3.6. Let n, d ≥ 2 and > 0. For every -differentially private algorithm with outputsuch that, with probability at least 1/2 (over the algorithm random coins), we must have thatProof. We can prove this theorem directly by combining the lower bound in Bassily et al. (2014) and our reduction approach (Theorem 3.4), but we try to give a complete proof as an example to demonstrate how does our black-box reduction approach work out.Let A be an -differentially private algorithm for minimizing L and let θ priv denote its output, define r := θ priv -θ * . First, observe that for any θ ∈ R d and dataset D as constructed in Lemma 3.5 (recall that D consists of n * copies of a vector z ∈ { 1when θ * = z, and also26) combining with the fact that | y, z | ≤ 1 proves the last inequality.If y -z 2 ≥ r 2 /2, then we have min y 2 ≤1 -y, z ≥ -1 + r 2 2 8 . To prove this, we assume z = e 1 without loss of generality and y -z = (x 1 , ..., x d ) where2 /8 as desired, which finishes the discussion on the second case.From the above result we have thatTo proceed, suppose for the sake of a contradiction, that for every dataset, with probability more than 1/2, we have that θ privθ * 2 = r 2 = Ω(1). Let Ã be an -differentially private algorithm that first runs A on the data and then outputs n * n θ priv . Recall that q(D) = n * n θ * , this implies that for every datasetwhich contradicts Lemma 3.5. Thus, there must exists a datasetsuch that with pr obability more than 1/2, we have r 2 = θ priv -θ * 2 = Ω(1), and as a result Proof. We want to find α such that any set satisfying the completeness condition in the above definition is a subset of the F β set of Bun et al. (2018) after rounded to binary numbers, which isSuppose, round the output M(D) ∈ [0, 1] d to a binary vector c ∈ {0, 1} d where c / ∈ F β (D), then it makes an "illegal" bit on at least βd columns, where each of these columns shares the same number (all-one or all-minus-one columns). It means that on each of these columns, M(D) has the opposite sign to the shared number, which means on this column, say i, the induced loss is lower bounded: Given input and a dataset X, we construct A to first generate a new dataset T by selecting each element of X with probability independently, then feed T to A. Fix an event S and two neighboring datasets X 1 , X 2 that differs by a single element i. Consider running A on X 1 . If i is not included in the sample T , then the output is distributed the same as a run on X 2 . On the other hand, if i is included in the sample T , then the behavior of A on T is only a factor of e off from the behavior of A on T \ {i}. Again, because of independence, the distribution of T \ {i} is the same as the distribution of T conditioned on the omission of i.For a set X, let p X denote the distribution of A(X), we have that for any event S,A lower bound of p X1 (S) ≥ exp(-)p X2 (S) -δ/e can be obtained similarly. To conclude, since δ = o(1/n) as the sample size n decreases by a factor of , A has (2 , o(1/n))-differential privacy. The size of X is roughly 1/ times larger than T , combined with the fact that A has sample complexity n * and T is fed to A, A has sample complexity at least Θ(n * / ).For the other direction, simply using the composability of differential privacy yields the desired result. In particular, by the k-fold adaptive composition theorem in Dwork et al. (2006) , we can combine 1/ independent copies of ( , δ)-differentially private algorithms to get an (1, δ/ ) one and notice that if δ = o(1/n), then δ/ = o(1/n) as well because the sample size n is scaled by a factor of at the same time, offsetting the increase in δ.

D.3 PROOF OF THEOREM 4.6

Theorem 4.6. Let n, d be large enough andwhere is G-Lipschitz w.r.t. 2 geometry, θ is a minimizer of L(θ; D), and C = √ d is the diameter of K w.r.t. 2 geometry, where K is the unit ∞ ball containing all possible true minimizers and differs from its usual definition in the constrained setting.Proof. Let k = Θ(log(1/δ)) be a parameter to be determined later satisfying k/n < 1/6000, and. Without loss of generality, we assume = 1 due to Lemma 4.4, andk log(1/δ)) corresponds to the number in Lemma 4.3 where we set γ = δ.We use contradiction to prove that for any ( , δ)-DP mechanism M, there exists someAssume for contradiction that M : {0,for all D ∈ {0, 1} n×d . We then construct a mechanism M k = {0, 1} n k ×d with respect to M as follows: with input D k ∈ {0, 1} n k ×d , M k will copy D k for k times and append enough 0's to get a dataset D ∈ {0, 1} n×d . The output ise -1 δ)-DP by the group privacy.We consider algorithm A F P to be the adversarial algorithm in the fingerprinting codes, which rounds the output M k (D k ) to the binary vector, i.e., rounding those coordinates with values no less than 1/2 to 1 and the remaining 0, and let c = A F P (M(D)) be the vector after rounding. As M k is (k, e k -1 e-1 δ)-DP, A F P is also (k, e k -1 e-1 δ)-DP. Considering the 1 loss, we can account for the loss caused by each coordinate separately. Recall that M k (D k ) = M(D). Thus we have thatwhere we use Lemma 4.5 for the third line.By Markov Inequality, we know thatBy union bound, we can upper bound the probability Pr[Trace(D k , c) =⊥] ≤ 1/5 + δ ≤ 1/2. As a result, there exists i * ∈ [n k ] such thatConsider the database with i * removed, denoted by D k -i * . Let c = A F P (M(D k -i * )) denote the vector after rounding. By the second property of fingerprinting codes, we have thatBy the differential privacy and post-processing property of M,Recall that 2 -O(n) < δ < o(1/n), and Equation ( 33) suggests k/n ≤ 2e k /δ for all valid k. But it is easy to see there exists k = Θ(log(1/δ)) and k < n/6000 to make this inequality false, which is contraction. As a result, there exists some D ∈ {0, 1} n×d such thatFor the ( , δ)-DP case when < 1, setting Q to be the condition 

