MEDIAN DC FOR SIGN RECOVERY: PRIVACY CAN BE ACHIEVED BY DETERMINISTIC ALGORITHMS

Abstract

Privacy-preserving data analysis becomes prevailing in recent years. It is a common sense in privacy literature that strict differential privacy can only be obtained by imposing additional randomness in the algorithm. In this paper, we study the problem of private sign recovery for sparse mean estimation and sparse linear regression in a distributed setup. By taking a coordinate-wise median among the reported local sign-vectors, which can be referred to as a median divide-and-conquer (Med-DC) approach, we can recover the signs of the true parameter with a provable consistency guarantee. Moreover, without adding any extra randomness to the algorithm, our Med-DC method can protect data privacy with high probability. Simulation studies are conducted to demonstrate the effectiveness of our proposed method.

1. INTRODUCTION

With the development of technology for data acquisition and storage, the modern dataset has a larger scale, more complex structure, and more practical considerations, which addresses new challenges for data analysis. In recent years, large quantities of sensitive data are collected by individuals and companies. While one wants to extract more accurate statistical information from the distributed dataset, we must also beware of the leakage of this sensitive personal information during the training process. This calls for the study of distributed learning under privacy constraints (Pathak et al., 2010; Hamm et al., 2016; Jayaraman et al., 2018) . In privacy literature, differential privacy, which was firstly proposed in Dwork et al. (2006) , has been the most widely adopted definition of privacy tailored to statistical data analysis. It has achieved tremendous success in real-world applications. Denote the data universe to be X , for the dataset X = {X i } n i=1 ∈ X n where X i 's are the data observations. The ( , δ)-differentially privacy can be defined as follows. Definition 1. (Differential Privacy in Dwork et al. (2006) ) A randomized algorithm A : X n → Θ gives ( , δ)-differentially private if for any pair of adjacent datasets X ∈ X n and X ∈ X n , there always holds P A(X 1:n ) ∈ U ≤ e • P A(X 1:n ) ∈ U + δ, for every subset U ⊆ Θ. Here two datasets X and X are adjacent if and only if the Hamming distance (Lei, 2011) of these two datasets of same size H(X, X ) = 1. As we can see, the quantities and δ measure the level of privacy loss. There are also several relaxations of differential privacy (see, e.g., Bun & Steinke (2016) ; Dwork & Rothblum (2016) ; Mironov (2017) ; Dong et al. (2019) ) designed for the ease of analysis. However, in these definitions, the dataset X is always assumed to be fixed, and the probability in (1) only takes over the randomness of the algorithm A. Therefore, it is impossible to achieve strict differential privacy without adding auxiliary perturbations in the algorithm. On the other hand, the statistical performance of the output is inevitably deteriorated by the additional randomness. This lead to a large body of works discussing the tradeoff between accuracy and privacy (Wasserman & Zhou, 2010; Bassily et al., 2014; Bun et al., 2018; Duchi et al., 2018; Cai et al., 2019) . In this paper, we consider the private sign recovery problem in the distributed system. To be more precise, assume the parameter of interest is a sparse vector, which has many zeros in its entries. The task is to identify the signs of the parameter from the observations stored in multiple machines while protecting each individual's privacy. The sign recovery problem, as an extension of sparsity pattern recovery, has found its significance in a broad variety of contexts, including variable selection (Tibshirani, 1996; Miller, 2002) , graphical models (Meinshausen & Bühlmann, 2006; Cai et al., 2011) , compressed sensing (Candes & Tao, 2005; Donoho, 2006) , and signal denoising (Chen et al., 2001) . However, this problem is rarely considered in the privacy community. To address the sign recovery problem, we propose the Median Divide-and-Conquer (Med-DC) method, a simple two-step procedure. Firstly, each local machine estimates the sparse parameter and sends the sign-vectors back to the server; Secondly, the server aggregates these sign-vectors using coordinate-wise median and output the final sign estimator. While mean based divide-andconquer (also referred to as Mean-DC) approaches have been widely analyzed in distributed learning literature (Mcdonald et al., 2009; Zhang et al., 2013; Lee et al., 2017; Battey et al., 2018) , the median-based counterpart has not yet been well explored. It is well-known that naively averaging the local estimators behaves badly for nonlinear and penalized optimization problems. This is because averaging cannot reduce the bias in the local sub-problems. In particular, for the distributed Lasso problem, as mentioned in Lee et al. (2017) , the estimation error of averaged local Lasso estimator is of the same order as that of local estimators. However, when only considering the sign recovery problem, we found that the Med-DC method perfectly fits the nature of the distributed private setup. (See Section 2.2 for more detailed discussions) For the sake of clarity, we only consider the sign recovery problem for sparse mean estimation and sparse linear regression, the two fundamental models in statistics. The proposed Med-DC method has the following advantages: • Consistent recovery. For both sparse mean estimation and sparse linear regression, the Med-DC method consistently recovers the signs of the true parameter with theoretical guarantees. Under some constraints, we can prove that our approach can identify signals larger than C log n/N for some constant C > 0 (where N is the full sample size and n is the local sample size), which coincides with the minimal signal level in the single machine setting (all data stored in one machine). • Efficient communication. To recover the signs of the parameter of interest in the distributed setup, a naive approach is to estimate the parameter using existing private distributed estimation methods and take the signs of the estimators. However, these methods usually involve multi-round aggregation of gradient information or local estimators, which seems costly for the simple sign recovery problem. Instead, our approach only aggregates the vectors of signs (bits information) in one shot, which is much more communicationally efficient. • Weak privacy. By relaxing the differential privacy to high-probability sense, our deterministic Med-DC method can be proved to be weakly 'private'. We also extend this concept to group privacy. To the best of our knowledge, this is the first deterministic algorithm that has a provable high-probability privacy guarantee. Moreover, as each machine only needs to transmit the vectors of signs, instead of the local estimators or gradient vectors, our proposed method also protects the privacy of each local machine, since gradient sharing can also result in privacy leakage (Zhu et al., 2019) . • Wide applicability. We believe the Med-DC approach deserves more attention due to its excellent practical performance and ease of implementation. For example, it is promising to apply the Med-DC method to wider classes of models, (e.g. , generalized linear model, M -estimation, etc.) or hybridize this method with many sophisticated distributed algorithms like averaged de-biased estimator in Lee et al. (2017) and Communication-Efficient Accurate Statistical Estimator (CEASE) in Fan et al. (2019) .

Notations. For every vector

v = (v 1 , ..., v p ) T , denote |v| 2 = p l=1 v 2 l , |v| 1 = p l=1 |v l |, and |v| ∞ = max 1≤l≤p |v l |. Moreover, we use supp(v) = {1 ≤ l ≤ p | v l = 0} as the support of the vector v, and v -l = (v 1 , . . . , v l-1 , v l+1 , . . . , v p ) T . For every matrix For simplicity, we denote S p-1 and B p as the unit sphere and unit ball in R p centered at 0. For a sequence of vectors {v i } n i=1 ⊆ R p , we denote med(•) as the coordinate-wise median. Lastly, the generic constants are assumed to be independent of m, n, and p. A ∈ R p1×p2 , define A = sup |v|2=1 |Av| 2 , A ∞ = max 1≤l1≤p1,1≤l2≤p2 |A l1,l2 |, A L∞ = sup

2.1. MEDIAN BASED DIVIDE-AND-CONQUER

Let µ * = (µ * 1 , . . . , µ * p ) T be the true parameter of interest. We assume the vector is sparse in the sense that many entries µ * l are zero. There are N i.i.d. observations X i 's satisfying E[X i ] = µ * , and they are evenly stored in m different machines H j (where 1 ≤ j ≤ m). Denote X = {X 1 , . . . , X N } as the full dataset. For simplicity we assume N = mn so that each machine has equally n samples. Our task is to identify the signs of µ * on all coordinates (denoted as sgn(µ * )) in this distributed setup while protecting the privacy of every element X i on each machine H j . To recover the signs of the true mean vector µ * privately, the most direct way is to estimate the mean by some existing differentially private algorithms and take the signs of the estimator coordinatewisely. By post-processing property of differential privacy (Proposition 2.1 in Dwork & Roth (2014) ), we know this sign recovery method is also differentially private. Private mean estimation is a fundamental problem in private statistical analysis and has been studied intensively (Dwork et al., 2006; Lei, 2011; Bassily et al., 2014; Cai et al., 2019) . The standard approach is to project the data onto a known bounded domain, and then apply the noises according to the diameter of the feasible domain and the privacy level. However, this method requires input data or the true parameter lies in a known bounded domain, which seems unsatisfactory in practice. Moreover, since we only want to estimate the signs, which take value in the discrete set {-1, 0, 1}, it seems unnecessary to perturb the mean directly. Indeed, to recover the signs of the true parameter, there is no need to obtain an accurate mean with all data. Instead, we propose a Median based Divide-and-Conquer (Med-DC) approach. To be more precise, we can estimate the signs on each local machine H j , and aggregate these vectors of signs by taking median to produce a more accurate sign estimator. To present our method more clearly, we define the following quantization function Q λ (•), Q λ (x) = sgn sgn(x) • (|x| -λ) + shrinkage operator = sgn(x) if |x| > λ, 0 if |x| ≤ λ. ( ) Here λ is a thresholding parameter. When x is a vector, Q λ (x) performs the above operation coordinate-wisely. In particular, when λ = 0, the function Q 0 (•) acts the same as the sign function sgn(•). Then we present our method in Algorithm 1. Algorithm 1 Median divide-and-conquer (Med-DC) for sparse mean estimator. Input: Dataset X = {X 1 , . . . , X N } evenly divided into m local machines H j (where j = 1, . . . , m), the universal thresholding parameter λ N . 1: for j = 1, . . . , m do 2: The j-th machine H j computes the local sample mean Xj = n -1 i∈Hj X i . Then H j sends Q j = Q λ N ( Xj ) to the server. 3: end for 4: The server takes coordinate-wise median Q(X) = med(Q λ N ( Xj ) | 1 ≤ j ≤ m), Output: The vector of signs Q(X). The choice of thresholding parameter will be discussed after Theorem 1 in Section 2.3. Especially mention that there are some cases when the median is not uniquely determined. For example, it is possible that there are the same numbers of 0's and 1's, then the median can be arbitrary value in [0, 1]. To avoid ambiguity, we simply take Ql (X) = 0 (where l denotes the coordinate index) whenever the median Ql (X) is not unique. Another important remark is that, the proposed sign recovery algorithm Q(•) is deterministic. More precisely, as there is no additional random perturbation in Algorithm 1, the output Q(X) is completely determined by the input dataset X. It is able to protect data privacy in a weaker sense.  S c S - S + H 1 H 2 H 3 H m • • • Med-DC

2.2. INTUITION BEHIND MED-DC

Before presenting the theoretical results of our Med-DC method, we briefly illustrate the intuition behind it. The median mechanism among the collection of discrete values in {-1, 0, 1} can be equivalently regarded as a voting game. At each coordinate, we take the element which gets more than half of the votes, and we take it as 0 when there is no such element. The mechanism of the Med-DC method can be visualized in Figure 1 . Recovery Consistency. The key insight of the Med-DC method is that, according to Berry-Essen theorem, the distribution of sample means on local machines is close to normal distribution centered at the true parameter µ * . Therefore, on each coordinate l, the local sample means { X1,l , . . . , Xm,l } are approximately symmetrically distributed around µ * l . Based on this observation, the sign recovery consistency of the median mechanism becomes clear: For µ * l > λ N (µ * l < -λ N ), it is likely to have at least m/2 elements larger than λ N (smaller than -λ N ), which makes the median of local signs more inclined to be 1 (-1). For µ * l = 0, by the approximate symmetry of local sample means, the numbers of 1's and -1's are likely to be equal. Thus the median tends to be 0. Weak Privacy. Given the dataset X and the adjacent datasets X , when applying Algorithm 1 to X , there would be at most one element among the set of signs {Q λ N ( Xj )} m j=1 change. With high probability, the change of one element would not affect the median value of Q(X). This interpretation is conceptually coincident with the idea of differential privacy in Definition 1. However, this 'privacy' guarantee does not hold for all data. For instance, in one-dimensional case, it is possible that X produces (m/2) positives and m/2 negatives ( Q(X) = 0), and the adjacent dataset X produces m/2 + 1 positives and m/2 -1 negatives ( Q(X ) = 1), which contradict with standard definition of differential privacy. However, by Proposition 1, we can show that the Med-DC method guarantees privacy in a weaker sense with probability tending to 1, which implies that the aforementioned unpleasant case only appears with a small probability. Connection with Robust Statistics. Our Med-DC method has an intimate connection with the median-of-means (MOM) estimator, a robust mean estimator which has attracted considerable recent interests in statistics and machine learning communities (Nemirovsky & Yudin, 1983; Yin et al., 2018; Minsker, 2019; Lecué & Lerasle, 2020) . To estimate the mean, the MOM estimator takes the median among the local sample means. Both the MOM estimator and our Med-DC method use the symmetrization effect of the local average. As our task is to find the signs of µ * , we only aggregate the signs of each local estimator. Robust statistics studies the estimator that is not much influenced by a small portion of data, which is conceptually similar to differential privacy. Indeed, the connection between robustness and privacy has been pointed out in Dwork & Lei (2009) ; Smith (2011) ; Avella-Medina (2019); Brunel & Avella-Medina (2020) . In particular, Brunel & Avella-Medina (2020) leverages the MOM estimator and "Propose-Test-Release" (PTR) framework in Dwork & Lei (2009) to develop private mean estimator without any boundedness assumptions on the data and parameter. It is worthwhile noting that, we can also combine the PTR-framework with our Med-DC approach to develop a strictly differentially private sign estimator.

2.3. THEORY OF MEAN VECTOR SIGN RECOVERY

To discuss the theoretical properties of our method, we introduce the distribution space P of X P(µ * , C) = P E P [X] = µ * , max 1≤l≤p E P |X l -µ * l | 3 ≤ C , where C > 0 is some constant. This is a rather weak condition on the distribution of X. At each coordinate, we assume X l has a finite third-order moment, which is common in median-of-mean literature (Minsker, 2019) . Then we have the sign consistency of the proposed estimator Q(X). Theorem 1. (Sign consistency of Med-DC) Let N = mn i.i.d. random vectors {X 1 , . . . , X N } sampled from P(µ * , C) be evenly distributed in m subsets H 1 , . . . , H m . Moreover, there are sufficiently large constants C 1 , C 2 , γ 0 > 0, such that (a) The dimension p satisfies p = O(n γ0 ), and take λ N = C 1 ( log n/N + 1/n); (b) Denote S = supp(µ * ), then there is min l∈S |µ * l | ≥ C 2 λ N . Then Q(X) defined in Algorithm 1 satisfies that, for some large γ 1 depends on C 1 , C 2 , γ 0 , there is P Q(X) = sgn(µ * ) ≥ 1 -n -γ1 . As we can see from assumptions (a) and (b), when m = O(n), the thresholding parameter λ N can be chosen to be C 1 log n/N , which means our algorithm can identify the signal above the order of O( log n/N ). This is coincident with the optimal signal-to-noise ratio in a single machine setting (all N data are stored in one machine). Moreover, we note that the assumptions in Theorem 1 does not really require the true parameter to be 'sparse' in the sense that s p. Instead, we only assume there is a fixed gap (at the level of O( log n/N )) between zeros and those nonzero elements. In the following proposition, we show that our algorithm can protect data privacy with high probability. Proposition 1. (Privacy of Med-DC) Under the same assumptions as Theorem 1, denote D(X) as the collection of datasets X adjacent to X, then for some large γ 2 > 0, there is P Q(X) = Q(X ), for all X ∈ D(X) ≥ 1 -n -γ2 . (5) It is worth noting that, the privacy guarantee formulated in equation ( 5) differs with the standard definition of differential privacy (see Definition 1) in several aspects. Firstly, as the algorithm Q(•) is deterministic, the randomness in (5) comes from the data generating mechanism. It implies that this algorithm preserves data privacy with probability tending to 1, and precludes some extreme cases, which may happen with a small probability. This is essentially different from the standard definition of differential privacy and its variants, which always assume the dataset is fixed, and the randomness comes from the algorithm itself. Secondly, combining Theorem 1 and Proposition 1, we know that with probability tending to 1, this algorithm recovers the true signs, and the modification of a single entry of X does not affect the output at all (0 privacy loss). Therefore, this method can be roughly regarded as a (0, 0)-differentially private algorithm.

3. PRIVATE SIGN RECOVERY OF LINEAR REGRESSION

3.1 MED-DC OF REGRESSION PARAMETER Let (X i , Y i ) (where i = 1, . . . , N ) be i.i.d. observations from the model Y = X T θ * + z, where θ * = (θ * 1 , . . . , θ * p ) T is the true sparse regression parameter, and z is the noise independent with the covariate X. Denote the full dataset as X = {(X 1 , Y 1 ), . . . , (X N , Y N )}, and X is evenly divided into m local machines H j (where 1 ≤ j ≤ m). Similarly, we attempt to recovery the vector of signs sgn(θ * ) = (sgn(θ * 1 ), . . . , sgn(θ * p )) T . Sparse linear regression is an important topic in the statistical literature. The least absolute shrinkage and selection operator (Lasso), which was firstly introduced in Tibshirani (1996) , has been one of the most popular approaches because of its benign theoretical guarantee and excellent empirical performance. Recently, a private iteratively hard thresholding pursuit algorithm was developed in Cai et al. (2019) to solve Lasso problem in the differential privacy framework. To deal with the private sign recovery problem in the distributed setup, a naive approach is to solve the private Lasso problem on each local machine and take the signs of the average of all local estimators. However, this method also requires the boundedness assumption on the covariates, and it is too complicated for sign recovery. More importantly, the average of the local Lasso estimators is likely to cause more non-zero elements because the coordinate will become non-zero as long as one of these local estimators is non-zero at this coordinate. By leveraging the idea of Med-DC, we present Algorithm 2 for sign recovery of sparse regression. Algorithm 2 Median divide-and-conquer for sparse linear regression (Med-DC Lasso) Input: Data on local machines {(X i , Y i ) | i ∈ H j } for j = 1, . . . , m, the universal regularization parameter λ N . 1: for j = 1, . . . , m do 2: The j-th part H j computes the local sample mean θj = argmin θ∈R p 1 2n i∈Hj (Y i -X T i θ) 2 + λ N |θ| 1 . Then the j-th local machine sends Q 0 ( θj ) to the server. 3: end for 4: The server takes coordinate-wise median Q(X) = med(Q 0 ( θj ) | 1 ≤ j ≤ m). Output: The vector of signs Q(X). Similarly as Algorithm 1, every step of the Med-DC Lasso method is deterministic. Therefore, we can solve each local subproblem (7) efficiently by many well-developed algorithms like FISTA in Beck & Teboulle (2009) , ADMM in Boyd et al. (2011) , etc.

3.2. THEORY OF REGRESSION PARAMETER SIGN RECOVERY

For linear regression, we consider the following distribution space P X,Y (θ * , η 1 , C 1 , η 2 , C 2 ) = P sup |v|2=1 E P exp(η 1 |v T X| 2 ) ≤ C 1 , z = Y -X T θ * , z ⊥ X, E P exp(η 2 |z| 2 ) ≤ C 2 , where C 1 , C 2 , η 1 , η 2 are some positive constants. This implies that both the covariate vector X and the noise z are sub-Gaussian. Then we have the following result of sign recovery consistency. Theorem 2. (Sign consistency of Med-DC Lasso) Let N = mn samples X = {(X 1 , Y 1 ), . . . , (X N , Y N )} from P X,Y (θ * , η 1 , C 1 , η 2 , C 2 ) be evenly distributed in m subsets H 1 , . . . , H m . Moreover, there are some sufficient large constants C 3 , C 4 , ∆ 0 > 0 such that (a) The dimension p is fixed, and take λ N = C 3 ( log n/N + log n/n); (b) Denote S = supp(µ * ), the minimal signal satisfies min l∈S |µ * l | ≥ C 4 λ N ; (c) The covariance matrix Σ = EXX T is positive definite. Let Σ -1 = (ω 1 , . . . , ω p ), there is max l∈S c |ω -l | 1 ω l,l ≤ 1 -∆ 0 . Then Q(X) defined in Algorithm 2 satisfies that, for some large γ 1 depends on C 3 , C 4 , ∆ 0 , there is P Q(X) = sgn(θ * ) ≥ 1 -n -γ1 . From the assumption (b), when m log n = O(n), the minimal signal has the order of O( log n/N ), which meets the "beta-min" condition (Wainwright, 2009) of standard Lasso problem for full sample case (all N samples stored in a single machine). Note that the regularization parameter λ N is a universal constant among all local machines. By assumption (a), the regularization parameter satisfies λ N log n/N , which is smaller than the standard setting O( log n/n) (Lee et al., 2017) . Therefore, the local estimators θj is not very sparse because the regularization parameter λ N is unable to annihilate the noises brought by the local data. However, by taking the median among the local signs, the noises are canceled out and the signals become detectable. Owing to the smaller scale, it helps to identify the smaller magnitude of signals. The assumption (c) implies that, for l ∈ S c , the l-th row of the precision machine Σ -1 is dominated by the diagonal entry ω l,l . It can be regarded as a more strict irrepresentability condition (Zhao & Yu, 2006; Wainwright, 2009) . Proposition 2. (Privacy of Med-DC Lasso) Under the same assumptions as Theorem 2, denote D(X) as the collection of datasets X adjacent to X, then for some large γ 2 > 0, there is P Q(X) = Q(X ), for all X ∈ D(X) ≥ 1 -n -γ2 . Similar as (5), equation ( 9) is a weakened privacy guarantee, which can protect data privacy with high probability. In addition, follow the proofs of Proposition 1 and 2, our proposed methods also guarantee group privacy with high probability (Section 10.1 in Dwork & Roth (2014) ). Corollary 1. (Group Privacy) Under the same assumptions as Theorem 2, denote D k (X) as the collection of datasets X have at most k elements differing with X, then for some large γ 3 > 0, then P Q(X) = Q(X ), for all X ∈ D k (X) ≥ 1 -n -γ3 .

4.1. RESULTS FOR SPARSE MEAN ESTIMATION

In the first experiments, we consider the sparse mean estimation problem, observations {X 1 , . . . , X N } are sampled from the model X i = µ * + z i , where the noises z i 's are drawn from the multivariate normal distribution N (0, I p ). We fix the dimension p as 200. The parameter of interest is defined as 2019) on all samples in a single machine. Same as Cai et al. (2019) , we adopt the oracle T = 2 √ log N , s = 10, and ( , δ) = (0.5, 10/N 1.1 ); (c) Pooled-Mean: Take average among all samples and take quantization function Q λ N ( X). Note that Mean-DC and CWZ need to transmit the local estimators to the server, which is more communication costly. The performance of sign recovery is measured by the following four criteria: µ * = ( 1, 0.8, • • • , 0.2, 0, -0.2, • • • , -0.8, -1, 0 T p-11 ) T , • Positive and Negative False Discovery Rate. PFDR = p l / ∈S + I{ Ql (X) = 1} max[ p l=1 I{ Ql (X) = 1}, 1] , NFDR = p l / ∈S -I{ Ql (X) = -1} max[ p l=1 I{ Ql (X) = -1}, 1] . • Total False Discovery Rate and Power. FDR = p l / ∈S I{ Ql (X) = 0} max[ p l=1 I{ Ql (X) = 0}, 1] , Power = p l∈S I{ Ql (X) = 0} max[ p l=1 I{ Ql (X) = 0}, 1] . 1 , our Med-DC approach clearly outperforms the non-private Mean-DC and private CWZ method. Comparing with the pooled-mean estimator, our method has higher power.

4.2. RESULTS FOR SPARSE LINEAR REGRESSION

In the second experiments, we consider the linear model defined in (6). Let the noises are i.i.d. from N (0, 1) and assume the i.i.d. covariate vectors X T i = (X i,1 , . . . , X i,p ) (i = 1, . . . , N ) are drawn from a multivariate normal distribution N (0, Σ). The covariance matrix Σ is a p × p Toeplitz matrix with its (i, j)-th entry Σ ij = 0.5 |i-j| , where 1 ≤ i, j ≤ p. We fix the dimension p = 200. Moreover, we set the true coefficient θ * the same as µ * in (10). Similarly, we set m = 100, n = 200. For the choice of regularization parameter λ N in each local machine, we first choose λ N based on the dataset in the first local machine H 1 by five-fold cross-validation. As suggested in classical literatures (Wainwright, 2009) , motivated by the theoretical scale difference in Theorem 2, and further divide it by √ m. In addition to Mean-DC, we mainly compare with other two methods: (d) CSL: Use the Communication-efficient Surrogate Likelihood (CSL) framework in Jordan et al. ( 2019) to obtain an estimator θCSL for the true parameter and take the signs of it. (e) Pooled-Lasso: Solve the Lasso problem on all data in a single machine and take the signs. Note that all the above-mentioned methods are not private. Both the Mean-DC method and the CSL method require to transmit information of local parameters or multi-round gradients, which is more communication-costly than ours. In particular, we note that the CSL method includes an iterative refinement of the estimator. In our simulation study, we present the results of the five-step CSL method. For each experiment, we repeat 500 independent simulations and report the number of PFDR, NFDR, FDR, and power. Table 2 : The PFDR, NFDR, FDR, power and their standard errors (in parentheses) of different methods under sample size N = 200 × 100, local sample size n = 200. Med-DC Mean-DC CSL Pooled-Lasso PFDR 0.0639 (0.1329) 0.9454 (0.0425) 0.4453 (0.3447) 0.5703 (0.1758) NFDR 0.0582 (0.1199) 0.9456 (0.0425) 0.4325 (0.3480) 0.5808 (0.1717) FDR 0.0645 (0.1184) 0.9457 (0.0424) 0.4466 (0.3397) 0.5896 (0.1464) Power 1.0000 (0.0000) 1.0000 (0.0000) 0.9998 (0.0045) 1.0000 (0.0000) It can be observed from Table 2 , while all these methods can select the true support set, the Med-DC Lasso method has apparently smaller false discovery rate than others.

A TECHNICAL LEMMAS

Lemma 1. (Berry-Esseen Theorem, Theorem 9.1.3 in Chow & Teicher (2012)  ) If {X i , i ≥ 1} are i.i.d. mean-zero random variables with E[X 2 i ] = σ 2 , E|X i | 3 < ∞. Then there exists a constant C B > 0 such that sup -∞<x<∞ P n i=1 X i < √ nσx -Φ(x) ≤ C B √ n . Lemma 2. (Exponential Inequality, Lemma 1 in Cai & Liu (2011) ) Let X 1 , ..., X n be i.i.d. random variables with zero mean. Suppose that there exist some η > 0 and C > 0 such that E[X 2 i e η|Xi| ] ≤ C. Then uniformly for 0 < x ≤ C and n ≥ 1, there is P 1 n n i=1 X i ≥ (η + η -1 )x ≤ exp - nx 2 C . Lemma 3. Let N (= mn) i.i.d. random variables X 1 , ..., X N evenly distributed in m subsets H 1 , ..., H m . Suppose E[X i ] = 0, E[X 2 i ] = σ 2 , E|X i | 3 < ∞. Denote Xj = i∈Hj X i /n as the local sample mean on H j . For every γ > 1, denote c γ = C 1 n + k m √ n + γ log n mn , where C > 0 is sufficiently large enough. Then for every fixed non-negative constant k, there is P    m j=1 I Xj < c γ ≤ m 2 + k    + P    m j=1 I Xj > -c γ ≤ m 2 + k    = O(n -γ ). Proof. For every x > 0, there is P    m j=1 I Xj < x ≤ m 2 + k    =P    1 m m j=1 I Xj < x -P X1 < x ≤ 1 2 + k m -P X1 < x    ≤P    1 m m j=1 I Xj < x -P X1 < x ≤ 1 2 + k m + C B √ n -Φ √ nx σ    =P    1 m m j=1 I Xj < x -P X1 < x ≤ - Cγ log n m    , where the last line uses Berry-Esseen Theorem (Lemma 1), and x is given by x = σ √ n Φ -1 1 2 + k m + C B √ n + Cγ log n m . Applying Lemma 2 to the i.i.d. sequence I Xj < x -P X1 < x , we have P    1 m m j=1 I Xj < x -P X1 < x ≤ - Cγ log n m    = O(n -γ ), for some C large enough. Moreover, we have the following elementary facts Φ -1 (x 0 ) = Φ -1 (x 0 ) -Φ -1 (1/2) ≤ |x 0 -1/2| Φ -1 (x 0 ) ≤ |x 0 -1/2| ψ {Φ -1 (3/4)} , holds for any 1/4 ≤ x 0 < 3/4. On the other hand, we know that 1/4 ≤ Φ( √ nx/σ) < 3/4 holds for m, n sufficiently large. Denote C δ = 1/ψ{Φ -1 (3/4)}, then there is P    m j=1 I Xj < C δ σ √ n C B √ n + k m + Cγ log n m ≤ m 2    ≤P    1 m m j=1 I Xj < x -P X1 < x ≤ - Cγ log n m    ≤ n -γ . Therefore, if we choose C ≥ max{C δ C B σ, C δ √ Cσ}, the bound of the first term in the left hand side of ( 11) is proved. By repeating the same procedure, we can prove the bound of the second term, which yields the desired result. Lemma 4. Let X 1 , . . . , X n be i.i.d. random vectors sampled from the distribution in (8). Denote its covariance matrix as Σ, and the sample covariance matrix as Σ. Then for every γ > 1, there exists a constant C > 0 such that P Σ -1 Σ -I ∞ ≥ C log n n = O(n -γ ). Proof. Recall that the inverse covariance matrix is denoted as Σ -1 = (ω 1 , . . . , ω p ), and e l as the l-th coordinate vector. Then the (l 1 , l 2 )-entry of the matrix Σ -1 Σ -I is (Σ -1 Σ -I) l1,l2 = 1 n n i=1 ω T l1 X i • e T l2 X i -δ l1,l2 . Since the dimension p is assumed to be bounded, and the covariance matrix is positive definite, there exist a constant ρ ∈ (0, 1) such that ρ ≤ Λ min (Σ) ≤ Λ max (Σ) ≤ ρ -1 . Then we have max 1≤l≤p |ω l | 2 ≤ Σ -1 ≤ ρ -1 . Since X is sub-Gaussain in (8), we obtain max 1≤l1,l2≤p E exp η 1 ρ ω T l1 X i • e T l2 X i -δ l1,l2 ≤e η1ρ • sup |v|2≤1 E η 1 |v T X| 2 ≤ e η1ρ C 1 . Therefore, we can apply Lemma 2 to each coordinate and yield P Σ -1 Σ -I ∞ ≥ C log n n ≤p 2 max 1≤l1,l2≤p P 1 n n i=1 ω T l1 X i • e T l2 X i -δ l1,l2 ≥ C log n n =O(p 2 n -γ-2 ) = O(n -γ ), for some C sufficiently large. Therefore, the lemma is proved.

B PROOF OF MAIN RESULTS

Proof of Theorem 1. If we can show that max 1≤l≤p P Ql (X) = sgn(µ * l ) = O(n -γ ), where γ > 0 is large enough. Then this theorem is proved as follows 1 -P Q(X) = sgn(µ * ) ≤p max 1≤l≤p P Ql (X) = sgn(µ * l ) = O(pn -γ ) = O(n -γ+γ0 ), provided that γ > γ 0 . For the l-th coordinate, we firstly suppose that µ * l = 0. Since X i 's are sampled from the distribution (4), by Lemma 3 with k = 0, if λ N = C 0 log n mn + 1 n ≥ c γ , we have P Ql (X) = sgn(µ * l ) ≤P m j=1 I Xj,l < λ N ≤ m 2 + P m j=1 I Xj,l > -λ N ≤ m 2 = O(n -γ ). Next we assume µ * l > 0. We use Lemma 3 again, if µ * l ≥ C 1 log n mn + 1 n ≥ c γ + λ N , then there is P Ql (X) = sgn(µ * l ) ≤ P m j=1 I Xj,l > λ N ≤ m 2 ≤P m j=1 I Xj,l -µ * l > -c γ ≤ m 2 = O(n -γ ). Lastly, when µ * l < 0, the proof is the same as above. Therefore (12) is proved. Proof of Proposition 1. For X ∈ D(X), denote the elements in X as X i (where 1 ≤ i ≤ N ), where N i=1 I(X i = X i ) = 1. Moreover, since these data are stored in m different machines H 1 , . . . , H m , we have 0 ≤ m j=1 I( X j = Xj ) ≤ 1. Noticing that P Q(X) = Q(X ), for all X ∈ D(X) ≤P Q(X ) = sgn(µ * ), for all X ∈ D(X); Q(X) = sgn(µ * ) + P Q(X) = sgn(µ * ) ≤P Q(X ) = sgn(µ * ), for all X ∈ D(X) + O(n -γ1 ). Therefore, we only need to show that P Q(X ) = sgn(µ * ), for all X ∈ D(X) = O(n -γ2 ). For the l-th coordinate, we firstly suppose that µ * l = 0. Then there is P Ql (X ) = 0, for all X ∈ D(X) ≤P m j=1 I X j,l < λ N ≤ m 2 , for all X ∈ D(X) + P m j=1 I X j,l > -λ N ≤ m 2 , for all X ∈ D(X) ≤P m j=1 I Xj,l < λ N ≤ m 2 + 1 + P m j=1 I Xj,l > -λ N ≤ m 2 + 1 , where the last inequality uses (13). Using Lemma 3 with k = 1, with λ N properly chosen, there is P m j=1 I Xj,l < λ N ≤ m 2 + 1 + P m j=1 I Xj,l > -λ N ≤ m 2 + 1 = O(n -γ ), for some γ > 0. Similarly, if µ * l > 0, there is P Ql (X ) = 1, for all X ∈ D(X) ≤P m j=1 I X j,l > λ N ≤ m 2 , for all X ∈ D(X) ≤P m j=1 I Xj,l > λ N ≤ m 2 + 1 ≤P m j=1 I Xj,l -µ * l > λ N -µ * l ≤ m 2 + 1 = O(n -γ ), where the penultimate line uses (13) and the last line uses Lemma 3 with k = 1. The proof when µ * l < 0 is similar, therefore P Q(X ) = sgn(µ * ), for all X ∈ D(X) ≤pP Ql (X ) = sgn(µ * l ), for all X ∈ D(X) = O(pn -γ ) = O(n -γ2 ), which proves (14). Therefore the proposition is proved. Proof of Theorem 2. For each j ∈ {1, . . . , m}, taking sub-gradient of (7) at θj , we have that 1 n i∈Hj (Y i -X T i θj )X i + λ N Z j = 0, where Z j is the sub-gradient satisfying |Z j | ∞ ≤ 1. Rearranging the terms and multiplying Σ -1 on the both sides, we have θj -θ * = (I -Σ -1 1 n i∈Hj X i X T i )( θj -θ * ) + 1 n i∈Hj Σ -1 X i z i + λ N Σ -1 Z j . Taking η = min{η 1 ρ, η 2 }, since the noise and covariates are assumed to be sub-Gaussian in (8), for each coordinate l ∈ {1, . . . , p}, we have max 1≤l≤p E exp η|ω l X • z| ≤ max 1≤l≤p E exp 1 2 ηρ|ω l X| 2 + 1 2 η 2 |z| 2 ≤ max 1≤l≤p E exp ηρ|ω l X| 2 • E exp η 2 |z| 2 1/2 ≤ C 1 C 2 . Therefore by Lemma 2, we know that there exists a constant C1 > 0 such that max 1≤j≤m 1 n i∈Hj Σ -1 X i z i ∞ ≤ C1 log n n , with probability larger than 1 -O(n -γ ). On the other hand, by the fact that |Z j | ∞ ≤ 1, we have |λ N Σ -1 Z j | ∞ ≤ λ N Σ -1 L∞ . Moreover, by Lemma 4, we know max 1≤j≤m I -Σ -1 1 n i∈Hj X i X T i L∞ = O P log n n Substitute ( 16) (17) ( 18) into (15), we have | θj -θ * | ∞ ≤ 2λ N Σ -1 L∞ + 2 C1 log n n . From ( 15), the l-th coordinate can be written in the following form θj,l -θ * l = λ N ω l,l Z j,l + λ N ω l,-l Z j,-l + 1 n i∈Hj ω T l X i z i + o P log n n . It left to rehash the argument in the proof of Theorem 1. From Lemma 5 below we have that 1 -P Q(X) = sgn(θ * ) ≤ p • max 1≤l≤p P Ql (X) = sgn(θ * l ) = O(n -γ+1 ). Therefore we proved Theorem 2. Lemma 5. Assume the same assumptions in Theorem 2. For every 1 ≤ l ≤ p, we have P Ql (X) = sgn(θ * l ) = O(n -γ ), for arbitrarily fixed γ > 1. Proof. When θ * l = 0, we know that P Ql (X) = sgn(θ * l ) ≤ P  When θj,l > 0, we know that Z j,l = 1 (see ( 19)). Moreover, by assumption (c), we have λ N ω l,l Z j,l + λ N ω l,-l Z j,-l ≥ λ N ω l,l -λ N |ω l,-l | 1 ≥ ∆ 0 λ N ω l,l . Therefore from (19) we have that P m j=1 I θj,l > 0 ≥ m 2 ≤ P m j=1 I 1 n i∈Hj ω T l X i z i ≤ 1 2 ∆ 0 λ N ω l,l ≤ m 2 . Applying Lemma 3 with k = 0 to the i.i.d. random variables ω T l X i z i we can prove (20) by taking λ N = 2 C1 ∆ 0 ω l,l log n mn + log n n , with C sufficiently large. Repeat the argument for the other half, we can prove the case for θ * l = 0.  i∈Hj ω T l X i z i ≤ θ * l -2 Σ -1 L∞ λ N ≥ m 2 . Applying Lemma 3 we can show that P m j=1 I 1 n i∈Hj ω T l X i z i ≤ θ * l -2 Σ -1 L∞ λ N ≥ m 2 = O(n -γ ), provided that θ * l -2 Σ -1 L∞ λ N ≥ C2 log n mn + log n n , with C2 sufficiently large. Combining with (21) we know θ * l ≥ C3 λ N , for some C3 > 0. When θ * l < 0, the prove is essentially the same as above, hence we omit it for brevity. Thus the lemma is proved. Proof of Proposition 2. The proof is similar as that of Proposition 1. For X ∈ D(X), denote the elements in X as (X i , Y i ) (where 1 ≤ i ≤ N ), where N i=1 I (X i , Y i ) = (X i , Y i ) = 1. Since these data are stored in m machines, we denote θ j as the local estimator given by data {(X i , Y i ) | i ∈ H j }, then there is 0 ≤ m j=1 I θ j = θj ≤ 1. ( ) Noticing that P Q(X) = Q(X ), for all X ∈ D(X) ≤P Q(X ) = sgn(θ * ), for all X ∈ D(X); Q(X) = sgn(θ * ) + P Q(X) = sgn(θ * ) ≤P Q(X ) = sgn(θ * ), for all X ∈ D(X) + O(n -γ1 ). Therefore we only need to prove P Q(X ) = sgn(θ * ), for all X ∈ D(X) = O(n -γ2 ). Similar argument can be applied for the case when θ * l < 0, which concludes the proof of ( 23). Therefore, Proposition 2 is proved.



|v|∞=1 |Av| ∞ as various matrix norms, Λ max (A) and Λ min (A) as the largest and smallest eigenvalues of A respectively. We will use I(•) as the indicator function and sgn(•) as the sign function. For two sequences a n , b n , we say a n b n when a n = O(b n ) and b n = O(a n ) hold at the same time.

Figure 1: This figure visualizes the mechanism of the median divide-and-conquer (Med-DC) approach. Denote S + , S -and S c as the sets of positives, negatives, and zeros of the true parameter µ * respectively. The black dots and white dots on each column represent the estimated positive and negative locations on each local machine.

10) which means the sparsity level s is fixed as 10. The data is divided into 100 local machines H 1 , . . . H 100 , each local sample size is n = 200. Therefore, the entire sample size is N = 200×100. For the choice of regularization parameter λ N in each local machine, we first choose λ N based on the dataset in the first local machine H 1 by five-fold cross-validation. Moreover, motivated by the theoretical scale difference in Theorem 1, we further divide λ n by √ m, namely, λ N = λ n / √ m. We compare with the following three methods: (a) Mean-DC: Replace the aggregator median in Med-DC by taking average; (b) CWZ: Perform the method proposed in Cai et al. (

line uses(22). Using the expansion (19) and rehash the proof in Lemma 5

The PFDR, NFDR, FDR, power, and their standard errors (in parentheses) of different methods under sample size N = 200 × 100, local sample size n = 200.

