MEDIAN DC FOR SIGN RECOVERY: PRIVACY CAN BE ACHIEVED BY DETERMINISTIC ALGORITHMS

Abstract

Privacy-preserving data analysis becomes prevailing in recent years. It is a common sense in privacy literature that strict differential privacy can only be obtained by imposing additional randomness in the algorithm. In this paper, we study the problem of private sign recovery for sparse mean estimation and sparse linear regression in a distributed setup. By taking a coordinate-wise median among the reported local sign-vectors, which can be referred to as a median divide-and-conquer (Med-DC) approach, we can recover the signs of the true parameter with a provable consistency guarantee. Moreover, without adding any extra randomness to the algorithm, our Med-DC method can protect data privacy with high probability. Simulation studies are conducted to demonstrate the effectiveness of our proposed method.

1. INTRODUCTION

With the development of technology for data acquisition and storage, the modern dataset has a larger scale, more complex structure, and more practical considerations, which addresses new challenges for data analysis. In recent years, large quantities of sensitive data are collected by individuals and companies. While one wants to extract more accurate statistical information from the distributed dataset, we must also beware of the leakage of this sensitive personal information during the training process. This calls for the study of distributed learning under privacy constraints (Pathak et al., 2010; Hamm et al., 2016; Jayaraman et al., 2018) . In privacy literature, differential privacy, which was firstly proposed in Dwork et al. (2006) , has been the most widely adopted definition of privacy tailored to statistical data analysis. It has achieved tremendous success in real-world applications. Denote the data universe to be X , for the dataset X = {X i } n i=1 ∈ X n where X i 's are the data observations. The ( , δ)-differentially privacy can be defined as follows. Definition 1. (Differential Privacy in Dwork et al. ( 2006)) A randomized algorithm A : X n → Θ gives ( , δ)-differentially private if for any pair of adjacent datasets X ∈ X n and X ∈ X n , there always holds P A(X 1:n ) ∈ U ≤ e • P A(X 1:n ) ∈ U + δ, for every subset U ⊆ Θ. Here two datasets X and X are adjacent if and only if the Hamming distance (Lei, 2011) of these two datasets of same size H(X, X ) = 1. As we can see, the quantities and δ measure the level of privacy loss. 2019)) designed for the ease of analysis. However, in these definitions, the dataset X is always assumed to be fixed, and the probability in (1) only takes over the randomness of the algorithm A. Therefore, it is impossible to achieve strict differential privacy without adding auxiliary perturbations in the algorithm. On the other hand, the statistical performance of the output is inevitably deteriorated by the additional randomness. This lead to a large body of works discussing the tradeoff between accuracy and privacy (Wasserman & Zhou, 2010; Bassily et al., 2014; Bun et al., 2018; Duchi et al., 2018; Cai et al., 2019) . In this paper, we consider the private sign recovery problem in the distributed system. To be more precise, assume the parameter of interest is a sparse vector, which has many zeros in its entries. The task is to identify the signs of the parameter from the observations stored in multiple machines while protecting each individual's privacy. The sign recovery problem, as an extension of sparsity pattern recovery, has found its significance in a broad variety of contexts, including variable selection (Tibshirani, 1996; Miller, 2002) , graphical models (Meinshausen & Bühlmann, 2006; Cai et al., 2011) , compressed sensing (Candes & Tao, 2005; Donoho, 2006) , and signal denoising (Chen et al., 2001) . However, this problem is rarely considered in the privacy community. To address the sign recovery problem, we propose the Median Divide-and-Conquer (Med-DC) method, a simple two-step procedure. Firstly, each local machine estimates the sparse parameter and sends the sign-vectors back to the server; Secondly, the server aggregates these sign-vectors using coordinate-wise median and output the final sign estimator. While mean based divide-andconquer (also referred to as Mean-DC) approaches have been widely analyzed in distributed learning literature (Mcdonald et al., 2009; Zhang et al., 2013; Lee et al., 2017; Battey et al., 2018) , the median-based counterpart has not yet been well explored. It is well-known that naively averaging the local estimators behaves badly for nonlinear and penalized optimization problems. This is because averaging cannot reduce the bias in the local sub-problems. In particular, for the distributed Lasso problem, as mentioned in Lee et al. ( 2017), the estimation error of averaged local Lasso estimator is of the same order as that of local estimators. However, when only considering the sign recovery problem, we found that the Med-DC method perfectly fits the nature of the distributed private setup. (See Section 2.2 for more detailed discussions) For the sake of clarity, we only consider the sign recovery problem for sparse mean estimation and sparse linear regression, the two fundamental models in statistics. The proposed Med-DC method has the following advantages: • Consistent recovery. For both sparse mean estimation and sparse linear regression, the Med-DC method consistently recovers the signs of the true parameter with theoretical guarantees. Under some constraints, we can prove that our approach can identify signals larger than C log n/N for some constant C > 0 (where N is the full sample size and n is the local sample size), which coincides with the minimal signal level in the single machine setting (all data stored in one machine). • Efficient communication. To recover the signs of the parameter of interest in the distributed setup, a naive approach is to estimate the parameter using existing private distributed estimation methods and take the signs of the estimators. However, these methods usually involve multi-round aggregation of gradient information or local estimators, which seems costly for the simple sign recovery problem. Instead, our approach only aggregates the vectors of signs (bits information) in one shot, which is much more communicationally efficient. • Weak privacy. By relaxing the differential privacy to high-probability sense, our deterministic Med-DC method can be proved to be weakly 'private'. We also extend this concept to group privacy. To the best of our knowledge, this is the first deterministic algorithm that has a provable high-probability privacy guarantee. Moreover, as each machine only needs to transmit the vectors of signs, instead of the local estimators or gradient vectors, our proposed method also protects the privacy of each local machine, since gradient sharing can also result in privacy leakage (Zhu et al., 2019) . • Wide applicability. We believe the Med-DC approach deserves more attention due to its excellent practical performance and ease of implementation. For example, it is promising to apply the Med-DC method to wider classes of models, (e.g. , generalized linear model, M -estimation, etc.) or hybridize this method with many sophisticated distributed algorithms like averaged de-biased estimator in Lee et al. ( 2017 For simplicity, we denote S p-1 and B p as the unit sphere and unit ball in R p centered at 0. For a sequence of vectors {v i } n i=1 ⊆ R p , we denote med(•) as the coordinate-wise median. Lastly, the generic constants are assumed to be independent of m, n, and p.



) and Communication-Efficient Accurate Statistical Estimator (CEASE) inFan et al. (2019).Notations. For every vectorv = (v 1 , ..., v p ) T , denote |v| 2 = p l=1 v 2 l , |v| 1 = p l=1 |v l |, and |v| ∞ = max 1≤l≤p |v l |. Moreover, we use supp(v) = {1 ≤ l ≤ p | v l = 0} as the support of the vector v, and v -l = (v 1 , . . . , v l-1 , v l+1 , . . . , v p ) T . For every matrix A ∈ R p1×p2 , define A = sup |v|2=1 |Av| 2 , A ∞ = max 1≤l1≤p1,1≤l2≤p2 |A l1,l2 |, A L∞ =sup |v|∞=1 |Av| ∞ as various matrix norms, Λ max (A) and Λ min (A) as the largest and smallest eigenvalues of A respectively. We will use I(•) as the indicator function and sgn(•) as the sign function. For two sequences a n , b n , we say a n b n when a n = O(b n ) and b n = O(a n ) hold at the same time.

