MEDIAN DC FOR SIGN RECOVERY: PRIVACY CAN BE ACHIEVED BY DETERMINISTIC ALGORITHMS

Abstract

Privacy-preserving data analysis becomes prevailing in recent years. It is a common sense in privacy literature that strict differential privacy can only be obtained by imposing additional randomness in the algorithm. In this paper, we study the problem of private sign recovery for sparse mean estimation and sparse linear regression in a distributed setup. By taking a coordinate-wise median among the reported local sign-vectors, which can be referred to as a median divide-and-conquer (Med-DC) approach, we can recover the signs of the true parameter with a provable consistency guarantee. Moreover, without adding any extra randomness to the algorithm, our Med-DC method can protect data privacy with high probability. Simulation studies are conducted to demonstrate the effectiveness of our proposed method.

1. INTRODUCTION

With the development of technology for data acquisition and storage, the modern dataset has a larger scale, more complex structure, and more practical considerations, which addresses new challenges for data analysis. In recent years, large quantities of sensitive data are collected by individuals and companies. While one wants to extract more accurate statistical information from the distributed dataset, we must also beware of the leakage of this sensitive personal information during the training process. This calls for the study of distributed learning under privacy constraints (Pathak et al., 2010; Hamm et al., 2016; Jayaraman et al., 2018) . In privacy literature, differential privacy, which was firstly proposed in Dwork et al. (2006) , has been the most widely adopted definition of privacy tailored to statistical data analysis. It has achieved tremendous success in real-world applications. Denote the data universe to be X , for the dataset X = {X i } n i=1 ∈ X n where X i 's are the data observations. The ( , δ)-differentially privacy can be defined as follows. Definition 1. (Differential Privacy in Dwork et al. ( 2006)) A randomized algorithm A : X n → Θ gives ( , δ)-differentially private if for any pair of adjacent datasets X ∈ X n and X ∈ X n , there always holds P A(X 1:n ) ∈ U ≤ e • P A(X 1:n ) ∈ U + δ, for every subset U ⊆ Θ. Here two datasets X and X are adjacent if and only if the Hamming distance (Lei, 2011) of these two datasets of same size H(X, X ) = 1. As we can see, the quantities and δ measure the level of privacy loss. There are also several relaxations of differential privacy (see, e.g., Bun & Steinke (2016); Dwork & Rothblum (2016) ; Mironov (2017); Dong et al. ( 2019)) designed for the ease of analysis. However, in these definitions, the dataset X is always assumed to be fixed, and the probability in (1) only takes over the randomness of the algorithm A. Therefore, it is impossible to achieve strict differential privacy without adding auxiliary perturbations in the algorithm. On the other hand, the statistical performance of the output is inevitably deteriorated by the additional randomness. This lead to a large body of works discussing the tradeoff between accuracy and privacy (Wasserman & Zhou, 2010; Bassily et al., 2014; Bun et al., 2018; Duchi et al., 2018; Cai et al., 2019) . In this paper, we consider the private sign recovery problem in the distributed system. To be more precise, assume the parameter of interest is a sparse vector, which has many zeros in its entries.

