ON INSTAHIDE, PHASE RETRIEVAL, AND SPARSE MA-TRIX FACTORIZATION

Abstract

In this work, we examine the security of InstaHide, a scheme recently proposed by Huang et al. (2020b) for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible complexity-theoretic assumption. The answer to this turns out to be quite subtle and closely related to the averagecase complexity of a multi-task, missing-data version of the classic problem of phase retrieval that is interesting in its own right. Motivated by this connection, under the standard distributional assumption that the public/private feature vectors are isotropic Gaussian, we design an algorithm that can actually recover a private vector using only the public vectors and a sequence of synthetic vectors generated by InstaHide.

1. INTRODUCTION

In distributed learning, where decentralized parties each possess some private local data and work together to train a global model, a central challenge is to ensure that the security of any individual party's local data is not compromised. Huang et al. (2020b) recently proposed an interesting approach called InstaHide for this problem. At a high level, InstaHide is a method for aggregating local data into synthetic data that can hopefully preserve the privacy of the local datasets and be used to train good models. Informally, given a collection of public feature vectors (e.g. a publicly available dataset like Ima-geNet Deng et al. ( 2009)) and a collection of private feature vectors (e.g. the union of all of the private datasets among learners), InstaHide produces a synthetic feature vector as follows. Let integers k pub , k priv be sparsity parameters. 1. Form a random convex combination of k pub public and k priv private vectors. 2. Multiply every coordinate of the resulting vector by an independent random sign in {±1}, and define this to be the synthetic feature vector. The hope is that by removing any sign information from the vector obtained in 

funding

* This work was supported in part by NSF CAREER Award CCF-1453261, NSF Large CCF-1565235 and Ankur Moitra's ONR Young Investigator Award.

