ON INSTAHIDE, PHASE RETRIEVAL, AND SPARSE MA-TRIX FACTORIZATION

Abstract

In this work, we examine the security of InstaHide, a scheme recently proposed by Huang et al. (2020b) for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible complexity-theoretic assumption. The answer to this turns out to be quite subtle and closely related to the averagecase complexity of a multi-task, missing-data version of the classic problem of phase retrieval that is interesting in its own right. Motivated by this connection, under the standard distributional assumption that the public/private feature vectors are isotropic Gaussian, we design an algorithm that can actually recover a private vector using only the public vectors and a sequence of synthetic vectors generated by InstaHide.

1. INTRODUCTION

In distributed learning, where decentralized parties each possess some private local data and work together to train a global model, a central challenge is to ensure that the security of any individual party's local data is not compromised. Huang et al. (2020b) recently proposed an interesting approach called InstaHide for this problem. At a high level, InstaHide is a method for aggregating local data into synthetic data that can hopefully preserve the privacy of the local datasets and be used to train good models. Informally, given a collection of public feature vectors (e.g. a publicly available dataset like Ima-geNet Deng et al. (2009) ) and a collection of private feature vectors (e.g. the union of all of the private datasets among learners), InstaHide produces a synthetic feature vector as follows. Let integers k pub , k priv be sparsity parameters. 1. Form a random convex combination of k pub public and k priv private vectors. 2. Multiply every coordinate of the resulting vector by an independent random sign in {±1}, and define this to be the synthetic feature vector. The hope is that by removing any sign information from the vector obtained in Step 1, Step 2 makes it difficult to discern which public and private vectors were selected in Step 1. Strikingly, Huang et al. (2020b) demonstrated on real-world datasets that if one trains a ResNet-18 or a NASNet on a dataset consisting of synthetic vectors generated in this fashion, one can still get good test accuracy on the underlying private dataset for modest sparsity parameters (e.g. k pub = k priv = 2).foot_0  The two outstanding theoretical challenges that InstaHide poses are understanding: • Utility: What property, either of neural networks or of real-world distributions, lets one tolerate this kind of covariate shift between the synthetic and original datasets? • Security: Can one rigorously formulate a refutable security claim for InstaHide, under a plausible average-case complexity-theoretic assumption? In this paper we consider the latter question. One informal security claim implicit in Huang et al. (2020b) is that given a synthetic dataset of a certain size, no efficient algorithm can recover a private image to within a certain level of accuracy (see Problem 1 for a formal statement of this recovery question). On the one hand, it is a worthwhile topic of debate whether this is a satisfactory guarantee from a security standpoint. On the other, even this kind of claim is quite delicate to pin down formally, in part because it seems impossible for such a claim to hold for arbitrary private datasets. Known Attacks and the Importance of Distributional Assumptions If the private and public datasets consisted of natural images, for example, then attacks are known Jagielski (2020); Carlini et al. (2020) . At a high level, the attack of Jagielski (2020) crucially leverages local Lipschitzness properties of natural images and shows that when k priv + k pub = 2, even a single synthetic image can reveal significant information. The very recent attack of Carlini et al. (2020) , which was independent of the present work and appeared a month after this submission appeared online, is more sophisticated and bears interesting similarities to the algorithms we consider. We defer a detailed discussion of these similarities to Appendix A in the supplement. While the original InstaHide paper Huang et al. (2020b) focused on image data, their general approach has the potential to be applicable to other forms of real-valued data, and it is an interesting mathematical question whether the above attacks remain viable. For instance, for distributions over private vectors where individual features are nearly independent, one cannot hope to leverage the kinds of local Lipschitz-ness properties that the attack of Jagielski (2020) exploits. Additionally, if the individual features are identically distributed, then it is information theoretically impossible to discern anything from just a single synthetic vector. For instance, if a synthetic vector v is given by the entrywise absolute value of 1 2 v 1 + 1 2 v 2 for private vectors v 1 , v 2 , then an equally plausible pair of private vectors generating v would be v 1 , v 2 given by swapping the i-th entry of v 1 with that of v 2 for any collection of indices i ∈ [d] . In other words, there are 2 d pairs of private vectors which are equally likely under the Gaussian measure and give rise to the exact same synthetic vector. Gaussian Images, and Our Results A natural candidate for probing whether such properties can make the problem of recovering private vectors more challenging is the case where the public and private vectors are sampled from the standard Gaussian distribution over R d . While this distribution does not capture datasets in the real world, it avoids some properties of distributions over natural images that might make InstaHide more vulnerable to attack and is thus a clean testbed for stresstesting candidate security claims for InstaHide. Furthermore, in light of known hardness results for certain learning problems over Gaussian space Diakonikolas et al. (2017) ; Bruna et al. (2020) ; Diakonikolas et al. (2020b) ; Goel et al. (2020a) ; Diakonikolas et al. (2020a) ; Klivans & Kothari (2014) ; Goel et al. (2020b) ; Bubeck et al. (2019) ; Regev & Vijayaraghavan (2017) , one might hope that when the vectors are Gaussian, one could rigorously establish some lower bounds, e.g. on the size of the synthetic dataset (information-theoretic) and/or the runtime of the attacker (computational), perhaps under an average-case assumption, or in some restricted computational model like SQ. Orthogonally, we note that the recovery task the attacker must solve appears to be an interesting inverse problem in its own right, namely a multi-task, missing-entry version of phase retrieval with an intriguing connection to sparse matrix factorization (see Section 2.2 and Section 3). The assumption of Gaussianity is a natural starting point for understanding the average-case complexity of this problem, and in this learning-theoretic context it is desirable to give algorithms with provable guarantees. Gaussianity is often a standard starting point for developing guarantees for such inverse problems Moitra & Valiant (2010) ; Netrapalli et al. (2013); Candes et al. (2015) ; Hardt & Price (2015) ; Zhong et al. (2017b; a) ; Li & Yuan (2017) ; Ge et al. (2018) ; Li & Liang (2018) ; Zhong et al. (2019) ; Chen et al. (2020) ; Kong et al. (2020) ; Diakonikolas et al. (2020b) . Our main result is to show that when the private and public data is Gaussian, we can use the synthetic and public vectors to recover a subset of the private vectors. Theorem 1.1 (Informal, see Theorem B.1). If there are n priv private vectors and n pub public vectors, each of which is an i.i.d. draw from N (0, Id d ), then as long as d = Ω(poly(k pub , k priv ) log(n pub + n priv )), there is some m = o(n k priv priv ) such that, given a sample of m random synthetic vectors independently generated as above, one can exactly recover k priv + 2 private vectors in time O(d(mfoot_1 + n 2 pub )) + poly(n pub ) with probability 9/10 over the randomness of the private and public vectors and the randomness of the selection vectors. 2 We emphasize that we can take m = o(n k priv priv ), meaning we can achieve recovery even with access to a vanishing fraction of all possible combinations of private vectors among the synthetic vectors generated. For instance, when k priv = 2, we show that m = O(n 4/3 priv ) suffices (see Theorem B.1). See Remark B.2 for additional discussion. Additionally, to ensure we are not working in an uninteresting setting where InstaHide has zero utility, we empirically verify that in the setting of Theorem 1.1, one can train on the synthetic vectors and get reasonable test accuracy on the original Gaussian dataset (see Section 4). Qualitatively, the main takeaway of Theorem 1.1 is that to prove meaningful security guarantees for InstaHide, we must be careful about the properties we posit about the underlying distribution generating the public and private data, even in challenging settings where this data does not possess the nice properties of natural images that have made other attacks possible.

1.1. CONNECTIONS AND EXTENSIONS TO PHASE RETRIEVAL

Our algorithm is based on connections and extensions to the classic problem of phase retrieval. At a high level, this can be thought of as the problem of linear regression where the signs of the linear responses are hidden. More formally, this is a setting where we get pairs (x 1 , y 1 ), ..., (x N , y N ) ∈ C n × R for which there exists a vector w ∈ C n satisfying | w, x i | = y i for all i = 1, ..., N , and the goal is to recover w. Without distributional assumptions on how x 1 , ..., x N are generated, this problem is NP-hard Yi et al. (2014) , and in the last decade, there has been a huge body of work, much of it coming from the machine learning community, on giving algorithms for recovering w under the assumption that x 1 , ..., x N are i.i.d. Gaussian, see e.g. Candes et al. (2013; 2015) ; Conca et al. (2015) ; Netrapalli et al. (2013) . To see the connection between InstaHide and phase retrieval, first imagine that InstaHide only works with public vectors (in the notation of Theorem 1.1, n priv = k priv = 0). Now, consider a synthetic vector y ∈ R d generated by InstaHide, and let the vector w ∈ R n pub be the one specifying the convex combination of public vectors that generated y. The basic observation is that for any feature i ∈ [d], if p i ∈ R n pub is the vector consisting of i-th coordinates of all the public vectors, then | w, x i | = y i . In other words, if InstaHide only works with public vectors, then the problem of recovering which public vectors generated a given synthetic vector is formally equivalent to phase retrieval. In particular, if the public dataset is Gaussian, then we can leverage the existing algorithms for Gaussian phase retrieval. Huang et al. (2020b) already noted this connection but argued that if InstaHide also uses private vectors, the existing algorithms for phase retrieval fail. Indeed, consider the extreme case where InstaHide only works with private vectors (i.e. n pub = 0), so that the only information we have access to is the synthetic vector (y 1 , ..., y d ) generated by InstaHide. As noted above in the discussion about private distributions where the features are identically distributed, it is clearly information-theoretically impossible to recover anything about w or the private dataset. As we will see, the key workaround is to exploit the fact that InstaHide ultimately generates multiple synthetic vectors, each of which is defined by a random sparse convex combination of public/private vectors. And as we will make formal in Section 2.2, the right algorithmic question to study in this context can be thought of as a multi-task, missing-data version of phase retrieval (see Problem 2) that we believe to be of independent interest. Lastly, we remark that in spite of this conceptual connection to phase retrieval, and apart from one component of our algorithm (see Section B.1) which draws upon existing techniques for phase retrieval, the most involved parts of our algorithm and its analysis utilize techniques that are quite different from the existing ones in the phase retrieval literature. We elaborate upon these techniques in Section 3.

2. TECHNICAL PRELIMINARIES

Miscellaneous Notation Given a subset T , let C k T denote the set of all subsets of T of size exactly k. Given a vector v ∈ R n and a subset S ⊆ [n], let [v] S ∈ R |S| denote the restriction of v to the coordinates indexed by S. Definition 2.1. Given a Gaussian distribution N (0, Σ), let N fold (0, Σ) denote the folded Gaussian distribution defined as follows: to sample from N fold (0, Σ), sample g ∼ N (0, Σ) and output |g|.

2.1. THE GENERATIVE MODEL

Definition 2.2 (Image matrix notation). Let image matrix X ∈ R d×n be a matrix whose columns consist of vectors x 1 , ..., x n corresponding to n images each with d pixels taking values in F.foot_2 It will also be convenient to refer to the rows of X as p 1 , ..., p d ∈ R n . Definition 2.3 (Public/private notation). Let S ⊂ {1, ..., n} be some subset. We will refer to S and S c {1, ..., n}\S as the set of public and private images respectively, and given a vector w ∈ R n , we will refer to supp(w) ∩ S and supp(w) ∩ S c as the public and private coordinates of w respectively. Definition 2.4 (Synthetic images). Given sparsity levels k pub ≤ |S|, k priv ≤ |S c |, image matrix X and a selection vector w ∈ R n for which [w] S and [w] S c are k pub -and k priv -sparse respectively, the corresponding synthetic image is the vector y X,w |Xw|, where |•| denotes entrywise absolute value. We say that X and a sequence of selection vectors w 1 , ..., w m ∈ R n give rise to a synthetic dataset consisting of the images {y X,w1 , ..., y X,wm }. Note that instead of the entrywise absolute value of Xw, InstaHide in Huang et al. (2020b) randomly flips the sign of every entry of Xw, but these two operations are interchangeable in terms of information; it will be slightly more convenient to work with the former. We will work with the following distributional assumption on the entries of X: Definition 2.5 (Gaussian images). We say that X is a random Gaussian image matrix if its entries are sampled i.i.d. from N (0, 1). We will also work with the following simple notion of "random convex combination" as our model for how the selection vectors w 1 , . . . , w m are generated: Definition 2.6 (Distribution over selection vectors). Let D be the distribution over selection vectors defined as follows. To sample once from D, draw random subset T 1 ⊂ S, T 2 ⊆ S c of size k pub and k priv and output the unit vector whose i-th entry is 1 √ The main algorithmic question we study is the following: k pub if i ∈ T 1 , 1 √ k priv if i ∈ T 2 , Problem 1 (Private (exact) image recovery). Let X ∈ R d×n be a Gaussian image matrix. Given access to the public images {x s } s∈S and the synthetic dataset {y X,w1 , . . . , y X,wm }, where w 1 , ..., w m ∼ D are unknown selection vectors, output a vector x ∈ R d for which there exists private image x s (where s ∈ S c ) satisfying |x i | = |(x s ) i | for all i ∈ [d]. Remark 2.7. Note that it is information-theoretically impossible to guarantee that x i = (x s ) i . This is because the distribution over X and the distribution over matrices given by sampling X and multiplying every private image by -1 are both Gaussian. And if the selection vectors w 1 , ..., w m generated the synthetic images in the former case, then the selection vectors w 1 . . . , w m , where w j is obtained by multiplying the private coordinates of w j by -1, would generate the exact same synthetic images.

2.2. MULTI-TASK PHASE RETRIEVAL WITH MISSING DATA

In this section we make formal the discussion in Section 1.1 and situate it in the notation above. First consider a synthetic dataset consisting of a single image y y X,w , where w is arbitrary and X is a random Gaussian image. From Eq. ( 1) we know that | w, p j | = y j ∀ j ∈ [d]. If S = {1, ..., n}, then the problem of recovering selection vector w from synthetic dataset {y} is merely that of recovering w from pairs (p j , y j ), and this is exactly the problem of phase retrieval over Gaussians. More precisely, because w is assumed to be sparse, this is the problem of sparse phase retrieval over Gaussians. If S {1, ..., n}, then it's clearly impossible to recover the private coordinates of w from y X,w alone. But it may still be possible to recover the public coordinates: formally, we can hope to recover [w] S given pairs ([p j ] S , y j ), where the p j 's are sampled independently from N (0, Id n ). This can be thought of as a missing-data version of sparse phase retrieval where some known subset of the coordinates of the inputs, those indexed by S c , are unobserved. But recall our ultimate goal is to say something about the private images. It turns out that because we actually observe multiple synthetic images, corresponding to multiple vectors w, it becomes possible to recover x s for some s ∈ S c (even in the extreme case where S = ∅!). This corresponds to the following inverse problem which is formally equivalent to Problem 1, but phrased in a self-contained way which may be of independent interest. Problem 2 (Multi-task phase retrieval with missing data). Let S [n] and S c = [n]\S. Let X ∈ R d×n be a matrix whose entries are i.i.d. draws from N (0, 1), with rows denoted by p 1 , ..., p d and columns denoted by x 1 , ..., x n . Let w 1 , . . . , w m ∼ D. For every j ∈ [d], we get a tuple [p j ] S , y (1) j , . . . , y (m) j satisfying | w i , p j | = y (i) j ∀ i ∈ [m], j ∈ [d]. Using just these, output x ∈ R d such that for some s ∈ S c , |x i | = |(x s ) i | for all i ∈ [d].

3. PROOF OVERVIEW

At a high level, our algorithm has three components: 1. Learn the public coordinates of all the selection vectors w 1 , ..., w m used to generate the synthetic dataset. 2. Recover the m × m rescaled Gram matrix M whose (i, j)-th entry is k • w i , w j . 3. Use M and the synthetic dataset to recover a private image. Step 1 draws upon techniques in Gaussian phase retrieval, while Step 2 follows by leveraging the correspondence between the covariance matrix of a Gaussian and the covariance matrix of its corresponding folded Gaussian (see Definition 2.1). Step 3 is the trickiest part and calls for leveraging delicate properties of the distribution D over selection vectors.

Learning the Public Coordinates of Any Selection Vector

We begin by describing how to carry out Step 1 above. First consider the case where S = {1, . . . , n}, that is, where every image is public. Recall from the discussion in Section 1.1 and 2.2 that in this case, the question of recovering w from synthetic image y X,w is equivalent to Gaussian phase retrieval. One way to get a reasonable approximation to w is to consider the n × n matrix N E p,y [y 2 • (pp -Id)], p ∼ N (0, Id n ) and y = | w, p |. It is a standard calculation (see Lemma B.3) to show that N is a rank-one matrix proportional to ww . And as every one of p 1 , . . . , p d is an independent sample from N (0, Id), and y X,w i satisfies w, p i = y X,w i for every pixel i ∈ [d], one can approximate N with the matrix N 1 d d i=1 y X,w i 2 • (p i p i -Id). This is the basis for the spectral initialization procedure that is present in many works on Gaussian phase retrieval, see e.g. Candes et al. (2015) ; Netrapalli et al. (2013) . N will not be a sufficiently good spectral approximation to N when d n, so instead we use a standard post-processing step based on the canonical SDP for sparse PCA (see ( 2)). Instead of taking the top eigenvector of N, we can take the top eigenvector of the SDP solution and argue that as long as d = Ω(poly(k pub ) log n), this will be sufficiently close to w that we can exactly recover supp(w). Now what happens when S {1, . . . , n}? Interestingly, if one simply modifies the definition of N to be E p,y [y 2 • ([p] S [p] S -Id)] and defines the corresponding empirical analogue N formed from the pairs {([p i ] S , y X,w i )} i∈[d] , one can still argue (see Lemma B.3) that the N is a rank-1 |S|×|S| matrix proportional to [w] S [ w] S and that the top eigenvector of the solution to a suitable SDP formed from N will be close to w (see Lemma B.4). Recovering the Gram Matrix via Folded Gaussians As we noted earlier, it is informationtheoretically impossible to recover [w i ] S c for any i ∈ [m] given only y X,wi and [w i ] S , but we now show it's possible to recover the inner products [w i ] S c , [w j ] S c for any i, j ∈ [m]. For the remainder of the overview, we will work in the extreme case where S c = {1, ..., n}, though it is not hard (see Section B.7) to combine the algorithms we are about to discuss with the algorithm for recovering the public coordinates to handle the case of general S. For brevity, let k k priv . First note that the m × d matrix whose rows consist of y X,w1 , ..., y X,wm can be written as Y    | p 1 , w 1 | • • • | p d , w 1 | . . . . . . . . . | p 1 , w m | • • • | p d , w m |    . Observe that without absolute values, each column would be an independent draw from the mvariate Gaussian N (0, M), where M is the Gram matrix defined above. Instead, with the absolute values, each column of Y is actually an independent draw from the folded Gaussian N fold (0, M) (Definition 2.1). The key point is that the covariance of N fold (0, M) can be directly related to M (see Corollary B.6), so by estimating the covariance of the folded Gaussian N fold (0, M) using the columns of Y, we can obtain a good enough approximation M to M that we can simply round every entry of M so that the rounded matrix exactly equals M. Furthermore, we only need to entrywise approximate the covariance of N fold (0, M) for all this to work, which is why it suffices for d to grow logarithmically in m. Discerning Structure From the Gram Matrix By this point, we have access to the Gram matrix M. Equivalently, we now know for any i, j ∈ [m] whether supp(w i ) ∩ supp(w j ) = ∅, that is, for any pair of synthetic images, we can tell whether the set of private images generating one of them overlaps with the set generating the other. Note that M = k • WW , where W is the matrix whose i-th row is w i , so if we could factorize M in this way, we would be able to recover which private vectors generated each synthetic vector. Of course, this kind of factorization problem, even if we constrain the factors to be row-sparse like W, has multiple solutions. One reason is that any permutation of the columns of W would also be a viable solution, but this is not really an issue because the ordering of the private images is not identifiable to begin with. A more serious issue is that if m is too small, there might be row-sparse factorizations of M which appear to be valid, but which we could definitively rule out upon sampling more synthetic vectors. For instance, suppose the first k + 1 selections vectors all satisfied |supp(w j ) ∩ supp(w j )| = k -1. Ignoring the fact that this is highly unlikely, in such a scenario it is impossible to distinguish between the case where the corresponding synthetic images all have the same k-1 private images in common, and the case where there is a group T ⊆ [n] of k + 1 private images such that each of these synthetic images is comprised of a subset of T of size k. But if we then sampled a new selection vector w k+2 for which |supp(w j ) ∩ supp(w k+2 )| = 1 for all j ∈ [k + 1], we could rule out the latter. This is indicative of a more general issue, namely that one cannot always recover the identity of a collection of subsets (even up to relabeling) if one only knows the sizes of their pairwise intersections! This leads to the following natural combinatorial question. What families of sets are uniquely identified (up to trivial ambiguities) by the sizes of the pairwise intersections? One answer to this question, as we show, is the family of all subsets of {1, ..., k + 2} of size k (see Lemma B.11) . This leads us to the following definition: Definition 3.1 (Floral Submatrices). A k+2 k × k+2 k matrix H is floral if the following holds. Fix some lexicographic ordering on C k [k+2] and index H according to this ordering. There is some permutation matrix Π for which the matrix H Π HΠ satisfies that for every pair of S, S ∈ C k [k+2] , H S,S = |S ∩ S |. See Example B.18 in the supplement. The upshot is that if we can identify a floral submatrix of M, then we know for certain that the subsets of private images picked by those selection vectors comprise all size-k subsets of some subset of [n] of size k + 2. In summary, using the pairwise intersection size information provided by the Gram matrix M, we can pinpoint collections of selection vectors which share a nontrivial amount of common structure. Learning a Private Image With a Floral Submatrix What can we do with this common structure in a floral submatrix? Let t = k+2 k . Given that the selection vectors w i1 , ..., w it corresponding to the rows of the floral submatrix only involve k + 2 different private images altogether, and there are t > k + 2 constraints of the form | w ij , p | = y X,wi j for any pixel ∈ [d], we can hope that for each pixel, we can uniquely recover the k + 2 private images from solving this system of equalities, where the unknowns are the values of the k + 2 private images at that particular pixel. A priori, the fact that the number of constraints in this system exceeds the number of unknowns does not immediately guarantee that this system has a unique solution up to multiplying the solution uniformly by -1. Here we exploit the fact that X is Gaussian to show however that this is the case almost surely (Lemma B.10). Finally, note that this system can be solved in time exp(O(k 2 )) by simply enumerating over 2 t sign patterns. We conclude that if we could find a floral submatrix, then we would find not just one, but in fact k + 2 private images! Existence of Floral Submatrix, and How to Find It It remains to understand how big m has to be before we can guarantee the existence of a floral submatrix inside the Gram matrix M with high probability. Obviously if m were big enough that with high probability we see every possible synthetic image that could arise from a selection vector w in the support of D, then M will contain many floral submatrices. One surprising part of our result is that we can ensure the existence of a floral submatrix when m is much smaller. Our proof of this is quite technical, but at a high level it is based on the second moment method (see Lemma B.12). The final question is: provided a floral submatrix of M exists, how do we find it? Note that naively, we could always brute-force over all m O(k 2 ) ≤ n O(k 3 ) principal submatrices with exactly k+2 k rows/columns, and for each such principal submatrix we can check in exp( O(k)) time whether it is floral. Surprisingly, we give an algorithm that can identify a floral submatrix of M in time dominated by the time it takes to write down the entries of the Gram matrix M. Note that an off-the-shelf algorithm for subgraph isomorphism would not suffice as the size of the submatrix in question is O(k 2 ), and furthermore such an algorithm would need to work for weighted graphs. Instead, our approach is to use the constructive nature of the proof in Lemma B.11, that the family of all subsets of {1, ..., k + 2} of size k is uniquely identified by the sizes of the pairwise intersections. By algorithmizing this proof, we give an efficient procedure for finding a floral submatrix, see Algorithm 3 and Lemma B.17. An important fact we use is that if we restrict our attention to the entries of M equal to k -1 or k -2, this corresponds to a graph over the selection vectors which is sparse with high probability. We defer the formal specification and analysis of our algorithm to the supplement.

4. EXPERIMENTS

We describe an experiment demonstrating the utility of InstaHide for Gaussian images and comparing to the utility of another data augmentation scheme, MixUp Zhang et al. (2018) . We also informally report on our implementation of LEARNPUBLIC and its empirical efficacy.

4.1. CHOICE OF ARCHITECTURE AND PARAMETERS

As our empirical results are purely for proof-of-concept, we work with a fairly basic neural network architecture. We use a 4-layer neural network as a binary classifier, y = arg max(softmax(W 4 σ(W 3 σ(W 2 (σ(W 1 x + b 1 )) + b2) + b3) + b4)), where x ∈ R 10 , W 1 ∈ R 100×10 , W 2 ∈ R 100×100 , W 3 ∈ R 100×100 , W 4 ∈ R 2×100 , b 1 ∈ R 100 , b 2 ∈ R 100 , b 3 ∈ R 100 , b 4 ∈ R 100 . We initialize the entries of each W l and b l to be i.i.d. draws from N (u l , 1), where u l is sampled from N (0, α) at the outset. We train the neural network for 100 epochs with cross-entropy loss and SGD optimizer with a learning rate of 0.01. We do not need to distinguish between public and private images in our experiments, so let k priv = 0 and k pub = k for k ∈ { 1, 2, 3, 6}, and for each choice of k, we use random k-sparse selection vectors whose nonzero entries equal 1/k. In all of our experiments, we separate our original image data (before generating synthetic data) into two categories: 80% training data and 20% test data. We train on synthetic images generated by MixUp or InstaHide using the training data, and measure the "test accuracy" on the training data and the test data separately. We provide more choices of k in Appendix C.

4.2. GAUSSIAN DATA

Settings We considered binary classification on Gaussian data. We generated random images x 1 , . . . , x 1000 ∈ R 10 from N (0, Id) and a random vector v ∈ R 10 ∼ N (0, Id). We then ranked all the Gaussian images based on i |x i v i | and labeled the largest half as '1' and the rest as '0'. The point of choosing this labeling function is that it would assign the same label to any x, x which agree entrywise in magnitudes. Given a synthetic image generated via MixUp or InstaHide using selection vector w, we assigned it the label which is the convex combination of the one-hot encodings of the labels of the original images indexed by w.

Results

We compare training and test loss over epochs when training on a synthetic dataset generated by either MixUp or Instahide, as shown in Figure 1 . We use the convention in this paper of defining synthetic images under InstaHide to have all nonnegative entries (rather than imposing random sign flips), though we explore in the supplement how random sign flips can affect learnability. Compared to training on MixUp, InstaHide results on lower model performance (accuracy). As we expected, when we increase the k, both MixUp and InstaHide suffer from accuracy loss. Instahide dropped by ∼ 10% accuracy when k = 6 compared to classical training k = 1, while MixUp dropped by ∼ 5%.

4.3. IMPLEMENTATION OF LEARNPUBLIC

We implemented LEARNPUBLIC for k priv = 2 and n ∈ {2000, 5000, 7500, 10000}. For k pub = 2, 4, 6, we respectively chose d = 1000, 1800, 2400. In particular, our choice of d is meant to work essentially for any choice of n (modulo the logarithmic dependence which does not noticeably manifest in this regime). One heuristic modification that we made to LEARNPUBLIC was to use a diagonal thresholding approach from Cai et al. (2016) in place of solving the SDP in (2): namely for every j ∈ [n] we computed the quantity 1 d d i=1 y 2 i • (x i ) 2 j , zeroed out all but the principal submatrix of M indexed by the top 25 such j, and computed the top eigenvector of the resulting matrix. For each parameter setting we found that, as expected, we were able to recover an average of at least 90% of the support. As this experiment was primarily to demonstrate that d can be much less than n, we did not explore further optimizations to the algorithm. A DISCUSSION OF OTHER ATTACKS Attack of Jagielski (2020) It has been pointed out Jagielski (2020) that for k priv = 2, k pub = 0, given a single synthetic image one can discern large regions of the constituent private images simply by taking the entrywise absolute value of the synthetic image. The reason is the pixel values of a natural image are mostly continuous, i.e. nearby pixels typically have similar values, so the entrywise absolute value of the InstaHide image should be similarly continuous. That said, natural images have enough discontinuities that this breaks down if one mixes more than just two images, and as discussed above, this attack is not applicable when the individual private features are i.i.d. like in our setting.

Attack of Carlini et al. (2020)

A month after this submission, Carlini et al. Carlini et al. (2020) independently gave an attack breaking the InstaHide challenge originally released by the authors of Huang et al. (2020b) . In that challenge, the public dataset was ImageNet, the private dataset consisted of n priv = 100 natural images, and k priv = 2, k pub = 4, m = 5000. They were able to produce a visually similar copy of each private image. Most of their work goes towards recovering which private images contributed to each synthetic image. Their first step is to train a neural network on the public dataset to compute a similarity matrix with rows and columns indexed by the synthetic dataset, such that the (i, j)-th entry approximates the indicator for whether the pair of private images that are part of synthetic image i overlaps with the pair that is part of synthetic image j. Ignoring the rare event that two private images contribute to two distinct synthetic images, and ignoring the fact that the accuracy of the neural network for estimating similarity is not perfect, this similarity matrix is precisely our Gram matrix in the k priv = 2 case. 2020) is focused on giving a heuristic for factorizing this Gram matrix. They do so essentially by greedily decomposing the graph with adjacency matrix given by the Gram matrix into n priv cliques (plus some k-means post-processing) and regarding each clique as consisting of synthetic images which share a private image in common. They then construct an m × n priv bipartite graph as follows: for every synthetic image index i and every private image index j, connect i to j if for four randomly chosen elements i 1 , ..., i 4 ∈ [m] of the j-th clique, the (i, i )-th entries of the Gram matrix are nonzero. Finally, they compute a min-cost max-flow on this instance to assign every synthetic image to exactly k priv = 2 private images.

The bulk of Carlini et al.'s work Carlini et al. (

It then remains to handle the contribution from the public images. Their approach is quite different from our sparse PCA-based scheme. At a high level, they simply pretend the contribution from the public images is mean-zero noise and set up a nonconvex least-squares problem to solve for the values of the constituent private images. Comparison to Our Generative Model Before we compare our algorithmic approach to that of Carlini et al. (2020) , we mention an important difference between the setting of the InstaHide challenge and the one studied in this work, namely the way in which the random subset of public/private images that get combined into a synthetic image is sampled. In our case, for each synthetic image, the subset is chosen independently and uniformly at random from the collection of all subsets consisting of k priv private images and k pub public images. For the InstaHide challenge, batches of n priv synthetic images get sampled one at a time via the following process: for a given batch, sample two random permutations π 1 , π 2 on n priv elements and let the t-th synthetic image in this batch be given by combining the private images indexed by π 1 (t) and π 2 (t). Note that this process ensures that every private image appears exactly 2m/n priv times, barring the rare event that π 1 (t) = π 2 (t) for some t in some batch. It remains to be seen to what extent the attack of Carlini et al. (2020) degrades in the absence of this sort of regularity property in our setting.

Comparison to Our Attack

The main commonality between our approach and that of Carlini et al. (2020) is to identify the question of extracting private information from the Gram matrix as the central algorithmic challenge. How we compute this Gram matrix differs. We use the relationship between covariance of a folded Gaussian and covariance of a Gaussian, while Carlini et al. (2020) use the public dataset to train a neural network on public data to approximate the Gram matrix. How we use this matrix also differs significantly. We do not produce a candidate factorization but instead pinpoint a collection of synthetic images such that we can provably ascertain that each one comprises k priv private images from the same set of k priv + 2 private images. This allows us to set up an appropriate piecewise linear system of size O(k priv ) with a provably unique solution and solve for the k priv + 2 private images. An exciting future direction is to understand how well the heuristic in Carlini et al. (2020) scales with k priv . Independent of the connection to InstaHide, it would be very interesting from a theoretical standpoint if one could show that their heuristic provably solves the multi-task phase retrieval problem defined in Problem 2 in time scaling only polynomially with k priv (i.e. the sparsity of the vectors w 1 , . . . , w m in the notation of Problem 2).

B RECOVERING PRIVATE IMAGES FROM A GAUSSIAN DATASET

In this section we prove our main algorithmic result: Theorem B.1 (Main). Let S [n], and let n pub = |S| and n priv = |S c |. Let k = k pub + k priv . If d ≥ Ω(poly(k pub , k priv ) • log(n pub + n priv )) and m ≥ Ω n k priv -2 k priv +1 priv k poly(k priv ) , then with high probability over X and the sequence of randomly chosen selection vectors w 1 , . . . , w m ∼ D, there is an algorithm which takes as input the synthetic dataset {y X,wi } i∈[m] and the columns of X indexed by S, and outputs k priv + 2 distinct images x 1 , . . . , x k priv +2 for which there exist k priv + 2 distinct private images x i1 , . . . , • Note that we can achieve recovery even when m = o(n k priv priv ). The reason this is significant is that as soon as m = Ω(n k priv priv ), all possible combinations of k private images are used. While it is still not immediately clear how to recover private images once this has happened, we regard the fact that we can do so well before this point to be one of the most interesting aspects of our result. Finally, we remark that the runtime is largely dominated by the O(m 2 ) term coming from forming an m × m matrix whose (i, j)-th entry turns out to equal w i , w j for all i, j ∈ [m]. In fact, naive implementations of the most sophisticated part of our algorithm (see Sections B.3, B.4, B.5, and B.6 ) require time ω(m 2 ), and getting these parts of the algorithm to run in O(m 2 ) time turns out to be quite subtle. x i k priv +2 satisfying | x j | = |x ij | for all j ∈ [k priv + 2].

B.1 LEARNING THE PUBLIC COORDINATES VIA GAUSSIAN PHASE RETRIEVAL

In this section we give a procedure which, given any synthetic image y X,w , recovers the entire support of [w] S . The algorithm is inspired by existing algorithms for sparse phase retrieval, with the catch that we need to handle the fact that we only get to observe the public subset of coordinates of any of the vectors p j . Our algorithm, LEARNPUBLIC is given in Algorithm 1 below. We first show that the population version of the matrix M formed in Step 1 is a rank-1 projector whose top eigenvector is in the direction of [w] S . Lemma B.3. Let w be a unit vector. Let M ∈ R n×n be defined as  M 1 d d j=1 (y 2 j -1) • [p j ] S • [p j ] S -Id max Z 0 Z, M subject to Tr(Z) = 1, i,j |Z i,j | ≤ k pub (2) Compute the top eigenvector w of Z. 3 return coordinates of the k entries of w with the largest magnitudes. Then E[ M] = 1 2 [w] S [w] S . Proof. First, it is obvious that the expectation of M can be written as E[ M] = E p∼N (0,I d ) [( w, p 2 -1) • (p S p S -Id)].

For any vector

v ∈ R n with v 2 = 1, we can compute v E[ M]v v E[ M]v = v E p [( w, p 2 -1) • (p S p S -Id)]v = E p [( w, p 2 -1) • ( [v] S , p 2 -1)] = E p [( w, p 2 -1) • ( [v] S 2 2 [v] S / [v] S 2 , p 2 -1)] = E p [( w, p 2 -1) • ( [v] S 2 2 [v] S / [v] S 2 , p 2 -[v] S 2 )] + E p [( w, p 2 -1) • ( [v] S 2 2 -1)] =: A 1 + A 2 where the second step follows from vfoot_4 2 = 1. For the first term in the above equation, we have A 1 = E p [( w, p 2 -1) • ( [v] S 2 2 [v] S / [v] S 2 , p 2 -[v] S 2 2 )] = [v] S 2 2 E p [( w, p 2 -1) • ( [v] S / [v] S 2 , p 2 -[v] S 2 2 )] = 2 [v] S 2 2 E p [φ 2 ( w, p ) • φ 2 ( [v] S / [v] S 2 , p )] = 2 [v] S 2 2 w, [v] S / [v] S 2 2 = 2 w, [v] S 2 where the third step follows from the fact that w and [v] S / [v] S 2 are unit vectors, φ 2 denotes the normalized degree-2 Hermite polynomial φ 2 (z) 1 √ 2 (z 2 -1), and the last step follows from the standard fact that E g∼N (0,I d ) [φ i ( g, v 1 )φ j ( g, v 2 )] = v 1 , v 2 i if i = j and 0 otherwise. For the second term, we have A 2 = E p [( w, p 2 -1) • ( [v] S 2 2 -1)] = ( [v] S 2 2 -1) • E p [ w, p 2 -1] = 0. Thus, we have A 1 + A 2 = 2 w, [v] S 2 . In particular, for v = [w] S / [w] S 2 , the above quantity is 2 [w] S Finally, we complete the proof of correctness of LEARNPUBLIC. Here we leverage the fact that we are running an SDP (the canonical SDP for sparse PCA) to show that as long as d is at least polynomially large in k pub and logarithmically large in n, with high probability we can recover supp([w] S ). Lemma B.4 (Learning the public coordinates). For any δ > 0, if d ≥ poly(k pub )/ log(n/δ), then with probability at least 1 -δ over the randomness of X, we have that the coordinates output by LEARNPUBLIC({([p j ] S , y j )} j∈ [d] for y j | p j , w | are exactly equal to supp([w] S ). Proof. Let Z be the solution to the SDP in (2), and define w * [w] S / [w] S . Because w * is a feasible solution for the SDP, by optimality of Z we get that 0 ≤ Z -w * w * , M = Z -w * w * , E[ M] + Z -w * w * , M -E[ M] = [w] S 2 2 Z -w * w * , w * w * 1 + Z -w * w * , M -E[ M] 2 , ( ) where in the last step we used Lemma B.3. Because Z F ≤ Tr(Z) = 1 = x * , we may upper bound 1 by -  ≥ log(n/δ)/η 2 , then M -E[ M] max ≤ η. We conclude from (3) that 0 ≤ - [w] S 2 4 Z -w * w * 2 F + 2k pub η, so Z -w * w * 2 F ≤ 8k pub η/ [w] S 2 ≥ 8ηk 2 pub , where in the last step we used that if w has at least one public coordinate, then [w] S 2 ≥ 1/k pub . By Davis-Kahan, this implies that the top eigenvector w of Z satisfies w-w * 2 ≤ 8ηk 2 pub . As the nonzero entries of w * are at least 1/ k pub , by taking η = O(1/k 3 pub ) we ensure that w -w * ∞ ≤ w -w * 2 < 1/2 k pub , so the largest entries of w in magnitude will be in the same coordinates as the nonzero entries of w * .

B.2 RECOVERING THE GRAM MATRIX VIA FOLDED GAUSSIANS

We now turn to the second step of our overall recovery algorithm: recovering the m × m Gram matrix whose (i, j)-th entry is supp(w i ) ∩ supp(w j ). For this section and the next four sections, we will assume that S = ∅, i.e. that all images are private. For brevity, let k k priv . This turns out to be without loss of generality. Given that in the case where S = ∅ we can recover the public coordinates of any selection vector using LEARNPUBLIC, passing to the case of general S will be a simple matter of subtracting the contribution of the public coordinates from the entries of the Gram matrix obtained by GRAMEXTRACT to reduce to the case of S = ∅. We will elaborate on this in the final proof of Theorem B.1. Given selection vectors w 1 , ..., w m , define the matrix W ∈ R m×d to have rows consisting of these vectors, so that the Gram matrix we are after is simply given by W W . Recall that the m×d matrix whose rows consist of y X,w1 , ..., y X,wm can be written as Y    | p 1 , w 1 | • • • | p d , w 1 | . . . . . . . . . | p 1 , w m | • • • | p d , w m |    , and as each entry of X is an independent standard Gaussian, the columns of Y ∈ R m×d ≥0 can be regarded as independent draws from N fold (0, W W ), where W is defined above. Let Σ fold denote the covariance of this folded Gaussian distribution. It is known that one can recover information about the covariance W W of the original Gaussian distribution from the covariance Σ fold of its folded counterpart: Lemma B.5 (Page 7 in Kan & Robotti (2017) ). Given a Gaussian N (0, Σ), the covariance Σ fold ∈ R m×m of the corresponding folded Gaussian distribution N fold (0, Σ) is given by Σ fold i,i = Σ i,i and, for i = j, Σ fold i,j = Σ i,j (4Φ 2 (0, 0; ρ i,j ) -1) + 4Σ 1/2 i,i Σ 1/2 j,j (1 -ρ 2 i,j )φ 2 (0, 0; ρ i,j ) - 2 π Σ 1/2 i,i Σ 1/2 j,j where ρ i,j Σ i,j /(Σ 1/2 i,i Σ 1/2 j,j ). We can apply Lemma B.5 in our specific setting to obtain the following relationship between W W and the covariance of N fold (0, W W ): Corollary B.6. If Σ = W W ∈ R m×m for some matrix W ∈ R m×n where the rows of W are unit vectors, then the covariance Σ fold ∈ R m×m of the corresponding folded Gaussian distribution N fold (0, Σ) is given by Σ fold i,j = 1, if i = j; Ψ( w i , w j ), if i = j. where Ψ(z) 2 π (z • arcsin(z) + √ 1 -z 2 -1). Proof. Because the rows of W are unit vectors, we have that Σ i,j = ρ i,j = w i , w j for all i, j ∈ [m]. To compute the off-diagonal entries of Σ fold , note that by definition of CDF and PDF, φ 2 (0, 0; w i , w j ) = 1 2π 1 -w i , w j 2 , Φ 2 (0, 0; w i , w j ) = 1 4 + arcsin w i , w j 2π . The claim follows. Algorithm 2: GRAMEXTRACT({y X,wi } i∈[m] , η) Input: InstaHide dataset {y X,wi } i∈[m] ), accuracy parameter η Output: Matrix M equal to the Gram matrix k • W W , scaled to have integer entries (see Lemma B.7) 1 η * ← O(η 2 ). 2 Let z 1 , ..., z d ∈ R m be the vectors given by (z j ) i = y X,wi j . for all i ∈ [m], j ∈ [d]. 3 Form the empirical estimates µ = 1 d d i=1 z i Σ = 1 d d i=1 (z i -µ)(z i -µ) and define Σ to be the matrix obtained by applying the function clip η * entrywise to Σ. 4 Let Σ be the matrix obtained by applying Ψ -1 entrywise to Σ . 5 Let Σ * denote the matrix obtained by entrywise rounding every entry of Σ to the nearest multiple of 1/k. 6 return k • Σ * . We now show that provided the number of pixels is moderately large, we can recover the matrix exactly, regardless of the choice of selection vectors w 1 , ..., w m ∈ R n . The full algorithm, GRAMEX-TRACT, is given in Algorithm 2 above. Lemma B.7 (Extract Gram matrix). Suppose d = Ω(log(m/δ)/η 4 ). For random Gaussian image matrix X and arbitrary w 1 , ..., w m ∈ S d-1 ≥0 , let Σ be the matrix computed in Step 4 of GRAMEX-TRACT ({y X,wi } i∈[m] , η), and let Σ * be the output. Then with probability 1-δ over the randomness of X, we have that | Σ i,i -w i , w i | ≤ η for all i, i ∈ [m]. In particular, if η = 1/2k, the condi- tioned on this happening, Σ * = k • W W . To prove this, we will need the following helper lemma about Ψ -1 . Lemma B.8. There is an absolute constant c > 0 such that for any 0 < η < 1 and z, z ≥ η, |Ψ -1 ( z) -Ψ -1 (z)| ≤ c √ η • | z -z|. Proof. Noting that Ψ (z) = 2 arcsin(x)/π, we get that the derivative of Ψ -1 at z is given by 1 Ψ (Ψ -1 (z)) = π 2 arcsin(Ψ -1 (z)) . One can verify numerically that for 0 ≤ x ≤ 1, x 2 π ≤ Ψ(x) ≤ 1.2x 2 π , so in particular πz/1.2 ≤ Ψ -1 (z) ≤ √ πz. The derivative of Ψ -1 at z is therefore upper bounded by O(1/ arcsin( πz/1.2)) ≤ O( 1.2/(πz)). In particular, for z ≥ η, this is at most O(1/ √ η). In other words, over η ≤ z ≤ 1, Ψ -1 is O(1/ √ η)-Lipschitz as claimed. Up to this point we have not used the randomness of the process generating the selection vectors w 1 , ..., w m . Note that without leveraging this, there exist choices of W for which it is informationtheoretically impossible to discern anything. Indeed, consider a situation where w 1 , ..., w m ∈ S d-1

≥0

have pairwise disjoint supports. In this case all we know is that the columns of Y are independent standard Gaussian vectors, as W W = Id. We now proceed to the most involved component of our proof, where we exploit the randomness of the selection vectors.

B.3 SOLVING A LARGE SYSTEM OF EQUATIONS

In this section we show that if we can pinpoint a collection of selection vectors corresponding to all size-k subsets of some set of k + 2 private images, then we can solve a certain system of equations to uniquely (up to sign) recover those private images. We will need the following basic notion corresponding to the fact that this system has only one unique solution, up to sign. Definition B.9 (Generic solution of system of equations). For any m and any vector v = (v S ) S∈C k [m] ∈ R ( m k ) , we say that v is generic if there are at most two solutions to the system i∈S a i = v S ∀S ∈ C k [m] in the variables {a i } i∈ [m] . Note that there are exactly two solutions {a i } and {a i } to this system if and only if a i = -a i for all i ∈ [m] and a i = 0 for some i ∈ [m]. We now show that for Gaussian images, the abovementioned system of equations almost surely has a unique solution up to sign. Lemma B.10 (Vector of Gaussian subset sums is generic). Let g 1 , ..., g m be independent draws from N (0, 1). For any m satisfying m ≥ k + 2, the vector v = (v S ) S∈C k [m] given by v S i∈S g i is generic almost surely (with respect to the randomness of g 1 , ..., g m ). Proof. First note that the entries of v are all nonzero almost surely. For v to not be generic, there must exist another vector v whose entrywise absolute value satisfies |v| = |v | but for which v = v, -v and for which there exists h 1 , ..., h m satisfying i∈S h i = v S for all S ∈ C k [m] . This would imply there exist indices S, T for which v S = v S and v T = -v T . By the assumption that m ≥ k + 2 (and recalling that k > 1 in our setup), we have that m k > m. In particular, the set of vectors w = (w S ) S∈C k [m] for which there exist numbers {g i } such that w S = i∈S g i for all S is a proper subspace U of R ( m k ) . Let 1 , ..., a be a basis for the set of vectors satisfying , w = 0 for all w ∈ U . Note that there is at least one nonzero generic vector in U , for instance, the vector w * given by w * S = 1[i ∈ S] (here we again use the fact that m ≥ k+2). Letting D ∈ R ( m k )×( m k ) denote the diagonal matrix whose S-th diagonal entry is equal to v S /v S , note that the existence of h 1 , ..., h m above implies that v additionally satisfies D i , v = 0 for all i ∈ [a]. But there must be some i for which D i does not lie in the span of 1 , ..., a , or else we would conclude that for any w ∈ U , the vector w whose S-th entry is w S • v S /v S would also lie in U . Because of the existence of indices S, T for which v S = v S and v T = -v T , we know that w = w , -w , so we would erroneously conclude that w is not generic for any w ∈ U , contradicting the fact that the vector w * defined above is generic. We conclude that there is some i for which D i lies outside the span of 1 , . . . , a . But then the fact that D i , v = 0 for this particular i implies that the variables g i satisfy some nontrivial linear relation. This almost surely cannot be the case because g 1 , ..., g m are independent draws from N (0, 1).

B.4 LOCATING A SET OF USEFUL SELECTION VECTORS

In the previous section we showed that we just need to find a set of selection vectors from among the rows of W that correspond to size-k subsets of some set of k + 2 private images. Here we show that such a collection of selection vectors is uniquely identified, up to trivial ambiguities, by their pairwise inner products. Proof. For the reader's convenience, we illustrate the sequence of subsets constructed in the following proof in Table 1 . Suppose without loss of generality that F contains the sets S 1,2 {1, ..., k} and S k+1,k+2 {3, ..., k + 2} (the indexing will become clear momentarily). We will show that  {T S } = C k U for U = [k + 2]. Let S * S 0 ∩ S 1 . For any S ∈ C k [k+2] satisfying |S 0 ∩ S | = |S 1 ∩ S | = k -1,

B.5 EXISTENCE OF A FLORAL SUBMATRIX

Recall the notion of a floral submatrix from Definition 3.1. In this section we show that with high probability M contains a floral principal submatrix. In the language of sets, this means that with high probability over a sufficiently long sequence of randomly chosen size-k subsets of [n], there is a collection of k+2 k subsets in the sequence which together comprise all size-k subsets of some U ⊆ [n] of size k + 2. Quantitatively, we have the following: Lemma B.12 (Existence of a floral submatrix). Let m ≥ Ω(k O(k 3 ) n k-2 k+1 ). If sets T 1 , ..., T m are independent draws from the uniform distribution over C k n , then with probability at least 9/10, there is some U ∈ C k+2 [n] for which every element of C k U is present among T 1 , ..., T m . Proof. Let L = k+2 k = 1 2 (k + 2)(k + 1). Define Z i1<•••<i L ∈[m] 1 {T i1 , ..., T i L } = C k U for some U ∈ C k+2 [n] . By linearity of expectation, E[Z] is equal to m L times the probability that {T 1 , ..., T L } = C k U for some U ∈ C k+2 [n] . The latter probability is equal to n k+2 • L! • n k -L , so we conclude that E[Z] = m L • n k + 2 • L! • n k -L ≥ m L • n k+2 n kL • L! • (k!) L L L • (k + 2) k+2 ≥ Ω m L n k+2-kL ≥ Ω(1), where in the penultimate step we used that L!•(k!) L L L •(k+2) k+2 is nonnegative and increasing over k ≥ 2, and in the last step we used that m ≥ Ω n k-2 k+1 . We now upper bound E[Z 2 ]. Consider a pair of distinct summands (i 1 , ..., i L ) and (i 1 , ..., i L ). Without loss of generality, we may assume these are (1, ..., L) and (s+1, ..., L) for some 0 ≤ s ≤ L. In order for {T 1 , ..., T L } = C k U and {T L-s+1 , ..., T 2L-s+1 } = C k U for some U, U ∈ C k+2 [n] , it must be that {T L-s+1 , ..., T L } = C k U ∩U . Note that if |U ∩ U | = k + 2, then U = U and therefore s must be 0. So if s > 0, it must be that |U ∩ U | ∈ {k, k + 1}. In either case, the probability that {T 1 , ..., T L-s+1 } = C k U \C k U ∩U , {T L+1 , ..., T 2L-s+1 } = C k U \C k U ∩U , and {T L-s+1 , ..., T L } = C k U ∩U is (L -s)! 2 • s! • n k -2L+s ≤ L! 2 • (k/n) 2kL-ks If |U ∩ U | = k, then s must be 1, and there are n k • n -k -2 2 • n -k -4 2 ≤ n k+4 choices for (U, U ). If |U ∩ U | = k + 1 then s must be k + 1 and there are and there are n k + 1 • (n -k -1) • (n -k -2) ≤ n k+3 choices for (U, U ). Finally, note that there are m L pairs of summands (i 1 , ..., i l ), (i 1 , ..., i L ) for which s = 0 (namely the ones for which i j = i j for all j), m • m-1 L-1 • m-L L-1 ≤ Θ(m) 2L-1 • L! 2 pairs for which s = 1, and m k+1 • m-k-1 L-k-1 • m-L L-k-1 ≤ Θ(m) 2L-k-1 • L! 2 for which s = k + 1. Putting everything together, we conclude that E[Z 2 ] = E[Z] + Θ(m) 2L-1 • L! 4 • n k+4 • (k/n) 2kL-k + Θ(m) 2L-k-1 • L! 4 • n k+3 • (k/n) 2kL-k(k+1) ≤ E[Z] 2 • 1 + O(1/m) • L! 4 • k 2kL-k + O(1/m k+1 ) • L! 4 • k 2kL-k(k+1) • n k 2 -1 ≤ (1.01 E[Z]) 2 , where in the last step we used that L ≤ k 2 and that n k 2 -1 /m k+1 ≤ 1 because m ≥ k Ω(k 3 ) n k-1 . By Paley-Zygmund, we conclude that P[Z > 0.01 E[Z]] ≥ 0.99 2 • E[Z] 2 E[Z 2 ] ≥ 9/10, as desired, upon picking constant factors appropriately. Lemma B.12 implies that with probability at least 9/10 over the randomness of the mixup vectors w 1 , ..., w m , if m ≥ Ω(k O(k 3 ) n k-2 k+1 ), then there is a subset of [m] for which the corresponding principal submatrix of W W is floral. By Lemma B.7, with high probability M = k • W W , so this is also the case for the output of GRAMEXTRACT.

B.6 FINDING A FLORAL SUBMATRIX

As mentioned in Section 3, to find a floral principal submatrix of M, one option is to enumerate over all subsets of size k+2 k of [m], which would take n O(k 3 ) time. We now give a much more efficient procedure for identifying a floral principal submatrix of M, whose runtime is dominated by the time it takes to write down the entries of M. At a high level, the reason we can obtain such dramatic savings is that the underlying graph defined by the large entries of W W is quite sparse, i.e. vertices of the graph typically have degree independent of k. We will need the following basic notion: Definition B.13. Given i ∈ [m] and integer 0 ≤ t ≤ k, let N t i {j : w i , w j = t/k}. For any j ∈ N t i , we refer to i and j as t-neighbors (this relation is obviously commutative). We will also need the following helper lemmas establishing certain deterministic regularity conditions that W W will satisfy with high probability. Lemma B.14 (Hypergraph sparsity). For any δ > 0, if m ≥ n k-1 log(1/δ), then with probability at least 1 -2mδ over the randomness of w 1 , ..., w m , we have that for every j ∈ [m], there are at most O(m • k k+1 • n 1-k ) (k -1)-neighbors of j, and at most O(m • k k+2 • n 2-k ) (k -2)-neighbors of j. if exactly k -2 choices of i which are (k -1)-neighbors of i z , i α , i β and (k -2)-neighbors of i 1-z , i γ , i δ , and which are not all (k -1)-neighbors of each other then 16 Add to I all such i . 



We did not describe how the labels for the synthetic vectors are assigned, but this part of InstaHide will not be important for our theoretical results and we defer discussion of labels to Section 4. See Problem 1 and Remark 2.7 for what exact recovery precisely means in this context. We will often refer to public/private/synthetic feature vectors as images, and their coordinates as pixels, in keeping with the original applications of InstaHide to image datasets inHuang et al. (2020b) Note that any such vector does not specify a convex combination, but this choice of normalization is just to make some of the analysis later on somewhat cleaner, and our results would still hold if we chose the vectors in the support of D to have entries summing to 1. , while for v ⊥ [w] S , the above quantity is 0. Thus we complete the proof. return I, F .



Figure 1: Comparing MixUp and Instahide training on Gaussian dataset with different k.

Furthermore, the algorithm runs in time O(dm 2 + dn 2 pub + n 2ω+1 pub ). where ω ≈ 2.373 is the exponent of matrix multiplication. Remark B.2. Here we give some interpretation to the quantitative guarantees of Theorem B.1: • The number of pixels d only needs to depend logarithmically on the number of public/private images and polynomially in the sparsity k pub , k priv , which will be some small positive integer (e.g. k pub + k priv = 4 or 8 in Huang et al. (2020a), k pub + k priv = 4 or 6 in Huang et al. (2020b) and k pub + k priv = 2 in the implementation of MixUp in Zhang et al. (2018)), so the regime in which Theorem B.1 applies is quite realistic.

LEARNPUBLIC({([p j ] S , y j )} j∈[d] ) Input: Samples ([p 1 ] S , y 1 ), ..., ([p d ] S , y d ) Output: supp([w] S ) with probability at least 1 -δ, provided d ≥ poly(k pub )/ log(n/δ) Form the matrix M 1 d d j=1 (y 2 j -1) • [p j ] S • [p j ] S -Id . 2 Solve the semidefinite program (SDP) (this step takes n 2ω+1 pub via Jiang et al. (2020))

Lemma B.11 (Uniquely identifying a family of subsets). Let F = {T S } S∈C k [k+2] be a collection of subsets of [n] for which |T S ∩ T S | = |S ∩ S | for all S, S ∈ C k [k+2] . Then there is some subset U ⊆ [n] of size k + 2 for which {T S } = C k U as (unordered) sets. Illustration of the sequence of subsets constructed in the proof of Lemma B.11 for k = 4. Red and blue denote S 0 and S 1 , purple denotes S a,b for a ∈ {1, 2}, b ∈ {k + 1, k + 2}, green denotes the 4k -8 sets S , and gold denotes the k-2 k-4 = 1 set S .

observe that S must contain S * and one element from each of S 0 \S 1 = {1, 2} and S 1 \S 0 = {k + 1, k + 2}, so there are four such choices of S , call them {S a,b } a∈{1,2},b∈{k+1,k+2} , and F must contain all of them.Now consider any subsetS ⊂ [k + 2] for which, for some b = b ∈ {k + 1, k + 2}, we have that |S ∩ S 1,2 | = |S ∩ S 1,b | = |S ∩ S 2,b | = k -1, and |S ∩ S k+1,k+2 | = |S ∩ S 1,b | = |S ∩ S 2,b | = k -2.Observe that it must be that |S ∩ S * | = k -3 and that S contains {1, 2}, so there are 2 • k-2 k-3 = 2k -4 such choices of S , and F must contain all of them. We can similarly consider S for which, for some a = a ∈ {1, 2}, we have that|S ∩ S k+1,k+2 | = |S ∩ S a,k+1 | = |S ∩ S a,k+2 | = k -1, and |S ∩ S 1,2 | = |S ∩ S a ,k+1 | = |S ∩ S a ,k+2 | = 2k -4,for which there are again 2k -4 choices of S , and F must contain all of them.Alternatively, if F contained k -2 subsets S satisfying |S ∩ S 1,2 | = |S ∩ S b,k+1 | = |S ∩ S b,k+2 | = k -1 for some b ∈ {1, 2}, then it would have to be that any such S contains the k -1 elements of {b, 3, . . . , k}, and therefore the intersection between any pair of such S must be equal to k -1, violating the constraint that|T S ∩ T S | = |S ∩ S | for all S, S ∈ C k [k+2]. The same reasoning applies to rule out the case whereF contains k -2 subsets S satisfying |S ∩ S k+1,k+2 | = |S ∩ S 1,b | = |S ∩ S 2,b | = k -1 for some b ∈ {k + 1, k + 2}.Finally, consider the set of all subsets S distinct from the ones exhibited thus far, and for which|S ∩ S 0 | = |S ∩ S 1 | = |S ∩ S a,b | = k -2 for all a ∈ {1, 2}, b ∈ {k + 1, k + 2}and |S ∩ S for at least one of the 4k -8 subsets constructed two paragraphs above. Observe that any S distinct from the ones exhibited thus far which satisfies the first constraint must either contain S * and two elements outside of {1, ..., k + 4}, or must satisfy |S ∩ S * | = k -4 and contain {1, 2, k + 1, k + 2}. In the former case, such an S would violate the second constraint. As for the latter case, there are k-2 k-4 such choices of S , and F must therefore contain all of them. We have now produced 4k -2 + k-2 k-4 = k+2 k unique subsets, all belonging to C k [k+2] , and F is of size k+2 k , concluding the proof.

17if |I | = 4k -8 then 18 If z = 0, set F (i α ) ← {1, 3, . . . , k, k + 1}, F (i β ) ← {2, 3, . . . , k, k + 1}, F (i γ ) ← {1, 3, . . . , k, k + 2}, andF (i δ ) ← {2, 3, . . . , k, k + 2}. 19 If z = 1, set F (i α ) ← {1, 3, . . . , k, k + 1}, F (i β ) ← {1, 3, . . . , k, k + 2}, F (i γ ) ← {2, 3, . . . , k, k + 1} and F (i δ ) ← {2, 3, . . . , k, k + 2}. 20 if exactly k-2k-4 choices of i which are (k -2)-neighbors of i 0 , i 1 , i α , i β , i γ , and i δ , and which are also (k -1)-neighbors of at least one i ∈ I then 21 Let I denote the set of such i .22I ← {i 0 , i 1 } ∪ I ∪ I ∪ I . 23 Let M sub denote the k-2 k-4 × k-2 k-4 submatrix of Mgiven by restricting to the rows and columns indexed by I and subtracting 4 from every entry.24_, G ←FINDFLORALSUBMATRIX(M sub , k -2).25For every i ∈ I , set F (i ) ← G(i ) ∪ {1, 2, k + 1, k + 2}.

LEARNPRIVATEIMAGE({y X,wi } i∈[m] ) Input: InstaHide dataset {y X,wi } i∈[m] Output: Vectors x 1 , ..., x k+2 ∈ R d equal to k + 2 images (up to signs) from the original private dataset 1 M ← 1 k priv •GRAMEXTRACT({y X,wi }, 1 2k pub +2k priv ). 2 for i ∈ [m] do 3 S i ←LEARNPUBLIC({([p j ] S , y j )} j∈[d] ).

Figure 3: Comparing Vanilla, Mixup and Instahide training on Gaussian magnitude dataset with different k.

1 2 Z -w * w * 2 F . For 2 , note that because the entrywise L 1 norm of Z and x * x * are both upper bounded by k, by Holder's we can upper bound 2 by 2k pub • M -E[ M] max . Standard concentration (see e.g. Neykov et al. (2016)) implies that as long as d

funding

* This work was supported in part by NSF CAREER Award CCF-1453261, NSF Large CCF-1565235 and Ankur Moitra's ONR Young Investigator Award.

annex

Figure 2 : Illustration of a house (i; j 1 , j 2 , j 3 , j 4 ) where the solid lines indicate an entry of k -1 in M, while the dotted lines indicate an entry of k -2.Proof. We will union bound over j ∈ [m], so without loss of generality fix j = 1 in the argument below. Let X j (resp. Y j ) denote the indicator for the event that 1 and j are (k -1)-neighbors (resp. (k -2)-neighbors). As w j is sampled independently of w 1 , conditioned on w 1 we know that X j is a Bernoulli random variable with expectation E[X j ] = k(n-k)( n k )≤ n 1-k • k k+1 , where the factor of k(n -k) comes from the number of ways to pick supp(w 1 )\supp(w j ) and supp(w j )\supp(w 1 ).Similarly, Y j is a Bernoulli random variable with expectationBy Chernoff, we conclude that j >2 X j > 2n 1-k • k k+1 with probability at mostfrom which the first claim follows. Similarly by Chernoff, j >2 Y j > 2n 2-k • k k+2 with probability at mostfrom which the second claim follows.Definition B.15. Given symmetric matrix M ∈ Z m×m and distinct indices i, j 1 , ..., j 4 ∈ [m] for which j 1 < j 4 , we say that (i; j 1 , . . . , j 4 ) is a house (see Figure 2 ) if for allLemma B.16 (Upper bounding the number of houses). If m ≥ Ω(n 2k/3 ), then with probability at least 9/10 over the randomness of w 1 , . . . , w m , there are at mostProof. Define Z i,j1,...j4 distinct,j1<j41 [(i; j 1 , . . . , j 4 ) is a house] .By linearity of expectation, E[Z] is equal to m • m-1 4 ≤ m 5 times the probability that (1; 2, 3, 4, 5) is a house. Note that the only way for (1; 2, 3, 4, 5) to be a house is if there are disjoint subsets S 1 , S -2, T ⊆ [n] of size 2, 2, and k -2 respectively such that w 1 is supported on S ∪ T and each of w 2 , . . . , w 5 is supported on {s 1 , s 2 } ∪ T where s 1 ∈ S 1 , s 2 ∈ S 2 . There are≤ n k+2 such choices of (S 1 , S 2 , T ), and for each is an O( n k -5 ) chance that the supports of w 1 , . . . , w 5 correspond to a given (S 1 , S 2 , T ), so we conclude thatWe now upper bound E[Z 2 ]. Consider a pair of distinct summands (i; j 1 , . . . , j 4 ) and (i ; j 1 , . . . , j 4 ). Recall that they correspond to some (S 1 , S 2 , T ) and (S 1 , S 2 , T ) respectively. Note that if these tuples overlap in any index (e.g. (1; 2, 3, 4, 5) and (6; 1, 7, 8, 9) ≤ poly(k) ways of partitioning U into three disjoint sets of size 2, 2, and k -2 respectively. We conclude that any pair of distinct summands in the expansion ofis the number of distinct indices within the tuples (i; j 1 , . . . , j 4 ) and (i ; j 1 , . . . , j 4 ). For any b, there are m 5 • m-5 b-5 ≤ m b such pairs of tuples. In the special case where b = 6, we will use a slightly sharper bound by noting that then, it must be that S 1 ∪ S 2 ∪ T and S 1 ∪ S 2 ∪ T are identical, in which case we can improve the above bound of O(n k+4 ) for the number of pairs U, U to O(n k+2 ).We conclude thatwhere in the last step we used the fact that m ≥ O(n 2k/3 ) and k ≥ 2 to bound the summands corresponding to b = 6 and b = 7. Finally, by our bounds on E[Z] and E[Z 2 ], we conclude by Chebyshev's that with probability at least 9/10, there are most 2Lemma B.17 (Finding a floral submatrix). Suppose m = Ω(n k-2 k+1 ). With probability at least 3/4, FINDFLORALSUBMATRIX(M) runs in time O(n 2k-4 k+1 •exp(poly(k))) and outputs k+2 k × k+2 ksized subset I ⊆ [m] indexing a principal submatrix of M which is floral, together with a functionProof. The proof of correctness essentially follows immediately from the proof of Lemma B.11, while the runtime analysis will depend crucially on the sparsity of the underlying weighted graph defined by M, as guaranteed by Lemmas B.14 and B.16. Henceforth, condition on the events of those lemmas holding, which will happen with probability at least 3/4.

First note that if one reaches as far as

Step 20 in FINDFLORALSUBMATRIX, then by the proof of Lemma B.11, the I produced in Step 22 indexes a principal submatrix of M which is floral. The recursive call in Step 24 is applied to a submatrix of M whose size is independent of n, and it is evident that the time expended past that point is no worse than some exp(poly(k)), and inductively we know that the resulting F produced in Step 25 when the recursion is complete correctly maps indices j ∈ [m] to subsets in C k [k+2] such that M j,j = |F (j) ∩ F (j )| for all j, j ∈ I. To carry out the rest of the runtime analysis, it suffices to bound the time expended leading up to the recursive call. Consider any house (i 0 ; j 1 , j 2 , j 3 , j 4 ) encountered in Step 5. First note that one can compute 4 a=1 N k-1 ja with a basic hash table, so because the first part of Lemma B.14 tells us that with high probability,Step 5 only requires O(m • k k+1 • n 1-k ) time. Similarly, for each of the O(1) possibilities in the loop in Step 14, it takes O(m • k k+1 • n 1-k ) time to enumerate over (k -1)-neighbors of i z , i α , i β in Step 15 and, by the second part of Lemma B.14, O(m • k k+2 • n 2-k ) time to enumerate over (k -2)-neighbors of i 1-z , i γ , i δ , and it takes poly(k) to check that the resulting indices i are not all (k -1)-neighbors of each other. And once more, in Step 20 it takes O(m • k k+2 • n 2-k ) time to enumerate over all indices which are (k -2) neighbors of i 0 , i 1 and of every i ∈ I .We conclude that for every house (i 0 ; j 1 , j 2 , j 3 , j 4 ), FINDFLORALSUBMATRIX expends at most O(m • k k+2 • n 2-k ) time checking whether the house can be expanded into a set of indices corresponding to a floral principal submatrix of M. Note that for any (i 0 ; j 1 , j 2 , j 3 , j 4 ) encountered in Step 4 which is not a house, the algorithm expends O(1) time.) such tuples which are not houses.And because Lemma B.16 tells us that with high probability there are O(k 5k • m 5 • n -4k+2 ) houses in M, FINDFLORALSUBMATRIX outputs None with low probability. In particular, given that any single houseStep 9 all the way potentially to Step 24, we conclude that the houses contribute a total of at mostPutting everything together, we conclude that FINDFLORALSUBMATRIX runs in timeLastly, note that k + 4 -10 k+1 ≤ 2k -4 k+1 whenever k ≥ 2, completing the proof.

B.7 PUTTING EVERYTHING TOGETHER

We are now ready to conclude the proof of correctness of our main algorithm, LEARNPRIVATEIM- By Lemma B.17, with high probability the output I, F of FINDFLORALSUBMATRIX in Step 7 satisfies that 1) the principal submatrix of M indexed by I, a set of indices of size k priv +2 k priv , is floral, and 2) the function F : I → C k priv [k priv +2] satisfies that |F (i) ∩ F (j)| = M i,j for all i, j ∈ I. By Lemma B.11, because the principal submatrix indexed by I is floral, there exists some subset U ⊆ [n] of size k priv + 2 for which the supports of the mixup vectors w j for j ∈ I are all the subsets of U of size k priv . Finally, by Lemma B.10 and the fact that the entries of X are independent Gaussians, for every pixel index ∈ [d], the solution { x ( ) i } to the system in Step 8 satisfies that there is some column x of the original private image matrix X such that for every i ∈ [k priv + 2], x ( ) i is, up to signs, equal to the -th pixel of x. Note that the runtime of LEARNPRIVATEIMAGE is dominated by the operations of forming the matrix M and running FINDFLORALSUBMATRIX, which take time O(m 2 ) by Lemma B.17.

B.8 EXAMPLE OF A FLORAL SUBMATRIX

Example B.18. For k = 2, the following 6 × 6 matrix, after dividing every entry by k, is floral: for which j 1 < j 4 do 5 if (i 0 ; j 1 , j 2 , j 3 , j 4 ) is a house then 6 N houses ← N houses + 1. 14 for z ∈ {0, 1} and distinct α, β, γ, δ ∈ [4] for which α < β and i γ (resp. i δ ) is a (k -1)-neighbor of i α (resp. i β ), and for which i 0 , α, β are (k -1)-neighbors and i 1 , γ, δ are (k -1)-neighbors do

