ONE-STEP ESTIMATOR FOR PERMUTED SPARSE RECOVERY

Abstract

This paper considers the unlabeled sparse recovery under multiple measurements, i.e., represents the observations, missing (or incomplete) correspondence information, sensing matrix, sparse signals, and additive sensing noise, respectively. Different from the previous works on multiple measurements (m > 1) which all focus on the sufficient samples regime, namely, n > p, we consider a sparse matrix B and investigate the insufficient samples regime (i.e., n p) for the first time. To begin with, we establish the lower bound on the sample number and signal-to-noise ratio (SNR) for the correct permutation recovery. Moreover, we present a simple yet effective estimator. Under mild conditions, we show that our estimator can restore the correct correspondence information with high probability. Numerical experiments are presented to corroborate our theoretical claims.

1. INTRODUCTION

In recent years, linear regression with permuted correspondence has received increasing attention due to its wide applications in the field of machine learning, signal processing, and statistics. Among all these applications, two most prominent examples are (i) linkage record, which merges two datasets pertaining to the same objects into one comprehensive dataset; and (ii) data de-anonymization, which infers the hidden labels of private data with public datasets. Apart from these two applications, other applications include correspondence estimation between pose and estimation in graphics; timedomain sampling in the presence of clock jitter; multi-target tracking; unsupervised data alignment, etc (Pananjady et al., 2018; Slawski & Ben-David, 2019; Slawski et al., 2020; Zhang et al., 2018) . In this paper, we consider the canonical model, i.e., a linear sensing relation with permuted labels: Y = Π XB + W, where Y ∈ R n×m is the sensing result, Π ∈ R n×n is an unknown permutation matrix, X ∈ R n×p is the design (sensing) matrix, B ∈ R p×m represents the sparse signals of interests, and W ∈ R n×m denotes the additive noise. Assuming the signal B is a sparse signal, to put more specifically, each column of B is k-sparse, we would like to (i) study the statistical limits of the permutation recovery under this scenario, e.g., the minimum sample number n and signal-to-noise ratio (SNR); and (ii) propose a practical estimator that can efficiently recover the permutation once the minimum requirements are met. To begin with, we briefly review the previous works. Related Works. The study of permuted linear regression has a long history that can at least date back to DeGroot & Goel (1976; 1980) ; Goel (1975) ; Bai & Hsing (2005) . Recent interests on this area start from Unnikrishnan et al. (2015) . Focusing on the noiseless case W = 0 with single measurement (m = 1), Unnikrishnan et al. (2015) establish the necessary condition n ≥ 2p for the permutation recovery if B is an arbitrary vector residing within the linear space R p . Later, Pananjady et al. (2018) extend the analysis to the noisy scenario. They showed the minimum SNR should be at least the order of Ω(n c ), where c > 0 is some positive constant. Numerical experiments suggest c is within the region [4, 5] . Other works such as Hsu et al. (2017) ; Abid et al. (2017) ; Slawski & Ben-David (2019) ; Tsakiris et al. (2020) ; Haghighatshoar & Caire (2018) also focus on this regime and obtain the same answer. In Emiya et al. (2014) , the setting with a sparse signal B is first studied. However, only empirical investigation is conducted without rigorous theoretical analysis. In the first work with theoretical analysis (Zhang & Li, 2021) , both the statistical limits and practical estimators with almost optimal performance are presented for the permutation recovery. Peng et al. (2021) studies the problem from the viewpoint of algebraic geometry. All existing works suggest that SNR = Ω(n c ) is inevitable for the permutation reconstruction if only one measurement is conducted, namely, m = 1. On the other hand, numerous works suggest multiple measurements, i.e., m > 1, can greatly reduce the SNR requirement, even to some positive constant. This line of research starts from Zhang et al. (2022) , where the information theoretic lower bounds and the maximum likelihood (ML) estimator are investigated. Later, Zhang & Li (2020) study this problem from the viewpoint of non-convex optimization and propose an optimal estimator for the permutation recovery. Independently, Slawski et al. (2020) investigate this problem from the viewpoint of denoising. Putting parsimonious constraints on the number of permuted rows, they view (I -Π )XB as sparse outliers and design the permutation recovery algorithm accordingly. These works focus on the sufficient samples regime, namely, n = Ω(p). In this paper, we focus on the insufficient samples regime. Assuming B to be sparse, we would like to show the correct permutation can be obtained with n p and SNR = O(1). Our contributions are summarized as follows • We propose a one-step estimator for the correspondence recovery, which consists of two sub-parts: one for Π and another for B . By formulating the correspondence recovery as a linear assignment problem (LAP) (Kuhn, 1955; Bertsekas & Castañón, 1992; Burkard et al., 2012) , the correct permutation matrix can be obtained when SNR is above certain positive constant. On top of the above contributions, we would like to briefly mention our proof strategy, which is based on a tailored version of the leave-one-out technique. Compared with the previous works that adopt the leave-out-out technique (Chen et al., 2020; Sur et al., 2019; El Karoui, 2013; 2018; Cai et al., 2021) , our construction method bas the following characteristics • Our construction method is adaptive, which replaces multiple rows ranging from 2 to 4 depending on each permuted row. Meanwhile, previous works such as Chen et al. (2020) ; Sur et al. (2019) ; El Karoui (2013; 2018) ; Cai et al. (2021) replace a fixed number of rows (or columns). • We not only leave out the rows, but also modify the thresholding operator operated on the perturbed samples B (••• ) from thres(•) to (•) imax (its definition is deferred to Subsection 4.2). This step is essential in controlling the approximation error, since otherwise the non-zero elements in matrices B (∝ X Y) and B (•) may not share the same position and the approximation error can be considerably large. A thorough understanding is deferred to the proof of Theorem 3. Notations. Denote c, c , c i as some positive constants, whose values are not necessarily the same even for those with the same notations. We denote a b if there exists some positive constants c 0 > 0 such that a ≤ c 0 b. Similarly, we define a b provided a ≥ c 0 b for some positive constant c 0 . We write a b when a b and a b hold simultaneously. For an arbitrary matrix M, we denote M i,: as its ith row, M :,i as its ith column, and M ij as its (i, j)th element. Its Frobenius norm is defined as | | |M| | | F while the operator norm is denoted as | | |M| | | OP , whose definitions can be found in Section 2.3 of Golub & Loan (2013) . In addition, we define its stable rank as srank( (Section 2.1.15 in Tropp (2015) ) and its support set supp(•) as {(i, j) : (•) i,j = 0}. The inner product between matrices is denoted as •, • while the inner product between vectors is denoted as •, • . •) | | |•| | | 2 F /| | |•| | | 2 OP We define the set of all possible permutation matrices as P n , which is defined as {Π ∈ {0, 1} n×n : n i=1 Π ij = 1, n j=1 Π ij = 1}. Associate with each permutation matrix Π, we define the operator π(•) that transforms index i to π(i) under Π. The Hamming distance d H (Π 1 , Π 2 ) between two permutation matrices Π 1 and Π 2 is defined as d H (Π 1 , Π 2 ) = n i=1 1 (π 1 (i) = π 2 (i)). The SNR is defined as B 2 F /(m • σ 2 ).

2. PROBLEM FORMULATION

We start this section with a formal restatement of the considered problem reading as Y = Π XB + W, where Π ∈ P n denotes the (fixed but unknown) permutation matrix, X ∈ R n×p is the sensing (design) matrix with its entries being i.i.d. standard normal random variables, i.e., X ij ∼ N(0, 1), 1 B ∈ R p×m is a fixed sparse matrix awaiting to be reconstructed (corresponding to the sparse signal), and W ∈ R n×m denote the noise with each entry W ij i.i.d ∼ N(0, σ 2 ). Here we put the separate sparse constraints on each column of B , namely, B :, 0 ≤ k (1 ≤ ≤ m). In addition, we denote h as the number of permuted rows, or equivalently, the Hamming distance between the identify matrix I and the permutation matrix Π , namely, h d H (I, Π ). Our goal is to reconstruct both the permutation matrix Π and sparse signal B from (1). Note that we do not assume that different columns of Π share the same support set. Actually, we prefer each column to be with a different support set, since otherwise rank(B ) will be bounded by k and will bring extra difficulties to the permutation recovery. A detailed explanation is deferred to Section 4. Before proceeding, we briefly review the prior art. In Unnikrishnan et al. (2015) , where B can reside within the entire linear space R p , it is proved that n ≥ 2p is required for the correct permutation recovery. As a result, subsequent works such as Pananjady et al. (2018) ; Zhang et al. (2022) ; Slawski & Ben-David (2019) ; Slawski et al. (2020) ; Zhang & Li (2020) all focus on the sufficient sample regime, i.e., n = Ω(p). Only until Zhang & Li (2021) , the insufficient sample regime, i.e., n = o(p), receives its first theoretical investigation. Similar to our setting, they put sparsity assumption on B however focus on the single measurement scenario, namely, m = 1. They show that the minimum SNR for correct (Π , supp(B )) to be Ω(n c 1/srank(B ) p c 2 p /n ). In the following context, we will show that the minimum SNR can be significantly reduced, to put more specifically, some positive constant provided multiple measurements are made (m 1).

3. INFORMATION THEORETIC LOWER BOUNDS

This section establishes the information theoretic lower bounds for the correct permutation recovery. Our goal is to ensure both Π and B can be reliably reconstructed. We investigate this problem from two perspectives: (i) the sample number n and (ii) the minimum SNR.

3.1. THE MINIMUM SAMPLE NUMBER n

We obtain the minimum sample number n such that sparse signal B can be reliably recovered with high probability. Here we consider the oracle situation where Π is given a prior. 

3.2. THE MINIMUM SNR

Then we turn to the minimum SNR requirement for the correct permutation recovery. To begin with, we restate the prior art in Zhang et al. (2022) . Theorem 1 (Theorem 1 in Zhang et al. (2022) ). Consider the oracle case where B is given a prior. Then there exists an integer n 0 such that for an arbitrary estimator Π, we have inf Π sup Π ∈Pn P X,W ( Π = Π ) ≥ 1 2 , provided that (i) log det(I + B B /σ 2 ) < log n and (ii) n ≥ n 0 . 1 Experiments suggest that we may relax this assumption to that Xij are i.i.d. sub-Gaussian random variables. One drawback of this bound is the missing role of the sparsity number k. This is because Theorem 1 assumes B to be perfectly known while sparsity number k only kicks in when B needs to be reconstructed. To handle such issue, we take supp(B ) into account as well. Then we have Theorem 2. There exists an integer n 0 ≥ 0 such that for arbitrary estimators Π and B, we have inf Π, B sup Π∈Pn, B∈B n,p,m,k P X,W [( Π, supp( B)) = (Π, supp(B))] ≥ 1 2 , hold for all n ≥ n 0 , where P n denotes the set of all permutation matrices, B n,p,m,k is defined as the set reading as {B ∈ R p×m : log det (I + B B /σ 2 ) ≤ log n + m log ( p k ) n }, and supp(•) {(i, j) : (•) i,j = 0} denotes the support set. This theorem suggests that for all possible estimators to reconstruct Π and supp(B), there exists at least one pair (Π, B) , Π ∈ P n , B ∈ B n,p,m,k such that the reconstruction error rate will be at least 1 /2. Hence, a reliable correspondence reconstruction requires B (a fixed but unknown matrix) to satisfy log det I + B B /σ 2 > log n + m log ( p k ) /n. We leave the rigorous proof to the supplementary material and only give an intuitive interpretation, which comes from the coding theory. First, we assume each entry in B to be binary, i.e., B ij ∈ {0, 1}, (1 ≤ i ≤ p, 1 ≤ j ≤ m). Thus, the information of B is fully incorporated in supp(B). In the following, we will use supp(B) and B interchangeably, as they are identical. Afterwards, we view the sensing relation in (1) as the following transmission process: (i) pair (Π, B) is encoded into the codeword ΠXB; (ii) ΠXB passes through a Gaussian additive channel; and (iii) one observes Y, from which one would like to obtain (Π, B). Using the terminology from coding theory, we can compute the corresponding code rate and channel capacity as log n!+m log ( p k ) n and log det(I + B B /σ 2 ), respectively. Due to the Shannon theorem, we can expect non-negligible decoding error if the coding rate is greater than the channel capacity, which leads to log det (I + B B /σ 2 ) < log n + m log ( p k ) n , the formula in Theorem 2. Remark 1. Note that the minimum SNR requirements can be derived from Theorem 2 as log det(I + B B /σ 2 ) is closely related to SNR. When B is of rank-one, we have log det(I + B B /σ 2 ) be log(1 + m • SNR). When B is of full-rank and with identical singular values, we have log det(I + B B /σ 2 ) be rank(B) • log(1 + SNR). Having obtained the information theoretic lower bounds, we will propose a computationally efficient estimator which matches the lower bounds thereof to a good extent.

4. ESTIMATOR DESIGN

This section proposes a computationally efficient estimator for the permutation recovery. Denote thres(•) as the operator which only keeps the element with the largest magnitude in each column, we reconstruct Π with the linear assignment problem (LAP) (Kuhn, 1955; Bertsekas & Castañón, 1992; Burkard et al., 2012) reading as Π opt = argmax Π∈Pn Π, Y • thres(X Y) • X , Once the permutation matrix Π opt is obtained, we can transform (1) to the previous setting and iteratively recover each k-sparse column B :,i . A formal statement of the algorithm is in Algorithm 1. Here we use Lasso estimator to reconstruct the signal B , which can be replaced with other estimators, say Dantzig estimator, etc. Design intuition. The design of (2) shares a similar idea of the estimator in Zhang & Li (2020) : we would like to approximate the direction of B by the product X Y. However, due to insufficient samples, product X Y is poorly aligned with B , or equivalently, large errors in approximating B with X Y and weak correlation B , X Y . To reduce the approximation errors, we apply thres(•) and set certain entries in X Y to zero. Algorithm 1: One-step estimator. Input: observation Y and sensing matrix X. Output: pair (Π opt , B opt ), which is written as Π opt = argmax Π∈Pn Π, Y • thres(X Y) • X , B opt = argmin B (2n) -1 Π opt Y -XB 2 F + λ n B 1 , where thres(•) applies to each column and thresholds all entries to zero except the one with the largest magnitude, P n denotes the set of all possible permutation matrices, • 1 i,j |(•) i,j | denotes the absolute sum of all entries, and λ n > 0 is some regularizer coefficient. Note that we always keep one nonzero entry in the operation thres(•) regardless of the sparsity number k. This operation is different from almost all the previous works, which ranges from Blumensath & Davies (2009) ; Foucart (2011) in the literature of compressive sensing to Jain et al. (2013) ; Yuan et al. (2014) ; Li et al. (2016) in the literature of optimization, since all these works suggest keeping at least O(k) non-zero elements for a k-sparse signal. More surprisingly, our numerical experiments suggest keeping more non-zero elements are detrimental to the permutation recovery. An illustration is put in Figure 1 , from which we observe the SNR required for correct correspondence recovery increases with the number of non-zero elements kept in X Y. Remark 2. We apply the operator thres(•) to X Y to better approximate B 's direction, or equivalently, increase the correlation thres(X Y), B . Due to the insufficient sample number n, X Y is poorly aligned with B . Thus, keeping more non-zero elements in thres(•) leads to a potential decrease of thres(X Y), B and less satisfactory performance.

4.1. MAIN RESULTS

This subsection formally presents our main results.

4.1.1. RESULTS IN RECOVERING Π

First, we study the correspondence recovery. The formal statement is given as Theorem 3. Suppose that n ≥ n 0 and p ≥ p 0 , where n 0 , p 0 > 0 are some positive constants. Provided that (i) n k(log n)(log 2 mnp), (ii) srank(B ) k 2 log 2(1+ε0) n, (iii) h ≤ c 0 • n, and (iv) SNR ≥ c 1 , we have {Π opt = Π } with probability at least 1 -c 2 • n -c3 , where ε 0 > 0 is an arbitrary positive constant and h d H (I, Π ) denotes the number of permuted rows. If we assume that for each column its maximum entry's energy is at least a constant proportion of the total energy, i.e., inf j maxi |B i,j | / B :,j 2 ≥ ε 1 (1 ≤ i ≤ p, 1 ≤ j ≤ m, we can further relax the requirements on stable rank srank(B ). Here ε 1 > 0 is an arbitrarily small positive constant. 1 , from which we conclude that our work gives the first affirmative answer such that SNR ≥ Ω(1) is sufficient to obtain the correct permutation matrix with insufficient samples, i.e., n p.

Discussion. Comparison with previous work is put in Table

In addition, we would like to compare it with the lower bound. To begin with, we discuss the SNR requirement, which is the top priority of our analysis. From Theorem 3, we can see that the correct permutation matrix can be obtained provided that SNR is above some positive constant; meanwhile Here SNR min , n min and h max denotes the minimum SNR required for correct permutation recovery, the minimum required sample number and maximum allowed number of permuted rows, respectively. Moreover, the logarithmic term is omitted in Ω(•). SNR min (≥) nmin /p (≥) hmax /n (≤) m = 1 m 1 m = 1 m 1 m = 1 m 1 (Pananjady et al., 2018) Ω(n c ) Ω( 1) Ω( 1) (Slawski & Ben-David, 2019 ) Ω(n c ) Ω( 1) Ω(log -1 n) (Zhang et al., 2022 ) Ω( 1) Ω( 1) Ω log -1 r(B ) (Slawski et al., 2020 ) Ω( 1) Ω(p) Ω(log -1 n) (Zhang & Li, 2020) Ω(n c ) Ω( 1) Ω( 1) Ω( √ p) Ω( 1) Ω( 1) (Zhang & Li, 2021) Ω(n c ) o(1) Ω(1) Our Estimator Ω( 1) o( 1) Ω(1) Theorem 2 requires SNR > 0. This means our SNR requirement has at most a gap of some positive constant with the statistical lower bound. For the sample number n, the lower bound requires n to be at least of order Ω(k log p); while Theorem 3 requires n to be Ω(k(log n)(log 2 mnp)), which means the lower bound is matched up to some multiplicative logarithmic terms. We conjecture that the required sample number n in Theorem 3 can be further optimized, i.e., the logarithmic terms, with a more delicate analysis. Moreover, our estimator allows the maximum number of permuted rows to be linearly proportional to the sample number, i.e., h max n, which is order-optimal. Remark 3. Compared with Zhang & Li (2020) which only requires srank(B ) to be above certain positive constant, our estimator requires a larger stable rank srank(B ). Although we cannot claim that srank(B ) must be lower bounded by some non-decreasing functions of log n, we have a numerical evidence such that srank(B ) may have to increase with sample number n, in other words, its lower bound may not be reduced to be some positive constant. Fixing the parameters p, k, h, and srank(B ), we study the impact of sample number n on the permutation recovery and put the results in Figure 2 . We observe that a larger n has a negative impact on the permutation reconstruction once n exceeds certain threshold. One possible reason is that the stable rank srank(B ) is fixed as a constant and violates the requirement srank(B ) log 2 n.

4.1.2. RESULTS IN RECOVERING B

Once the ground-truth Π is obtained, we restore (1) to the traditional model in compressive sensing/sparse recovery (Candès et al., 2006; Candes et al., 2006; Donoho, 2006; Wainwright, 2019) . One corollary is given as follows. Corollary 1. Suppose that n ≥ n 0 and p ≥ p 0 , where n 0 , p 0 > 0 are some positive constants. Provided that (i) n k(log n)(log 2 mnp), (ii) srank(B ) k 2 log 2(1+ε) n, (iii) h ≤ c 0 • n, and (iv) SNR ≥ c 1 . Setting λ n in (3) as c 2 σ log p/n, we conclude B -B opt F σ mk log p n holds with probability exceeding 1 -c 3 n -c4 -c 5 p -c6 . Its proof is a simple combination of Theorem 3 and the previous results in Candès et al. (2006) ; Candes et al. (2006); Donoho (2006) ; Wainwright (2019) . However, we observe a new phenomenon: the reconstruction error B opt -B F is affected by the signal energy B F on top of the sensing noise σfoot_0 . 2 Moreover, we should mention that the above difference will still exist even when we replace (3) with other estimators, say, Dantzig estimator.

4.2. PROOF OUTLINE

Due to the space limit, we only give a sketch of our proof ideas and leave the technical details to the supplementary material. Denote B = (n -h) -1 X Y, our goal is to show Y, Π X • thres( B) > Y, ΠX • thres( B) , ∀ Π = Π (4) holds with high probability under the settings in Theorem 3. Same as Zhang & Li (2020) , our analysis faces the difficulties brought by (i) combinatorial nature of the problem and (ii) high-order moments of sub-Gaussian random variables. On top of these challenges, we are subject to insufficient samples, i.e., n p. These issues are tackled by a combination of relaxation and a tailored leave-one-out analysis, which can be roughly divided into the following three stages. Stage I. We consider the sufficient condition of (4), which reads as Y i,: , thres( B) X π (i),: ≥ Y i,: , thres( B) X j,: , ∀ j = π (i). Re-arranging the terms, we obtain an equivalent form reading as B X π (i),: , thres( B) X π (i),: ≥ B X π (i),: , thres( B) X j,: + W i,: , thres( B) X π (i),: -X j,: . Informally speaking, we first assume that thres( B) is almost parallel to B ; and the dependence of thres( B) on X π (i),: , X j,: is negligible. Then, we can approximate (5)'s left-hand side as B ∼ N(0, I m×m ) are Gaussian random vectors. Easily, we can see that (5) holds with high probability provided the SNR is sufficiently large. In the following two stages, we will verify the above two assumptions, that is, (i) ∠(thres( B), B ) is small and (ii) the dependence between thres( B) and X π (i),: , X j,: is negligible. Stage II. We would like to lower-bound the inner product B , thres( B) . Denote β as the corresponding column in B, we can express B , thres( B) as β ∈{B :, } 1≤ ≤m β , thres( β) . From the definition of thres(•), we notice that thres( β) only has one non-zero entry. W.l.o.g. we assume its index is one and hence have β , thres( β) = β 1 β 1 = (β 1 ) 2 + β 1 ( β 1 -β 1 ) ≥ (β 1 ) 2 -max i |β i | • β -β ∞ . We (i) upper-bound max i |β i | and β -β ∞ ; and (ii) lower-bound |β 1 |. Part (i) is quite standard and part (ii) lies in analyzing the event | β 1 | ≥ max j | β j |, which is due to the definition of thres(•). Stage III. We would like to show the dependence between thres( B) and rows X π (i),: and X j,: is negligible. This is accomplished by a tailored leave-one-out analysis. For each row indices pair (π (i), j), we construct a perturbed matrix B (π (i),j) by replacing the rows X π (i),: , X j,: with their i.i.d. substitutes. Easily, we can verify that B (π (i),j) is independent from the rows X π (i),: , X j,: as the latter are not involved in B (π (i),j) . Meanwhile, we have B (π (i),j) exhibit a similar behavior as B as they share almost identical components. Actually, this is the basic idea of leave-one-out technique (Chen et al., 2020; Sur et al., 2019; El Karoui, 2013; 2018; Cai et al., 2021) . Compared with these works, our construction method has the following characteristics • The number of replaced rows in our method varies for different pair of row indices. Meanwhile, the replacement number is fixed in the above mentioned works. For a better explanation, we refer the readers to our constructed leave-one-out samples B (•) in the appendix. • We modify the operator thres(•) in approximating thres( B). While the previous works usually keep the operator thres(•) intact, we approximate it with the operator (•) imax , which denotes the positions of non-zero elements in thres( B). In other words, the positions of the non-zero elements we keep in the leave-one-out samples B (•) are determined by thres( B) rather than thres( B (•) ). In our analysis, we can see this step is essential in controlling the approximation errors. Otherwise, the approximation error can be considerably large, since thres( B) may not share the same support set with thres( B (•) ), let alone their 2 differences. The explanation thereof is a simplified version of our proof technique. The technical details, which are put in the supplementary material, can be different from the above however follow the same spirit. Moreover, we want to discuss our algorithm's computational complexity: in the first step for permutation recovery, our estimator only requires one matrix multiplication and thresholding operation on top of the operations in the oracle estimator; in the second step for sparse signal recovery, our estimator needs one additional matrix multiplication when compared with the work without permutation.

5. NUMERICAL RESULTS

This section presents the numerical experiments to verify our main theorem, to put more specifically, Theorem 3: we would like to prove the correct permutation can be obtained, i.e., {Π opt = Π }, with n p and SNR ≥ c. We only present the numerical results on the synthetic data here and defer those on the real-world data to the supplementary material. Experiment setting with Gaussian distribution. We let X ij i.i.d ∼ N(0, 1) and pick the sample number n to be {100, 150} and set h = n /4. We vary the signal length p to be {500, 600}. Then we set the sparsity number k within the region {10, 15, 20}. And the stable rank srank(B ) is within the range {150, 200, 250}. The corresponding simulation results can be found in Figure 3 . ∼ N(0, 1), with respect to SNR. Discussion w.r.t n /p. First, we confirm our theory such that correct permutation can be obtained with insufficient samples, i.e., n p. In addition, we notice that the permutation recovery becomes easy with a larger n /p: the first row in Figure 3 is with n /p = 0.2 while the second row is with n /p = 0.25. We can verify the SNR for the correct permutation recovery is smaller in the second row than that for the first row. However, we should stress that this conclusion may not hold provided that srank(B ) is not sufficiently large. More details are referred to Figure 2 . Discussion w.r.t. sparsity number k. We vary the sparsity number k to be within {10, 15, 20}. We conclude a large sparsity number k can make the permutation recovery more difficult. For example, consider the case when (n, p, srank(B )) = (100, 500, 200) . When k = 10, correct permutation requires SNR ≥ 1.4; when k = 15, correct permutation needs SNR ≥ 2.2; and when k = 20, correct permutation requires SNR ≥ 4. The same conclusion holds for other cases as well. 3Experiment setting with sub-Gaussian distribution. In addition to the Gaussian setting, we also evaluate our estimator's performance when X ij being sub-Gaussian. Here, we pick X ij to be i.i.d. Rademacher random variables such that P(X ij = ±1) = 1 /2. The corresponding results are put in Figure 4 , from which we can observe a similar pattern as that of the Gaussian setting. ∼ Rademacher, i.e., P(X ij = -1) = P(X ij = 1) = 1 /2, w.r.t. SNR.

6. CONCLUSION

In this paper, we studied the unlabeled sparse recovery with multiple measurements (i.e., m > 1) for the first time. To begin with, we investigated the lower bounds on the sample number n and the SNR. Furthermore, we proposed a simple yet effective estimator, which restores the permutation matrix via a linear assignment problem. We proved that our estimator can obtain the correct correspondence information when SNR is above certain positive constant and required sample number n is in linear dependence with sparsity number k. In addition, we discovered multiple phenomena that are seldom encountered before: (i) keeping more non-zero elements in thres(•) deteriorates permutation recovery; and (ii) increasing sample number n plays a dual role in reconstructing the permutation. In the course of analyzing our estimator's performance and explaining the above phenomena, we developed a tailored version of the leave-one-out technique, which involves an adaptive number of replaced elements and simultaneous modification of the threshold operator. Moreover, we provided numerical experiments to corroborate our claims and showed our estimator can reliably reconstruct the permutation matrix even when the entries X ij are sub-Gaussian random variables. A PROOF OF THEOREM 2 Proof. The proof technique is a combination of that in Zhang et al. (2022) and Zhang & Li (2021) . First we put uniform distribution as the prior of Π, i.e., P(Π = Π samp ) = |P n | -1 , where Π samp is an arbitrary fixed permutation matrix and P n denotes the set of all possible permutation matrices. In addition, we introduce distributions for the support set of B. For an arbitrary column B :, , we assume its support set to be uniformly distributed among p k possible patterns. Easily, we can verify the relation sup Π,B P X,W (Π, supp(B)) = ( Π, supp( B)) ≥ P X,W,Π,B (Π, supp(B)) = ( Π, supp( B)) . Since inf Π, B can be safely added to the left-hand side of (6), our goal becomes lower-bounding P X,W,Π,B (Π, supp(B)) = ( Π, supp( B)) . Adopting the proof technique used in Theorem 2.10.1 in Cover & Thomas (2012) , we consider the entropy H(Π, supp(B)), which can be computed as H(Π, supp(B)) 1 = H(Π) + H(supp(B)) 2 = log n! + m • log p k , where in 1 we use the independent among Π and supp(B), and in 2 we use the fact |P n | = n! and |supp(B)| = p k m . Meanwhile, we have the relation H(Π, supp(B)) 3 = H(Π, supp(B) | X) 4 = H(Π, supp(B) | X, Π, supp( B)) ζ1 + I(Π, supp(B); Π, supp( B) | X) ζ2 . where 3 is due to the independence between X and Π, B; and 4 is because of the definition of the conditional entropy and mutual information. The proof is thus complete by separately bounding η 1 and η 2 . Analysis of ζ 1 . We upper-bound ζ 1 with Fano's inequality (Cover & Thomas, 2012, Theorem 2.10 .1), which proceeds as ζ 1 ≤ H(Π, supp(B) | Π, supp( B)) ≤ 1 + log (|(Π, supp(B))|) • P X,W,Π,B (Π, supp(B)) = ( Π, supp( B)) . Analysis of ζ 2 . Due to the Markov property of (Π, supp(B)) → Y → ( Π, supp( B)), we invoke the data-processing inequality (Cover & Thomas, 2012, Thm. 2.8 .1) and conclude ζ 2 ≤ I(Π, supp(B); Y | X). Invoking the definition of conditional mutual information, we have I(Π, supp(B); Y | X) = E X,W,Π [h(Y | X = x)) -h(Y | Π, supp(B), X = x)] 5 ≤ 1 2 log det E X,W,Π YY - mn 2 log σ 2 , where in 5 we use the property (Cover & Thomas, 2012, Theorem 8.6.5 ) h(Z) ≤ 1 2 log det Cov(Z) ≤ 1 2 log det E ZZ , for a random vector Z with finite covariance matrix Cov(Z), and the entropy for a Gaussian random vector. Following the same procedure as in (Zhang et al., 2022, Lemma 11) , we have log det E X,W,Π YY = nm • log σ 2 + n • log det I + B B /σ 2 , which further yields to I(Π, supp(B); Y | X) ≤ n 2 log det I + B B /σ 2 . Under review as a conference paper at ICLR 2023 Summary. Combing ( 7), (8), and ( 9) then leads to a lower-bound on P X,W,Π,B (Π, supp(B)) = ( Π, supp( B)) reading as P X,W,Π,B (Π, supp(B)) = ( Π, supp( B)) ≥ log n! + m log p k -1 -( n /2) log det I + B B /σ 2 log (|(Π, supp(B))|) . Easily, we can verify P X,W,Π,B (Π, supp(B)) = ( Π, supp( B)) is lower bounded by 1 /2 given the assumptions in Theorem 2 and thus complete the proof.

B PROOF OF THEOREM 3

We define B as (n -h) -1 X Y and define operator imax(i) as argmax j | B j,i | (1 ≤ i ≤ m) for each column in B, which returns the index of the entry with the largest magnitude. With a slight abuse of notation, we denote B imax = thres( B). The benefits of this notation will be seen shortly. In addition, we define the error E err as E err = ∃ Π, s.t. Y, ΠX • B imax ≥ Y, Π X • B imax . According to the sensing relation such that Y = Π XB + W, we rewrite (10) as Π XB + W, ΠX • B imax ≤ Π XB + W, Π X • B imax , and would like to show it holds with probability near zero. The major technical difficulties come from the fact that B is correlated with sensing matrix X, which introduces high-order moments. The solution of such challenge is broadly divided into the following two parts. Part I: Relaxation of error event. We first relax the error event E err E err ⊆ Y i,: , B imax X π (i),: ≤ Y i,: , B imax X j,: , ∃ j = π (i) Eerr-relax , which means P(E err ) ≤ P(E err-relax ). Part II: Decoupling dependence via a modified leave-one-out technique. To decompose the dependence between B and the rows X π (i),: and X j,: , we modify the leave-one-out technique and construct a perturbed matrix B (π (i),j) , which shares almost identical statistical behaviors as B. Before delving into the technical details, we first provide a glimpse of the construction idea. Recalling the definition of B, which is written as B = 1 n -h n =1 X ,: X π ( ),: B + X W n -h , we construct the perturbed matrix B (π (i),j) by replacing the corresponding rows with their i.i.d. samples. The detailed construction method is stated as follows. To begin with, we draw i.i.d. samples for each rows of X i,: and denote it as X i,: (1 ≤ i ≤ n). Similarly we draw samples W j,: for each row in W, 1 ≤ j ≤ n. For arbitrary indices π (i) and j such that j = π (i), we create the samples B (π (i),j) as (assume i = π (i) and j = π (j)) B (π (i),j) = (n -h) -1 ,π ( ) =π (i),j X ,: X π ( ),: B + (n -h) -1 X i,: X π (i),: + X π (i),: X π (π (i)),: + X j,: X π (j) ,: + X π -1 (j),: X j,: B + (n -h) -1   =π (i),i X ,: W ,: + X π (i),: W π (i),: + X i,: W i,:   . ( ) Provided that i = π (i), we can simplify the summaries X i,: X π (i),: + X π (i),: X π (π (i)),: and X π (i),: W π (i),: + X i,: W i,: in the above construction as the terms X i,: X i,: and X i,: W i,: , respectively. Similarly, we will simplify X j,: X π (j) ,: + X π -1 (j),: X j,: as X j,: X j,: when j = π (j). With the above construction method, easily we can verify that B (π ,j) is independent of the rows X π (i),: , X j,: and W i,: as they are not involved in B (π ,j) . Before delving into the technical details, we first collect all required notations.

B.1 NOTATIONS

Define the following events E1 =π ( ) X 2 ,i n -h -1 ≤ c0 log(np) n -h , ∀ 1 ≤ i ≤ p ; E2(β)    =π ( ) X ,i X ,\i β \i n -h log(mnp) n -h β \i 2, ∀ 1 ≤ i ≤ p    ; E3(β) =π ( ) X ,i X π ( ),: , β n -h √ h log(mnp) n -h β 2, ∀ 1 ≤ i ≤ p ; E4 n =1 X ,i W ,j n -h σ √ n log(mnp) n -h , ∀ 1 ≤ i ≤ p, 1 ≤ j ≤ m ; E5 B imax -B (π (i),j) imax x 2 log 3/2 (np) n -h B F + σ(log np) m(log mn) n -h , ∀ 1 ≤ π (i) = j ≤ p , where β ∈ R p is an arbitrary column of B ,: (1 ≤ ≤ m), β \i ∈ R p denotes its copy with the ith entry being set to be zero, and x ∈ R p denotes an arbitrary row in matrix X, which follows Gaussian distribution N(0, I p×p ). Note that x is not necessarily independent from B and B (π (i),j) . For an arbitrary event E, we denote its complement as E. In addition, we define matrix M (π (i),j) as B B (π (i),j) imax . For the notational simplicity, we drop the superscript π (i) and j in M (π (i),j) when there is no ambiguity. The following context provides the technical details and a diagram representing the dependence among all lemmas is put in Figure 5 . Proof. The proof can be broadly divided into three stages. Stage I. To begin with, we prove the relation E err ⊆ E err-relax , whose definition can be found in (11). Conditional on E err-relax , we have Y i,: , B imax X π (i),: > Y i,: , B imax X j,: , ∀ j = π (i). Thus we conclude Y, Π X • B imax > Y, ΠX • B imax , ∀ Π = Π , which suggests E err-relax will automatically lead to E err , in other words, E err-relax ⊆ E err . Hence, we could upper bound P (E err ) by P (E err-relax ). Stage II. Regarding the relation Y i,: , B imax X π (i),: ≤ Y i,: , B imax X j,: , we can recast it as B X π (i),: , B (π (i),j) imax X π (i),: η (π (i),j) 1 ≤ B X π (i),: , B (π (i),j) imax X j,: η (π (i),j) 2 + B X π (i),: , B imax -B (π (i),j) imax X j,: -X π (i),: η (π (i),j) 3 + W i,: , B (π (i),j) imax X j,: -X π (i),: η (π (i),j) 4 + W i,: , B imax -B (π (i),j) imax X j,: -X π (i),: η (π (i),j) 5 . We should emphasize that the subscript imax are solely determined by B rather than its perturbed partner B (π (i),j) . This is to ensure B and B (π (i),j) share the same support set. In the following analysis, we will see this property plays an important role in bounding the difference between B imax and B (π (i),j) imax , which is contained in η (π (i),j) 3 and η (π (i),j) 5 . The following context separately studies each term η (π (i),j) (1 ≤ ≤ 5). First we define quantities ∆ (π (i),j) (1 ≤ ≤ 5) as ∆ (π (i),j) 1 B 2 F k - mσ 2 (log mnp) 2 n - √ mσ log(mnp) √ n B F - c 0 log n srank(B ) B F B (π (i),j) imax F ; ∆ (π (i),j) 2 log n B F B (π (i),j) imax F / srank(B ); ∆ (π (i),j) 3 B 2 F √ log n (log np) 3/2 n + B F σ √ m log n(log mn) 1/2 (log np) n ; ∆ (π (i),j) 4 σ(log n) B (π (i),j) imax F ; ∆ (π (i),j) 5 √ m log n log 3/2 np n -h B F σ + mσ 2 (log np) (log n)(log mn) n -h . In addition, we define the events F (1 ≤ ≤ 6) as F 1 η (π (i),j) 1 ∆ (π (i),j) 1 , ∀ 1 ≤ π (i) = j ≤ p ; (13) F 2 |η (π (i),j) 2 | ∆ (π (i),j) 2 , ∀ 1 ≤ π (i) = j ≤ p ; F 3 |η (π (i),j) 3 | ∆ (π (i),j) 3 , ∀ 1 ≤ π (i) = j ≤ p ; F 4 |η (π (i),j) 4 | ∆ (π (i),j) 4 , ∀ 1 ≤ π (i) = j ≤ p ; F 5 |η (π (i),j) 5 | ∆ (π (i),j) 5 , ∀ 1 ≤ π (i) = j ≤ p , F 6 B (π (i),j) imax -B imax F log(mnp) √ n B F + σ log(mnp) √ m √ n , ∀ 1 ≤ π (i) = j ≤ p . (18) In Lemma 1, Lemma 2, Lemma 3, Lemma 4, Lemma 5, and Lemma 12, we will show all the above events, namely, F (1 ≤ ≤ 6), hold with probability approaching one. Stage III. Given the assumptions in Theorem 3, we will verify the relation 6 =1 F ⊆ E err-relax . Considering the difference ∆ (π (i),j) 1 - 5 =2 ∆ (π (i),j) , we can lower bound it as ∆ (π (i),j) 1 mσ 2 - 5 =2 ∆ (π (i),j) mσ 2 ≥ SNR 1 k - √ log n log 3 /2 (np) n - 2 log n srank(B ) ζ1 - √ SNR √ log n(log np) √ log mn + √ log np n + log mnp √ n + log n √ m + 2(log n)(log mnp) n • srank(B ) ζ1 /2 - log n(log mnp) √ mn + log 2 (mnp) + log n(log mn)(log np) n ζ0 . Under the assumptions in Theorem 3, we have ζ 1 k -1 ; ζ1 /2 log mnp √ n + log n √ m ; ζ 0 log n(log mnp) √ mn + log 2 (mnp) n , which leads to ∆ (π (i),j) 1 mσ 2 - 5 =2 ∆ (π (i),j) mσ 2 ≥ ζ 1 • SNR -ζ1 /2 √ SNR -ζ 0 1 1 k - 1 k √ log n - 1 k log ε n - 1 k 2 log ε+ 1 /2 n - 1 k 2 log n > 0, where in 1 we use the relation m ≥ rank(B ) ≥ srank(B ) k 2 log 2(1+ε) n. Then we conclude 5 =2 |η (π (i),j) | ≤ 5 =2 ∆ (π (i),j) ≤ ∆ (π (i),j) 1 ≤ η (π (i),j) 1 , ∀ 1 ≤ π (i) = j ≤ p, which means we will automatically obtain E err-relax when assuming  ) m =1 E 3 (B ,: ) E 4 , we have P(F 1 ) ≥ 1 -c 0 n -c1 provided n k log 2 (mnp). Proof. Recalling the definition of M, i.e., M B B (π (i),j) imax , we divide the proof procedure as For the clarify of presentation, we defer the proof of Step II to Lemma 13 and focus on Step I. Due to the construction of B (π ,j) in ( 12), we conclude B (π ,j) is independent with row X π (i),: . Hence we can first condition on M and rewrite term η 1 in terms of a quadratic product x Mx, where x ∈ R p is a random vector satisfying x ∈ N(0, I p×p ). • With Hanson-Wright inequality (c.f. Theorem 6.2.1 in Vershynin ( 2018)), we have , where 1 is due to the definition of stable rank. Lemma 2. We have P(F 2 ) ≥ 1 -4n -c where event F 2 is defined in (14). Proof. First we fix the indices π (i) and j such that π (i) = j. Due to the independence across the rows of X, we conclude (π (i),j) imax and in 1 we have the rotation invariance of Gaussian random vector. Regarding the event F 2 , we invoke the union bound and complete the proof as P(F 2 ) ≤ n 2 • 4n -c = 4n -c . Since X are with i.i.d Gaussian entries, we have each row in X S β \i be a Gaussian random vector with zero mean and variance β \i 2 2 . Hence, we have X S β \i 2 2/ β \i 2 2 be a χ 2 random variable with freedom n -h, which leads to 65(n-h) , ζ 1 1 ≤ p exp n -h 2 (log 2 -1) ≤ pe -0. where 1 is due to Lemma 15. For ζ 2 , we notice that X ,i is independent of the inner product X ,: , β \i . Hence, we can view the product n-h =1 X ,i X ,: , β \i as a Gaussian random variable N(0, X S β \i 2 2 ), which leads to ζ 2 ≤ pP n-h =1 X ,i X ,: , β \i log(mnp) X S β \i 2 2 ≤ 2n -c m -c p -c , where in 2 we use the tail bound of Gaussian random vectors. Lemma 9. For a fixed β ∈ R p , we have P(E 3 (β)) ≥ 1 -c 0 n -c1 m -c2 p -c3 . Proof. According to Lemma 16, we can decompose the index set : = π ( ) into three disjoint sets I j such that (i) indices and π ( ) do not fall into the same set I j ; and (ii) the cardinality of each set satisfies h j |I j | ≥ h 3 , (1 ≤ j ≤ 3). Then we can decompose product =π ( ) X ,i X π ( ),: , β as Due to the properties of I j , we have X ,i and X π ( ),: , β be independent and hence where 1 is due to Lemma 15 and 2 is due to the Gaussian tail bound. Plugging it in (19) then completes the proof.



This corollary has one requirement on SNR, which is affected by both B F and σ 2 . In addition, we may have B opt -B F be directly linked to B F , as a wrong correspondence can be obtained with low SNR. This result is consistent with Theorem 2 as B n,p,m,k covers a broader class of matrices B when k increases from 15 to 20, which implies a larger SNR is required for correct permutation recovery. Iterating over all columns B , we can show E 5 holds with probability one provided x ∞ √ log np, | x, β | √ log np β 2 , and w ∞ σ √ log nm, in other words, ϑ is zero.



Figure 1: Keeping more non-zero elements in X Y deteriorates permutation recovery. n = 100, p = 500, h = 25, k = 5, srank(B ) = 100. #non-zero elements refers to the number of non-zero elements kept in each column of X Y.

Figure 2: Illustration of the dual role of sample number n. We consider the noiseless case (infinite SNR); and set p = 200 (signal length), k = 5 (sparsity number), and h = 25 (number of permuted rows).

-hand side as a sum of z 1 B B z 2 and √ 2z B F up to some normalization constant, where z 1 , z 2 i.i.d ∼ N(0, I p×p ) and z i.i.d

Figure 3: Simulated recovery rate P( Π = Π ) with n = {100, 150}, p ∈ {500, 600}, h ∈ {25, 37}, and X ij i.i.d

Figure 4: Simulated recovery rate P( Π = Π ) with n = {100, 150}, p ∈ {500, 600}, h ∈ {25, 37}, and X ij i.i.d

Figure 5: Dependence diagram of lemmas.

Step I. Condition on M, we haveη ∀ 1 ≤ π (i) = j ≤ n, hold with probability exceeding 1 -n -c . • Step II. Provided n k log 2 (mnp), 2 (B ,: ) m =1 E 3 (B ,: ) E 4 .

, ∃ 1 ≤ π (i) = j ≤ n ≤ n 2 • P x Mx -Tr(M) ≥ t ≤ n 2 • 2 exp -c c 0 log n| | |M| | | F = Tr(M) -c 0 log n| | |M| | | F , ∀ 1 ≤ π (i) = j ≤ n,holds with probability 1 -2n -c . Then we complete the proof by showing | | |M| | | F ≤ B OP B

|x My| log n| | |M| | | F ≤ P |x My| log n| | |M| | | F , My 2 log n| | |M| | | F + P My 2 log n| | |M| | | F 1 ≤ P |x My| log n My 2 + 2n -c ≤ 4n -c ,where M is defined as B B

,i X π ( ),: , β log(mnp) h j β 2 c 0 log mnp) -c 0 log mnp + 1) + P   ∈Ij X ,i X π ( ),: , β log mnp X ∈Ij ,: β 2   -c0 m -c0 p -c0 + n -c1 m -c1 p -c1 n -c m -c p -c ,

We study the lower bounds w.r.t the number n and signal-to-noise ratio (SNR) for the correct reconstruction of both the permutation matrix Π and the signal B . Assuming each column

Comparison with previous works. All results are presented in their best orders, which only hold true in certain regimes.

Hang Zhang and Ping Li. Sparse recovery with shuffled labels: Statistical limits and practical estimators. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pp. 1760-1765, Melbourne, Australia, 2021. Hang Zhang, Martin Slawski, and Ping Li. The benefits of diversity: Permutation recovery in unlabeled sensing from multiple measurement vectors. IEEE Trans. Inf.

annex

Lemma 3. Condition on E 5 , we conclude P (F 3 |E 5 ) ≥ 1 -2n -c , where F 3 is defined in (15).Proof. First we recall the definition of η 3 , which is written as(π (i),j) imax (X j,: -X i,: ) .We begin the proof asFor the first term, we invoke the Hanson-Wright inequality (c.f. Theorem 6.2.1 in Vershynin ( 2018)), which leads toFor the second term, we will prove it to be zero. This is becausefor all 1 ≤ π (i) = j ≤ n, where in 1 we condition on E 5 and B X π (i),: 2 √ log n B F .Lemma 4. We have P(F 4 ) ≥ 1 -4n -c .Proof. First we fix the indices π (i) and j such that π (i) = j. Invoke the definition of, X j,: , and X π (i),: are independent with each other. Hence we can complete the proof aswhere in 1 we use the tail bound for Gaussian random variable. The proof is then completed with the union bound such thatLemma 5. Condition on E 5 , we concludeProof. The proof is in a similar form of that for Lemma 3. First we decompose the probability P(F 5 |E 5 ) aswhere the last inequality is due to the tail bound for the Gaussian random variable. We complete the proof by showing the second probability is zero. This is becauseProof. We complete the proof bywhere 1 is due to the fact such that β (π (i),j) imaxβ has only one non-zero element, and 2 is due to Lemma 12.

B.3 SUPPORTING LEMMAS

Lemma 7. We have P(E 1 ) ≥ 1 -2n -c1 p -c2 when n -h log np, and n, p are sufficiently large.Proof. Due to the Hanson-Wright inequality (c.f. Theorem 6.2.1 in Vershynin ( 2018)), we havewhere in 1 we use the fact n -h log(np).Lemma 8. For a fixed β ∈ R p , we haveProof. To begin with, we construct the sensing matrix X S by concatenating all rows X ,: such that = π ( ). With union bound, we can upper bound P E 2 (β) asLemma 10. We haveProof. We complete the proof aswhere 1 and 2 are due to the union bound, 3 is due to Lemma 15, and 4 is due to the tail bound for Gaussian random variable.Lemma 11. Conditional on the intersection of events E 1 E 2 (β ) E 3 (β ) E 4 , we havewhere imax and max are defined as the indices of β and β with the largest magnitude, i.e., imaxProof. To begin with, we define3 asrespectively. Then we can write β i aswhere max is defined as the index of β with the largest magnitude, i.e., max argmax i |β i |. With triangle inequality, we obtainThe following context separately discusses each term. First, we consider |ζThen we turn to study the rest of terms. Conditional on E 2 (β ) E 3 (β ) E 4 , we haveCombining ( 20), (21), and ( 22) then yields the lower-bound forwhich concludes the proof as n h.Lemma 12. Conditional on the intersection of eventsProof. For an arbitrary index i, we consider the difference β i -β i , which can be written as

Conditional on the intersection of events

respectively, and complete the proof asβ 2 and h ≤ n.Lemma 13. Conditional on the intersection of eventsProof. First, we pick one arbitrary column β of B . W.l.o.g. we assume that β imax ≥ 0. Then we obtaindenotes the corresponding column in B (π (i),j) . Then we obtain the following lower bound on β imax β imaxSimilarly, we can show ( 23) holds as well when β imax < 0. Recalling the definition of β max , to put more specificallyFor |β imax |, we can lower bounded it by Lemma 11. While for β (π (i),j)β ∞ , we cannot directly use Lemma 12 since, strictly speaking, it concerns X with rows X π (i),: , X j,: rather than X π (i),: , X j,: However since they follow the same distributions, we can follow the same procedure and showThen we obtainwhere in 1 we use the relationwhere in 2 we use the fact that x 2 -2ax is monotonically increasing in the region [a, ∞); and the equality is achieved whenHence, we obtainHaving obtained the lower bound for one single column of B , we complete the proof aswhere 3 comes from a reorganization of the terms and the inequalityLemma 14. We haveProof. We begin the proof with the union boundwhere w (•) and w (•) denote the corresponding entries from W and W.Then we would prove that ϑ is zero. The technical details are attached in the following. Due to the fact that B (i,j)imax shares the same support set as B imax , we havewhere in 1 we condition on the relation x ∞ √ log np.Regarding∞ . Again, this fact relies heavily on the fact such that B (i,j) imax shares the same support set as B imax . Otherwise, the best result we can have isimax ∞ . Afterwards, we obtainwhere in 2 we use the condition ) represents an arbitrary image of digit i (0 ≤ i ≤ 9) in the training (resp. test) set. Then, we create a Gaussian sensing matrix X and a noise matrix W. Afterwards, we permute the sensing results and apply our algorithm in Algorithm 1 to reconstruct the images. As the benchmark, we ignore the missing correspondence and directly estimate B with Lasso estimator. In other words, we let Π opt in (3) be the identity matrix and use it to estimate the images. An illustration of the reconstructed images is put in Figure 6 . 

D USEFUL FACTS

This section collects some useful facts about probability for the sake of self-containing.Lemma 15 ( (Dasgupta & Gupta, 2003) ). For a χ 2 -random variable Z, which is with freedom , we conclude P (Z ≤ t) ≤ exp 2 tt + 1 , t < ;Lemma 16 ((Pananjady et al., 2018) ). Suppose the permutation matrix Π with Hamming distance h from the identity matrix I, namely, d H (I, Π) = h. We can decompose the index set {i : i = π(i)} into 3 independent sets I i (1 ≤ i ≤ 3) such that the cardinality of each set satisfies |I i | ≥ h/3 ≥ h/5.

