IDENTIFYING PHASE TRANSITION THRESHOLDS OF PERMUTED LINEAR REGRESSION VIA MESSAGE PASS-ING

Abstract

This paper considers the permuted linear regression, i.e., Y = ⇧ , and W 2 R n⇥m represent the observations, missing (or incomplete) information about ordering, sensing matrix, signal of interests, and additive sensing noise, respectively. As is shown in the previous work, there exists phase transition phenomena in terms of the signal-to-noise ratio (snr), number of permuted rows, etc. While all existing works only concern the convergence rates without specifying the associate constants in front of them, we give a precise identification of the phase transition thresholds via the message passing algorithm. Depending on whether the signal B \ is known or not, we separately identify the corresponding critical points around the phase transition regimes. Moreover, we provide numerical experiments and show the empirical phase transition points are well aligned with theoretical predictions.

1. INTRODUCTION

This paper considers the permuted linear regression Y = ⇧ \ XB \ + W, where Y 2 R m⇥n denotes the sensing result, ⇧ \ 2 R n⇥n represents the permutation matrix, X 2 R n⇥p is the sensing matrix, B \ is the signal of interests, W denotes the additive noise, and is the noise variance. The research on this problem dates back at least to 1970s under the name 'broken sample problem' (Goel, 1975; Bai & Hsing, 2005; DeGroot et al., 1971; DeGroot & Goel, 1976; 1980) . In recent years, we have witnessed a revival of this problem due to its board spectrum of applications in privacy protection, data integration, etc (Pananjady et al., 2018; Unnikrishnan et al., 2015; Slawski et al., 2020; Slawski & Ben-David, 2019; Pananjady et al., 2017; Zhang et al., 2022; Zhang & Li, 2020) . 2020) can all explain this phenomenon, the precise positions of the phase transition thresholds are never studied but rather their statistical order. In this work, we would like to leverage message passing (MP) algorithm to identify their precise location. As a byproduct, we also come up with an algorithm to partially recover the permutation matrix. 2020) develop an estimator which we will analyze as part of our contributions. Although they obtained the correct convergence rate to restore the correspondence, which are minimax-optimal in certain regimes, their results fail to specify the leading coefficients, or equivalently, the precise location of the phase transition threshold. Moreover, their analysis does not consider the intertwined influence among the parameters n, p, m, etc. One example would be the impact of the p/n ratio on the maximum allowed number of permuted rows, which has not been studied before this work. Another line of research comes from the field of statistical physics, which begins with Mézard & Parisi (1986; 1985) . Using the replica method, they study the linear assignment problem (LAP), i.e., min ⇧ P i,j ⇧ ij E ij where ⇧ denotes a permutation matrix and E ij is i. 2010), this method is extended to the particle tracking problem, where a phase transition phenomenon is first observed. Later, Semerjian et al. (2020) modify it to fit the graph matching problem, which paves way for our work in studying the permuted linear regression. Our technical contributions are summarized as follows • We propose the first framework that can identify the precise location of phase transition thresholds associated with permuted linear regression. In the oracle case where B \ is known, our scheme is able to determine the phase transition snr. In the non-oracle case where B \ is not given, our scheme can further predict the maximum allowed permuted rows and uncover its dependence on the ratio n /p. • We generalize the full permutation estimator and first obtain a partial permutation estimator. Consider the example where the correspondence for a single index is desired. By removing all function nodes except that corresponds to that index, we exploit the MP algorithm and design an algorithm that converge in one step. Moreover, we show its performance almost match the estimator for the full permutation recovery. In addition, we would like to briefly mention the technical challenges. Compared with the previous works (Mezard & Montanari, 2009; Talagrand, 2010; Linusson & Wästlund, 2004; Mézard & Parisi, 1987; 1986; Parisi & Ratiéville, 2002; Semerjian et al., 2020) , where the edge weights are relatively simple, our edge weights usually involve high-order interactions across Gaussian random variables and are densely correlated. To tackle this issue, our proposed approximation method to compute the phase transition thresholds consists of three parts: (i) perform Taylor expansion; (ii) modify the leave-one-out technique; and (iii) size correction scheme. A detailed explanation can be found in Section 5. Hopefully, it will serve independent technical interests for researchers in the machine learning community. Notations. We use a a.s. ! b to suggest a converges almost surely to b. We denote f (n) u g(n) when lim n!1 f (n) /g(n) = 1. We denote f (n) = O P (g(n)) if the sequence f (n) /g(n) is bounded in probability; while we denote f (n) = o P (g(n)) if the sequence f (n) /g(n) converges to zero in probability. The inner product between two vectors (resp. matrices) are denoted as h•, •i. In addition, for two distributions d 1 and d 2 , we write d 1 ⇠ = d 2 if they are the equal up to some normalization. Moreover, we define P n as the set of all possible permutation matrices, i.e., P n , {⇧ 2 {0, 1} n⇥n , P i ⇧ ij = 1, P j ⇧ ij = 1} ; and associate each permutation matrix ⇧ 2 P n with a mapping ⇡ of {1, 2, . . . , n}, where ⇡(i) denotes the correspondence of index i permuted by ⇧, 1  i  n. The signal-to-noise-ratio (snr) is written as B 

2. PROBLEM SETTING

In this paper, we consider the linear regression with permuted labels reading as Y = ⇧ \ XB \ + W, where Y 2 R n⇥m represents the sensing result, ⇧ \ 2 P n denotes a permutation matrix awaiting to be reconstructed, X 2 R n⇥p is the sensing matrix with each entry X ij following the i.i.d standard



Associated with this problem comes a phase transition phenomenon: the error rate for the permutation recovery suddenly drops to zero once some parameters reach certain thresholds. Despite previous work such as Slawski et al. (2020); Slawski & Ben-David (2019); Pananjady et al. (2017); Zhang et al. (2022); Zhang & Li (

Related work. The line of research starts with the literature in permuted linear regression. Among all the works mentioned above, the most related works include Slawski et al. (2020); Slawski & Ben-David (2019); Pananjady et al. (2017); Zhang et al. (2022); Zhang & Li (2020), in which almost the same settings as ours are used. Pananjady et al. (2018); Slawski & Ben-David (2019) consider the single observation model (m = 1) and proved the snr for the correct permutation recovery will be O P (n c ), where c > 0 is some positive constant. Later, Slawski et al. (2020); Zhang et al. (2022); Zhang & Li (2020) investigate the multiple observations model (m > 1) and suggest the snr requirement can be significantly decreased, to put it more specifically, from O P (n c ) to O P n c/m . In particular, Zhang & Li (

i.d random variable uniformly distributed within the regime [0, 1]. Martin et al. (2005) then generalize the LAP to multi-index matching and presented a investigation based on MP algorithm. And Caracciolo et al. (2017); Malatesta et al. (2019) extend the distribution of E ij to a broader class. However, all the above works exhibit no phase transition. In Chertkov et al. (

\ 2 F /(m • 2 ), where | | |•| | | F isthe Frobenius norm and 2 denotes the variance of the sensing noise.

