A NEW FRAMEWORK FOR TENSOR PCA BASED ON TRACE INVARIANTS

Abstract

We consider the Principal Component Analysis (PCA) problem for tensors T ∈ (R n ) ⊗k of large dimension n and of arbitrary order k ≥ 3. It consists in recovering a spike v ⊗k 0 (related to a signal vector v 0 ∈ R n ) corrupted by a Gaussian noise tensor Z ∈ (R n ) ⊗k such that T = βv ⊗k 0 + Z where β is the signal-to-noise ratio. In this paper, we propose a new framework based on tools developed by the theoretical physics community to address this important problem. They consist in trace invariants of tensors built by judicious contractions (extension of matrix product) of the indices of the tensor T. Inspired by these tools, we introduce a new process that builds for each invariant a matrix whose top eigenvector is correlated to the signal for β sufficiently large. Then, we give examples of classes of invariants for which we demonstrate that this correlation happens above the best algorithmic threshold (β ≥ n k/4 ) known so far. This method has many algorithmic advantages: (i) it provides a detection algorithm linear in time and with only O(1) memory requirements (ii) the algorithms are very suitable for parallel architectures and have a lot of potential of optimization given the simplicity of the mathematical tools involved (iii) experimental results show an improvement of the state of the art for the symmetric tensor PCA. We provide experimental results to these different cases that match well with our theoretical findings.

1. INTRODUCTION

Powerful computers and acquisition devices have made it possible to capture and store real-world multidimensional data. For practical applications (Kolda & Bader (2009) ), analyzing and organizing these high dimensional arrays (formally called tensors) lead to the well known curse of dimensionality (Gao et al. (2017) , Suzuki (2019) ). Thus, dimensionality reduction is frequently employed to transform a high-dimensional data set by projecting it into a lower dimensional space while retaining most of the information and underlying structure. One of these techniques is Principal Component Analysis (PCA), which has made remarkable progress in a large number of areas thanks to its simplicity and adaptability (Jolliffe & Cadima (2016) ; Seddik et al. (2019) ). In the Tensor PCA, as introduced by Richard & Montanari (2014) , we consider a model where we attempt to detect and retrieve an unknown unit vector v 0 from noise-corrupted multi-linear measurements put in the form of a tensor T. Using the notations found below, our model consists in: T = βv ⊗k 0 + Z, with Z a pure Gaussian noise tensor of order k and dimension n with identically independent distributed (iid) standard Gaussian entries: Z i1,i2,...,i k ∼ N (0, 1) and β the signal-to-noise ratio. To solve this important problem, many methods have been proposed. However, practical applications require optimizable and parallelizable algorithms that are able to avoid the high computationally cost due to an unsatisfactory scalability of some of these methods. A summary of the time and space requirement of some existent methods can be found in Anandkumar et al. (2017) . One way to achieve this parallelizable algorithm is through methods based on tensor contractions (Kim et al. (2018) ) which are extensions of the matrix product. These last years, tools based on tensor contractions have been developed by theoretical physicists where random tensors have emerged as a generalization of random matrices. In this paper, we investigate the algorithmic threshold of tensor PCA and some of its variants using the theoretical physics approach and we show that it leads to new insights and knowledge in tensor PCA. Tensor PCA and tensor decomposition (the recovery of multiple spikes) is motivated by the increasing number of problems in which it is crucial to exploit the tensorial structure ( 2020)), quantum computing (Hastings (2020) ) as well as statistical query (Dudeja & Hsu (2020) ). Recently, a fundamentally different set of mathematical tools that have been developed for tensors in the context of high energy physics have been used to approach the problem. They consist in trace invariants of degree d ∈ N, obtained by contracting pair of indices of d copies of the tensor T. They have been used in Evnin (2020) to study the highest eigenvalue of a real symmetric Gaussian tensor. Subsequently, Gurau (2020) provided a theoretical study on a function based on an infinite sum of these invariants. Their results suggest a transition phase for the highest eigenvalue of a tensor for β around n 1/2 in a similar way to the BBP transition in the matrix case (Baik et al. (2005) ). Thus, this function allows the detection of a spike. However evaluating it involves computing an integral over a n-dimensional space, which may not be possible in a polynomial time. The contribution of this paper is the use of these invariant tools to build tractable algorithms with polynomial complexity. In contrast to Gurau (2020) , instead of using a sum of an infinite number of invariants, we select one trace invariant with convenient properties to build our algorithms. It lets us detect the presence of the signal linearly in time and with a space requirement in O(1). Moreover, in order to recover the signal vector besides simply detecting it, we introduce new tools in the form of matrices associated to this specific invariant. Within this framework, we show as particular cases, that the two simpler graphs (of degree two) are similar to the tensor unfolding and the homotopy algorithms (which is equivalent to average gradient descent). These two algorithms are the main practical ones known from the point of view of space and time requirement (Anandkumar et al. (2017) provides a table comparison). Notations We use bold characters T, M , v for tensors, matrices and vectors and T ijk , M ij , v i for their components. [p] denotes the set {1, . . . , n}. A real k-th order tensor is of order k if it is a member of the tensor product of R ni , i ∈ [k]: T ∈ k i=1 R ni . It is symmetric if T i1...i k = T τ (i1)...τ (i k ) ∀τ ∈ S k where S k is the symmetric group (more details are provided in Appendix ??). For a vector v ∈ R n , we use v ⊗p ≡ v ⊗ v ⊗ • • • ⊗ v ∈ p R n to denote its p-th tensor power. v, w denotes the scalar product of v and w. Let's define the operator norm, which is equivalent to the highest eigenvalue of a tensor of any order: X op ≡ max {X i1,...,i k (w 1 ) i1 . . . (w k ) i k , ∀i ∈ {1, . . . , n}, w i ≤ 1} The trace of A is denoted Tr(A). We denote the expectation of a variable X by E(X) and its variance by σ(X). We say that a function f is negligible compared to a positive function g and we write f = o(g) if lim n→∞ f /g → 0. Einstein summation convention It is important to keep in mind throughout the paper that we will follow the Einstein summation convention: when an index variable appears twice in a single term and is not otherwise defined, it implies summation of that term over all the values of the index. For example: T ijk T ijk ≡ ijk T ijk T ijk . It is a common convention when addressing tensor problems that helps to make the equations more comprehensible.

2.1. WHAT DO WE USE TO STUDY THE SIGNAL?

An important concept in problems involving matrices is the spectral theory. It refers to the study of eigenvalues and eigenvectors of a matrix. It is of fundamental importance in many areas. In machine learning, the matrix PCA computes the eigenvectors and eigenvalues of the covariance matrix of the features to perform a dimensional reduction while ensuring most of the key information is maintained. In this case, the eigenvalues is a very efficient tool to describe data variability. In the case of signal processing, eigenvalue can contain information about the intensity of the signal, while the eigenvector points out to its direction. Lastly, a more theoretical example involves quantum physics where the spectrum of the matrix operator is used to calculate the energy levels and the state associated. In all of these examples, an important property of the eigenvalues of a n-dimensional matrix M is its invariance under orthogonal transformations {M → OM O -1 , O ∈ O(n)} where O(n) is the n-dimensional orthogonal group (i. e. the group of real matrices that satisfies OO = I n , which should not be confused with the computational complexity O(n)). Since these transformations essentially just rotate the basis to define the coordinate system, they must not affect intrinsic information like data variability, signal intensity or the energy of a system. The eigenvalues are able to capture some of these inherent information, but recovering the complete general information requires computing their respective eigenvectors (for example to find the principal component, the direction of the signal or the physical state). There are more such invariants than eigenvalues. Another important set worth mentioning are the traces of the n first matrix powers Tr(A), Tr A 2 , . . . , Tr(A n ). Obtaining them uses slightly different methods than eigenvalues, but they contain the same information since each set can be inferred from the other through some basic algebraic operations. On the basis of the matrix case, we expect that for a tensor T ∈ k i=1 R ni , tensor quantities that are invariant under orthogonal transformations (T a 1 j ...a k j → O (1) a 1 j b 1 j . . . O (k) a k j b k j T b 1 j ...b k j for O (i) ∈ O(n i ) ∀i ∈ [k] ) should capture similar intrinsic information like the intensity of the signal, and conceivably, there should be other objects related to these quantities that are able to indicate the direction of the signal. However, the concept of eigenvalue and eigenvector is ill defined in the tensor case and not practical giving that the number of eigenvalues is exponential with the dimension n (Qi (2005), Cartwright & Sturmfels (2013) ) and computing them is very complicated. In contrast, we have a very convenient generalization of the traces of the power matrices for the tensors that we call trace invariants. They have been extensively studied during the last years in the context of high energy physics and many important properties have been proven (Gurau (2017) ). We first give a more formal definition of trace invariants. Let T be a tensor whose entries are T i1,...,i k . Let's define a contraction of a pair of indices as setting them equal to each other and summing over them, as in calculating the trace of a matrix (A ij → n i=1 A ii ). The trace invariants of the tensor T correspond to the different way to contract pairs of indices in a product of an even number of copies of T. The degree of the trace invariants consists in the number of copies of T contracted. For example, i1,i2,i3 T i1i2i3 T i1i2i3 and i1,i2,i3 T i1i2i2 T i1i3i3 are trace invariants of degree 2. In the remainder of this paper, we will use the Einstein summation convention defined in the notation subsection. A trace invariant of degree d of a tensor T of order k admits a practical graphical representation as an edge colored graph G obtained by following two steps: we first draw d vertices representing the d different copies of T. The indices of each copy is represented by k half-edges with a different color for each index position as shown in Figure 1a . Then, when two different indices are contracted in the tensor invariant, we connect their corresponding half-edges in G. Reciprocally, to obtain the tensor invariant associated to a graph G with d vertices, we take d copies of T (one for each vertex), we associate a color for each index position, and we contract the indices of the d copies of T following the coloring of the edges connecting the vertices. We denote this invariant I G (T). Three important examples of trace invariants are: the melon diagram (Figure 1b ) and the tadpole (1c). Avohou et al. (2020) provides a thorough study about the number of trace invariants for a given degree d. A very useful asset of these invariants is that we are able to compute their expectation for tensors whose components are Gaussian using simple combinatorial analysis (Gurau (2017) ). As previously mentioned in Section 2.1, an invariant should be able to detect a signal. But if our goal is to recover it, we should find mathematical objects that are able to provide a vector. To this effect, we introduce in this paper a new set of tools in the form of matrices. We denote by M G,e the matrix obtained by cutting an edge e of a graph G in two half edges (see Figure 2 for an example). Indeed, this cut amounts to not summing over the two indices i 1 and i 2 associated to these two half-edges and using them to index the matrix instead. We will drop the index G, e of the matrix when the context is clear. edge e I G (T) = T ijk T ijk i1 i2 M G,e ≡ (T i1jk T i2jk ) i1,i2∈[n] Cut the edge e 

2.3. PHASE TRANSITION WITHIN THIS FRAMEWORK

We can represent the tensor from which we hope to extract the signal represented graphically as:  T ij1...j k-1 = β v i v j1 . . . v j k-1 + Z ij1...j k-1 I G (T) = I (N ) G (T) + I (S) G (T). (2) An identical decomposition can be carried out for the matrix. Let's consider a tensor T, a graph G and its associated trace invariant I G (T). Let's denote I G (T) the invariant associated to the subgraph obtained by removing from G the edge e and its two vertices. We can distinguish three kind of contributions to the matrix M G,e that we denote M  1. E(M (N ) ) = E(I (N ) G ) n I n . Using the lemma 1, we identify three possible phases depending on which matrix operator norm is much larger than the others: • No detection and no recovery: If M (N ) -E(M (N ) ) op M (D) op , M (R) op then no recovery and no detection is possible we can't distinguish if there is a signal. It is for example the phase for β → 0. I I (N ) I (S) I I I i 1 i 2 ... ... i 1 i 2 ... ... i 1 i 2 ... ... i 1 i 2 ... ... i 1 i 2 ... ... i 1 i 2 ... ... = = Z i1.. Z i2.. . . . + + Z i1... Z i2... . . . + + (v i1 Z i2.. . . . + + v i2 Z i1.. . . . ) + + v i1 v i2 . . . M i1i2 = = M (N ) i1i2 + + M (D) i1i2 + + M (R) i1i2

Pure noise matrix

Signal contribution to the detection Signal contribution to the recovery = + + + = + + + + Figure 4 : Decomposition of a matrix graph and the melon example • Detection but no recovery: If M (D) op M (N ) -E(M (N ) ) op , M (R) op then detection but no recovery. We can detect the presence of the signal (thanks to the highest eigenvalue) but we can't recover the signal vector since the leading eigenvector is not correlated to the signal vector. • Detection and recovery: M (R) op M (N ) -E(M (N ) ) op , M (D) op . We recover the signal vector. It is for example the phase for β → ∞.

2.4. ALGORITHMIC THRESHOLD FOR A GENERAL GRAPH

We can now state the important algorithms that will be essential for this paper. It is important to keep in mind that the following claims concern the large n limit. Empirically, the approximation of large n limit seems valid for n > 25. The first algorithm gives a criteria for distinguishing a pure noise tensor from a tensor with a spike. Denoting E(I G (B)) the expectation and σ(I G (B)) the variance of the trace invariant associated to a graph G for (B)), where the components of B are Gaussian random. The algorithm consists simply in calculating the trace invariant of the tensor and comparing its distance from E(I G (B)) with σ(I G (B)). It is straightforward to see that calculating a trace invariant (which is a scalar) like T ijk T ijk only needs O(1) memory. Theorem 2. Let G be a graph of degree d, ∃ β det > 0 so that Algorithm 1 detects the presence of a signal for β ≥ β det . The second algorithm is able to recover the spike in a tensor T through the construction of the matrix of size n × n M G,e (T) associated to a given graph G and edge e. Algorithm 2: Recovery algorithm associated to the graph G and edge e Input: The tensor T = βv ⊗k 0 + Z Goal: Estimate v 0 Result: Obtaining an estimated vector v Theorem 3. Let G be a graph of degree d, ∃ β rec > 0 so that Algorithm 2 gives an estimator v so that v is strongly correlated to v 0 ( v, v 0 > 0.9) for β ≥ β rec . Since the algorithms 2 and 1 consists in algebraic operations on the tensors entries, they are very suitable for a parallel architecture. The Theorem 4 gives a lower bound to the threshold above which we can detect and recover a spike using a single graph. Interestingly, this threshold which appears naturally in our framework, matches the threshold below which there is no known algorithm that is able to recover the spike in polynomial time. We call the Gaussian variance of a graph G, the variance of the invariant I G (B) where B ijk are Gaussian random. Theorem 4. Let k ≥ 3. It is impossible to detect or recover the signal using a single graph below the threshold β ≤ n k/4 which is the minimal Gaussian variance of any graph G.

3. SOME APPLICATIONS OF THIS FRAMEWORK

Using these algorithms, we are now able to investigate the performance of our framework in various theoretical settings. In the first two subsections, we study the algorithms associated to two trace invariants of degree 2. They consist of the melonic diagram, which gives a very practical detection algorithm, and whose recovery algorithm is a variant of the unfolding algorithm, and the tadpole diagram whose recovery algorithm is similar to the homotopy algorithm. The last two sections are an illustration of the versatility of this framework. We study the case the dimensions n i of the tensor T (T ∈ k i=1 R ni ) are not necessarily equal, which is important for practical applications where the dimensions are naturally asymmetric. Our methods allows us to derive a new algorithmic threshold for this case.

3.1. THE MELON GRAPH SIMILAR TO TENSOR UNFOLDING

Let's consider the invariant T i1...in T i1...in (illustrated by the graph in Figure 1b when k = 3). Its recovery algorithm (with the matrix obtained by cutting any of the edges) is similar to the tensor unfolding method presented in Richard & Montanari (2014) . The difference is that the melonic algorithm uses only a matrix n×n instead of a matrix n k/2 ×n k/2 for the tensor unfolding. However, the main contribution of this framework for this graph is that it allows the detection in a linear time (n k operations for a input (tensor) of size n 3 ) in a constant memory space (it just calculates a scalar). This provides it a potential usefulness as a first step for detecting the signal before deciding to use more computationally costly methods to recover it. Also, to the best of our knowledge, this framework is the first to theoretically prove a conjuncture that the unfolding algorithm works also for the symmetric case. Theorem 5. The algorithms 1 and 2 work for the melon graph with β det = β rec = O(n k/4 ) in linear time and respectively O(n 2 ) and O(1) memory requirement.

3.2. THE TADPOLE GRAPH

Figure 1c has a special characteristic: we can obtain two disconnected parts by cutting only one line. Therefore, the matrix obtained by cutting that edge is of rank one (in the form of vv T ). Thus, the vector v has a weak correlation with the signal v 0 , which allow the tensor power iteration (v i ← T ijk v j v k ) to empirically recover it (formal proofs require to consider some more sophisticated variants of power iteration like in Anandkumar et al. (2017) and Biroli et al. (2020) ). This algorithm is a variant of the already existent homotopy algorithm. Theorem 6. The tadpole graph allows to recover the signal vector for k ≥ 3 and β = O(n k/4 ) by using local algorithms to enhance the signal contribution of the vector T ijj .

4. NUMERICAL EXPERIMENTS

In this section we will investigate the empirical results of the previously mentioned applications in order to see if they match with our theoretical results. We restrict to the dimension k = 3 for simplicity. More details about the experiments settings could be found in the Appendix.

4.1. COMPARISON OF RECOVERY METHODS

This distinction is easily visible, for n = 100 and β = 100 in Figure 5a where we plotted the histograms of the melonic invariant (in blue without signal and in orange with signal) for 500 independent instances of Gaussian random tensors Z. Thus, to measure the accuracy of the detection of the signal, we use the quantity: 1 -cardinal of the Intersection over the cardinal of the Union (1-IoU). For the recovery algorithm, we focus in the symmetric case (the most studied case and the most consistent with a symmetric spike) and, as in Richard & Montanari (2014) , for every algorithm we use two variants: the simple algorithm outputting v and an algorithm where we apply 100 power iterations on v: v i ← T ijk v j v k , distinguishable by a prefix "p-". In Figure ??, we run 200 experiments for each value of β and plot the 95% confidence interval of the correlation of the vector recovered with the signal vector. We will compare our method to two type of results: • Other algorithmic methods: the melonic (tensor unfolding) and the homotopy. To the best of our knowledge, they give the state of art respectively for the symmetric and asymmetric tensor (Biroli et al. (2020) ). Other methods exist but are either too computationally expensive (sum of squares) or are variants of these algorithms. • Information-theoretical results: In (Richard & Montanari (2014) ), it was proven that computing the global minimum v of the function v → T ijk v i v j v k recovers the signal vector v 0 above a theoretical threshold β th = 2.87 √ n but with exponential time, and that no other approach can do significantly better than that. Thus, we plot in red dashed line denoted "perf" the deep minimum that is closest to v 0 , by using gradient method with an initialization in v 0 .

5. CONCLUSION

In this paper we introduced a novel framework for the tensor PCA based on trace invariants. Within this framework, we provide different algorithms to detect a signal vector or recover it. These algorithms use tensor contractions that has a high potential of parallelization and computing optimization. We illustrate the practical pertinence of our framework by presenting some examples of algorithms and prove their ability to detect and recover a signal vector linearly in time for β above the optimal algorithmic threshold. Note that, one of the proposed detection algorithms requires only O(1) memory requirement which could be advantageous in some applications. Moreover, we also show that two well known algorithms (Homotopy and Tensor Unfolding) can be mapped to our framework and result to simpler graph (e.g. the melonic graph). Important directions of future research is to apply these new methods to real data.



Figure 1: Example of graphs and their associated invariants

Figure 2: Obtaining a matrix by cutting the edge of a trace invariant graph G

Figure 3: Graphical decomposition of the tensor T

denoted the invariant I G (T) by I and dropped the index G, e for simplicity).



Algorithm associated to the graph G and edge e Input: The tensor T = βv ⊗k + Z Goal: Detection of v Result: Gives the probability of the presence of a spike

Figure ?? suggests that a high probability detection requires β ≥ 3n 3/4 . 0.995 1.000 1.005 1.010 1.015 1.020 (T ijk T ijk )/n Distribution of the melon invariant without (blue) and with (orange) signal.

Figure 5: Detection using the melonic graph.

Related workTensor PCA was introduced byRichard & Montanari (2014) where the authors suggested and analyzed different methods to recover the signal vector like matrix unfolding and power iteration. Since then, various other methods were proposed. Hopkins et al. (2015) introduced algorithms based on the sum of squares hierarchy with the first proven algorithmic threshold of n k/4 . However this class of algorithm generally requires high computing resources and relies on complex mathematical tools (which makes its algorithmic optimization difficult). Other studied methods have been inspired by different perspectives like homotopy inAnandkumar et al. (2017), statistical

