MANY-BODY APPROXIMATION FOR NON-NEGATIVE TENSORS

Abstract

We propose a nonnegative tensor decomposition with focusing on the relationship between the modes of tensors. Traditional decomposition methods assume low-rankness in the representation, resulting in difficulties in global optimization and target rank selection. To address these problems, we present an alternative way to decompose tensors, a many-body approximation for tensors, based on an information geometric formulation. A tensor is treated via an energy-based model, where the tensor and its mode correspond to a probability distribution and a random variable, respectively, and many-body approximation is performed on it by taking the interaction between variables into account. Our model can be globally optimized in polynomial time in terms of the KL divergence minimization, which is empirically faster than low-rank approximations keeping comparable reconstruction error. Furthermore, we visualize interactions between modes as tensor networks and reveal a nontrivial relationship between many-body approximation and low-rank approximation.

1. INTRODUCTION

Tensors are generalization of vectors and matrices. Data in various fields such as neuroscience (Erol & Hunyadi, 2022) , bioinformatics (Luo et al., 2017) , signal processing (Cichocki et al., 2015) , and computer vision (Panagakis et al., 2021) are often stored in the form of tensors, and features are extracted from them. Tensor decomposition and its non-negative version (Shashua & Hazan, 2005) are popular methods that extract features by approximating tensors by the sum of products of smaller tensors. These smaller tensors are often called factors. It usually tries to minimize the difference between the tensor reconstructed from obtained factors and an original tensor, called the reconstruction error. In most of tensor decomposition approaches, a low-rank structure is typically assumed, where a given tensor is approximated by a linear combination of a small number of bases. Such decomposition requires the following two information. First, it requires the structure, which specifies the type of decomposition such as CP decomposition (Hitchcock, 1927) and Tucker decomposition (Tucker, 1966) . In recent years, tensor networks (Cichocki et al., 2016) have been introduced, which can intuitively and flexibly design the structure including tensor train decomposition (Oseledets, 2011) , tensor ring decomposition (Zhao et al., 2016) , and tensor tree decomposition (Murg et al., 2010) . Second, it requires the rank value, the number of bases used in the decomposition. Since larger ranks increase the capability of the model while increasing the computational cost, the user is required to find the appropriate rank in this tradeoff problem. Since the above tensor decomposition via minimization of the reconstruction error is non-convex, which causes initial value dependence (Kolda & Bader, 2009, Chapter 3) , the problem of finding an appropriate setting of the low-rank structure is highly nontrivial in practice as it is hard to locate the cause if the decomposition does not perform well. As a result, to find proper structure and rank, the user often needs to perform decomposition multiple times with various settings, which is time and memory consuming. Instead of the low-rank structure that has been the focus of attention in the past, in this paper, we propose a novel formulation of tensor decomposition, called many-body approximation, that focuses on the relationship among modes of tensors. We determine the structure of decomposition based on the existence of the interactions between modes. The proposed method requires only the decomposition structure naturally determined by the interactions between the modes and does not require the rank value, which traditional decomposition methods also require and often suffer to determine. To describe interactions between modes, we follow the standard strategy in statistical mechanics that uses an energy function H(•) to treat interactions and considers the corresponding distribution exp (H(•)). This model is known to be an energy-based model in machine learning, which has been used in Legendre decomposition (Sugiyama et al., 2018; 2016) that decomposes tensors via convex optimization. Technically, it finds factors of a tensor by treating a tensor as a probability distribution and enforcing some of its natural parameters to be zero. We point out that interactions in the energy function H(•) can be represented using natural parameters of distribution, and we can successfully formulate many-body approximation as a special case of Legendre decomposition by setting some of natural parameters to be zero. The advantage of this approach is that many-body approximation can be also achieved by convex optimization that minimizes the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) . Our approach, describing interactions between modes using energy functions, is different from existing methods that focus on interactions between mode matrices (Vasilescu & Terzopoulos, 2002; Vasilescu, 2011) or block tensors (Vasilescu et al., 2021) . Furthermore, we introduce a way of representing tensor interactions, which visualizes the presence or absence of interactions between modes. We discuss the correspondence between our representation and the tensor network and point out that an operation called coarse-grained transformation (Levin & Nave, 2007) , in which multiple tensors are viewed as a new tensor, reveals unexpected relationship between the proposed method and existing methods such as tensor ring and tensor tree decomposition. We summarize our contribution as follows: • By focusing on the interaction between modes of tensors, we introduce an alternative rankfree tensor decomposition, many-body approximation. This decomposition is realized by convex optimization. • We present a way of describing tensor many-body approximation, interaction representation, a diagram that shows interactions within a tensor. This diagram can be transformed into tensor networks, which tells us the relationship between many-body approximation and existing low-rank approximation. • We empirically show that many-body approximation is faster than low-rank approximation with competitive reconstruction errors.

2. TENSOR MANY-BODY APPROXIMATION

Our proposal, tensor many-body approximation, is based on the formulation of Legendre decomposition for tensors. We first review Legendre decomposition and its optimization in Section 2.1. We introduce interactions between modes and its visual representation to prepare for many-body approximation in Section 2.2. Using interactions between modes, we define many-body approximation in Section 2.3. Finally, we transform the interaction representation into a tensor network and point out the connection between many-body approximation and existing low-rank decomposition methods in Section 2.4. In the following discussion, we consider D-order non-negative tensors whose size is (I 1 , . . . , I D ). We assume the sum of all elements in P is 1 for simplicity, while this assumption can be eliminated using the general property of Kullback-Leibler (KL) divergence, λD KL (P, Q) = D KL (λP, λQ), for any real number λ.

2.1. REMINDER TO LEGENDRE DECOMPOSITION AND ITS OPTIMIZATION

Legendre decomposition is a method to decompose a non-negative tensor by regarding the tensor as a discrete distribution and representing it with a limited number of parameters. We describe a non-negative tensor P using natural parameters θ = (θ 1,...,1 , . . . , θ I1,...,I D ) and its energy function H as P i1,...,i D = exp (H i1,...,i D ), H i1,...,i D = i1 i ′ 1 =1 • • • i D i ′ D =1 θ i ′ 1 ,...,i ′ D , where θ 1,...,1 has a role of normalization. Here it is clear that a tensor corresponds to a distribution whose sample space is its index set; that is, the value of each element is regarded as the probability of realizing the corresponding index (Sugiyama et al., 2017) . As we can see in equation 1, we can uniquely identify tensors from natural parameters θ. We can compute the natural parameter θ from a given tensor as θ i1,...,i D = I1 i ′ 1 =1 • • • I D i ′ D =1 µ i ′ 1 ,...i ′ D i1,...,i D log P i ′ 1 ,...,i ′ D (2) using the Möbius function µ : S ×S → {-1, 0, +1}, where S is the set of indices, defined inductively as follows: µ i ′ 1 ,...,i ′ D i1,...,i D =      1 if i d = i ′ d for all d ∈ {1, . . . , D}, - D d=1 i ′ d -1 j d =i d µ j1,...j D i1,...,i D else if i d ≤ i ′ d for all d ∈ {1, . . . , D}, 0 otherwise. The above modelling for non-negative tensors is an instance of the log-linear model on posets (Sugiyama et al., 2017) . Since distribution described by equation 1 belongs to the exponential family, we can also identify each tensor by expectation parameters η = (η 1,...,1 , . . . , η I1,...,I D ) using the Möbius inversion formula as η i1,...,i D = I1 i ′ 1 =i1 • • • I D i ′ D =i D P i ′ 1 ,...,i ′ D , P i1,...,i d = I1 i ′ 1 =1 • • • I D i ′ D =1 µ i ′ 1 ,...,i ′ d i1,...,i d η i ′ 1 ,...,i ′ d . See Supplemental Materials for examples of the above calculation. Since distribution is determined by specifying either θ-parameters or η-parameters, they form two coordinate systems called the θ-coordinate system and the η-coordinate system, respectively. By using the dual flatness, Legendre decomposition achieves convex optimization as shown in the following.

2.1.1. OPTIMIZATION

Legendre decomposition approximates a tensor by setting some θ values to be zero, which corresponds to dropping some parameters for regularization. Let B be the set of indices of θ parameters that are not imposed to be 0. Then Legendre decomposition coincides with a projection of a given nonnegative tensor P onto the subspace B = {θ | θ i1,...,i D = 0 if (i 1 , . . . , i D ) / ∈B}. Let us consider projection of a given tensor P onto B. The space of probability distributions is not a Euclidean space. Therefore, it is necessary to consider geometry of probability distributions, which is studied in information geometry. It is known that a subspace with linear constraints on natural parameters θ is flat, called e-flat (Amari, 2016, Chapter 2) . The subspace B is e-flat, meaning that the logarithmic combination, or called e-geodesic, R ∈ {(1 -t) log Q 1 + t log Q 2 -ϕ(t) | 0 < t < 1} of any two points Q 1 , Q 2 ∈ B is included in the subspace B, where ϕ(t) is a normalizer. There is always a unique point P on the e-flat subspace that minimizes the KL divergence from any point P. P = arg min Q;Q∈B D KL (P, Q) (4) This projection is called the m-projection. The m-projection onto a e-flat subspace is a convex optimization. We define two vectors θ B = (θ b ) b∈B and η B = (η b ) b∈B . We write as |B| the number of elements in these vectors since it is equal to the cardinality of B. The derivative of the KL divergence and the Hessian matrix G ∈ R |B|×|B| are given as ∂ ∂θ B D KL (P, Q) = η B -ηB , G u,v = η max(i1,j1),...,max(i D ,j D ) -η i1,...,i D η j1,...,j D where η B and ηB are the expectation parameters of Q and P, respectively, and u = (i 1 , . . . , i D ), v = (j 1 , . . . , j D ) ∈ B. This matrix G is also known as the negative Fisher information matrix. Using gradient descent with second-order derivative, we can update θ B in each iteration t as θ B t+1 = θ B t -G -1 (η B t -ηB ) -projection -geodesic a. -flat space b. The distribution Q t+1 is calculated from the updated natural parameters θ t+1 . This step finds a point Q t+1 ∈ B that is closer to the destination P along with the e-geodesic from Q t to P. We can also calculate the expected value parameters η t+1 from the distribution. By repeating this process until convergence, we can always find the globally optimal solution satisfying equation 4. This procedure is illustrated in Figure . 1(a). See Supplemental Materials for more detail on the optimization.

2.2. INTERACTION AND ITS REPRESENTATION OF TENSORS

In this subsection, we introduce interactions between modes and its visual representation to prepare for many-body approximation. The following discussion enables us to intuitively describe relationships between modes and formulate our novel rank-free tensor decomposition. First we introduce n-body parameters, which is a generalized concept of one-body and two-body parameters in (Ghalamkari & Sugiyama, 2022) . Let n of a n-body parameter be the number of non-one indices; for example, θ 1,2,1,1 is a one-body parameter, θ 4,3,1,1 is a two-body parameter and θ 1,2,4,3 is a three-body parameter. We also use the following notation for n-body parameters: θ (k) i k = θ 1,...,1,i k ,1,...,1 , θ (k,m) i k ,im = θ 1,...,1,i k ,1,...,1,im,1,...,1 , θ (k,m,p) i k ,im,ip = θ 1,...,i k ,...,im,...,ip,...,1 , for n = 1, 2, and 3, respectively. We write the energy function H with n-body parameters as H i1,••• ,i D = H 0 + D k=1 H (k) i k + k-1 m=1 D k=1 H (k,m) i k ,im + m-1 p=1 k-1 m=1 D k=1 H (k,m,p) i k ,im,ip + • • • + H (1,...,D) i1,...,i D (8) where the n-th order energy is introduced as H (l1,...,ln) i l 1 ,...,i ln = i l 1 i ′ l 1 =2 • • • i ln i ′ ln =2 θ (l1,...,ln) i ′ l 1 ,...,i ′ ln . ( ) For simplicity, we suppose that 1 ≤ l 1 < l 2 < • • • < l n ≤ D holds. We set H 0 = θ 1,...,1 . We say that an n-body interaction exists between modes l 1 , . . . , l n if there are indices i l1 , . . . , i ln satisfying H (l1,...,ln) i l 1 ,...,i ln ̸ = 0. The first term H 0 in equation 8 is called the normalized factor or the partition function. The terms H (k) are called bias in machine learning and magnetic field or self-energy in statistical physics. The terms H (k,m) are called the weight of the Boltzmann machine in machine learning and two-body interaction or electron-electron interaction in physics. To visualize the existence of interactions within a tensor, we newly introduce a diagram called interaction representation, which is inspired by factor graphs in graphical modelling (Bishop & Nasrabadi, 2006, Chapter 8) . The graphical representation of the product of tensors is widely known as tensor networks. However, displaying the relations between the modes of a tensor as a factor graph is our novel approach. We represent the n-body interaction as a black square, ■, connected with n modes. We describe examples of the two-body interaction between modes (k, m) and the three-body interaction among modes (k, m, p) in Figure 1(b) . Combining these interactions, the energy function including all two-body interactions is shown in Figure 1 (c), and the energy function including all two-body and three-body interactions is shown in Figure 1  (d) for D = 4. This visualization allows us to intuitively understand the relationship between modes of tensors. For simplicity, we abbreviate one-body interactions in the diagrams, while we always assume them. Once interaction representation is given, we can determine the corresponding decomposition of tensors. In the following section, we reduce some of n-body interactions, that is, H (l1,...,ln) i l 1 ,...,i ln = 0, by fixing each parameter θ (l1,...,ln) i l 1 ,...,i ln = 0 for all indices (i l1 , . . . , i ln ) ∈ {2, . . . , I l1 } × • • • × {2, . . . , I ln }.

2.3. MANY-BODY APPROXIMATION

Our proposed method, tensor many-body approximation, approximate a given tensor with assuming the existence of dominant interactions between the modes of the tensor and ignoring other interactions. Since this operation can be understood as setting some natural parameters of the distribution to be zero, it can be achieved by convex optimization through the theory of Legendre decomposition. As we see below, approximated tensors are represented without the summation symbol . This property is different from existing low-rank approximations except for rank-1 approximation. As an example, we consider two types of approximations of a nonnegative tensor P by tensors represented in Figure 1(c), (d ). If all energies greater than 2nd-order or those than 3rd-order in equation 8 are ignored, that is, H (l1,...,ln) i l 1 ,...,i ln = 0 for n > 2 or n > 3, P is approximated as follows: P i1,i2,i3,i4 ≃ P ≤2 i1,i2,i3,i4 = X (1,2) i1,i2 X (1,3) i1,i3 X (1,4) i1,i4 X (2,3) i2,i3 X (2,4) i2,i4 X (3,4) i3,i4 , P i1,i2,i3,i4 ≃ P ≤3 i1,i2,i3,i4 = χ (1,2,3) i1,i2,i3 χ (1,2,4) i1,i2,i4 χ (1,3,4) i1,i3,i4 χ (2,3,4) i2,i3,i4 , where each factor on the right-hand side is represented as X (k,m) i k ,im = 1 6 √ Z exp 1 3 H (k) i k + H (k,m) i k ,im + 1 3 H (m) im , χ (k,m,p) i k ,im,ip = 1 4 √ Z exp H (k) i k + H (m) im + H (p) ip 3 + 1 2 H (k,m) i k ,im + 1 2 H (m,p) im,ip + 1 2 H (k,p) i k ,ip + H (k,m,p) i k ,im,ip . The partition function, or the normalization factor, is given as Z = exp (-θ 1,1,1,1 ), which do not depend on indices (i 1 , i 2 , i 3 , i 4 ). When the tensor P is approximated by P ≤m , the set B contains only indices of n(≤ m)-body parameters. In the above discussion, we consider many-body approximation with all n-body parameters, while our formulation allows us to use only a part of n-body interactions as shown in the following. We consider the situation where only one-body interaction and two-body interaction between (k, k + 1) exist for all k ∈ {1, . . . , D} (D + 1 implies 1 for simplicity). Figure 2 (a) shows the interaction representation of the approximated tensor. As we can confirm by substituting 0 for H (k,l) i k ,i l if l ̸ = k + 1 , we can describe the approximated tensor as P i1,...,i D ≃ P cyc i1,...,i D = X (1) i1,i2 X (2) i2,i3 . . . X (D) i D ,i1 where X (k) i k ,i k+1 = 1 D √ Z exp 1 2 H (k) i k + H (k,k+1) i k ,i k+1 + 1 2 H (k+1) i k+1 . ( ) with normalization factor Z = exp (-θ 1,...,1 ). When the tensor P is approximated by P cyc , the set B contains only all one-body parameters and two-body parameters θ (d,d+1) i d ,i d+1 for d ∈ {1, 2, . . . , D}. We call this approximation cyclic two-body approximation since the order of indices in equation 12 is cyclic. We show the connection between cyclic two-body approximation and existing tensor ring decomposition in the following subsection. 

2.4. CONNECTION TO TENSOR NETWORK

Tensor interaction representation is a diagram that focuses on the relationship between modes. Tensor networks, which are well known as diagrams that focus on factors after decomposition, represent a tensor as an undirected graph, whose nodes correspond to matrices or tensors and edges are the modes of summation in tensor products (Cichocki et al., 2016) . Our tensor interaction representation has a tight connection to tensor networks, and we can convert a tensor interaction representation to a tensor network. For the conversion, we use a hyper-diagonal tensor Ω, that is defined as Ω ijk = δ ij δ jk δ ki , where δ ij = 1 if i = j and 0 otherwise. The tensor Ω is often represented by • in tensor networks. In the community of tensor network, the tensor Ω appears in the CNOT gate and a special case of Z spider (Nielsen & Chuang, 2010) . The tensor network in Figure 2 (a) represents the following formula D d=1   j d l d X (d) l d ,j d+1 Ω j d+1 ,i d+1 ,l d+1   , where j D+1 = j 1 , i D+1 = i 1 , l D+1 = l 1 . Substituting the definition of Ω in equation 14, we realize that the tensor network corresponds to equation 12. We point out that the tensor network representation of cyclic two-body approximation is similar to the tensor network of the tensor ring decomposition. The tensor ring decomposition is an extension of the tensor train decomposition, and its representation is shown in Figure 2 (b) using a tensor network. In fact, if we consider the region enclosed by the dotted line in the tensor network as a new tensor, the tensor network of the cyclic two-body approximation coincides with the tensor network of the tensor ring decomposition(See more details in the Supplemental Materials). This operation, in which multiple tensors are regarded as a new tensor in a tensor network, is called renormalization or coarse-graining transformation (Evenbly & Vidal, 2015) . Comparing the number of parameters The number of elements of an input tensor is I 1 × I 2 × • • • × I D . After the cyclic two-body approximation, the number of parameters is given as |B| = 1 + D d=1 (I d -1) + D d=1 (I d -1)(I d+1 -1) where we assume I D+1 = I 1 . The first term is for a normalizer, the second is the number of one-body parameters, and the final term is the number of two-body parameters. In contrast, in the tensor ring decomposition with the target rank (R 1 , . . . , R D ), the number of parameters is given as |R| = D k=1 R k I k R k+1 . The ratio of the number of parameters of these two methods |B|/|R| is proportional to I/R 2 if we assume R d = R and I d = I for all d ∈ {1, . . . , D} for simplicity. Therefore, when the target rank is small and the size of the input tensor is large, the proposed method has more parameters than the tensor ring decomposition. 

2.4.1. OTHER EXAMPLE OF MANY-BODY APPROXIMATION AND ITS TENSOR NETWORK

In the same way, we can find a correspondence between another example of many-body approximation and the existing low-rank approximation. For D = 9, we consider three-body and two-body interactions among (i 1 , i 2 , i 3 ), (i 4 , i 5 , i 6 ), and (i 7 , i 8 , i 9 ) and three-body approximation among (i 3 , i 6 , i 9 ). We provide the interaction representation of the target energy function in Figure 3 (a). In this approximation, the decomposed tensor can be described as P i1,...,i9 = A i1,i2,i3 B i4,i5,i6 C i7,i8,i9 G i3,i6,i9 . In the same way in the case of the cyclic two-body approximation, we can convert the interaction representation to a tensor network, as described in Figure 3 (a). A tensor network of tensor tree decomposition in Figure 3 (b) emerges when the region enclosed by the dotted line is replaced with a new tensor (shown with tilde) in Figure 3 (a). Such tensor tree decomposition is used in generative modeling (Cheng et al., 2019) , computational chemistry (Murg et al., 2015) and quantum many-body physics (Shi et al., 2006 ). As we have seen above, by transforming tensor interaction representation to tensor networks and applying coarse-graining, we can reveal the relationship between tensor many-body approximations and low-rank approximations.

2.5. MANY-BODY APPROXIMATION AS GENERALIZATION OF MEAN-FIELD APPROXIMATION

It has been already pointed out that any tensor P can be represented by vectors x (d) ∈ R I d for d ∈ {1, . . . , D} as P i1,...,i D = x (1) i1 x (2) i2 . . . x (D) i D if and only if all n(≥ 2)-body θ-parameters are 0 (Ghalamkari & Sugiyama, 2021) . The right-hand side is equal to the Kronecker product of D vectors x (1) , . . . , x (D) , and therefore this approximation is equivalent to the rank-1 approximation since the rank of the tensor that can be represented by the Kronecker product is always 1, which is also known to correspond to mean-field approximation. In this study, we propose many-body approximation by relaxing the condition for the mean-field approximation that ignores n(≥ 2)-body interactions. Therefore many-body approximation is generalization of rank-1 approximation and mean-field approximation.

2.6. COMPUTATIONAL COMPLEXITY

We analyze the computational complexity of many-body approximation. In many-body approximation, the overall complexity is dominated by the update of θ, which includes matrix inversion of G. The complexity of computing the inverse of an n × n matrix is O(n 3 ). Therefore, the computational complexity of many-body approximation is O(γ|B| 3 ), where γ is the number of iterations. This complexity can be reduced if we reshape tensors so that the size of each mode becomes small. For example, let us consider a 3-order tensor whose size is (J 2 , J 2 , J 2 ) and its cyclic two- (20,20,20,20,20,20) R10 (30,30,30,30,30) R15 (20,20,20,20,20,20) (30,30,30,30,30) body approximation. In this case, the time complexity is O(γJ 12 ) since it holds that |B| ∝ J 4 (See equation 15). In contrast, if we reshape the input tensor to a 6-order tensor whose size is (J, J, J, J, J, J), the time complexity becomes O(γJ 6 ) since it holds that |B| ∝ J 2 . This technique of reshaping a tensor into a larger-order tensor is used practically not only in the proposed method but also in various methods based on tensor networks, such as tensor ring decomposition (Malik & Becker, 2021).

3. EXPERIMENTS

As seen in Section 2.4, many-body approximation has a close connection to low-rank approximation. For example, in a tensor ring decomposition, if we impose that decomposed factors can be represented as products with hyper-diagonal tensors Ω, this decomposition is equivalent to a cyclic two-body approximation (see Figure 2 ). Therefore, to examine our conjecture that cyclic two-body approximation is as capable of approximating as tensor ring decomposition, we empirically examine the efficiency and effectiveness of cyclic two-body approximation compared with tensor ring decomposition. As baselines, we use five existing methods of non-negative tensor ring decomposition, NTR-APG, NTR-HALS, NTR-MU, NTR-MM and NTR-lraMM (Yu et al., 2021; 2022) . These methods minimize the reconstruction error defined with the Frobenius norm by the gradient method. See Supplemental Materials for implementation detail. We evaluate the approximation performance by the relative error ∥T -T ∥ F /∥T ∥ F for an input tensor T and a reconstructed tensor T with the Frobenius norms ∥ • ∥ F . Since all the existing methods are based on nonconvex optimization, we plot the best score (minimum relative error) among 5 restarts with random initialization. In contrast, the score of our method is obtained by a single run as it is convex optimization and such restarts are fundamentally unnecessary. We compare the total running time of them. cross point of horizontal and vertical red dotted lines. Please note that our method does not have the concept of the rank, thus the score of our method is invariant to changes of the target rank unlike other methods. If the cross point of red dotted lines is lower than other lines, the proposed method is better than other methods.

Synthetic data

In addition to the above case in which we assumed the low-rankness, we also generated synthetic datasets without such an assumption. We created a tensor of size 30 5 and a tensor of size 20 5 by sampling from uniform distribution and performed the same experiment. Results are shown in Figure 4 (c) and Figure 4(d) . In all experiments, the proposed method is superior to comparison partners in both efficiency and effectiveness. It should be noted that the relative error of the proposed method is smaller even when the target rank of the tensor ring decomposition is large and the number of parameters is several times larger than the proposed method. Real data Next, we evaluate our method on real data. 4DLFD is a 9-order tensor, which is produced from 4D Light Field Dataset (Honauer et al., 2016; Gortler et al., 1996; Levoy & Hanrahan, 1996) . TT_ChartRes, TT_Origami and TT_Paint are 7-order tensors, which is produced from TokyoTech Hyperspectral Image Dataset (Monno et al., 2015; 2017) . Each tensor has been reshaped to reduce the computational complexity. See the dataset details in the Supplemental Materials. The proposed method is always faster than baselines with keeping the competitive relative errors. In baseline methods, a slight change of the target rank can induce a significant increase of the reconstruction error due to the nonconvex nature of them.

4. CONCLUSION

We propose many-body approximation for tensors, which decomposes tensors with focusing on the relationship between modes represented by an energy-based model. It approximates tensors by ignoring the energy corresponding to some interactions, which can be viewed as generalization of mean-field approximation that considers only one-body interactions. Our novel formulation enables us to achieve convex optimization of the model, while the existing approaches based on the low-rank structure are non-convex. Furthermore, we introduce a way of visualize interactions between modes, called interaction representation, to see activated interactions between modes. We have established transformation between our representation and tensor networks, which reveals the nontrivial connection between many-body approximation and the classical tensor low-rank tensor decomposition. (θ, η)-coordinate and geodesics In this study, we map a normalized D-order non-negative tensor P ∈ R I1×•••×I D ≥0 to a discrete probability distribution with D random variables. Let U be the set of discrete probability distributions with D random variables. The entire space U is a non-Euclidean space with the Fisher information matrix G as the metric. This metric measures the distance between two points. In Euclidean space, the shortest path between two points is a straight line. In a non-Euclidean space, such a shortest path is called a geodesic. In the space U, two kinds of geodesics can be introduced, e-geodesics and m-geodesics. For two points P 1 , P 2 ∈ U, eand m-geodesics can be defined as { R t | log R t = (1 -t) log P 1 + t log P 2 -ϕ(t) } , { R t | R t = (1 -t)P 1 + tP 2 } , respectively, where 0 ≤ t ≤ 1 and ϕ(t) is a normalization factor to keep R t to be a distribution. We can parameterize distributions P ∈ U by a parameter called the natural parameter. We have described the relationship between a distribution P and natural parameter θ = (θ 1,...,1 , . . . , θ I1,...,I D ) in equation 1. The natural parameter θ serves as a coordinate system of U, since any distribution in U is specified by determining θ. Furthermore, we can also specify a distribution P by its expectation parameter η = (η 1,...,1 , . . . , η I1,...,I D ), which corresponds to expected values of the distribution and an alternative coordinate system of U. The definition of the expectation parameter η is described in equation 3. θ-coordinates and η-coordinates are orthogonal with each other, which means that the Fisher information matrix G has the following property, G u,v = ∂η u /∂θ v and (G -1 ) u,v = ∂θ u /∂η v . eand m-geodesics can also be described using these parameters as follows. θ t | θ t = (1 -t)θ P1 + tθ P2 , η t | η t = (1 -t)η P1 + tη P2 , where θ P and η P are θand η-coordinate of a distribution P ∈ U. The KL divergence from discrete distributions P ∈ U to Q ∈ U is given as  D KL (P, Q) = I1 i1=1 It is known that a subspace with linear constraints on natural parameters θ is e-flat (Amari, 2016, Chapter 2). The proposed many-body approximation performs m-projection onto the subspace B ⊂ U with some natural parameters fixed to be 0. From this linear constraint, we know that B is e-flat. Therefore, the optimal solution of the many-body approximation is always unique. When a space is e-flat and m-flat at the same time, we say that the space is dually-flat. U is dually-flat. Natural gradient method e(m)-flatness guarantees that cost functions to be optimized in equation 20 are convex. Therefore, m(e)-projection onto an e(m)-flat subspace can be implemented by a gradient method using a second-order gradient. We call this gradient method the natural gradient method. The Fisher information matrix G appears by second-order differentiation of the KL divergence (see equation 5). We can perform fast optimization using the update formula in equation 6, using the inverse of the Fisher information matrix. Examples for Möbius function In the proposed method, we need to transform the distribution P ∈ R I1×•••×I D with θ and η using the Möbius function, defined in Section 2.1. We provide examples here. In equation 2, The Möbius function is used to find the natural parameter θ from a distribution P. For example, if D = 2, 3, it holds that θ i1,i2 = log P i1,i2 -log P i1-1,i2 -log P i1,i2-1 + log P i1-1,i2-1 , θ i1,i2,i3 = log P i1,i2,i3 -log P i1-1,i2,i3 -log P i1,i2-1,i3 -log P i1,i2,i3-1 + log P i1-1,i2-1,i3 + log P i1,i2-1,i3-1 + log P i1,i2-1,i3-1 -log P i1-1,i2-1,i3-1 ,



Figure 1: (a) An illustration of optimization of Legendre decomposition. Interaction representations corresponding to (c) equation 10 and (d) equation 11.

Figure 2: (a) Interaction representation of an example of cyclic two-body approximation and its transformed tensor network for D = 4. (b) Tensor network of tensor ring decomposition.

Figure 3: (a) Interaction representation corresponding to equation 16 and its transformed tensor network for D = 9. (b) Tensor network of a variant of tensor tree decomposition.

Figure 4: (a)(b) Results for low ring rank tensor. (c)(d) Results for tensors sampled from uniform distribution. The vertical red dotted line is |B| (See equation 15).

Figure 5: Experimental results for real datasets. The vertical red dotted line is |B| (See equation 15).

and projections A subspace is called e-flat when any e-geodesic connecting two points in a subspace is included in the subspace. The vertical descent of an m-geodesic from a point P ∈ U onto e-flat subspace B e is called m-projection. Similarly, e-projection is obtained when we replace all e with m and m with e. The flatness of subspaces guarantees the uniqueness of the projection destination. The projection destination P or P obtained by mor e-projection onto B e or B m minimizes the following KL divergence, P = arg min Q∈Be D KL (P, Q), P = arg min Q∈Bm D KL (Q, P).

i1,...,i D log P i1,...,i D Q i1,...,i D .

REPRODUCIBILITY STATEMENT

We provide implementation and dataset details in Supplemental Material. We provide both of code for the proposed method and code for comparison methods in Supplemental Material.

ETHICS STATEMENT

This study is theoretical analysis of tensors and we believe that our theoretical discussion would not have any negative societal impacts.YuYuan Yu, Kan Xie, Jinshi Yu, Qi Jiang, and Shengli Xie. Fast nonnegative tensor ring decomposition based on the modulus method and low-rank approximation. Science China Technological Sciences, 64(9):1843-1853, 2021.Yuyuan Yu, Guoxu Zhou, Ning Zheng, Yuning Qiu, Shengli Xie, and Qibin Zhao. Graph-regularized non-negative tensor-ring decomposition for multiway representation learning. IEEE Transactions on Cybernetics, 2022.Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, and Andrzej Cichocki. Tensor ring decomposition. arXiv preprint arXiv:1606.05535, 2016.

A CYCLIC TWO-BODY APPROXIMATION AND RING DECOMPOSITION

We can interpret cyclic two-body approximation as tensor ring decomposition with constraints as described below. Non-negative tensor ring decomposition approximates a given tensor P ∈ R with D core tensors χ (1) , χ (2) , . . . , χ (D) withfor each d ∈ {1, 2, . . . , D} aswhere (R 1 , . . . , R D ) is called tensor ring rank. The decomposition is described in Figure 2(c ). The cyclic two-body approximation also approximates the tensor P in the form of equation 18, imposing an additional constraint that each core tensor χ (d) is decomposed asfor each d ∈ {1, 2, . . . , D}, where Ω ijk = δ ij δ jk δ ki . We assume r 0 = r D for simplicity. We obtain equation 12 by substituting equation 19 into equation 18. 

B IMPLEMENTATION DETAIL

We describe the implementation details of methods in the following.Proposed method Our method is implemented in Julia 1.8. We use a natural gradient method for cyclic two-body approximation. The natural gradient method uses the inverse of the Fisher information matrix to perform second-order optimization in a non-euclidean space. For non-normalized tensors, we conduct the following procedure. First, we compute the total sum of elements of an input tensor. Then, we normalize the tensor. After that, we conduct Legendre decomposition for the normalized tensor. Finally, we get the product of the result of the previous step and the total sum we compute initially. The termination criterion is the same as the original implementation of Legendre Decomposition by Sugiyama et al. (2018) , that is, it terminates if ||η B t -ηB || < 10 -5 , where η B t is the expectation parameters on t-th step and ηB is the expectation parameters of an input tensor, which are defined in Section 2.1.1. The overall procedure is described in Algorithm 1. Note that this algorithm is based on Legendre decomposition by Sugiyama et al. (2018) .Baseline methods We implemented baseline methods by translating MATLAB code provided by the authors into Julia code for fair comparison. As we can see from their original papers, NTR-APG, NTR-HALS, NTR-MU, NTR-MM and NTR-lraMM have an inner and outer loop to find a local solution. We repeat the inner loop 100 times. We stop the outer loop when the difference between the relative error of the previous and the current iteration is less than 10e-4. NTR-MM and NTR-lraMM require diagonal parameters matrix Ξ. We define Ξ = ωI where I is an identical matrix and ω = 0.1.The NTR-lraMM method performs low-rank approximation to the matrix obtained by mode expansion of an input tensor. The target rank is set to be 20. This setting is the default setting in the provided code. The initial positions of baseline methods were sampled from uniform distribution on (0, 1). 

C DATASET DETAIL

We describe the details of each dataset in the following.Synthetic Datasets For all experiments on synthetic datasets, we change the target ring-rank as (r, . . . , r) for r = 2, 3, . . . , 9 for baseline methods.Real Datasets 4DLFD is originally a (9, 9, 512, 512, 3) tensor, which is produced from 4D Light Field Dataset described in Honauer et al. (2016) . Its license is Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. We use dino images and their depth and disparity map in training scenes. We concatenate them to produce a tensor. We reshaped the tensor as (6, 8, 6, 8, 6, 8, 6, 8, 12) . For baseline methods, we chose the target ringrank as (2, 3, 2, 2, 2, 2, 2, 2, 2), (2, 3, 2, 2, 3, 2, 2, 3, 2), (2, 2, 2, 2, 2, 2, 2, 2, 5), (2, 5, 2, 2, 5, 2, 2, 2, 2), (2, 2, 2, 2, 2, 2, 2, 2, 7), (2, 2, 2, 2, 3, 2, 2, 2, 7), (2, 2, 2, 2, 2, 2, 2, 2, 9). TT_ChartRes is originally a (736, 736, 31) tensor, which is produced from TokyoTech 31-band Hyperspectral Image Dataset. We use ChartRes.mat. We reshaped the tensor as (23, 8, 4, 23, 8, 4, 31) . For baseline methods, we chose the target ring-rank as (2, 2, 2, 2, 2, 2, 2) (2, 2, 2, 2, 2, 2, 5), (2, 2, 2, 2, 2, 2, 8), (3, 2, 2, 3, 2, 2, 5), (2, 2, 2, 2, 2, 2, 9), (3, 2, 2, 3, 2, 2, 6), (4, 2, 2, 2, 2, 2, 6), (3, 2, 2, 4, 2, 2, 8), (3, 2, 2, 3, 2, 2, 9), (3, 2, 2, 3, 2, 2, 10), (3, 2, 2, 3, 2, 2, 12), (3, 2, 2, 3, 2, 2, 15), (3, 2, 2, 3, 2, 2, 16) . TT_Origami and TT_Paint are originally (512, 512, 59) tensors, which are produced from TokyoTech 59-band Hyperspectral Image Dataset. We use Origami.mat and Paint.mat. In TT_Origami, 0.0016% of elements were negative, hence we preprocessed all elements of TT_Origami by subtracting -0.000764, the smallest value in TT_Origami, to make all elements non-negative. We reshaped the tensor as (8, 8, 8, 8, 8, 8, 59) . For baseline methods, we chose the target ring-rank as (2, 2, 2, 2, 2, 2, r) for r = 2, 3, . . . , 15. These reshaping reduces the computational complexity as described in Section 2.6 to complete all the experiments in a reasonable time.

D PROJECTION THEORY IN INFORMATION GEOMETRY

We explain concepts of information geometry used in this study, including natural parameters, expectation parameters, model flatness, and convexity of optimization. In the following discussion, we consider only discrete probability distributions.where we assume P 0,i2 = P i1,0 = 1 and P i1,i2,0 = P i1,0,i3 = P 0,i2,i3 = 1. Note that, to identify the value of θ i1,...,i d , we need onlyIn the same way, using equation 3, we can find the distribution P by the expectaion parameter η. For example, if D = 2, 3, it holds that P i1,i2 = η i1,i2 -η i1+1,i2 -η i1,i2+1 + η i1+1,i2+1 , P i1,i2,i3 = η i1,i2,i3 -η i1+1,i2,i3 -η i1,i2+1,i3 -η i1,i2,i3+1+ η i1+1,i2+1,i3 + η i1+1,i2,i3+1 + η i1,i2+1,i3+1 -η i1+1,i2+1,i3+1 ,where we assume η I1+1,i2 = η i1,I2+1 = 0 and η I1+1,i2,i3 = η i1,I2+1,i3 = η i1,i2,I3+1 = 0.

