NEURAL NONNEGATIVE CP DECOMPOSITION FOR HIERARCHICAL TENSOR ANALYSIS

Abstract

There is a significant demand for topic modeling on large-scale data with complex multi-modal structure in applications such as multi-layer network analysis, temporal document classification, and video data analysis; frequently this multi-modal data has latent hierarchical structure. We propose a new hierarchical nonnegative CANDECOMP/PARAFAC (CP) decomposition (hierarchical NCPD) model and a training method, Neural NCPD, for performing hierarchical topic modeling on multi-modal tensor data. Neural NCPD utilizes a neural network architecture and backpropagation to mitigate error propagation through hierarchical NCPD.

1. INTRODUCTION

The recent explosion in the collection and availability of data has led to an unprecedented demand for scalable data analysis techniques. Furthermore, data that has a multi-modal tensor format has become ubiquitous across numerous fields (Cichocki et al., 2009) . The need to reduce redundant dimensions (across modes) and to identify meaningful latent trends within data has rightly become an integral focus of research within signal processing and computer science. An important application of these dimension-reduction techniques is topic modeling, the task of identifying latent topics and themes of a dataset in an unsupervised or partially supervised approach. A popular topic modeling approach for matrix data is the dimension-reduction technique nonnegative matrix factorization (NMF) (Lee & Seung, 1999) , which is generalized to multi-modal tensor data by the nonnegative CP decomposition (NCPD) (Carroll & Chang, 1970; Harshman et al., 1970) . These models identify r latent topics within the data; here the rank r is a user-defined parameter that can be challenging to select without a priori knowledge or a heuristic selection procedure. In topic modeling applications, one often additionally wishes to understand the hierarchical topic structure (i.e., how the topics are naturally related and combine into supertopics). For matrices (tensors), a naive approach is to apply NMF (NCPD) first with rank r and then again with rank j < r, and simply identify the j supertopics as linear (multilinear) combinations of the original r subtopics. However, due to the nonconvexity of the NMF (NCPD) objective function, the supertopics identified in this way need not be linearly (multi-linearly) related to the subtopics. For this reason, hierarchical models which enforce these relationships between subtopics and supertopics have become a popular direction of research. A challenge of these models is that the nonconvexity of the model at each level of hierarchy can yield cascading error through the layers of models; several works have proposed techniques for mitigating this cascade of error (Flenner & Hunter, 2018; Trigeorgis et al., 2016; Le Roux et al., 2015; Sun et al., 2017; Gao et al., 2019) . In this work, we propose a hierarchical NCPD model and Neural NCPD, an algorithm for training this model which exploits backpropagation techniques to mitigate the effects of error introduced at earlier (subtopic) layers of hierarchy propagating downstream to later (supertopic) layers. This approach allows us to (1) explore the topics learned at different ranks simultaneously, and (2) illustrate the hierarchical relationship of topics learned at different tensor decomposition ranks. Notation. We follow the notational conventions of Goodfellow et al. (2016) ; e.g., tensor X, matrix X, vector x, and (integer or real) scalar x. In all models, we use variable r (with superscripts denoting layer of hierarchical models) to denote model rank and use j when indexing through rank-one components. In all tensor decomposition models, we use k to denote the order (number of modes) of the tensor and use i when indexing through modes of the tensor. In all hierarchical models, we use L to denote the number of layers in the model and use to index layers. We let ⊗ denote the vector outer product and adopt the CP decomposition notation [[X 1 , X 2 , • • • , X k ]] ≡ r j=1 x (1) j ⊗ x (2) j ⊗ • • • ⊗ x (k) j , where x (i) j is the jth column of the ith factor matrix X i (Kolda & Bader, 2009) . Contributions. Our main contributions are two-fold. First, we propose a novel hierarchical nonnegative tensor decomposition model that we denote hierarchical NCPD (HNCPD). Our model treats all tensor modes alike and the output is not affected by the order of the modes in the tensor representation; this is a property not shared by other hierarchical tensor decomposition models such as that of Cichocki et al. (2007a) . Second, we propose an effective neural network-inspired training method that we call Neural NCPD. This method builds upon the Neural NMF method proposed in Gao et al. (2019) , but is not a direct extension; Neural NCPD consists of a branch of Neural NMF for each tensor mode, but the backpropagation scheme must be adapted for factorization information flow between branches. Organization. In the remainder of Section 1, we present related work on tensor decompositions and training methods. In Section 2, we present our main contributions, hierarchical NCPD and the Neural NCPD method. In Section 3, we test Neural NCPD on real and synthetic data, and offer some brief conclusions in Section 4. We include justification of several computational details of our method and further experimental results in Appendix A.

1.1. RELATED WORK

In this section, we introduce NMF, hierarchical NMF, the Neural NMF method, and NCPD, and then summarize some relevant work. Nonnegative Matrix Factorization (NMF). Given a nonnegative matrix X ∈ R n1×n2

≥0

, and a desired dimension r ∈ N, NMF seeks to decompose X into a product of two low-dimensional nonnegative matrices; dictionary matrix A ∈ R n1×r ≥0 and representation matrix S ∈ R r×n2 ≥0 so that X ≈ AS = r j=1 a j ⊗ s j , where a j is a column (topic) of A and s j is a row of S. Typically, r is chosen such that r < min{n 1 , n 2 } to reduce the dimension of the original data matrix or reveal latent themes in the data. Each column of S provides the approximation of the respective column in X in the lower-dimensional space spanned by the columns of A. The nonnegativity of the NMF factor matrices yields clear interpretability; thus, NMF has found application in document clustering (Xu et al., 2003; Gaussier & Goutte, 2005; Shahnaz et al., 2006) , and image processing and computer vision (Lee & Seung, 1999; Guillamet & Vitria, 2002; Hoyer, 2002) , amongst others. Popular training methods include multiplicative updates (Lee & Seung, 1999; 2001; Lee et al., 2009) , projected gradient descent (Lin, 2007) , and alternating least-squares (Kim et al., 2008; Kim & Park, 2008) . Hierarchical NMF (HNMF). HNMF seeks to illuminate hierarchical structure by recursively factorizing the NMF S matrices; see e.g., (Cichocki et al., 2009) . We first apply NMF with rank r (0) and then apply NMF with rank r (1) to the S matrix, collecting the r (0) subtopics into r (1) supertopics. HNMF with L layers approximately factors the data matrix as -1) . X ≈ A (0) S (0) ≈ A (0) A (1) S (1) ≈ • • • ≈ A (0) A (1) • • • A (L-1) S (L (3) Here the A (i) matrix represents how the subtopics at layer i collect into the supertopics at layer i + 1. Note that as L increases, the error X -A (0) A (1) • • • A (L-1) S (L-1) F necessarily increases as error propagates with each step. As a result, significant error is introduced when L is large. Choosing r (0) , r (1) , • • • , r (L-1) in practice proves difficult as the number of possibilities grow combinatorially. Neural NMF (NNMF). In the previous work of Gao et al. (2019) , the authors developed an iterative algorithm for training HNMF that uses backpropagation techniques to mitigate cascading error through the layers. To form this hierarchical factorization, the Neural NMF algorithm uses a neural net architecture. Each layer of the network has weight matrix A ( ) . In the forward propagation step, the network accepts a matrix S ( -1) , calculates the nonnegative least-squares solution S ( ) = q(A ( ) , S ( -1) ) ≡ arg min S≥0 S ( -1) -A ( ) S F , and sends the matrix S ( ) to the next layer. In the backpropagation step, the algorithm calculates gradients and updates the weights of the network, which in this case are the A matrices. Nonnegative CP Decomposition (NCPD). The NCPD generalizes NMF to higher-order tensors; specifically, given an order-k tensor X ∈ R n1×n2×•••×n k ≥0 and a fixed integer r, the approximate NCPD of X seeks X 1 ∈ R n1×r ≥0 , X 2 ∈ R n2×r ≥0 , • • • , X k ∈ R n k ×r ≥0 so that X ≈ [[X 1 , X 2 , • • • , X k ]]. The X i matrices will be referred to as the NCPD factor matrices. A nonnegative approximation with fixed r is obtained by approximately minimizing the reconstruction error between X and the NCPD reconstruction. This decomposition has found numerous applications in the area of dynamic topic modeling where one seeks to discover topic emergence and evolution (Cichocki et al., 2007b; Traoré et al., 2018; Saha & Sindhwani, 2012) . Methods for training NMF models can often be generalized to NCPD; for example, multiplicative updates (Welling & Weber, 2001) and alternating least-squares (Kim et al., 2014) . Other Related Work. Other works have sought to mitigate error propagation in HNMF models with techniques inspired by neural networks (Trigeorgis et al., 2016; Le Roux et al., 2015; Sun et al., 2017; Flenner & Hunter, 2018) . Additionally, previous works have developed hierarchical tensor decomposition models and methods (Vasilescu & Kim, 2019; Song et al., 2013; Grasedyck, 2010) . The model most similar to ours is that of Cichocki et al. (2007a) , which we refer to as hierarchical nonnegative tensor factorization (HNTF). This model consists of a sequence of NCPDs, where a factor matrix for one mode is held constant, the remaining factor matrices produce the tensor which is decomposed at the second layer, and this decomposition is combined with the fixed matrix from the previous layer. We note that HNTF is dependent upon the ordering of the modes, and specifically which data mode appears first in the representation of the tensor. We refer to 'HNTF-i' as HNTF applied to the representation of the tensor where the modes are reordered with mode i first.

2. OUR CONTRIBUTIONS

In this section, we present our two main contributions. We first describe the proposed hierarchical NCPD (HNCPD) model, and then propose a training method, Neural NCPD, for the model.

2.1. HIERARCHICAL NCPD (HNCPD)

Given an order-k tensor X ∈ R n1×...×n k , HNCPD consists of an initial rank-r NCPD layer with factor matrices X 1 , X 2 , . . . , X k , each with r columns, and an HNMF with ranks r (0) , r (1) , • • • , r (L-2) for each of these factors matrices; that is, for each X i at layer , we factorize X i as X i ≈ X i ≡ A (0) i A (1) i ...A ( -2) i S ( -2) i (6) where A ( ) i has r ( ) columns; see Figure 1 for a visualization. Thus, HNCPD consists of tensor approximations X ≈ [[A (0) 1 ...A ( -2) 1 S ( -2) 1 , • • • , A (0) k ...A ( -2) k S ( -2) k ]]. To access hierarchical structure between tensor topics at each layer, we need to utilize information in the S ( ) i matrices for all modes. To simplify this hierarchical structure, we develop an approximation scheme such that the hierarchical topic structure for all modes is given by a single matrix. For simplicity, we first consider the two layer case. We note that [[ X 1 , X 2 , • • • , X k ]] = 1≤j1,j2,...j k ≤r (0) α j1,j2,...j k (A (0) 1 ) :,j1 ⊗ (A (0) 2 ) :,j2 ⊗ . . . ⊗ (A (0) k ) :,j k (8) where α j1,j2,...j k = r p=1 (S (0) 1 ) j1,p (S (0) 2 ) j2,p . . . (S (0) k ) j k ,p ; we justify this statement in Appendix A. We refer to decomposition summands in (8) where j 1 = j 2 = • • • = j k as vector outer products of same-index factor matrix topics, and all other summands as vector outer products of different-index factor matrix topics. To identify clear hierarchy, we avoid these different-index column outer products. ≈ ≈ X X1 X 2 X3 A (0) 1 S (0) 1 (S (0) 2 ) (A (0) 2 ) A (0) 3 S (0) 3 Figure 1: A visualization of a two-layer HNCPD model. The colored edges of the order-three tensor, X, represent the three modes. The approximation scheme computes matrices A (0) i whose columns visualize the desired r (0) NCPD topics along each mode while avoiding different-index column outer products in the decomposition. We approximate the summation (8) by replacing all summands that include column p 2 of A (0) k with a single rank-one vector outer product, ( A (0) 1 ) :,p2 ⊗ ( A (0) 2 ) :,p2 ⊗ . . .⊗( A (0) k-1 ) :,p2 ⊗(A (0) k ) :,p2 . To minimize error introduced by this approximation, we transform factor matrices A (0) i for i = k to A (0) i by collecting into ( A (0) i ) :,p2 the ap- proximate contribution of all columns of A (0) i in vector outer products with (A (0) k ) :,p2 in (8). That is, for 1 ≤ p 1 , p 2 ≤ r (0) and 1 ≤ i < k, let W i ∈ R r (0) ×r (0) be a matrix with (W i ) p1,p2 = ji=p1,j k =p2,1≤j1,j2,...j k ≤r (0) α j1,j2,...,j k and A (0) i = A (0) i W i , Furthermore, we can identify the topic hierarchy from the S (0) k matrix. We can generalize this process to later layers by noting that we can group the A i matrices together, so X i ≈ A (0) i A (1) i . . . A ( ) i S ( ) i . Thus, we can treat this approximation as above, replacing A (0) i with the product A (0) i A (1) i . . . A ( ) i . Like in HNMF, errors in earlier layers can propagate through to later layers and produce highly suboptimal approximations. Challenges encountered during computation of HNMF are exacerbated in an HNCPD model. For this reason, we exploit approaches developed for HNMF in Gao et al. (2019) in our training method Neural NCPD. Furthermore, the computation of HNMF factor matrices for X i are independent from X j if the factorizations are applied sequentially; Neural NCPD allows factor matrices in (6) for all other modes to influence the factorization of a given mode.

2.2. NEURAL NCPD

Our iterative method consists of two subroutines, a forward-propagation and a backpropagation. In Algorithms 1 and 2, we display the pseudocode for our proposed method. Following this learning process for the factor matrices in (6), we apply the approximation scheme described in Section 2.1 to the learned factor matrices to visualize the hierarchical structure of the computed HNCPD model.  S ( ) i = q(A ( ) i , S ( -1) i ), ( ) where q is as defined in equation 4 and S (-1) i = X i , producing the matrices S (0) i , . . . , S (L-2) i for 1 ≤ i ≤ k. The function q(A ( ) , S ( -1) ), as a nonnegative least-squares problem, can be calculated via any convex optimization solver; we utilize an implementation of the Hanson-Lawson algorithm (Lawson & Hanson, 1995) . Finally, we pass the A ( ) i and S ( ) i matrices and X into a loss function, which we differentiate and backpropagate. Algorithm 1 Forward Propagation procedure FORWARDPROP({X i } k i=0 , {A ( ) i } k,L-2 i=0, =0 ) for i = 1, • • • , k do for = 0, • • • , L -2 do S ( ) i ← q(A ( ) i , S ( -1) i ) see equation 4 Algorithm 2 Neural NCPD Input: Tensor X ∈ R n1×n2×...×n k , cost C X 1 , X 2 , . . . , X k ← NCPD(X), initialize {A ( ) i } k,L-2 i=0, =0 for iterations = 1, . . . , T do ForwardProp({X i } k i=0 , {A ( ) i } k,L-2 i=0, =0 ) Alg. 1 for i = 1, • • • , k, = 0, • • • , L -2 do A ( ) i ← optimizer A ( ) i , ∂C ∂A ( ) i + any first-order method Backpropagation. Our goal is to differentiate our cost function C with respect to the weights in each layer, the A ( ) i matrices and backpropagate. This algorithm accepts any firstorder optimization method, denoted optimizer (e.g., SGD (Robbins & Monro, 1951) , Adam (Kingma & Ba, 2014)), but projects the updated weight matrix into the positive orthant to maintain nonnegativity. For the NCPD task, the most natural loss function is the reconstruction loss, C = X -[[ X 1 , X 2 , • • • , X k ]] F . (11) In order to encourage optimal fit at each layer, we also introduce a loss function that we refer to as energy loss. First we denote the approximation of X at layer of our network as X = [[A (0) 1 A (1) 1 ...A ( -2) 1 S ( -2) 1 , A (0) 2 A (1) 2 ...A ( -2) 2 S ( -2) 2 , . . . , A (0) k A (1) k ...A ( -2) k S ( -2) k ]]. Then, we calculate energy loss as E = X -[[X 1 , X 2 , • • • , X k ]] F + L-2 =0 X -X F . ( ) The derivatives of q(A, X) with respect to A and X are derived and exploited to differentiate a generic cost function for the hierarchical NMF model in Gao et al. (2019) ; here we summarize these derivatives and illustrate how to combine them with simple multilinear algebra for HNCPD. Gao et al. (2019) show that, if ∂C ∂A ( 1 ) i S is the derivative of C with respect to A ( 1) i holding the S matrices constant, then ∂C ∂A ( 1) i = ∂C ∂A ( 1) i S + 1≤ 2≤L-2 1≤j≤r U ( 1, 2),j i , where U ( 1, 2),j i relates C to A ( 1 ) i through S ( 2) i and S ( 1 ) i , is defined column-wise (j), and depends upon ∂C ∂S ( 2 ) i * , the derivative of C with respect to S ( 2) i holding S ( 2+1) i , . . . , S (L-2) i constant.

The definition of U

( 1, 2),j i is given in Gao et al. (2019) and utilizes, via the chain-rule, the partial derivative of q(A i , S -1 i ) for all ∈ [ 1 , 2 ]. Example. The derivative of the previously defined, or other differentiable cost functions can be calculated using these results of Gao et al. (2019) and some simple multi-linear algebra. As an example, we directly compute the backpropagation step for the reconstruction loss function C given in equation 11. Let X (i) be the mode-i matricized version of X, and define H i = X k . . . X i+1 X i-1 . . . X 1 , where denotes the Khatri-Rao product (see e.g., (Kolda & Bader, 2009) ). Then we have that ∂C ∂A ( j ) i S = 2 A (0) i A (1) i ...A ( j -1) i X (i) -X i H i H i A ( j+1) i ...A (L-2) i S (L-2) i , and ∂C ∂S ( j ) i * = 2 A (0) i A (1) i ...A ( j ) i X (i) -X i H i H i . ( ) These derivatives are justified in Appendix A. With equation 14, these derivatives are sufficient to calculate the partial derivative of C with respect to any A matrix.

3. EXPERIMENTAL RESULTS

We test Neural NCPD on three datasets: one synthetic, one video, and one collected from Twitter. The synthetic dataset is constructed as a simple block tensor with hierarchical structure. The Twitter dataset consists of tweets from political candidates during the 2016 United States presidential election (Littman et al., 2016) . We pull the video, a time-lapse of a forest over the span of one year, from (Solheim) . We also compare Neural NCPD to Standard NCPD, in which we perform an independent NCPD decomposition at each rank, and to Standard HNCPD, in which we perform NCPD first on the full dataset, and apply HNMF to the fixed factor matrices; here we sequentially apply NMF to the factor matrices using multiplicative updates and do not update previous layer factorizations as in Neural NCPD. In all experiments, we use Tensorly (Kossaifi et al., 2018) for Standard NCPD calculations and to initialize the NCPD layer of our hierarchical NCPD, and in Neural NCPD we do not backpropagate to this layer as the initialization has usually found a stationary point. We use Energy Loss (Eq. 12) for all experiments to encourage fit at every layer. Because we do not backpropagage to the initial factor matrices, the first term in (Eq. 12) is fixed. For the Twitter and video experiments we use the approximation scheme of Section 2.1 to recover the relationship between the columns of the A ( ) i matrices and visualize the A ( ) i matrices.

3.1. EXPERIMENT ON SYNTHETIC DATA

We test the Neural NCPD algorithm first using a synthetic dataset. This dataset is a rank seven tensor of size 40 × 40 × 40 with positive noise added to each entry; we generate noise as n = |g| where g ∼ N (0, σ 2 ). To generate this dataset, we begin with the all-zeros tensor and create three large nonoverlapping blocks with value 1, and then overlay each block with either two or three additional blocks with value 3. We display this tensor with two levels of noise at the left of Figure 2 ; here we plot projections of all tensors (and all approximations) along the third mode; that is, we construct a matrix with entries equal to the largest entries of the mode-three fibers (see e.g., (Kolda & Bader, 2009) for relevant definitions). The projections on the remaining two modes are included in the Appendix A, and are all similar to the third mode. Table 1 : Relative reconstruction loss, C rel , on a synthetic dataset for Neural NCPD, Standard HNCPD, and HNTF with two different levels of noise. We list the loss of the approximation r (1) = 3. The results of HNTF are similar for all orderings of the modes, so we list only one. Method σ 2 = 0.05 σ 2 = 0.5 r = 7 r (0) = 5 r (1) = 3 r = 7 r (0) = 5 r 2 and Table 1 ; we present the relative reconstruction loss C rel = X -[[ X 1 , X 2 , • • • , X k ]] F / X F . For each level of noise, we display the rank 7 approximation shared by all methods, and the rank 5 3 and S (1) 3 matrices, which show how rank 7 topics collect into ranks 5 and 3 topics. From Table 1 , we see that the loss for Neural NCPD is at or below that of Standard HNCPD and HNTF at each rank and level of noise. At each rank, we display the top keywords and topic heatmaps for candidate and temporal modes.

3.2. TEMPORAL DOCUMENT ANALYSIS

In Table 2 , we display the relative reconstruction loss on the Twitter political dataset for all models. We see that Neural NCPD significantly outperforms Standard HNCPD, slightly outperforms Standard NCPD while offering a hierarchical topic structure, and outperforms all HNTF-i, for which loss varies significantly based on the arrangement of the tensor. In Figure 3 , we show the topic keywords and factor matrices of a rank 8, 4, and 2 hierarchical NCPD approximation computed by Neural NCPD. Note that in the rank 8 candidates mode factor and keywords we see that nearly every topic is identified with a single candidate. Topic two of the rank 8 approximation aligns with In Figure 4 , we display the S (0) 3 (top) and S (1) 3 (bottom) matrices produced by Neural NCPD on the Twitter dataset, which illustrate how topics collect at each rank. We see topics 5 and 6 from the rank 8 factorization combine to form topic 3 at rank 4 and topic 2 at rank 2. This is expected because both topics include keywords from Cruz and Kasich, who had high presence in topics 5 and 6 respectively in the rank 8 factorization. In Figure 5 , we display the results of performing separate NCPD decomposition of ranks 4 and 2 on the Twitter dataset. We see that the results are similar to those of Neural NCPD, but these independent decompositions lack the clear hierarchical structure provided by Neural NCPD. Note that while the topics corresponding to Kasich and Clinton combine in the rank 4 NCPD, these candidates are present in different topics in the rank 2 NCPD; Neural NCPD prevents this breach of hierarchy.

3.3. VIDEO DATA ANALYSIS

We next apply Neural NCPD to video data constructed from a year-long time-lapse video of a forest; see Figure 6 for a selection of frames and Figure 11 in Appendix A for more details. We extract 37 frames and flatten each frame (RGB image) into a single matrix, to form a tensor X of size 37 × 3 × 57600; here the first mode represents frames (temporal mode), the second colors (chromatic mode), and the third pixels (spatial mode). Figure 6 : We display seven of 37 extracted frames from a year-long time-lapse video of a forest. In Figure 7 , we show the three-layer Neural NCPD decompositions of the video tensor with ranks 8, 6, and 3. For each rank, we plot the topics in the spatial (left), temporal (top right), and chromatic (bottom right) modes. We note that many of the identified topics represent visual seasonal changes. Topic six of the rank 8 decomposition represents the green and leafy late-summer to early fall. Topic one of the rank 6 decomposition represents the winter sky and leafless trees. Topic three of the rank 3 decomposition represents the summer and fall sky and tree leaves. We additionally apply NMF to the slices of the tensor along a single mode. Slicing along the temporal or spatial modes would make interpretation of the resulting topics challenging, so we choose to slice along the chromatic mode, producing three matrices. In Figure 8 , we visualize a rank 3 NMF A APPENDIX In this supplementary material, we provide the details of the example derivative computations from Section 2.2, give a justification of the NCPD expansion formula exploited in Section 2.1, and provide further experimental results that we were not able to include in Section 3.

EXAMPLE DERIVATIONS

Here, we justify the derivations provided in the example in Section 2.2. We note that Anaissi et al. (2020) ; Kolda & Hong (2019) provide similar derivations for the CP tensor decomposition, but their decompositions do not attempt to further decompose the CP factor matrices, and thus, their results are not sufficient for providing derivatives with respect to the A and S matrices. Consider the full reconstruction loss function for the order-k tensor X, C = X -[[X 1 , X 2 , . . . , X k ]] 2 F , where for some fixed 1 ≤ i ≤ k, X i = ABC and consider the gradient ∂C ∂B . Let H i = X k . . . X i+1 X i-1 . . . X 1 . Now, we let X = [[X 1 , X 2 , . . . , X k ]]. Then, if X (i) is the mode-i matricization of X (see e.g., (Kolda & Bader, 2009 )), we have X (i) = X i (X k . . . X i+1 X i-1 . . . X 1 ) ] = X i H i . Thus, if we let X (i) be the mode-i matricization of X, we have that C = X (i) -X (i) 2 F = X (i) -(ABC)(X k . . . X i+1 X i-1 . . . X 1 ) 2 F . Now, we compute the desired gradient through a series of applications of the chain rule. We then see that ∂C ∂B = ∂C ∂X i H i ∂X i H i ∂ABC ∂ABC ∂B = A ∂C ∂X i H i ∂X i H i ∂X i C = 2A X (i) -X i H i H i C . Now, using the calculations above we can proceed in calculating ∂C ∂A ( j ) i . Gao et al. (2019) where U ( 1, 2),j i relates C to A ( 1 ) i through S ( 2) i and S ( 1) i , is defined column-wise (j), and depends upon

∂C ∂S

( 2 ) i * , the derivative of C with respect to S 



CONCLUSIONSIn this paper, we introduced the hierarchical NCPD model and presented a novel algorithm, Neural NCPD, to train this decomposition. We empirically demonstrate the promise of this method on both real and synthetic datasets; in particular, this model reveals the hierarchy of topics learned at different NCPD ranks, which is not available to standard NCPD or NMF-based approaches.



Propagation. The forward-propagation treats A ( ) i matrices as neural network weights and uses A ( ) i and previous layer output to compute

Figure 2: Data tensor X with two levels of noise (left), ranks 7, 5, and 3 Neural NCPD and Standard HNCPD approximations of X (middle), transposed Neural NCPD S (0) 3 and S (1) 3 matrices (right).

We next apply Neural NCPD to a dataset of tweets from four Republican [R] and four Democratic [D] 2016 presidential primary candidates, (1) Hillary Clinton [D], (2) Tim Kaine [D], (3) Martin O'Malley [D], (4) Bernie Sanders [D], (5) Ted Cruz [R], (6) John Kasich [R], (7) Marco Rubio [R], and (8) Donald Trump [R]; this is constructed from a subset of the dataset of Littman et al. (2016). We use a bag-of-words (12,721 words in corpus) representation of all tweets made by a candidate within bins of 30 days (from February to December 2016), and cap each of these groups at 100 tweets to avoid oversampling from any candidate; resulting in a tensor of size 8 × 10 × 12721.

Figure 3: A three-layer Neural NCPD on the Twitter dataset at ranks r = 8, r (0) = 4 and r (0) = 2.At each rank, we display the top keywords and topic heatmaps for candidate and temporal modes.

Figure 4: The S (0) 3 (top) and S (1) 3 (bottom) matrices produced by Neural NCPD on the Twitter dataset.

Figure 7: A three-layer Neural NCPD of the time-lapse video at ranks r = 8, r 0 = 6, and r 1 = 3. We display topics at each rank for spatial (left), temporal (top right), and chromatic (bottom right) modes. Relative reconstruction loss is 0.105, 0.109, and 0.122 respectively at each layer.

Figure 8: A decomposition of the time-lapse video by rank 3 NMF on slices of the tensor along the chromatic mode. For each color, we display the three topics in spatial (left) and temporal (right) modes. Relative reconstruction loss is 0.101. on each of the three chromatic slices of the video tensor. The chromatic factorizations are nearly identical, illustrating little salient dynamic information. While similar to the rank 3 Neural NCPD layer, the chromatic NMFs obscure much of the chromatic interaction evidenced by Neural NCPD. In particular, Neural NCPD illustrates the spatial and temporal dynamics of multi-colored features and their cooccurrence hierarchy, while NMF provides only single-colored features and requires far more work to glean multi-colored feature co-occurrence information.

Figure9: Here we display the projections onto all three modes for the original data tensor X and approximations of X at ranks r = 7, r (0) = 5, and r (1) = 3 produced by Neural NCPD, Standard HNCPD, and HNTF at two levels of noise.

annex

and. Since we can assume that A( 1) i is independent of all other A's and S's, we have thatNow, we calculate ∂C ∂S ( j ) i * . Since we can assume that S ( j ) i is independent of all other A's and S's, we have thatThus, we have the required derivatives to evaluate ∂C ∂A ( j ) i.

HNCPD EXPANSION

We now provide brief justification of the expansion of the NCPD in terms of later factorizations used in Section 2.1; that is,k ) j k ,p . We have that by definition,We also have thatThus, by the linearity of the outer product we have thatα p,j1,j2,...,j k we arrive at the original statement.

SYNTHETIC EXPERIMENT

In this section, we provide the additional views of the synthetic tensor and computed approximations from Section 3.1. In Figure 2 in the main text, for visualization we displayed the projection of each tensor onto the third mode. In Figure 9 , we display the projections of these tensors onto all three modes. We see that due to the simple block structure used to produce the synthetic data tensor, the three modes all tell a similar story; that is, Neural NCPD is able to recover meaningful structure along all three modes.

HNTF-1 HNTF-2 HNTF-3

Figure 10 : Here we display a three-layer HNTF on the Twitter dataset from Section 3.2 at ranks r = 8, r (0) = 4, and r (1) = 2, run separately for each of the possible ordering on the data tensor. We display the top keywords and heatmaps of topics in the candidate and temporal modes at ranks 4 (left) and 2 (right). We note that the rank 8 factorization is identical to that of Neural NCPD, so we do not re-display it here (see Section 3.2).In Figure 10 , we display the results from running HNTF on the Twitter dataset in Section 3.2, excluding the topics at rank 8 because they are identical to those learned by Neural NCPD (see Section 3.2). We see that while the factorization for the first possible ordering is similar to that of Neural NCPD and contains significant meaningful topic modeling information, the other two orderings lose significant information by the last layer and, and have topic presence and from only 2 or 3 of the eight candidates.

VIDEO DATA EXPERIMENT

Figure 11 : Here we display the first 36 of 37 frames of the time lapse video dataset from Section 3.3 (The 37th frames is included in Figure 6 In Figure 11 , we display the first 36 of 37 frames of the time lapse video dataset from Section 3.3 (the 37th frame is included in Figure 6 ) in order to make it clear how seasons progress throughout the frames. We see that the video begins in the white winter months, transitions to spring at around frame 16, and stays green until it transitions to fall around frame 28.In Figure 12 , we display the S (0) 3 matrix (top) and S(1) 3 matrix (bottom) produced by Neural NCPD on the time-lapse video tensor described in Section 3.3. By examining the S matrices from our Neural NCPD algorithm, we are also able to see the hierarchical relationship between the topics from different ranks. In the S (0) 3 matrix, we see the hierarchical relationship between the rank 6 and rank 8 topics. In the S(1) 3 matrix, we see the hierarchical relationship between the rank 3 and rank 8 topics. We note that the S (0) 3 matrix (top) illustrates that topic one of rank 6 NCPD is closely related to topic eight of rank 8 NCPD, and S(1) 3 (bottom) similarly illustrates that topic two of rank 3 NCPD is closely related to topic eight in rank 8 NCPD; these relationships are unsurprising because, as seen in Figure 7 in the main text, these topics are present temporally during winter and fall and spatially in the sky behind the trees. 

