LANGUAGE MODELING USING TENSOR TRAINS

Abstract

Tensor networks have previously been shown to have potential in language modeling in theory but lack practical evidence support. We propose a novel Tensor Train Language Model (TTLM) based on Tensor-Train decomposition. To show the usefulness of TTLM, we perform a principled experimental evaluation on real-world language modeling tasks, showing that our proposed variants, TTLM-Large and TTLM-Tiny, can be more effective than Vanilla RNNs with low-scale of hidden sizes. Also, we demonstrate the relationship between TTLM and Second-order Recurrent Neural Networks (RNNs), Recurrent Arithmetic Circuits, and Multiplicative Integration RNNs in the sense that the architectures of all of these are, essentially, special cases of that of TTLM. 1

1. INTRODUCTION

A language model assigns probabilities of sequences of words from the vocabulary V ; the number of texts increases exponentially w.r.t to length N . Hence the domain of a language model is, by definition, the exponential space V N . However, due to the vanilla exponential space being intractable, existing work tends to use recurrent or auto-regressive architectures to generate conditional probabilities based on the context (typically encapsulated as a fixed-length dense vector). This indeed simplifies the calculation. Recently, researchers (Pestun & Vlassopoulos, 2017; Miller et al., 2021; Zhang et al., 2019) have reconsidered the view of language models as joint probabilities of text, as it leads to exponential representations in tensor space. Word connections could be preserved in the exponential tensor space when measuring joint probabilities. To deal with the exponential space complexity, a mathematical tool called 'tensor network'foot_1 has been used to reduce the exponential space of language modeling to a tractable one (Pestun & Vlassopoulos, 2017) . However, the so-called 'tensor network language model' in Pestun & Vlassopoulos (2017) is currently a concept that needs to be proved practically. As proof-of-concept work, we derive a Tensor Train Language Model (TTLM) (the simplest tensor network). Technically, we represent a sentence based on the exponential semantic space constructed by the tensor product of word representations. The probability of the sentence is obtained by the inner product of two high-dimensional tensors: the input Φ(X) and the global coefficients A. Under the framework of TTLM, we propose two variants: TTLM-Tiny and TTLM-Large. Also, we clarify the relationship between the proposed TTLM and a series of Recurrent Neural Networks (RNNs) (i.e., Second-order RNNs (Goudreau et al., 1994) , Recurrent Arithmetic Circuits (RACs) (Levine et al., 2018) , and Multiplicative Integration RNNs (MI-RNNs) (Wu et al., 2016)) . These connections open a new eye to understanding RNNs and give some natural implementations for TTLM. We benchmark these TTLM variants and analyze the difference in their working mechanism and behaviors. Experimental results on language modeling tasks show that our TTLM variants could outperform than Vanilla-RNNs under the same training setting. These demonstrate the feasibility of TTLM. The main contributions of our work can be summarized as follows: 1. We propose a novel Tensor Train Language Model, as a first attempt to apply tensor networks on real-world language modeling tasks. 2. We propose two novel TTLM variants, TTLM-Large and TTLM-Tiny, and theoretically demonstrate the relationship between TTLM and a series of existing RNNs. 3. Compared to Vanilla-RNNs on WikiText-2 and PTB datasets, TTLM-Large reduces perplexity by 14.3 and 16.0, respectively, and TTLM-Tiny reduces perplexity by 1.7 and 8.5, respectively.

2. RELATED WORK

Previous studies on tensor networks in machine learning have mainly been devoted to analyzing the theoretical properties of neural networks. A better understanding of feed-forward, convolutional and recurrent architectures has been gained, including compression parameters (Novikov et al., 2015) , expressive power (Cohen et al., 2016; Cohen & Shashua, 2016; Khrulkov et al., 2018) , and depth efficiency for long-term memory (Levine et al., 2018) . Focusing on natural language modeling, certain studies have tensorized existing network architectures (Novikov et al., 2015) , while few studies have applied tensor networks alone as a language model. To the best of our knowledge, tensor network language models have remained a theoretical proposal instead of an empirical model (Pestun & Vlassopoulos, 2017; Pestun et al., 2017) . Perhaps, Miller et al. (2021) is the only other work that uses tensor networks for probabilistic sequence modeling, while it fails to scale up its model for real-world sequence modeling tasks. We first derive a tensor network language model in the way that it can be applied to real-world language modeling datasets, while its variants outperform Vanilla RNNs with lower-scale hidden sizes.

3. PRELIMINARIES

We briefly recapitulate basic notions and notationsfoot_2 ; full technical introductions can be found in standard textbooks (e.g., Bi et al. (2022) ; Itskov (2009) ). Notation. For the purposes of this paper, every tensor A is a multidimensional array of elements (called components) of R, each denoted by its integer coordinates in the array; e.g., for a twodimensional array, the component at position i, j ∈ N is denoted A ij . The order of a tensor is how many indices it has (e.g., a vector v is a first-order tensor, a matrix M is a second-order tensor, etc.). The dimension of a tensor refers to the number of values that a particular index (or so-called mode) can take, e.g., the dimension of B ∈ R I1×I2×I3 is I 1 × I 2 × I 3 . Tensor product. For two tensors C ∈ R I1×•••×Ij (order j) and D ∈ R Ij+1,×•••×I j+k (order k), their tensor product is denoted by ⊗ and return a tensor E i1•••i j+k = C i1...ij • D ij+1•••i j+k (order j + k). Generalized inner product. For two tensor X, Y ∈ R I1×I2×•••×I N of the same size, their inner product is defined as ⟨X, Y⟩ = I1 i1=1 I2 i2=1 • • • I N i N =1 X i1,i2,...,i N Y i1,i2,...,i N . For two tensors X ∈ R I1×I2×•••×I N ×Ix and Y ∈ R I1×I2×•••I N ×Iy sharing N modes of the same size, the "generalized inner product" defined in (Kossaifi et al., 2020) is calculated as ⟨X, Y⟩N = I 1 i 1 =1 I 2 i 2 =1 • • • I N i N =1 Xi 1 ,i 2 ,...,i N Yi 1 ,i 2 ,...,i N with ⟨X, Y⟩ N ∈ R Ix×Iy .

4. LANGUAGE MODELING USING TENSOR TRAINS

We introduce a language model in tensor space in Sec. 4.1. We define our general Tensor Train Language Model in Sec. 4.2, and its special case, TTLM, in Sec. 4.3.

Vector Matrix

Order-3 tensor (1) tensors are notated by solid shapes with a number of 'legs' corresponding to their indices. (2) connecting two index lines implies a contraction or summation over the connected indices. In this paper, we augment our equations with these diagrams to make them easier to visualize. 𝑖 ∈ [𝐼 ! ] 𝑖 ∈ [𝐼 ! ] 𝑗 ∈ [𝐼 " ] a) = 𝑖 ∈ [𝐼 ! ] 𝑗 ∈ [𝐼 " ] 𝑘 ∈ [𝐼 # ] 𝑗 𝑘 𝑖 𝑗 𝑘 𝐶 !" = # # $! 𝑩 !"# 𝒗 # Order-N tensor 𝐓 𝑖 % ∈ [𝐼 % ] 𝑖 & ∈ [𝐼 & ] 𝑖 '(% ∈ [𝐼 '(% ] 𝑖 ' ∈ [𝐼 ' ] b) 𝒗 𝑨 𝐁 𝐁 𝒗 𝑪 Tensor Contraction

4.1. LANGUAGE MODELS IN A TENSOR SPACE

Natural language typically has complex dependencies between features (e.g., tokens or words) (Hou et al., 2013) foot_3 that are not captured well by standard methods such as feature concatenation. One could also see a similar interaction between any arbitrary features in factorization machines (Rendle, 2010) . Suppose a given text consists of N words X = [x (1) , x (2) , • • • , x (N ) ] and let f i ∈ R Ii be a feature extractor (it can be one-hot encoding or word embedding). We now define a representation of X designed to capture these dependencies: Φ(X) = f1(x (1) ) ⊗ f2(x (2) ) • • • ⊗ fN (x (N ) ) = N i=1 fi(x (i) ) where the tensor space is R I1 ⊗ R I2 ⊗ • • • ⊗ R I N . Each component of f i represents independent meaning-bearing units, such as morphemes or latent factors. For simplicity, we assume that a text shares the same one-hot encoding f (x (t) ) ∈ R |V | in later sections. Consequently, Φ(X) is a |V | Ndimensional tensor that records all possible combinations of words in X. Inspired by Kossaifi et al. (2020) ; Zhang et al. (2019) , we define a tensor regression model to obtain the estimated probability for each text X: p(X) = ⟨A, Φ(X)⟩ = |V | i 1 ,i 2 ,••• ,i N =1 Ai 1 ,••• ,i N • Φ(X)i 1 ,••• ,i N where ⟨•⟩ denotes the inner product of two same-sized tensors, and A is a regression weight tensor of the same shape as Φ(X) in the tensor space Aw V ⊗N = V ⊗ • • • ⊗ V N where V refers to R |V | . Similar 1 w 2 ...w N = G (1) :,w 1 1×R 1 G (2) :,w 2 ,: R 1 ×R 2 • • • G (N ) :,w N R N -1 ×1 = R 1 α 1 =1 R 2 α 2 =2 • • • R N -1 α N -1 =1 G (1) w 1 α 1 G (2) α 1 w 2 α 2 • • • G (N ) α N -1 w N (3) where the tensors G (t) ∈ R Rt-1×|V |×Rt (t = 1, ..., d, R 0 = R N = 1 by definition) are called TT cores. We now combine Φ(X) and A in the TT format to define general TTLM. The elements of G (t) :,wt,: in Eq. 3 can be represented as: G (t) α t-1 w t α t = |V | i=1 f (x (t) )iG (t) α t-1 iα t (4) where each f (x (t) ) is a one-hot vector having w t = 1 for at most one t, and has zeros elsewhere. Therefore, one-hot encoding enables us to integrate the input data into the TT format of A by inserting Eq. 4 into Eq. 3: A w1w2...w N = |V | i1,••• ,i N =1 R1 α1=1 • • • R N -1 α N -1 =1 f (x (1) ) i1 G (1) i1α1 f (x (2) ) i2 G (2) α1i2α2 • • • f (x (N ) ) i N G (N ) α N -1 i N = R1 α1=1 R2 α2=2 • • • R N -1 α N -1 =1 G (1) w1α1 G (2) α1w2α2 • • • G (N ) α N -1 w N (5) = ⟨A, Φ(X)⟩ = p(X) where ) ). The difference between Eq. 5 and Eq. 3 is that Eq. 5 has combined A and Φ(X) in the low-dimensional form. This is because that Eq. 3 can compute the elements of A (Oseledets, 2011), and because Φ(X) here is the tensor product of one-hot vectors, so that Eq. 5 can compute Eq. 2. Further, since Eq. 5 now has input data (one-hot vectors) and weights (TT cores), we name Eq. 5 as our general TTLM. Φ(X) = N i=1 f (x (i

4.3. TTLM

Here we consider a special class of general TTLM. Despite its site-dependent TT cores G (t) potentially giving it more expressiveness for language modeling, this property currently generates unnecessary obstacles to its applicability, like the choice of R t . We here provide a detailed explanation of its special case: TTLM. Definition. Suppose all the intermediate TT cores are equal to each other G = G (2) , . . . , G (N -1) ∈ R R×|V |×R and G (1) = G (N ) ∈ R |V |×R in Eq. 5. Then, TTLM is defined as follows: p(X) = |V | i1,••• ,i N =1 R α1,••• ,α N -1 =1 f (x (1) ) i1 G (1) i1α1 f (x (2) ) i2 G α1i2α2 • • • f (x (N ) ) i N G (N ) α N -1 i N = R α1,••• ,α N -1 =1 G (1) w1α1 G α1w2α2 • • • G α N -2 w N -1 α N -1 G (N ) α N -1 w N (6) where its tensor diagram notation is shown in Figure 2a . Recursive information. We recursively unfold the calculation of TTLM in Eq. 6 and find that G has two sources of "input": the information from the previous recursive unfolding, and the input data f (x (t) ) (see Eq. 15 for a detailed version). From this perspective, G acts as a bilinear map G : R |V | × R R → R R , and we can regard the information in the previous step as a hidden state h (t) TTLM , given by: h (t) TTLM = f (x (t) ) T Gh (t-1) TTLM (7) where f (x (t) ), G, and h (t-1) TTLM are contracted together (we permute the indices of G from R R×|V |×R to R |V |×R×R which does not change the number of indices). Recursive Probability Computation. Here, we here provide further details about the process of computing p(X) by TTLM in practice. 𝐆 (") 𝑓(𝑥 ! ) 𝑓(𝑥 " ) Φ 𝑋 ($:&) 𝒜 ($:() 𝒇 𝑥 " 𝒇 𝑥 $ |𝑉| 𝑅 𝑅 𝑅 𝒚 (() |𝑉| 𝒇 𝑥 % 𝐆 𝐆 (&) 𝐆 |𝑉| |𝑉| Figure 3 : Recursive calculation of conditional probability in TTLM. Here we provide an example that given the text x (1:3) , y (4) = softmax(G (4) h (3) TTLM ) where y (4) ∈ V is the probability distribution of word x (4) . In language modeling, p(X) is often decomposed using the chain rule (Bahl et al., 1983) as follows: p(X) = N t=1 p(x (t) |x (1:t-1) ) where x (1:t-1) denotes the text [x (1) , x (2) , • • • , x (t-1) ]. At time t, the output prediction of a model, y (t) ∈ V, is a probability distribution of word x (t) given x (1:t-1) . In TTLM, we define y (t) as follows: y (t) = softmax G (t) h (t-1) TTLM (8) where G (t) ∈ R |V |×R is the last TT core in TT format at time t. Such a definition is the same as that of RNNs, which use hidden states and a weight matrix to calculate word probabilities. Fig. 3 provides a simple example. We can derive the definition of y (t) in high-dimensional space, if we substitute h (t-1) TTLM in Eq. 8 by Eq. 6 and Eq. 7: y (t) = softmax   |V | i1,••• ,it-1=1 R α1,••• ,αt-1=1 f (x (1) ) i1 G (1) i1α1 f (x (2) ) i2 G α1i2α2 • • • G (t) αt-1   (9) = softmax ⟨A (1:t) ), t-1 i=1 f (x (i) )⟩ t-1 (10) = softmax ⟨A (1:t) ), Φ(X (1:t-1) )⟩ t-1 where A (1:t) ∈ V ⊗t , Φ(X (1:t-1) ) ∈ V ⊗t-1 and ⟨•⟩ t-1 denotes the "generalized inner product defined" in Preliminary. Note that Eq. 9 is the low-dimensional form of Eq. 11, similarly to the relationship between Eq. 5 and Eq. 2. By these definitions, there are some interesting properties of TTLM. 1) We can use teacher forcing (Jurafsky, 2000) to learn parameters of TT cores. 2) The hidden-to-output tensor G (t) is defined to be the same as the input-to-hidden tensor G (1) . 3) G and G (t) have no parameters in common. We provide a detailed explanation of the relationship between different TT cores in Appendix A.

5. TTLM VARIANTS

To show the versatility and practical applicability of the TTLM framework, we now propose two new variants: TTLM-Large and TTLM-Tiny in Sec. 5.1. In Sec. 5.2, we briefly summarize the relationship between TTLM and some widely-used RNNs.

5.1. NEW VARIANTS: TTLM-LARGE AND TTLM-TINY

The TT core G in TTLM is an entire third-order tensor. In the two variants, we decompose G into several separate tensors without violating the TT format, as shown in Fig. 2b and Fig. 2c . We define TTLM-Tiny and TTLM-Large as follows: h (t) Tiny = f (x (t) ) T W xe δW hh h (t-1) Tiny h (t) Large = f (x (t) ) T W xe W eh δW hh h (t-1) Large ( ) where W xe ∈ R |V |×R×R is the input-to-hidden tensor ; W eh ∈ R R×R×R×R ; and δ ∈ R R×R×R×R is a fourth-order diagonal tensor such that δ ijkl = 1 iff the i = j = k = l, and δ ijkl = 0 otherwise. The relationship between our proposed models and TTLM is as follows: W xe in both models take the same role as G (t) in TTLM (i.e. input-to-hidden and hidden-to-output), while G = W xe δW hh in TTLM-Tiny and G = W xe W eh δW hh in TTLM-Large. As in RNNs, we compute the conditional probability recursively for TTLM-Large and TTLM-Tiny as: y (t) = softmax(VPh (t) ) where V ∈ R R×|V |×R is an output embedding tensor, and P ∈ R R×R×R is a projector tensor. Then we tie the input tensor W xe to the output embedding tensor V (we provide a detailed explanation in Appendix C). One obvious advantage of our models is to utilize information from the hidden layer and input data separately. Such interaction, particularly TTLM-Tiny, can potentially avoid overfitting, similarly to Wu et al. (2016) where multiplication integration between two sources of "input" can outperform many other methods. In Sec 6.2, we provide relevant experimental evidence.

5.2. EXISTING TTLM VARIANTS

Given the fact that TT scores of TTLM can vary, Appendix B provides a detailed illustration that three existing models, namely second-order RNNs, RACs and MI-RNNs can be considered as one of the "special" implementations of TTLM. We briefly summarize the differences between the three models: 1) Second-order RNNs use the third-order T as the TT cores with an activation function given Eq. 14; 2) RACs use W hx ⊙ W hh as the TT cores given Eq. 18; 3) MI-RNNs use W hx ⊙ W hh as the TT cores with an activation function given Eq. 19. Along with our two proposed models, we study the experimental performance of second-order RNNs, RACs and MI-RNNs compared to TTLM-Large and TTLM-Tiny in Section 6.

6. EXPERIMENTAL EVALUATION

To further understand the properties of TTLM variants, we now investigate the effectiveness of TTLM-Large and TTLM-Tiny compared to Second-order RNNs, RACs, MI-RNNs, and Vanilla-RNNs. We conduct experiments from two distinct perspectives: (1) The rank of TT decomposition has been proved to be the dimension of the hidden states of RNNs (Khrulkov et al., 2018) . Here, we study the influence of rank on the effectiveness of our TTLM variants in Sec 6.2. (2) In Sec.6.3, we analyse the influence of non-linearity for TTLM variants.

6.1. EXPERIMENTAL SETTING

Tasks, Datasets, and Metrics. We conduct experiments on word-level language model datasets: English Penn Treebank (PTB), which consists of 929k training words, 73k validation words, and 82k test words. It has 10k words in its vocabulary (Marcinkiewicz, 1994) . The WikiText-2 dataset (Merity et al., 2016) is derived from Wikipedia articles and consists of 2088k training words, 217k validation words, 45k test words, and a vocabulary of over 30,000 words. We compare these models on the language modelling task, evaluated by the Perplexity (PPL) (Meister & Cotterell, 2021) : the lower the perplexity, the better the model. Baselines. Vanilla RNNs, Second-RNNs, RACs and MI-RNNs are our baselines. We also provide some original results of RNN-based models as references (Mikolov & Zweig, 2012; Zaremba et al., 2014; Grave et al., 2016; Merity et al., 2017) . The implementation details are introduced in Appendix C. Hyperparameters. To compare the effectiveness of comparable models in the same scale: 1) We set the rank/hidden size of TTLM variants/Vanilla-RNNs as [5, 10, 20, 25, 30, 35, 40, 45, 50] . The embedding size of these models is the squared number of hidden sizes/ranks. 3) To avoid the impact of the large embedding size on the model performance, we also provide several common choices of embedding size in Vanilla-RNNs by setting its embedding size as [100, 200, 300] (we name them as RNNs-100, RNNs-200, RNNs-300 correspondingly and use them in Fig. 5 ). 4) The random seed is fixed to ensure the experimental results are not influenced by initializing the weights. 5) We train all models for 50 epochs and choose the best model in the validation set to predict the result in the test set.

6.2. RANK AND EFFECTIVENESS ANALYSIS

The rank of the TT format has been used to explain the expressive power or long-term memory capacity of RNNs (Khrulkov et al., 2018; Levine et al., 2018) . However, the relationship between rank and effectiveness in language modelling has yet to be shown practically. Later, we will evaluate the effectiveness of our models w.r.t rank. Effectiveness. (1) Fig. 4 shows the influence of the rank on our models based on the validation PPL. The validation PPL of TTLM-Large drops down at the early training step but easily increases when the rank increases. In contrast, the validation PPL of TTLM-Tiny stably decreases as the rank increases. (2) Compared to Vanilla-RNNs, the influence of the rank on our models based on the test PPL is shown in Fig. 5 . Our models outperform Vanilla-RNNs on all used parameter settings. (3) Table 1 provides a supplementary example to show the comparison of our models with baselines and some references. TTLM-Large and TTLM-Tiny are more effective than baselines. Overfitting. Based on Fig. 4a and Fig. 4b , we find that TTLM-Large is more prone to overfitting than TTLM-Tiny. As the only difference between the two models is W eh , this suggests that the simpler parameterization of the TT cores, the more easily the model avoids overfitting. This finding is consistent with the comparison between MI-RNNs and Second-order RNNs by Wu et al. (2016) . Low-scale. Despite the effectiveness of our models under the current hyperparameter settings, Fig. 5 reveals their limited potential when the rank is larger than 40 where the test perplexity of Vanilla-RNNs still stably decreases. Therefore, we expect that our model can outperform vanilla RNNs with a low-scale of hidden size (i.e. range from 5 to 50), but not larger scales; this is a clear tradeoff of using the simple parameterization of TT cores as we do in TTLMs. 6.3 NON-LINEARITY ANALYSIS. Fig. 6 shows the effects of the tanh activation function on TT variants based on validation PPL. Regarding the speed of convergence, tanh speeds up TTLM-Large-tanh, TTLM-Tiny-tanh, MI-RNNs while barely influencing second-order RNNs. Regarding the magnitude of the lowest validation perplexity, tanh impairs the performance TTLM-Large and TTLM-Tiny, but has little influence on multiplicative integration and the third-tensor T in Second-order RNNs. Thus, the influence of non-linearity on TTLM variants depends on TT cores settings, both for the convergence of validation PPL and the magnitude of the lowest validation PPL. Thus, from an experimental point of view, the effect of non-linearity functions on one TT variant cannot simply be transferred or analogized to another TT variant. This also suggests that one should be wary of the analogy between tensor decomposition and existing neural network models at the implementation level declared by previous research (Khrulkov et al., 2018; Levine et al., 2018) . The activation function could be a factor to influence such an analogy.

7. CONCLUSION

We first apply TT decomposition to real-world language modeling and name the framework as TTLM. We propose two variants: TTLM-Large and TTLM-Tiny, and show that they are more ef- 

APPENDIX A RELATIONSHIP BETWEEN TT CORES IN TTLM

To help readers understand the roles of TT cores in TTLM, we here provide a detailed calculation of the probability of a text X = [x (1) , x (2) , • • • , x (N ) ] by TTLM. Note that all the intermediate TT cores are equal to each other: G = G (2) , ..., G (N -1) and G (1) = G (N ) . The calculation of y (t) (i.e. the conditional probability of x (t) given x (1:t-1) ) at time t) can be described as three steps. As step I, suppose f (x (1) ) is a one-hot vector having f (x (1) ) 1 = 1. The calculation of G (1) f (x (1 ) in TTLM is as follows: G (1) f (x (1 ) =     f x (1) 1 f x (1) 2 . . . f x (1) |V |          G (1) 11 G (1) 12 . . . G (1) 1R G (1) 21 G (1) 22 . . . G (1) 2R . . . . . . . . . . . . G (1) |V |1 G (1) |V |2 . . . G (1) |V |R      = G (1) 11 , G (1) 12 , • • • , G (1) 1R T = h (1) TTLM As step II, TTLM will calculate f (x (i) )Gh (i-1) TTLM where i ∈ {2, 3, • • • , t -1}. For example, h (2) TTLM is calculated in Eq. 7 at time t = 2 as follows: h (2) TTLM = f (x (2) ) T Gh (1) TTLM As step III, TTLM will output y (t) as follows: G (t) h (t-1) TTLM =      G (t) 11 G (t) 12 . . . G (t) 1R G (t) 21 G (t) 22 . . . G (t) 2R . . . . . . . . . . . . G (t) |V |1 G (t) |V |2 . . . G (t) |V |R           h (t-1) TTLM1 h (t-1) TTLM2 . . . h (t-1) TTLM R      =      R i=1 G (t) 1i h (t-1) TTLM1 R i=1 G (t) 2i h (t-1) TTLM2 . . . R i=1 G (t) Ri h (t-1) TTLM R      Observing the calculation, G (1) , G and G (t) theoretically have no parameters in common (though we set G (1) = G (t) for simplicity). Further, their roles in TTLM are different: G (1) can be viewed as a word embedding matrix; G deals with two sources of information, i.e. hidden state and input word; G (t) extracts the evidence provided in h (t-1) TTLM and generates a set of scores over vocabulary.

B RELATIONSHIP BETWEEN TTLM AND SOME RNNS

We now demonstrate the relationship between TTLM and Second-order RNNs, Recurrent Arithmetic Circuits (RACs) and Multiplicative Integration RNNs (MI-RNNs). To avoid symbol clutter when representing different RNNs, the notation is: W hx ∈ R R×|V | denotes the input-to-hidden matrix, W hh ∈ R R×R denotes hidden-to-hidden matrix, ϕ(•) is a element-wise nonlinear activation function. Also, different hidden states are denoted as: Second-order RNNs (h (t) 2nd ), RACs (h (t) RAC ) and MI-RNNs (h (t) MI ).

B.1 RELATION TO SECOND-ORDER RNNS

Unlike Vanilla-RNNs (Mikolov & Zweig, 2012) that have additive blocks, Second-order RNNs have interaction between hidden states and input data in multiplicative form. This is achieved by a third-order tensor T with the i-th coordinate of the hidden states h (t) 2nd defined as (Hochreiter & Schmidhuber, 1997; Maupomé & Meurs, 2020) : h (t) 2nd i = ϕ(f (x (t) ) T Ti,:,:h (t-1) 2nd + b) where T i,:,: ∈ R M ×R is the ith slice of tensor T ∈ R M ×R×R , and b is a bias vector. For simplicity, we will ignore b for other variants of RNNs since b can be seen as 0th component of f (x (t) ) which equals to 1. Rabusseau et al. (2019) has provided that Tensor Trains can generalize linear Secondorder RNNs. We here provide a basic proof from the perspective of recursive property in TTLM. Claim B.1. The third-order tensor T in Second-order RNNs equals the TT cores in TTLM. The hidden states of Second-order RNNs is identical to that of TTLM if they are accompanied by an activation function. Proof. The proof is based on the following observation: We recursively unfold the calculation of TTLM in Eq. 5: Aw 1 ,••• ,w N = |V | i=1 f (x (1) )i 1 G (1) i 1 α 1 • • • = |V | i 1 ,i 2 =1 R α 1 =1 f (x (1) )i 1 G (1) i 1 α 1 f (x (2) )i 2 Gα 1 i 2 α 2 • • • . . . = |V | i 1 ,••• ,i N =1 R α 1 ,••• ,α N -1 =1 f (x (1) )i 1 G (1) i 1 α 1 f (x (2) )i 2 Gα 1 i 2 α 2 • • • f (x (N ) )i N G (N ) α N -1 i N Observe in the above, that at each time step, G has two sources of "input": the information from the previous recursive unfolding (e.g., in the second line, the first line is the previous information), and the input data f (x (t) ). From this perspective, G acts as a bilinear map G : R |V | × R R → R R , and we can regard the information in the previous line as a hidden state h TTLM , given by: h (t) TTLMα t = |V | i t =1 R α t =1 f (x (t) )i t Gi t α t α t-1 h (t-1) TTLMα t-1 where we permute the indices of G αt-1itαt as G itαtαt-1 ( note that this does not change the number of indices). We can also represent the hidden states in Second-order RNNs shown by Eq. 14 in element-wise fashion: h (t) 2nd i = ϕ(f (x (t) ) T Ti,:,:h (t-1) 2nd ) = ϕ   |V | j=1 R k=1 f (x (t) )jT jik h (t-1) 2nd k   where j, k are the dummy indices as i t , α t ; i specifies the coordinate of h TTLM . Thus, T and G are the same-sized trainable bi-linear map. After demonstrating that the third-order tensor T in Second-order RNNs equals the TT cores G, the only difference between the hidden states in Eq. 17 and in Eq. 16 is an activation function. If we add an activation function for h (t) TTLM , the hidden states of Second-order RNNs and TTLM are identical, as shown in Fig. 7a . 

B.2 RELATION TO RACS AND MI-RNNS

We here focus on Multiplicative Integration (MI), a way to connect two sources of inputs by the Hadamard product '⊙'. MI has been used in RACs, Multiplicative RNNs (M-RNNs) (Sutskever et al., 2011) and MI-RNNs: Recurrent Arithmetic Circuits (RACs) are recurrent networks with hidden states h (t) RAC defined as (Levine et al., 2018) : h (t) RAC = W hx f (x (t) ) ⊙ W hh h (t-1) RAC (18) where these hidden states are also used as an intermediate term in M-RNNs. Multiplicative Integration RNNs (MI-RNNs) are RACs with an activation function and hidden states h (t) MI defined as (Wu et al., 2016) : h (t) MI = ϕ(W hx f (x (t) ) ⊙ W hh h (t-1) MI ) Claim B.2. Given the condition the TT-scores: G = W hx ⊙ W hh . The hidden states of RACs are identical to that of TTLM. The hidden states of MI-RNNs are identical to that of TTLM if they are accompanied by an activation function. Proof. The proof is based on the following observation: In the language of tensor contractions, Eq. 18 involves contracting the input weights matrix W hx with the input vector f (x (t) ), and contracting the hidden weights matrix W hh with h (t-1) RAC . The Hadamard product of the two is a third-order diagonal tensor δ ∈ R R×R×R such that δ ijk = 1 iff the i = j = k, and δ ijk = 0 otherwise. Thus, we can write Eq. 18 in element-wise fashion: h (t) RACα t = |V | i t =1 R α t =1 f (x (t) )i t W hx i t j δ jα t k W hh kα t-1 h (t-1) RACα t-1 = |V | i t =1 R α t =1 f (x (t) )i t Gi t α t α t-1 h (t-1) RACα t-1 where G = W hx ⊙ W hh . In this case, the hidden state of TTLM in Eq. 16 is equal to the hidden state of RACs in Eq. 20, as shown in Fig. 7b . Similarly, if Eq. 16 is accompanied with an activation function ϕ, Eq. 16 is equal to the hidden state of MI-RNNs in Eq. 19 as shown in Fig. 7c . Given Claim B.1 and B.2, the three models shall be simulated by TTLM with a non-linear activation function and we leave finding a theoretical proof of this conjecture to a future work.



The code is available at https://github.com/tensortrainlm/tensortrainlm Tensor networks are, roughly, decompositions of large tensors into sets of smaller tensors and have been employed in physics, mathematics, and machine learning(Sun et al., 2020;Novikov et al., 2015; Cohen et al., 2016; Stoudenmire & Schwab, 2016b;Cheng et al., 2019;Novikov et al., 2016;Selvan & Dam, 2020). Most of the notations here follow the textbook Deep LearningGoodfellow et al. (2016). Such dependencies (including collocation) have been viewed as an analogy of entanglement(Hou et al., 2013).



Figure 1: A quick introduction to tensor diagram notation. There are two rules of tensor diagrams.(1) tensors are notated by solid shapes with a number of 'legs' corresponding to their indices. (2) connecting two index lines implies a contraction or summation over the connected indices. In this paper, we augment our equations with these diagrams to make them easier to visualize.

Figure 2: a) Tensor Train Language Model based on Eq. 5. b) TT core of TTLM-Tiny. c) TT core of TTLM-Large. The dashed line in the square represents A, Φ(X), or G. Note that the only difference between TTLM-Large and TTLM-Tiny is whether to use tensor W eh .

Figure 4: Rank analysis for the TTLM-Large and TTLM-Tiny on PTB.

Figure 5: Comparison of test perplexity on PTB. RNNs here is Vanilla-RNNs, and its embedding size is the same as TTLM-Large and TTLM-Tiny. RNNs-100, RNNs-200 and RNNs-300 are the RNNs with fixed embedding sizes of 100, 200 and 300, respectively.

Figure 6: Comparison of influence of non-linearity on TTLM variants on PTB . The suffix -tanh refers to a model using the tanh activation function. Second-linear refers to Second-order RNNs without activation function.

Figure 7: a) Second-order RNNs under TTLM framework. b) Hidden state of RACs under TTLM framework. c) hidden state of MI-RNNs under TTLM framework. The dashed line in the square denotes A, Φ(X) or G. The small hollow circles denote the activation functions.

PPL evaluation on test set on WikiText-2 and PTB. Models tagged with * indicate that they are re-implemented by ourselves. The symbol "-" means these data are not available in their original paper. Params are the training parameters, the details are in Appendix C

C IMPLEMENTATIONS

We implement all RNNs models, TTLM, TTLM-Large, and TTLM-Tiny using PyTorch. The weights in the models are adjusted to minimize the average cross entropy loss over training sequences via stochastic gradient descent computed using the truncated backpropagation through time algorithm. (Werbos, 1990; Williams & Peng, 1990) ).For RNNs, there are five matrix parameters: W xe ∈ R E×|V | is the input embedding matrix, W eh ∈ R E×H is the embedding-to-hidden matrix, W hh ∈ R H×H is the hidden-to-hidden matrix. We tie (share the same training parameters) the input embedding W xe and output embedding V which has been proved lead to a significant reduction in perplexity (Press & Wolf, 2016) . So there is a projection matrix P ∈ R H×E before the output embedding. All this process is introduced in (Press & Wolf, 2016) .For TTLM models, we tie the input tensor W xe and V. The implementation of δ is functioned by a reshape function, so the interaction between hidden and input can be computed by matrix product. We also let G (1) have the same parameters along the dimension |V | (i.e. G (1) is simplified into a G (1) ∈ R 1×R and can be viewed as initial hidden state h (0) ).

Model

Training Parameters Vanilla-RNN 

