LANGUAGE MODELING USING TENSOR TRAINS

Abstract

Tensor networks have previously been shown to have potential in language modeling in theory but lack practical evidence support. We propose a novel Tensor Train Language Model (TTLM) based on Tensor-Train decomposition. To show the usefulness of TTLM, we perform a principled experimental evaluation on real-world language modeling tasks, showing that our proposed variants, TTLM-Large and TTLM-Tiny, can be more effective than Vanilla RNNs with low-scale of hidden sizes. Also, we demonstrate the relationship between TTLM and Second-order Recurrent Neural Networks (RNNs), Recurrent Arithmetic Circuits, and Multiplicative Integration RNNs in the sense that the architectures of all of these are, essentially, special cases of that of TTLM. 1

1. INTRODUCTION

A language model assigns probabilities of sequences of words from the vocabulary V ; the number of texts increases exponentially w.r.t to length N . Hence the domain of a language model is, by definition, the exponential space V N . However, due to the vanilla exponential space being intractable, existing work tends to use recurrent or auto-regressive architectures to generate conditional probabilities based on the context (typically encapsulated as a fixed-length dense vector). This indeed simplifies the calculation. Recently, researchers (Pestun & Vlassopoulos, 2017; Miller et al., 2021; Zhang et al., 2019) have reconsidered the view of language models as joint probabilities of text, as it leads to exponential representations in tensor space. Word connections could be preserved in the exponential tensor space when measuring joint probabilities. To deal with the exponential space complexity, a mathematical tool called 'tensor network'foot_1 has been used to reduce the exponential space of language modeling to a tractable one (Pestun & Vlassopoulos, 2017) . However, the so-called 'tensor network language model' in Pestun & Vlassopoulos (2017) is currently a concept that needs to be proved practically. As proof-of-concept work, we derive a Tensor Train Language Model (TTLM) (the simplest tensor network). Technically, we represent a sentence based on the exponential semantic space constructed by the tensor product of word representations. The probability of the sentence is obtained by the inner product of two high-dimensional tensors: the input Φ(X) and the global coefficients A. Under the framework of TTLM, we propose two variants: TTLM-Tiny and TTLM-Large. Also, we clarify the relationship between the proposed TTLM and a series of Recurrent Neural Networks (RNNs) (i.e., Second-order RNNs (Goudreau et al., 1994) , Recurrent Arithmetic Circuits (RACs) (Levine et al., 2018) , and Multiplicative Integration RNNs (MI-RNNs) (Wu et al., 2016)) . These connections open a new eye to understanding RNNs and give some natural implementations for TTLM. We benchmark these TTLM variants and analyze the difference in their working mechanism and behaviors. Experimental results on language modeling tasks show that our TTLM variants could outperform than Vanilla-RNNs under the same training setting. These demonstrate the feasibility of TTLM. The main contributions of our work can be summarized as follows: 1. We propose a novel Tensor Train Language Model, as a first attempt to apply tensor networks on real-world language modeling tasks. 2. We propose two novel TTLM variants, TTLM-Large and TTLM-Tiny, and theoretically demonstrate the relationship between TTLM and a series of existing RNNs. 3. Compared to Vanilla-RNNs on WikiText-2 and PTB datasets, TTLM-Large reduces perplexity by 14.3 and 16.0, respectively, and TTLM-Tiny reduces perplexity by 1.7 and 8.5, respectively.

2. RELATED WORK

Previous studies on tensor networks in machine learning have mainly been devoted to analyzing the theoretical properties of neural networks. A better understanding of feed-forward, convolutional and recurrent architectures has been gained, including compression parameters (Novikov et al., 2015) , expressive power (Cohen et al., 2016; Cohen & Shashua, 2016; Khrulkov et al., 2018) , and depth efficiency for long-term memory (Levine et al., 2018) . Focusing on natural language modeling, certain studies have tensorized existing network architectures (Novikov et al., 2015) , while few studies have applied tensor networks alone as a language model. To the best of our knowledge, tensor network language models have remained a theoretical proposal instead of an empirical model (Pestun & Vlassopoulos, 2017; Pestun et al., 2017) . Perhaps, Miller et al. (2021) is the only other work that uses tensor networks for probabilistic sequence modeling, while it fails to scale up its model for real-world sequence modeling tasks. We first derive a tensor network language model in the way that it can be applied to real-world language modeling datasets, while its variants outperform Vanilla RNNs with lower-scale hidden sizes.

3. PRELIMINARIES

We briefly recapitulate basic notions and notationsfoot_2 ; full technical introductions can be found in standard textbooks (e.g., Bi et al. (2022) ; Itskov (2009)). Notation. For the purposes of this paper, every tensor A is a multidimensional array of elements (called components) of R, each denoted by its integer coordinates in the array; e.g., for a twodimensional array, the component at position i, j ∈ N is denoted A ij . The order of a tensor is how many indices it has (e.g., a vector v is a first-order tensor, a matrix M is a second-order tensor, etc.). The dimension of a tensor refers to the number of values that a particular index (or so-called mode) can take, e.g., the dimension of B ∈ R I1×I2×I3 is I 1 × I 2 × I 3 . Tensor product. For two tensors C ∈ R I1×•••×Ij (order j) and D ∈ R Ij+1,×•••×I j+k (order k), their tensor product is denoted by ⊗ and return a tensor E i1•••i j+k = C i1...ij • D ij+1•••i j+k (order j + k). Generalized inner product. For two tensor X, Y ∈ R I1×I2×•••×I N of the same size, their inner product is defined as ⟨X, Y⟩ = I1 i1=1 I2 i2=1 • • • I N i N =1 X i1,i2,...,i N Y i1,i2,...,i N . For two tensors X ∈ R I1×I2×•••×I N ×Ix and Y ∈ R I1×I2×•••I N ×Iy sharing N modes of the same size, the "generalized inner product" defined in (Kossaifi et al., 2020) is calculated as ⟨X, Y⟩N = I 1 i 1 =1 I 2 i 2 =1 • • • I N i N =1 Xi 1 ,i 2 ,...,i N Yi 1 ,i 2 ,...,i N with ⟨X, Y⟩ N ∈ R Ix×Iy .

4. LANGUAGE MODELING USING TENSOR TRAINS

We introduce a language model in tensor space in Sec. 4.1. We define our general Tensor Train Language Model in Sec. 4.2, and its special case, TTLM, in Sec. 4.3.



The code is available at https://github.com/tensortrainlm/tensortrainlm Tensor networks are, roughly, decompositions of large tensors into sets of smaller tensors and have been employed in physics, mathematics, and machine learning(Sun et al., 2020; Novikov et al., 2015; Cohen et al., 2016; Stoudenmire & Schwab, 2016b; Cheng et al., 2019; Novikov et al., 2016; Selvan & Dam, 2020). Most of the notations here follow the textbook Deep Learning Goodfellow et al. (2016).

