TEXTTN: PROBABILISTIC ENCODING OF LANGUAGE ON TENSOR NETWORK

Abstract

As a novel model that bridges machine learning and quantum theory, tensor network (TN) has recently gained increasing attention and successful applications for processing natural images. However, for natural languages, it is unclear how to design a probabilistic encoding architecture to efficiently and accurately learn and classify texts based on TN. This paper proposes a general two-step scheme of text classification based on Tensor Network, which is named as TextTN. TextTN first encodes the word vectors in a probabilistic space by a generative TN (word-GTN), and then classifies a text sentence using a discriminative TN (sentence-DTN). Moreover, in sentence-DTN, its hyper-parameter (i.e., bond-dimension) can be analyzed and selected by the theoretical property of TextTN's expressive power. In experiments, our TextTN also obtains the state-of-the-art result on SST-5 sentiment classification task.

1. INTRODUCTION

Machine learning incorporating with the quantum mechanics forms a novel interdisciplinary field known as quantum machine learning (Huggins et al., 2019; Ran et al., 2020) . Tensor network (TN) as a novel model has become prominent in the field of quantum machine learning (Biamonte et al., 2017) . On the one hand, tensor network can be used as mathematical tool to enhance the theoretical understanding of existing neural network methods (Levine et al., 2018; 2019) . On the other hand, based on tensor network, new machine learning algorithms have been proposed, e.g., discriminative TN (DTN) (Stoudenmire & Schwab, 2016) for supervised tasks and generative TN (GTN) (Han et al., 2018) for unsupervised scenarios (Han et al., 2018) . Based on the natural analogy between the quantum concepts (e.g., quantum many-body system (Levine et al., 2018) ) and the image representation, many studies and applications are conducted for processing and learning natural pictures (Stoudenmire & Schwab, 2016; Sun et al., 2020; Liu et al., 2019) . However, for natural languages, it remains unclear how to design an efficient and effective TN approach, which can accurately learn and classify texts. In the field of natural language processing (NLP), researchers have realized the analogy between the quantum many-body wave function and the word interactions (by the tensor product) in a text sentence, and developed a quantum-inspired language representation (Zhang et al., 2018) . Based on the quantum many-body physics and tensor decomposition techniques, Zhang et al. ( 2018) provided a mathematical understanding of existing convolution neural network (CNN) based text classification methods. Similarly, a tensor space language model (TSLM) has been built based on the tensor network formulation (Zhang et al., 2019) . This work shows that TSLM is a more generalized language model compared with n-gram and recurrent neural network (RNN) based language models. In implementation, however, TSLM did not provide a tensor network algorithm. The challenge lies in the high dimensionality of each word vector, which is much higher than the dimensionality of each pixel representation in image scenarios. After the tensor product of a number of word vectors, the resulting high-order tensors will become computationally intractable. More recently, a tensor network algorithm, namely uniform matrix product state (u-MPS) model, has been proposed for probabilistic modeling of a text sequence (Miller et al., 2020) . u-MPS is evaluated on a context-free language task, which uses an synthetic data set. However, u-MPS has not been applied in a real-world NLP task, e.g., typical language modeling or text classification task. In addition, the expressive power of u-MPS has not been investigated. The expressive power of tensor network is a fundamental property of various TNs and has been systematically studied for tensor network factorizations of multivariate probability distributions (Glasser et al., 2019) . This motivates us to make use of the theoretical property of TN's expressive power, for developing a tensor network based probabilistic model for natural language representation and classificationfoot_0 . To build such a text tensor network, we need to address two research problems in this paper. First, how to design a probabilistic encoding architecture to efficiently and effectively learn and classify the text. Second, how to analyse its expressive power, and make use of such analyses for more theoretical understanding and practical effectiveness of the text tensor network. In this paper, we propose a novel tensor network architecture, named as TextTN. TextTN encodes each word vector in word-GTNs and classifies the sentence in a sentence-DTN. First, the proposed word-GTNs train a TN for each word and treat each element of a word vector as a node. In this manner, the word-GTNs firstly map a high-dimensional word vector a low-dimensional linear space by the tensor network operators. Then, the second layer tensor network, called as sentence-DTN, trains a TN for each sentence, by regarding the low-dimensional word vector obtained by word-GTNs as its input. In TextTN, a sentence is represented by the tensor product among word vectors. Therefore, the interaction information among different word vectors and different dimensions are both modeled in TextTN. Such interactions, are encoded in the high-order weighted tensors, which represent a high-order semantic space. In both word-GTNs and sentence-DTN, the high-order tensor can be solved by the tensor network, i.e., a matrix product state model, which uses the idea of low-rank approximation that can conquer the exponential wall problem (Watson & Dunn, 2010) . In sentence-DTN, the bond-dimension is an important hyper-parameter and reflects the expressive power of TextTN. In this paper, we analyze the upper and lower bounds of the bond-dimension. Particularly, its lower bound can be determined by the entanglement entropy, which can be considered as a measurement of the communication information encoded in tensor network. A reference bonddimension can be set as this lower bound, as we assume that a larger value means an information redundancy and a smaller value indicates an insufficiency of the TN's expressive power. In the experiments, such a reference bond-dimension can achieve effective classification results, which indicates the TextTN's practical advantage in its potential to save hyper-parameter tuning efforts. Moreover, the word interaction has been taken into account in sentence-DTN by the joint effect of different words for the later class predication by the loss functions. For the learning algorithm, we observe that different word positions have different weights in a sentence, so that the one-function (for a specific position) training in the original DTN is inappropriate. Therefore, we propose an all-function training process in the sentence-DTN to improve the stability of TextTN. We have evaluated TextTN in four major text classification datasets (MR, Subj, CR and MPQA). The results show that TextTN outperforms convolutional neural network (CNN) on all the datasets. This departs from vision tasks where according to the recent literature, a tensor network has not been reported to outperform CNN (Kim, 2014) . In addition, based on the word vectors from the pre-trained model BERT, the TextTN has better results than the BERT model on SST-2 and SST-5 tasks, and the accuracy of BERT+TextTN is comparable with the state of the art (SOTA) result on SST-5 dataset.

2. BACKGROUND

We now provide the background of Matrix product States (MPS), a family of tensor networks. MPS (Schollwock, 2011) is also known as the tensor train decomposition (Oseledets, 2011) . Because of the low degree of freedom, the research based on MPS is developing rapidly. At present, the tensor network based on MPS can be roughly divided into two categories. One is the Generative Tensor Network (GTN) (Han et al., 2018; Sun et al., 2020) , and the other one is the supervised tensor network (also named as Discriminative Tensor Network, DTN) (Stoudenmire & Schwab, 2016) . Then, we briefly describe existing GTN and DTN models for image classification tasks. GTNs are used to model the joint probability distribution of given data. For a picture X with n pixels, each pixel is encoded into a two-dimensional vector x i = (p i , 1 -p i ) T by a feature mapping from a



In this paper, we focus on the text classification task. However, the idea and formulation of our proposed approach are general and have potential in other NLP tasks.

