NEURAL TOPIC MODEL VIA OPTIMAL TRANSPORT

Abstract

Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.

1. INTRODUCTION

As an unsupervised approach, topic modelling has enjoyed great success in automatic text analysis. In general, a topic model aims to discover a set of latent topics from a collection of documents, each of which describes an interpretable semantic concept. Topic models like Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and its hierarchical/Bayesian extensions, e.g., in Blei et al. (2010) ; Paisley et al. (2015) ; Gan et al. (2015) ; Zhou et al. (2016) have achieved impressive performance for document analysis. Recently, the developments of Variational AutoEncoders (VAEs) and Autoencoding Variational Inference (AVI) (Kingma & Welling, 2013; Rezende et al., 2014) have facilitated the proposal of Neural Topic Models (NTMs) such as in Miao et al. (2016) ; Srivastava & Sutton (2017) ; Krishnan et al. (2018) ; Burkhardt & Kramer (2019) . Inspired by VAE, many NTMs use an encoder that takes the Bag-of-Words (BoW) representation of a document as input and approximates the posterior distribution of the latent topics. The posterior samples are further input into a decoder to reconstruct the BoW representation. Compared with conventional topic models, NTMs usually enjoy better flexibility and scalability, which are important for the applications on large-scale data. Despite the promising performance and recent popularity, there are several shortcomings for existing NTMs, which could hinder their usefulness and further extensions. i) The training and inference processes of NTMs are typically complex due to the prior and posterior constructions of latent topics. To encourage topic sparsity and smoothness, Dirichlet (Burkhardt & Kramer, 2019) or gamma (Zhang et al., 2018) distributions are usually used as the prior and posterior of topics, but reparameterisation is inapplicable to them, thus, complex sampling schemes or approximations have to be used, which could limit the model flexibility. ii) A desideratum of a topic model is to generate better topical representations of documents with more coherent and diverse topics; but for many existing NTMs, it is hard to achieve good document representation and coherent/diverse topics at the same time. This is because the objective of NTMs is to achieve lower reconstruction error, which usually means topics are less coherent and diverse, as observed and analysed in Srivastava & Sutton (2017) ; Burkhardt & Kramer (2019) . iii) It is well-known that topic models degrade their performance severely on short documents such as tweets, news headlines and product reviews, as each individual document contains insufficient word co-occurrence information. This issue can be exacerbated for NTMs because of the use of the encoder and decoder networks, which are usually more vulnerable to data sparsity. To address the above shortcomings for NTMs, we in this paper propose a neural topic model, which is built upon a novel Optimal Transport (OT) framework derived from a new view of topic modelling. For a document, we consider its content to be encoded by two representations: the observed representation, x, a distribution over all the words in the vocabulary and the latent representation, z, a distribution over all the topics. x can be obtained by normalising a document's word count vector while z needs to be learned by a model. For a document collection, the vocabulary size (i.e., the number of unique words) can be very large but one individual document usually consists of a tiny subset of the words. Therefore, x is a sparse and low-level representation of the semantic information of a document. As the number of topics is much smaller than the vocabulary size, z is the relatively dense and high-level representation of the same content. Therefore, the learning of a topic model can be viewed as the process of learning the distribution z to be as close to the distribution x as possible. Accordingly, it is crucial to investigate how to measure the distance between two distributions with different supports (i.e., words to x and topics to z). As optimal transport is a powerful tool for measuring the distance travelled in transporting the mass in one distribution to match another given a specific cost function, and recent development on computational OT (e.g., in Cuturi (2013) ; Frogner et al. (2015) ; Seguy et al. (2018) ; Peyré et al. (2019) ) has shown the promising feasibility to efficiently compute OT for large-scale problems, it is natural for us to develop a new NTM based on the minimisation of OT. Specifically, our model leverages an encoder that outputs topic distribution z of a document by taking its word count vector as input like standard NTMs, but we minimise the OT distance between x and z, which are two discrete distributions on the support of words and topics, respectively. Notably, the cost function of the OT distance specifies the weights between topics and words, which we define as the distance in an embedding space. To represent their semantics, all the topics and words are embedded in this space. By leveraging the pretrained word embeddings, the cost function is then a function of topic embeddings, which will be learned jointly with the encoder. With the advanced properties of OT on modelling geometric structures on spaces of probability distributions, our model is able to achieve a better balance between obtaining good document representation and generating coherent/diverse topics. In addition, our model eases the burden of designing complex sampling schemes for the posterior of NTMs. More interestingly, our model is a natural way of incorporating pretrained word embeddings, which have been demonstrated to alleviate the issue of insufficient word co-occurrence information in short texts (Zhao et al., 2017; Dieng et al., 2020) . With extensive experiments, our model can be shown to enjoy the state-of-the-art performance in terms of both topic quality and document representations for both regular and short texts.

2. BACKGROUND

In this section, we recap the essential background of neural topic models and optimal transport.

2.1. NEURAL TOPIC MODELS

Most of existing NTMs can be viewed as the extensions of the framework of VAEs where the latent variables can be interpreted as topics. Suppose the document collection to be analysed has V unique words (i.e., vocabulary size). Each document consists of a word count vector denoted as x ∈ N V and a latent distribution over K topics: z ∈ R K . An NTM assumes that z for a document is generated from a prior distribution p(z) and x is generated by the conditional distribution p φ (x|z) that is modelled by a decoder φ. The model's goal is to infer the topic distribution given the word counts, i.e., to calculate the posterior p(z|x), which is approximated by the variational distribution q θ (z|x) modelled by an encoder θ. Similar to VAEs, the training objective of NTMs is the maximisation of the Evidence Lower BOund (ELBO): max θ,φ E q θ (z|x) [log p φ (x|z)] -KL [q θ (z|x) p(z)] . (1) The first term above is the expected log-likelihood or reconstruction error. As x is a count-valued vector, it is usually assumed to be generated from the multinomial distribution: p φ (x|z) := Multi(φ(z)), where φ(z) is a probability vector output from the decoder. Therefore, the expected log-likelihood is proportional to x T log φ(z). The second term is the Kullback-Leibler (KL) divergence that regularises q θ (z|x) to be close to its prior p(z). To interpret topics with words, φ(z) is usually constructed by a single-layer network (Srivastava & Sutton, 2017) : φ(z) := softmax(Wz), where W ∈ R V ×K indicates the weights between topics and words. Different NTMs may vary in the prior and the posterior of z, for example, the model in Miao et al. (2017) applies Gaussian distributions for them and Srivastava & Sutton (2017) ; Burkhardt & Kramer (2019) show that Dirichlet is a better choice. However, reparameterisation cannot be directly applied to a Dirichlet, so various approximations and sampling schemes have been proposed.

2.2. OPTIMAL TRANSPORT

OT distances have been widely used for the comparison of probabilities. Here we limit our discussion to OT for discrete distributions, although it applies for continuous distributions as well. Specifically, let us consider two probability vectors r ∈ ∆ Dr and c ∈ ∆ Dc , where ∆ D denotes a D -1 simplex. The OT distancefoot_0 between the two probability vectors can be defined as: |P 1 Dc = r, P T 1 Dr = c}; and 1 D is the D dimensional vector of ones. Intuitively, if we consider two discrete random variables X ∼ Categorical(r) and Y ∼ Categorical(c), the transport matrix P is a joint probability of (X, Y ), i.e., p(X = i, Y = j) = p ij and U (r, c) is the set of all the joint probabilities. The above optimal transport distance can be computed by finding the optimal transport matrix P * . It is also noteworthy that the Wasserstein distance can be viewed as a specific case of the OT distances. d M (r, c) := min P∈U (r,c) P, M , As directly optimising Eq. ( 2) can be time-consuming for large-scale problems, a regularised optimal transport distance with an entropic constraint is introduced in Cuturi ( 2013), named the Sinkhorn distance: d M,α (r, c) := min P∈Uα(r,c) P, M , where U α (r, c) := {P ∈ U (r, c)|h(P) ≥ h(r) + h(c) -α}, h(•) is the entropy function, and α ∈ [0, ∞). To compute the Sinkhorn distance, a Lagrange multiplier is introduced for the entropy constraint to minimise Eq. ( 3), resulting in the Sinkhorn algorithm, widely-used for discrete OT problems.

3. PROPOSED MODEL

Now we introduce the details of our proposed model. Specifically, we present each document as a distribution over V words, x ∈ ∆ V obtained by normalising x: x := x/S where S := V v=1 x is the length of a document. Also, each document is associated with a distribution over K topics: z ∈ ∆ K , each entry of which indicates the proportion of one topic in this document. Like other NTMs, we leverage an encoder to generate z from x: z = softmax(θ( x)). Notably, θ is implemented with a neural network with dropout layers for adding randomness. As x and z are two distributions with different supports for the same document, to learn the encoder, we propose to minimise the following OT distance to push z towards x: min θ d M ( x, z) . ( ) Here M ∈ R V ×K >0 is the cost matrix, where m vk indicates the semantic distance between topic k and word v. Therefore, each column of M captures the importance of the words in the corresponding topic. In addition to the encoder, M is a variable that needs to be learned in our model. However, learning the cost function is reported to be a non-trivial task (Cuturi & Avis, 2014; Sun et al., 2020) . To address this problem, we specify the following construction of M: m vk = 1 -cos(e v , g k ) , where cos(•, •) ∈ [-1, 1] is the cosine similarity; g k ∈ R L and e v ∈ R L are the embeddings of topic k and word v, respectively. The embeddings are expected to capture the semantic information of the topics and words. Instead of learning the word embeddings, we propose to feed them with pretrained word embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) . This not only reduces the parameter space to make the learning of M more stable but also enables us to leverage the rich semantic information in pretrained word embeddings, which is beneficial for short documents. Here the cosine distance instead of others is used for two reasons: it is the most commonly-used distance metric for word embeddings and the cost matrix M is positive thus the similarity metric requires to be upper-bounded. As cosine similarity falls in the range of [-1, 1], we have M ∈ [0, 2] V ×K . For easy presentation, we denote G ∈ R L×K and E ∈ R L×V as the collection of the embeddings of all topics and words, respectively. Now we can rewrite Eq. ( 4) as: min θ,G d M ( x, z) . ( ) Although the mechanisms are totally different, both M of our model and W in NTMs (See Section 2.1) capture the relations between topics and words (M is distance while W is similarity). Here M is the cost function of our OT loss while W is the weights in the decoder of NTMs. Different from other NTMs based on VAEs, our model does not explicitly has a decoder to project z back to the word space to reconstruct x, as the OT distance facilitates us to compute the distance between z and x directly. To further understand our model, we can actually project z to the space of x by "virtually" defining a decoder: φ(z) := softmax((2 -M)z). With the notation of φ(z), we show the following theorem to reveal the relationships between other NTMs and ours, whose proof is shown in Section A of the appendix. Theorem 1. When V ≥ 8 and M ∈ [0, 2] V ×K , we have: d M ( x, z) ≤ -xT log φ(z). With Theorem 1, we have: Lemma 1. Maximising the expected multinomial log-likelihood of NTMs is equivalent to minimising the upper bound of the OT distance in our model. Frogner et al. (2015) propose to minimise the OT distance between the predicted and true label distributions for classification tasks. It is reported in the paper that combining the OT loss with the conventional cross-entropy loss gives better performance on using either of them. As the expected multinomial log-likelihood is easier to learn and can be helpful to guide the optimisation of the OT distance, empirically inspired by Frogner et al. (2015) and theoretically motivated by Theorem 1, we propose the following joint loss for our model that combines the OT distance with the expected log-likelihood: max θ,G xT log φ(z) -d M ( x, z) . If we compare the above loss with the ELBO of Eq. ( 1), it can be observed that similar to the KL divergence of NTMs, our OT distance can be viewed as a regularisation term to the expected loglikelihood ( xT log φ(z) := 1 S x T log φ(z)). Compared with other NTMs, our model eases the burden of developing the prior/posterior distributions and the associated sampling schemes. Moreover, with OT's ability to better modelling geometric structures, our model is able to achieve better performance in terms of both document representation and topic quality. In addition, the cost function of the OT distance provides a natural way of incorporating pretrained word embeddings, which boosts our model's performance on short documents. Finally, we replace the OT distance with the Sinkhorn distance (Cuturi, 2013) , which leads to the final loss function: max θ,G xT log φ(z) -d M,α ( x, z) . ( ) where z = softmax(θ( x)); M is parameterised by G; φ(z) := softmax((2 -M)z); x and x are the word count vector and its normalisation, respectively; is the hyperparameter that controls the weight of the expected likelihood; α is the hyperparameter for the Sinkhorn distance. To compute the Sinkhorn distance, we leverage the Sinkhorn algorithm (Cuturi, 2013) Column-wisely normalise X to get X Compute M with G and E by Eq. ( 5); Compute Z = softmax(θ( X)); Compute the first term of Eq. ( 9); # Sinkhorn iterations # Ψ 1 = ones(K, B)/K, Ψ 2 = ones(V, B)/V ; H = e -M/α ; while Ψ 1 changes or any other relevant stopping criterion do Ψ 2 = X 1/(HΨ 1 ); Ψ 1 = Z 1/(H T Ψ 2 ); end Compute the second term of Eq. ( 9): d M,α = sum(Ψ 2 T (H M)Ψ 1 ); Compute the gradients of Eq. ( 9) in terms of θ, G; Update θ, G with the gradients; end Algorithm 1: Training algorithm for NSTM. X ∈ N V ×B and Z ∈ R K×B >0 consists of the word count vectors and topic distributions for all the documents, respectively; is the element-wise multiplication. Algorithm 1. It is noteworthy that the Sinkhorn iterations can be implemented with the tensors of TensorFlow/PyTorch (Patrini et al., 2020) . Therefore, the loss of Eq. ( 9) is differentiable in terms of θ and G, which can be optimised jointly in one training iteration. After training the model, we can infer z by conducting a forward-pass of the encoder θ with the input x. In practice, x can be normalised by other methods e.g., softmax or one can use TF-IDF as the input data of the encoder.

4. RELATED WORKS

We first consider NTMs (e.g. in Miao et al. (2016) ; Srivastava & Sutton (2017) ; Krishnan et al. (2018) ; Card et al. (2018) ; Burkhardt & Kramer (2019) ; Dieng et al. (2020) reviewed in Section 2.1 as the closest line of related works to ours. For a detailed survey of NTMs, we refer to Zhao et al. (2021) . Connections and comparisons between our model and NTMs have been discussed in Section 3. In addition, word embeddings have been recently widely-used as complementary metadata for topic models, especially for modelling short texts. For Bayesian probabilistic topic models, word embeddings are usually incorporated into the generative process of word counts, such as in Petterson et al. (2010); Nguyen et al. (2015) ; Li et al. (2016) ; Zhao et al. (2017) . Due to the flexibility of NTMs, word embeddings can be incorporated as part of the encoder input, such as in Card et al. (2018) or they can be used in the generative process of words such as in Dieng et al. (2020) . Our novelty with NSTM is that word embeddings are naturally incorporated in the cost function of the OT distance. To our knowledge, the works that connect topic modelling with OT are still very limited. In Yurochkin et al. (2019) authors proposed to compare two documents' similarity with the OT distance between their topic distributions extracted from a pretrained LDA, but the aim is not to learn a topic model. Another recent work related to ours is Wasserstein LDA (WLDA) (Nan et al., 2019) , which adapts the framework of Wasserstein AutoEncoders (WAEs) (Tolstikhin et al., 2018) . The key difference from ours is that WLDA minimises the Wasserstein distance between the fake data generated with topics and real data, which can be viewed as an OT variant to VAE-NTMs. However, our NSTM directly minimises the OT distance between z and x, where there are no explicit generative processes from topics to data. Other two related works are Distilled Wasserstein Learning (DWL) (Xu et al., 2018) and Optimal Transport LDA (OTLDA) (Huynh et al., 2020) , which adapt the idea of Wasserstein barycentres and Wasserstein Dictionary Learning (Rolet et al., 2016; Schmitz et al., 2018) . There are fundamental differences of ours from DWL and OTLDA in terms of the relations between documents, topics, and words. Specifically, in DWL and OTLDA, documents and topics locate in one space of words (i.e., both are distributions over words) and x can be approximated with the weighted Wasserstein barycentres of all the topic-word distributions, where the weights can be interpreted as the topic proportions of the document, i.e., z. However, in NSTM, a document locates in both the topic space and the word space and topics and words are embedded in the embedding space. These differences lead to different views of topic modelling and different frameworks as well. Moreover, DWL mainly focuses on learning word embeddings and representations for International Classification of Diseases (ICD) codes, while NSTM aims to be a general method of topic modelling. Finally, DWL and OTLDA are not neural network models while ours is.

5. EXPERIMENTS

We conduct extensive experiments on several benchmark text datasets to evaluate the performance of NSTM against the state-of-the-art neural topic models.

Datasets:

Our experiments are conducted on five widely-used benchmark text datasets, varying in different sizes, including 20 News Groups (20NG)foot_1 , Web Snippets (WS) (Phan et al., 2008) , Tag My News (TMN) (Vitale et al., 2012) Evaluation metrics: We report Topic Coherence (TC) and Topic Diversity (TD) as performance metrics for topic quality. TC measures the semantic coherence in the most significant words (top words) of a topic, given a reference corpus. We apply the widely-used Normalized Pointwise Mutual Information (NPMI) (Aletras & Stevenson, 2013; Lau et al., 2014) computed over the top 10 words of each topic, by the Palmetto package (Röder et al., 2015) 7 . As not all the discovered topics are interpretable (Yang et al., 2015; Zhao et al., 2018) , to comprehensively evaluate the topic quality, we choose the topics with the highest NPMI and report the average score over those selected topics. We vary the proportion of the selected topics from 10% to 100%, where 10% indicates the top 10% topics with the highest NPMI are selected and 100% means all the topics are used. TD, as its name implies, measures how diverse the discovered topics are. We define topic diversity to be the percentage of unique words in the top 25 words (Dieng et al., 2020) of the selected topics, similar in TC. TD close to 0 indicates redundant topics; TD close to 1 indicates more varied topics. As doc-topic distributions can be viewed as unsupervised document representations, to evaluate the quality of such representations, we perform document clustering tasks and report the purity and Normalized Mutual Information (NMI) (Manning et al., 2008) adopt two strategies to perform the document clustering task: i) Following Nguyen et al. (2015) , we use the most significant topic of a testing document as its clustering assignment to compute purity and NMI (denoted by top-Purity and top-NMI); ii) We apply the KMeans algorithm on z (over all the topics) of the testing documents and report the purity and NMI of the KMeans clusters (denoted by km-Purity and km-NMI). For the first strategy, the number of clusters equals to the number of topics while for the second one, we vary the number of clusters of KMeans in the range of {20, 40, 60, 80, 100}. Note that our goal is not to achieve the state-of-the-art document clustering results but compare document representations of topic models. For all the metrics, higher values indicate better performance. Baseline methods and their settings: We compare with the state-of-the-art NTMs, including: LDA with Products of Experts (ProdLDA) (Srivastava & Sutton, 2017) , which replaces the mixture model in LDA with a product of experts and uses the AVI for training; Dirichlet VAE (DVAE) (Burkhardt & Kramer, 2019) , which is a neural topic model imposing the Dirichlet prior/posterior on z. We use the variant of DVAE with rejection sampling VI, which is reported to perform the best; Embedding Topic Model (ETM) (Dieng et al., 2020) , which is a topic model that incorporates word embeddings and is learned by AVI; Wasserstein LDA (WLDA) (Nan et al., 2019) , which is a WAE-based topic model. For all the above baselines, we use their official code with the best reported settings. Settings for NSTM: NSTM is implemented on TensorFlow. For the encoder θ, to keep simplicity, we use a fully-connected neural network with one hidden layer of 200 units and ReLU as the activation function, followed by a dropout layer (rate=0.75) and a batch norm layer, same to the settings of Burkhardt & Kramer (2019) . For the Sinkhorn algorithm, following Cuturi (2013) , the maximum number of iterations is 1,000 and the stop tolerance is 0.005foot_7 . In all the experiments, we fix α = 20 and = 0.07. We further vary the two hyperparameters to study our model's sensitivity to them in Figure B .1 of the appendix. Finetuning the parameters specifically to a dataset may give better results. The optimisation of NSTM is done by Adam (Kingma & Ba, 2015) with learning rate 0.001 and batch size 200 for maximally 50 iterations. For NSTM and ETM, the 50-dimensional (i.e., L = 50, see Eq. ( 5)) GloVe word embeddings (Pennington et al., 2014) pre-trained on Wikipediafoot_8 are used. We use the number of topics K = 100 in most cases and set K = 500 on RCV2 to test our model's scalability.

5.2. RESULTS

Quantitative results: We run all the models in comparison five times with different random seeds and report the mean and standard deviation (as error bars). We show the results of TC and TD in Figure 1 and top-Purity/NMI in Table 2 , and km-Purity/NMI in Figure 2 , respectively. We have the following remarks about the results: i) Our proposed NSTM outperforms the others significantly in terms of topic coherence while obtaining high topic diversity on all the datasets. Although others may have higher TD than ours in one dataset or two, they usually cannot achieve a high TC at the same time. ii) In terms of document clustering, our model performs the best in general with a significant gap over other NTMs, except the case where ours is the second for the KMeans clustering on 20NG. This demonstrates that NSTM is not only able to discover interpretable topics with better quality but also learn good document representations for clustering. It also shows that with the OT distance, our model can achieve a better balance among the comprehensive metrics of topic modelling. iii) For all the evaluation metrics, our model is consistently the best on the short documents including WS and TMN. This demonstrates the effectiveness of our way of incorporating pretrained word embeddings, which shows our model's potential on short text topic modelling. Although ETM also uses pretrained word embeddings, its performance is incomparable to ours. Scalability: NSTM has comparable scalability with other NTMs and is able to scale on large datasets with a large number of topics. To demonstrate the scalability, we run NSTM, DVAE, ProdLDA (as these three are implemented in TensorFlow, while ETM is in PyTorch, and WLDA is in MXNet) on RCV2 with K = 500. The three models run on a Titan RTX GPU with batch size 1,000. Figure 3 shows the training losses, which demonstrate that NSTM has similar learning speed to ProdLDA, better than DVAE. The TC and TD scores of this experiment are shown in Section C of the appendix, where it can be observed that with 500 topics, our model shows similar performance advantage over others. Qualitative analysis: As topics in our model are embedded in the same space as pretrained word embeddings, they share similar geometric properties. Figure 4 shows a qualitative analysis. For the t-SNE (Maaten & Hinton, 2008) visualisation, we select the top 50 topics with the highest NPMI learned by a run of NSTM on RCV2 with K = 100 and feed their (50 dimensional) embeddings into the t-SNE method. We also show the top five words and the topic number (1 to 50) of each topic. We A PROOF OF THEOREM 1 Proof. Before showing the proof, we introduce the following notations: We denote k ∈ {1, • • • , K} and v ∈ {1, • • • , V } as the indexes; The s th (s ∈ {1, • • • , S}) token of the document picks a word in the vocabulary, denoted by w s ∈ {1, • • • , V }; the normaliser in the softmax function of φ(z) is denoted as φ so: φ = V v=1 e K k=1 z k (2-m vk ) = e 2 V v=1 e -K k=1 z k m vk . With these notations, we first have the following equation for the multinomial log-likelihood: xT log φ(z) = 1 S S s=1 log φ(z) ws = 1 S S s=1 K k=1 z k (2 -m wsk ) -log φ = 2 -log φ - 1 S S s=1 K k=1 z k m wsk . (A.1) Recall that in Eq. ( 1) of the main paper, the transport matrix P is one of the joint distributions of x and z. We introduce the conditional distribution of z given x as Q, where q(v, k) indicates the probability of assigning a token of word v to topic k. Given that P satisfies P ∈ U ( x, z) and p vk = xv q(v, k), Q must satisfy U ( x, z) := {Q ∈ R V ×K >0 | V v=1 xv q(v, k) = z k }. With Q, we can rewrite the OT distance as: d M ( x, z) = min Q∈U ( x,z) V,K v=1,k=1 xv q(v, k)m vk = 1 S min Q∈U ( x,z) K k=1 S s=1 q(w s , k)m wsk . If we let q(v, k) = z k , meaning that all the tokens of a document to the topics according to the document's doc-topic distribution, then Q satisfies U ( x, z), which leads to: d M ( z, x) ≤ 1 S K k=1 S s=1 z k m wsk . (A.2) Together with Eq. (A.1), the definition of φ, and the fact that m vk ≤ 2, we have: xT log φ(z) = 2 -log φ - 1 S S s=1 K k=1 z k m wsk ≤ -log V v=1 e -K k=1 z k m vk -d M ( x, z) ≤ -(log V -2) -d M ( x, z) ≤ -d M ( x, z) , (A.3) where the last equation holds if log V > 2, i.e., V ≥ 8. 



To be precise, an OT distance becomes a "distance metric" in mathematics only if the cost function M is induced from a distance metric. We call it "OT distance" to assist the readability of our paper. http://qwone.com/~jason/20Newsgroups/ http://acube.di.unipi.it/tmn-dataset/ https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html https://trec.nist.gov/data/reuters/reuters.html We do not consider the labels of Reuters and RCV2 as there are multiple labels for one document. http://palmetto.aksw.org The Sinkhorn algorithm usually reaches the stop tolerance in less than 50 iterations in NSTM https://nlp.stanford.edu/projects/glove/



where •, • denotes the Frobenius dot-product; M ∈ R Dr×Dc ≥0 is the cost matrix/function of the transport; P ∈ R Dr×Dc >0 is the transport matrix/plan; U (r, c) denotes the transport polytope of r and c, which is the polyhedral set of D r × D c matrices: U (r, c) := {P ∈ R Dr×Dc >0

Figure1: The first row shows the TC scores for all the datasets and the second row shows the corresponding TD scores. In each subfigure, the horizontal axis indicates the proportion of selected topics according to their NPMIs.

Figure 2: The first row shows the km-Purity scores and the second row shows the corresponding km-NMI scores. In each subfigure, the horizontal axis indicates the number of KMeans clusters.

Figure 3: Training loss.

Figure B.1: Parameter sensitivity of NSTM on 20News. The first and second show the performance with different values of and α, respectively. In the first row, we fix α = 20 and vary while in the second row, we fix = 0.07 and vary α.

Figure C.1: TC and TD on RCV2 with 500 topics.

Figure D.1: Sinkhorn distance with varied K. Vertical axis: the average Sinkhorn distance over all the training documents, i.e., mean d M (z, x). Horizontal axis: the number of topics, i.e., K .

Statistics of the datasets

on 20NG, WS, and TMN, where the document labels are considered. With the default training/testing splits of the datasets, we train a model on the training documents and infer the topic distributions z on the testing documents. Given z, we

top-Purity and top-NMI for document clustering. The best and second scores of each dataset are highlighted in boldface and with an underline, respectively.

ACKNOWLEDGMENTS

Trung Le was supported by AOARD grant FA2386-19-1-4040. Wray Buntine was supported by the Australian Research Council under award DP190100017.

annex

can observe that although the words of the topics are different, the semantic similarity between the topics captured by the embeddings is highly interpretable. In addition, we take the GloVe embeddings of the polysemantic word "apple" and find the closest 10 related words among the 0.4 million words of the GloVe vocabulary according to their cosine similarity. It can be seen that by default "apple" refers to the Apple company more in GloVe. Either adding the embeddings of topic 1 that describes the concept of "food" or subtracting the embeddings of topic 46 that describes the concept of "tech companies" reveals the fruit semantic for the word "apple". More qualitative analysis on topics are provided in Section E of the appendix.

6. CONCLUSION

In this paper, we presented a novel neural topic model based on optimal transport, where a document is endowed with two representations: the word distribution, x, and the topic distribution, z. An OT distance is leveraged to compare the semantic distance between the two distributions, whose cost function is defined according to the cosine similarities between topics and words in the embedding space. z is obtained from an encoder that takes x as input and is trained by minimising the OT distance between z and x. With pretrained word embeddings, topic embeddings are learned by the same minimisation of the OT distance in terms of the cost function. Our model has shown appealing properties that are able to overcome several shortcomings of existing neural topic models. extensive experiments have been conducted, showing that our model achieves state-of-the-art performance on both discovering quality topics and deriving useful document representations for both regular and short texts.

