NEURAL TOPIC MODEL VIA OPTIMAL TRANSPORT

Abstract

Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.

1. INTRODUCTION

As an unsupervised approach, topic modelling has enjoyed great success in automatic text analysis. In general, a topic model aims to discover a set of latent topics from a collection of documents, each of which describes an interpretable semantic concept. Topic models like Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and its hierarchical/Bayesian extensions, e.g., in Blei et al. (2010) ; Paisley et al. (2015) ; Gan et al. (2015) ; Zhou et al. (2016) have achieved impressive performance for document analysis. Recently, the developments of Variational AutoEncoders (VAEs) and Autoencoding Variational Inference (AVI) (Kingma & Welling, 2013; Rezende et al., 2014) have facilitated the proposal of Neural Topic Models (NTMs) such as in Miao et al. (2016); Srivastava & Sutton (2017) ; Krishnan et al. (2018); Burkhardt & Kramer (2019) . Inspired by VAE, many NTMs use an encoder that takes the Bag-of-Words (BoW) representation of a document as input and approximates the posterior distribution of the latent topics. The posterior samples are further input into a decoder to reconstruct the BoW representation. Compared with conventional topic models, NTMs usually enjoy better flexibility and scalability, which are important for the applications on large-scale data. Despite the promising performance and recent popularity, there are several shortcomings for existing NTMs, which could hinder their usefulness and further extensions. i) The training and inference processes of NTMs are typically complex due to the prior and posterior constructions of latent topics. To encourage topic sparsity and smoothness, Dirichlet (Burkhardt & Kramer, 2019) or gamma (Zhang et al., 2018) distributions are usually used as the prior and posterior of topics, but reparameterisation is inapplicable to them, thus, complex sampling schemes or approximations have to be used, which could limit the model flexibility. ii) A desideratum of a topic model is to generate better topical representations of documents with more coherent and diverse topics; but for many existing NTMs, it is hard to achieve good document representation and coherent/diverse topics at the same time. This is because the objective of NTMs is to achieve lower reconstruction error, which usually means topics are less coherent and diverse, as observed and analysed in Srivastava & Sutton (2017); Burkhardt & Kramer (2019) . iii) It is well-known that topic models degrade their performance severely on short documents such as tweets, news headlines and product reviews, as each individual document contains insufficient word co-occurrence information. This issue can be exacerbated for NTMs because of the use of the encoder and decoder networks, which are usually more vulnerable to data sparsity. To address the above shortcomings for NTMs, we in this paper propose a neural topic model, which is built upon a novel Optimal Transport (OT) framework derived from a new view of topic modelling. For a document, we consider its content to be encoded by two representations: the observed representation, x, a distribution over all the words in the vocabulary and the latent representation, z, a distribution over all the topics. x can be obtained by normalising a document's word count vector while z needs to be learned by a model. For a document collection, the vocabulary size (i.e., the number of unique words) can be very large but one individual document usually consists of a tiny subset of the words. Therefore, x is a sparse and low-level representation of the semantic information of a document. As the number of topics is much smaller than the vocabulary size, z is the relatively dense and high-level representation of the same content. Therefore, the learning of a topic model can be viewed as the process of learning the distribution z to be as close to the distribution x as possible. Accordingly, it is crucial to investigate how to measure the distance between two distributions with different supports (i.e., words to x and topics to z). As optimal transport is a powerful tool for measuring the distance travelled in transporting the mass in one distribution to match another given a specific cost function, and recent development on computational OT (e.g., in Cuturi ( 2013 Specifically, our model leverages an encoder that outputs topic distribution z of a document by taking its word count vector as input like standard NTMs, but we minimise the OT distance between x and z, which are two discrete distributions on the support of words and topics, respectively. Notably, the cost function of the OT distance specifies the weights between topics and words, which we define as the distance in an embedding space. To represent their semantics, all the topics and words are embedded in this space. By leveraging the pretrained word embeddings, the cost function is then a function of topic embeddings, which will be learned jointly with the encoder. With the advanced properties of OT on modelling geometric structures on spaces of probability distributions, our model is able to achieve a better balance between obtaining good document representation and generating coherent/diverse topics. In addition, our model eases the burden of designing complex sampling schemes for the posterior of NTMs. More interestingly, our model is a natural way of incorporating pretrained word embeddings, which have been demonstrated to alleviate the issue of insufficient word co-occurrence information in short texts (Zhao et al., 2017; Dieng et al., 2020) . With extensive experiments, our model can be shown to enjoy the state-of-the-art performance in terms of both topic quality and document representations for both regular and short texts.

2. BACKGROUND

In this section, we recap the essential background of neural topic models and optimal transport.

2.1. NEURAL TOPIC MODELS

Most of existing NTMs can be viewed as the extensions of the framework of VAEs where the latent variables can be interpreted as topics. Suppose the document collection to be analysed has V unique words (i.e., vocabulary size). Each document consists of a word count vector denoted as x ∈ N V and a latent distribution over K topics: z ∈ R K . An NTM assumes that z for a document is generated from a prior distribution p(z) and x is generated by the conditional distribution p φ (x|z) that is modelled by a decoder φ. The model's goal is to infer the topic distribution given the word counts, i.e., to calculate the posterior p(z|x), which is approximated by the variational distribution q θ (z|x) modelled by an encoder θ. Similar to VAEs, the training objective of NTMs is the maximisation of the Evidence Lower BOund (ELBO): max θ,φ E q θ (z|x) [log p φ (x|z)] -KL [q θ (z|x) p(z)] . (1) The first term above is the expected log-likelihood or reconstruction error. As x is a count-valued vector, it is usually assumed to be generated from the multinomial distribution: p φ (x|z) := Multi(φ(z)), where φ(z) is a probability vector output from the decoder. Therefore, the expected log-likelihood is proportional to x T log φ(z). The second term is the Kullback-Leibler (KL) divergence that regularises q θ (z|x) to be close to its prior p(z). To interpret topics with words, φ(z) is usually constructed by a single-layer network (Srivastava & Sutton, 2017): φ(z) := softmax(Wz), where W ∈ R V ×K



); Frogner et al. (2015); Seguy et al. (2018); Peyré et al. (2019)) has shown the promising feasibility to efficiently compute OT for large-scale problems, it is natural for us to develop a new NTM based on the minimisation of OT.

