NEURAL TOPIC MODELING WITH EMBEDDING CLUSTERING REGULARIZATION

Abstract

Topic models have been prevalent for decades with various applications like automatic text analysis due to their effectiveness and interpretability. However, existing topic models commonly suffer from the notorious topic collapsing issue: the discovered topics semantically collapse towards each other, leading to highly repetitive topics, insufficient topic discovery, and damaged model interpretability. In this paper, we propose a new neural topic model, Embedding Clustering Regularization Topic Model (ECRTM), to solve the topic collapsing issue. In addition to the reconstruction error of existing work, we propose a novel Embedding Clustering Regularization (ECR), which forces each topic embedding to be the center of a separately aggregated word embedding cluster in the semantic space. Instead of collapsing together, this makes topic embeddings away from each other and cover different semantics of word embeddings. Thus our ECR enables each produced topic to contain distinct word semantics, which alleviates topic collapsing. Through jointly optimizing our ECR objective and the neural topic modeling objective, ECRTM generates diverse and coherent topics together with high-quality topic distributions of documents. Extensive experiments on benchmark datasets demonstrate that ECRTM effectively addresses the topic collapsing issue and consistently surpasses state-of-the-art baselines in terms of topic quality, topic distributions of documents, and downstream classification tasks.

1. INTRODUCTION

Topic models have achieved great success in document analysis via discovering latent semantics. They have various downstream applications (Boyd-Graber et al., 2017) , like content recommendation (McAuley & Leskovec, 2013 ), summarization (Ma et al., 2012) , and information retrieval (Wang et al., 2007) . Current topic models can be roughly classified as two lines: conventional topic models based on probabilistic graphical models (Blei et al., 2003) or matrix factorization (Kim et al., 2015; Shi et al., 2018) and neural topic models (Miao et al., 2016; 2017; Srivastava & Sutton, 2017) . Topic#1: just show even come time one good really going know Topic#2: just even really something come going like actually things get Topic#3: just one even something come way really like always good Topic#4: just get going come one know even really something way Topic#5: just like inside get even look come one everything away (3) Topic collapsing damages the interpretability of topic models. It becomes difficult to infer the real underlying topics that a document contains (Huynh et al., 2020) . In consequence, topic collapsing impedes the utilization and extension of topic models; therefore it is crucial to overcome this challenge for building reliable topic modeling applications. To address the topic collapsing issue and achieve robust topic modeling, we in this paper propose a novel neural topic model, Embedding Clustering Regularization Topic Model (ECRTM). First, we illustrate the reason for topic collapsing. Figure 1 shows topic embeddings mostly collapse together in the semantic space of previous state-of-the-art methods. This makes discovered topics contain similar word semantics and thus results in the topic collapsing. Then to avoid the collapsing of topic embeddings, we propose the novel Embedding Clustering Regularization (ECR) in addition to the reconstruction error of previous work. ECR considers topic embedding as cluster centers and word embeddings as cluster samples. To effectively regularize them, ECR models the clustering soft-assignments between them by solving a specifically defined optimal transport problem on them. As such, ECR forces each topic embedding to be the center of a separately aggregated word embedding cluster. Instead of collapsing together, this makes topic embeddings away from each other and cover different semantics of word embeddings. Thus our ECR enables each produced topic to contain distinct word semantics, which alleviates topic collapsing. Regularized by ECR, our ECRTM achieves robust topic modeling performance by producing diverse and coherent topics together with high-quality topic distributions of documents. Figure 1d shows the effectiveness of ECRTM. We conclude the main contributions of our paper as follows: • We propose a novel embedding clustering regularization that avoids the collapsing of topic embeddings by forcing each topic embedding to be the center of a separately aggregated word embedding cluster, which effectively mitigates topic collapsing. • We further propose a new neural topic model that jointly optimizes the topic modeling objective and the embedding clustering regularization objective. Our model can produce diverse and coherent topics with high-quality topic distributions of documents at the same time. • We conduct extensive experiments on benchmark datasets and demonstrate that our model effectively addresses the topic collapsing issue and surpasses state-of-the-art baseline methods with substantially improved topic modeling performance.

2. RELATED WORK

Conventional Topic Models Conventional topic models (Hofmann, 1999; Blei et al., 2003; Blei & Lafferty, 2006; Mimno et al., 2009; Boyd-Graber et al., 2017) commonly employ probabilistic graphical models to model the generation process of documents with topics as latent variables. They infer model parameters with MCMC methods like Gibbs Sampling (Steyvers & Griffiths, 2007) or Variational Inference (Blei et al., 2017) . Another line of work uses matrix factorization to model topics (Yan et al., 2013; Kim et al., 2015; Shi et al., 2018) . Conventional topic models usually need model-specific derivations for different modeling assumptions. Neural Topic Models Due to the success of Variational AutoEncoder (VAE, Kingma & Welling, 2014; Rezende et al., 2014) , several neural topic models have been proposed (Miao et al., 2016;  



Figure 1: t-SNE visualization (van der Maaten & Hinton, 2008) of word embeddings (•) and topic embeddings ( ) under 50 topics. These show that while the topic embeddings mostly collapse together in previous state-of-the-art models (ETM (Dieng et al., 2020), NSTM (Zhao et al., 2021b), and WeTe (Wang et al., 2022)), our ECRTM successfully avoids the collapsing by forcing each topic embedding to be the center of a separately aggregated word embedding cluster.

Top related words of the discovered topics by NSTM(Zhao et al., 2021b)  on IMDB. These topics semantically collapse towards each other with many uninformative and repetitive words. Repetitions are underlined. Topic collapsing incurs insufficient topic discovery. Many latent topics are undisclosed, making the topic discovery insufficient to understand documents(Dieng et al., 2020).

