BT-CHAIN: BIDIRECTIONAL TRANSPORT CHAIN FOR TOPIC HIERARCHIES DISCOVERY

Abstract

Topic modeling has been an important tool for text analysis. Originally, topics discovered by a model are usually assumed to be independent. However, as a semantic representation of a concept, a topic is naturally related to others, which motivates the development of learning hierarchical topic structure. Most existing Bayesian models are designed to learn hierarchical structure, but they need nontrivial posterior inference. Although the recent transport-based topic models bypass the posterior inference, none of them considers deep topic structures. In this paper, we interpret the document as its word embeddings and propose a novel bidirectional transport chain to discover multi-level topic structures, where each layer learns a set of topic embeddings and the document hierarchical representations are defined as a series of empirical distributions according to the topic proportions and corresponding topic embeddings. To fit such hierarchies, we develop an upward-downward optimizing strategy under the recent conditional transport theory, where document information is first transported via the upward path, and then its hierarchical representations are refined according to the adjacent upper and lower layers in a layer-wise manner via the downward path. Extensive experiments on text corpora show that our approach enjoys superior modeling accuracy and interpretability. Moreover, we also conduct experiments on learning hierarchical visual topics from images, which demonstrate the adaptability and flexibility of our method.

1. INTRODUCTION

Topic models (TMs) like latent Dirichlet allocation (LDA) (Blei et al., 2003) , Poisson factor analysis (PFA) (Zhou et al., 2012) , and their various extensions (Teh et al., 2006; Hoffman et al., 2010; Blei, 2012; Zhou et al., 2016) are a family of popular techniques for discovering the hidden semantic structure from a collection of documents in an unsupervised manner. In addition to learning shallow topics, mining the potential hierarchical topic structures has obtained much research effort since the hierarchies are ubiquitous in big text corpora (Meng et al., 2020; Lee et al., 2022) and can be applied to a wide range of applications (Grimmer, 2010; Zhang et al.; Guo et al., 2020) . Hierarchical Bayesian probabilistic models have been commonly used to learn topic structures (Blei et al., 2010; Paisley et al., 2014; Gan et al., 2015; Henao et al., 2015; Zhou et al., 2016) , where a hierarchy of topics are learned and the topics in the higher layers serve as the priors of the topics in the lower layers. Despite the success of Bayesian models in topic structure mining, most of them employ Bayesian posterior inference to optimize their parameters (e.g., Markov Chain Monte Carlo (MCMC) and Variational Inference (VI)), which is usually non-trivial to derive and can be less flexible and efficient for big text corpora (Zhang et al., 2018) . Recent developments in Autoencoding Variational Inference (AVI) (Kingma & Welling, 2013; Rezende et al., 2014) provide stronger inference tools for Bayesian models and have inspired several neural topic models (Zhang et al., 2018; Duan et al., 2021a) , resulting in improved efficiency and flexibility. However, applying AVI to neural topic models still has some limitations or concerns. First, the estimation of variational posterior always needs a trade-off between accuracy and efficiency as an asymptotically exact method (Salimans et al., 2015) . Besides, the latent distributions are required to be reparameterizable and KL divergence is expected to be analytical, both of which are hard to meet for topic models since they usually depend on Dirichlet distribution or the Gamma distribution (Blei et al., 2003; Zhou et al., 2015) . Another concern comes from likelihood maximization, in which the inference of topic structure relies on word co-occurrence patterns within a document. This has been recently found to give poor quality topics in case of little evidence of co-occurrences, such as corpus with a small number of documents or containing short context (Huynh et al., 2020; Wang et al., 2022) . This concern is even more acute in hierarchies learning where more topics and their correlations need to be inferred (Meng et al., 2020) . Several existing studies have targeted to incorporate meta knowledge to improve topic representation. The source of side information may come from various fields, including knowledge graph (Xie et al., 2015; Duan et al., 2021b) , pre-trained language model (Bianchi et al., 2021; Meng et al., 2022) and word embeddings (Wang et al., 2022; Duan et al., 2021a) . Another notable tendency developed recently is the conditional transport (CT) theory (Zheng & Zhou, 2021a) . It provides an efficient tool to measure the distance between two probability distributions and has been employed in numerous machine learning problems, such as domain adaptation, generative model, and document representation (Zheng et al., 2021; Tanwisuth et al., 2021; Wang et al., 2022) . The CT distance is defined by the bidirectional (forward and backward) transport cost between the source and target distributions, allowing the two distributions not to share the same support. Moreover, the CT distance can be unbiasedly approximated with the discrete empirical distributions, making it amenable to stochastic gradient descent-based optimization. Wang et al. ( 2022) first introduced CT into topic modeling by minimizing the transport cost between the word and topic space, resulting in better topic quality and document representation. The similar idea is shared with recent optimal transport-based methods (Kusner et al., 2015; Huynh et al., 2020; Zhao et al., 2021) . However, they all focus on single-layer topic discovery, ignoring multi-level topic dependencies. This paper goes beyond hierarchical Bayesian models for topic structure learning and aims to discover topic hierarchies based on the conditional transport between distributions. To formulate topic structure learning as a transport problem, we first provide a hierarchical, distributional view of topic modeling, where each layer of a topic hierarchy learns a set of topics presented as embedding vectors. Moreover, the to-be-learned topics share the same word embedding space and are organized in a taxonomy where the upper-level topics are more general while the lower-level topics are more specific (Zhang et al., 2018) . In detail, we view each document as an empirical distribution of word embeddings and consider that a document can also be presented by the topic embeddings (Wang et al., 2022) at each layer. Those hierarchical empirical distributions have different supports but share semantic consistency across topical levels. With this view, we propose to learn topic hierarchies with a Bidirectional Transport chain (BT-chain) where a document's topic distributions in two adjacent layers are learned by being pushed close to each other in terms of the CT loss. This results in a more flexible and efficient method than VAI-based NTMs, while keeping the interpretability of Bayesian models. With a different mechanism from previous hierarchical topic models, the proposed BT-chain is a straightforward and novel approach for topic structure learning, which can be flexibly integrated with deep neural networks. To achieve an efficient and end-to-end training algorithm, an upwarddownward optimizing strategy is developed carefully, which first warms up the empirical distributions by transporting the input via the bottom-to-top path, and then applies the backward layer-wise refinement by considering the bidirectional information stream from the Bayesian perspective. The main contributions in this paper are as follows: (1) We view the hierarchical topic modeling from a new perspective of multi-layer conditional transport, which facilities us to develop a novel bidirectional transport chain for topic structure learning. (2) To effectively and efficiently implement the proposed method, we propose an upward-downward training algorithm for BT-chain with proper amortizations and strategy. (3) We conduct extensive experiments on text corpora to show that our approach enjoys superior modeling accuracy and interpretability compared with the state-of-the-art hierarchical topic models. To extend the application of topic modeling, we also apply it on learning hierarchical visual topics from images, which shows interesting visualizations.

2. BACKGROUND

In this section, we recap the background of transport distance between two discrete distributions. Let us consider two discrete probability distributions p and q ∈ P(X) on space X ∈ R H : p = n i=1 u i δ xi , and q = m j=1 v j δ yj , where x i and y j are two points in the arbitrary same space X. u ∈ Σ n and v ∈ Σ m , the simplex of R n and R m , denotes two probability values of the discrete states satisfying n i=1 u i = 1 and m j=1 v j = 1. δ x refers to a point mass located at coordinate x ∈ R H .

