TOPIC-AWARE CONTEXTUALIZED TRANSFORMERS

Abstract

Training on disjoint fixed-length segments, Transformers successfully transform static word embeddings into contextualized word representations. However, they often restrict the context of a token to the segment it resides in and hence neglect the flow of contextual information across segments, failing to capture longer-term dependencies beyond the predefined segment length. This paper uses a probabilistic deep topic model to provide contextualized embeddings at both the token and segment levels. It also introduces a contextual next-word embedding guided topic attention module, injecting contextualized topic information into Transformerbased architectures. The proposed method not only captures global semantic coherence of all segments and word concurrence patterns, but also enriches the representation of each token by adapting it to its local context, which goes beyond the segment it resides in and can be flexibly defined according to the target task while maintaining control over memory footprint and computational time. Experiments on various corpora show that adding only a few extra parameters, the proposed topic-aware contextualized transformers consistently outperform their conventional counterparts, and can be used to generate coherent sentences and paragraphs.

1. INTRODUCTION

Language models (LMs) play an important role across a range of natural language processing tasks, such as text summarization (Rush et al., 2015; Gehrmann et al., 2018) , neural machine translation (NMT) (Sutskever et al., 2014; Cho et al., 2014a) , and image captioning (Herdade et al., 2019; Anderson et al., 2018; Xu et al., 2015) . Existing neural LMs are often built on either recurrent units, as used in recurrent neural networks (RNNs) (Cho et al., 2014b; Hochreiter and Schmidhuber, 1997) , or purely the attention mechanism based modules, as used in the Transformer and its various generalizations (Vaswani et al., 2017; Dai et al., 2019; Radford et al., 2019) . Moving beyond traditional recurrent units, Transformers mainly rely on attention mechanisms, in which the direct connections between long-distance word pairs might ease optimization and enable the learning of longrange dependency (Dai et al., 2019) , and have recently demonstrated state-of-the-art performances on a wide range of sequence modeling tasks. Rather than representing a token using a predefined word embedding vector, each Transformer layer creates a contextualized representation of each token by attending to different parts of the input segment (Ethayarajh, 2019) , allowing the same word to take different representations depending on its context. However, Transformers are usually trained on disjoint fixed-length segments, without any information flow across segments (Dai et al., 2019) , limiting the contextualization within the current segment. Therefore, they often fail to take full advantage of many other rich contextual information, such as longer-range word dependencies beyond the segment length and semantic relationships between neighboring segments. While a naive solution to explore richer contextual information is to increase the segment length, in practice, it is usually infeasible due to limited resources, which requires O N 2 for the window N of inputs at each layer. Some long-range transformer variants (Dai et al., 2019; Rae et al., 2020; Rae and Razavi, 2020) aim to extend context via compression, which use compressed memory cells for preserving the previous segments' information. The Transformer-XL (Dai et al., 2019) builds up recurrent connections between segments, concatenating the past activations with a memory cell of size M N, which results in an attention cost of O (N (M + N )). However the memory cell still requires a considerable space L ⇥ M ⇥ d model in a L-layer transformer with embedding size of d model , which consumes a non-negligible cost (Rae and Razavi, 2020). Rae et al. ( 2020) shorten the range of attention for Transformers by compressing the past memories into fine-grained and coarser compressed memory slots, while still suffering from memory consuming as the memory size is quiet large (> 1000). In addition, some efficient versions focusing on Transformer model's self-attention mechanism have also recently been explored. These models reduce memory requirements by leveraging sparsity in the attention layers (Sukhbaatar et al., 2019) , exploiting a factorized sparse representation (Child et al., 2019) , replacing dot-product attention with locality-sensitive hashing to decrease complexity (Kitaev et al., 2020) , or using product-key attention to increase the key space (Lample et al., 2019) . Besides, Chen et al. ( 2019) represent sentence-level context as latent topic representations by using a convolution neural network, and utilize the context representations to improve translation. However, leveraging the contextualized topic information by capturing semantic coherence via a deep probabilistic topic model, to our knowledge, has not been directly applied to Transformer before. Furthermore, compared with pre-training, fine-tuning is relatively inexpensive (Devlin et al., 2019) . Nevertheless, most of the current contextualized models are trained independently on different datasets, without making good use of the publicly released pre-trained models (Radford et al., 2019; Devlin et al., 2019; Radford et al., 2018; Brown et al., 2020; Peters et al., 2018; Yang et al., 2019) , paired with unsupervised pre-training on a large amount of training data. This motivates us to explore a general intervention based on those predecessors for performance gain with little computation cost, providing longer-range dependencies through a deep topic model. Different from RNN or Transformer-based LMs, topic models (Blei et al., 2003; Teh et al., 2006; Zhou and Carin, 2015; Gan et al., 2015; Zhou et al., 2016; Zhao et al., 2018) are well suited for capturing global semantic coherency by extracting word concurrence patterns into semantically meaningful topics, which can be viewed as the contextualized word representations of the entire target corpus including all segments. Since topic models are appropriate to capture long-range dependencies, some approaches attract significant recent interest by leveraging topic models to improve RNN-based language models (Dieng et al., 2017; Ahn et al., 2016; Lau et al., 2017; Wang et al., 2018a; Guo et al., 2019) . Dieng et al. ( 2017 2019) extract recurrent hierarchical semantic structure via a dynamic deep topic model to guide natural language generation. Motivated by recent successes on integrating topic information into RNN-based LMs, here we focus on using topic model to provide richer contextual information for improving the Transformer. In particular, we consider using Poisson gamma belief network (PGBN) (Zhou et al., 2016; Zhang et al., 2018) , a state-of-the-art probabilistic topic model which can be equivalently represented as a multi-stochastic-layer deep generalization of vanilla topic models (Blei et al., 2003; Zhou et al., 2012) , to extract globally shared semantical topic representations of user-defined contexts. To this end, three different types of contextual topic information are provided to introduce long-range semantic dependencies into Transformers. (i) We first introduce the contextual token embedding (TE) guided by topic model to enrich the representation of each token, which not only extracts global semantics from the corpus, but also provides localized representation of a token given either its preceding or surrounding context (which one to use is task-dependent). (ii) To utilize contextual information of a segment, we develop the contextual segment embedding (SE) to construct a set of virtual words, which is placed before the word sequence of the current segment and fed into Transformer. As such, the generation of any token in one segment depends on semantic context from the previous segments. (iii) After that, we further develop a multi-head topic attention (TA) module into the Transformer, selecting semantically related topics for generating each token, a design inspired by how a token is generated by a topic model given the topics and corresponding topic proportion. To encourage topic select-attention to focus on the topics where the predicting token is more likely to be assigned to by the topic model, during training, we add a restriction between the attention weights and the latent representation of the predicting word. Besides, a sparse penalty is employed on the topic select-attention, encouraging the network to focus on only a small subset of extracted topics. Moving beyond conventional transformers, our model can not only utilize longer-range word dependencies beyond the segment length and semantic relationships across all segments, but also generalize easily to any pre-trained Transformer-based model by jointly fine-tuning on the target corpus. It only adds minor memory and computation overhead comparing with fine-tuning the Transformer-based model alone. We demonstrate the effectiveness of our method both quantitatively and qualitatively.



) and Ahn et al. (2016) integrate the syntactic dependencies of RNNs and semantic topics of latent topic models. Lau et al. (2017) introduce an attention based convolutional neural network to extract semantic topics for extending the RNN cell. Wang et al. (2018a) learn the global semantic coherence of a document via a neural topic model and use the learned latent topics to build a mixture-of-experts language model. Guo et al. (

