TOPIC-AWARE CONTEXTUALIZED TRANSFORMERS

Abstract

Training on disjoint fixed-length segments, Transformers successfully transform static word embeddings into contextualized word representations. However, they often restrict the context of a token to the segment it resides in and hence neglect the flow of contextual information across segments, failing to capture longer-term dependencies beyond the predefined segment length. This paper uses a probabilistic deep topic model to provide contextualized embeddings at both the token and segment levels. It also introduces a contextual next-word embedding guided topic attention module, injecting contextualized topic information into Transformerbased architectures. The proposed method not only captures global semantic coherence of all segments and word concurrence patterns, but also enriches the representation of each token by adapting it to its local context, which goes beyond the segment it resides in and can be flexibly defined according to the target task while maintaining control over memory footprint and computational time. Experiments on various corpora show that adding only a few extra parameters, the proposed topic-aware contextualized transformers consistently outperform their conventional counterparts, and can be used to generate coherent sentences and paragraphs.

1. INTRODUCTION

Language models (LMs) play an important role across a range of natural language processing tasks, such as text summarization (Rush et al., 2015; Gehrmann et al., 2018) , neural machine translation (NMT) (Sutskever et al., 2014; Cho et al., 2014a) , and image captioning (Herdade et al., 2019; Anderson et al., 2018; Xu et al., 2015) . Existing neural LMs are often built on either recurrent units, as used in recurrent neural networks (RNNs) (Cho et al., 2014b; Hochreiter and Schmidhuber, 1997) , or purely the attention mechanism based modules, as used in the Transformer and its various generalizations (Vaswani et al., 2017; Dai et al., 2019; Radford et al., 2019) . Moving beyond traditional recurrent units, Transformers mainly rely on attention mechanisms, in which the direct connections between long-distance word pairs might ease optimization and enable the learning of longrange dependency (Dai et al., 2019) , and have recently demonstrated state-of-the-art performances on a wide range of sequence modeling tasks. Rather than representing a token using a predefined word embedding vector, each Transformer layer creates a contextualized representation of each token by attending to different parts of the input segment (Ethayarajh, 2019), allowing the same word to take different representations depending on its context. However, Transformers are usually trained on disjoint fixed-length segments, without any information flow across segments (Dai et al., 2019) , limiting the contextualization within the current segment. Therefore, they often fail to take full advantage of many other rich contextual information, such as longer-range word dependencies beyond the segment length and semantic relationships between neighboring segments. While a naive solution to explore richer contextual information is to increase the segment length, in practice, it is usually infeasible due to limited resources, which requires O N 2 for the window N of inputs at each layer. Some long-range transformer variants (Dai et al., 2019; Rae et al., 2020; Rae and Razavi, 2020) aim to extend context via compression, which use compressed memory cells for preserving the previous segments' information. The Transformer-XL (Dai et al., 2019) builds up recurrent connections between segments, concatenating the past activations with a memory cell of size M N, which results in an attention cost of O (N (M + N )). However the memory cell still requires a considerable space L ⇥ M ⇥ d model in a L-layer transformer with embedding size of d model , which consumes a non-negligible cost (Rae and Razavi, 2020). Rae et al. (2020) shorten the range of attention

