Topic Aware Transformer: Domain Shift for Unconditional Text Generation Models

Abstract

Our goal is to guide pre-trained language models (PLMs) towards unconditional text generation tasks while resolving the domain gap and avoiding the catastrophic forgetting. Because Transformer-based models are pretrained on more massive and heterogeneous corpora than specific target corpus, the gap between these corpora and the target corpus raises the question of whether these PLMs will actually benefit this task even after fine-tuning. As the domain adaptation of PLMs needs to bridge this gap, we propose a framework, Topic Aware Transformer (TAT), that adapts PLMs for target-aware text generation while alleviating catastrophic forgetting. The motivation of TAT to distill the target-specific knowledge as topics, and steer PLMs toward these topics. This requirement and motivation lead us to introduce a topic steering layer (TSL) as an additional layer, and Topic Distribution Modeling (TDM) as a training task. Experiments show that these components resolve the gap as the domain shift, and can tailor PLMs to generate text to better reflect a given small fine-tuning corpus.

1. INTRODUCTION

Our goal is to adapt pre-trained language models (PLMs) to achieve unconditional text generation toward a target domain. The success of Transformer-based PLMs motivates us to explore how to fine-tune them so as to well reflect a given target corpus thereby generating more personalized texts with very few specializations. The size of the target corpus is generally much smaller than that of existing pre-training corpora, which may lead to catastrophic forgetting (Ramasesh et al., 2021) . For example, the popular pre-training data sets of Giga5en (Parker et al., 2011) , and ClueWeb 2012-Bfoot_0 occupy 16G, and 25TB, respectively. PLMs can become biased toward the patterns of language used in the training data (Keskar et al., 2019) . Given the rapid diversification of applications, a pre-training approach is needed to effectively achieve domain shift without catastrophic forgetting. Toward this domain shift, we propose a framework, Topic Aware Transformer (TAT), that adapts PLMs as unconditional generative tasks while alleviating catastrophic forgetting. As the domain knowledge consists of global (e.g., linguistic) and specific (e.g., semantic) knowledge, our intuition is that knowledge can be represented as a distribution of words, and the gap between the source and the target domain can be taken to be differences between distributions. These intuitions motivate TAT to detect these distributions via topics, and steer PLMs toward these topics to highlight the target-specific knowledge. That is, the motivation of TAT is to introduce a topic steering layer (TSL) as an additional layer that detects topics and helps training PLMs theoretically, and Topic Distribution Modeling (TDM) as a training task to align text on the topic representation on the target domain. To prevent catastrophic forgetting, TAT can fine-tune PLMs while bridging the domain gap without updating PLM parameters. Experiments confirm that TAT supports PLMs and verify its advantages as follows; •Theoretical contributions: TSL allows topics to act as unsupervised labels that represent global and target-specific word distributions as domain knowledge, and adapts PLMs to resolve the gap and perform domain shift over topics. •Practical contributions: As TAT updates only the target-specific word distributions, and does not need to update the parameters of PLMs, it generates more target-specific texts at lower computational cost than possible when using previous PLMs alone, while preserving PLM's advantages.

2. PREVIOUS WORK

Recently, pre-trained neural language models (NLMs), such as BERT (Devlin et al., 2019) et al., 2018) existing NMT models through layer-wise coordination of the encoder and decoder, and use modified attention masks in train both the encoder and the decoder simultaneously, their framework cannot be applied directly to domain-specific text generation; our framework differs from pre-learning in terms of task objectives, topic introduction, and fine-tuning. BertSUM (Wang et al., 2020) notes that topic models are better at learning explicit document semantics than Transformer. Different from their work, TAT aims to adapt NLMs to text generation tasks by performing domain shift. As existing PLMs used large raw text data that do not necessarily contain sufficient knowledge or patterns that are directly related to the target-specific task, they still suffer from several potential limitations. More precisely, texts of specific task, e.g., movie review, can differ from PLMs training data (Chen et al., 2022) . To address the question of whether pre-training on a corpus more directly tied to the task can further improve performance, continual pretraining (Gururangan & et al, 2020) has shown benefit of optimizing a PLM to a target domain before further finetuning. UDALM (Karouzos et al., 2021) first trains PLMs by masked language modeling (MLM) on the target domain and then trains a target classifier with source domain labeled data, while keeping the MLM objective on unlabeled target domain data. AdaPrompt (Chen et al., 2022 ) is a framework that can adapt a PLM for the end task considering both the prompts and the verbalizer, and adaptively continual pretraining on the retrieved data, which can benefit prompt-based methods on NLP downstream tasks. While these training based approaches show the improvement in solving the domain gap, these practice of adapting and controlling pre-trained generative models poses the catastrophic forgetting: most approaches to enforcing a control objective result in a dramatic loss of capabilities of the original model beyond the scope of the control objective. As a way to take advantage of this achievement, we noted that some knowledge is universal across domains and some is not. Therefore, our approach aims to avoid this problem by incorporating a mechanism to recognize these relative differences, and intensively update only the target-specific knowledge. As with global semantic information, topic models (Blei et al., 2003; Kawamae, 2018; Wang et al., 2020) , and their extensions, take a global statistical view and look at the word distributions of topics across a given corpus; they represent each document as a bag-of-word (BOW) vector. Although these models organize a given corpus into small sets of prominent topics and have been proven to be powerful tools for uncovering latent structure, they and their application (Chang et al., 2021; Wang et al., 2018; 2020) are not, in the strict sense, sequence models.



https://www.lemurproject.org/clueweb09.php/



, GPT2(Radford et al., 2019),XLNet (Yang et al., 2019),RoBERTa (Liu et al., 2019),  and ALBERT (Lan et al., 2020) use Transformer (Vaswani et al., 2017)  for learning contextualized text representations, and have yielded great advances in NLP tasks. Though achieving appealing performance, these Transformer-based models are better at exploring the relationships among local tokens than global semantics (e.g., word collocation over a given corpus)(Wang et al., 2020). As no Transformer-based model considers these explicit semantics, Wang et al.(Wang et al., 2020)  rearranged and explored the semantics of topic models and developed a topic-friendly assistant for Transformer-based abstractive summarization models. UNIfied pre-trained Language Model(Dong et al., 2019)  supports NLU and natural language generation (NLG) tasks by employing a shared Transformer network and utilizing specific self-attention masks to control which context the prediction is conditioned on. While He et al. improve (He

