PRETRAIN KNOWLEDGE-AWARE LANGUAGE MODELS

Abstract

How much knowledge do pretrained language models hold? Recent research observed that pretrained transformers are adept at modeling semantics but it is unclear to what degree they grasp human knowledge, or how to ensure they do so. In this paper we incorporate knowledge-awareness in language model pretraining without changing the transformer architecture, inserting explicit knowledge layers, or adding external storage of semantic information. Rather, we simply signal the existence of entities to the input of the transformer in pretraining, with an entityextended tokenizer; and at the output, with an additional entity prediction task. Our experiments show that solely by adding these entity signals in pretraining, significantly more knowledge is packed into the transformer parameters: we observe improved language modeling accuracy, factual correctness in LAMA knowledge probing tasks, and semantics in the hidden representations through edge probing. We also show that our knowledge-aware language model (KALM) can serve as a drop-in replacement for GPT-2 models, significantly improving downstream tasks like zero-shot question-answering with no task-related training.

1. INTRODUCTION

The strong effectiveness and rich generalization ability of pretrained language models (PLMs) (1; 2; 3; 4; 5) have raised many questions about what is captured in transformer networks and why. Recent explorations found the pretrained language models may "rediscover" the linguistic pipeline at various transformer layers (6), can serve as implicit knowledge bases for relation extraction (7), perform soft reasoning tasks (8; 9), and conduct some language tasks reasonably in a fully unsupervised, zero-shot, fashion (3; 10). With sufficiently large amount of parameters, i.e. several billions, and enough task-specific supervision, the pretrained language models can even directly generate answers for natural language questions, at the same accuracy with state-of-the-art reading comprehension systems, without using any context documents or knowledge graphs (11). Impressive as they are, language models are still far from ready to serve as an "unsupervised multi-task learner" that learns knowledge directly from human language and generalizes to downstream language tasks (3). There are notable gaps of language models' performances on downstream tasks between models with (11; 12) and without (3) large amounts of task-specific fine-tuning. The language models still (de-)generate dull, factually-incorrect, or dream-like text when used in natural language generation (13; 14; 15; 16) . These challenges often necessitate over-parameterization (17), grounding on external structural semantics (14; 16; 18; 19) , or large amount of task-specific fine-tuning (11), which are costly, complicated, and not always feasible for every language task. One potential limitation of these language models is their style of pretraining, e.g, auto-regressive language modeling (3) or masked language modeling (1), wherein transformer networks process a sequence of words and are asked to predict the next/masked words. There is no explicit guidance to the transformers that humans prefer them to capture correct, real-world information. As a result, all the knowledge captured in these pretrained language models is only signaled by patterns of co-occuring words in the input sequence that is learned implicitly during pretraining. In this paper, instead of creating bigger models or adding knowledge-specific architectures, we propose to more efficiently leverage the existing parameters in the standard transformer language model, by simply making them aware of the various forms an entity can manifest itself as, and its role in the surrounding text. More specifically, this knowledge-awareness is communicated via the input fed to PLMs and in the output expected from them during pretraining. For input-awareness, we use an entity-name (surface form) dictionary that tokenizes word spans to their most popularly referred-to entity, e.g., as fuzzy frequency-based entity annotations (20), and serve these entity tokens as a parallel input channel along with the word tokens. For output-awareness, in addition to the language modeling objective, we add an entity prediction task that guides the model to distinguish the correct entity from various negative distractions. The two objectives together explicitly guide the language model to predict not only the correct words, but also the correct entity behind those words during pretraining, without changing the network architecture. By adding knowledge awareness to GPT-2 style auto-regressive language models, our pretrained language model, "Knowledge-Aware Language Model" (KALM), shows significantly improved handling of knowledge-sensitive tasks. In the LAMA knowledge probing tasks (7), KALM outperforms its entity-unaware baseline, GPT-2, by about 25% across all tasks at both base and large transformer sizes. Our 24 layer KALM (Large) is even comparable with the 17 Billion parameter GPT-2 on some tasks. It more accurately captures commonsense knowledge, factual semantics, and also relation semantics in these LAMA tests. The knowledge signals also aid generic language understanding: we have observed better language modeling perplexity and word prediction accuracy with KALM too. The advantages in language modeling also transfer to downstream tasks. In zero-shot question answering, the exact match accuracy of the answers generated by KALM are 20%-100% better than those of an equivalent GPT-2 model. We did not use any task-specific supervision or additional gradient updates, relying solely on the unsupervised knowledge learned in KALM. We only feed in a few example question-answer pairs as templates to format how generated answers should look. Injecting rich knowledge signals leads to improvements approximately equal to those gained by doubling the transformer layers, indicating that PLMs can be trained more efficiently -growing the parameters exponentially is not the only way to improve language understanding. To better understand pretraining and the advantage of knowledge awareness, we leverage the edge probe technique (6; 21) and dissect what is learned in the representations at various gradient step numbers throughout pretraining. We observe that the auto-regressive transformers start to learn the basis of language in the beginning of the pretraining, and gradually learns more complex semantics in the process; adding knowledge-awareness greatly accelerates learning of higher level semantics, e.g., coreferences and entity types, and helps the model perform better in those more complicated tasks.

2. PRETRAINING KNOWLEDGE-AWARE LANGUAGE MODELS

In this section we first present preliminaries in language modeling and then how we add knowledgeawareness in their pretraining.

2.1. PRELIMINARY

In this paper, without loss of generality, we mainly focus on the auto-regressive language modeling. Considering the text X as a sequence of tokens (words or sub-words): X = {w 1 , ..., w i , ..., w n }, the classical unidirectional factorization of language probabilities (22; 23) describes: p(X) = i p(w i |w <i ), where w <i refers to all the tokens appear before i. This conditional probability can be parameterized in various ways. An effective choice is to use the uni-directional transformer, as done in GPT-2 (3): p(w i |w <i ) = transformer(w i |w <i ). The language modeling task provides a large amount of data to pretrain very deep transformer networks (5; 24; 25). Scaling up the transformer parameter sizes will lead to significant improvements in the language model capability: with wider and deeper transformer layers, it is observed that transformer language models start to output more complicated semantics beyond lexical and syntactic patterns (6; 8; 7). On the other hand, a roughly log-linear relationship between transformer size and output quality has been established, e.g. doubling the quality requires ten times more parameters and training data (3; 17; 10). Even in industry, the marginal gain of increasing parameters will eventually be outweighed by the cost to train and serve such models.

