PRETRAIN KNOWLEDGE-AWARE LANGUAGE MODELS

Abstract

How much knowledge do pretrained language models hold? Recent research observed that pretrained transformers are adept at modeling semantics but it is unclear to what degree they grasp human knowledge, or how to ensure they do so. In this paper we incorporate knowledge-awareness in language model pretraining without changing the transformer architecture, inserting explicit knowledge layers, or adding external storage of semantic information. Rather, we simply signal the existence of entities to the input of the transformer in pretraining, with an entityextended tokenizer; and at the output, with an additional entity prediction task. Our experiments show that solely by adding these entity signals in pretraining, significantly more knowledge is packed into the transformer parameters: we observe improved language modeling accuracy, factual correctness in LAMA knowledge probing tasks, and semantics in the hidden representations through edge probing. We also show that our knowledge-aware language model (KALM) can serve as a drop-in replacement for GPT-2 models, significantly improving downstream tasks like zero-shot question-answering with no task-related training.

1. INTRODUCTION

The strong effectiveness and rich generalization ability of pretrained language models (PLMs) (1; 2; 3; 4; 5) have raised many questions about what is captured in transformer networks and why. Recent explorations found the pretrained language models may "rediscover" the linguistic pipeline at various transformer layers (6), can serve as implicit knowledge bases for relation extraction (7), perform soft reasoning tasks (8; 9), and conduct some language tasks reasonably in a fully unsupervised, zero-shot, fashion (3; 10). With sufficiently large amount of parameters, i.e. several billions, and enough task-specific supervision, the pretrained language models can even directly generate answers for natural language questions, at the same accuracy with state-of-the-art reading comprehension systems, without using any context documents or knowledge graphs (11). Impressive as they are, language models are still far from ready to serve as an "unsupervised multi-task learner" that learns knowledge directly from human language and generalizes to downstream language tasks (3). There are notable gaps of language models' performances on downstream tasks between models with (11; 12) and without (3) large amounts of task-specific fine-tuning. The language models still (de-)generate dull, factually-incorrect, or dream-like text when used in natural language generation (13; 14; 15; 16). These challenges often necessitate over-parameterization (17), grounding on external structural semantics (14; 16; 18; 19), or large amount of task-specific fine-tuning (11), which are costly, complicated, and not always feasible for every language task. One potential limitation of these language models is their style of pretraining, e.g, auto-regressive language modeling (3) or masked language modeling (1), wherein transformer networks process a sequence of words and are asked to predict the next/masked words. There is no explicit guidance to the transformers that humans prefer them to capture correct, real-world information. As a result, all the knowledge captured in these pretrained language models is only signaled by patterns of co-occuring words in the input sequence that is learned implicitly during pretraining. In this paper, instead of creating bigger models or adding knowledge-specific architectures, we propose to more efficiently leverage the existing parameters in the standard transformer language model, by simply making them aware of the various forms an entity can manifest itself as, and its role in the surrounding text. More specifically, this knowledge-awareness is communicated via the input fed to PLMs and in the output expected from them during pretraining. For input-awareness,

