K-ADAPTER: INFUSING KNOWLEDGE INTO PRE-TRAINED MODELS WITH ADAPTERS

Abstract

We study the problem of injecting knowledge into large pre-trained models like BERT and RoBERTa. Existing methods typically update the original parameters of pre-trained models when injecting knowledge. However, when multiple kinds of knowledge are injected, they may suffer from catastrophic forgetting. To address this, we propose K-ADAPTER, which retains the original parameters of the pre-trained model fixed and supports continual knowledge infusion. Taking RoBERTa as the pre-trained model, K-ADAPTER has a neural adapter for each kind of infused knowledge, like a plug-in connected to RoBERTa. There is no information flow between different adapters, thus different adapters are efficiently trained in a distributed way. We inject two kinds of knowledge, including factual knowledge obtained from automatically aligned text-triplets on Wikipedia and Wikidata, and linguistic knowledge obtained from dependency parsing. Results on three knowledge-driven tasks (total six datasets) including relation classification, entity typing and question answering demonstrate that each adapter improves the performance, and the combination of both adapters brings further improvements. Probing experiments further indicate that K-ADAPTER captures richer factual and commonsense knowledge than RoBERTa.

1. INTRODUCTION

Language representation models, which are pre-trained on large-scale text corpus through unsupervised objectives like (masked) language modeling, such as BERT (Devlin et al., 2019) , GPT (Radford et al., 2018; 2019) , XLNet (Yang et al., 2019) , RoBERTa (Liu et al., 2019) and T5 (Raffel et al., 2019) , have established state-of-the-art performances on various NLP downstream tasks. Despite the huge success of these pre-trained models in empirical studies, recent studies suggest that models learned in such an unsupervised manner struggle to capture rich knowledge. For example, Poerner et al. (2019) suggest that although language models do well in reasoning about the surface form of entity names, they fail in capturing rich factual knowledge. Kassner & Schütze (2019) observe that BERT mostly did not learn the meaning of negation (e.g. "not"). These observations motivate us to study the injection of knowledge into pre-trained models like BERT and RoBERTa. Recently, some efforts have been made to exploit injecting knowledge into pre-trained language models (Zhang et al., 2019; Lauscher et al., 2019; Levine et al., 2019; Peters et al., 2019; He et al., 2019; Xiong et al., 2020) . Most previous works (as shown in Table 1 ) augment the standard language modeling objective with knowledge-driven objectives and update model parameters in a multi-task learning manner. Although these methods, with updated pre-trained models, obtain better performance on downstream tasks, they fail at continual learning (Kirkpatrick et al., 2017) . Model parameters need to be retrained when new kinds of knowledge are injected, which may result in the catastrophic forgetting of previously injected knowledge. Meanwhile, the resulting pre-trained models produce entangled representations, which makes it hard to investigate the effect of each knowledge when multiple kinds of knowledge are injected. In this paper, we propose K-ADAPTER, a flexible and simple approach that infuses knowledge into large pre-trained models. K-ADAPTER has attractive properties including supporting continual knowledge infusion and producing disentangled representations. It leaves the original representation of a pre-trained model unchanged and exports different representations for different types of infused knowledge. This is achieved by the integration of compact neural models, dubbed adapters here. The contributions of this paper are summarized as follows: • We propose K-ADAPTER, a flexible approach that supports continual knowledge infusion into large pre-trained models (e.g. RoBERTa in this work). • We infuse factual knowledge and linguistic knowledge, and show that adapters for both kinds of knowledge work well on downstream tasks. • K-ADAPTER achieves superior performance by fine-tuning parameters on three downstream tasks, and captures richer factual and commonsense knowledge than RoBERTa on probing experiments.

2. RELATED WORK

Our work relates to the area of injecting knowledge into pre-trained models. As stated in 



Comparison between our approach (K-ADAPTER) and previous works on injecting knowledge into BERT.Adapters are knowledge-specific models plugged outside of a pre-trained model, whose inputs are the output hidden-states of intermediate layers of the pre-trained model. We take RoBERTa(Liu  et al., 2019)  as the base pre-trained model and integrate two types of knowledge, including factual knowledge obtained by aligned Wikipedia text to Wikidata triplets, linguistic knowledge obtained by applying off-the-shell dependency parser to web texts. In the pre-training phase, we train two adapters independently on relation classification task and dependency relation prediction task respectively, while keeping the original parameters of RoBERTa frozen. Since adapters have much less trainable parameters compared with RoBERTa, the training process is memory efficient.

previous works mainly differ from the knowledge sources and the objective used for training.ERNIE(Zhang et al., 2019)  injects a knowledge graph into BERT. They align entities from Wikipedia sentences to fact triples in WikiData, and discard sentences with less than three entities. In the training process, the input includes sentences and linked facts, and the knowledge-aware learning objective is to predict the correct token-entity alignment. Entity embeddings are trained on fact triples from WikiData viaTransE (Bordes et al., 2013).LIBERT (Lauscher et al., 2019)  injects pairs of words with synonym and hyponym-hypernym relations in WordNet. The model takes a pair of words separated by a special token as the input, and is optimized by a binary classification problem, which predicts whether the input holds a particular relation or not.SenseBERT (Levine et al.,  2019)  considers word-supersense knowledge. It inject knowledge by predicting the supersense of the masked word in the input, where the candidates are nouns and verbs and the ground truth comes

