KNOWLEDGE-IN-CONTEXT: TOWARDS KNOWLEDGE-ABLE SEMI-PARAMETRIC LANGUAGE MODELS

Abstract

Fully-parametric language models generally require a huge number of model parameters to store the necessary knowledge for solving multiple natural language tasks in zero/few-shot settings. In addition, it is hard to adapt to the evolving world knowledge without the costly model re-training. In this paper, we develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC), which empowers a parametric text-to-text language model with a knowledgerich external memory. Specifically, the external memory contains six different types of knowledge: entity, dictionary, commonsense, event, script, and causality knowledge. For each input instance, the KiC model adaptively selects a knowledge type and retrieves the most helpful pieces of knowledge. The input instance along with its knowledge augmentation is fed into a text-to-text model (e.g., T5) to generate the output answer, where both the input and the output are in natural language forms after prompting. Interestingly, we find that KiC can be identified as a special mixture-of-experts (MoE) model, where the knowledge selector plays the role of a router that is used to determine the sequence-to-expert assignment in MoE. This key observation inspires us to develop a novel algorithm for training KiC with an instance-adaptive knowledge selector. As a knowledge-rich semiparametric language model, KiC only needs a much smaller parametric part to achieve superior zero-shot performance on unseen tasks. By evaluating on 40+ different tasks, we show that KiC Large with 770M parameters easily outperforms large language models that are 4-39x larger. In addition, KiC also exhibits emergent abilities at a much smaller model scale compared to the fully-parametric models.

1. INTRODUCTION

Recently, large-scale fully-parametric language models have achieved great success in solving natural language processing (NLP) tasks (Radford et al., 2019; Brown et al., 2020; Chowdhery et al., 2022; Kaplan et al., 2020) . However, they generally require a huge number of model parameters to store the necessary knowledge for solving multiple NLP tasks in the zero/few-shot setting. Meanwhile, their problem solving capability only emerges after reaching a certain model scale (Wei et al., 2022) . In addition, large parametric language models are hard to adapt to the evolving world knowledge without expensive model re-training. To overcome these challenges, there has been an increasing interest in developing semi-parametric language models, where a parametric language model is augmented with an external memory containing a large number of text chunks (Borgeaud et al., 2022; Izacard et al., 2022; Khandelwal et al., 2019; Zhong et al., 2022) . Although these semi-parametric approaches are shown to be more effective than their much larger parametric counterparts, there remain several challenges. The first challenge is that useful knowledge pieces are generally sparsely distributed over a large textual corpus. Therefore, it is difficult to locate and retrieve the correct text chunk that contains the right knowledge to complement a given input instance. Second, it is difficult to determine the proper text chunk granularity to cover the desired knowledge. Thus, people usually use oversized text chunks to build indexing, which makes it even harder to determine whether knowledge is contained. On the other hand, there have been a rich collection of knowledge resources (e.g., knowledge graphs), where different kinds of knowledge are densely and compactly organized in structured or semi-structured forms. In this paper, we leverage these knowledge resources to construct

Dictionary

Entity Commonsense select Knowledge Selector Events Script retrieve Causality High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condensfe. What will most likely result if a highpressure system remains in an area for a long period of time?

Text-to-Text Model concatenate & prompt

Persistent high pressure has a stabilizing effect on the weather, causing subsiding air that dries out the atmosphere Here's a problem to solve: High-pressure systems stop air from ...

The following is the reference:

Persistent high pressure has a stabilizing ... Drought.

Retriever

Figure 1 : Overview of the KiC model architecture. It is augmented with a knowledge-rich memory that contains diverse categories of knowledge. For each input instance, KiC first selects a particular knowledge category and retrieves the most helpful knowledge pieces to augment the input. It then feeds the prompted input into a text-to-text backbone module (e.g., T5) to generate the output answer. a semi-parametric language model, by simply using off-shelf encoders and retrievers to index and search the external memory. In particular, our primary contribution is developing a novel semi-parametric language model architecture, Knowledge-in-Context (KiC), that is fueled by a large knowledge-rich external memory (Section 2). Specifically, the memory covers six broad categories of knowledge types: entity, dictionary, commonsense, event, script and causality (Section 2.2). Our comprehensive analysis reveals that a wide range of natural language tasks (31 out of 35 tasks) benefit from adding knowledge, where different knowledge resources help with different subsets of tasks. Interestingly, some tasks are even improved by 10%+ after adding suitable knowledge. To adaptively utilize knowledge, we exploit KiC to dynamically identify the most useful knowledge pieces for each input instance from a certain task and place them in the current context for answering the question. We adopt a single text-to-text transformer (e.g., T5) to generate the output answer from the input. Specifically, we append the retrieved knowledge pieces to the input instance, and then feed them into the text-to-text model to generate the output answer (also in natural language). The major advantage of such a text-to-text paradigm is that it handles multiple natural language tasks with the same interface and can also generalize to unseen tasks (Sanh et al., 2022; Raffel et al., 2020) . Moreover, we find this training paradigm is suitable for our model design as it can teach our KiC model to learn how to select and use knowledge through various seen language tasks and then generalize well to use knowledge for solving unseen tasks. Our experimental analysis further shows that such instance-adaptive (contextdependent) knowledge augmentation is critical to the success of KiC model. However, due to the inherent discrete nature, it is difficult to train KiC in a fully differentiable manner to select the correct knowledge category for each instance. To solve this problem, we find that KiC can be reformulated as a special mixture-of-experts (MoE) model (Jacobs et al., 1991; Jordan & Jacobs, 1994; Shazeer et al., 2017; Fedus et al., 2022) , where the knowledge selector is identified as the router that is used to determine the sequence-to-expert assignment in MoE (Section 2.3). Furthermore, the memory partition corresponding to each knowledge category together with the text-to-text model can be recognized as a special semi-parametric expert in MoE. This key observation inspires us to develop a novel learning algorithm to train KiC with instance-adaptive knowledge selection capabilities. In our experiments (Section 3), we adopt the same setting as T0 (Sanh et al., 2022) , where we train KiC models on a collection of tasks and then evaluate on another set of unseen tasks in a zero-shot manner. As a knowledge-rich semi-parametric language model, KiC only needs a much smaller parametric part to achieve superior zero-shot performance on unseen tasks. With only 0.77B parameters, KiC Large outperforms zero-shot baseline models such as GPT-NeoX-20B or OPT-30B that are 25-38x larger. It achieves 39.4% zero-shot performance on MMLU benchmark, very close to the GPT-3's 5-shot performance of 43.9% that has 175B parameters (227x larger). Also, KiC exhibits emergent abilities at a much smaller model scale compared to the fully-parametric models.

2.1. OVERVIEW

In this section, we introduce our proposed KiC language model, which augments a parametric text-totext Transformer (backbone) model with a knowledge-rich external memory (Figure 1 ). Overall, KiC consists of the following modules: (i) a parametric text-to-text backbone, (ii) an external knowledge memory with a retriever, and (iii) a knowledge selector. As shown in Figure 1 , for each input instance, the knowledge selector first selects a particular knowledge category based on the input context and then retrieves the most helpful knowledge pieces for solving the current problem. The retrieved knowledge is used to complement the input context via concatenation, and the knowledge-augmented textual inputs are fed into the text-to-text backbone model, which generates the output solution in natural language. The text-to-text backbone model can be any encoder-decoder models (e.g., T5, BART) or decoder-only models (e.g., GPT, PaLM). For convenience and without loss of generality, we adopt T5 as our backbone model throughout this paper. In the following subsections, we will explain in detail how to construct the knowledge memory along with its retriever (Section 2.2) as well as how to learn the entire KiC model in a fully-differentiable end-to-end manner (Section 2.3).

2.2. EXTERNAL KNOWLEDGE MEMORY AND RETRIEVER

Knowledge-rich external memory A significant advantage of semi-parametric models over fullyparametric ones is that we could flexibly change the knowledge resources. As shown in Table 7 , structured or semi-structured knowledge resources can often provide more relevant and accurate knowledge than plain text. In this work, we include the following popular representative knowledge resources, where each knowledge piece is in the form of < subject, relation, object > triplet. More details about the statistics and examples of these knowledge resources can be found in Appendix A.1. • Dictionary: We consider dictionary (lexical) knowledge, which records definitions and example sentences of English words. We leverage the largest open-source dictionary Wiktionaryfoot_0 as the lexical knowledge resource (e.g., < "apple", definition, "A common, round fruit ..." >). Specifically, we use the Wiktionary dump dated April 30, 2022 that contains 1.3M word definitions and 470K example sentences for 1M words/phrases. • Commonsense: We include commonsense knowledge from ConceptNet (Liu & Singh, 2004) , which covers broad knowledge in our daily life. In ConceptNet, all knowledge pieces are represented in the format of triplets with human-defined relations (e.g., < "bird", CapableOf, "fly" >). We follow previous works (Zhang et al., 2020) to include the core 600K high-quality triplets. • Entity: We cover named entity knowledge in Wikipedia and Wikidata (Vrandečić & Krötzsch, 2014) . For each entity (e.g., United States), we collect its Wikidata properties (e.g., < "United States", capital, "Washington D.C." >), and its related Wikipedia sentences (e.g., < "United States", context, "It consists of 50 states ..." >). Here, related sentences refer to the sentences from an entity's own article, or the sentences of other articles that link to this entity. • Event: We consider knowledge about daily events with human-constructed (i.e., ATOMIC (Hwang et al., 2021) and GLUCOSE (Mostafazadeh et al., 2020) ) and auto-extracted event knowledge graphs (i.e., ASER (Zhang et al., 2022a) ). Similar to commonsense knowledge, all event knowledge graphs store knowledge in the triplet format, where relations are human-defined or discourse relations, the subject and the object are events (e.g., < "I am hungry", before, "I eat food" >). • Script: We also include the script knowledge from Sun et al. (2022) , which implicitly represents complex relations by situating argument pairs in a context (mostly natural conversations). Specifically, we use 325K triples that are in the form of < verbal information, context, nonverbal information >, where verbal information is an utterance, nonverbal information can be body movements, vocal tones, or facial expressions, etc., and context is the entire text of the scene from which the verbal-nonverbal pair is extracted. • Causalityfoot_1 : The last external knowledge resource we include is the auto-extracted causal knowledge CausalBank (Li et al., 2020) , which collects large-scale English sentences expressing cause-effect relations. It consists of 133M because-mode sentences (i.e., sentences captured by 12 patterns such as "because", "caused by", etc.) and 181M therefore-mode sentences (i.e., sentences captured by 19 patterns such as "therefore", "result in", etc.). We also convert each sentence into a triplet form (e.g., < "babies cry", therefore-mode, "will lead to sleep problems" >). 

Knowledge Selector

Figure 2 : KiC model can be equivalently formulated as a mixture-of-experts (MoE) architecture. The knowledge selector can be identified as a router that is used to determine the sequence-to-expert assignment in MoE. Each expert is made up of the (shared) text-to-text model and the external memory of a particular knowledge category. Therefore, each expert is in itself a stand-alone semi-parametric language model specialized in a certain type of knowledge. To allow the option of not using any knowledge, we also include a "generalist" module, which is the (shared) text-to-text model alone. Note that although the effectiveness of certain knowledge types such as entity and dictionary knowledge has been demonstrated on a wide range of tasks (e.g., (Zhang et al., 2019b) ), other types of knowledge such as commonsense and script knowledge are only used for carefully selected tasks that tend to require these types of knowledge (Ye et al., 2019; Qiu et al., 2019) . In this paper, we evaluate the contribution of all aforementioned knowledge types on broader sets of downstream tasks to better understand the contribution of these knowledge types. Some examples of retrieved knowledge can be found in Appendix D, which show their usefulness for solving different tasks. Retriever To effectively retrieve knowledge from the knowledge memory, we follow the previous work (Borgeaud et al., 2022) to use dense retrieval techniques. Specifically, for each knowledge resource, we design one or more knowledge-specific strategies to generate key-value pairs from the original knowledge pieces (see Table 8 in Appendix for details). Then we encode all keys into dense vectors using a SOTA sentence encoder MPNet (Song et al., 2020) . During retrievalfoot_2 , given a query, we encode it with the same sentence encoder model and then retrieve the most relevant knowledge using the maximum inner product search (MIPS), which is able to reduce search complexity from O(n) to O(log n). In KiC, we employ SCaNN (Guo et al., 2020) as the MIPS search algorithm.

2.3. KIC: A MIXTURE OF SEMI-PARAMETRIC EXPERTS

As we will show in our comprehensive analysis (Table 1 ), for a particular task, some knowledge categories help the performance while others might hurt. For this reason, it is critical to dynamically select the correct knowledge type in order to facilitate the solution of the problem. In our work, instead of using task-dependent knowledge selection, we consider a more fine-grained instance-dependent strategy: we adaptively choose the knowledge based on each input instance. We now proceed to explain how KiC learns to make such instance-dependent knowledge selection. Note that the discrete decision made by the knowledge selector will seep into the overall neural architecture in the form of a discrete latent variable. There could be several alternative methods (such as reinforcement learning (Sutton & Barto, 2018) ) for learning the model with discrete latent variables. In this paper, we develop a simple yet effective approach for learning KiC in a fully-differentiable end-to-end manner. The key idea is based on an important observation that KiC can be reformulated as a special one-layer mixture-of-experts architecture, as shown in Figure 2 . Note that the knowledge selector can be identified as the router that is used to determine the sequence-to-expert assignment in MoE. This is slightly different from the settings of the recent MoE works (Shazeer et al., 2017; Fedus et al., 2022) , where their routers perform token-to-expert assignments. Meanwhile, each expert is made up of the text-to-text module together with a particular category of knowledge memory. Interestingly, each expert is in itself a stand-alone semi-parametric language model, which retrieves a particular kind of knowledge from its own memory to augment its inputs. In other words, each expert can be understood as a specialist with expertise in a specific knowledge category. In addition, we also include a special expert named generalist, which is used to handle situations where we do not need knowledge from our memory. Furthermore, due to the original KiC design, the text-to-text modules in all the experts (and the generalist) share the same model parameters with the only difference being the non-parametric parts (i.e., the knowledge memories). Inspired by the above KiC-MoE equivalence, we now proceed to develop a fully-differentiable learning strategy for KiC by leveraging existing MoE learning approaches used in Fedus et al. (2022) . More formally, the knowledge selector S(x) is modeled as a (K + 1)-class linear classifier, which outputs a (K + 1)-dimensional normalized probability vector. We apply the same encoder from our T5 backbone model to the input text sequence from a particular task, which generates a sequence of hidden representation vectors. Then, we apply mean-pooling to them to obtain a fixed-dimension vector, which is fed into the (K + 1)-way linear classifier to generate the probabilities of selecting different knowledge categories. Its k-th element, denoted as S k (x), represents the probability of choosing the k-th knowledge category for k = 0, 1, . . . , K, where k = 0 represents the choice of generalist (i.e., no external knowledge). Let T (•) denote the text-to-text transformer and c k be the knowledge retrieved from the k-th category. Then, in KiC, we select the top-1 knowledge category according to S(x) and compute the output according to the following expressions: k = arg max k S k (x) (1) ŷ = T (x ⊕ ck) • Sk(x) where ⊕ denotes concatenation of the input x and the retrieved knowledge ck (both in the form of natural language). Observe that KiC first selects the knowledge category k that has the highest probability, and then retrieves the most relevant knowledge ck from that category to complement the input x. The knowledge-augmented input is fed into the text-to-text model to generate the logits for the output tokens. Similar to SwitchTransformer (Fedus et al., 2022) , we multiply the output logits from T (•) by the probability Sk(x) from the selector to compute the final logits for the output tokens. This is a simple yet quite effective strategy to enable differentiable learning in MoE, which was successfully used in both Shazeer et al. (2017) and Fedus et al. (2022) . We adopt this similar strategy and our experiments in Section 3 will demonstrate its effectiveness in KiC learning as well. 4Note that we currently only consider the top-1 knowledge selection (routing) for simplicity and leave the generalization to top-n selection as future work. Finally, similar to MoE, we also add an auxiliary load balancing loss together with the standard cross-entropy loss during KiC learning: L(x, y) = T t=1 CrossEntropy ŷt , y t + α • Balancing S(x) where y denotes the target sequence, the subscript t indexes the t-th output token, and α is a positive hyper-parameter that controls the tradeoff between the two losses. We find that, without a load balancing term, the knowledge selector tends to select only one knowledge category throughout the entire training process, which was also observed in MoE learning. There could be different choices of the balancing loss such as the ones used in (Shazeer et al., 2017; Fedus et al., 2022) , which encourage the diversity of knowledge selection in different ways based on S(x). Without loss of generality, we use the same load balancing loss as in SwitchTransformer (Fedus et al., 2022) (see Equation 4). The above KiC-MoE equivalence may also lead to interesting observations that could potentially benefit the studies of both semi-parametric language models and MoEs. For example, in MoE works, the experts are generally designed to be different parametric neural modules (e.g., different MLPs (Fedus et al., 2022; Shazeer et al., 2017) ). However, our work shows that this may not be the only option: we can construct different experts by using the same parametric module but with different inputs. By bridging these two active areas, we hope there could be more fruitful future outcomes.

3.1. ANALYSIS OF KNOWLEDGE USEFULNESS

To verify our assumption that external knowledge resources can facilitate LMs in general language understanding and see the effects of using different types of knowledge, we conduct single-task fine-tuning experiments on a wide range of downstream tasks (Table 1 ). We evaluate 35 tasks in total and classify them into 10 categories following the P3 task categorization framework (Sanh et al., 2022) . For each knowledge type (each column), we append retrieved knowledge pieces to the input sentence and truncate the entire sequence whenever it exceeds the sequence limit. Next, the augmented input sentences are fed into the standard text-to-text model (T5) to generate the target answer for optimization, where training instances are from every single task. We can see that model performances on 30 out of 35 tasks are improved after adding at least one type of knowledge, which demonstrates the effectiveness of using high-quality external knowledge. Based on these results, we exploit KiC to dynamically identify the most useful knowledge pieces to adaptively utilize knowledge. Following previous papers, we report the median accuracy (%) and the standard deviation of all prompts used. Note that T0 Base and T0 Large are reproduced using the same collection of tasks and hyper-parameters with KiC models. Baseline models are: BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , GPT-Neo (Black et al., 2021) , GPT-J (Wang & Komatsuzaki, 2021) , GPT-NeoX (Black et al., 2022) , OPT (Zhang et al., 2022c) . We use the standard autoregressive (log) probabilities to score candidate choices and select the best one as the prediction for all baseline models including mask LMs such as BERT and RoBERTa. (Izacard et al., 2022) . Our main model KiC is initialized with T5 LM-adapt , an improved version of T5 that continues training T5 for additional 100K steps on the LM objective (Lester et al., 2021) to enhance its ability to generate natural language. Similar to T0, we train our KiC model on a mixture of multiple tasks (39 tasks in total) by combining and shuffling all training instances from different tasks (8.4M in total) and predict on unseen (held-out) tasks to evaluate zero-shot generalization ability. Our final KiC Large model is trained with 128 V100 GPUs for 42 hours. More training details are in Appendix A.2.

Zero-shot generalization

We evaluate our KiC model on two groups of zero-shot datasets. 1) Held-out tasks of P3 contain two coreference tasks, three NLI tasks, three sentence completion tasks and one word sense disambiguation (WSD) task. all zeroshot baseline models (e.g., GPT-NeoX, OPT) that are 25-38x larger. Moreover, KiC Large beats T0 3B that has 3B parameters on all 9 tasks by a large margin with our adaptive knowledge selector and only 0.77B parameters. 2) Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020) benchmark is designed to measure knowledge acquired in model pretraining. MMLU covers 57 subjects under four categories, i.e., STEM, Humanities, Social Sciences and Other. Comparisons with SOTA LMs are shown in Table 3 . We can see that KiC Large beats all fine-tuning baseline models RoBERTa Large and GPT-2 without using any training data from MMLU. Surprisingly, KiC Large achieves an average performance of 39.4% using only 0.77B parameters, which is just 4.5% below the 5-shot performance of GPT-3 that has 175B parameters (227x larger). To investigate how the KiC knowledge selector leverages different knowledge resources when applying to unseen tasks, we plot the distributions of the selected knowledge categories in Figure 4 . More discussions and analysis can be found in Appendix B. Finally, to examine the importance of different KiC components (e.g., knowledge selectors, external knowledge sources, etc.), we conduct extensive ablation studies by comparing our full KiC model with the following baselines: (i) KiC without knowledge, (ii) KiC with an external memory that contains only plain text (English Wikipedia), (iii) KiC without knowledgeselector but retrieving from a mixture of all knowledge categories, (iv) KiC with a task-adaptive selector, and (v) KiC without generalist. The results are reported in Table 12 of Appendix B.

KiC in multi-task training

To see whether our KiC learning can help with multi-tasking training, we reproduce T0 Large with the same collection of tasks and evaluate KiC Large on the validation set of each in-domain task (Table 4 ). Here, in-domain tasks can be divided into two groups -tasks used in multitask training and tasks not used in multitask training but within the observed task category. Again, KiC Large outperforms T0 Large , with significant improvement on in-domain unseen tasks (tasks marked with *) such as Race and BoolQ and knowledge-intensive tasks such as CosmosQA and DREAM. It demonstrates the superiority of our proposed KiC learning in multi-tasking training. Emerging behavior Wei et al. (2022) discover that language models usually can only perform a near-random zero/few-shot performance when they are small but achieves a substantial performance jump when they reach a certain critical threshold of scale (size). A language model is generally considered superior if it can show emerging behavior at a smaller model scale. Therefore, we compare our KiC model with T5 and T0 on held-out tasks to see how performance change with respect to their model sizes. From Figure 3 , we can see that T5 is around random guess when the model is below 11B. T0 is better than T5 as it shows emerging behavior when it increases from 3B to 11B. Surprisingly, our KiC model shows emerging behavior when it increases from 0.22B to 0.77B, which demonstrates that our semi-parametric model can achieve the same language understanding capacity using much fewer parameters with the help of adaptive knowledge selector and external knowledge. 

4. RELATED WORK

Knowledge Injection of PLMs Although PLMs can capture knowledge such as linguistic, semantic, commonsense, and world knowledge to some extent, they can only memorize knowledge vaguely in parameters, causing poor performance on knowledge-intensive tasks. Recent studies make a great effort to inject knowledge such as lexical knowledge, entity knowledge graph, or syntactic knowledge into LM pre-training (Yang et al., 2021) . For example, besides masked language modeling (MLM) Semi-parametric language models Most of the existing works on semi-parametric language models (Khandelwal et al., 2019; Zhong et al., 2022; Grave et al., 2017; Merity et al., 2017; de Masson d'Autume et al., 2019; Guu et al., 2020; Fan et al., 2021; Lewis et al., 2020) mainly focus on improving the language modeling capability (e.g., improving perplexities) or a particular category of downstream task (e.g., open-domain question answering). Some recent works (Izacard et al., 2022; Borgeaud et al., 2022; Petroni et al., 2021) seek to improve diverse downstream tasks with an external memory. All these works augment the parametric language model with memories of plain texts. In contrast, we focus on developing semi-parametric language models with a knowledge-rich memory for improving the performance of a wide range of downstream language tasks.

5. CONCLUSIONS AND FUTURE WORK

This work develops a novel semi-parametric language model architecture, Knowledge-in-Context (KiC), which empowers a parametric text-to-text language model with a knowledge-rich external memory containing six different types of knowledge. We also design an instance-adaptive knowledge selector to retrieve the most helpful pieces of knowledge for each input instance. As a knowledge-rich semi-parametric language model, KiC only needs a relatively smaller parametric part to achieve superior zero-shot performance on unseen tasks and exhibits emergent abilities at a much smaller model scale compared to the fully-parametric models. Future work may include future exploiting unstructured plain texts to pre-train KiC.

APPENDIX A EXPERIMENTAL DETAILS

A.1 KNOWLEDGE PIECES In this section, we give the basic statistics of different knowledge categories that are used in KiC -see Table 5 . In addition, we further give examples of the knowledge pieces for each category (Table 6 ). The knowledge pieces are in the form of < subject, relation, object >. They will be further encoded into key-value pairs according to different strategies in Appendix A.2. (Sun et al., 2022) , where verbal information is an utterance, nonverbal information can be body movements, vocal tones, or facial expressions, etc., and context is the entire text of the scene from which the verbal-nonverbal pair is extracted. The verbal and nonverbal messages are conveyed within a short time period (usually mentioned in the same turn or adjacent turns). Note that the script knowledge can be viewed as a special kind of commonsense knowledge, where the relations are characterized by free texts.

A.2 IMPLEMENTATION DETAILS

Key-Value pairs construction Our knowledge memory consists of a large set of key-value pairs, which are constructed in the following manner. First, we build an initial set of key-value pairs (in textual form) from the original knowledge pieces (i.e., knowledge triplets) according to Table 8 . Then, we further encode the keys into dense vectors using MPNet. The encoded keys along with their corresponding values (in textual forms) will be stored as the final key-value pairs in our knowledge memory. The encoded key vectors are used for knowledge piece retrieval during MIPS search. Question High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time? Answer Drought CausalBank (structured) Persistent high pressure has a stabilizing effect on the weather, causing subsiding air that dries out the atmosphere.

Wikipedia (plain text)

High-pressure systems are alternatively referred to as anticyclones. On Englishlanguage weather maps, high-pressure centers are identified by the letter H in English, within the isobar with the highest pressure value. On constant pressure upper level charts, it is located within the highest height line contour.  s s ⊕ r ⊕ o s ⊕ o s ⊕ r ⊕ o s ⊕ r ⊕ o s ⊕ r ⊕ o Entity s o o o Event s s ⊕ r ⊕ o s ⊕ o s ⊕ r ⊕ o s ⊕ r ⊕ o s ⊕ r ⊕ o Script s r o r Causality s ⊕ o s ⊕ o o ⊕ s o ⊕ s Table 8 : Knowledge-specific strategies to construct key-value pairs from knowledge triplets < subject (s), relation (r), object (o) > (⊕ denotes concatenation). The keys will be further encoded into vector forms using MPNet, which are used for knowledge retrieval during MIPS search. Retriever We use All-MPNet base-v2foot_4 as the encoder for encoding the keys in knowledge memory as well as the input query instance. The model is trained on one billion sentence pairs with the contrastive learning objective, and we use the publically available model checkpoint. For most knowledge categories, we directly apply MIPS search to the encoded query and key vectors during retrieval. For the dictionary knowledge and the entity knowledge, we first pre-filter the knowledge pieces according to the following strategies before applying MIPS search. • When retrieving from dictionary knowledge, we first use a domain-independent keyword extraction algorithm (Rose et al., 2010) to extract important words from the queryfoot_5 . Then, we filter the knowledge pieces so that only the ones related to the important words are retained for MIPS search. • When retrieving from entity knowledge, we follow previous work (Pan et al., 2019) to first extract concept mentions from the query and then link each mention to its corresponding page in Wikipedia. All the knowledge pieces that are not related to the linked concepts are excluded from MIPS search. The above pre-filtering strategies are also common practices when using these types of knowledge, which allow us to locate relevant knowledge pieces more accurately. In addition, they also reduce the MIPS search complexity by focusing only on the most relevant candidates. Load Balancing Loss To encourage the diversity of knowledge selection, we adopt the load balancing loss from SwitchTransformer (Fedus et al., 2022) . Given K + 1 experts, a batch B with B all knowledge categories, (iv) KiC with a task-adaptive selector, and (v) KiC without generalist. The results are reported in Table 12 . First of all, it is important to leverage the knowledge-rich memory; when removing the knowledge memory or replacing it with a plain-text memory that consists of English Wikipedia, the performance would degrade greatly. Second, it is also important to use a knowledge selector to first pick a particular category of knowledge and then retrieve the relevant knowledge pieces from it. When we mix all the knowledge categories together with a single retriever, there would be a significant performance drop. The main reason is that different knowledge categories generally requires certain pre-filtering strategy during retrieval (see Appendix A.2). Furthermore, we also find that the instance-adaptive knowledge selector in our KiC model is crucial in achieving good performance. When we replace it with a task-adaptive selector, which picks a fixed knowledge category for all instances from the same task based on the task description, the performance is also noticeably worse. Finally, by comparing KiC without generalist to the original KiC, we also observe that there is a noticeable performance drop, which confirms the importance of allowing the model to ignore all external knowledge for some instances. Dataset Task KiC Large w/o knowledge /w plain texts w/o selector /w task-adaptive w/o generalist Table 12 : Ablation study of the KiC Large model. We consider the following four ablation models: (i) KiC without knowledge (i.e., T0), (ii) KiC with an external memory that contains only plain text (English Wikipedia), (iii) KiC without knowledge-selector but retrieving from a mixture of all knowledge categories, (iv) KiC with a task-adaptive selector, and (v) KiC without generalist. We report the mean, median and standard deviation for P3 tasks over different templates. For MMLU, we report the results on the test set, just like other works in the literature.

Which categories of knowledge are useful for an unseen task?

To understand what kind of knowledge categories are retrieved to help a particular task, we report the distribution of the selected knowledge by KiC Large for each task in Figure 4 . The results show that most of the knowledge categories are useful for different tasks. And the knowledge selector is able to pick the most helpful knowledge type for solving its current task. For example, in Word-in-Context (WiC) task, the model mostly retrieves from the dictionary knowledge to help it disambiguate different word senses. In StoryCloze task, it relies more heavily on commonsense knowledge to complete the story ending. For MMLU tasks, since they cover a large variety of subjects (i.e., 57 subjects), it is not surprising that it needs more diverse categories of knowledge. In addition, the results further show that the generalist in KiC is also very important as the model would frequently choose it when solving different tasks. It demonstrates the necessity of allowing the model to ignore all knowledge categories for some instances. Finally, we would like to highlight that we never use any direct supervision to train the knowledge selector. Instead, it learns to make such decisions from the distant supervision of predicting the correct answer. This is valuable because learning to identify the most helpful knowledge for solving a particular task is an important step toward general intelligence. More importantly, the results also confirm the effectiveness of our learning strategy based on our KiC-MoE equivalence. G E N E N T D I C C O M E V T S C R C A U 0.00 0.25 Percentage WSC G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage Winogrande XL G E N E N T D I C C O M E V T S C R C A U 0.00 0.25 Percentage A LI R1 G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage A LI R2 G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage A LI R3 G E N E N T D I C C O M E V T S C R C A U 0.00 0.25 Percentage CB G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage RTE G E N E N T D I C C O M E V T S C R C A U 0.00 0.25 Percentage COPA G E N E N T D I C C O M E V T S C R C A U 0.0 0.5 Percentage Hellaswag G E N E N T D I C C O M E V T S C R C A U 0.0 0.5 Percentage StoryCloze G E N E N T D I C C O M E V T S C R C A U 0.00 0.25 Percentage Word-in-Context G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage MMLU STEM G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage MMLU Humanities G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage MMLU Social Sciences G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage MMLU Other G E N E N T D I C C O M E V T S C R C A U 0.0 0.2 Percentage

MMLU All

Figure 4 : The distribution of the selected knowledge categories for each task. We examine the following categories of knowledge: entity (ENT), dictionary (DIC), commonsense (COM), event (EVT), script (SCR), or causality (CAU) knowledge. In addition, the generalist (GEN) means that we do not choose any external knowledge but make predictions based solely on the input query. Full results for zero-shot performance In Table 13 , we provide the full zero-shot results on holdout unseen tasks, where we report both mean and median results together. The reason that we report both the mean and median is to be consistent with the results in the T0 paper (Sanh et al., 2022) , where they report both metrics. In the main paper, we only keep the median results for brevity. Table 13 : Full zero-shot evaluation results on holdout unseen tasks. We report mean/median accuracy (%) over all prompts for each task.

D CASE STUDY OF RETRIEVED KNOWLEDGE

We show examples of retrieved knowledge in Table 15 . Different knowledge plays critical roles in various tasks. For instance, in the Hellaswag task, the model can predict that a person will mow the lawn because it finds the commonsense knowledge that a "lawn mover" is used for cutting grass. Similarly, in the WiC task, the model knows that the two "pockets" are different with the help of a detailed explanation of different synsets of the word "pocket." Last but not least, in the Winogrande task, the model can successfully know that burglary is more likely to be investigated because it finds the event knowledge that burglary is often concluded by an investigator. 

E PROMPT TEMPLATES FOR KNOWLEDGE-IN-CONTEXT

We provide the prompt templates for training and evaluating our KiC system. Note that we use the same naming convention for the templates as the original P3 dataset (Sanh et al., 2022) . Table 17 : All used evaluation datasets and templates from P3 (Sanh et al., 2022) and MMLU (Hendrycks et al., 2020) for KiC. Note that the original MMLU tasks do not include templates, we use the templates of ai2_arc/ARC_Challenge in P3 for MMLU evaluation.



https://en.wiktionary.org/wiki/Wiktionary:Main_Page Follow the literatures in the commonsense community(Zhang et al., 2021; 2022b), we use the term "causality" to refer to commonsense causality, which is mostly contributory(Bunge, 2017). To further enhance retrieval quality and decrease search space, we employ an additional filtering step for dictionary and entity knowledge pieces. See Appendix A.2 for more knowledge retrieval details. It might be tempting to use Gumbel-Softmax to handle the discrete latent variable in KiC. However, in order to use the straight-through-estimator during backpropagation, it has to compute the hidden states for all the experts, i.e., executing the text-to-text transformer by (K + 1) times, which is prohibitive when K increases. https://huggingface.co/sentence-transformers/all-mpnet-base-v2 https://pypi.org/project/rake-nltk/



Figure 3: Emerging behaviors of T5, T0 and KiC models. Our KiC model shows emerging behavior at a much smaller model scale (when it increases from 0.22B to 0.77B) compared to T0.

and next sentence prediction (NSP), Lauscher et al. (2020) add synonyms and hyponym-hypernym relation prediction between words and Levine et al. (2020) add supersense prediction of masked words into LM training objectives. To use entity knowledge, ERNIE 2.0 (Sun et al., 2020) introduces named entity masking to learn better embeddings for semantic units, Peters et al. (2019) include entity linking, hypernym linking into pre-training and K-BERT (Liu et al., 2020) uses entity knowledge triples to construct knowledge-rich sentence trees. For syntax knowledge injection, Wang et al. (2021) integrate dependency relation prediction into LM training and Bai et al. (2021) incorporate syntax tree information through a syntax-aware self-attention mechanism.



Comparison to state-of-the-art results on the test set of MMLU tasks. Following standard approaches, we choose the prompt that yields the best accuracy (%) on the validation set. Additional models used for comparison: Gopher(Rae et al., 2021), Atlas

Table2shows that our KiC Large model outperforms In-domain evaluation results measured in accuracy (%) and standard deviation. T0 Large and KiC Large are trained using the same collection of tasks and hyper-parameters, while KiC Large has the knowledge selector during multitask learning. * indicates that the training data provided by this task are not used in multitask training. Thus, we regard tasks with * as in-domain zero-shot evaluation because KiC has observed similar tasks (such as other multi-choice QA tasks) in multitask training. † indicates that it's the score on the test set. Otherwise, we report the score on the validation set.

The statistics of different knowledge categories ("K": thousand, "M": million). Storage is the space required to store the original data ("MB": megabyte, "GB": gigabyte). The type marked as "human" means that it is collected by crowd-sourcing, and "auto" means it is automatically extracted.

Examples of knowledge piece in the format of <subject, relation, object> triplets. For script knowledge, < subject, relation, object > becomes < verbal information, context, nonverbal information > extracted from movie scripts

Examples of retrieved supporting knowledge from different resources (i.e., CausalBank v.s. Wikipedia). Note that the retrieved knowledge pieces from CausalBank are generally more helpful in solving the problem than the retrieved plain text pieces from Wikipedia.

TaskHellaswag InputA first person view is seen of a man riding a riding lawn mower. he...How does the description likely end?Ending 1: takes turns quickly, mowing the lawn.Ending 2: creates a large puddle of water and a high rush of water around him as he heads back and forth back and forth.Ending 3: moves all around while there is a crowd watching.Ending 4: talks about how to properly ride an object while another man climbs up on the back of him. Lydia put the change in her left pocket.Sentence 2: Lydia pocketed the change.Determine whether the word "pocket" is used in the same sense in both sentences. Yes or no? Output no Knowledge Type Lexicon Knowledge Piece pocket: A bag stitched to an item of clothing, used for carrying small items. Such a receptacle seen as housing someone's money; hence, financial resources. Examples of the improved instances and the corresponding selected knowledge.

All used training datasets and templates from P3(Sanh et al., 2022) for KiC.

annex

sequences, the load balancing loss is computed according to:where f i is the fraction of sequences that are actually dispatched to expert i, and P i is the fraction of the selector probability allocated for expert i, which are defined asThe notation 1(•) denotes an indicator function that takes the value of one when its argument inside the parenthesis is true and zero otherwise. Note that S i (x) is the probability of assigning a particular sequence x to expert i, while P i is the total probability fractions assigned to expert i from all the sequences in the batch B. Fedus et al. (2022) point out that the above load balancing loss could encourage uniform routing since it is minimized under a uniform distribution.Hyper-parameters The hyper-parameters of learning KiC Base and KiC Large are listed in Table 9 . In addition, we also list the hyper-parameters of single-task finetuning used in Table 10 . Note that we set a maximum number of retrieved knowledge pieces to concatenate. If a knowledge-augmented input sequence exceeds the maximum input length, then it will be truncated. 

B ADDITIONAL EXPERIMENTAL RESULTS

In this section, we provide additional experimental results and visualization results.Ablation studies of KiC We now further examine the contribution of different components of the KiC model by performing extensive ablation studies. Specifically, we implement the following ablation models: (i) KiC without knowledge, (ii) KiC with an external memory that contains only plain text (English Wikipedia), (iii) KiC without knowledge-selector but retrieving from a mixture of

C DESCRIPTIONS OF 35 EVALUATION TASKS IN TABLE 1

We show the description of all evaluation tasks in Table 14 . We categorize these tasks in the same way as the T0 paper (Sanh et al., 2022) , with a brief explanation for each category of tasks. For more detailed information, please refer to the original papers listed in Table 14 . Natural language inference is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise."Paraphrase MRPC (Dolan & Brockett, 2005) , QQP (Wang et al., 2018) , PAWS (Zhang et al., 2019a) Paraphrase identification (PI) is concerned with the ability to identify alternative linguistic expressions of the same meaning at different textual levels.Closed QA ARC (Easy and Challenge) (Clark et al., 2018) , WikiQA (Yang et al., 2015) In the closed book QA, each question is associated with a document, and the models are required to answer the question with the document.Extractive QA ReCoRD (Zhang et al., 2018) Extractive QA aims to extract a text span from the passage to answer the questions.Multiple Choice QA CoS-E v1.11 (Rajani et al., 2019) , Cos-mosQA (Huang et al., 2019) , DREAM (Sun et al., 2019) , OpenBookQA (Mihaylov et al., 2018) , PIQA (Bisk et al., 2020) , QASC (Khot et al., 2020) , QuAIL (Rogers et al., 2020) , QuaRTz (Tafjord et al., 2019) , RACE (Middle and Hign) (Lai et al., 2017) , SciQ (Welbl et al., 2017) , SocialIQA (Sap et al., 2019) , BoolQ (Clark et al., 2019) , Mul-tiRC (Khashabi et al., 2018) , WikiHop (Welbl et al., 2018) , WIQA (Tandon et al., 2019) In multiple choice QA, each question is associated with several answers, and the models are required to select the correct one/ones.Sentiment Analysis IMDB (Maas et al., 2011) and Rotten Tomatoes (Pang & Lee, 2005) Sentiment analysis aims at predicting the sentiment attitude of a text span (mostly sentences or reviews).Sentence Completion HellaSwag (Zellers et al., 2019) , COPA (Roemmele et al., 2011) , Story Cloze (Mostafazadeh et al., 2016) Decide which sentence is the most plausible ending of the given sentence(s).Topic Classification AG News (Del Corso et al., 2005) and DBpedia14 (Lehmann et al., 2015) .Classify a given sentence into one of the predefined topic categories. WiC (Pilehvar et al., 2019) The WSD task provides two sentences containing the same lemma word and asks whether the two target words have the same meaning. 

