COMPOSITIONAL TASK REPRESENTATIONS FOR LARGE LANGUAGE MODELS

Abstract

Large language models have shown a remarkable cross-task generalization ability. Most prior works assumed that prompts effectively extract knowledge from language models to facilitate generalization to new tasks. This perspective led to numerous studies on improving prompts. In contrast, we introduce a new perspective, compositional generalization, that views each task as a composition of latent codes and generalizes to test tasks by a new composition of seen codes. To this end, we propose a novel prompt-free approach, Compositional Task Representations (CTR), that employs multi-task training to learn a discrete, compositional codebook. Empirically, our CTR substantially outperforms prompt-based methods in zero-label learning on average. According to our analysis, some of the learned CTR codes are interpretable to humans and demonstrate a certain degree of controllability.

1. INTRODUCTION

Large language models (LLMs) have shown remarkable performance in cross-task generalization. Without using any labeled data for the target task, GPT-3 (Brown et al., 2020) obtains reasonable performance on a wide range of tasks. Later extensions such as FLAN (Wei et al., 2022) and T0 (Sanh et al., 2022) continue training the LLMs on a large number of supervised tasks, which further improves cross-task generalization performance. The aforementioned studies have used an important assumption that natural language prompts extract knowledge from LLMs to facilitate generalization to new tasks. In this direction, numerous studies have focused on different aspects of improving prompt-based learning, such as designing better prompts (Xu et al., 2022) , increasing the number of prompts (Wang et al., 2022; Aribandi et al., 2022) , and improving the training efficiency of prompts (Lester et al., 2021) . In contrast, we explore an alternative perspective for cross-task generalization, i.e., compositional generalization. Specifically, we explore whether it is possible to represent tasks using discrete compositions of latent codes. This perspective enjoys several potential benefits. First, since the latent codes have been trained for seen tasks, we expect the LLMs to have strong cross-task generalization abilities because new tasks can also be represented as a composition of these trained codes. Second, it provides a way to analyze and understand cross-task generalization by investigating the association between tasks and the learned representations. Third, it has the potential of being more controllable than prompts for task generalization due to its built-in compositionality. Motivated by the aforementioned potentials, we propose a new method, Compositional Task Representations (CTR), that employs multi-task training to learn a discrete, compositional codebook. Specifically, given a large number of training tasks, we use an encoder to map each randomlyinitialized task embedding to a fixed-length sequence of query vectors. Each query vector is used to retrieve a code from a codebook, which is formulated as an embedding lookup table. This produces I watched the first McCain Obama debate last night . It was full of moments I had to pause the DVR because I had to discuss what they were saying with my husband . I learned a lot about the Iraq war and Afghanistan , and I saw both McCain and Obama make some going points , and I saw them both make some blunders .How would this person be classified ?-Moderate -Liberal -Conservative -None of the above choices . Figure 1 : An illustration of how CTR generalizes to zero-label tasks. In this real example produced by our model, CTR combines the abilities of reasoning-based QA, sentence generation, and multichoice selection from training tasks to perform a new task COPA. a sequence of codes, a compositional representation of the current task. These compositional codes are fed as the input to an LLM in place of prompts to make predictions. At test time, given a new task, we use unlabeled data to search for a high-performing composition of codes, which enables zero-label cross-task generalization. CTR is also applicable to the few-shot setting where the few labeled examples are used for code search. Empirically, we demonstrate improved performance under both the settings of zero-label learning and few-shot learning, outperforming strong baselines including prompt tuning, model tuning, and genetic prompt search (Xu et al., 2022) . Importantly, we analyze the learned task representations and show that they demonstrate a certain degree of interpretability and controllability. For example, as shown in Figure 1 , CTR learns to generalize to a new task by a new composition of existing codes.

2. RELATED WORK

Language Model Prompting. Brown et al. (2020) showed that GPT-3 performs well in the fewshot setting if properly handcrafted prompts are provided. Other works (Shoeybi et al., 2019; Rae et al., 2021; Schick & Schütze, 2021 ) also report promising zero-shot or few-shot performances of LLMs. Besides, Wei et al. (2022) and Sanh et al. (2022) collect a set of labeled datasets and use manual templates to transform them into a sequence-to-sequence style. Such a formulation makes it possible to continue training LLMs on these labeled datasets and improves cross-task generalization. Wang et al. (2022) ; Mishra et al. (2022) introduced a benchmark of over 1600 tasks and their expertwritten instructions Gao et al. (2021) ; Shin et al. (2020) studied automating the search process of discrete prompts. Li & Liang (2021); Liu et al. (2021) propose continuous soft prompts with gradient-based optimization. Compared to these approaches, we study a different approach that learns compositional task representations, which benefits cross-task generalization. Compositional Architecture for LLMs. Previous work has explored designing compositional architectures. Sparsely Gated Mixture of Expert (MoE) (Lepikhin et al., 2021) activates a subset of a network given the input data. Artetxe et al. (2021) trained an MoE model with 1.1T parameters, which is shown to outperform a dense model with similar computational cost. SkillNet-NLU (Tang et al., 2022) and SkillNet-NLG (Liao et al., 2022b) employed a similar sparsely activated mechanism to handle different NLU or NLG tasks. Different from these approaches, our approach focuses on learning compositional task representations using a discrete codebook. 

3. COMPOSITIONAL TASK REPRESENTATIONS

The motivation of CTR is to explore the cross-task generalization ability of LLMs from a brandnew perspective-compositional generalization, and to further improve the performance of crosstask generalization. Specifically, our main hypothesis is that being trained on a variety of natural language tasks, LLMs will be able to learn to represent each task as a composition of discrete latent codes, where each latent code is associated with certain aspects of a task. As a result, CTR potentially enjoys the advantages of better cross-task generalization since it could represent new tasks by forming new compositions. This section introduces the overall architecture of CTR and how it is trained to overcome optimization challenges. As Figure 2 shows, CTR consists of the CTR learning module and an LLM, where the CTR learning module contains an encoder, a decoder, task embeddings, and a codebook.

3.1. DISCRETE LATENT TASK CODEBOOK

We define a latent task codebook embedding space C ∈ R S×D where S is the size of the latent codebook embedding space (i.e., each task code can take either of the S categorical values), and D is the dimension of each latent code embedding C i ∈ R D , i ∈ [1, 2, • • • , S] . This is analogous to the idea of VQ-VAE (van den Oord et al., 2017) where a discrete latent codebook is also employed. As Figure 2 shows, given a training task as the input (assuming the task id is k), CTR first obtains its task embedding E k ∈ R D by retrieving from a randomly-initialized task embedding lookup table R N ×D where N is the number of training tasks. The task embedding E k is then passed to the encoder module and is mapped into a fixed-length sequence of query vectors Q ∈ R L×D where L is the length of the sequence. For each query vector Q l , l ∈ [1, 2, • • • , L], it is used to retrieve a task code embedding from the codebook embedding space. Specifically, it calculates the l 2 distance with each of the latent code embedding C i ∈ R D , i ∈ [1, 2, • • • , S] , and find the nearest neighbor, CTR l = C z l , where z l = arg min i ||Q l -C i || 2 (1) In this way, all query vectors together produce a L-length sequence of code embeddings, denoted as the compositional task representations (CTR). They are then passed through a decoder, followed by being used as the input to an LLM in place of prompts to make predictions. Let z be a vector of L latent codes z 1 , z 2 , . . . , z L . We consider each latent code as describing an attribute or a necessary skill of a certain task, such as task type, output space, etc. Intuitively, we need multiple codes to fully describe a task, and each task is formulated as a composition of codes.

3.2. TRAINING

Training can be challenging due to the existence of discrete latent variables. Moreover, during the initial training phase, the CTR learning module is randomly initialized. As a result, the codebook embeddings C will have a very different distribution to the query vectors Q, which increases the difficulty of optimization. We decouple training into two phases. In the first phase, we freeze the LLM and only update the CTR learning module. This is followed by tuning all parameters. In terms of the loss function, following van den Oord et al. (2017) , we employ two separate losses, an embedding loss and a commitment loss, to match the query vectors with the compositional task representations. These losses are used in combination with a standard language modeling loss, L = L LM + L l=1 (||sg[Q l ] -CTR l || 2 2 + β||Q l -sg[CTR l ]|| 2 2 ) (2) where L LM is a standard language modeling loss for solving the target task, sg denotes stopping the gradients, and β is a hyperparameter.

3.3. INFERENCE

We consider two settings: zero-label learning and few-shot learning, and describe how we apply CTR in these settings. Code Ensemble for Zero-Label Learning. In the zero-label setting, we are given a new task along with a set of unlabeled data. * The question is how to decide the code for this new task without labeled data. Our main idea is to select one of the training tasks (319 in total in our experiments) and use its learned code for this new task. We first obtain a set of candidate codes by examining how much a code gives predictions that deviate from a uniform distribution on the unlabeled data (Zhao et al., 2021) . The candidate set is formed by N (set to 60 in our experiments) codes with the lowest deviations. We then ensemble the candidate set of codes to predict pseudo labels on unlabeled data, and select the code with the highest pseudo-label accuracy. Bitwise Search for Few-Shot Learning. In the few-shot setting, we are given a new task along with a set of labeled data. We use this set of labeled data as our validation set to search for a high-performing code. Our preliminary study shows that one can control the output of CTR by changing one bit in the code vector z. Inspired by this, we first examine the validation-set accuracy for codes of training tasks, and select the code with the best accuracy as an initialization. Then we iteratively change a single bit of the selected code and evaluate the validation-set performance. At each iteration, we keep the updated code with higher performance. Finally, the code vector that obtains the best result on the validation set is taken as the test-task code. Our preliminary study shows that a certain code value usually occurs in a specific position of the code vector, and each position usually only has a small set of code values. Motivated by this, we only changes a small set of valid code candidates for each position. In both the zero-label and few-shot settings, after we obtain a code vector z for the new task, we use the vector to obtain a composition task representation by indexing z in the codebook C. The task representations are then passed through the decoder and the LLM to perform the new task, similar to training time. This is also illustrated in the right part of Figure 2 .

4.1. EXPERIMENTAL SETUP

We conduct extensive quantitative experiments to validate the performance of cross-task generalization of CTR. We mainly consider two settings, the zero-label setting and the few-shot setting. Aside from quantitative experiments, we perform qualitative analysis to understand the cross-task generalization ability by investigating the association between the discrete latent codes and the tasks.

4.1.1. DATASETS

CTR requires a large multi-task set for training, and a held-out set of tasks whose types are never seen during training for evaluation. We follow the T0 benchmark. The training part consists of 39 tasks of 8 task types, including closed-book question answering (QA), multiple-choice QA, extractive QA, sentiment analysis, paraphrase identification, topic classification, summarization, and structure to text. The test part consists of 11 tasks of 4 task types, including natural language inference (RTE (Dagan et al., 2006) , CB (De Marneffe et al., 2019) , ANLI/R1-R3 (Nie et al., 2020) ), coreference resolution (WSC (Levesque et al., 2012) , Winogrande (Sakaguchi et al., 2020) ), sentence completion (COPA (Roemmele et al., 2011) , StoryCloze (Mostafazadeh et al., 2017) , Hellaswag (Zellers et al., 2019) ), and word sense disambiguation (WiC (Pilehvar & Camacho-Collados, 2019) ). Both the training and test parts are disjoint in task types, ensuring the zero-label setting. We follow T0 to use the accuracy on the validation split of test tasks as the metric.

4.1.2. BASELINES

For the zero-label setting, we primarily compare CTR with the following baselines. It is noteworthy that all baselines share a similar model size as CTR (i.e., about 770M), thus are comparable. We provide implementation details of baselines in Appendix A.7. • T0 (Sanh et al., 2022) shares similar goals as CTR, which uses prompted multi-task training to improve the generalization performance. T0 reports the average results over multiple prompts. † . • Self-Training Since unlabeled data are accessed, we consider the self-training (Schick & Schütze, 2020) method as one of the baselines. Starting from T0, it uses T0 to label the unlabeled data, and further finetunes T0 with the pseudo-labeled data. It also reports average performance over prompts. • Manual-Code uses artificial feature vectors in place of CTR vector. Specifically, we manually label a set of artificially-designed features for each task, including the number of input fields of the task, whether it requires reasoning, and whether it is a classification task, etc. Each task can be represented as an discrete feature vector, each dimension associated with one of the aspects. • ZPS (Liao et al., 2022a) is method for zero-label prompt selection. It first labels a set of unlabeled data through prompt ensemble, and use the pseudo-labeled data to select the best natural language prompt for the test task. We apply ZPS to multi-task T0 as a baseline. For the few-shot setting where there are 32 test-set labeled examples, based on the multi-task T0, we experimented the following five baseline methods. • Model Tuning directly finetunes the pretrained model using the test-set labeled data. Specifically, we follow the few-shot setting in Zheng et al. ( 2021) to use 16 labeled data for finetuning and another 16 labeled samples for model selection. • Prompt Tuning (Lester et al., 2021) introduces additional continuous prompts to the backbone model (i.e., T0) and trains the continuous prompts using the few labeled data. • GPS (Xu et al., 2022 ) is a genetic prompt search method. Based on T0, GPS gradually mutates the prompts with a generative model and uses the few labeled data to selects prompt candidates. • GRIPS (Prasad et al., 2022 ) is a gradient-free edit-based method for optimal prompt search. • Black-Box Tuning (BBT) (Sun et al., 2022 ) is a gradient-free few-shot prompt selection method. Unlike GPS and GRIPS, it searches for the best soft prompt embedding in a continuous space.

4.1.3. TRAINING DETAILS

We instantiate our CTR with T5-Large (Raffel et al., 2019) being the LLM. We implement both the encoder and the decoder in our CTR model as two linear networks. We have experimented with various architectures (see Appendix A.4 for more details). For the first stage of training where the LLM is frozen, we use the Adam optimizer with a learning rate of 1e-2, a decay rate of 0.1, and a batch size of 2048. We use a codebook embedding dimension of 1024, which is the same as the hidden dimension. The CTR length is set at 10 and each position can be assigned values ranging from 0 to 127; i.e., the codebook size is 128. The hyperparameter β is set at 0.1. We experimented with different codebook sizes and CTR lengths, and detailed results are provided in Appendix A.5. For the second training phase where all parameters are updated, we use the Adam optimizer with a learning rate of 1e-4 and a batch size of 1024. We follow the training recipe as T0 (Sanh et al., 2022) for the rest of the hyperparameters. 

4.2. MAIN RESULTS AND ANALYSIS

The performance of cross-task generalization, respectively under the zero-label setting and the fewshot setting, are shown in Table 1 . Our CTR outperforms all baseline methods on average under both settings. Comparing CTR with T0-Large and its variant (i.e., self-training), CTR outperforms them respectively by more than 4 points and by almost 1.5 points on average. The improvements potentially originate from two aspects-(a) the learned compositional task representations benefit from better generalization abilities than the discrete manual prompts used by T0/self-training; (b) our CTR can select a high-performing compositional task representation for the unseen task. Compared with Manual-Code, our CTR demonstrate significant advantages of more than 4.5 points, proving that artificially-designed features of tasks are unreliable, and CTR provides an effective way of automatically training such data-driven compositional task representations using multi-tasks. Compared with Model Tuning, Prompt Tuning, and BBT, where they require parameter updates over the test-task data, CTR shows better performance of cross-task generalization on average. Compared with baselines that do not tune parameters (i.e., GPS, GRIPS), CTR shows even larger and more consistent advantages on most of the test tasks; i.e., on 9/11 tasks CTR shows dominating performance. On co-reference tasks , CTR performs worse under the zero-label setting and better under the fewshot setting. The zero-label setting uses pseudo-labeled data to select test codes while the few-shot setting uses real-labeled data to select codes. The reason for the decreased performance lies in that the pseudo data of the co-reference task were of low quality and did not select effective task codes. Please refer to Appendix A.3 for generated zero-label cases from CTR .

4.3.1. ARE MANUAL PROMPTS NECESSARY?

We are interested in whether CTR will be further improved when combined with discrete manual prompts. We conduct comparative experiments of CTR respectively with and without manual prompts. Specifically, for CTR without manual prompts, the inputs are a direct concatenation of multiple text fields, with compositional task representations appended in front of it. For CTR with manual prompts, the inputs are constructed by leveraging T0 prompts , with CTR placed in front. The results are presented in Table 2 . We report both the CTR results as well as its "upper-bound" results. It is noteworthy that the "upper-bound" results are post-hoc, which are obtained by the best code/prompt with observing the test task performance, and are merely given for estimating the potential of the two methods. We shall observe that adding additional discrete manual prompts does not improve the CTR performance as well as the "upper-bound". From the results, we shall conclude that (1) our CTR does not rely on manual prompts to obtain optimal performance. CTR can work as an alternative to the prompt-based methods. (2) By comparing the CTR performance and the CTR upper-bound, there is still room to optimize the code-searching algorithm, and as a result to further improve the generalization performance. Table 2 : Ablation study on manual prompts. It shows the results of our CTR respectively without and with manual prompts. The experiments are under the zero-label setting. The "upper-bound" results of each method are obtained by using the best code/prompt after observing test task performance and are merely given for estimating the potential of the method. Results show that additionally adding manual prompts does not necessarily improve performance, and the codebook learned by CTR can act as an alternative to previous manual prompts.

4.3.2. TRAINING OBJECTIVES

Our CTR consists of three training loss items. Intuitively, the LM loss is optimized to predict the correct answer of the tasks. The embedding loss and the commitment loss are optimized to minimize the distance between the query embeddings and the CTR that are lookup from the codebook embeddings. To study the effectiveness of each loss item, we conduct an ablation study on them, and the results are shown in Table 3 . Results show that removing either of the loss items will drastically hurt the zero-label performance , proving that either loss item is indispensable for the training of CTR . Table 3 : Ablation study on CTR loss function. Results show that by removing either of the loss items, i.e., commitment loss, embedding loss or both, the performance decreases to varying degrees.

4.4.1. INTERPRETABILITY

Since CTR is trained to represent tasks with compositional codes, each code associated with one of the key aspects of tasks, CTR demonstrates a certain degree of interpretability. Table 4 presents examples of CTR that shows how each compositional task code is possibly interpreted. Interestingly, it frequently occurred that tasks that share similar features indeed have the same compositional code. For example, the code 52 occurs in tasks that require extracting information from the given contexts, including samsum * (summarization task), wiki bio * (structured data to text task), and paws labeled final paraphrase task (paraphrase generation) etc. Another example is that code 111 exists in most of the tasks that require generating long setences, including multi news * and samsum * (both are summarization tasks). We also validate these explanations on unseen test tasks. Results show that these possible explanations still hold. For example, the COPA task generates long answers given two candidate choices, and its compositional code has the code 111 (indicating sentence generation).

4.4.2. CONTROLLABILITY

CTR also demonstrates a certain degree of controllability. Specifically, by modifying one bit of the compositional task code, our CTR will exhibit a different task behavior. We shall conclude-(1) Since the inputs are randomly sampled from all tasks, some of which are quite different in data distribution, CTR performs well on them, proving that CTR indeed learns the ability to perform different tasks, instead of simply memorizing/overfitting to a certain dataset. (2) CTR is capable of switching between different tasks by simply changing one bit of the compositional code, proving that CTR effectively encodes the "task behavior" factor into the bits of the compositional code while disentangling other factors.

4.4.3. HOW CTR GENERALIZES TO NEW TASKS

To reveal how CTR essentially works, we use Figure 1 to explain how CTR generalizes from training tasks to unseen tasks such as COPA. During the training phase, CTR learns the compositional task code for each task, each code associated with certain features of the task. For example, code 69 is associated with many of the multiple-choice/reasoning-based QA tasks; code 111 is associated with most of the long-sentence-generation tasks, etc. For a new task, its task code is a re-arranged composition of the code existing in training tasks. If the new task shares the same code snippet as the training tasks, it inherits corresponding task features as well. In the example of Figure 1 , a real case produced by our model, the new task COPA is performed by combining the abilities of reasoning-based QA, sentence generation, and multi-choice selection from training tasks. These observations are consistent with our motivation of learning CTR for cross-task generalization.

4.4.4. GENERALIZING TO NEW UNSEEN DOMAINS/CLASSES

Another interesting finding is that, for tasks with limited categorical classes, e.g., topic classification tasks, CTR is capable of generalizing to new unseen classes/domains. Table 6 shows several examples. Given different inputs that are randomly sampled from a large set of tasks and the compositional task code of AGNews, CTR not only predicts topic classes that are originally defined INPUT: If Jim moves some particles of matter farther apart, what will happen to the rate at which they can pass vibrations on to nearby particles decrease or increase? Assuming that: When particles of matter are closer together, they can more quickly pass the energy of vibrations to nearby particles. TASK CODE: [61, 96, 127, 113, 55, 107, 5, 87, , 5] FILL 52 → DIALOGUE: W: Hi, Jim. I'm Jim. I'm a physicist at the University of California, Berkeley. FILL 37 → TOPIC CLASSIFICATION: Science and Technology

INPUT:

A historic victory for the separatist Parti Quebecois was marred by an attack on a victory rally for premier-elect Pauline Marois. A man opened fire during her victory speech, killing one person and critically wounding another, reports CTV. ... "I have convictions and I am going to defend them." Marois said during her victory speech. "There will be a referendum when the Quebec population wants a referendum." TASK CODE: [31, , 4, 113, 55, 107, 18, 87, 12, 63] FILL 68 → REVIEW RATING: 1 star FILL 93 → SUMMARIZATION : One person was killed and another critically injured in a shooting at a victory rally for the separatist Parti Quebecois in Quebec City on Sunday. INPUT: Paragraph: I've been here a few times and I like that it's right up the street. On Tuesday and Wednesday they have specials on manicures and pedicures which is $5 cheaper then the original price. TASK CODE: [23, 90, 76, 15, 2, 79, , 1, 119, 19] FILL 13 → SENTIMENT ANALYSIS: Positive FILL 5 → TOPIC CLASSIFICATION : Business. as well as the code mapping does not necessarily need complicated transformation. When using complicated architectures, it instead increases the difficulties of the learning process.

A.5 CODEBOOK SIZE AND CTR LENGTH

There are two critical hyper-parameters for CTR -the codebook size and the CTR length. We conduct experiments of different selection of the two hyper-parameters, and see how it influence the performance of zero-label generalization. Table 11 presents the results. We shall observe that when the codebook size decreases below 64, the performance decreases to a large degree. We conjecture that it is due to the capacity of a codebook size below 64 is not sufficient for representing the multiple aspects of the tasks. In addition, experimenting with a CTR length of 10 generally outperforms those with larger CTR length. We want to verify whether different selections of codes will affect the performance of the model. Intuitively, if a code is merely noise that gets ignored by the model eventually, the model will not sensitive to different selections of codes and vice versa. Therefore, we randomly sampled several codes of training tasks and evaluate their performance on different test tasks. The results are shown in Table 12 . As shown in the table, the model is highly sensitive to a different selection of codes. For example, the code of task amazon polarity user satisfied can perform well on RTE and significantly better than the performance of code of task cnn dailymail 3.0.0 news card view. 



† T0 uses natural language prompts from PromptSource(Bach et al., 2022).



Figure 2: An overview of the architecture of our proposed CTR . The left part illustrates the training phase, while the right part shows how CTR works during the inference phrase.

Main results of CTR and baselines on 11 test tasks under the zero-label setting and the fewshot setting. The zero-label setting allows using unlabeled data of the test task while the few-shot setting uses 32 labeled data of the test task. All methods share a similar model size (i.e., 770M).

Table5shows several examples of controlling CTR. We shall observe that, given the same inputs, by simply changing one

Examples of how compositional task codes can be possibly interpreted. We use a codebook size of 128. It shows the co-occurrence of tasks and codes. By analyzing common features shared by multiple tasks, we shall find that CTR embraces the advantage of interpretability. bit of the compositional task code, the task behavior of CTR turns from DIALOGUE GENERATION to TOPIC CLASSIFICATION, from REVIEW RATING to SUMMARIZATION, from SENTIMENT ANALYSIS to TOPIC CLASSIFICATION, etc.

Examples of controlling the compositional codes. Given the input and a compositional task code, each time we modify one bit of the code, CTR begins to perform a different task. with the code for AGNews (topic classification):[23, 90, 76, 15, 2, 79, 39, 1, 119, 19]   Original AGNews classes: {World, Sports, Business, Science and technology} INPUT: Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.OUTPUT: BusinessINPUT: There are 10 apples on an apple tree. Three fall off. Now there are X apples. Stuning even for the non-gamer This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate game music! It would impress anyone who cares to listen! OUTPUT: Music INPUT: Slack (2003) compares three groups that conducted biological research at Yale during overlapping periods between 1910 and 1970. Yale proved important as a site for this research. ... Hutchinson's example shows that new models for research groups are needed, especially for those that include extensive field research.

Examples of CTR generalizing to new unseen classes/domains. The inputs are randomly selected from all tasks other than AGNews. The first case predict the same classes as AGNews defines, while the latter three cases predict new classes that are never seen within AGNews. It shows the codebook of CTR can generalize to new unseen classes/domains. We propose the Compositional Task Representations (CTR) method that learns a discrete compositional codebook for tasks and generalizes to new unseen tasks by forming new compositions of the task codes. For the inference of CTR , we propose two algorithms -Code Ensemble and Bitwise Search, respectively for zero-label and few-shot settings. Experiments demonstrate that our CTR significantly outperforms existing prompted-based methods on both zero-label and few-shot settings. Analysis of the learned compositional task codes proves that some of the CTR codes show certain degrees of interpretability and controllability.A.2 EXAMPLES OF DBPEDIA 14QUESTION: Federation of International Trade Associations -The Federation of International Trade Associations (FITA) based in Reston Virginia and New York New York USA was founded in 1984. It fosters international trade by seeking to strengthen the role of associations in the United States Mexico and Canada. FITA is the strategic partner of the United States Commercial Service for ecommerce. company, educational institution, artist, athlete, office holder, mean of transportation, building, natural place, village, animal, plant, album, film or written work MODEL ANSWER (WITHIN THE ORIGINAL CLASSES): Company QUESTION: Furian knife In this movie sequel, Vin Diesel returns as Riddick, an escaped convict with a price on his head. Riddick has been hiding on a snow planet for the last five years, when a group of mercenaries try to capture him. Riddick returns to the planet Helion, and finds out that his friend Jack, is in prison on Crematoria, a very hot planet. While on Helion, the planet is invaded by the Necromongers, led by Lord Marshal (Colm Feore), who wants to rule the universe. Riddick is captured by the mercs and flown to the same unsavoury and possibly illegal prison Kyra is in. Turns out Jack is now known as Kyra (Alexa Davalos) and is tough as nails. MODEL ANSWER (WITHIN THE ORIGINAL CLASSES): Film QUESTION: lila abu-lughod's: -name -alma mater -website -known for -birth date -employer -nationality -occupation Bio: lila abu-lughod -lrb-born 1952 -rrb-is an american with palestinian and jewish ancestry who is professor of anthropology and women's and gender studies at columbia university in new york city. a specialist of the arab world, her seven books, most based on long term ethnographic research, cover topics from sentiment and poetry to nationalism and media, from gender politics to the politics of memory. We know that, thanks to our DNA, each of us is a little bit different. Some of those differences are obvious, like eye and hair color. Others are not so obvious, like how our bodies react to medication. Researchers are beginning to look at how to tailor medical treatments to our genetic profiles, in a relatively new field called pharmacogenomics. Some of the biggest breakthroughs have been in cancer treatment. MODEL OUTPUT (OUT OF ORIGINAL CLASSES): Science INPUT: how did athenians make money? Other Greek cities set up democracies, and even though most followed an Athenian model, none were as powerful, stable, nor as well-documented as that of Athens. MODEL OUTPUT (OUT OF ORIGINAL CLASSES): City INPUT: Great food, portions could be smaller. A little pricey for Middleton. Would try somewhere else next time but would also not hesitate to return MODEL OUTPUT (OUT OF ORIGINAL CLASSES): Restaurant

Examples of CTR by the compositional code of DBpedia 14. Given different inputs that are randomly sampled from a large set of tasks, and the compositional task code of DBpedia 14, CTR not only predicts topic classes that are originally defined by DBpedia 14, but also predicts new topic classes that never occur in DBpedia 14 .

Ablation study on different codebook size and CTR length. All experiments are conducted under the zero-label setting. We experiment three of the codebook size, respectively 128, 64 and 48, and two of the CTR length, including 10 and 20. Results show that "128 -10" combination achieves the best performance.

Performance sensitivity to different selections of codes.

annex

Table 10 : Ablation study on different architectures of the CTR encoder and decoder. All experiments are conducted under the zero-label setting. We experiment two of the encoder, respectively the linear net and the multi-layer perceptron (MLP), and five alternatives of the decoder, including the linear net, bidirectional RNN, Transformer, MLP and removing the decoder (None). Results show that the simple "Linear -Linear" combination achieves the best performance.Table 10 shows the zero-label results when using different architecture of the encoder and the decoder. We experiment two of the encoder, respectively the linear net and the multi-layer perceptron (MLP), and five alternatives of the decoder, including the linear net, bidirectional RNN, Transformer, MLP and removing the decoder (None). Results show that simply using a linear network for both the encoder and the decoder performs best, while complicated architectures, e.g., Transformer and MLP, yield poor performance. This could because that the learning of codebook embeddings A.7 EXPERIMENTAL DETAILS For the data preprocessing of all experiments, to balance the number of data for different tasks, we restrict the maximum data examples for each training task to be 50,000, which empirically yields better results.Training details of each baseline method under the zero-label setting are illustrated as follows.T0-Large Based on T5-Large-LM-Adapted, it performs multi-task training for 10000 steps. We set the maximum length of input and target sequences to 384 and 32 respectively. We use the Adam optimizer with a learning rate of 1e-4, a dropout rate of 0.1, and a batch size of 1024. Following T0 Sanh et al. (2022) , we use the same task prompts from PromptSource (Bach et al., 2022) . We report the average accuracy of multiple prompts for each test task. Note that our reproduced T0-Large results are much better than those reported in the original paper Sanh et al. (2022) , which sets a much stronger baseline for comparison. We report the average accuracy of prompts for each test task. We believe our baseline is well-optimized. Because the performance on test tasks is comparable to of T0-3B reported in Sanh et al. (2022) , even our baseline only contains 770M parameters.Self-Training For a fair comparison, we randomly sample 32 unlabeled data for self-training, which is the same as CTR. The Self-Training method trains from the T0-Large with these pseudo-labeled examples for 5 epochs and reports average performance over prompts. For the training, we use a batch size of 32 and the Adam optimizer with a learning rate of 1e-4.Manual-Code In practice, we manually label a set of artificially-designed features for each task, including the number of input fields of the task, whether it requires reasoning, whether it includes options into inputs, and whether it is a classification task, etc. Each task can be represented as an artificially-defined discrete feature vector, each dimension associated with one of the aspects. Manual-Code follows exact the same training recipe as the second training phase of our CTR. Training details are presented in Section 4.1.3. Zero-Label Prompt Selection (ZPS) (Liao et al., 2022a) For each task, we use 32 unlabeled data for producing pseudo-labeled data. Finally, we report the accuracy of the selected prompt.For the few-shot setting, we consider the following five baseline methods. All few-shot baselines are based on our reproduced T0-Large.Model Tuning We use the Adam optimizer with a batch size of 256 and a learning rate of 1e-4. We combine all training data of the test tasks for training. The maximum training step is 100. We use a validation set for model selection. Finally, we report the average accuracy of the selected best checkpoint.Prompt Tuning (Lester et al., 2021) We use the Adam optimizer with a batch size of 128 and a learning rate of 0.05. We combine all training data of the test tasks for training. The maximum training step is 100. The length of continuous prompts for each task is 20. We use a validation set for model selection. Finally, we report the average accuracy of the selected best checkpoint. GPS (Xu et al., 2022) We follow hyper-parameters reported in the original paper. Specifically, we run the GPS for 6 steps. At each step, new prompts are generated by a T5-xxl-lm-adapted model. GRIPS In our experiments, we set max patience P = 2, candidate l = 5, and step m = 5. For the rest of the hyper-parameters, we follow the original GRIPS paper (Prasad et al., 2022) . Black-Box Tuning (BBT) (Sun et al., 2022) In practice, we use the Adam optimizer with a learning rate of 0.05. For each test task, we train the soft prompt for 200 steps. We set the prompt length L = 10, subspace dimension d = 500, and cma budget 1000. For the rest of the hyper-parameters, we follow the original BBT (Sun et al., 2022) .

