COMPOSITIONAL TASK REPRESENTATIONS FOR LARGE LANGUAGE MODELS

Abstract

Large language models have shown a remarkable cross-task generalization ability. Most prior works assumed that prompts effectively extract knowledge from language models to facilitate generalization to new tasks. This perspective led to numerous studies on improving prompts. In contrast, we introduce a new perspective, compositional generalization, that views each task as a composition of latent codes and generalizes to test tasks by a new composition of seen codes. To this end, we propose a novel prompt-free approach, Compositional Task Representations (CTR), that employs multi-task training to learn a discrete, compositional codebook. Empirically, our CTR substantially outperforms prompt-based methods in zero-label learning on average. According to our analysis, some of the learned CTR codes are interpretable to humans and demonstrate a certain degree of controllability.

1. INTRODUCTION

Large language models (LLMs) have shown remarkable performance in cross-task generalization. Without using any labeled data for the target task, GPT-3 (Brown et al., 2020) obtains reasonable performance on a wide range of tasks. Later extensions such as FLAN (Wei et al., 2022) and T0 (Sanh et al., 2022) continue training the LLMs on a large number of supervised tasks, which further improves cross-task generalization performance. The aforementioned studies have used an important assumption that natural language prompts extract knowledge from LLMs to facilitate generalization to new tasks. In this direction, numerous studies have focused on different aspects of improving prompt-based learning, such as designing better prompts (Xu et al., 2022) , increasing the number of prompts (Wang et al., 2022; Aribandi et al., 2022) , and improving the training efficiency of prompts (Lester et al., 2021) . In contrast, we explore an alternative perspective for cross-task generalization, i.e., compositional generalization. Specifically, we explore whether it is possible to represent tasks using discrete compositions of latent codes. This perspective enjoys several potential benefits. First, since the latent codes have been trained for seen tasks, we expect the LLMs to have strong cross-task generalization abilities because new tasks can also be represented as a composition of these trained codes. Second, it provides a way to analyze and understand cross-task generalization by investigating the association between tasks and the learned representations. Third, it has the potential of being more controllable than prompts for task generalization due to its built-in compositionality. Motivated by the aforementioned potentials, we propose a new method, Compositional Task Representations (CTR), that employs multi-task training to learn a discrete, compositional codebook. Specifically, given a large number of training tasks, we use an encoder to map each randomlyinitialized task embedding to a fixed-length sequence of query vectors. Each query vector is used to retrieve a code from a codebook, which is formulated as an embedding lookup table. This produces CosmosQA (Code 69 is shared by tasks that are multiplechoice/reasoning-based QA): Document: National Archives Yes, it 2019s that time again, folks. It 2019s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. ……

Moderate

The unemployment rate dropped to 8.2% last month, but the economy only added 120,000 jobs, when 203,000 new jobs had been predicted, according to today's jobs report. Reaction on the Wall Street Journal's MarketBeat Blog was swift: "Woah!!! Bad number. " The unemployment rate, however, is better news; it had been expected to hold steady at 8.3%. But the AP notes that the dip is mostly due to more Americans giving up on seeking employment."

Compositional Task Code Generated Outputs

The woman retired. so... -She received her pension. -She paid off her mortgage.

COPA (sentence completion)

She received her pension.

Zero-Label Inference Phase

Rand falls on shock SA rate cut Interest rates are trimmed to 7.5 by the South African central bank, but the lack of warning hits the rand and surprises markets. world politics, sports, business, or science and technology? Business [23, 90, 69, 15, 2, 79, 25, 103, 120, 44] [54, 14, 58, 28, 117, 33, 16, 53, 109, 111 

Task Inputs

The girl found a bug in her cereal. so... -She poured milk in the bowl. -She lost her appetite. AGNews (Code [15, 2, 79, 39, 1, 119, 19] occurs in tasks that select from multiple options): MultiNews (Code 111 exists in most long-sentence-generation tasks): She lost her appetite. a sequence of codes, a compositional representation of the current task. These compositional codes are fed as the input to an LLM in place of prompts to make predictions. At test time, given a new task, we use unlabeled data to search for a high-performing composition of codes, which enables zero-label cross-task generalization. CTR is also applicable to the few-shot setting where the few labeled examples are used for code search. Empirically, we demonstrate improved performance under both the settings of zero-label learning and few-shot learning, outperforming strong baselines including prompt tuning, model tuning, and genetic prompt search (Xu et al., 2022) . Importantly, we analyze the learned task representations and show that they demonstrate a certain degree of interpretability and controllability. For example, as shown in Figure 1 , CTR learns to generalize to a new task by a new composition of existing codes. 



watched the first McCain Obama debate last night . It was full of moments I had to pause the DVR because I had to discuss what they were saying with my husband . I learned a lot about the Iraq war and Afghanistan , and I saw both McCain and Obama make some going points , and I saw them both make some blunders .How would this person be classified ?-Moderate -Liberal -Conservative -None of the above choices .

Figure 1: An illustration of how CTR generalizes to zero-label tasks. In this real example produced by our model, CTR combines the abilities of reasoning-based QA, sentence generation, and multichoice selection from training tasks to perform a new task COPA.

Language Model Prompting.Brown et al. (2020)  showed that GPT-3 performs well in the fewshot setting if properly handcrafted prompts are provided. Other works(Shoeybi et al., 2019; Rae  et al., 2021; Schick & Schütze, 2021)  also report promising zero-shot or few-shot performances of LLMs. Besides, Wei et al. (2022) and Sanh et al. (2022) collect a set of labeled datasets and use manual templates to transform them into a sequence-to-sequence style. Such a formulation makes it possible to continue training LLMs on these labeled datasets and improves cross-task generalization. Compositional Architecture for LLMs. Previous work has explored designing compositional architectures. Sparsely Gated Mixture of Expert (MoE) (Lepikhin et al., 2021) activates a subset of a network given the input data. Artetxe et al. (2021) trained an MoE model with 1.1T parameters, which is shown to outperform a dense model with similar computational cost. SkillNet-NLU(Tang  et al., 2022)  andSkillNet-NLG (Liao et al., 2022b)  employed a similar sparsely activated mechanism to handle different NLU or NLG tasks. Different from these approaches, our approach focuses on learning compositional task representations using a discrete codebook.

