GENERAL-PURPOSE IN-CONTEXT LEARNING BY META-LEARNING TRANSFORMERS

Abstract

Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose learning algorithms from scratch, using only black-box models with minimal inductive bias. Such a model takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general-purpose in-context learners. We characterize phase transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks, and meta-optimization. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and meta-generalization of general-purpose learning algorithms.

1. INTRODUCTION

Meta-learning is the process of automatically discovering new learning algorithms instead of designing them manually (Schmidhuber, 1987) . An important quality of human-engineered learning algorithms is their applicability to a wide range of tasks or environments. For learning-to-learn to exceed those capabilities, the meta-learned learning algorithms must be similarily general-purpose. Recently, there has been significant progress toward this goal (Kirsch et al., 2019; Oh et al., 2020) . The improved generality of the discovered learning algorithms has been achieved by introducing inductive bias, such as by bottlenecking the architecture or by hiding information, which encourage learning over memorization. Methods include restricting learning rules to use gradients (Metz et al., 2019; Kirsch et al., 2019; Oh et al., 2020) , symbolic graphs (Real et al., 2020; Co-Reyes et al., 2021) , or parameter sharing or symmetries (Kirsch & Schmidhuber, 2020; Kirsch et al., 2021) . While enabling generalization, these inductive biases come at the cost of increasing the effort to design these systems and potentially restrict the space of discoverable learning algorithms. Instead, we seek to explore general-purpose meta-learning systems with minimal inductive bias. Good candidates for this are black-box sequence-models as meta-learners such as LSTMs (Hochreiter et al., 2001; Wang et al., 2016; Duan et al., 2016) or Transformers (Vaswani et al., 2017) . These memorybased or in-context learners take in training data and produce test-set predictions without any explicit definition of an inference model, training loss, or optimization algorithm. This has lead to strong few-shot learning ability within the context of, for example, language modeling (Brown et al., 2020) . In this work, we investigate how such black-box meta-learners can be trained to (meta-)generalize and learn on significantly different datasets than used during meta-training. For this we propose a Transformer-based General-Purpose In-Context Learner (GPICL) which is described with an associated meta-training task distribution in Section 3. In Section 4.1 we characterize transitionsinduced by scaling the number of tasks or the model size used for meta-training-between memorization, learning, and generalization. We further show in Section 4.2 that the capabilities of metatrained algorithms are bottlenecked by their accessible state size determining the next prediction

annex

(such as the hidden state size in a recurrent network), unlike standard models which are thought to be bottlenecked by parameter count. Finally, in Section 4.3, we propose practical interventions that improve the meta-training of general purpose learning algorithms.

2. BACKGROUND

What is a (supervised) learning algorithm? In this paper, we focus on the setting of metalearning supervised learning algorithms. Consider a mappingfrom the training (support) set D = {x i , y i } N D i=1 and a query input x to the query's prediction y where x i , x ∈ R Nx , y i , y ∈ R Ny and N D , N x , N y ∈ N + . The subset of these functions that qualify as learning algorithms are those that improve their predictions y given an increasingly larger training set D. Meta-learning then corresponds to finding these functions via meta-optimization. As in other black-box meta-learning models, we use a neural network to represent such functions.What is a general-purpose learning algorithm? A learning algorithm can be considered generalpurpose if it learns on a wide range of possible tasks D and their respective related queries x , y . For example, gradient-descent on a suitable loss function can be considered a very general-purpose human-engineered learning algorithm (where the gradient is obtained via backpropagation or other means).

3. GENERAL-PURPOSE IN-CONTEXT LEARNING WITH TRANSFORMERS

Due to the small number of inductive biases in black-box models, we can only expect (meta-)generalization when meta-training with an appropriately broad data distribution. Thus, changes in the data distribution affect whether and how a model meta-learns and meta-generalizes. We classify algorithms along two different dimensions: To what extent it learns (improving predictions given increasingly larger training sets), and to what extent it generalizes (performs well on instances, tasks, or datasets not seen before). Algorithms can then be categorized as follows: We demonstrate that sharp phase transitions occur between these learning modalities, and empirically investigate these transitions.

3.1. GENERATING TASKS FOR LEARNING-TO-LEARN

Neural networks are known to require datasets of significant size to effectively generalize. While in standard supervised learning large quantities of data are common, meta-learning algorithms may require a similar number of distinct tasks in order to learn and generalize. Unfortunately, the number of commonly available tasks is orders of magnitudes smaller compared to the datapoints in each task.Previous work has side-stepped this issue by building-in architectural or algorithmic structure into the learning algorithm, in effect drastically reducing the number of tasks required. For example, in Kirsch & Schmidhuber (2020); Kirsch et al. (2021) , the authors included symmetries into the

