WHAT LEARNING ALGORITHM IS IN-CONTEXT LEARN-ING? INVESTIGATIONS WITH LINEAR MODELS

Abstract

Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples (x, f (x)) presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms.

1. INTRODUCTION

One of the most surprising behaviors observed in large neural sequence models is in-context learning (ICL; Brown et al., 2020) . When trained appropriately, models can map from sequences of (x, f (x)) pairs to accurate predictions f (x ′ ) on novel inputs x ′ . This behavior occurs both in models trained on collections of few-shot learning problems (Chen et al., 2022; Min et al., 2022) and surprisingly in large language models trained on open-domain text (Brown et al., 2020; Zhang et al., 2022; Chowdhery et al., 2022) . ICL requires a model to implicitly construct a map from in-context examples to a predictor without any updates to the model's parameters themselves. How can a neural network with fixed parameters to learn a new function from a new dataset on the fly? This paper investigates the hypothesis that some instances of ICL can be understood as implicit implementation of known learning algorithms: in-context learners encode an implicit, contextdependent model in their hidden activations, and train this model on in-context examples in the course of computing these internal activations. As in recent investigations of empirical properties of ICL (Garg et al., 2022; Xie et al., 2022) , we study the behavior of transformer-based predictors (Vaswani et al., 2017) on a restricted class of learning problems, here linear regression. Unlike in past work, our goal is not to understand what functions ICL can learn, but how it learns these functions: the specific inductive biases and algorithmic properties of transformer-based ICL. In Section 3, we investigate theoretically what learning algorithms transformer decoders can implement. We prove by construction that they require only a modest number of layers and hidden units to train linear models: for d-dimensional regression problems, with O(d) hidden size and constant depth, a transformer can implement a single step of gradient descent; and with O(d 2 ) hidden size a Correspondence to akyurek@mit.edu. Ekin is a student at MIT, and began this work while he was intern at Google Research. Code and reference implementations are released at this web page b The work is done when Tengyu Ma works as a visiting researcher at Google Research.

