WHAT LEARNING ALGORITHM IS IN-CONTEXT LEARN-ING? INVESTIGATIONS WITH LINEAR MODELS

Abstract

Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples (x, f (x)) presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms.

1. INTRODUCTION

One of the most surprising behaviors observed in large neural sequence models is in-context learning (ICL; Brown et al., 2020) . When trained appropriately, models can map from sequences of (x, f (x)) pairs to accurate predictions f (x ′ ) on novel inputs x ′ . This behavior occurs both in models trained on collections of few-shot learning problems (Chen et al., 2022; Min et al., 2022) and surprisingly in large language models trained on open-domain text (Brown et al., 2020; Zhang et al., 2022; Chowdhery et al., 2022) . ICL requires a model to implicitly construct a map from in-context examples to a predictor without any updates to the model's parameters themselves. How can a neural network with fixed parameters to learn a new function from a new dataset on the fly? This paper investigates the hypothesis that some instances of ICL can be understood as implicit implementation of known learning algorithms: in-context learners encode an implicit, contextdependent model in their hidden activations, and train this model on in-context examples in the course of computing these internal activations. As in recent investigations of empirical properties of ICL (Garg et al., 2022; Xie et al., 2022) , we study the behavior of transformer-based predictors (Vaswani et al., 2017) on a restricted class of learning problems, here linear regression. Unlike in past work, our goal is not to understand what functions ICL can learn, but how it learns these functions: the specific inductive biases and algorithmic properties of transformer-based ICL. In Section 3, we investigate theoretically what learning algorithms transformer decoders can implement. We prove by construction that they require only a modest number of layers and hidden units to train linear models: for d-dimensional regression problems, with O(d) hidden size and constant depth, a transformer can implement a single step of gradient descent; and with O(d 2 ) hidden size a Correspondence to akyurek@mit.edu. Ekin is a student at MIT, and began this work while he was intern at Google Research. Code and reference implementations are released at this web page b The work is done when Tengyu Ma works as a visiting researcher at Google Research. and constant depth, a transformer can update a ridge regression solution to include a single new observation. Intuitively, n steps of these algorithms can be implemented with n times more layers. In Section 4, we investigate empirical properties of trained in-context learners. We begin by constructing linear regression problems in which learner behavior is under-determined by training data (so different valid learning rules will give different predictions on held-out data). We show that model predictions are closely matched by existing predictors (including those studied in Section 3), and that they transition between different predictors as model depth and training set noise vary, behaving like Bayesian predictors at large hidden sizes and depths. Finally, in Section 5, we present preliminary experiments showing how model predictions are computed algorithmically. We show that important intermediate quantities computed by learning algorithms for linear models, including parameter vectors and moment matrices, can be decoded from in-context learners' hidden activations. A complete characterization of which learning algorithms are (or could be) implemented by deep networks has the potential to improve both our theoretical understanding of their capabilities and limitations, and our empirical understanding of how best to train them. This paper offers first steps toward such a characterization: some in-context learning appears to involve familiar algorithms, discovered and implemented by transformers from sequence modeling tasks alone.

2. PRELIMINARIES

Training a machine learning model involves many decisions, including the choice of model architecture, loss function and learning rule. Since the earliest days of the field, research has sought to understand whether these modeling decisions can be automated using the tools of machine learning itself. Such "meta-learning" approaches typically treat learning as a bi-level optimization problem (Schmidhuber et al., 1996; Andrychowicz et al., 2016; Finn et al., 2017) : they define "inner" and "outer" models and learning procedures, then train an outer model to set parameters for an inner procedure (e.g. initializer or step size) to maximize inner model performance across tasks. Recently, a more flexible family of approaches has gained popularity. In in-context learning (ICL), meta-learning is reduced to ordinary supervised learning: a large sequence model (typically implemented as a transformer network) is trained to map from sequences [x 1 , f (x 1 ), x 2 , f (x 2 ), ..., x n ] to predictions f (x n ) (Brown et al., 2020; Olsson et al., 2022; Laskin et al., 2022; Kirsch & Schmidhuber, 2021) . ICL does not specify an explicit inner learning procedure; instead, this procedure exists only implicitly through the parameters of the sequence model. ICL has shown impressive results on synthetic tasks and naturalistic language and vision problems (Garg et al., 2022; Min et al., 2022; Zhou et al., 2022) . Past work has characterized what kinds of functions ICL can learn (Garg et al., 2022; Laskin et al., 2022) and the distributional properties of pretraining that can elicit in-context learning (Xie et al., 2021; Chan et al., 2022) . But how ICL learns these functions has remained unclear. What learning algorithms (if any) are implementable by deep network models? Which algorithms are actually discovered in the course of training? This paper takes first steps toward answering these questions, focusing on a widely used model architecture (the transformer) and an extremely well-understood class of learning problems (linear regression).

2.1. THE TRANSFORMER ARCHITECTURE

Transformers (Vaswani et al., 2017) are neural network models that map a sequence of input vectors x = [x 1 , . . . , x n ] to a sequence of output vectors y = [y 1 , . . . , y n ]. Each layer in a transformer maps a matrix H (l) (interpreted as a sequence of vectors) to a sequence H (l+1) . To do so, a transformer layer processes each column h (l) i of H (l) in parallel. Here, we are interested in autoregressive (or "decoder-only") transformer models in which each layer first computes a self-attention: a i = Attention(h (l) i ; W F , W Q , W K , W V ) (1) = W F [b 1 , . . . , b m ] (2) where each b is the response of an "attention head" defined by: b j = softmax (W Q j h i ) ⊤ (W K j H :i ) (W V j H :i ) . then applies a feed-forward transformation: h (l+1) i = FF(a i ; W 1 , W 2 ) (4) = W 1 σ(W 2 λ(a i + h (l) i )) + a i + h (l) i . Here σ denotes a nonlinearity, e.g. a Gaussian error linear unit (GeLU; Hendrycks & Gimpel, 2016) : σ(x) = x 2 1 + erf x √ 2 , and λ denotes layer normalization (Ba et al., 2016) : λ(x) = x -E[x] Var[x] , where the expectation and variance are computed across the entries of x. To map from x to y, a transformer applies a sequence of such layers, each with its own parameters. We use θ to denote a model's full set of parameters (the complete collection of W matrices across layers). The three main factors governing the computational capacity of a transformer are its depth (the number of layers), its hidden size (the dimension of the vectors h), and the number of heads (denoted m above).

2.2. TRAINING FOR IN-CONTEXT LEARNING

We study transformer models directly trained on an ICL objective. (Some past work has found that ICL also "emerges" in models trained on general text datasets; Brown et al., 2020.) To train a transformer T with parameters θ to perform ICL, we first define a class of functions F, a distribution p(f ) supported on F, a distribution p(x) over the domain of functions in F, and a loss function L. We then choose θ to optimize the auto-regressive objective, where the resulting T θ is an in-context learner: arg min θ E x1,...,xn∼p(x) f ∼p(f )   n i=1 L (f (x i ), T θ ([x 1 , f (x 1 ) . . . , x i ]))   (8)

2.3. LINEAR REGRESSION

Our experiments focus on linear regression problems. In these problems, F is the space of linear functions f (x) = w ⊤ x where w, x ∈ R d , and the loss function is the squared error L(y, y ′ ) = (y -y ′ ) 2 . Linear regression is a model problem in machine learning and statistical estimation, with diverse algorithmic solutions. It thus offers an ideal test-bed for understanding ICL. Given a dataset with inputs X = [x 1 , . . . , x n ] and y = [y 1 , . . . , y n ], the (regularized) linear regression objective: i L(w ⊤ x i , y i ) + λ∥w∥ 2 2 (9) is minimized by: w * = (X ⊤ X + λI) -1 X ⊤ y With λ = 0, this objective is known as ordinary least squares regression (OLS); with λ > 0, it is known as ridge regression (Hoerl & Kennard, 1970) . (As discussed further in Section 4, ridge regression can also be assigned a Bayesian interpretation.) To present a linear regression problem to a transformer, we encode both x and f (x) as d + 1-dimensional vectors: xi = [0, x i ], ỹi = [y i , 0 d ], where 0 d denotes the d-dimensional zero vector.

3. WHAT LEARNING ALGORITHMS CAN A TRANSFORMER IMPLEMENT?

For a transformer-based model to solve Eq. ( 9) by implementing an explicit learning algorithm, that learning algorithm must be implementable via Eq. (1) and Eq. ( 4) with some fixed choice of transformer parameters θ. In this section, we prove constructively that such parameterizations exist, giving concrete implementations of two standard learning algorithms. These proofs yield upper bounds on how many layers and hidden units suffice to implement (though not necessarily learn) each algorithm. Proofs are given in Appendices A and B.

3.1. PRELIMINARIES

It will be useful to first establish a few computational primitives with simple transformer implementations. Consider the following four functions from R H×T → R H×T : mov(H; s, t, i, j, i ′ , j ′ ): selects the entries of the s th column of H between rows i and j, and copies them into the t th column (t ≥ s) of H between rows i ′ and j ′ , yielding the matrix: | H :i-1,t | H :,:t H i ′ :j ′ ,s H :,t+1: | H j,t | . mul(H; a, b, c, (i, j), (i ′ , j ′ ), (i ′′ , j ′′ )): in each column h of H, interprets the entries between i and j as an a × b matrix A 1 , and the entries between i ′ and j ′ as a b × c matrix A 2 , multiplies these matrices together, and stores the result between rows i ′′ and j ′′ , yielding a matrix in which each column has the form [h :i ′′ -1 , A 1 A 2 , h j ′′ : ] ⊤ . div(H; (i, j), i ′ , (i ′′ , j ′′ )): in each column h of H, divides the entries between i and j by the absolute value of the entry at i ′ , and stores the result between rows i ′′ and j ′′ , yielding a matrix in which every column has the form [h :i ′′ -1 , h i:j /|h i ′ |, h j ′′ : ] ⊤ . aff(H; (i, j), (i ′ , j ′ ), (i ′′ , j ′′ ), W 1 , W 2 , b): in each column h of H, applies an affine transformation to the entries between i and j and i ′ and j ′ , then stores the result between rows i ′′ and j ′′ , yielding a matrix in which every column has the form [h :i ′′ -1 , W 1 h i:j + W 2 h i ′ :j ′ + b, h j ′′ : ] ⊤ . Lemma 1. Each of mov, mul, div and aff can be implemented by a single transformer decoder layer: in Eq. (1) and Eq. (4), there exist matrices W Q , W K , W V , W F , W 1 and W 2 such that, given a matrix H as input, the layer's output has the form of the corresponding function output above.foot_0  With these operations, we can implement building blocks of two important learning algorithms.

3.2. GRADIENT DESCENT

Rather than directly solving linear regression problems by evaluating Eq. ( 10), a standard approach to learning exploits a generic loss minimization framework, and optimizes the ridge-regression objective in Eq. ( 9) via gradient descent on parameters w. This involves repeatedly computing updates: w ′ = w -α ∂ ∂w L(w ⊤ x i , y i ) + λ∥w∥ 2 2 = w -2α(xw ⊤ x -yx + λw) for different examples (x i , y i ), and finally predicting w ′⊤ x n on a new input x n . A step of this gradient descent procedure can be implemented by a transformer: Theorem 1. A transformer can compute Eq. (11) (i.e. the prediction resulting from single step of gradient descent on an in-context example) with constant number of layers and O(d) hidden space, where d is the problem dimension of the input x. Specifically, there exist transformer parameters θ such that, given an input matrix of the form: H (0) = • • • 0 y i 0 x i 0 x n • • • , the transformer's output matrix H (L) contains an entry equal to w ′⊤ x n (Eq. (11)) at the column index where x n is input.

3.3. CLOSED-FORM REGRESSION

Another way to solve the linear regression problem is to directly compute the closed-form solution Eq. ( 10). This is somewhat challenging computationally, as it requires inverting the regularized covariance matrix X ⊤ X + λI. However, one can exploit the Sherman-Morrison formula (Sherman & Morrison, 1950) to reduce the inverse to a sequence of rank-one updates performed example-byexample. For any invertible square A, A + uv ⊤ -1 = A -1 - A -1 uv ⊤ A -1 1 + v ⊤ A -1 u . ( ) Because the covariance matrix X ⊤ X in Eq. ( 10) can be expressed as a sum of rank-one terms each involving a single training example x i , this can be used to construct an iterative algorithm for computing the closed-form ridge-regression solution. Theorem 2. A transformer can predict according to a single Sherman-Morrison update: w ′ = λI + x i x ⊤ i -1 x i y i = I λ - I λ x i x ⊤ i I λ 1 + x ⊤ i I λ x i x i y i ( ) with constant layers and O(d 2 ) hidden space. More precisely, there exists a set of transformer parameters θ such that, given an input matrix of the form in Eq. ( 12), the transformer's output matrix H (L) contains an entry equal to w ′⊤ x n (Eq. ( 14)) at the column index where x n is input. Discussion. There are various existing universality results for transformers (Yun et al., 2020; Wei et al., 2021) , and for neural networks more generally (Hornik et al., 1989) . These generally require very high precision, very deep models, or the use of an external "tape", none of which appear to be important for in-context learning in the real world. Results in this section establish sharper upper bounds on the necessary capacity required to implement learning algorithms specifically, bringing theory closer to the range where it can explain existing empirical findings. Different theoretical constructions, in the context of meta-learning, have been shown for linear self-attention models (Schlag et al., 2021) , or for other neural architectures such as recurrent neural networks (Kirsch & Schmidhuber, 2021) . We emphasize that Theorem 1 and Theorem 2 each show the implementation of a single step of an iterative algorithm; these results can be straightforwardly generalized to the multi-step case by "stacking" groups of transformer layers. As described next, it is these iterative algorithms that capture the behavior of real learners.

4. WHAT COMPUTATION DOES AN IN-CONTEXT LEARNER PERFORM?

The previous section showed that the building blocks for two specific procedures-gradient descent on the least-squares objective and closed-form computation of its minimizer-are implementable by transformer networks. These constructions show that, in principle, fixed transformer parameterizations are expressive enough to simulate these learning algorithms. When trained on real datasets, however, in-context learners might implement other learning algorithms. In this section, we investigate the empirical properties of trained in-context learners in terms of their behavior. In the framework of Marr's (2010) "levels of analysis", we aim to explain ICL at the computational level by identifying the kind of algorithms to regression problems that transformer-based ICL implements.

4.1. BEHAVIORAL METRICS

Determining which learning algorithms best characterize ICL predictions requires first quantifying the degree to which two predictors agree. We use two metrics to do so: Squared prediction difference. Given any learning algorithm A that maps from a set of input- output pairs D = [x 1 , y 1 , . . . , x n , y n ] to a predictor f (x) = A(D)(x) , we define the squared prediction difference (SPD): SPD(A 1 , A 2 ) = E D=[x1,...]∼p(D) x ′ ∼p(x) (A 1 (D)(x ′ ) -A 2 (D)(x ′ )) 2 , ( ) where D is sampled as in Eq. ( 8). SPD measures agreement at the output level, regardless of the algorithm used to compute this output. Implicit linear weight difference. When ground-truth predictors all belong to a known, parametric function class (as with the linear functions here), we may also investigate the extent to which different learners agree on the parameters themselves. Given an algorithm A, we sample a context dataset D as above, and an additional collection of unlabeled test inputs D X ′ = {x ′ i }. We then compute A's prediction on each x ′ i , yielding a predictor-specific dataset D A = {(x ′ i , ŷi )} = x i , A(D)(x ′ i ) encapsulating the function learned by A. Next we compute the implied parameters: ŵA = arg min We can then quantify agreement between two predictors A 1 and A 2 by computing the distance between their implied weights in expectation over datasets: w i (ŷ i -w ⊤ x ′ i ) 2 . ( ) ILWD(A 1 , A 2 ) = E D E D X ′ ∥ ŵA1 -ŵA2 ∥ 2 2 . ( ) When the predictors are not linear, ILWD measures the difference between the closest linear predictors (in the sense of Eq. ( 16)) to each algorithm. For algorithms that have linear hypothesis space (e.g. Ridge regression), we will use the actual value of ŵA instead of the estimated value.

4.2. EXPERIMENTAL SETUP

We train a Transformer decoder autoregresively on the objective in Eq. ( 8). For all experiments, we perform a hyperparameter search over depth L ∈ {1, 2, 4, 8, 12, 16}, hidden size W ∈ {16, 32, 64, 256, 512, 1024} and heads M ∈ {1, 2, 4, 8}. Other hyper-parameters are noted in Appendix D. For our main experiments, we found that L = 16, H = 512, M = 4 minimized loss on a validation set. We follow the training guidelines in Garg et al. (2022) , and trained models for 500, 000 iterations, with each in-context dataset consisting of 40 (x, y) pairs. For the main experiments we generate data according to p(w) = N (0, I) and p(x) = N (0, I).

4.3. RESULTS

ICL matches ordinary least squares predictions on noiseless datasets. We begin by comparing a (L = 16, H = 512, M = 4) transformer against a variety of reference predictors: • k-nearest neighbors: In the uniform variant, models predict ŷi = 1 3 j y j , where j is the top-3 closest data point to x i where j < i. In the weighted variant, a weighted average ŷi ∝ 1 3 j |x i -x j | -2 y j is calculated, normalized by the total weights of the y j s. • One-pass stochastic gradient descent: ŷi = w ⊤ i x i where w i is obtained by stochastic gradient descent on the previous examples with batch-size equals to 1: w i = w i-1 - 2α(x ⊤ i-1 w ⊤ i-1 x i-1 -x ⊤ i-1 y i-1 + λw i-1 ). • One-step batch gradient descent: ŷi = w ⊤ i x i where w i is obtained by one of step gradient descent on the batch of previous examples: w i = w 0 -2α(X ⊤ w ⊤ X -X ⊤ Y + λw 0 ). • Ridge regression: We compute ŷi = w ′⊤ x i where w ′⊤ = (X ⊤ X + λI) -1 X ⊤ Y . We denote the case of λ = 0 as OLS. The agreement between the transformer-based ICL and these predictors is shown in Fig. 1 . As can be seen, there are clear differences in fit to predictors: for almost any number of examples, normalized SPD and ILWD are small between the transformer and OLS predictor (with squared error less than 0.01), while other predictors (especially nearest neighbors) agree considerably less well. When the number of examples is less than the input dimension d = 8, the linear regression problem is under-determined, in the sense that multiple linear models can exactly fit the in-context training (0.0/1.0) 2 = 0 (0.25/1.0) 2 = 1/16 (0.5/1.5) 2 = 1/9 (0.5/1.0) 2 = 1/4 (0.5/0.75) dataset. In these cases, OLS regression selects the minimum-norm weight vector, and (as shown in Fig. 1 ), the in-context learner's predictions are reliably consistent with this minimum-norm predictor. Why, when presented with an ambiguous dataset, should ICL behave like this particular predictor? One possibility is that, because the weights used to generate the training data are sampled from a Gaussian centered at zero, ICL learns to output the minimum Bayes risk solution when predicting under uncertainty. Building on these initial findings, our next set of experiments investigates whether ICL is behaviorally equivalent to Bayesian inference more generally. ICL matches the minimum Bayes risk predictor on noisy datasets. To more closely examine the behavior of ICL algorithms under uncertainty, we add noise to the training data: now we present the in-context dataset as a sequence: [x 1 , f (x 1 ) + ϵ 1 , . . . , x n , f (x n ) + ϵ n ] where each ϵ i ∼ N (0, σ 2 ). Recall that ground-truth weight vectors are themselves sampled from a Gaussian distribution; together, this choice of prior and noise mean that the learner cannot be certain about the target function with any number of examples. Standard Bayesian statistics gives that the optimal predictor for minimizing the loss in Eq. ( 8) is: For linear regression with Gaussian priors and Gaussian noise, the Bayesian estimator in Eq. ( 18) has a closed-form expression: ŷ = E[y|x, D]. ŵ = X ⊤ X + σ 2 τ 2 I -1 X ⊤ Y ; ŷ = ŵ⊤ x . Note that this predictor has the same form as the ridge predictor from Section 2.3, with the regularization parameter set to σ 2 τ 2 . In the presence of noisy labels, does ICL match this Bayesian predictor? We explore this by varying both the dataset noise σ 2 and the prior variance τ 2 (sampling w ∼ N (0, τ 2 )). For these experiments, the SPD values between the in-context learner and various regularized linear models is shown in Fig. 2 . As predicted, as variance increases, the value of the ridge parameter that best explains ICL behavior also increases. For all values of σ 2 , τ 2 , the ridge parameter that gives the best fit to the transformer behavior is also the one that minimizes Bayes risk. These experiments clarify the finding above, showing that ICL in this setting behaviorally matches minimum-Bayes-risk predictor. We also note that when the noise level σ → 0 + , the Bayes predictor converges to the ordinary least square predictor. Therefore, the results on noiseless datasets studied in the beginning paragraph of this subsection can be viewed as corroborating the finding here in the setting with σ → 0 + . ICL exhibits algorithmic phase transitions as model depth increases. The two experiments above evaluated extremely high-capacity models in which (given findings in Section 3) computational constraints are not likely to play a role in the choice of algorithm implemented by ICL. But what about smaller models-does the size of an in-context learner play a role in determining the learning algorithm it implements? To answer this question, we run two final behavioral experiments: one in which we vary the hidden size (while optimizing the depth and number of heads as in Section 4.2), then vary the depth of the transformer (while optimizing the hidden size and number of heads). These experiments are conducted without dataset noise. Figure 3: Computational constraints on ICL: We show SPD averaged over underdetermined region of the linear regression problem. In-context learners behaviorally match ordinary least squares predictors if there is enough number of layers and hidden sizes. When varying model depth (left background), algorithmic "phases" emerge: models transition between being closer to gradient descent, (red background), ridge regression (green background), and OLS regression (blue). Results are shown in Fig. 3 . When we vary the depth, learners occupy three distinct regimes: very shallow models (1L) are best approximated by a single step of gradient descent (though not wellapproximated in an absolute sense). Slightly deeper models (2L-4L) are best approximated by ridge regression, while the deepest (+8L) models match OLS as observed in Fig. 3 . Similar phase shift occurs when we vary hidden size in a 16D problem. Interestingly, we can read hidden size requirements to be close to ridge-regression-like solutions as H ≥ 16 and H ≥ 32 for 8D and 16D problems respectively, suggesting that ICL discovers more efficient ways to use available hidden state than our theoretical constructions requiring O(d 2 ). Together, these results show that ICL does not necessarily involve minimum-risk prediction. However, even in models too computationally constrained to perform Bayesian inference, alternative interpretable computations can emerge.

5. DOES ICL ENCODE MEANINGFUL INTERMEDIATE QUANTITIES?

Section 4 showed that transformers are a good fit to standard learning algorithms (including those constructed in Section 3) at the computational level. But these experiments leave open the question of how these computations are implemented at the algorithmic level. How do transformers arrive at the solutions in Section 4, and what quantities do they compute along the way? Research on extracting precise algorithmic descriptions of learned models is still in its infancy (Cammarata et al., 2020; Mu & Andreas, 2020) . However, we can gain insight into ICL by inspecting learners' intermediate states: asking what information is encoded in these states, and where. To do so, we identify two intermediate quantities that we expect to be computed by gradient descent and ridge-regression variants: the moment vector X ⊤ Y and the (min-norm) least-square estimated weight vector w OLS , each calculated after feeding n exemplars. We take a trained in-context learner, freeze its weights, then train an auxiliary probing model (Alain & Bengio, 2016) to attempt to recover the target quantities from the learner's hidden representations. Specifically, the probe model takes hidden states at a layer H (l) as input, then outputs the prediction for target variable. We define a probe with position-attention that computes (Appendix E): We train this probe to minimize the squared error between the predictions and targets v: L(v, v) = |v -v| 2 . The probe performs two functions simultaneously: its prediction error on held-out representations determines the extent to which the target quantity is encoded, while its attention mask, α identifies the location in which the target quantity is encoded. For the FF term, we can insert the function approximator of our choosing; by changing this term we can determine the manner in which the target quantity is encoded-e.g. if FF is a linear model and the probe achieves low error, then we may infer that the target is encoded linearly. α = softmax(s v ) (20) v = FF v (α ⊤ W v H (l) ) (21) For each target, we train a separate probe for the value of the target on each prefix of the dataset: i.e. one probe to decode the value of w computed from a single training example, a second probe to decode the value for two examples, etc. Results are shown in Fig. 4 . For both targets, a 2layer MLP probe outperforms a linear probe, meaning that these targets are encoded nonlinearly (unlike the constructions in Section 3). However, probing also reveals similarities. Both targets are decoded accurately deep in the network (but inaccurately in the input layer, indicating that probe success is non-trivial.) Probes attend to the correct timestamps when decoding them. As in both constructions, X ⊤ Y appears to be computed first, becoming predictable by the probe relatively early in the computation (layer 7); while w becomes predictable later (around layer 12). For comparison, we additionally report results on a control task in which the transformer predicts ys generated with a fixed weight vector w = 1 (so no ICL is required). Probes applied to these models perform significantly worse at recovering moment matrices (see Appendix E for details).

6. CONCLUSION

We have presented a set of experiments characterizing the computations underlying in-context learning of linear functions in transformer sequence models. We showed that these models are capable in theory of implementing multiple linear regression algorithms, that they empirically implement this range of algorithms (transitioning between algorithms depending on model capacity and dataset noise), and finally that they can be probed for intermediate quantities computed by these algorithms. While our experiments have focused on the linear case, they can be extended to many learning problems over richer function classes-e.g. to a network whose initial layers perform a non-linear feature computation. Even more generally, the experimental methodology here could be applied to larger-scale examples of ICL, especially language models, to determine whether their behaviors are also described by interpretable learning algorithms. While much work remains to be done, our results offer initial evidence that the apparently mysterious phenomenon of in-context learning can be understood with the standard ML toolkit, and that the solutions to learning problems discovered by machine learning researchers may be discovered by gradient descent as well.

A THEOREM 1

The operations for 1-step SGD with single exemplar can be expressed as following chain (please see proofs for the Transformer implementation of these operations (Lemma 1) in Appendix C): • mov(; 1, 0, (1, 1 + d), (1, 1 + d)) (move x) • aff(; (1, 1 + d), (), (1 + d, 2 + d), W 1 = w) (w ⊤ x) • aff(; (1 + d, 2 + d), (0, 1), (2 + d, 3 + d), W 1 = I, W 2 = -I) (w ⊤ x -y) • mul(; d, 1, 1, (1, 1 + d), (2 + d, 3 + d), (3 + d, 3 + 2d)) (x(w ⊤ x -y)) • aff(; (), (), (3 + 2d, 3 + 3d), b = w, ) (write w) • aff(; (3 + d, 3 + 2d), (3 + 2d, 3 + 3d), (3 + 3d, 3 + 4d), W 1 = I, W 2 = -λ) (x(w ⊤ x -y) -λw) • aff(; (3 + 2d, 3 + 3d), (3 + 3d, 3 + 4d), (3 + 2d, 3 + 3d), W 1 = I, W 2 = -2α, ) (w ′ ) • mov(; 2, 1, (3 + 2d, 3 + 3d), (3 + 2d, 3 + 3d)) (move w ′ ) • mul(; 1, d, 1, (3 + 2d, 3 + 3d), (1, 1 + d), (3 + 3d, 4 + 3d)) (w ′ ⊤ x 2 ) This will map:               0 y 1 0 x 1 0 x 2               →                0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w w w x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw w -2α(x 1 w ⊤ x 1 -λw) w ′ w -2α(x 2 w ⊤ x 1 -λw) w -2α(x 1 w ⊤ x 1 -λw) w ′ w ′ (w -2α(x 1 w ⊤ x 1 -λw)) ⊤ x1 w ′ ⊤ x 1 w ′ ⊤ x 2                We can verify the chain of operator step-by-step. In each step, we show only the non-zero rows. • mov(; 1, 0, (1, 1 + d), (1, 1 + d)) (move x) • aff(; (3+d, 3+2d), (3+2d, 3+3d), (3+3d, 3+4d) , 0 y 1 0 x 1 0 x 2 → 0 y 1 0 x 1 x 1 x 2 • aff(; (1, 1 + d), (), (1 + d, 2 + d), W 1 = w) (w ⊤ x) 0 y 1 0 x 1 x 1 x 2 →   0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2   • aff(; (1 + d, 2 + d), (0, 1), (2 + d, 3 + d), W 1 = I, W 2 = -I) (w ⊤ x -y)   0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2   →    0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y 1 w ⊤ x 2    • mul(; d, 1, 1, (1, 1 + d), (2 + d, 3 + d), (3 + d, 3 + 2d)) (x(w ⊤ x -y))    0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y 1 w ⊤ x 2    →      0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1      • aff(; (), (), (3 + 2d, 3 + 3d), b = w, ) (write w)      0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1      →        0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w w w        W 1 = I, W 2 = -2λ) (x(w ⊤ x-y)-2λw)        0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w w w        →          0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w w w x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw          • aff(; (3 + 2d, 3 + 3d), (3 + 3d, 3 + 4d), (3 + 2d, 3 + 3d), W 1 = I, W 2 = -2α, ) (w ′ )          0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w w w x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw          →          0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w -2α(x 1 w ⊤ x 1 -λw) w ′ w -2α(x 2 w ⊤ x 1 -λw) x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw          • mov(; 2, 1, (3 + 2d, 3 + 3d), (3 + 2d, 3 + 3d)) (move w ′ )            0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w w w w -2α(x 1 w ⊤ x 1 -λw) w ′ w -2α(x 2 w ⊤ x 1 -λw) x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw            →            0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w -2α(x 1 w ⊤ x 1 -λw) w ′ w -2α(x 2 w ⊤ x 1 -λw) x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw w -2α(x 1 w ⊤ x 1 -λw) w ′ w ′            • mul(; 1, d, 1, (3 + 2d, 3 + 3d), (1, 1 + d), (3 + 3d, 4 + 3d)) (w ′ ⊤ x 2 )            0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w -2α(x 1 w ⊤ x 1 -λw) w ′ w -2α(x 2 w ⊤ x 1 -λw) x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw w -2α(x 1 w ⊤ x 1 -λw) w ′ w ′            →              0 y 1 0 x 1 x 1 x 2 w ⊤ x 1 w ⊤ x 1 w ⊤ x 2 w ⊤ x 1 w ⊤ x 1 -y w ⊤ x 2 x 1 w ⊤ x 1 x 1 (w ⊤ x 1 -y) x 2 w ⊤ x 1 w -2α(x 1 w ⊤ x 1 -λw) w ′ w -2α(x 2 w ⊤ x 1 -λw) x 1 w ⊤ x 1 -λw x 1 (w ⊤ x 1 -y) -λw x 2 w ⊤ x 1 -λw w -2α(x 1 w ⊤ x 1 -λw) w ′ w ′ (w -2α(x 1 w ⊤ x 1 -λw)) ⊤ x1 w ′ ⊤ x 1 w ′ ⊤ x 2              We obtain the updated prediction in the last hidden unit of the third time-step. Generalizing to multiple steps of SGD. Since w ′ is written in the hidden states, we may repeat this iteration to obtain ŷ3 = w ′′ ⊤ x 3 where w ′′ is the one step update w ′ -2α(x 2 w ′⊤ x 2 -y 2 x 2 + λw, requiring a total of O(n) layers for a single pass through the dataset where n is the number of examplers. As an empirical demonstration of this procedure, the accompanying code release contains a reference implementation of SGD defined in terms of the base primitive provided in an anymous links https://icl1.s3.us-east-2.amazonaws.com/theory/{primitives,sgd,ridge}.py (to preserve the anonymity we did not provide the library dependencies). This implementation predicts ŷn = w ⊤ n x n , where w n is the weight vector resulting from n -1 consecutive SGD updates on previous examples. It can be verified there that the procedure requires O(n + d) hidden space. Note that, it is not O(nd) because we can reuse spaces for the next iteration for the intermediate variables, an example of this performed in (w ′ ) step above highlighted with blue color.

B THEOREM 2

We provide a similar construction to Theorem 1 (please see proofs for the Transformer implementation of these operations in Appendix C, specifically for div see Appendix C.6) • mov(; 1, 0, (1, 1 + d), (1, 1 + d)) (move x 1 ) • mul(; d, 1, 1, (1, 1 + d), (0, 1), (1 + d, 1 + 2d)) (x 1 y) • aff(; (), (), (1 + 2d, 1 + 2d + d 2 ), b = I λ ) (A -1 0 = I λ ) • mul(; d, d, 1, (1 + 2d, 1 + 2d + d 2 ), (1, 1 + d), (1 + 2d + d 2 , 1 + 3d + d 2 )) (A -1 0 u = I λ x 1 ) • mul(; 1, d, d, (1, 1 + d), (1 + 2d, 1 + 2d + d 2 ), (1 + 3d + d 2 , 1 + 4d + d 2 )) (vA -1 0 = x ⊤ 1 I λ ) • mul(; d, 1, d, (1 + 2d + d 2 , 1 + 3d + d 2 ), (1 + 3d + d 2 , 1 + 4d + d 2 ), (1 + 4d + d 2 , 1 + 4d + 2d 2 )) (A -1 0 uvA -1 0 = I λ x 1 x ⊤ 1 I λ ) • mul(; 1, d, 1, (1+3d+d 2 , 1+4d+d 2 ), (1, 1+d), (1+4d+2d 2 , 2+4d+2d 2 )) (v ⊤ A -1 0 u = x ⊤ 1 I λ x 1 ) • aff(; (1+4d+2d 2 , 2+4d+2d 2 ), (), (1+4d+2d 2 , 2+4d+2d 2 ), W 1 = 1, b = 1, ) (1+v ⊤ A -1 0 u = 1 + x ⊤ 1 I λ x 1 ) • div(; (1 + 4d + d 2 , 1 + 4d + 2d 2 ), 1 + 4d + 2d 2 , (2 + 4d + 2d 2 , 2 + 4d + 3d 2 )) (right term) • aff(; (1+2d, 1+2d+d 2 ), (2+4d+2d 2 , 2+4d+3d 2 ), (1+2d, 1+2d+d 2 ), W 1 = I, W 2 = -I) (A -1 1 ) • mul(; d, d, 1, (1 + 2d, 1 + 2d + d 2 ), (1, 1 + d), (2 + 4d + 3d 2 , 2 + 5d + 3d 2 )) (A -1 1 x 1 ) • mul(; d, 1, 1, (2 + 4d + 3d 2 , 2 + 5d + 3d 2 ), (0, 1), (2 + 4d + 3d 2 , 2 + 5d + 3d 2 )) (A -1 1 x 1 y 1 ) • mov(; 2, 1, (2 + 4d + 3d 2 , 2 + 5d + 3d 2 ), (2 + 4d + 3d 2 , 2 + 5d + 3d 2 )) (move w ′ ) • mul(; d, 1, 1(2 + 4d + 3d 2 , 2 + 5d + 3d 2 ), (1, 1 + d), (2 + 5d + 3d 2 , 3 + 5d + 3d 2 )) (w ′ ⊤ x 2 ) Note that, in contrast to Appendix A, we need O(dfoot_1 ) space to implement matrix multiplications. Therefore over-all required hidden size is O(d 2 ) As Theorem 1, generalizing it to multiple iterations will at least require O(n) layers, as we repeat the process for the next examplar. C LEMMA 1 All of the operators mentioned in this lemma share a common computational structure, and can in fact be implemented as special cases of a "base primitive" we call RAW (for Read-Arithmetic-Write). This operator may also be useful for future work aimed at implementing other algorithms. The structure of our proof of Lemma 1 is as follows: 1. Motivation of the base primitive RAW. 2. Formal definition of RAW. 3. Definition of dot, aff, mov in terms of RAW. 4. Implementation of RAW in terms of transformer parameters. 5. Brief discussion of how to parallelize RAW, making it possible to implement mul. 6. Seperate proof for div by utilizing layer norm.

C.1 RAW OPERATOR: INTUITION

At a high level, all of the primitives in Lemma 1 involve a similar sequence of operations: 1) Operators read some hidden units from the current or previous timestep: dot and aff read from two subsets of indices in the current hidden state h t 2 , while mov reads from a previous hidden state h t ′ . This selection is straightforwardly implemented using the attention component of a transformer layer. We may notate this reading operation as follows:   1 W a k∈K(i) h (l) k [r]   Read with Attention . Here r denotes a list of indice to read from, and K denotes a map from current timesteps to target timesteps. For convenience, we use Numpy-like notation to denote indexing into a vector with another vector: Definition C.1 (Bracket). x[.] is Python index notation where the resulting vector, x ′ = x[r]: y j = x rj j = 1, ....|r| The first step of our proof below shows that the attention output a (l) can compute the expression above. 2) Operators perform element-wise arithmetic between the quantity read in step 1 and another set of entries from the current timestep: This step takes different forms for aff and mul (mov ignores values at the current timestep altogether).   W a |K(i)| k∈K(i) h (l) k [r]   Read with Attention ⊙ W h (l) i [s] (multiplicative form) (23)   W a |K(i)| k∈K(i) h (l) k [r]   Read with Attention + W h (l) i [s] (additive form) The second step of the proof below computes these operations inside the MLP component of the transformer layer. 3) Operators reduce, then write to the current hidden state Once the underlying element-wise operation calculated, the operator needs to write these values to the some indices in current hidden state, defined by a list of indices w. Writing might be preceded by a reduction state (e.g. for computing dot products), which can be expressed generically as a linear operator W o . The final form of the computation is thus: h (l+1) i [w] ←        W o Elementwise operation   W a |K(i)| k∈K(i) h (l) k [r]   Read with Attention ⃝ ⋆ W h (l) i [s]        Here, ← means that the other indices i / ∈ w are copied from h l-1 .

C.2 RAW OPERATOR DEFINITION

We denote this "master operator" as RAW: Definition C.2. RAW(h; ⃝ ⋆ , s, r, w, W o , W a , W, K) is a function R H×T → R H×T . It is parameter- ized by an elementwise operator ⃝ ⋆ ∈ {+, ⊙}, three matrices W ∈ R d×|s| , W a ∈ R d×|r| , W o ∈ R |w|×d , three index sets s, r, and w, and a timestep map K : Z + → (Z + * ). Given an input matrix h, it outputs a matrix with entries: h (l+1) i,w =   W o     W a |K(i)| k∈K(i) h (l) k [r]   ⃝ ⋆ W h (l) i [s]     i = 1, ..., T ; h (l+1) i,j / ∈w = h (l) i,j / ∈w i = 1, ..., T ; We additionally require that j ∈ K(i) =⇒ j < i (since self-attention is causal.) (For simplicity, we did not include a possible bias term in linear projections W o , W a , W , we can always assume the accompanying bias parameters b 0 , b a , b when needed) C.3 REDUCING LEMMA 1 OPERATORS TO RAW OPERATOR Given this operator, we can define each primitive in Lemma 1 using a single RAW operator, except the mul and div. Instead of the matrix multiplication operator mul, we will first show the dot product dot (a special case of mul), then later in the proof, we will argue that we can parallelize these dot products in Appendix C.5 to obtain mul. We will show how to implement div separately in Appendix C.6. Lemma 2. We can define mov, aff operator, and the dot product case of mul in Lemma 1 by using a single RAW operator dot(h; (i, j), (i ′ , j ′ ), (i ′′ , j ′′ )) = mul(h; 1, |i -j|, 1, (i, j), (i ′ , j ′ ), (i ′′ , i ′′ + 1)) = RAW(h; ⊙, W = I, W a = I, W o = 1 ⊤ , s = (i, j), r = (i ′ , j ′ ), w = (i ′′ , i ′′ + 1), K = {(t, {t})∀ t }) aff(h; (i, j), (i ′ , j ′ ), (i ′′ , j ′′ ), W 1 , W 2 , b) = RAW(h; +, W = W 1 , W a = W 2 , W o = I, b 0 = b, s = (i, j), r = (i ′ , j ′ ), w = (i ′′ , i ′′ + 1), K = {(t, {t})∀ t }) mov(h; s, t, (i, j), (i ′ , j ′ )) = RAW(h; +, W = 0, W a = I, W o = I, = (), r = (i ′ , j ′ ), w = (i, j), K = {(t, {s})}) Proof. Follows immediately by substituting parameters into Eq. ( 26).

C.4 IMPLEMENTING RAW

It remains only to show: Lemma 3. A single transformer layer can implement the RAW operator: there exist settings of transformer parameters such that, given an arbitrary hidden matrix h as input, the transformer computes h ′ (Eq. ( 26)) as output. Our proof proceeds in stages. We begin by providing specifying initial embedding and positional embedding layers, constructing inputs to the main transformer layer with necessary positional information and scratch space. Next, we prove three useful procedures for bypassing (or exploiting) non-linearities in the feed-forward component of the transformer. Finally, we provide values for remaining parameters, showing that we can implement the Elementwise and Reduction steps described above.

C.4.1 EMBEDDING LAYERS

Embedding Layer for Initialization: Rather than inserting the input matrix h directly into the transformer layer, we assume (as is standard) the existence of a linear embedding layer. We can set this layer to pad the input, providing extra scratch space that will be used by later steps of our implementation. We define the embedding matrix W e as: W e = I (d+1)×(d+1) 0 0 0 Then, the embedded inputs will be xi = W e x i = [0, x i , 0 H-d-1 ] ⊤ (29) ỹi = W e y i = [y i , 0 H-1 ] ⊤ Position Embeddings for Attention Manipulation: Implementing RAW ultimately requires controlling which position attends to which position in each layer. For example, we may wish to have layers in which each position attends to the previous position only, or in which even positions attends to other even positions. We can utilize position embeddings, p i , to control attention weights. In a standard transformer, the position embedding matrix is a constant matrix that is added to the inputs of the transformer after embedding layer (before the first layer), so the actual input to to the transformer is: h 0 i = xi + p i 31) We will use these position embeddings to encode the timestep map K. To do this, we will use 2p units per layer (p will be defined momentarily). p units will be used to encode attention keys k i , and the other p will be used to encode queries q i . We define the position embedding matrix as follows: p i = [0 d+1 , k 0 i , q 0 i , . . . , k (L) i , q (L) i , 0 H-2pT -1 ] ⊤ With K encoded in positional embeddings, the transformer matrices W Q and W K are easy to define: they just need to retrieve the corresponding embedding values: W l K =        0 . . . . . . I p×p 0 p×p . . . . . .        W l Q =        0 . . . . . . 0 p×p I p×p . . . . . .        (33) The constructions used in this paper rely on two specific timestep maps K, each of which can be implemented compactly in terms of k and q: Case 1: Attend to previous token. This can be constructed by setting: k i = e i q i = N e i-1 where N is a sufficiently large number. In this case, the output of the attention mechanism will be: α = softmax (W Q j h i ) ⊤ (W K j h :i ) = softmax q ⊤ i [k 1 , . . . , k i ] = softmax [0, . . . , N, 0]) = [0, . . . , 1 (i-1) , 0] Case 2: Attend to a single token. For simpler patterns, such as attention to a specific token: K(i) = {t} i ≥ t {} i < t (34) only 1 hidden unit is required. We set: k i = -N i ̸ = t N i = t q i = N from which it can be verified (using the same procedure as in Case 1) that the desired attention pattern is produced. Intricacy: How can K be empty? We can also cause K(i) to attend to an empty set by assuming the softmax has extra ("imaginary") timestep obtained by prepending a 0 to attention vector pot-hoc (Chen et al., 2021) . Cumulatively, the parameter matrices defined in this subsection implement the Read with Attention component of the RAW operator.

C.4.2 HANDLING & UTILIZING NONLINEARITIES

The mul operator requires elementwise multiplication of quantities stored in hidden states. While transformers are often thought of as only straightforwardly implementing affine transformations on hidden vectors, their nonlinearities in fact allow elementwise multiplication to a high degree of approximation. We begin by observing the following property of the GeLU activation function in the MLP layers of the Transformer network: Lemma 4. The GeLU nonlinearity can be used to perform multiplication: specifically, π 2 (GeLU(x + y) -GeLU(x) -GeLU(y)) = xy + O(x 3 + y 3 ) Proof. A standard implementation of the GeLU nonlinearity is defined as follows: GeLU(x) = x 2 1 + tanh 2 π x + 0.044715x 3 . (36) Thus GeLU(x) = x 2 + 2 π x 2 + O(x 3 ) (37) GeLU(x + y) -GeLU(x) -GeLU(y) = 2 π xy + O(x 3 + y 3 ) (38) =⇒ xy ≈ π 2 (GeLU (x + y) -GeLU (x) -GeLU (y)) For small x and y, the third-order term vanishes. By scaling inputs down by a constant before the GeLU layer, and scaling them up afterwards, models may use the GeLU operator to perform elementwise multiplication. We can generalize this proof to other smooth functions as we discussed further in [TODO REF]. Previous work also shows, in practice, Transformers with ReLU activation utilize non-linearities to get the multiplication in other settings. When implementing the aff operator, we have the opposite problem: we would like the output of addition to be transmitted without nonlinearities to the output of the transformer layer. Fortunately, for large inputs, the GeLU nonlinearity is very close to linear; to bypass it it suffices to add to inputs a large N : Lemma 5. The GeLU nonlinearity can be bypassed: specifically, GeLU(N + x) -N ≈ x N ≫ 1 Proof. GeLU(N + x) -N = N 2 1 + tanh 2 π N + 0.044715N 3 -N (41) ≈ N 2 (1 + 1) -N (42) = x For all verions of the RAW operator, it is additionally necessary to bypass the LayerNorm operation. The following formula will be helpful for this: Lemma 6. Let N be a large number and λ the LayerNorm function. Then the following approximation holds: 2 L N λ([x, N, -N - x, 0]) ≈ [x, 2N, -2N - x, 0] N ≫ 1 Proof.

E[x]

= 0 (45) Var[x] = 1 L (N 2 + N 2 + x 2 ) ≈ 2N 2 L (46) (47) Then, 2 L N λ([x, N, -N - x, 0]) ≈ 2 L N [ L 2N 2 x, √ 2L, - √ 2L - L 2N 2 x, 0] = [x, 2N, -2N - x, 0] (b 2 ) w[m] = -N j (W o ) m,j -N m ∈ 1, . . . , |w| (b 2 ) t[i] = -N (87) (b 2 ) t ′ / ∈t = 0 (88) (89) Therefore, m i [w] = W o x i [t] + W 0 W h l i [s] -x i [w] equals to what we promised in Eq. ( 60) for + case. If we sum this with the residual x i term back Eq. ( 53), so the output of this layer can be written as: (h i ) (l+1) t ′ ∈w = W o     W a |K(i)| k∈K(i) h (l) k [r]   + W h (l) i [s]   (90) (h i ) (l+1) t ′ / ∈w = (h l i ) t ′ / ∈w RAW(⊙, .) If the operator, ⃝ ⋆ = ⊙, we need to use three extra hidden units the same size as |t|, let's name the extra indices as t a , t b , t c , and output w space. The (u i ) will get below entries to be able to use [], where N is a large number: (u i ) t ′ ∈ta = (W h l i [s] + x i [t])/N (92) (u i ) t ′ ∈t b = x i [t]/N (93) (u i ) t ′ ∈tc = W h l i [s]/N (94) (u i ) t ′ ∈t = -x i [t] + N (95) (u i ) t ′ ∈w = -x i [w] + N (96) (u i ) t ′ / ∈(t∪ta∪t b ∪tc∪w) = -N All of this operations are linear, can be done W 1 zero except the below entries: (W 1 ) ta[m],s[n] = (W ) m,n /N m ∈ 1, ..., |t a |, n ∈ 1, ..., |s| (W 1 ) ta[m],t[n] = 1/N m ∈ 1, ..., |t a |, n ∈ 1, ..., |t| (W 1 ) t b [m],t[m] = 1/N m ∈ 1, ..., |t b |, n ∈ 1, ..., |t| (W 1 ) tc[m],s[m] = 1/N m ∈ 1, ..., |t c |, n ∈ 1, ..., |s| (W 1 ) w[m],w[m] = -1 m ∈ 1, ..., |w| (W 1 ) t[i],t[m] = -1 m ∈ 1, ..., |t| and b 1 to: (b 1 ) t ′ ∈(t∪ta∪t b ∪tc) = 0 (106) (b 1 ) t ′ ∈(t∪w) = N (107) (b 1 ) t ′ / ∈(t∪ta∪t b ∪tc∪w ) = -N (108) The resulting v with the approximations become: (v i ) t ′ ∈ta = gelu((W h l i [s] + x i [t])/N (110) (v i ) t ′ ∈t b = gelu(x i [t]/N ) (111) (v i ) t ′ ∈tc = gelu(W h l i [s]/N ) (112) (v i ) t ′ ∈t = x i [t] + N (113) (v i ) t ′ ∈w = x i [w] + N (114) (v i ) t ′ / ∈(t∪ta∪t b ∪tc∪w) = 0 (115) Now, we can use the GeLU trick in Lemma 4, by setting W 2 (W 2 ) w[m],ta[n] = (W o ) m,n N 2 π 2 m ∈ 1, . . . , |w|, n ∈ 1, . . . , |t a | (117) (W 2 ) w[m],t b [n] = -(W o ) m,n N 2 π 2 m ∈ 1, . . . , |w|, n ∈ 1, . . . , |t b | (118) (W 2 ) w[m],tc[n] = -(W o ) m,n N 2 π 2 m ∈ 1, . . . , |w|, n ∈ 1, . . . , |t c | (119) (W 2 ) w[m],w[m] = 1 m ∈ 1, . . . , |w| (W 2 ) t[m],t[m] = 1 m ∈ 1, . . . , |t| (122) We then set b 2 : (b 2 ) t ′ ∈(t∪w) = N (123) (b 2 ) t ′ / ∈(t∪w) = 0 (124) (125) With this, m i [w] = W o x i [t] * W 0 W h l-1 i [s] -x i [w], (h i ) (l+1) t ′ ∈w = W o     W a |K(i)| k∈K(i) h (l) k [r]   ⊙ W h (l) i [s]   (126) (h i ) (l+1) t ′ / ∈w = (h l i ) t ′ / ∈w (127) We have used 4|t| space for internal computation of this operation, and finally used |w| space to write the final result. We show RAW operator is implementable by setting the parameters of a Transformer. C.5 PARALLELIZING THE RAW OPERATOR Lemma 7. With the conditions that K is constant, the operators are independent (i.e (r i ∪ s i ∪ w i ) ∩ w j̸ =i = ∅), and there is k (4|t k | + |w k |) available space in the hidden state, then a Transformer layer can apply k such RAW operation in parallel by setting different regions of W 1 , W 2 , W f and (W V ) k matrices. Proof. From the construction above, it is straightforward to modify the definition of the RAW operator to perform k operations as all the indices of matrices that we use in Appendix C.4.3 do not overlap with the given conditions in the lemma. 4 , dashed lines show probing results with a task model trained on a control task, in which w is always the all-ones 1. This problem structurally resembles our main experiment setup, but does not require in-context learning. During probing, we feed this model data generated by w sampled form normal distribution as in the original task model. We observe that the control probe has a significantly higher error rate, showing that the probing accuracy obtained with actual task model is non-trivial. We present detailed error values of the control probe in Fig. 5 .

F LINEARITY OF ICL

In Fig. 1b , we compare implicit linear weight of the ICL against the linear algorithms using ILWD measure. Note that this measure do not assume predictors to be linear: when the predictors are not linear, ILWD measures the difference between closest linear predictors (in Eq. ( 16) sense) to each algorithm. To gain more insight to ICL's algorithm, we can measure how linear ICL in different regimes of the linear problem (underdetermined, determined) by using R 2 (coefficient of determination) measure. So, instead of asking what's the best linear fit in Eq. ( 16), we can ask how good is the linear fit, which is the R 2 of the estimator. Interestingly, even though our model matches min-norm least square solution in both metrics in Section 4.3, we show that ICL is becoming gradually linear in the under-determined regime Fig. 6 . This is an important result, enables us to say the in-context learner's hypothesis class is not purely linear. (a) Approximating x 2 using GeLU , Eq. ( 136). (b) Approximating x 2 using tanh, Eq. ( 140), where δ = 1e -3 . (c) A piece-wise linear approximation to x 2 by using ReLU, Eq. ( 141). 

G MULTIPLICATIVE INTERACTIONS WITH OTHER NON-LINEARITIES

We can show that for a real-valued and smooth non-linearity f (x), we can apply the same trick in in the paper body. In particular, we can write Taylor expansion as: f (x) = ∞ i=0 a i x i = a 0 + a 1 x + a 2 x 2 + . . . which converges for some sufficiently small neighborhood: X ∈ [-ϵ, ϵ]. First, assume that the second order term a 2 dominates higher-order terms in this domain such that: a 2 x 2 ≫ a i>2 x i wherex ∈ X . It's is easy to verify that the following is true: 1 2a 2 (f (x + y) -f (x) -f (y) + a 0 ) = xy + O(x 3 + y 3 ) (135) So, given the expansion for GeLU in Eq. ( 37), we can use this generic formula to obtain the multiplication approximation: We plot this approximation against x 2 for [0.1, -0.1] range in Fig. 7a . In the case of a 2 is zero, we cannot get any second order term, and in the case of a 2 is negligible O(x 3 + y 3 ) will dominate the Eq. ( 135), so we cannot obtain a good approximation of xy. In this case, we can resort to numerical derivatives and utilize the a 3 term: f ′ (x) = a 1 + 2a 2 x + 3a 3 x 3 + . . . 



We omit the trivial size preconditions, e.g. mul:(i -j = a * b, i ′ -j ′ = b * c, i ′′ -j ′′ = c * d). For notational convenience, we will use h to refer to sequence of hidden states (instead of H in Eq. (1).), h t ′ will be the hidden state at time step t ′



Predictor-ICL fit w.r.t. prediction differences.

Predictor-ICL fit w.r.t implicit weights.

Figure 1: Fit between ICL and standard learning algorithms: We plot (dimension normalized) SPD and ILWD values between textbook algorithms and ICL on noiseless linear regression with d = 8. GD(α) denotes one step of batch gradient descent and SGD(α) denotes one pass of stochastic gradient descent with learning rate α. Ridge(λ) denotes Ridge regression with regularization parameter λ. Under both evaluations, in-context learners agree closely with ordinary least squares, and are significantly less well approximated by other solutions to the linear regression problem.

This is because, conditioned on x and D, the scalar ŷ(x, D) := E[y|x, D] is the minimizer of the loss E[(y -ŷ) 2 |x, D], and thus the estimator ŷ is the minimzier of E[(y -ŷ) 2 ] = E x,D [E[(y -ŷ) 2 |x, D]].

Linear regression problem with d = 16

Figure 4: Probing results on d = 4 problem: Both moments X ⊤ Y (top) and least-square solution w OLS (middle) are recoverable from learner representations. Plots in the left column show the accuracy of the probe for each target in different model layers. Dashed lines show the best probe accuracies obtained on a control task featuring a fixed weight vector w = 1. Plots in the right column show the attention heatmap for the best layer's probe, with the number of input examples on the x-axis. The value of the target after n exemplars is decoded primarily from the representation of y n , or, after n = d examplars, uniformly from y n≥4 .

Figure 5: Detailed error values of the control probe displayed in Fig. 4.

Figure 7: Approximations of multiplication via various non-linearities.

x + y) -GeLU (x) -GeLU (y)) ≈ xy (136)

Figure 8: Empirical requirements on model parameters to satisfy SPD(Ridge(λ = 0.1), ICL) > SPD(OLS, ICL) when other parameters optimized.

ICL under uncertainty: With problem dimension d = 8, and for different values of prior variance τ 2 and data noise σ 2 , we display (dimension-normalized) MSPD values for each predictor pair, where MSPD is the average SPD value over underdetermined region of the linear problem. Brightness is proportional with 1 MSPD . ICL most closely follows the minimum-Bayes-risk Ridge regression output for all σ 2 τ 2 values.

ACKNOWLEDGEMENTS

We thank Evan Hernandez, Andrew Drozdov, Ed Chi for their feedback on the early drafts of this paper. At MIT, Ekin Akyürek is supported by an MIT-Amazon ScienceHub fellowship and by the MIT-IBM Watson AI Lab.

annex

By adding a large number N to two padding locations and sum the part of the hidden state that we are interested to pass through LayerNorm, we make x to the output of LayerNorm pass through. This addition can be done in the transformer's feed-forward computation (with parameter W F ) prior to layer norm. This multiplication of 2 L N can be done in first layer of MLP back, then linear layer can output/use x. For convenience, we will henceforth omit the LayerNorm operation when it is not needed.We may make each of these operations as precise as desired (or allowed by system precision). With them defined, we are ready to specify the final components of the RAW operator.

C.4.3 PARAMETERIZING RAW

We want to show a layer of Transformer defined in above, hence parameterized by θ = {W f , W 1 , W 2 , (W Q , W K , W v ) m }, can well-approximate the RAW operator defined in Eq. ( 25). We will provide step by step constructions and define the parameters in θ. Begin by recalling the transformer layer definition:Attention Output We will only use m = 2 attention heads for this construction. We show in Eq. ( 32) that we can control attentions to uniformly attend with a pattern by setting key and query matrices. Assume that the first head parameters W Q 1 , W K 1 have been set in the described way to obtain the pattern function K.

Now we will set remaining attention parameters

2 and show hat we can make the a i + h (l) i term in Eq. ( 4) to contain the corresponding term in Eq. ( 25), in some unused indices t such that:Then the term on the RAW operator can be obtained by the first head's output. In order to achieve that, we will set W a as a part of actual attention value network such that W V 1 is sparse matrix 0 everywhere expect:Now our first heads stores the right term in Eq. ( 53) in the indicies t. However, when we add the residual term h (l)i , this will change. To remove the residual term, we will use another head to output hThen, W f ∈ R H×2H is zero otherwise:We already defined (W Q , W K , W V ) 1,2 and W f and obtained the first term in the Eq. ( 25) in (a i + hArithmetic term Now we want to calculate the term inside the parenthesis Eq. ( 25). We will calculate it through the MLP layer and store in m i and substract the first term. Let's denote the input to the MLP asi ), the output of the first layer u i , the output of the non-linearity as a i , and the final output as m i . The entries of m i will be:We will define the MLP layer to operate the attention term calculated above with a part of the current hidden state by defining W 1 and W 2 . Let's assume we bypass the LayerNorm by using Lemma 6.Let's show this seperately for + and ⊙ operators.RAW(+, .) If the operator, ⃝ ⋆ = +, first layer of the MLP will calculate the second term in Eq. ( 25) and overwrite the space where the attention output term Eq. ( 53) is written, and add a large positive bias term N to by pass GeLU as explained in Lemma 4. We will use an available space t in the x i same size as t.This can be done by setting W 1 (weight term of the first layer of the MLP) to zero except the below indices:(71) (72) and the bias vector b 1 toNote the second term is added to make unused indices t ∪ w ∪ t become zero after the gelu, which outputs zero for large negative values. Since we added a large positive term, we make sure gelu behaved like a linear layer. Thus we have,(85)This makes it possible to construct a Transformer layer not only to implement vector-vector dot products, but general matrix-matrix products, as required by mul. With this, we show that we can implement mul by using single layer of a Transformer.

C.6 LAYERNORM FOR DIVISION

Let say we have the input [c, y, 0] ⊤ calculated before the attention output in Eq. ( 53), and we want to divide y to c. This trick is very similar to the on in Lemma 6. We can use the following formula: Lemma 8. using LayerNorm for division. Let N, M to be large numbers, λ LayerNorm function, the following approximation holds:Proof.

E[x]

= 0 (129)Published as a conference paper at ICLR 2023To get the input to the format used in this Lemma, we can easily use W f to convert the head outputs. Then, after the layer norm, we can use W 1 to pull the y c back and write it to the attention output. By this way, we can approximate scalar division in one layer.Lemma 1 By Lemmas 2, 3, 3, 7 and 8; we constructed the operators in Lemma 1 using single layer of a Transformer, thus proved Lemma 1

D DETAILS OF TRANSFORMER ARHITECTURE AND TRAINING

We perform these experiments using the Jax framework on P100 GPUs. The major hyperparameters used in these experiments are presented in Table 1 . The code repository used for reproducing these experiments will be open sourced at the time of publication. Most of the hyperparameters adapted from previous work Garg et al. (2022) to be compatible, and we adapted the Transformer architecture details. We use Adam optimizer with cosine learning rate scheduler with warmup where number of warmup steps set to be 1/5 of total iterations. We use larned absolute position embeddings. In the phase shift plots in Fig. 3 , we keep the value in the x-axis constant and used the best setting over the parameters: {number of layers, hidden size, number of heads and learning rate}.

E DETAILS OF PROBE

We will use the terms probe model and task model to distinguish probe from ICL. Our probe is defined as:The position scores s v ∈ R T are learned parameters where T is the max input sequence length (T = 80 in our experiments). The softmax of position scores attention weights α for each position and for each target variable. This enables us to learn input-independent, optimal target locations for each target (displayed on the right side of Fig. 4 ). We then average hidden states by using these attention weights. A linear projection, W v ∈ R T ×H ′ , is applied before averaging. FF is either a linear layer or a 2-layer MLP (hidden size=512) with a GeLU activation function. For each layer, we train a different probe with different parameters using stochastic gradient descent. H ′ equals to the 512. The probe is trained using an Adam optimizer with a learning rate of 0.001 (chosen from among {0.01, 0.001, 0.0001} on validation data).If a 3 is not negligible, a 3 x 2 ≪ a i>3 x i in the same domain, we can use numerical derivatives to get a multiplication term:+y 3 ) (138) For example, tanh has no second order term in its Taylor expansion:Using above formula we can obtain the following expression:-Similar to our construction in Eq. ( 110), we can construct a Transformer layer that calculates these quantities (noting that δ is a small, input-independent scalar).We plot this approximation against x 2 for [0.1, -0.1] range in Fig. 7b . Note that, if we use this approximation in our constructions we will need more hidden space as there are 6 different tanh term as opposed to 3 GeLU term in Eq. ( 110).Non-smooth non-linearities ReLU is another commonly used non-linearity that is not differentiable. With ReLU, we can only hope to get piece-wise linear approximations. For example, we can try to approximate x 2 with the following function: We plot this approximation against x 2 for [0.1, -0.1] range in Fig. 7c .

H EMPIRICAL SCALING ANALYSIS WITH DIMENSIONALITY

In Figs. 3a and 3b, we showed that ICL needs different hidden sizes to enter the "Ridge regression phase" (orange background) or "OLS phase" (green background) depending on the dimensionality d of inputs x. However, we cannot reliably read the actual relations between size requirements and the dimension of the problem from only two dimensions. To better understand size requirements, we ask the following empirical question for each dimension: how many layer/hidden size/heads are needed to better fit the least-squares solution than the Ridge(λ = ϵ) regression solution (the green phase in Figs. 3a and 3b )?To answer this important question, we experimented with d = {1, 2, 4, 8, 12, 16, 20} and run an experiment sweep for each dimension over:• number of layers (L): {1, 2, 4, 8, 12, 16},• hidden size (H): {16, 32, 64, 256, 512, 1024},• number of heads (M): {1, 2, 4, 8},• learning rate: {1e-4, 2.5e-4}.For each feature that affects computational capacity of transformer (L, H, M ), we optimize other features and find the minimum value for the feature that satisfies SPD(OLS, ICL) < SPD(Ridge(λ = ϵ), ICL). We plot our experiment with ϵ = 0.1 in Appendix H. We find that single head is enough for all problem dimensions, while other parameters exhibit a step-functionlike dependence on input size.Please note that other hyperparameters discussed in Appendix D (e.g weight initialization) were not optimized for each dimension independently.

