PROVABLE MEMORIZATION CAPACITY OF TRANS-FORMERS

Abstract

Quantifying memorization capacity is essential for understanding the expressiveness and generalizability of deep learning model architectures. However, the memorization capacity of the Transformer architecture has yet to be explored. In this work, we present the first study of the memorization capacity of the Transformer architecture. We prove that Transformers are capable of memorizing N sequence-to-sequence mappings of length n with d-dimensional input tokens using Õ(d + n + √ nN ) parameters. Our theory supports memorization both with and without permutation equivariance, utilizing positional encodings in the latter case. Building on our theory, we also analyze the memorization capacity of Transformers in the sequence classification and language modeling tasks. To verify these theoretical findings, we conduct experiments analyzing the memorization capacity of Transformers in the natural language domain.

1. INTRODUCTION

Transformer networks (Vaswani et al., 2017) have shown tremendous success in natural language processing tasks (Devlin et al., 2019; Yang et al., 2019; Brown et al., 2020; Fedus et al., 2022) , rapidly becoming the standard architecture for natural language modeling. The success of Transformers has also transferred to various other sequence and set modeling tasks, including image recognition (Parmar et al., 2018; Dosovitskiy et al., 2021) , semantic segmentation (Zheng et al., 2021) , video understanding (Akbari et al., 2021; Bertasius et al., 2021) , reinforcement learning (Parisotto et al., 2020; Chen et al., 2021; Janner et al., 2021) , 3D point cloud processing (Zhao et al., 2021) , protein structure prediction (Jumper et al., 2021) , and automatic theorem proving (Polu & Sutskever, 2020) . Despite this success across various areas, the theoretical understanding of Transformers lags behind that of standard fully-connected networks. The major strength of Transformers is in their efficient scaling, which is enabled through parallel token processing with parameter sharing and simple dot-product-based token interaction. Surprisingly, even though the parameter sharing and simple token interaction impose constraints on the function space of Transformers, Yun et al. (2020a) show that Transformers can approximate any continuous function from input to output sequences. However, their result focuses on the function approximation capacity with infinite precision, leaving the finite sample memorization capacity with finite precision unexplored. We note that universal function approximation does not automatically imply efficient memorization in terms of the number of parameters. Generalizing infinite precision results to the finite precision case is not straightforward and may not be possible in some cases. For example, Transformers are Turing complete only with infinite precision (Pérez et al., 2019) , but not with finite precision (Dehghani et al., 2019) . Understanding the memorization capacity of a model is critical for choosing an appropriate model size. Practitioners often choose a model size with enough representation capacity to achieve zero training loss (i.e., a size larger than memorization capacity). Moreover, the memorization capacity has generalization implications, as observed in the double descent phenomena (Belkin et al., 2019; Nakkiran et al., 2021) . As the network size increases, generalization performance exhibits a bias-variance tradeoff until the memorization is possible and then improves monotonically after-ward. Understanding the memorization capacity of Transformers requires answers to the following questions: How large should the size and precision of the Transformer architecture be to enable memorization of any given number of input-output sequence pairs? How does the memorization capacity of Transformers differ for various problem settings in practical application scenarios? In this paper, we answer these questions by proving that Transformers can memorize N sequences of d-dimensional tokens with length n using Õ(d + n + √ nN ) parameters. Our proof constructs permutation equivariant Transformers that can memorize all permutations of N input sequences. We extend this construction to the memorization without permutation equivariance by adding positional encodings. In addition, we derive the memorization capacity for sequence classification task from our proposed theory. The key technical component of our construction is efficient contextual mapping, which requires only n self-attention layers. Our contextual mapping also applies to sparse-attention Transformers making fewer assumptions on sparsity patterns than Yun et al. (2020b) and Zaheer et al. (2020) . Furthermore, we present the generalization of contextual mapping to function approximation settings, vastly improving the parameter efficiency of attention layers compared to the selective-shiftingbased contextual mapping in Yun et al. (2020a) . Our main contributions are summarized as follows: • We prove the memorization capacity of Transformers for sequence-to-sequence mappings with and without permutation equivariance. We analyze the memorization capacity in other standard task settings, such as sequence classification and language modeling. • We show that the efficient contextual mapping presented in our theoretical analysis extends to sparse attention settings and improves the function approximation results. • We provide experiments validating the memorization capacity of Transformers for token classification and sequence classification tasks.

1.1. RELATED WORKS

Memorization capacity. Characterizing the memorization capacity of neural networks has been an active research area with a long history (Baum, 1988; Sontag, 1997; Huang & Babri, 1998; Huang, 2003; Zhang et al., 2017; Yun et al., 2019; Bubeck et al., 2020; Vershynin, 2020; Rajput et al., 2021; Park et al., 2021; Vardi et al., 2022) . Recently, Park et al. (2021) constructed neural networks with O(N 2/3 ) parameters to memorize N data points. They bypass the Ω(N ) lower bound in Sontag (1997) by assuming a simple separation (i.e., ∥x i -x j ∥ ≥ δ, ∀i ̸ = j). Vardi et al. (2022) improve further, showing that Õ(N 1/2 ) parameters are sufficient. They also prove the matching lower bound of Ω(N 1/2 ) through the VC-dimension analysis. Inspired by Park et al. (2021) and Vardi et al. (2022) , our construction assumes a similar separation, but it is between pairs of distinct tokens, not the whole sequence pairs. (See Definition 3.1 and the discussion that follows the definition.) In addition, our construction uses the same pipeline of projection, string matching, and bit extraction as in Vardi et al. (2022) . However, we introduce an additional critical step: efficient contextual mapping to complement projection by summarizing all token information via self-attention layers. In contrast to the extensive results on fully-connected networks, there are few studies on the memorization capacity of specific modern architectures. Hardt & Ma (2017) show that the residual network with ReLU activation and O(N ) hidden neurons can memorize N data points under the separation assumption. Nguyen & Hein (2018) show that the convolutional network with O(N ) hidden neurons can memorize N data points. To the best of our knowledge, there is no existing literature on the memorization capacity of the Transformer architecture. Transformer expressivity. Given the recent empirical success of Transformers observed across multiple areas, several papers have studied the expressivity of Transformers. Yun et al. (2020a) establish the first universal approximation theorem for Transformers, and the result is later extended to sparse-attention Transformers (Yun et al., 2020b; Zaheer et al., 2020) and Transformers with hard constraints (Kratsios et al., 2022) . All these results study the function approximation but not the finite sample memorization as in our paper. We also note that our construction can reduce the number of self-attention layers in the function approximation setting. There are other lines of studies focusing on different aspects of the representation capacity of Transformers. Some papers aim to characterize the representation capacity of a single self-attention layer. Bhojanapalli et al. (2020) suggest that the small size of attention heads limits the rank of a selfattention matrix. Dong et al. (2021) show that the rank of self-attention decays exponentially when self-attention layers are composed without skip-connections or feedforward layers. Likhosherstov et al. (2021) show that a fixed self-attention module can approximate any sparsity pattern. Other papers investigate a tradeoff between width and depth since it is a crucial issue when scaling Transformers. Levine et al. (2020) demonstrate the depth efficiency in modeling feature interaction through the separation rank analysis of Transformers. Wies et al. (2021) identify the rank of the input embedding matrix as a bottleneck for the network width's contribution to expressivity. Our memorization study provides a complementary understanding of the capacity of Transformers.

2. PRELIMINARIES

This section establishes the notation and defines the Transformer architecture.

2.1. NOTATION

Denote the number of input-output pairs as N , the number of output classes as C, the token embedding dimension as d, and the sequence length as n. We use Õ(•) to hide logarithmic factors and O(•) to hide constant factors. We use σ R to represent the ReLU activation function. We let σ S and σ H be the softmax and hardmax operators, respectively. These operators take a matrix as an input, apply softmax/hardmax columnwise, and output a column stochastic matrix of the same size. For m ∈ N, we define [m] = {1, 2, • • • , m}. We use |S| to denote the number of elements in a set S. We denote a set or a function as an upper case caligraphic letter, a matrix by an upper case bold letter and a vector by a lower case bold letter. We denote the standard unit vector with all but i-th coordinates 0 as e i , the m dimensional vector with all coordinates 1 as 1 m and the m dimensional vector with all coordinates 0 as 0 m . For a vector x, we represent its i-th entry as x[i] and its Euclidean norm as ∥x∥. For a matrix X, we use X[i, j], X[i, :], and X[:, j] to represent (i, j)-th entry, i-th row, and j-th column, respectively. We use ∥X∥ F to represent the Frobenius norm of the matrix X.

2.2. TRANSFORMER ARCHITECTURE

We define a Transformer N : R d×n → R 1×n of depth L as a composition of L Transformer blocks with input and output embedding mappings: N = E out • F L • • • • • F 2 • F 1 • E in where each Transformer block F l : R m×n → R m×n is a sequence-to-sequence function consisting of two subblocks: a self-attention subblock and a tokenwise feedforward subblock. The input embedding block E in : R d×n → R m×n and the output embedding block E out : R m×n → R 1×n are 1-layer tokenwise linear mappings. The self-attention subblock represents the interaction among tokens. Formally, given an input Z ∈ R m×n , the self-attention subblock F (SA) l with h heads and head size k computes F (SA) l (Z) = Z + h i=1 W (O) l,i W (V ) l,i Z σ S W (K) l,i Z T W (Q) l,i Z , where W (O) l,i ∈ R m×k and W (V ) l,i , W (K) l,i , W (Q) l,i ∈ R k×m are the weight matrices parametrizing the self-attention subblock. We include a skip-connection in the self-attention subblock. The feedforward subblock processes each token independently in parallel by applying two feedforward layers. Given an input H ∈ R m×n , the feedforward subblock  and b (1) l ∈ R q parametrize the feedforward subblock. The feedforward subblock also includes a skip-connection. F (F F ) l with dimension q computes F (F F ) l (H) = H + W (2) l σ R W (1) l H + b (1) l 1 T n + b (2) l 1 T n , where W (2) l ∈ R m×q , b (2) l ∈ R m , W (1) l ∈ R q×m Finally, the Transformer block composes two subblocks as F l (Z) = F (F F ) l (F (SA) l (Z)) Unlike the original formulation in Vaswani et al. (2017) , our definition excludes layer normalization as Yun et al. (2020a) to simplify our analysis. Since layer normalizations mainly contribute to optimization without much effect on expressivity, our definition still captures the representation aspect of the Transformer architecture. Since each Transformer block consists of a fixed number of layers even in the most fine-grained sense, we use the number of blocks L as the depth of the network. We define the width of the network as max{m, kh, q}foot_0 . The number of parameters is the number of non-zero weights in our network. We note that the single parameter is reused n times for a sequence length n, but still is counted as one parameter. The bit complexity of the network is the maximum bit complexity of its weights, where the bit complexity of a weight is the number of bits required to represent it. We adopt these definitions of the number of parameters and the bit complexity from the convention in the VC dimension literature (Bartlett et al., 2019) and the recent paper on the optimal memorization capacity of fully-connected networks (Vardi et al., 2022) .

3. MEMORIZATION CAPACITY OF TRANSFORMERS

In this section, we describe the problem setting and present our main theorem on the memorization capacity of the Transformer architecture. Then, we sketch the proof for our main theorem and discuss the memorization capacity of Transformers in other standard task settings.

3.1. PROBLEM SETTING

We consider the memorization of N input-output sequence pairs (X (1) , Y (1) ), • • • , (X (N ) , Y (N ) ) where each input X (i) ∈ R d×n is a sequence of n token vectors in dimension d. Each output Y (i) ∈ [C] 1×n is a sequence of n labels where each label Y (i) [1, k] is assigned to a token X (i) [:, k]. We define the context of each input sequence X (i) as i) as the set of all tokens appearing in the input sequences. Note that |V| ≤ nN . V (i) = {v ∈ R d : v = X (i) [:, k] for some k ∈ [n]}. We define the vocabulary V = i∈[N ] V ( As pointed out in Park et al. (2021) and Vardi et al. (2022) , we must assume some conditions on the dataset to bypass the lower bound in Sontag (1997) and memorize N data points with o(N ) parameters. We present a natural generalization of the separation condition defined in Vardi et al. (2022) to sequence modeling settings. Definition 3.1. Let r ≥ 1, 0 < δ ≤ 1. Let X (1) , • • • , X (N ) ∈ R d×n be N input sequences with vocabulary V. We say that X (1) , • • • , X (N ) are tokenwise (r, δ)-separated if 1. ∥v∥ ≤ r for all v ∈ V and 2. ∥v -v ′ ∥ ≥ δ for all v, v ′ ∈ V with v ̸ = v ′ . This condition requires (1) each token to have a bounded norm and (2) each pair of distinct tokens to be separated. We note that the tokenwise separation is a stronger condition than the separation of input sequencesfoot_1 . However, the condition better captures many practical settings where the number of tokens in the vocabulary is much smaller than the number of input sequences. For the permutation equivariant mappings, we need the following label consistency condition. Definition 3.2. Let (X (1) , Y (1) ), • • • , (X (N ) , Y (N ) ) ∈ R d×n × [C] 1×n be N input-output pairs of sequences. We say that (X (1) , Y (1) ), • • • , (X (N ) , Y (N ) ) are consistently labeled if X (i) [:, k] = X (i) [:, l] implies Y (i) [1, k] = Y (i) [1, l] for every i ∈ [N ] and k, l ∈ [n]. We emphasize that we impose this condition only on the memorization with permutation equivariance but not on the memorization without permutation equivariancefoot_2 . The condition implies that two identical tokens appearing in the same context should have the same label. Consider a permutation equivariant mapping F : R d×n → R 1×n and an input sequence X ∈ R d×n . Define X ′ by swapping two tokens X[:, k] and X[:, l] in X. Then, we have F(X)[1, k] = F(X ′ )[1, l] due to the permutation equivariance. However, if two tokens X[:, k] and X[:, l] were identical, then X = X ′ and consequently F(X)[1, k] = F(X)[1, l].

3.2. MAIN RESULTS

We now present our main theorem on the memorization capacity of Transformers. Theorem 3.1. Let N, d, n, C ∈ N and r ≥ 1, 0 < δ ≤ 1. Let (X (1) , Y (1) ), • • • , (X (N ) , Y (N ) ) ∈ R d×n × [C] 1×n be N input-output sequence pairs of sequences where input sequences are distinct and tokenwise (r, δ)-separated. 1. (With permutation equivariance) Suppose that contexts V (i) are distinct and sequences are consistently labeled. Then, there exists a Transformer network N : R d×n → R 1×n such that N (X (i) P ) = Y (i) P for every i ∈ [N ] and for every permutation matrix P ∈ R n×n .

2.. (Without permutation equivariance)

There exists a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n such that N (X (i) + E) = Y (i) for every i ∈ [N ]. In both cases, the Transformer N has width 16 (m = 8, h = k = 1 and q = 16), depth O n + nN log(nN ) + nN log(nN ) • max{log C, log R} and bit complexity bounded by O log d + nN log(nN ) • max{log C, log R} where we denote R := 8000r 2 δ -2 dn 5 N 6 . Theorem 3.1 shows that Õ(d + n + √ nN ) parameters are enough to memorize N sequence-tosequence mapping of length n with token dimension d since the initial embedding layer has d parameters and the rest layers have a constant number of parameters. We provide the proof sketch in Section 3.3 and the full proof in Appendix A. Remark 3.2. Extensions to real vector outputs. Some application scenarios of Transformers require real vector outputs. As proposed in Park et al. (2021) and Vardi et al. (2022) , extension to real scalar values is easily achievable by using O( 1 ϵ ) classes when the output has a bounded range. More concretely, we partition the output range into ϵ-length intervals and match each class to one partition to perform regression with error ϵ per token. This replaces log C in Theorem 3.1 with log( 1 ϵ ). Similarly, extension to vector outputs is possible when the output has a bounded domain. Suppose that we aim to minimize the tokenwise L2 distances in dimension p. We partition the output range into ϵ √ p -length cubes and match each class to one cube. Then, we use O ( √ p ϵ ) p classes to perform regression with error ϵ per token. This construction replaces log C in Theorem 3.1 with p log( p ϵ ). Remark 3.3. Large width and fixed bit complexity. Theorem 3.1 uses fixed width and large bit complexity to minimize the number of parameters. However, a common approach to scaling Transformers is to increase width (Levine et al., 2020) while using the same number of bits per parameter. Using a similar argument as Vardi et al. (2022) , we extend Theorem 3.1 to the cases with a larger width and with bounded bit complexity. When the larger width is allowed, Transformers of width O(nN/L 2 ), depth Õ(n + L) and bit complexity Õ(L) memorize the same dataset for some L ≤ √ nN . When the bit complexity is bounded, Transformers of width O(1), depth Õ(n + nN/B) and bit complexity Õ(B) memorize the same dataset for some B ≤ √ nN . We provide the formal theorem and the proof in Appendix B. Remark 3.4. Tightness in the order of bit counts. Suppose the token dimension d and the sequence length n are both O(N ). Theorem 3.1 shows that a Transformer memorizes N input-output sequence pairs using Õ( √ nN ) parameters of bit complexity Õ( √ nN ), which sum up to Õ(nN ) bits. Without any additional assumption on a dataset, representing models that memorize all C nN possible labelsfoot_3 for N input sequences requires Ω(nN ) bits. Thus, Theorem 3.1 is tight upto logarithmic factors in the order of bit counts. • The dependence on the number of data points N is the same as Õ( √ nN ). However, for the permutation equivariant case, Transformers memorize all permutations of each input sequence. That is, Transformers are capable of memorizing at most n! times more data points at the cost of reusing each parameter n timesfoot_5 . • Our construction has a better dependence on d and n: O(d+n) for Transformers and O(dn) for fully-connected ReLU networks. Transformers exploit the structure of sequence data through parameter sharing. As a result, Transformers do not need O(dn) parameters to read all dn values in the input sequence. We note that this comparison is not completely fair because (1) our result makes a slightly stronger assumption of the separation between tokens than the separation between whole input; and (2) fullyconnected networks are not designed for permutation equivariant datasets. However, the difference in assumption affects the final bound only to the logarithmic terms. Furthermore, there is no known result on the memorization capacity of other permutation equivariant architectures.

3.3. PROOF SKETCH

We outline the proof of Theorem 3.1. A more formal statement of each stage with detailed proof is in Appendix A. Our proof adopts the approach from Vardi et al. ( 2022) and shares similar steps. We discuss our technical novelty after sketching the main ideas of our proof. Our proof constructs a Transformer in 4 stages. The first two stages assemble input values and encode as a "contextual token id" that identify each token in the different context of sequence. We ensure that the "contextual token id" is permutation equivariant and that the "contextual token id" is uniquely assigned to each token in each context. Then, the last two stages map each "contextual token id" to the corresponding label using the bit-extraction network adopted from Vardi et al. (2022) . We describe key ideas of each stage. • Stage 1. Tokenwise Projection. We project each token vector to a scalar token id while keeping distinct tokens well separated. • Stage 2. Contextual Mapping. We compose a sequence id as a linear combination of token ids. Weights of each id in the linear combination depend on the order of token id within each sequence. The resulting sequence ids are permutation invariant. 7 We concatenate the token id and the sequence id to obtain a contextual token id. • Stage 3. String Lookup. We partition all nN contextual token ids into intervals, each containing the same number of ids. We construct two encoding numbers for each interval by concatenating all corresponding contextual token ids and token labels. Then, we find which group the contextual token id falls into and retrieve the corresponding encoding numbers. • Stage 4. Bit Extraction. We extract each contextual token id and token label from the encoding numbers. If the extracted contextual token id agrees with the one composed in stage 2, then we output the corresponding token label. Our technical novelty in this proof is in (1) the implementation of the contextual mapping in stage 2 using self-attention subblocks and (2) generalizing stages 3 and 4 to incorporate skip connections. Remark 3.6. Contextual mapping. The number of self-attention subblocks that our proof uses is proportional to the length of the sequence n but independent of the number of data points N . This requirement is in striking contrast to the selective-shifting-based contextual mapping from Yun et al. (2020a) , which requires 1 δ dn layers for shifting each grid cell of side length δ. In the memorization setting, we may remove layers for grid cells without any data point. But the selective-shifting-based contextual mapping would still need nN self-attention subblocks, which alone is already larger than the number of required layers in our result. In contrast, our efficient contextual mapping uses n layers, improving all the above when δ, N > 1. 8 We showcase the benefit of our contextual mapping in Appendix C. Specifically, we show that our contextual mapping are capable of incorporating sparse self-attention settings with minimal parameter overhead. Moreover, we show that the same idea of our contextual mapping can be applied in the function approximation setting. Remark 3.7. Parameter contribution. In our construction, self-attention subblocks contribute only O(n) parameters while feedforward subblocks contribute O( √ nN ). When n < N , most of the parameter count comes from feedforward subblocks. Although the self-attention layers play a critical role in contextual mapping, we do not need many of them. This explains the model design in practice, where more than half of the parameters are in the tokenwise feedforward subblocks (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020; Geva et al., 2021) . Moreover, Mandava et al. (2020) observes that the self-attention layers can be further reduced without much performance loss.

4. MEMORIZATION CAPACITY IN OTHER TASKS

In this section, we generalize Theorem 3.1 to analyze the memorization capacity for other standard task settings. We consider sequence classification in Theorem 4.1, masked language modeling in Theorem D.1 and autoregressive language modeling in Theorem D.2. Due to the limited space, formal results on language modeling tasks and the proofs of the theorems are provided in Appendix D.

4.1. SEQUENCE CLASSIFICATION

In sequence classification, we assign a single label y (i) ∈ [C] for each input sequence X (i) . We present our theorem for sequence classification task. Theorem 4.1. Let N, d, n, C ∈ N and r ≥ 1, 0 < δ ≤ 1. Let (X (1) , y (1) ), • • • , (X (N ) , y (N ) ) ∈ R d×n × [C] be N input-output pairs of sequences where input sequences are distinct and tokenwise (r, δ)-separated. 1. (With permutation invariance) Suppose that contexts V (i) are distinct. Then, there exists a Transformer network N : R d×n → R 1×n such that N (X (i) P )[1, k] = y (i) 7 An appropriate choice of positional encoding bypasses permutation invariance by collecting token ids in the position order. See the second part of Theorem 3.1 for the result and Section A.5 for the details. 8 The most practical settings fall into this regime. for every i ∈ [N ], k ∈ [n] and for every permutation matrix P ∈ R n×n .

2.. (Without permutation invariance)

There exists a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n such that N (X (i) + E)[1, k] = y (i) for every i ∈ [N ], k ∈ [n]. In both cases, the Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth O n + N log N + N log N • max{log C, log R} and bit complexity bounded by O log d + N log N • max{log C, log R} where we denote R := 8000r 2 δ -2 dn 5 N 6 . 

4.2. LANGUAGE MODELING

We consider two language modeling tasks commonly used for pre-training Transformers: masked language modeling and autoregressive language modeling. We consider the memorization of all possible length n sequences obtainable from the given corpus of length T . The input is embedded in the d-dimensional space while the output is mapped to one of V tokens in the dictionary. The memorization of masked language modeling requires Õ(d + n + √ n m+1 mT ) parameters (Theorem D.1) while the memorization of autoregressive language modeling requires Õ(d + n + √ T ) parameters (Theorem D.2). Compared to the autoregressive language modeling, the masked language modeling has the additional factor n m m that comes from memorizing all masking patterns separately and the factor n that comes from memorizing all masked tokens instead of one next token. For more details on the settings and formal statement of the result, we refer to Appendix D.

5.1. EXPERIMENTAL SETUP

We complement our theory with experiments on real-world dataset. We train encoder-only Transformer models (Vaswani et al., 2017) on token classification task where each token is assigned a label as in Theorem 3.1 and sequence classification task where each sequence is assigned a label as in Theorem 4.1. We study the relationship between the memorized dataset size and the model size. For token classification, we use 14,000 randomly selected examples among 14,041 training examples in the named entity recognition dataset from CoNLL-2003 (Tjong Kim Sang & De Meulder, 2003) . For sequence classification, we use 50,000 randomly selected examples among 392,702 training examples in the MNLI dataset from GLUE benchmark (Wang et al., 2019) . We vary the model size through the embedding size m while fixing the number of layers as L = 6. We fix the number of attention head as h = 12, the embedding to head size ratio as m/k = h = 12 and the feedforward to embedding size ratio as q/m = 4, as commonly done in practice. More details on experiments are in the Appendix F.

5.2. RESULTS

Figure 1 shows heatmaps of training errors as the dataset size and the model size vary. There is a clear trend that the training error is smaller (darker in color) for the smaller dataset size and the The number of parameters required for memorization. We show the number of parameters (Y-axis) of the smallest model that achieves less than 0.005 training error for each dataset size (X-axis). We observe a linear trend in the number of parameters with a subtle concavity as our theory predicts, but only at the low data regime. larger model size. To see the clearer relationship between model size and the dataset size, we plot in Figure 2 the minimum model size that achieves less than 0.005 training error for each dataset size. In general, there is a linear increase in the model size as the memorized dataset size increases. We also observe a slight downward curvature (concavity) as our theory predicts, but only at the low dataset size. We conjecture that the linear trend may be due to the fixed depth and bit complexity during the experiments. See Remark 3.3 for the discussion of this bounded depth and bit complexity regime and see Appendix B for the formal result in these regimes. Indeed, Theorem B.1 and Theorem B.2 predicts linear dependence of the model size in the dataset size.

6. CONCLUSIONS

In this paper, we prove that Transformers are capable of memorizing N length-n sequence-tosequence mappings with Õ(d + n + √ nN ) parameters. We extend our theory to analyze the memorization capacity of Transformers in other standard task settings. Our proof constructs a contextual mapping with O(n) self-attention layers, which significantly improves the previously proposed selective-shifting-based contextual mapping in terms of parameter efficiency. Finally, we provide experimental results that verify our theory. A PROOF OF THEOREM 3.1 Our proof of Theorem 3.1 consists of four stages as described in Section 3.3. We state and prove the main lemma for each stage in the following subsections. Then, we combine all stages in the end. The lemmas in each stage assumes permutation equivariance. We analyze how to circumvent permutation equivariance through the positional encoding in a separate subsection.

A.1 STAGE 1: TOKENWISE PROJECTION

The main lemma for stage 1 is stated below. Lemma A.1. Let N, d, n ∈ N and r ≥ 1, 0 < δ ≤ 1. Let X (1) , • • • , X (N ) ∈ R d×n be a set of N in- put sequences that are distinct and tokenwise (r, δ)-separated. Denote r ′ = 2⌈2n 2 N 2 √ πdδ -1 ⌉⌈r⌉. Then, there exists a network N 1 : R d×n → R 1×n consisting of the input embedding block and one tokenwise feedforward subblock with feedforward dimension q = 1 and bit complexity ⌈log(2rn 2 N 2 d √ πδ -1 )⌉ such that N 1 (X (i) ), i ∈ [N ] are non-negative and tokenwise (2r ′ , 2)- separated. Moreover, for i, j ∈ [N ] and k, l ∈ [n], N 1 (X (i) )[1, k] = N 1 (X (j) )[1, l] if and only if X (i) [:, k] = X (j) [:, l]. Proof. Our proof defines a vector u ∈ R d such that sequences x (i) = u T X (i) ∈ R 1×n , i ∈ [N ] are tokenwise (r ′ , 2)-separated and satisfy that, for i, j ∈ [N ] and k, l ∈ [n], x (i) [1, k] = x (j) [1, l] if and only if X (i) [:, k] = X (j) [:, l] Then, we construct the network N 1 with the required size and bit complexity that computes N 1 (X) = u T X + r ′ 1 T n . Construction of u. Recall the definition of the vocabulary V = i∈[N ] V (i) = {v ∈ R d : v = X (i) [:, k] for some i ∈ [N ], k ∈ [n]}. Note that |V| ≤ nN . We use Lemma E.1 on V to find ũ such that 1 n 2 N 2 8 πd ∥v -v ′ ∥ ≤ 1 |V| 2 8 πd ∥v -v ′ ∥ ≤ ũT (v -v ′ ) ≤ ∥v -v ′ ∥ for every v, v ′ ∈ V. Let û ∈ R d be a vector with each coordinate being the first ⌈log(n 2 N 2 d √ π)⌉ bits of the corre- sponding coordinate of ũ. We note that ∥ û -ũ∥ ≤ √ d 2 log(n 2 N 2 d √ π) = 1 n 2 N 2 1 πd . Define u = S û with S = ⌈2n 2 N 2 √ πdδ -1 ⌉ ≥ 2. We now check that x (i) = u T X (i) , i ∈ [N ] are tokenwise (r ′ , 2)-separated with r ′ = 2S⌈r⌉. Let i, j ∈ [N ], k, l ∈ [n] with X (i) [:, k] ̸ = X (j) [:, l] and v, v ′ ∈ V with v = X (i) [:, k], v ′ = X (j) [:, l] . Then, we have x (i) [1, k] = u T X (i) [:, k] = S ûT v ≤ S ũT v + ( û -ũ) T v ≤ S ∥v∥ + 1 n 2 N 2 1 πd ∥v∥ ≤ 2S∥v∥ ≤ 2Sr ≤ r ′ . We also have x (i) [1, k] -x (j) [1, l] = u T (X (i) [:, k] -X (j) [:, l]) = S ûT (v -v ′ ) ≥ S ũT (v -v ′ ) -( û -ũ) T (v -v ′ ) ≥ S 1 n 2 N 2 8 πd ∥v -v ′ ∥ - 1 n 2 N 2 1 πd ∥v -v ′ ∥ ≥ S 1 n 2 N 2 1 πd ∥v -v ′ ∥ ≥ S 1 n 2 N 2 1 πd δ ≥ 2, which also implies x (i) [1, k] = x (j) [1, l] if and only if X (i) [:, k] = X (j) [:, l]. Construction of N 1 . We construct N 1 as a composition of the input embedding block E in : R n×d → R 1×n and a tokenwise feedforward block F (F F ) : R 1×n → R 1×n with a skip-connection. We define E in (X) = ûT X and F (F F ) (z) = z + (S -1)σ R (z + 2⌈r⌉1 T n ) + 2⌈r⌉1 T n . Then, we have N 1 (X) = F (F F ) (E in (X)) = ûT X + (S -1)σ R ( ûT X + 2⌈r⌉1 T n ) + 2⌈r⌉1 T n = ûT X + (S -1)( ûT X + 2⌈r⌉1 T n ) + 2⌈r⌉1 T n = S ûT X + 2S⌈r⌉1 T n = u T X + r ′ 1 T n , where we removed the ReLU activation because ûT X + 2⌈r⌉1 T n have all positive values. It is straightforward from the definition of F (F F ) that the feedforward dimension of the network is 1. Moreover, we can represent each coordinate of û with ⌈log(n 2 N 2 d √ π)⌉ bits, the weights in F (F F ) with ⌈log S⌉ = ⌈log(2n 2 N 2 √ πdδ -1 )⌉ bits, and the biases in F (F F ) with ⌈log 2r⌉ bits. Thus, the bit complexity of the network N 1 is ⌈log(2rn 2 N 2 d √ πδ -1 )⌉. A.2 STAGE 2: CONTEXTUAL MAPPING Before proceeding on to the stage 2, we reindex the tokens of each output from Lemma A.1. Due to the permutation equivariance of the Transformer architecture, we may reorder tokens in any order as we want without loss of generality. For each i ∈ [N ], we consider x (i) ∈ R n as a vector and reindex as follows. Suppose that there are n i unique tokens in X (i) . Then, there are also n i unique values in x (i) . We assign indices for tokens so that the first n i tokens have n i unique values of x (i) in descending order: x (i) [1] > x (i) [2] > • • • > x (i) [n i ]. Then, we index the remaining redundant tokens in descending order: x (i) [n i + 1] ≥ x (i) [n i + 2] ≥ • • • ≥ x (i) [n]. We note that the resulting vector x (i) is uniquely defined. That is, X (i) corresponds to one and only one valid resulting vector x (i) . We need the following definition in this section. Definition A.1. Let N, n ∈ N and r ≥ 1, 0 < δ ≤ 1. Let x (1) , • • • , x (N ) ∈ R n be N data instances. We say that x (1) , • • • , x (N ) are (r, δ)-separated if • ∥x (i) ∥ ≤ r for all i ∈ [N ] and • ∥x (i) -x (j) ∥ ≥ δ for all i, j ∈ [N ] with i ̸ = j. We now state the main lemma for stage 2. Lemma A.2. Let N, n, r ′ ∈ N. Let x (1) , • • • , x (N ) ∈ R n be a set of N input sequences that are non-negative and tokenwise (2r ′ , 2)-separated. Denote R ′ = 4⌈2N 2 √ πn⌉⌈r ′ ⌉⌈ √ n⌉. Then, there exists a network N 2 : R 3×n → R 3×n consisting of 2n Transformer blocks with the number of head h = 1, head size k = 1, feedforward dimension q = 4 and bit complexity ⌈log(2r ′ nN 2 √ π)⌉ such that N 2     x (i)T 0 T n 0 T n     =   0 T n 0 T n z (i) 1 T n   where z (i) ∈ R, i ∈ [N ] are (R ′ + 1, 2)-separated. Proof. Our proof defines a vector w ∈ R n such that values z(i) = w[1 : n i ] T x (i) [1 : n i ] ∈ R, i ∈ [N ] are (R ′ , 4 )-separated where we denote the number of unique values in x (i) as n i . Then, we construct the network N 2 with the required size and bit complexity that computes N 2     x (i)T 0 T n 0 T n     =   0 T n 0 T n z (i) 1 T n   where z (i) approximates z(i) within 1 as z (i) -z(i) ≤ 1. Since we have z (i) ≤ z(i) + z (i) -z(i) ≤ R ′ + 1 for i ∈ [N ] and z (i) -z (j) ≥ z(i) -z(j) -z (i) -z(i) -z (j) -z(j) ≥ 4 -1 -1 = 2 for i, j ∈ [N ] with i ̸ = j, we conclude that z (i) ∈ R, i ∈ [N ] are (R ′ + 1, 2)-separated. Construction of w. For i ∈ [N ], we define x(i) ∈ R n as a vector having the same values as x (i) in the first n i coordinates and 0 in the rest. We use Lemma E.1 on x(1) , x(2) , • • • , x(N) to find w such that 1 N 2 8 πn ∥ x(i) -x(j) ∥ ≤ wT x(i) -x(j) ≤ ∥ x(i) -x(j) ∥ for every i, j ∈ [N ]. Let ŵ ∈ R d be a vector with each coordinate being the first ⌈log(nN 2 √ π)⌉ bits of the corresponding coordinate of w. We note that ∥ ŵ -w∥ ≤ √ n 2 log(nN 2 √ π) = 1 N 2 1 πn . Define w = P ŵ with P = ⌈2N 2 √ πn⌉. We now check that z(i) = w[1 : n i ] T x (i) [1 : n i ] = w T x(i) , i ∈ [N ] are (R ′ , 4)-separated. Let i, j ∈ [N ] with i ̸ = j. Then, we have z(i) = w T x(i) = P ŵT x(i) ≤ P wT x(i) + ( ŵ -w) T x(i) ≤ P ∥ x(i) ∥ + 1 N 2 1 πn ∥ x(i) ∥ ≤ 2P ∥ x(i) ∥ = 2P n k=1 x(i) k 2 ≤ 4P r ′ √ n ≤ R ′ . and z(i) -z(j) = w T ( x(i) -x(j) ) = P ŵT ( x(i) -x(j) ) ≥ P wT ( x(i) -x(j) ) -( ŵ -w) T ( x(i) -x(j) ) ≥ P 1 N 2 8 πn ∥ x(i) -x(j) ∥ - 1 N 2 1 πn ∥ x(i) -x(j) ∥ ≥ P 1 N 2 1 πn ∥ x(i) -x(j) ∥ ≥ P 1 N 2 1 πn • 2 ≥ 4. Construction of N 2 . We construct N 2 = F 2n • F 2n-1 • • • • • F 1 in n steps. Each step l ∈ [n] consists of 2 Transformer blocks F 2l-1 , F 2l : R 3×n → R 3×n . First, we use Lemma E.2 to obtain a self-attention module F(SA) 2l-1 : R 1×n → R 1×n that computes a vector with all coordinates holding 1 2P √ n -approximation xmax of x max = max i∈[n] x[i] given x ∈ R n . We extend F(SA) 2l-1 to define a valid self-attention subblock F (SA) 2l-1 : R 3×n → R 3×n as F (SA) 2l-1     x T 0 T n z T     =   x T 0 T n z T   +   0 T n F(SA) 2l-1 (x T ) 0 T n   =   x T xmax 1 T n z T   for x, z ∈ R n . The subblock F (SA) 2l-1 finds the 1 2P √ n -approximate maximum token id among the first coordinate values and outputs in the second coordinate. Next, we use Lemma E.3 to obtain a tokenwise feedforward module F(F F ) 2l-1 : R 2×n → R 1×n that outputs r ′ for tokens with the same value in two coordinates. We define a tokenwise feedforward subblock F (F F ) 2l-1 : R 3×n → R 3×n by extending F(F F ) 2l-1 so that, for x, y, z ∈ R n , F (F F ) 2l-1     x T y T z T     =   x T y T z T   -   2 F(F F ) 2l-1 ([x, y] T ) 0 T n 0 T n   =   xT y T z T   , where x[i] = x[i] -2r ′ if |x[i] -y[i]| < 1 2 x[i] if |x[i] -y[i]| > 1 . The subblock F (F F ) 2l-1 compares the first two rows of the input and subtracts 2r ′ from the first row if two values are not separated. Since x[i] is bounded above by 2r ′ , the subtracted entries become negative. Since we do not need the self-attention subblock from F 2l , we set all weights of F (SA) 2l to zero. We define F (F F ) 2l : R 3×n → R 3×n as F (F F ) 2l     x T y T z T     =   x T y T z T   + 1 0 0 -1/P 0 ŵk σ R   -e T 1 P e T 2   x T y T z T     =   x T y T z T   + 1 0 0 -1/P 0 ŵk σ R -x T P y T =   x T + σ R (-x T ) y T -σ R (y T ) z T + ŵk P σ R (y T )   =   σ R (x T ) -σ R (-y T ) z T + w k σ R (y T )   , where P > 0 and w k = ŵk P are defined earlier. When y ≥ 0 n , we can further simplify as F (F F ) 2l     x T y T z T     =   σ R (x T ) 0 T n z T + w k y T   . Let Z (i,l) = F 2l • F 2l-1 • • • • • F 1     x (i)T 0 T n 0 T n     ∈ R 3×n be the output of the l-th step when the input is x (i) for l = 0, • • • , n. We show inductively that Z (i,l) [1, k] = x (i) [k] if l < n i and x (i) [k] < x (i) [l] 0 otherwise , Z (i,l) [2, k] = 0, Z (i,l) [3, k] = w[1 : min{l, n i }] T x(i) [1 : min{l, n i }] = min{l,ni} j=1 w[j] x(i) [j] (1) for i ∈ [N ], l, k ∈ [n], where x(i) [j] is 1 2P √ n -approximation of x (i) [j]. For l = 0, the conditions 1 hold for the input   x (i)T 0 T n 0 T n   . Suppose that the conditions 1 hold for l = l -1. When l > n i , the induction hypothesis implies that Z (i, l-1) [1, :] = 0 T n . Thus, we obtain the output as F (F F ) 2 l • F (F F ) 2 l-1 • F (SA) 2 l-1 Z (i, l-1) = F (F F ) 2 l • F (F F ) 2 l-1 • F (SA) 2 l-1     0 T n 0 T n Z (i, l-1) [3, :]     = F (F F ) 2 l • F (F F ) 2 l-1     0 T n 0 T n Z (i, l-1) [3, :]     = F (F F ) 2 l     -2r ′ 1 T n 0 T n Z (i, l-1) [3, :]     =   0 T n 0 T n Z (i, l-1) [3, :]   . When l ≤ n i , the induction hypothesis implies that Z (i, l-1) [1, :] has non-zero entries among which the largest value is x (i) [ l]. Then, it follows that F (F F ) 2 l • F (F F ) 2 l-1 • F (SA) 2 l-1 Z (i, l-1) = F (F F ) 2 l • F (F F ) 2 l-1 • F (SA) 2 l-1       Z (i, l-1) [1, :] Z (i, l-1) [2, :] Z (i, l-1) [3, :]       = F (F F ) 2 l • F (F F ) 2 l-1 • F (SA) 2 l-1     Z (i, l-1) [1, :] 0 T n Z (i, l-1) [3, :]     = F (F F ) 2 l • F (F F ) 2 l-1     Z (i, l-1) [1, :] x(i) [ l]1 T n Z (i, l-1) [3, :]     = F (F F ) 2 l     x(i, l-1)T x(i) [ l]1 T n Z (i, l-1) [3, :]     =    σ R x(i, l-1)T 0 T n Z (i, l-1) [3, :] + w[ l] x(i) [ l]1 T n    where x(i) [ l] is 1 2P √ n -approximation of x (i) [ l]. Here, x(i, l-1) denotes x(i, l-1) [k] = Z (i, l-1) [1, k] -2r ′ if |Z (i, l-1) [1, k] -x(i) [ l]| < 1 2 Z (i, l-1) [1, k] if |Z (i, l-1) [1, k] -x(i) [ l]| > 1 so σ R ( x(i, l-1) [k]) = 0 if |Z (i, l-1) [1, k] -x(i) [ l]| < 1 2 Z (i, l-1) [1, k] if |Z (i, l-1) [1, k] -x(i) [ l]| > 1 . Since distinct tokens are separated by 2, 1 2P √ n -approximations of them are separated by 1. On the other hand, since 1 2P √ n < 1 2 , the 1 2P √ n -approximation of the maximum element stays closer than 1 2 . Consequently, σ R ( x(i, l-1)T ) is the same as Z (i, l) [1, :]. Thus, the conditions 1 hold and we conclude our induction proof. In the end of n steps, the output is Z (i,n) =   0 T n 0 T n z (i) 1 T n   with z (i) = w T x(i) where each entry of x(i) 1 2P √ n -approximates the corresponding entries of x(i) . We check that z (i) approximates z(i) within 1 as z (i) -z(i) ≤ w T x(i) -w T x(i) ≤ ∥w∥∥ x(i) -x(i) ∥ ≤ P ∥ ŵ∥ √ n 2P √ n ≤ 1 2 (∥ w∥ + ∥ ŵ -w∥) ≤ 1 2 1 + 1 N 2 1 πn ≤ 1. Our construction of N 2 involves 2n Transformer blocks. From Lemma E.2, the subblock F (SA) 2l-1 uses 1 head (h = 1) with head size 1 (k = 1) and bit complexity ⌈log log(8n 3/2 r ′ P ) ⌉ = ⌈log log(16n 2 N 2 r ′ √ π) ⌉. From Lemma E.3, the subblock F (F F ) 2l-1 uses feedforward dimension 4 with bit complexity ⌈log 2r ′ ⌉. The subblock F (F F ) 2l uses feedforward dimension 2. All weights in F (F F ) 2l are either 0, ±1, P, -1/P or ŵk . Since we represent each coordinate of ŵk using ⌈log(nN 2 √ π)⌉ bits, the bit complexity of F (F F ) 2l is max{⌈log P ⌉, ⌈log(nN 2 √ π)⌉} = ⌈log(2nN 2 √ π)⌉. Thus, the network Ñ2 has 1 head (h = 1) with the head size 1 (k = 1), the feedforward dimension 4 and the bit complexity ⌈log(2r ′ nN 2 √ π)⌉.

A.3 STAGE 3: STRING LOOKUP

In stages 3 and 4, we map each token X (i) [:, k] to the corresponding label using both the token id x (i) [k] from stage 1 and the sequence id z (i) from stage 2. Now that the token id x (i) [k] and the sequence id z (i) hold enough information to identify the label, we process each token independently using tokenwise feedforward blocks from now on. We first combine two ids into a single "contextual token" id and map to the corresponding token label in the last two stages. We adopt stages 2 and 3 in Vardi et al. (2022) to our architecture that involves skip-connections. We can extend stage 2 by using extra dimension to pass x (i) from stage 1 without any additional parameter. Then, after stage 2, the k-th token in the i-th sequence contains z (i) and x (i) [k], which are enough to identify the corresponding label y (i) [k]. We note that 0 ≤ x (i) [k] ≤ 2r ′ and z (i) ≤ R ′ + 1 with r ′ , R ′ > 6. We define a (i) [k] = (⌊z (i) ⌋ + R ′ + 1)(2r ′ + 1) + ⌊x (i) [k]⌋ + 1 to be the unique integer id for each token in each sequence. Then, we have • 1 ≤ a (i) [k] < (2R ′ + 3)(2r ′ + 1) < 9r ′ R ′ for i ∈ [N ] and • a (i) [k] -a (j) [l] ≥ 2 for i, j ∈ [N ], k, l ∈ [n] with V (i) ̸ = V (j) or X (i) [:, k] ̸ = X (j) [:, l]. We denote R = 9r ′ R ′ . Now, our goal is to map a (i) [k] ∈ [R] to y (i) [k] ∈ [C] using tokenwise feedforward subblocks. We denote the number of distinct a (i) [k]'s as N ′ ≤ nN . We reindex each of unique tokens and the corresponding label as ã(i) and ỹ(i) for i ∈ [N ′ ], respectively. Without loss of generality, we supppose that 1 ≤ ã(1) < • • • < ã(N ′ ) < R. Then, we partition N ′ contextual token ids into A groups of B ids. For each group g ∈ [A], we construct two strings u g and w g . The binary string u g is a concatenation of B ids in the group g. Each id is represented as an integer of ρ = ⌈log R⌉ bits. The binary string w g is a concatenation of B labels corresponding to B ids in the group g. Each label is represented as an integer of γ = ⌈log C⌉ bits. Thus, u g and w g are strings of length ρB and γB, respectively. We now state the main lemma for stage 3. Lemma A.3. Let N ′ , R ∈ N and 1 ≤ ã(1) < • • • < ã(N ′ ) < R to be distinct integer ids that identify each token in each sequence. We suppose that ã (i) -ã(j) ≥ 2 for i, j ∈ [N ′ ] with i ̸ = j. Let A, B, b ∈ N with A < N ′ and B = ⌈ N ′ A ⌉. Let w 1 , • • • , w A ∈ N with the number of bits in their binary representation at most b. Then, there exists a network N 3 : R 2 → R 2 consisting of A feedforward blocks with skipconnections every 2 layers, feedforward dimension q = 4 and bit complexity b + ⌈log(2R + 1)⌉ such that N 3 ã(i) , 0 = ã(i) , w ⌈ i B ⌉ . Proof. We construct N 3 = F A •F A-1 •• • ••F 1 as a composition of A feedforward blocks F l : R 2 → R 2 , l ∈ [A] where each feedforward block contains a single hidden layer and a skip-connection. Let l ∈ [A]. We use Lemma E.4 to define Fl : R 2 → R 1 such that Fl (x) = w l if x ∈ ã((l-1)•B+1) , ã(l•B) and Fi (x) = 0 if x / ∈ ã((l-1)•B+1) -1 2 , ã(l•B) + 1 2 where we regard l • B > N ′ to be N ′ . We extend Fl to define a valid feedforward block F l as F l (x, y) = (x, y) + 0, Fi (x) = x, y + Fi (x) . Throughout the computation of N 3 , the first coordinate is the same across all blocks. For i ∈ [N ′ ], ã(i) only activates one of Fl , l ∈ [A]. In particular, Fl (ã (i) ) = w l if l = ⌈ i B ⌉ and Fl (ã (i) ) = 0 otherwise. Thus, we get N 3 ã(i) , 0 = ã(i) , w ⌈ i B ⌉ . Our construction involves A feedforward blocks with skip-connections every 2 layers. From Lemma E.4, feedforward dimension q = 4 and the bit complexity is b + ⌈log(2R + 1)⌉. Remark A.4. We can extend the construction from Lemma A.3 to output multiple strings corresponding to the same range without additional feedforward dimension. In particular, the first layer in the construction of Lemma E.4 does not depend on w. Thus, we can reuse this four units with different output weights to output additional strings corresponding to the same range. In stage 4, we need 2 strings for each range.

A.4 STAGE 4: BIT EXTRACTION

We first define BIN i:j (n) to be the substring of n with bits in places from i to j inclusive for n, i, j ∈ N with i ≤ j. We now state the main lemma for stage 4. Lemma A.5. Let B, ρ, γ ∈ N and u, w ∈ N. Suppose that the number of bits in binary representation of u and w are ρB and γB, respectively. We assume that BIN ρ•(i-1)+1:ρ•i (u) -BIN ρ•(j-1)+1:ρ•j (u) ≥ 2 for i, j ∈ [B] with i ̸ = j. Then, there exists a network N 4 : R 3 → R consisting of an output embedding block and (max{ρ, γ} + 2)B + 2 feedforward blocks with skip-connections every 2 layers, feedforward dimension q = 16 and bit complexity 2 max{ρ, γ}B such that N 4 (x, u, w) = BIN ρ•(i-1)+1:ρ•i (w), if there exist i ∈ [B] such that x = BIN ρ•(i-1)+1:ρ•i (u). Proof. We construct N 4 = F post • F B • F B-1 • • • • • F 1 • F pre in B steps with pre-processing F pre : R 3 → R 8 and post-processing F post : R 8 → R. We define the pre-processing network as F pre (x, u, w) = x, u 2 ρB + 1 2 ρB+1 , u 2 ρB + 1 2 ρB+2 , 0, w 2 γB + 1 2 γB+1 , w 2 γB + 1 2 γB+2 , 0, 0 and the post-processing network as F post (z 1 , z 2 , • • • , z 8 ) = z 8 . In each step l ∈ [B], F l : R 8 → R 8 first implements two sub-networks F u l : R 3 → R 3 and F w l : R 3 → R 3 in parallel on the middle 6 coordinates. Two sub-networks uses Lemma E.5 to compute F u l     ϕ (ρ(l-1)) u 2 ρB + 1 2 ρB+1 ϕ (ρ(l-1)) u 2 ρB + 1 2 ρB+2 0     =   ϕ (ρl) u 2 ρB + 1 2 ρB+1 ϕ (ρl) u 2 ρB + 1 2 ρB+2 BIN ρ(l-1)+1:ρl (u)   and F w l     ϕ (γ(l-1)) w 2 γB + 1 2 γB+1 ϕ (γ(l-1)) w 2 γB + 1 2 γB+2 0     =   ϕ (γl) w 2 γB + 1 2 γB+1 ϕ (γl) w 2 γB + 1 2 γB+2 BIN γ(l-1)+1:γl (w)   . Then, F l combines the result using two additional feedforward blocks as            x ϕ (ρl) u 2 ρB + 1 2 ρB+1 ϕ (ρl) u 2 ρB + 1 2 ρB+2 BIN ρ(l-1)+1:ρl (u) ϕ (γl) w 2 γB + 1 2 γB+1 ϕ (γl) w 2 γB + 1 2 γB+2 BIN γ(l-1)+1:γl (w) 0            F 1 l --→             x ϕ (ρl) u 2 ρB + 1 2 ρB+1 ϕ (ρl) u 2 ρB + 1 2 ρB+2 c l = F1 l (x, BIN ρ(l-1)+1:ρl (u)) ϕ (γl) w 2 γB + 1 2 γB+1 ϕ (γl) w 2 γB + 1 2 γB+2 BIN γ(l-1)+1:γl (w) 0             F 2 l --→             x ϕ (ρl) u 2 ρB + 1 2 ρB+1 ϕ (ρl) u 2 ρB + 1 2 ρB+2 0 ϕ (γl) w 2 γB + 1 2 γB+1 ϕ (γl) w 2 γB + 1 2 γB+2 0 F2 l (c l , BIN γ(l-1)+1:γl (w))             where F1 l uses Lemma E.3 to compute F1 l (x, y) = 2 γ if |x -y| < 1 2 0 if |x -y| > 1 and F2 l computes, for 0 ≤ y ≤ 2 γ F2 l (x, y) = σ R (x -2 γ + y) = y if x = 2 γ 0 if x = 0 . The resulting value in the last coordinate is BIN γ(l-1)+1:γl (w) if |x -BIN ρ(l-1)+1:ρl (u)| < 1 2 and 0 if |x -BIN ρ(l-1)+1:ρl (u)| > 1. Therefore, the last coordinate throughout each step of N 4 keeps the value 0 until it finds x = BIN ρ(l-1)+1:ρl (u). When such l is found, the value is updated to BIN γ(l-1)+1:γl (w) as the requirement. Finally, the parallel implementation of F u l and F w l requires max{ρ, γ} feedforward blocks, feedforward dimension 16 = 8 + 8 and bit complexity 2 max{ρ, γ}B. Moreover, each of F 1 l and F 2 l requires 1 feedforward block and bit complexity γ. The feedforward dimension of F 1 l is 5 where F1 l incurs 4 from Lemma E.3 and 1 additional unit wipes out the (positive) carried value in 4-th coordinate from the skip-connection. The feedforward dimension of F 2 l is 3 where F2 l incurs 1 and 2 additional units wipe out the carried values in 4-th and 7-th coordinates from the skip-connection. In total, each step l ∈ [B] consists of max{ρ, γ} + 2 feedforward blocks with skip-connections, feedforward dimension 16 and bit complexity 2 max{ρ, γ}B. The pre-processing network F pre and the post-processing network F post can be implemented with 1 feedforward block, feedforward dimension at most 2 (to carry u and w or z 8 ) and the bit complexity at most max{ρ, γ}B + 2. Thus, N 4 consists of (max{ρ, γ} + 2)B + 2 feedforward blocks with skip-connections, feedforward dimension 16 and bit complexity 2 max{ρ, γ}B.

A.5 POSITIONAL ENCODING

Consider the token id u T X (n) in stage 1. If we define the positional encoding as E = r ′ u[n -1, n -2, • • • , 1, 0], then we have u T (X (n) + E) = u T X (n) + r ′ [n -1, n -2, • • • , 1, 0], Since every element of u T X (n) are bounded above by r ′ , positional encoding enforces the decreasing order used in stage 2 as the usual sequential order. Thus, the sequence id is no longer permutation equivariant. The upper bound on the magnitude of token ids increases by a factor of n, but it only affects parameter complexity and bit complexity logarithmically through R. A.6 PROOF OF THEOREM 3.1 The final Transformer network combines all 4 stages as N = N 4 • N 3 • N 2 • N 1 where N 3 and N 4 applies the same function to each token independently. The mismatch in the embedding and feedforward dimensions is easily resolved by using the maximum dimension required and set all weights in the unused dimension to zero. We modify N 3 to output both u ⌈ i B ⌉ and w ⌈ i B ⌉ as mentioned in the Remark A.4. We summarize all stages: 1. N 1 projects the input sequence X (i) ∈ R d×n to the token ids x (i) ∈ R n . This stage consists of 1 feedforward block of dimension 1 and bit complexity ⌈log(2rn 2 N 2 d √ πδ -1 )⌉ ≤ log(r ′ √ d). 2. N 2 further projects the token ids x (i) ∈ R n to the sequence id z (i) ∈ R permutation equivariantly. This stage consists of 2n Transformer blocks of attention dimension 1, feedforward dimension 4 and bit complexity ⌈log(2r ′ nN 2 √ π)⌉ ≤ log(R ′ ). 3. N 3 combines a token id and a sequence id to obtain the contextual token id and finds two group strings. N ′ ≤ nN memorized contextual token ids are partitioned into A groups of B ids so that N ′ ≤ AB. Two group strings are crafted as concatenations of contextual token ids and corresponding labels in the group. This stage consists of A feedforward blocks of dimension 8 and bit complexity ⌈max{log R, log C}⌉B + ⌈log R⌉ + 1. 4. N 4 extracts the correct label from the crafted strings. This stage consists of (max{ρ, γ} + 2)B + 2 feedforward blocks of dimension 16 and bit complexity 2⌈max{log R, log C}⌉B. In total, the network N uses 1+2n+A+(max{ρ, γ}+2)B +2 = O(n+A+max{ρ, γ}B) Transformer blocks of dimension 16 and bit complexity log(r  ′ √ d) + log(R ′ ) + ⌈max{log R, log C}⌉B + ⌈log R⌉ + 1 + 2⌈max{log R, log C}⌉B = O(log d + ⌈max{log R, log C}⌉B). We note that R = 9r ′ R ′ ≤ 150r ′2 nN 2 ≤ 8000r 2 n 5 N 6 dδ -2

B LARGE WIDTH AND FIXED BIT COMPLEXITY

In this section, we formally study the generalization of Theorem 3.1 in the case of large width (Section B.1) and fixed bit complexity (Section B.2) from Remark 3.3. To simplify the argument, we state and prove the result only for the case without permutation equivariance. The theorem for the permutation equivariance case is straightforwardly obtained from our stated theorem.

B.1 LARGE WIDTH

We state and prove the theorem for large width. Theorem B.1. Assume the same setting as in Theorem 3.1. Let L ≤ √ nN . There exists a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n such that N (X (i) + E) = Y (i) for every i ∈ [N ]. In both cases, the Transformer N has width 16nN/L 2 (m = 6nN/L 2 , h = k = 1 and q = 16nN/L 2 ), depth O n + L log(L) + L 1 log(L) • max{log C, log R} and bit complexity bounded by O log d + L 1 log(L) • max{log C, log R} where we denote R := 8000r 2 δ -2 dn 5 N 6 . Proof. Stages 1 and 2 are the same as the proof of Theorem 3.1. For stages 3 and 4, instead of directly memorizing nN contextual token ids directly, we construct nN/L 2 subnetworks where each memorize L 2 contextual token ids. By stacking subnetworks horizontally across width, we obtain the result. We remark that the width is not increased for the self-attention layers which are not used in the parallelized stages 3 and 4. Theorem B.1 shows that, if depth is bounded above by Õ(L) with L > n, then Õ(d + n + nN/L) parameters are enough to memorize N sequence classification examples of length n with token dimension d. We count the number of parameters as linear in width instead of quadratic in width because our construction uses parallel nN/L 2 subnetworks without interaction among them.

B.2 FIXED BIT COMPLEXITY

We state and prove the theorem for fixed bit complexity. Theorem B.2. Assume the same setting as in Theorem 3.1. Let B ≤ √ nN . There exists a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n such that N (X (i) + E) = Y (i) for every i ∈ [N ]. In both cases, the Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth O n + nN B log(B) + nN B 1 log(B) • max{log C, log R} and bit complexity bounded by O log d + B 1 log(B) • max{log C, log R} where we denote R := 8000r 2 δ -2 dn 5 N 6 . Proof. Stage 1 and 2 are the same as the proof of Theorem 3.1. For stage 3 and 4, instead of directly memorizing nN contextual token ids directly, we construct nN/B 2 subnetworks where each memorize B 2 contextual token ids. By stacking subnetworks vertically across depth, we obtain the result.  F (SSA) l (Z) = Z + h i=1 W (O) l,i W (V ) l,i Z A l k σ S W (K) l,i Z A l k T W (Q) l,i Z A l k , where Z A l k ∈ R m×|A l k | denote the submatrix consisting of columns of Z in the index set A l k . Again, W (O) l,i ∈ R m×k and W (V ) l,i , W (K) l,i , W (Q) l,i ∈ R k×m are the weight matrices parametrizing the sparse self-attention subblock. We make the following assumption on the sparsity pattern, which is the last condition among three conditions in Assumption 1 in Yun et al. (2020b)  := A 1 k , S t k := j∈A (t-1) mod p+1 k S t-1 j We assume that the sparsity patterns {A l k } satisfy that there exists a finite s ∈ N such that s = min{u|S u k = [n] for all k ∈ [n]}. We provide the sparse-attention version of our main result.  Theorem C.1. Let N, d, n, C, s ∈ N and r ≥ 1, 0 < δ ≤ 1. Let (X (1) , Y (1) ), • • • , (X (N ) , Y (N ) ) ∈ R d×n × [C] O log d + nN log N • max{log C, log R} such that F (X (i) P ) = Y (i) P for every i ∈ [N ] and for every permutation matrix P ∈ R n×n . Proof. The stage 1, 3 and 4 are the same as the proof of Theorem 3.1. In stage 2, each step of N 2 in Lemma A.2 computes the maximum token id over the whole sequence using 1 self-attention layer. Under Assumption C.1, we instead can compute the maximum token id over the allowed sparsity pattern. Since the whole sequence is covered within a recursion of s consecutive sparsity patterns and taking maximum is associative, repeating s self-attention layer will give the desired maximum token id over the whole sequence again. Now, the other component in stage 2 works as before and the resulting memorization is achieved in the same way. The only overhead in this approach is the s times larger number of self-attention layers. The simplicity of our contextual mapping enables easy generalization to the sparse attention. Since the sparse attention Transformer only constrains the sparsity pattern in the self-attention subblock, stages 1, 3 and 4 in our construction works without any modification. For the contextual mapping in stage 2, we achieve the same memorization capacity with only s times more self-attention layers. The number of parameters is Õ(d + sn + √ N ).

C.2 IMPROVED CONTEXTUAL MAPPING FOR FUNCTION APPROXIMATION

Our idea can be used to reduce the number of self-attention layers for the contextual mapping in function approximation settings, too. We recall the formal definition of the contextual mapping as follows. Definition C.2. (Definition 3.1 in Yun et al. (2020a) , Contextual Mapping) Consider a finite set L ⊂ R d×n . A map q : L → R 1×n is a contextual mapping if the map satisfies the following: 1. For any L ∈ L, the n entries in q(L) are distinct. 2. For any L, L ′ ∈ L with L ̸ = L ′ , all entries of q(L) and q(L ′ ) are distinct. We state our improvement of the contextual mapping.  G δ = {0, δ, • • • , 1 -δ} d×n : Gδ := {L ∈ G δ |L :,i ̸ = L :,j for all i ̸ = j}. Assume that n ≥ 2 and δ -1 ≥ 2. Then, there exist a function g c : R 4×n → R 4×n composed of 3n Transformer blocks with h = 1, k = 1 and q = 4 that employ the hardmax operator, vectors w ∈ R d , u ∈ R 4 , constants t l , t r ∈ R (0 < t l < t r ), such that q(L) := u T g c (w T L, w T L, 0, 0) satisfies the following propeties: 1. For any L ∈ Gδ , the entries of q(L) are all distinct. 2. For any L, L ′ ∈ Gδ such that L is not a permutation of L ′ , all entries of q(L), q(L ′ ) are distinct. 3. For any L ∈ Gδ , all the entries of q(L) are in [t l , t r ]. 4. For any L ∈ G + δ \ Gδ , all the entries of q(L) are outside [t l , t r ]. Proof. Since token embeddings are δ-discretized, we can concatenate each coordinate to obtain the token id as in Yun et al. (2020a) . Let w be the vector that represents such concatenation as a linear operation. As in stage 2 of our proof for Theorem 3.1, we concatenate the token id in the decreasing order of magnitude to obtain the sequence id in n steps. Then, we set u appropriately to obtain the "contextual token id" or the concatenation of token id and sequence id in q(L). Then, the first three conditions are easy to check. The first condition is trivially true because distinct token will have distinct token id and consequently distinct contextual token id. The second condition is also true because if L is not a permutation of L ′ , then their sequence ids should differ. The third condition is true since a linear function in a compact region is bounded. Consider the final condition. In any step of our efficient contextual mapping, when the maximum token id is zero, there must be duplicate tokens in the input sequence. Conversely, if there are duplicate tokens in the input sequence, the maximum token id should be zero at some steps of our efficient contextual mapping. We may use one more feedforward block in each step of the efficient contextual mapping to subtract the sequence id by M if such zero maximum token id is observed. Here, M is the maximum value possible for the sequence id. Then, the sequence id will still be negative in the end so q(L) will also be negative. Thus, the final condition is also true. We highlight the difference in the architecture. The major difference is the number of Transformer blocks used. We use 3n layers (linear in the sequence length) while Yun et al. (2020a) 

D OTHER TASKS

We provide the proof for Theorem 4.1 (Section D.1) and formal results on language modeling tasks. We consider two language modeling tasks that are commonly used to pre-training Transformers: masked language modeling (Section D.2) and autoregressive language modeling (Section D.3). Here, we use the slice index notation i : j for items from i (inclusive) to j (exclusive). We mask m out of n tokens in the sequence with Q = n m possible masking patterns M (1) , • • • , M (Q) ∈ {0, 1} n . We define the masked sequences M (j) • X (i) extracted from X as the sequence X (i) with columns masked according to the pattern M (j) . We say that a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n memorizes masked language modeling of X, Y if N (M (j) • X (i) + E) = Y (i) for every i ∈ [P ], j ∈ [Q]. Theorem D.1. Let T, d, n, m, V ∈ N and r ≥ 1, 0 < δ ≤ 1. Let X ∈ R d×T , Y ∈ [V ] T be a corpus data of T tokens represented as embedding vectors and token ids, respectively. Suppose that the masked sequences extracted from X are distinct and tokenwise (r, δ)-separated. Then, there exists a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n that memorizes masked language modeling of X, Y . The Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth O n + nP Q log(nP Q) + nP Q log(nP Q) • max{log V, log R} and bit complexity bounded by O log d + nP Q log(nP Q) • max{log V, log R} where we denote P = T -n + 1, Q = n m and R := 8000r 2 δ -2 dn 5 P 6 Q 6 . Proof. We apply Theorem 3.1 to memorize P Q masked sequences M (j) • X (i) .

D.3 AUTOREGRESSIVE LANGUAGE MODELING

Let X ∈ R d×T and Y ∈ [V ] T be a corpus data of T tokens represented as embedding vectors and token ids, respectively. The corpus data is divided into P = T -n sequences X (1) , • • • , X (P ) ∈ R d×n and y (1) , • • • , y (P ) ∈ [V ] of length n by taking X (i) = X[:, i : i + n] and y (i) = Y [i + n]. We call X (i) as the input sequences extracted from X. We say that a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n memorizes autoregressive language modeling of X, Y if N (X (i) + E) = y (i) for every i ∈ [P ]. Theorem D.2. Let T, d, n, m, V ∈ N and r ≥ 1, 0 < δ ≤ 1. Let X ∈ R d×T , Y ∈ [V ] T be a corpus data of T tokens represented as embedding vectors and token ids, respectively. Suppose that the input sequences extracted from X are distinct and tokenwise (r, δ)-separated. Then, there exists a Transformer network N : R d×n → R 1×n and positional encoding E ∈ R d×n that memorizes autoregressive language modeling of X, Y . The Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth O n + P log P + P log P • max{log V, log R} and bit complexity bounded by O log d + P log P • max{log V, log R} where we denote P = T -n and R := 8000r 2 δ -2 dn 5 P 6 . Proof. We apply Theorem 4.1 to memorize P input sequences X (i) .

E TECHNICAL LEMMAS

Here, we state technical lemmas that are used in our proofs. πd ∥x (i) -x (j) ∥ ≤ |u T (x (i) -x (j) )| ≤ ∥x (i) -x (j) ∥ for every i, j ∈ [N ]. Lemma E.2. Let n ∈ N and r ′ , P > 1. Then, there exists a neural network F : R 1×n → R 1×n consisting of a single softmax self-attention with 1 head, head size 1 and bit complexity ⌈log log(8n 3/2 r ′ P ) ⌉ such that F(x T ) = c1 Proof. When we have hardmax activation on the attention matrix, it is easy to construct the network that satisfies the condition. Consider the following self-attention: F(x) = 1 • (1 • x)σ H (1 • x) T (0 • x + 1 T n ) = xσ H x T 1 T n = max i∈[n] x i 1 T n . Indeed, the output is exactly x max and the bit complexity is 1 since all weights are either 0 or 1. To approximate hardmax with softmax, we introduce some large factor t > 0 in the attention matrix. Consider the following self-attention: . F(x) = 1 • (1 • x)σ S (t • x) T (0 • x + 1 T n ) = xσ S tx T 1 T n = c1 Since c is a convex combination of x[i]'s, it is easy to see that x max upper bounds c. It suffices to find t that satisfies the lower bound condition.



We recall that m is the embedding dimension, h is the number of attention heads, k is the attention head size, and q is the feedforward dimension. When X (1) , • • • , X (N ) ∈ R d×n are distinct and (r, δ)-separated, then (1) ∥X (i) ∥F ≤ r √ n for i ∈ [N ] and (2) ∥X (i) -X (j) ∥F ≥ δ for i, j ∈ [N ] with i ̸ = j. The general sequence-to-sequence mappings do not satisfy the condition. For example, the same word appearing multiple times in a sentence may have different meanings. There are C possible labels for n tokens in N sequences. We consider inputs as dn-dimensional flattened vectors and outputs as values from [C n ]. We choose a different balancing point between stages 2 and 3 in their construction to balance the extra factors. A similar benefit of Transformers has been previously observed in the function approximation(Yun et al., 2020a): the number of parameters is reduced by (n -1)! times to successfully approximate permutation equivariant functions.



Vuckovic et al. (2020) andKim et al. (2021) study the Lipschitz smoothness of attention operations. Edelman et al. (2022) derive the norm-based generalization bounds from the Lipschitz smoothness of norm-bounded attention layers to analyze the inductive bias of attention layers. Wei et al. (2021) study the expressivity of Transformers under the constraint of statistical learnability.

Remark 3.5. Comparison against fully-connected ReLU networks. With a slight modification of results in Vardi et al. (2022) 5 , fully-connected ReLU networks require Õ(dn + √ nN ) parameters.

Theorem 4.1 shows that Õ(d + n + √ N ) parameters are enough to memorize N sequence classification examples of length n with token dimension d. Compared to the sequence-to-sequence mapping, there is √ n factor of saving in the last term.

Figure 1: Heatmaps of training errors. We show the color-coded training errors as the dataset size and the model size vary. The dataset size is represented in the Y-axis while the model size is represented in the X-axis as the embedding size. The training error tends to get better for the smaller dataset size and the larger model size.

, we balance A and B as A = nN log(nN ) and B = nN log(nN ) and conclude the proof of Theorem 3.1.

Theorem B.2 shows that, if bit complexity is bounded above by Õ(B), then Õ(d + n + nN/B) parameters are enough to memorize N sequence classification examples of length n with token dimension d. We count the number of parameters as linear in width instead of quadratic in width because our construction uses parallel nN/L 2 subnetworks without interaction among them. C CONTEXTUAL MAPPING IMPLICATIONS C.1 SPARSE ATTENTION TRANSFORMERS This section shows how our result generalizes to sparse-attention Transformers. Sparse-attention Transformers replaces self-attention subblocks with sparse counterparts. Let A l k ⊂ [n] be the lth sparsity pattern of k-th token where k ∈ [n], l ∈ [p]. Given an input Z ∈ R m×n , the sparse self-attention subblock F (SSA) l with h heads and head size k computes

1×n be N input-output pairs of sequences that satisfies. Denote R := 8000r 2 δ -2 dn 5 N 6 . Then, there exists a Transformer network F : R d×n → R 1×n with width 16 (m = 6, h = k = 1 and r = 16), depthO ns + nN log N + nN log N • max{log C, log R}and bit complexity bounded by

Theorem C.2. (Improved version of Lemma 6 in Yun et al. (2020a)) Consider the following subset of

use δ -d + 1 layers (exponential in the embedding dimension). Since the sequence length and the embedding dimension are of the same order in practice, our construction exponentially improves Lemma 6 inYun et al. (2020a). The minor difference in the architecture is the intermediate embedding dimension (4 in ours and d inYun et al. (2020a)) and the number of attention heads (1 in ours and 2 inYun  et al. (2020a)).

PROOF OF THEOREM 4.1 Stages 1 and 2 are the same as the proof of Theorem 3.1. Instead of classifying the contextual token id, we can directly classify sequence ids in stages 3 and 4. Since there are N possible sequence id, we replace nN with N in the parameter complexity from Theorem 3.1.D.2 MASKED LANGUAGE MODELINGLet X ∈ R d×T and Y ∈ [V ] T be corpus data of T tokens represented as embedding vectors and token ids, respectively. The corpus data is divided intoP = T -n+1 sequences X (1) , • • • , X (P ) ∈ R d×n and Y (1) , • • • , Y (P ) ∈ [V ] n of length n by taking X (i) = X[:, i : i+n] and Y (i) = Y [i : i+ n].

Lemma E.1. (Lemma 13 from Park et al. (2021)) Let N, d ∈ N and x (1) , • • • x (N ) ∈ R d . Then, there exists a unit vector u ∈ R d such that 1 N 2 8

T n with x max -1 2P √ n ≤ c ≤ x max whenever x ∈ R n satisfies • |x[i]| ≤ 2r ′ for i ∈ [n] and • x[i] ≤ x max -2 for i ∈ [n] with x[i] ̸ = x max ,where we denote x max = max i∈[n]  x[i] .

exp(tx[i]) n j=1 exp(tx[j])

. Thus, our assumption is strictly weaker. Assumption C.1. (Relaxed version of Assumption 1 in Yun et al. (2020b)) Define a sequence of set

annex

Choose t = ⌈ 1 2 log(8n 3/2 r ′ P )⌉. We lower bound the softmax weights on x max as,where n max := |{i : x[i] = x max }|. Now, we can lower bound c as c ≥ x max p max -2r ′ (1This self-attention module only has 1 head with head size 1. All weights are either 0, 1 or t so the bit complexity is ⌈log t⌉ ≤ ⌈log log(8n 3/2 r ′ P ) ⌉.Lemma E.3. Let r ′ ∈ N. Then, there exists a neural network F : R 2 → R with 1 hidden layer, width 4 and bit complexity ⌈log 2r ′ ⌉ such thatProof. Consider the following neural network:It is straightforward to see that the network F computes the desired function. This network has 1 hidden layer and width 4. All parameters are either ±1, ±2 or ±r ′ so the bit complexity is ⌈log 2r ′ ⌉.Lemma E.4. Let a, b, w ∈ N with a < b. Then, there exists a neural network F : R → R with 1 hidden layer, width 4 and bit complexity ⌈log w⌉ + ⌈log(2b + 1)⌉ such that.Proof. Consider the following neural network:It is straightforward to see that the network F computes the desired function. This network has 1 hidden layer and width 4. All parameters are either ±w, 2 or 2a -1, 2a, 2b, 2b + 1 so the bit complexity is ⌈log w⌉ + ⌈log(2b + 1)⌉.Then, there exists a neural network F : R 3 → R 3 consisting of j -i + 1 feedforward blocks with skip-connections every 2 layers, feedforward dimension 8 and bit complexity 2n such thatProof. We constructwhere each feedforward block contains a single feedforward layer with 8 hidden units and a skip-connection. In step l, the blockSince we havethe block computation in Equation 2 ensures that F computes the desired function.Construction of F l . To obtain Equation 2, we define the feedforward block F l with a skipconnection asWe first note that the triangle function ϕ can be implemented with one hidden layer asAlso, we can implement identity function asThus, the following 8 hidden unitsare enough to represent F l .Published as a conference paper at ICLR 2023We check Equation 2 as follows:Finally, since all parameters are have the form 2 k for some -1 ≤ k ≤ 2n, the bit complexity is 2n.

F EXPERIMENTAL SETUP

We use HuggingFace 9 PyTorch implementation of the BERT model for our experiments. All experiments are conducted on an Nvidia Quatro RTX 5000, 16 GB memory GPU in a machine with Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz.As mentioned in the main paper, we use 14,000 random samples in the named entity recognition dataset from CoNLL-2003 (Tjong Kim Sang & De Meulder, 2003) for token classification and 50,000 random samples in the MNLI dataset from GLUE benchmark (Wang et al., 2019) for sequence classification. For token classification, the task is to classify the named entity type for each token among 9 possible classes. The sequence classification dataset aims to classify the relationship between sentence pairs as 3 classes: entailment, contradiction, and neutral. We vary the dataset size by randomly order examples and picking first p% for p = 10, 20, • • • , 100.We vary the model size through the embedding size m, which is varied by 12 and 96 for token and sequence classification tasks, respectively. As mentioned in the main paper, we fix the number of layers as L = 6, the number of attention head as h = 12, the embedding to head size ratio as m/k = h = 12 and the feedforward to embedding size ratio as q/m = 4, as commonly done in practice.We optimize using Adam optimizer (Kingma & Ba, 2015) with learning rate 0.00002, batch size 32 and dropout rate 10%. We train our models for 1,500 and 7,500 steps for token and sequence classification, respectively. We determine the above number of steps to ensure that the training error does not improve at least for the last 3 epochs.For Figure 2 , we choose the minimum size memorizing model as the smallest model that reaches the training error 0.005. The maximum training errors of selected models are 0.00499 and 0.00450 for token and sequence classification tasks, respectively. The average training errors of selected models are 0.00464 and 0.00259 for token and sequence classification tasks, respectively. 9 https://huggingface.co/ 

