PROVABLE MEMORIZATION CAPACITY OF TRANS-FORMERS

Abstract

Quantifying memorization capacity is essential for understanding the expressiveness and generalizability of deep learning model architectures. However, the memorization capacity of the Transformer architecture has yet to be explored. In this work, we present the first study of the memorization capacity of the Transformer architecture. We prove that Transformers are capable of memorizing N sequence-to-sequence mappings of length n with d-dimensional input tokens using Õ(d + n + √ nN ) parameters. Our theory supports memorization both with and without permutation equivariance, utilizing positional encodings in the latter case. Building on our theory, we also analyze the memorization capacity of Transformers in the sequence classification and language modeling tasks. To verify these theoretical findings, we conduct experiments analyzing the memorization capacity of Transformers in the natural language domain.

1. INTRODUCTION

Transformer networks (Vaswani et al., 2017) have shown tremendous success in natural language processing tasks (Devlin et al., 2019; Yang et al., 2019; Brown et al., 2020; Fedus et al., 2022) , rapidly becoming the standard architecture for natural language modeling. The success of Transformers has also transferred to various other sequence and set modeling tasks, including image recognition (Parmar et al., 2018; Dosovitskiy et al., 2021) , semantic segmentation (Zheng et al., 2021 ), video understanding (Akbari et al., 2021; Bertasius et al., 2021) , reinforcement learning (Parisotto et al., 2020; Chen et al., 2021; Janner et al., 2021) , 3D point cloud processing (Zhao et al., 2021) , protein structure prediction (Jumper et al., 2021) , and automatic theorem proving (Polu & Sutskever, 2020) . Despite this success across various areas, the theoretical understanding of Transformers lags behind that of standard fully-connected networks. The major strength of Transformers is in their efficient scaling, which is enabled through parallel token processing with parameter sharing and simple dot-product-based token interaction. Surprisingly, even though the parameter sharing and simple token interaction impose constraints on the function space of Transformers, Yun et al. (2020a) show that Transformers can approximate any continuous function from input to output sequences. However, their result focuses on the function approximation capacity with infinite precision, leaving the finite sample memorization capacity with finite precision unexplored. We note that universal function approximation does not automatically imply efficient memorization in terms of the number of parameters. Generalizing infinite precision results to the finite precision case is not straightforward and may not be possible in some cases. For example, Transformers are Turing complete only with infinite precision (Pérez et al., 2019) , but not with finite precision (Dehghani et al., 2019) . Understanding the memorization capacity of a model is critical for choosing an appropriate model size. Practitioners often choose a model size with enough representation capacity to achieve zero training loss (i.e., a size larger than memorization capacity). Moreover, the memorization capacity has generalization implications, as observed in the double descent phenomena (Belkin et al., 2019; Nakkiran et al., 2021) . As the network size increases, generalization performance exhibits a bias-variance tradeoff until the memorization is possible and then improves monotonically after-ward. Understanding the memorization capacity of Transformers requires answers to the following questions: How large should the size and precision of the Transformer architecture be to enable memorization of any given number of input-output sequence pairs? How does the memorization capacity of Transformers differ for various problem settings in practical application scenarios? In this paper, we answer these questions by proving that Transformers can memorize N sequences of d-dimensional tokens with length n using Õ(d + n + √ nN ) parameters. Our proof constructs permutation equivariant Transformers that can memorize all permutations of N input sequences. We extend this construction to the memorization without permutation equivariance by adding positional encodings. In addition, we derive the memorization capacity for sequence classification task from our proposed theory. The key technical component of our construction is efficient contextual mapping, which requires only n self-attention layers. Our contextual mapping also applies to sparse-attention Transformers making fewer assumptions on sparsity patterns than Yun et al. Our main contributions are summarized as follows: • We prove the memorization capacity of Transformers for sequence-to-sequence mappings with and without permutation equivariance. We analyze the memorization capacity in other standard task settings, such as sequence classification and language modeling. • We show that the efficient contextual mapping presented in our theoretical analysis extends to sparse attention settings and improves the function approximation results. • We provide experiments validating the memorization capacity of Transformers for token classification and sequence classification tasks.

1.1. RELATED WORKS

Memorization capacity. Characterizing the memorization capacity of neural networks has been an active research area with a long history (Baum, 1988; Sontag, 1997; Huang & Babri, 1998; Huang, 2003; Zhang et al., 2017; Yun et al., 2019; Bubeck et al., 2020; Vershynin, 2020; Rajput et al., 2021; Park et al., 2021; Vardi et al., 2022) . Recently, Park et al. ( 2021) constructed neural networks with O(N 2/3 ) parameters to memorize N data points. They bypass the Ω(N ) lower bound in Sontag (1997) by assuming a simple separation (i.e., ∥x i -x j ∥ ≥ δ, ∀i ̸ = j). Vardi et al. ( 2022) improve further, showing that Õ(N 1/2 ) parameters are sufficient. They also prove the matching lower bound of Ω(N 1/2 ) through the VC-dimension analysis. Inspired by Park et al. (2021) and Vardi et al. (2022) , our construction assumes a similar separation, but it is between pairs of distinct tokens, not the whole sequence pairs. (See Definition 3.1 and the discussion that follows the definition.) In addition, our construction uses the same pipeline of projection, string matching, and bit extraction as in Vardi et al. (2022) . However, we introduce an additional critical step: efficient contextual mapping to complement projection by summarizing all token information via self-attention layers. In contrast to the extensive results on fully-connected networks, there are few studies on the memorization capacity of specific modern architectures. Hardt & Ma (2017) show that the residual network with ReLU activation and O(N ) hidden neurons can memorize N data points under the separation assumption. Nguyen & Hein (2018) show that the convolutional network with O(N ) hidden neurons can memorize N data points. To the best of our knowledge, there is no existing literature on the memorization capacity of the Transformer architecture. Transformer expressivity. Given the recent empirical success of Transformers observed across multiple areas, several papers have studied the expressivity of Transformers. Yun et al. (2020a) establish the first universal approximation theorem for Transformers, and the result is later extended to sparse-attention Transformers (Yun et al., 2020b; Zaheer et al., 2020) and Transformers with hard constraints (Kratsios et al., 2022) . All these results study the function approximation but not



(2020b)  andZaheer et al. (2020). Furthermore, we present the generalization of contextual mapping to function approximation settings, vastly improving the parameter efficiency of attention layers compared to the selective-shiftingbased contextual mapping inYun et al. (2020a).

