PROVABLE MEMORIZATION CAPACITY OF TRANS-FORMERS

Abstract

Quantifying memorization capacity is essential for understanding the expressiveness and generalizability of deep learning model architectures. However, the memorization capacity of the Transformer architecture has yet to be explored. In this work, we present the first study of the memorization capacity of the Transformer architecture. We prove that Transformers are capable of memorizing N sequence-to-sequence mappings of length n with d-dimensional input tokens using Õ(d + n + √ nN ) parameters. Our theory supports memorization both with and without permutation equivariance, utilizing positional encodings in the latter case. Building on our theory, we also analyze the memorization capacity of Transformers in the sequence classification and language modeling tasks. To verify these theoretical findings, we conduct experiments analyzing the memorization capacity of Transformers in the natural language domain.

1. INTRODUCTION

Transformer networks (Vaswani et al., 2017) have shown tremendous success in natural language processing tasks (Devlin et al., 2019; Yang et al., 2019; Brown et al., 2020; Fedus et al., 2022) , rapidly becoming the standard architecture for natural language modeling. The success of Transformers has also transferred to various other sequence and set modeling tasks, including image recognition (Parmar et al., 2018; Dosovitskiy et al., 2021) , semantic segmentation (Zheng et al., 2021 ), video understanding (Akbari et al., 2021; Bertasius et al., 2021) , reinforcement learning (Parisotto et al., 2020; Chen et al., 2021; Janner et al., 2021) , 3D point cloud processing (Zhao et al., 2021) , protein structure prediction (Jumper et al., 2021) , and automatic theorem proving (Polu & Sutskever, 2020) . Despite this success across various areas, the theoretical understanding of Transformers lags behind that of standard fully-connected networks. The major strength of Transformers is in their efficient scaling, which is enabled through parallel token processing with parameter sharing and simple dot-product-based token interaction. Surprisingly, even though the parameter sharing and simple token interaction impose constraints on the function space of Transformers, Yun et al. (2020a) show that Transformers can approximate any continuous function from input to output sequences. However, their result focuses on the function approximation capacity with infinite precision, leaving the finite sample memorization capacity with finite precision unexplored. We note that universal function approximation does not automatically imply efficient memorization in terms of the number of parameters. Generalizing infinite precision results to the finite precision case is not straightforward and may not be possible in some cases. For example, Transformers are Turing complete only with infinite precision (Pérez et al., 2019) , but not with finite precision (Dehghani et al., 2019) . Understanding the memorization capacity of a model is critical for choosing an appropriate model size. Practitioners often choose a model size with enough representation capacity to achieve zero training loss (i.e., a size larger than memorization capacity). Moreover, the memorization capacity has generalization implications, as observed in the double descent phenomena (Belkin et al., 2019; Nakkiran et al., 2021) . As the network size increases, generalization performance exhibits a bias-variance tradeoff until the memorization is possible and then improves monotonically after-1

