THE LAZY NEURON PHENOMENON: ON EMERGENCE OF ACTIVATION SPARSITY IN TRANSFORMERS

Abstract

This paper studies a curious phenomenon that machine learning model with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small k brings a collection of desired properties, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

1. INTRODUCTION

The great success of modern machine learning for tasks in computer vision, natural language processing, game playing and beyond is driven primarily by the computational model known as deep neural networks (DNNs) (LeCun et al., 2015) . With inspirations drawn from biological intelligent systems, DNNs are massive systems of distributed computational nodes (a.k.a. neurons) with learned inter-connections, which possess the capacity of accomplishing complex real-world tasks. Although originally motivated from biological brains, there are differences at very fundamental levels on how DNNs work compared to biological neural networks. One of such differences is in the sparsity of neural activities. Evidence from neuroscience suggests that neural activity in biological brains is sparse, namely, only a small percentage of all neurons fire at each time (Ahmed et al., 2020; Barth & Poulet, 2012; Kerr et al., 2005; Poo & Isaacson, 2009) . Sparse firing suggests that despite having billions of neurons, only a small fraction of the brain participates in computation at each time, which may explain why brains can sustain at a very low energy cost. In contrast, learning and inference with DNNs rely primarily on dense computations where all neurons are involved for any input. In fact, modern computational hardware for deep neural networks, such as GPUs and TPUs, are designed to facilitate massive scale dense computations. Even with such dedicated hardware, DNNs are still notoriously resource-demanding to train and deploy. Aside from computation efficiency, DNNs also lag far behind biological brains in terms of robustness to input perturbation, error correction for erroneous training labels, confidence calibration for the predictions, etc.

1.1. AN INTRIGUING OBSERVATION: ACTIVATIONS ARE SPARSE IN TRAINED TRANSFORMERS

This paper provides an extensive study on a surprising observation that despite performing dense computations, DNNs produce very sparse activation in its intermediate layers once trained.Specifically, we study Transformer (Vaswani et al., 2017) , a DNN model architecture that has become a workhorse for modern applications. Transformers are constructed by interleaving a self-attention module and a multi-layer perceptrons (MLPs) of depth 2, and the focus of this paper is on the activation map of the first MLP layer. Figure 1 shows the sparsity of the activation maps, measured by the percentage of nonzeros, in all MLP layers of a T5-Base model (Raffel et al., 2020) computed on the training set of C4. We see that the percentage of nonzero entries is around 50% at initialization, which is expected: randomly initialized weights produce roughly equal numbers of positive and negative entries in the pre-activation map, resulting in about 50 % non-zeros after the ReLU. However, at the end of training the percentage of nonzero entries reduces drastically: the average value across all encoder-decoder layers is 2.7% with the largest one being 12.0% and the smallest one being only 1.1%. The emergence of sparse activation in Transformers bears a similarity to the sparsity of neural activities in biological brains, revealing an interesting connection between artificial and biological networks. Moreover, unlike classical sparse methods where such a connection is established via explicit sparse regularization (Olshausen & Field, 1996) , the sparsity observed in Transformers is emergent without any explicit design.

1.2. PREVALENCE, BENEFITS, AND CAUSES OF SPARSITY

This paper studies the aforementioned phenomenon of sparse activation in trained Transformers, with a focus on the following two questions. First, is the phenomenon shown in Figure 1 a corner case or does it occur broadly? Second, why should we care about the sparsity in DNNs, other than the appeal of its similarity to biological brains? Our main results along these two lines are summarized below. 1. Sparsity is a prevalent phenomenon. We show in Section 2 that the emergence of sparse activation reported in Figure 1 is not an isolated and cherry-picked case. Rather, sparsity is prevalent, and occurs broadly in Transformer models: it emerges in all layers of a Transformer, for Transformers trained on both vision and natural language data, for Transformers of various configurations, and for activation maps computed on both train and test data, etc. Moreover, through controlled experiments on the width and depth of Transformers, we reveal that larger models are sparser, as measured by percentage of nonzero entries. We also show in the Appendix B that sparsity emerges with many other architectures and with different optimizers. 2. Sparsity improves efficiency. Sparsity of activation map in trained Transformers implies that a large proportion of the computation during inference is spent on multiplying values by zero. Hence, FLOPs can be drastically reduced by avoiding all such computations, which we discuss in Section 3.1. Motivated by this observation, and to obtain reduced FLOPs not only after training but throughout training, we introduce Top-k Transformer in Section 3.2, a simple modification of Transformers where a Top-k thresholding is applied to the activation mapsfoot_0 . We show that Top-k Transformers with a reasonable sized k has on par performance with vanilla Transformers. To demonstrate the computation benefits of Top-k Transformers, we provide proof-of-concept results on wall time reduction for the task of unbatched decoding on TPUv4 with a large Top-k T5. Meanwhile, we emphasise that this result is far from fully realizing the benefit of sparse activation, due to a lack of hardware support for sparse computation. 3. Sparsity improves robustness and calibration. We further show in Section 3.3 that enforcing explicit sparsity via Top-k Transformers improves model performance in terms of less sensitivity to noisy training data, less sensitivity to input corruptions, and better confidence calibration. In addition, we provide a study on the causes of sparsity in the Appendix D, showing that sparsity is likely not an artifact of the training data, and may be attributed to the training dynamics in the optimization process.

1.3. EXPERIMENTAL SETUP

We study the sparsity in activation maps of Transformers with two commonly used Transformer models, namely Text-to-Text Transfer Transformer (i.e., T5) and Vision Transformer (i.e., ViT). • T5 is an encoder-decoder model for natural language processing tasks (Raffel et al., 2020) . We train T5 on the Colossal Clean Crawled Corpus (C4) using the span corruption task. • ViT is an encoder model for vision tasks (Dosovitskiy et al., 2021) . Unless specified otherwise, we train ViT on ImageNet-21k (Deng et al., 2009) , an image classification dataset with 14M images and 21k classes. For certain cases we also use ImageNet-1k which is a subset of ImageNet-21k with 1.3M images and 1k classes. We measure the sparsity level (computed on training set unless specified otherwise) at the intermediate output of the two-layer MLPs in a Transformer . Recall that an MLP performs the following mapping f (x; K, V ) . = dff i=1 σ( k i , x ) • v i , or equivalently, f (x; K, V ) . = V σ(K x), where x ∈ IR dmodel is the input, K = [k 1 , . . . , k dff ] ∈ IR dmodel×dff and V = [v 1 , . . . , v dff ] ∈ IR dmodel×dff are learnable layer parameters, and σ() is a nonlinear activation function. We use ReLU as the activation function σ() for both T5 and ViTfoot_2 . A two-layer MLP may be regarded as having d ff neurons in the hidden layer, where the i-th neuron performs the computation σ( k i , x ) • v i , and the final layer output is the sum of the output of all neurons. Each neuron is called activated if σ( k i , x ) is strictly positive. Hence, the sparsity of neuron activation can be measured by the number of nonzero entries in the feature map a . = σ(K x) ∈ IR dff . (2) Both T5 and ViT come with several configurations for d model , d ff , number of layers, etc. Unless specified otherwise, we will use the Base models (i.e., T5-Base and ViT-B/16) which have d model = 768, d ff = 3072, and 12 layers (for ViT) and 12 encoder layers +12 decoder layers (for T5). Our experiment with T5 and ViT uses the T5X (Roberts et al., 2022) and the Scenic codebase (Dehghani et al., 2022) , respectively. More training details of T5 and ViT are provided in Appendix A.

2. PREVALENCE OF SPARSITY IN LEARNED TRANSFORMERS

This section shows thorough experiments on commonly used Transformers that sparsity in activation maps is a prevalent phenomenon. We also show through some controlled experiments that deeper and wider Transformers tend to be sparser measured by percentage of nonzero entries in activation maps.

2.1. SPARSITY IS A UBIQUITOUS PHENOMENON

We start by providing experimental evidence that the emergence of sparse activation in trained Transformers is a ubiquitous phenomenon. To this end, we plot the percentage of nonzero entries of activation maps in different Transformers, and present the results in Figure 2 . These results demonstrate the following. • Sparsity emerges for both Vision and NLP tasks. Figure 2a shows the percentage of nonzero entries of trained T5 and ViT models evaluated on their respective training datasets. We see that both encoder and decoder of T5, as well as the ViT, all exhibit sparsity. • Sparsity emerges on both training and evaluation data. The presence of sparsity in activation maps does not rule out the possibility that a small percentage of the neurons are always activated for all inputs, whereas the rest of the neurons are never activated. To illustrate that this is not the case, we experiment with a pretrained T5 base modelfoot_3 to plot the percentage of layer inputs for which each of the d ff neurons is activated when evaluated on 800 examples taken from C4 dataset with span corruption task. Note that there are 800 × 512 = 409600 samples as MLP activation is computed per token. The results are presented in Figure 3 with x-axis being indices of neurons in the first encoder layer of T5 sorted in descending order according to percentage of layer inputs on which they are activated. It can be seen that while a few neurons are activated for around 50% of the time, the vast majority of neurons (around 93.5%) are activated less than 10% of the time. Moreover, there are no dead neurons that are never activated, and the least activated neuron is activated for around 0.001% of the time, and 99% of neurons are activated over 1% of the time. Finally, while the results here are for neurons in the first MLP layer of a pretrained T5 base encoder, all other MLP layers show qualitatively similar behavior.

2.2. THE LARGER, THE SPARSER

We next examine the effect of model size on the sparsity level of activation maps. Note that Figure 2e and Figure 2f provide evidence with T5 of varying configuration that larger models tend to be sparser. Here we perform controlled experiments to examine the effect of model depth, measured by the number of Transformer layers, as well as the effect of model width, measured by the dimension of activation map of MLPs (i.e., d ff ), separately. Towards that, we take a standard T5 model and vary the depth and width, respectively while keeping the rest of the configuration fixed, and examine their sparsity level after training. The results are presented in Figure 4 for the encoder, whereas we omit the results for the decoder as they are qualitatively the same as those for encoder. It can be seen from Figure 4a that deeper Transformers are arguably sparser. For example, many of the middle layers of the 32-layer model have less than 1% nonzero entries while all shallower models have more than 1% nonzero entries across all layers. For comparing networks of different widths, we measure the sparsity with the percentage and the count of nonzero entries in Figure 4b and Figure 4c , respectively. It can be seen that wider models have a lower percentage of nonzero entries, though a higher count of nonzero entries. 3 EFFICIENT, ROBUST, AND CALIBRATED: SPARSITY IS ALL YOU NEED? In this section we show that activation sparsity provides several practical benefits. In Section 3. , where the first term comes from computing the key, query, and value matrices, the second term comes from computing the self-attention matrix, and the third term comes from the MLP. For a fixed sequence length N , and considering the fact that d ff is often much larger than d model , it is arguable that MLP poses the computational bottleneck in large Transformers. In the following, we explain how sparsity in activation map of MLP can be leveraged to significantly reduce its computational cost, without affecting the model performance. Efficiency for the Second MLP Layer. The sparse activation immediately suggests that a lot of the computation for inference with Transformers is not needed at all. That is, while doing dense matrix-matrix multiplications, much of it is about multiplying a vector by a value of zero, which can be avoided to save computation. Specifically, we consider the second layer of the MLP in (1) which performs the computation V a, (3) where a ∈ IR dff is the intermediate activation map of MLP (see (2)) and V ∈ IR dmodel×dff is the layer parameter. Eq. ( 3) involves a simple matrix-vector multiplication which has a FLOP count of 2d model × d ff . However, if a is sparse with, say s nonzero entries, then the FLOP count for (3) reduces to 2d model × s. Hence, FLOP in the second MLP layer is reduced by a factor of 1 -s dff . Note that s dff is exactly the percentage of nonzeros plotted in the y-axis of e.g. Figure 1 , which is 2.7% averaged across all layers. Hence, the computational cost of the second MLP layer can be reduced by a significant amount. More excitingly, the reduction factor 1 -s dff is likely to be even bigger for larger Transformer models (see Figures 4a and 4b ), pointing to a greater reduction in computation. Efficiency for the First MLP Layer. The sparsity in the intermediate activation map of MLP does not immediately suggest a reduction in computation for the first MLP layer. Nonetheless, it is possible to significantly reduce the computation in the first MLP layer by leveraging approximate nearest neighbor search, which we explain next. Recall from (1) that the computation in the first MLP layer is given by σ(K x), (4) with K = [k 1 , . . . , k dff ] ∈ IR dmodel×dff being the layer parameter and x being the layer input. If the output is sparse with k nonzero entries, then the calculation in (4) may be formulated as finding k points from the set {k i } dff i=1 that are "closest" to the input x measured by values of inner product. Such a problem is well-known as the nearest neighbor search (NNS) problem or the maximum inner product search problem. While naive solution of the NNS problem has linear complexity in d ff , there exists approximate algorithms (Guo et al., 2020; Johnson et al., 2019; Shrivastava & Li, 2014) that are of sublinear complexity, and using them in Transformers means that FLOP in the first MLP layer may be reduced to have sublinear complexity in d ff . There are of course the questions of whether such approximate NNS algorithms could hurt Transformer performance, which we leave for future study.

3.2. SPARSITY IN TRAINING VIA TOP-k TRANSFORMERS

The benefit of efficiency from sparsity in Section 3.1 comes with caveats. First, while the activation maps are sparse on average, there is the possibility that some of the activation maps for certain inputs are denser hence cannot benefit from sparse computation. Second, sparsity occurs only in trained Transformers while the computation is dense during and particularly at the beginning of training. Here we present Top-k Transformer, a simple modification to Transformer architecture that allows us to control sparsity level for all model inputs, and throughout training. Top-k Transformer is built upon a regular Transformer with the only modification being the MLP layers, where at the output of the activation function σ() (see ( 1)) we add a Top-k thresholding operator. That is, the MLPs of Top-k Transformers perform the following computation f (x; K, V ) = V • Top k σ(K T x) , where Top k (•) performs a thresholding that all entries other than those of the largest k values are set to zero with k being a hyper-parameter subject to design choices. Note that Top-k Transformer reduces to a regular Transformer if we set k = d ff . By using a small value of k, the benefit of efficiency in terms of reduction in FLOP as discussed in Section 3.1 applies to Transformer training as well. We now provide experimental results with Top-k Transformers on wall-time benefits from FLOPs reduction discussed in Section 3.1. In particular, we evaluate the inference time latency reduction of Top-k Transformer. In our experiment, we add a Top-k thresholding to T5X (Roberts et al., 2022) foot_4 . We gain efficiency in the second MLP layer by an implementation that avoids all multiplication by zero as described in Section 3.1. The decoder per-token wall time for unbatched greedy decoding during inference on a single TPUv4 chip is presented in Figure 6 . We observe that larger models have more wall time reduction, due to the fact that they have larger d ff hence more FLOPs reduction. In particular, for T5-11B we observe around 10% wall time reduction with k ≤ 128, though this amount becomes smaller with a larger k = 256. Finally, we emphasize that the sparsity in Top-k Transformers is unstructured and data-dependent, which is not well supported on existing computation hardwares such as TPUs and GPUs. Hence, the results in Figure 6 are for proof-of-concept purposes, and are far from fully realizing the benefit of FLOPs reduction via sparsity. We leave a study of better implementation of sparse computation for obtaining higher wall time reduction to future work.

3.3. BONUS! IMPROVED ROBUSTNESS AND CALIBRATION

Despite not being explicitly designed for such purposes, inducing sparse activation via Top-k Transformer has the benefits of improving model robustnessfoot_5 and confidence calibration. We demonstrate this using the image classification task with the ImageNet-1k dataset, and present the results in We conduct experiments using the ImageNet-1k dataset for which we replace p% of the labels in the training set with a random label drawn uniformly from the set of all possible labels. The evaluation performance under p ∈ {40%, 80%} label noise is presented in Table 1 . It shows that Top-k offers a consistent performance gain with label noise. Confidence Calibration. Aside from label noise, another symptom of over-parameterization of DNNs is that they tend to be overly confident in their predictions. In the context of classification problems, they tend to assign a high (i.e., close to 1) probability to the class of its prediction, while it is more desirable that they produce a probability that is commensurate with its confidence level (Guo et al., 2017) . A commonly used metric for confidence calibration is the expected calibration error (ECE) (Naeini et al., 2015) , which is the discrepancy between the probability to the class of a model's prediction and the probability that its prediction is actually correct. Here we measure the calibration of Top-k ViT via ECE and report the results in Table 1 . It shows that Top-k with k = 128 enables the Transformer to be more calibrated when compared to a vanilla Transformer. Furthermore, results reported in Appendix C show that ECE monotonically decreases as k is decreased from 128 to 32. Robustness to Input Perturbation. Another important challenge with DNNs is that their outputs tend to be sensitive to naturally occurring image corruptions, which limits their application to mission critical tasks (Bhojanapalli et al., 2021) . Here we evaluate the robustness of Top-k ViT to three types of additive noises, namely Gaussian noise, impulse noise, and shot noise. For that purpose, we train Top-k ViT on standard ImageNet-1k training data and report their classification accuracy on ImageNet-C (Hendrycks & Dietterich, 2019) , a benchmark that contains algorithmically generated Gaussian, impulse, and shot noise (among many others types) applied to the ImageNet-1k test dataset. For each noise type, there are five severity levels. We report the averaged performance over all severity levels of each corruption type in Table 1 for k = 128, and in Appendix C for a few other values of k. We see that robust accuracy is the highest with k = 64, while taking k = 128 or k = 32 also provides benefits compared to the vanilla Transformer.

4. RELATED WORK

Prior efforts on introducing sparsity in deep neural networks abound, though often with diverse motivations and objectives. Here we provide a brief overview of several popular lines of work. Sparsity for Efficiency. Sparsity in either model weights or activation maps is often used for improving training and inference efficiency (see e.g. Hoefler et al. (2021) for a review). For activation sparsity in particular, sparsity for efficiency is explored perhaps first in ConvNets (Georgiadis, 2019; Kurtz et al., 2020; Rhu et al., 2018) before subsequently becoming a key design component in many of the largest Transformer based language and vision models (Du et al., 2022; Fedus et al., 2022a; b; Rajbhandari et al., 2022) . The Top-k thresholding that we use in Top-k Transformer has also been previously used in Gupta et al. (2021) to improve memory efficiency of Transformers. However, it has been unclear a priori whether sparsity hurts model performance, hence the practice often relies on wishful design, trial-and-error, and post-hot justification (Baykal et al., 2022) . Our discovery that Transformers naturally produce sparse activation maps, and that larger models are even sparser, may provide principled perspectives towards efficiently training future large models. Sparsity for Robustness. Many works find that smaller and sparser networks obtained by model compression are more robust to adversarial perturbation (Chen et al., 2022; Guo et al., 2018; Jordao & Pedrini, 2021) and label noise (Xue et al., 2022) . Another line of work that uses sparsity for robustness leverages the property that practical data corruption is often sparse (Ghosh et al., 2017; Liu et al., 2022; You et al., 2020) . None of the work mentioned above is based on sparsity in activation maps. More closely related to ours is the work of Ahmad & Scheinkman (2019) where sparsity in activation map of convolutional DNNs is shown to improve robustness to input perturbation, and Muthukumar & Sulam (2022) that leverages sparse activation to derive robust generalization error bounds. Sparsity for Explainability. Work on leveraging sparsity for interpreting deep learning models long exist but often in a post-hoc fashion for examining the semantic meanings encoded by a neuron of a trained model (Dalvi et al., 2019) . For Transformers, evidence suggests that the learned knowledge is encoded mainly in its MLPs with individual neurons expressing specific factual knowledge (Dai et al., 2022) . Moreover, enforcing neuron activation sparsity in MLPs helps to improve the percentage of neurons that are interpretable (Elhage et al., 2022) . Hence, our discovery may point to new directions towards developing more interpretable DNNs (Cuadros et al., 2022; Sajjad et al., 2021) . Sparsity for Data Modeling. Following the seminal work of Olshausen & Field (1996) , there are a lot of interests in sparsity as an effective modeling of natural signals (Mairal et al., 2014) . With the close resemblance of the computational structure of ReLU networks and sparse encoding algorithms (Gregor & LeCun, 2010) , it became natural to study a DNN as a multi-layer sparse modeling of the data (Papyan et al., 2018) . Along with substantial theoretical understanding of such a modeling are obtained (Papyan et al., 2017; Sulam et al., 2018) , there are also experimental results on their practical benefits (Sun et al., 2018) though less often on modern large-scale data. Sparsity for Theory of Over-parameterized Models. Because of its simplicity and well-develped theory in classical machine learning (Candès & Wakin, 2008; Vidal et al., 2015; Wright & Ma, 2022) , sparse modeling is often used to provide theoretical understanding of modern large and overparameterized models. This include works on implicit regularization (Chou et al., 2021; Nacson et al., 2022; Vaskevicius et al., 2019; Woodworth et al., 2020; Zhao et al., 2019) , nonconvex optimization (Buhai et al., 2020; Sulam et al., 2022) , noise interpolators (Chinot et al., 2022; Donhauser et al., 2022; Koehler et al., 2021) , etc. However, the aforementioned work uses sparsity as a testbed or toy model to gain insights, without implication of existence of sparsity in DNNs.

5. DISCUSSION

This work demonstrates the natural emergence of sparse activation in commonly used Transformer models (Section 2). The notion of sparsity pertains to the law of parsimony, a.k.a. Occam's razor, where among all possible explanations of observed data, the simplest ones are preferred. It is a fundamental scientific principle broadly used in various scientific and engineering subjects (Domingos, 1999; Epstein, 1984) , including classical machine learning (Tibshirani, 1996) . Hence, our discovery may be suggesting that the law of parsimony is playing a role in Transformers even though they are not explicitly designed so, resonating with recent view on the role of sparsity for intelligence systems (LeCun, 2022; Ma et al., 2022; Roberts, 2021; Vasudevan et al., 2021) . More importantly, we back such a perspective by providing evidence of improved robustness and calibration via enforcing sparsity using Top-k thresholding (Section 3), which indicates that sparsity is indeed a pertinent prior for good generalization. We hope that our work may motivate future effort on introducing sparsity in deep learning models in a more principled way for obtaining more efficient, robust, and calibrated models. Finally, while our motivation of studying sparse activation in Transformers comes (partly) from study of biological brains, establishing such a connection may reciprocally benefits efforts on applying artificial intelligence to the study of biology and neuroscience (Richards et al., 2022) .

Appendices

The appendices are organized as follows. In Section A we provide the implementation details for experiments conducted in this paper. In Section B we demonstrate the emergence of sparse activation in other architectures and with other optimizers than those used in Section 2. In Section C we provide additional experiments upon those in Section 3 to demonstrate the benefits of sparsity. In Section D we explore the potential causes of sparsity, with a focus on the effect of training data. In Section E we present a derivation to show that during early training, the final MLP layer's intermediate activation tends to get sparse. Finally in Section F we present insights on the emergence of activation sparsity from experiments on 2-layer MLP models.

A IMPLEMENTATION DETAILS

A.1 T5 For most of the experiments, except the Top-k transformer, we used vanilla T5 architecture (Raffel et al., 2020) . We trained model with Adafactor optimizer, an inverse square root learning rate schedule, and no dropout. For the first 10,000 steps we also use a fixed learning rate of 0.01 as warm-up. The training task is span corruption without any mixture, and unless specified otherwise, we train the model for 100,000 steps with batch size of 256 to save compute and time, as the sparsity or accuracy trend is already clear by then. We used 512 tokens on the encoder side and 114 tokens on the decoder side. A We evaluate the sparsity level of activation map in several commonly used network architectures beyond T5 and ViT. This includes BERT which is also a Transformer based architecture, as well as non-Transformer based architectures such as MLP-Mixer and ConvNets. We also examine whether residual connection accounts for the emergence of sparsity. 

MLP-Mixer.

We evaluate the sparsity level of the MLP-Mixer (Tolstikhin et al., 2021) , an all-MLP architecture constructed from cascading token-mixing and channel-mixing MLPs. Specifically, we use Mixer-B16 as the architecture, ADAM with β 1 = 0.9, β 2 = 0.999 as the optimizer, and train on ImageNet-21k for 300 epochs. While Tolstikhin et al. ( 2021) sweeps over a product set of hyper-parameters, here for simplicity we use a fixed set of hyper-parameters with weight decay of 0.03, gradient norm clipping at 1.0, base learning rate of 0.003, RandAugment magnitude of 10, no mixup, no stochastic depth, and no dropout. Convolutional Neural Network (ConvNet). Sparsity in activation maps has been studied for Con-vNets such as the AlexNet (Krizhevsky et al., 2017) at least as early as in the work of Rhu et al. (2018) . There are also follow-up work (Georgiadis, 2019; Kurtz et al., 2020) on how enforcing sparse activation maps can help to gain computation efficiency. For completeness, we evaluate and present results for the sparsity level of residual networks (ResNets) (He et al., 2016) , which is one of the most commonly used ConvNets, trained on ImageNet-1k. In particular, we focus on ResNet-18 and ResNet-50 which are constructed from stacking 8 standard residual blocks and 16 "bottleneck" residual blocks, respectively, where each block has two and three convolutional and ReLU layers, respectively. We examine the sparsity of activation maps after each of the ReLU layers in each residual block. Here, the x-axis is the index of the residual block, and the sparsity of different layers in the residual blocks are plotted with separated curves in each figure. It can be observed that • Layers near the network output tend to produce sparser activation maps than layers near the network input. This is aligned with the observation with ViT trained on ImageNet-1k (see Figure 2b ). • For each residual block, the intermediate layers (i.e., the 1st layer for ResNet-18 and the 1st & 2nd layers for ResNet-50) produce sparser activation maps than the output layer (i.e., 2nd layer for ResNet-18 and 3rd layer for ResNet-50). In addition, all residual blocks are divided into four stages that have different output feature map sizes. For ResNet-50, the four stages are composed of blocks 0 -2, 3 -6, 7 -12, and 13 -15. shows that there are patterns on how sparsity level varies within each stage and across the boundary of the stages. • For the 1st layers, percentage of nonzeros decreases within each stage, and jumps up from the last layer of each stage to the first layer of next stage. • For the 2nd layers, percentage of nonzeros decreases quickly at the beginning of each stage then becomes stable. • For the 3rd layers, percentage of nonzeros tend to increase slightly within each stage, and jumps down from the last layer of each stage to the first layer of next stage. Sparsity and Residual Learning. We provide a study on the effect of residual connections on activation sparsity. Each Transformer block contains two types of residual connections: the one that is in parallel with the attention blocks, and the one that is in parallel with the MLP blocks. We focus on the residual connection parallel to the MLP blocks. We perform two different studies. • Effect of shortcut connection. Towards that, we train two T5-Large models, one using the vanilla Transformer block and the other with residual connection removed for the Transformer block on encoder layer 6 (i.e., the 7 th encoder layer, as we count from 0). There is a 1.6% evaluation accuracy drop with the latter model compared to the former model. training compared to the corresponding layer of the vanilla Transformer. Moreover, the sparsity level at all other layers also changes, though to a much smaller extend. • Effect of initialization scale of the residual branch. Many works have found that having the residual branch initialized at a smaller scale helps with stabilizing and accelerating the training of residual (convolutional) networks (Goyal et al., 2017; Zhang et al., 2019) and Transformers (Touvron et al., 2021) . Here for simplicity we consider the idea from Bachlechner et al. (2021) ; Qi et al. (2020) where a trainable scalar multiplier that is initialized at zero is applied to the residual branch (a.k.a., ReZero). We consider ViT trained on ImageNet-21k with ReZero added to the MLP modules. We find that this increases the training accuracy from 46.15% to 46.85% but reduces the validation accuracy from 47.58% to 46.75%. We plot the sparsity level of ViT with ReZero and compare it with the vanilla ViT in Figure B .9. It can be seen that ReZero reduces the percentage of nonzeros in layers near the network output. It may be curious to ask whether the emergence of sparsity is specific to such optimizers and whether other optimizers, such as stochastic gradient descent (SGD), also leads to sparse activation maps. However, we find that SGD cannot effectively train Transformer architectures such as T5 and ViT. Hence, we study the effect of optimizer on activation sparsity by looking at ResNet trained on ImageNet-1k following the setup in Section B.1, since both SGD and ADAM can effectively train the network. To train ResNet with ADAM, we use the same hyper-parameters as those used in SGD, with the only difference being that the optimizer is ADAM with β 1 = 0.9, β 2 = 0.999. To make the comparison with SGD fair, we tune the base learning rate for ADAM and select 3e -3, which is the one that gives the highest training accuracy among the set of {1e -4, 3e -4, 1e -3, 3e -3, 1e -2}. The training accuracy obtained by ADAM with base learning rate 3e -3 is similar to that obtained by SGD, namely, 67.8% by ADAM vs 69.3% by SGD with ResNet-18, and 75.0% by ADAM vs 78.5% by SGD with ResNet-50. The For ResNet-18, we see that ADAM leads to a smaller percentage of nonzero entries particularly towards the output of the network for the first layers of each residual block. In contrast, ADAM and SGD have very similar sparsity level at the second layers of each residual block. Similar observation holds for ResNet-50, where the percentage of nonzero entries is smaller with ADAM for the first and second layers of each residual block, while for the third layer the sparsity level does not change much.

B.3 SPARSITY IN FINETUNING

In this section we show that activation sparsity not only occurs after model pretraining but persists after further finetuning on downstream tasks. Here we take a T5 that has been pretrained on C4 as described in Section 1.3, and finetune the model on a open domain Natural Question (Kwiatkowski et al., 2019) QA task. We follow the set up in Li et al. (2022) , where the retrieved passages are independently encoded by the encoder, and then passed to the decoder via cross attention. The decoder takes the question as the prefix and produces the answer. The decoder is a standard auto- 

C.2 BENEFITS OF SPARSITY PERSISTS WITH 1 -NORM INDUCED SPARSITY

While Top-k thresholding is used in Section 3.3 to demonstrate the benefit of sparsity, we show that other means of obtaining sparsity, such as an explicit 1 norm regularization, also provides such benefits. We experiment with ViT for ImageNet-1k classification under the same setup as in Section 3.3. Here instead of the Top-k ViT, we train a regular ViT but with an additional loss term, which is the sum of the 1 norm of all activation maps of ViT across all layers. We refer to the method as L1-ViT. We vary the weight λ on the 1 loss in the set λ ∈ {0.1, 0.5, 1.0} to control the strength of the regularization, and denote the corresponding methods as L1-ViT-{0.1, 0.5, 1.0}. The sparsity level, natural accuracy, robust accuracy under input perturbation, and ECE of L1-ViT are reported in Table C .1. We see that with λ = 0.1 or 0.5, the averaged percentage of nonzero entries do not change much, but already demonstrates performance gain in terms of accuracy under input perturbation and calibration without hurting the natural accuracy. Using a λ = 1.0 drastically reduces We use a random label experiment with ViT for image classification to test Hypothesis D.1. Specifically, we generate a new training dataset by replacing p% of the labels in the ImageNet-21k dataset with random labels drawn uniformly at random from the set of all possible labels, where p is varied to examine the effects. With such a dataset, the labels for a certain percentage of images do not provide a meaningful description for the content of the image. Hence, if Hypothesis D.1 is valid, then the activation map will become dense. The sparsity level of ViT trained on the random label datasets is shown in Figure D.1a. It can be seen that the percentage of activated neurons decreases with an increasing percentage of label noise up to 70%. An even higher label noise level at 100% changes the sparsity level across layers as the shallow layers (i.e., layers 0 -4) becomes sparser, while the deep layers (i.e., layers 5 -11) becomes denser. Nonetheless, even with 100% label noise, all layers have < 10% activated neurons.

D.2 SPARSITY FROM DATA?

While modern image and text data are often of high-dimensional, their intrinsic degree of freedom is much smaller, i.e., they are low-dimensional and admit compact representations (Vidal et al., 2015; Wright & Ma, 2022) . Hence, even if the labels do not provide meaningful descriptions of the data, it may still be possible that Transformers extract low-dimensional structures from data and produce compact representations in the form of sparse activation maps. This motivates the following hypothesis. Our results point to the possibility that sparsity comes from the training dynamic. Namely, at early training stage with any training data and a random initialization of network parameters, the descending direction of the gradient on the Transformer parameters tends to point to a regime where their MLPs produce sparse activation maps. In the following, we provide theoretical evidence for this argument by looking at the gradient on the positive activation maps for a DNN with last two layers being a ReLU followed by a fully connected layer. In particular, we have the follow result.

Hypothesis

Theorem D.1. Let f (x; V , θ) : IR n → IR K be a neural network given by f (x) = V σ p(x; θ) , (D.1) where V = [v 1 , . . . , v dff ] ∈ IR K×dff is network parameter for the last layer drawn from a random distribution, σ() is the ReLU activation function, and p(x; θ) denotes all other layers with parameter θ. We write p = p(x; θ) for simplicity. • Consider the mean squared error (MSE) loss MSE (f (x), y) . = 1 2 f (x) -y 2 2 , where y is an arbitrary vector independent of V . Assume that V satisfies E [V ] = 0, and E [ v i , v j ] = 0, if i = j, > 0, otherwise 6 . (D.2) If there exist an i * such that p i * > 0, then we have E ∂ MSE (f (x), y) ∂p i * > 0, (D.3) where the expectation is taken with respect to randomness in V . • Consider the cross-entropy (CE) loss CE (f (x), y) = -y, log exp(f (x)) exp(f (x)),1 , where y is an arbitrary vector that sums up to one and independent of V . Assume that the entries of V are drawn from independent distributions, the probability of any entry of V being 0 is less than 1, and E [V ] = 0. If there exist an i * such that p i * > 0, then we have E ∂ CE (f (x), y) ∂p i * > 0, (D.4) where the expectation is taken with respect to randomness in V . The proof of Theorem D.1 is provided in Appendix E. Theorem D.1 states that the gradient of either the MSE or CE loss with respect to any positive activation p i * is positive in expectation. Hence, any training algorithm based on negative gradient directions tends to reduce the magnitude of such positive activations, which will lead to a smaller training loss. Here, the expectation is taken with respect to the randomness in the last layer parameter V . Hence, our result can be considered as an analysis for DNNs at initialization where weights are often chosen randomly from a fixed distribution. In particular, the required properties for the distribution of V in Theorem D.1 for both MSE and CE losses are satisfied by commonly used initialization methods, such as the one in He et al. (2015) . On the other hand, Theorem D.1 does not apply to subsequent training iterations since the label y is no longer independent of V . However, it can be seen empirically from Figure 1 that the trend of a decreasing percentage of nonzero entries persists for a certain number of iterations during the beginning of training until such a percentage reaches a low level and stays relatively stable until the end of training.

E PROOF OF THEOREM D.1

Proof of Theorem D.1. For an arbitrary loss (f (x), y), we have ∂ ∂p i * = ∂ ∂f , ∂f ∂p i * = ∂ ∂f , v i * . (E.1) First, Consider = M SE . We have ∂ M SE ∂f = f (x) -y = i σ(p i ) • v i -y. (E.2) Plugging this into (E.1), we obtain ∂ M SE ∂p i * = i σ(p i ) v i , v i * -v i * , y =   i =i * σ(p i ) v i , v i *   + σ(p i * ) v i * , v i * -v i * , y (E.3) Taking the expectation, and noting the conditions in (D.2), we have E ∂ M SE ∂p i * = 0 + σ(p i * )E [ v i * , v i * ] + 0 > 0. (E.4) This finishes the proof for MSE loss. Published as a conference paper at ICLR 2023 In the rest of the proof we consider = CE . We have ∂ CE ∂f = exp(f (x)) exp(f (x)), 1 -y = exp( i σ(p i ) • v i ) exp( i σ(p i ) • v i ), 1 -y. (E.5) Plugging this into (E.1), we obtain ∂ CE ∂p i * = exp( i σ(p i ) • v i ), v i * exp( i σ(p i ) • v i ), 1 -v i * , y (E.6) For the enumerator in the first term on the RHS of the equation above, we have exp i σ(p i ) • v i , v i * = m v i * ,m • exp i σ(p i ) • v im = m   v i * ,m • exp (p i * • v i * ,m ) • exp   i =i * σ(p i ) • v i,m     (E.7 ) Plugging this into (E.6) and denoting C (1) m = exp   i =i * σ(p i ) • v i,m   , we obtain ∂ CE ∂p i * = m v i * ,m • exp (p i * • v i * ,m ) • C (1) m exp( i σ(p i ) • v i ), 1 -v i * , y (E.8) For the denominator in the first term on the RHS of the equation above, we have exp i σ(p i ) • v i , 1 = m exp i σ(p i ) • v im = m   exp (p i * • v i * ,m ) • exp   i =i * σ(p i ) • v im     = exp (p i * • v i * ,m ) • exp   i =i * σ(p i ) • v i,m   + m =m   exp (p i * • v i * ,m ) • exp   i =i * σ(p i ) • v im     (E.9) Plugging this into (E.8) and denoting C (2) m = exp   i =i * σ(p i ) • v i,m   , (E.10) C (3) m = m =m   exp (p i * • v i * ,m ) • exp   i =i * σ(p i ) • v im     , (E.11) we obtain ∂ CE ∂p i * = m v i * ,m • exp (p i * • v i * ,m ) • C (1) m exp (p i * • v i * ,m ) • C (2) m + C (3) m -v i * , y . (E.12) Taking expectation with respect to V on both sides, and using the assumption that all entries of V are independent, we have (3) m are independent of v i * ,m . By Lemma E.1 and using the assumption that the expectation of V is zero, we have The following lemma is used in the proof above. Lemma E.1. Let V be a random variable with a probabilistic density function p(v) that satisfies P (V = 0) = 1. Let C 1 , C 2 , C 3 and p be positive numbers. Then, E ∂ CE ∂p i * = m E v i * ,m • exp (p i * • v i * ,m ) • C (1) m exp (p i * • v i * ,m ) • C (2) m + C (3) m -E [ v i * , y ] = m E {v i,l |(i,l) =(i * ,m)} E v i * ,m v i * ,m • exp (p i * • v i * ,m ) • C E v i * ,m v i * ,m • exp (p i * • v i * ,m ) • C (1) m exp (p i * • v i * ,m ) • C (2) m + C E C 1 V • exp(pv) C 2 exp(pV) + C 3 > C 1 C 2 + C 3 E [V] . (E.17) Proof. We may calculate the expectation by using the probabilistic density function p(v) as E C 1 V • exp(pV) C 2 exp(pV) + C 3 = E C 1 V C 2 + C 3 exp(-pV) = ∞ -∞ C 1 v C 2 + C 3 exp(-pv) p(v)dv . = ∞ -∞ g(v) • vp(v)dv. (E.18) Since g(v) is monotonically increasing for v ∈ IR, we have g(v) ≥ g(0) for v ≥ 0 and g(v) ≤ g(0) for v ≤ 0. Hence, That is, the inequality in (E.20) holds with strict inequality. Hence we have E C 1 V • exp(pV) C 2 exp(pV) + C 3 = ∞ -∞ g(v) • vp(v)dv > g(0)E [V] = C 1 C 2 + C 3 E [V] . (E.22)

F INSIGHTS FROM SPARSITY IN MLPS

We study the sparsity of activation maps in two-layer MLPs. By showing that sparsity emerges, the result here extends the scope of prevalence of activation sparsity from modern DNNs to two-layer



The approach is previously adopted in ConvNets for improving model robustness(Ahmad & Scheinkman, 2019), and more recently inGupta et al. (2021) for improving memory efficiency of Transformers. ViT uses GeLU as its activation function(Dosovitskiy et al., 2021). Here we switch to ReLU as it allows us to more easily measure the sparsity level using the number of nonzero entries with a very small performance drop (e.g., 47.78% with GeLU vs 47.58% with ReLU for Top-1 evaluation accuracy on ImageNet-21K). https://github.com/google-research/t5x/blob/main/docs/models.md#t5-checkpoints We use the implementation of jax.lax.approx_max_k(Chern et al., 2022) with a recall target of 0.95. This is previously demonstrated inAhmad & Scheinkman (2019) for ConvNets. This requirement is generally satisfied unless the probability of vi = 0 is 1.



Figure 1: Percentage of nonzero entries (y-axis, log scale) in the activation map as a function of number of training steps (x-axis) for a T5-Base model trained with the span corruption objective on the C4 dataset. Left: layers (from shallow to deep) of the encoder. Right: layers of the decoder.

Figure 2: Percentage of nonzero entries across different layers of trained Transformers (a) for both language data with T5 and vision data with ViT, (b) on both train and evaluation data, (c) for ViT trained on ImageNet of 21k vs 1k classes, (d) on ViT of varying configurations, and (e, f) on T5 of varying configurations. Note that the y-axis is in log scale. Sparsity emerges in all cases.

Figure2bshows the percentage of nonzero entries in a trained T5 model measured on both the training data and the evaluation data. We see that the property of sparsity generalizes very well to evaluation data as the curves for training and evaluation data align very closely with each other.• Sparsity emerges on datasets of varying scale. Figure2cshows the percentage of nonzero entries in ViT trained on both ImageNet-21k and ImageNet-1k, where the former is a superset of the later with approximately 10× more images and 21× more classes. We see that the scale of data does not affect much of the sparsity level. • Sparsity emerges on Transformers of varying configurations. Figure2dshows the percentage of nonzero entries for ViT of varying configurations in model size. Figure2e and 2fshow the percentage of nonzero entries for encoder and decoder, respectively, of T5 with varying configurations in model size. We see that sparsity persists for all cases. • Sparsity emerges across all layers of a Transformer. Finally, all plots in Figure2show that sparsity emerges in all layers of a Transformer. Moreover, in all cases the first few and last few layers tend to be denser than intermediate layers.

Figure 3: Percentage of times that each neuron in the first MLP layer of a trained T5 is activated on C4 dataset.

Figure 4: Activation sparsity across different encoder layers of trained T5 Transformers of (a) varying depth and (b, c) varying width (i.e., d ff ). Since with varying width the dimension of activation maps also changes, we evaluate sparsity both in term of the percentage (as in (b)) and the count (as in (c)) of nonzeros. Deeper and wider models are sparser in terms of percentage of activated neurons.

Figure 5: Training and evaluation accuracy of Top-k T5 for three different sizes: base, large and 3B (left) and Top-k ViT (right) with varying k. Top-k Transformer is on par with regular Transformer for a large enough k. e.g. for T5 3B with k = 128, and ViT with k = 256, the drop is around 0.3%.

Figure 6: Latency reduction for unbatched greedy decoding in decoder of Top-k Transformers on TPUv4.

Figure B.2: Plots a, b: Percentage of nonzero entries in activation maps of BERT Base and Large models(Devlin et al., 2019) trained on Wikipedia dataset. We observe high levels of sparsity (<10%) similar to other Transformer models. Plots c, d: Histograms of pre-activation values for layers 1 and 12 of a Bert Base model. We notice that while at initialization the activations are distributed with mean 0, the mean quickly shifts negative as the training progresses, resulting in high levels of sparse activation values.

Figure B.1: Percentage of nonzero entries in activation maps of MLP-Mixer trained on ImageNet-21k. Results for token-mixing and channel-mixing MLPs are plotted in separate curves.

Figure B.1 shows the sparsity level at the intermediate layer of both token mixing and channel mixing MLPs of Mixer-B16. We also plot the sparsity level of ViT (i.e., the plot in Figure 2a) to Figure B.1

The results for respectively.

Figure B.4b

Figure B.6: Effect of network width ∈ {64, 128, 256} on sparsity level across layers of ResNet-18.

Such observations may help to understand the role of each stage in ResNets.Comparing the percentage of nonzero entries inResNets (shown in Figure B.4a and Figure B.4b)   and for Transformers (shown in Figure2b), both of which are trained on ImageNet-1k, we see that ResNets produce much denser activation maps with more than 10% nonzero entries in all layers. One possible explanation is that ResNet uses batch normalization (BN) before each activation function, while Transformer's MLP does not have BN before the activation function. To understand the effect of BN on sparsity, we conduct an experiment with BN in ResNet removed. Because ResNet cannot be effectively trained without BN, we decrease the learning rate from standard ResNet training by a factor of 10. Moreover, we add a learnable scalar multiplier that is initialized as 0 to all the residual branches, following the study inBachlechner et al. (2021);Qi et al. (2020). The results for comparing with standard ResNet are reported in FigureB.5, where to separate the effect of using a smaller learning rate, we also compare with the method of training a regular ResNet but with a small learning rate compared to standard training. The two subfigures ofFigure B.5  show the effect of width on sparsity of the first and second layers in each residual block, respectively. It can be observed that, removing BN does not significantly change the sparsity level, except for small set of layers.Meanwhile, the trend that larger models are sparser for Transformers (see Section 2.2) holds for ResNets as well, as seen in Figure B.6. Here, we vary the width of ResNet-18 by multiplying the number of output channels of each convolutional layer by a factor of 1 (for width = 64), 2 (for width = 128), and 4 (for width = 256). The two subfigures show the effect of width on sparsity of the first and second layers in each residual block, respectively. In both cases, wider models have smaller percentage of nonzero entries across all layers, except for the very last layer (i.e., the 2nd layer in block #7 shown in Figure B.6b).

The percentage of nonzero entries of these two Transformers are presented in Figure B.7 for the encoder layers and in Figure B.8 for the decoder layers. It can be seen that in encoder layer 6 for which the residual connection is removed, the sparsity has a very different trend during

Figure B.7: Percentage of nonzero entries in activation maps in vanilla T5-Large and in a T5-Large with the residual connection parallel to MLP removed in the 7 th encoder layer (i.e., encoder layer 6). Different subplots correspond to different encoder layers (see Figure B.8 for results on decoder layers). The encoder layer 6, which has its residual connection removed, shows a significant difference in both sparsity and the trend of sparsity during training. Sparsity level in other layers changes from vanilla T5-Large as well, though to a smaller extent.

Figure B.8: Same setup as Figure B.7, but showing the results for the last 12 layers of the decoder.

Figure B.9: Effect of initialization scale of the residual branch. We add a scalar multiplier that is initialized at 0 on the residual branch (a.k.a., ReZero Bachlechner et al. (2021)) of the MLP modules of a ViT, and train the model on ImageNet-21k. The percentage of nonzero entries is compared with those obtained with a regular ViT.

Figure B.10: Effect of optimizer on sparsity level across layers of ResNet-18.

results for ResNet-18 and ResNet-50 are presented in Figure B.10 and Figure B.11, respectively.

Figure B.12: Percentage of nonzero entries across different layers of trained T5 after pretraining vs after finetuning on a question answering task. Left: Results on encoder layers. Right: Results on decoder layers.

Figure C.3: Performance of Top-k ViT on corrupted ImageNet-1k test data with Gaussian noise (left), impulse noise (middle), and shot noise (right), each under five severity levels. Top-k improves robustness for all noise types and on all corruption levels with a suitable choice of k.

Figure D.1: Percentage of nonzero entries in ViT trained on ImageNet-21k (IM-21K) with (a) random labels where p% labels are replaced by labels drawn from a uniform distribution with p ∈ {50%, 70%, 100%}, (b) random images where each image is replaced by one where the pixels are drawn from i.i.d. uniform distribution in [-1, 1], and (c) infinite data where sufficient training data is generated by drawing random image and random label pairs so that the model is never trained on the same pair twice.

1) m exp (p i * • v i * ,m ) • C E [ v i * , y ] . (E.13) In above, E v i * ,m []means expectation with respect to v i * ,m , and E {v i,l |(i,l) =(i * ,m)} [] means expectation with respect to all other entries in V . Note that C

P (V = 0) = 1, there exists an interval (a, b) such that b a p(v)dv > 0. Without loss of generality we assume that b > a ≥ 0. Then,

model and an MLP intermediate dimension d ff , the computational complexity of a Transformer for an input sequence of length N is O(N d 2 model +N 2 d model +N d model d ff )



Evaluation of Top-128 ViT for ImageNet-1k classification in terms of 1) natural accuracy with ImageNet-1k evaluation set, 2) robust accuracy with {40%, 80%} corrupted training labels, 3) robust accuracy under input perturbation with additive {Gaussian, Impulse, Shot} noise on evaluation images, and 4) calibration error on evaluation data measured by ECE. Top-128 ViT is on par with ViT for natural accuracy while is significantly better for model robustness and calibration.Robustness to Label Noise. An important challenge for DNNs is that they are highly susceptible to label noise, the problem where a certain percentage of training labels are corrupted or erroneously generated. This may be attributed to the fact that DNNs are often over-parameterized, hence too "capable" that they tend to overfit, or "memorize" the noisy labels without generalizing to test data.

1: Configuration of T5 and ViT that are used in the experiments. d model and d ff are defined in Section 1.3. # Layers is the number of encoder + decoder layers for T5 and encoder layers for ViT.

1: Evaluation of ViT with a varying weight ∈ {0.1, 0.5, 1.0} on a 1 regularization upon activation maps for ImageNet-1k classification in terms of 1) averaged percentage of nonzero entries in activation maps across all layers, 2) natural accuracy (i.e., on ImageNet-1k evaluation set), 3) robust accuracy under input perturbation with additive {Gaussian, Impulse, Shot} noise, and 4) calibration error measured by ECE.

D.2 (Sparsity from natural data). Sparsity in trained Transformers arises from natural training data (e.g., images for ViT and texts for T5). use a random image experiment to test Hypothesis D.2. With the ImageNet-21k dataset, we replace each image with a random image generated by drawing pixel values from an i.i.d. Uniform distribution in the range of [0, 255], and use these images (instead of the original images in ImageNet-21k) for model training. Such random images do not contain any low-dimensional structures nor compact representations. The percentage of nonzero entries of a ViT trained on random image dataset is shown in Figure D.1b. It can be seen that the first four layers become sparser while the last few layers become relatively denser compared to training with natural images in ImageNet-21k. Nonetheless, all layers have < 10% activated neurons. D.3 SPARSITY FROM DATA-FITTING?Modern deep neural networks are often over-parameterized, with sufficient capacity to fit practical training datasets and obtain close-to-zero training error. There is evidence suggesting that this result holds true even if the data and label are generated in random(Zhang et al., 2021). Hence, there is the possibility that sparsity arises because the training data, even if generated in random, is scarce relative to the scale of modern over-paremeterized models.Hypothesis D.3 (Sparsity from data-fitting). Sparsity in trained Transformers arises from the fact that models have more than sufficient capacity to fit training data of practical scale.To test Hypothesis D.3, we design an infinite data experiment where the amount of training data is infinitely large so that any practical Transformer becomes under-parameterized relative to the data and cannot fit the data. The way we generate infinite training data is to sample images with random pixels as in the random image experiment, and for each image we sample a random label as in the random label experiment. Moreover, we generate sufficient amount of such training data to make sure that the model never sees the same data point twice during the training. The number of training iterations in the infinite data experiment is kept the same as that of the random image and random label experiments.The results of random label, random image, and infinite data experiments in Figure D.1 show that labels, data, and data-fitting as conjectured in Hypothesis D.1, D.2, and D.3, respectively, all affect the sparsity level of the activation map. Nonetheless, none of them fully explains the emergence of sparsity since for all results in Figure D.1, the percentage of nonzero entries is considerably smaller than at the initialization (i.e., 50%).

ACKNOWLEDGMENTS

We would like to acknowledge helpful discussions with René Vidal and Jeremias Sulam from Johns Hopkins University, with Weijie Su from UPenn, with Yuxiang Wang from UC Santa Barbara, with Atlas Wang from UT Austin, with Nishanth Dikkala, Nikhil Vyas, Preston McAfee and Mukund Sundararajan from Google, with Subutai Ahmad from Numenta, with Wei Hu, Salar Fattahi, and Jianhao Ma from University of Michigan, with Tuo Zhao from Georgia Tech. We particularly thank Donhauser Konstantin from ETH Zurich for interesting discussion on hypothesis for emergence of sparsity.

annex

MLPs which are one of the simplest neural network architectures. Moreover, by training such twolayer MLPs with different types of data, we provide additional insights on the causes for emergence of sparsity.Datasets. We conduct our experiment with the MNIST dataset, which contains 60,000 grey scale images of handwritten digits. Similar to the experiment in Section D, we also consider a dataset with random data, as well as a dataset with infinite data. For the random data, we replace each image of MNIST with a random one drawn from sampling i.i.d. pixels from uniform distribution, and each label with a random class amongst 10. Note that the image-label pairs are fixed throughout training.For the infinite data, the random images and random labels are generated on-the-fly, representing a random dataset of infinite size.

Models and

Training. We train two-layer MLPs with ReLU activation maps with varying width (i.e., hidden dimension): 32, 128, 512, 2048, 8192, 32768 and 131072. We use three different optimizers: SGD, SGD with momentum, and Adam, all for 200 epochs (for the infinite data case, we use the same number of iterations as that for training on MNIST and random data). We find that 200 epochs is sufficient for the reported metrics to converge in most of the cases. • For random data, we observe a uni-modal shaped curve for sparsity level. Namely, when the model width is small hence the model cannot well-fit the training data, the percentage of nonzero entries is small. As the model width increases, where the model is able to fit the training data evidenced by the fact that the training accuracy increases, we observe that the percentage of nonzero entries starts to increase. However, as we further increase the model size in the regime where model is able to perfectly fit the training data, we see that the percentage of nonzeros starts to decrease.

Results

• For infinite data, where the model cannot fit the training data (hence training accuracy is 0.1 which is the same as result from random guessing), the percentage of nonzero entries is close to 0. This is aligned with the result of random data experiment with a small model width.• For MNIST, where the model of varying width in our experiment is able to fit the training data, we observe that the percentage of nonzero entries decreases. This trend aligns with the random data experiment with large model widthThe evidence above suggest that the sparsity level may be associated with the under-and overparameterization of the models. Namely, the percentage of nonzero entries is the highest when the model size is close to the point that the model can start to fit the training data (i.e., the interpolation threshold), and is lower in both under and over-parameterized regimes. It may be intriguing to note that a similar pattern exists for the variance (as in the bias-variance tradeoff) curve of deep learning models, which as shown in Yang et al. (2020) to exhibit a uni-modal shape as well. Such a connection may help us understand the interplay between generalization and sparsity of activation in deep learning models. 

