THE LAZY NEURON PHENOMENON: ON EMERGENCE OF ACTIVATION SPARSITY IN TRANSFORMERS

Abstract

This paper studies a curious phenomenon that machine learning model with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small k brings a collection of desired properties, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

1. INTRODUCTION

The great success of modern machine learning for tasks in computer vision, natural language processing, game playing and beyond is driven primarily by the computational model known as deep neural networks (DNNs) (LeCun et al., 2015) . With inspirations drawn from biological intelligent systems, DNNs are massive systems of distributed computational nodes (a.k.a. neurons) with learned inter-connections, which possess the capacity of accomplishing complex real-world tasks. Although originally motivated from biological brains, there are differences at very fundamental levels on how DNNs work compared to biological neural networks. One of such differences is in the sparsity of neural activities. Evidence from neuroscience suggests that neural activity in biological brains is sparse, namely, only a small percentage of all neurons fire at each time (Ahmed et al., 2020; Barth & Poulet, 2012; Kerr et al., 2005; Poo & Isaacson, 2009) . Sparse firing suggests that despite having billions of neurons, only a small fraction of the brain participates in computation at each time, which may explain why brains can sustain at a very low energy cost. In contrast, learning and inference with DNNs rely primarily on dense computations where all neurons are involved for any input. In fact, modern computational hardware for deep neural networks, such as GPUs and TPUs, are designed to facilitate massive scale dense computations. Even with such dedicated hardware, DNNs are still notoriously resource-demanding to train and deploy. Aside from computation efficiency, DNNs also lag far behind biological brains in terms of robustness to input perturbation, error correction for erroneous training labels, confidence calibration for the predictions, etc.

1.1. AN INTRIGUING OBSERVATION: ACTIVATIONS ARE SPARSE IN TRAINED TRANSFORMERS

This paper provides an extensive study on a surprising observation that despite performing dense computations, DNNs produce very sparse activation in its intermediate layers once trained.Specifically, We see that the percentage of nonzero entries is around 50% at initialization, which is expected: randomly initialized weights produce roughly equal numbers of positive and negative entries in the pre-activation map, resulting in about 50 % non-zeros after the ReLU. However, at the end of training the percentage of nonzero entries reduces drastically: the average value across all encoder-decoder layers is 2.7% with the largest one being 12.0% and the smallest one being only 1.1%. The emergence of sparse activation in Transformers bears a similarity to the sparsity of neural activities in biological brains, revealing an interesting connection between artificial and biological networks. Moreover, unlike classical sparse methods where such a connection is established via explicit sparse regularization (Olshausen & Field, 1996) , the sparsity observed in Transformers is emergent without any explicit design.

1.2. PREVALENCE, BENEFITS, AND CAUSES OF SPARSITY

This paper studies the aforementioned phenomenon of sparse activation in trained Transformers, with a focus on the following two questions. First, is the phenomenon shown in Figure 1 a corner case or does it occur broadly? Second, why should we care about the sparsity in DNNs, other than the appeal of its similarity to biological brains? Our main results along these two lines are summarized below. 1. Sparsity is a prevalent phenomenon. We show in Section 2 that the emergence of sparse activation reported in Figure 1 is not an isolated and cherry-picked case. Rather, sparsity is prevalent, and occurs broadly in Transformer models: it emerges in all layers of a Transformer, for Transformers trained on both vision and natural language data, for Transformers of various configurations, and for activation maps computed on both train and test data, etc. Moreover, through controlled experiments on the width and depth of Transformers, we reveal that larger models are sparser, as measured by percentage of nonzero entries. We also show in the Appendix B that sparsity emerges with many other architectures and with different optimizers. 2. Sparsity improves efficiency. Sparsity of activation map in trained Transformers implies that a large proportion of the computation during inference is spent on multiplying values by zero. Hence, FLOPs can be drastically reduced by avoiding all such computations, which we discuss in Section 3.1. Motivated by this observation, and to obtain reduced FLOPs not only after training but throughout training, we introduce Top-k Transformer in Section 3.2, a simple modification of Transformers where a Top-k thresholding is applied to the activation mapsfoot_0 . We show that Top-k Transformers with a reasonable sized k has on par performance with vanilla Transformers. To demonstrate the computation benefits of Top-k Transformers, we provide proof-of-concept results on wall time reduction for the task of unbatched decoding on TPUv4 with a large Top-k T5. Meanwhile, we emphasise that this result is far from fully realizing the benefit of sparse activation, due to a lack of hardware support for sparse computation.



The approach is previously adopted in ConvNets for improving model robustness(Ahmad & Scheinkman, 2019), and more recently in Gupta et al. (2021) for improving memory efficiency of Transformers.



Figure 1: Percentage of nonzero entries (y-axis, log scale) in the activation map as a function of number of training steps (x-axis) for a T5-Base model trained with the span corruption objective on the C4 dataset. Left: layers (from shallow to deep) of the encoder. Right: layers of the decoder.we studyTransformer (Vaswani et al., 2017), a DNN model architecture that has become a workhorse for modern applications. Transformers are constructed by interleaving a self-attention module and a multi-layer perceptrons (MLPs) of depth 2, and the focus of this paper is on the activation map of the first MLP layer. Figure1shows the sparsity of the activation maps, measured by the percentage of nonzeros, in all MLP layers of a T5-Base model (Raffel et al., 2020) computed on the training set of C4. We see that the percentage of nonzero entries is around 50% at initialization, which is expected: randomly initialized weights produce roughly equal numbers of positive and negative entries in the pre-activation map, resulting in about 50 % non-zeros after the ReLU. However, at the end of training the percentage of nonzero entries reduces drastically: the average value across all encoder-decoder layers is 2.7% with the largest one being 12.0% and the smallest one being only 1.1%. The emergence of sparse activation in Transformers bears a similarity to the sparsity of neural activities in biological brains, revealing an interesting connection between artificial and biological networks. Moreover, unlike classical sparse methods where such a connection is established via explicit sparse regularization(Olshausen & Field, 1996), the sparsity observed in Transformers is emergent without any explicit design.

