THE LAZY NEURON PHENOMENON: ON EMERGENCE OF ACTIVATION SPARSITY IN TRANSFORMERS

Abstract

This paper studies a curious phenomenon that machine learning model with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small k brings a collection of desired properties, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

1. INTRODUCTION

The great success of modern machine learning for tasks in computer vision, natural language processing, game playing and beyond is driven primarily by the computational model known as deep neural networks (DNNs) (LeCun et al., 2015) . With inspirations drawn from biological intelligent systems, DNNs are massive systems of distributed computational nodes (a.k.a. neurons) with learned inter-connections, which possess the capacity of accomplishing complex real-world tasks. Although originally motivated from biological brains, there are differences at very fundamental levels on how DNNs work compared to biological neural networks. One of such differences is in the sparsity of neural activities. Evidence from neuroscience suggests that neural activity in biological brains is sparse, namely, only a small percentage of all neurons fire at each time (Ahmed et al., 2020; Barth & Poulet, 2012; Kerr et al., 2005; Poo & Isaacson, 2009) . Sparse firing suggests that despite having billions of neurons, only a small fraction of the brain participates in computation at each time, which may explain why brains can sustain at a very low energy cost. In contrast, learning and inference with DNNs rely primarily on dense computations where all neurons are involved for any input. In fact, modern computational hardware for deep neural networks, such as GPUs and TPUs, are designed to facilitate massive scale dense computations. Even with such dedicated hardware, DNNs are still notoriously resource-demanding to train and deploy. Aside from computation efficiency, DNNs also lag far behind biological brains in terms of robustness to input perturbation, error correction for erroneous training labels, confidence calibration for the predictions, etc.

1.1. AN INTRIGUING OBSERVATION: ACTIVATIONS ARE SPARSE IN TRAINED TRANSFORMERS

This paper provides an extensive study on a surprising observation that despite performing dense computations, DNNs produce very sparse activation in its intermediate layers once trained.Specifically, * Equal contribution 1

