COMPARATIVE ANALYSIS BETWEEN VISION TRANSFORMERS AND CNNS FROM THE VIEW OF NEUROSCIENCE

Abstract

Neuroscience has provide many inspirations for the development of artificial intelligence, especially for neural networks for computer vision tasks. Recent research on animals' visual systems builds the connection between neural sparsity and animals' levels of evolution, based on which comparisons between two most influential vision architecture, Transformer and CNN, are carried out. In particular, the sparsity of attentions in Transformers is comprehensively studied, and previous knowledge on sparsity of neurons in CNNs is reviewed. In addition, a novel metric for neural sparsity is defined and ablation experiments are launched on various types of Transformer and CNN models. Finally, we draw the conclusion that more layers in models will result in higher sparsity, however, too many heads in Transformers may cause reduction of sparsity, which attributes to the significant overlap among effects of attention units.

1. INTRODUCTION

Visual perception is not only the most significant kind of humans' perception, but also the most typical characteristic of higher animals' intelligencefoot_0 . As a consequence, computer vision becomes one of the most high-profile research fields in the history of artificial intelligence, in which various machine vision tasks were uniformly defined for practical applications in the past several decades, and numerous algorithms and models emerged to improve performance of computers on them. Among all vision architectures, CNN (convolutional neural network) is the most influential one, which lead machine learning to enter the deep era, and dominated almost all the fundamental vision tasks in the 2010s, including image classification (Krizhevsky et al., 2012; He et al., 2016; Tan & Le, 2019) , object detection (Redmon et al., 2016; Ren et al., 2015; He et al., 2017) and semantic segmentation (Ronneberger et al., 2015; Chen et al., 2018) . CNN architecture was initially inspired by studying animals' visual system. Through biological experiments on mammals (one of the most evolved species in the animal kingdom) (Hubel & Wiesel, 1959) , some essential properties of visual systems were observed, such as hierarchical structure, receptive field and translation invariance. These discoveries laid the foundation for the design of CNN architecture, which, in turn, demonstrated its striking performance firstly in vision tasks (Le-Cun et al., 1989) . And in recent years, some works concentrating on comparison between CNNs and higher animals like humans have been launched, providing helpful points for research on interpretation of deep learning and brain-inspired intelligence (Geirhos et al., 2020) . Starting from 2020, Transformer architecture began to replace CNN as the new focus of research in computer vision field. Though Transformer had swept the natural language processing field before that (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) , models applying attention mechanism could not surpass the performance of Resnet-based CNNs (He et al., 2016; Xie et al., 2017; Tan & Le, 2019; Radosavovic et al., 2020) in vision problems. ViT (Vision Transformer) put forward in Dosovitskiy et al. (2021) , the milestone labeling the new era of computer vision, which depends completely on attention mechanism and has nothing to do with convolution, became state of the art in the task of image recognition at scale (represented by ImageNet (Russakovsky et al., 2015) ). After that, hundreds of works on computer vision based on Transformer architecture have been published, contributing to innovations on architecture (Liu et al., 2021; Wang et al., 2021; Chu et al., 2021) , novel training techniques (Touvron et al., 2021; 2022) , expansion for other tasks (Carion et al., 2020; Chen et al., 2021b; Jiang et al., 2021; Chen et al., 2021a), etc. Attention mechanism is also recognized as an essential property of animals' perception, therefore, some researchers have attempted to observe and study Transformer with prior knowledge of bioscience. Meanwhile, since Transformers perform better than CNNs currently in many tasks, people tend to find evidence supporting that Transformer is a more advanced architecture than CNN. For instance, Tuli et al. (2021) proposes that Transformer is more similar to humans' visual system in terms of behavioral analyses. However, this statement is not well supported, since there are many other properties of humans' and animals' visual systems remaining not having been measured and analysed in vision models. Inspired by the recent research about sparsity in animals' visual system, we discuss the sparsity of attentions in Vision Transformers in depth, and compare it with sparsity of neurons in CNNs through systematic experiments on a set of vision models, including classic CNNs and Transformers of different configurations. From the experimental results, the conclusion is drawn that adding layers to models will enhance the effect of sparsity, but adding heads to Transformers may play the opposite role, when the number of heads is too large. Specifically, our contributions mainly include: • In section 2, some related works are reviewed. • In section 3, sparsity of attentions in Vision Transformers is discovered and strictly defined, and its distribution is analysed from different perspectives. • In section 4, previous works on sparsity of neurons in CNNs and that in animals' visual systems are reviewed. • In section 5, ablation experiments and a metric for neural sparsity are designed, and experimental results are reported and analysed. Please refer to Appendix A for experimental details, and codes for our experiments are publicly available at https://github.com/SmartAnonymous/Codes-for-ICLR-2023. In a Transformer containing L (in ViT-base L = 12) Transformer Encoders (layers), each one carries out the following process during inference:

2. RELATED WORKS

z ′ l = MSA(LN(z l-1 )) + z l-1 , l = 1, 2, ..., L z l = MLP(LN(z ′ l )) + z ′ l , l = 1, 2, ..., L in which z 0 is the original patch embeddings, MSA refers to multi-head self-attention function, LN represents layer normalization and MLP is a multi-layer perceptron. Specifically, each head in a MSA (in ViT-base one MSA contains 12 heads) calculates in the way that: head i = Attention(QW Q i , KW K i , V W V i ) = softmax (QW Q i )(KW K i ) T √ d (V W V i ) (2) in which Q, K, V are respectively query, key and value matrices, W Q i , W K i , W V i are corresponding weights, and d is a scaling factor determined by the model. Intuitively, V W V i can be recognized as containing the information in features, and the softmax term is a coefficient matrix for transferring information between pairs of features, which plays a pivotal role in attention mechanism. Here we name it by attention map, represented by AttnMap: AttnMap = softmax (QW Q i )(KW K i ) T √ d Here AttnMap ∈ [0, 1] N ×N , in which N is the number of embeddings (in ViT-base N = 197). The sum of each row of attention map is 1, ensured by softmax. In the following parts we are going to visualize and analyse AttnMaps in ViT-base, and demonstrate the patterns we discovered in attentions.

3.2. SPARSITY OF ATTENTIONS

Sparse activation is a common phenomenon in deep neuron networks, which has already been observed in CNNs and in Transformers. In the attention maps of deeper layers in Vision Transformers, we also discover evident sparsity of columns (vertical lines), as shown in Figure 1 . Figure 1 (b) presents a typical attention map, in which several vertical lines are significantly "brighter" than others. In other words, those columns contain coefficients which are particularly large. As the mean value of all values in an AttnMap is 1/N = 1 197 < 0.01, it is not surprising that most of the area in an attention map are "black". Large values in AttnMaps of deeper layers are distributed in certain columns, instead of being scattered in different columns. This general phenomenon indicates that some certain features (vectors) are paid by more attention in deeper layers, as a result of which they are likely to be more significant than other features. As those features are considerably prominent, they are recognized as sparsity of attentions and we are interested in their distributions (see the following parts of section 3) and effectiveness (see section 5). The distributions of sparsity of attentions among heads in the same layer are similar. It is observed in Figure S1 that all the twelve heads in a deeper layer share similar locations (indexes) of "bright" columns, which further verifies that their corresponding features are paid by more attention in all heads. More generally, as shown in Figure 2 (a), most of the means of the correlation coefficients between AttnMaps of pairs of heads are large, demonstrating that all the heads in one Transformer model share similar patterns of sparsely activation. The distributions of sparsity of attentions among input images are dissimilar. This statement is verified only to guarantee that attention is not always concentrated on some certain features, but is distributed differently among all input images. Otherwise, the distribution of sparsity of attentions would be only determined by the Transformer model and weights, and all of our analysis would be meaningless. This argument is further proved by results in subsection 3.4.

3.3. NUMERICAL DISTRIBUTION OF SPARSITY OF ATTENTIONS

For more intuitive demonstration of the numerical distribution of sparsity of attentions, µ is defined as the mean of values in columns of a certain index of attention maps of one layer, and ν is defined as the negative denary logarithm of µ: µ l,j = 1 HN H h=1 N i=1 AttnMap l,h,ij ∈ [0, 1], ν l,j = -log 10 µ l,j ∈ [0, +∞) in which l, h, i, j are the indexes of layers, heads, rows on AttnMaps and columns on AttnMaps, respectively. As shown in Figure 3 (a), most of ν of the last layer is larger than -log 10 μ, while a small portion of ν lies around a peak smaller than ν = 1. In other words, most of µ is around the order of magnitude of 10 -3 , while a small portion of µ is gathered around the order of magnitude of 10 -1 . This result provide direct evidence for the existence and significance of sparsity of attentions of deeper layers in Transformers (the density curves of ν of all layers are shown in Figure S4 ). 

3.4. SPATIAL DISTRIBUTION OF SPARSITY OF ATTENTIONS

It has been illustrated that attentions are distributed sparsely among columns of attention maps, corresponding to some features. Then another question emerges: are all the features equally likely to become the focus of attentions? Figure 3 (b) shows the probability distribution of top-5% large µ of the last layer (not considering the 0th feature), and draw it on patches corresponding to those of the input images. Apparently, the distribution is not completely uniform, but all the probabilities lie in [0.002, 0.028], which does not manifest great dispersion. So it is reasonable carry out the ablation experiments in section 5 on Transformers. Moreover, it seems interesting that the locations of the top-4 large probabilities are symmetrical in a sense, which is beyond explanation currently. In summary, in this section it is shown in detail that there exists sparsity of attentions in deeper layers of Vision Transformer, and its distribution is similar to sparsity in animals' visual systems (illustrated in subsection 4.2).

4. SPARSITY IN CONVOLUTION NEURAL NETWORK AND ANIMALS' VISUAL SYSTEMS 4.1 SPARSITY IN CONVOLUTION NEURAL NETWORK

In contrast with Transformers, the sparsity in CNN usually refers to the phenomenon that connections between neurons are sparsely activated (some values are zeros or they will not affect calculation significantly), which is mainly caused by nonlinear activation function such as ReLU (rectified linear unit) (Hara et al., 2015) . Actually, CNN itself is an architecture sparsified from fully connected network, mainly according to locality principle. So far, sparsity in CNN has been well studied and widely applied in model compression and efficient inference and training, through approaches like pruning and sparsely training (Cheng et al., 2017; Hoefler et al., 2021; Perrinet, 2017) . Meanwhile, sparsity has been also used to analyze and illustrate CNN model performance from the view of neuroscience (Zhao & Zhang, 2022) . In our experiments on CNNs, the l 1 norms of outputs of neurons after the activation layers are calculated and sorted decreasingly, and the corresponding neurons of a percentage of largest norms are recognized as the sparse ones.

4.2. SPARSITY IN ANIMALS' VISUAL SYSTEMS

The basic units in animals' neural systems are neurons, and they process information by generating sequences of electrical impulses. Sparse encoding has been theoretically proved and physically observed to be commonly adpoted in brains, which refers to the phenomenon that states and events are encoded only using a small subset of neurons (Dayan & Abbott, 2001) . Particularly, experimental evidence for sparse firing in the animals' visual cortex is discovered (Willmore et al., 2011; Barth & Poulet, 2012) , especially in V1 (the primary visual cortex of primates). Moreover, the latest biological research points out that the neural sparsity is more prominent in higher animals, compared with lower animals (Wildenberg et al., 2021) . This result inspires us to launch comparative analysis of sparsity among vision models.

5.1. DESIGN OF ABLATION EXPERIMENTS

In order to fairly compare the effectiveness of sparsity of different vision models, a series of ablation experiments are designed. Just as their names imply, the effect of sparsity is measured by the change of prediction accuracy when a certain percentage of basic units (attentions in Transformers and neurons in CNNs) of certain layers are dropped (set as 0). The prediction accuracies of dropping the top-p sparse units of the last n layersfoot_1 and randomly dropping p units of the last n layers are respectively denoted as A t (p, n) and A r (p, n), and the effect of sparsity is reported by ψ(p, n): ψ(p, n) = A 0 -A t (p, n) A 0 -A r (p, n) in which A 0 is the prediction accuracy of the full model. ψ(p, n) is a reasonable metric for functional sparsity, which is supposed to be larger than 1 if the sparsity is effective. And the larger ψ(p, n) is, the more effective the sparsity is. To comprehensively study the sparsity of Vision Transformers and CNNs, ablation experiments are carried out on the following models of different configurations: ViT (Dosovitskiy et al., 2021; Steiner et al., 2021) , DeiT (Touvron et al., 2021) , Swin (Liu et al., 2021) , VGG (Simonyan & Zisserman, 2015) and ResNet (He et al., 2016) . Parameters are selected as p = 5%, 10%, 20%, 30% and n ∈ {1, 2, 3}. Additionally, all A r (p, n) are reported using the means of results of 3 replications with different seeds.

5.2. COMPARING SPARSITY IN TRANSFORMERS AND CNNS

Results of ablation experiments are detailedly demonstrated in Table 1 ,2 and Figure 4 . 2. In the same model, A r and A t always decrease as the percentage of dropping units increases or the number of layers with dropping units increases, which is consistent with our intuitive expectations that all the units have positive impacts on model performance. 3. Attention mechanism is more robust than convolution in terms of dropping basic computing unitsfoot_2 . As shown in Figure 4 (a), the loss of accuracies of CNNs (VGG, ResNet) when a certain percentage of units are dropped are much larger than those of Transformers (ViT, DeiT, Swin). The great loss of accuracy is not surprising in models without residual connections such as VGG, but in the comparisons between state-of-the-art Transformers and CNNs, the discovery is meaningful. Furthermore, Figure 4 (b) shows that in ViT-base, the loss of accuracies when randomly dropping p attention units is less than p of the loss of accuracies when dropping the whole attention layer (the dotted lines are "upper convex"), and the loss of accuracies when dropping the top-p attention units is also not large as their proportions of values. This means that the effects on prediction of those attention units are overlapping greatly. 4. According to D in Table 1 and 2, for CNN models more layers lead to higher sparsity, while for Transformer models, more heads do not always result in higher sparsityfoot_3 . The sparsity of Swin is significantly lower than ViT and DeiT, which is likely to be caused by the larger number of heads (Appendix A). This result reveals a side effect of using too many heads in a Transformer model, i.e., loss in sparsity and more dissimilar to higher animals' neural systems.

6. CONCLUSION

In our works, the sparsity of attentions in Transformers is proved to be existent, and its distribution is quantitatively analysed. What is more, inspired by recent achievements in neuroscience, a metric for the effect of sparsity in vision models is defined based on ablation experiments, which are conducted on Vision Transformer models and CNN models of different structures and configurations. We finally draw the conclusion that generally, increasing the number of layers in CNNs (also likely in Transformers) conduces to improve neural sparsity in deep layers, while overly increasing the number of heads in Transformers does not, which is likely to cause overlap of effects among attention units. This discovery will be helpful for understanding attention mechanism and designing more efficient and neurally advanced models for vision tasks.



For animals, lower and higher are descriptions for relative levels of evolution of biological complexity. For instance, primates are higher than non-primate mammals, mammals are higher than other vertebrates and vertebrates are higher than invertebrates. Here we choose dropping units in the last n layers, because: (1) sparsity exists only in deeper layers of Transformers, and the sparsity of shallower layers of CNNs are mainly due to locality, which is not our concern;(2) once the the sparse units of one layer are dropped, sparsity of its following layers will change instead of disappearing, which is not in line with our needs. Here we only concentrate on the deeper layers, whose units have global receptive fields, since discussion on units with local receptive fields is meaningless. For VGG and ResNet models, configurations mainly differ in numbers of layers; while for ViT and DeiT, {base, small, tiny} models mainly differ in numbers of heads. For Swin, small model contain less heads compared with base one, and tiny model contain less layers compared with small one.



Figure 1: (a) The input image (transformed to 224 × 224) selected from ImageNet (Russakovsky et al., 2015); (b) the attention map of one head of the last layer in the ViT-base, generated by inputting image (a); (c) the corresponding patches (by index) of the top-3 "bright" lines in (b).

Figure 2: (a) The mean of the correlation matrix of AttnMaps of heads in the last layer of ViT-base, in which each value represents the mean of the correlation coefficient between a pair of heads; (b) the mean of the correlation matrix of AttnMaps of layers in ViT-base, in which each value represents the mean of the correlation coefficient between a pair of layers. Both results are calculated by attentions generated while inference of images of all categories of ImageNet.

.0130.0020.0020.0030.0020.0030.0060.0020.0040.0070.0020.0020.002 0.0060.0030.

Figure 3: (a) The numerical distribution of ν 11,j (i.e. ν of the last layer) of ViT-base, in which the green line refers to the negative denary logarithm of the mean of µ; (b) the spacial distribution of top-5% large µ 11,j (i.e. µ of the last layer) of ViT-base, shown by 14 × 14 patches corresponding to the input images. Both results are calculated by attentions generated while inference of images of all categories of ImageNet.

Figure 4: (a) The accuracy curves of several models for ablation on the 2 last layers of them; (b) The accuracy curves of ViT-base for ablation on the last n layers of it.

Results of ablation experiments on Transformer models. For experimental details, see Appendix A.

Results of ablation experiments on CNN models. For experimental details, see Appendix A.

A EXPERIMENTAL DETAILS A.1 BASIC INFORMATION

We adopt ImageNet (Russakovsky et al., 2015) , the most acknowledged dataset for image recognition, and timm code library (Wightman, 2019) , a library of various image models (with pretrained weights) implemented by PyTorch, for experiments. They all accept applications of non-commercial research purposes.The specific process of ablation experiments: For each image: 1. inputting it into the Transformer or CNN model and sorting µ l,j ; 2. for a percentage p, dropping out the top-p sparse units of the last l layers and doing inference; Then calculating the classification accuracy (l, p) among all input images.It must be pointed out that in ablation experiments, the values of ψ may be not precise on account of the randomness when measuring A r , especially when A r is close to A 0 . Replications of random experiments are adopted to alleviate this problem, and multiple experiments with different configurations also contribute to draw stable conclusions.For details of implementation, please refer to our codes at https://github.com/ SmartAnonymous/Codes-for-ICLR-2023.

A.2 MODEL CONFIGURATIONS AND DETAILS

• In ablation experiments, ViT (Dosovitskiy et al., 2021; Steiner et al., 2021) , DeiT (Touvron et al., 2021) and Swin (Liu et al., 2021) are selected among Transformer models, and VGG (Simonyan & Zisserman, 2015) and ResNet (He et al., 2016) are selected among CNN models.• For ViT and DeiT, the base, small and tiny versions of models are selected, which all contain 12 layers and respectively contain 12, 6 and 3 heads. All the models take 224 × 224 as the size of input images and 16 × 16 as the size of patches.• ViT has a class token for prediction, and DeiT has a class token and a distillation token, which are all not considered into discussion of sparsity and ablation experiments. This is because they are not equivalent in status with other features.• For Swin, the base, small and tiny versions of models are selected, and the numbers of layers and heads are shown in the table below. All the models take 224 × 224 as the size of input images, 4 × 4 as the size of patches and 7 × 7 as the size of windows.

Models

Layers Heads Swin-base (2, 2, 18, 2) (4, 8, 16, 32) Swin-small (2, 2, 18, 2) (3, 6, 12, 24) Swin-tiny (2, 2, 6, 2) (3, 6, 12, 24)• For VGG, the 11, 13, 16 and 19 layer versions of models are selected, and all the models take 224 × 224 as the size of input images.• For ResNet, the 34, 50, 101 and 152 layer versions of models are selected, and all the models take 224 × 224 as the size of input images. In ResNet, we only consider the sparsity in layers with 3 × 3 convolution kernels. 

