COMPARATIVE ANALYSIS BETWEEN VISION TRANSFORMERS AND CNNS FROM THE VIEW OF NEUROSCIENCE

Abstract

Neuroscience has provide many inspirations for the development of artificial intelligence, especially for neural networks for computer vision tasks. Recent research on animals' visual systems builds the connection between neural sparsity and animals' levels of evolution, based on which comparisons between two most influential vision architecture, Transformer and CNN, are carried out. In particular, the sparsity of attentions in Transformers is comprehensively studied, and previous knowledge on sparsity of neurons in CNNs is reviewed. In addition, a novel metric for neural sparsity is defined and ablation experiments are launched on various types of Transformer and CNN models. Finally, we draw the conclusion that more layers in models will result in higher sparsity, however, too many heads in Transformers may cause reduction of sparsity, which attributes to the significant overlap among effects of attention units.

1. INTRODUCTION

Visual perception is not only the most significant kind of humans' perception, but also the most typical characteristic of higher animals' intelligencefoot_0 . As a consequence, computer vision becomes one of the most high-profile research fields in the history of artificial intelligence, in which various machine vision tasks were uniformly defined for practical applications in the past several decades, and numerous algorithms and models emerged to improve performance of computers on them. Among all vision architectures, CNN (convolutional neural network) is the most influential one, which lead machine learning to enter the deep era, and dominated almost all the fundamental vision tasks in the 2010s, including image classification (Krizhevsky et al., 2012; He et al., 2016; Tan & Le, 2019 ), object detection (Redmon et al., 2016; Ren et al., 2015; He et al., 2017) and semantic segmentation (Ronneberger et al., 2015; Chen et al., 2018) . CNN architecture was initially inspired by studying animals' visual system. Through biological experiments on mammals (one of the most evolved species in the animal kingdom) (Hubel & Wiesel, 1959) , some essential properties of visual systems were observed, such as hierarchical structure, receptive field and translation invariance. These discoveries laid the foundation for the design of CNN architecture, which, in turn, demonstrated its striking performance firstly in vision tasks (Le-Cun et al., 1989) . And in recent years, some works concentrating on comparison between CNNs and higher animals like humans have been launched, providing helpful points for research on interpretation of deep learning and brain-inspired intelligence (Geirhos et al., 2020) . Starting from 2020, Transformer architecture began to replace CNN as the new focus of research in computer vision field. Though Transformer had swept the natural language processing field before that (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) , models applying attention mechanism could not surpass the performance of Resnet-based CNNs (He et al., 2016; Xie et al., 2017; Tan & Le, 2019; Radosavovic et al., 2020) in vision problems. ViT (Vision Transformer) put forward in Dosovitskiy et al. ( 2021), the milestone labeling the new era of computer vision, which depends completely on attention mechanism and has nothing to do with convolution, became state of the art in the task of image recognition at scale (represented by ImageNet (Russakovsky et al., 2015) ). After that, hundreds of works on computer vision based on Transformer architecture have been published, contributing to innovations on architecture (Liu et al., 2021; Wang et al., 2021; Chu et al., 2021) , novel training techniques (Touvron et al., 2021; 2022) , expansion for other tasks (Carion et al., 2020; Chen et al., 2021b; Jiang et al., 2021; Chen et al., 2021a) , etc. Attention mechanism is also recognized as an essential property of animals' perception, therefore, some researchers have attempted to observe and study Transformer with prior knowledge of bioscience. Meanwhile, since Transformers perform better than CNNs currently in many tasks, people tend to find evidence supporting that Transformer is a more advanced architecture than CNN. For instance, Tuli et al. (2021) proposes that Transformer is more similar to humans' visual system in terms of behavioral analyses. However, this statement is not well supported, since there are many other properties of humans' and animals' visual systems remaining not having been measured and analysed in vision models. Inspired by the recent research about sparsity in animals' visual system, we discuss the sparsity of attentions in Vision Transformers in depth, and compare it with sparsity of neurons in CNNs through systematic experiments on a set of vision models, including classic CNNs and Transformers of different configurations. From the experimental results, the conclusion is drawn that adding layers to models will enhance the effect of sparsity, but adding heads to Transformers may play the opposite role, when the number of heads is too large. Specifically, our contributions mainly include: • In section 2, some related works are reviewed. • In section 3, sparsity of attentions in Vision Transformers is discovered and strictly defined, and its distribution is analysed from different perspectives. • In section 4, previous works on sparsity of neurons in CNNs and that in animals' visual systems are reviewed. • In section 5, ablation experiments and a metric for neural sparsity are designed, and experimental results are reported and analysed. Please refer to Appendix A for experimental details, and codes for our experiments are publicly available at https://github.com/SmartAnonymous/Codes-for-ICLR-2023. 



For animals, lower and higher are descriptions for relative levels of evolution of biological complexity. For instance, primates are higher than non-primate mammals, mammals are higher than other vertebrates and vertebrates are higher than invertebrates.



2.1 COMPARISONS BETWEEN TRANSFORMERS AND CNNSIntuitively, Transformers have less bias for vision than CNNs, which is generally acknowledged.Besides, Raghu et al. (2021)  points out that Transformers have more uniform internal representations than CNNs, and depend more on dataset scale. In addition, it is observed by Park & Kim (2022) that MSAs (multi-head self attentions) are low-pass filters, while convolutions are high-pass filters, so they are complementary to some degree. Similarly, Zhao et al. (2021) argues that a hybrid design containing both convolution and Transformer modules is better than either one. Moreover, theoretical proof is given in Li et al. (2021) that a MSA layer with enough heads can perform any convolution operation. And in terms of behaviors, Bai et al. (2021) claims that Transformers are not more robust than CNNs, and those opposite results obtained by previous works may be caused by unfair experimental settings.2.2 ANALYSIS OF NEURAL NETWORKS FROM THE VIEW OF NEUROSCIENCEUnderstanding of brains and that of artificial networks always promote each other, in which observations in neuroscience have provide a lot of inspirations for design of both algorithms and hardware(Roy et al., 2019). Besides the history that the study on mammals' visual systems contributed to the development of visual computing,Marblestone et al. (2016)  puts forward several hypotheses of mechanism of humans' brains, which may guide researchers to novel directions of network modeling. Additionally, Yang et al. (2019) finds that network models can be trained to be functionally specialized for different cognitive processes of brains spontaneously.

