COMPARATIVE ANALYSIS BETWEEN VISION TRANSFORMERS AND CNNS FROM THE VIEW OF NEUROSCIENCE

Abstract

Neuroscience has provide many inspirations for the development of artificial intelligence, especially for neural networks for computer vision tasks. Recent research on animals' visual systems builds the connection between neural sparsity and animals' levels of evolution, based on which comparisons between two most influential vision architecture, Transformer and CNN, are carried out. In particular, the sparsity of attentions in Transformers is comprehensively studied, and previous knowledge on sparsity of neurons in CNNs is reviewed. In addition, a novel metric for neural sparsity is defined and ablation experiments are launched on various types of Transformer and CNN models. Finally, we draw the conclusion that more layers in models will result in higher sparsity, however, too many heads in Transformers may cause reduction of sparsity, which attributes to the significant overlap among effects of attention units.

1. INTRODUCTION

Visual perception is not only the most significant kind of humans' perception, but also the most typical characteristic of higher animals' intelligencefoot_0 . As a consequence, computer vision becomes one of the most high-profile research fields in the history of artificial intelligence, in which various machine vision tasks were uniformly defined for practical applications in the past several decades, and numerous algorithms and models emerged to improve performance of computers on them. Among all vision architectures, CNN (convolutional neural network) is the most influential one, which lead machine learning to enter the deep era, and dominated almost all the fundamental vision tasks in the 2010s, including image classification (Krizhevsky et al., 2012; He et al., 2016; Tan & Le, 2019) , object detection (Redmon et al., 2016; Ren et al., 2015; He et al., 2017) and semantic segmentation (Ronneberger et al., 2015; Chen et al., 2018) . CNN architecture was initially inspired by studying animals' visual system. Through biological experiments on mammals (one of the most evolved species in the animal kingdom) (Hubel & Wiesel, 1959) , some essential properties of visual systems were observed, such as hierarchical structure, receptive field and translation invariance. These discoveries laid the foundation for the design of CNN architecture, which, in turn, demonstrated its striking performance firstly in vision tasks (Le-Cun et al., 1989) . And in recent years, some works concentrating on comparison between CNNs and higher animals like humans have been launched, providing helpful points for research on interpretation of deep learning and brain-inspired intelligence (Geirhos et al., 2020) . Starting from 2020, Transformer architecture began to replace CNN as the new focus of research in computer vision field. Though Transformer had swept the natural language processing field before that (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) , models applying attention mechanism could not surpass the performance of Resnet-based CNNs (He et al., 2016; Xie et al., 2017; Tan & Le, 2019; Radosavovic et al., 2020) in vision problems. ViT (Vision Transformer) put forward in Dosovitskiy et al. (2021) , the milestone labeling the new era of computer vision, which depends completely on attention mechanism and has nothing to do with convolution, became state



For animals, lower and higher are descriptions for relative levels of evolution of biological complexity. For instance, primates are higher than non-primate mammals, mammals are higher than other vertebrates and vertebrates are higher than invertebrates.

