MORE CONVNETS IN THE 2020S: SCALING UP KER-NELS BEYOND 51 × 51 USING SPARSITY

Abstract

Transformers have quickly shined in the computer vision world since the emergence of Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) seems to be challenged by increasingly effective transformer-based models. Very recently, a couple of advanced convolutional models strike back with large kernels motivated by the local-window attention mechanism, showing appealing performance and efficiency. While one of them, i.e. RepLKNet, impressively manages to scale the kernel size to 31×31 with improved performance, the performance starts to saturate as the kernel size continues growing, compared to the scaling trend of advanced ViTs such as Swin Transformer. In this paper, we explore the possibility of training extreme convolutions larger than 31×31 and test whether the performance gap can be eliminated by strategically enlarging convolutions. This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61×61 with better performance. Built on this recipe, we propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with sparse factorized 51×51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures like ConvNeXt and RepLKNet, on ImageNet classification as well as a wide range of downstream tasks including semantic segmentation on ADE20K, object detection on PASCAL VOC 2007, and object detection/segmentation on MS COCO.

1. INTRODUCTION

Since invented (Fukushima & Miyake, 1982; LeCun et al., 1989; 1998) , convolutional neural networks (CNNs) (Krizhevsky et al., 2012a; Simonyan & Zisserman, 2015.; He et al., 2016; Huang et al., 2017; Howard et al., 2017; Xie et al., 2017; Tan & Le, 2019) have quickly evolved as one of the most indispensable architectures of machine learning in the last decades. However, the dominance of CNNs has been significantly challenged by Transformer (Vaswani et al., 2017) over the past few years. Stemming from natural language processing, Vision Transformers (ViTs) (Dosovitskiy et al., 2021; d'Ascoli et al., 2021; Touvron et al., 2021b; Wang et al., 2021b; Liu et al., 2021e; Vaswani et al., 2021) have demonstrated strong results in various computer vision tasks including image classification (Dosovitskiy et al., 2021; Yuan et al., 2021b ), object detection (Dai et al., 2021; Liu et al., 2021e) , and segmentation (Xie et al., 2021; Wang et al., 2021a; c; Cheng et al., 2021) . Meanwhile, works on understanding of ViTs have blossomed. Plausible reasons behind the success of ViTs are fewer inductive bias (Dosovitskiy et al., 2021) , long-range dependence (Vaswani et al., 2017 ), advanced architecture (Yu et al., 2021) , and more human-like representations (Tuli et al., 2021) , etc. Recently, there is a rising trend that attributes the supreme performance of ViTs to the ability to capture a large receptive field. In contrast to CNNs which perform convolution in a small sliding window (e.g., 3×3 and 5×5) with shared weights, global attention or local attention with larger window sizes in ViTs (Liu et al., 2021e) directly enables each layer to capture large receptive field. Inspired by this trend, some recent works on CNNs (Liu et al., 2022b; Ding et al., 2022) strike back by designing advanced pure CNN architecture and plugging large kernels into them. For instance, RepLKNet (Ding et al., 2022) successfully scales the kernel size to 31×31, while achieving comparable results to Swin Transformer (Liu et al., 2021e). However, large kernels are notoriously difficult to train. Even with the assistance of a parallel branch with small kernels, the performance of RepLKNet starts to saturate as the kernel size continues increasing, compared to the scaling trend of advanced ViTs such as Swin Transformer. Therefore, it remains mysterious whether we can exceed the Transformer-based models by further scaling the kernel size beyond 31×31. In this paper, we attempt to answer this research question by leveraging sparsity commonly observed in the human visual system. Sparsity has been seen as one of the most important principles in the primary visual cortex (V1) (Tong, 2003) , where the incoming stimuli have been hypothesized to be sparsely coded and selected (Desimone & Duncan, 1995; Olshausen & Field, 1997; Vinje & Gallant, 2000) . We extensively study the trainability of large kernels and unveil three main observations: (i) et al., 2016; Howard et al., 2017; Xie et al., 2017; Huang et al., 2017) . Until very recently, some existing



Figure 1: Large depth-wise kernel (e.g., 51×51) paradigms of ConvNeXt, RepLKNet, and SLaK. Dark blue squares refer the dense weights in convolutional kernels. Light blue squares refer to the sparse weights in convolutional kernels.

Originally introduced for Natural Language Processing(Vaswani  et al., 2017)  and extended in Computer Vision byDosovitskiy et al. (2021), self-attention can be viewed as a global depth-wise kernel that enables each layer to have a global receptive field. Swin Transformer(Liu et al., 2021e) is a ViTs variant that adopts local attention with a shifted window manner. Compared with global attention, local attention(Ramachandran et al., 2019; Vaswani et al.,  2021; Chu et al., 2021; Liu et al., 2021d; Dong et al., 2022)  can greatly improve the memory and computation efficiency with appealing performance. Since the size of attention windows is at least 7, it can be seen as an alternative class of large kernel. A recent work(Guo et al., 2022b)  proposes a novel large kernel attention module that uses stacked depthwise, small convolution, dilated convolution as well as pointwise convolution to capture both local and global structure. 1×M convolutions. However, the proposed method leads to performance degradation on ImageNet. The family of Inceptions(Szegedy et al.,  2016; 2017)  allows for the utilization of varying convolutional kernel sizes to learn spatial patterns at different scales. With the popularity of VGG(Simonyan & Zisserman, 2014), it has been common over the past decade to use a stack of small kernels (1×1 or 3×3) to obtain a large receptive field (He

availability

https://github.com/VITA-Group/

