MORE CONVNETS IN THE 2020S: SCALING UP KER-NELS BEYOND 51 × 51 USING SPARSITY

Abstract

Transformers have quickly shined in the computer vision world since the emergence of Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) seems to be challenged by increasingly effective transformer-based models. Very recently, a couple of advanced convolutional models strike back with large kernels motivated by the local-window attention mechanism, showing appealing performance and efficiency. While one of them, i.e. RepLKNet, impressively manages to scale the kernel size to 31×31 with improved performance, the performance starts to saturate as the kernel size continues growing, compared to the scaling trend of advanced ViTs such as Swin Transformer. In this paper, we explore the possibility of training extreme convolutions larger than 31×31 and test whether the performance gap can be eliminated by strategically enlarging convolutions. This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61×61 with better performance. Built on this recipe, we propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with sparse factorized 51×51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures like ConvNeXt and RepLKNet, on ImageNet classification as well as a wide range of downstream tasks including semantic segmentation on ADE20K, object detection on PASCAL VOC 2007, and object detection/segmentation on MS COCO.

1. INTRODUCTION

Since invented (Fukushima & Miyake, 1982; LeCun et al., 1989; 1998) , convolutional neural networks (CNNs) (Krizhevsky et al., 2012a; Simonyan & Zisserman, 2015.; He et al., 2016; Huang et al., 2017; Howard et al., 2017; Xie et al., 2017; Tan & Le, 2019) have quickly evolved as one of the most indispensable architectures of machine learning in the last decades. However, the dominance of CNNs has been significantly challenged by Transformer (Vaswani et al., 2017) over the past few years. Stemming from natural language processing, Vision Transformers (ViTs) (Dosovitskiy et al., 2021; d'Ascoli et al., 2021; Touvron et al., 2021b; Wang et al., 2021b; Liu et al., 2021e; Vaswani et al., 2021) Recently, there is a rising trend that attributes the supreme performance of ViTs to the ability to capture a large receptive field. In contrast to CNNs which perform convolution in a small sliding window (e.g., 3×3 and 5×5) with shared weights, global attention or local attention with larger window sizes in ViTs (Liu et al., 2021e) directly enables each layer to capture large receptive field. Inspired by this trend, some recent works on CNNs (Liu et al., 2022b; Ding et al., 2022) strike back by designing advanced pure CNN architecture and plugging large kernels into them. For instance, RepLKNet (Ding et al., 2022) successfully scales the kernel size to 31×31, while achieving



have demonstrated strong results in various computer vision tasks including image classification (Dosovitskiy et al., 2021; Yuan et al., 2021b), object detection (Dai et al., 2021; Liu et al., 2021e), and segmentation (Xie et al., 2021; Wang et al., 2021a;c; Cheng et al., 2021). Meanwhile, works on understanding of ViTs have blossomed. Plausible reasons behind the success of ViTs are fewer inductive bias (Dosovitskiy et al., 2021), long-range dependence (Vaswani et al., 2017), advanced architecture (Yu et al., 2021), and more human-like representations (Tuli et al., 2021), etc.

availability

https://github.com/VITA-Group/

