DESIGNING BERT FOR CONVOLUTIONAL NETWORKS: SPARSE AND HIERARCHICAL MASKED MODELING

Abstract

We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, randomly masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method, called Sparse masKed modeling (SparK), is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). The improvements on object detection and instance segmentation are more significant (up to +3.5%), validating the strong transferability of features learned. We also find SparK's favorable scaling behavior by observing more gains on larger networks. All of these findings support the promising future of generative pre-training on convnets. Both codes and pre-trained models have been released at https://github.

1. INTRODUCTION

The pretrain-finetune paradigm in natural language processing (NLP), as exemplified by BERT and GPT (Devlin et al., 2018; Clark et al., 2020; Radford et al., 2019; Brown et al., 2020) , is remarkably effective and thus long envied by our vision community. It is the emerging masked image modeling (Bao et al., 2021; He et al., 2021; Xie et al., 2021; Chen et al., 2022) initially extends the success of BERT from language transformers to vision transformers (ViTs). A bold move that increases the mask ratio to a staggering level (60~75%) is largely credited with this success (He et al., 2021; Xie et al., 2021) . As a result, the field of visual self-supervised learning on ViTs (Dosovitskiy et al., 2020; Liu et al., 2021) has now shifted from contrastive learning (Grill et al., 2020; Chen et al., 2021; Caron et al., 2021) to BERT-style masked modeling or a fusion of the two (Zhou et al., 2021) . Despite this progress, extending the success of BERT pre-training from transformers to convolutional networks (convnets) remains a desirable but unrealized goal. Early pioneering work (Pathak et al., 2016) predated BERT but performed much worse than supervised pre-training. Although there have been efforts over the past year to port BERT to convnets, they ultimately compromise by proposing a non-convolutional model (Gao et al., 2022) or non-masked modeling (Fang et al., 2022) . One might therefore wonder: what exactly is impeding the application of BERT to convnets? We try to conclude that in essence, the difficulty is rooted in the fundamental differences in data processing between language and vision (Bateman, 2014; Cheng et al., 2022) . While typical NLP models like recurrent networks or transformers process text as a variable-length sequence of words (well-defined semantic units), convnets have to recognize objects of different sizes from raw pixels ("units" at different scales). This large disparity rises two challenges: (i) Removing the information

