DESIGNING BERT FOR CONVOLUTIONAL NETWORKS: SPARSE AND HIERARCHICAL MASKED MODELING

Abstract

We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, randomly masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method, called Sparse masKed modeling (SparK), is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). The improvements on object detection and instance segmentation are more significant (up to +3.5%), validating the strong transferability of features learned. We also find SparK's favorable scaling behavior by observing more gains on larger networks. All of these findings support the promising future of generative pre-training on convnets. Both codes and pre-trained models have been released at https://github.

1. INTRODUCTION

The pretrain-finetune paradigm in natural language processing (NLP), as exemplified by BERT and GPT (Devlin et al., 2018; Clark et al., 2020; Radford et al., 2019; Brown et al., 2020) , is remarkably effective and thus long envied by our vision community. It is the emerging masked image modeling (Bao et al., 2021; He et al., 2021; Xie et al., 2021; Chen et al., 2022) initially extends the success of BERT from language transformers to vision transformers (ViTs). A bold move that increases the mask ratio to a staggering level (60~75%) is largely credited with this success (He et al., 2021; Xie et al., 2021) . As a result, the field of visual self-supervised learning on ViTs (Dosovitskiy et al., 2020; Liu et al., 2021) has now shifted from contrastive learning (Grill et al., 2020; Chen et al., 2021; Caron et al., 2021) to BERT-style masked modeling or a fusion of the two (Zhou et al., 2021) . Despite this progress, extending the success of BERT pre-training from transformers to convolutional networks (convnets) remains a desirable but unrealized goal. Early pioneering work (Pathak et al., 2016) predated BERT but performed much worse than supervised pre-training. Although there have been efforts over the past year to port BERT to convnets, they ultimately compromise by proposing a non-convolutional model (Gao et al., 2022) or non-masked modeling (Fang et al., 2022) . One might therefore wonder: what exactly is impeding the application of BERT to convnets? We try to conclude that in essence, the difficulty is rooted in the fundamental differences in data processing between language and vision (Bateman, 2014; Cheng et al., 2022) . While typical NLP models like recurrent networks or transformers process text as a variable-length sequence of words (well-defined semantic units), convnets have to recognize objects of different sizes from raw pixels ("units" at different scales). This large disparity rises two challenges: (i) Removing the information of masked "words" is difficult for convnets. In ViTs, an input image is divided into non-overlapping patches. Simply dropping masked patches or replacing them with mask tokens can remove the information. This ease relies on transformer being able to handle irregular (variable-length) and nonoverlapping patches, thus cannot be achieved on convnets as they not only operate on regular grids, but also perform sliding window with overlapping. One may zero-out all masked pixels and feed this "mosaic" into a convnet, but this would lead to a significant distribution shift (in figure 1 ) and other issues (discussed further in section 3.1 and figure 3 ), thus cannot be an ideal solution. (ii) Single-scale algorithms are inadequate for learning multi-scale (hierarchical) features. Multi-scale structures have been a gold standard in computer vision, which allows visual processing systems like SIFT descriptors (Lowe, 1999; Bay et al., 2006) and pyramid networks (He et al., 2015; Lin et al., 2017) to handle variations in object scale. In contrast, the masked modeling approach from NLP operates in a single-scale manner. Applying it directly on convnets will miss the advantage of model hierarchy. In this work, we clear the hurdles above and make BERT suitable for convnet by proposing Sparse masKed modeling with hierarchy (SparK). We first randomly mask an image in a patch-wise manner. Observing the sparse nature of point clouds coincides with these unmasked patches, we treat them as a flatten point cloud and use sparse convolution for encoding. This enables convnets to handle irregular masked images. For decoding and reconstruction, the sparse features are filled with mask embeddings and fed into a multi-scale decoder, leveraging the hierarchical structure of convnets. SparK is a general method that does not limit the specific encoder to be pre-trained. We test it with two representative convnet famlies: classical ResNets (He et al., 2016) and modern ConvNeXts (Liu et al., 2022) . All models benefit from SparK, with more gains on larger models that demonstrates its favorable scaling ability. On standard downstream tasks (classification, object detection and instance segmentation), convnet-based SparK outperforms both (i) state-of-the-art contrastive learning and (ii) transformer-based masked modeling by similarly large margins (around +1.0%). The improvements over COCO baselines are more significant than those on ImageNet (up to +3.5%), indicating the representations learned by SparK are highly transferable. To summarize, SparK provides: • The first pre-training method in the style of BERT that can be directly applied to any convnets without backbone modifications, overcoming their inability to handle irregular masked inputs. • The insights into the design of generative pre-training for convnets, e.g., the first use of sparse convolution for masked image modeling and a hierarchical design for BERT-style pre-training. • A leap in convnet's performance across downstream tasks with gains of up to 3.5 points, showing the promise of extending the success of transformer's pretrain-finetune paradigm to convnets. The recent surge of interest in vision transformers (Liu et al., 2021; He et al., 2021) has shifted the focus away from convnets in the computer vision community. However, convnets embody the core principles of many classical vision processing systems, such as scale-and translation-equivariance, locality, weight-sharing, and hardware-friendliness (Lowe, 1999; Csurka et al., 2004) . These networks continue to be indispensable in addressing a variety of challenging and structural real-world tasks beyond classification (Jaderberg et al., 2015; Liu et al., 2017; 2022) . We hope SparK's inspiring performance will prompt us to revisit convnets as generic backbones for computer vision community, and motivate more future arts in exploiting their potential through generative pre-training.



Figure 1: Different masking strategies with pixel intensity histograms plotted before (in gray) and after (blue) masking. (b) is a straightforward idea to apply masked modeling to convnets, which results in a distribution shift. (a) illustrates MAE (He et al., 2021) that has no such side effect thanks to the transformer's ability to process variable-length input. We propose (c) to adapt convnets to irregular masked input without a distribution shift.

