CONTEXTUAL CONVOLUTIONAL NETWORKS

Abstract

This paper presents a new Convolutional Neural Network, named Contextual Convolutional Network, that capably serves as a general-purpose backbone for visual recognition. Most existing convolutional backbones follow the representation-toclassification paradigm, where representations of the input are firstly generated by category-agnostic convolutional operations, and then fed into classifiers for specific perceptual tasks (e.g., classification and segmentation). In this paper, we deviate from this classic paradigm and propose to augment potential category memberships as contextual priors in the convolution for contextualized representation learning. Specifically, top-k likely classes from the preceding stage are encoded as a contextual prior vector. Based on this vector and the preceding features, offsets for spatial sampling locations and kernel weights are generated to modulate the convolution operations. The new convolutions can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation without additional supervision. The qualities of Contextual Convolutional Networks make it compatible with a broad range of vision tasks and boost the state-of-the-art architecture ConvNeXt-Tiny by 1.8% on top-1 accuracy of ImageNet classification. The superiority of the proposed model reveals the potential of contextualized representation learning for vision tasks.

1. INTRODUCTION

Beginning with the AlexNet (Krizhevsky et al., 2012) and its revolutionary performance on the ImageNet image classification challenge, convolutional neural networks (CNNs) have achieved significant success for visual recognition tasks, such as image classification (Deng et al., 2009) , instance segmentation (Zhou et al., 2017) and object detection (Lin et al., 2014) . Lots of powerful CNN backbones are proposed to improve the performances, including greater scale (Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016) , more extensive connections (Huang et al., 2017; Xie et al., 2017; Sun et al., 2019; Yang et al., 2018) , and more sophisticated forms of convolution (Dai et al., 2017; Zhu et al., 2019; Yang et al., 2019) . Most of these architectures follow the representation-to-classification paradigm, where representations of the input are firstly generated by category-agnostic convolutional operations, and then fed into classifiers for specific perceptual tasks. Consequently, all inputs are processed by consecutive static convolutional operations and expressed as universal representations. In parallel, in the neuroscience community, evidence accumulates that human visual system integrates both bottom-up processing from the retina and top-down modulation from higher-order cortical areas (Rao & Ballard, 1999; Lee & Mumford, 2003; Friston, 2005) . On the one hand, the bottom-up processing is based on feedforward connections along a hierarchy that represents progressively more complex aspects of visual scenes (Gilbert & Sigman, 2007) . This property is shared with the aforementioned representation-to-classification paradigm (Zeiler & Fergus, 2014; Yamins et al., 2014) . On the other hand, recent findings suggest that the top-down modulation affects the bottom-up processing in a way that enables the neurons to carry more information about the stimulus being

availability

//github.com/liang4sx/contextual_cnn.

