CONTEXTUAL CONVOLUTIONAL NETWORKS

Abstract

This paper presents a new Convolutional Neural Network, named Contextual Convolutional Network, that capably serves as a general-purpose backbone for visual recognition. Most existing convolutional backbones follow the representation-toclassification paradigm, where representations of the input are firstly generated by category-agnostic convolutional operations, and then fed into classifiers for specific perceptual tasks (e.g., classification and segmentation). In this paper, we deviate from this classic paradigm and propose to augment potential category memberships as contextual priors in the convolution for contextualized representation learning. Specifically, top-k likely classes from the preceding stage are encoded as a contextual prior vector. Based on this vector and the preceding features, offsets for spatial sampling locations and kernel weights are generated to modulate the convolution operations. The new convolutions can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation without additional supervision. The qualities of Contextual Convolutional Networks make it compatible with a broad range of vision tasks and boost the state-of-the-art architecture ConvNeXt-Tiny by 1.8% on top-1 accuracy of ImageNet classification. The superiority of the proposed model reveals the potential of contextualized representation learning for vision tasks.

1. INTRODUCTION

Beginning with the AlexNet (Krizhevsky et al., 2012) and its revolutionary performance on the ImageNet image classification challenge, convolutional neural networks (CNNs) have achieved significant success for visual recognition tasks, such as image classification (Deng et al., 2009) , instance segmentation (Zhou et al., 2017) and object detection (Lin et al., 2014) . Lots of powerful CNN backbones are proposed to improve the performances, including greater scale (Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016) , more extensive connections (Huang et al., 2017; Xie et al., 2017; Sun et al., 2019; Yang et al., 2018) , and more sophisticated forms of convolution (Dai et al., 2017; Zhu et al., 2019; Yang et al., 2019) . Most of these architectures follow the representation-to-classification paradigm, where representations of the input are firstly generated by category-agnostic convolutional operations, and then fed into classifiers for specific perceptual tasks. Consequently, all inputs are processed by consecutive static convolutional operations and expressed as universal representations. In parallel, in the neuroscience community, evidence accumulates that human visual system integrates both bottom-up processing from the retina and top-down modulation from higher-order cortical areas (Rao & Ballard, 1999; Lee & Mumford, 2003; Friston, 2005) . On the one hand, the bottom-up processing is based on feedforward connections along a hierarchy that represents progressively more complex aspects of visual scenes (Gilbert & Sigman, 2007) . This property is shared with the aforementioned representation-to-classification paradigm (Zeiler & Fergus, 2014; Yamins et al., 2014) . On the other hand, recent findings suggest that the top-down modulation affects the bottom-up processing in a way that enables the neurons to carry more information about the stimulus being discriminated (Gilbert & Li, 2013) . For example, recordings in the prefrontal cortex reveal that the same neuron can be modulated to express different categorical representations as the categorical context changes (e.g., from discriminating animals to discriminating cars) (Cromer et al., 2010; Gilbert & Li, 2013) . Moreover, words with categorical labels (e.g., "chair") can set visual priors that alter how visual information is processed from the very beginning, allowing for more effective representational separation of category members and nonmembers (Lupyan & Ward, 2013; Boutonnet & Lupyan, 2015) . The top-down modulation can help to resolve challenging vision tasks with complex scenes or visual distractors. This property is however not considered by recent CNN backbones yet. Motivated by the top-down modulation with categorical context in the brain, we present a novel architecture, namely Contextual Convolutional Networks (Contextual CNN), which augments potential category memberships as contextual priors in the convolution for representation learning. Specifically, the top-k likely classes by far are encoded as a contextual vector. Based on this vector and preceding features, offsets for spatial sampling locations and kernel weights are generated to modulate the convolutional operations in the current stage (illustrated in Fig. 1a ). The sampling offsets enable free form deformation of the local sampling grid considering the likely classes and the input instance, which modulates where to locate information about the image being discriminated. The weight offsets allow the adjustment of specific convolutional kernels (e.g. "edges" to "corners"), which modulates how to extract discriminative features from the input image. Meanwhile, the considered classes are reduced from k to m (m < k) and fed to the following stage for further discrimination. By doing so, the following stage of convolutions is conditioned on the results of the previous, thus rendering convolutions dynamic in a smart way. The proposed contextual convolution can be used as a drop-in replacement for existing convolutions in CNNs and trained end-to-end without additional supervision. 1b , the counterpart model presents high but wrong score for "boathouse" w.r.t. the groundtruth class "pier" based on features of the oceanside house, which are shared across images of both classes. In contrast, the proposed model predicts correctly by generating features of the long walkway stretching from the shore to the water, which is a crucial cue to discriminate "pier" from "boathouse". We hope that Contextual CNN's strong performance on various vision problems can promote the research on a new paradigm of convolutional backbone architectures.



Figure 1: Left: Illustration of a 3 × 3 contextual convolution. Given preceding instance features and top-k likely classes, sampling offsets and weight offsets are generated via non-linear layers. These offsets are added to regular grid sampling locations and static kernel weights of a standard convolution, respectively. Right: Grad-CAM visualization (Selvaraju et al., 2017) of the learned features of ConvNeXt-T (Liu et al., 2022) and our model. Grad-CAM is used to interpret the learned features by highlighting corresponding regions that discriminate the predicted class from other classes.

availability

//github.com/liang4sx/contextual_cnn.

