REVERSIBLE COLUMN NETWORKS

Abstract

We propose a new neural network design paradigm Reversible Column Network (RevCol). The main body of RevCol is composed of multiple copies of subnetworks, named columns respectively, between which multi-level reversible connections are employed. Such architectural scheme attributes RevCol very different behavior from conventional networks: during forward propagation, features in RevCol are learned to be gradually disentangled when passing through each column, whose total information is maintained rather than compressed or discarded as other network does. Our experiments suggest that CNN-style RevCol models can achieve very competitive performances on multiple computer vision tasks such as image classification, object detection and semantic segmentation, especially with large parameter budget and large dataset. For example, after ImageNet-22K pre-training, RevCol-XL obtains 88.2% ImageNet-1K accuracy. Given more pre-training data, our largest model RevCol-H reaches 90.0% on ImageNet-1K, 63.8% AP box on COCO detection minival set, 61.0% mIoU on ADE20k segmentation. To our knowledge, it is the best COCO detection and ADE20k segmentation result among pure (static) CNN models. Moreover, as a general macro architecture fashion, RevCol can also be introduced into transformers or other neural networks, which is demonstrated to improve the performances in both computer vision and NLP tasks. We release code and models at https:

1. INTRODUCTION

Information Bottleneck principle (IB) (Tishby et al., 2000; Tishby & Zaslavsky, 2015) rules the deep learning world. Consider a typical supervised learning network as in Fig. 1 (a) : layers close to the input contain more low-level information, while features close to the output are rich in semantic meanings. In other words, information unrelated to the target is gradually compressed during the layer-by-layer propagation. Although such learning paradigm achieves great success in many practical applications, it might not be the optimal choice in the view of feature learning -down-stream tasks may suffer from inferior performances if the learned features are over compressed, or the learned semantic information is irrelevant to the target tasks, especially if a significant domain gap exists between the source and the target tasks (Zamir et al., 2018) . Researchers have devoted great efforts to make the learned features to be more universally applicable, e.g. via self-supervised pre-training (Oord et al., 2018; Devlin et al., 2018; He et al., 2022; Xie et al., 2022) or multi-task learning (Ruder, 2017; Caruana, 1997; Sener & Koltun, 2018) . In this paper, we mainly focus on an alternative approach: building a network to learn disentangled representations. Unlike IB learning, disentangled feature learning (Desjardins et al., 2012; Bengio et al., 2013; Hinton, 2021) does not intend to extract the most related information while discard the less related; instead, it aims to embed the task-relevant concepts or semantic words into a few decoupled dimensions respectively. Meanwhile the whole feature vector roughly maintains as much information as the input. It is quite analogous to the mechanism in biological cells (Hinton, 2021; Lillicrap et al., 2020) expression intensities. Accordingly in computer vision tasks, learning disentangled features is also reasonable: for instance, high-level semantic representations are tuned during ImageNet pre-training, meanwhile the low-level information (e.g. locations of the edges) should also be maintained in other feature dimensions in case of the demand of down-stream tasks like object detection. Fig. 1 (b) sketches our main idea: Reversible Column Networks (RevCol), which is greatly inspired by the big picture of GLOM (Hinton, 2021). Our network is composed of N subnetworks (named columns) of identical structure (however whose weights are not necessarily the same), each of which receives a copy of the input and generates a prediction. Hence multi-level embeddings, i.e. from low-level to highly semantic representations, are stored in each column. Moreover, reversible transformations are introduced to propagate the multi-level features from i-th column to (i + 1)-th column without information loss. During the propagation, since the complexity and nonlinearity increases, the quality of all feature levels is expected to gradually improve. Hence the last column (Col N in Fig. 1 (b )) predicts the final disentangled representations of the input. In RevCol, one of our key contributions is the design of the reversible transformations between adjacent columns. The concept is borrowed from the family of Reversible Networks (Chang et al., 2018; Gomez et al., 2017; Jacobsen et al., 2018; Mangalam et al., 2022) ; however, conventional reversible structures such as RevNets (Gomez et al., 2017) (Fig. 2 (a)) usually have two drawbacks: first, feature maps within a reversible block are restricted to have the same shape * ; second, the last two feature maps in RevNets have to contain both low-level and high-level information due to the reversible nature, which may be difficult to optimize as in conflict with IB principle. In this paper, we overcome the drawbacks by introducing a novel reversible multi-level fusion module. The details are discussed in Sec. 2. We et al., 2020; Devlin et al., 2018) and get improved results on both computer vision and NLP tasks. Finally, similar to RevNets (Gomez et al., 2017) , RevCol also shares the bonus of memory saving from the reversible nature, which is particularly important for large model training.



Figure 1: Sketch of the information propagation in: (a) Vanilla single-column network. (b) Our reversible column network. Yellow color denotes low-level information and blue color denotes semantic information.

build a series of CNN-based RevCol models under different complexity budgets and evaluate them in mainstream computer vision tasks, such as ImageNet classification, COCO object detection and instance segmentation, as well as ADE20K semantic segmentation. Our models achieve comparable or better results than sophisticated CNNs or vision transformers like ConvNeXt(Liu et al., 2022b)   andSwin (Liu et al., 2021). For example, after ImageNet-22K pre-training, our RevCol-XL model obtains 88.2% accuracy on ImageNet-1K without using transformers or large convolutional kernels(Ding et al., 2022b; Liu et al., 2022b; Han et al., 2021). More importantly, we find RevCol can scale up well to large models and large datasets. Given a larger private pre-training dataset, our biggest model RevCol-H obtains 90.0% accuracy on ImageNet-1K classification, 63.8% AP box on COCO detection minival set, and 61.0% mIoU on ADE20K segmentation, respectively. To our knowledge, it is the best reversible model on those tasks, as well as the best pure CNN model on COCO and ADE20K which only involves static kernels without dynamic convolutions(Dai et al., 2017; Ma et al.,  2020). In the appendix, we further demonstrate RevCol can work with transformers (Dosovitskiy

funding

megvii-research/RevCol * Corresponding author. This work is supported by The National Key Research and Development Program of China (No. 2017YFA0700800) and Beijing Academy of Artificial Intelligence (BAAI).

availability

//github.com/

