FACS: FAST ADAPTIVE CHANNEL SQUEEZING

Abstract

Channel squeezing is one of the central operations performed in CNN bottlenecks to reduce the number of channels in a feature map. This operation is carried out by using a 1 × 1 pointwise convolution which constitutes a significant amount of computations and parameters in a given network. ResNet-50 for instance, consists of 16 such layers which form 33% of total layers and 25% (1.05B/4.12B) of total FLOPs or computations. In the light of their predominance, we propose a novel "Fast Adaptive Channel Squeezing" module which carries out the squeezing operation in a computationally efficient manner. The key benefit of FACS is that it neither alters the number of parameters nor affects the accuracy of a given network. When plugged into diverse CNNs architectures, namely ResNet, VGG, and MobileNet-v2, FACS achieves state-of-the-art performance on ImageNet and CIFAR datasets at dramatically reduced FLOPs. FACS also cuts the training time significantly, and lowers the latency which is particularly advantageous for fast inference on edge devices. The source-code will be made publicly available.

1. INTRODUCTION

Introduced by ResNet (He et al., 2016) , squeeze-and-expand units form the basis of state-of-the-art CNNs. It is essentially a three layered unit in which the first layer (1 × 1) performs channel squeezing while the third layer (1 × 1) performs channel expansion. The middle layer (3 × 3), on the other hand maintains the channel count, and governs the network's receptive field. Interestingly, in CNNs inheriting squeeze-and-expand units, we make a key structural observation; that the squeeze and expand layers dominate both in the number and computations, while do not contribute to a network's receptive field due to their pointwise nature. For instance, ResNet-50 consists of 32 such layers out of 50 total layers, accounting for ∼ 54% (2.23B/4.12B) of overall FLOPs, whereas ResNet-101 consists of 66 of them out of 101, accounting for ∼ 50% (3.98B/7.85B) of total FLOPs. As CNNs are widely used in machine vision (Ren et al., 2015; Zhao et al., 2017; Carion et al., 2020) , bigger networks are now preferred to achieve higher accuracy Gao et al. (2018) to deal with increased task complexity. For this reason, VGG, ResNet style networks are still dominant in both academia and industry (Kumar & Behera, 2019; Ding et al., 2021; Kumar et al., 2020) due to their architectural simplicity, customizability, and high representation power, in contrast to newer complex networks (Tan & Le, 2019). However, as ResNet-like CNNs are based on squeeze-expand units, our preceding observation raises an important question: "can computations in 1 × 1 layers be reduced without sacrificing network parameters and accuracy?" If so, the inference of such networks can be significantly accelerated on edge computing devices, benefiting a whole spectrum of applications such as autonomous driving, autonomous robotics, and so on. To the best of our knowledge, the above problem has not been addressed previously but is of great practical importance. Hence, in this paper, we show that indeed it is possible to achieve the desired objective by examining channel squeezing through the lens of computational complexity. To this end, we propose a novel "fast adaptive channel squeezing" (FACS) module that transforms a feature map X ∈ R C×H×W into another map Y ∈ R C R ×H×W , mimicking the functionality of a squeeze layer but with fewer computations, while retaining a network's parameters, accuracy, and non-linearity. We evaluate FACS by embedding it into three CNN architectures: plain (VGG), residual (ResNet), and mobile-series (MobileNet-v2), and on three datasets: ImageNet, (Deng et al., 2009) , and CIFAR-10, CIFAR-100. FACS is backed with comprehensive ablation study. Since FACS is novel, we demonstrate intermediate data representations learnt by FACS using GradCAM (Selvaraju et al., 2017) , and show that they are better relative to the baselines. FACS brings huge improvements e.g. ResNet-50 becomes ∼ 23% faster with 0.47% improved accuracy on ImageNet (Deng et al., 2009) , whereas VGG becomes faster by ∼ 6% with 0.23% improved accuracy, without bells and whistles. Next section deals with the related work. Sec. 3 describes the FACS module, and its integration into different CNNs. Sec. 4 carries solid experiments in support of FACS. Sec. 5, concludes our findings.

