FACS: FAST ADAPTIVE CHANNEL SQUEEZING

Abstract

Channel squeezing is one of the central operations performed in CNN bottlenecks to reduce the number of channels in a feature map. This operation is carried out by using a 1 × 1 pointwise convolution which constitutes a significant amount of computations and parameters in a given network. ResNet-50 for instance, consists of 16 such layers which form 33% of total layers and 25% (1.05B/4.12B) of total FLOPs or computations. In the light of their predominance, we propose a novel "Fast Adaptive Channel Squeezing" module which carries out the squeezing operation in a computationally efficient manner. The key benefit of FACS is that it neither alters the number of parameters nor affects the accuracy of a given network. When plugged into diverse CNNs architectures, namely ResNet, VGG, and MobileNet-v2, FACS achieves state-of-the-art performance on ImageNet and CIFAR datasets at dramatically reduced FLOPs. FACS also cuts the training time significantly, and lowers the latency which is particularly advantageous for fast inference on edge devices. The source-code will be made publicly available.

1. INTRODUCTION

Introduced by ResNet (He et al., 2016) , squeeze-and-expand units form the basis of state-of-the-art CNNs. It is essentially a three layered unit in which the first layer (1 × 1) performs channel squeezing while the third layer (1 × 1) performs channel expansion. The middle layer (3 × 3), on the other hand maintains the channel count, and governs the network's receptive field. Interestingly, in CNNs inheriting squeeze-and-expand units, we make a key structural observation; that the squeeze and expand layers dominate both in the number and computations, while do not contribute to a network's receptive field due to their pointwise nature. For instance, ResNet-50 consists of 32 such layers out of 50 total layers, accounting for ∼ 54% (2.23B/4.12B) of overall FLOPs, whereas ResNet-101 consists of 66 of them out of 101, accounting for ∼ 50% (3.98B/7.85B) of total FLOPs. As CNNs are widely used in machine vision (Ren et al., 2015; Zhao et al., 2017; Carion et al., 2020) , bigger networks are now preferred to achieve higher accuracy Gao et al. (2018) to deal with increased task complexity. For this reason, VGG, ResNet style networks are still dominant in both academia and industry (Kumar & Behera, 2019; Ding et al., 2021; Kumar et al., 2020) due to their architectural simplicity, customizability, and high representation power, in contrast to newer complex networks (Tan & Le, 2019) . However, as ResNet-like CNNs are based on squeeze-expand units, our preceding observation raises an important question: "can computations in 1 × 1 layers be reduced without sacrificing network parameters and accuracy?" If so, the inference of such networks can be significantly accelerated on edge computing devices, benefiting a whole spectrum of applications such as autonomous driving, autonomous robotics, and so on. To the best of our knowledge, the above problem has not been addressed previously but is of great practical importance. Hence, in this paper, we show that indeed it is possible to achieve the desired objective by examining channel squeezing through the lens of computational complexity. To this end, we propose a novel "fast adaptive channel squeezing" (FACS) module that transforms a feature map X ∈ R C×H×W into another map Y ∈ R C R ×H×W , mimicking the functionality of a squeeze layer but with fewer computations, while retaining a network's parameters, accuracy, and non-linearity. We evaluate FACS by embedding it into three CNN architectures: plain (VGG), residual (ResNet), and mobile-series (MobileNet-v2), and on three datasets: ImageNet, (Deng et al., 2009) , and CIFAR-10, CIFAR-100. FACS is backed with comprehensive ablation study. Since FACS is novel, we demonstrate intermediate data representations learnt by FACS using GradCAM (Selvaraju et al., 2017) , and show that they are better relative to the baselines. FACS brings huge improvements e.g. ResNet-50 becomes ∼ 23% faster with 0.47% improved accuracy on ImageNet (Deng et al., 2009) , whereas VGG becomes faster by ∼ 6% with 0.23% improved accuracy, without bells and whistles. Next section deals with the related work. Sec. 3 describes the FACS module, and its integration into different CNNs. Sec. 4 carries solid experiments in support of FACS. Sec. 5, concludes our findings.

2.1. CONVNETS

The earlier CNNs (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; He et al., 2016) are accuracy oriented but have grown complex in terms of branching (Huang et al., 2017) , despite having higher accuracy. Mobile networks (Sandler et al., 2018; Zhang et al., 2018) focus on lower FLOPs for high speed inference by using depthwise convolutions (Sifre & Mallat). However they suffer from low representation power, need prolonged training schedule of 200 -400 epochs in contrast to earlier ones which are typically trained for 90 -120 epochs. The limited representation power hinders their performance on downstream tasks e.g. (Howard et al., 2017) needs 200 epochs on ImageNet to perform similar to (Simonyan & Zisserman, 2014) (trained for 75 epochs), but performs poorly on object detection. (Tan & Le, 2019) tackles this issue by being efficient in parameters and FLOPs, however ends-up being highly branched and deeper, memory hungry, and sluggish (Ding et al., 2021) . Recent Transformer based methods are computationally hungry due to their attention mechanisms (Vaswani et al., 2017; Dosovitskiy et al., 2020) which put them out of the reach of edge-devices. ResNet design space exploration (Schneider et al., 2017) provides several variants competitive to (Tan & Le, 2019) like designs while being simpler, quite advantageous for edge devices and real-time applications. Similarly, (Ding et al., 2021) improves the speeds of VGG however its training time design has many branches which is still inferior to ResNet. Furthermore, (Hu et al., 2018; Woo et al., 2018) propose novel units which yield improved accuracy in ResNet at the expense of increased parameters because of additional convolutions, and marginal computational overhead. The above discussion shows that the older architectures still have a room for improvement, and real-world applications can be benefited by focusing on these areas.

2.2. FAST INFERENCE

Fast inference of CNNs has been widely explored which mainly includes static pruning, network compression / quantization. These methods are generally employed post training and are agnostic to any given network. More recently, dynamic pruning methods such as (Gao et al., 2018) have become state-of-the-art in this area, which deals with the limitation of static pruning methods. Pruning works mainly by disabling or suppressing a set of channels in a feature maps, and for which convolution computations are inhibited. For example, (Gao et al., 2018) consumes a tensor X ∈ R C×H×W and outputs another tensor Y ∈ R C×H×W , but some of the channels in Y have zero values for which convolution operations are inhibited, thus saving computations but at the cost of accuracy. (Gao et al., 2018) achieves its goal by introducing new convolution layers into a network. The proposed FACS is orthogonal to the above methods since it performs computationally efficient channel squeezing, transforming a tensor X ∈ R C×H×W to Y ∈ R C R ×H×W . In addition, FACS preserves a network's parameters and accuracy, and does not involve new convolution layers, rather reuses the parameters of the original squeeze layer while redefining the information flow. Hence, FACS and pruning can not complement each other. However as pruning is applicable to any network architecture, and as FACS is an architectural enhancement, thus it can be seamlessly integrated with any pruning method to speed-up the overall network. However, in this paper, we have limited ourself to the study of FACS in deep networks.

3. FACS

Given an input feature map X ∈ R C×H×W , FACS progressively infers intermediate 1D descriptors z ∈ R C×1×1 , p ∈ R C R ×1×1 , and an output feature map Y ∈ R C R ×H×W of reduced dimensionality. FACS can primarily be sectioned into three stages: First, global context aggregation which provides channel-wise global spatial context in form of a 1D descriptor z. Second, cross channel information blending which transforms z into another descriptor p, referred as adaptive channel fusion probability. Third, channel fusion which utilizes p and X in order to produce Y. The overall structure of FACS module is illustrated in Figure 2 . For reference, we have also shown the baseline channel squeezing method in Figure 1 . It can be noticed that FACS is substantially different from the baseline counterpart in term of structure. The only component shared between them is the convolution layer, yet the tensors on which it operates are different. Further, FACS commonly employs global pooling, maximum or averaging operations which are the most fundamental operation in neural networks, however their use in FACS configuration is entirely novel. For this reason, we term FACS as a novel module to perform fast channel squeezing.

