RETHINKING CONVOLUTION: TOWARDS AN OPTIMAL EFFICIENCY

Abstract

In this paper, we present our recent research about the computational efficiency in convolution. Convolution operation is the most critical component in recent surge of deep learning research. Conventional 2D convolution takes O(C 2 K 2 HW ) to calculate, where C is the channel size, K is the kernel size, while H and W are the output height and width. Such computation has become really costly considering that these parameters increased over the past few years to meet the needs of demanding applications. Among various implementations of the convolution, separable convolution has been proven to be more efficient in reducing the computational demand. For example, depth separable convolution reduces the complexity to O(CHW • (C + K 2 )) while spatial separable convolution reduces the complexity to O(C 2 KHW ). However, these are considered an ad hoc design which cannot ensure that they can in general achieve optimal separation. In this research, we propose a novel operator called optimal separable convolution which can be calculated at O(C 3 2 KHW ) by optimal design for the internal number of groups and kernel sizes for general separable convolutions. When there is no restriction in the number of separated convolutions, an even lower complexity at O(CHW •log(CK 2 )) can be achieved. Experimental results demonstrate that the proposed optimal separable convolution is able to achieve an improved accuracy-FLOPs and accuracy-#Params trade-offs over both conventional and depth/spatial separable convolutions.

1. INTRODUCTION

Tremendous progresses have been made in recent years towards more accurate image analysis tasks, such as image classification, with deep convolutional neural networks (DCNNs) (Krizhevsky et al., 2012; Srivastava et al., 2015; He et al., 2016; Real et al., 2019; Tan & Le, 2019; Dai et al., 2020) . However, the computational complexity for state-of-the-art DCNN models has also become increasingly high and computationally expensive. This can significantly defer their deployment to real-world applications, such as mobile platforms and robotics, where the resources are highly constrained (Howard et al., 2017; Dai et al., 2020) . It is very much desired that a DCNN could achieve better performance with less computation and fewer model parameters. The most time-consuming building block of a DCNN is the convolutional layer. There have been many previous works aiming at reducing the amount of computation in the convolutional layer. Historically, researchers apply Fast Fourier Transform (FFT) (Nussbaumer, 1981; Quarteroni et al., 2010) to implement convolution and they gain great speed up for large convolutional kernels. For small convolutional kernels, a direct application is often still cheaper (Podlozhnyuk, 2007) . Researchers also explore low rank approximation (Jaderberg et al., 2014; Ioannou et al., 2015) to implement convolutions. However, most of the existing methods start from a pre-trained model and mainly focus on network pruning and compression. In this research, we study how to design a separable convolution to achieve an optimal implementation in terms of computational complexity. Enabling convolution separable has been proven to be an efficient way to reduce the computational complexity (Sifre & Mallat, 2014; Howard et al., 2017; Szegedy et al., 2016) . Comparing to the FFT and low rank approximation approaches, a welldesigned separable convolution shall be efficient for both small and large kernel sizes and shall not require a pre-trained model to operate on. Table 1 : A comparison of computational complexity and the number of parameters of the proposed optimal separable convolution and existing approaches. The proposed optimal separable convolution is much more efficient. In this table, C represents the channel size of convolution, K is the kernel size, H and W are the output height and width, g is the number of groups. "Vol. RF" represents whether the corresponding convolution satisfies the proposed volumetric receptive field condition.  Conv2D Conv2D Conv2D Conv2D Conv2D Conv2D (N = 2) Conv2D (Optimized N ) FLOPs C 2 K 2 HW C 2 K 2 HW/g CK 2 HW C 2 HW CHW (C + K 2 ) 2C 2 KHW 2C 3 2 KHW eCHW log(CK 2 ) #Params C 2 K 2 C 2 K 2 /g CK 2 C 2 C(C + K 2 ) 2C 2 K 2C 3 2 K eC log(CK 2 ) Vol. RF Note -- g = C K = 1 Depth-wise + Point-wise K 2 → 2K - e = 2.71828... V o lu m e tr ic R F Receptive Field (RF) C h a n n e l R F (a) 3x3 3x3 1x1 3x3 1x1

Conv2D

Depth Separable Optimal Separable (b) Figure 1 : Volumetric receptive field and the proposed optimal separable convolution. (a) The volumetric receptive field (RF) of a convolution is the Cartesian product of its (spatial) RF and channel RF. (b) Illustrations of the channel connections for conventional, depth separable, and the proposed optimal separable convolutions. In the DCNN research, two most well-known separable convolutions are depth separable (Sifre & Mallat, 2014) and spatial separable (Szegedy et al., 2016) convolutions. Both are able to reduce the computational complexity of a convolution. The complexity of a conventional 2D convolution is quadratic with three hyper-parameters: number of channels (C), kernel size (K), and spatial dimensions (H or W ), and its computational complexity is actually O(C 2 K 2 HW ). Depth separable convolution is constructed as a depth-wise convolution followed by a point-wise convolution, where depth-wise convolution is a group convolution with its number of groups g = C and point-wise convolution is a 1 × 1 convolution. Spatial separable convolution replaces a K × K kernel to K × 1 and 1 × K kernels. Different types of convolutions and their computational costs are summarized in Table 1 . From this table, we can easily verify that depth separable convolution has a complexity of O(CHW • (C + K 2 )) and spatial separable convolution has a complexity of O(C 2 KHW ). Both depth separable and spatial separable convolutions follow an ad hoc design. They are able to reduce the computational cost to some degree but normally will not achieve an optimal separation. A separable convolution in general has three sets of parameters: the internal number of groups, channel size, and kernel size of each separated convolution. Instead of setting these parameters in an ad hoc fashion, we design a scheme to achieve an optimal separation. The resulting separable convolution is called optimal separable convolution in this research. To prevent the proposed optimal separable convolution from being degenerated, we assume that the internal channel size is in an order of O(C) and propose the following volumetric receptive field condition. As illustrated in Fig. 1a , similar to the receptive field (RF) of a convolution which is defined as the region in the input space that a particular CNN's feature is looking at (or affected by) (Lindeberg, 2013), we define the volumetric RF of a convolution to be the volume in the input space that affects CNN's output. The volumetric RF condition requires that a properly decomposed separable convolution maintains the same volumetric RF as the original convolution before decomposition. Hence, the proposed optimal separable convolution will be equivalent to optimizing the internal number of groups and kernel sizes to achieve the computational objective (measured in FLOPsfoot_0 ) while satisfying the volumetric RF condition. Formally, the objective function is defined by Equation (2) under the constraints defined by Equations ( 3)-( 6). The solution to this optimization problem will be described in detail in Section 2.



In this research, similar to (He et al., 2016), FLOPs are measured in number of multiply-adds.

