RETHINKING CONVOLUTION: TOWARDS AN OPTIMAL EFFICIENCY

Abstract

In this paper, we present our recent research about the computational efficiency in convolution. Convolution operation is the most critical component in recent surge of deep learning research. Conventional 2D convolution takes O(C 2 K 2 HW ) to calculate, where C is the channel size, K is the kernel size, while H and W are the output height and width. Such computation has become really costly considering that these parameters increased over the past few years to meet the needs of demanding applications. Among various implementations of the convolution, separable convolution has been proven to be more efficient in reducing the computational demand. For example, depth separable convolution reduces the complexity to O(CHW • (C + K 2 )) while spatial separable convolution reduces the complexity to O(C 2 KHW ). However, these are considered an ad hoc design which cannot ensure that they can in general achieve optimal separation. In this research, we propose a novel operator called optimal separable convolution which can be calculated at O(C 3 2 KHW ) by optimal design for the internal number of groups and kernel sizes for general separable convolutions. When there is no restriction in the number of separated convolutions, an even lower complexity at O(CHW •log(CK 2 )) can be achieved. Experimental results demonstrate that the proposed optimal separable convolution is able to achieve an improved accuracy-FLOPs and accuracy-#Params trade-offs over both conventional and depth/spatial separable convolutions.

1. INTRODUCTION

Tremendous progresses have been made in recent years towards more accurate image analysis tasks, such as image classification, with deep convolutional neural networks (DCNNs) (Krizhevsky et al., 2012; Srivastava et al., 2015; He et al., 2016; Real et al., 2019; Tan & Le, 2019; Dai et al., 2020) . However, the computational complexity for state-of-the-art DCNN models has also become increasingly high and computationally expensive. This can significantly defer their deployment to real-world applications, such as mobile platforms and robotics, where the resources are highly constrained (Howard et al., 2017; Dai et al., 2020) . It is very much desired that a DCNN could achieve better performance with less computation and fewer model parameters. The most time-consuming building block of a DCNN is the convolutional layer. There have been many previous works aiming at reducing the amount of computation in the convolutional layer. Historically, researchers apply Fast Fourier Transform (FFT) (Nussbaumer, 1981; Quarteroni et al., 2010) to implement convolution and they gain great speed up for large convolutional kernels. For small convolutional kernels, a direct application is often still cheaper (Podlozhnyuk, 2007) . Researchers also explore low rank approximation (Jaderberg et al., 2014; Ioannou et al., 2015) to implement convolutions. However, most of the existing methods start from a pre-trained model and mainly focus on network pruning and compression. In this research, we study how to design a separable convolution to achieve an optimal implementation in terms of computational complexity. Enabling convolution separable has been proven to be an efficient way to reduce the computational complexity (Sifre & Mallat, 2014; Howard et al., 2017; Szegedy et al., 2016) . Comparing to the FFT and low rank approximation approaches, a welldesigned separable convolution shall be efficient for both small and large kernel sizes and shall not require a pre-trained model to operate on.

