RETHINKING CONVOLUTION: TOWARDS AN OPTIMAL EFFICIENCY

Abstract

In this paper, we present our recent research about the computational efficiency in convolution. Convolution operation is the most critical component in recent surge of deep learning research. Conventional 2D convolution takes O(C 2 K 2 HW ) to calculate, where C is the channel size, K is the kernel size, while H and W are the output height and width. Such computation has become really costly considering that these parameters increased over the past few years to meet the needs of demanding applications. Among various implementations of the convolution, separable convolution has been proven to be more efficient in reducing the computational demand. For example, depth separable convolution reduces the complexity to O(CHW • (C + K 2 )) while spatial separable convolution reduces the complexity to O(C 2 KHW ). However, these are considered an ad hoc design which cannot ensure that they can in general achieve optimal separation. In this research, we propose a novel operator called optimal separable convolution which can be calculated at O(C 3 2 KHW ) by optimal design for the internal number of groups and kernel sizes for general separable convolutions. When there is no restriction in the number of separated convolutions, an even lower complexity at O(CHW •log(CK 2 )) can be achieved. Experimental results demonstrate that the proposed optimal separable convolution is able to achieve an improved accuracy-FLOPs and accuracy-#Params trade-offs over both conventional and depth/spatial separable convolutions.

1. INTRODUCTION

Tremendous progresses have been made in recent years towards more accurate image analysis tasks, such as image classification, with deep convolutional neural networks (DCNNs) (Krizhevsky et Dai et al., 2020) . However, the computational complexity for state-of-the-art DCNN models has also become increasingly high and computationally expensive. This can significantly defer their deployment to real-world applications, such as mobile platforms and robotics, where the resources are highly constrained (Howard et al., 2017; Dai et al., 2020) . It is very much desired that a DCNN could achieve better performance with less computation and fewer model parameters. The most time-consuming building block of a DCNN is the convolutional layer. There have been many previous works aiming at reducing the amount of computation in the convolutional layer. Historically, researchers apply Fast Fourier Transform (FFT) (Nussbaumer, 1981; Quarteroni et al., 2010) to implement convolution and they gain great speed up for large convolutional kernels. For small convolutional kernels, a direct application is often still cheaper (Podlozhnyuk, 2007) . Researchers also explore low rank approximation (Jaderberg et al., 2014; Ioannou et al., 2015) to implement convolutions. However, most of the existing methods start from a pre-trained model and mainly focus on network pruning and compression. In this research, we study how to design a separable convolution to achieve an optimal implementation in terms of computational complexity. Enabling convolution separable has been proven to be an efficient way to reduce the computational complexity (Sifre & Mallat, 2014; Howard et al., 2017; Szegedy et al., 2016) . Comparing to the FFT and low rank approximation approaches, a welldesigned separable convolution shall be efficient for both small and large kernel sizes and shall not require a pre-trained model to operate on. Table 1 : A comparison of computational complexity and the number of parameters of the proposed optimal separable convolution and existing approaches. The proposed optimal separable convolution is much more efficient. In this table, C represents the channel size of convolution, K is the kernel size, H and W are the output height and width, g is the number of groups. "Vol. RF" represents whether the corresponding convolution satisfies the proposed volumetric receptive field condition. 

Conv2D

Depth Separable Optimal Separable (b) Figure 1 : Volumetric receptive field and the proposed optimal separable convolution. (a) The volumetric receptive field (RF) of a convolution is the Cartesian product of its (spatial) RF and channel RF. (b) Illustrations of the channel connections for conventional, depth separable, and the proposed optimal separable convolutions. In the DCNN research, two most well-known separable convolutions are depth separable (Sifre & Mallat, 2014) and spatial separable (Szegedy et al., 2016) convolutions. Both are able to reduce the computational complexity of a convolution. The complexity of a conventional 2D convolution is quadratic with three hyper-parameters: number of channels (C), kernel size (K), and spatial dimensions (H or W ), and its computational complexity is actually O(C 2 K 2 HW ). Depth separable convolution is constructed as a depth-wise convolution followed by a point-wise convolution, where depth-wise convolution is a group convolution with its number of groups g = C and point-wise convolution is a 1 × 1 convolution. Spatial separable convolution replaces a K × K kernel to K × 1 and 1 × K kernels. Different types of convolutions and their computational costs are summarized in Table 1 . From this table, we can easily verify that depth separable convolution has a complexity of O(CHW • (C + K 2 )) and spatial separable convolution has a complexity of O(C 2 KHW ). Both depth separable and spatial separable convolutions follow an ad hoc design. They are able to reduce the computational cost to some degree but normally will not achieve an optimal separation. A separable convolution in general has three sets of parameters: the internal number of groups, channel size, and kernel size of each separated convolution. Instead of setting these parameters in an ad hoc fashion, we design a scheme to achieve an optimal separation. The resulting separable convolution is called optimal separable convolution in this research. To prevent the proposed optimal separable convolution from being degenerated, we assume that the internal channel size is in an order of O(C) and propose the following volumetric receptive field condition. As illustrated in Fig. 1a , similar to the receptive field (RF) of a convolution which is defined as the region in the input space that a particular CNN's feature is looking at (or affected by) (Lindeberg, 2013) , we define the volumetric RF of a convolution to be the volume in the input space that affects CNN's output. The volumetric RF condition requires that a properly decomposed separable convolution maintains the same volumetric RF as the original convolution before decomposition. Hence, the proposed optimal separable convolution will be equivalent to optimizing the internal number of groups and kernel sizes to achieve the computational objective (measured in FLOPsfoot_0 ) while satisfying the volumetric RF condition. Formally, the objective function is defined by Equation (2) under the constraints defined by Equations (3)-(6). The solution to this optimization problem will be described in detail in Section 2. We shall show that the proposed optimal separable convolution can be calculated at the order of O(C 3foot_2 KHW ). This is at least a factor of √ C more efficient than the depth separable and spatial separable convolutions. The proposed optimal separable convolution is able to be easily generalized into an N -separable case, where the number of separated convolutions N can be optimized further. In such a generalized case, an even lower complexity at O(CHW • log(CK 2 )) can be achieved. Extensive experiments are carried out to demonstrate the effectiveness of the proposed optimal separable convolution. As illustrated in Fig. 3 , on the CIFAR10 dataset (Krizhevsky et al., 2009) , the proposed optimal separable convolution achieves a better Pareto-frontier 2 than both conventional and depth separable convolutions using the ResNet (He et al., 2016) architecture. To demonstrate that the proposed optimal separable convolution generalizes well to other DCNN architectures, we adopt the DARTS (Liu et al., 2018) architecture by replacing the depth separable convolution with the proposed optimal separable convolution. The accuracy is improved from 97.24% to 97.67% with fewer parameters. On the ImageNet dataset (Deng et al., 2009) , the proposed optimal separable convolution also achieves an improved performance. For the DARTS architecture, the proposed approach achieves 74.2% top1 accuracy with only 4.5 million parameters.

2. THE PROPOSED APPROACH 2.1 CONVOLUTION AND ITS COMPUTATIONAL COMPLEXITY

A convolutional layer takes an input tensor B l-1 of shape (C l-1 , H l-1 , W l-1 ) and produces an output tensor B l of shape (C l , H l , W l ), where C * , H * , W * are input and output channels, feature heights and widths. The convolutional layer is parameterized with a convolutional kernel of shape (C l , C l-1 , K H l , K W l ) , where K * l are the kernel sizes, and the superscript indicates whether it is aligned with the features in height or width. In this research, we take = O(K) for complexity analysis. Formally, we have C * = O(C), H * = O(H), W * = O(W ), B l (c l , h l , w l ) = c l-1 k H l k W l B l-1 (c l-1 , h l-1 , w l-1 ) • F l (c l , c l-1 , k H l , k W l ), where h l = h l-1 + k H l and w l = w l-1 + k W l . Hence, the number of FLOPs (multiply-adds) for convolution is C l H l W l • C l-1 K H l K W l = O(C 2 K 2 HW ) and the number of parameters is C l C l-1 K H l K W l = O(C 2 K 2 ). For a group convolution, we have g convolutions with kernels of shape (C l /g, C l-1 /g, K H l , K W l ). Hence, it has O(C 2 K 2 HW/g) FLOPs and O(C 2 K 2 /g) parameters, where g is the number of groups. A depth-wise convolution is equivalent to a group convolution with g = C * = C. A pointwise convolution is a 1 × 1 convolution. A depth separable convolution is composed of a depth-wise convolution and a point-wise convolution. A spatial separable convolution replaces a K × K kernel with K × 1 and 1 × K kernels. Different types of convolutions are summarized in Table 1 . From this table, their FLOPs and number of parameters can be easily verified.

2.2. RETHINKING CONVOLUTION AND THE VOLUMETRIC RECEPTIVE FIELD CONDITION

Separable convolution has been proven to be efficient in reducing the computational demand in convolution. However, existing approaches including both depth separable and spatial separable convolutions follow an ad hoc design. They are able to reduce the computational cost to some extent but will not normally achieve an optimal separation. In this research, we shall design an efficient convolution operator achieving the computational objective by optimal design of its internal hyperparameters. The resulting operator is called optimal separable convolution. One difficulty is that if we do not pose any restriction to a separable convolution, optimizing the FLOPs target will resulting in a separable convolution being equivalent to a degenerated channel scaling operatorfoot_3 . Hence, we propose the following volumetric receptive field condition. As illustrated in Fig. 1a , the receptive field (RF) of a convolution is defined to be the region in the input space that a particular CNN's feature is affected by (Lindeberg, 2013) . We define the channel RF to be the channels that affect CNN's output and define the volumetric RF to be the Cartesian product of the RF and channel RF of this convolution. The volumetric RF of a convolution is actually the volume in the input space that affects CNN's output. The volumetric RF condition requires that a properly decomposed separable convolution maintains (at least) the same volumetric RF as the original convolution before decomposition. Hence, the proposed optimal separable convolution will be equivalent to optimizing its internal parameters while satisfying the volumetric RF condition.

2.3. OPTIMAL SEPARABLE CONVOLUTION

In this section, we discuss the case of two-separable convolution. We present the discussion informally to gain intuition into the proposed approach. In the next section, we shall provide a formal proof. Suppose that the shape of the original convolutional kernel is (C out , C in , K H , K W ), where C in , C out are the input and output channels, and (K H , K W ) is the kernel size. Let C 1 = C in , and C 3 = C out . For the proposed optimal separable convolution, we optimize the FLOPs as computational objective while maintaining the original convolution's volumetric RF. Formally, the computational demand of the proposed separable convolution is f (g 1 , g 2 , C 2 , K H|W * ) = C 2 C 1 K H 1 K W 1 HW g 1 + C 3 C 2 K H 2 K W 2 HW g 2 In order to satisfy the volumetric RF condition, the following three conditions need to be satisfiedfoot_4 : K H 1 + K H 2 -1 = K H (Receptive F ield Condition) K W 1 + K W 2 -1 = K W (4) g 1 • g 2 ≤ C 2 (Channel Condition) (5) min(C l , C l+1 ) ≥ g l (Group Convolution Condition) We have three sets of parameters: the number of groups g 1 , g 2 , the internal channel size C 2 , and the internal kernel sizes K H|W * . In this research, we shall assume that the internal channel size C 2 is in an order of O(C) and is preset according to a given policy. Otherwise, g 1 = g 2 = C 2 = 1 will be a trivial solution. This could lead the separable convolution to be over-simplified and not applicable in practice. Typical policies of presetting The proposed problem is a constrained optimization. It is usually hard to solve it directly . However, we shall show that the optimal solution shall be g 1 , g 2 ∼ √ C. For large channel sizes, the optimal solution is usually an interior point in the solution space rather than on the boundary. Let K H|W * be constants, by substituting g 2 = C 2 /g 1 5 and setting f (g 1 ) = 0, one can derive that C 2 include C 2 = min(C 1 , C 3 ) (normal architecture), C 2 = (C 1 + C 3 )/2 (linear architecture), C 2 = max(C 1 , C 3 )/ g 1 = C 1 C 2 K H 1 K W 1 C 3 K H 2 K W 2 ∼ √ C, and min g1 f (g 1 ) = 2 • C 1 C 2 C 3 K H 1 K W 1 K H 2 K W 2 HW (8) = O(C 3 2 KHW ) if we set K H|W 1 = K H|W and K H|W 2 = 1. One interesting fact is that we can optimize the internal number of groups g 1 , g 2 and internal kernel sizes K H|W * simultaneously. For simplicity, we assume that kernel sizes aligned in height and width are equal. By setting f (g 1 ) = 0 and f (K 1 ) = 0, one can derive that g 1 is the same as Equation ( 7) and K 1 = K 2 = K+1 2 , substituting them into Equation (8), one can get f (g 1 , K 1 ) = O(C 3 2 K 2 HW ). This results in a higher complexity than O(C 3 2 KHW ). In fact, the solution to f (g 1 ) = 0 and f (K 1 ) = 0 is a saddle point. As illustrated in Fig. 2 , given the input channel C 1 = 64 and output channel C 3 = 64, kernel size (K H , K W ) = (5, 5), we take C 2 = min(C 1 , C 3 ) = 64. By setting f (g 1 ) = 0, f (K 1 ) = 0, the solution g 1 = √ 64 = 8, K 1 = 5+1 2 = 3 is a saddle point.

2.4. OPTIMAL SEPARABLE CONVOLUTION (GENERAL CASE)

In this section, we shall generalize the proposed optimal separable convolution from N = 2 to an optimal N and shall provide a formal proof. Suppose that the shape of the original convolutional kernel is (C out , C in , K H , K W ). Let C 1 = C in , and C N +1 = C out (C N +1 = C 3 for N = 2) . The computational demand of an N -separable convolution is f ({g * }, {K H|W * }) = C 2 C 1 K H 1 K W 1 HW g 1 + • • • + C N +1 C N K H N K W N HW g N (9) For ease of analysis, we first introduce the notation channels per group n l = C l g l , which simply means: channels per group × number of groups = the number of channels. Then, we have f ({n * }, {K H|W * }) = C 2 n 1 K H 1 K W 1 HW + • • • + C N +1 n N K H N K W N HW (10) satisfying the volumetric RF condition K H 1 + K H 2 + • • • = K H + (N -1) (Receptive F ield Condition) K W 1 + K W 2 + • • • = K W + (N -1) (12) n 1 • • • n N ≥ C 1 ⇔ g 1 • • • g N ≤ C 2 • • • C N (Channel Condition) n l ≥ max(1, C l+1 C l ) ⇔ g l ≤ min(C l , C l+1 ) (Group Convolution Condition) We keep both notations g l and n l . This is because, for the channel condition, it is intuitive that n 1 • • • n N ≥ C 1 means that the product of n 1 • • • n N needs to occupy each node in the input channel C 1 = C in . This is equivalent to the less intuitive condition g 1 • • • g N ≤ C 2 • • • C N . Similarly, for the group convolution condition, g l ≤ min(C l , C l+1 ) means the number of groups can not exceed the input and output channels of this group convolution, while n l ≥ max(1, C l+1 C l ) is less intuitive. For Equation (10), apply an arithmetic-geometric mean inequality, we can get f ({n * }, {K H|W * }) ≥ N N C 1 C 2 2 • • • C 2 N C N +1 K H 1 • • • K H N K W 1 • • • K W N g 1 • • • g N HW (15) ≥ N N C 1 • • • C N +1 K H 1 • • • K H N K W 1 • • • K W N HW (16) The equality holds if and only if C 2 n 1 K H 1 K W 1 = • • • = C N +1 n N K H N K W N . Let n l = β l n 1 , where β l = C2K H 1 K W 1 C l+1 K H l K W l . Let β = Πβ i , we can solve n 1 = N C1 β = N √ ΠCiΠK H i ΠK W i C2K H 1 K W 1 and n l = N Π N +1 i=1 C i Π N i=1 K H i Π N i=1 K W i C l+1 K H l K W l ∼ N √ C. ( ) Note that the inequality (16) holds for arbitrary K H|W * . We need to further optimize K H|W * . From the arithmetic-geometric mean inequality again, we can get K H 1 • • • K H N ≤ ( K H 1 +•••+K H N N ) N = ( K H +N -1 N ) N and the equality holds if and only if K H 1 = • • • = K H N = K H +N -1 N . However, we want the inequality reversed, instead of finding the maximum of this product, we expect to find its minimum. This still gives us a hint, the maximum is achieved when the internal kernel sizes are as even as possible, so the minimum should be achieved when the internal kernel sizes are as diverse as possible. In the extreme case, one of the internal kernel sizes should take K H and all the rest takes 1. A formal proof of this claim can be derived. Hence, we have It can be verified that, for N = 2, Equation (8) and Equation ( 18) match with the same complexity. By setting f (N ) = 0, we can derive that N = log(CK 2 ), and f ({n * }, {K H|W * }) ≥ N N C 1 • • • C N +1 K H K W HW = O(N C 1+ 1 N K 2 N HW ). min f ({n * }, {K H|W * }) = eCHW • log(CK 2 ) = O(CHW • log(CK 2 )) ) where e = 2.71828... is the natural logarithm constant. The proposed optimal separable convolution can have a spatial separable configuration: a single kernel takes (K H , K W ) or two kernels take (K H , 1) and (1, K W ). Besides, the proposed optimal separable convolution allows using a mask of the internal number of groups and solve for an Mseparable sub-problem (M < N ). Details are discussed in Appendix A, where the implementation of the proposed optimal separable convolution is also presented in Algorithm 1.

3. EXPERIMENTAL RESULTS

In this section, we carry out extensive experiments on benchmark datasets to demonstrate the effectiveness of the proposed optimal separable convolution scheme. In the proposed experiments, we use a prefix dor oto indicate that the conventional or depth separable convolutions in the baseline networks are replaced with depth separable (dsep) or the proposed optimal separable (osep) convolutions. In this research, we set the number of separated convolutions N = 2. The details of the training settings for the proposed experiments are described in Appendix B. The proposed osep scheme can significantly reduce FLOPs/#Params. In Section 2, we proved that this reduction factor can be √ CK in theoryfoot_6 . As illustrated by the solid lines in Fig. 3 (a) and (b), the orange solid curve lies in a region with significantly smaller x-values than the blue solid curve. This indicates that o-ResNet shall have significantly smaller FLOPs and fewer parameters than the ResNet baseline. For example, o-ResNet110 has even lower FLOPs (0.033 billion vs 0.041 billion) and fewer parameters (0.177 million vs 0.270 million) than ResNet20, yet with noticeable higher accuracy (92.12% vs 91.25%). This demonstrates that the proposed osep scheme could significantly reduce both computational cost and number of parameters for conventional convolutions. For dsep, this reduction factor is 1 1/K 2 +1/C , which is bounded by K 2 . For 3 × 3 kernels, this reduction can be at most 9. Whereas for the proposed osep scheme, no such bounds exist. The advantage of the proposed osep scheme over dsep is illustrated in Fig. 3 (a) and (b) by the orange and green solid curves. From which, we can see the proposed osep scheme is more efficientfoot_7 with smaller x-values. The proposed o-ResNets can have 8x-16x smaller FLOPs and 10x-18x fewer parameters than the ResNet baselines in the proposed experiments. For fair comparisons, we introduce the channel multiplier in order to approximately match the FLOPs. We use the suffix " m<multiplier>"foot_8 to indicate the channel multiplier. Note that FLOPs/#Params is proportional to channel multiplier 3/2 for osep. As illustrated in Fig. 3 , from which we can see, the proposed optimal separable convolution scheme is much more efficient than conventional convolutions. The orange curve, including both solid and dashed parts, achieved a better accuracy-FLOPs Pareto-frontier than the blue curve. It is worth noting that even under the same FLOPs, the number of o-ResNet parameters is also smaller than that of ResNet by a large margin. This could result in a more regularized network with fewer parameters to prevent over-fitting and possibly contribute to the final performance. In Fig. 3 , we also present the d-ResNet curves in dashed green by replacing the conventional convolutions with depth separable convolutions. As can be seen, d-ResNet achieves good accuracy-FLOPs balances for small networks (e.g. d-ResNet20 and d-ResNet32), but performs comparable or no better than conventional convolutions for large ones (e.g. d-ResNet56 and d-ResNet110). In summary, the proposed optimal separable convolution achieves better accuracy-FLOPs and accuracy-#Params Pareto-frontiers than both conventional and depth separable convolutions. To demonstrate that the proposed osep scheme generalizes well to other DCNN architectures, we adopt the DARTS (V2) (Liu et al., 2018) network as the baseline. The DARTS evaluation network has 20 cells and 36 initial channels, we increase the initial channels to 42 to match the FLOPs. By replacing the dsep convolutions in DARTS with the proposed osep convolutions, as illustrated in Table 2 , the resulting o-DARTS improved the accuracy from 97.24% to 97.67% with fewer parameters (3.25 million vs 3.35 million). It is worth noting that it is very hard to significantly improve the DARTS search space. In Table 2 , we also include three variants of DARTS, i.e. P-DARTS (Chen et al., 2019) , PC-DARTS (Xu et al., 2019) , and GOLD-DARTS (Bi et al., 2020) , with more advanced search strategies for comparison. As can be seen, o-DARTS even achieved higher accuracies than these advanced network architectures.

Ablation Studies

We carry out ablation studies on the effects of the internal BatchNorms and nonlinearities, and the spatial separable configuration. We conclude that internal BatchNorms and nonlinearities have no effects on the results yet introduce extra computation and parameters, while the spatial separable configuration leads to a slightly worse performance. Hence, they are not adopted in this research. Details are presented in Table 5 with discussion in Appendix C.

3.2. EXPERIMENTAL RESULTS ON IMAGENET

We evaluate the proposed optimal separable convolution scheme on the benchmark ImageNet (Deng et al., 2009) dataset, which contains 1.28 million training images and 50,000 testing images.

3.2.1. IMAGENET40

Because carrying out experiments directly on the ImageNet dataset can be resource-and timeconsuming, we resized all the images into 40×40 pixels. A 32×32 patch is randomly cropped and a random horizontal flip with a probability of 0.5 is applied before feeding into the network. No extra data augmentation strategies are used. The baseline ResNet architecture is a modified version of that used on the CIFAR10 dataset, except that the channel sizes are set to be 4× larger, the features are 3 , as can be seen, by substituting conventional convolutions with the proposed optimal separable convolutions, the resulting o-ResNet achieved 4-5% (e.g. 49.97% vs 44.93% for 56-layer and 50.72% vs 46.74% for 110-layer) performance gains comparing against the ResNet baselines. This demonstrates that the proposed optimal separable convolution scheme is much more efficient. For o-ResNet56 and o-ResNet110, they also have fewer parameters that could contribute to a more regularized model. For o-ResNet20 and o-ResNet32, they have slightly more parameters because the last FC layer accounts for a great portion of overhead for 1000 classes.

3.2.2. FULL IMAGENET

Similar to the experiments on CIFAR10, we replace the dsep convolutions in the DARTS (V2) network with the proposed osep convolutions to demonstrate that the proposed approach is able to generalize to other network architectures. The experiment is carried out on the full ImageNet dataset. The DARTS evaluation network has 14 cells and 48 initial channels, we increase the initial channel size to 56 to match the FLOPs. The resulting network is called o-DARTS. Experimental results are illustrated in Table 4 . It can be seen that, with fewer parameters (4.50 million vs 4.72 million), the proposed o-DARTS network achieved higher accuracies in both top1 (74.2% vs 73.3%) and top5 (91.9% vs 91.3%) accuracies than the DARTS baseline. This indicates that the proposed osep is able to achieve better accuracy-FLOPs and accuracy-#Params balances than dsep convolutions. It is worth noting that in the proposed experiments, we adopt ResNet and DARTS as the baselines because these are two most well-known architectures. In practice, one may simply replace the conventional or depth separable convolutions with the proposed optimal separable convolutions in a DCNN to reduce the computation and model parameters. By increasing the channel sizes, better-performing models can be expected. The proposed optimal separable convolution achieved improved accuracy-FLOPs and accuracy-#Params Pareto-frontiers than both conventional and depth separable convolutions. Hence, one can either match the accuracy to get a smaller model with reduced computation and model parameters, or match the FLOPs to get a better-performing model.

4. CONCLUSIONS

In this paper, we have presented a novel scheme called optimal separable convolution to improve the computational efficiency in convolution. Conventional convolution took a costly complexity at O(C 2 K 2 HW ). The proposed optimal separable convolution scheme is able to achieve its complexity at O(C 3 2 KHW ), which is even lower than that of depth separable convolution at O(CHW • (C + K 2 )). Hence, the proposed optimal separable convolution has the full potential to replace the usage of depth separable convolutions in a DCNN. Examples include but are not limited to the ResNet and DARTS architectures. The proposed optimal separable convolution also has a spatial separable configuration. A generalized N -separable case can achieve better performance at O(CHW • log(CK 2 )). Another potential impact of the proposed optimal separable convolution is for the AutoML community. The proposed novel operator is able to increase the neural architecture search space. In a multi-objective optimization formulation, where both accuracy and FLOPs are optimized, we expect a more efficient network architecture can be discovered in the future using the proposed optimal separable convolution operator. 

B TRAINING SETTINGS

Experiments on CIFAR10 for the ResNet architecture The images are padded with 4 pixels and randomly cropped into 32 × 32 to feed into the network. A random horizontal flip with a probability of 0.5 is also applied. All the networks are trained with a standard SGD optimizer for 200 epochs. The initial learning rate is set to 0.1, with a decay of 0.1 at the 100 and 150 epochs. The batch size is 128. A weight decay of 0.0001 and a momentum of 0.9 are used. Experiments on CIFAR10 for the DARTS architecture We follow the same training settings in (Liu et al., 2018) : the network is trained with a standard SGD optimizer for 600 epochs with a batch size of 96. The initial learning rate is set to 0.025 with a cosine learning rate scheduler. A weight decay of 0.0003 and a momentum of 0.9 are used. Additional enhancements include cutout, path dropout of probability 0.2, and auxiliary towers with weight 0.4. Experiments on ImageNet40 for the ResNet architecture Each network is trained with a standard SGD optimizer for 20 epochs with the initial learning rate set to 0.1, and a decay of 0.1 at the 10 and 15 epochs. The batch size is 256, the weight decay is 0.0001 and the momentum is 0.9. 

C ABLATION STUDIES

Internal BatchNorm and Non-linearity For a DCNN, it is generally a good practice to add a BatchNorm (BN) (Ioffe & Szegedy, 2015) and a non-linearity after each convolution. For the proposed optimal separable convolution, we wonder if it is still necessary to add such a BN and a non-linearity after each of the internal separated convolutions. Experimental results are illustrated in Table 5 . Comparing the "Internal BN and Non-linearity" column against the "Accuracy" column, we are able to conclude that with or without internal BN and non-linearity, similar results with only statistical variances can be generated. This is reasonable because the network has already been regularized by outer BN and non-linearity layers from the macro architecture. Internal ones shall offer little to no additional improvements. Because internal BN and non-linearity could introduce extra computation and parameters, in the proposed research, we shall not use internal BN and non-linearity. Spatial Separable Another variation of the proposed optimal separable convolution scheme is the spatial separable configuration. For Equation ( 16), the optimal solution is achieved when one of the internal kernel sizes takes K H|W and all the rest takes 1. It does not matter which one of the internal kernel sizes takes K H|W . Hence, we have this spatial separable variant: a single kernel takes (K H , K W ) or two kernels take (K H , 1) and (1, K W ). The detailed implementation is illustrated in Algorithm 1. While spatial separable or not affects neither the FLOPs nor the number of parameters for the proposed optimal separable convolution, the results could be slightly different. As illustrated by the column "Spatial Separable" in Table 5 , the spatial separable configuration leads to slightly worse performances. The reason might be that spatial separation fuses horizontal and vertical features separately, which could be less efficient than fusing them simultaneously.

D INFERENCE TIME FOR THE PROPOSED OPTIMAL SEPARABLE CONVOLUTION

FLOPs measures the best possible theoretical speed we are able to achieve. In this Section, we further report the wall-clock inference time of the proposed optimal separable convolution scheme. The inference time is measured on a laptop computer with Windows 10 operating system and Intel i5-8250 CPU, which we use to simulate a mobile platform. The results are illustrated in Table 6 . As can be seen, for ResNet20, o-ResNet20 m3.875, and d-ResNet20 m2.75, they have a similar FLOPs (≈0.0405 billion), yet o-ResNet20 m3.875 and d-ResNet20 m2.75 take a slightly longer inference time (0.0468s vs 0.0310s). This is because current implementation of grouped convolution in Py-Torch is not optimized. From )). However, the proposed optimal separable convolution is even more efficient than depth separable convolution. It can be calculated at O(C 3 2 KHW ) and has the full potential to replace the usage of depth separable convolutions. A second advantage of the proposed optimal separable convolution is that it can be applied to fully connected layers if we view them as 1 × 1 convolutional layers, whereas depth separable convolution cannot. Further, depth separable convolution requires the middle channel size to be equal to the input channel size, whereas for the proposed optimal separable convolution, the middle channel size can be freely set. Spatial separable convolution was originally developed to speed up image processing operations. For example, a Sobel kernel is a 3 × 3 kernel and can be written as (1, 2, 1) T • (-1, 0, 1). Spatial separable will require 6 instead of 9 parameters while doing the same operation. Spatial separable convolution is also adopted in the design of modern DCNNs. For example, in (Szegedy et al., 2016) , the authors introduce spatial separation to the GoogLeNet (Szegedy et al., 2015) architecture. For the proposed optimal separable convolution, there is also a spatial separable configuration. In the body of literature, separable convolution is also referred to factorized convolution or convolution decomposition. In this research, the proposed scheme is called optimal separable convolution following the naming conventions of depth and spatial separable convolutions.



In this research, similar to(He et al., 2016), FLOPs are measured in number of multiply-adds. In multi-objective optimization, a Pareto-frontier is the set of parameterizations (allocations) that are all Pareto-optimal. An allocation is Pareto-optimal if there is no alternative allocation where improvement can be made to one participant's well-being without sacrificing any other's. Here, Pareto-frontier represents the curve of the accuracies we are able to achieve for different FLOPs/#Params. From Table1, let g = C and K = 1, a convolution will have C parameters and CHW FLOPs. This is in fact a channel scaling operator. Composition of such operators is not meaningful because the composition itself is equivalent to a single channel scaling operator. The channel condition (5) g1• g2 ≤ C2 ⇔ C 1 g 1 • C 2 g 2 ≥ C1 means the product C 1 g 1 • C 2g 2 needs to occupy each node in the input channel C1 = Cin to maintain the volumetric receptive field. This is further explained for the channel condition general case (13) in Section 2.4. It is trivial to verify that, for any solution (g1, g2) with g1 • g2 < C2, (g1, g2 = C2/g1 > g2) shall be another feasible solution with a smaller FLOPs target. Hence, the optimal solution must satisfy g1 • g2 = C2. For optimal separable, √ CK = C 2 K 2 HW C 3/2 KHW . For depth separable,1 1/K 2 +1/C = C 2 K 2 HW CHW (C+K 2 ) < K 2 . Because there is an overhead and the channels are relatively small (e.g. 16) for CIFAR10 models, the advantage margin is noticeable but is not remarkable. This margin of advantage can be bigger for larger channels. We usually omit this channel multiplier suffix for simplicity when it is clear from the context that we are comparing different schemes under the same FLOPs/#Params and there are no confusions.





and K H|W *

4 (bottleneck architecture (He et al., 2016)), or C 2 = 4 min(C 1 , C 3 ) (inverted residual architecture (Sandler et al., 2018)).

Figure 2: Given channels C1 = C2 = C3 = 64, and kernel sizes K H = K W = 5 in Equation (2), by setting f (g1) = 0, f (K1) = 0. The solution g1 = 8, K1 = 3 is a saddle point.

Experimental results on CIFAR10 for the ResNet architecture (best viewed in color). The proposed optimal separable convolution (o-ResNet) achieves improved (a) accuracy-FLOPs and (b) accuracy-#Params Pareto-frontiers than both the conventional (ResNet) and depth separable (d-ResNet) convolutions.

EXPERIMENTAL RESULTS ON CIFAR10 CIFAR10 (Krizhevsky et al., 2009) is a dataset consist of 50,000 training images and 10,000 testing images. These images are with a resolution of 32 × 32 and are categorized into 10 object classes. In the proposed experiments, we use ResNet (He et al., 2016) as baselines and replace the conventional convolutions in ResNet with dsep and osep convolutions, resulting in d-ResNet and o-ResNet.

ImageNet for the DARTS architecture We follow the training settings in(Chen et al., 2019) for multi-GPU training: the images are random resized crop into 224 × 224 patches with a random scale in [0.08, 1.0] and a random aspect ratio in [0.75, 1.33]. Random horizontal flip and color jitter are also applied. The network is trained from scratch for 250 epochs with batch size 1024 on 8 GPUs. An SGD optimizer with an initial learning rate of 0.5, a momentum of 0.9, and a weight decay of 3e-5. The learning rate is decayed linearly after each epoch. Additional enhancements include label smoothing with weight 0.1 and auxiliary towers with weight 0.4.

al., 2012; Srivastava et al., 2015; He et al., 2016; Real et al., 2019; Tan & Le, 2019;

Experimental results on CIFAR10 for DARTS. The proposed optimal separable convolution (o-DARTS) generalizes well to the DARTS architecture, and achieves improved accuracy with approximately the same FLOPs and fewer parameters. DARTS uses depth separable convolution and an optional dis prefixed.

Experimental results on ImageNet40 for the ResNet architecture. The proposed optimal separable convolution (o-ResNet) achieves 4-5% performance gain over the ResNet baseline.

Experimental results on full ImageNet for the DARTS architecture. The proposed o-DARTS achieves 74.2% top1 accuracy with only 4.5 million parameters.

Experimental results on CIFAR10 for the ResNet architecture with ablation studies of internal BN and non-linearity and spatial separable configuration.

Experimental results on CIFAR10 for the ResNet with inference time on Windows 10 Intel CPU i5-8250.

RELATED WORKThere have been many previous works aiming at reducing the amount of computation in convolution. Historically, researchers apply Fast Fourier Transform (FFT)(Nussbaumer, 1981;Quarteroni et al., 2010) to implement convolution. For 1D convolution, FFT reduces the number of computations for H points from O(H 2 ) to O(H log H). For 2D convolution, FFT-2D reduces the computational complexity from O(HW • K 2 ) to O(HW • (log H + log W )) (Podlozhnyuk, 2007). Hence, it can be easily concluded that FFT gains great speed up for large convolutional kernels. For small convolutional kernels (K << H or W ), a direct application is often still cheaper. Researchers also explore low rank approximation(Jaderberg et al., 2014;Ioannou et al., 2015) to implement convolutions. However, most of the existing methods obtain moderate efficiency improvements, and they usually require a pre-trained model and mainly focus on network pruning and compression. In recent state-of-the-art deep CNN models, several heuristics are adopted to reduce the heavy computation in convolution. For example, in(He et al., 2016), the authors use a bottleneck structure. Yet in(Sandler et al., 2018), the authors adopt an inverted bottleneck structure. Such heuristics may require further ad hoc design to work in practice, however, they are not solid and shall become less convincing.Among various implementations of convolution, separable convolution has been proven to be more efficient in reducing the computational demand. Depth separable convolution is explored extensively in modern DCNNs(Howard et  al., 2017; Sandler et al., 2018; Howard et al., 2019; Liu et al., 2018; Tan & Le, 2019). It reduces the computational cost of a conventional convolution from O(C 2 K 2 HW ) to O(CHW • (C + K 2

Algorithm 1 The Algorithm for Optimal Separable Convolution

Input: Input channel C1 = Cin, output channel CN+1 = Cout, kernel size (K H , K W ), number of separated convolutions N Optional Input: internal kernel sizes (optional, preset), internal number of groups (optional, masked values), spatial separable (True or False), internal number of groups g1, for certain l, re-optimize g l with a masked number of groups by pre-setting g l = 1 for l ∈ {l :Because n l ∼ N √ C, for large channel sizes, we rarely need to re-optimize.

A ALGORITHMIC DETAILS OF THE PROPOSED OPTIMAL SEPARABLE CONVOLUTION

For the proposed optimal separable convolution, one of the internal kernel sizes can take K H|W , while the rest takes 1. In this research, we simply select the middle kernel size as (K H , K W ). In a spatial separable configuration, we select the middle two to have their kernel sizes as (K H , 1) and (1, K W ). It is worth noting that all these configurations have the same FLOPs. Unlike the spatial separable convolution where the spatial separable configuration is able to reduce the complexity from K 2 to 2K. This is because the complexity has already been reduced to O(K) for the proposed optimal separable convolution. Another interesting property of the proposed optimal separable convolution is that it prefers large kernel sizes over small ones.For the proposed optimal separable convolution, we are able to preset the internal convolutional kernel sizes according to a custom policy, and optimize the internal number of groups only. Furthermore, we are able to preset a portion of the internal number of groups to certain values, and optimize only the remaining internal number of groups. Suppose that the internal channel and kernel sizes are given. Without loss of generality, we assume that g M +1 , • • • , g N are preset. The proposed optimal separable problem will be an M -separable convolution sub-problem (M < N ):(24) This M -separable sub-problem can be solved by the same algorithm. A detailed implementation of the proposed optimal separable convolution is described by Algorithm 1.

