DILATED CONVOLUTION WITH LEARNABLE SPACINGS

Abstract

Recent works indicate that convolutional neural networks (CNN) need large receptive fields (RF) to compete with visual transformers and their attention mechanism. In CNNs, RFs can simply be enlarged by increasing the convolution kernel sizes. Yet the number of trainable parameters, which scales quadratically with the kernel's size in the 2D case, rapidly becomes prohibitive, and the training is notoriously difficult. This paper presents a new method to increase the RF size without increasing the number of parameters. The dilated convolution (DC) has already been proposed for the same purpose. DC can be seen as a convolution with a kernel that contains only a few non-zero elements placed on a regular grid. Here we present a new version of the DC in which the spacings between the nonzero elements, or equivalently their positions, are no longer fixed but learnable via backpropagation thanks to an interpolation technique. We call this method "Dilated Convolution with Learnable Spacings" (DCLS) and generalize it to the n-dimensional convolution case. However, our main focus here will be on the 2D case for computer vision only. We first tried our approach on ResNet50: we drop-in replaced the standard convolutions with DCLS ones, which increased the accuracy of ImageNet1k classification at iso-parameters, but at the expense of the throughput. Next, we used the recent ConvNeXt state-of-the-art convolutional architecture and drop-in replaced the depthwise convolutions with DCLS ones. This not only increased the accuracy of ImageNet1k classification but also of typical downstream and robustness tasks, again at iso-parameters but this time with negligible cost on throughput, as ConvNeXt uses separable convolutions. Conversely, classic DC led to poor performance with both ResNet50 and ConvNeXt. The code of the method is based on PyTorch and available.

1. INTRODUCTION

The receptive field of a deep convolutional network is a crucial element to consider when dealing with recognition and downstream tasks in computer vision. For instance, a logarithmic relationship between classification accuracy and receptive field size was observed in Araujo et al. (2019) . This tells us that large receptive fields are necessary for high-level vision tasks, but with logarithmically decreasing rewards and thus a higher computational cost to reach them. (Dosovitskiy et al., 2020) and in CNNs (Liu et al., 2022b; Ding et al., 2022; Trockman & Kolter, 2022; Liu et al., 2022a) highlight the beneficial effect that a large convolution kernel can have, compared to the 3 × 3 kernels traditionally used in previous state-of-the-art CNN models (He et al., 2016) . However, when naively increasing the kernel size, the accuracy rapidly plateaus or even decreases. For example, in ConvNeXt, the best accuracy was achieved by a 7 × 7 kernel (Liu et al., 2022b; a) (Megvii, 2020) , in addition to a spatial separation of the depthwise kernel followed by an accumulation of the resulting activations. Yet, all these improvements have a cost in terms of memory and computation, and it does not seem possible to increase the size of the kernels indefinitely.

Recent advances in vision transformers

One of the first approaches that allow inflating the receptive field of a convolutional layer without increasing the number of learnable parameters nor the computational cost is called dilated convolution (DC). DC or "atrous convolution" was first described in Holschneider et al. (1990) and Shensa (1992), under the name "convolution with a dilated filter" before being referred to as "dilated convolution" in Yu & Koltun (2015) . The purpose of this approach is to inflate the convolutional kernel by regularly inserting spaces (i.e. zeros) between the kernel elements, as depicted in Figure 2b . The spacing between elements is thus constant, it is a hyper-parameter usually referred to as "dilation" or "dilation rate". Despite its early successes in classification since Yu et al. ( 2017), and its even 2) will lead to a similar conclusion. The failure of this method for classification tasks could be attributed to the great rigidity imposed by its regular grid as discussed in Wang & Ji (2018) . In this context, we propose DCLS (Dilated Convolution with Learnable Spacings), a new convolution method. In DCLS, the positions of the non-zero elements within the convolutional kernels are learned in a gradient-based manner. The inherent problem of non-differentiability due to the integer nature of the positions in the kernel is circumvented by interpolation (Fig. 2c ). DCLS is a differentiable method that only constructs the convolutional kernel. To actually apply the method, we could either use the native convolution provided by PyTorch or a more advanced one such as the depthwise implicit gemm convolution method (Ding et al., 2022) , using the constructed kernel. DCLS comes in six sub-versions: 1D, 2D, 3D and what we call N-MD methods, namely: "2-1D, 3-1D and 3-2D" where a N-dimension kernel is used but positions are learned only along M dimension(s). The main



. Using a structural re-parameterization trick, Ding et al. (2022) demonstrated the benefit of increasing the kernel size up to 31 by 31. Thereafter, Liu et al. (2022a) showed that there was still room for improvement by moving to 51 by 51, us-ing the depthwise implicit matrix multiplication (gemm) method developed by Ding et al. (2022) and for which the implementation has been integrated into the open-sourced framework MegEngine

Figure 1: Classification accuracy on ImageNet-1K as a function of latency (i.e. inverse of the throughput). Dot diameter corresponds to the number of parameters. more convincing results in semantic segmentation Sandler et al. (2018); Chen et al. (2017; 2018) and object detection Lu et al. (2019), DC has gradually fallen out of favor and has been confined to downstream tasks such as those described above. Without much success, Ding et al. (2022) tried to implement DC in their ReplKNet architecture. Our own investigation on ResNet and ConvNeXt with standard dilated convolution (Section 4.2) will lead to a similar conclusion. The failure of this method for classification tasks could be attributed to the great rigidity imposed by its regular grid as discussed inWang & Ji (2018).

