CONVOLUTION AND POOLING OPERATION MODULE WITH ADAPTIVE STRIDE PROCESS-ING EFFECT

Abstract

Convolutional neural network is one of the representative models of deep learning, which has a wide range of applications. Convolution and pooling are two key operations in convolutional neural networks. They play an important role in extracting input features and mapping low-level semantic features to high-level semantic features. Stride is an important parameter involved in convolution and pooling operations, which refers to the distance of each slide of the convolution kernel (pooling kernel) during the convolution (pooling) operation. The stride has an impact on the granularity of feature extraction and the selection (filtering) of features, thus affecting the performance of convolutional neural networks. At present, in the training of convolutional neural networks, the content of convolution kernel and pooling kernel can be determined by the optimization algorithm based on gradient descent. However, the stride usually cannot be treated similarly, and can only be selected manually as a hyperparameter. Most of the existing related works choose a fixed stride, for example, the value is 1. In fact, different tasks or inputs may require different stride for better model processing. Therefore, this paper views the role of stride in convolution and pooling operation from the perspective of sampling, and proposes a convolution and pooling operation module with adaptive stride processing effect. The feature of the proposed module is that the feature map finally obtained by convolution or pooling operation is no longer limited to equal interval downsampling (feature extraction) according to a fixed stride, but adaptively extracted according to the changes of input features. We apply the proposed module on many convolutional neural network models, including VGG, Alexnet and MobileNet for image classification, YOLOX-S for object detection, Unet for image segmentation, and so on. Simulation results show that the proposed module can effectively improve the performance of existing models.

1. INTRODUCTION

The research on convolutional neural networks started from 1980s to 1990s. Time delay network and LENET-5 were the earliest convolutional neural networks. After the 21st century, with the proposal of deep learning theory and the improvement of computing equipment, convolutional neural networks have developed rapidly and been applied in computer vision (Krizhevsky et al., 2012) , natural language processing(Qiuqiang Kong, 2020) and other fields. Operators in convolutional neural network include convolution operator and pooling operator. The elements of the convolution operator include the size of the convolution kernel, the numerical size of the convolution kernel, the stride of the convolution operation and so on. The elements of the pooling operator include stride, padding, and so on. Convolution neural network in the stride is to point to: convolution kernels or pooling operator acting on by convolution or by pooling area, convolution kernels or pooling operator each sliding distance, convolution and pooling operation is to extract the characteristics of the input and the lower sampling, stride for feature extraction of the characteristics of grain size and trade-off (filtering), which influence the properties of convolution neural network. In the current convolutional neural network, the stride of convolution or pooling operator convolution or pooling operation is manually selected as a hyperparameter and is fixed. The fixed stride means that the sliding distance of convolution kernel (pooling kernel) is the same each time, and the fixed stride is 1, which will cause redundancy in extracting features. Different tasks may require different stride, for example, the pool operation stride is fixed to 2, and fixed to 2 is not fixed to 1, and may cause the loss of features. Moreover, the stride directly obtained by learning added to the model can not be backpropagated in training to update. Self-adaptation is the process of automatically adjusting the processing method, processing sequence, processing parameters, boundary conditions or constraint conditions according to the data characteristics of the processed data in the process of processing and analysis, so that it can adapt to the statistical distribution characteristics and structural characteristics of the processed data, so as to achieve the best processing effect. The introduction of self-adaptation in convolutional neural network can make the network more adaptable to input changes, so as to make the whole network play a better effect, such as the improvement of processing speed, accuracy and recall. At present, adaptation in convolutional neural networks includes adaptation in determining the depth of convolutional operators(Veit & Belongie, 2019), adaptation in determining the width of convolutional operators (Jiaqi Ma, 2018) , and adaptation in determining the parameters of convolutional neural networks (Adam W. Harley, 2017), Adaptation in determining the receptive field of convolution operator(Xizhou Zhu, 2019), pooling operator determination(Chen-Yu Lee, 2015), etc. Since the stride obtained by direct learning cannot be backpropagated to update the stride when added to the model, it cannot be determined by the optimization algorithm based on gradient descent, and can only be selected manually as a hyperparameter. Therefore, the current academic research on the stride adaptation in convolutional neural networks is basically blank. (Rachid Riad, 2022) proposed an adaptive stride method for images in the frequency domain, and obtained the stride by learning method to filter out high-frequency signals in the frequency domain. However, the learning stride is different from the stride concept in traditional convolutional neural networks. In this paper, a convolution and pooling operation module with adaptive stride for traditional convolution operation, namely CAS(Convolution operation adaptive stride) and PAS(Pooling operation adaptive stride) module, is proposed. We find that in the traditional convolution or pooling operator (operation), a fixed convolution or pooling stride S (S>1) The convolution or pooling operation results obtained can essentially be regarded as those obtained by adopting convolution or pooling operation results with stride 1 and then taking down-sampling at equal intervals (that is, discarding or suppressing part of the results). Based on the insight, further promotion, in order to obtain more general convolution with adaptive stride or pooling operation effect, can consider to use stride 1 convolution or pooling of input feature maps are operation, obtain the preliminary convolution operation or pooling results (initial feature maps), and then trying to based on the input features to generate a suitable mask characteristic figure, The mask feature map is combined with the initial feature map, and some results are discarded or suppressed (for example, insignificant results are set to 0), so as to obtain the final feature map with adaptive stride convolution or pooling operation effect. The most important aspect of adaptation is that the feature map finally obtained by convolution or pooling operation is no longer limited to equal interval downsampling or feature extraction according to a fixed stride S, but adaptively extracted according to the change of input features.

2. METHODS

In convolutional neural networks, convolution operation or pooling operation can extract features and map the underlying semantic features to higher-level semantic features and the role of downsampling. Classification needs local features and localization needs global features. In convolutional neural network, the convolution or pooling operation stride is fixed as 1, which can extract detailed features, but will cause feature redundancy. Because the position of the object in each picture is changed, the stride sliding distance is the same each time but not fixed to 1 may cause the loss of features. Therefore, it is the correct way to use the stride in convolution or pooling operation to adaptively determine the stride for each movement process of convolution kernel and pooling operator according to the input feature map. It is hoped that convolution or pooling operation can focus on the key areas needed. This is slightly different from the traditional concept of fixed stride, which means that the convolution kernel or pooling operator moves the same distance every time. As shown in Figure 1 , the red region is the convolved or pooled region.

