MIND THE POOL: CONVOLUTIONAL NEURAL NET-WORKS CAN OVERFIT INPUT SIZE

Abstract

We demonstrate how convolutional neural networks can overfit the input size: The accuracy drops significantly when using certain sizes, compared with favorable ones. This issue is inherent to pooling arithmetic, with standard downsampling layers playing a major role in favoring certain input sizes and skewing the weights accordingly. We present a solution to this problem by depriving these layers from the arithmetic cues they use to overfit the input size. Through various examples, we show how our proposed spatially-balanced pooling improves the generalization of the network to arbitrary input sizes and its robustness to translational shifts. * Work done mainly while at Meta AI. 1 We refer to this as size overfitting for brevity. 2 By pooling we refer to any stride-based downsampling such as maxpooling and strided convolution.

1. INTRODUCTION

Convolutional neural networks (CNNs) are versatile models in machine learning. Early CNN architectures used in image classification were restricted to a fixed input size. For example, AlexNet (Krizhevsky et al., 2012) was designed to classify 224 × 224 images from ImageNet (Deng et al., 2009) . To facilitate model comparison, this size has been adopted in subsequent ImageNet classifiers such as VGGNet (Simonyan & Zisserman, 2015) and ResNet (He et al., 2016) . The adoption of fully-convolutional architectures (Long et al., 2015; Springenberg et al., 2015) and global pooling methods (Lin et al., 2014; He et al., 2015) demonstrated how CNNs can process inputs of arbitrary size. Fully convolutional networks eliminate the use of fully-connected layers in CNN backbones, preserving 2D feature maps as the output of these backbones. Global pooling summarizes feature maps of arbitrary sizes into fixed-size vectors that can be processed using fullyconnected classification layers. This ability to process inputs of varying sizes enables CNN-based classifiers to leverage their full resolution and preserving their aspect ratios. The role of input size in CNNs has been mainly studied with respect to computational efficiency, receptive field adequacy, and model performance (Richter et al., 2021) . In this paper, we study the impact of input size on the robustness and generalization of CNNs. In particular, we are interested in analyzing the sensitivity of CNNs with flexible input size to variations in this size, as illustrated in Figure 1 . We demonstrate how the input size(s) used during training can strongly impact this sensitivity, and in turn, the robustness of CNNs to input shifts. We further introduce a solution to reduce this sensitivity. Our contributions are: • Demonstrating how CNNs can overfit the boundary conditions dictated by the input size used during training 1 , and identifying pooling arithmetic 2 as the culprit (Section 2). • Introducing a modification to stride-based downsampling layers such as maxpooling and strided convolution to mitigate size overfitting (Section 3) and demonstrating how it can improve the accuracy and shift robustness of CNNs in two exemplary tasks. In Section 4 we discuss the implications of size overfitting and link our observations with relevant findings in the literature. 

2.1. VARYING THE INPUT SIZE

For the purpose of our analysis, we vary the input size from 192 × 192 to 299 × 299. This is done by simultaneously increasing the width and the height by 1 pixel, limiting the input to a square shape. This simplifies the analysis and preserves the aspect ratio used during training. We follow the same resizing method used during training when possible. This method first resizes the image so that the smaller dimension is equal to s = 256. Then, the method applies a random crop of size 224 × 224. We use the same steps, changing mainly the crop size, and applying a centered crop instead of a random crop. This maintains the object scale to match the training images. Centered crops are typically used in the validation phase to eliminate randomness. A crop smaller than 224 × 224 incurs a loss of information at the periphery. A crop s ′ × s ′ larger than 256 × 256 would require padding. To avoid padding artifacts, we change the first step to use s = max(s ′ , 256). The information loss in crops smaller than 224 × 224 and the increased object scale in crops larger than 256 × 256 can potentially impact the classification result of certain instances. Nevertheless, the analysis helps identify a fundamental impact of input size on CNNs, as we explain next.

2.2. ANALYZING SENSITIVITY TO INPUT SIZE

For each input size, we compute the accuracy of the pretrained model on the ImageNet validation set after resizing the images as described above. Figure 1 depicts in blue the accuracy as a function of the input dimension. Both input dimensions are increased simultaneously. The validation accuracy generally increases with the input size in the range we considered. However, there are remarkable drops in accuracy that occur periodically at an interval of 32, immediately after reaching a peak. This suggests that the model favors spec ific input sizes that correspond to these peaks, while it struggles with inputs that are 1-pixel larger in width and in height. We next demonstrate how these peaks and drops in accuracy are a byproduct of pooling arithmetic.



Figure 1: The ImageNet top-1 accuracy of two ResNet-18 models as a function of input size. Both models are trained on 224 × 224 images. The standard CNN represents the baseline available in PyTorch. Our spatially-balanced CNN mitigates periodic size overfitting.

