MIND THE POOL: CONVOLUTIONAL NEURAL NET-WORKS CAN OVERFIT INPUT SIZE

Abstract

We demonstrate how convolutional neural networks can overfit the input size: The accuracy drops significantly when using certain sizes, compared with favorable ones. This issue is inherent to pooling arithmetic, with standard downsampling layers playing a major role in favoring certain input sizes and skewing the weights accordingly. We present a solution to this problem by depriving these layers from the arithmetic cues they use to overfit the input size. Through various examples, we show how our proposed spatially-balanced pooling improves the generalization of the network to arbitrary input sizes and its robustness to translational shifts. * Work done mainly while at Meta AI. 1 We refer to this as size overfitting for brevity. 2 By pooling we refer to any stride-based downsampling such as maxpooling and strided convolution. 1

1. INTRODUCTION

Convolutional neural networks (CNNs) are versatile models in machine learning. Early CNN architectures used in image classification were restricted to a fixed input size. For example, AlexNet (Krizhevsky et al., 2012) was designed to classify 224 × 224 images from ImageNet (Deng et al., 2009) . To facilitate model comparison, this size has been adopted in subsequent ImageNet classifiers such as VGGNet (Simonyan & Zisserman, 2015) and ResNet (He et al., 2016) . The adoption of fully-convolutional architectures (Long et al., 2015; Springenberg et al., 2015) and global pooling methods (Lin et al., 2014; He et al., 2015) demonstrated how CNNs can process inputs of arbitrary size. Fully convolutional networks eliminate the use of fully-connected layers in CNN backbones, preserving 2D feature maps as the output of these backbones. Global pooling summarizes feature maps of arbitrary sizes into fixed-size vectors that can be processed using fullyconnected classification layers. This ability to process inputs of varying sizes enables CNN-based classifiers to leverage their full resolution and preserving their aspect ratios. The role of input size in CNNs has been mainly studied with respect to computational efficiency, receptive field adequacy, and model performance (Richter et al., 2021) . In this paper, we study the impact of input size on the robustness and generalization of CNNs. In particular, we are interested in analyzing the sensitivity of CNNs with flexible input size to variations in this size, as illustrated in Figure 1 . We demonstrate how the input size(s) used during training can strongly impact this sensitivity, and in turn, the robustness of CNNs to input shifts. We further introduce a solution to reduce this sensitivity. Our contributions are: • Demonstrating how CNNs can overfit the boundary conditions dictated by the input size used during training 1 , and identifying pooling arithmetic 2 as the culprit (Section 2). • Introducing a modification to stride-based downsampling layers such as maxpooling and strided convolution to mitigate size overfitting (Section 3) and demonstrating how it can improve the accuracy and shift robustness of CNNs in two exemplary tasks. In Section 4 we discuss the implications of size overfitting and link our observations with relevant findings in the literature.

