MIXSIZE: TRAINING CONVNETS WITH MIXED IMAGE SIZES FOR IMPROVED ACCURACY, SPEED AND SCALE RESILIENCY

Abstract

Convolutional neural networks (CNNs) are commonly trained using a fixed spatial image size predetermined for a given model. Although trained on images of a specific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps. In this work, we describe and evaluate a novel mixed-size training regime that uses several image sizes at training time. We demonstrate that models trained using our method are more resilient to image size changes and generalize well even on small images. This allows faster inference by using smaller images at test time. For instance, we receive a 76.43% top-1 accuracy (ResNet50 on ImageNet) with an image size of 160, which matches the accuracy of the baseline model with 2× fewer computations. Furthermore, for a target image size used at test time, we show this method can be exploited either to accelerate training or improve the final test accuracy. For example, we are able to reach a 79.27% accuracy with the same model evaluated at a 288 spatial size for a relative improvement of 14% over the baseline. MixSize regimes pave the way for faster and more accurate training and inference using convolutional networks. Our PyTorch implementation and pre-trained models are publicly available 1 .

1. INTRODUCTION

Convolutional neural networks are successfully used to solve various tasks across multiple domains such as visual (Krizhevsky et al., 2012; Ren et al., 2015) , audio (van den Oord et al., 2016 ), language (Gehring et al., 2017) and speech (Abdel-Hamid et al., 2014) . While scale-invariance is considered important for visual representations (Lowe, 1999), convolutional networks are not scale invariant with respect to the spatial resolution of the image input, as a change in image dimension may lead to a non-linear change of their output. Even though CNNs are able to achieve state-of-the-art results in many tasks and domains, their sensitivity to the image size is an inherent deficiency that limits practical use cases, and requires that images at evaluation time match training image size. For example, Touvron et al. ( 2019) demonstrated that networks trained on specific image size, perform poorly on other image sizes at evaluation time, as confirmed in Figure 1 . The most common method to improve scale invariance in CNNs to artificially enlarge the dataset using a set of label-preserving transformations also known as "data augmentation" (Howard, 2013; Krizhevsky et al., 2012) . Several of these transformations scale and crop objects appearing within the data, thus increasing the network's robustness to inputs of different scale. Several works attempted to achieve scale invariance by modifying the network structure to learn over multiple possible target input scales (Takahashi et al., 2017; Xu et al., 2014; Zhang et al., 2019) . These methods



https://github.com/paper-submissions/mixsize 1



Figure 1: Top-1 test accuracy per image size, models trained on specific sizes (ResNet50, ImageNet).

