MIXSIZE: TRAINING CONVNETS WITH MIXED IMAGE SIZES FOR IMPROVED ACCURACY, SPEED AND SCALE RESILIENCY

Abstract

Convolutional neural networks (CNNs) are commonly trained using a fixed spatial image size predetermined for a given model. Although trained on images of a specific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps. In this work, we describe and evaluate a novel mixed-size training regime that uses several image sizes at training time. We demonstrate that models trained using our method are more resilient to image size changes and generalize well even on small images. This allows faster inference by using smaller images at test time. For instance, we receive a 76.43% top-1 accuracy (ResNet50 on ImageNet) with an image size of 160, which matches the accuracy of the baseline model with 2× fewer computations. Furthermore, for a target image size used at test time, we show this method can be exploited either to accelerate training or improve the final test accuracy. For example, we are able to reach a 79.27% accuracy with the same model evaluated at a 288 spatial size for a relative improvement of 14% over the baseline. MixSize regimes pave the way for faster and more accurate training and inference using convolutional networks. Our PyTorch implementation and pre-trained models are publicly available 1 .

1. INTRODUCTION

Convolutional neural networks are successfully used to solve various tasks across multiple domains such as visual (Krizhevsky et al., 2012; Ren et al., 2015) , audio (van den Oord et al., 2016) , language (Gehring et al., 2017) and speech (Abdel-Hamid et al., 2014) . While scale-invariance is considered important for visual representations (Lowe, 1999) , convolutional networks are not scale invariant with respect to the spatial resolution of the image input, as a change in image dimension may lead to a non-linear change of their output. Even though CNNs are able to achieve state-of-the-art results in many tasks and domains, their sensitivity to the image size is an inherent deficiency that limits practical use cases, and requires that images at evaluation time match training image size. For example, Touvron et al. ( 2019) demonstrated that networks trained on specific image size, perform poorly on other image sizes at evaluation time, as confirmed in Figure 1 . The most common method to improve scale invariance in CNNs to artificially enlarge the dataset using a set of label-preserving transformations also known as "data augmentation" (Howard, 2013; Krizhevsky et al., 2012) . Several of these transformations scale and crop objects appearing within the data, thus increasing the network's robustness to inputs of different scale. Several works attempted to achieve scale invariance by modifying the network structure to learn over multiple possible target input scales (Takahashi et al., 2017; Xu et al., 2014; Zhang et al., 2019) . These methods explicitly change the model for specific input size, thus not benefiting from possible lower computational requirements of using smaller image sizes, nor with ability of inferring on sizes not observed during training. Another approach suggested by Cai et al. ( 2020) modifies network structure and training regime to account for variety of inference modes without additional specialized training. In this work, we introduce a novel training regime, "MixSize" for convolutional networks that uses stochastic image and batch sizes. The main contributions of the MixSize regime are: • Reducing image size sensitivity. We show that the MixSize training regime can improve model performance on a wide range of sizes used at evaluation. • Faster inference. As our mixed-size models can be evaluated at smaller image sizes, we show up to 2× reduction in computations required at inference to reach the same accuracy as the baseline model. • Faster training vs. high accuracy. We show that reducing the average image size at training leads to a trade-off between the time required to train the model and its final accuracy.

2.1. USING MULTIPLE IMAGE SIZES

Deep convolutional networks are traditionally trained using fixed-size inputs, with spatial dimensions H × W and a batch size B. The network architecture is configured such that the spatial dimensions are reduced through strided pooling or convolutions, with the last classification layer applied on a 1 × 1 spatial dimension. Modern convolutional networks usually conclude with a final "global" average pooling (Lin et al., 2013; Szegedy et al., 2015) , which reduces any remaining spatial dimensions with a simple averaging operation. Modifying the spatial size of an input to a convolutional layer by a factor γ, will yield an output with size scaled by the same factor γ. This modification does not require any change to the number of parameters of the given convolutional layer, nor its underlying operation. It was observed by practitioners and previous works that a network trained on a specific input dimension can still be used at inference using a modified image size to some extent (Simonyan & Zisserman, 2014) . Moreover, evaluating with an image size that is larger than used for training can improve accuracy up to a threshold, after which it quickly deteriorates (Touvron et al., 2019) . Although not explicitly trained to handle varying image sizes, CNNs are commonly evaluated on multiple scales post-training, such as in the case of detection (Lin et al., 2017; Redmon & Farhadi, 2018; Liu et al., 2020) and segmentation (He et al., 2017) tasks. In these tasks, a network that was pretrained with fixed image size for classification is used as the backbone of a larger model that is expected to adapt to a wide variety of image sizes. Recently, Tan & Le (2019) showed a computation-vs-accuracy trade-off in applying scaling to the image size used for convolutional networks training and evaluation. This finding is consistent with past findings, which demonstrated that training with a larger image size can result in a better classification accuracy (Huang et al., 2018; Szegedy et al., 2016) . In addition, previous works explored the notion of "progressive resizing" (Howard, 2018; Karras et al., 2017) In this work we will further explore the notion of using multiple image sizes at training, so the CNN performance will be resilient to test time changes of the image size.

2.2. LARGE BATCH TRAINING OF DEEP NETWORKS

Deep neural network training can be distributed across many computational units and devices. The most common distribution method is by "data-parallelism" -computing an average estimate of the



https://github.com/paper-submissions/mixsize



Figure 1: Top-1 test accuracy per image size, models trained on specific sizes (ResNet50, ImageNet).

-increasing image size as training progresses to improve model performance and time to convergence. A similar idea by Wu et al. (2020) was used to improve performance of training on video data, by balancing resolution with batch size. Another related work by Touvron et al. (2019) demonstrated that CNNs can be trained using a fixed small image size and fine-tuned post-training to a larger size, with which evaluation will be performed. This procedure reduced the train-test discrepancy caused by the change in image size and allowed faster training time and improved accuracy -at the cost of additional fine-tuning procedure and additional computations at inference time.

