USING OBJECT-FOCUSED IMAGES AS AN IMAGE AUGMENTATION TECHNIQUE TO IMPROVE THE ACCURACY OF IMAGE-CLASSIFICATION MODELS WHEN VERY LIMITED DATA SETS ARE AVAILABLE Anonymous

Abstract

Today, many of the machine learning models are extremely data hungry. On the other hand, the accuracy of the algorithms used is very often affected by the amount of the training data available, which is, unfortunately, rarely abundant. Fortunately, image augmentation is one of the very powerful techniques that can be used by computer-vision engineers to expand their existing image data sets. This paper presents an innovative way for creating a variation of existing images and introduces the idea of using an Object-Focused Image (OFI). This is when an image includes only the labeled object and everything else is made transparent. The objective of OFI method is to expand the existing image data set and hence improve the accuracy of the model used to classify images. This paper also elaborates on the OFI approach and compares the accuracy of five different models with the same network design and settings but with different content of the training data set. The experiments presented in this paper show that using OFIs along with the original images can lead to an increase in the validation accuracy of the used model. In fact, when the OFI technique is used, the number of the images supplied nearly doubles.

1. INTRODUCTION

Nowadays, Convolutional Neural Networks (CNNs) are among the most common tools used for image classification. For machine learning (ML) problems such as image classification, the size of the training image set is very important to build high-accuracy classifiers. As the popularity of CNNs grows, so does the interest in data augmentation. Although augmented data sets are sometimes artificial, they are very similar to the original data sets. Thus, augmentation can make a network learn more useful representations because of the increase of training data. In this paper, the researchers propose a new method to produce new images from existing ones and therefore augment data. Several experiments were conducted to validate that the model could benefit further from using the mixed data set of old and new images. This paper compares five models with the same design and settings. The only difference is the set of supplied images. The first model uses the original 2,000 images of dogs and cats. The second and the third use the OFI version of those 2,000 images. The fourth and the fifth use all the images: the original images as well as the OFI images. There are two methods to obtain the OFIs. The automated method uses an Application Programming Interface (API) to remove the background of the image and leave only the labeled object (a cat or a dog) in the foreground. The OFIs produced by the first method are called automatic OFIs. The other method is executed manually: every image is edited by a human expert who will remove the background or any additional object in the image. The OFIs produced by this second method are called manual OFIs. The manual method is more accurate than the automated API method and might lead to better results. The only difference between the second and the third models is the fact that the second model uses the automatic method while the third model uses the manual method. This is also the difference between the fourth and the fifth models. In this paper, the five models are tested to answer the following questions: • Will the model have a better validation accuracy when only the original set of images is used? • Will it have better validation accuracy when only the set of OFIs is used? • Will we have better validation accuracy when both sets are used? • Are models with manual OFIs more accurate than those which use automatic OFIs?

2. RELATED WORK

In deep learning, augmentation is a commonly used practice since the 80's and early 90's (Simard et al., 1992) . It is considered a critical component of many ML models (Ciresan et al., 2010; Krizhevsky et al., 2012; LeCun et al., 2015) . Augmentation is of paramount importance in extreme cases where only few training examples are available (Vinyals et al., 2016) . In fact, it has been considered an extremely important element of numerous successful modern models. Among these models are: the AlexNet model (Krizhevsky et al., 2012) , All-CNN model (Springenberg et al., 2014) , and the ResNet model (He et al., 2016) . In some implementations, experts were able to rely on data augmentation, apply it heavily, and achieve successful results (Wu et al., 2015) . This is not the case with computer vision only. Data augmentation was effective with other domains also like text categorization (Lu et al., 2006) , speech recognition (Jaitly & Hinton, 2013), and music source separation (Uhlich et al., 2017) . When given sufficient data, deep neural networks were providing exceptional results. They have been studied, proven, and demonstrated in many domains (Gu et al., 2015) , including: image classification (Krizhevsky et al., 2012; Huang et al., 2016) , natural language processing (Gu et al., 2015) , reinforcement learning (Mnih et al., 2015; Foerster et al., 2016; Silver et al., 2016; Gu et al., 2016; Van Hasselt et al., 2016) , machine translation (Wu et al., 2016 ), synthesis (Wang et al., 2017) , and many more. In all these experiments and implementations, used datasets have been significantly large. Augmentation is a vital technique, not only for cases where datasets are small but also for any size of dataset. Indeed, even those models trained on enormous datasets such as Imagenet (Deng et al., 2009) could benefit from data augmentation. Without data augmentation, deep neural networks may suffer from the lack of generalization (Perez & Wang, 2017) and adversarial vulnerability (Zhang et al., 2017) . Although CNNs have led to substantial achievements in image-processing tasks and image classification (Zeiler and Fergus, 2014; Sermanet et al., 2014) , CNNs with abundant parameters might over fit. This is due to the fact that they learn very detailed features of supplied training images that do not generalize to images unseen during training (Zeiler and Fergus, 2014; Zintgraf et al., 2017) . Data augmentation has been considered a solution to this overfitting problem (Krizhevsky et al., 2012; He et al., 2016; DeVries and Taylor, 2017) . Nowadays, data augmentation techniques are gaining continuous attention (DeVries and Taylor, 2017; Zhong et al., 2017; Zhang et al., 2017) . When it comes to image recognition, data augmentation is one of the fundamental building blocks for almost all state-of-the-art results (Ciresan et al., 2010; Dosovitskiy et al., 2016; Graham, 2014; Sajjadi et al., 2016) . Nonetheless, because augmentation strategies may have a large impact on the model performance, they require extensive selection and tuning (Ratner et al., 2017) . In fact, data augmentation experts proposed cutout (DeVries and Taylor, 2017). At every training step, cutout randomly masks out a square region in an image. It is an extension of dropout. Random erasing is similar to cutout (Zhong et al. 2017), as it masks out a subregion in an image but with the following three differences: i) It randomly chooses whether or not to mask out; ii) It uses a random size; and iii) It uses a random aspect ratio of the masked region. On the other hand, Mixup α-blends two images to form a new image (Zhang et al., 2017) . Mixup also behaves like class label smoothing. It mixes the class labels of two images with the ratio α : 1 -α (Szegedy et al., 2016) . Other researchers proposed techniques such as random-cropping and horizontal-flipping (Krizhevsky et al., 2012) .

