SELF-PRETRAINING FOR SMALL DATASETS BY EX-PLOITING PATCH INFORMATION

Abstract

Deep learning tasks with small datasets are often tackled by pretraining models with large datasets on relevent tasks. Although pretraining methods mitigate the problem of overfitting, it can be difficult to find appropriate pretrained models sometimes. In this paper, we proposed a self-pretraininng method by exploiting patch information in the dataset itself without pretraining on other datasets. Our experiments show that the self-pretraining method leads to better performance than training from scratch both in the condition of not using other data.

1. INTRODUCTION

Transfer learning has become the de facto approach of doing deep learning tasks on small datasets. Because of the data-hungry nature of deep learning methods, training from scratch using small datasets usually got overfitting. Although transfer learning using models pretrained on additional large datasets mitigates the problem of overfitting, it is hard to find an appropriate pretrained model like the ImageNet-classification pretrained model which used in detection and segmentation tasks when the appearance of input data or the goal of task in the target domain is special. Research on training with small datasets without using external information has emerged in these years. Barz et al.(Barz & Denzler, 2020) proposed the method of training from scratch on small datasets using the cosine loss, which got substantially better performance than using the cross entropy loss function on fine-grained classification tasks. Zhang et al.(Zhang et al., 2019) introduced a generative adversarial network into the process of training with limited datasets without using external data or prior knowledge. In contrast to doing data augmentation or training using special loss functions on small dataset tasks, we proposed the self-pretraining method which transfers patch information in the dataset itself to the model in a weakly supervised manner. Patches in images can represent image information in some extent. In (Kang et al., 2014) , Kang et.al predicted the image quality score using the average quality score of image patches which are trained by image-level quality labels. Also, BagNet(Brendel & Bethge, 2019) indicated that small image patches which contain the class evidence can do well in the ImageNet classification challenge by aggregating the their score in the image without considering the spatial order. In a case of the fine-grained classification, (Wang et al., 2017) extracted features of image patches by training with external large datasets in a weakly supervised way. Inspired by (Gatys et al., 2016) which pointed out that convolution neural networks get local features in lower layers and global structure features in higher layers, our self-pretraining method pretrains lower layers to higher layers in the model using image pathces with the incremental size step by step.

2. PROPOSED METHOD

In this section, we proposed our self-pretraining method using patch information in the dataset itself. Despite the small number of training images, the number of patches sampled from images can be large enough to meet the data-hungry demand. Our self-pretraining methods get insights from two aspects: First, large amounts of patches which contain parts of the information in the image can training the network using image-level labels in a weakly supervised manner. Although each small patch does not hold the complete information leads to the correct prediction, the model trained on these small patches probably learned similar parameters in lower layers compared with models trained on large relevant datasets in a discriminative way. Therefore, traning using image patches with not exactly correct image-level labels may get the performance of the image feature extraction close to standard pretraining methods. Otherwise, training with random shuffling image patches means that self-pretrained models do not use the global structure information to predict results, while special structures in limited data can cause the overfitting problem more easily than local patches which have more intra-class samples in the dataset. The self-pretraining method pretrains the model using patches with incremental size in a step by step way, which can be seen in Figure 1 . In early stages of the self-pretraining, we sample smaller image patches uniformlly as training data which the number of them can be larger. The target of each patch is assigned with the label of the image containing this patch. After training within these patches in a weakly-supervised manner, we remove higher layers of the pretrained model while retain lower layers as the frozen feature extractor in the next training step, in which bigger patches are sampled from images as training data. The procedure continues iteratively until the size of patches to be sampled is equal to the full image size. Using this step by step pretraining method, we can get a model with the input of full size images which has better performance than training from scratch on the small dataset. Detailed descriptions of self-pretraining methods are in Algorithm 1.



Figure 1: We get the prediction model with the input of full size images by pretraining with patches in a multi-step way. In the begining (Step 1), small patches are sampled as training data with imagelevel labels. Then (Step 2), the lower layers pretrained in the former step (Step 1) are frozen as the early feature extractor in the model of this step. Iteratively continue this procedure until (Step 3) the size of patches to be sampled is equal to the full size of images.

