HOW SAMPLING AFFECTS TRAINING: AN EFFECTIVE SAMPLING THEORY STUDY FOR LONG-TAILED IMAGE CLASSIFICATION

Abstract

The long-tailed image classification problem has been very challenging for a long time. Suffered from the unbalanced distribution of categories, many deep vision classification methods perform well in the head classes while poor in the tail ones. This paper proposes an effective sampling theory, attempting to provide a theoretical explanation for the decoupling representation and classifier for longtailed image classification. To apply the above sampling theory in practice, a general jitter sampling strategy is proposed. Experiments show that variety of longtailed distribution algorithms exhibit better performance based on the effective sampling theory. The code will be released soon later.

1. INTRODUCTION

The image classification problems are fundamental tasks in computer vision, and many methods based on deep learning have achieved gratifying results on artificially constructed datasets so far. However, due to the large discrepancy between distributions for different classes, the classification model performs very well for head categories, but usually gives an inaccurate prediction for the tail ones at the same time. This phenomena dose not only occurs in image classification, but also in other common vision tasks such as semantic segmentation He et al. ( 2021 (2018) , and transfer learning strategies in embedding level Liu et al. (2020) . The main idea solving the imbalanced classification problem is to enhance the training proportion for the tail categories so as to alleviate the overfitting for the head ones. Kang et al. (2019) points out the strong dependence between the representation learning for backbone network and classifier learning for the the last fully connected layer, and concludes that the optimal gradient for training the backbone network and classifier are obtained from the original sampling distribution and re-sampling distribution such as class-balanced sampling respectively, from which the mainstream of two-stage optimization strategy is gradually accepted by more researchers. Xiang et al. (2020) further alleviates the strong dependence of the single-expert model with a specific training distribution, leading to an improvement of classification accuracy both for head and tail categories. Kang et al. (2019); Zhou et al. (2020) mentions that the mainstream methods for long-tailed distribution requires two stages learning. Sampling process need be conducted within the original distribution to learn in the first step stage for representation, without an ample theoretical explanation for this phenomena however. Inspired by Cui et al. (2019) , we realised that the growth between the actual effective samples and the actual number of samples does not change synchronously in the first training stage, where the effective sample growth formula is given by Cui et al. (2019) . Based on the concept of effective sample, our expanded effective sampling theory is proposed. Here we give two important findings. The total number of effective samples is the primary factor affecting the training for long-tailed distribution, and the second one is the effective sample utilization.The improvement of accuracy on the long-tailed distribution can be achieved through the process of maximizing the total number of effective samples and balancing the effective samples utilization among categories. The main contributions of this paper are as follows: 1. We build a complete theory on effective sampling, which could be used for studying the properties of sampling with/without replacement, through which the optimal sampling methods are proposed. 2. A general jitter sampling strategy is proposed for the piratical application, and experiments on various public datasets have been carried out. The experimental results reach the competitive performance which further verify the core factor of our theory, that is, the total number of effective samples is the core factor affecting the first learning stage and the process of effective samples equalization among classes is beneficial for model training.

2. RELATED WORK

Re-sampling Redesigned sampling frequencies for different classes are used in re-sampling based strategies. Early ideas mainly focus on under-sampling the head classes and over-sampling the tail classes. Drummond et al. (2003) argues that over-sampling is better than under-sampling because the latter process may loss important samples, while over-sampling the tail classes may lead to the over-fitting problem at the same time. 

3. EFFECTIVE SAMPLING THEORY

Inspired by the concept of the effective samples Cui et al. (2019) , this paper proposes a hypothesis to explain the effective sampling in training processes. We believe that the total number effective samples is the primary core factor in the representation learning, and then the next one is the utilization of effective samples between categories. The performance of the representation learning can be improved by the increasing the total number of effective samples and equalizing the effective sample utilization.



); Wang et al. (2020a), object detection Ouyang et al. (2016); Li et al. (2020) and so on. Researches on long-tail classification problems mainly focus on the following research perspectives including loss function re-weighting Cao et al. (2019), training data re-sampling Mahajan et al.

Chawla et al. (2002);Han et al. (2005); He et al. (2008) By introducing the generated new data for tail through interpolation, the above problem could be solved. However, the imprecise interpolation may also introduce new noises. The process of representation learning and classifier learning should be decoupled with their suitable distributionKang et al. (2019); Zhou et al. (2020). Re-weighting Re-weighting refers to assigning different weights to loss computation denoted by the corresponding classes . The reciprocal of sample frequency is adopted to correlate with weight in early studiesHuang et al. (2016); Wang et al. (2017).Re-weighting by the number of effective samples of each class is utilized in Mahajan et al. (2018); Mikolov et al. (2013). LDAM Cao et al. (2019) adopted the loss determined by the classification decision boundary distance, where categories with larger magnitude are closer to the decision boundary. Meta-learning based methodJamal et al. (2020) also is used for a better weights estimating Zhang et al. (2021a). considers the difficulty and total number of the data to determine loss weights.In addition, some methods based on difficult samplesZhang et al. (2021a) and logits adjustmentsMenon et al. (2020) also belongs to re-weighting. Transfer learning Transfer learning attempts to transfer knowledge from source domain to enhance performance on the target domain Zhang et al. (2021b). BBN Zhou et al. (2020) is trained on the origin distribution in the early steps, which transfer to classes-balanced distribution later for the optimization of classifier. LEAP Liu et al. (2020) constructs a "feature cloud" of tail classes transferred from head ones features to better support the classification boundaries. LFME Xiang et al. (2020) trains multi-expert models separately on multiple sub-datasets, and produce a student model through knowledge distillation. RIDE Wang et al. (2020b) uses dynamic routing to control the number of experts involved. Research for long-tailed classification mainly focuses on the above aspects. In addition, there are some theoretical studies on training strategies for long-tailed distribution. Kang et al. (2019) and Zhou et al. (2020) show an empirical law of long-tail classification research, that is, the process of representation learning and classifier learning is uncoupled. Menon et al. (2020) points out using Adam-type optimizers may not be conducive to the training for long-tailed datasets. Cui et al. (2019) introduces the concept of the effective number of samples because of the finding that the total number of non repeated samples actually participating in the training may not be as large as expected.

