HOW SAMPLING AFFECTS TRAINING: AN EFFECTIVE SAMPLING THEORY STUDY FOR LONG-TAILED IMAGE CLASSIFICATION

Abstract

The long-tailed image classification problem has been very challenging for a long time. Suffered from the unbalanced distribution of categories, many deep vision classification methods perform well in the head classes while poor in the tail ones. This paper proposes an effective sampling theory, attempting to provide a theoretical explanation for the decoupling representation and classifier for longtailed image classification. To apply the above sampling theory in practice, a general jitter sampling strategy is proposed. Experiments show that variety of longtailed distribution algorithms exhibit better performance based on the effective sampling theory. The code will be released soon later.

1. INTRODUCTION

The image classification problems are fundamental tasks in computer vision, and many methods based on deep learning have achieved gratifying results on artificially constructed datasets so far. However, due to the large discrepancy between distributions for different classes, the classification model performs very well for head categories, but usually gives an inaccurate prediction for the tail ones at the same time. This phenomena dose not only occurs in image classification, but also in other common vision tasks such as semantic segmentation He et al. (2021) ; Wang et al. (2020a) , object detection Ouyang et al. (2016) ; Li et al. (2020) and so on. Researches on long-tail classification problems mainly focus on the following research perspectives including loss function re-weighting Cao et al. (2019) , training data re-sampling Mahajan et al. (2018) , and transfer learning strategies in embedding level Liu et al. (2020) . The main idea solving the imbalanced classification problem is to enhance the training proportion for the tail categories so as to alleviate the overfitting for the head ones. Kang et al. (2019) points out the strong dependence between the representation learning for backbone network and classifier learning for the the last fully connected layer, and concludes that the optimal gradient for training the backbone network and classifier are obtained from the original sampling distribution and re-sampling distribution such as class-balanced sampling respectively, from which the mainstream of two-stage optimization strategy is gradually accepted by more researchers. Xiang et al. (2020) further alleviates the strong dependence of the single-expert model with a specific training distribution, leading to an improvement of classification accuracy both for head and tail categories. Kang et al. (2019) ; Zhou et al. (2020) mentions that the mainstream methods for long-tailed distribution requires two stages learning. Sampling process need be conducted within the original distribution to learn in the first step stage for representation, without an ample theoretical explanation for this phenomena however. Inspired by Cui et al. (2019) , we realised that the growth between the actual effective samples and the actual number of samples does not change synchronously in the first training stage, where the effective sample growth formula is given by Cui et al. (2019) . Based on the concept of effective sample, our expanded effective sampling theory is proposed. Here we give two important findings. The total number of effective samples is the primary factor affecting the training for long-tailed distribution, and the second one is the effective sample utilization.The improvement of accuracy on the long-tailed distribution can be achieved through the process of maximizing the total number of effective samples and balancing the effective samples utilization among categories. The main contributions of this paper are as follows: 1. We build a complete theory on effective sampling, which could be used for studying the properties of sampling with/without replacement, through which the optimal sampling methods are proposed. 2. A general jitter sampling strategy is proposed for the piratical application, and experiments on various public datasets have been carried out. The experimental results reach the competitive performance which further verify the core factor of our theory, that is, the total number of effective samples is the core factor affecting the first learning stage and the process of effective samples equalization among classes is beneficial for model training.

2. RELATED WORK

Re-sampling Redesigned sampling frequencies for different classes are used in re-sampling based strategies. Early ideas mainly focus on under-sampling the head classes and over-sampling the tail classes. Drummond et al. (2003) argues that over-sampling is better than under-sampling because the latter process may loss important samples, while over-sampling the tail classes may lead to the over-fitting problem at the same time. Research for long-tailed classification mainly focuses on the above aspects. In addition, there are some theoretical studies on training strategies for long-tailed distribution. Kang et al. (2019) and Zhou et al. (2020) show an empirical law of long-tail classification research, that is, the process of representation learning and classifier learning is uncoupled. Menon et al. (2020) points out using Adam-type optimizers may not be conducive to the training for long-tailed datasets. Cui et al. (2019) introduces the concept of the effective number of samples because of the finding that the total number of non repeated samples actually participating in the training may not be as large as expected.

3. EFFECTIVE SAMPLING THEORY

Inspired by the concept of the effective samples Cui et al. (2019) , this paper proposes a hypothesis to explain the effective sampling in training processes. We believe that the total number effective samples is the primary core factor in the representation learning, and then the next one is the utilization of effective samples between categories. The performance of the representation learning can be improved by the increasing the total number of effective samples and equalizing the effective sample utilization.

3.1. CONCEPT DESCRIPTION

The image information redundancy occurs during model training when objects have similar features. As the number of instances of a certain category increases, the probability of redundant samples usually increases, which is due to the inconsistent difficulty of data collection from source. For example of the data collection on the cat category, the total samples of hairless cats Probably much smaller than any hairy type.In addition, multiple repeated sampling from the same source for different angles suffers lower generalization performance than the separate sampling from different source in the same category. Redundancy causes the asynchronous growth of category frequency and information content. For the sample x 1 , x 2 and the encoder f encoder , if the image substructure s 1 ∈ x 1 and s 2 ∈ x 2 and ∥f encoder (s 1 ) -f encoder (s 2 )∥ = ||z 1 -z 2 || < δ, then these samples are redundant. Then the updated conception of effective sampling is proposed, which refers to those sampling processes that do not generate new redundancy in the existing data sets. Specifically, for the specific structure a i and category k, if one sampling is performed, and the category of this sample belongs to k, which producing no redundancy with a, then an effective sampling processes happens, during which the total number of effective samples of the k category +1. The above sampling process is called effective sample sampling. The ratio of the number of valid samples to the total number of this category is defined as the effective sample proportion. Based on the concept of effective sampling and effective sample proportion, the effective sample theory is established. The effective sample theory studies how the category sampling distribution affects the actual training efficiency. In this process, we define the two concepts of the total number of effective samples and the utilization rate of effective samples, and then give the quantitative analyze for those two concepts in different sampling methods. Suppose there are N samples in dataset with m class labels. The number of each category is (n 1 , n 2 , ..., n m ). The effective sample proportion of each category is (a 1 , a 2 , ..., a m ), and the actual sampling frequency is set to (u 1 , u 2 , ..., u m ).

3.1.1. SAMPLING WITH REPLACEMENT

Sampling with replacement means that the data of each category once sampled still has a certain probability to be sampled in next iteration.

Total number of effective samples

Let E i,n be the expected number of effective samples of the category i sampled by n times,which satisfies the following equation: E i,n = u i • max(a i n i -E i,n-1 , 0) n i • (E i,n-1 + 1) + 1 -u i • max(a i n i -E i,n-1 , 0) n i • E i,n-1 Simplify the above formula (see Appendix Effective Sampling Theory), and E in = n i * a i * (1 - (1 -ui ni ) n ). We note that total number of effective samples of the overall dataset after sampling n times is S n , and we have: S n = m j=1 a j n j (1 -w n j ) where w j = 1 - uj nj . When n is large enough, the analytical solution of u i satisfies the following equation: u i = 1 -i̸ =j (1 -A ijn ) * n j 1 + i̸ =j Aijn * nj ni ; A ijn = ( a i a j ) 1 n This formula shows that the optimal sampling frequency is approximately equal to the class frequency ratio of the original distribution when the sampling times is large enough, that is u i ∝ n i .

Effective sample utilization

The effective sample proportion is defined as R i,n = Ei,n uin , It describes the proportional of the total number of effective samples in the total number of samples of category i after sampling n times. On the condition of sampling with replacement, this proportional expression is simplified as follows: R in = a i n i (1 -w n i ) u i n Consider the ratio of the effective sampling proportions Q i,j for any two classes i and j: Q i,j,n = R in R jn = a i * n i * (1 -w n i ) * u j a j * n j * (1 -w n j ) * u i when n is large enough,Q i,j,n = Rin Rjn = ai * ni * uj aj * nj * ui . For sampling with replacement, the optimal sampling frequency needs to be approximately proportional to the product of the number of class i and its effective sample proportion, which is: u i ∝ n i * a i 3.1.2 SAMPLING WITHOUT REPLACEMENT Sampling without replacement means that the data of each category is sampled completely according to the preset sampling frequency, which once sampled will not return to their original category set until the next epoch comes.

Total number of effective samples

On the condition of sampling without replacement, E i,n satisfies the following equation: E in = u i (min(E i,n-1 + a i , a i n i )) + (1 -u i )E i,n-1 After Simplifying the above formula (see Appendix Effective Sampling Theory), E in = n * u i * a i , and S n = m j=1 min(a j u j n, a j n j ). Sort the tuple (a i , n i , u i ) with descending order by a i and we obtain the following sequence:  (a x1 , n x1 , u x1 ), (a x2 , n x2 , u x2 ), . . . , (a xm , n xm , u xm ) When n satisfies s j=1 n xj ≤ n < u i = ni N . On the condition of sampling without replacement, the growth rate of the total number of effective samples is much greater than that on the condition of sampling with replacement. Theoretically, it is the optimal sampling strategy for increasing the total number of effective samples.

Effective sample utilization

The effective sample utilization R i,n and the ratio of the effective sampling proportions Q i,j,n can be expressed as follows: R in = min(a i u i n, a i n i ) u i n = a i , if n < ni ui aini uin , otherwise Q i,j,n = R in R jn = n i * a i * u j n j * a j * u i = 1; u i ∝ n i * a i On the condition of sampling without replacement, a balanced utilization of effective samples among classes needs the sampling frequency u i to be proportional to he product of the number of class i and its effective sample proportion.

3.1.3. SOLUTION

Firstly, we found that the primary core factor affecting the training under the long-tailed distributions is the total number of effective samples. By maximizing the total number of effective samples, the encoder can obtain those gradient generated from samples with fewer redundancy information, which benefits for increasing training efficiency for the first stage of representation learning. Secondly, as the training progresses, the actual utilization rate of a single effective sample becomes different, which lead to efficiency discrepancy of learning for different structures. The difference of those two key factors between Sampling methods will further affect the final classification accuracy. In the previous studies, it was found that using the original sampling distribution for the first stage of training is more effective than class-balanced sampling distributions Kang et al. (2019) , partly because the total number of effective samples reaches near maximum by simply setting the sampling frequency to be proportional to the sample frequency, which is well supported by our theory. However, according to the formula of maximizing the total number of effective samples and the effective sample utilization between classes, the maximization of the total number of effective samples and the total balance of effective sample utilization between categories can never be achieved theoretically the same time, due to the existence of sample redundancy. A reasonable trade-off is to ensure the total number of effective samples close to the maximization primarily and balancing the effective samples utilization between categories.

3.2. JITTER SAMPLING STRATEGY

The effective sampling theory reveals the contradiction between optimizing the total number of effective samples and optimizing the utilization of effective samples between categories,and Estimating the accurate redundancy of real-world category samples directly can be also difficult. Fortunately, the effective sample theory suggests that the optimal sampling frequency is actually close to the original sampling distribution, which implies that the total number effective sample can be approached as long as the distance between sampling frequency and originally distribution is in a controlled range,and the deviation from the original distribution gives a possibility to balance the utilization of effective samples. Another reasonable assumption is that, for a certain category, more sample frequency usually brings less effective sample proportion, which will be well explained in appendix (4). Based on the above analysis, the jitter sampling strategy is proposed. We design a sampling schedule in which the sampling frequency fluctuates around the original sampling distribution, exploring to maximizing the total number of effective samples and balancing the sample utilization through random walks. For the case of sampling with replacement, we build a meta-dataloader, which contains multiple sub-dataloaders. During each iteration, it randomly selects one from meta-dataloader with a preset probability, and samples a data-batch at a sampling frequency that approximates the original distribution. For sampling without replacement, a single dataloader is used to sample a data-batch with a preset sample frequency close to the original distribution. In this process, we dynamically adjust the actual sampling distribution by introducing a control factor related to training time. u it ∝ f (n i , r); r = g(t); t : 0 → 1 In the early stage of training, our strategy is relatively conservative,which adopts a sampling strategy almost same as the original distribution, and gradually explore from multiple sub-distributions as the training progresses. In the appendix, we prove that if the hyperparameters are properly chosen, jitter sampling strategy can perform better than the original strategy by trading off optimizing the above two key points.

3.2.1. SAMPLING WITH REPLACEMENT

On the condition of sampling with replacement, two different strategies are proposed. The first method mainly controls the change range of sampling frequency through temperature (jitter factor), where the selection probability for each sub-dataloaders are fixed. In the second method, we select the fixed original sampling distribution and the reverse sampling distribution for each sub-dataloaders, and complete the actual sampling process by dynamically adjusting the selection probabilities of the two. (see appendix 1 for more details for the effectiveness proof process) Method 1 Three sub-dataloaders are initialized with varying sampling frequencies as follows: dataloader 1 : (u 1 , u 2 , . . . , u m ) dataloader 2 : (u 1+δt 1 , u 1+δt 2 , . . . , u 1+δt m ) dataloader 3 : (u 1-δt 1 , u 1-δt 2 , . . . , u 1-δt m ) where u i ∝ n i where δt is jitter factor, which is updated at each epoch. The rule to update δt is: δt = random(0, 1) • α • max epoch epoch total β , γ Each sampling selects one of the sub-dataloaders for sampling with a preset probability [p 1 , p 2 , p 3 ]. (see appendix 2 for more details for the effectiveness proof process) Method 2 Two sub-dataloaders are initialized accordance with: dataloader 1 : (u 1 , u 2 , . . . , u m ) dataloader 2 : (u -1 1 , u -1 2 , . . . , u -1 m ) where u i ∝ n i where δt is the jitter factor, which is updated at the arrival of each epoch with the following rule: δt = α • max epoch epoch total β , γ Each sampling selects one of the dataloaders for sampling with a preset probability [1 -δt, δt]. The first jitter sampling method is more general, and we will demonstrate its effectiveness in detail in the appendix. The second dithering method cannot theoretically guarantee to maximize the total number of valid samples, but if it is assumed that in the real data set, the category with more sample instances has greater redundancy, then the second jittering method can be considered. For sampling without replacement, we only use a dataloader like: dataloader 0 : (u 1+δt 1 , u 1+δt 2 , • • • , u 1+δt m ), u i ∝ n i where δt varies as follows: δt = random(-0.5, 0.5) • α • max epoch epoch total β , γ For the actual sampling, a queue is maintained separately for each category of samples, and the samples of that category are initially filled in a random order. u 0 * N samples are drawn from class i for each epoch, and when the classes sample queue is emptied, a new round of filling is performed again in random order. When a new epoch arrives, the dataloader prioritizes the samples still in the queue from the previous round until all samples have been drawn. The purpose of doing this is to avoid the situation where there is a put-back sampling between each epoch. In theory, sampling without replacement by dithering will lose valid samples, but since the probability of equalizing the utilization of valid samples will be increased at the same time, we will prove it in detail in the appendix. 43.9 †De-confound [Tang et al. (2020)] 47.3 †De-confound-TDE [Tang et al. (2020)] 48.3 # * RIDE(4experts+reduce) [Wang et al. (2020b)] 49.5 # * RIDE(4experts) [Wang et al. (2020b) ] 50 # TLC(4experts) [Li et al. (2022) Evaluation Metrics In long-tailed learning, the overall performance on all classes and the performance for head, middle and tail are usually reported. The overall performance on all classes is reported in this paper, and we average the class-specific accuracy and use the averaged accuracy as the metric.

4. EXPERIMENTS

The experiments are mainly conducted on two aspects. First, we experiment on major long-tailed classification benchmarks in the evaluation results, which mainly verifies the actual effectiveness of the proposed jitter sampling strategy. Taking the fairness of experiment, we follow the RIDE Wang et al. (2020b) ensemble learning framework. Second, we experiment the ablation studies on the effectiveness of each component.

4.2. EVALUATION RESULTS

Experiments for single-experts model cirfa100-lt and cirfa10-lt are used as the experimental datasets, where ResNet-32 He et al. (2016) is selected as the backbone. In order to improve the classification accuracy without increasing the amount of calculation greatly, we use the a special designed module,which is called group L2norm, to expanse features before the original full connection layer of classifier through L2 normlization module by pre-designed groups. Batchsize is set to 256, and SGD is used as our optimizer. The total training epochs are 500, and the learning rate is initialized to 0.5, with a learning rate decays of 0.01 at 350 epoch and 450 epoch, respectively. The warmup epoch is set as 5. The jitter strategy setting adopts the second method on the condition of sampling with replacement. In the first stage (0-350), we set α = 1 and β = 1.5, and in the second stage(350-500), we directly set δt = 0.5. The cross entropy loss is used as our loss function. The module design motivation of group L2norm is to increase the richness of features, and ensure a output features with a certain controllable norm in different groups, so as to avoid the phenomenon that neural network reduces the loss by simply increasing the data norm, while the classification boundaries may not be well optimized. The disadvantage is that longer training is required when adding the norm module to the existing network module. Table 2 shows that our proposed method (J-sampling+group L2norm+longtrain) surpasses most current methods (slightly lower than De-confound-TDE) with single backbone Resnet32, with an acceptable computational complexity increase.

Experiments for multi-experts model

Using the jitter sampling strategy combined with RIDE, the current sota ensemble learning methods are compared in cifar100-lt and cifar10-lt. In this group of experiments, the jitter sampling strategy on the condition of sampling replacement is adopted. The experimental settings are as follows: batchsize 0 † denotes results copied from their paper, respectively. 1 * denotes the results reported by their public code of sampling with replacement. 2 # denotes the results reported by multi-experts is set to 128, training epochs is set to 200, the learning rate is initially set to 0.1 with the decay rate of 0.01 at 160 epoch and 180 epoch, respectively. The warmup epoch is set to 5. The jitter setting adopts the sampling with replacement one. In the first stage (0-160), we set α = 0.05 , β = 2 , γ = 0.01, and select cross entropy as the loss function. In the second stage, LDAM loss is adopted. Here we delete the feature enhancement method of group L2norm, because the L2norm operation plays a role of repetition with NormFC module of the classifier in the original RIDE. It can be seen from Table 1 that on the cifar00-lt dataset, the single-model results are better than most of the experimental results (slightly lower than De-confound-TDE), and the multi-model results are better than the existing results (among which, J-sampling (ours)+RIDE( 4experts) is 0.6 points higher than RIDE(4experts)). It can be seen from Table 2 that on the cifa10-lt dataset, J-sampling (ours) + RIDE (4experts) outperforms all existing algorithms.

4.3. ABLATION STUDIES

In this subsection, we experiment the ablation studies on the effectiveness of each component, which including replacement strategy, Jitter strategy and the training time. We further discuss two key factors that affecting the representation learning: the total number of effective samples and the effective sample utilization.The basic experimental setting keeps the same as experiments for multi-experts model. Comparison of the Jitter strategy We compare the jitter with and without replacement sampling on cirfa100 and imagent-lt, respectively. We have proved the effect of jitter in appendix in theory, which is also validated from the experimental, that adding jitter within sampling frequency can actually improve the accuracy with a certain probability. For non-replacement sampling, although the total number of effective samples theoretically decreases slightly, at the same time, the utilization of effective samples is also balanced. One of a possible reason for why there is no obvious accuracy improvement on imagenet-lt,compared to cirfa100-lt,is that the image size of imagenet-lt is larger than the cifar series dataset, so the redundancy between images is not high (the foreground only occupies part of the image, while the background difference between instances is obvious).

Dataset

Consider the derivative of G with respect to x: ∂G ∂x = - m j=1 a j n j • n 1 - u j (x) n j n-1 • - 1 n j • ∂u j (x) ∂x = m j=1 a j n 1 - u j (x) n j n-1 • n x j ln n j m k=1 n x k -n x j m k=1 (n x k ln n k ) ( m k=1 n x k ) 2 When x = 1: ∂G ∂x x=1 = n 1 - 1 N n-1 • m j=1 a j • n j ln n j m k=1 n k -n j m k=1 (n k ln n k ) ( m k=1 n k ) 2 If ∂G ∂x x=1 = 0, there is: m j=1 a j n j ln n j m k=1 n k = m j=1 a j n j m k=1 (n k ln n k ) This usually doesn't hold. It is worth mentioning that when a 1 = a 2 = • • • = a m , that is, when the redundancy of each class of the dataset is the same, the above equation holds. Therefor, normally: ∂G ∂x x=1 > 0 or ∂G ∂x x=1 < 0 Then there must exist δt such that G(1 + δt) > G(1) or G(1 -δt) > G(1). So, a probability combination (p 1 , p 2 , p 3 ) must exist that let p 1 G(1) + p 2 G(1 + δt) + p 3 G(1 -δt) > G(1). In summary, we prove that there must exist parameters δt and (p 1 , p 2 , p 3 ) that allow the the total number of effective samples obtained using jittering method one of sampling with replacement is greater than the total number of effective samples. When n is sufficiently large, the effective sample utilization between classes is the same as the formula when there is sampling without replacement. The sample utilization between classes is balanced at this time, as we will prove in Appendix 2. Proof 2: In sampling with replacement, the utilization of the effective samples between classes obtained by jittering method 2 is more balanced than that of the original sampling with replacement. The effective sample utilization for the ith category is R i,n . In method 2 of sampling with replacement, when n is large enough, it is: R i,n = x • a i n i (1 -w n i ) u i n + (1 -x) • a i n i (1 -w i n ) u i n ≈ x • a i n i u i n + (1 -x) • a i n i u i n Where w i = 1 -ui ni , w i = 1 -ui ni , u i = u -1 i m j=1 u -1 j . The derivative of R i,n is: ∂R i,n ∂x = a i n i u i n - a i n i u i n A more balanced effective sample utilization corresponds to a distribution of category effective sample utilization P = { R1,n k R k,n , . . . , Rm,n k R k,n } and a uniform distribution Q = { 1 m , . . . , 1 m m } with less KL divergence. KL(Q∥P ) = m i=1 Q i log Q i P i = -log m - 1 m m i=1 log R i,n + log n i=1 R i,n Consider the derivative of the KL divergence. ∂KL ∂x = - 1 m m i=1 ∂Ri,n ∂x R i,n + m i=1 ∂Ri,n ∂x m i=1 R i,n When x = 1, R i,n = aini uin . ∂KL ∂x x=1 = - 1 m m i=1 aini uin -aini uin aini uin + m i=1 aini uin -aini uin m i=1 aini uin = 1 m • m i=1 u i u i - m j=1 a j • uj uj m j=1 a-jnj uj n = m k=1 u -1 k • 1 m m i=1 u 2 i - m i=1 a i u 2 i m j=1 a i From Appendix 4, the proportion of effective samples is low for categories with large sample size, which means that the lager u i is, the smaller a i is. Therefore, 1 m m i=1 u 2 i - m i=1 a i u 2 i m j=1 a i > 0 That means ∂KL ∂x x=1 > Therefore, there exists δt ∈ (0, 1) such that KL(1 -δt) < KL(1). In other words, the dithering method 2 can make the effective sample utilization more balanced.

A.2 APPENDIX 2

Jittering method in Sampling without replacement is effective. Proof: In sampling without replacement, the effective sample utilization of a jittering method would be more balanced than without jittering. The effective sample utilization for the ith class is: R i,n = a i n i u i (x) n A more balanced effective sample utilization corresponds to a distribution of category effective sample utilization P = { R1,n k R k,n , . . . , Rm,n k R k,n } and a uniform distribution Q = { 1 m , . . . , 1 m m } with Under review as a conference paper at ICLR 2023 less KL divergence. KL(Q∥P ) = m i=1 Q i log Q i P i = 1 m m i=1 log 1 mP i = -log m - 1 m m i=1 log P i Consider the sampling rate function u i (x). u i (x) = n x i m k=1 n x k For x = 1 + δt and δt > 0, the sampling rate increases for the head classes and decreases for the tail classes. We have n < ni ui(x) in tail classes. So, effective sample utilization of tail classes is R i,n = a i , and effective sample utilization of tail classes is R i,n = aini ui(x)n . Might as well let m-th class has the largest number of samples. There exists δt quite small such that R m,n = amnm um(x)n . Ant the utilization of the others is a i . So the KL divergence is: KL(Q∥P ) = -log m - 1 m m-1 i=1 log a i m-1 k=1 a k + amnm um(x)n + log amnm um(x)n m-1 k=1 a k + amnm um(x)n = -log m - 1 m m-1 i=1 log a i + log a m n m u m (x)n -m log m-1 k=1 a k + a m n m u m (x)n The derivative of KL divergence with respect to x is: ∂KL ∂x = - 1 m • - u m (x)n a m n m • a m n m u 2 m (x)n • ∂u m (x) ∂x + m • 1 m-1 k=1 a k + amnm um(x)n • a m n m u 2 m (x)n • ∂u m (x) ∂x = 1 m • u m (x)n a m n m - m m-1 k=1 a k + amnm um(x)n • a m n m u 2 m (x)n • ∂u m (x) Where ∂um(x) ∂x is: ∂u m (x) ∂x = n x i ln n i • m k=1 n x k -n x i m k=1 (n x k ln n k ) ( m k=1 n x k ) 2 When x = 1, we have: ∂KL ∂x x=1 = 1 m • 1 a m - m m k=1 a k • a m N n m • n m ln n m • m k=1 n k -n m • m k=1 (n k ln n k ) ( m k=1 n k ) 2 Since n m is the largest and a m is the smallest from Appendix 4, we have

∂KL ∂x x=1

< 0 So when δt > 0 and closer to 0, the jittering of u i (1 + δt) causes KL(Q∥P ) to drop, that is, the effective sample utilization between classes is more balanced. Similarly, it can be shown that when δt > 0 and closer to 0, the jittering of u i (1 -δt) causes the effective sample utilization between classes be more balanced. A.3 APPENDIX 3

Total number of efficient samples

Define the effective number of samples obtained for the i-th class after n sampling as E i,n , whose recursive formula is E in = u i • max(a i n i -E i,n-1 , 0) n i • (E i,n-1 + 1) + 1 -u i • max(a i n i -E i,n-1 , 0) n i • E i,n-1 When the total number of effective samples reaches a i n i , no new effective samples are added. Therefore, to simplify the discussion, we consider the case where the upper bound has not yet been reached. Simplifying the above equation yields: E in = a i u i + 1 - u i n i E i,n-1 Let w i = 1 -ui ni , it's easy to know: E in w n i = a i u i w n i + E i,n-1 w n-1 i E in w n i = n j=1 a i u i w j i + E 0 E in = a i u i • w n i -1 w i -1 = a i n i (1 -w n i ) We note that total number of effective samples of the overall dataset after sampling n times is S n , and we have: We introduce Lagrange multipliers and try to solve for the conditions satisfied when S n reaches its extreme value: L(u 1 , . . . , u m , λ) = S n + λ 1 - m i=1 u i Calculate its derivative: ∂L ∂u i = -a i n 1 - u i n i n-1 -λ Let ∂L ∂ui = 0, get: ∂L ∂u i = ∂L ∂u j a i a j 1/n = 1 - uj nj 1 -ui ni When n is large enough, the analytic solution of u i satisfies the following equation. u i = 1 -i̸ =j n j (1 -A i,j,n ) 1 + i̸ =j n j A i,j,n /n i A i,j,n = a i a j 1 n This equation shows that in sampling with replacement the optimal sampling frequency is approximately equal to the category frequency ratio of the original distribution when the number of samples is large enough, which also implies that we can theoretically obtain close to the upper limit of the total number of valid samples by using a sampling ratio that approximates the original distribution. The optimal sampling rate u i satisfies: u i ∝ n i

Effective sample utilization

The effective sample proportion is defined as follows. R i,n = E i,n u i n In sampling with replacement, this expression is simplified as follows: R i,n = a i n i (1 -w n i ) u i n , where w i = 1 - u i n i Consider the ratio of the effective sampling proportions of any two classes Q i,j . Q i,j = R in R jn = a i n i u j (1 -w n i ) a j n j u i (1 -w n j ) It is not difficult to find that when n is sufficiently large, Q i,j satisfies: Q i,j = a i n i • u j a j n j • u i Therefore, in sampling with replacement, when the sampling frequency approximates the effective number of the class is proportional, the effective sample utilization is balanced. That is: u i ∝ a i n i A.3.2 SAMPLING WITHOUT REPLACEMENT

Total number of effective samples

In sampling without replacement, E in satisfies the following relation. E i,n = u i (min(E i,n-1 + a i • 1, a i n i )) + (1 -u i )E i,n-1 Simplify to get: E i,n = a i u i n Therefore S n satisfies: S n = m j=1 min(a j u j n, a j n j )



Chawla et al. (2002);Han et al. (2005);He et al. (2008) By introducing the generated new data for tail through interpolation, the above problem could be solved. However, the imprecise interpolation may also introduce new noises. The process of representation learning and classifier learning should be decoupled with their suitable distributionKang et al. (2019); Zhou et al. (2020). Re-weighting Re-weighting refers to assigning different weights to loss computation denoted by the corresponding classes . The reciprocal of sample frequency is adopted to correlate with weight in early studiesHuang et al. (2016); Wang et al. (2017).Re-weighting by the number of effective samples of each class is utilized in Mahajan et al. (2018); Mikolov et al. (2013). LDAM Cao et al. (2019) adopted the loss determined by the classification decision boundary distance, where categories with larger magnitude are closer to the decision boundary. Meta-learning based methodJamal et al. (2020) also is used for a better weights estimating Zhang et al. (2021a). considers the difficulty and total number of the data to determine loss weights.In addition, some methods based on difficult samplesZhang et al. (2021a) and logits adjustmentsMenon et al. (2020) also belongs to re-weighting. Transfer learning Transfer learning attempts to transfer knowledge from source domain to enhance performance on the target domain Zhang et al. (2021b). BBN Zhou et al. (2020) is trained on the origin distribution in the early steps, which transfer to classes-balanced distribution later for the optimization of classifier. LEAP Liu et al. (2020) constructs a "feature cloud" of tail classes transferred from head ones features to better support the classification boundaries. LFME Xiang et al. (2020) trains multi-expert models separately on multiple sub-datasets, and produce a student model through knowledge distillation. RIDE Wang et al. (2020b) uses dynamic routing to control the number of experts involved.

s+1 j=1 n xj , S n to reach its maximum value,on which condition the sampling frequency satisfies the following equation u x1 = n x1 n , . . . , u xs = n xs n , u xs+1 = n -s j=1 n xj n , u xs+2 = 0, . . . , u xm = 0 Obviously, S n just obtains the maximum value when and only when m j=1 n i = n = N , where

4.1 EXPERIMENTAL SETUPDataset The long-tailed benchmark datasets commonly used are selected: CIFAR10-LT, CIFAR100-LT Cao et al. (2019)  and ImageNet-LT Liu et al. (2019)  which sampled from CIFAR10Krizhevsky et al. (2009), CIFAR100Krizhevsky et al. (2009) andImageNetDeng et al. (2009), respectively. The imbalance ratio of CIFAR10-LT, CIFAR100-LT and ImageNet-LT are 100,100,256,respectively.

Table3: with / without replacement Comparison of the replacement strategy 3.1.1When the sampling rate is proportional to the number of class samples, the total number of effective samples without replacement is theoretically greater than the total number of effective samples with replacement. As shown in Table3, it can be seen that the actual accuracy of sampling without replacement is significantly higher than sampling with replacement.

Comparison of the training time

The training epochs is positively correlated with the total number of effective samples. When training epochs is normal, it is obvious that the total number of effective samples on the condition of sampling with replacement is lower than the without one. Jitter strategy helps increasing the total number of effective samples and balance the effective sample utilization between categories. For longer training, the effective sample has been saturated, leading to a limited accuracy improvement.

5. CONCLUSION

We have established an effective sampling theory to explain the sampling efficiency gap in different sampling methods and a jitter sampling strategy is developed to improve the actual training effect,based on which our proposed methods perform well on many long-tailed datasets.If we can find a way to eliminate information redundancy precisely, our theory may be further optimized. We will explore it in the next experiment. 

A APPENDIX

A.1 APPENDIX 1Jittering method in Sampling with replacement is effective.Proof 1: In sampling with replacement, the total number of effective samples obtained using jittering method 1 is greater than the that of the original sampling with replacement, and the effective sample utilization is equalized between classes.Let the expectation of the total number of effective samples sampled by the jittering method one be J n , whose values are:Where u (i) j is the sampling rate of the i-th dataloader for the j-th class. Let u j (x) is a function that calculates the sampling rate of the j-th class in dataloaders. The sampling rate of dataloader 1 is uAnd G(x) is a function with:The condition for S n to reach its extreme value isIt is easy to know that S n just obtains the maximum value of the above equation when and only when m j=1 n i = n = N , when it only needs to satisfy:

Effective sample utilization

Define after n samples the effective sample utilization R in as follows.When n is large enough, for any i, j, considering the condition of 1, there are:For sampling without replacement, the condition for achieving a balanced utilization of effective samples among classes is that the sampling frequency must be proportional to the number of effective samples in a class. u i ∝ a i n i

A.4 APPENDIX 4:

Proof: When the number of classes is sufficiently large, the classes redundancy and the number of classes are negatively correlated.We assume that class i obeys a priori Gaussian distribution, and the actual data of class i is actually obtained from that Gaussian distribution. Each point on the numerical axis represents the sample we actually sampled, and the probability density of the location it was actually sampled isWe define two samples t i , t j as redundant if their sample positions |t i -t j | < δ. Let the sampling position t 0 , then the probability that the sampling position is greater than position t 0 is P = P (t > t 0 ).The expectation of the number the sample location is greater than t 0 is N P for N independent acquisitions times.Consider the case when N * P = 1, whose physical meaning is that after collecting N times expects only once its sampling position is larger than t 0 . t 0 can be considered as the average upper bound of the sample position of N times sampling. Considering symmetry, the lower bound of the sampling position is 2µ i -t 0 . Then the upper level of the effective sample proportion a i satisfies:It's easy to know:Although a rigorous derivation of a i cannot be given, we show that the upper bound a i , a i , decreases monotonically after sampling a certain range and a i tends to 0 when the number of samples tends to infinity. Since then, we have completed the derivation of the negative correlation between the category redundancy and the number of class samples.The above is based on the assumption that the dimension sample is 1, while the actual datatype we deal with is much more complicated like image-type. In the real case, we think that the an image sample I i can be represented as its potential variable X i which can be generated by a self encoding network,that is:In the study of VAE, X i is defined as the potential variable obeying a specific Gaussian distribution:and then the a i of I i can be expressed as follows:a i ≤ a 1i * a 2i * a 3i ... * a si a si donates the effctive sample proposition of X i ,and when X i maintains statistical independence with each other(X j ), equality can be established.

