ROGA: RANDOM OVER-SAMPLING BASED ON GE-NETIC ALGORITHM

Abstract

When using machine learning to solve practical tasks, we often face the problem of class imbalance. Unbalanced classes will cause the model to generate preferences during the learning process, thereby ignoring classes with fewer samples. The oversampling algorithm achieves the purpose of balancing the difference in quantity by generating a minority of samples. The quality of the artificial samples determines the impact of the oversampling algorithm on model training. Therefore, a challenge of the oversampling algorithm is how to find a suitable sample generation space. However, too strong conditional constraints can make the generated samples as non-noise points as possible, but at the same time they also limit the search space of the generated samples, which is not conducive to the discovery of better-quality new samples. Therefore, based on this problem, we propose an oversampling algorithm ROGA based on genetic algorithm. Based on random sampling, new samples are gradually generated and the samples that may become noise are filtered out. ROGA can ensure that the sample generation space is as wide as possible, and it can also reduce the noise samples generated. By verifying on multiple datasets, ROGA can achieve a good result.

1. INTRODUCTION

When modeling the classification problem, the balance of classes can ensure that the information is balanced during the learning process of the model, but the classes are often unbalanced in actual tasks (Kotsiantis et al. (2005) ; He & Garcia ( 2009)), which leads to the machine learning model preferring majority class samples. In order to solve this problem, a commonly used method uses oversampling algorithm to increase the minority class samples to balance the gap between the minority class samples and the majority class samples. Therefore, the quality of the generated samples will determine the training quality of the model after oversampling, but it is difficult to characterize this effect. In the past studies, oversampling methods only can estimated the noise samples, such as overlapping samples or outliers, and eliminated it as much as possible. Ensure that the sample does not make training the model more difficult.

SMOTE(Chawla et al. (2002)) and its derivative algorithms generate new samples by interpolation.

There has been a lot of work in the selection of samples to be interpolated, the number of interpolations, or the method of interpolation, and the final goal is to reduce the generation of noise samples by restricting the interpolation position. Some researchers pointed out that sampling should not be performed in the sample space, but should be projected to other spaces (Wang et al. (2007 ), Zhang & Yang (2018) ). In addition, some researchers have proposed that sampling should be taken from the minority distribution (Bellinger et al. (2015) , Das et al. (2015) ) to ensure the consistency of the distribution. It can be found that in order to reduce the generation of noise samples, the space for generating samples needs to be restricted, that is, the oversampling algorithm can only sample within a limited range. However, just paying attention to noise generation may not achieve better results. Too strong constraints will cause the sample generation space to be limited, and the generated samples will fall into a local optimal solution, so that it is impossible to find samples that are more conducive to model training. Therefore, in order to reduce the limitation on the sample generation space and reduce noise as much as possible, this paper proposes oversampling algorithm ROGA based on genetic algorithm. ROGA will randomly sample in the feature space before the first iteration, and generate a

