Abstract:To address the issue of class imbalance in datasets, a natural neighborhood hypersphere oversampling method (NNHOS) for high performance classification of imbalanced data sets is proposed in this paper. First, for each sample point in the imbalanced data sets, its natural neighbors are searched until a stable natural neighborhood is formed. Then, based on the label characteristics of the natural neighbors of each sample point, all sample points are classified into five regions: outliers, noise points, safe points of the majority class, safe points of the minority class, and boundary points of the minority class. Subsequently, a hypersphere is constructed for each boundary point of the minority class. At the same time, the small hyperspheres that are completely within the large hypersphere are merged to form a set of hyperspheres. Finally, to achieve a balanced data set, each hypersphere is adaptively assigned a sampling ratio based on the hypersphere radius and a specified number of new sample points are generated within each hypersphere. The results indicate that this method utilizes oversampling on synthetic and real datasets to generate a new sample set. Experiments are conducted using the CART, SVM, and KNN classifiers, and compared with eight other commonly used methods. Additionally, AUC, F1, and Gm are used as evaluation metrics to further demonstrate that this method can more effectively classify imbalanced datasets.