引用本文: | 周玉,岳学震,刘星,王培崇.不平衡数据集的自然邻域超球面过采样方法[J].哈尔滨工业大学学报,2024,56(12):81.DOI:10.11918/202311030 |
| ZHOU Yu,YUE Xuezhen,LIU Xing,WANG Peichong.A natural neighborhood hypersphere oversampling method for imbalanced data sets[J].Journal of Harbin Institute of Technology,2024,56(12):81.DOI:10.11918/202311030 |
|
摘要: |
为解决数据集类别不平衡问题,针对不平衡数据集分类提出了一种实现不平衡数据集高性能分类的自然邻域超球面过采样方法(natural neighborhood hypersphere oversampling method, NNHOS)。首先,对不平衡数据集中的每个样本点搜索其自然邻居直至形成稳定的自然邻域;接着,根据每个样本点自然邻居的标签特点,将所有样本点划分为异常点、噪声点、多数类安全点、少数类安全点和少数类边界点5个区域;然后,对每个少数类边界点构建超球面,合并完全处于大超球面中的小超球面,形成一个超球面集合;最后,根据超球面半径大小自适应地为每个超球面分配采样比例,在超球面内生成指定个数的新样本点得到平衡数据集。结果表明,利用该方法在人工数据集和真实数据集上进行过采样形成新的样本集,以CART,SVM和KNN 3个分类器进行实验,并与其他8种常用方法进行对比分析。同时,以AUC值、F1和Gm作为评价指标,进一步证明了该方法可以更好的对不平衡数据集进行分类。 |
关键词: 不平衡数据集 过采样 自然邻居 超球面 分类 |
DOI:10.11918/202311030 |
分类号:TP181 |
文献标识码:A |
基金项目:国家自然科学基金 (U2,0);河南省高等学校青年骨干教师培养计划项目(2018GGJS079);河北省高等学校科学技术研究项目(ZD2020344);华北水利水电大学第十五届研究生创新课题项目(NCWUYC-202315048) |
|
A natural neighborhood hypersphere oversampling method for imbalanced data sets |
ZHOU Yu1,YUE Xuezhen1,LIU Xing1,WANG Peichong2
|
(1.School of Electrical Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450011, China; 2.School of Information Engineering, Hebei GEO University, Shijiazhuang 050031, China)
|
Abstract: |
To address the issue of class imbalance in datasets, a natural neighborhood hypersphere oversampling method (NNHOS) for high performance classification of imbalanced data sets is proposed in this paper. First, for each sample point in the imbalanced data sets, its natural neighbors are searched until a stable natural neighborhood is formed. Then, based on the label characteristics of the natural neighbors of each sample point, all sample points are classified into five regions: outliers, noise points, safe points of the majority class, safe points of the minority class, and boundary points of the minority class. Subsequently, a hypersphere is constructed for each boundary point of the minority class. At the same time, the small hyperspheres that are completely within the large hypersphere are merged to form a set of hyperspheres. Finally, to achieve a balanced data set, each hypersphere is adaptively assigned a sampling ratio based on the hypersphere radius and a specified number of new sample points are generated within each hypersphere. The results indicate that this method utilizes oversampling on synthetic and real datasets to generate a new sample set. Experiments are conducted using the CART, SVM, and KNN classifiers, and compared with eight other commonly used methods. Additionally, AUC, F1, and Gm are used as evaluation metrics to further demonstrate that this method can more effectively classify imbalanced datasets. |
Key words: imbalanced data sets oversampling natural neighborhood hypersphere classification |