不平衡数据集的自然邻域超球面过采样方法

doi:10.11918/202311030

首页 > 过刊浏览>2024年第56卷第12期 >81-95. DOI:10.11918/202311030

不平衡数据集的自然邻域超球面过采样方法
DOI:
                        10.11918/202311030
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:(1.华北水利水电大学 电气工程学院,郑州 450011; 2.河北地质大学 信息工程学院,石家庄 050031)
作者简介:周玉(1979—),男,副教授,硕士生导师
通讯作者:周玉,zhouyu_beijing@126.com
中图分类号:TP181
基金项目:国家自然科学基金 (U2,0)；河南省高等学校青年骨干教师培养计划项目(2018GGJS079)；河北省高等学校科学技术研究项目(ZD2020344)；华北水利水电大学第十五届研究生创新课题项目(NCWUYC-202315048)

A natural neighborhood hypersphere oversampling method for imbalanced data sets

Author:

Affiliation:

(1.School of Electrical Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450011, China; 2.School of Information Engineering, Hebei GEO University, Shijiazhuang 050031, China)

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

为解决数据集类别不平衡问题,针对不平衡数据集分类提出了一种实现不平衡数据集高性能分类的自然邻域超球面过采样方法(natural neighborhood hypersphere oversampling method, NNHOS)。首先,对不平衡数据集中的每个样本点搜索其自然邻居直至形成稳定的自然邻域；接着,根据每个样本点自然邻居的标签特点,将所有样本点划分为异常点、噪声点、多数类安全点、少数类安全点和少数类边界点5个区域；然后,对每个少数类边界点构建超球面,合并完全处于大超球面中的小超球面,形成一个超球面集合；最后,根据超球面半径大小自适应地为每个超球面分配采样比例,在超球面内生成指定个数的新样本点得到平衡数据集。结果表明,利用该方法在人工数据集和真实数据集上进行过采样形成新的样本集,以CART,SVM和KNN 3个分类器进行实验,并与其他8种常用方法进行对比分析。同时,以AUC值、F₁和G_m作为评价指标,进一步证明了该方法可以更好的对不平衡数据集进行分类。

Abstract:

To address the issue of class imbalance in datasets, a natural neighborhood hypersphere oversampling method (NNHOS) for high performance classification of imbalanced data sets is proposed in this paper. First, for each sample point in the imbalanced data sets, its natural neighbors are searched until a stable natural neighborhood is formed. Then, based on the label characteristics of the natural neighbors of each sample point, all sample points are classified into five regions: outliers, noise points, safe points of the majority class, safe points of the minority class, and boundary points of the minority class. Subsequently, a hypersphere is constructed for each boundary point of the minority class. At the same time, the small hyperspheres that are completely within the large hypersphere are merged to form a set of hyperspheres. Finally, to achieve a balanced data set, each hypersphere is adaptively assigned a sampling ratio based on the hypersphere radius and a specified number of new sample points are generated within each hypersphere. The results indicate that this method utilizes oversampling on synthetic and real datasets to generate a new sample set. Experiments are conducted using the CART, SVM, and KNN classifiers, and compared with eight other commonly used methods. Additionally, AUC, F₁, and G_m are used as evaluation metrics to further demonstrate that this method can more effectively classify imbalanced datasets.

参考文献

相似文献

引证文献

引用本文

周玉,岳学震,刘星,王培崇.不平衡数据集的自然邻域超球面过采样方法[J].哈尔滨工业大学学报,2024,56(12):81. DOI:10.11918/202311030

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-11-10
最后修改日期:
录用日期:
在线发布日期: 2024-12-24
出版日期:

出版声明

期刊订阅

引用本文

相关视频

分享

文章指标

历史

文章二维码