不平衡数据集的自然邻域超球面过采样方法

周玉; 岳学震; 刘星; 王培崇

期刊检索

关键词检索

新闻公告MORE

主管单位 中华人民共和国工业和信息化部 主办单位 哈尔滨工业大学主编李隆球 国际刊号ISSN 0367-6234 国内刊号CN 23-1235/T

期刊网站二维码

微信公众号二维码

引用本文:	周玉,岳学震,刘星,王培崇.不平衡数据集的自然邻域超球面过采样方法[J].哈尔滨工业大学学报,2024,56(12):81.DOI:10.11918/202311030
	ZHOU Yu,YUE Xuezhen,LIU Xing,WANG Peichong.A natural neighborhood hypersphere oversampling method for imbalanced data sets[J].Journal of Harbin Institute of Technology,2024,56(12):81.DOI:10.11918/202311030

【打印本页】【HTML】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

过刊浏览高级检索

本文已被：浏览 183次下载 272次	码上扫一扫！
分享到：微信更多字体:加大+\|默认\|缩小-
不平衡数据集的自然邻域超球面过采样方法
周玉¹,岳学震¹,刘星¹,王培崇²
(1.华北水利水电大学电气工程学院,郑州 450011; 2.河北地质大学信息工程学院,石家庄 050031)

摘要:

为解决数据集类别不平衡问题,针对不平衡数据集分类提出了一种实现不平衡数据集高性能分类的自然邻域超球面过采样方法(natural neighborhood hypersphere oversampling method, NNHOS)。首先,对不平衡数据集中的每个样本点搜索其自然邻居直至形成稳定的自然邻域；接着,根据每个样本点自然邻居的标签特点,将所有样本点划分为异常点、噪声点、多数类安全点、少数类安全点和少数类边界点5个区域；然后,对每个少数类边界点构建超球面,合并完全处于大超球面中的小超球面,形成一个超球面集合；最后,根据超球面半径大小自适应地为每个超球面分配采样比例,在超球面内生成指定个数的新样本点得到平衡数据集。结果表明,利用该方法在人工数据集和真实数据集上进行过采样形成新的样本集,以CART,SVM和KNN 3个分类器进行实验,并与其他8种常用方法进行对比分析。同时,以AUC值、F₁和G_m作为评价指标,进一步证明了该方法可以更好的对不平衡数据集进行分类。

关键词: 不平衡数据集过采样自然邻居超球面分类

DOI：10.11918/202311030

分类号:TP181

文献标识码:A

基金项目:国家自然科学基金 (U2,0)；河南省高等学校青年骨干教师培养计划项目(2018GGJS079)；河北省高等学校科学技术研究项目(ZD2020344)；华北水利水电大学第十五届研究生创新课题项目(NCWUYC-202315048)

A natural neighborhood hypersphere oversampling method for imbalanced data sets

ZHOU Yu¹,YUE Xuezhen¹,LIU Xing¹,WANG Peichong²

(1.School of Electrical Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450011, China; 2.School of Information Engineering, Hebei GEO University, Shijiazhuang 050031, China)

Abstract:

To address the issue of class imbalance in datasets, a natural neighborhood hypersphere oversampling method (NNHOS) for high performance classification of imbalanced data sets is proposed in this paper. First, for each sample point in the imbalanced data sets, its natural neighbors are searched until a stable natural neighborhood is formed. Then, based on the label characteristics of the natural neighbors of each sample point, all sample points are classified into five regions: outliers, noise points, safe points of the majority class, safe points of the minority class, and boundary points of the minority class. Subsequently, a hypersphere is constructed for each boundary point of the minority class. At the same time, the small hyperspheres that are completely within the large hypersphere are merged to form a set of hyperspheres. Finally, to achieve a balanced data set, each hypersphere is adaptively assigned a sampling ratio based on the hypersphere radius and a specified number of new sample points are generated within each hypersphere. The results indicate that this method utilizes oversampling on synthetic and real datasets to generate a new sample set. Experiments are conducted using the CART, SVM, and KNN classifiers, and compared with eight other commonly used methods. Additionally, AUC, F₁, and G_m are used as evaluation metrics to further demonstrate that this method can more effectively classify imbalanced datasets.

Key words: imbalanced data sets oversampling natural neighborhood hypersphere classification

期刊检索

关键词检索

新闻公告MORE

友情链接LINKS