python_imbalanced-learn非平衡学习包_02_Over-sampling过采样

2023-05-16

python_imbalanced-learn非平衡学习包_01_简介

python_imbalanced-learn非平衡学习包_02_Over-sampling过采样

后续章节待定_希望各位认可前面已更_您的认可是我的动力

Over-sampling

1. A practical guide

You can refer to Compare over-sampling samplers

实用指南

您可以参考比较过采样采样器

1.1 Naive random over-sampling

One way to fight this issue is to generate new samples in the classes which are under-represented. The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples. The RandomOverSampler offers such scheme:
网址

朴素随机过抽样

解决此问题的一种方法是对表示不足的类生成新样本。最简单的策略是通过随机抽样生成新样本,替换当前可用样本。RandomOverSampler提供了这样的方案:

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
from collections import Counter
print(sorted(Counter(y_resampled).items()))

[(0, 4674), (1, 4674), (2, 4674)]

The augmented data set should be used instead of the original data set to train a classifier:
应使用扩充数据集代替原始数据集来训练分类器:

from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_resampled, y_resampled) # doctest : +ELLIPSIS


LinearSVC()  
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

LinearSVC()  

In the figure below, we compare the decision functions of a classifier trained using the over-sampled data set and the original data set.
在下图中,我们比较了使用过采样数据集和原始数据集训练的分类器的决策函数。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DuSnYq0s-1653184586363)(attachment:image.png)]

It would also work with pandas dataframe:
它还可以与pandas dataframe数据帧配合使用

from sklearn.datasets import fetch_openml
df_adult, y_adult = fetch_openml(
    'adult', version=2, as_frame=True, return_X_y=True)
df_adult.head()  
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
025.0Private226802.011th7.0Never-marriedMachine-op-inspctOwn-childBlackMale0.00.040.0United-States
138.0Private89814.0HS-grad9.0Married-civ-spouseFarming-fishingHusbandWhiteMale0.00.050.0United-States
228.0Local-gov336951.0Assoc-acdm12.0Married-civ-spouseProtective-servHusbandWhiteMale0.00.040.0United-States
344.0Private160323.0Some-college10.0Married-civ-spouseMachine-op-inspctHusbandBlackMale7688.00.040.0United-States
418.0NaN103497.0Some-college10.0Never-marriedNaNOwn-childWhiteFemale0.00.030.0United-States
type(df_adult)
pandas.core.frame.DataFrame
df_resampled, y_resampled = ros.fit_resample(df_adult, y_adult)
df_resampled.head() 
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
025.0Private226802.011th7.0Never-marriedMachine-op-inspctOwn-childBlackMale0.00.040.0United-States
138.0Private89814.0HS-grad9.0Married-civ-spouseFarming-fishingHusbandWhiteMale0.00.050.0United-States
228.0Local-gov336951.0Assoc-acdm12.0Married-civ-spouseProtective-servHusbandWhiteMale0.00.040.0United-States
344.0Private160323.0Some-college10.0Married-civ-spouseMachine-op-inspctHusbandBlackMale7688.00.040.0United-States
418.0NaN103497.0Some-college10.0Never-marriedNaNOwn-childWhiteFemale0.00.030.0United-States

If repeating samples is an issue, the parameter shrinkage allows to create a smoothed bootstrap. However, the original data needs to be numerical. The shrinkage parameter controls the dispersion of the new generated samples. We show an example illustrate that the new samples are not overlapping anymore once using a smoothed bootstrap. This ways of generating smoothed bootstrap is also known a Random Over-Sampling Examples (ROSE) [MT14].

如果重复采样是一个问题,那么参数shrinkage(收缩)允许创建平滑的自我提升样本,然而,原始数据需要是数字。“shrinkage(收缩)”参数控制新生成样本的分散。我们给出了一个示例,说明使用平滑引导后,新样本不再重叠。这种生成平滑自我提升样本的方法也被称为随机过采样示例(ROSE)[MT14]。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-elqDuDrd-1653184586366)(attachment:image.png)]

1.2 From random over-sampling to SMOTE and ADASYN

Apart from the random sampling with replacement, there are two popular methods to over-sample minority.classes: (i) the Synthetic Minority Oversampling Technique (SMOTE) [CBHK02] and (ii) the Adaptive Synthetic (ADASYN) [HBGL08] sampling method. These algorithms can be used in the same manner:
网址

除了随机抽样替换外,还有两种常用的方法对少数群体进行过采样,方式(i)合成少数过采样技术(SMOTE)[CBHK02]和(ii)自适应合成(ADASYN)[HBGL08]采样方法。这些算法可以以相同的方式使用:

from imblearn.over_sampling import SMOTE, ADASYN
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

clf_smote = LinearSVC().fit(X_resampled, y_resampled)
X_resampled, y_resampled = ADASYN().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

clf_adasyn = LinearSVC().fit(X_resampled, y_resampled)
[(0, 4674), (1, 4674), (2, 4674)]
[(0, 4673), (1, 4662), (2, 4674)]

The figure below illustrates the major difference of the different over-sampling methods.
下图说明了不同过采样方法的主要区别。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rUNglgUQ-1653184586368)(attachment:image.png)]

1.3. Ill-posed examples

While the RandomOverSampler is over-sampling by duplicating some of the original samples of the minority class, SMOTE and ADASYN generate new samples in by interpolation. However, the samples used to interpolate/generate new synthetic samples differ. In fact, ADASYN focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier while the basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule. Therefore, the decision function found during training will be different among the algorithms.

虽然RandomOverSampler是通过复制少数群体的一些原始样本进行过度采样,SMOTE和ADASYN通过插值生成新样本。但是,用于插值/生成新合成样本的样本不同。
事实上,ADASYN专注于在分类器错误分类的原始样旁边使用k-最近邻生成样本,而SMOTE的基本实现使用最近邻规则时不会区分容易分类样本和难以分类样本。因此,在训练过程中发现决策函数在不同的算法中会有所不同。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4uBrp4sT-1653184586370)(attachment:image.png)]

The sampling particularities of these two algorithms can lead to some peculiar behavior as shown below.
这两种算法的采样特性可能导致某些特殊行为,如下所示

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L6XONAM6-1653184586370)(attachment:image.png)]

1.4 SMOTE variants|

SMOTE might connect inliers and outliers while ADASYN might focus solely on outliers which, in both cases, might lead to a sub-optimal decision function. In this regard, SMOTE offers three additional options to generate samples. Those methods focus on samples near the border of the optimal decision function and will generate samples in the opposite direction of the nearest neighbors class. Those variants are presented in the figure below.

SMOTE 变形

SMOTE可能会连接内联点和离群点,而ADASYN可能只关注离群点,这两种情况下都可能导致次优决策函数。在这方面,SMOTE提供了三个额外的选项来生成样本。
这些方法侧重于最优决策函数边界附近的样本,并将在最近邻类的相反方向生成样本。这些变体如下图所示。

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

The BorderlineSMOTE [HWM05], SVMSMOTE [NCK09], and KMeansSMOTE [LDB17] offer some variant of the SMOTE algorithm:
BorderlineSMOTE[HWM05]、SVMSote[NCK09]和KMeansSMOTE[LDB17]提供了SMOTE算法的一些变体:

from imblearn.over_sampling import BorderlineSMOTE
X_resampled, y_resampled = BorderlineSMOTE().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
[(0, 4674), (1, 4674), (2, 4674)]

When dealing with mixed data type such as continuous and categorical features, none of the presented methods (apart of the class RandomOverSampler) can deal with the categorical features. The SMOTENC [CBHK02] is an extension of the SMOTE algorithm for which categorical data are treated differently:

当处理连续和分类特征等混合数据类型时,所提出的方法(除了类RandomOverSampler)都不能处理分类特征。SMOTENC[CBHK02]是SMOTE算法的扩展专门用来处理类别特征:

# create a synthetic data set with continuous and categorical features
import numpy as np
rng = np.random.RandomState(42)
n_samples = 50
X = np.empty((n_samples, 3), dtype=object)
X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
X[:, 1] = rng.randn(n_samples)
X[:, 2] = rng.randint(3, size=n_samples)
y = np.array([0] * 20 + [1] * 30)
print(sorted(Counter(y).items()))

[(0, 20), (1, 30)]

In this data set, the first and last features are considered as categorical features. One need to provide this information to SMOTENC via the parameters categorical_features either by passing the indices of these features or a boolean mask marking these features:

在此数据集中,第一个和最后一个特征被视为分类特征。需要通过参数categorical_features向SMOTENC提供此信息,或者通过传递这些特性的索引或用布尔掩码标记这些特性:

from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
print(X_resampled[-5:])
[(0, 30), (1, 30)]
[['A' 0.5246469549655818 2]
 ['B' -0.3657680728116921 2]
 ['B' 0.9344237230779993 2]
 ['B' 0.3710891618824609 2]
 ['B' 0.3327240726719727 2]]

Therefore, it can be seen that the samples generated in the first and last columns are belonging to the same categories originally presented without any other extra interpolation.

However, SMOTENC is only working when data is a mixed of numerical and categorical features. If data are made of only categorical data, one can use the SMOTEN variant [CBHK02]. The algorithm changes in two ways:

1.the nearest neighbors search does not rely on the Euclidean distance. Indeed, the value difference metric (VDM) also implemented in the class ValueDifferenceMetric is used.

2.a new sample is generated where each feature value corresponds to the most common category seen in the neighbors samples belonging to the same class.

因此,可以看出,在第一列和最后一列中生成的样本与最初显示的类别相同,没有任何其他额外的插值。
然而,SMOTENC只有在数据混合了数字和分类特征时才起作用。如果数据仅由分类数据构成,则可以使用SMOTEN变量[CBHK02]。算法有两处变化:
1.最近邻搜索不依赖于欧几里德距离。实际上,ValueDifferenceMetric类中实现的值差度量(VDM)则会被使用。
2.生成新样本的每个特征值对应于属于同一类的相邻样本中最常见的类别。

Let’s take the following example:
我们以以下例子为例

import numpy as np
X = np.array(["green"] * 5 + ["red"] * 10 + ["blue"] * 7,
             dtype=object).reshape(-1, 1)
y = np.array(["apple"] * 5 + ["not apple"] * 3 + ["apple"] * 7 +
             ["not apple"] * 5 + ["apple"] * 2, dtype=object)
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
X_res, y_res = sampler.fit_resample(X, y)
X_res[y.size:]
y_res[y.size:]
#array(['not apple', 'not apple', 'not apple', 'not apple', 'not apple', 'not apple'], dtype=object)
array(['not apple', 'not apple', 'not apple', 'not apple', 'not apple',
       'not apple'], dtype=object)

We generate a dataset associating a color to being an apple or not an apple. We strongly associated “green” and “red” to being an apple. The minority class being “not apple”, we expect new data generated belonging to the category “blue”:

我们生成一个数据集,将颜色与是否是苹果关联起来。我们强烈地将“绿色”和“红色”与是苹果联系在一起。少数类别为“非苹果”,我们预计生成的新数据属于“蓝色”类别:


本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

python_imbalanced-learn非平衡学习包_02_Over-sampling过采样 的相关文章

随机推荐

  • MinGW-w64安装教程——著名C/C++编译器GCC的Windows版本

    MinGW w64安装教程 著名C C 43 43 编译器GCC的Windows版本 MinGW w64安装教程 著名C C 43 43 编译器GCC的Windows版本 本文主要讲述如何安装 C语言 编译器 MinGW w64 xff0c
  • RT-Thread实时操作系统简介

    目录 一 概述 二 架构 三 版本选择 四 内核启动流程 五 自动初始化机制 六 内核对象模型 七 I O设备模型 1 框架 2 设备驱动使用序列图 3 设备类型 八 FinSH控制台 九 ENV工具 1 menuconfig 2 Scon
  • PCIe RAS

    对于Linux系统针对RAS的AER错误处理机制完成 PCIe RAS简单来讲就是PCIe的错误检测 纠正以及汇报的机制 它可以方便我们准确的定位 xff0c 纠正和分析错误增强系统的健壮性和可靠性 PCIe错误的分类 PCIe错误分为可校
  • Linux下的regulator调试

    先看regulator使用的小demo 如 i2c8 touchscreen 64 28 vddcama supply 61 lt amp xxxxx gt int ret struct regulator power static int
  • 关于添加系统调用遇到 Unable to handle kernel paging request at virtual address 的解决

    Unable to handle kernel paging request at virtual address 是内存访问异常的错误 xff0c 原因通常有三种 xff1a virtual address 为 0x00000000 时
  • vscode安装配置clang-format插件及使用

    vscode安装配置clang format插件及使用 首先安装插件 在vscode扩展里搜索clang format xff0c 安装排名第一的xaver clang format 确认clang format可执行程序路径 window
  • 简历中项目描述怎么写啊

    http wenda tianya cn question 7ade6dc9324bed88
  • 树莓派(Raspberry Pi 3) - 系统烧录及系统使用

    树莓派 xff08 Raspberry pi xff09 是一块集成度极高的ARM开发板 xff0c 不仅包含了HDMI xff0c RCA xff0c CSI xff0c HDMI xff0c GPIO等端口 xff0c 还支持蓝牙以及无
  • flashcache原理

    介绍flashcache的文章很多 xff0c 我就不废话了 使用上 xff0c 有余峰老哥的 文章 xff1b 原理上 xff0c 有ningoo同学的 flashcache系列 但是ningoo同学漏掉了device mapper和fl
  • 无人机算法之PID

    xff08 未完成 xff09 一 PID介绍 xff08 百度百科 xff09 PID 控制器 xff08 比例 积分 微分控制器 xff09 是一个在工业控制应用中常见的反馈回路部件 xff0c 由比例单元 P 积分单元 I 和微分单元
  • java:接口、lambda表达式与内部类

    接口 xff08 interface 接口用来描述类应该做什么 xff0c 而不指定他们具体应该如何做 接口不是类 xff0c 而是对符合这个接口的类的一组需求 接口定义的关键词是interface span class token key
  • 卫星系统算法课程设计 - 第二部分 qt的安装与创建项目

    上一篇文章只讲了基本的东西 xff0c 这一篇要完成qt的安装 xff0c 构建项目 xff0c 并且将上一篇的代码导入进去 某比利比例搜qt安装 xff0c 看到qt5 14 2的下载安装 xff0c 跟着做 1 创建项目 创建新项目 x
  • 无人机-材料准备

    xff08 未完成 xff09 一 使用空心杯电机 xff0c 型号8520 xff0c 1S版本 xff0c 约5G每只 二 空心杯机架 xff0c 型号QX90 xff0c 约8 5g 三 使用55MM桨 四 1S 600MA电池 五
  • CMake中链接库的顺序问题

    原文链接 xff1a https blog csdn net lifemap article details 7586363 cmake中链接库的顺序是a依赖b xff0c 那么b放在a的后面 例如进程test依赖a库 b库 a库又依赖b
  • 鸿蒙wifi Demo运行

    title 鸿蒙Wi Fi Demo运行 date 2021 1 1 22 25 10 categories harmony 本文首发于LHM s notes 欢迎关注我的博客 坑有点多 由于之前没有看过wifi的内核态代码 xff0c 所
  • 将TensorFlow训练好的模型迁移到Android APP上(TensorFlowLite)

    将TensorFlow训练好的模型迁移到Android APP上 xff08 TensorFlowLite xff09 1 写在前面 最近在做一个数字手势识别的APP xff08 关于这个项目 xff0c 我会再写一篇博客仔细介绍 xff0
  • 汉诺塔代码图文详解(递归入门)

    游戏规则 xff1a 已知条件存在A B C三根柱子 xff0c A上套有N片圆盘 如下图 目的将A上的所有圆盘移到C上约束条件每次只能移动一片圆盘 xff0c 且整个过程中只能出现小圆盘在大圆盘之上的情况 首先我们模拟 N 61 2 xf
  • STM32 最小系统电路简析

    文章目录 一 最小系统的组成1 供电电路2 外部晶振3 BOOT选择4 复位电路 二 最小系统实例1 STM32F103C8T6最小系统 三 各部分组成简析1 供电电路设计2 外部晶振原理3 BOOT设计4 复位电路设计 一 最小系统的组成
  • 带参数的宏的问题

    include 34 iostream 34 using namespace std define COMPUTE XX a a a 43 a 2 int main int a 61 2 int test1 61 COMPUTE XX 43
  • python_imbalanced-learn非平衡学习包_02_Over-sampling过采样

    python imbalanced learn非平衡学习包 01 简介 python imbalanced learn非平衡学习包 02 Over sampling过采样 后续章节待定 希望各位认可前面已更 您的认可是我的动力 Over s