基于支持向量的数据重采样器

2024-01-06

我正在努力实现一个数据重采样器以基于support vectors。这个想法是为了适应SVM分类器，得到support vector类的点，然后通过仅选择每个类的支持向量点附近的数据点来平衡数据，以使类具有相同数量的示例，忽略所有其他（远离支持向量点）。

我正在多类别设置中执行此操作。因此，我需要对类进行成对重新采样（即one-against-one）。我知道在sklean的SVM https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html “......在内部，一对一（‘ovo’）始终用作训练模型的多类别策略”。然而，由于我不确定如何改变 sklearn 的 SVM 的训练行为，以便在训练期间对每一对重新采样，因此我实现了一个自定义类来做到这一点。

目前，自定义类运行良好。然而，在我的实现中，我有一个错误（逻辑错误），它将每对类标签更改为0 and 1，从而弄乱了我的班级标签。在下面的代码中，我用MWE:

# required imports
import random
from collections import Counter
from math import dist
import numpy as np
from sklearn.svm import SVC
from sklearn.utils import check_random_state
from sklearn.multiclass import OneVsOneClassifier
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

np.random.seed(7)
random.seed(7)

# resampler class
class DataUndersampler():
  def __init__(self, random_state=None):
    self.random_state = random_state
    print('DataUndersampler()')

  def fit_resample(self, X, y):
    random_state = check_random_state(self.random_state)
    # class distribution
    counter = Counter(y)
    print(f'Original class distribution: {counter}')
    maj_class = counter.most_common()[0][0]
    min_class = counter.most_common()[-1][0]
    # number of minority examples
    num_minority = len(X[ y == min_class])
    #num_majority = len(X[ y == maj_class]) # check on with maj now
    svc = SVC(kernel='rbf', random_state=32)
    svc.fit(X,y)
    # majority class support vectors
    maj_sup_vectors = svc.support_vectors_[maj_class]
    #min_sup_vectors = svc.support_vectors_[min_class] # minority sup vect
    # compute distances to support vectors' point
    distances = []
    for i, x in enumerate(X[y == maj_class]): 
      #input(f'sv: {maj_sup_vectors}, x: {x}') # check value passed
      d = dist(maj_sup_vectors, x) 
      distances.append((i, d))
    # sort distances (reverse=False -> ascending)
    distances.sort(reverse=False, key=lambda tup: tup[1])
    index = [i for i, d in distances][:num_minority] 
    X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
    y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))
    print(f"Resampled class distribution ('ovo'): {Counter(y_ds)} \n")

    return X_ds, y_ds

所以，使用这个：

# synthetic data
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
                           n_informative=4)

# actual class distribution
Counter(y)
Counter({0: 9924, 1: 22, 2: 15, 3: 13, 4: 26})

resampler = DataUndersampler(random_state=234)
rf_clf = model = RandomForestClassifier()

pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
DataUndersampler()

classifier.fit(X, y)

Original class distribution: Counter({0: 9924, 1: 22})  
Resampled class distribution ('ovo'): Counter({0: 22, 1: 22}) 

Original class distribution: Counter({0: 9924, 1: 15}) # this should be {0: 9924, 2: 15}
Resampled class distribution ('ovo'): Counter({0: 15, 1: 15}) # should be-> {0: 9924, 2: 15}

Original class distribution: Counter({0: 9924, 1: 13}) # should be -> {0: 9924, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) # -> {0: 9924, 3: 13}

Original class distribution: Counter({0: 9924, 1: 26}) # should be-> {0: 9924, 4: 26}
Resampled class distribution ('ovo'): Counter({0: 26, 1: 26}) # -> {0: 9924, 4: 26}

Original class distribution: Counter({0: 22, 1: 15}) # should be > {1: 22, 2: 15}
Resampled class distribution ('ovo'): Counter({0: 15, 1: 15}) # -> {1: 22, 2: 15}

Original class distribution: Counter({0: 22, 1: 13}) # -> {1: 22, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) ## -> {1: 22, 3: 13}

Original class distribution: Counter({1: 26, 0: 22}) # -> {4: 26, 1: 22}
Resampled class distribution ('ovo'): Counter({1: 22, 0: 22}) # -> {4: 26, 1: 22}

Original class distribution: Counter({0: 15, 1: 13}) # -> {2: 15, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) # -> {2: 15, 3: 13}

Original class distribution: Counter({1: 26, 0: 15}) # -> {4: 26, 2: 15}
Resampled class distribution ('ovo'): Counter({1: 15, 0: 15}) # -> {4: 26, 2: 15}

Original class distribution: Counter({1: 26, 0: 13}) # -> {4: 26, 3: 13}
Resampled class distribution ('ovo'): Counter({1: 13, 0: 13}) # -> {4: 26, 3: 13}

我该如何解决？

问题：

在您的代码中，由于类标签的方式而变得混乱 OneVsOneClassifier内部工作 https://scikit-learn.org/stable/modules/multiclass.html#onevsoneclassifier。它将原来的多类问题转化为多个二元分类问题。对于每个二元问题，类都被重新标记为0 and 1，这就是为什么你只看到0 and 1在你的输出中。

问题，详细：

当您使用时OneVsOneClassifier，它在内部构建多个二元分类器，每个分类器仅在两个原始类上进行训练。对于每个二元分类器，类标签都转换为0 and 1。这种转变是由内部完成的OneVsOneClassifier处理二元分类问题。

现在，当你在你的内心DataUndersampler类、标签y您收到的是这些转换后的标签0 and 1，而不是多类问题的原始标签。这就是为什么你的 print 语句里面DataUndersampler.fit_resample()正在显示Counter带键的对象0 and 1.

下面是一个例子来说明这是如何发生的：

假设您有一个包含 3 个类的多类问题，标记为0, 1, and 2. When OneVsOneClassifier应用后，它将创建 3 个二元分类器：一个用于类0与班级1, 一个用于班级0与班级2，以及一个用于课堂的1与班级2.

现在，对于每个二元分类器，这些类都被重新标记为0 and 1。这意味着，对于第一个分类器（类0与班级1），原来的类0可能会被重新标记为0和原来的班级1可能会被重新标记为1。但对于第二个分类器（类0与班级2），原来的类0可能会被重新标记为0，和原来的类2可能会被重新标记为1。类似地，对于第三个分类器（类1与班级2），原来的类1可能会被重新标记为0，和原来的类2可能会被重新标记为1.

当你的DataUndersampler.fit_resample()方法接收y，它正在接收这些转变的标签，而不是多类问题的原始标签。

关键点是重新标记为0 and 1对于每个二元分类器都是独立完成的，并且不保留原始标签。这就是为什么你只看到0 and 1在你的输出中，这就是我说“类标签变得混乱”时的意思。并不是标签分配不正确；而是标签分配不正确。而是原来的标签正在转变为0 and 1对于每个二元分类问题，这不是您所期望的。

为了跟踪原始标签，您需要在转换之前存储它们，然后在完成重采样后将二进制标签映射回原始标签。

可能的解决方案：

要解决此问题，您可以使用scikit-learn-contrib/imbalanced-learn https://github.com/scikit-learn-contrib/imbalanced-learn图书馆（pip install -U imbalanced-learn).
Its RandomUnderSampler https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html在内部处理重新标记问题并确保保留原始类标签。

在最初的实现中，类标签变得“混乱”，因为OneVsOneClassifier正在将多类问题转换为多个二元分类问题。对于每个二元问题，类都被重新标记为 0 和 1。这就是为什么您在输出中只看到 0 和 1，即使您的原始数据具有不同的标签。

随着RandomUnderSampler，类标签被保留。这RandomUnderSampler通过随机选择多数类的子集来创建新的平衡数据集。原始数据集中的类标签将用于此新数据集中。

因此，在新的实现中，不需要维护从原始类标签到二进制标签的映射，因为RandomUnderSampler为您处理这个问题。这是使用不平衡学习等专门库的好处之一，它为机器学习中的常见问题提供了可靠的解决方案。

这是您的修改版本DataUndersampler跟踪原始标签及其使用方式的类：

from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
import numpy as np

class DataUndersampler:
    def __init__(self):
        self.sampler = RandomUnderSampler(random_state=42)

    def fit(self, X, y):
        self.sampler.fit_resample(X, y)
        return self

    def transform(self, X, y):
        X_res, y_res = self.sampler.fit_resample(X, y)
        return X_res, y_res

# Create a dummy dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=2, n_redundant=10, n_classes=3, weights=[0.01, 0.01, 0.98], class_sep=0.8, random_state=42)

# initialize your undersampler
undersampler = DataUndersampler()

# fit the undersampler and transform the data
X_resampled, y_resampled = undersampler.fit(X, y).transform(X, y)

print(f"Original class distribution: {Counter(y)}")
print(f"Resampled class distribution: {Counter(y_resampled)}")

# initialize the pipeline (without the undersampler)
pipeline = Pipeline([
    ('clf', OneVsOneClassifier(RandomForestClassifier(random_state=42)))
])

# fit the pipeline on the resampled data
pipeline.fit(X_resampled, y_resampled)

# now you can use your pipeline to predict
# y_pred = pipeline.predict(X_test)  # assuming you have a test set X_test

我已经注释掉了最后一行，因为没有X_test在此代码中定义。如果您有单独的测试集，则可以取消注释该行以进行预测。

主要变化如下：

RandomUnderSampler使用而不是手动实现欠采样。这消除了对_undersample功能并显着简化fit and transform方法。
The fit方法现在正好适合RandomUnderSampler到数据并返回self。这是因为fitscikit-learn 管道中变压器的方法预计会返回self.
The transform方法应用拟合的RandomUnderSampler到数据并返回欠采样数据。

这些更改背后的主要思想是尽可能地利用现有的库和约定，使代码更简单、更易于理解且更易于维护。

MWE

最小工作示例（MWE）现在是：

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# synthetic data
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
                           n_informative=4)

print("Original class distribution:", Counter(y))

resampler = RandomUnderSampler(random_state=234)
rf_clf = RandomForestClassifier()

pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)

classifier.fit(X, y)

# predict and evaluate
y_pred = classifier.predict(X)
print("Predicted class distribution:", Counter(y_pred))

在此更新的代码中：

我们正在进口RandomUnderSampler从不平衡学习。
我们替换DataUndersampler with RandomUnderSampler在管线中。
我们删除与重采样类分布相关的打印语句，因为RandomUnderSampler不直接提供此信息。但是，在训练分类器后，您仍然可以获得预测类别的分布。

此代码应该可以正常工作，而不会出现您之前遇到的标签问题。此外，它应该比原来的 MWE 更短、更简洁。

我们希望拟合 SVC 来确定每对类中的支持向量，然后忽略远离其支持向量的多数类的示例，直到实现数据平衡（n_majority = n_minority例子）。

支持基于向量的欠采样

因此，您的目标是以更明智的方式对多数类进行欠采样，考虑数据的结构而不仅仅是随机的。

我们需要修改DataUndersampler来执行这个策略。
主要思想是适合一个SVC--C支持向量分类 https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html根据数据，找到支持向量，然后根据到这些支持向量的距离对多数类进行欠采样。

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.svm import SVC
import numpy as np

class DataUndersampler(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=None):
        self.random_state = random_state
        self.svc = SVC(kernel='linear')

    def fit(self, X, y):
        # Fit SVC to data
        self.svc.fit(X, y)
        return self

    def transform(self, X, y):
        # Get support vectors
        support_vectors = self.svc.support_vectors_
        # Get indices of support vectors
        support_vector_indices = self.svc.support_

        # Separate majority and minority classes
        majority_class = y.value_counts().idxmax()
        minority_class = y.value_counts().idxmin()
        X_majority = X[y == majority_class]
        y_majority = y[y == majority_class]
        X_minority = X[y == minority_class]
        y_minority = y[y == minority_class]

        # Calculate distances of majority class samples to nearest support vector
        distances = np.min(np.linalg.norm(X_majority.values[:, np.newaxis] - support_vectors, axis=2), axis=1)

        # Sort the majority class samples by distance and take only as many as there are in minority class
        sorted_indices = np.argsort(distances)
        indices_to_keep = sorted_indices[:len(y_minority)]

        # Combine the undersampled majority class with the minority class
        X_resampled = pd.concat([X_majority.iloc[indices_to_keep], X_minority])
        y_resampled = pd.concat([y_majority.iloc[indices_to_keep], y_minority])

        return X_resampled, y_resampled

您可以像以前一样在管道中使用此变压器：

resampler = DataUndersampler(random_state=234)
pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X, y)

这种方法在欠采样时会尊重数据结构，因为它使用 SVM 的支持向量来指导欠采样过程。它还应该解决标签不正确的问题。
但是，请注意，由于需要拟合 SVM 并计算每对类的支持向量的距离，因此这比随机欠采样的计算成本更高。

The new DataUndersampler类与原始类有很大不同，因为它使用了不同的欠采样策略。
以下是主要区别：

支持向量分类器 (SVC)：新类将 SVC 拟合到fit方法。这是一个主要区别，因为原始课程没有使用任何学习算法。 SVC 用于查找支持向量，支持向量是定义类之间决策边界的数据点。
支持向量和距离：新类使用支持向量来计算多数类中每个数据点到其最近支持向量的距离。此信息用于对多数类进行欠采样，保留最接近支持向量的数据点。相比之下，原始类使用了随机欠采样策略，没有考虑数据的结构。
重新采样：新类根据计算出的距离对多数类进行欠采样，保留与少数类中一样多的数据点。这确保了类的平衡，而且保留的大多数类数据点是信息最丰富的数据点，因为它们接近决策边界。
最初的类也旨在平衡类，但它是通过随机丢弃大多数类中的数据点来实现的。
不再重新标记：新类不需要将类重新标记为0 and 1，这导致原始代码出现问题。
类保持原样，因为 SVC 可以处理原始标签。
Pandas:新代码利用 pandas 进行数据操作（例如，分离多数类和少数类、对数据进行重新采样）。原始类使用 numpy 数组。
Scikit-learn 兼容性：与原始类一样，新类扩展了BaseEstimator and TransformerMixin来自 scikit-learn 的类，因此它可以用作 scikit-learn 管道的一部分。这fit and transform方法分别用于拟合 SVC 和对数据进行欠采样。

修订后使用的新欠采样策略DataUndersampler类本质上是一种称为基于支持向量的欠采样.

在这个策略中，核心思想是适应支持向量机（SVM） https://scikit-learn.org/stable/modules/svm.html数据的分类器，它标识数据点（称为支持向量），定义类之间的决策边界。

然后，对于多数类中的每个数据点，计算到最近支持向量的距离。这里的基本原理是，来自最接近决策边界（即支持向量）的多数类的数据点对于分类任务来说信息量最大，因为它们位于多数类的“边缘”并且最接近决策边界。少数民族阶层。

然后根据这个距离对多数类中的数据点进行排序，并丢弃距决策边界最远的数据点，直到多数类中的数据点数量等于少数类中的数据点数量。这有效地对大多数类进行了欠采样，同时保留了信息最丰富的数据点。

这一策略与最初的策略不同DataUndersampler类，它只是随机丢弃来自多数类的数据点，直到类达到平衡。基于支持向量的欠采样策略是一种更复杂、更有针对性的方法，因为它在决定丢弃哪些数据点时考虑了数据的结构。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)