机器学习之如何处理缺失值(missing value)

2023-05-16

机器学习之如何处理缺失值

备注：本次数据来源于kaggle，详情请戳here，原文参考连接，请戳here ，本文篇幅较长，旨在多介绍EDA过程中的一些思想和细节。

文章目录

机器学习之如何处理缺失值
一、介绍
二、缺失值分布
- 1. null值对每个feaure的影响
- 2. null值对每个样本的影响
- 3. null值和label之间的线性关系
- 4. null值个数与label之间的分布关系
三、null值的影响及填充方式
- 1. 简单的lightgbm分类器
- 2. 均值填充
- 3. 常数填充
- 4. IterativeImputer填充
- 5. 使用null count作为新特征
四、结论
五、HyperGBM中如何处理null值

一、介绍

本EDA(Exploratory Data Analysis)的目的是为了探索数据集中的缺失值信息，旨在通过掌握缺失值的分布情况等信息，从而找到一种合适的方法填充缺失值，让模型有更好的性能。

二、缺失值分布

让我们来看一下null值在数据集中是如何分布的，明白了null值的分布规律在特征工程中是非常有用的，其重要性不亚于选择一个合适的算法模型，比如: 依照null值的分布情况，我们就可以避免只是简单的用mean值或者"0"值进行填充，接下来让我们一起来看一下null值对ML模型会有什么样的影响。

1. null值对每个feaure的影响

import pandas as pd
## load data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# Define the features we want to examine, names for one-hot encoding them, and the total number of records
features = [x for x in train.columns if x.startswith("f") and not x.endswith("_is_null")]
null_features = ["{}_is_null".format(x) for x in features]
total_rows = float(train.shape[0])

# One-hot encode whether a feature is null per row
for feature, null_feature in zip(features, null_features):
    train[null_feature] = train[feature].isnull().astype(int)

# Generate counts of number of null values we see per feature
null_counts = pd.DataFrame.from_dict({k : [round((train[(train[k]) == 1][k].count() / total_rows) * 100, 3)] for k in null_features})

# Plot percentage of rows impacted by feature
sns.set_style("whitegrid")
bar, ax = plt.subplots(figsize=(10, 35))
ax = sns.barplot(data=null_counts, ci=None, palette="muted", orient="h")
ax.set_title("Percentage of Null Values Per Feature (Train Data)", fontsize=15)
ax.set_xlabel("Percentage")
ax.set_ylabel("Feature")
for rect in ax.patches:
    ax.text(rect.get_width(), rect.get_y() + rect.get_height() / 2, "%.3f%%" % rect.get_width())
plt.show()

在这里插入图片描述
在这里我们可以看到每个特征里面有null值，占比大约为1.6%，当然接着我们应该确认test数据集是否同样存在相同的分布情况

## almost same with trian code.

在这里插入图片描述

我们可以看到训练集和测试集的null值分布情况是一致的，那么现在的问题变成了：null值的重叠情况怎么样，换句话说，是否每一个样本都被null值影响了，或者说这里是否存在没有任何缺失值的样本？

2. null值对每个样本的影响

为了分析null值对每一行的影响，我们可以构建一个‘null count’的特征，用于表征每个样本中有多少个null值，然后统计出一共有多少个样本含有0/ 1/ 2/ 3/ …/个特征。

# Count the number of null values that occur in each row
train["null_count"] = train.isnull().sum(axis=1)
# Group the null counts
counts = train.groupby("null_count")["claim"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}
null_data["6 or More Null Values"] = sum([v for k, v in enumerate(counts.values()) if k > 5])

# Plot the null count results
pie, ax = plt.subplots(figsize=[20, 10])
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5)
_ = plt.title("Percentage of Null Values Per Row (Train Data)", fontsize=14)
plt.show()

在这里插入图片描述
我们可以看到有37.5%的样本不含任意的缺失值，甚至存在6.05%的样本存在6个及以上的缺失值。同样的，我们来看一下测试集上是不是有着相同的情况。

## almost same with trian code.

在这里插入图片描述

null值在每个单样本上的分布情况在测试集和训练集上是一致的，同样的，我们可以看到大约有1/3的数据是不含有任何缺失值的。

3. null值和label之间的线性关系

让我们看一下null值特征和label之间是否有很强的线性相关性

# Define the features we want to examine, names for one-hot encoding them, and the total number of records
features = [x for x in train.columns if x.startswith("f") and not x.endswith("_is_null")]
null_features = ["{}_is_null".format(x) for x in features]
total_rows = float(train.shape[0])

# One-hot encode whether a feature is null per row
for feature, null_feature in zip(features, null_features):
    train[null_feature] = train[feature].isnull().astype(int)

correlation_features = null_features.copy()
correlation_features.append("claim")
null_correlation = train[correlation_features].corr()
null_correlation.style.background_gradient(cmap='coolwarm')

f, ax = plt.subplots(figsize=(30, 30))

# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(
    null_correlation,
    mask=np.triu(np.ones_like(null_correlation, dtype=bool)),
    cmap=sns.diverging_palette(230, 20, as_cmap=True),
    vmax=.3,
    center=0,
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .5}
)
plt.show()

在这里插入图片描述
我们可以看到null值和label之间并没有明显的线性关系

4. null值个数与label之间的分布关系

现在来看一下null值个数与label之间的分布关系。

train["null_count"] = train.isnull().sum(axis=1)

z = dict()
for (null_count, claim_status), value_count in train[["null_count", "claim"]].value_counts().to_dict().items():
    if null_count not in z:
        z[null_count] = dict()
    z[null_count][claim_status] = value_count
a = {
    "Number of Null Values": ["Not Claimed (0)", "Claimed (1)"],
}
a = []
for null_values in range(15):
    a.append([null_values, z[null_values][0], z[null_values][1]])
df = pd.DataFrame(a, columns=["Number of Null Values", "Not Claimed (0)", "Claimed (1)"])
ax = df.plot(x="Number of Null Values", y=["Not Claimed (0)", "Claimed (1)"], kind="bar", figsize=(20, 10))
_ = ax.set_title("Number of Null Values by Claim Status", fontsize=15)
_ = ax.set_ylabel("Number of Rows")
plt.show()

在这里插入图片描述
我们可以看到当我们含有2个及以上的null值的时候，分类结果更加偏向于claim，所以这里如果我们将null_count作为新特征的话，对我们模型分类会有很大的帮助，同时，也表明了如果我们可以有效的填充这些null值，我们的分类效果会更好。

三、null值的影响及填充方式

现在我们知道了null值在训练集和测试集上有着相同的分布，接下来我们通过几个简单的小实验来看一下null值是如何影响模型的性能的

1. 简单的lightgbm分类器

lightgbm自带对null值的处理机制，在我们做各种null值填充之前，我们先用默认的lgb分类模型建立一个baseline。

features = [x for x in train.columns if x.startswith("f") and not x.endswith("_is_null")]
train["null_count"] = train.isnull().sum(axis=1)
new_train = train.copy()
target = train["claim"]
k_fold = StratifiedKFold(
    n_splits=3,
    random_state=2021,
    shuffle=True,
)

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )
for fold, (train_index, test_index) in enumerate(k_fold.split(new_train[features], target)):
    x_train = pd.DataFrame(new_train[features].iloc[train_index])
    y_train = target.iloc[train_index]

    x_valid = pd.DataFrame(new_train[features].iloc[test_index])
    y_valid = target.iloc[test_index]
    model = LGBMClassifier(
        random_state=2021,
        metric="auc",
        n_estimators=16000,
        verbose=-1,
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=200,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas

    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))


train["unmodified_preds"] = train_preds
train["unmodified_probas"] = train_probas
misclassified = train[(train["claim"] != train["unmodified_preds"])]["null_count"].value_counts().to_dict()

# Show the confusion matrix
confusion = confusion_matrix(train["claim"], train["unmodified_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Unmodified Dataset)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

# Plot percentage of rows impacted by feature
sns.set_style("whitegrid")
bar, ax = plt.subplots(figsize=(10, 10))
ax = sns.barplot(x=list(misclassified.keys()), y=list(misclassified.values()))
_ = ax.set_title("Number of Misclassifications by Null Values in Row (Unmodified Dataset)", fontsize=15)
_ = ax.set_xlabel("Number of Null Values in Row")
_ = ax.set_ylabel("Number of Misclassified Predictions")
for p in ax.patches:
    height = p.get_height()
    ax.text(
        x=p.get_x()+(p.get_width()/2),
        y=height,
        s=round(height),
        ha="center"
    )
plt.show()

...
-- Overall:
              precision    recall  f1-score   support

           0       0.75      0.76      0.75    480404
           1       0.75      0.74      0.75    477515

    accuracy                           0.75    957919
   macro avg       0.75      0.75      0.75    957919
weighted avg       0.75      0.75      0.75    957919

-- ROC AUC: 0.8040679784649976

在这里插入图片描述

通过3-folds的交叉验证，我们可以看到precision和recall的值是非常稳定的，这表明原始数据分布是一致的，但是，我们同样可以看到一个现象：尽管只含有一个null值的样本在数据集中占比只有14%，但是却产生了最高比例的误判(72118个样本被误判), 而相反，没有null值的样本占比高达37%，但是误判样本却只有48553个，同样的情况也适用于含有2个null值的情况，然后随着null值数的增加，我们可以看到误判数在急剧下降中，那么现在的问题是：是否可以通过找到合适的填充值从而提升分类结果呢？

2. 均值填充

现在我们会使用mean填充掉所有的null值，然后构建一个相同的模型，我们来观察一下这样做是否可以突破原来的baseline分数。

new_train = train.copy()
for feature in features:
    new_train[feature].fillna(new_train[feature].mean(), inplace=True)
## other parts is almost same with baseline model's code

...
-- Overall:
              precision    recall  f1-score   support

           0       0.70      0.78      0.74    480404
           1       0.75      0.67      0.71    477515

    accuracy                           0.73    957919
   macro avg       0.73      0.73      0.72    957919
weighted avg       0.73      0.73      0.72    957919

-- ROC AUC: 0.7903046619321749

在这里插入图片描述

对比于lgb默认的填充方式，使用mean填充反而降低了模型的性能，只含有1个和2个null值的样本误判比例更高了，当然我们可以知道主要是因为false positive(将分类为0的样本判断成1), 这就意味着使用mean填充让我们丢失了一些重要的信息 - 就像null值原本代表了一些离群值，边界值，所以很明显，mean填充的方式是不可取的。

3. 常数填充

现在我们尝试用0去填充所有的null值，0填充并不是一个好的主意，因为我们并没有去单独看过每一列的特征值分布情况，0进行填充可能会造成数据的偏移(skew data)，比如说某一列的值分布本来在10000-100000之间，而你直接给了一个零，就会改变数据的分布情况，但是关于skews特性，是可以被GBM模型给提取到的，说不定还会提升我们模型的性能，who kowns,let’s try it.

new_train = train.copy()
for feature in features:
    new_train[feature].fillna(0, inplace=True)
## other parts is almost same with baseline model's code

...
-- Overall:
              precision    recall  f1-score   support

           0       0.72      0.77      0.75    480404
           1       0.75      0.70      0.73    477515

    accuracy                           0.74    957919
   macro avg       0.74      0.74      0.74    957919
weighted avg       0.74      0.74      0.74    957919

-- ROC AUC: 0.7969909218144446

在这里插入图片描述
使用0进行填充，尽管还是落后于我们的baseline性能. 但是相较于mean填充，我们可以看到null值个数为2的样本误判率降低了，roc_auc结果也有了些许的提升。

4. IterativeImputer填充

我们也可以尝试使用各种Imputer进行填充，比如 Scikit Learn’s IterativeImputer就会尝试找到相似样本，用相似样本的值填充对应的缺失值，在这里我们设置K=5(nearest neighbors num is 5).

imputer = IterativeImputer(random_state=2021, n_nearest_features=5)
new_train[features] = imputer.fit_transform(new_train[features])
## other parts is almost same with baseline model's code

...
-- Overall:
              precision    recall  f1-score   support

           0       0.68      0.75      0.71    480404
           1       0.72      0.64      0.68    477515

    accuracy                           0.70    957919
   macro avg       0.70      0.70      0.70    957919
weighted avg       0.70      0.70      0.70    957919

-- ROC AUC: 0.7552119652576256

在这里插入图片描述
这次我们可以看到一些不一样的地方，当我们降低了null值个数为1的样本的误判率之后，我们改变了其他列的一个分布情况，我们可以看到对于没有null值的样本误判数从原来的48553变到了现在的63879，这可能是因为我们填充的值脱离了实际的数据分布特性，我们现在可以意识到null值一定是包含了一些重要信息的，而且这个信息不可以用IterativeImputer进行恢复。

5. 使用null count作为新特征

不同于想办法去恢复null值原始包含的信息，我们可以在源数据上加一个null值的个数作为新特征

features.append("null_count")
## other parts is almost same with baseline model's code

...
-- Overall:
              precision    recall  f1-score   support

           0       0.86      0.66      0.74    480404
           1       0.72      0.89      0.80    477515

    accuracy                           0.77    957919
   macro avg       0.79      0.77      0.77    957919
weighted avg       0.79      0.77      0.77    957919

-- ROC AUC: 0.8126761574315831

在这里插入图片描述

可以看到，我们现在得到了目前最好的性能结果，我们大大的减少了含有null值样本的误判率，虽然我们并没有改变不含null值的样本的误判率，事实上，我们可以基于null值的特性，创建一个更加简单的分类模型，我们可以将含有null值的样本直接判断为1，而且这个模型的性能不会很差，有兴趣的可以自己试一下。

四、结论

通过以上实验，我们可以得出以下结论：

1.null值含有非常重要的信息，甚至在这个数据中是最重要的信息，通过构建null count特征，可以给算法模型提供一个非常有效的区分信息，从而提升性能

2.我们很难找到一个合适的填充null值的方式，至少对于这个数据集来说，并没有给我们提供足够多的信息用于恢复null值，

对于任何的ML算法，我们可以保留null值出现的地方，或者使用可以直接处理null值的算法，也可以使用null count作为新特征，或者使用one-hot编码null值，当然也可以使用Imputer进行null值的填充。

五、HyperGBM中如何处理null值

在HyperGBM中一共支持4种Imputer的方式： [‘mean’, ‘median’, ‘constant’, ‘most_frequent’]

当把[num_pipeline_mode]=‘simple’，使用mean值进行填充,
当把[num_pipeline_mode]=‘complex’(默认为’complex’)，框架会搜索出最优的Imputer方式

对于需要进行ensemble/stacking的用户来说，不同的Imputer可以产生不同的输入数据，从而可以让模型之间的差异性更大

...
search_space_ = GeneralSearchSpaceGenerator(num_pipeline_mode='simple')
experiment = make_experiment(train_data=train,target='claim',log_level='info',search_space=search_space_)
...

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)