sklearn学习之使用sklearn进行特征选择

2023-11-16

文章目录

在本节中我们将使用 sklearn.feature_selection模块中的类在高维度的样本集上进行特征选择、降维来提升估计器的性能。

1. Removing features with low variance方差选择法

sklearn.feature_selection.VarianceThreshold(threshold=0.0)

方差选择法是一种进行特征选择的简单的baseline方法，它移除所有不满足给定阈值要求的特征。阈值默认为0，此方法默认移除方差为0的特征，即在所有样本上取值相同的特征。
example:

from sklearn.feature_selection import VarianceThreshold
# X_train的方差为[0.        , 0.22222222, 2.88888889, 0.        ]
X_train = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
# X_test的方差为[1.55555556, 8.66666667, 0.22222222, 0.22222222]
X_test = [[1, 2, 0, 3], [3, 4, 0, 3], [4, 9, 1, 2]]
# 实例化一个阈值为0.5的方差选择器对象
selector = VarianceThreshold(threshold=0.5)
# fit方法计算给定数据（这里为X_train）的方差，返回一个实例化对象
selector.fit(X_train)
# variances属性返回X_train每个特征的方差值
selector.variances_
# 使用方差选择器去选择X_train的特征，输出为第二列[[0],[4],[1]]
X_train = selector.transform(X_train)
# 使用方差选择器去选择X_test的特征，输出为第二列[[0],[0],[1]]
X_test = selector.transform(X_test)

如上面的例子，第一步实例化一个阈值为0.5的方差选择器对象，选择器此时只知道它的阈值为0.5；
第二步对于给定的训练数据X_train进行fit操作，在fit阶段选择器selector首先计算X_train每个特征的方差，然后按照threshold进行选择，如上面的例子按照计算出来的方差的大小以及给定的threshold只有第二列（从0开始）满足条件，将这个条件保存下来，具体的保存方式可能是以mask的形式，如[False,False,True,False]，（注意此时还没有进行特征选择的行为，只是选择器将如何选择特征的方法已经保存在了选择器对象中），
第三步transform，在transform传入的数据和fit传入的数据在特征维度上相等的情况下，按照mask选择第二列作为特征选择后的结果，注意在transform的过程中不需要计算给定数据（X_train或X_test）的特征方差。
其实这里的fit和transform方法可以把它们类比为模型的train和predict过程进行理解，在这个例子中我们的特征选择器按照给定的训练数据方差进行特征选择，然后选择训练数据特征和测试数据特征，虽然测试数据中第0列和第1列的方差更大，所以在实际的处理过程中对数据的要求很高，一方面样本很多的情况下不会出现测试数据和训练数据对应特征相差过大的情况，另一方面训练数据和测试数据在采样的时候应该保证它们尽可能服从同一个分布，不会出现我们例子中的极端情况。
方差选择法根据实际的训练数据的不同和给定阈值的不同单独考虑特征的分布，更倾向于选择分布比较离散的特征，每次可以剔除不定数量的特征。

2. Univariate feature selection 单变量特征选择

单变量特征选择是在单变量统计检验的基础上选择最佳特征。它可以看做估计器的预处理阶段，sklearn将特征选择流程看做实现transform方法的对象。
单变量特征选择可以从多个方面观察每个独立特征对于目标变量的影响，清楚直观，易于理解；但忽视了特征之间可能存在的某些关系。

2.1 特征选择方法

SelectKBest removes all but the highest scoring features
SelectPercentile removes all but a user-specified highest scoring percentage of features
using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

SelectKBest按照scores保留K个特征；SelectPercentile按照scores保留指定百分比的特征；SelectFpr、SelectFdr和SelectFwe对每个特征使用通用的单变量统计检验；GenericUnivariateSelect允许使用可配置策略如超参数搜索估计器选择最佳的单变量选择策略。

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):

For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif

The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.

2.2 特征选择指标

method	describe	scenery
f_classif(X, y)	计算样本的方差分析f值	分类
mutual_info_classif(X, y)	估计离散目标变量的互信息	分类
chi2(X, y)	计算非负特征和类别之间的卡方统计数据	分类
f_regression(X, y)	计算单个特征和目标变量线性回归测试的f值	回归
mutual_info_regression(X, y)	估计连续目标变量的互信息	回归

`f-regression`

sklearn.feature_selection.f_regression(X, y, center=True)

X是特征矩阵，y是目标变量，center表示是否对X和y进行中心化
第一步，计算每个特征与目标变量的相关系数
- ( ( X [ : , i ] − m e a n ( X [ : , i ] ) ) ∗ ( y − m e a n y ) ) / ( s t d ( X [ : , i ] ) ∗ s t d ( y ) ) ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)) ((X[:,i]−mean(X[:,i]))∗(y−meany))/(std(X[:,i])∗std(y))
第二步，将相关系数转换为F score和p-value

`mutual_info_regression`

sklearn.feature_selection.mutual_info_regression(X, y, discrete_features=’auto’, n_neighbors=3, copy=True, random_state=None)

X是特征矩阵，y是目标变量，discrete_features用于指定哪些是离散的特征，n_neighbors是[1][2]中用来估计随机变量互信息的近邻数，大的n_neighbors可以降低估计器的方差，但会造成偏差
两个随机变量之间的互信息是一个非负值，度量变量之间的依赖关系。独立随机变量之间的互信息等于0，互信息越大表示依赖关系越强。
该函数依赖于[1]和[2]中描述的基于k近邻距离的熵估计的非参数方法。这两种方法都基于[3]最初提出的思想。

2.3 示例

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

3.Recursive feature elimination 递归特征消除

给定一个可以给出特征权重的外部estimator（如线性模型的系数），recursive feature elimination (RFE)就是递归地考虑越来越小的特征集合。第一步，estimator在所有原始特征上训练，通过coef_或者feature_importances_获得每个特征的重要性；第二步，最不重要的特征将会从现有的特征集中剪去，这个过程重复迭代直到特征的数目达到期望值。

RFECV实现了在RFE中使用cross-validation循环寻找最优特征数量的过程。

A recursive feature elimination example showing the relevance of pixels in a digit classification task.

A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation.

4.基于模型的特征选择

sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.
example: Feature selection using SelectFromModel and LassoCV

4.1 基于L1的特征选择

4.2 基于数的特征选择

5. Feature selection as part of a pipeline

使用sklearn.pipeline.Pipeline流程化预处理和训练过程

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)