sklearn.decomposition模块提供矩阵分解算法、其他PCA、NMF 或ICA,其中大部分算法都被视为降维技术。
①主成分分析:sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None)
主要参数说明:
n_components:参数主要用于指定保留的特征个数,其数据类型为整数、浮点数、None或字符型。若n_components为None时,表示保留所有特征;若n_components为整数时,表示保留的特征个数;若n_components为浮点数时,表示保留后特征的方差之和占所有特征方差的最小阈值;若n_components = ‘mle’ and svd_solver = ‘full’时,该算法会用MLE算法去选择保留的特征。
whiten:表示对保留后的特征数据是否进行标准化(转化成特征方差都为1)标识
svd_solver : SVD分解方式,可选项‘auto’, ‘full’, ‘arpack’, ‘randomized’
构建简单例子
In [1]: import numpy as np
...: import matplotlib.pyplot as plt
...: from mpl_toolkits.mplot3d import Axes3D
...: from sklearn.datasets.samples_generator import make_blobs
...: X, y = make_blobs(n_samples=10000, n_features=3, centers=[[3,3, 3], [0,
...: 0,0], [1,1,1], [2,2,2]], cluster_std=[0.2, 0.1, 0.2, 0.2],
...: random_state =9)
...: fig = plt.figure()
...: ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=20)
...: plt.scatter(X[:, 0], X[:, 1], X[:, 2],marker='o')
...: plt.show()
...:
利用PCA训练数据情况:
a、n_components=None,保留所有特征
In [2]: from sklearn.decomposition import PCA
...: pca = PCA()
...: pca.fit(X)
...: print(pca.n_components_)
...:
3
训练后,观察三个特征的方差及方差比
In [3]: pca.explained_variance_
Out[3]: array([ 3.78483785, 0.03272285, 0.03201892])
In [4]: pca.explained_variance_ratio_
Out[4]: array([ 0.98318212, 0.00850037, 0.00831751])
b、n_components为整数M,若M小于X的特征总数,则挑选前M个方差大的特征
In [5]: from sklearn.decomposition import PCA
...: pca = PCA(n_components=2)#保留2个特征值
...: pca.fit(X)
...: print(pca.explained_variance_)
...: print(pca.explained_variance_ratio_)
...:
[ 3.78483785 0.03272285]
[ 0.98318212 0.00850037]
c、n_components为浮点数,选择特征方差占比大于阈值n_components的最大特征方差且特征个数最小
In [6]: pca = PCA(n_components=0.006)
...: pca.fit(X)
...: print(pca.explained_variance_)
...: print(pca.explained_variance_ratio_)
...: print(pca.n_components_)
...:
[ 3.78483785]
[ 0.98318212]
1
d、n_components为mle时,svd_solver参数必须为full,否则报错
In [7]: pca = PCA(n_components='mle',svd_solver='full')
...: pca.fit(X)
...: print(pca.explained_variance_)
...: print(pca.explained_variance_ratio_)
...: print(pca.n_components_)
...:
[ 3.78483785]
[ 0.98318212]
1
In [8]: pca = PCA(n_components='mle',svd_solver='arpack')
...: pca.fit(X)
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-b62bafac46ff> in <module>()
1 pca = PCA(n_components='mle',svd_solver='arpack')
----> 2 pca.fit(X)
d:\softwore\python\lib\site-packages\sklearn\decomposition\pca.py in fit(self, X
, y)
305 Returns the instance itself.
306 """
--> 307 self._fit(X)
308 return self
309
d:\softwore\python\lib\site-packages\sklearn\decomposition\pca.py in _fit(self,
X)
368 return self._fit_full(X, n_components)
369 elif svd_solver in ['arpack', 'randomized']:
--> 370 return self._fit_truncated(X, n_components, svd_solver)
371
372 def _fit_full(self, X, n_components):
d:\softwore\python\lib\site-packages\sklearn\decomposition\pca.py in _fit_trunca
ted(self, X, n_components, svd_solver)
433 raise ValueError("n_components=%r cannot be a string "
434 "with svd_solver='%s'"
--> 435 % (n_components, svd_solver))
436 elif not 1 <= n_components <= n_features:
437 raise ValueError("n_components=%r must be between 1 and "
ValueError: n_components='mle' cannot be a string with svd_solver='arpack'
In [9]: pca = PCA(n_components='mle',svd_solver='randomized')
...: pca.fit(X)
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-1f9c5b9ac3af> in <module>()
1 pca = PCA(n_components='mle',svd_solver='randomized')
----> 2 pca.fit(X)
d:\softwore\python\lib\site-packages\sklearn\decomposition\pca.py in fit(self, X
, y)
305 Returns the instance itself.
306 """
--> 307 self._fit(X)
308 return self
309
d:\softwore\python\lib\site-packages\sklearn\decomposition\pca.py in _fit(self,
X)
368 return self._fit_full(X, n_components)
369 elif svd_solver in ['arpack', 'randomized']:
--> 370 return self._fit_truncated(X, n_components, svd_solver)
371
372 def _fit_full(self, X, n_components):
d:\softwore\python\lib\site-packages\sklearn\decomposition\pca.py in _fit_trunca
ted(self, X, n_components, svd_solver)
433 raise ValueError("n_components=%r cannot be a string "
434 "with svd_solver='%s'"
--> 435 % (n_components, svd_solver))
436 elif not 1 <= n_components <= n_features:
437 raise ValueError("n_components=%r must be between 1 and "
ValueError: n_components='mle' cannot be a string with svd_solver='randomized'
In [10]: pca = PCA(n_components='mle')
...: pca.fit(X)
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-92060cf30409> in <module>()
1 pca = PCA(n_components='mle')
----> 2 pca.fit(X)
d:\softwore\python\lib\site-packages\sklearn\decomposition\pca.py in _fit(self,
X)
358 if max(X.shape) <= 500:
359 svd_solver = 'full'
--> 360 elif n_components >= 1 and n_components < .8 * min(X.shape):
361 svd_solver = 'randomized'
362 # This is also the case of n_components in (0,1)
TypeError: unorderable types: str() >= int()