Python相关矩阵教程（使用Pandas）

2023-10-12

在本博客中，我们将介绍一个重要的多变量数据描述性统计量，称为相关矩阵。我们将学习如何在 Python 中使用以下命令创建、绘制和操作相关矩阵Pandas.

我们将讨论以下主题：

目录 hide

1 What is the correlation matrix?
- 1.1 什么是相关系数？
2 求给定数据的相关矩阵
3 绘制相关矩阵
4 解释相关矩阵
5 向绘图添加标题和标签
6 对相关矩阵进行排序
7 选择负相关对
8 选择强相关对（幅度大于 0.5）
9 将协方差矩阵转换为相关矩阵
10 将相关矩阵导出到图像
11 结论

什么是相关矩阵？

相关矩阵是表示给定数据中变量对之间的“相关性”的表格数据。

我们将在本博客结束时构建这个相关矩阵。

每一行和每一列代表一个变量，这个矩阵中的每个值都是相应行和列所代表的变量之间的相关系数。

相关矩阵是一个重要的数据分析指标，通过计算来汇总数据，以了解各种变量之间的关系并做出相应的决策。

这也是机器学习管道中重要的预处理步骤，用于计算和分析需要对高维数据进行降维的相关矩阵。

我们提到相关矩阵中的每个单元格都是“相关系数' 在对应于单元格的行和列的两个变量之间。

在我们继续之前，让我们先了解什么是相关系数。

什么是相关系数？

相关系数是表示两个变量之间关系强度的数字。

相关系数有多种类型，但其中最常见的是皮尔逊系数，用希腊字母 ρ (rho) 表示。

It is defined as the covariance between two variables divided by the product of the standard deviations of the two variables.

Where the covariance between X and Y COV(X, Y) is further defined as the ‘expected value of the product of the deviations of X and Y from their respective means’.
The formula for covariance would make it clearer.

So the formula for Pearson’s correlation would then become:

The value of ρ lies between -1 and +1.
Values nearing +1 indicate the presence of a strong positive relation between X and Y, whereas those nearing -1 indicate a strong negative relation between X and Y.
Values near to zero mean there is an absence of any relationship between X and Y.

求给定数据的相关矩阵

让我们生成两个变量的随机数据，然后为它们构建相关矩阵。


import numpy as np
np.random.seed(10)
# generating 10 random values for each of the two variables
X = np.random.randn(10)
Y = np.random.randn(10)
# computing the corrlation matrix
C = np.corrcoef(X,Y)
print(C)

Output:

Since we compute the correlation matrix of 2 variables, its dimensions are 2 x 2.
The value 0.02 indicates there doesn’t exist a relationship between the two variables. This was expected since their values were generated randomly.

在这个例子中，我们使用了 NumPy 的`corrcoef`方法来生成相关矩阵。
然而，这种方法有一个局限性，它只能计算两个变量之间的相关矩阵。

因此，继续，我们将使用数据框存储数据并计算它们的相关矩阵。

绘制相关矩阵

为了进行解释，我们将使用一个不仅仅具有两个特征的数据集。

我们将使用乳腺癌数据，这是机器学习入门课程中使用的一种流行的二元分类数据。
我们将从 scikit-learn 加载此数据集dataset module.
它以 NumPy 数组的形式返回，但我们将将它们转换为 Pandas DataFrame.


from sklearn.datasets import load_breast_cancer
import pandas as pd
breast_cancer = load_breast_cancer()
data = breast_cancer.data
features = breast_cancer.feature_names
df = pd.DataFrame(data, columns = features)
print(df.shape)
print(features)

Output:

There are 30 features in the data, all of which are listed in the output above.

我们现在的目标是确定每对列之间的关系。我们将通过绘制相关矩阵来实现这一点。

为了简单起见，我们将仅使用前六列并绘制它们的相关矩阵。
为了绘制矩阵，我们将使用一个流行的可视化库，称为海博恩,它是建立在 matplotlib 之上的。


import seaborn as sns
import matplotlib.pyplot as plt
# taking all rows but only 6 columns
df_small = df.iloc[:,:6]
correlation_mat = df_small.corr()
sns.heatmap(correlation_mat, annot = True)
plt.show()

Output:

The plot shows a 6 x 6 matrix and color-fills each cell based on the correlation coefficient of the pair representing it.

Pandas 数据框corr()方法用于计算矩阵。默认情况下，它计算皮尔逊相关系数。
我们还可以通过向参数传递适当的值来使用其他方法，例如 Spearman 系数或 Kendall Tau 相关系数'method'.

我们用过seaborn的heatmap()绘制矩阵的方法。参数‘annot=True' 显示每个单元格中相关系数的值。

现在让我们了解如何解释绘制的相关系数矩阵。

解释相关矩阵

Let’s first reproduce the matrix generated in the earlier section and then discuss it.

You must keep the following points in mind with regards to the correlation matrices such as the one shown above:

网格中的每个单元格代表两个变量之间的相关系数的值。
(a,b)位置的值表示a行和b列的特征之间的相关系数。这将等于位置 (b, a) 处的值
It is a square矩阵——每一行代表一个变量，所有列代表与行相同的变量，因此行数 = 列数。
It is a 对称的矩阵——这是有道理的，因为 a,b 之间的相关性与 b, a 之间的相关性相同。
All 对角线元素为 1。由于对角线元素表示每个变量与其自身的相关性，因此它始终等于 1。
轴刻度表示每个轴代表的特征。
较大的正值（接近 1.0）表示强正相关，即，如果其中一个变量的值增加，另一个变量的值也会增加。
较大的负值（接近 -1.0）表示很强的负相关性，即一个变量的值随着另一个变量的增加而减小，反之亦然。
接近 0 的值（正值或负值）表示两个变量之间不存在任何相关性，因此这些变量彼此独立。
上述矩阵中的每个单元格也由颜色的深浅表示。这里颜色较深的阴影表示较小的值，而较亮的阴影对应较大的值（接近 1）。
该比例是在图右侧的颜色条的帮助下给出的。

向绘图添加标题和标签

我们可以调整生成的相关矩阵，就像任何其他 Matplotlib 图一样。让我们看看如何向矩阵添加标题并向轴添加标签。


correlation_mat = df_small.corr()
sns.heatmap(correlation_mat, annot = True)
plt.title("Correlation matrix of Breast Cancer data")
plt.xlabel("cell nucleus features")
plt.ylabel("cell nucleus features")
plt.show()

Output:

If we want, we could also change the position of the title to bottom by specifying the y position.


correlation_mat = df_small.corr()
sns.heatmap(correlation_mat, annot = True)
plt.title("Correlation matrix of Breast Cancer data", y=-0.75)
plt.xlabel("cell nucleus features")
plt.ylabel("cell nucleus features")
plt.show()

Output:

对相关矩阵进行排序

如果给定的数据具有大量特征，相关矩阵可能会变得非常大，因此难以解释。

有时我们可能想要对矩阵中的值进行排序，并以升序或降序查看各个特征对之间的相关性强度。
让我们看看如何实现这一目标。

首先，我们将给定的矩阵转换为一维系列值。


correlation_mat = df_small.corr()
corr_pairs = correlation_mat.unstack()
print(corr_pairs)

Output:

The unstack method on the Pandas DataFrame returns a Series with MultiIndex.That is, each value in the Series is represented by more than one indices, which in this case are the row and column indices that happen to be the feature names.

现在让我们使用以下命令对这些值进行排序sort_values()Pandas 系列的方法。


sorted_pairs = corr_pairs.sort_values(kind="quicksort")
print(sorted_pairs)

Output:

We can see each value is repeated twice in the sorted output. This is because our correlation matrix was a symmetric matrix, and each pair of features occurred twice in it.

尽管如此，我们现在已经拥有所有特征对的排序相关系数值，并且可以相应地做出决策。

选择负相关对

我们可能想要选择具有特定范围的相关系数值的特征对。
让我们看看如何从上一节中生成的排序对中选择具有负相关性的对。


negative_pairs = sorted_pairs[sorted_pairs < 0]
print(negative_pairs)

Output:

选择强相关对（幅度大于 0.5）

让我们使用相同的方法来选择强相关的特征。也就是说，我们将尝试过滤掉那些相关系数值大于0.5或小于-0.5的特征对。


strong_pairs = sorted_pairs[abs(sorted_pairs) > 0.5]
print(strong_pairs)

Output:

将协方差矩阵转换为相关矩阵

We have seen the relationship between the covariance and correlation between a pair of variables in the introductory sections of this blog.

Let us understand how we can compute the covariance matrix of a given data in Python and then convert it into a correlation matrix. We’ll compare it with the correlation matrix we had generated using a direct method call.

首先，Pandas 没有提供计算所有变量对之间协方差的方法，因此我们将使用 NumPy 的cov() method.


cov = np.cov(df_small.T)
print(cov)

Output:

We’re passing the transpose of the matrix because the method expects a matrix in which each of the features is represented by a row rather than a column.

所以我们的分子是正确的。
现在我们需要计算一个 6×6 矩阵，其中 i、j 处的值是位置 i 和 j 处特征的标准差的乘积。

然后，我们将协方差矩阵除以该标准差矩阵来计算相关矩阵。

让我们首先构建标准差矩阵。


#compute standard deviations of each of the 6 features
stds = np.std(df_small, axis = 0) #shape = (6,)
stds_matrix = np.array([[stds[i]*stds[j] for j in range(6)] for i in range(6)])
print("standard deviations matrix of shape:",stds_matrix.shape)

Output:

Now that we have the covariance matrix of shape (6,6) for the 6 features, and the pairwise product of features matrix of shape (6,6), we can divide the two and see if we get the desired resultant correlation matrix.


new_corr = cov/std_matrix

我们已将新的相关矩阵（从协方差矩阵导出）存储在变量中new_corr.

让我们通过绘制相关矩阵并将其与使用 Pandas 方法直接生成的早期矩阵并置来检查是否正确corr().


plt.figure(figsize=(18,4))
plt.subplot(1,2,1)
sns.heatmap(correlation_mat, annot = True)
plt.title("Earlier correlation matrix (from Pandas)")
plt.xlabel("cell nucleus features")
plt.ylabel("cell nucleus features")
plt.subplot(1,2,2)
sns.heatmap(correlation_mat, annot = True)
plt.title("Newer correlation matrix (from Covariance mat)")
plt.xlabel("cell nucleus features")
plt.ylabel("cell nucleus features")
plt.show()

Output:

We can compare the two matrices and notice that they are identical.

将相关矩阵导出到图像

在 Python 脚本中绘制相关矩阵是不够的。我们可能想保存它以供以后使用。
我们可以使用以下命令将生成的绘图保存为磁盘上的图像文件plt.savefig() method.


correlation_mat = df_small.corr()
sns.heatmap(correlation_mat, annot = True)
plt.title("Correlation matrix of Breast Cancer data")
plt.xlabel("cell nucleus features")
plt.ylabel("cell nucleus features")
plt.savefig("breast_cancer_correlation.png")

运行此代码后，您可以在同一工作目录中看到名为“breast_cancer_correlation.png”的图像文件。

结论

在本教程中，我们学习了相关矩阵是什么以及如何在 Python 中生成它们。我们首先关注相关矩阵和相关系数的概念。

然后我们将相关矩阵生成为 NumPy 数组，然后生成 Pandas DataFrame。接下来，我们学习了如何绘制相关矩阵并操作绘图标签、标题等。我们还讨论了用于解释输出相关矩阵的各种属性。

我们还了解了如何对相关矩阵执行某些操作，例如对矩阵进行排序、查找负相关对、查找强相关对等。

然后我们讨论了如何使用数据的协方差矩阵，并通过将其除以各个特征的标准差的乘积来生成相关矩阵。
最后，我们看到了如何将生成的绘图保存为图像文件。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

pandas