首先让我提出一些一般准则:
- 实际上,如果你有很多特征和较少的样本,马哈拉诺比斯算法往往会给出误导性的结果(你可以自己尝试一下),所以你拥有的特征越多,你应该提供的样本就越多。
- 协方差矩阵必须是对称 https://en.wikipedia.org/wiki/Symmetric_matrix and 正定 https://en.wikipedia.org/wiki/Positive-definite_matrix为了使算法正常工作,因此您应该在继续之前进行检查。
正如已经提到的,欧几里得度量 https://en.wikipedia.org/wiki/Euclidean_distance无法找到正确的距离,因为它试图获得ordinary直线距离。
因此,如果我们有多维的变量空间中,两个点可能看起来具有相同的距离Mean,但其中之一距离数据云很远(即它是一个异常值).
解决办法是马哈拉诺比斯距离 https://en.wikipedia.org/wiki/Mahalanobis_distance这使得类似的东西特征缩放通过采取特征向量 https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors变量而不是原始轴。
它应用以下公式:
where:
-
x
is the 观察找到它的距离;
-
m
is the mean观察结果;
-
S
is the 协方差矩阵 https://en.wikipedia.org/wiki/Covariance_matrix.
复习:
The 协方差 https://en.wikipedia.org/wiki/Covariance代表两个变量之间关系的方向(即正、负或零),因此它显示了一个变量与其他变量的变化相关的强度。
执行
考虑一下这个6x3数据集,其中每一行代表一个样本,每一列代表给定样本的一个特征:
首先,我们需要创建一个协方差矩阵features每个样本的,这就是我们设置参数的原因rowvar
to False
in the numpy.cov https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.cov.html函数,所以现在每一列代表一个变量:
covariance_matrix = np.cov(data, rowvar=False)
# data here looks similar to the above table
# in the picture
接下来,我们找到Inverse协方差矩阵的:
inv_covariance_matrix = np.linalg.inv(covariance_matrix)
但在继续之前,我们应该检查,如上所述,矩阵及其逆矩阵是否是对称 and 正定。我们为此使用乔列斯基分解 https://en.wikipedia.org/wiki/Cholesky_decomposition幸运的是,该算法已经在numpy.linalg.cholesky https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.cholesky.html:
def is_pos_def(A):
if np.allclose(A, A.T):
try:
np.linalg.cholesky(A)
return True
except np.linalg.LinAlgError:
return False
else:
return False
然后我们找到均值m
每个特征上的变量(我应该说维度)并将它们保存在一个数组中,如下所示:
vars_mean = []
for i in range(data.shape[0]):
vars_mean.append(list(data.mean(axis=0)))
# axis=0 means each column in the 2D array
请注意,我重复每一行只是为了利用矩阵减法,如下所示。
接下来,我们发现x - m
(即微分),但由于我们已经有了向量化vars_mean
,我们需要做的就是:
diff = data - vars_mean
# here we subtract the mean of feature
# from each feature of each example
最后,应用这样的公式:
md = []
for i in range(len(diff)):
md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))
请注意以下事项:
- 协方差矩阵的逆矩阵的维数为:
number_of_features x number_of_features
- 的尺寸
diff
矩阵与原始数据矩阵类似:number_of_examples x number_of_features
- 因此,每个
diff[i]
(即行)是1 x number_of_features
.
- 因此根据矩阵乘法规则,得到的矩阵为
diff[i].dot(inv_covariance_matrix)
将1 x number_of_features
;当我们再次乘以diff[i]
; numpy
自动将后者视为列矩阵(即number_of_features x 1
);所以最终结果将变成单个值(即不需要转置)。
为了检测异常值,我们应该指定一个threshold
;但由于马哈拉诺比斯距离的平方遵循卡方分布,自由度 = 数据集中的特征数量,那么我们可以选择一个阈值,例如 0.1,然后我们可以使用chi2.cdf https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html方法来自Scipy
, 像这样:
1 - chi2.cdf(square_of_mahalanobis_distances, degree_of_freedom)
因此,任何 (1 - 卡方 CDF) 小于或等于阈值的点都可以归类为异常值。
将所有内容放在一起
import numpy as np
def create_data(examples=50, features=5, upper_bound=10, outliers_fraction=0.1, extreme=False):
'''
This method for testing (i.e. to generate a 2D array of data)
'''
data = []
magnitude = 4 if extreme else 3
for i in range(examples):
if (examples - i) <= round((float(examples) * outliers_fraction)):
data.append(np.random.poisson(upper_bound ** magnitude, features).tolist())
else:
data.append(np.random.poisson(upper_bound, features).tolist())
return np.array(data)
def MahalanobisDist(data, verbose=False):
covariance_matrix = np.cov(data, rowvar=False)
if is_pos_def(covariance_matrix):
inv_covariance_matrix = np.linalg.inv(covariance_matrix)
if is_pos_def(inv_covariance_matrix):
vars_mean = []
for i in range(data.shape[0]):
vars_mean.append(list(data.mean(axis=0)))
diff = data - vars_mean
md = []
for i in range(len(diff)):
md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))
if verbose:
print("Covariance Matrix:\n {}\n".format(covariance_matrix))
print("Inverse of Covariance Matrix:\n {}\n".format(inv_covariance_matrix))
print("Variables Mean Vector:\n {}\n".format(vars_mean))
print("Variables - Variables Mean Vector:\n {}\n".format(diff))
print("Mahalanobis Distance:\n {}\n".format(md))
return md
else:
print("Error: Inverse of Covariance Matrix is not positive definite!")
else:
print("Error: Covariance Matrix is not positive definite!")
def is_pos_def(A):
if np.allclose(A, A.T):
try:
np.linalg.cholesky(A)
return True
except np.linalg.LinAlgError:
return False
else:
return False
data = create_data(15, 3, 10, 0.1)
print("data:\n {}\n".format(data))
MahalanobisDist(data, verbose=True)
Result
data:
[[ 12 7 9]
[ 9 16 7]
[ 14 11 10]
[ 14 5 5]
[ 12 8 7]
[ 8 8 10]
[ 9 14 8]
[ 12 12 10]
[ 18 10 6]
[ 6 12 11]
[ 4 12 15]
[ 5 13 10]
[ 8 9 8]
[106 116 97]
[ 90 116 114]]
Covariance Matrix:
[[ 980.17142857 1143.62857143 1035.6 ]
[1143.62857143 1385.11428571 1263.12857143]
[1035.6 1263.12857143 1170.74285714]]
Inverse of Covariance Matrix:
[[ 0.03021777 -0.03563241 0.0117146 ]
[-0.03563241 0.08684092 -0.06217448]
[ 0.0117146 -0.06217448 0.05757261]]
Variables Mean Vector:
[[21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8]]
Variables - Variables Mean Vector:
[[ -9.8 -17.6 -12.8]
[-12.8 -8.6 -14.8]
[ -7.8 -13.6 -11.8]
[ -7.8 -19.6 -16.8]
[ -9.8 -16.6 -14.8]
[-13.8 -16.6 -11.8]
[-12.8 -10.6 -13.8]
[ -9.8 -12.6 -11.8]
[ -3.8 -14.6 -15.8]
[-15.8 -12.6 -10.8]
[-17.8 -12.6 -6.8]
[-16.8 -11.6 -11.8]
[-13.8 -15.6 -13.8]
[ 84.2 91.4 75.2]
[ 68.2 91.4 92.2]]
Mahalanobis Distance:
[1.3669401667524865, 2.1796331318432967, 0.7470525416547134, 1.6364973119931507, 0.8351423113609481, 0.9128858131134882, 1.397144258271586, 0.35603382066414996, 1.4449501739129382, 0.9668775289588046, 1.490503433100514, 1.4021488309805878, 0.4500345257064412, 3.239353067840299, 3.260149280200771]