我真的认为 z 分数使用scipy.stats.zscore() https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html是去这里的路。查看相关问题这个帖子 https://stackoverflow.com/questions/50904088/deleting-the-same-outliers-in-two-timeseries/50906623#50906623。他们关注的是使用哪种方法before消除潜在的异常值。在我看来,您的挑战要简单一些,因为根据提供的数据判断,识别潜在的异常值将非常简单,而无需转换数据。下面是执行此操作的代码片段。但请记住,什么看起来像异常值,什么看起来不像异常值将完全取决于您的数据集。并且删除后some异常值,以前看起来不像异常值的东西,现在突然变得像异常值。看一看:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
原始数据:
测试运行 1:Z 分数 = 4:
如您所见,由于级别设置得太高,没有删除任何数据。
测试运行 2:Z 分数 = 2:
现在我们已经取得进展了。两个异常值已被删除,但仍然存在一些可疑数据。
测试运行 3:Z 分数 = 1.2:
这看起来真的很好。现在剩下的数据似乎比以前分布得更均匀了一些。但现在原始数据点突出显示的数据点开始看起来有点像潜在的异常值。那么该停在哪里呢?这完全取决于你!
编辑:这是简单复制粘贴的全部内容:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)
# final plot
df_clean.plot(style = 'o')