均值(mean):
x
ˉ
=
1
n
∑
i
=
1
n
x
i
\bar{x}=\frac{1}{n} \sum_{i=1}^{n} x_{i}
xˉ=n1i=1∑nxi
方差(var)、均方差(std):
S
2
=
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
ˉ
)
2
,
S
=
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
ˉ
)
2
\quad S^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}, S=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}
S2=n−11i=1∑n(xi−xˉ)2,S=n−11i=1∑n(xi−xˉ)2 (与传统的方差不同,这里除以的是n-1)
偏度(df.skewness):标准化三阶中心矩阵,反映对称性,当其值大于0时,此时数据位于均值右侧的比位于左侧的多
s
k
=
1
n
∑
i
=
1
n
(
x
i
−
x
ˉ
)
3
s
3
s_{k}=\frac{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{3}}{s^{3}}
sk=s3n1∑i=1n(xi−xˉ)3
峰度(df.kurt):标准化四阶中心矩阵,当其值大于3时,表示分布有沉重的尾巴,说明样本有较多远离均值的数据
G
2
=
1
n
∑
i
=
1
n
(
x
i
−
x
ˉ
)
4
(
1
n
∑
i
=
1
n
(
x
1
−
x
ˉ
)
2
)
2
−
3
G_{2}=\frac{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{4}}{\left(\frac{1}{n} \sum_{i=1}^{n}\left(x_{1}-\bar{x}\right)^{2}\right)^{2}}-3
G2=(n1∑i=1n(x1−xˉ)2)2n1∑i=1n(xi−xˉ)4−3
分位数(df.quantile( p )):若概率0<p<1,随机变量X或他的概率分布的分位数Za是指满足条件p(X < Za)=α的实数
三、数据的预处理
缺漏数据的处理
删掉这条数据:df.dropna(axis=0,how="any",inplace=False)
用均值填充:
means = df[列].mean()
df[列].fillna(means)
用中位数来填补
medians = df[列].median()
df[列].fillna(medians)
用众数来填补
modes = df[列].mode()
df[列].fillna(modes)
数据的标准化: 最大最小值标准化和均值标准化
x
i
′
=
x
i
−
x
min
x
max
−
x
min
x
i
′
=
x
i
−
x
s
x_{i}^{\prime}=\frac{x_{i}-x_{\min }}{x_{\max }-x_{\min }} \quad x_{i}^{\prime}=\frac{x_{i}-x}{s}
xi′=xmax−xminxi−xminxi′=sxi−x
# 最大最小值标准化defmax_min_std(data):
m_max = data.max(axis=0)
m_min = data.min(axis=0)
data =(data - m_min)/(m_max-m_min)return data
#均值标准化defmean_std(data):
m_mean = data.mean(axis=0)
m_std = data.std(axis=0)
data =(data - m_mean)/m_std
return data
四、相关性分析
如何判断各因素之间是否相关? 1. pearson相关系数(df.corr(method = )):
r
=
∑
i
=
1
n
(
x
i
−
x
ˉ
)
(
y
i
−
y
ˉ
)
∑
i
=
1
n
(
x
i
−
x
ˉ
)
2
∑
i
=
1
n
(
y
i
−
y
ˉ
)
2
r=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}}
r=∑i=1n(xi−xˉ)2∑i=1n(yi−yˉ)2∑i=1n(xi−xˉ)(yi−yˉ) 2. spearman,kendall相关系数
多元线性回归模型:
y
=
β
0
+
β
1
x
1
+
β
2
x
2
+
…
+
β
p
x
p
+
ε
y=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\ldots+\beta_{p} x_{p}+\varepsilon
y=β0+β1x1+β2x2+…+βpxp+ε 其中的
β
i
\beta_{i}
βi是回归系数
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
model = linear.fix(x,y)print("截距:")print(linear.intercept_)print("回归系数:")print(linear.coef_)