作为一个机器学习和Python的初学者,我研究了很久才知道training dataset和testing dataset如何作用于上述公式(一开始我直接用training dataset把
P
(
C
∣
F
)
P(C\vert F)\;
P(C∣F)给求出来了,蛋疼)。 对于
P
(
C
∣
F
)
P(C\vert F)\;
P(C∣F),我们要做的是用training dataset将
P
(
C
∣
F
)
P(C\vert F)\;
P(C∣F)的概率分布所需的参数求出来,在这里采用的是正态分布,正态分布的模型有两个:
σ
,
μ
\sigma, \mu
σ,μ 分别对应标准差和平均值,因为Python自带方差函数,所以我们可以直接求方差
σ
2
\sigma^2
σ2。还有另外一个概率需要得到:
P
(
C
)
P(C)\;
P(C),这个概率相对简单,只需要统计每个class在training dataset中出现的次数除以总的class数就能得到。综上所述,在training阶段,我们所需的参数为
σ
2
,
μ
,
P
(
C
)
\sigma^2,\mu,P(C)\;
σ2,μ,P(C)
三、Predict阶段
根据联合分布概率可知,
P
(
F
∣
C
)
=
P
(
F
1
∣
C
)
×
P
(
F
2
∣
C
)
×
P
(
F
3
∣
C
)
×
⋯
×
P
(
F
n
∣
C
)
=
∏
1
n
P
(
F
i
∣
C
)
\begin{array}{l}P(F\vert C)=P(F_1\vert C)\times P(F_2\vert C)\times P(F_3\vert C)\times\cdots\times P(F_n\vert C)=\\\;\;\;\;\;\;\;\;\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\prod\nolimits_1^nP(F_i\vert C)\end{array}
P(F∣C)=P(F1∣C)×P(F2∣C)×P(F3∣C)×⋯×P(Fn∣C)=∏1nP(Fi∣C)
利用对数的性质,将乘积形式变为求和形式,
log
[
P
(
F
∣
C
)
]
=
log
[
P
(
F
1
∣
C
)
×
P
(
F
2
∣
C
)
×
P
(
F
3
∣
C
)
×
⋯
×
P
(
F
n
∣
C
)
]
=
∑
1
n
log
[
P
(
F
i
∣
C
)
]
\begin{array}{l}\log\lbrack P(F\vert C)\rbrack=\log\lbrack P(F_1\vert C)\times P(F_2\vert C)\times P(F_3\vert C)\times\cdots\times P(F_n\vert C)\rbrack=\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\sum\nolimits_1^n\log\lbrack P(F_i\vert C)\rbrack\end{array}
log[P(F∣C)]=log[P(F1∣C)×P(F2∣C)×P(F3∣C)×⋯×P(Fn∣C)]=∑1nlog[P(Fi∣C)]
接着
∑
1
n
log
[
P
(
F
i
∣
C
)
]
×
P
(
C
)
\begin{array}{lc}\sum\nolimits_1^n\log\lbrack P(F_i\vert C)\rbrack&\times\end{array}P(C)
∑1nlog[P(Fi∣C)]×P(C)得到我们需要的概率(
P
(
F
)
P(F)
P(F)相对于分子是个常数,所以不用求),取最大概率的那个C,就是我们利用NBC分类得到的class。
四、代码剖析
因为刚上手Python,所以有一些代码或许可以用更简洁的方式写,大佬们可以指正。
NBC类总体代码如下:
classNBC:def__init__(self, feature_types, num_classes, landa=1* e **-6):"""
Args:
feature_types:
num_classes:
landa: avoid the scenario of log0, defalt 1e-6
"""
self.feature_types = feature_types
self.num_classes = num_classes
self.landa = landa
self.avg =None
self.var =None
self.prior =Nonedeffit(self, Xtrain, ytrain):"""
Xtrain is the four features , y is the lable of every row,
we need to use parameters to get some CONSTANTS(average, variance, prior probability)
in order to predict test datasets
Args:
Xtrain:
ytrain:
Returns:
"""
self.prior = self.get_y_pri(ytrain)# the four features average values of the three labels
self.avg = self.get_x_avg(Xtrain, ytrain)# the four features's var values of the three labels# var = power(std, 2)
self.var = self.get_x_var(Xtrain, ytrain)defpredict_prob(self, Xtest):"""
calculate the probability of every row in the test dataset
in order to choose the closest label of this row
Args:
Xtest:
Returns:
array
"""# apply_along_axis means cut the Xtest into rows in order to calculate easier the likelihood
likelihood = np.apply_along_axis(self.get_likelihood, axis=1, arr=Xtest)return self.prior * likelihood
defpredict(self, Xtest):"""
choose the largest probability as the label of row, return the label array
Args:
Xtest:
Returns:
array
"""return np.apply_along_axis(self.get_prediction_label, axis=1, arr=self.predict_prob(Xtest))defget_prediction_label(self, prob_row):"""
get the corresponding label of the largest probability of each row
Args:
prob_row:
Returns:
array
"""return np.argmax(prob_row)defget_count(self, ytrain, c):"""
get total number of every label in thetrain dataset
Args:
ytrain:
c: class lable
Returns:
int count
"""
count =0for y in ytrain:if y == c:
count +=1return count
defget_y_pri(self, ytrain):"""
get prior probability of all labels
Args:
ytrain:
Returns:
array
"""
ytrain_len =len(ytrain)
res =[]for y inrange(self.num_classes):
pri_p = self.get_count(ytrain, y)/ ytrain_len
res.append(pri_p)return np.array(res)defget_x_var(self, Xtrain, ytrain):"""
get variance of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res =[]for i inrange(self.num_classes):
res.append(Xtrain[ytrain == i].var(axis=0))return np.array(res)defget_likelihood(self, label_row):"""
get likelihood probability of every row of test dataset
we add landa parameter manually to avoid the computation result of Gaussian distribution may be zero
Args:
label_row:
Returns:
array
"""# landa parameter is very important
gauss_dis =(1/ sqrt(2* pi * self.var)* exp(-1*(label_row - self.avg)**2/(2* self.var)))+ self.landa
# log(abc) = loga + logb + locreturn(log(gauss_dis)).sum(axis=1)defget_x_avg(self, Xtrain, ytrain):"""
get average of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res =[]for i inrange(self.num_classes):
res.append(Xtrain[ytrain == i].mean(axis=0))return np.array(res)
根据二中总结所需要的参数,avg为
μ
\mu
μ,var为
σ
2
\sigma^2
σ2,prior为
P
(
C
)
P(C)
P(C),另外有一个额外带有默认值的参数landa,是为了防止
log
0
\log0
log0的出现。
training
Training阶段
deffit(self, Xtrain, ytrain):"""
Xtrain is the four features , y is the lable of every row,
we need to use parameters to get some CONSTANTS(average, variance, prior probability)
in order to predict test datasets
Args:
Xtrain:
ytrain:
Returns:
"""
self.prior = self.get_y_pri(ytrain)# the four features average values of the three labels
self.avg = self.get_x_avg(Xtrain, ytrain)# the four features's var values of the three labels# var = power(std, 2)
self.var = self.get_x_var(Xtrain, ytrain)
defget_y_pri(self, ytrain):"""
get prior probability of all labels
Args:
ytrain:
Returns:
array
"""
ytrain_len =len(ytrain)
res =[]for y inrange(self.num_classes):
pri_p = self.get_count(ytrain, y)/ ytrain_len
res.append(pri_p)return np.array(res)defget_count(self, ytrain, c):"""
get total number of every label in thetrain dataset
Args:
ytrain:
c: class lable
Returns:
int count
"""
count =0for y in ytrain:if y == c:
count +=1return count
获取平均值
μ
\mu
μ
defget_x_avg(self, Xtrain, ytrain):"""
get average of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res =[]for i inrange(self.num_classes):
res.append(Xtrain[ytrain == i].mean(axis=0))return np.array(res)
获取方差
σ
2
\sigma^2
σ2
defget_x_var(self, Xtrain, ytrain):"""
get variance of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res =[]for i inrange(self.num_classes):
res.append(Xtrain[ytrain == i].var(axis=0))return np.array(res)
以上我们完成了对NBC的训练,接下来进行激动人心的预测阶段。
predicting
defpredict(self, Xtest):"""
choose the largest probability as the label of row, return the label array
Args:
Xtest:
Returns:
array
"""return np.apply_along_axis(self.get_prediction_label, axis=1, arr=self.predict_prob(Xtest))defpredict_prob(self, Xtest):"""
calculate the probability of every row in the test dataset
in order to choose the closest label of this row
Args:
Xtest:
Returns:
array
"""# apply_along_axis means cut the Xtest into rows in order to calculate easier the likelihood
likelihood = np.apply_along_axis(self.get_likelihood, axis=1, arr=Xtest)return self.prior * likelihood
predict_prob函数计算 三 中的
P
(
F
i
∣
C
)
P(F_i\vert C)
P(Fi∣C),即先验概率乘以似然度,这里将testing dataset以行单位做了切片,调用get_likelihood函数,似然度get_likelihood函数代码如下
defget_likelihood(self, label_row):"""
get likelihood probability of every row of test dataset
we add landa parameter manually to avoid the computation result of Gaussian distribution may be zero
Args:
label_row:
Returns:
array
"""# landa parameter is very important
gauss_dis =(1/ sqrt(2* pi * self.var)* exp(-1*(label_row - self.avg)**2/(2* self.var)))+ self.landa
# log(abc) = loga + logb + locreturn(log(gauss_dis)).sum(axis=1)
正态分布公式
f
(
x
)
=
1
2
π
σ
2
e
(
x
−
μ
)
2
2
σ
2
f(x)=\frac1{\sqrt{2\pi\sigma^2}}e^\frac{(x-\mu)^2}{2\sigma^2}
f(x)=2πσ21e2σ2(x−μ)2 取对数后求和,return后利用argmax函数取概率值最大的class
defget_prediction_label(self, prob_row):"""
get the corresponding label of the largest probability of each row
Args:
prob_row:
Returns:
array
"""return np.argmax(prob_row)