我只是想知道这是否是计算分类准确性的合法方法:
- 获取精确召回阈值
- 对于每个阈值,对连续 y_scores 进行二值化
- 从列联表(混淆矩阵)计算它们的准确性
-
返回阈值的平均准确度
recall, precision, thresholds = precision_recall_curve(np.array(np_y_true), np.array(np_y_scores))
accuracy = 0
for threshold in thresholds:
contingency_table = confusion_matrix(np_y_true, binarize(np_y_scores, threshold=threshold)[0])
accuracy += (float(contingency_table[0][0]) + float(contingency_table[1][1]))/float(np.sum(contingency_table))
print "Classification accuracy is: {}".format(accuracy/len(thresholds))
您正朝着正确的方向前进。
混淆矩阵绝对是计算分类器准确性的正确起点。在我看来,您的目标是接收器的操作特性。
在统计学中,接收器操作特性 (ROC) 或 ROC 曲线是说明二元分类器系统在判别阈值变化时的性能的图形。https://en.wikipedia.org/wiki/Receiver_operating_characteristic https://en.wikipedia.org/wiki/Receiver_operating_characteristic
AUC(曲线下面积)是分类器性能的衡量标准。更多信息和解释可以在这里找到:
https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it
http://mlwiki.org/index.php/ROC_Analysis http://mlwiki.org/index.php/ROC_Analysis
这是我的实现,欢迎您改进/评论:
def auc(y_true, y_val, plot=False):
#check input
if len(y_true) != len(y_val):
raise ValueError('Label vector (y_true) and corresponding value vector (y_val) must have the same length.\n')
#empty arrays, true positive and false positive numbers
tp = []
fp = []
#count 1's and -1's in y_true
cond_positive = list(y_true).count(1)
cond_negative = list(y_true).count(-1)
#all possibly relevant bias parameters stored in a list
bias_set = sorted(list(set(y_val)), key=float, reverse=True)
bias_set.append(min(bias_set)*0.9)
#initialize y_pred array full of negative predictions (-1)
y_pred = np.ones(len(y_true))*(-1)
#the computation time is mainly influenced by this for loop
#for a contamination rate of 1% it already takes ~8s to terminate
for bias in bias_set:
#"lower values tend to correspond to label −1"
#indices of values which exceed the bias
posIdx = np.where(y_val > bias)
#set predicted values to 1
y_pred[posIdx] = 1
#the following function simply calculates results which enable a distinction
#between the cases of true positive and false positive
results = np.asarray(y_true) + 2*np.asarray(y_pred)
#append the amount of tp's and fp's
tp.append(float(list(results).count(3)))
fp.append(float(list(results).count(1)))
#calculate false positive/negative rate
tpr = np.asarray(tp)/cond_positive
fpr = np.asarray(fp)/cond_negative
#optional scatterplot
if plot == True:
plt.scatter(fpr,tpr)
plt.show()
#calculate AUC
AUC = np.trapz(tpr,fpr)
return AUC
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)