为什么 CaliberatedClassifierCV 的性能不如直接分类器？

2023-12-20

我注意到 sklearn 是新的CalibratedClassifierCV似乎表现不如直接base_estimator当。。。的时候base_estimator is GradientBoostingClassifer，（我没有测试过其他分类器）。有趣的是，如果make_classification的参数为：

n_features = 10
n_informative = 3
n_classes = 2

那么CalibratedClassifierCV似乎表现稍好（对数损失评估）。

然而，在以下分类数据集下CalibratedClassifierCV似乎通常表现不佳：

from sklearn.datasets import make_classification
from sklearn import ensemble
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from sklearn import cross_validation
# Build a classification task using 3 informative features

X, y = make_classification(n_samples=1000,
                           n_features=100,
                           n_informative=30,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=9,
                           random_state=0,
                           shuffle=False)

skf = cross_validation.StratifiedShuffleSplit(y, 5)

for train, test in skf:

    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf_cv = CalibratedClassifierCV(clf, cv=3, method='isotonic')
    clf_cv.fit(X_train, y_train)
    probas_cv = clf_cv.predict_proba(X_test)
    cv_score = log_loss(y_test, probas_cv)

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    probas = clf.predict_proba(X_test)
    clf_score = log_loss(y_test, probas) 

    print 'calibrated score:', cv_score
    print 'direct clf score:', clf_score
    print

一次运行产生：

也许我错过了一些关于如何CalibratedClassifierCV有效，或者没有正确使用它，但我的印象是，如果有的话，将分类器传递给CalibratedClassifierCV将导致性能相对于base_estimator alone.

谁能解释一下观察到的表现不佳？

概率校准本身需要交叉验证，因此CalibratedClassifierCV每折叠训练一个校准分类器（在本例中使用StratifiedKFold），并在调用 Predict_proba() 时取每个分类器的预测概率的平均值。这可能会导致对效应的解释。

我的假设是，如果训练集相对于特征和类别的数量来说很小，则每个子分类器的减少的训练集会影响性能，并且集成不能弥补它（或使其变得更糟）。此外，GradientBoostingClassifier 可能从一开始就提供了相当好的概率估计，因为它的损失函数针对概率估计进行了优化。

如果这是正确的，那么以与 CaliberatedClassifierCV 相同的方式但没有校准的集成分类器应该比单个分类器更差。此外，当使用更多的折叠次数进行校准时，这种影响应该会消失。

为了测试这一点，我扩展了您的脚本以增加折叠数量并包含未经校准的集成分类器，并且我能够确认我的预测。 10 倍校准的分类器总是比单个分类器表现更好，而未校准的集成则明显更差。在我的运行中，三重校准分类器的表现也并不比单一分类器差，所以这也可能是一个不稳定的效果。这些是同一数据集的详细结果：

这是我的实验的代码：

import numpy as np
from sklearn.datasets import make_classification
from sklearn import ensemble
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from sklearn import cross_validation

X, y = make_classification(n_samples=1000,
                           n_features=100,
                           n_informative=30,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=9,
                           random_state=0,
                           shuffle=False)

skf = cross_validation.StratifiedShuffleSplit(y, 5)

for train, test in skf:

    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf_cv = CalibratedClassifierCV(clf, cv=3, method='isotonic')
    clf_cv.fit(X_train, y_train)
    probas_cv = clf_cv.predict_proba(X_test)
    cv_score = log_loss(y_test, probas_cv)
    print 'calibrated score (3-fold):', cv_score


    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf_cv = CalibratedClassifierCV(clf, cv=10, method='isotonic')
    clf_cv.fit(X_train, y_train)
    probas_cv = clf_cv.predict_proba(X_test)
    cv_score = log_loss(y_test, probas_cv)
    print 'calibrated score (10-fold:)', cv_score

    #Train 3 classifiers and take average probability
    skf2 = cross_validation.StratifiedKFold(y_test, 3)
    probas_list = []
    for sub_train, sub_test in skf2:
        X_sub_train, X_sub_test = X_train[sub_train], X_train[sub_test]
        y_sub_train, y_sub_test = y_train[sub_train], y_train[sub_test]
        clf = ensemble.GradientBoostingClassifier(n_estimators=100)
        clf.fit(X_sub_train, y_sub_train)
        probas_list.append(clf.predict_proba(X_test))
    probas = np.mean(probas_list, axis=0)
    clf_ensemble_score = log_loss(y_test, probas)
    print 'uncalibrated ensemble clf (3-fold) score:', clf_ensemble_score

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    probas = clf.predict_proba(X_test)
    score = log_loss(y_test, probas)
    print 'direct clf score:', score
    print

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

scikitlearn

为什么 CaliberatedClassifierCV 的性能不如直接分类器？的相关文章

Python 3.6 DateTime Strptime 返回错误，而 Python 3.7 运行良好

我刚刚为日期数据创建了一个数据类型它返回一个datetime datetime object 这是代码 import datetime class Date def new cls dateTime args kwargs return
将 Poetry 与 PyEnv 一起使用并遇到 Python 版本问题

我正在使用 WSL2 Ubuntu 我一直在学习使用 Fastapi 进行后端 API 开发的课程我相信我的 Ubuntu 默认 python 是 3 8 我正在尝试使用 python 3 10 0 进行开发我做了以下事情 pyenv安
matplotlib：调整图形窗口大小而不缩放图形内容

当您调整图形大小时 Matplotlib 会自动缩放图形窗口中的所有内容通常这是用户想要的但我经常想增加窗口的大小为其他东西腾出更多空间在这种情况下我希望在更改窗口大小时预先存在的内容保持相同的大小有谁知道一个干净的方法来做到这
如何生成大型网站的图形站点地图[关闭]

Closed 这个问题正在寻求书籍工具软件库等的推荐不满足堆栈溢出指南 help closed questions 目前不接受答案我想为我的网站生成图形站点地图据我所知有两个阶段抓取网站并分析链接关系提取树形结构生成视觉上
如何仅选择数组中的第一列并对其求和？

这是我的代码 import numpy as np contrainte1 1080 0 65 minutes tous les jours contrainte2 720 0 55 minutes du lundi au vendredi
Seaborn regplot 中点和线的不同颜色

中列出的所有示例西伯恩的regplot文档 https seaborn pydata org generated seaborn regplot html点和回归线显示相同的颜色改变color争论改变了两者如何为点设置与线不同的颜色你
从 Python 将分层 JSON 数据写入 Excel xls？

我想将一些数据从 python 写入 xlsx 我目前将其存储为 JSON 但它从 Python 中输出什么并不重要单个文章的 JSON 如下所示 Word Count 50 Key Words Blah blah blah Foo Fr
如何使用 Twython 将 oauth_callback 值传递给 oauth/request_token

Twitter 最近刚刚强制执行以下规定 1 您必须通过oauth callbackoauth request token 的值这不是可选的即使您已经在 dev twitter com 上设置了一个如果您正在执行带外 OAuth 请通
无法使用 Python 循环分页 API 响应

所以我对这个感到摸不着头脑使用 HubSpot 的 API 我需要获取我客户的门户帐户中所有公司的列表遗憾的是标准 API 调用一次只能返回 100 家公司当它返回响应时它包含两个参数使分页响应成为可能其中之一是 ha
在Python中清理属于不同语言的文本

我有一个文本集合其中的句子要么完全是英语印地语或马拉地语每个句子附加的 id 为 0 1 2 分别代表文本的语言无论任何语言的文本都可能有 HTML 标签标点符号等我可以使用下面的代码清理英语句子 import HTMLPars
如何在python mechanize中设置cookie

向服务器发送请求后 br open http xxxx br select form nr 0 br form MESSAGE 1 2 3 4 5 br submit 我得到了响应标题其中包含 set cookie Set Cookie
检查列表是否已排序的 Pythonic 方法

有没有一种Python式的方法来检查列表是否已经排序ASC or DESC listtimestamps 1 2 3 5 6 7 就像是isttimestamps isSorted 返回True or False 我想输入一些消息的时间戳列
将带有 md5 消息摘要和 DESede/CBC/PKCS5Padding 的 3DES 加密的 java 代码转换为 python

我有这个工作java代码它使用3DES加密对密码进行加密 import java security MessageDigest import java util Arrays import java util Base64 import
使用 Python 导入包含文本和数字数据的文件

I have a txt file which has text data and numerical data The first two rows of the file have essential information in te
Pandas 中的数据透视表小计

我有以下数据 Employee Account Currency Amount Location Test 2 Basic USD 3000 Airport Test 2 Net USD 2000 Airport Test 1 Basic
向结构化 numpy 数组添加字段

将字段添加到结构化 numpy 数组的最简洁方法是什么是否可以破坏性地完成或者是否有必要创建一个新数组并复制现有字段每个字段的内容是否连续存储在内存中以便可以有效地完成此类复制如果您使用 numpy 1 3 还有 numpy li
如何输入可变的默认参数

Python 中处理可变默认参数的方法是将它们设置为无 https stackoverflow com a 366430 5049813 例如 def foo bar None bar if bar is None else bar ret
Matplotlib：检查空图

我有一个循环加载并绘制一些数据如下所示 import os import numpy as np import matplotlib pyplot as plt for filename in filenames plt figure i
Django 按小时过滤

我找到了那个链接 http code djangoproject com attachment ticket 8424 time filters diff http code djangoproject com attachment tic
具有重复值的 Sqlite 列

就说专栏吧aSQLite 数据库的非常重复始终有相同的 4 个值其他值可能稍后出现但不同值的数量将少于 1000 个 VALUES hello world it s a shame to store this str many tim

随机推荐

MongoDB 正在运行但无法使用 shell 连接

CentOS 5 x Linux 与 MongoDB 2 0 1 尝试过 main 和legacy static MongoDB 正在运行 root 31664 1 5 1 4 81848 11148 Sl 18 40 0 00 mongo
Bash 参数扩展仅在交互式 shell 中有效，但在脚本中无效

在用户控制台中我有bash echo SHELL bin bash bash version GNU bash version 4 2 46 1 release x86 64 redhat linux gnu 我在文件 test sh 中有
如何包含 SVG 文件作为背景

我是 SVG 世界的新手今天才开始尝试我正在尝试创建一个移动网站其中主要图形都是可扩展的从而支持所有显示分辨率我为我的输入创建了一个 svg 文件当前为 type image 令人惊讶的是结果与我的代码编辑器 Coda 中的预期
插件是否应该添加新的实例方法猴子补丁或子类/混合并替换父类？

举一个简单的例子以一个类为例Polynomial class Polynomial object def init self coefficients self coefficients coefficients 对于以下形式的多项式p
MYSQL - 比较 NOW() 和请求中巴黎时区的日期

我有一个存储为日期时间巴黎日期时间的值在任何情况下在 VIEW 中如何知道存储的日期是否早于或等于 NOW 并且 NOW 位于巴黎时区 PS 我对 SQL 服务器没有任何控制权要确保日期位于巴黎时区您可以使用CONVERT T
Mono / MonoTouch 下的 Task.Factory.StartNew() 延迟

在 Mono 和 MonoTouch 下我看到调用之间有大约 500 毫秒的延迟 StartNew Action action object state CancellationToken cancellationToken TaskCr
如何使用 ImageMagick 将 TIFF 转换为 JPG？

请帮我我需要帮助将 TIFF 文件转换为 JPG 文件我使用 Ubuntu 中的命令行和 ImageMagick 执行此操作如下所示 convert 03 tif 03 jpg But my JPG file after conver
boost::any_range> 在发布模式下崩溃

我观察到以下代码的一个相当奇怪的行为 include
在哪里存储项目之间共享的原型文件？

我有项目 A 和项目 B 它们可能使用不同的编程语言项目 A 使用 proto 文件公开 API 项目 B 将使用该文件以项目 B 使用的编程语言生成 API 但是原型文件存储在哪里使用 protobuf 的传统方法是什么您是否将从
为什么我有时会看到“条目‘文件名’未更新。无法合并。”在“git reset --hard”和“git pull”之后？

有时当我执行以下操作时 git reset hard HEAD is now at 0123abde comment is here git pull Updating 0123abde 456789fa 我收到错误 error Entr
Node.js fs.readdir 递归目录搜索

关于使用 fs readdir 进行异步目录搜索有什么想法吗我意识到我们可以引入递归并使用下一个要读取的目录调用读取目录函数但我有点担心它不是异步的有任何想法吗我看过节点行走 https github com coolaj86 no
如何从 ArrayBuffer 在 WebGL 中渲染图像

我正在服务器端读取一张图像并通过 AJAX 调用推送到 Web 浏览器我有一个要求我必须使用 WebGL 逐行渲染它们例如图像为 640X480 其中 640 是宽度 480 是高度现在像素总数将为 640 480 307200
具有 2 个 SVN 服务器的相同代码工作副本 [重复]

这个问题在这里已经有答案了是否可以使用相同的工作副本并将其放入两个不同的 SVN 服务器提交更新如何同步两个 Subversion 存储库 https stackoverflow com questions 143130 how t
如何通过索引值和任意列中的值搜索pandas数据框

我正在尝试选择数据从文件中读入由值 1 和 0 表示我希望能够从值列表中选择行同时选择其中每个选定行的值为 1 的任何列为了使其更复杂我还想从值列表中选择行其中这些行的列中的所有值均为零这可能吗最终如果除 pandas
php将一个数字更改为另一个数字可以改回原来的数字

使用 PHP 我尝试将一个数字编码为另一个数字我可以将其解码回原始数字编码的字符串只需是数字不应包含其他任何内容例如 10 变成 573563547892 或类似的东西我怎样才能在 PHP 中做这样的事情我尝试了很多加密解密函数
在 Django 模板标签库中导入外部库时出错

因此我尝试编写一个 Django 可重用应用程序它提供了一种在页面上显示 Twitter feed 的方法我很清楚它已经存在了 20 次这是一项学术练习目录结构非常简单 myproject init py manage py se
加载 Apache Netbeans 9.0 或 10 时出现 Java 未找到错误

我下载了该程序的存档将其解压缩到 Windows 上的 C 盘然后当我尝试在其中运行任一可执行文件 32 位 64 位时bin文件夹我收到错误消息找不到 JAVA 1 8 或更高版本尽管已经安装了 Java 10 JRE 我也尝试
如何使用 React-Router 在 React 中正确渲染 404 页面？

我正在使用 React 和 React Router 构建一个网站我想在用户访问不存在的 url 时呈现 404 页面有些网址是动态的例如 www site com user username 如果具有特定用户名的用户不存在如何使用
特定文件的 mod_rewrite 异常

我的页面没有按应有的方式重定向因为我的 htaccess 文件设置为 RewriteEngine on RewriteCond 1 index php resources robots txt RewriteCond REQUEST FI
为什么 CaliberatedClassifierCV 的性能不如直接分类器？

我注意到 sklearn 是新的CalibratedClassifierCV似乎表现不如直接base estimator当的时候base estimator is GradientBoostingClassifer 我没有测试过其他分类器

为什么 CaliberatedClassifierCV 的性能不如直接分类器？

为什么 CaliberatedClassifierCV 的性能不如直接分类器？ 的相关文章

随机推荐

热门标签

为什么 CaliberatedClassifierCV 的性能不如直接分类器？的相关文章