机器学习sklearn之集成学习（三）

2023-11-03

随机森林

集成学习一般可分为三大类：boosting，bagging，stacking，随机森林算法归属于bagging类，它的特点是使用多个没有依赖关系的弱学习器进行并行拟合，最后的决策也很简单，对于分类问题则使用简单的投票法，对于回归问题，则使用平均法。

在随机森林算法中建立每棵树的过程是：

1、随机在N个样本中选择一个样本，重复N次（样本是有可能重复的）
2、随机在M个特征中选择m个特征，不同于普通的决策树选择信息增益最大或者根据基尼系数等选择特征

注意使用随机森林算法的采样方式与与GBDT算法不同，GBDT算法采用的是无放回采样，而随机森林算法采用的是有放回采样。
采用有放回采样方式可以保证随机森林的每棵树之间都有交集，否则每棵决策树都有可能是“有偏的”。

随机森林算法的优势：

1、能够有效的运行在大数据集上，精度高
2、能够处理高维数据而不需要降维处理
3、采用随机采样，训练出的模型的方差小，泛化能力强
4、能够评估各个特征在分类问题上的重要性
5、对缺失值不敏感

承接上篇博客，对2万个数据进行算法验证，并比较CART、GBDT及随机森林算法。

导入数据

import pandas as pd
df = pd.read_csv("./train_modified.csv")
df.head()

	Existing_EMI	ID	Loan_Amount_Applied	Loan_Tenure_Applied	Monthly_Income	Var4	Var5	Age	EMI_Loan_Submitted_Missing	...	Var2_6	Mobile_Verified_0	Mobile_Verified_1	Source_0	Source_2
0	0.0	ID000002C20	300000	5	20000	1	0	37	1	...	1	1	0	1	0
1	0.0	ID000004E40	200000	2	35000	3	13	30	0	...	1	0	1	1	0
2	0.0	ID000007H20	600000	4	22500	1	0	34	1	...	0	0	1	0	1
3	0.0	ID000008I30	1000000	5	35000	3	10	28	1	...	0	0	1	0	1
4	25000.0	ID000009J40	500000	2	100000	3	17	31	1	...	0	0	1	0	1

5 rows × 51 columns

这里我们以 Disbursed 这一列作为分类结果，从结果上看是二分类问题，观察样本发现样本分布不均匀，这时候对分类的好坏评估就需要使用 AUC 评估参数了

df["Disbursed"].value_counts()

0    19680
1      320
Name: Disbursed, dtype: int64

数据可视化（随便选了两个特征，似乎不明显-_-!）

from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
%matplotlib inline

x_columns = [x for x in df.columns if x not in ["Disbursed", "ID"]]  # 挑选除了Disbursed、ID这两列的数据
X = df[x_columns]
y = df["Disbursed"]

# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# 数据可视化
fig = plt.figure()
plt.scatter(x_train[y_train==0]["Loan_Tenure_Applied"], x_train[y_train==0]["Var4"])
plt.scatter(x_train[y_train==1]["Loan_Tenure_Applied"], x_train[y_train==1]["Var4"])
plt.legend([0, 1])
plt.show()

png

使用单个决策树CART模型，不调任何参数，观察结果，发现尽管分类的精度很高，但是在样本分布不均匀的情况下，AUC得分接近0.5，说明这个分类器性能很差

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, classification_report

dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)

accuracy = dtc.score(x_test, y_test)
print("Accuracy (test): \n", accuracy)

y_pred = dtc.predict(x_test)
print("混淆矩阵：\n", classification_report(y_test, y_pred))

y_predprob = dtc.predict_proba(x_test)[:, 1]
print("AUC Score (test): %f" % roc_auc_score(y_test, y_predprob))

Accuracy (test): 
 0.9658
混淆矩阵：
              precision    recall  f1-score   support

          0       0.99      0.98      0.98      4926
          1       0.05      0.07      0.06        74

avg / total       0.97      0.97      0.97      5000

AUC Score (test): 0.523336

建立集成学习算法模型，首先是GBDT模型，不调参数，使用默认的参数设置，观察结果

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, classification_report

gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)

accuracy = gbc.score(x_test, y_test)
print("Accuracy (test): \n", accuracy)

y_pred = gbc.predict(x_test)
print("混淆矩阵：\n", classification_report(y_test, y_pred))

y_predprob = gbc.predict_proba(x_test)[:, 1]
print("AUC Score (test): %f" % roc_auc_score(y_test, y_predprob))

Accuracy (test): 
 0.9848
混淆矩阵：
              precision    recall  f1-score   support

          0       0.99      1.00      0.99      4926
          1       0.00      0.00      0.00        74

avg / total       0.97      0.98      0.98      5000

AUC Score (test): 0.824064

建立集成学习算法模型，使用随机森林模型，不调参数，使用默认的参数设置，观察结果，发现袋外分数和精度都挺高，但是AUC得分还是接近0.5，所以分类模型并不是很好

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report

rfc = RandomForestClassifier(oob_score=True)
rfc.fit(x_train, y_train)

accuracy = rfc.score(x_test, y_test)
print("Accuracy (test): \n", accuracy)

y_pred = rfc.predict(x_test)
print("混淆矩阵(test)：\n", classification_report(y_test, y_pred))

y_predprob = rfc.predict_proba(x_test)[:, 1]
print("AUC Score (test): %f" % roc_auc_score(y_test, y_predprob))

print("袋外分数:\n", rfc.oob_score_)

Accuracy (test): 
 0.9852
混淆矩阵(test)：
              precision    recall  f1-score   support

          0       0.99      1.00      0.99      4926
          1       0.50      0.01      0.03        74

avg / total       0.98      0.99      0.98      5000

AUC Score (test): 0.603095
袋外分数:
 0.9803333333333333


D:\anaconda\setup\lib\site-packages\sklearn\ensemble\forest.py:453: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
  warn("Some inputs do not have OOB scores. "
D:\anaconda\setup\lib\site-packages\sklearn\ensemble\forest.py:458: RuntimeWarning: invalid value encountered in true_divide
  predictions[k].sum(axis=1)[:, np.newaxis])

对随机森林算参数进行调参，发现精度并没有提高，但是AUC得分提高了很多

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.model_selection import GridSearchCV

rfc = RandomForestClassifier(oob_score=True, max_features="sqrt")
params = {"max_depth": list(range(3,15, 2)), 
          "n_estimators": list(range(50, 201, 20)), 
          'min_samples_split': list(range(80,150,20)), 
          'min_samples_leaf': list(range(10,60,10))}
gs = GridSearchCV(estimator=rfc, param_grid=params, cv=5)

gs.fit(x_train, y_train)

accuracy = gs.score(x_test, y_test)
print("Accuracy (test): \n", accuracy)

y_pred = gs.predict(x_test)
print("混淆矩阵(test)：\n", classification_report(y_test, y_pred))

y_predprob = gs.predict_proba(x_test)[:, 1]
print("AUC Score (test): %f" % roc_auc_score(y_test, y_predprob))

Accuracy (test): 
 0.9852
混淆矩阵(test)：
              precision    recall  f1-score   support

          0       0.99      1.00      0.99      4926
          1       0.00      0.00      0.00        74

avg / total       0.97      0.99      0.98      5000

AUC Score (test): 0.803139


D:\anaconda\setup\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

gs.best_estimator_, gs.best_score_, gs.best_params_

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=3, max_features='sqrt', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=10, min_samples_split=80,
             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
             oob_score=True, random_state=None, verbose=0, warm_start=False),
 0.9836,
 {'max_depth': 3,
  'min_samples_leaf': 10,
  'min_samples_split': 80,
  'n_estimators': 50})

最后使用网格搜索后的最优超参数进行训练，结果如下

rfc2 = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=3, max_features='sqrt', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=10, min_samples_split=80,
             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
             oob_score=True, random_state=None, verbose=0, warm_start=False)
rfc2.fit(x_train, y_train)

accuracy = rfc2.score(x_test, y_test)
print("Accuracy (test): \n", accuracy)

y_pred = rfc2.predict(x_test)
print("混淆矩阵(test)：\n", classification_report(y_test, y_pred))

y_predprob = rfc2.predict_proba(x_test)[:, 1]
print("AUC Score (test): %f" % roc_auc_score(y_test, y_predprob))

print("袋外分数:\n", rfc2.oob_score_)

Accuracy (test): 
 0.9836
混淆矩阵(test)：
              precision    recall  f1-score   support

          0       0.98      1.00      0.99      4918
          1       0.00      0.00      0.00        82

avg / total       0.97      0.98      0.98      5000

AUC Score (test): 0.808931
袋外分数:
 0.9841333333333333


D:\anaconda\setup\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

机器学习sklearn之集成学习（三）的相关文章

WCF Service 的创建，测试，发布与使用示例

WCF Service 的创建测试发布与使用示例一 WCF Service 的创建 WCF Service Application 注为了在 IIS 或 WAS 中托管则必须使用 WCF Service Application 类
Kali安装zmap简单介绍

zmap是一个非常方便的扫描器跟nmap和masscan一样不过区别在于zmap他快号称是一小时扫遍整个互联网主要使用方式是TCP SYN scan TCP connectscan UDP scan No Ping scan等下面
【MySQL基础】9—MySQL数据类型

Github主页 https github com A BigTree 笔记链接 https github com A BigTree Code Learning 如果可以麻烦各位看官顺手点个star 如果文章对你有所帮助可以点赞收藏

随机推荐

shell中变量自增的实现方法

Linux Shell中写循环时常常要用到变量的自增现在总结一下整型变量自增的方法我所知道的 bash中目前有五种方法 1 i expr i 1 2 let i 1 3 i 4 i i 1 5 i i 1 可以实践一下简单的实例如
NLP下游任务理解以及模型结构改变（上）

序言 Bert 是一种基于微调的多层双向 Transformer 编码 Bert中的Transformer 编码器和 Transformer 本身的编码器结构相似但有以下两点的不同 1 与Transformer本身的Encoder端相比
JAVA层HIDL服务的获取原理-Android10.0 HwBinder通信原理（九）

Android取经之路的源码都基于Android Q 10 0 进行分析 Android取经之路系列文章系统启动篇 Android系统架构Android是怎么启动的Android 10 0系统启动之init进程Android10 0系
一个非常好用的 Python 魔法库

点上方蓝色菜鸟学Python 选星标公众号重磅干货第一时间到达来源 Be melting https blog csdn net lys 828 article details 106489371 今天跟大家分享FuzzyWuz
React配置路由----傻瓜教程

整体思路就是就是在路由文件里配置路由再将路由文件导入App js 最后将App js导入src下的index js 1 安装全局环境和创建React项目 cnpm install g create react app 仅第一次创建项目前
带你熟知关键字static用法——C语言（举例及通俗易懂）

引入首先我们看两个例子例1 include
win10 vscode tensorflow 填坑大成

win 10 vscode tensorflow 这个略览是基于对Vs code 各种设置功能的迷惑开始的包括 user setting workspace setting launch setting 以及从user setting 分
python代码中的中文语法错误：SyntaxError: Non-ASCII character ‘\xe5‘ in file trade.py on line 7

编写的python程序中使用了中午执行的时候报了上述语法错误经过排查原因如下我使用的解释器的版本是python2的 Python 2默认的编码格式是ASCII 如果直接使用中文就会报如下错误 SyntaxError Non ASCII
集合实现控制台登录注册案例

学习完集合框架以后做了一个用集合去实现控制台的注册登录的操作这个案例是用集合去存储用户的注册的信息所以存在每次运行程序都要进行一个注册的操作这也是程序目前唯一的一个bug了但是写完这个例子还是让我受益颇多的对于程序中真是的开发如
LeetCode5-最长回文子串

官网地址 https leetcode cn com problems longest palindromic substring solution zui chang hui wen zi chuan by leetcode soluti
安装libpng报错zlib not installed

libpng安装configure时报错 error ZLib not installed 两个原因 zlib的include和lib路径没找到添加zlib路径到环境变量在 configure就行了 export LDFLAGS L u
Python中一维向量和一维向量转置相乘

在Python中有时会碰到需要一个一维列向量 n 1 与另一个一维列向量 n 1 的转置 1 n 相乘得到一个n n的矩阵的情况但是在python中我们发现无论是 T 还是 np transpose 都无法实现一维向量的转置相比之
Cocos Creator 源码解读：siblingIndex 与 zIndex

前言本文基于 Cocos Creator 2 4 5 撰写普天同庆来了来了源码解读系列文章终于又来了温馨提醒本文包含大段引擎源码使用大屏设备阅读体验更佳 Hi There 节点 cc Node 作为 Cocos Creato
完美解决 knife4j You do not have permission to access this page的问题

文章目录 1 复现问题 2 分析问题 3 解决问题 1 复现问题今天在项目中配置了knife4j 本地启动后输入用户名和密码能够正常访问如下图所示但将项目部署到正式环境并成功启动且用户名和密码输入正确后却报出了如下错误用户名和
电脑 ktv服务器系统,ktv服务器主机系统

ktv服务器主机系统内容精选换一换安装传输工具在本地主机和Windows云服务器上分别安装数据传输工具将文件上传到云服务器例如QQ exe 在本地主机和Windows云服务器上分别安装数据传输工具将文件上传到云服务器例如QQ
PWM 驱动电机

文章主要是个人的笔记所以很多地方可能是根据自己的情况来写的驱动器使用的是L298N CubeMX的配置基本的配置略过时钟源等我们使用定时器1 1 选择内部时钟通道1 PWM互补通道 2 这里我们后面把预分频的值改为了3 计数值
各种开源库介绍

开源库介绍这里收录了一些个人觉得比较好的开源库也请各位把自己觉得好的开源库分享出来供大家一起分享谢谢该文章会不断更新 C FreeType FreeType库是一个完全免费开源的高质量的且可移植的字体引擎它提供统一的接口
主键为null报错

解决办法 88条消息 mybatis 为何我的id已经设置了主键自增仍然报id不能为空 Java CSDN问答https ask csdn net questions 7402678
ArcGIS教程：太阳辐射建模

入射太阳辐射日照源自太阳穿过大气层时会发生改变然后由于地形和表面要素进一步发生改变最后在地球表面被分别截取成直射部分散射部分和反射部分截取的直接辐射是源自阳光的畅通无阻的直光线散射辐射则是由于被大气中的云和尘埃等成分分散反
机器学习sklearn之集成学习（三）

随机森林集成学习一般可分为三大类 boosting bagging stacking 随机森林算法归属于bagging类它的特点是使用多个没有依赖关系的弱学习器进行并行拟合最后的决策也很简单对于分类问题则使用简单的投票法对于回归问