github地址:DataScicence欢迎star
集成学习4-前向分步算法与GBDT-原理与案例
集成学习3-Boosting的原理和案例
集成学习2-bagging的原理与案例分析
集成学习1-投票法的原理和案例分析
XGBoost原理
Xgboost的大致原理与GBDT相似,但是在部分步骤中进行了改进
目标函数
xgboost与GBDT最大的不同就是目标函数
上式中,
y
^
i
t
−
1
\hat y_{i}^{t-1}
y^it−1表示前面t-1轮中生成的加权树模型的预测结果,
Ω
(
f
i
)
表
示
正
则
项
Ω(f_i)表示正则项
Ω(fi)表示正则项
接下来是重点,利用泰勒展开拟合目标函数:
前面t-1轮的训练误差是已知的,因此将上式改变为:
函数g和h分别是1阶和2阶导数
定义树模型
f
t
(
x
)
=
w
q
(
x
)
每
个
节
点
的
权
重
f_t(x)=w_q(x) 每个节点的权重
ft(x)=wq(x)每个节点的权重
q
(
x
)
每
个
样
本
属
于
的
节
点
q(x)每个样本属于的节点
q(x)每个样本属于的节点
I
j
=
{
i
∣
q
(
x
i
)
=
j
}
每
个
节
点
的
样
本
集
合
I_{j}=\left\{i \mid q\left(\mathbf{x}_{i}\right)=j\right\}每个节点的样本集合
Ij={i∣q(xi)=j}每个节点的样本集合
如上图所示,
q
(
x
1
)
=
1
,
q
(
x
2
)
=
3
,
q
(
x
3
)
=
1
,
q
(
x
4
)
=
2
,
q
(
x
5
)
=
3
q(x_1) = 1,q(x_2) = 3,q(x_3) = 1,q(x_4) = 2,q(x_5) = 3
q(x1)=1,q(x2)=3,q(x3)=1,q(x4)=2,q(x5)=3,
I
1
=
{
1
,
3
}
,
I
2
=
{
4
}
,
I
3
=
{
2
,
5
}
I_1 = \{1,3\},I_2 = \{4\},I_3 = \{2,5\}
I1={1,3},I2={4},I3={2,5},
w
=
(
15
,
12
,
20
)
w = (15,12,20)
w=(15,12,20)
重新定义树的复杂度:
Ω
(
f
K
)
=
γ
T
+
1
2
λ
∑
j
=
1
T
w
j
2
\Omega\left(f_{K}\right) = \gamma T+\frac{1}{2} \lambda \sum_{j=1}^{T} w_{j}^{2}
Ω(fK)=γT+21λj=1∑Twj2
重构目标函数
O
b
j
(
t
)
=
∑
i
=
1
n
[
g
i
f
K
(
x
i
)
+
1
2
h
i
f
K
2
(
x
i
)
]
+
γ
T
+
1
2
λ
∑
j
=
1
T
w
j
2
=
∑
j
=
1
T
[
(
∑
i
∈
I
j
g
i
)
w
j
+
1
2
(
∑
i
∈
I
j
h
i
+
λ
)
w
j
2
]
+
γ
T
=
[
G
j
w
j
+
1
2
(
H
j
+
λ
)
w
j
2
]
+
γ
T
\begin{aligned} Obj^{(t)} &=\sum_{i=1}^{n}\left[g_{i} f_{K}\left(\mathrm{x}_{i}\right)+\frac{1}{2} h_{i} f_{K}^{2}\left(\mathrm{x}_{i}\right)\right]+\gamma T+\frac{1}{2} \lambda \sum_{j=1}^{T} w_{j}^{2} \\ &=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T\\ &=[G_jw_j+\frac{1}{2}(H_j+λ)w^2_j]+γT \end{aligned}
Obj(t)=i=1∑n[gifK(xi)+21hifK2(xi)]+γT+21λj=1∑Twj2=j=1∑T⎣⎡⎝⎛i∈Ij∑gi⎠⎞wj+21⎝⎛i∈Ij∑hi+λ⎠⎞wj2⎦⎤+γT=[Gjwj+21(Hj+λ)wj2]+γT
式中:
求解w和L:
找到令Obj最小的w:(
a
x
2
+
b
x
+
c
ax^2+bx+c
ax2+bx+c求解公式:
x
∗
=
−
b
2
a
x^*=-\frac{b}{2a}
x∗=−2ab)
将
w
j
∗
代
入
O
b
j
w_j^*代入Obj
wj∗代入Obj即可求得目标函数的值
所以分支后,目标函数的降低值为:
寻找最佳分支
使用精确算法或近似算法,选择每一步中使Gain最大的分支方法
Xgboost案例
加载数据集
from sklearn.datasets import load_iris
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report
iris = load_iris()
X,y = iris.data,iris.target
import pandas as pd
X = pd.DataFrame(X,columns=iris.feature_names)
X.head()
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
训练模型
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3)
# 算法参数
params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'num_class': 3,
'gamma': 0.1,
'max_depth': 6,
'lambda': 2,
'subsample': 0.7,
'colsample_bytree': 0.75,
'min_child_weight': 3,
'eta': 0.1,
'seed': 1,
'nthread': 4,
}
dtrain = xgb.DMatrix(X_train,y_train)
model = xgb.XGBClassifier(**params)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
[16:55:13] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.86 0.93 22
2 0.77 1.00 0.87 10
accuracy 0.93 45
macro avg 0.92 0.95 0.93 45
weighted avg 0.95 0.93 0.94 45
绘制特征重要性
xgb.plot_importance(model)
<matplotlib.axes._subplots.AxesSubplot at 0x2c2d525cc10>
调参
常用参数:
参考:机器学习集成学习之XGBoost
from sklearn.model_selection import GridSearchCV
def Tuning(cv_params, other_params,x_train_array,y_train_):
model2 = xgb.XGBClassifier(**other_params)
optimized_GBM = GridSearchCV(estimator=model2,
param_grid=cv_params,
scoring='accuracy',
cv=5,
n_jobs=-1)
optimized_GBM.fit(x_train_array, y_train_)
evalute_result = optimized_GBM.cv_results_['mean_test_score']
#print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值:{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))
return optimized_GBM
other_params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'num_class': 3
}
cv_params = {
'learning_rate':[0.01, 0.02, 0.05, 0.1, 0.15],
}
opt = Tuning(cv_params,other_params,X_train,y_train)
[17:02:24] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
参数的最佳取值:{'learning_rate': 0.01}
最佳模型得分:0.9619047619047618
other_params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'num_class': 3,
'learning_rate':0.01,
}
cv_params = {
'max_depth': [2,3,4,5],
'min_child_weight': [0, 2, 5, 10, 20],
}
opt = Tuning(cv_params,other_params,X_train,y_train)
[17:03:16] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
参数的最佳取值:{'max_depth': 2, 'min_child_weight': 0}
最佳模型得分:0.9619047619047618
C:\Users\lipan\anaconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
other_params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'num_class': 3,
'learning_rate':0.01,
'max_depth': 2,
'min_child_weight': 0,
}
cv_params = {
'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
}
opt = Tuning(cv_params,other_params,X_train,y_train)
C:\Users\lipan\anaconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
[17:04:37] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
参数的最佳取值:{'colsample_bytree': 0.5, 'subsample': 0.95}
最佳模型得分:0.9619047619047618
other_params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'num_class': 3,
'learning_rate':0.01,
'max_depth': 2,
'min_child_weight': 0,
'subsample': 0.95,
'colsample_bytree': 0.5
}
cv_params = {
'reg_alpha': [0, 0.25, 0.5, 0.75, 1],
}
opt = Tuning(cv_params,other_params,X_train,y_train)
[17:06:08] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
参数的最佳取值:{'reg_alpha': 0}
最佳模型得分:0.9619047619047618
C:\Users\lipan\anaconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
y_pred = opt.best_estimator_.predict(X_test)
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.86 0.93 22
2 0.77 1.00 0.87 10
accuracy 0.93 45
macro avg 0.92 0.95 0.93 45
weighted avg 0.95 0.93 0.94 45