集成模型(2)GBDT用于分类和回归及其python实现

2023-11-17

GBDT用于分类和回归及其python实现

adaboost用于分类的时候其实是模型为加法模型，损失函数为指数损失函数的算法，用于回归的时候是是损失函数为平方误差的损失函数，但是当损失函数为一般损失函数的时候，优化会变得比较复杂，例如我们分类使用对数损失函数，那么前面我们求解基函数权值和样本更新权值的过程就会变得比较复杂，这时候提出了一种新的解决方案——GBDT(Gradient Boosting Decision Tree，梯度提升树)。

1.GBDT回归

1.1基本思想

GBDT用于回归问题时，核心思想是利用损失函数的负梯度在当前模型的值作为回归问题提升树算法中残差的近似。然后拟合一个回归树，并且当损失函数为均方误差时，负梯度的值就是残差。其实现过程和AdaBoost用于回归时的差别就在于残差的计算方式上面。

1.2算法流程：

输入：训练集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } ， x i ∈ R n , y i ∈ R T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}，x_i\in R^n,y_i\in R T={(x1,y1),(x2,y2),...,(xN,yN)}，xi∈Rn,yi∈R，损失函数 L ( y , f ( x ) ) L(y,f(x)) L(y,f(x))；

初始化： f 0 ( x ) = a r g m i n c ∑ i = 1 N L ( y i , c ) f_0(x)=argmin_{c}\sum_{i=1}^NL(y_i,c) f0(x)=argminc∑i=1NL(yi,c)
对 m = 1 , 2 , . . . , M m=1,2,...,M m=1,2,...,M:
(a)对 i = 1 , 2 , . . . , N i=1,2,...,N i=1,2,...,N计算 r m , i = − [ L ( y i , f ( x i ) ) f ( x i ) ] f ( x ) = f m − 1 ( x ) r_{m,i}=-[\frac{L(y_i,f(x_i))}{f(x_i)}]_{f(x)=f_{m-1}(x)} rm,i=−[f(xi)L(yi,f(xi))]f(x)=fm−1(x)
(b)对 r m , i r_{m,i} rm,i拟合回归树，得到第m个基模型的叶节点区域 R m , j , j = 1 , 2 , . . , J R_{m,j}, j=1,2,..,J Rm,j,j=1,2,..,J
(c) 对 j = 1 , 2 , . . . , J j=1,2,...,J j=1,2,...,J计算叶节点区域 R m , j R_{m,j} Rm,j的最佳输出值： c m , j = a r g m i n c ∑ x i ∈ R m , j L ( y i , f m − 1 ( x i ) + c ) c_{m,j}=argmin_{c}\sum_{x_i\in R_{m,j}}L(y_i,f_{m-1}(x_i)+c) cm,j=argminc∑xi∈Rm,jL(yi,fm−1(xi)+c)
(d)更新 f m ( x ) = f m − 1 ( x ) + ∑ j = 1 J c m , j I ( x ∈ R m , j ) f_m(x)=f_{m-1}(x)+\sum_{j=1}^Jc_{m,j}I(x\in R_{m,j}) fm(x)=fm−1(x)+∑j=1Jcm,jI(x∈Rm,j)
生成加法模型， f ( x ) = ∑ m = 1 M ∑ j = 1 J c m , j I ( x ∈ R m , j ) f(x)=\sum_{m=1}^M\sum_{j=1}^Jc_{m,j}I(x\in R_{m,j}) f(x)=∑m=1M∑j=1Jcm,jI(x∈Rm,j)

对于上面的算法流程需要注意的是基学习器回归树的生成过程(b)(c)的节点分裂规则和叶节点输出值的生长规则不一定相同，节点分裂规则通常是采用最小化均方误差（sklearn中还使用了friedman_mse，mae），叶节点输出值的生成规则是要最小化我们的损失函数（因为我们是拿回归树来实现分类问题，所以最终输出是要最小化目标损失函数，节点分裂规则只是去拟合回归树），但是AdaBoost回归中使用均方误差时，这两个规则是相同的；此外初始化和AdaBoost中的初始化也有区别。

2.GBDT二分类

2.1基本思想

GBDT用于二分类问题时，其基学习器和AdaBoost的思路不一样，并不是使用分类树来做的，而和回归问题一样也是采用的回归树作为基学习器，因为GBDT每轮的训练是在上一轮训练模型的负梯度值基础之上训练的。这就要求每轮迭代的时候，真实标签减去当前模型的输出结果是有意义的，如果选用的弱分类器是分类树，类别相减是没有意义的。

这里主要考虑损失函数为对数损失函数时候的情况，当损失函数为指数函数的时候其实就可以直接用AdaBoost实现，并且在sklearn中也这么做的。

首先对于对数损失函数而言： L ( y , f ( x ) ) = − ( y l o g p + ( 1 − y ) l o g ( 1 − p ) ) L(y,f(x))=-(ylogp+(1-y)log(1-p)) L(y,f(x))=−(ylogp+(1−y)log(1−p))，其中 p = p ( y = 1 ∣ x ) = 1 / ( 1 + e − f ( x ) ) p=p(y=1|x)=1/(1+e^{-f(x)}) p=p(y=1∣x)=1/(1+e−f(x))是表示正样本的概率，其中f(x)是每一轮更新后的加法模型，为什么p的表达式是这样？其实这里是和逻辑回归的思想一样，使用sigmoid函数来来将回归模型（这里是回归树，逻辑回归中是线性回归模型）的输出和真实值联系起来，即将其映射到（0，1）之间。

然后将p带入损失函数进行化简:
L ( y , f ( x ) ) = − ( y l o g p + ( 1 − y ) l o g ( 1 − p ) ) = − ( y l o g 1 1 + e − f ( x ) + ( 1 − y ) l o g e − f ( x ) 1 + e − f ( x ) ) = y l o g ( 1 + e − f ( x ) ) + ( y − 1 ) ( l o g e − f ( x ) − l o g ( 1 + e − f ( x ) ) ) = y l o g ( 1 + e − f ( x ) ) + ( y − 1 ) ( − f ( x ) − l o g ( 1 + e − f ( x ) ) ) = − ( y − 1 ) f ( x ) + l o g ( 1 + e − f ( x ) ) = − ( y − 1 ) f ( x ) + l o g ( 1 + e f ( x ) e f ( x ) ) = − y f ( x ) + l o g ( 1 + e f ( x ) ) L(y,f(x))=-(ylogp+(1-y)log(1-p))\\=-(ylog\frac{1}{1+e^{-f(x)}}+(1-y)log\frac{e^{-f(x)}}{1+e^{-f(x)}})\\=ylog{(1+e^{-f(x)})}+(y-1)(loge^{-f(x)}-log(1+e^{-f(x)}))\\=ylog{(1+e^{-f(x)})}+(y-1)(-f(x)-log(1+e^{-f(x)}))\\=-(y-1)f(x)+log(1+e^{-f(x)})\\=-(y-1)f(x)+log(\frac{1+e^{f(x)}}{e^{f(x)}})\\=-yf(x)+log(1+e^{f(x)}) L(y,f(x))=−(ylogp+(1−y)log(1−p))=−(ylog1+e−f(x)1+(1−y)log1+e−f(x)e−f(x))=ylog(1+e−f(x))+(y−1)(loge−f(x)−log(1+e−f(x)))=ylog(1+e−f(x))+(y−1)(−f(x)−log(1+e−f(x)))=−(y−1)f(x)+log(1+e−f(x))=−(y−1)f(x)+log(ef(x)1+ef(x))=−yf(x)+log(1+ef(x))

2.2算法流程

输入：训练集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } ， x i ∈ R n , y i ∈ { 1 , 0 } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}，x_i\in R^n,y_i\in \{1,0\} T={(x1,y1),(x2,y2),...,(xN,yN)}，xi∈Rn,yi∈{1,0}，损失函数 L ( y , f ( x ) ) L(y,f(x)) L(y,f(x))为对数损失函数时；

初始化： f 0 ( x ) = a r g m i n c ∑ i = 1 N L ( y i , f ( x i ) ) = l o g ∑ i = 1 N y i ∑ i = 1 N ( 1 − y i ) f_0(x)=argmin_{c}\sum_{i=1}^NL(y_i,f(x_i))=log\frac{\sum_{i=1}^Ny_i}{\sum_{i=1}^N(1-y_i)} f0(x)=argminc∑i=1NL(yi,f(xi))=log∑i=1N(1−yi)∑i=1Nyi
对 m = 1 , 2 , . . . , M m=1,2,...,M m=1,2,...,M:
(a)对 i = 1 , 2 , . . . , N i=1,2,...,N i=1,2,...,N计算 r m , i = − [ L ( y i , f ( x i ) ) f ( x i ) ] f ( x ) = f m − 1 ( x ) = y i − 1 1 + e f m − 1 ( x i ) r_{m,i}=-[\frac{L(y_i,f(x_i))}{f(x_i)}]_{f(x)=f_{m-1}(x)}=y_i-\frac{1}{1+e^{f_{m-1}(x_i)}} rm,i=−[f(xi)L(yi,f(xi))]f(x)=fm−1(x)=yi−1+efm−1(xi)1
(b)对 r m , i r_{m,i} rm,i拟合回归树，得到第m个基模型的叶节点区域 R m , j , j = 1 , 2 , . . , J R_{m,j}, j=1,2,..,J Rm,j,j=1,2,..,J
(c) 对 j = 1 , 2 , . . . , J j=1,2,...,J j=1,2,...,J计算叶节点区域 R m , j R_{m,j} Rm,j的最佳输出值： c m , j = a r g m i n c ∑ x i ∈ R m , j L ( y i , f m − 1 ( x i ) + c ) = ∑ x ∈ R m , j r m , i ∑ x ∈ R m , j ( y i − r m , i ) ( 1 − y i + r m , i ) c_{m,j}=argmin_{c}\sum_{x_i\in R_{m,j}}L(y_i,f_{m-1}(x_i)+c)\\=\frac{\sum_{x\in R_{m,j}}r_{m,i}}{\sum_{x\in R_{m,j}}(y_i-r_{m,i})(1-y_i+r_{m,i})} cm,j=argminc∑xi∈Rm,jL(yi,fm−1(xi)+c)=∑x∈Rm,j(yi−rm,i)(1−yi+rm,i)∑x∈Rm,jrm,i
(d)更新 f m ( x ) = f m − 1 ( x ) + ∑ j = 1 J c m , j I ( x ∈ R m , j ) f_m(x)=f_{m-1}(x)+\sum_{j=1}^Jc_{m,j}I(x\in R_{m,j}) fm(x)=fm−1(x)+∑j=1Jcm,jI(x∈Rm,j)
生成加法模型， f ( x ) = ∑ m = 1 M ∑ j = 1 J c m , j I ( x ∈ R m , j ) f(x)=\sum_{m=1}^M\sum_{j=1}^Jc_{m,j}I(x\in R_{m,j}) f(x)=∑m=1M∑j=1Jcm,jI(x∈Rm,j)
可以看到上面的流程和应用于回归时十分相似，但是对于分类任务而言要把最后累加结果转换成概率，对于对数损失函数而言： p ( y i = 1 ∣ x i ) = 1 1 + e − f ( x i ) p(y_i=1|x_i)=\frac{1}{1+e^{-f(x_i)}} p(yi=1∣xi)=1+e−f(xi)1。然后设置一个阈值，大于阈值的预测为正样本。

2.3python实现

这里因为对于回归的节点分裂规则和叶节点区域的生成规则不相同，所以对前面回归树的实现先进行修改：

2.3.1回归树

import numpy as np
import pandas as pd
import pygraphviz as pgv
import matplotlib.pyplot as plt

'''构建回归树'''
#计算均方误差
def cal_mse(feat_val_list,split_val,y_true):
    feat_val_list = np.array(feat_val_list)
    y_pred = np.zeros(len(y_true))
    c1 = np.mean(np.array(y_true)[feat_val_list<split_val])
    c2 = np.mean(np.array(y_true)[feat_val_list>split_val])
    y_pred[feat_val_list<split_val] = c1
    y_pred[feat_val_list>split_val] = c2

    return np.sum((y_pred-y_true)**2)/len(y_true),c1,c2

#选择最优划分特征以及划分点
def select_best_feature(data:pd.DataFrame, y_true, criterion='mse'):
    '''
    :param data: 待划分数据集
    :param y_true: 样本对应的真实值
    :param criterion: 节点分裂准则，默认使用'mse'
    :return: 最优的划分特征以及划分点
    '''
    min_mse = np.inf #最小均方误差
    best_feat = '' #最优划分特征
    best_split_val = 0 #最优划分点
    best_c1 = 0 #最优划分后每个区域的取值
    best_c2 = 0
    for feat in data.columns.values:
        feat_val_list = list(data[feat]) #该属性下的所有取值，含重复取值
        sorted_feat_val_list = sorted(list(data[feat].unique()))
        #所有划分点，不含重复值
        split_val_list = [(sorted_feat_val_list[i]+sorted_feat_val_list[i+1])/2 for i in np.arange(len(sorted_feat_val_list)-1)]
        for split_val in split_val_list:
            #计算最小的划分空间，以及其均方误差
            if criterion == 'mse':
                cur_mse,c1,c2 = cal_mse(feat_val_list, split_val, y_true)

            if cur_mse<min_mse:
                min_mse = cur_mse
                best_feat = feat
                best_split_val = split_val
                best_c1 = c1
                best_c2 = c2

    return best_feat, best_split_val, min_mse, best_c1, best_c2

def build_regressionTree(data:pd.DataFrame,y_true,cur_depth,min_samples_leaf=1,min_samples_split=2,max_depth=3,criterion='mse',
                         r=None,loss='ls'):
    '''
    :param data: 训练集
    :param y_true: 训练样本对应的真实值, 即需要进行拟合的值
    :param cur_depth: 当前树的层数
    :param min_samples_leaf: 每个叶节点最小样本数目，如果当前节点划分后叶节点数目小于该值则不进行划分
    :param min_samples_split: 每个划分节点所需要包含的最小样本树，如果小于该值则不进行划分
    :param max_depth: 数的最大深度
    :param criterion: 节点分裂准则, 默认使用'mse'
    *********************
    下面这两个参数主要为了GBDT里面的回归树的实现分类问题
    :param r: 当前模型的残差
    :param loss: 损失函数，默认使用'ls'，即最小化均方误差，主要决定了叶结点的输出
    :return:
    '''
    #当用于构建GBDT里面的回归树时，如果当前数据中已经都属于一个类别了，就停止划分
    if (r is not None) & (np.sum(y_true)==len(y_true) or np.sum(y_true)==0):
        if loss == 'logloss':
            return {'isLeaf': True,
                    'val': np.sum(r) / np.sum((y_true - r) * (np.ones(len(y_true)) - y_true + r))}  # 返回叶节点
    #当待划分数据集大小小于最小划分大小时，或者达到树最大深度时，停止划分
    if (len(data) < min_samples_split) or cur_depth>max_depth:
        if r is None: #构建单个决策树, 分裂准则和loss是一致的
            if criterion == 'mse':
                return {'isLeaf': True, 'val': np.mean(y_true)}
        else: #用于GBDT的时候
            if loss == 'logloss':
                return {'isLeaf': True,
                        'val': np.sum(r) / np.sum((y_true - r) * (np.ones(len(y_true)) - y_true + r))}  # 返回叶节点

    tree = {} #构建树
    if r is None:
        best_feat, best_split_val, min_mse, best_c1, best_c2 = select_best_feature(data, y_true, criterion)
    else: #在GBDT中，是去拟合当前模型残差，此时y_true为类别
        best_feat, best_split_val, min_mse, best_c1, best_c2 = select_best_feature(data, r, criterion)

    l_tree_index = data[best_feat]<best_split_val
    r_tree_index = data[best_feat]>best_split_val

    # 当划分后的叶节点数据量大小小于最小叶节点数据量时，停止划分
    if(len(data[l_tree_index]) < min_samples_leaf) or (len(data[r_tree_index]) < min_samples_leaf):
        if r is None:  # 构建单个决策树, 分裂准则和loss是一致的
            if criterion == 'mse':
                return {'isLeaf': True, 'val': np.mean(y_true)}
        else:  # 用于GBDT的时候
            if loss == 'logloss':
                return {'isLeaf': True,
                        'val': np.sum(r) / np.sum((y_true - r) * (np.ones(len(y_true)) - y_true + r))}  # 返回叶节点

    # print(best_feat, best_split_val)
    #继续递归划分
    tree['isLeaf'] = False
    tree['best_feat'] = best_feat
    tree['split_val'] = best_split_val
    if r is None:
        tree['l_tree'] = build_regressionTree(data[l_tree_index], np.array(y_true)[l_tree_index], cur_depth + 1,
                                              min_samples_leaf, min_samples_split, max_depth,
                                              criterion, None, loss)
        tree['r_tree'] = build_regressionTree(data[r_tree_index], np.array(y_true)[r_tree_index], cur_depth + 1,
                                              min_samples_leaf, min_samples_split, max_depth,
                                              criterion, None, loss)
    else:
        tree['l_tree'] = build_regressionTree(data[l_tree_index], np.array(y_true)[l_tree_index], cur_depth + 1,
                                              min_samples_leaf, min_samples_split, max_depth,
                                              criterion, np.array(r)[l_tree_index], loss)
        tree['r_tree'] = build_regressionTree(data[r_tree_index], np.array(y_true)[r_tree_index], cur_depth + 1,
                                              min_samples_leaf, min_samples_split, max_depth,
                                              criterion, np.array(r)[r_tree_index], loss)

    return tree

def predict(tree:{}, data:pd.DataFrame):
    y_pred = np.zeros(len(data))
    for i in np.arange(len(data)):
        tmp_tree = tree
        while(tmp_tree['isLeaf']==False):
            cur_feat = tmp_tree['best_feat']
            split_val = tmp_tree['split_val']
            if data.loc[i,cur_feat] <= split_val:
                tmp_tree = tmp_tree['l_tree']
            else:
                tmp_tree = tmp_tree['r_tree']
        y_pred[i] = tmp_tree['val']

    return y_pred


def plotTree(A,tree:{}, father_node,depth,label):
    #如果当前是根节点
    if depth == 1:
        A.add_node(father_node)
        #如果既是根节点又是叶节点，即树桩
        if tree['isLeaf'] == True:
            A.add_edge(father_node,tree['val'],label=label)
            return
        else:
            plotTree(A,tree['l_tree'], father_node,depth+1,'<=')
            plotTree(A,tree['r_tree'], father_node,depth+1,'>')
            return
    if tree['isLeaf'] == True:
        A.add_edge(father_node, tree['val'], label=label)
        return
    A.add_edge(father_node, tree['best_feat']+':'+str(tree['split_val']), label=label)
    plotTree(A,tree['l_tree'], tree['best_feat']+':'+str(tree['split_val']), depth+1,'<=')
    plotTree(A,tree['r_tree'], tree['best_feat']+':'+str(tree['split_val']), depth+1,'>')

if __name__ == '__main__':
    df = pd.DataFrame(columns=['x1'])
    df['x1'] = [1,2,3,4,5,6,7,8,9,10]
    y_true = [4.50,4.75,4.91,5.34,5.80,7.05,7.90,8.23,8.70,9.00]
    y_true = [5.56,5.70,5.91,6.40,6.80,7.05,8.90,8.70,9.00,9.05]
    tree = build_regressionTree(df, y_true, cur_depth=1,max_depth=3)
    print(tree)
    # print(select_best_feature(df, y_true))
    print(predict(tree,df))
    A = pgv.AGraph(directed=True, strict=False)
    plotTree(A, tree, tree['best_feat']+':'+str(tree['split_val']),depth=1,label='')
    A.layout('dot')  # layout with dot
    A.draw('tree2.png')  # write to file

2.3.2GBDT实现

import numpy as np
import pandas as pd
import pygraphviz as pgv
from DecisionTree import decisiontreeRegressor
'''构建梯度提升树，用于分类问题，使用logloss损失函数（对数损失函数，也是交叉熵损失函数）'''

#对数损失
def log_loss(y_true, y_pred):
    pass

#计算当前模型f m-1的负梯度, 这里只实现了计算对数损失函数的负梯度
def cal_negGradient(y_true, f_m):
    return y_true-1/(1+np.exp(-f_m))

#构建梯度提升树分类器
def build_gbdtClassifier(data:pd.DataFrame, y_true, n=5, lr=0.1, loss='logloss', criterion='mse', max_depth=1,
                         min_samples_leaf=1,min_samples_split=2):
    y_true = np.array(y_true)
    gbdt_classifier = [] #最终得到的一系列弱学习器
    f_0 = np.log(np.sum(y_true)/np.sum(1-y_true)) #step1 初始化f0
    f_m = f_0 #当前模型的输出值
    gbdt_classifier.append(f_0)
    for _ in np.arange(n):
        r = cal_negGradient(y_true, f_m) #使用负梯度作为残差的近似 step2 (a)
        tree = decisiontreeRegressor.build_regressionTree(data, y_true, 1, r=r, loss=loss, max_depth=max_depth,
                                                          min_samples_leaf=min_samples_leaf, min_samples_split=min_samples_split) #使用回归树拟合残差 step2 (b)(c)
        f_m += lr * decisiontreeRegressor.predict(tree,data) #线性相加 step2 (d)
        gbdt_classifier.append(tree)

    return gbdt_classifier

def predict(gbdt_classifier:[], data:pd.DataFrame, lr=0.1):
    f_0 = gbdt_classifier[0]
    f_m = f_0
    for clf in gbdt_classifier[1:]:
        f_m += lr * decisiontreeRegressor.predict(clf, data)

    y_pred_prob = 1/(1+np.exp(-f_m))
    y_pred_prob[y_pred_prob>=0.5] = 1
    y_pred_prob[y_pred_prob<0.5] = 0

    return y_pred_prob

if __name__ == '__main__':
    df = pd.DataFrame(columns=['x1'])
    df['x1'] = [1,2,3,4,5,6,7,8,9,10]
    y_true = [0,0,0,1,1,0,0,0,1,1]
    gbdt_classifier = build_gbdtClassifier(df, y_true, n=2,max_depth=3)
    print(predict(gbdt_classifier,df))

    df = pd.DataFrame(columns=['age','weight'])
    df['weight'] = [10,30,70,60]
    df['age'] = [5, 7, 21, 30]
    y_true = [0,0,1,1]

    gbdt_classifier = build_gbdtClassifier(df, y_true, n=5, max_depth=2)
    print(predict(gbdt_classifier, df))
    for i,clf in enumerate(gbdt_classifier[1:]):
        A = pgv.AGraph(directed=True, strict=False)
        decisiontreeRegressor.plotTree(A, clf, clf['best_feat']+':'+str(clf['split_val']),depth=1,label='')
        A.layout('dot')
        A.draw('output/tree'+str(i)+'.png')

参考博客：
深入理解GBDT二分类算法
 GBDT原理与Sklearn源码分析-分类篇

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)