R语言 lightgbm 算法优化:不平衡二分类问题(附代码)

2023-10-27



来源:大数据文摘

本文约10000字,建议阅读10分钟本文以kaggle比赛的数据为例,为你讲解不平衡二分类问题的解决方法。

本案例使用的数据为kaggle中“Santander Customer Satisfaction”比赛的数据。此案例为不平衡二分类问题,目标为最大化auc值(ROC曲线下方面积)。目前此比赛已经结束。

竞赛题目链接为:

https://www.kaggle.com/c/santander-customer-satisfaction 

1. 建模思路

本文档采用微软开源的lightgbm算法进行分类,运行速度极快。具体步骤为:

  • 读取数据;

  • 并行运算:由于lightgbm包可以通过设置相应参数进行并行运算,因此不再调用doParallel与foreach包进行并行运算;

  • 特征选择:使用mlr包提取了99%的chi.square;

  • 调参:逐步调试lgb.cv函数的参数,并多次调试,直到满意为止;

  • 预测结果:用调试好的参数值构建lightgbm模型,输出预测结果;本案例所用程序输出结果的ROC值为0.833386,已超过Private Leaderboard排名第一的结果(0.829072)。

2. lightgbm算法

由于lightgbm算法没有给出具体的数学公式,因此此处不再介绍,如有需要,请查看github项目网址。

lightgbm算法具体介绍网址:

https://github.com/Microsoft/LightGBM

读取数据

options(java.parameters = "-Xmx8g") ## 特征选择时使用,但是需要在加载包之前设置library(readr)lgb_tr1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/train.csv")lgb_te1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/test.csv")


数据探索

1. 设置并行运算

library(dplyr)library(mlr)library(parallelMap)parallelStartSocket(2)


2. 数据各列初步探索

summarizeColumns(lgb_tr1) %>% View()


3. 处理缺失值

impute missing values by mean and mode
imp_tr1 <- impute(
    as.data.frame(lgb_tr1), 
    classes = list(
        integer = imputeMean(), 
        numeric = imputeMean()
    )
)
imp_te1 <- impute(
    as.data.frame(lgb_te1), 
    classes = list(
        integer = imputeMean(), 
        numeric = imputeMean()
    )
)

处理缺失值后:

summarizeColumns(imp_tr1$data) %>% View()


4. 观察训练数据类别的比例–数据类别不平衡

table(lgb_tr1$TARGET)


5. 剔除数据集中的常数列

lgb_tr2 <- removeConstantFeatures(imp_tr1$data)lgb_te2 <- removeConstantFeatures(imp_te1$data)


6. 保留训练数据集与测试数据及相同的列

tr2_name <- data.frame(tr2_name = colnames(lgb_tr2))te2_name <- data.frame(te2_name = colnames(lgb_te2))tr2_name_inner <- tr2_name %>%     inner_join(te2_name, by = c('tr2_name' = 'te2_name'))TARGET = data.frame(TARGET = lgb_tr2$TARGET)lgb_tr2 <- lgb_tr2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]lgb_te2 <- lgb_te2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]lgb_tr2 <- cbind(lgb_tr2, TARGET)


注:

1)由于本次使用lightgbm算法,故而不对数据进行标准化处理;

2)lightgbm算法运行效率极高,1GB内不进行特征筛选也可以运行的极快,但是此处进行特征筛选,以进一步加快运行速率;

3)本案例直接进行特征筛选,未生成衍生变量,原因为:不知特征实际意义,不好随机生成。

特征筛选–卡方检验

library(lightgbm)


1. 试算最大权重值程序,后面将继续优化

grid_search <- expand.grid(    weight = seq(1, 30, 2)     ## table(lgb_tr1$TARGET)[1] / table(lgb_tr1$TARGET)[2] = 24.27261    ## 故而设定weight在[1, 30]之间)
lgb_rate_1 <- numeric(length = nrow(grid_search))

set.seed(0)

for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr2$TARGET * i + 1) / sum(lgb_tr2$TARGET * i + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr2[, 1:300]),         label = lgb_tr2$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc'    )    # 交叉验证    lgb_tr2_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        learning_rate = .1,        num_threads = 2,        early_stopping_rounds = 10    )    lgb_rate_1[i] <- unlist(lgb_tr2_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr2_mod$record_evals$valid$auc$eval))]}

library(ggplot2)grid_search$perf <- lgb_rate_1ggplot(grid_search,aes(x = weight, y = perf)) +     geom_point()


从此图可知auc值受权重影响不大,在weight=5时达到最大。

2. 特征选择

1) 特征选择

lgb_tr2$TARGET <- factor(lgb_tr2$TARGET)lgb.task <- makeClassifTask(data = lgb_tr2, target = 'TARGET')lgb.task.smote <- oversample(lgb.task, rate = 5)fv_time <- system.time(    fv <- generateFilterValuesData(        lgb.task.smote,        method = c('chi.squared')        ## 此处可以使用信息增益/卡方检验的方法,但是不建议使用随机森林方法,效率极低        ## 如果有兴趣,也可以尝试IV值方法筛选        ## 特征工程决定目标值(此处为auc)的上限,可以把特征筛选方法作为超参数处理    ))


2) 制图查看

# plotFilterValues(fv)plotFilterValuesGGVIS(fv)


3) 提取99%的chi.squared(lightgbm算法效率极高,因此可以取更多的变量)

注:提取的X%的chi.squared中的X可以作为超参数处理。

fv_data2 <- fv$data %>%     arrange(desc(chi.squared)) %>%     mutate(chi_gain_cul = cumsum(chi.squared) / sum(chi.squared))
fv_data2_filter <- fv_data2 %>% filter(chi_gain_cul <= 0.99)dim(fv_data2_filter) ## 减少了一半的自变量fv_feature <- fv_data2_filter$namelgb_tr3 <- lgb_tr2[, c(fv_feature, 'TARGET')]lgb_te3 <- lgb_te2[, fv_feature]


4) 写出数据

write_csv(lgb_tr3, 'C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv')write_csv(lgb_te3, 'C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv')


算法

lgb_tr <- rxImport('C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv')lgb_te <- rxImport('C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv')## 建议lgb_te数据在预测时再读取,以节约内存library(lightgbm)


1. 调试weight参数

grid_search <- expand.grid(    weight = 1:30)
perf_weight_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc'    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        learning_rate = .1,        num_threads = 2,        early_stopping_rounds = 10    )    perf_weight_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
library(ggplot2)grid_search$perf <- perf_weight_1ggplot(grid_search,aes(x = weight, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在weight=4时达到最大,呈递减趋势。

2. 调试learning_rate参数

grid_search <- expand.grid(    learning_rate = 2 ^ (-(8:1)))
perf_learning_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_learning_rate_1ggplot(grid_search,aes(x = learning_rate, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在learning_rate=2^(-5) 时达到最大,但是 2^(-(6:3)) 区别极小,故取learning_rate = .125,提高运行速度。

3. 调试num_leaves参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = seq(50, 800, 50))
perf_num_leaves_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_num_leaves_1ggplot(grid_search,aes(x = num_leaves, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在num_leaves=650时达到最大。

4. 调试min_data_in_leaf参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    min_data_in_leaf = 2 ^ (1:7))
perf_min_data_in_leaf_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        min_data_in_leaf = grid_search[i, 'min_data_in_leaf']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_min_data_in_leaf_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_data_in_leaf_1ggplot(grid_search,aes(x = min_data_in_leaf, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值对min_data_in_leaf不敏感,因此不做调整。

5. 调试max_bin参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin = 2 ^ (5:10))
perf_max_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_bin_1ggplot(grid_search,aes(x = max_bin, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在max_bin=2^10 时达到最大,需要再次微调max_bin值。

6. 微调max_bin参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin = 100 * (6:15))
perf_max_bin_2 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_max_bin_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_bin_2ggplot(grid_search,aes(x = max_bin, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在max_bin=1000时达到最大。

7. 调试min_data_in_bin参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 2 ^ (1:9)    )
perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_data_in_bin_1ggplot(grid_search,aes(x = min_data_in_bin, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在min_data_in_bin=8时达到最大,但是变化极其细微,因此不做调整。

8. 调试feature_fraction参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = seq(.5, 1, .02)    )
perf_feature_fraction_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_feature_fraction_1ggplot(grid_search,aes(x = feature_fraction, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在feature_fraction=.62时达到最大,feature_fraction在[.60,.62]之间时,auc值保持稳定,表现较好;从.64开始呈下降趋势。

9. 调试min_sum_hessian参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = seq(0, .02, .001))
perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_sum_hessian_1ggplot(grid_search,aes(x = min_sum_hessian, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在min_sum_hessian=0.005时达到最大,建议min_sum_hessian取值在[0.002, 0.005]区间,0.005后呈递减趋势。

10. 调试lamda参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = seq(0, .01, .002),    lambda_l2 = seq(0, .01, .002))
perf_lamda_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_lamda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_lamda_1ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) +     geom_point() +     facet_wrap(~ lambda_l2, nrow = 5)


从此图可知建议lambda_l1 = 0, lambda_l2 = 0

11. 调试drop_rate参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = seq(0, 1, .1))
perf_drop_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_drop_rate_1ggplot(data = grid_search, aes(x = drop_rate, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在drop_rate=0.2时达到最大,在0, .2, .5较好;在[0, 1]变化不大。

12. 调试max_drop参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = seq(1, 10, 2))
perf_max_drop_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_drop_1ggplot(data = grid_search, aes(x = max_drop, y = perf)) +     geom_point() +    geom_smooth()


从此图可知auc值在max_drop=5时达到最大,在[1, 10]区间变化较小。

二次调参

1. 调试weight参数

grid_search <- expand.grid(    learning_rate = .125,    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = 5)
perf_weight_2 <- numeric(length = nrow(grid_search))
for(i in 1:20){    lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[1, 'learning_rate'],        num_leaves = grid_search[1, 'num_leaves'],        max_bin = grid_search[1, 'max_bin'],        min_data_in_bin = grid_search[1, 'min_data_in_bin'],        feature_fraction = grid_search[1, 'feature_fraction'],        min_sum_hessian = grid_search[1, 'min_sum_hessian'],        lambda_l1 = grid_search[1, 'lambda_l1'],        lambda_l2 = grid_search[1, 'lambda_l2'],        drop_rate = grid_search[1, 'drop_rate'],        max_drop = grid_search[1, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        learning_rate = .1,        num_threads = 2,        early_stopping_rounds = 10    )    perf_weight_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
library(ggplot2)ggplot(data.frame(num = 1:length(perf_weight_2), perf = perf_weight_2), aes(x = num, y = perf)) +     geom_point() +     geom_smooth()


从此图可知auc值在weight>=3时auc趋于稳定, weight=7 the max

2. 调试learning_rate参数

grid_search <- expand.grid(    learning_rate = seq(.05, .5, .03),    num_leaves = 650,    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = 5)
perf_learning_rate_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_learning_rate_1ggplot(data = grid_search, aes(x = learning_rate, y = perf)) +     geom_point() +    geom_smooth()


结论:learning_rate=.11时,auc最大。

3. 调试num_leaves参数


grid_search <- expand.grid(    learning_rate = .11,    num_leaves = seq(100, 800, 50),    max_bin=1000,    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = 5)
perf_num_leaves_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_num_leaves_1ggplot(data = grid_search, aes(x = num_leaves, y = perf)) +     geom_point() +    geom_smooth()


结论:num_leaves=200时,auc最大。

4. 调试max_bin参数

grid_search <- expand.grid(    learning_rate = .11,    num_leaves = 200,    max_bin = seq(100, 1500, 100),    min_data_in_bin = 8,    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = 5)
perf_max_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_max_bin_1ggplot(data = grid_search, aes(x = max_bin, y = perf)) +     geom_point() +    geom_smooth()


结论:max_bin=600时,auc最大;400,800也是可接受值。

5. 调试min_data_in_bin参数

grid_search <- expand.grid(    learning_rate = .11,    num_leaves = 200,    max_bin = 600,    min_data_in_bin = seq(5, 50, 5),    feature_fraction = .62,    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = 5)
perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_data_in_bin_1ggplot(data = grid_search, aes(x = min_data_in_bin, y = perf)) +     geom_point() +    geom_smooth()


结论:min_data_in_bin=45时,auc最大;其中25是可接受值。

6. 调试feature_fraction参数

grid_search <- expand.grid(    learning_rate = .11,    num_leaves = 200,    max_bin = 600,    min_data_in_bin = 45,    feature_fraction = seq(.5, .9, .02),    min_sum_hessian = .005,    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = 5)
perf_feature_fraction_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_feature_fraction_1ggplot(data = grid_search, aes(x = feature_fraction, y = perf)) +     geom_point() +    geom_smooth()


结论:feature_fraction=.54时,auc最大, .56, .58时也较好。

7. 调试min_sum_hessian参数

grid_search <- expand.grid(    learning_rate = .11,    num_leaves = 200,    max_bin = 600,    min_data_in_bin = 45,    feature_fraction = .54,    min_sum_hessian = seq(.001, .008, .0005),    lambda_l1 = 0,    lambda_l2 = 0,    drop_rate = .2,    max_drop = 5)
perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_min_sum_hessian_1ggplot(data = grid_search, aes(x = min_sum_hessian, y = perf)) +     geom_point() +    geom_smooth()


结论:min_sum_hessian=0.0065时auc取得最大值,取min_sum_hessian=0.003,0.0055时可接受。

8. 调试lambda参数

grid_search <- expand.grid(    learning_rate = .11,    num_leaves = 200,    max_bin = 600,    min_data_in_bin = 45,    feature_fraction = .54,    min_sum_hessian = 0.0065,    lambda_l1 = seq(0, .001, .0002),    lambda_l2 = seq(0, .001, .0002),    drop_rate = .2,    max_drop = 5)
perf_lambda_1 <- numeric(length = nrow(grid_search))
for(i in 1:nrow(grid_search)){    lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)        lgb_train <- lgb.Dataset(        data = data.matrix(lgb_tr[, 1:148]),         label = lgb_tr$TARGET,         free_raw_data = FALSE,        weight = lgb_weight    )        # 参数    params <- list(        objective = 'binary',        metric = 'auc',        learning_rate = grid_search[i, 'learning_rate'],        num_leaves = grid_search[i, 'num_leaves'],        max_bin = grid_search[i, 'max_bin'],        min_data_in_bin = grid_search[i, 'min_data_in_bin'],        feature_fraction = grid_search[i, 'feature_fraction'],        min_sum_hessian = grid_search[i, 'min_sum_hessian'],        lambda_l1 = grid_search[i, 'lambda_l1'],        lambda_l2 = grid_search[i, 'lambda_l2'],        drop_rate = grid_search[i, 'drop_rate'],        max_drop = grid_search[i, 'max_drop']    )    # 交叉验证    lgb_tr_mod <- lgb.cv(        params,        data = lgb_train,        nrounds = 300,        stratified = TRUE,        nfold = 10,        num_threads = 2,        early_stopping_rounds = 10    )    perf_lambda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}
grid_search$perf <- perf_lambda_1ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) +     geom_point() +     facet_wrap(~ lambda_l2, nrow = 5)


结论:lambda与auc整体呈负相关,取lambda_l1=.0002, lambda_l2 = .0004

9. 调试drop_rate参数

结论:drop_rate=.4时取到最大值,.15, .25可接受。

10. 调试max_drop参数

结论:drop_rate=.4时取到最大值,.15, .25可接受。

预测

1. 权重

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)


2. 训练数据集

lgb_train <- lgb.Dataset(    data = data.matrix(lgb_tr[, 1:148]),     label = lgb_tr$TARGET,     free_raw_data = FALSE,    weight = lgb_weight)


3. 训练

# 参数params <- list(    learning_rate = .11,    num_leaves = 200,    max_bin = 600,    min_data_in_bin = 45,    feature_fraction = .54,    min_sum_hessian = 0.0065,    lambda_l1 = .0002,    lambda_l2 = .0004,    drop_rate = .4,    max_drop = 14)# 模型lgb_mod <- lightgbm(    params = params,    data = lgb_train,    nrounds = 300,    early_stopping_rounds = 10,    num_threads = 2)# 预测lgb.pred <- predict(lgb_mod, data.matrix(lgb_te))

4. 结果

lgb.pred2 <- matrix(unlist(lgb.pred), ncol = 1)lgb.pred3 <- data.frame(lgb.pred2)

5. 输出

write.csv(lgb.pred3, "C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb.pred1_tr.csv")


注:此处给在校读书的朋友一些建议:

1. 在学校学习机器学习算法时,测试所用数据量一般较少,因此可以尝试大多数算法,大多数的R函数,例如测试随机森林算法时,可以选择randomforest包,如果数据量稍微增多,可以设置并行运算,但是如果数据量达到GB级别,并行运算randomforest包也处理不了了,并且内存会溢出;建议使用专业版R中的函数;

2. 学校学习主要针对理论进行学习,测试数据一般较为干净,实际数据结构一般更为复杂一些。

编辑:黄继彦

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

R语言 lightgbm 算法优化:不平衡二分类问题(附代码) 的相关文章

随机推荐

  • 如何在 Ubuntu 20.04 上安装 Mono

    Mono 是一个用于开发和运行基于 ECMA ISO 标准的跨平台应用程序的平台 它是 Microsoft NET 框架的免费开源实现 本教程介绍了在 Ubuntu 20 04 上安装 Mono 所需的步骤 先决条件 这些说明假定您以 ro
  • 如何在 CentOS 7 上安装 Django

    Django 是一个免费开源的高级 Python Web 框架 旨在帮助开发人员构建安全 可扩展和可维护的 Web 应用程序 有不同的方法来安装 Django 具体取决于您的需要 它可以在系统范围内安装 也可以使用 pip 安装在 Pyth
  • 如何在 Ubuntu 20.04 上安装 Anaconda

    Anaconda 是一个流行的 Python R 数据科学和机器学习平台 用于大规模数据处理 预测分析和科学计算 Anaconda 发行版附带 250 个开源数据包 并且可以从 Anaconda 存储库安装超过 7 500 个附加包 它还包
  • 如何在 Python 中将整数转换为字符串

    Python 有多种内置数据类型 有时 在编写 Python 代码时 您可能需要将一种数据类型转换为另一种数据类型 例如 连接一个字符串和整数 首先 您需要将整数转换为字符串 本文介绍如何将 Python 整数转换为字符串 Python s
  • 如何重置 MySQL 根密码

    您是否忘记了 MySQL root 密码 别担心 这发生在我们所有人身上 在本文中 我们将向您展示如何从命令行重置 MySQL root 密码 识别服务器版本 根据您系统上运行的 MySQL 或 MariaDB 服务器版本 您将需要使用不同
  • Bash printf 命令

    通常 在编写 bash 脚本时 我们使用echo打印到标准输出 echo是一个简单的命令 但其功能有限 要更好地控制输出的格式 请使用printf命令 The printf命令格式并打印其参数 类似于 Cprintf 功能 printf命令
  • 如何在 CentOS 7 上安装 Jenkins

    Jenkins是一个基于 Java 的开源自动化服务器 提供了一种设置持续集成和持续交付 CI CD 管道的简单方法 持续集成 CI 是一种 DevOps 实践 团队成员定期将代码更改提交到版本控制存储库 然后运行自动化构建和测试 持续交付
  • Linux 中的 Usermod 命令

    usermod是一个命令行实用程序 允许您修改用户的登录信息 本文介绍了如何使用usermod命令将用户添加到组 更改用户 shell 登录名 主目录等 usermod命令 的语法usermod命令采用以下形式 usermod option
  • 如何在 Ubuntu 18.04 上使用 UFW 设置防火墙

    正确配置的防火墙是整个系统安全最重要的方面之一 默认情况下 Ubuntu附带了一个名为UFW Uncomplicated Firewall 的防火墙配置工具 UFW 是一个用户友好的前端 用于管理 iptables 防火墙规则 其主要目标是
  • Python 模运算符

    模运算是一种算术运算 可求出一个数字除以另一个数字的余数 余数称为运算的模 例如 5除以3等于1 余数为2 8除4等于2 余数为0 Python 模运算符 在 Python 中 模运算符由百分号 语法如下 num1 num2 这是一个例子
  • Linux服务器上重置Mysql8密码

    前言 此流程适用于mysql 8版本 1 关闭数据库 1 关闭数据库 service mysqld stop 2 编辑配置文件 1 编辑文件 vim etc my cnf 输入 i 进入编辑模式 2 添加配置 skip grant tabl
  • ECSHOP文件结构系统简介

    原来做电子商务系统一直用zencart 后来虽然接触过一段时间magento 但是magento觉得还是挺高深的 前面两个比较多的用在外贸电子商务 特别是magento 可以说是开源电子商务系统中的豪华版 以后有时间的话再一起学习探讨一下
  • 通过App的演示深入理解区块链运行原理

    下载安装 如果没有安装nodejs 需要先安装 nodejs Clone this repository git clone https github com seanseany blockchain cli Go into the rep
  • 源码进阶之线程池

    写在前面 上次学习了多线程 了解了线程的概念和作用 学习了线程的创建方式 工作模式和一些重要的方法 当我们使用线程中 创建 销毁线程伴随着系统开销 过于频繁的创建 销毁线程 就会很大程度上影响处理效率 那么此时我们就引入了线程池的概念 即为
  • C语言【猜数字游戏】详解

    提示 文章写完后 目录可以自动生成 如何生成可参考右边的帮助文档 文章目录 前言 一 猜数字游戏是什么 二 使用步骤 1 首先应该打印菜单 2 打印我们的game 函数来实现我们的游戏具体逻辑 总结 前言 本文详细介绍了猜数字游戏的具体实现
  • Android Studio升级Gradle Plugin升级导致项目运行失败问题

    背景 错误 升级Android Studio 旧项目无法运行 奇奇怪怪什么错误都有 例如 java lang IllegalAccessError class org gradle api internal tasks compile pr
  • 100天精通Python(数据分析篇)——第68天:Pandas数据清洗函数大全(判断缺失、删除空值、填补空值、替换元素、分割元素)

    文章目录 一 drop 删除指定行列 1 删除指定行 2 删除指定列 二 del 删除指定列 三 isnull 判断是否为缺失 1 判断是否为缺失 2 判断哪些列存在缺失 3 统计缺失个数 四 notnull 判断是否不为缺失 五 drop
  • Mac上如何正确的安装 Android Studio

    1 下载 Android Studio 官方下载地址 2 预处理 可选 如果事先下载过 Android Studio 则需要在终端执行以下命令 卸载主程序 rm Rf Applications Android Studio app rm R
  • window jenkins + 加固 & mac 进行jenkins + fastlane + pod + git环境搭建 一

    非常炒蛋的操作 但是必须搞 最心累就是360有个版本问题 严重拖了后腿 最后提示 权限不足 是cli版本内部问题 主要思想 jenkins 进行搭建支持window linux 等系统部署完毕后 配置git或者svn的路径 进行构建后 进行
  • R语言 lightgbm 算法优化:不平衡二分类问题(附代码)

    来源 大数据文摘 本文约10000字 建议阅读10分钟本文以kaggle比赛的数据为例 为你讲解不平衡二分类问题的解决方法 本案例使用的数据为kaggle中 Santander Customer Satisfaction 比赛的数据 此案例