机器学习笔记2：建立模型一般所需步骤

2023-11-11

1.特征工程

2.抽样方式

2.1 随机抽样:

1.特征工程

特征工程是指对样本属性值的处理。对数值型属性值一般用标准化，以此来消除量纲的影响。对字符串或者文本型属性值一般用one-hot编码，以此来消除不同类别之间的差异

本文所用数据集展示：

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# 划分数据集
data=pd.read_excel('123.xlsx')
x=data.copy().drop('median_house_value',axis=1) # 除去目标值列
y=data['median_house_value']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

# 特征工程
# 1.数据准备
train_num=x_train.iloc[:,:-1] # 提取出数值型属性值
num_name = list(train_num) # 数值型属性值的列名
str_name = ["ocean_proximity"] # 字符串型属性值的列名
# 2.数值属性处理方案
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), # 缺失值用平均数填补
        ('std_scaler', StandardScaler()), # 标准化
    ])
# 3.字符串属性处理方案
full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_name), # 数值列用上面定义的顺序流程
        ("cat", OneHotEncoder(), str_name), # 文本类用one-hot编码转换
    ])
# 4.训练并转换
x_train = full_pipeline.fit_transform(x_train)
x_test = full_pipeline.transform(x_test)

2.抽样方式

常用的有随机抽样和分层抽样，随机抽样意思很明确就不解释了。

分层抽样：你要从人群中选取1000个人来调查下人均财产，假如当下社会有10%的上层阶级、50%的中层阶级、40%的下层阶级，如果选出100个上层阶级、500个中层阶级、400个下层阶级的人比随机抽样更能代表全体人口。当要选取的样本数足够大时（99.99%）随机抽样和分层抽样的差别就不大了，反之可以采用分层抽样，但基于何种指标进行分层就智者见智了。确定层数时要保证每层有足够的样本，不要将层数分的太多

2.1 随机抽样:

from sklearn.model_selection import train_test_split

# random_state:随机数种子
# test_size：训练集所占比例
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22,test_size=0.8)

2.2 分层抽样：

先查看该属性值的分布图确定分层区间

housing["median_income"].hist()

from sklearn.model_selection import StratifiedShuffleSplit

# 分类区间：0~1.5、1.5~3、...、6~inf
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

# StratifiedShuffleSplit()是处理分层抽样时的打乱数据集步骤
# 参数意义是，分出一组train/test集，后者占比20%
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

3.交叉验证

交叉验证的目的是为了多次对模型进行评估，取平均值作为最后的结果。

交叉验证有两种方式，第一种是将训练集再分出一部分验证集（不参与训练），利用train_test_split函数可以轻松实现。

第二种使用sklearn库自带的K-折交叉验证功能，内部执行如下

1、首先，将全部样本划分成k个大小相等的样本子集；

2、依次遍历这k个子集，每次把当前子集作为验证集，其余所有样本作为训练集，进行模型的训练和评估；

3、最后把k次评估指标的平均值作为最终的评估指标。在实际实验中，k通常取3，5，10.

# 交叉验证准确率
from sklearn.model_selection import cross_val_score

# 利用先前训练的模型sgd_clf进行3折交叉验证
# scoring：评估的指标，例如准确率、f1值等，这里用的是默认的准确率
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

4.微调模型

微调模型的目的一般是为了提高某项评价值，方法也有很多，例如网格搜索和随机搜索用于调整超参数。网格搜索就是利用你给出的组合一个个试，随即搜索一般是你给出超参数的范围，然后进行随机抽取组合。推荐先用随机搜索确定大致范围，再用网格搜索进行细搜

4.1 网格搜索：

这里共有（3*4）+（1*2*3）=18种组合，每种组合都进行三折交叉验证

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=42)
param_grid = [
    # 尝试 12 (3×4) 的超参数组合
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # 在bootstrap设置为False的情况下，尝试6 (2×3) 的超参数组合
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

# scoring参数是为了下一步显示出每组的指标
# refit参数默认True，一旦通过交叉验证找到了最佳估算器，它将在整个训练集上重新训练
grid_search = GridSearchCV(model, param_grid, cv=3,scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)

# 最佳模型
grid_search.best_estimator_

可查每组指标对应的评分（例如RMSE：
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

4.2 随机搜索:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
        'bootstrap':[True,False]
    }

# n_iter：随机搜索的次数
forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=3, scoring='neg_mean_squared_error', random_state=42)
# 数据集是来自房价预测
rnd_search.fit(x_train, y_train)

每组的指标结果显示：

cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)