python 中 机器学习算法 --决策树

2023-05-16

文章目录

  • 思维脑图
    • 3.1.2 用pandas加载数据集
    • 3.1.3 清洗数据集
      • 现在计算这些的实际值
      • 主队和客队最后一场比赛赢了吗?
  • 3.2 决策树
    • 3.2.1 决策树中的参数
    • 3.2.2 决策树的使用|
    • 3.3 体育赛事结果预测
  • 3.4 随机森林
    • 参考文献

思维脑图

在这里插入图片描述

import os
import numpy as np
import pandas as pd
home_folder = "./PythonDataMining/"
data_folder = os.path.join(home_folder,'data')
data_filename = os.path.join(data_folder, "leagues_NBA_2014_games_games.csv")

3.1.2 用pandas加载数据集

results = pd.read_csv(data_filename)
results.iloc[:5]
DateScore TypeVisitor TeamVisitorPtsHome TeamHomePtsOT?Notes
0Tue Oct 29 2013Box ScoreOrlando Magic87Indiana Pacers97NaNNaN
1Tue Oct 29 2013Box ScoreLos Angeles Clippers103Los Angeles Lakers116NaNNaN
2Tue Oct 29 2013Box ScoreChicago Bulls95Miami Heat107NaNNaN
3Wed Oct 30 2013Box ScoreBrooklyn Nets94Cleveland Cavaliers98NaNNaN
4Wed Oct 30 2013Box ScoreAtlanta Hawks109Dallas Mavericks118NaNNaN

3.1.3 清洗数据集

results = pd.read_csv(data_filename, skiprows=[0,])
# Fix the name of the columns
results.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]
results.iloc[:5]
DateScore TypeVisitor TeamVisitorPtsHome TeamHomePtsOT?Notes
0Tue Oct 29 2013Box ScoreLos Angeles Clippers103Los Angeles Lakers116NaNNaN
1Tue Oct 29 2013Box ScoreChicago Bulls95Miami Heat107NaNNaN
2Wed Oct 30 2013Box ScoreBrooklyn Nets94Cleveland Cavaliers98NaNNaN
3Wed Oct 30 2013Box ScoreAtlanta Hawks109Dallas Mavericks118NaNNaN
4Wed Oct 30 2013Box ScoreWashington Wizards102Detroit Pistons113NaNNaN
results['HomeWin'] = results['VisitorPts'] < results['HomePts']
y_true = results['HomeWin'].values
results.iloc[:5]
DateScore TypeVisitor TeamVisitorPtsHome TeamHomePtsOT?NotesHomeWin
0Tue Oct 29 2013Box ScoreLos Angeles Clippers103Los Angeles Lakers116NaNNaNTrue
1Tue Oct 29 2013Box ScoreChicago Bulls95Miami Heat107NaNNaNTrue
2Wed Oct 30 2013Box ScoreBrooklyn Nets94Cleveland Cavaliers98NaNNaNTrue
3Wed Oct 30 2013Box ScoreAtlanta Hawks109Dallas Mavericks118NaNNaNTrue
4Wed Oct 30 2013Box ScoreWashington Wizards102Detroit Pistons113NaNNaNTrue
print("Home Win 百分比: {0:.1f}%".format(100 * results["HomeWin"].sum() / results["HomeWin"].count()))
results["HomeLastWin"] = False
results["VisitorLastWin"] = False
# This creates two new columns, all set to False
results.iloc[:5]
Home Win 百分比: 58.0%
DateScore TypeVisitor TeamVisitorPtsHome TeamHomePtsOT?NotesHomeWinHomeLastWinVisitorLastWin
0Tue Oct 29 2013Box ScoreLos Angeles Clippers103Los Angeles Lakers116NaNNaNTrueFalseFalse
1Tue Oct 29 2013Box ScoreChicago Bulls95Miami Heat107NaNNaNTrueFalseFalse
2Wed Oct 30 2013Box ScoreBrooklyn Nets94Cleveland Cavaliers98NaNNaNTrueFalseFalse
3Wed Oct 30 2013Box ScoreAtlanta Hawks109Dallas Mavericks118NaNNaNTrueFalseFalse
4Wed Oct 30 2013Box ScoreWashington Wizards102Detroit Pistons113NaNNaNTrueFalseFalse

现在计算这些的实际值

主队和客队最后一场比赛赢了吗?

# Now compute the actual values for these
# Did the home and visitor teams win their last game?
from collections import defaultdict
won_last = defaultdict(int)

for index, row in results.iterrows():  # Note that this is not efficient
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["HomeLastWin"] = won_last[home_team]
    row["VisitorLastWin"] = won_last[visitor_team]
    results.iloc[index] = row    
    # Set current win
    won_last[home_team] = row["HomeWin"]
    won_last[visitor_team] = not row["HomeWin"]
results.iloc[20:25]
DateScore TypeVisitor TeamVisitorPtsHome TeamHomePtsOT?NotesHomeWinHomeLastWinVisitorLastWin
20Fri Nov 1 2013Box ScoreMiami Heat100Brooklyn Nets101NaNNaNTrueFalseFalse
21Fri Nov 1 2013Box ScoreCleveland Cavaliers84Charlotte Bobcats90NaNNaNTrueFalseTrue
22Fri Nov 1 2013Box ScorePortland Trail Blazers113Denver Nuggets98NaNNaNFalseFalseFalse
23Fri Nov 1 2013Box ScoreDallas Mavericks105Houston Rockets113NaNNaNTrueTrueTrue
24Fri Nov 1 2013Box ScoreSan Antonio Spurs91Los Angeles Lakers85NaNNaNFalseFalseTrue

3.2 决策树

决策树是一种有监督的机器学习算法,它看起来就像是由一系列节点组成的流程图,其中位
于上层节点的值决定下一步走向哪个节点。

%%html
<img src = './image/决策树1.png',width=100,height=100>

<img src = ‘./image/决策树1.png’,width=100,height=100>

跟大多数分类算法一样,决策树也分为两大步骤。
 首先是训练阶段,用训练数据构造一棵树。上一章的近邻算法没有训练阶段,但是决策
树需要。从这个意义上说,近邻算法是一种惰性算法,在用它进行分类时,它才开始干
活。相反,决策树跟大多数机器学习方法类似,是一种积极学习的算法,在训练阶段完
成模型的创建。
 其次是预测阶段,用训练好的决策树预测新数据的类别。以上图为例,[“is raining”,
“very windy”]的预测结果为“Bad”(坏天气)。
创建决策树的算法有多种,大都通过迭代生成一棵树。它们从根节点开始,选取最佳特征,
用于第一个决策,到达下一个节点,选择下一个最佳特征,以此类推。当发现无法从增加树的层
级中获得更多信息时,算法启动退出机制。
scikit-learn库实现了分类回归树(Classification and Regression Trees,CART)算法并将
其作为生成决策树的默认算法,它支持连续型特征和类别型特征。

3.2.1 决策树中的参数

3.2.2 决策树的使用|

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14)

from sklearn.model_selection import cross_val_score

X_previouswins = results[['HomeLastWin','VisitorLastWin']].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf,X_previouswins,y_true,scoring = 'accuracy')
print('Using just the last result from the home and visitor teams')
print('Accuracy: {0:.1f}%'.format(np.mean(scores)*100))
Using just the last result from the home and visitor teams
Accuracy: 59.1%

3.3 体育赛事结果预测

# What about win streaks?
results["HomeWinStreak"] = 0
results["VisitorWinStreak"] = 0
# Did the home and visitor teams win their last game?
from collections import defaultdict
win_streak = defaultdict(int)

for index, row in results.iterrows():  # Note that this is not efficient
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["HomeWinStreak"] = win_streak[home_team]
    row["VisitorWinStreak"] = win_streak[visitor_team]
    results.loc[index] = row    
    # Set current win
    if row["HomeWin"]:
        win_streak[home_team] += 1
        win_streak[visitor_team] = 0
    else:
        win_streak[home_team] = 0
        win_streak[visitor_team] += 1
clf = DecisionTreeClassifier(random_state=14)
X_winstreak =  results[["HomeLastWin", "VisitorLastWin", "HomeWinStreak", "VisitorWinStreak"]].values
scores = cross_val_score(clf, X_winstreak, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
Accuracy: 58.4%

我们试试看哪个队在阶梯上更好。使用上一年的梯子

ladder_filename = os.path.join(data_folder, "leagues_NBA_2013_standings_expanded-standings.csv")
ladder = pd.read_csv(ladder_filename)
ladder.head()
RkTeamOverallHomeRoadEWACSE...Post≤3≥10OctNovDecJanFebMarApr
01Miami Heat66-1637-429-1241-1125-514-412-615-1...30-29-339-81-010-310-58-512-117-18-1
12Oklahoma City Thunder60-2234-726-1521-939-137-38-26-4...21-83-644-6NaN13-411-211-57-412-56-2
23San Antonio Spurs58-2435-623-1825-533-198-29-18-2...16-129-531-101-012-412-412-38-310-43-6
34Denver Nuggets57-2538-319-2219-1138-145-510-04-6...24-411-728-80-18-89-612-38-413-27-1
45Los Angeles Clippers56-2632-924-1721-935-177-38-26-4...17-93-538-121-08-616-09-78-57-77-1

5 rows × 24 columns

#这里好像所有的特征都转变为只有几类,例如True和false ,不然那个信息增益要算很多

# We can create a new feature -- HomeTeamRanksHigher\
results["HomeTeamRanksHigher"] = 0
for index, row in results.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    if home_team == "New Orleans Pelicans":
        home_team = "New Orleans Hornets"
    elif visitor_team == "New Orleans Pelicans":
        visitor_team = "New Orleans Hornets"
    home_rank = ladder[ladder["Team"] == home_team]["Rk"].values[0]
    visitor_rank = ladder[ladder["Team"] == visitor_team]["Rk"].values[0]
    row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
    results.iloc[index] = row
results[:5]
DateScore TypeVisitor TeamVisitorPtsHome TeamHomePtsOT?NotesHomeWinHomeLastWinVisitorLastWinHomeWinStreakVisitorWinStreakHomeTeamRanksHigher
0Tue Oct 29 2013Box ScoreLos Angeles Clippers103Los Angeles Lakers116NaNNaNTrue00001
1Tue Oct 29 2013Box ScoreChicago Bulls95Miami Heat107NaNNaNTrue00000
2Wed Oct 30 2013Box ScoreBrooklyn Nets94Cleveland Cavaliers98NaNNaNTrue00001
3Wed Oct 30 2013Box ScoreAtlanta Hawks109Dallas Mavericks118NaNNaNTrue00001
4Wed Oct 30 2013Box ScoreWashington Wizards102Detroit Pistons113NaNNaNTrue00000
X_homehigher =  results[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.2%
from sklearn.model_selection import GridSearchCV

parameter_space = {
                   "max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
                   }
clf = DecisionTreeClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_homehigher, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
准确率: 60.5%

#谁赢了最后一场比赛?我们忽略了家/访客这一点

last_match_winner = defaultdict(int)
results['HomeTeamWonLast'] = 0

for index, row in results.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    teams = tuple(sorted([home_team, visitor_team]))  # Sort for a consistent ordering
    # 在当前行中记录上次交手的胜方
    row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
    results.loc[index] = row
    # 本次比赛的胜方
    winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
    last_match_winner[teams] = winner
results.loc[:5]
DateScore TypeVisitor TeamVisitorPtsHome TeamHomePtsOT?NotesHomeWinHomeLastWinVisitorLastWinHomeWinStreakVisitorWinStreakHomeTeamRanksHigherHomeTeamWonLast
0Tue Oct 29 2013Box ScoreLos Angeles Clippers103Los Angeles Lakers116NaNNaNTrue000010
1Tue Oct 29 2013Box ScoreChicago Bulls95Miami Heat107NaNNaNTrue000000
2Wed Oct 30 2013Box ScoreBrooklyn Nets94Cleveland Cavaliers98NaNNaNTrue000010
3Wed Oct 30 2013Box ScoreAtlanta Hawks109Dallas Mavericks118NaNNaNTrue000010
4Wed Oct 30 2013Box ScoreWashington Wizards102Detroit Pistons113NaNNaNTrue000000
5Wed Oct 30 2013Box ScoreLos Angeles Lakers94Golden State Warriors125NaNNaNTrue0True0100
X_home_higher =  results[["HomeTeamRanksHigher", "HomeTeamWonLast"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_home_higher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.5%

最后我们来看一下,决策树在训练数据量很大的情况下,能否得到有效的分类模型。我们将
会为决策树添加球队,以检测它是否能整合新增的信息。
虽然决策树能够处理特征值为类别型的数据,但scikit-learn库所实现的决策树算法要求
先对这类特征进行处理。用LabelEncoder转换器就能把字符串类型的球队名转化为整型。代码
如下

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
encoding = LabelEncoder()
encoding.fit(results["Home Team"].values)
home_teams = encoding.transform(results["Home Team"].values)
visitor_teams = encoding.transform(results["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T

决策树可以用这些特征值进行训练,但DecisionTreeClassifier仍把它们当作连续型特
征。例如,编号从0到16的17支球队,算法会认为球队1和2相似,而球队4和10不同。但其实这没
意义,对于两支球队而言,它们要么是同一支球队,要么不同,没有中间状态!
为了消除这种和实际情况不一致的现象,我们可以使用OneHotEncoder转换器把这些整数转
换为二进制数字。每个特征用一个二进制数字①来表示。例如,LabelEncoder为芝加哥公牛队分配
的数值是7,那么OneHotEncoder为它分配的二进制数字的第七位就是1,其余队伍的第七位就是0。
每个可能的特征值都这样处理,而数据集会变得很大。代码如下:

onehot = OneHotEncoder()
X_teams = onehot.fit_transform(X_teams).todense()

clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
准确率: 60.1%

正确率为60%,比基准值要高,但是没有之前的效果好。原因可能在于特征数增加后,决策
树处理不当。鉴于此,我们尝试修改算法,看看会不会起作用。数据挖掘有时就是不断尝试新算
法、使用新特征这样一个过程。

3.4 随机森林

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Using full team labels is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using full team labels is ranked higher
准确率: 61.5%
X_all = np.hstack([X_home_higher,X_teams])
print(X_all.shape)
(1229, 62)
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 62.9%

我们也可以尝试CridSearchCV类的其他参数

parameter_space = {
                   "max_features": [2, 10, 'auto'],
                   "n_estimators": [100,],
                   "criterion": ["gini", "entropy"],
                   "min_samples_leaf": [2, 4, 6],
                   }
clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
print(grid.best_estimator_)
准确率: 65.4%
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=6, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=14, verbose=0,
                       warm_start=False)

参考文献

<<机器学习>> --周志华
<<数据挖掘概念与技术>> 中文版的

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

python 中 机器学习算法 --决策树 的相关文章

随机推荐