银行客户长期忠诚度预测建模,此处忠诚度的指标选取为客户的流失情况,该指标分为两类,长期客户忠诚用1表示,不忠诚则用0表示。
对此预测建模中,使用到机器学习分类中的随机森林分类。
import pandas as pd
long_data26=pd.read_excel('result4.xlsx')
long_data26.Age
0 52
1 41
2 42
3 61
4 39
..
9175 37
9176 37
9177 39
9178 34
9179 40
Name: Age, Length: 9180, dtype: int64
long_data26
|
CustomerId |
CreditScore |
Gender |
Age |
Tenure |
Balance |
NumOfProducts |
HasCrCard |
IsActiveMember |
EstimatedSalary |
Exited |
Status |
AssetStage |
IsActiveStatus |
IsActiveAssetStage |
CrCardAssetStage |
0 |
15553251 |
713 |
1 |
52 |
0 |
185891.54 |
1 |
1 |
1 |
46369.57 |
1 |
1 |
3 |
3 |
9 |
9 |
1 |
15553256 |
619 |
1 |
41 |
8 |
0.00 |
3 |
1 |
1 |
79866.73 |
1 |
2 |
2 |
5 |
6 |
6 |
2 |
15553283 |
603 |
1 |
42 |
8 |
91611.12 |
1 |
0 |
0 |
144675.30 |
1 |
2 |
0 |
2 |
2 |
5 |
3 |
15553308 |
589 |
1 |
61 |
1 |
0.00 |
1 |
1 |
0 |
61108.56 |
1 |
1 |
2 |
0 |
0 |
6 |
4 |
15553387 |
687 |
1 |
39 |
2 |
0.00 |
3 |
0 |
0 |
188150.60 |
1 |
1 |
2 |
0 |
0 |
0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9175 |
15815628 |
711 |
1 |
37 |
8 |
113899.92 |
1 |
0 |
0 |
80215.20 |
0 |
2 |
0 |
2 |
2 |
5 |
9176 |
15815645 |
481 |
0 |
37 |
8 |
152303.66 |
2 |
1 |
1 |
175082.20 |
0 |
2 |
3 |
5 |
9 |
9 |
9177 |
15815656 |
541 |
1 |
39 |
9 |
100116.67 |
1 |
1 |
1 |
199808.10 |
1 |
2 |
0 |
5 |
8 |
9 |
9178 |
15815660 |
758 |
1 |
34 |
1 |
154139.45 |
1 |
1 |
1 |
60728.89 |
0 |
1 |
3 |
3 |
9 |
9 |
9179 |
15815690 |
614 |
1 |
40 |
3 |
113348.50 |
1 |
1 |
1 |
77789.01 |
0 |
1 |
0 |
3 |
8 |
9 |
9180 rows × 16 columns
通过观察不难发现,数据中某些列存在较大的波动,因此此处都个别列进行面元划分、标准化处理等操作。
将年龄进行离散化处理,划分为7个年龄段,7个类别
long_data0=long_data26.copy()
long_data0.loc[(long_data0['Age']<18),'离散化年龄']=1
long_data0.loc[(long_data0['Age']>=18) & (long_data0['Age']<30),'离散化年龄']=2
long_data0.loc[(long_data0['Age']>=30) & (long_data0['Age']<40),'离散化年龄']=3
long_data0.loc[(long_data0['Age']>=40) & (long_data0['Age']<50),'离散化年龄']=4
long_data0.loc[(long_data0['Age']>=50) & (long_data0['Age']<60),'离散化年龄']=5
long_data0.loc[(long_data0['Age']>=60) & (long_data0['Age']<70),'离散化年龄']=6
long_data0.loc[(long_data0['Age']>=70),'离散化年龄']=7
long_data0
|
CustomerId |
CreditScore |
Gender |
Age |
Tenure |
Balance |
NumOfProducts |
HasCrCard |
IsActiveMember |
EstimatedSalary |
Exited |
Status |
AssetStage |
IsActiveStatus |
IsActiveAssetStage |
CrCardAssetStage |
离散化年龄 |
0 |
15553251 |
713 |
1 |
52 |
0 |
185891.54 |
1 |
1 |
1 |
46369.57 |
1 |
1 |
3 |
3 |
9 |
9 |
5.0 |
1 |
15553256 |
619 |
1 |
41 |
8 |
0.00 |
3 |
1 |
1 |
79866.73 |
1 |
2 |
2 |
5 |
6 |
6 |
4.0 |
2 |
15553283 |
603 |
1 |
42 |
8 |
91611.12 |
1 |
0 |
0 |
144675.30 |
1 |
2 |
0 |
2 |
2 |
5 |
4.0 |
3 |
15553308 |
589 |
1 |
61 |
1 |
0.00 |
1 |
1 |
0 |
61108.56 |
1 |
1 |
2 |
0 |
0 |
6 |
6.0 |
4 |
15553387 |
687 |
1 |
39 |
2 |
0.00 |
3 |
0 |
0 |
188150.60 |
1 |
1 |
2 |
0 |
0 |
0 |
3.0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9175 |
15815628 |
711 |
1 |
37 |
8 |
113899.92 |
1 |
0 |
0 |
80215.20 |
0 |
2 |
0 |
2 |
2 |
5 |
3.0 |
9176 |
15815645 |
481 |
0 |
37 |
8 |
152303.66 |
2 |
1 |
1 |
175082.20 |
0 |
2 |
3 |
5 |
9 |
9 |
3.0 |
9177 |
15815656 |
541 |
1 |
39 |
9 |
100116.67 |
1 |
1 |
1 |
199808.10 |
1 |
2 |
0 |
5 |
8 |
9 |
3.0 |
9178 |
15815660 |
758 |
1 |
34 |
1 |
154139.45 |
1 |
1 |
1 |
60728.89 |
0 |
1 |
3 |
3 |
9 |
9 |
3.0 |
9179 |
15815690 |
614 |
1 |
40 |
3 |
113348.50 |
1 |
1 |
1 |
77789.01 |
0 |
1 |
0 |
3 |
8 |
9 |
4.0 |
9180 rows × 17 columns
对CreditScore、EstimatedSalary、EstimatedSalary三个列的数据进行最大-最小值标准化处理。
the_min1=long_data0['CreditScore'].min()
the_max1=long_data0['CreditScore'].max()
the_min2=long_data0['EstimatedSalary'].min()
the_max2=long_data0['EstimatedSalary'].max()
# the_min3=long_data0['Balance'].min()
# the_max3=long_data0['Balance'].max()
long_data0['标准化信用']=((long_data0['CreditScore']-the_min1)/(the_max1-the_min1))*10
long_data0['标准化个人年收入']=((long_data0['EstimatedSalary']-the_min2)/(the_max2-the_min2))*10
#long_data0['标准化金融资产']=((long_data0['Balance']-the_min3)/(the_max3-the_min3))*10
long_data0
|
CustomerId |
CreditScore |
Gender |
Age |
Tenure |
Balance |
NumOfProducts |
HasCrCard |
IsActiveMember |
EstimatedSalary |
Exited |
Status |
AssetStage |
IsActiveStatus |
IsActiveAssetStage |
CrCardAssetStage |
离散化年龄 |
标准化信用 |
标准化个人年收入 |
0 |
15553251 |
713 |
1 |
52 |
0 |
185891.54 |
1 |
1 |
1 |
46369.57 |
1 |
1 |
3 |
3 |
9 |
9 |
5.0 |
7.26 |
2.318373 |
1 |
15553256 |
619 |
1 |
41 |
8 |
0.00 |
3 |
1 |
1 |
79866.73 |
1 |
2 |
2 |
5 |
6 |
6 |
4.0 |
5.38 |
3.993573 |
2 |
15553283 |
603 |
1 |
42 |
8 |
91611.12 |
1 |
0 |
0 |
144675.30 |
1 |
2 |
0 |
2 |
2 |
5 |
4.0 |
5.06 |
7.234663 |
3 |
15553308 |
589 |
1 |
61 |
1 |
0.00 |
1 |
1 |
0 |
61108.56 |
1 |
1 |
2 |
0 |
0 |
6 |
6.0 |
4.78 |
3.055473 |
4 |
15553387 |
687 |
1 |
39 |
2 |
0.00 |
3 |
0 |
0 |
188150.60 |
1 |
1 |
2 |
0 |
0 |
0 |
3.0 |
6.74 |
9.408872 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9175 |
15815628 |
711 |
1 |
37 |
8 |
113899.92 |
1 |
0 |
0 |
80215.20 |
0 |
2 |
0 |
2 |
2 |
5 |
3.0 |
7.22 |
4.011000 |
9176 |
15815645 |
481 |
0 |
37 |
8 |
152303.66 |
2 |
1 |
1 |
175082.20 |
0 |
2 |
3 |
5 |
9 |
9 |
3.0 |
2.62 |
8.755319 |
9177 |
15815656 |
541 |
1 |
39 |
9 |
100116.67 |
1 |
1 |
1 |
199808.10 |
1 |
2 |
0 |
5 |
8 |
9 |
3.0 |
3.82 |
9.991866 |
9178 |
15815660 |
758 |
1 |
34 |
1 |
154139.45 |
1 |
1 |
1 |
60728.89 |
0 |
1 |
3 |
3 |
9 |
9 |
3.0 |
8.16 |
3.036486 |
9179 |
15815690 |
614 |
1 |
40 |
3 |
113348.50 |
1 |
1 |
1 |
77789.01 |
0 |
1 |
0 |
3 |
8 |
9 |
4.0 |
5.28 |
3.889666 |
9180 rows × 19 columns
通过spss软件的斯皮尔逊相关系数分析,并 做热力图可得知,HasCrCard(是否持有信用卡)列数据对长期忠诚度指标相关性甚小,所以模型训练时将删去此列。
long_data99=long_data0.loc[:,['CustomerId','标准化信用','Gender','离散化年龄','Status','AssetStage','NumOfProducts','IsActiveMember','标准化个人年收入']]
long_data99['Exited']=long_data0['Exited']
long_data99
|
CustomerId |
标准化信用 |
Gender |
离散化年龄 |
Status |
AssetStage |
NumOfProducts |
IsActiveMember |
标准化个人年收入 |
Exited |
0 |
15553251 |
7.26 |
1 |
5.0 |
1 |
3 |
1 |
1 |
2.318373 |
1 |
1 |
15553256 |
5.38 |
1 |
4.0 |
2 |
2 |
3 |
1 |
3.993573 |
1 |
2 |
15553283 |
5.06 |
1 |
4.0 |
2 |
0 |
1 |
0 |
7.234663 |
1 |
3 |
15553308 |
4.78 |
1 |
6.0 |
1 |
2 |
1 |
0 |
3.055473 |
1 |
4 |
15553387 |
6.74 |
1 |
3.0 |
1 |
2 |
3 |
0 |
9.408872 |
1 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9175 |
15815628 |
7.22 |
1 |
3.0 |
2 |
0 |
1 |
0 |
4.011000 |
0 |
9176 |
15815645 |
2.62 |
0 |
3.0 |
2 |
3 |
2 |
1 |
8.755319 |
0 |
9177 |
15815656 |
3.82 |
1 |
3.0 |
2 |
0 |
1 |
1 |
9.991866 |
1 |
9178 |
15815660 |
8.16 |
1 |
3.0 |
1 |
3 |
1 |
1 |
3.036486 |
0 |
9179 |
15815690 |
5.28 |
1 |
4.0 |
1 |
0 |
1 |
1 |
3.889666 |
0 |
9180 rows × 10 columns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier # 导入sklearn库的RandomForestClassifier函数
from sklearn import metrics # 分类结果评价函数
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
x=long_data99.iloc[:,1:-1] #特征
y=long_data99.iloc[:,-1] #标签
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, train_size=0.8)
model = RandomForestClassifier() # 实例化模型RandomForestClassifier
model.fit(x_train, y_train) # 在训练集上训练模型
# 在测试集上测试模型
expected = y_test # 测试样本的期望输出
predicted = model.predict(x_test) # 测试样本预测
print(metrics.classification_report(expected, predicted)) # 输出结果,精确度、召回率、f-1分数
precision recall f1-score support
0 0.86 0.94 0.90 1457
1 0.66 0.43 0.52 379
accuracy 0.84 1836
macro avg 0.76 0.69 0.71 1836
weighted avg 0.82 0.84 0.82 1836
print(metrics.confusion_matrix(expected, predicted)) # 混淆矩阵
[[1372 85]
[ 215 164]]
auc = metrics.roc_auc_score(y_test, predicted)
accuracy = metrics.accuracy_score(y_test, predicted) # 求精度
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Accuracy: 83.66%
# y_test.shape
# x_test.shape
predicted
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
5.2
使用混淆矩阵以及F1 Score方法对模型进行评估
def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, recall_score, classification_report
import itertools
cnf_matrix = confusion_matrix(expected, predicted) #计算混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes = class_names, title = 'Confusion matrix') #绘制混淆矩阵
np.set_printoptions(precision=2)
print('Accary:', (cnf_matrix[1,1]+cnf_matrix[0,0])/(cnf_matrix[1,1]+cnf_matrix[0,1]+cnf_matrix[0,0]+cnf_matrix[1,0]))
print('Recall:', cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[1,0]))
print('Precision:', cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[0,1]))
print('Specificity:', cnf_matrix[0,0]/(cnf_matrix[0,1]+cnf_matrix[0,0]))
plt.show()
Confusion matrix, without normalization
[[1372 85]
[ 215 164]]
Accary: 0.8366013071895425
Recall: 0.43271767810026385
Precision: 0.6586345381526104
Specificity: 0.9416609471516816
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-w59QsDp2-1669538327358)(output_24_1.png)]
long_data99
|
CustomerId |
标准化信用 |
Gender |
离散化年龄 |
Status |
AssetStage |
NumOfProducts |
IsActiveMember |
标准化个人年收入 |
Exited |
0 |
15553251 |
7.26 |
1 |
5.0 |
1 |
3 |
1 |
1 |
2.318373 |
1 |
1 |
15553256 |
5.38 |
1 |
4.0 |
2 |
2 |
3 |
1 |
3.993573 |
1 |
2 |
15553283 |
5.06 |
1 |
4.0 |
2 |
0 |
1 |
0 |
7.234663 |
1 |
3 |
15553308 |
4.78 |
1 |
6.0 |
1 |
2 |
1 |
0 |
3.055473 |
1 |
4 |
15553387 |
6.74 |
1 |
3.0 |
1 |
2 |
3 |
0 |
9.408872 |
1 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9175 |
15815628 |
7.22 |
1 |
3.0 |
2 |
0 |
1 |
0 |
4.011000 |
0 |
9176 |
15815645 |
2.62 |
0 |
3.0 |
2 |
3 |
2 |
1 |
8.755319 |
0 |
9177 |
15815656 |
3.82 |
1 |
3.0 |
2 |
0 |
1 |
1 |
9.991866 |
1 |
9178 |
15815660 |
8.16 |
1 |
3.0 |
1 |
3 |
1 |
1 |
3.036486 |
0 |
9179 |
15815690 |
5.28 |
1 |
4.0 |
1 |
0 |
1 |
1 |
3.889666 |
0 |
9180 rows × 10 columns
#考虑类别的不平衡性,需要计算类别的加权平均,则使用‘weighted’
#F1分数(F1-score)是分类问题的一个衡量指标,常常将F1-score作为最终测评的方法。
from sklearn.metrics import f1_score
print(f1_score(expected, predicted, average='weighted'))
0.8231781531939228
通过混淆矩阵的热力图以及F1 Score的得到的测评分数可知,上述预测模型准确率召回率精确率都不错,因此上述综合评判,预测模型性能良好
5.3
想要使用预测数据预测客户长期忠诚度,则还需对预测数据进行同样的标准化、离散化、面元划分等处理。
long_data_test=pd.read_csv('long-customer-test.csv')
long_data0=long_data_test.copy()
long_data0.loc[(long_data0['Age']<18),'离散化年龄']=1
long_data0.loc[(long_data0['Age']>=18) & (long_data0['Age']<30),'离散化年龄']=2
long_data0.loc[(long_data0['Age']>=30) & (long_data0['Age']<40),'离散化年龄']=3
long_data0.loc[(long_data0['Age']>=40) & (long_data0['Age']<50),'离散化年龄']=4
long_data0.loc[(long_data0['Age']>=50) & (long_data0['Age']<60),'离散化年龄']=5
long_data0.loc[(long_data0['Age']>=60) & (long_data0['Age']<70),'离散化年龄']=6
long_data0.loc[(long_data0['Age']>=70),'离散化年龄']=7
long_data0
|
CustomerId |
CreditScore |
Gender |
Age |
Tenure |
Balance |
NumOfProducts |
HasCrCard |
IsActiveMember |
EstimatedSalary |
离散化年龄 |
0 |
15647311 |
608 |
1 |
41 |
1 |
83807.86 |
1 |
0 |
1 |
112542.58 |
4.0 |
1 |
15737452 |
653 |
0 |
58 |
1 |
132602.88 |
1 |
1 |
0 |
5097.67 |
5.0 |
2 |
15577657 |
732 |
0 |
41 |
8 |
0.00 |
2 |
1 |
1 |
170886.17 |
4.0 |
3 |
15589475 |
591 |
1 |
39 |
3 |
0.00 |
3 |
1 |
0 |
140469.38 |
3.0 |
4 |
15687946 |
556 |
1 |
61 |
2 |
117419.35 |
1 |
1 |
1 |
94153.83 |
6.0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
995 |
15732202 |
615 |
0 |
34 |
1 |
83503.11 |
2 |
1 |
1 |
73124.53 |
3.0 |
996 |
15735078 |
724 |
1 |
53 |
1 |
139687.66 |
2 |
1 |
1 |
12913.92 |
5.0 |
997 |
15707861 |
520 |
1 |
46 |
10 |
85216.61 |
1 |
1 |
0 |
117369.52 |
4.0 |
998 |
15594612 |
702 |
0 |
44 |
9 |
0.00 |
1 |
0 |
0 |
59207.41 |
4.0 |
999 |
15806360 |
609 |
0 |
41 |
6 |
0.00 |
1 |
0 |
1 |
112585.19 |
4.0 |
1000 rows × 11 columns
long_data0.loc[(long_data0.Tenure>6),'Status']='老客户'
long_data0.loc[(long_data0.Tenure<=3),'Status']='稳定客户'
long_data0.loc[(long_data0.Status.isna()),'Status']='新客户'
long_data0.loc[(long_data0.Balance>120000),'AssetStage']='高资产'
long_data0.loc[(long_data0.Balance>90000) & (long_data0.Balance<=120000),'AssetStage']='中上资产'
long_data0.loc[(long_data0.Balance>50000) & (long_data0.Balance<=90000),'AssetStage']='中下资产'
long_data0.loc[(long_data0.Balance<=50000),'AssetStage']='低资产'
long_data0
|
CustomerId |
CreditScore |
Gender |
Age |
Tenure |
Balance |
NumOfProducts |
HasCrCard |
IsActiveMember |
EstimatedSalary |
离散化年龄 |
Status |
AssetStage |
0 |
15647311 |
608 |
1 |
41 |
1 |
83807.86 |
1 |
0 |
1 |
112542.58 |
4.0 |
稳定客户 |
中下资产 |
1 |
15737452 |
653 |
0 |
58 |
1 |
132602.88 |
1 |
1 |
0 |
5097.67 |
5.0 |
稳定客户 |
高资产 |
2 |
15577657 |
732 |
0 |
41 |
8 |
0.00 |
2 |
1 |
1 |
170886.17 |
4.0 |
老客户 |
低资产 |
3 |
15589475 |
591 |
1 |
39 |
3 |
0.00 |
3 |
1 |
0 |
140469.38 |
3.0 |
稳定客户 |
低资产 |
4 |
15687946 |
556 |
1 |
61 |
2 |
117419.35 |
1 |
1 |
1 |
94153.83 |
6.0 |
稳定客户 |
中上资产 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
995 |
15732202 |
615 |
0 |
34 |
1 |
83503.11 |
2 |
1 |
1 |
73124.53 |
3.0 |
稳定客户 |
中下资产 |
996 |
15735078 |
724 |
1 |
53 |
1 |
139687.66 |
2 |
1 |
1 |
12913.92 |
5.0 |
稳定客户 |
高资产 |
997 |
15707861 |
520 |
1 |
46 |
10 |
85216.61 |
1 |
1 |
0 |
117369.52 |
4.0 |
老客户 |
中下资产 |
998 |
15594612 |
702 |
0 |
44 |
9 |
0.00 |
1 |
0 |
0 |
59207.41 |
4.0 |
老客户 |
低资产 |
999 |
15806360 |
609 |
0 |
41 |
6 |
0.00 |
1 |
0 |
1 |
112585.19 |
4.0 |
新客户 |
低资产 |
1000 rows × 13 columns
from sklearn.preprocessing import LabelEncoder
t1=long_data0.loc[:,'Status'] #要输入的是标签,不是特征矩阵,所以允许一维数据
t2=long_data0.loc[:,'AssetStage']
le1 = LabelEncoder() #实例化
le1 = le1.fit(t1) # 导入数据
label1 = le1.transform(t1) # transform接口调取结果
long_data0.loc[:,"Status"] = label1
print(long_data0['Status'].unique())
le2 = LabelEncoder() #实例化
le2 = le2.fit(t2) # 导入数据
label2 = le2.transform(t2) # transform接口调取结果
long_data0.loc[:,"AssetStage"] = label2
long_data0['AssetStage'].unique()
[1 2 0]
array([1, 3, 2, 0])
the_min1=long_data0['CreditScore'].min()
the_max1=long_data0['CreditScore'].max()
the_min2=long_data0['EstimatedSalary'].min()
the_max2=long_data0['EstimatedSalary'].max()
# the_min3=long_data0['Balance'].min()
# the_max3=long_data0['Balance'].max()
long_data0['标准化信用']=((long_data0['CreditScore']-the_min1)/(the_max1-the_min1))*10
long_data0['标准化个人年收入']=((long_data0['EstimatedSalary']-the_min2)/(the_max2-the_min2))*10
#long_data0['标准化金融资产']=((long_data0['Balance']-the_min3)/(the_max3-the_min3))*10
long_data0
|
CustomerId |
CreditScore |
Gender |
Age |
Tenure |
Balance |
NumOfProducts |
HasCrCard |
IsActiveMember |
EstimatedSalary |
离散化年龄 |
Status |
AssetStage |
标准化信用 |
标准化个人年收入 |
0 |
15647311 |
608 |
1 |
41 |
1 |
83807.86 |
1 |
0 |
1 |
112542.58 |
4.0 |
1 |
1 |
5.16 |
5.631343 |
1 |
15737452 |
653 |
0 |
58 |
1 |
132602.88 |
1 |
1 |
0 |
5097.67 |
5.0 |
1 |
3 |
6.06 |
0.250472 |
2 |
15577657 |
732 |
0 |
41 |
8 |
0.00 |
2 |
1 |
1 |
170886.17 |
4.0 |
2 |
2 |
7.64 |
8.553206 |
3 |
15589475 |
591 |
1 |
39 |
3 |
0.00 |
3 |
1 |
0 |
140469.38 |
3.0 |
1 |
2 |
4.82 |
7.029924 |
4 |
15687946 |
556 |
1 |
61 |
2 |
117419.35 |
1 |
1 |
1 |
94153.83 |
6.0 |
1 |
0 |
4.12 |
4.710429 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
995 |
15732202 |
615 |
0 |
34 |
1 |
83503.11 |
2 |
1 |
1 |
73124.53 |
3.0 |
1 |
1 |
5.30 |
3.657276 |
996 |
15735078 |
724 |
1 |
53 |
1 |
139687.66 |
2 |
1 |
1 |
12913.92 |
5.0 |
1 |
3 |
7.48 |
0.641911 |
997 |
15707861 |
520 |
1 |
46 |
10 |
85216.61 |
1 |
1 |
0 |
117369.52 |
4.0 |
2 |
1 |
3.40 |
5.873077 |
998 |
15594612 |
702 |
0 |
44 |
9 |
0.00 |
1 |
0 |
0 |
59207.41 |
4.0 |
2 |
2 |
7.04 |
2.960302 |
999 |
15806360 |
609 |
0 |
41 |
6 |
0.00 |
1 |
0 |
1 |
112585.19 |
4.0 |
0 |
2 |
5.18 |
5.633476 |
1000 rows × 15 columns
处理结果如下:
long_data99=long_data0.loc[:,['CustomerId','标准化信用','Gender','离散化年龄','Status','AssetStage','NumOfProducts','IsActiveMember','标准化个人年收入']]
long_data99
|
CustomerId |
标准化信用 |
Gender |
离散化年龄 |
Status |
AssetStage |
NumOfProducts |
IsActiveMember |
标准化个人年收入 |
0 |
15647311 |
5.16 |
1 |
4.0 |
1 |
1 |
1 |
1 |
5.631343 |
1 |
15737452 |
6.06 |
0 |
5.0 |
1 |
3 |
1 |
0 |
0.250472 |
2 |
15577657 |
7.64 |
0 |
4.0 |
2 |
2 |
2 |
1 |
8.553206 |
3 |
15589475 |
4.82 |
1 |
3.0 |
1 |
2 |
3 |
0 |
7.029924 |
4 |
15687946 |
4.12 |
1 |
6.0 |
1 |
0 |
1 |
1 |
4.710429 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
995 |
15732202 |
5.30 |
0 |
3.0 |
1 |
1 |
2 |
1 |
3.657276 |
996 |
15735078 |
7.48 |
1 |
5.0 |
1 |
3 |
2 |
1 |
0.641911 |
997 |
15707861 |
3.40 |
1 |
4.0 |
2 |
1 |
1 |
0 |
5.873077 |
998 |
15594612 |
7.04 |
0 |
4.0 |
2 |
2 |
1 |
0 |
2.960302 |
999 |
15806360 |
5.18 |
0 |
4.0 |
0 |
2 |
1 |
1 |
5.633476 |
1000 rows × 9 columns
使用模型对数据进行预测:
x_test=long_data99.iloc[:,1:] #特征
predicted = model.predict(x_test) # 测试样本预测
len(predicted)
1000
long_data99['Exited']=predicted
long_data99
|
CustomerId |
标准化信用 |
Gender |
离散化年龄 |
Status |
AssetStage |
NumOfProducts |
IsActiveMember |
标准化个人年收入 |
Exited |
0 |
15647311 |
5.16 |
1 |
4.0 |
1 |
1 |
1 |
1 |
5.631343 |
1 |
1 |
15737452 |
6.06 |
0 |
5.0 |
1 |
3 |
1 |
0 |
0.250472 |
1 |
2 |
15577657 |
7.64 |
0 |
4.0 |
2 |
2 |
2 |
1 |
8.553206 |
0 |
3 |
15589475 |
4.82 |
1 |
3.0 |
1 |
2 |
3 |
0 |
7.029924 |
1 |
4 |
15687946 |
4.12 |
1 |
6.0 |
1 |
0 |
1 |
1 |
4.710429 |
0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
995 |
15732202 |
5.30 |
0 |
3.0 |
1 |
1 |
2 |
1 |
3.657276 |
0 |
996 |
15735078 |
7.48 |
1 |
5.0 |
1 |
3 |
2 |
1 |
0.641911 |
0 |
997 |
15707861 |
3.40 |
1 |
4.0 |
2 |
1 |
1 |
0 |
5.873077 |
0 |
998 |
15594612 |
7.04 |
0 |
4.0 |
2 |
2 |
1 |
0 |
2.960302 |
1 |
999 |
15806360 |
5.18 |
0 |
4.0 |
0 |
2 |
1 |
1 |
5.633476 |
0 |
1000 rows × 10 columns
result=long_data99.loc[:,['CustomerId','Exited']]
result.set_index("CustomerId",inplace=True)
result.to_excel("result5.xlsx",encoding = 'openpyxl')
如下图为指定的 5 个客户 ID 的预测结果:
result1=result[result.index.isin([15579131,15674442,15719508,15730076,15792228])].sort_index()
result1
|
Exited |
CustomerId |
|
15579131 |
0 |
15674442 |
0 |
15719508 |
1 |
15730076 |
0 |
15792228 |
1 |