ValueError: X.shape[1] = 2 应等于 13,即训练时的特征数量

2024-06-24

我试图通过使用 scikit-learn 的 SVM 文档分类器来预测肺癌数据,我使用以下代码但出现一些错误。我用过matplotlib.pyplot as plt用于数据图但出现错误。

在这里,我明智地使用了肺癌数据风险因素。

输入文件

GENDER  AGE SMOKING YELLOW_FINGERS  ANXIETY PEER_PRESSURE   CHRONIC DISEASE FATIGUE     ALLERGY     WHEEZING    ALCOHOL CONSUMING   COUGHING    SHORTNESS OF BREATH SWALLOWING DIFFICULTY   CHEST PAIN  LUNG_CANCER
F   59  0   0   0   1   0   1   0   1   0   1   1   0   1   0
F   63  0   1   0   0   0   0   0   1   0   1   1   0   0   0
F   75  0   1   0   0   1   1   1   1   0   1   1   0   0   1
M   69  0   1   1   0   0   1   0   1   1   1   1   1   1   1
M   74  1   0   0   0   1   1   1   0   0   0   1   1   1   1
M   63  1   1   1   0   0   0   0   0   1   0   0   1   1   0

脚本支持向量机

# Support Vector Machine (SVM)

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('C:/Users/Vishnu/Desktop/Lung Cancer/lung_cancer.csv')
X = dataset.iloc[:, [2,3,4,5,6,7,8,9,10,11,12,13,14]].values
y = dataset.iloc[:, 15].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Lung Cancer Risk Factor')
plt.legend()
plt.show()

# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Lung Cancer Risk Factor')
plt.legend()
plt.show()

Error

ValueError: X.shape[1] = 2 should be equal to 13, the number of features at training time

在这就像我收到错误

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
         alpha = 0.75, cmap = ListedColormap(('red', 'green')))

为什么我收到错误,请给我建议。先感谢您。

Edit_1

SVM 测试集输出图

SVM 训练集输出图

有谁可以告诉我吗?这是正确的输出吗?

提前致谢


不管例外情况如何,我认为有几个方面需要解决。

  1. The 例外其本身是因为您只提供 2 个变量作为输入而引起的classifier.predict而您的模型接受了 13 个变量的训练。如果您想在其中 2 个变量上绘制等高线,则必须将其他 11 个变量设置为某个默认值。

    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                         np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    Xpred = np.array([X1.ravel(), X2.ravel()] + [np.repeat(0, X1.ravel().size) for _ in range(11)]).T
    # Xpred now has a grid for x1 and x2 and average value (0) for x3 through x13
    pred = classifier.predict(Xpred).reshape(X1.shape)   # is a matrix of 0's and 1's !
    plt.contourf(X1, X2, pred,
                 alpha = 0.75, cmap = ListedColormap(('red', 'green')))
    

    This snippet will work, however it will probably not give you what you want. With some random binomial data you get a digital red-green plot like the following. The output of SVC.predict is a binary matrix, not probabilities. SVM prediction (binary)

  2. You could绘制decision_function http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.decision_function相反,作为预测结果,可视化到分离超平面的距离。这可以解释为风险因素。然而这并不是一个概率

    pred = classifier.decision_function(Xpred).reshape(X1.shape)    
    plt.contourf(X1, X2, pred,
                 alpha=1.0, cmap="RdYlGn", levels=np.linspace(pred.min(), pred.max(), 100))
    
  3. 我看到你的另一个问题dataset。好像有15列。然后我会期待这条线y = dataset.iloc[:, 15].values提出一个IndexError。如果没有,请检查数据集的完整性。是否正确导入pd.read_csv https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html?

  4. 另外,您还丢弃了前两列的信息:性别和年龄。对于性别,你可以改变F to 0 and M to 1例如,还包括年龄X:

    dataset = pd.read_csv('C:/Users/Vishnu/Desktop/Lung Cancer/lung_cancer.csv')
    dataset.loc[dataset['GENDER'] == 'F', 'GENDER'] = 0
    dataset.loc[dataset['GENDER'] == 'M', 'GENDER'] = 1
    X = dataset.iloc[:, 0:14].values
    y = dataset.iloc[:, 14].values
    

我希望这有帮助。如果在研究您想要的解决方案时出现另一个问题,并且您无法通过自己的研究找到答案,请随时询问:)

EDIT

解决关于散点图正确性的第二个问题:我不知道你是如何制作这个图的,但是使用你的散点图代码,绘制在决策函数之上,我得到以下结果(带有肺癌数据 https://drive.google.com/file/d/1DGe3ZKeoW7UsGGq-lPQZmK0_8UXK1D6I/view你提供的)

y是一个二元变量。这就是为什么np.unique(y_set)是相同的[0, 1]。我不知道如何使用此代码获得柱状数据点结构。很抱歉,我什至不知道您实际上想通过此图实现什么目的,所以我无法判断它是否显示您想要显示的内容。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

ValueError: X.shape[1] = 2 应等于 13,即训练时的特征数量 的相关文章

随机推荐