这种类型的事情通常可以用sklearn.cross_validation.LeaveOneLabelOut
。您只需要构建一个对您的组进行编码的标签向量。即,所有样本K1
会带标签1
,所有样本在K2
将采用标签 2,依此类推。
这是一个带有假数据的完全可运行的示例。重要的线路是创建cv
对象,以及调用cross_val_score
import numpy as np
n_features = 10
# Make some data
A = np.random.randn(3, n_features)
B = np.random.randn(5, n_features)
C = np.random.randn(4, n_features)
D = np.random.randn(7, n_features)
E = np.random.randn(9, n_features)
# Group it
K1 = np.concatenate([A, B])
K2 = np.concatenate([C, D])
K3 = E
data = np.concatenate([K1, K2, K3])
# Make some dummy prediction target
target = np.random.randn(len(data)) > 0
# Make the corresponding labels
labels = np.concatenate([[i] * len(K) for i, K in enumerate([K1, K2, K3])])
from sklearn.cross_validation import LeaveOneLabelOut, cross_val_score
cv = LeaveOneLabelOut(labels)
# Use some classifier in crossvalidation on data
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
scores = cross_val_score(lr, data, target, cv=cv)
然而,您当然可能会遇到这样的情况:您想完全手动定义折叠。在这种情况下,您需要创建一个iterable
(例如list
) 的情侣(train, test)
通过索引指示将哪些样本纳入每次折叠的训练和测试集中。让我们检查一下:
# create train and test folds from our labels:
cv_by_hand = [(np.where(labels != label)[0], np.where(labels == label)[0])
for label in np.unique(labels)]
# We check this against our existing cv by converting the latter to a list
cv_to_list = list(cv)
print cv_by_hand
print cv_to_list
# Check equality
for (train1, test1), (train2, test2) in zip(cv_by_hand, cv_to_list):
assert (train1 == train2).all() and (test1 == test2).all()
# Use the created cv_by_hand in cross validation
scores2 = cross_val_score(lr, data, target, cv=cv_by_hand)
# assert equality again
assert (scores == scores2).all()