困扰我两天的问题:StratifiedShuffleSplit与train_test_split创建的数据集为何训练结果不同?
让人头疼的问题
最近,我在进行卷积模型的分类任务时发现了一个StratifiedShuffleSplit函数的bug。
众所周知,在训练模型之前我们一般会对数据集划分为训练集和验证集,以便后期对模型性能的验证。然而在我近期的实验中,我发现使用StratifiedShuffleSplit函数划分数据集和使用train_test_split函数划分数据集竟然产生了巨大的训练差异(前者验证准确率高达95%,而后者只有80%)。
版本信息
sklearn 1.2.2
torch 2.0.0
cuda 11.8
python 3.9
设置虚拟数据
x = np.random.rand(10000,1)
y1 = np.ones(5000)
y0 = np.zeros(5000)
y = np.concatenate([y1,y0], axis=0)
print(x.shape)
print(y.shape)
# (10000, 1)
# (10000,)
train_test_split划分
train_data, test_data, train_labels, test_labels = train_test_split(x, y, test_size=0.2, random_state=30, stratify=y)
设置stratify是为了保证其与StratifiedShuffleSplit保持一致,都为分层采样
train_subset_x = torch.FloatTensor(train_datas)
train_subset_y = torch.LongTensor(train_labels)
valid_subset_x = torch.FloatTensor(test_datas)
valid_subset_y = torch.LongTensor(test_labels)
from collections import Counter
print(Counter([np.int32(train_subset_y[i]) for i in range(len(train_subset_y))]))
print(Counter([np.int32(valid_subset_y[i]) for i in range(len(valid_subset_y))]))
# Counter({0: 4000, 1: 4000})
# Counter({1: 1000, 0: 1000})
StratifiedShuffleSplit划分
from sklearn.model_selection import StratifiedShuffleSplit
from collections import Counter
from torch.utils.data.dataset import Subset
def generate_train_indices(n_splits, ratio, data, lab):
# ss = StratifiedShuffleSplit(n_splits=n_splits, train_size=ratio, random_state=20)
ss = StratifiedShuffleSplit(n_splits=n_splits, train_size=ratio)
return [i.tolist() for i, _ in ss.split(data, lab)], [j.tolist() for _, j in ss.split(data, lab)]
train_indices, valid_indices = generate_train_indices(1, 0.8, x, y)
print(Counter([y[i] for i in train_indices[0]]))
print(Counter([y[i] for i in valid_indices[0]]))
# Counter({0: 4000, 1: 4000})
# Counter({1: 1000, 0: 1000})
以上两种方法创建的数据集,在同一个模型,相同的训练方式下,结果完全不同,前者80%左右,后者95%左右。然后我将StratifiedShuffleSplit的random_state设置上固定的数字,比如:
from sklearn.model_selection import StratifiedShuffleSplit
from collections import Counter
from torch.utils.data.dataset import Subset
def generate_train_indices(n_splits, ratio, data, lab):
ss = StratifiedShuffleSplit(n_splits=n_splits, train_size=ratio, random_state=20)
# ss = StratifiedShuffleSplit(n_splits=n_splits, train_size=ratio)
return [i.tolist() for i, _ in ss.split(data, lab)], [j.tolist() for _, j in ss.split(data, lab)]
train_indices, valid_indices = generate_train_indices(1, 0.8, x, y)
然后,实验结果就变得跟train_test_split划分的数据集所得的结果相同了,是不是很神奇?
我在尝试了各种ablation实验后,还是没找到原因出在哪里。然后我就想不会是验证集数据泄露了吧,然后我就打印了一下未设置random_state下的采样结果
from sklearn.model_selection import StratifiedShuffleSplit
from collections import Counter
from torch.utils.data.dataset import Subset
def generate_train_indices(n_splits, ratio, data, lab):
# ss = StratifiedShuffleSplit(n_splits=n_splits, train_size=ratio, random_state=20)
ss = StratifiedShuffleSplit(n_splits=n_splits, train_size=ratio)
return [i.tolist() for i, _ in ss.split(data, lab)], [j.tolist() for _, j in ss.split(data, lab)]
train_indices, valid_indices = generate_train_indices(1, 0.8, x, y)
l = []
for i in train_indices[0]:
for j in valid_indices[0]:
if i==j:
l.append(i)
# print(l)
print(len(l))
# 1602
结果发现,验证集和测试集竟然有1602个数据是完全相同的,然后我设置了random_state参数,发现没有相同的数据了。
据我从网上查到的知识可以知道,random_state只是设置一个随机种子,并不会对StratifiedShuffleSplit产生其他的影响。然而,实际情况是设置了random_state是无放回的分层采样,而不设置random_state就会变成有放回的分层采样。
这是StratifiedShuffleSplit函数的一个bug?还是该函数本身就是这么设置的?或者我的代码有问题?求大佬解释