另一种解决方案是使用内存映射张量。这类似于其他解决方案 https://stackoverflow.com/a/64408076/2790047但在我看来更好,因为它抽象了与二进制数据的直接交互,并在更高的抽象级别上运行。
每个张量使用以下方式存储其数据Storage
目的。这种机制允许我们使用定义内存映射存储系统FloatStorage.from_file https://pytorch.org/docs/master/storage.html#torch.FloatStorage.from_file。使用内存映射张量可以让我们将数据集写入磁盘并读取它,就好像它是形状为普通张量 (3600000, 32, 30) 一样,而无需直接将该内存存储在 RAM 中。
例如,我们可以使用类似以下内容将数据集写入磁盘
import torch
filename = 'data.bin'
num_samples = 3600000
rows, cols = 32, 30
# shared=True allows us to save the tensor to disk as we perform in place modifications to it
samples = torch.FloatTensor(torch.FloatStorage.from_file(filename, shared=True, size=num_samples * rows * cols)).reshape(num_samples, rows, cols)
for idx in tqdm(range(num_samples)):
# placeholder random samples, insert your actual samples here
# every in-place assignment to samples is automatically reflected on the disk
samples[idx] = torch.randn(rows, cols)
这样做的好处是与内置兼容TensorDataset https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset
from torch.utils.data import TensorDataset, DataLoader
filename = 'data.bin'
num_samples = 3600000
rows, cols = 32, 30
# shared=False prevents changes to samples from affecting the data on disk
samples = torch.FloatTensor(torch.FloatStorage.from_file(filename, shared=False, size=num_samples * rows * cols)).reshape(num_samples, rows, cols)
dataset = TensorDataset(samples)
loader = DataLoader(dataset, batch_size=256, num_workers=0)
for batch in tqdm(loader):
# batch is a (256, 32, 30) tensor
pass
100%|██████████| 14063/14063 [00:11<00:00, 1216.80it/s]