将大数据加载到 TensorFlow 2.0 中，而不将其加载到 RAM 上

2023-11-23

我已经处理并保存了大量视频和音频文件数据集（大约 8 到 9 GB 的数据）数据保存为 2 个 numpy 数组，每个数组对应一种模态文件的形状为（number_of_examples、maximum_time_length、feature_length）

我想使用这些数据来训练我的神经网络以执行分类任务我使用的是 TensorFlow 2.0 Beta 版本我在 Google Colab 上运行所有代码（安装 tf-2.0 beta 后）每次我在 tf.data 中加载数据时，都会使用虚拟机的整个 RAM，并强制重新启动会话。

以前的方法：

我尝试了两种方法

1）将两个变量全部加载到RAM中并将其转换为张量

2）将数据加载为内存映射数组（从磁盘）并将其加载到 tf.data

然而，这两种方法都会加载 RAM 并强制虚拟机重新启动

Code:

# Access the Audio memory from disk without loading
X_audio = np.memmap('gdrive/My Drive/Codes/audio_data.npy', dtype='float32', mode='r').reshape(2198,3860,74)

# Access the Video memory from disk without loading
X_video = np.memmap('gdrive/My Drive/Codes/video_data.npy', dtype='float32', mode='r').reshape(2198,1158,711)

# Load labels
with open('gdrive/My Drive/Codes/label_data_3','rb') as f:
    Y = pkl.load(f)

dataset = tf.data.Dataset.from_tensor_slices((X_audio, X_video, Y)).shuffle(2198).batch(32)

错误：您的会话在使用所有可用 RAM 后崩溃

With tensorflow 2.x.x您可以使用的数据集 APItf.data.Dataset.from_generator从生成器函数创建数据集。该生成器函数将通过 numpy memap 完成读取工作。

下面的代码创建一个虚拟数据文件，然后从磁盘上的文件中一次读取一个示例。它可以轻松更新以读取多个示例以增加 IO 吞吐量（如果您需要在下面的代码示例中实现这一点，请告诉我）。

# imports
import numpy as np
import pathlib
import tensorflow as tf

# create huge numpy array and save it to disk
file = pathlib.Path("huge_data.npy")
examples = 5000
example_shape = (256, 256)
huge_data_shape = (examples, *example_shape)
huge_data_dtype = np.float64

# create file if does not exist
if not file.is_file():
    print("creating file with random data and saving to disk")
    numpy_data = np.random.rand(*huge_data_shape).astype(huge_data_dtype)
    np.save(file, numpy_data)

# memmap the file
numpy_data_memmap = np.load(file, mmap_mode='r')


# generator function
def data_generator():
    return iter(numpy_data_memmap)


# create tf dataset from generator fn
dataset = tf.data.Dataset.from_generator(
    generator=data_generator,
    output_types=huge_data_dtype,
    output_shapes=example_shape,
)

# consume huge dataset
for i, ex in enumerate(dataset):
    print(i, ex.shape, ex.dtype)

Output:

0 (256, 256) <dtype: 'float64'>
1 (256, 256) <dtype: 'float64'>
2 (256, 256) <dtype: 'float64'>
3 (256, 256) <dtype: 'float64'>
...
4995 (256, 256) <dtype: 'float64'>
4996 (256, 256) <dtype: 'float64'>
4997 (256, 256) <dtype: 'float64'>
4998 (256, 256) <dtype: 'float64'>
4999 (256, 256) <dtype: 'float64'>

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)