我正在反序列化大型 numpy 数组(本例中为 500MB),我发现不同方法的结果存在数量级差异。以下是我计时的 3 种方法。
我正在接收来自multiprocessing.shared_memory
包,所以数据作为一个memoryview
目的。但在这些简单的示例中,我只是预先创建一个字节数组来运行测试。
我想知道这些方法是否有任何错误,或者是否还有其他技术我没有尝试过。如果您想快速移动数据而不只是为了 IO 锁定 GIL,那么 Python 中的反序列化确实是一个棘手的问题。对为什么这些方法差异如此之大的一个很好的解释也是一个很好的答案。
""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8) # 500 MB data
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)
result = None
print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))
Results:
Deserialize using pickle...
Time: 0.2509949207 sec
Deserialize from bytes...
Time: 0.0204288960 sec
Deserialize using numpy load from BytesIO...
Time: 28.9850852489 sec
第二个选项是最快的,但明显不太优雅,因为我需要显式序列化形状和数据类型信息。
我发现你的问题很有用,我正在寻找最好的 numpy 序列化并确认 np.load() 是最好的,除了它被击败pyarrow
在我下面的附加测试中。 Arrow 现在是一个超级流行的分布式计算数据序列化框架(例如 Spark,...)
""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
import pyarrow as pa
sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8) # 500 MB data
pa_buf = pa.serialize(sample).to_buffer()
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)
result = None
print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize pyarrow')
t0 = time.time()
restored_data = pa.deserialize(pa_buf)
print('Time: {:.10f} sec'.format(time.time() - t0))
Databricks Runtime 8.3ML Python 3.8、Numpy 1.19.2、Pyarrow 1.0.1 上 i3.2xlarge 的结果
Deserialize using pickle...
Time: 0.4069395065 sec
Deserialize from bytes...
Time: 0.0281322002 sec
Deserialize using numpy load from BytesIO...
Time: 0.3059172630 sec
Deserialize pyarrow
Time: 0.0031735897 sec
你的 BytesIO 结果大约是我的 100 倍,我不知道为什么。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)