我正在使用的系统当前运行大型(> 5GB).csv 文件。为了提高性能,我正在测试(A)从磁盘创建数据帧的不同方法(pandas VSdask http://pythondata.com/dask-large-csv-python/)以及(B)将结果存储到磁盘的不同方式(.csv VShdf5 https://dzone.com/articles/quick-hdf5-pandas files).
为了对性能进行基准测试,我执行了以下操作:
def dask_read_from_hdf():
results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_dd_hdf = results_dd_hdf.Security.unique()
hdf.close()
def pandas_read_from_hdf():
results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_pd_hdf = results_pd_hdf.Security.unique()
hdf.close()
def dask_read_from_csv():
results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_dd_csv = results_dd_csv.Security.unique()
def pandas_read_from_csv():
results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_pd_csv = results_pd_csv.Security.unique()
print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()
我的发现是:
dask hdf performance
10 loops, best of 3: 133 ms per loop
pandas hdf performance
1 loop, best of 3: 1.42 s per loop
dask csv performance
1 loop, best of 3: 7.88 ms per loop
pandas csv performance
1 loop, best of 3: 827 ms per loop
当 hdf5 存储的访问速度比 .csv 更快,并且 dask 创建数据帧比 pandas 更快时,为什么 hdf5 的 dask 比 csv 的 dask 慢?难道我做错了什么?
什么时候从 HDF5 存储对象创建 dask 数据帧对性能有意义?