HDF5 - 并发、压缩和 I/O 性能 [关闭]


我对 HDF5 性能和并发性有以下疑问:

  1. HDF5 支持并发写访问吗?
  2. 撇开并发因素不谈,HDF5 的性能如何?输入/输出性能 (does 压缩率影响性能)?
  3. 由于我将 HDF5 与 Python 结合使用,它的性能与 Sqlite 相比如何?


  • http://www.sqlite.org/faq.html#q5 http://www.sqlite.org/faq.html#q5
  • 可以在 NFS 文件系统上锁定 sqlite 文件吗? https://stackoverflow.com/questions/9907429/locking-sqlite-file-on-nfs-filesystem-possible
  • http://pandas.pydata.org/ http://pandas.pydata.org/

更新为使用 pandas 0.13.1

  1. No. http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats。有多种方法可以do这个,例如让不同的线程/进程写出计算结果,然后将单个进程组合​​起来。

  2. depending the type of data you store, how you do it, and how you want to retrieve, HDF5 can offer vastly better performance. Storing in an HDFStore as a single array, float data, compressed (in other words, not storing it in a format that allows for querying), will be stored/read amazingly fast. Even storing in the table format (which slows down the write performance), will offer quite good write performance. You can look at this for some detailed comparisons (which is what HDFStore uses under the hood). http://www.pytables.org/ http://www.pytables.org/, here's a nice picture:

自 PyTables 2.3 以来,查询现在已建立索引,因此性能实际上比这要好得多。

回答你的问题,如果你想要任何类型的性能,HDF5 是最佳选择。


In [14]: %timeit test_sql_write(df)
1 loops, best of 3: 6.24 s per loop

In [15]: %timeit test_hdf_fixed_write(df)
1 loops, best of 3: 237 ms per loop

In [16]: %timeit test_hdf_table_write(df)
1 loops, best of 3: 901 ms per loop

In [17]: %timeit test_csv_write(df)
1 loops, best of 3: 3.44 s per loop


In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop

In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop

In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop

In [22]: %timeit test_csv_read()
1 loops, best of 3: 620 ms per loop


import sqlite3
import os
from pandas.io import sql

In [3]: df = DataFrame(randn(1000000,2),columns=list('AB'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A    1000000  non-null values
B    1000000  non-null values
dtypes: float64(2)
def test_sql_write(df):
    if os.path.exists('test.sql'):
    sql_db = sqlite3.connect('test.sql')
    sql.write_frame(df, name='test_table', con=sql_db)

def test_sql_read():
    sql_db = sqlite3.connect('test.sql')
    sql.read_frame("select * from test_table", sql_db)
def test_hdf_fixed_write(df):

def test_csv_read():

def test_csv_write(df):

def test_hdf_fixed_read():

def test_hdf_table_write(df):

def test_hdf_table_read():



