是的,读书csv
文件到numpy
是相当慢的。代码路径中有很多纯Python。这些天,即使我使用纯numpy
我还在用pandas
for IO:
>>> import numpy as np, pandas as pd
>>> %time d = np.genfromtxt("./test.csv", delimiter=",")
CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
Wall time: 14.9 s
>>> %time d = np.loadtxt("./test.csv", delimiter=",")
CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
Wall time: 25.8 s
>>> %time d = pd.read_csv("./test.csv", delimiter=",").values
CPU times: user 740 ms, sys: 36 ms, total: 776 ms
Wall time: 780 ms
或者,在像这样的足够简单的情况下,您可以使用类似于 Joe Kington 所写的内容here https://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy:
>>> %time data = iter_loadtxt("test.csv")
CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
Wall time: 2.86 s
还有沃伦·韦凯瑟的文本阅读器 https://github.com/WarrenWeckesser/textreader图书馆,以防万一pandas
依赖性太重:
>>> import textreader
>>> %time d = textreader.readrows("test.csv", float, ",")
readrows: numrows = 1500000
CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
Wall time: 1.34 s