这可能为时已晚,但对于未来的用户,我无论如何都会发布。另一张海报提到使用多处理。我可以保证这一点,并且可以提供更多细节。我们每天使用 Python 处理数百 MB/数 GB 的文件。所以这绝对取决于任务。我们处理的一些文件不是 CSV,因此解析可能相当复杂,并且比磁盘访问花费的时间更长。但是,无论文件类型如何,方法都是相同的。
您可以同时处理大文件的各个部分。这是我们如何做到这一点的伪代码:
import os, multiprocessing as mp
# process file function
def processfile(filename, start=0, stop=0):
if start == 0 and stop == 0:
... process entire file...
else:
with open(file, 'r') as fh:
fh.seek(start)
lines = fh.readlines(stop - start)
... process these lines ...
return results
if __name__ == "__main__":
# get file size and set chuck size
filesize = os.path.getsize(filename)
split_size = 100*1024*1024
# determine if it needs to be split
if filesize > split_size:
# create pool, initialize chunk start location (cursor)
pool = mp.Pool(cpu_count)
cursor = 0
results = []
with open(file, 'r') as fh:
# for every chunk in the file...
for chunk in xrange(filesize // split_size):
# determine where the chunk ends, is it the last one?
if cursor + split_size > filesize:
end = filesize
else:
end = cursor + split_size
# seek to end of chunk and read next line to ensure you
# pass entire lines to the processfile function
fh.seek(end)
fh.readline()
# get current file location
end = fh.tell()
# add chunk to process pool, save reference to get results
proc = pool.apply_async(processfile, args=[filename, cursor, end])
results.append(proc)
# setup next chunk
cursor = end
# close and wait for pool to finish
pool.close()
pool.join()
# iterate through results
for proc in results:
processfile_result = proc.get()
else:
...process normally...
就像我说的,这只是伪代码。它应该让任何需要做类似事情的人开始。我面前没有代码,只是凭记忆做。
但我们在第一次运行时获得了超过 2 倍的速度提升,而无需对其进行微调。您可以根据您的设置微调池中的进程数量以及块的大小以获得更高的速度。如果您像我们一样有多个文件,请创建一个池来并行读取多个文件。只是要小心,不要让太多进程使盒子超载。
注意:您需要将其放入“if main”块中,以确保不会创建无限进程。