玩了一段时间后我想这是因为joblib
将所有时间都花在协调所有事情的并行运行上,而没有时间实际做任何有用的工作。至少对于 OSX 和 Linux 下的我来说 — 我没有任何 MS Windows 机器
我首先加载包、拉入代码并生成一个虚拟文件:
from random import choice
import re
from multiprocessing import Pool
from joblib import delayed, Parallel
regex = re.compile(r'a *a|b *b') # of course more complex IRL, with lookbehind/forward
mydict = {'aa': 'A', 'bb': 'B'}
def handler(match):
return mydict[match[0].replace(' ', '')]
def replace_in(tweet):
return re.sub(regex, handler, tweet)
examples = [
"Regex replace isn't that computationally expensive... I would suggest using Pandas, though, rather than just a plain loop",
"Hmm I don't use pandas anywhere else, but if it makes it faster, I'll try! Thanks for the suggestion. Regarding the question: expensive or not, if there is no reason for it to use only 19%, it should use 100%"
"Well, is tweets a generator, or an actual list?",
"an actual list of strings",
"That might be causing the main process to have the 419MB of memory, however, that doesn't mean that list will be copied over to the other processes, which only need to work over slices of the list",
"I think joblib splits the list in roughly equal chunks and sends these chunks to the worker processes.",
"Maybe, but if you use something like this code, 2 million lines should be done in less than a minute (assuming an SSD, and reasonable memory speeds).",
"My point is that you don't need the whole file in memory. You could type tweets.txt | python replacer.py > tweets_replaced.txt, and use the OS's native speeds to replace data line-by-line",
"I will try this",
"no, this is actually slower. My code takes 12mn using joblib.parallel and for line in f_in: f_out.write(re.sub(..., line)) takes 21mn. Concerning CPU and memory usage: CPU is same (17%) and memory much lower (60Mb) using files. But I want to minimize time spent, not memory usage.",
"I moved this to chat because StackOverflow suggested it",
"I don't have experience with joblib. Could you try the same with Pandas? pandas.pydata.org/pandas-docs/…",
]
with open('tweets.txt', 'w') as fd:
for i in range(2_000_000):
print(choice(examples), file=fd)
(看看你是否能猜出我从哪里得到这些台词!)
作为基线,我尝试使用简单的解决方案:
with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
for l in fin:
fout.write(replace_in(l))
在我的 OSX 笔记本电脑上这需要 14.0 秒(挂钟时间),在我的 Linux 桌面上需要 5.15 秒。请注意,更改您的定义replace_in
使用做regex.sub(handler, tweet)
代替re.sub(regex, handler, tweet)
在我的笔记本电脑上将上述时间减少到 8.6 秒,但我不会在下面使用此更改。
然后我尝试了你的joblib
包裹:
with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
with Parallel(n_jobs=-1) as parallel:
for l in parallel(delayed(replace_in)(tweet) for tweet in fin):
fout.write(l)
在我的笔记本电脑上需要 1 分 16 秒,在台式机上需要 34.2 秒。 CPU 利用率非常低,因为子/工作任务大部分时间都在等待协调器向它们发送工作。
然后我尝试使用multiprocessing
包裹:
with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
with Pool() as pool:
for l in pool.map(replace_in, fin, chunksize=1024):
fout.write(l)
在我的笔记本电脑上花费了 5.95 秒,在台式机上花费了 2.60 秒。我还尝试了块大小为 8 的情况,分别花费了 22.1 秒和 8.29 秒。块大小允许池将大量工作发送给其子级,因此它可以花费更少的时间进行协调,并花更多的时间来完成有用的工作。
因此我大胆猜测joblib
对于这种用法并不是特别有用,因为它似乎没有 chunksize 的概念 https://github.com/joblib/joblib/issues/50.