使用 mrjob v0.4.4 时,为什么我会收到 [Errno 7] Argument list too long 和 OSError: [Errno 24] Too much open files?

2023-12-01

看起来 MapReduce 框架的本质就是处理许多文件。因此,当我收到错误告诉我使用了太多文件时,我怀疑我做错了什么。

如果我运行该作业inlinerunner 和三个目录,它可以工作:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

但是如果我使用localrunner(以及相同的三个目录),它失败:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/

[...output clipped...]

> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
    self._invoke_step(step_num, 'mapper')
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
    working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
    procs_args, output_path, working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
    cwd=working_dir, env=env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
    proc = Popen(args, **proc_kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
    errpipe_read, errpipe_write = self.pipe_cloexec()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files

此外,如果我返回使用内联运行器并在输入中包含更多目录(总共 11 个),那么我会再次收到不同的错误:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

[...clipped...]

Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run 
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run 
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
    self._invoke_sort(self._step_input_paths(), sort_output_path)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
    check_call(args, stdout=output, stderr=err, env=env)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long

mrjob 文档包括讨论之间的差异inline and local runners,但我不明白它如何解释这种行为。

最后,我要提到的是,我正在通配的目录中的文件数量并不大(致谢):

$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do   printf "%-25.25s : " "$dir";   find "$dir" -type f | wc -l; done | sort
./01                      :      236
./02                      :      169
./03                      :      176
./04                      :      185
./05                      :      176
./06                      :      235
./07                      :      275
./08                      :      265
./09                      :      186
./10                      :      171
./11                      :      161

我认为这与工作本身无关,但事实是:

from mrjob.job import MRJob
import numpy as np
import geohash

class MRGPSQuality(MRJob):

    def mapper(self, _, line):

        try:
            lat = float(line.split(',')[1])
            lng = float(line.split(',')[2])
            horizontalAccuracy = float(line.split(',')[4])
            gh = geohash.encode(lat, lng, precision=7)
            yield gh, horizontalAccuracy
        except:
            pass

    def reducer(self, key, values):
        # Convert the generator straight back to array:
        vals = np.fromiter(values, float)
        count = len(vals)
        mean = np.mean(vals)
        if count > 50:
            yield key, [count, mean]

if __name__ == '__main__':
    MRGPSQuality.run()

“参数列表太长”的问题不是作业或 python,而是 bash。命令行中用于启动作业的星号会扩展到匹配的每个文件,这是一个非常长的命令行并超出了 bash 限制。

该错误与 ulimit 无关,但错误“打开文件过多”与 ulimit 有关,因此如果命令实际运行,您会遇到 ulimit。

您可以像这样检查 shell 限制(如果您有兴趣)...getconf ARG_MAX

为了解决最大参数问题,您可以通过这样做将所有文件连接成一个。

for f in *; do cat "$f" >> ../directory/bigfile.log; done

然后运行指向大文件的 mrjob 。

如果文件很多,您可以使用多个线程使用 gnu 并行连接文件,因为上面的命令是单线程并且速度慢。

ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"

*将 8 更改为您想要的并行度

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

使用 mrjob v0.4.4 时,为什么我会收到 [Errno 7] Argument list too long 和 OSError: [Errno 24] Too much open files? 的相关文章

随机推荐