我正在编写一个自定义文件系统爬虫,它通过 sys.stdin 传递数百万个 glob 来进行处理。我发现运行脚本时,其内存使用量随着时间的推移而大幅增加,整个过程几乎停止了。我在下面写了一个最小的案例来说明问题。我是否做错了什么,或者我在 Python / glob 模块中发现了错误? (我使用的是python 2.5.2)。
#!/usr/bin/env python
import glob
import sys
import gc
previous_num_objects = 0
for count, line in enumerate(sys.stdin):
glob_result = glob.glob(line.rstrip('\n'))
current_num_objects = len(gc.get_objects())
new_objects = current_num_objects - previous_num_objects
print "(%d) This: %d, New: %d, Garbage: %d, Collection Counts: %s"\
% (count, current_num_objects, new_objects, len(gc.garbage), gc.get_count())
previous_num_objects = current_num_objects
输出看起来像:
(0) This: 4042, New: 4042, Python Garbage: 0, Python Collection Counts: (660, 5, 0)
(1) This: 4061, New: 19, Python Garbage: 0, Python Collection Counts: (90, 6, 0)
(2) This: 4064, New: 3, Python Garbage: 0, Python Collection Counts: (127, 6, 0)
(3) This: 4067, New: 3, Python Garbage: 0, Python Collection Counts: (130, 6, 0)
(4) This: 4070, New: 3, Python Garbage: 0, Python Collection Counts: (133, 6, 0)
(5) This: 4073, New: 3, Python Garbage: 0, Python Collection Counts: (136, 6, 0)
(6) This: 4076, New: 3, Python Garbage: 0, Python Collection Counts: (139, 6, 0)
(7) This: 4079, New: 3, Python Garbage: 0, Python Collection Counts: (142, 6, 0)
(8) This: 4082, New: 3, Python Garbage: 0, Python Collection Counts: (145, 6, 0)
(9) This: 4085, New: 3, Python Garbage: 0, Python Collection Counts: (148, 6, 0)
每第 100 次迭代,就会释放 100 个对象,因此len(gc.get_objects()
每 100 次迭代增加 200。len(gc.garbage)
永远不会从0开始变化。第2代收集计数缓慢增加,而第0次和第1次计数则上下波动。