这是一个解析一百万的脚本<instrumentConfiguration/>
元素(967MB
文件)中40
秒(在我的机器上)而不消耗大量内存。
吞吐量为24MB/s
. The cElementTree page (2005)报告47MB/s
.
#!/usr/bin/env python
from itertools import imap, islice, izip
from operator import itemgetter
from xml.etree import cElementTree as etree
def parsexml(filename):
it = imap(itemgetter(1),
iter(etree.iterparse(filename, events=('start',))))
root = next(it) # get root element
for elem in it:
if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
values = [('Id', elem.get('id')),
('Parameter1', next(it).get('name'))] # cvParam
componentList_count = int(next(it).get('count'))
for parent, child in islice(izip(it, it), componentList_count):
key = parent.tag.partition('}')[2]
value = child.get('name')
assert child.tag.endswith('cvParam')
values.append((key, value))
yield values
root.clear() # preserve memory
def print_values(it):
for line in (': '.join(val) for conf in it for val in conf):
print(line)
print_values(parsexml(filename))
Output
$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps
Note: 代码很脆弱它假设前两个孩子<instrumentConfiguration/>
are <cvParam/>
and <componentList/>
所有值都可用作标记名称或属性。
论性能
在这种情况下,ElementTree 1.3 比 cElementTree 1.0.6 慢约 6 倍。
如果你更换root.clear()
by elem.clear()
那么代码速度会快 10%,但内存会多 10 倍。lxml.etree
与elem.clear()
变体,性能与cElementTree
但它消耗 20 (root.clear()
) / 2 (elem.clear()
) 倍内存 (500MB)。