我有一个非常大的 xml 文件,我需要根据特定标签将其拆分为多个文件。
XML 文件是这样的:
<xml>
<file id="13">
<head>
<talkid>2458</talkid>
<transcription>
<seekvideo id="645">So in college,</seekvideo>
...
</transcription>
</head>
<content> *** This is the content I am trying to save *** </content>
</file>
<file>
...
</file>
</xml>
我想提取content每个file并根据talkid.
这是我尝试过的代码:
import xml.etree.ElementTree as ET
all_talks = 'path\\to\\big\\file'
context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
if elem.tag == 'file':
content = elem.find('content').text
title = elem.find('talkid').text
filename = format(title + ".txt")
with open(filename, 'wb', encoding='utf-8') as f:
f.write(ET.tostring(content), encoding='utf-8')
但我收到以下错误:
AttributeError: 'NoneType' object has no attribute 'text'
如果您已经在使用.iterparse()
仅依赖事件更为通用:
import xml.etree.ElementTree as ET
from pathlib import Path
all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))
for event, element in context:
if event == 'end':
if element.tag == 'talkid':
title = element.text
elif element.tag == 'content':
content = element.text
elif element.tag == 'file' and title and content:
with open(all_talks.with_name(title + '.txt'), 'w') as f:
f.write(content)
elif element.tag == 'file':
content = title = None
Upd. In 类似的问题 https://stackoverflow.com/q/74182062/10824407 @Leila https://stackoverflow.com/users/11926527/leila问如何写所有的文字<seekvideo>
标记到文件而不是<content>
文件,所以这是一个解决方案:
import xml.etree.ElementTree as ET
from pathlib import Path
all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))
for event, element in context:
if event == 'end':
if element.tag == 'file' and title and parts:
with open(all_talks.with_name(title + '.txt'), 'w') as f:
f.write('\n'.join(parts))
elif element.text:
if element.tag == 'talkid':
title = element.text
elif element.tag == 'seekvideo':
parts.append(element.text)
elif element.tag == 'file':
title = None
parts = []
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)