我有一个 python 脚本来清理抓取的 html 内容,它使用 BeautifulSoup4 并且工作得很好。最近我决定学习 lxml,但我发现教程(对我来说)更难遵循。例如我使用下面的代码来合并多个<br />
标记为一个,即如果有多个<br />
标签,删除所有标签,只保留一个:
from bs4 import BeautifulSoup, Tag
data = 'foo<br /><br>bar. <p>foo<br/><br id="1"><br/>bar'
soup = BeautifulSoup(data)
for br in soup.find_all("br"):
while isinstance(br.next_sibling, Tag) and br.next_sibling.name == 'br':
br.next_sibling.extract()
print soup
<html><body><p>foo<br/>bar. </p><p>foo<br/>bar</p></body></html>
我如何在 lxml 中实现类似的效果?谢谢,
你可以尝试.drop_tag()
删除重复连续出现的方法<br/>
tag:
from lxml import html
doc = html.fromstring(data)
for br in doc.findall('.//br'):
if br.tail is None: # no text immediately after <br> tag
for dup in br.itersiblings():
if dup.tag != 'br': # don't merge if there is another tag inbetween
break
dup.drop_tag()
if dup.tail is not None: # don't merge if there is a text inbetween
break
print(html.tostring(doc))
# -> <div><p>foo<br>bar. </p><p>foo<br>bar</p></div>
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)