我正在尝试解析许多包含文本、表格和 html 的网页。每个页面都有不同数量的段落,但每个段落都以一个开头开头<div>
, 闭幕式</div>
直到最后才发生。我只是想获取内容,过滤掉某些元素并用其他元素替换它们
期望的结果:text1 <b>text2</b> (table_deleted) text3
实际结果text1\n\ntext2some text heretext 3text2some text heretext 3 (table deleted)
from bs4 import BeautifulSoup
html = """
<h1>title</h1>
<h3>extra data</h3>
<div>
text1
<div>
<b>next2</b><table>some text here</table>text 3
</div>
</div>"""
soup = BeautifulSoup(html, 'html5lib')
tags = soup.find('h3').find_all_next()
contents = ""
for tag in tags:
if tag.name == 'table':
contents += " (table deleted) "
contents += tag.text.strip()
print(contents)
不要使用html5lib
作为解析器而不是使用html.parser
。话虽这么说,您可以使用以下命令访问紧随“h3”标签之后的“div”CSS选择器 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors和select_one
method.
从那里,您可以unwrap https://www.crummy.com/software/BeautifulSoup/bs4/doc/#replace-with下面的“div”标签并使用替换“table”标签replace_with https://www.crummy.com/software/BeautifulSoup/bs4/doc/#replace-with method
In [107]: from bs4 import BeautifulSoup
In [108]: html = """
...: <h1>title</h1>
...: <h3>extra data</h3>
...: <div>
...: text1
...: <div>
...: <b>next2</b><table>some text here</table>text 3
...: </div>
...: </div>"""
In [109]: soup = BeautifulSoup(html, 'html.parser')
In [110]: my_div = soup.select_one('h3 + div')
In [111]: my_div
Out[111]:
<div>
text1
<div>
<b>next2</b><table>some text here</table>text 3
</div>
</div>
In [112]: my_div.div.unwrap()
Out[112]: <div></div>
In [113]: my_div
Out[113]:
<div>
text1
<b>next2</b><table>some text here</table>text 3
</div>
In [114]: my_div.table.replace_with('(table deleted)')
Out[114]: <table>some text here</table>
In [115]: my_div
Out[115]:
<div>
text1
<b>next2</b>(table deleted)text 3
</div>
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)