想要获得关键文本的包装。例如,在 HTML 中:
…
<div class=“target”>chicken</div>
<div class=“not-target”>apple</div>
…
并根据文本“鸡”,想要返回<div class=“target”>chicken</div>
.
目前,有以下方法来获取 HTML:
import requests
from bs4 import BeautifulSoup
req = requests.get(url).txt
soup = BeautifulSoup(r, ‘html.parser’)
并且必须要做soup.find_all(‘div’,…)
并循环遍历所有可用的div
找到我正在寻找的包装纸。
但不必遍历每个div
, 根据定义的文本获取 HTML 中的包装器的正确且最佳的方法是什么?
预先感谢您,一定会接受/赞成答案!
# coding: utf-8
html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title> Last chicken leg on stock! Only 500$ !!! </title>
</head>
</body>
<div id="layer1" class="class1">
<div id="layer2" class="class2">
<div id="layer3" class="class3">
<div id="layer4" class="class4">
<div id="layer5" class="class5">
<p>My chicken has <span style="color:blue">ONE</span> leg :P</p>
<div id="layer6" class="class6">
<div id="layer7" class="class7">
<div id="chicken_surname" class="chicken">eat me</div>
<div id="layer8" class="class8">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>"""
from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc, "lxml")
# (tag -> text) direction is pretty obvious that way
tag = soup.find('div', class_="chicken")
tag2 = soup.find('div', {'id':"chicken_surname"})
print('\n###### by_cls:')
print(tag)
print('\n###### by_id:')
print(tag2)
# but can be tricky when need to find tag by substring
tag_by_str = soup.find(string="eat me")
tag_by_sub = soup.find(string="eat")
tag_by_resub = soup.find(string=re.compile("eat"))
print('\n###### tag_by_str:')
print(tag_by_str)
print('\n###### tag_by_sub:')
print(tag_by_sub)
print('\n###### tag_by_resub:')
print(tag_by_resub)
# there are more than one way to access underlying strings
# both are different - see results
tag = soup.find('p')
print('\n###### .text attr:')
print( tag.text, type(tag.text) )
print('\n###### .strings generator:')
for s in tag.strings: # strings is an generator object
print s, type(s)
# note that .strings generator returns list of bs4.element.NavigableString elements
# so we can use them to navigate, for example accessing their parents:
print('\n###### NavigableString parents:')
for s in tag.strings:
print s.parent
# or even grandparents :)
print('\n###### grandparents:')
for s in tag.strings:
print s.parent.parent
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)