我正试图从中剔除过生日的人维基百科页面
这是现有的代码:
hdr = {'User-Agent': 'Mozilla/5.0'}
site = "http://en.wikipedia.org/wiki/"+"january"+"_"+"1"
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup
这一切都工作正常,我得到了整个 HTML 页面,但我想要特定的数据,并且我不知道如何在没有 id 的情况下使用 Beautiful Soup 访问该数据。这<ul>
标签没有 id,也没有<li>
标签。另外,我不能只要求每一个<li>
标记,因为页面上还有其他列表。有没有特定的方法来调用给定的列表? (我不能只对这一页使用修复程序,因为我计划迭代所有日期并获取每一页的生日,并且我不能保证每一页的布局都与这一页完全相同)。
我们的想法是得到span
with Births
id,找到父母的下一个兄弟姐妹(即ul
)并迭代它li
元素。这是一个完整的示例,使用requests
(虽然这不相关):
from bs4 import BeautifulSoup as Soup, Tag
import requests
response = requests.get("http://en.wikipedia.org/wiki/January_1")
soup = Soup(response.content)
births_span = soup.find("span", {"id": "Births"})
births_ul = births_span.parent.find_next_sibling()
for item in births_ul.findAll('li'):
if isinstance(item, Tag):
print item.text
prints:
871 – Zwentibold, Frankish son of Arnulf of Carinthia (d. 900)
1431 – Pope Alexander VI (d. 1503)
1449 – Lorenzo de' Medici, Italian politician (d. 1492)
1467 – Sigismund I the Old, Polish king (d. 1548)
1484 – Huldrych Zwingli, Swiss pastor and theologian (d. 1531)
1511 – Henry, Duke of Cornwall (d. 1511)
1516 – Margaret Leijonhufvud, Swedish wife of Gustav I of Sweden (d. 1551)
...
希望有帮助。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)