我正在尝试使用 BeautifulSoup 从以下网页中抓取表格:https://www.pro-football-reference.com/boxscores/201702050atl.htm https://www.pro-football-reference.com/boxscores/201702050atl.htm
import requests
from bs4 import BeautifulSoup
url = 'https://www.pro-football-
reference.com/boxscores/201702050atl.htm'
page = requests.get(url)
html = page.text
页面上的大多数表格都位于注释标签内,因此无法以直接的方式访问。
print(soup.table.text)
returns:
1
2
3
4
OT
Final
via Sports Logos.net
About logos
New England Patriots
0
3
6
19
6
34
via Sports Logos.net
About logos
Atlanta Falcons
0
21
7
0
0
28
即包含玩家统计数据的主表丢失了。我尝试简单地使用删除评论标签
html = html.replace('<!--',"")
html = html.replace('-->',"")
但无济于事。我如何访问这些注释掉的表?
干得好。您只需更改索引号即可从该页面获取任何表格。
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm').text
soup = BeautifulSoup(page,'lxml')
table = soup.find_all('table')[1] #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
for rowdata in table.find_all("tr")]
for data in tab_data:
print(' '.join(data))
由于除了前两个表之外的其他表都在 javascript 中,这就是为什么您需要使用 selenium 来进行 gatewaycrash 并解析它们。您现在肯定可以从该页面访问任何表。这是修改后的。
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
table = soup.find_all('table')[7] #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
for rowdata in table.find_all("tr")]
for data in tab_data:
print(' '.join(data))
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)