使用 Python 网页抓取克服 pandas.read_html 的限制

2023-10-21

熊猫 read_html函数是一个非常有用的工具，用于从网页中快速提取 HTML 表格。

它允许您仅用一行代码从 HTML 内容中提取表格数据。
然而，read_html有一些限制。

本教程将指导您应对其中一些挑战，并提供克服这些挑战的解决方案。

为了本教程的目的，我们将使用这个示例 HTML 文件从中提取数据。

目录 hide

1 导航动态内容（JavaScript 交互）
2 表格提交和认证
3 复杂的 CSS 选择器
4 多页抓取
5 处理非表格数据
6 HTTP 标头和 Cookie
7 尊重 robots.txt

导航动态内容（JavaScript 交互）

网页通常包含动态内容，其中网页的结构随着时间的推移而变化。

在这种情况下，我们可以结合使用美丽汤 and Selenium与此动态内容进行交互。


from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://localhost/test.html'
driver = webdriver.Firefox()
driver.get(url)
button = driver.find_element('id', 'loadMoreButton')
button.click()
html_content = driver.page_source
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all('table')[0]
dfs = pd.read_html(str(table))

print(dfs[0])

Output


   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3
3       a4      b4      c4
4       a5      b5      c5

在此脚本中，Selenium 不仅用于打开网页，还用于与其交互。

The loadMoreButton单击元素将其他数据加载到表中，然后使用 BeautifulSoup 和pandas.read_html.

表格提交和认证

另一个限制是pandas.read_html是它不支持表单提交或身份验证。这两个任务都可以使用 Selenium 和 BeautifulSoup 来执行。
以下是表单提交和身份验证的示例：


from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://localhost/test.html'
driver = webdriver.Firefox()
driver.get(url)
username = driver.find_element('id', 'username')
password = driver.find_element('id', 'password')
username.send_keys('test_user')
password.send_keys('test_password')
login_button = driver.find_element('id', 'login')
login_button.click()
html_content = driver.page_source
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all('table')[0]
dfs = pd.read_html(str(table))
print(dfs[0])

Output


   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3

在上面的脚本中，Selenium 用于自动化登录网站的过程。该脚本查找用户名和密码字段的元素，输入登录凭据，然后单击登录按钮。

一旦脚本通过身份验证，它就会获取页面的 HTML 内容并提取 HTML 表数据。

复杂的 CSS 选择器

While pandas.read_html对于结构良好的表格效果很好，但在处理复杂的 CSS 选择器时却表现不佳。

BeautifulSoup 对此派上用场，它允许我们使用 CSS 选择器来导航和搜索解析树。
以下是将 CSS 选择器与 BeautifulSoup 一起使用的示例：


from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://localhost/test.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.select('div.content > table.table-class')[0]
dfs = pd.read_html(str(table))
print(dfs[0])

Output


   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3

在这个脚本中，BeautifulSoup 的select函数与针对第一个表（table.table-class）在一个div与班级content.

这种定位 HTML 内容特定部分的精确度是无法通过以下方式实现的：pandas.read_html alone.

多页抓取

网页抓取的一个常见问题是处理分页内容。pandas.read_html没有内置支持自动浏览多个页面。

我们可以用Scrapy，一个强大的 Python 抓取库，可以处理这个问题：


import scrapy
import pandas as pd
import pandas as pd
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://localhost/test.html']
    def parse(self, response):
        dfs = pd.read_html(response.text)
        print(dfs[0])
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Output


Page 1
   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3
...
Page 2
   Column1 Column2 Column3
0       a4      b4      c4
1       a5      b5      c5
2       a6      b6      c6
...
...

在上面的脚本中，创建了一个 Scrapy 蜘蛛来浏览多个页面。这parse为初始 URL 和每个后续页面调用方法。

它使用以下命令从当前页面中提取 HTML 表pandas.read_html然后使用 CSS 选择器找到下一页的链接。如果找到下一页，蜘蛛就会跟踪链接并重复该过程。

处理非表格数据

另一个限制是pandas.read_html是它只提取表格数据。如果您感兴趣的数据存储在另一个 HTML 结构（例如列表）中，您将需要另一个工具。

这是使用 BeautifulSoup 提取项目列表的示例：


from bs4 import BeautifulSoup
import requests
url = 'http://localhost/test.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
items = soup.select('ul.item-list > li')
items_text = [item.get_text() for item in items]
print(items_text)

Output


['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']

在这个脚本中，BeautifulSoup 的select函数与 CSS 选择器一起使用，该选择器针对所有lia 内的元素ul与班级item-list。每个项目的文本都被提取到一个列表中。

HTTP 标头和 Cookie

又一个限制pandas.read_html其缺点是它不允许控制 HTTP 标头或 cookie，而这些通常是访问某些网页所必需的。我们可以使用requests库来处理这个：


import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://localhost/test.html'
headers = {'User-Agent': 'Mozilla/5.0'}
cookies = {'session_id': '1234567890'}
response = requests.get(url, headers=headers, cookies=cookies)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find_all('table')[0]
dfs = pd.read_html(str(table))
print(dfs[0])

Output


   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3

在这个脚本中，requests.get用于获取 HTML 内容，但这次我们传递一个 HTTP 标头字典和一个 cookie 字典。

例如，这使我们能够假装是某种类型的浏览器或跨多个请求维护会话。

然后使用 BeautifulSoup 解析 HTML 内容，并使用以下命令提取表格pandas.read_html.

尊重 robots.txt

网页抓取应始终以尊重和道德的方式进行。大多数网站都包含一个robots.txt文件，说明网站的哪些部分不应被抓取或抓取。

要遵守这些规则，您可以手动检查robots.txt文件（通常位于网站的根目录，例如http://localhost/robots.txt），或者使用类似的库robotexclusionrulesparser自动遵守规则：


from robotexclusionrulesparser import RobotExclusionRulesParser
url = 'http://localhost'
rp = RobotExclusionRulesParser()
rp.fetch(url + '/robots.txt')
can_fetch_page = rp.can_fetch('*', url + '/test.html')
print(can_fetch_page)

Output


True

在此脚本中，我们使用RobotExclusionRulesParser获取并解析robots.txt来自目标网站的文件。

The can_fetch方法用于检查是否可以抓取特定页面（根据robots.txt rules).

The '*'参数意味着我们正在检查适用于所有网络爬虫的规则。如果输出是True，这意味着我们可以抓取页面。
请记住，遵守这些规则至关重要，这不仅是出于对网站运营商的尊重，也是为了避免潜在的法律问题。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python