好的,这就是我想要实现的目标:
- 调用带有动态过滤搜索结果列表的 URL
- 点击第一个搜索结果(5/页)
-
抓取标题、段落和图像,并将它们作为 json 对象存储在单独的文件中,例如
{
"Title": "单个条目的标题元素",
"Content" : "各个条目的 DOM 顺序中的段落和图像"
}
导航回搜索结果概述页面并重复步骤 2 - 3
- 抓取 5/5 结果后,转到下一页(单击分页链接)
- 重复步骤 2 - 5,直到没有留下任何条目
To visualize once more what is intedned:
到目前为止我所拥有的是:
#import libraries
from selenium import webdriver
from bs4 import BeautfifulSoup
#URL
url = "https://URL.com"
#Create a browser session
driver = webdriver.Chrome("PATH TO chromedriver.exe")
driver.implicitly_wait(30)
driver.get(url)
#click consent btn on destination URL ( overlays rest of the content )
python_consentButton = driver.find_element_by_id('acceptAllCookies')
python_consentButton.click() #click cookie consent btn
#Seleium hands the page source to Beautiful Soup
soup_results_overview = BeautifulSoup(driver.page_source, 'lxml')
for link in soup_results_overview.findAll("a", class_="searchResults__detail"):
#Selenium visits each Search Result Page
searchResult = driver.find_element_by_class_name('searchResults__detail')
searchResult.click() #click Search Result
#Ask Selenium to go back to the search results overview page
driver.back()
#Tell Selenium to click paginate "next" link
#probably needs to be in a sourounding for loop?
paginate = driver.find_element_by_class_name('pagination-link-next')
paginate.click() #click paginate next
driver.quit()
Problem
每次 Selenium 导航回搜索结果概述页面时,列表计数都会重置
所以它点击第一个条目 5 次,导航到接下来的 5 个项目并停止
这可能是递归方法的预定情况,但不确定。
任何有关如何解决此问题的建议都将受到赞赏。
您只能使用requests
and BeautifulSoup
刮,不含硒。它会更快并且消耗更少的资源:
import json
import requests
from bs4 import BeautifulSoup
# Get 1000 results
params = {"$filter": "TemplateName eq 'Application Article'", "$orderby": "ArticleDate desc", "$top": "1000",
"$inlinecount": "allpages", }
response = requests.get("https://www.cst.com/odata/Articles", params=params).json()
# iterate 1000 results
articles = response["value"]
for article in articles:
article_json = {}
article_content = []
# title of article
article_title = article["Title"]
# article url
article_url = str(article["Url"]).split("|")[1]
print(article_title)
# request article page and parse it
article_page = requests.get(article_url).text
page = BeautifulSoup(article_page, "html.parser")
# get header
header = page.select_one("h1.head--bordered").text
article_json["Title"] = str(header).strip()
# get body content with images links and descriptions
content = page.select("section.content p, section.content img, section.content span.imageDescription, "
"section.content em")
# collect content to json format
for x in content:
if x.name == "img":
article_content.append("https://cst.com/solutions/article/" + x.attrs["src"])
else:
article_content.append(x.text)
article_json["Content"] = article_content
# write to json file
with open(f"{article_json['Title']}.json", 'w') as to_json_file:
to_json_file.write(json.dumps(article_json))
print("the end")
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)