我一直试图通过重复单击按钮来抓取单个网址上可用的所有评论“显示另外 6 条评论”。我相信这个问题适用于任何使用 Selenium 在单个 url 上废弃许多动态元素的人。
问题:当评论数量超过几百时,循环变得非常慢。
我正在使用 Selenium,因为该网站涉及 Javascript。
我点击的按钮的 HTML(朝向页面底部)
<button type="button" class="css-1e0935c" data-comp="Link Box">Show 6 more reviews<svg viewBox="0 0 95 57" class="css-1ymrwr7" data-comp="Chevron Box"><path d="M47.5 57L95 9.5 85.5 0l-38 38-38-38L0 9.5 47.5 57z"></path></svg></button>
我尝试过的事情:
- 不加载图像:没有改进(下面未显示)
- using 循环中可能最有效的选择器 https://csswizardry.com/2011/09/writing-efficient-css-selectors/
我想到的事情:
- 用 PhantomJS 取代 Chrome。我在滚动时遇到问题
PhantomJS。我没有追求,因为看起来收益会是
增量,而不是我需要的数量级(我可能是错的)。
- 在评论可用时加载评论,而不是在整个页面“展开”时加载评论。我不认为这个
会解决性能问题
- 找到一种更快地解析按钮的方法。我读到浏览器如何匹配 CSS 选择器 https://stackoverflow.com/questions/5797014/why-do-browsers-match-css-selectors-from-right-to-left但找不到改进我的代码的方法
这是我的第一个问题。非常感谢您的耐心和帮助。
python 2 或 3 中的可重现代码(慢循环位于底部):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
# Page with 5591 reviews
url = "https://www.sephora.com/product/soy-face-cleanser-P7880?icid2=:p7880:product"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(4)
# Navigation steps (feel free to skip)
# scroll to section 'Similar Products' (above Reviews)
timeout = 10
wait_driver = WebDriverWait(driver, timeout)
section_title = wait_driver.until(EC.presence_of_element_located(\
(By.XPATH, '//h2[@class="css-1orm38z"]')))
driver.execute_script("arguments[0].scrollIntoView();", section_title)
# Sort by newest review
wait_driver.until(EC.presence_of_element_located(\
(By.XPATH, '//button[@class="css-u2mtre"]'))).click()
wait_driver.\
until(EC.presence_of_element_located(\
(By.XPATH, '//div/span[text()="Newest"]'))).click()
# This is the loop that is way too slow
# First expand all reviews by clicking button
numReviews = 0
while True:
try:
# Fastest selector I could come up with
button = driver.find_element_by_css_selector(' .css-1e0935c')
button.click()
numReviews += 6
print("Loading 6 more reviews... (" + str(numReviews) + ")")
except Exception:
break
# Now that full page is loaded, store all reviews
# [...]
Output:
Loading 6 more reviews... (6)
Loading 6 more reviews... (12)
Loading 6 more reviews... (18)
Loading 6 more reviews... (24)
Loading 6 more reviews... (30)
Loading 6 more reviews... (36)
Loading 6 more reviews... (42)
Loading 6 more reviews... (48)
Loading 6 more reviews... (54)
Loading 6 more reviews... (60)
Loading 6 more reviews... (66)
Loading 6 more reviews... (72)
Loading 6 more reviews... (78)
Loading 6 more reviews... (84)
ETC...
我的程序对于具有 200 条评论的产品运行良好,但随着评论数量的增加(例如,我的示例网址中 >5000 条),上述操作需要越来越多的时间。