Python、Selenium 和 Beautiful Soup for URL

2024-01-08

我正在尝试使用 Selenium 编写一个脚本来访问 Pastebin 进行搜索并以文本形式打印 URL 结果。我需要可见的 URL 结果,仅此而已。

<div class="gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" dir="ltr" style="word-break:break-all;">pastebin.com/VYQTSbzY</div>

当前脚本是:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

browser = webdriver.Firefox()
browser.get('http://www.pastebin.com')

search = browser.find_element_by_name('q')
search.send_keys("test")
search.send_keys(Keys.RETURN)

soup=BeautifulSoup(browser.page_source)

for link in soup.find_all('a'):
    print link.get('href',None),link.get_text()

你实际上并不需要BeautifulSoup. selenium本身在定位元素方面非常强大:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys


browser = webdriver.Firefox()
browser.get('http://www.pastebin.com')

search = browser.find_element_by_name('q')
search.send_keys("test")
search.send_keys(Keys.RETURN)

# wait for results to appear
wait = WebDriverWait(browser, 10)
results = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.gsc-resultsbox-visible")))

# grab results
for link in results.find_elements_by_css_selector("a.gs-title"):
    print link.get_attribute("href")

browser.close()

Prints:

http://pastebin.com/VYQTSbzY
http://pastebin.com/VYQTSbzY
http://pastebin.com/VAAQCjkj
...
http://pastebin.com/fVUejyRK
http://pastebin.com/fVUejyRK

请注意使用显式等待 https://selenium-python.readthedocs.org/waits.html这有助于等待搜索结果出现。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Python、Selenium 和 Beautiful Soup for URL 的相关文章

随机推荐