通过 Selenium 解码 facebook 上的类名

2023-12-29

我注意到 Facebook 有一些奇怪的类名，看起来是计算机生成的。我不知道这些类是否至少随着时间的推移保持不变，或者它们在某个时间间隔内发生变化？也许有这方面经验的人可以回答。我唯一能看到的是，当我退出 Chrome 并再次打开它时，它仍然是一样的，所以至少他们不会改变每个浏览器会话。

因此，我猜想抓取 facebook 的最佳方法是在用户界面中使用一些元素，并假设结构始终相同，例如从“关于”部分获取地址，如下所示：

from selenium import webdriver
driver = webdriver.Chrome("C:/chromedriver.exe")

driver.get("https://www.facebook.com/pg/Burma-Superstar-620442791345784/about/?ref=page_internal")
# wait some time
address_elements = driver.find_elements_by_xpath("//span[text()='FIND US']/../following-sibling::div//button[text()='Get Directions']/../../preceding-sibling::div[1]/div/span")
for item in address_elements:
    print item.text

你说得很对。Facebook https://www.facebook.com/是通过构建ReactJS https://reactjs.org/从以下内容的存在中可以明显看出keywords and tags内HTML DOM https://www.w3schools.com/js/js_htmldom.asp:

{"react_render":true,"reflow":true}

["React-prod"]
["ReactDOM-prod"]
ReactComposerTaggerType:{r:["t5r69"],be:1}

所以，动态生成的类名在一定的时间之后必然会改变timegaps.

Solution

解决方案是使用static构造一个属性dynamic 定位策略 https://stackoverflow.com/questions/48369043/official-locator-strategies-for-the-webdriver/48376890#48376890.

检索文本下方地址的第一行FIND US你需要诱导WebDriver等待 https://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.wait.html#module-selenium.webdriver.support.wait和这个结合预期条件 https://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html#module-selenium.webdriver.support.expected_conditions as visibility_of_element_located() https://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html#selenium.webdriver.support.expected_conditions.invisibility_of_element_located您可以使用以下优化解决方案：

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[normalize-space()='FIND US']//following::span[2]"))))

参考

您可以在以下位置找到一些相关讨论：

使用 selenium 记录 Facebook https://stackoverflow.com/questions/45635190/logging-facebook-using-selenium/45636091#45636091
为什么 Selenium 驱动程序无法识别 Facebook 登录页面的 ID 元素？ https://stackoverflow.com/questions/47741832/why-selenium-driver-fail-to-recognize-id-element-of-facebook-login-page/47746472#47746472

Outro

Note: Scraping Facebook违反了他们的第 3.2.3 条服务条款 https://www.facebook.com/legal/terms你可能会受到盘问，甚至可能会陷入困境脸书监狱 https://www.facebook.com/help/community/question/?id=804287426255468. Use Facebook Graph API https://developers.facebook.com/docs/graph-api反而。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

通过 Selenium 解码 facebook 上的类名

Solution

参考

Outro

python

facebook

seleniumwebdriver

xpath

WebDriverWait