我需要滚动网页(例如 Twitter),并对网站上出现的新元素进行网络抓取。我尝试使用python 3.x
, selenium
and PhantomJS
。这是我的代码
import time
from selenium import webdriver
from bs4 import BeautifulSoup
user = 'ciroylospersas'
# Start web browser
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.set_window_size(1024, 768)
browser.get("https://twitter.com/")
# Fill username in login
element = browser.find_element_by_id("signin-email")
element.clear()
element.send_keys('your twitter user')
# Fill password in login
element = browser.find_element_by_id("signin-password")
element.clear()
element.send_keys('your twitter pass')
browser.save_screenshot('screen.png') # save a screenshot to disk
# Summit the login
element.submit()
time.sleep(5
browser.save_screenshot('screen1.png') # save a screenshot to disk
# Move to the following url
browser.get("https://twitter.com/" + user + "/following")
browser.save_screenshot('screen2.png') # save a screenshot to disk
scroll_script = "var h = document.body.scrollHeight; window.scrollTo(0, h); return h;"
newHeight = browser.execute_script(scroll_script)
print(newHeight)
browser.save_screenshot('screen3.png') # save a screenshot to disk
问题是我无法滚动到底部。这screen2.png
and screen3.png
是相同的。但如果我改变webdriver
from PhantomJS
to Firefox
相同的代码工作正常。为什么?
当我试图解决类似的问题时,我能够让它在 phantomJS 中工作:
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
它将滚动到当前的“底部”,等待,查看页面是否加载更多,如果没有加载则放弃(假设如果高度匹配则所有内容都已加载。)
在我的原始代码中,我在匹配高度旁边检查了一个“最大”值,因为我只对前 10 个左右的“页面”感兴趣。如果还有更多,我希望它停止加载并跳过它们。
另外,这是我用作的答案example https://stackoverflow.com/questions/28928068/scroll-down-to-bottom-of-infinite-page-with-phantomjs-in-python
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)