我用Python scrapy结合selenium写了一个scraper来抓取一些titles
来自网站。这css selectors
我的刮刀中定义的内容是完美的。我希望我的抓取工具继续点击下一页并解析每个页面中嵌入的信息。它在第一页上做得很好,但当它发挥硒部分的作用时,抓取工具会一遍又一遍地点击同一个链接。
由于这是我第一次使用 Selenium 和 scrapy,所以我不知道如何成功地继续下去。任何修复都将受到高度赞赏。
如果我这样尝试,那么它会顺利工作(选择器没有任何问题):
class IncomeTaxSpider(scrapy.Spider):
name = "taxspider"
start_urls = [
'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
]
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def parse(self,response):
self.driver.get(response.url)
while True:
for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h1.faqsno-heading"))):
name = elem.find_element_by_css_selector("div[id^='arrowex']").text
print(name)
try:
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
self.wait.until(EC.staleness_of(elem))
except TimeoutException:break
但我的目的是让我的脚本以这种方式运行:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
class IncomeTaxSpider(scrapy.Spider):
name = "taxspider"
start_urls = [
'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
]
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def click_nextpage(self,link):
self.driver.get(link)
elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))
#It keeeps clicking on the same link over and over again
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
self.wait.until(EC.staleness_of(elem))
def parse(self,response):
while True:
for item in response.css("h1.faqsno-heading"):
name = item.css("div[id^='arrowex']::text").extract_first()
yield {"Name": name}
try:
self.click_nextpage(response.url) #initiate the method to do the clicking
except TimeoutException:break
这些是该着陆页上可见的标题(让您知道我在找什么):
INDIA INCLUSION FOUNDATION
INDIAN WILDLIFE CONSERVATION TRUST
VATSALYA URBAN AND RURAL DEVELOPMENT TRUST
我不愿意从该网站获取数据,因此除了我上面尝试过的方法之外的任何替代方法对我来说都是无用的。我唯一的目的是找到与我在第二种方法中尝试的方式相关的任何解决方案。