我使用 scrapy 1.0.3,无法发现 CLOSESPIDER 扩展是如何工作的。
对于命令:
scrapy 抓取domain_links --set=CLOSESPIDER_PAGECOUNT=1
正确的是一个请求,但对于两页计数:
scrapy 抓取domain_links --set CLOSESPIDER_PAGECOUNT=2
是无限的请求。
所以请用简单的例子解释一下它是如何工作的。
这是我的蜘蛛代码:
class DomainLinksSpider(CrawlSpider):
name = "domain_links"
#allowed_domains = ["www.example.org"]
start_urls = [ "www.example.org/",]
rules = (
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow_domains="www.example.org"), callback='parse_page'),
)
def parse_page(self, response):
print '<<<',response.url
items = []
item = PathsSpiderItem()
selected_links = response.selector.xpath('//a[@href]')
for link in LinkExtractor(allow_domains="www.example.org", unique=True).extract_links(response):
item = PathsSpiderItem()
item['url'] = link.url
items.append(item)
return items
甚至不适用于这个简单的蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['karen.pl']
start_urls = ['http://www.karen.pl']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow_domains="www.karen.pl"), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
return item
但不是无穷大:
scrapy 抓取示例 --set CLOSESPIDER_PAGECOUNT=1
'下载者/request_count': 1,
scrapy 抓取示例 --set CLOSESPIDER_PAGECOUNT=2
'下载者/request_count':17,
scrapy 抓取示例 --set CLOSESPIDER_PAGECOUNT=3
“下载者/请求计数”:19,
也许是因为并行下载。
是的,对于 CONCURRENT_REQUESTS = 1,CLOSESPIDER_PAGECOUNT 设置适用于第二个示例。我会检查第一个 - 它也有效。
这对我来说几乎是无限的,因为带有许多网址(我的项目)的站点地图被抓取为下一页:)