Scrapy 获取网站时出现错误“DNS 查找失败”


我正在尝试使用 Scrapy 获取“DNS 查找失败”网站上的所有链接。

问题是,每个没有任何错误的网站都打印在解析对象方法,但当 url 返回 DNS 查找失败时,回调parse_obj 没有被调用.

我想获取所有出现错误的域“DNS 查找失败“, 我怎样才能做到这一点 ?

Logs :

2016-03-08 12:55:12 [scrapy] INFO: Spider opened
2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 12:55:12 [scrapy] DEBUG: Telnet console listening on
2016-03-08 12:55:12 [scrapy] DEBUG: Crawled (200) <GET> (referer: None)
2016-03-08 12:55:12 [scrapy] DEBUG: Retrying <GET> (failed 1 times): DNS lookup failed: address '' not found: [Errno 11001] getaddrinfo failed.

Code :

class MyItem(Item):
    url= Field()

class someSpider(CrawlSpider):
    name = 'Crawler'        
    start_urls = ['']
    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

    def parse_obj(self, response):
        item = MyItem()
        item['url'] = []
        for link in LxmlLinkExtractor(allow=()).extract_links(response):
            parsed_uri = urlparse(link.url)
            url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
            print url

CrawlSpider 规则不允许传递错误返回(这是一个耻辱)

这是一个变体另一个答案我给出了捕获 DNS 错误的方法:

# -*- coding: utf-8 -*-
import random

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError

class HttpbinSpider(CrawlSpider):
    name = "httpbin"

    # this will generate test links so that we can see CrawlSpider in action
    start_urls = (
    rules = (
             # hook to be called when this Rule generates a Request

    # this is just to no retry errors for this example spider
    custom_settings = {
        'RETRY_ENABLED': False

    # method to be called for each Request generated by the Rules above,
    # here, adding an errback to catch all sorts of errors
    def add_errback(self, request):
        self.logger.debug("add_errback: patching %r" % request)

        # this is a hack to trigger a DNS error randomly
        rn = random.randint(0, 2)
        if rn == 1:
            newurl = request.url.replace('', 'httpbin.organisation')
            self.logger.debug("add_errback: patching url to %s" % newurl)
            return request.replace(url=newurl,

        # this is the general case: adding errback to all requests
        return request.replace(errback=self.errback_httpbin)

    def parse_page(self, response):"parse_page: %r" % response)

    def errback_httpbin(self, failure):
        # log all errback failures,
        # in case you want to do something special for some errors,
        # you may need the failure's type

        if failure.check(HttpError):
            # you can get the response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)


$ scrapy crawl httpbin
2016-03-08 15:16:30 [scrapy] INFO: Scrapy 1.0.5 started (bot: httpbinlinks)
2016-03-08 15:16:30 [scrapy] INFO: Optional features available: ssl, http11
2016-03-08 15:16:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'httpbinlinks.spiders', 'SPIDER_MODULES': ['httpbinlinks.spiders'], 'BOT_NAME': 'httpbinlinks'}
2016-03-08 15:16:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-08 15:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-08 15:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-08 15:16:30 [scrapy] INFO: Enabled item pipelines: 
2016-03-08 15:16:30 [scrapy] INFO: Spider opened
2016-03-08 15:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 15:16:30 [scrapy] DEBUG: Telnet console listening on
2016-03-08 15:16:30 [scrapy] DEBUG: Crawled (200) <GET> (referer: None)
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET> (referer:
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200>
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET> (referer:
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET> (referer:
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET> (referer:
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET> (referer:
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET> (referer:
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET> (referer:
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200>
2016-03-08 15:16:31 [scrapy] INFO: Closing spider (finished)
2016-03-08 15:16:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/request_bytes': 2577,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 3968,
 'downloader/response_count': 8,
 'downloader/response_status_count/200': 8,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 8, 14, 16, 31, 761515),
 'log_count/DEBUG': 20,
 'log_count/ERROR': 4,
 'log_count/INFO': 14,
 'request_depth_max': 1,
 'response_received_count': 8,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2016, 3, 8, 14, 16, 30, 427657)}
2016-03-08 15:16:31 [scrapy] INFO: Spider closed (finished)

