Scrapy 获取网站时出现错误“DNS 查找失败”

2023-12-02

我正在尝试使用 Scrapy 获取“DNS 查找失败”网站上的所有链接。

问题是，每个没有任何错误的网站都打印在解析对象方法，但当 url 返回 DNS 查找失败时，回调parse_obj 没有被调用.

我想获取所有出现错误的域“DNS 查找失败“，我怎样才能做到这一点？

Logs :

2016-03-08 12:55:12 [scrapy] INFO: Spider opened
2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 12:55:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 12:55:12 [scrapy] DEBUG: Crawled (200) <GET http://domain.com> (referer: None)
2016-03-08 12:55:12 [scrapy] DEBUG: Retrying <GET http://expired-domain.com/> (failed 1 times): DNS lookup failed: address 'expired-domain.com' not found: [Errno 11001] getaddrinfo failed.

Code :

class MyItem(Item):
    url= Field()

class someSpider(CrawlSpider):
    name = 'Crawler'        
    start_urls = ['http://domain.com']
    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

    def parse_obj(self, response):
        item = MyItem()
        item['url'] = []
        for link in LxmlLinkExtractor(allow=()).extract_links(response):
            parsed_uri = urlparse(link.url)
            url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
            print url

CrawlSpider 规则不允许传递错误返回（这是一个耻辱）

这是一个变体另一个答案我给出了捕获 DNS 错误的方法：

# -*- coding: utf-8 -*-
import random

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


class HttpbinSpider(CrawlSpider):
    name = "httpbin"

    # this will generate test links so that we can see CrawlSpider in action
    start_urls = (
        'https://httpbin.org/links/10/0',
    )
    rules = (
        Rule(LinkExtractor(),
             callback='parse_page',
             # hook to be called when this Rule generates a Request
             process_request='add_errback'),
    )

    # this is just to no retry errors for this example spider
    custom_settings = {
        'RETRY_ENABLED': False
    }

    # method to be called for each Request generated by the Rules above,
    # here, adding an errback to catch all sorts of errors
    def add_errback(self, request):
        self.logger.debug("add_errback: patching %r" % request)

        # this is a hack to trigger a DNS error randomly
        rn = random.randint(0, 2)
        if rn == 1:
            newurl = request.url.replace('httpbin.org', 'httpbin.organisation')
            self.logger.debug("add_errback: patching url to %s" % newurl)
            return request.replace(url=newurl,
                                   errback=self.errback_httpbin)

        # this is the general case: adding errback to all requests
        return request.replace(errback=self.errback_httpbin)

    def parse_page(self, response):
        self.logger.info("parse_page: %r" % response)

    def errback_httpbin(self, failure):
        # log all errback failures,
        # in case you want to do something special for some errors,
        # you may need the failure's type
        self.logger.error(repr(failure))

        if failure.check(HttpError):
            # you can get the response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

这是您在控制台上看到的内容：

$ scrapy crawl httpbin
2016-03-08 15:16:30 [scrapy] INFO: Scrapy 1.0.5 started (bot: httpbinlinks)
2016-03-08 15:16:30 [scrapy] INFO: Optional features available: ssl, http11
2016-03-08 15:16:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'httpbinlinks.spiders', 'SPIDER_MODULES': ['httpbinlinks.spiders'], 'BOT_NAME': 'httpbinlinks'}
2016-03-08 15:16:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-08 15:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-08 15:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-08 15:16:30 [scrapy] INFO: Enabled item pipelines: 
2016-03-08 15:16:30 [scrapy] INFO: Spider opened
2016-03-08 15:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 15:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 15:16:30 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/0> (referer: None)
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/5>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/9>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/8> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/7> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/6> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/3> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/4> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/1> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/2> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [scrapy] INFO: Closing spider (finished)
2016-03-08 15:16:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/request_bytes': 2577,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 3968,
 'downloader/response_count': 8,
 'downloader/response_status_count/200': 8,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 8, 14, 16, 31, 761515),
 'log_count/DEBUG': 20,
 'log_count/ERROR': 4,
 'log_count/INFO': 14,
 'request_depth_max': 1,
 'response_received_count': 8,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2016, 3, 8, 14, 16, 30, 427657)}
2016-03-08 15:16:31 [scrapy] INFO: Spider closed (finished)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

webscraping

webcrawler

Scrapy

Scrapy 获取网站时出现错误“DNS 查找失败” 的相关文章

Pandas 字符串提取所有匹配项

我正在学习 pandas 系列字符串方法中的正则表达式操作我能够从字符串中提取第一个数字但我的正则表达式与第二个数字不匹配如何捕获这两个数字注意第二行第二个元素在这里是 NAN CODE import pandas as pd d
重新索引错误没有意义

I have DataFrames大小在 100k 到 2m 之间我正在处理这个问题的框架是如此之大但请注意我必须对其他框架执行相同的操作 gt gt gt len data 357451 现在这个文件是通过编译许多文件创建的所以它
如何为 Intellij/PyCharm 设置 PYTHONSTARTUP 脚本

我尝试添加PYTHONSTARTUP环境变量我还尝试了自定义启动脚本但更令人惊讶的是这also没有工作 npa别名无法识别出于一点绝望我什至尝试添加到interpreter options 那什么也没做实际上是什么Interpr
pip 安装失败，SSL 证书验证失败 (_ssl.c:833)

我无法通过 pip install 安装任何外部 python 模块我已经正确安装了 python 但如果我使用 pip install 它会显示此错误这是我运行后的代码pip install pytesseract C Users 1
Spyder 导入模块出错

我正在尝试在 Spyder 中使用 sklearn 一开始当我尝试导入它时我收到 ImportError No module named sklearn 然后我用 PYTHONPATH 管理器设置 PATH 然后使用工具菜单中的更新模
小数缓存是Python规范中定义的还是一个实现细节？

Python 似乎有一个所谓的小数字缓存用于存储 5 到 256 范围内的数字我们可以使用以下程序来演示这一点 for i in range 7 258 if id i id i 0 print i is cached else pr
Django 未在 404 页面上应用应用程序中的 CSS 文件

姜戈3 0 8 Python 3 7 x 我有一个包含一些应用程序的 Django 项目我正在尝试为 400 403 404 500 错误制作一些默认错误页面我已经这样做了并显示了适当的模板但没有任何样式或 JS 在 404 错
python 脚本中 os.system 的 256 和 512 响应代码是什么

当我在 python 中使用 os system ping 服务器时我得到多个响应代码使用的命令 os system ping q c 30 s SERVERANME 0 在线 256 离线 512 512 是什么意思 Per the
检测/删除 Python 2 + GTK 中不成对的代理字符

在Python 2 7中我可以成功转换Unicode字符串 abc udc34xyz 转换为 UTF 8 结果是 abc xed xb0 xb4xyz 但是当我将 UTF 8 字符串传递给例如时 pango parse markup or
类型错误：需要 Future 或协程

我尝试在 asyncssh 上自动重新连接 ssh 客户端 SshConnectManager 必须留在后台并在需要时进行 ssh 会话 class SshConnectManager object def init self host u
使用 python 只读取 Excel 中的可见行

我想只读取 python 中 Excel 工作表中的可见行输入 Excel表所以当我过滤时作为 python 中的输出在本例中我将仅获得可见数据 1 行这是我的代码 from openpyxl import load workbo
根据给定列表中的值替换列中的值[重复]

这个问题在这里已经有答案了我在数据框中有一列仅允许定义列表中存在的值例如给定列表 l1 1 2 5 6 如果列表中不存在列中的值我需要将每个值替换为 0 column Expected column 1 1 5 5 2 2 3 0
如何删除 pandas 数据框中的唯一行？

我遇到了一个看似简单的问题在 pandas 数据框中删除唯一的行基本上相反drop duplicates https pandas pydata org pandas docs stable generated pandas Data
如何测试列表中多个值的成员资格

我想测试两个或多个值是否在列表中具有成员资格但我得到了意外的结果 gt gt gt a b in b a foo bar a True 那么 Python 可以同时测试列表中多个值的成员资格吗这个结果意味着什么 See also How
Python：帮助（numpy）在退出时导致段错误

我遇到了一个奇怪的现象在 python 解释器中我执行以下操作 gt gt gt import numpy gt gt gt help numpy 帮助显示正确但一旦我按 q 返回解释器 Segmentation fault core
使用 statsmodels.formula.api 中的 ols - 如何删除常数项？

我正在遵循第一个例子statsmodels教程 http statsmodels sourceforge net devel http statsmodels sourceforge net devel 如何指定在 ols 中不使用常数项进
Python 中的十进制到二进制半精度 IEEE 754

我只能使用以下命令将十进制转换为二进制单精度 IEEE754struct pack模块或者使用相反的方法 float16 或 float32 numpy frombuffer 是否可以使用 Numpy 将十进制转换为二进制半精度浮点数我
Flask 扩展未在 app.extensions 中注册

我想访问在我的 Flask 应用程序上注册的一些扩展我尝试使用app extensions 但我初始化的一些扩展不在字典中 from flask import current app current app extensions get
在至少 7 天内连续三天登录该产品的用户

我有一个用于用户参与的数据框 df 如下所示 time stamp user id 2013 01 01 10 05 23 1 2013 01 03 16 35 23 1 2013 01 06 11 06 35 1 2013 01 10 1
在Python中从CSV文件中获取随机行并找到相应的单词，就像测验一样

抱歉标题含糊不清想不出更好的表达方式我有一个包含德语英语单词的 CSV 文件如下所示 Ja Yes Nein No Katze Cat 我希望我的 python 脚本从 CSV 文件中打印一个随机的德语单词并要求他们输入英语单词

随机推荐

npm 安装错误：“主机密钥验证失败。”

我想从 Bitbucket 获取该模块我在 Windows 服务器上构建了该模块但是当我使用时出现错误npm install npm ERR Error while executing npm ERR C Users AppData L
如何在访问期间知道ANTLR解析器当前处于哪个替代规则

如果我们查看 bash 源代码特别是 yacc 语法我们可以看到所有重定向都是这样定义的 redirection GREATER WORD LESS WORD NUMBER GREATER WORD NUMBER LESS WORD R
如何让 date_part 查询命中索引？

我还没有能够让这个查询命中索引而不是执行完整扫描我有另一个查询它对几乎相同的表使用 date part day datelocal 该表的数据稍微少一些但是相同的结构并且将命中我在 datelocal 列上创建的索引这是一个没有时
C# - 如何使用 TaskSchedular 类列出特定用户的计划任务

我想知道是否有人可以帮助我我正在尝试使用 TaskScheduler 类 http www codeproject com KB cs tsnewlib aspx 列出本地计算机上特定用户管理员的计划任务我有以下内容 richText
剪一段阿拉伯字符串

我有一个阿拉伯语字符串例如现在我需要剪切这个字符串并输出它如下所示我尝试了这个功能 function short name str limit if limit lt 3 limit 3 if strlen str gt limit
从 codecommit 获取私人仓库

我是 golang 新手我们正在尝试在 go 中创建一个包并在我们想要使用的所有服务中使用我尝试在 github 中创建一个存储库并尝试执行 go get 我没有遇到任何问题现在我想在亚马逊的codecommit中创建相同的包我将
WPF Listview：列重新排序事件？

当用户更改顺序时我需要同步两个 ListViews 事件的列顺序但似乎没有列重新排序事件目前我只是做了一个AllowsColumnReorder False 但这不是一个永久的解决方案在网上搜索时发现很多人都有同样的问题但没有解
膨胀类 android.widget.ImageButton 时出错

当我在系统应用程序上安装程序时出现错误当我使用数据应用程序时它运行良好这是错误 android view InflateException Binary XML file line 19 Error inflating c
检查 BIT 列时 LINQ 生成奇怪的 SQL

我有以下 LINQtoSQL 语句 from t1 in table1 join t2 in table2 on t1 Id equals t2 OtherTableId where t2 BranchId branchId t1 IsPe
在egrep中匹配As后跟相同数量的B

假设我想匹配一个具有完全相同数量的字符 A 和 B 的模式这样正好有 n 个 A 后跟 n 个 B 例如可以匹配以下字符串 AB AABB AAABBB 另一方面这些字符串无法匹配 BA AAABB AABBB ABAB 为了解决这个
SVN 提交未完成

当我在 svn 中提交文件时我经常遇到这样的情况在传输完所有文件后 svn 将挂起然后最终超时并出现错误svn E175012 Connection timed out 当我上传超过 20 个文件时似乎会发生这种情况我相信这是在所
C - 将字符串拆分为字符串数组

我不完全确定如何在 C 中执行此操作 char curToken strtok string curToken ls l we will say I need a array of strings containing ls l and N
c++ static_assert 在“if constexpr 语句”的两个分支上均失败

我试图在编译时确定特定类型是否属于类型标准对当我编译下面的代码时两个分支即 HERE1 和 HERE2 上的断言均失败如果我删除 static asserts 并取消注释打印我会得到我所期望的这是 HERE1 的is pair
使用三角形网格纹理，无需读/写图像文件

这是上一个问题的后续请参阅在javafx上为三角形网格中的各个三角形着色我认为这本身就是另一个话题有没有一种方法使用javafx 可以让我不必实际将图像文件写入磁盘或外部设备来使用纹理换句话说我可以使用特定的纹理而不必使用图
加载网页，执行其 JavaScript 并将生成的 HTML 转储到文件

我需要加载一个网页执行其 JavaScript 以及标签中包含的所有 js 文件并将生成的 HTLM 转储到文件中这需要在服务器上完成我已经尝试过使用node js和zombie js 但它似乎太不成熟无法在现实世界中工作通常
C# 在特定情况下使用小数位格式化百分比

在我正在构建的应用程序中我需要按以下方式格式化百分比 00012 gt 0 01 0012 gt 0 12 012 gt 1 2 12 gt 12 1 12 gt 112 小于 1 的百分比应显示 2 位小数任何 1 或大于 1 的值都
动态加载数据到Gridview

当我在 gridview 上工作时我遇到了以下问题任何帮助将不胜感激当我将数据加载到 gridview 时它仅加载数组的前 3 个项目但还有 18 个项目需要加载为什么它不加载其他 15 个项目 Log i 显示了我的 LogC
使用 .AddIdentityServerJwt() 时，.NET Core Razor Pages 应用程序的身份验证不适用于没有“/Identity”路由的视图

使用 NET Core 3 1 框架我尝试使用以下设置配置 Web 平台 Razor Pages 应用程序充当平台的登陆页面具有平台广告 cookie 同意隐私政策联系人以及身份附带的页面例如登录注册管理帐户等功能页面
如何在 htaccess 中的 #ancors 和 ?queries 之前从 ulrs 中删除 *.php、index.php 和尾部斜杠

我无法为我的问题找到令人满意的答案已经上网冲浪三天了但没有发现任何实际有效的东西我的网站结构如下 data controllers helpers partials layouts images javascripts stylesh
Scrapy 获取网站时出现错误“DNS 查找失败”

我正在尝试使用 Scrapy 获取 DNS 查找失败网站上的所有链接问题是每个没有任何错误的网站都打印在解析对象方法但当 url 返回 DNS 查找失败时回调parse obj 没有被调用我想获取所有出现错误的域 DNS 查找失

Scrapy 获取网站时出现错误“DNS 查找失败”

Scrapy 获取网站时出现错误“DNS 查找失败” 的相关文章

随机推荐

热门标签