如何用Scrapy爬取整个网站?

2024-02-01

我无法抓取整个网站,Scrapy 只能抓取表面,我想抓取得更深。过去 5-6 个小时一直在谷歌搜索,但没有任何帮助。我的代码如下:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log

class ExampleSpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)

规则短路,这意味着链接满足的第一个规则将成为应用的规则,第二个规则(带有回调)将不会被调用。

将您的规则更改为:

rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何用Scrapy爬取整个网站? 的相关文章

随机推荐