我正在编写一个 scrapy 蜘蛛来从主页上抓取今天的《纽约时报》文章,但由于某种原因它不跟踪任何链接。当我实例化链接提取器时scrapy shell http://www.nytimes.com
,它成功提取了文章网址列表le.extract_links(response)
,但我无法获取抓取命令(scrapy crawl nyt -o out.json
)抓取除主页以外的任何内容。我有点无计可施了。是因为主页没有从解析函数中生成文章吗?任何帮助是极大的赞赏。
from datetime import date
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ..items import NewsArticle
with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()
with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]
today = date.today().strftime('%Y/%m/%d')
print today
class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)
def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article
我已经找到了解决我的问题的方法。我做错了两件事:
- 我需要子类化
CrawlSpider
而不是Spider
如果我想让它自动抓取子链接。
- 使用时
CrawlSpider
,我需要使用回调函数而不是覆盖parse
。根据文档,覆盖parse
breaks CrawlSpider
功能。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)