如何正确使用Rules、restrict_xpaths来用scrapy抓取和解析URL?

2024-06-25

我正在尝试编写一个爬行蜘蛛来爬行网站的 RSS 提要,然后解析文章的元标记。

第一RSS页面是显示RSS类别的页面。我设法提取链接,因为标签位于标签中。它看起来像这样:

        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject1">subject1</a>
           </td>   
        </tr>
        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject2">subject2</a>
           </td>
        </tr>

单击该链接后,它会为您带来该 RSS 类别的文章,如下所示:

   <li class="regularitem">
    <h4 class="itemtitle">
        <a href="http://example.com/article1">article1</a>
    </h4>
  </li>
  <li class="regularitem">
     <h4 class="itemtitle">
        <a href="http://example.com/article2">article2</a>
     </h4>
  </li>

正如你所看到的,如果我使用标签,我可以再次获得 xpath 的链接 我希望我的爬虫转到该标签内的链接并为我解析元标签。

这是我的爬虫代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItem


class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

    def parse_articles(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
           item = exampleItem()
           item['link'] = response.url
           item['meta_name'] =m.select('@name').extract()
           item['meta_value'] = m.select('@content').extract()
           items.append(item)
        return items

然而,这是我运行爬虫时的输出:

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

我在这里做错了什么?我一遍又一遍地阅读文档,但我觉得我一直忽略了一些东西。任何帮助,将不胜感激。

EDIT:添加: items.append(item) 。原帖里忘记了。EDIT::我也尝试过,结果是相同的输出:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Request

class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]


    def parse(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//td[@class="xmlLink"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_link)


    def parse_link(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//h4[@class="itemtitle"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_again)    

    def parse_again(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
            item = exampleItem()
            item['link'] = response.url
            item['meta_name'] = m.select('@name').extract()
            item['meta_value'] = m.select('@content').extract()
            items.append(item)
        return items

您返回的是空的items,您需要附加item to items.
你也可以yield item在循环。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何正确使用Rules、restrict_xpaths来用scrapy抓取和解析URL? 的相关文章

随机推荐