如何使用scrapy抓取xml url

2024-05-13

你好,我正在使用 scrapy 来抓取 xml url

假设下面是我的 Spider.py 代码

class TestSpider(BaseSpider):
    name = "test"
    allowed_domains = {"www.example.com"}


    start_urls = [
        "https://example.com/jobxml.asp"
        ]


    def parse(self, response):
        print response,"??????????????????????"

result:

2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines: 
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
    {'downloader/request_bytes': 651,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 504,
     'downloader/response_count': 3,
     'downloader/response_status_count/400': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
     'scheduler/memory_enqueued': 3,
     'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 263143424, 'memusage/startup': 263143424}

scrapy是否不适用于xml抓取,如果是的话,任何人都可以给我提供一个关于如何抓取xml标签数据的示例

提前致谢...........


您有一个专门用于抓取 xml feed 的蜘蛛。这是来自 scrapy 文档:

XMLFeedSpider 示例

这些蜘蛛非常容易使用,让我们看一个例子:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))

        item = Item()
        item['id'] = node.select('@id').extract()
        item['name'] = node.select('name').extract()
        item['description'] = node.select('description').extract()
        return item

这是另一种不使用 scrapy 的方法:

这是一个用于从给定 url 下载 xml 的函数,请注意,这里没有一些导入,这也将为您提供下载 xml 文件的良好进度。

def get_file(self, dir, url, name):
    s = urllib2.urlopen(url)
    f = open('xml/test.xml','w')
    meta = s.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (name, file_size)
    current_file_size = 0
    block_size = 4096
    while True:
        buf = s.read(block_size)
        if not buf:
            break
        current_file_size += len(buf)
        f.write(buf)
        status = ("\r%10d  [%3.2f%%]" %
                 (current_file_size, current_file_size * 100. / file_size))
        status = status + chr(8)*(len(status)+1)
        sys.stdout.write(status)
        sys.stdout.flush()
    f.close()
    print "\nDone getting feed"
    return 1

然后你用 iterparse 解析下载并保存的 xml 文件,如下所示:

for event, elem in iterparse('xml/test.xml'):
        if elem.tag == "properties":
            print elem.text

这只是一个如何浏览 xml 树的示例。

另外,这是我的旧代码,因此您最好使用 with 来打开文件。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何使用scrapy抓取xml url 的相关文章

随机推荐