Scrapy-自动爬虫

2023-05-16

在前面几篇博文当中，我们使用Scrapy框架编写的爬虫项目，只能爬取起始网址中设置的网页。有时候，我们需要让爬虫持续不断的自动爬取多个网页，此时，我们需要编写自动爬取网页的爬虫。在本章中，我们以我的博客为例，为大家讲解如何编写自动爬取网页的Scrapy爬虫

Items的编写

我们在使用scrapy startproject blog --nolog命令创建好爬虫项目之后，首先需要编写items.py，在该文件中定义好我们关注的需要的数据。我们要爬取的这些网页中包含了大量的数据，但并不是每一样数据我们都关注，如果我们将这些网页所有的数据都存储起来，那么会比较乱并且浪费大量的服务器资源。比如，我们比较关注这些文章中的标题，阅读数，发布时间等数据，我们可以只提取各网页中每个商品的这5项信息。修改后items.py文件，代码如下：

import scrapy

class BlogItem(scrapy.Item):
    title = scrapy.Field()
    page_views = scrapy.Field()
    published_date = scrapy.Field()

编写好items.py后，我们就定义好了需要关注的结构化数据了

定制Item Pipeline

编写好items.py文件之后，还需要对爬取的数据做进一步的处理，比如存储到json文件中，这就要说到Item Pipeline。当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。Item Pipeline主要有以下典型应用：

清理HTML数据
验证爬取的数据的合法性，检查Item是否包含某些字段
查重并丢弃
将爬取结果保存到文件或者数据库中

定制Item Pipeline的方法其实很简单，每个Item Pipeline组件是一个独立的Python类，必须实现process_item方法，方法原型如下：

def process_item(self, item, spider)

每个Item Pipeline组件都需要调用该方法，这个方法必须返回一个Item对象，或者抛出DropItem异常，被丢弃的Item将不会被之后的Pipeline组件所处理。参数说明如下：

item: 被爬取的Item对象
spider: 代表着爬取该Item的Spider

我们需要将爬虫爬取的Item存储到本地的json文件中。我们将该爬虫项目中的pipeline.py文件修改后代码如下：

import json

class BlogPipeline(object):

    def __init__(self):
        self.file = open("blog.json", 'w', encoding='utf-8')

    def process_item(self, item, *args, **kwargs):
        for i in range(len(item['title'])):
            title = item['title'][i]
            page_views = item['page_views'][i]
            published_date = item['published_date'][i]
            i = json.dumps({'title': title, 'page_views': page_views, 'published_date': published_date}, ensure_ascii=False)
            line = i + '\n'
            self.file.write(line)
        return item

    def close_spider(self, *args, **kwargs):
        self.file.close()

此时，我们通过pipelines.py文件将获取到的博客信息存入到了当前目录下的blog.json文件当中了。

激活Item Pipeline

定制完Item Pipeline，它是无法工作的，需要进行激活。要启用一个Item Pipeline组件，必须将它的类添加到settings.py中的ITEM_PIPLINES变量中，代码如下：

ITEM_PIPELINES = {
    'blog.pipelines.BlogPipeline': 300,
}
COOKIES_ENABLED = False # 禁用cookie

Spider的编写

设置好settings.py文件之后，我们需要对该项目中最核心的部分—爬虫文件进行相应的编写，来实现网页的自动爬取以及关键字信息的提取。如下：

import scrapy
from scrapy.http import Request
from blog.items import BlogItem


class BlogSpider(scrapy.Spider):
    name = "blog"
    allowed_domains = ["csdn.net"]
    start_urls = ["https://blog.csdn.net/y472360651/article/list/1"]

    def parse(self, response, **kwargs):
        item = BlogItem()
        res = response.xpath(
            '//*[@id="articleMeList-blog"]/div[2]/div/h4/a/text()').extract()
        item['title'] = []
        for i in res:
            i = i.replace('\n', '').replace(' ', '')
            if i:
                item['title'].append(i)

        item['page_views'] = response.xpath('//*[@id="articleMeList-blog"]/div[2]/div/div[1]/p/span[2]/text()').extract()
        item['published_date'] = response.xpath('//*[@id="articleMeList-blog"]/div[2]/div/div[1]/p/span[1]/text()').extract()
        # 提取完后返回
        yield item
        # 通过循环自动爬取6页的数据
        for i in range(2, 7):
            url = f"https://blog.csdn.net/y472360651/article/list/{i}"
            # 实现自动爬取
            yield Request(url=url, callback=self.parse)

上面的爬虫文件中，关键部分为通过XPath表达式对我们所关注的数据信息进行提取，要爬取的网址的构造以及通过yield返回Request实现网页的自动爬取等。

我们可以运行以下该爬虫，如下：

scrapy crawl blog --nolog

运行完成之后，我们可以在当前目录发现一个名为blog.json的文件，里面就是我们爬取的博客内容啦！

自此，Over~~~

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Scrapy

自动爬虫