18.Python爬虫之Scrapy框架

2023-05-16

scrapy 框架

01. Scrapy 链接
02. Scrapy 的爬虫流程
03. Scrapy入门
04. setting.py文件中的常用设置
- 4.1. logging模块的使用
- 4.2. ==scrapy项目中的setting.py常用配置内容（待续）==
05. scrapy框架糗事百科爬虫案例
06. scrapy.Request知识点
07. 思考 parse()方法的工作机制
08. CrawlSpider爬虫
- 微信小程序crawlspider爬虫
09. Scrapy 发送post请求案例(人人网登录案例)
10. scrapy框架豆瓣网登录案例(验证码识别技术)（待爬）
11. scrapy 下载图片和文件方法（汽车之家宝马五系高清图片下载）
12. crawl spider 下载图片和文件方法（汽车之家宝马五系高清图片下载）
13. 下载器中间件-设置随机请求头
14. [ip代理中间件(快代理)](https://pan.baidu.com/s/1U6KnIFOYhS9NT7iXd4t84g)
15. Scrapy Shell
16. 攻克Boss直聘反爬虫（待调整）
17. 动态网页的数据爬取
- 17.1.安装Selenium
- 17.2. 安装chromedriver
- 17.3 第一个小案例
- 17.4. 定位元素
- 17.5. selenium 操作表单元素
- 17.6. 行为链
- 17.7. cookie的操作
- 17.8. 页面等待
- 17.9. 切换页面
- 17.10. selenium 使用代理
- WebElement元素
18. Selenium 拉勾网爬虫
19. Scrapy+Selenium爬取简书网整站，并且存入到mysql当中
20. selenium设置代理和UserAgent
21. [http://httpbin.org 测试接口解析](https://blog.csdn.net/chang995196962/article/details/91362364)

01. Scrapy 链接

Scrapy中文维护站点
Scrapy框架官方网址

02. Scrapy 的爬虫流程

在这里插入图片描述

Scrapy Engine（引擎）
- 总指挥：负责数据和信号的在不同模块之间的传递（Scrapy已经实现）
Scheduler（调度器）
- 一个队列，存放引擎发过来的request请求（Scrapy已经实现）
Downloader（下载器）
- 下载把引擎发过来的requests请求，并发回给引擎（Scrapy已经实现）
Spider（爬虫）
- 处理引擎发来的response，提取数据，提取url，并交给引擎（需要手写）
Item Pipeline（管道）
- 处理引擎传过来的数据，比如存储（需要手写）
Downloader Middlewares(下载中间件)
- 可以自定义的下载扩展，比如设置代理，请求头，cookie等信息
Spider Middlewares（中间件）
- 可以自定义requests请求和进行response过滤

03. Scrapy入门

安装： conda install scrapy
创建一个scrapy项目
scrapy startproject mySpider
生成一个爬虫
scrapy genspider xiaofan "xiaofan.com"（scrapy genspider 爬虫的名字允许爬取的范围）
提取数据
完善spider，使用xpath等方法
保存数据
pipeline中保存数据
运行爬虫（命令行形式）
scrapy crawl 爬虫的名字
通过脚本运行爬虫
- 在项目根目录新建脚本start.py，运行start.py文件即可
```
from scrapy import cmdline

cmdline.execute('scrapy crawl qsbk_spider'.split())
```

python爬虫scrapy之如何同时执行多个scrapy爬行任务

from scrapy import cmdline

cmdline.execute('scrapy crawlall'.split())

scrapy保存信息的最简单的方法主要有四种，-o 输出指定格式的文件，，命令如下：

# json格式，默认为Unicode编码
scrapy crawl itcast -o teachers.json

# json lines格式，默认为Unicode编码
scrapy crawl itcast -o teachers.jsonlines

# csv 逗号表达式，可用Excel打开
scrapy crawl itcast -o teachers.csv

# xml格式
scrapy crawl itcast -o teachers.xml

项目结构截图及主要文件的作用

04. setting.py文件中的常用设置

4.1. logging模块的使用

scrapy项目
- settings中设置LOG_LEVEL=“WARNING”
- settings中设置LOG_FILE="./a.log" # 设置日志保存的位置，设置后终端不会显示日志内容
- import logging. 实例化logger的方式在任何文件中使用logger输入内容
普通项目中
- import logging
- logging.basicConfig(…) # 设置日志输出的样式，格式
- 实例化一个logger=logging.getLogger(name)
- 在任何py文件中调用logger即可

4.2. scrapy项目中的setting.py常用配置内容（待续）

# 1.导包
import logging
import datetime
import os


# 2.项目名称 TODO 需要修改
BOT_NAME = 'position_project'

# 3.模块名称
SPIDER_MODULES = ['{}.spiders'.format(BOT_NAME)]
NEWSPIDER_MODULE = '{}.spiders'.format(BOT_NAME)

# 4.遵守机器人协议（默认为True）
ROBOTSTXT_OBEY = False

# 5.用户代理（使用的浏览器类型）
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 ' \
             'Safari/537.36 '

# 6.默认请求头信息（USER_AGENT 单独配置）
DEFAULT_REQUEST_HEADERS = {
    "authority": "www.zhipin.com",
    "method": "GET",
    "path": "/c101010100/?query=python&page=1",
    "scheme": "https",
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"zh-CN,zh;q=0.9",
    "cache-control":"max-age=0",
    "sec-fetch-mode":"navigate",
    "sec-fetch-site":"none",
    "sec-fetch-user":"?1",
    "upgrade-insecure-requests":"1",
    "cookie":"_uab_collina=155192752626463196786582; lastCity=101010100; _bl_uid=nCk6U2X3qyL0knn41r97gqj6tbaI; __c=1577356639; __g=-; __l=l=%2Fwww.zhipin.com%2Fweb%2Fcommon%2Fsecurity-check.html%3Fseed%3D4xwicvOb7q2EkZGCt80nTLZ0vDg%252BzlibDrgh%252F8ybn%252BU%253D%26name%3D89ea5a4b%26ts%3D1577356638307%26callbackUrl%3D%252Fc101010100%252F%253Fquery%253Dpython%2526page%253D1%26srcReferer%3D&r=&friend_source=0&friend_source=0; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1577356640; toUrl=https%3A%2F%2Fwww.zhipin.com%2Fc101010100%2F%3Fquery%3Dpython%26page%3D1%26ka%3Dpage-1; __a=29781409.1551927520.1573210066.1577356639.145.7.53.84; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1577413477; __zp_stoken__=7afdOJ%2Bdzh7nyTlE0EwBT40ChjblHK0zWyGrgNKjNseeImeToJrFVjotrvwrJmc4SAz4ALJJLFiwM6VXR8%2FhRZvbdbnbdscb5I9tbPbE0vSsxADMIDYNDK7qJTzOfZJNR7%2BP",
    "referer":"https://www.zhipin.com/c101010100/?query=python&page=1",

}


# 7.格式化日志输出的格式，日志文件每分钟生成一个文件
time_str = datetime.datetime.strftime(datetime.datetime.now(), '%Y-%m-%d %H-%M')
LOG_FILE = '{}\\{}\\logs\\{}.log'.format(os.getcwd(), BOT_NAME, time_str)
LOG_LEVEL = 'DEBUG'

# 8.设置运行多个爬虫的自定义命令
COMMANDS_MODULE = '{}.commands'.format(BOT_NAME)

# 9.scrapy输出的json文件中显示中文（https://www.cnblogs.com/linkr/p/7995454.html）
FEED_EXPORT_ENCODING = 'utf-8'

# 10.管道pipeline配置，后面的值越小，越先经过这根管道 TODO 需要修改
ITEM_PIPELINES = {
   '{}.pipelines.PositionProjectPipeline'.format(BOT_NAME): 300,
}

# 11.限制爬虫的爬取速度， 单位为秒
DOWNLOAD_DELAY = 1

# 12. 下载中间件 TODO 需要修改
DOWNLOADER_MIDDLEWARES = {
   '{}.middlewares.RandomUserAgent'.format(BOT_NAME): 1,
}

# 13. 禁用cookie
COOKIES_ENABLED = False

05. scrapy框架糗事百科爬虫案例

qsbk_spider.py

# -*- coding: utf-8 -*-
import scrapy

from qsbk.items import QsbkItem


class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']
    base_domain = "https://www.qiushibaike.com"

    def parse(self, response):
        duanzidivs = response.xpath("//div[@id='content-left']/div")
        for duanzidiv in duanzidivs:
            author = duanzidiv.xpath(".//h2/text()").extract_first().strip()
            content = duanzidiv.xpath(".//div[@class='content']//text()").extract()
            item = QsbkItem(author=author, content=content)
            yield item
        # 爬取下一页
        next_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
        if not next_url:
            return
        else:
            yield scrapy.Request(self.base_domain + next_url, callback=self.parse)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QsbkItem(scrapy.Item):
    author = scrapy.Field()
    content = scrapy.Field()

pipelines.py低级方式

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class QsbkPipeline(object):

    def __init__(self):
        self.fp = open("duanzi.json", "w", encoding="utf-8")

    def open_spider(self, spider):
        print("爬虫开始了...")

    def process_item(self, item, spider):
        item_json = json.dumps(dict(item), indent=4, ensure_ascii=False)
        self.fp.write(item_json+"\n")
        return item

    def close_spider(self, spider):
        self.fp.close()
        print("爬虫结束了...")

pipelines.py高级方式一（比较耗内存）

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
from scrapy.exporters import JsonItemExporter


class QsbkPipeline(object):

    def __init__(self):
        self.fp = open("duanzi.json", "wb")
        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding="utf-8", indent=4)
        self.exporter.start_exporting()

    def open_spider(self, spider):
        print("爬虫开始了...")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("爬虫结束了...")

pipelines.py高级方式二

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter


class QsbkPipeline(object):

    def __init__(self):
        self.fp = open("duanzi.json", "wb")
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8", indent=4)

    def open_spider(self, spider):
        print("爬虫开始了...")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print("爬虫结束了...")

导出为csv文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter, CsvItemExporter


class QsbkPipeline(object):
    def __init__(self):
        self.fp = open("qsbk.csv", "wb")
        self.exporter = CsvItemExporter(self.fp,  encoding='utf-8')

    def open_spider(self, spider):
        print('爬虫开始了...')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        print('爬虫结束了...')
        self.fp.close()

06. scrapy.Request知识点

Request 部分源码：

# 部分代码
class Request(object_ref):

    def __init__(self, url, callback=None, method='GET', headers=None, body=None, 
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None):

        self._encoding = encoding  # this one has to be set first
        self.method = str(method).upper()
        self._set_url(url)
        self._set_body(body)
        assert isinstance(priority, int), "Request priority not an integer: %r" % priority
        self.priority = priority

        assert callback or not errback, "Cannot use errback without a callback"
        self.callback = callback
        self.errback = errback

        self.cookies = cookies or {}
        self.headers = Headers(headers or {}, encoding=encoding)
        self.dont_filter = dont_filter

        self._meta = dict(meta) if meta else None

    @property
    def meta(self):
        if self._meta is None:
            self._meta = {}
        return self._meta

其中，比较常用的参数：

url: 就是需要请求，并进行下一步处理的url

callback: 指定该请求返回的Response，由那个函数来处理。

method: 请求一般不需要指定，默认GET方法，可设置为"GET", "POST", "PUT"等，且保证字符串大写

headers: 请求时，包含的头文件。一般不需要。内容一般如下：
        # 自己写过爬虫的肯定知道
        Host: media.readthedocs.org
        User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0
        Accept: text/css,*/*;q=0.1
        Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
        Accept-Encoding: gzip, deflate
        Referer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/
        Cookie: _ga=GA1.2.1612165614.1415584110;
        Connection: keep-alive
        If-Modified-Since: Mon, 25 Aug 2014 21:59:35 GMT
        Cache-Control: max-age=0

meta: 比较常用，在不同的请求之间传递数据使用的。字典dict型

        request_with_cookies = Request(
            url="http://www.example.com",
            cookies={'currency': 'USD', 'country': 'UY'},
            meta={'dont_merge_cookies': True}
        )

encoding: 使用默认的 'utf-8' 就行。

dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False。

errback: 指定错误处理函数

07. 思考 parse()方法的工作机制

因为使用的yield，而不是return。parse函数将会被当做一个生成器使用。scrapy会逐一获取parse方法中生成的结果，并判断该结果是一个什么样的类型；
如果是request则加入爬取队列，如果是item类型则使用pipeline处理，其他类型则返回错误信息。
scrapy取到第一部分的request不会立马就去发送这个request，只是把这个request放到队列里，然后接着从生成器里获取；
取尽第一部分的request，然后再获取第二部分的item，取到item了，就会放到对应的pipeline里处理；
parse()方法作为回调函数(callback)赋值给了Request，指定parse()方法来处理这些请求 scrapy.Request(url, callback=self.parse)
Request对象经过调度，执行生成 scrapy.http.response()的响应对象，并送回给parse()方法，直到调度器中没有Request（递归的思路）
取尽之后，parse()工作结束，引擎再根据队列和pipelines中的内容去执行相应的操作；
程序在取得各个页面的items前，会先处理完之前所有的request队列里的请求，然后再提取items。
这一切的一切，Scrapy引擎和调度器将负责到底。

08. CrawlSpider爬虫

创建命令：scrapy genspider -t crawl 爬虫的名字爬虫的域名

微信小程序crawlspider爬虫

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from wxapp.items import WxappItem


class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=2']

    rules = (
        # 指定规则，爬取列表上的详情链接，并不需要解析
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=1'), follow=True),
        # 指定爬取详情页面的规则，不需要递归找，防止重复
        Rule(LinkExtractor(allow=r'.+article-.+\.html'), callback="parse_detail", follow=False)
    )

    def parse_detail(self, response):
        title = response.xpath("//div[@class='cl']/h1/text()").get()
        item = WxappItem(title=title)
        return item

·注意：千万记住 callback 千万不能写 parse，再次强调：由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

09. Scrapy 发送post请求案例(人人网登录案例)

可以使用 yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。
如果希望程序执行一开始就发送POST请求，可以重写Spider类的start_requests(self) 方法，并且不再调用start_urls里的url。

# -*- coding: utf-8 -*-
import scrapy


class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://renren.com/']

    def start_requests(self):
        """
        重写了start_requests方法，模拟人人网的登录
        """
        url = "http://www.renren.com/PLogin.do"
        data = {"email": "594042358@qq.com", "password": "fanjianhaiabc123"}
        # post请求得用FormRqeust，模拟登录
        request = scrapy.FormRequest(url, formdata=data, callback=self.parse_page)
        yield request

    def parse_page(self, response):
        """
        登录成功之后，访问个人主页面
        """
        # get请求， 获取个人主页信息
        request = scrapy.Request(url="http://www.renren.com/446858319/profile", callback=self.parse_profile)
        yield request

    def parse_profile(self, response):
        with open("profile.html", "w", encoding="utf-8") as fp:
            fp.write(response.text)

10. scrapy框架豆瓣网登录案例(验证码识别技术)（待爬）

11. scrapy 下载图片和文件方法（汽车之家宝马五系高清图片下载）

方式一，传统的下载方式
bmw5_spider.py

# -*- coding: utf-8 -*-
import scrapy
from bmw5.items import Bmw5Item


class Bmw5SpiderSpider(scrapy.Spider):
    name = 'bmw5_spider'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']

    def parse(self, response):
        uiboxs = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxs:
            category = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            print(category)
            urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = Bmw5Item(category=category, urls=urls)
            yield item

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
from urllib import request


class Bmw5Pipeline(object):

    def __init__(self):
        self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
        if not os.path.exists(self.path):
            os.mkdir(self.path)

    def process_item(self, item, spider):
        category = item['category']
        urls = item['urls']

        category_path = os.path.join(self.path, category)
        if not os.path.exists(category_path):
            os.mkdir(category_path)
        for url in urls:
            image_name = url.split('_')[-1]
            request.urlretrieve(url, os.path.join(category_path, image_name))
        return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Bmw5Item(scrapy.Item):
    category = scrapy.Field()
    urls = scrapy.Field()

方式2
下载图片的Images Pipeline
1. 定义好一个Item，然后再这个Item中定义两个属性，分别为image_urls以及images，image_urls是用来存储需要下载的图片的url链接，需要给一个列表
1. 当文件下载完成后，会吧文件下载的相关信息存储到item的images属性中，比如下载路径、下载的url和图片的校验码等。
1. 在配置文件settings.py中配置IMAGES_STORE, 这个配置属性是用来设置图片下载下来的路径。
1. 启动pipeline, 在ITEM_PIPELIES中设置scrapy.pipelines.images.ImagesPipeline:1

下载文件的Files Pipeline

1. 定义好一个Item，然后再这个Item中定义两个属性，分别为file_urls以及files，file_urls是用来存储需要下载的图片的url链接，需要给一个列表
1. 当文件下载完成后，会吧文件下载的相关信息存储到item的files属性中，比如下载路径、下载的url和图片的校验码等。
1. 在配置文件settings.py中配置FILES_STORE, 这个配置属性是用来设置图片下载下来的路径。
1. 启动pipeline, 在ITEM_PIPELIES中设置scrapy.pipelines.files.FilesPipeline:1

自定义图片下载 Images Pipeline

bmw5_spider.py

# -*- coding: utf-8 -*-
import scrapy
from bmw5.items import Bmw5Item


class Bmw5SpiderSpider(scrapy.Spider):
    name = 'bmw5_spider'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']

    def parse(self, response):
        uiboxs = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxs:
            category = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            print(category)
            urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = Bmw5Item(category=category, image_urls=urls)
            yield item

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os

from scrapy.pipelines.images import ImagesPipeline

from bmw5.settings import IMAGES_STORE


class BMWImagesPipeline(ImagesPipeline):
    """
    自定义图片下载器
    """
    def get_media_requests(self, item, info):
        # 这个方法是在发送下载请求之前调用
        # 其实这个方法本身就是去发送下载请求的
        request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)

        for request_obj in request_objs:
            request_obj.item = item

        return request_objs

    def file_path(self, request, response=None, info=None):
        # 这个方法是在图片将要存储的时候调用， 来获取这个图片的存储路径
        path = super(BMWImagesPipeline, self).file_path(request, response, info)
        # 获取category
        category = request.item['category']
        image_store = IMAGES_STORE
        category_path = os.path.join(image_store, category)
        if not os.path.exists(category_path):
            os.mkdir(category_path)

        image_name = path.replace("full/", "")
        image_path = os.path.join(category_path, image_name)

        return image_path

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Bmw5Item(scrapy.Item):
    category = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py

ITEM_PIPELINES = {
    # 'bmw5.pipelines.Bmw5Pipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'bmw5.pipelines.BMWImagesPipeline': 1,
}

12. crawl spider 下载图片和文件方法（汽车之家宝马五系高清图片下载）

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule
from bmw5.items import Bmw5Item


class Bmw5SpiderSpider(CrawlSpider):
    name = 'bmw5_spider'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']

    rules = {
        Rule(LinkExtractor(allow=r"https://car.autohome.com.cn/pic/series/65.+"), callback="parse_page", follow=True),
    }

    def parse_page(self, response):
        category = response.xpath("//div[@class='uibox']/div/text()").get()
        srcs = response.xpath("//div[contains(@class,'uibox-con')]/ul/li//img/@src").getall()
        srcs = list(map(lambda url: response.urljoin(url.replace("240x180_0_q95_c42", "1024x0_1_q95")), srcs))
        item = Bmw5Item(category=category, image_urls=srcs)
        yield item

13. 下载器中间件-设置随机请求头

设置随机请求头（谷歌，火狐，Safari）
User-Agent 字符串连接

httpbin.py

# -*- coding: utf-8 -*-
import scrapy
import json


class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/user-agent']

    def parse(self, response):
        useragent = json.loads(response.text)['user-agent']
        print('=' * 30)
        print(useragent)
        print('=' * 30)
        yield scrapy.Request(self.start_urls[0], dont_filter=True)

middlewares.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

import random


class UserAgentDownloadMiddleware(object):
    USER_AGENTS = ['Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
                   'Mozilla/5.0 (compatible; ABrowse 0.4; Syllable)',
                   'Mozilla/4.0 (compatible; MSIE 7.0; America Online Browser 1.1; Windows NT 5.1; (R1 1.5); .NET CLR 2.0.50727; InfoPath.1)']

    def process_request(self, request, spider):
        """
        这个方法是下载器在发送请求之前会执行的。 一般可以在这个里面设置随机代理IP，请求头等信息
        request： 发送请求的request对象
        spider：发送请求的spider对象
        返回值：
            1. 如果返回None，Scrapy将继续处理改request，执行其他中间件
            2. 返回response对象：Scrapy将不会调用其他的process_request方法， 将直接返回这个response对象。
                已经激活的中间件process_response()方法则会在每个response对象返回时被调用
            3. 返回request对象，不再使用之前的request对象下载数据，使用返回的这个
            4. 如果这个方法中出现了异常，则会调用process_exception方法
        """
        useragent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = useragent

middlewares.py改进版
- 注意:USER_AGENT_LIST抽出来了（参考下面设置随机请求头）

import random
from position_project.conf.user_agent import USER_AGENT_LIST


class RandomUserAgent(object):

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)

14. ip代理中间件(快代理)

开放代理


class IPProxyDownloadMiddleware(object):
    """
    开放代理(不是免费代理哦)
    """
    PROXIES = ["178.44.170.152:8080", "110.44.113.182:8000"]
    def process_request(self, request, spider):
        proxy = random.choice(self.PROXIES)
        request.meta['proxy'] = proxy

独享代理

import base64
class IPPxoxyDownloadMiddleware(object):
    """
    独享代理
    """
    def process_request(self,request, spider):
        proxy = '121.199.6.124:16816'
        user_password = '970138074:rcdj35ur'
        request.meta['proxy'] = proxy
        # bytes
        b64_user_password = base64.b64encode(user_password.encode("utf-8"))
        request.headers["Proxy-Authorization"] = 'Basic ' + b64_user_password.decode("utf-8")

15. Scrapy Shell

启动命令

scrapy shell "http://www.itcast.cn/channel/teacher.shtml"

通过response 写xpath进行调试

16. 攻克Boss直聘反爬虫（待调整）

spiders.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from boss.items import BossItem


class ZhipingSpider(CrawlSpider):
    name = 'zhipin'
    allowed_domains = ['zhipin.com']
    start_urls = ['http://www.zhipin.com/c101010100/?query=python&page=1']

    rules = (
        # 匹配列表页规则https://www.zhipin.com/c101010100/?query=python&page=1
        Rule(LinkExtractor(allow=r'.+\?query=python&page=\d+'), follow=True),
        # 匹配详情页规则
        Rule(LinkExtractor(allow=r'.+job_detail/.+\.html'), callback="parse_job", follow=False),

    )

    def parse_job(self, response):
        print("*" * 100)
        name = response.xpath("//div[@class='name']/h1/text()").get()
        salary = response.xpath("//div[@class='name']/span[@class='salary']/text()").get()
        job_info = response.xpath("//div[@class='job-sec']//text()").getall()
        job_info = list(map(lambda x: x.strip(), job_info))
        job_info = "".join(job_info)
        job_info = job_info.strip()
        print(job_info)
        item = BossItem(name=name, salary=salary, job_info=job_info)
        yield item

settings.py

DEFAULT_REQUEST_HEADERS = {
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"zh-CN,zh;q=0.9",
    "cache-control":"max-age=0",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "sec-fetch-mode":"navigate",
    "sec-fetch-site":"none",
    "sec-fetch-user":"?1",
    "upgrade-insecure-requests":"1",
    "cookie":"_uab_collina=155192752626463196786582; lastCity=101010100; __c=1565492379; toUrl=/; __zp_stoken__=a32dy4M8VTtvU41ADf0l5K0oReZKFror7%2F2qFAGN5RbBdirT9P%2F2zhugmroLb2ZzmyLVH7BYC%2B3ELS5F05bZCcNIRA%3D%3D; sid=sem; __g=sem; __l=l=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title&r=https%3A%2F%2Fsp0.baidu.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc.php%3Ft%3D06KL00c00fDIFkY0IWPB0KZEgsAN9DqI00000Kd7ZNC00000LI-XKC.THdBULP1doZA80K85yF9pywdpAqVuNqsusK15ynsmWIWry79nj0snynYPvD0IHY3rjm3nDcswWDzPHwaP1RYPRPAPjN7PRPafRfYwD77nsK95gTqFhdWpyfqn1czPjmsPjnYrausThqbpyfqnHm0uHdCIZwsT1CEQLILIz4lpA-spy38mvqVQ1q1pyfqTvNVgLKlgvFbTAPxuA71ULNxIA-YUAR0mLFW5Hb4rHf%26tpl%3Dtpl_11534_19713_15764%26l%3D1511867677%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E5%252587%252586%2525E5%2525A4%2525B4%2525E9%252583%2525A8-%2525E6%2525A0%252587%2525E9%2525A2%252598-%2525E4%2525B8%2525BB%2525E6%2525A0%252587%2525E9%2525A2%252598%2526linkText%253DBoss%2525E7%25259B%2525B4%2525E8%252581%252598%2525E2%252580%252594%2525E2%252580%252594%2525E6%252589%2525BE%2525E5%2525B7%2525A5%2525E4%2525BD%25259C%2525EF%2525BC%25258C%2525E6%252588%252591%2525E8%2525A6%252581%2525E8%2525B7%25259F%2525E8%252580%252581%2525E6%25259D%2525BF%2525E8%2525B0%252588%2525EF%2525BC%252581%2526xp%253Did(%252522m3224604348_canvas%252522)%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%25255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D8%26wd%3Dboss%25E7%259B%25B4%25E8%2581%2598%26issp%3D1%26f%3D8%26ie%3Dutf-8%26rqlang%3Dcn%26tn%3Dbaiduhome_pg%26inputT%3D3169&g=%2Fwww.zhipin.com%2Fuser%2Fsem7.html%3Fsid%3Dsem%26qudao%3Dbaidu3%26plan%3DPC-%25E9%2580%259A%25E7%2594%25A8%25E8%25AF%258D%26unit%3DPC-zhaopin-hexin%26keyword%3Dboss%25E7%259B%25B4%25E8%2581%2598%25E4%25BC%2581%25E4%25B8%259A%25E6%258B%259B%25E8%2581%2598; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1565493077,1565494665,1565494677,1565504545; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1565505516; __a=29781409.1551927520.1553506739.1565492379.86.5.40.25"
}

17. 动态网页的数据爬取

直接分析ajax调用的接口，然后通过代码请求这个接口
使用Selenium + Chromedriver模拟浏览器行为获取数据
- selenium 常用操作
- Selenium-Python中文文档链接

17.1.安装Selenium

conda install selenium

17.2. 安装chromedriver

下载链接
下载完成后，放到不需要权限的纯英文目录下就可以了
注意chromedriver的版本要和浏览器的版本一致，64位的也可以用32位的

17.3 第一个小案例

from selenium import webdriver
import time

driver_path = r"D:\chromedriver\chromedriver.exe"

driver = webdriver.Chrome(executable_path=driver_path)

driver.get("http://www.baidu.com")
# 通过page_source获取网页的源代码
print(driver.page_source)

time.sleep(3)
driver.close()

17.4. 定位元素

如果只是想要解析网页中的数据，那么推荐将网页源代码扔给lxml来解析，因为lxml底层使用的是c怨言，所以解析效率会高一点
如果是想要对元素进行一些操作，比如给一个文本框输入，或者点击某个按钮，那么就必须使用selenium给我们提供的查找元素的额方法

from selenium import webdriver
import time
from lxml import etree

driver_path = r"D:\chromedriver\chromedriver.exe"

driver = webdriver.Chrome(executable_path=driver_path)

driver.get("http://www.baidu.com")
# 通过page_source获取网页的源代码
print(driver.page_source)

# inputTag = driver.find_element_by_id("kw")
# inputTag = driver.find_element_by_name("wd")
# inputTag = driver.find_element_by_class_name("s_ipt")
inputTag = driver.find_element_by_xpath("//input[@class='s_ipt']")
inputTag.send_keys("迪丽热巴")
htmlE = etree.HTML(driver.page_source)

print(htmlE)
time.sleep(3)
driver.close()

17.5. selenium 操作表单元素

文本框的操作

inputTag = driver.find_element_by_xpath("//input[@class='s_ipt']")
inputTag.send_keys("迪丽热巴")

time.sleep(3)

inputTag.clear()

checkbox的操作

inputTag = driver.find_element_by_name("remember")
inputTag.click()

select的操作
按钮的操作

17.6. 行为链

from selenium import webdriver
import time
from selenium.webdriver.common.action_chains import  ActionChains

driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get("http://www.baidu.com")

inputTag = driver.find_element_by_xpath("//input[@class='s_ipt']")
submitBtn = driver.find_element_by_id('su')

actions = ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag, '黄渤')
actions.move_to_element(submitBtn)
actions.click(submitBtn)
actions.perform()

time.sleep(6)

inputTag.clear()


driver.close()

17.7. cookie的操作

17.8. 页面等待

隐式等待

driver.implicitly_wait(10)

显示等待

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get("http://www.douban.com")
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, 'app-title'))
)
print(element)

17.9. 切换页面

from selenium import webdriver

driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get("http://www.jd.com")

driver.execute_script("window.open('https://www.douban.com/')")
print(driver.window_handles)
driver.switch_to.window(driver.window_handles[1])

print(driver.current_url)

17.10. selenium 使用代理

from selenium import webdriver

driver_path = r"D:\chromedriver\chromedriver.exe"

options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://60.17.239.207:31032")
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)

driver.get("http://www.jd.com")

WebElement元素

18. Selenium 拉勾网爬虫

传统方式

import requests
from lxml import etree
import time
import re

# 请求头
HEADERS = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Referer": "https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput=",
    "Host": "www.lagou.com",
}

def request_list_page():
    url1 = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='

    url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'

    # 通过data来控制翻页

    for page in range(1, 2):
        data = {
            'first': 'false',
            'pn': page,
            'kd': 'python'
        }
        s = requests.Session()  # 建立session
        response = s.get(url=url1, headers=HEADERS, timeout=3)
        cookie = s.cookies  # 获取cookie
        respon = s.post(url=url, headers=HEADERS, data=data, cookies=cookie, timeout=3)
        time.sleep(7)
        result = respon.json()
        positions = result['content']['positionResult']['result']
        for position in positions:
            positionId = position['positionId']
            position_url = "https://www.lagou.com/jobs/{}.html".format(positionId)
            parse_position_detail(position_url, s)
            break

def parse_position_detail(url, s):
    response = s.get(url, headers=HEADERS)
    text = response.text
    htmlE = etree.HTML(text)
    position_name = htmlE.xpath("//div[@class='job-name']/@title")[0]
    job_request_spans = htmlE.xpath("//dd[@class='job_request']//span")
    salary = job_request_spans[0].xpath("./text()")[0].strip()
    education = job_request_spans[3].xpath("./text()")[0]
    education = re.sub(r"[/ \s]", "", education)
    print(education)
    job_detail = htmlE.xpath("//div[@class='job-detail']//text()")
    job_detail = "".join(job_detail).strip()
    print(job_detail)


if __name__ == '__main__':
    request_list_page()

Selenium + Chromedriver方式

import re
import time

from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait


class LagouSpider(object):
    """
    Selenium + ChromeDriver 拉钩爬虫
    """
    driver_path = r"D:\chromedriver\chromedriver.exe"

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=LagouSpider.driver_path)
        # 这个链接并不是真正招聘职位信息的链接
        self.url = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='
        # 职位信息列表
        self.positions = []

    def run(self):
        self.driver.get(self.url)
        while True:
            WebDriverWait(self.driver, 10).until(
                # 这里只能追踪的元素，不能追踪到元素的具体属性
                EC.presence_of_element_located((By.XPATH, "//div[@class='pager_container']/span[last()]"))
            )

            source = self.driver.page_source
            self.parse_list_page(source)
            next_btn = self.driver.find_element_by_xpath("//div[@class='pager_container']/span[last()]")
            if "pager_next_disabled" in next_btn.get_attribute("class"):
                break
            else:
                next_btn.click()

    def parse_list_page(self, source):
        htmlE = etree.HTML(source)
        links = htmlE.xpath("//a[@class='position_link']/@href")
        for link in links:
            self.request_detail_page(link)
            time.sleep(1)

    def request_detail_page(self, url):
        # self.driver.get(url)
        self.driver.execute_script("window.open('{}')".format(url))
        self.driver.switch_to.window(self.driver.window_handles[1])

        WebDriverWait(self.driver, 10).until(
            # EC.presence_of_element_located((By.XPATH, "//div[@class='job-name']/@title"))
            # 这里只能追踪到元素，追踪不到元素下的具体属性
            EC.presence_of_element_located((By.XPATH, "//div[@class='job-name']"))
        )

        page_srouce = self.driver.page_source
        self.parse_detail_page(page_srouce)
        # 关闭这个详情页
        self.driver.close()
        # 继续切换到职位列表页面
        self.driver.switch_to.window(self.driver.window_handles[0])

    def parse_detail_page(self, source):
        htmlE = etree.HTML(source)
        position_name = htmlE.xpath("//div[@class='job-name']/h2/text()")[0]
        company = htmlE.xpath("//div[@class='job-name']/h4/text()")[0]
        job_request_spans = htmlE.xpath("//dd[@class='job_request']//span")
        salary = job_request_spans[0].xpath("./text()")[0].strip()
        salary = re.sub(r"[/ \s]", "", salary)
        city = job_request_spans[1].xpath("./text()")[0].strip()
        city = re.sub(r"[/ \s]", "", city)
        experience = job_request_spans[2].xpath("./text()")[0].strip()
        experience = re.sub(r"[/ \s]", "", experience)
        education = job_request_spans[3].xpath("./text()")[0]
        education = re.sub(r"[/ \s]", "", education)
        type = job_request_spans[4].xpath("./text()")[0]
        type = re.sub(r"[/ \s]", "", type)
        job_detail = htmlE.xpath("//div[@class='job-detail']//text()")
        job_detail = "".join(job_detail).strip()
        print("职位：%s" % position_name)
        print("单位：%s" % company)
        print("")
        print(salary + "/" + city + "/" + experience + "/" + education + "/" + type)
        print("")
        print(job_detail)

        position = {
            'name': position_name,
            'company': company,
            'salary': salary,
            'city': city,
            'experience': experience,
            'education': education,
            'desc': job_detail
        }
        self.positions.append(position)
        # print(position)
        print("=" * 100)


if __name__ == '__main__':
    spider = LagouSpider()
    spider.run()

19. Scrapy+Selenium爬取简书网整站，并且存入到mysql当中

目前这个网站有css加密

item目标字段类

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class JianshuProjectItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    article_id = scrapy.Field()
    origin_url = scrapy.Field()
    author = scrapy.Field()
    avatar = scrapy.Field()
    pub_time = scrapy.Field()

setting设置类

# 1.导包
import logging
import datetime
import os


# 2.项目名称 TODO 需要修改
BOT_NAME = 'jianshu_project'

# 3.模块名称
SPIDER_MODULES = ['{}.spiders'.format(BOT_NAME)]
NEWSPIDER_MODULE = '{}.spiders'.format(BOT_NAME)

# 4.遵守机器人协议（默认为True）
ROBOTSTXT_OBEY = False

# 5.用户代理（使用的浏览器类型）
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 ' \
             'Safari/537.36 '

# 6.默认请求头信息（USER_AGENT 单独配置）
DEFAULT_REQUEST_HEADERS = {
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"zh-CN,zh;q=0.9",
}


# 7.格式化日志输出的格式，日志文件每分钟生成一个文件
time_str = datetime.datetime.strftime(datetime.datetime.now(), '%Y-%m-%d %H-%M')
LOG_FILE = '{}\\{}\\logs\\{}.log'.format(os.getcwd(), BOT_NAME, time_str)
LOG_LEVEL = 'DEBUG'

# 8.设置运行多个爬虫的自定义命令
COMMANDS_MODULE = '{}.commands'.format(BOT_NAME)

# 9.scrapy输出的json文件中显示中文（https://www.cnblogs.com/linkr/p/7995454.html）
FEED_EXPORT_ENCODING = 'utf-8'

# 10.管道pipeline配置，后面的值越小，越先经过这根管道 TODO 需要修改
ITEM_PIPELINES = {
   # '{}.pipelines.JianshuProjectPipeline'.format(BOT_NAME): 300,
   '{}.pipelines.JianshuTwistedPipeline'.format(BOT_NAME): 300,
}

# 11.限制爬虫的爬取速度， 单位为秒
DOWNLOAD_DELAY = 1

# 12. 下载中间件 TODO 需要修改
DOWNLOADER_MIDDLEWARES = {
   '{}.middlewares.RandomUserAgent'.format(BOT_NAME): 1,
   '{}.middlewares.SeleniumDownloadMiddleware'.format(BOT_NAME): 2
}

# 13. 禁用cookie
COOKIES_ENABLED = False

spider爬虫类

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from jianshu_project.items import JianshuProjectItem


class JianshuSpider(CrawlSpider):
    name = 'jianshu'
    allowed_domains = ['jianshu.com']
    start_urls = ['https://www.jianshu.com/']

    rules = (
        Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
    )

    def parse_detail(self, response):
        title = response.xpath('//div[@id="__next"]/div[1]/div/div/section[1]/h1/text()').extract_first()
        avatar = response.xpath("//div[@class='_2mYfmT']//a[@class='_1OhGeD']/img/@src").extract_first()
        author = response.xpath("//span[@class='FxYr8x']/a/text()").extract_first()
        pub_time = response.xpath(
            '//div[@id="__next"]/div[1]/div/div/section[1]/div[1]/div/div/div[2]/time/text()').extract_first()
        url = response.url
        url1 = url.split('?')[0]
        article_id = url1.split('/')[-1]

        content = response.xpath("//article[@class='_2rhmJa']").extract_first()

        item = JianshuProjectItem(
            title=title,
            avatar=avatar,
            author=author,
            pub_time=pub_time,
            origin_url=url,
            article_id=article_id,
            content=content
        )
        yield item

pipeline管道

import pymysql
from pymysql import cursors
from twisted.enterprise import adbapi


class JianshuProjectPipeline(object):
    """同步入庫"""
    def __init__(self):
        dbparams = {
            'host': 'mini1',
            'port': 3306,
            'user': 'root',
            'password': '123456',
            'database': 'db_jianshu',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        print('*' * 300)
        print(item)
        self.cursor.execute(self.sql, (item['title'], item['content'],
                                       item['author'], item['avatar'],
                                       item['pub_time'], item['article_id'],
                                       item['origin_url']))
        self.conn.commit()
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
            insert into tb_article (id,title,content,author,avatar,pub_time,article_id, origin_url) values(null,%s,%s,%s,%s,%s,%s,%s)
            """
            return self._sql
        return self._sql


class JianshuTwistedPipeline(object):
    """异步入库"""

    def __init__(self):
        dbparams = {
            'host': 'mini1',
            'port': 3306,
            'user': 'root',
            'password': '123456',
            'database': 'db_jianshu',
            'charset': 'utf8',
            'cursorclass': cursors.DictCursor
        }
        self.dbpool = adbapi.ConnectionPool('pymysql', **dbparams)
        self._sql = None

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
                insert into tb_article (id,title,content,author,avatar,pub_time,article_id, origin_url) values(null,%s,%s,%s,%s,%s,%s,%s)
                """
            return self._sql
        return self._sql

    def process_item(self, item, spider):
        defer = self.dbpool.runInteraction(self.insert_item, item)
        defer.addErrback(self.handle_error, item, spider)
        return item

    def insert_item(self, cursor, item):
        cursor.execute(self.sql, (item['title'], item['content'],
                                  item['author'], item['avatar'],
                                  item['pub_time'], item['article_id'],
                                  item['origin_url']))

    def handle_error(self, error, item, spider):
        print('*' * 100)
        print('error:', error)
        print('*' * 100)

middleware中间件

import random
from jianshu_project.conf.user_agent import USER_AGENT_LIST
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse


class RandomUserAgent(object):

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)


class SeleniumDownloadMiddleware(object):

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r'D:\chromedriver\chromedriver.exe')

    def process_request(self, request, spider):

        self.driver.get(request.url)
        time.sleep(1)
        try:
            while True:
                loadMore = self.driver.find_element_by_class_name('load-more')
                loadMore.click()
                time.sleep(0.3)

                if not loadMore:
                    break
        except Exception as e:
            pass

        source = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url, body=source, request=request, encoding='utf-8')
        return response

其他确实部分都可以在前面部分找到相关叙述

20. selenium设置代理和UserAgent

import random
from useragent_demo.conf.user_agent import USER_AGENT_LIST
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse


class RandomUserAgent(object):

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)


class SeleniumDownloadMiddleware1(object):

    def process_request(self, request, spider):
        options = webdriver.ChromeOptions()
        options.add_argument('user-agent={}'.format(request.headers['User-Agent'])) # 设置随机请求头
        # self.options.add_argument('--proxy-server={}'.format(request.headers['proxy'])) # 设置代理
        driver = webdriver.Chrome(chrome_options=options, executable_path=r'D:\chromedriver\chromedriver.exe')

        driver.get(request.url)
        source = driver.page_source
        response = HtmlResponse(url=driver.current_url, body=source, request=request, encoding='utf-8')
        driver.close()
        return response


class SeleniumDownloadMiddleware(object):

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r'D:\chromedriver\chromedriver.exe')
        self.options = self.driver.create_options()

    def process_request(self, request, spider):
        self.options.add_argument('user-agent={}'.format(request.headers['User-Agent']))
        self.driver.get(request.url)
        time.sleep(1)
        source = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url, body=source, request=request, encoding='utf-8')
        return response

21. http://httpbin.org 测试接口解析

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

18.Python爬虫之Scrapy框架的相关文章

一个简单的开源PHP爬虫框架『Phpfetcher』

这篇文章首发在吹水小镇 xff1a http blog reetsee com archives 366 要在手机或者电脑看到更好的图片或代码欢迎到博文原地址也欢迎到博文原地址批评指正转载请注明 xff1a 吹水小镇 reetsee c
「更快！更爽！」吹水新闻2.0

这篇文章首发在吹水小镇 xff1a http blog reetsee com archives 388 要在手机或者电脑看到更好的图片或代码欢迎到博文原地址也欢迎到博文原地址批评指正转载请注明 xff1a 吹水小镇 reetsee c
如何对日志文件进行二分查找？开源文件二分查找工具『timecat』介绍

这篇文章是我从自己的博客搬运过来的转载请注明 xff1a 吹水小镇 reetsee com 原文链接地址 xff1a http blog reetsee com archives 502 要获得更好的阅读体验 xff0c 欢迎大家到这篇文
阿里的电话面试是神马感觉

感觉就是被问了个稀巴烂 xff0c 估计到不了下一轮问了神马呢 xff0c 问了我的项目 xff0c 我描述了一阵子之后 xff0c 当他问到使用人数的时候 xff0c 我说是内部使用没有发布 xff0c 只是一件比赛的作品的时候 xf
keil5 制作自己的pack

转发自 xff1a http blog sina com cn s blog dc9571b90102vhqf html 以前都是用的keil5以下版本的keil xff0c 没有RTE这个功能 xff0c 后来安装了keil5以上的版本
找出带环单向链表的环入口（交点）

其实这个问题已经被问烂了 xff0c 但是之前没有想透 xff0c 今天算是解决得差不多找环的入口这个问题 xff0c 其实是建立在另外一个问题之上的判断单向链表是否有环土方法很多 xff0c 但是比较好的目前就那么一个 xff1a
关于我最近看的一本书——大名鼎鼎的APUE

APUE xff0c Know as Unix环境高级编程 xff0c 我每天都在用自己的绳命去看 xff0c 每天都燃烧自己去看什么样的书 xff0c 一看就是上乘之中的珍稀之品 xff1f 这本不同于不少机械工业出版社的大部头 xf
写博客加分不

写第一篇博客 xff0c 就看看加分不 xff0c 这个网站分很重要 xff0c 不然下不了东西
若在Ubantu中查询IP地址输入ifconfig时显示没找到该命令时怎么办？

新手在学习Linux中想要查询IP地址输入ifconfig时却显示如下图 xff1a 这时我们只需要输入sudo apt install net tools 显示这下图时就证明已经安装好了这时我们只需要输入ifcogfig xff0c
小觅相机深度版运行Vins-mono

首先声明 xff0c 本人自己也是slam新手 xff0c 此贴只因为自己在用小觅相机深度版运行Vins的时候太过无助 xff0c 所以想写个自己运行出结果的完整过程 xff0c 仅供参考 xff0c 如有不对之处 xff0c 还望不吝指教
shell如何判定C/C++程序是否执行成功

linux编程中经常遇到这样的问题 xff0c 即判断一个程序执行是否成功 xff0c 通常实现方法是通过进程的退出状态来判断 xff0c 当linux的一个进程执行完成后会返回一个进程的退出状态 xff0c 通过判断退出状态码可以确定该
删除数组中的指定元素或数组对象

1 删除数组中的某个指定元素 1 xff09 首先获取元素下标 xff0c 用indexOf 函数找到他的位置index xff0c 如果没有找到这个元素那么index将会等于 1 xff1b 如下为找到元素2的下标 var array 6
stm32cubemx 配置FreeRTOS相关基础基础知识及串口接收中断实验

本实验首先大家自身要有stm32cubemx基础配置 xff0c 比如GPIO 中断串口 SPI等 xff0c FreeRTOS有相关的调用函数基础及调度任务的概念都需要提前理解单独stm32cubemx或FreeRTOS网上很多 xf
select()

select 简述 xff1a 确定一个或多个套接口的状态 xff0c 如需要则等待 include lt winsock h gt int PASCAL FAR select int nfds fd set FAR readfds fd
C#工控上位机开发

对于电源软件开发者来说 xff0c 上位机的开发难度是远远小于下位机的 xff0c 之前几个月我一直在研究电力电子技术和下位机的控制算法 xff0c 也有了一点点的收获 xff0c 但说实话还是差的太远了 xff0c 而且人力物力资源非常稀
虚拟机运行gazebo卡

操作系统 xff1a ubuntu18 04 链接 B站链接 xff1a Autolabor初级教程 ROS机器人入门问题虚拟机gazebo卡的话 xff0c 可以试试在虚拟机设置里打开3d加速 xff0c 在每次启动gazebo前命令
谨以此文献给才毕业一两年的朋友

谨以此文献给才毕业一两年的朋友选自同事信件谨以此文献给才毕业一两年的朋友我们终于进入了这个社会从此结束了被学校老师看管的生涯 xff0c 结束了做父母乖宝贝的日子 xff0c 也结束从父母兄长那里拿钱的幸福时光我们从家里搬了出来 x
winform怎样设置comboBox默认值

combox是开发winform常用的组件之一 xff0c 如何添加他的默认值呢 xff01 方法步骤新建一个windows窗体应用程序 xff0c 这里项目命名为test01 在默认打开的Form1中 xff0c 在左边的工具箱拖拉两
Internal server error 500 问题解决思路

我们系统在一次升级之后 xff0c 生产环境大量出现Internal server error 500错误 xff0c 具体场景 xff1a 在APP上使用拍照功能后 xff0c APP通过Http协议上传压缩后的照片到服务端 xff0c
c# List集合类常用操作：二、增加

所有操作基于以下类 class Employees public int Id get set public string Name get set public string City get set public DateTime Bi

随机推荐

485通讯协议_终于有人把RS485通讯协议应用及缺点分析清楚了，看完收获多多

RS 485是工业控制环境中常用的通信协议 xff0c 具有抗干扰能力强传输距离长的特点 RS 485通信协议是对RS 232协议的改进协议层不变 xff0c 但只有物理层得到了改进 xff0c 从而保留了串行通信协议应用简单的特点 R
常用串口调试工具比较（详细）

目前有许多免费的串口调试工具 xff0c 比较常用的有 xff1a 1 友善串口调试助手 xff08 v2 6 5 xff09 优点 xff1a 1 xff09 使用方便 xff0c 不丢包 xff1b 2 xff09 串口自动识别 xff
请问在Delphi中用什么方式打开窗体能让它一直在最前面，而且还可以对其它窗体进行操作？

将要打开的窗口FormStyle属性设置为 fsStayOnTop xff0c 然后再用show方法打开窗口
微软Surface Pro 4/5平板如何重装Win10系统？教程分享

重装Win10系统很多用户都会 xff0c 但是如果是平板 xff0c 那么操作起来和电脑可能会有不同 xff0c 毕竟平板需要触屏的支持 xff0c 今天我们要讲的是微软Surface Pro 4 5如何重装Win10系统 xff0c 主
C#发送16进制串口数据

个困扰两天的问题 xff1a 需要通过串向设备发送的数据 xff1a 0A010 7e 08 00 11 00 00 7e 76 7f 我先将每个16进制字符转换成10进制 xff0c 再将其转换成ASCII码对应的字符 lt summa
C# WinForm遍历窗体控件的3种方法

这篇文章主要介绍了C WinForm遍历窗体控件的3种方法 xff0c 帮助大家更好的理解和使用c xff0c 感兴趣的朋友可以了解下目录 1 循环遍历2 递归遍历3 使用反射 1 循环遍历 1 2 3 4 5 6 7 8 private
gazebo_ros：未找到命令

操作系统 xff1a ubuntu18 04 链接 B站链接 xff1a Autolabor初级教程 ROS机器人入门问题 gazebo ros xff1a 未找到命令解决办法 sudo apt install ros span cla
【深入学习51单片机】一、基于8051的RTOS内核任务切换堆栈过程剖析

我一直在写裸机 xff0c 写的多了自然会对rtos产生浓厚兴趣 xff0c 最有意思的莫过于任务切换了 xff0c 可以在多个死循环里面跳转 xff0c 很神奇的样子本文学习参考程序是网上一个基于8051的简易os xff0c 从哪里下
如何使用网络调试助手调试UDP

最近的一个项目需要使用UDP xff0c 在网上下载了一个网络调试助手的小工具进行调试 xff0c 非常方便 xff0c 在这里简单的向大家介绍一下它的使用方法 xff0c 仅供参考 xff0c 其他类似的调试工具都差不多工具原料网络
Delphi 回调函数及例子

Delphi回调函数 1 回调函数的概述回调函数是这样一种机制 xff1a 调用者在初始化一个对象 xff08 这里的对象是泛指 xff0c 包括OOP中的对象全局函数等 xff09 时 xff0c 将一些参数传递给对象 xff0c 同
解决Window10连接共享目录登录失败:未知的用户名或错误密码的解决方法

关于未知的用户名或错误密码 win10共享这个很多人还不知道 xff0c 今天菲菲来为大家解答以上的问题 xff0c 现在让我们一起来看看吧 xff01 1 按 Win 43 R 运行执行窗口 xff0c 输入 gpedit msc 命
教您电脑电源短接哪两根线风扇转

如果哦我们没有接电脑主板 xff0c 如何让电脑电源风扇转动我们只有通过电脑电源线的短接来解决这个问题 xff0c 但是电脑电源线哪两跟线使风扇转动呢一起跟小编来看看吧电脑电源短接哪两根线风扇转 xff1a 1 电脑电源本身是一个供电
编译PX4固件

PX4编译文章目录 PX4编译疑难杂症bug1bug2catkin build isolated 官方脚本Step1Step2 安装常用依赖Step3 创建并运行脚本Step4 补全代码子模块Step5 验证仿真官方offboard 例
ubuntu18.04安装mavros并读取pixhawk飞控数据

Mavros源码安装最新mavros详细安装教程 xff08 亲测可行 xff09 启动mavros 上述步骤完成后 xff0c 启动px4 launch roslaunch mavros px4 launch 此时rostopic li
相机标定：相机模型和畸变模型

一相机标定方案相机内参标定是确定内参和畸变参数 equidistqant畸变模型或者 radial tangential模型的过程本文首先介绍SLAM中常用的相机模型和畸变模型 xff0c 随后介绍我们采用的两种内参标定方案 xf
新手入门Docker之Windows下如何使用Docker

Docker是什么 xff1f Docker 将应用程序与该程序的依赖 xff0c 打包在一个文件里面运行这个文件 xff0c 就会生成一个虚拟容器程序在这个虚拟容器里运行 xff0c 就好像在真实的物理机上运行一样有了 Docker
C/C++服务器开发常见面试题（一）

C C 43 43 服务器开发常见面试题转自 xff1a LinuxC C 43 43 服务器开发面试题一编程语言 1 根据熟悉的语言 xff0c 谈谈两种语言的区别 xff1f 主要浅谈下C C 43 43 和JAVA语言的区别 1
Realsense D455/435内参标定以及手眼标定

相机的内外参内参数与相机自身特性有关的参数 xff0c 焦距 xff0c 像素大小外参数 xff0c 相机的位置 xff0c 旋转方向为什么要内参标定理想情况下 xff0c 镜头会将一个三维空间中的直线也映射成直线 xff08 即射
最新Ubuntu20.04安装指南(防踩坑版)

文章目录最新Ubuntu20 04安装配置指南防踩坑版一备份 1 Windows系统镜像 2 个人资料 3 一些杂项二启动盘UEFI引导1 启动盘制作2 设置以U盘方式启动三 Ubuntu安装最新Ubuntu20 04安装
18.Python爬虫之Scrapy框架

scrapy 框架 01 Scrapy 链接02 Scrapy 的爬虫流程03 Scrapy入门04 setting py文件中的常用设置4 1 logging模块的使用4 2 61 61 scrapy项目中的setting py常用配置内