1、通过pipelines操作拿到的数据
要点一:爬虫文件必须通过关键字yield生成器才能与pipelines建立链接才可以操作,当运行爬虫文件之后执行的顺序如下图简介,如果重写模块或者内部方法可能会造成麻烦,往下翻阅可以看到open_spider()之后,开始执行爬虫文件close_spider()关闭爬虫文件,scrapy之间执行的顺序类似与函数嵌套的执行,即爬虫模块嵌套pipeline模块
要点二:先打开文件,在操作文件,关闭文件,需要注意的是多管道之间的优先级是setting当中建立管道时数字给的大小的建立,越小优先级越高,二级管道如果不指定来源则操作的是一级管道的数据,一级管道如果没有return返回则二级管道没有数据可以操作
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json
class MyspiderPipeline:
def __init__(self):
self.f = open('demo.json', 'w', encoding='utf-8')
def open_spider(self, itme):
print('爬虫开始了')
def process_item(self, item, spider):
# print('1'*10)
# print(spider.name)
item['age'] = '18'
item_json = json.dumps(item,ensure_ascii=False)
self.f.write(item_json+'\n')
# item['age'] = '18' # 不行,已经写进去了
# print(item)
# print('1'*20)
# return item # 管道之间的链接通过return返回的数据进行链接如果没有,低级管道则操控不了高级管道的数据
def close_spider(self, item):
print('爬虫结束了')
self.f.close
class MyspiderPipeline1:
def process_item(self, item, spider):
if spider.name == 'don': # 没有提示直接写
if isinstance(item,Demo2Item):
print('可以的')
# print(spider.name)
print('0'*20)
print(item)
# 11111111111111111111 结果
# 00000000000000000000
# {"name": "影讯&购票", "age": "18"}
2、loggin模块的使用
为了将可能发现的错误信息保存到文件当中,方便查看,哪个模块可能出错就去那个模块设置logging
第一步:操作可能出错的模块
import scrapy
import logging
logger = logging.getLogger(__name__)
class JdSpider(scrapy.Spider):
name = 'jd'
allowed_domains = ['jd.com']
start_urls = ['http://jd.com/']
def parse(self, response):
# logging.warning('this is warning')
# print(__name__)
logger.warning('this is warning')
第二步:操作settingLOG_FILE = './LOG.log'将错误信息保存在当前文件下的log文件当中 还有一种模块方法不用操作setting只用操作第一步以及在scrapy中任意位置新建如下模块
即可
import logging
logging.basicConfig(
# level=log_level,
level = logging.INFO,
format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='parser_result.log',
filemode='w')
3、腾讯爬虫案例
创建项目
scrapy startproject tencent
创建爬虫
scrapy genspider hr tencent.com
3.1scrapy.Request知识点
scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None)
常用参数为:
callback:指定传入的URL交给那个解析函数去处理
meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,比如下载延迟,请求深度
dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要重复请求的URL有重要用途
3.2详细爬虫代码
注意:要在setting中设置显示标准LOG_LEVEL = 'WARNING'
import scrapy
# 列表页面
# https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1597974603010&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
# 详情页面
# https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1597974781014&postId=1254638329626894336&language=zh-cn
class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
first_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1597974603010&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
second_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1597974781014&postId={}&language=zh-cn'
start_urls = [first_url.format(1)]
def parse(self, response):
for page in range(1,3):
url = self.first_url.format(page)
yield scrapy.Request(url,callback=self.parse_first)
def parse_first(self,response):
data = response.json()
page = data['Data']['Posts']
item = {}
for job in page:
job_index = job['PostId']
# print(job_index)
item['job_name'] = job['CategoryName']
detail_url = self.second_url.format(job_index)
yield scrapy.Request(detail_url,
meta={'item':item},
callback=self.parse_second)
def parse_second(self,response):
item = response.meta['item']
# item = response.get('item')
data = response.json()
item['zhi_duty'] = data['Data']['Requirement']
item['zhi_bility'] = data['Data']['Responsibility']
print(item)
4、items的使用
在爬虫中可以字定义字典来使用保存数据的方法,为了高耦合,让items来定义你想要提取的哪些内容,通过爬虫导入items当中的类来实现高耦合,
语法
import scrapy
class Demo2Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field() 这是模板,name代表字典的key,也就是你想提取的数据是什么,后面的代表以key来创建字典,这是scrapy封装的方法,不用深究,我们需要理解的就是它是以name为key创建字典
job_name = scrapy.Field()
zhi_duty = scrapy.Field()
zhi_bility = scrapy.Field()
使用
在爬虫模块导入items这个模块,再实例化items里面的类,实例化对象就可以当成一个有key的空字典使用,如下,虽然还是用字典的方式添加key和value,但是你不用定义字典了,并且,添加字典时必须是items里面定义的key,不能不一样比如你items定义的是job_name1,此时你就不能用job_name必须使用job_name1,另外爬虫的键值对不能多,但可以少,返回的是你爬虫根据item定义的字典提取的东西,提取多少就返回多少,这就是items创建字典与普通创建的区别,
它们打印的结果也有一些细微的区别,见下图
阳光政务平台实现步骤小案例
import scrapy
from demo2.items import YgItem
class YgSpider(scrapy.Spider):
name = 'yg'
allowed_domains = ['wz.sun0769.com/']
start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&type=4']
def parse(self, response):
item = YgItem()
# item={}
li_list = response.xpath("//ul[@class='title-state-ul']/li")
for li in li_list:
item['title'] = li.xpath("./span[3]/a/text()").extract_first() # 标题
item['href'] = 'http://wz.sun0769.com' + li.xpath("./span[3]/a/@href").extract_first() # 详情页的url
yield scrapy.Request(item['href'],
callback=self.parse_detail,
meta={'item':item},)
# print(item)
# next_url = 'http://wz.sun0769.com' + response.xpath(
# "//div[@class='mr-three paging-box']/a[2]/@href").extract_first()
# for url in next_url:
# yield scrapy.Request(url, callback=self.parse)
def parse_detail(self, response):
# response.xpath()
item = response.meta['item']
# item = response.meta.get('item')
print(item)
item['content'] = response.xpath("//div[@class='details-box]/pre/text()").extract_first()
print(item)
# yield item