Python-一键爬取图片、音频、视频资源

2023-12-19

前言

使用Python爬取任意网页的资源文件，比如图片、音频、视频；一般常用的做法就是把网页的HTML请求下来通过XPath或者正则来获取自己想要的资源，这里我做了一个爬虫工具软件，可以一键爬取资源媒体文件；但是需要说明的是，这里爬取资源文件只针对HTML已有的文件，如果需要二次请求的是爬取不到的，比如酷狗音乐播放界面，因为要做通用工具，匹配不同的网站！！！

这里主推图片爬取，一些需要图片素材的可以输入网址一键爬取！

还有就是爬取视频的时候会把磁力链接爬取下来！可以使用第三方下载工具下载！

代码

爬取资源文件

这里需要说明的就只，有的图片资源并不是url链接，是data:image格式，这里需要转换一下存储！

def getResourceUrlList(url ,isImage, isAudio, isVideo):
	global imgType_list, audioType_list, videoType_list
	imageUrlList = []
	audioUrlList = []
	videoUrlList = []
 
	url = url.rstrip().rstrip('/')
	htmlStr = str(requestsDataBase(url))
	# print(htmlStr)
	
	Wopen = open('reptileHtml.txt','w')
	Wopen.write(htmlStr)
	Wopen.close()
 
	Ropen = open('reptileHtml.txt','r')
	imageUrlList = []
 
	for line in Ropen:
		line = line.replace("'", '"')
		segmenterStr = '"'
		if "'" in line:
			segmenterStr = "'"
 
		lineList = line.split(segmenterStr)
		for partLine in lineList:
			if isImage == True:
				# 查找图片
				if 'data:image' in partLine:
					base64List = partLine.split('base64,')
					imgData = base64.urlsafe_b64decode(base64List[-1] + '=' * (4 - len(base64List[-1]) % 4))
					base64ImgType = base64List[0].split('/')[-1].rstrip(';')
					imageName = zfjTools.getTimestamp() + '.' + base64ImgType
					imageUrlList.append(imageName + '$==$' + base64ImgType)
 
				# 查找图片
				for imageType in imgType_list:
					if imageType in partLine:
						imgUrl = partLine[:partLine.find(imageType) + len(imageType)].split(segmenterStr)[-1]
 
						# 修复URL
						imgUrl = repairUrl(imgUrl, url)
 
						sizeType = '_{' + 'size' + '}'
						if sizeType in imgUrl:
							imgUrl = imgUrl.replace(sizeType, '')
 
						imgUrl = imgUrl.strip()
 
						if imgUrl.startswith('http://') or imgUrl.startswith('https://') and imgUrl not in imageUrlList:
							imageUrlList.append(imgUrl)
						else:
							imgUrl = ''
 
			if isAudio == True:
				# 查找音频
				for audioType in audioType_list:
					if audioType in partLine or audioType.lower() in partLine:
						audioType = audioType.lower() if audioType.lower() in partLine else audioType
						audioUrl = partLine[:partLine.find(audioType) + len(audioType)].split(segmenterStr)[-1]
 
						# 修复URL
						audioUrl = repairUrl(audioUrl, url)
 
						if audioUrl.startswith('http://') or audioUrl.startswith('https://') and audioUrl not in audioUrlList:
							audioUrlList.append(audioUrl)
						else:
							audioUrl = ''
 
			if isVideo == True:
				# 查找视频
				for videoType in videoType_list:
					if videoType in partLine or videoType.lower() in partLine:
						videoType = videoType.lower() if videoType.lower() in partLine else videoType
						videoUrl = partLine[:partLine.find(videoType) + len(videoType)].split(segmenterStr)[-1]
 
						# 修复URL
						videoUrl = repairUrl(videoUrl, url)
 
						if videoUrl.startswith('http://') or videoUrl.startswith('https://') or videoUrl.startswith('ed2k://') or videoUrl.startswith('magnet:?') or videoUrl.startswith('ftp://') and videoUrl not in videoUrlList:
							videoUrlList.append(videoUrl)
						else:
							videoUrl = ''
 
	return (imageUrlList, audioUrlList, videoUrlList)

爬取自定义节点

# 统配节点爬取
def getNoteInfors(url, fatherNode, childNode):
	url = url.rstrip().rstrip('/')
	htmlStr = requestsDataBase(url)
	
	Wopen = open('reptileHtml.txt','w')
	Wopen.write(htmlStr)
	Wopen.close()

	html_etree = etree.HTML(htmlStr)

	dataArray = []

	if html_etree != None:
		nodes_list = html_etree.xpath(fatherNode)
		for k_value in nodes_list:
			partValue = k_value.xpath(childNode)
			if len(partValue) > 0:
				dataArray.append(partValue[0])

	return dataArray

如果你对Python感兴趣，想要学习python，这里给大家分享一份 Python全套学习资料 ，都是我自己学习时整理的，希望可以帮到你，一起加油！

????有需要的小伙伴，可以点击下方链接免费领取或者 V扫描下方二维码免费领取 ????
Python全套学习资料

在这里插入图片描述

1️⃣零基础入门

① 学习路线

对于从来没有接触过Python的同学，我们帮你准备了详细的 学习成长路线图 。可以说是 最科学最系统的学习路线 ，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。
在这里插入图片描述

② 路线对应学习视频

还有很多适合0基础入门的学习视频，有了这些视频，轻轻松松上手Python~
在这里插入图片描述

③练习题

每节视频课后，都有对应的练习题哦，可以检验学习成果哈哈！
在这里插入图片描述

2️⃣国内外Python书籍、文档

① 文档和书籍资料

在这里插入图片描述

3️⃣Python工具包+项目源码合集

①Python工具包

学习Python常用的开发软件都在这里了！每个都有详细的安装教程，保证你可以安装成功哦！
在这里插入图片描述

②Python实战案例

光学理论是没用的，要学会跟着一起敲代码，动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。 100+实战案例源码等你来拿！
在这里插入图片描述

③Python小游戏源码

如果觉得上面的实战案例有点枯燥，可以试试自己用Python编写小游戏，让你的学习过程中增添一点趣味！
在这里插入图片描述

4️⃣Python面试题

我们学会了Python之后，有了技能就可以出去找工作啦！下面这些面试题是都来自阿里、腾讯、字节等一线互联网大厂，并且有阿里大佬给出了权威的解答，刷完这一套面试资料相信大家都能找到满意的工作。
在这里插入图片描述

上述所有资料 ⚡️ ，朋友们如果有需要的，可以扫描下方????????????二维码免费领取????
在这里插入图片描述

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)