任务描述
使用python爬虫,实现获取豆瓣“北京租房”的租房信息,并筛选适合个人的房源存入Excel。使用方法都写在注释里了,请认真阅读哦~
完整代码
import time
import requests
import xlwt
from bs4 import BeautifulSoup
"""
获取豆瓣租房信息
获取excel后可能会产生空白行,为了表示每一页的信息独立开
也可以根据该操作去除 https://jingyan.baidu.com/article/cbcede075ad25202f50b4d52.html
"""
def get_douban_books(url, num):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
m = n = num
item_a_title = soup.find_all("td", class_="title")
for item in item_a_title:
tag_a = item.find("a")
name = tag_a["title"]
link = tag_a["href"]
contains = ["牡丹园" "健德门", "西土城", "北土城", "安贞门", "惠新西街南口", "芍药居", "十号线",
"10号线", "1分钟", "2分钟", "3分钟", "4分钟", "5分钟"]
for c in contains:
if c in name:
sheet.write(m, 0, name)
sheet.col(0).width = 256 * len(name)
sheet.write(n, 1, link)
sheet.col(1).width = 256 * len(link)
m += 1
n += 1
workbook = xlwt.Workbook()
sheet = workbook.add_sheet('豆瓣租房')
head = ['租房信息', '地址']
for h in range(len(head)):
sheet.write(0, h, head[h])
sheet.col(0).width = 512 * 50
sheet.col(1).width = 256 * 50
all_page = int(input("请填写需要获取的页数:"))
page_size = 30
url = 'https://www.douban.com/group/beijingzufang/discussion?start={}'
urls = [url.format(num * page_size) for num in range(all_page)]
page_num = [num * page_size + 1 for num in range(all_page)]
for i in range(all_page):
get_douban_books(urls[i], page_num[i])
print("==========第" + str(i + 1) + "页,完成==========")
time.sleep(1)
workbook.save('./douban_zufang.xls')
print("写入完成!")
运行结果
毕竟是爬取信息,可能会遇到网站更新抵制反爬,如果遇到什么问题或者有其他问题,在下面留言,我看到了会及时回复的哦
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)