爬虫高级应用（14. 可见即可爬Selenium）

2023-11-09

本章主要内容

1、安装Selenium和WebDriver
2、Selenium的基本使用方法
3、查找节点
4、节点交互
5、管理Cookie
6、执行JavaScript代码
7、改变节点属性值

Selenium的主要功能：

1、打开浏览器
2、获取浏览器页面的特定内容
3、控制浏览器页面上的空间，如向一个文本框输入一个字符串
4、关闭浏览器

14.1 first_selenium第一个案例

使用selenium之前需要先下载浏览器驱动，chromedriver下载地址
不仅chrome浏览器的驱动可以使用，也可以下载MicrosoftWeb、火狐、IDE浏览器等等的驱动

使用Selenium的API来演示四个功能。

首先通过创建Chrome对象打开Chrome浏览器，并让Chrome浏览器显示京东首页
然后在京东首页上方的搜索文本框自动输入“Python从菜鸟到高手”，并自动按Enter键开始搜索
在获取并输出页面的标题、当前页面的URL以及整个页面的代码

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec

browser = webdriver.Chrome('./webdriver/chromedriver.exe')

try:
    browser.get('https://www.jd.com')
    # 根据id属性的值查找搜索框
    input = browser.find_element_by_id('key')
    # 使用send_keys方法向搜索框输入Python从菜鸟到高手文本
    input.send_keys('Python从菜鸟到高手')
    # 使用send_keys方法模拟按下Enter键
    input.send_keys(Keys.ENTER)
    # 创建WebDriverWait对象，设置最长等待时间（4s）
    wait = WebDriverWait(browser,4)
    # 等待搜索页面
    wait.until(ec.presence_of_all_elements_located((By.ID,'J_goodsList')))
    print(browser.title)
    # 显示搜索页面的URL
    print(browser.current_url)
    # 显示搜索页面的代码
    print(browser.page_source)
    # 关闭浏览器
    browser.close()
except Exception as e:
    print(e)
    browser.close()

14.2 find_node获取节点

find_element_by_id ，通过id查找节点
find_element_by_name ，通过name查找节点
send_keys ，给input输入文本框写入内容

demo.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>表单</title>
</head>
<body>
    <script>
        function onclick_form(){
            alert(document.getElementById('name').value +
                document.getElementById('age').value +
                document.getElementsByName('country')[0].value+
                document.getElementsByClassName('myclass')[0].value)
        }

    </script>
    姓名：<input id="name"><p></p>
    年龄：<input id="age"><p></p>
    国家：<input name="country"><p></p>
    收入：<input class="myclass"><p></p>
    <button onclick="onclick_form()">提交</button>
</body>
</html>

爬虫.py

from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome('./webdriver/chromedriver.exe')
try:
    # 打开demo.html页面
    browser.get('http://localhost/demo.html')
    # 通过id属性查找姓名input节点
    input = browser.find_element_by_id('name')
    # 自动输入王军
    input.send_keys('王军')
    input = browser.find_element_by_id('age')
    input.send_keys('30')
    # 通过name属性查找城市input节点
    input = browser.find_element_by_name('country')
    # 自动输入中国
    input.send_keys('中国')
    input = browser.find_element_by_name('myclass')
    input.send_keys('4000')

    # 使用find_element和class属性再次获取收入input节点
    input = browser.find_element(By.CLASS_NAME,'myclass')
    # 要向覆盖前面的输入，需要情况input节点，否则会在input节点原来的呢绒后面追加新内容
    input.clear()
    input.send_keys('8000')
except Exception as e:
    print(e)
    browser.close()

14.3 find_multi_node查找多节点

使用selenium通过节点名查找所有符合条件的节点，并输入节点本身、符合条件的节点总数以及第1个符合条件的节点文本

find_elements_by_tag_name(‘li’) ，查找所有节点名为li的节点
find_elements(By.TAG_NAME,‘ul’) ，第二种方式，，查找所有节点名为ul的节点

from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome('./webdriver/chromedriver.exe')
try:
    browser.get('https://www.jd.com')
    # 根据节点名查找所有节点名为li的节点
    input = browser.find_elements_by_tag_name('li')
    print(input)
    print(len(input))
    # 输出第1个符合条件的节点的文本
    print(input[0].text)
    # 使用另一个方式查找所有名为ul的节点
    input = browser.find_elements(By.TAG_NAME,'ul')
    print(input)
    print(input[0].text)

    browser.close()
except Exception as e:
    print(e)
    browser.close()

14.4 node_interactive按钮节点交互

使用Selenium通过模拟浏览器单击动作循环单击页面上的6个按钮，单击每次按钮后，按钮下方的div就会按照按钮的背景色设置div的背景色

demo1.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>彩色按钮</title>
</head>
<body>
    <script>

        function onclick_color(e) {
            document.getElementById("bgcolor").style.background = e.style.background
        }
    </script>

  <button class="mybutton" style="background: red" onclick="onclick_color(this)">按钮1</button>
  <button class="mybutton" style="background: blue" onclick="onclick_color(this)">按钮2</button>
  <button class="mybutton" style="background: yellow" onclick="onclick_color(this)">按钮3</button>
  <br>
  <button class="mybutton" style="background: green" onclick="onclick_color(this)">按钮4</button>
  <button class="mybutton" style="background: blueviolet" onclick="onclick_color(this)">按钮5</button>
  <button class="mybutton" style="background: gold" onclick="onclick_color(this)">按钮6</button>
  <p></p>
  <div id="bgcolor" style="width: 200px; height: 200px">


  </div>
</body>
</html>

爬虫.py

from selenium import webdriver
import time
browser = webdriver.Chrome('./webdriver/chromedriver')
try:
    browser.get('http://localhost/demo1.html')
    # 查找所有class属性值为mybutton的结点
    buttons = browser.find_elements_by_class_name('mybutton')
    i = 0
    # 循环模拟浏览器点击按钮
    while True:
        # 发送单击按钮动作
        buttons[i].click()
        # 延迟1秒钟
        time.sleep(1)
        i += 1
        # 如果到了最后1个按钮，重新设置计数器i，再从1开始
        if i == len(buttons):
            i = 0
except Exception as e:
    print(e)
    browser.close()

14.5 js_menu动作链操纵鼠标

点击和输入可以指定对象，而拖拽鼠标，键盘按键则需要全局操作即动作链
动作链需要创建ActionChains对象，并通过若干方法想浏览器发送一个或多个动作

使用selenium动作链的move_to_element方法模拟鼠标移动的动作，自动显示京东商城首页左侧的每个二级导航菜单

from selenium import webdriver
from selenium.webdriver import ActionChains
import time
browser = webdriver.Chrome('./webdriver/chromedriver.exe')
try:
    browser.get('https://www.jd.com')
    # 创建ActionChains对象
    actions = ActionChains(browser)
    # 通过CSS选择器查找所有class属性值为cate_menu_item的li节点，每一个li节点都是一个二级导航菜单
    li_list = browser.find_elements_by_css_selector(".cate_menu_item")
    # 通过迭代，显示每一个二级菜单，调用动作链中的方法发送工作后，必须调用perform方法才能生效
    for li in li_list:
        actions.move_to_element(li).perform()
        time.sleep(1)
except Exception as e:
    print(e)
    browser.close()

14.6 drapdrop拖拽节点

使用selenium动作链的drag_and_drop方法将一个节点拖动到另一个节点上
在这里插入图片描述

import time
from selenium import webdriver
from selenium.webdriver import ActionChains
browser = webdriver.Chrome('./webdriver/chromedriver.exe')
try:
    browser.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
    # 切换到id属性值为iframeResult结点
    browser.switch_to.frame('iframeResult')
    # 使用CSS选择器获取id属性值为draggable的拖动节点
    source = browser.find_element_by_css_selector('#draggable')
    # 使用CSS选择器获取id属性值为droppable的拖动节点
    target = browser.find_element_by_css_selector('#droppable')
    # 创建ActionChains对象
    actions = ActionChains(browser)
    # 调用drag_and_drop方法拖动节点
    actions.drag_and_drop(source,target)
    # 调用perform方法让拖动生效
    actions.perform()
    time.sleep(1)
    browser.close()
except Exception as e:
    print(e)
    browser.close()

14.7 exec_javascript执行JS

使用selenium的execute_script方法让京东商城首页滚动到最低端，然后弹出一个对话框

from selenium import webdriver
browser = webdriver.Chrome('./webdriver/chromedriver.exe')
browser.get('https://www.jd.com')
# 将京东商城首页滚动到最底端
browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
# 弹出对话框
browser.execute_async_script('alert("已经到达底部了")')

14.8 get_node_info获取节点信息

使用Selenium的API获取京东商城首页HTML代码中id为navitems-group1的ul节点的相关信息以及ul节点中li子节点的相关信息

from selenium import webdriver
from selenium.webdriver import ActionChains
options = webdriver.ChromeOptions()
# 添加参数，不让Chrome浏览器显示，只在后台运行
options.add_argument('headless')
browser = webdriver.Chrome('./webdriver/chromedriver.exe',chrome_options=options)
browser.get('https://www.jd.com')
# 查找页面中id属性值为navitems-group1的第1个节点
ul = browser.find_element_by_id("navitems-group1")
print(ul.text)
# 输出节点内部使用id，不是id属性值
print('id=',ul.id)
# 输出节点的绝对坐标
print('location=',ul.location)
# 输出节点的名称
print('tag_name=',ul.tag_name)
# 输出节点尺寸
print('size=',ul.size)
# 搜索该节点内的所有名为li的子节点
li_list = ul.find_elements_by_tag_name("li")
for li in li_list:
    print(type(li))
    # 输出li节点的文本和class属性值，如果属性没找到就返回None
    print('<{}> class={}'.format(li.text,li.get_attribute('class')))
    # 查找li节点内的名为a的子节点
    a = li.find_element_by_tag_name('a')
    # 输出a节点的href属性值
    print('href=',a.get_attribute('href'))
browser.close()

14.9 get_cookies获取Cookie列表

使用seleniumAPI获取Cookie列表，并添加新的Cookie，以及删除所有的Cookie

from selenium import webdriver
browser = webdriver.Chrome('./webdriver/chromedriver.exe')
browser.get('https://www.jd.com')
# 获取Cookie列表
print(browser.get_cookies())
# 添加新的Cookie
browser.add_cookie({'name':'name','value':'jd','domain':'www.jd.com'})
print(browser.get_cookies())
# 删除所有的Cookie
browser.delete_all_cookies()
print(browser.get_cookies())

14.10 baidu_move_search_button移动百度搜索按钮

通过JavaScript代码改变百度搜索按钮的位置，让这个按钮在多个位置之间移动，时间间隔是2秒
execute_script方法的第1个参数用于指定JavaScript代码，后面的可变参数，
可以为JavaScript代码传递参数，通过arguments变量获取每一个参数

from selenium import webdriver
import time
driver = webdriver.Chrome('./webdriver/chromedriver.exe')
driver.get('https://www.baidu.com')
# 查找百度搜索按钮
search_button = driver.find_element_by_id("su")
# 定义搜索按钮可以移动的x坐标的位置
x_positions = [50,90,130,170]
# 定义搜索按钮可以移动的y坐标的位置
y_positions = [100,10,160,90]
# 迭代位置列表，每间隔2秒移动一次搜索按钮
for i in range(len(x_positions)):
    # 用于移动搜索按钮的JavaScript代码，arguments[0]就是搜索按钮对应的DOM
    js = '''
        arguments[0].style.position = "absolute";
        arguments[0].style.left = "{}px";
        arguments[0].style.top = "{}px";
    '''.format(x_positions[i],y_positions[i])
    # 执行JavaScript代码，并开始移动搜索按钮
    driver.execute_script(js,search_button)
    time.sleep(2)

14.11 jd_change_node改变节点

使用JavaScript代码修改京东商城首页顶端的前两个导航菜单的文本和链接分别改成“python从菜鸟到高手”和“极客起源”，导航链接会改变

from selenium import webdriver
import time
driver = webdriver.Chrome('./webdriver/chromedriver.exe')
# 查找id属性值为navitems-group1的节点
ul = driver.find_element_by_id('navitems-group1')
# 查找该节点下所有名为li的子节点
li_list = ul.find_elements_by_tag_name('li')
# 查找第1个li节点中第1个名为a的子节点
a1 = li_list[0].find_element_by_tag_name('a')
# 查找第1个li节点中第2个名为a的子节点
a2 = li_list[1].find_element_by_tag_name('a')
# 使用下面的JavaScript代码用于修改上面查找到的两个a节点的文本和链接（href属性值）
js = '''
        arguments[0].text = 'Python从菜鸟到高手'
        arguments[0].href = 'https://item.jd.com/12417265.html'
        arguments[1].text = '极客起源'
        arguments[1].href = 'https://geekori.com'
'''
driver.execute_script(js,a1,a2)

写在最后

Selenium是一个用于Web应用程序测试的工具。
Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。
Selenium获取页面的能力非常强大，正如开题所说，可见即可爬，所有request获取不到的页面，尽管用Selenium试一试，会有奇效。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

高级爬虫案例教程

爬虫

python

selenium