python网页爬虫xpath应用

2023-11-09

一、认识xpath和xml数据

lxml是Python基于xpath做数据解析的工具

from lxml import etree

1.xpath数据解析 - 通过提供标签路径来获取标签(xpath指的就是标签的路径)

1) xpath基本感念

树: 整个html内容或者整个xml内容
节点：树结构中的每个标签(元素)就是一个节点
根节点：树结构中的第一个节点就是根节点（网页对应树的根节点是html标签）
节点内容：双标签的标签内容
节点属性：标签的标签属性

2) 路径 - 目标节点在整个树结构中的位置信息

2.xml数据格式

xml和json都是通用的数据格式，可以用于不同编程语言的程序之间进行数据交流。
json更小更快；xml更安全

用json和xml两种数据格式来传输一个班级的信息：
1）json
{
“name”: “goodstudy”,
“teacher”: {
“name”: “niuzi”,
“tel”: “1100”,
“age”: 18
},
“students”:[
{“name”: “小明”, “age”: 18, “tel”: “120”, “gender”: “男”},
{“name”: “张三”, “age”: 22, “tel”: “119”, “gender”: “女”},
{“name”: “老王”, “age”: 30, “tel”: “140”, “gender”: “男”}
]
}

2）xml

niuzi
1100
18

二、xpath语法

在说明这个语法前
现在当前目录建一个xml文件
在这里插入图片描述
data.xml文件内容如下

<supermarket>
    <name>永辉超市</name>
    <staffs>
        <staff>
            <name class="c1">张三</name>
            <position>收营员</position>
            <salary>3500</salary>
        </staff>
        <staff>
            <name>小明</name>
            <position class="c1">收营员</position>
            <salary>3800</salary>
        </staff>
        <staff>
            <name class="c1">小花</name>
            <position>导购</position>
            <salary>4500</salary>
        </staff>
        <staff>
            <name>李华</name>
            <position>导购</position>
            <salary>5500</salary>
        </staff>
    </staffs>

    <goodsList>
        <goods>
            <name class="c1">面包</name>
            <price>5.5</price>
            <count>12</count>
        </goods>
        <goods tag="hot">
            <name>泡面</name>
            <price class="c1">3.5</price>
            <count>59</count>
        </goods>
        <goods tag="discount" discount="0.8">
            <name>火腿肠</name>
            <price>1.5</price>
            <count>30</count>
        </goods>
        <goods tag="hot">
            <name>矿泉水</name>
            <price>2</price>
            <count>210</count>
        </goods>
    </goodsList>

</supermarket>

节奏开始，导入模块

from lxml import etree

1.创建树获取树的根节点

etree.XML(xml数据)
etree.HTML(html数据)

root = etree.XML(open('data.xml', encoding='utf-8').read())

2.通过xpath路径获取节点(标签)

节点对象.xpath(路径) - 获取指定路径对应的所有的标签

xpath语法（路径的写法）：
1)绝对路径：不管xpath点前面是哪个标签，绝对路径都是以’/‘开头，从根节点开始往后写
2)相对路径：在写路径的时候用’.‘表示当前节点，用’…‘表示当前节点的上层节点。谁去点的xpath当前节点就是谁
3)全路径：在写路径的时候用’//'开头，获取标签的时候是在整个树中获取所有满足路径结构的标签

1)绝对路径

staff_names = root.xpath('/supermarket/staffs/staff/name')
print(staff_names)

在路径的最后加'/text()'可以获取标签内容
result = root.xpath('/supermarket/staffs/staff/name/text()')
print(result)       # ['张三', '小明', '小花']

注意：不断xpath前面是谁去点的，写绝对路径的时候都必须从根节点开始写

goodsList = root.xpath('/supermarket/goodsList')[0]
result = goodsList.xpath('/supermarket/goodsList/goods/price/text()')
print(result)

2）相对路径

result = root.xpath('./staffs/staff/name/text()')
print(result)

goodsList = root.xpath('/supermarket/goodsList')[0]
result = goodsList.xpath('./goods/price/text()')
print(result)

相对路径中’./‘开头的时候，’./'可以不写

result = goodsList.xpath('goods/price/text()')
print(result)

3)全路径

result = root.xpath('//name/text()')
print(result)

result = root.xpath('//goods/name/text()')
print(result)

3. xpath的谓语(条件) - 在路径中需要添加条件的节点后加’[谓语]’

1)和位置相关条件

[N]     -   第N个节点
[last()]    -   最后一个节点
[last()-N]      -  [last()-1]: 倒数第2个
[position()>N]、[position()<N]、[position()>=N]、[position()<=N]
"""
result = root.xpath('//staffs/staff[2]/name/text()')
print(result)

result = root.xpath('//staffs/staff[last()]/name/text()')
print(result)

result = root.xpath('//staffs/staff[last()-1]/name/text()')
print(result)

result = root.xpath('//staffs/staff[position()<=2]/name/text()')
print(result)

2)和属性相关条件

[@属性名=值] - 获取指定属性为指定值的标签
[@属性名] - 获取拥有指定属性的标签

result = root.xpath('//goodsList/goods[@tag]/name/text()')
print(result)

result = root.xpath('//goods[@tag="hot"]/name/text()')
print(result)

3)和子标签内容相关条件

[子标签名=值] - 获取指定子标签的标签内容为指定值的标签
[子标签名>值] - 获取指定子标签的标签内容大于指定值的标签

result = root.xpath('//goods[price=3.5]/name/text()')
print(result)

result = root.xpath('//goods[count>=50]/name/text()')
print(result)

4.获取标签内容和标签属性

获取标签内容: 获取标签的路径/text() - 获取路径选中的所有的标签的标签内容
获取标签内容: 获取标签的路径/@属性名 - 获取路径选中的所有的标签的指定属性的值

result = root.xpath('//goods[2]/@tag')
print(result)

goods_names = root.xpath('//goods/name')
for x in goods_names:
    print(x.xpath('./text()')[0])

5.在xpath路径中可以用*来代替任何标签或者任何属性

result = root.xpath('//goods[1]/*/text()')
print(result)

result = root.xpath('//*[@class="c1"]/text()')
print(result)

result = root.xpath('//goodsList/goods[3]/@*')
print(result)

6.若干(分支) - |

路径1|路径2 - 依次获取|分开的所有路径对应的内容

result = root.xpath('//goods/name/text()|//staff/name/text()')
print(result)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)