Nokogiri 支持两种主要类型的搜索,search
and at
. search
返回一个 NodeSet,您应该将其视为一个数组。at
返回一个节点。两者都可以采用 CSS 或 XPath 表达式。我更喜欢 CSS,因为它们更具可读性,但有时您无法轻松地使用其中一种来达到您想要的效果,因此请尝试另一种。
对于您的问题,重要的是使用指定要从中提取文本的节点text
。如果您的结果太宽泛,除了您想要的标签内的文本之外,您还会从标签之间获取文本。为了避免深入到您要阅读的内容的最直接节点:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<release>
<artists>
<artist>
<name>Johnny Mnemonic</name>
</artist>
<artist>
<name>Constantine</name>
</artist>
<artists>
<release>
EOT
因为这些寻找name
具体来说,所需的文本很容易获得,没有垃圾:
doc.at('name').text # => "Johnny Mnemonic"
doc.at('artist name').text # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"
这些是较宽松的搜索,因此会返回更多垃圾:
doc.at('artist').text # => "\n Johnny Mnemonic\n "
doc.at('artists').text # => "\n \n Johnny Mnemonic\n \n \n Constantine\n \n \n\n"
Using search
返回多个节点:
doc.search('name').map(&:text)
[
[0] "Johnny Mnemonic",
[1] "Constantine"
]
doc.search('artist').map(&:text)
[
[0] "\n Johnny Mnemonic\n ",
[1] "\n Constantine\n "
]
之间唯一真正的区别search
and at
就是它at
就好像search(...).first
.
See "抓取时如何避免连接节点中的所有文本 https://stackoverflow.com/questions/43594656/how-to-avoid-joining-all-text-from-nodes-when-scraping" also.
为了方便起见,Nokogiri 有一些额外的别名:at_css
and css
, and at_xpath
and xpath
.
以下是替代方法,使用 CSS 和 XPath 访问器来获取从 Pry 中剪辑的名称:
[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]