使用 R 的 XML 包时,如何保留与该节点关联的某个节点的数据,例如在同一个列表中?我正在尝试将从网络上抓取的数据放入数据框中,并将相关信息分组为行。有<span>
没有类属性来区分的元素,可能有一个或两个<span>
位于每个相关组(数据框的行)中。
这是我另存为的一些示例 htmlhtml_example.html
.
<!DOCTYPE html>
<html>
<body>
<div class="foo">
<div class="fooname">Name of 1st foo</div>
<span>1st span in 1st foo</span>
<span>2nd span in 1st foo</span>
</div>
<div class="foo">
<div class="fooname">Name of 2nd foo</div>
<span>Only 1 span in 2nd foo</span>
</div>
</body>
</html>
这是当前的解析代码和输出:
library(XML)
html <- readLines("html_example.html")
parse <- htmlParse(html)
fooname <- xpathSApply(parse, "//div[@class='foo']/div[@class='fooname']"
, xmlValue)
print(fooname)
# > print(fooname)
# [1] "Name of 1st foo" "Name of 2nd foo"
span <- xpathSApply(parse, "//div[@class='foo']/span"
, xmlValue)
print(span)
# > print(span)
# [1] "1st span in 1st foo" "2nd span in 1st foo" "Only 1 span in 2nd foo"
目前无法关联“fooname”和“span”的值。有没有办法让抓取输出看起来像这样?
foo1 <- list(fooname[1], span[1:2])
foo2 <- list(fooname[2], span[3])
list1 <- list(foo1, foo2)
list1
# > mylist
# [[1]]
# [[1]][[1]]
# [1] "Name of 1st foo"
#
# [[1]][[2]]
# [1] "1st span in 1st foo" "2nd span in 1st foo"
#
#
# [[2]]
# [[2]][[1]]
# [1] "Name of 2nd foo"
#
# [[2]][[2]]
# [1] "Only 1 span in 2nd foo"
最终,在抓取过程中没有必要,我想创建一个如下所示的数据框。 NA的相关讨论here https://stackoverflow.com/questions/29188801/using-ldply-with-unequal-lengths-from-strsplit/29189798#29189798:
FooNames <- c(fooname[1], fooname[2])
Span1 <- c(span[1], span[3])
Span2 <- c(span[2], NA)
df <- data.frame(FooNames, Span1, Span2, stringsAsFactors = FALSE)
df
# > df
# FooNames Span1 Span2
# 1 Name of 1st foo 1st span in 1st foo 2nd span in 1st foo
# 2 Name of 2nd foo Only 1 span in 2nd foo <NA>