天哪,我想我明白了。您将无法使用readHTMLTable
(而且,我现在比以前更了解 XML 包代码......该代码中存在一些严重的 R-fu)并且我正在使用rvest
只是因为我混合使用了 XPath 和 CSS 选择器(不过我最终更多地考虑了 XPath)。dplyr
只为gimpse
.
library(XML)
library(dplyr)
library(rvest)
trim <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "", x)
# neither rvest::html nor rvest::html_session liked it, hence using XML::htmlParse
doc <- htmlParse("http://nomads.ncep.noaa.gov/")
ds <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'http')]/
preceding-sibling::td[3]")
data_set <- ds %>% html_text() %>% trim()
data_set_descr_link <- ds %>% html_nodes("a") %>% html_attr("href")
freq <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'hourly') or
contains(., 'hours') or
contains(., 'daily') or
contains(., '06Z')]") %>%
html_text() %>% trim()
grib_filter <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'http')]/preceding-sibling::td[1]") %>%
sapply(function(x) {
ifelse(x %>% xpathApply("boolean(./a)"),
x %>% html_node("a") %>% html_attr("href"),
NA)
})
http_link <- doc %>% html_nodes("a[href^='/pub/data/']") %>% html_attr("href")
gds_alt <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'http')]/following-sibling::td[1]") %>%
sapply(function(x) {
ifelse(x %>% xpathApply("boolean(./a)"),
x %>% html_node("a") %>% html_attr("href"),
NA)
})
nom <- data.frame(data_set,
data_set_descr_link,
freq,
grib_filter,
gds_alt)
glimpse(nom)
## Variables:
## $ data_set (fctr) FNL, GFS 1.0x1.0 Degree, GFS 0.5x0.5 Degr...
## $ data_set_descr_link (fctr) txt_descriptions/fnl_doc.shtml, txt_descr...
## $ freq (fctr) 6 hours, 6 hours, 6 hours, 12 hours, 6 ho...
## $ grib_filter (fctr) cgi-bin/filter_fnl.pl, cgi-bin/filter_gfs...
## $ gds_alt (fctr) dods-alt/fnl, dods-alt/gfs, dods-alt/gfs_...
head(nom)
## data_set
## 1 FNL
## 2 GFS 1.0x1.0 Degree
## 3 GFS 0.5x0.5 Degree
## 4 GFS 2.5x2.5 Degree
## 5 GFS Ensemble high resolution
## 6 GFS Ensemble Precip Bias-Corrected
##
## data_set_descr_link freq
## 1 txt_descriptions/fnl_doc.shtml 6 hours
## 2 txt_descriptions/GFS_high_resolution_doc.shtml 6 hours
## 3 txt_descriptions/GFS_half_degree_doc.shtml 6 hours
## 4 txt_descriptions/GFS_Low_Resolution_doc.shtml 12 hours
## 5 txt_descriptions/GFS_Ensemble_high_resolution_doc.shtml 6 hours
## 6 txt_descriptions/GFS_Ensemble_precip_bias_corrected_doc.shtml daily
##
## grib_filter gds_alt
## 1 cgi-bin/filter_fnl.pl dods-alt/fnl
## 2 cgi-bin/filter_gfs.pl dods-alt/gfs
## 3 cgi-bin/filter_gfs_hd.pl dods-alt/gfs_hd
## 4 cgi-bin/filter_gfs_2p5.pl dods-alt/gfs_2p5
## 5 cgi-bin/filter_gens.pl dods-alt/gens
## 6 cgi-bin/filter_gensbc_precip.pl dods-alt/gens_bc
请确保列匹配。我盯着看,但验证一下就太好了。注意:可能有更好的方法来做到这一点sapply
(任何人都可以随意编辑它,相信自己)。
It's really脆弱的代码。也就是说,如果格式发生变化,它就会发出嘎嘎声(但对于所有抓取来说都是如此)。它应该能够承受他们实际创建有效的 HTML(顺便说一句,这是可悲的 HTML),但大多数代码依赖于http
自此之后该列仍然有效most其他列的提取依赖于它。您丢失的模型也在那里。如果任何 XPath 令人困惑,请发表评论 q,我会尝试“解释”。