我正在尝试搜索ProQuest 存档器使用 R。我有兴趣查找包含特定关键字的报纸的文章数量。
通常使用它效果很好rvest
工具。然而,该程序有时会崩溃。看这个最小的例子:
library(xml2)
library(rvest)
# Retrieve the title of the first search hit on the page of search results
for (p in seq(0, 150, 10)) {
searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", p, sep="")
htmlWeb <- read_html(searchURL)
nodeWeb <- html_node(htmlWeb, ".text tr:nth-child(1) .result_title a")
textWeb <- html_text(nodeWeb)
print(textWeb)
Sys.sleep(0.1)
}
这有时对我有用。但是,如果我运行此脚本或类似的脚本几次,它会在同一点崩溃,并且在第 12 次迭代时出现错误(p=120
):
Error in open.connection(x, "rb") : HTTP error 503.
我尝试通过增加长度的停顿来规避这个问题,但这没有帮助。
我也考虑过:
- 保存无法访问的结果页面并为这些情况编写单独的脚本
- 通过程序改变我的IP?
- 通过程序退出并启动 R 一段时间?
我感谢您的任何评论。
尝试在延误时表现得更人性化一些。这对我有用(多次尝试):
library(xml2)
library(httr)
library(rvest)
library(purrr)
library(dplyr)
to_get <- seq(0, 150, 10)
pb <- progress_estimated(length(to_get))
map_chr(to_get, function(i) {
pb$tick()$print()
searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", i, sep="")
htmlWeb <- read_html(searchURL)
nodeWeb <- html_node(htmlWeb, "td > font.result_title > a")
textWeb <- html_text(nodeWeb)
Sys.sleep(sample(10, 1) * 0.1)
textWeb
}) -> titles
print(trimws(titles))
## [1] "NEWSPAPER SPECIALS."
## [2] "NEWSPAPER SPECIALS."
## [3] "New Jersey Ice Co. Insolvent."
## [4] "NEWSPAPER SPECIALS."
## [5] "NEWSPAPER SPECIALS"
## [6] "AMERICAN ICE BEGINNING BUSY SEASON IN IMPROVED CONDITION."
## [7] "NEWSPAPER SPECIALS"
## [8] "THE GERMAN REICHSBANK."
## [9] "U.S. Exploration Co. Bankrupt."
## [10] "CHICAGO TRACTION."
## [11] "INCREASING FREIGHT RATES."
## [12] "A.O. BROWN & CO."
## [13] "BROAD STREET GOSSIP"
## [14] "Meadows, Williams & Co."
## [15] "FAILURES IN OCTOBER."
## [16] "Supplementary Receiver for Heinze & Co."
我随机化了 sleep call 值,稍微简化了 CSS 目标,添加了进度条并自动创建了一个向量。您可能最终想要来自该数据的 data.frame,所以?purrr::map_df
为了那个原因。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)