在这里,您可以通过查看每个 html 文件来获取数据。这只是一种获得一些好的结果的方法...但是...您必须注意到应该编辑那些 gsub 函数以获得更好的结果。发生这种情况是因为该 url 列表,或者说,该网页,在数据显示方式上并不均匀。这是你必须处理的事情。例如,这里只是两个屏幕截图,您可以在其中找到网络演示中的差异:
无论如何,您可以通过调整此代码来管理此操作:
library(purrr)
library(rvest)
[...] #here is your data
all_ppl_urls[100] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [3] "Gender: MaleReligion: Eastern OrthodoxRace or Ethnicity: Middle EasternSexual orientation: StraightOccupation: PoliticianParty Affiliation: Republican"
#-----------------------------------------------------------------------------------------------
# NEW WAY
toString(read_html(all_ppl_urls[100])) #get example of how html looks...
#><b>AKA</b> Edmund Spencer Abraham</p>\n<p><b>Born:</b> <a href=\"/lists/681/000106363/\" class=\"proflink\">12-Jun</a>-<a href=\"/lists/951/000105636/\" class=\"proflink\">1952</a><br><b>Birthplace:</b> <a href=\"/geo/604/000080364/\" class=\"proflink\">East Lansing, MI</a><br></p>\n<p><b>Gender:</b> Male<br><b>
#1. remove NA urls (avoid problems later on)
urls <- all_ppl_urls[!is.na(all_ppl_urls)]
length(all_ppl_urls)
length(urls)
#function that creates a list with your data
GetLife <- function (htmlurl) {
htmltext <- toString(read_html(htmlurl))
name <- gsub('^.*AKA</b>\\s*|\\s*</p>\n.*$', '', htmltext)
gender <- gsub('^.*Gender:</b>\\s*|\\s*<br>.*$', '', htmltext)
race <- gsub('^.*Race or Ethnicity:</b>\\s*|\\s*<br>.*$', '', htmltext)
occupation <- gsub('^.*Occupation:</b>\\s*|\\s*<br>.*$|\\s*</a>.*$|\\s*</p>.*$', '', htmltext)
#as occupation seems to have to many hyperlinks we are making another step
occupation <- gsub("<[^>]+>", "",occupation)
nationality <- gsub('^.*Nationality:</b>\\s*|\\s*<br>.*$', '', htmltext)
res <- c(ifelse(nchar(name)>100, NA, name), #function that cleans weird results >100 chars
ifelse(nchar(gender)>100, NA, gender),
ifelse(nchar(race)>100, NA, race),
ifelse(nchar(occupation)>100, NA, occupation),
ifelse(nchar(nationality)>100, NA, nationality),
htmlurl)
return(res)
}
emptydf <- data.frame(matrix(ncol=6, nrow=0)) #creaty empty data frame
colnames(emptydf) <- c("name","gender","race","occupation","nationality","url") #set names in empty data frame
urls <- urls[2020:2030] #sample some of the urls
for (i in 1:length(urls)){
emptydf[i,] <- GetLife(urls[i])
}
emptydf
以下是分析这 10 个 url 的示例:
name gender race occupation nationality url
1 <NA> Male White Business United States http://www.nndb.com/people/214/000128827/
2 Mark Alexander Ballas, Jr. Male White Dancer United States http://www.nndb.com/people/162/000346121/
3 Thomas Cass Ballenger Male White Politician United States http://www.nndb.com/people/354/000032258/
4 Severiano Ballesteros Sota Male Hispanic Golf Spain http://www.nndb.com/people/778/000116430/
5 Richard Achilles Ballinger Male White Government United States http://www.nndb.com/people/511/000168007/
6 Steven Anthony Ballmer Male White Business United States http://www.nndb.com/people/644/000022578/
7 Edward Michael Balls Male White Politician England http://www.nndb.com/people/846/000141423/
8 <NA> Male White Judge United States http://www.nndb.com/people/533/000168029/
9 <NA> Male Asian Engineer England http://www.nndb.com/people/100/000123728/
10 Michael A. Balmuth Male White Business United States http://www.nndb.com/people/635/000175110/
11 Aristotle N. Balogh Male White Business United States http://www.nndb.com/people/311/000172792/