这一切都可以在没有 Selenium 的情况下使用 rvest 来完成。不幸的是,Google 在不同的区域设置中的工作方式有所不同,因此例如在我的区域设置中,必须先导航一个同意页面,然后才能向 Google 发送请求。
似乎这在 OP 语言环境中不是必需的,但对于那些在英国的人来说,您可能需要先运行以下代码才能使其余代码正常工作:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
'https://www.google.com/',
'&gl=GB&m=0&pc=shp&x=5&src=2',
'&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
'set_eom=false&set_aps=true&set_sc=true'),
handle = google_handle)
url <- httr::GET(url, handle = google_handle)
对于OP和那些没有Google同意页面的人来说,设置很简单:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
接下来,我们定义将用于提取标题(如之前的问答)以及标题下方的文本(与此问题相关)的 xpath
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
现在我们可以应用这些 xpath 来获取正确的节点并从中提取文本:
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#> title text
#> <chr> <chr>
#> 1 "Mario García Torres - Wikipedia" "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (@mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~