我必须数据集,一个包含某个位置(经纬度),即测试,另一个包含纽约市所有邮政编码的纬度/经度信息,即 test2。
test <- structure(list(trip_count = 1:10, dropoff_longitude = c(-73.959862,
-73.882202, -73.934113, -73.992203, -74.00563, -73.975189, -73.97448,
-73.974838, -73.981377, -73.955093), dropoff_latitude = c(40.773617,
40.744175, 40.715923, 40.749203, 40.726158, 40.729824, 40.763599,
40.754135, 40.759987, 40.765224)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
test2 <- structure(list(latitude = c(40.853017, 40.791586, 40.762174,
40.706903, 40.825727, 40.739022, 40.750824, 40.673138, 40.815559,
40.754591), longitude = c(-73.91214, -73.94575, -73.94917, -73.82973,
-73.81752, -73.98205, -73.99289, -73.81443, -73.90771, -73.976238
), borough = c("Bronx", "Manhattan", "Manhattan", "Queens", "Bronx",
"Manhattan", "Manhattan", "Queens", "Bronx", "Manhattan")), class = "data.frame", row.names = c(NA,
-10L))
我现在尝试加入这两个数据集,以便最终每个trip_count
我得到一个borough
。到目前为止我用过difference_left_join
对于这样的:
test %>% fuzzyjoin::difference_left_join(test2,by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"), max_dist = 0.01)
尽管这种方法有效,但随着数据集变大,这种连接会创建很多多个匹配项,因此我最终得到的数据集有时是初始数据集的十倍test
。有没有人有不同的方法来解决这个问题而不创建多个匹配?或者有什么方法可以强制连接始终只对每一行使用一个匹配项test
?我将非常感激!
EDIT: 解决这个问题R dplyr left join - 多个返回值和新行:如何仅询问第一个匹配项? https://stackoverflow.com/questions/42431974/r-dplyr-left-join-multiple-returned-values-and-new-rows-how-to-ask-for-the-fi也能解决我的问题。所以也许你们中的一个人对此有一个想法!