我有 2 个数据集,想要进行模糊连接。
这是两个数据集。
library(data.table)
# data1
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS", header = T)
# data2
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI", header = T)
两个数据集具有相同的字符State
and type
;然而,列NAME
不一样。他们很相似。
虽然我可以减去列NAME
每个数据有3或4个图表,然后将它们合并,似乎由于观察量较大,正确率可能不高。
dt1$NameSubstr <- substr(dt1$NAME, 1, 4)
dt2$NameSubstr <- substr(dt2$NAME, 1, 4)
merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T)
方法不好。
我检查包裹fuzzyjoin
。但不确定我是否正确。
library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
# Results
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: <NA> <NA> <NA> ADA OHIO OH CI
9: <NA> <NA> <NA> ADAMS MASS MA CI
10: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
本练习的结果是正确的,见下文。但是如果这两个数据中的任意一个NAME相同,则答案将不正确。
我在这两个数据中创建了一个新的观察结果。
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS
THE SAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI
THE SAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: THE SAME AA BB <NA> <NA> <NA>
9: <NA> <NA> <NA> ADA OHIO OH CI
10: <NA> <NA> <NA> ADAMS MASS MA CI
11: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
12: <NA> <NA> <NA> THE SAME AA BB
这是不正确的结果。
有什么建议吗?
看来我不能使用fuzzy_full_join
.