这里我有一个输入的例子:
df <- tibble::tribble(
~name, ~number, ~ind,
"ARPO", "405162", 5,
"ARPO S.L.", "504653", 22,
"ARPOS", "900232", 1,
"ARPO", "504694", 12,
"ARPO", "400304", 42,
"JJJJ", "401605", 2,
"JJJJ", "900029", 31,
"BBBBB", "400090", 25,
"BBBBB", "403004", 33,
"JJJJ", "900222", 2,
"BBBBB", "403967", 11,
"BBBB", "400304", 52,
"JJJJ", "404308", 200,
"ARPO", "403898", 2,
"ARPO", "158159", 24,
"BBBBBBB", "700805", 2,
"ARPO S.L.", "900245", 24,
"JJJJ", "501486", 2,
"JJJJ", "400215", 210,
"JJJJ", "504379", 26,
"HARPO", "900222", 400,
"BBBBB", "109700", 46,
"ARPO", "142173", 14,
"BBBBB", "400586", 22,
"ARPO", "401605", 322
)
我在这里找到了类似的解决方案:将具有相似名称的级别组合在一起 R https://stackoverflow.com/questions/24825215/group-together-levels-with-similar-names-r
x <- df$name
groups <- list()
i <- 1
while(length(x) > 0) {
id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
groups[[i]] <- x[id]
x <- x[-id]
i <- i + 1
}
因此,从那时起,您可以创建一个组变量:
df$group <- ""
for (j in 1:length(groups)){
df$group <- ifelse(df$name %in% groups[[j]], paste0("group_",j), df$group)
}
也许您可以找到更简单的解决方案,但这可行!