这是我对此采取的方法。我将创建一个包含一些我需要考虑的条件的函数,并将其用于输入。我添加了注释来解释函数中发生的情况。
该函数有 4 个参数:
-
invec
:输入字符向量。
-
thresh
:我们可以用多少个字符来确定“基础”菜。默认 = 5。
-
minlen
:你的“奖金”问题。默认 = 3。
-
strict
: 符合逻辑。如果有底菜nchar
比你的短thresh
,您想要降低阈值还是严格限制您对基础的要求?默认 =FALSE
。请参阅最后一个示例以了解如何操作strict
可能会起作用。
myfun <- function(invec, thresh = 5, minlen = 3, strict = FALSE) {
# Bookkeeping -- sort, unique, all upper case
invec <- sort(unique(toupper(invec)))
# More bookkeeping -- min should not be longer
# than min base dish unless strict = TRUE
thresh <- if (isTRUE(strict)) thresh else min(min(nchar(invec)), thresh)
# Use `thresh` to get the `stubs``
stubs <- invec[!duplicated(substr(invec, 1, thresh))]
# loop through the stubs and do two things:
# - Match the dish with the stub
# - Return the base dish and any dishes within the minlen
unlist(
lapply(stubs, function(x) {
temp <- grep(x, invec, value = TRUE, fixed = TRUE)
temp[temp == x | nchar(temp) <= nchar(x) + minlen]
}),
use.names = FALSE)
}
您的样本数据:
dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE",
"HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA",
"PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")
结果如下:
myfun(dishes, minlen = 0)
# [1] "DAL BHAT" "HAMBURGER" "PIZZA"
myfun(dishes)
# [1] "DAL BHAT" "HAMBURGER" "HAMBURGER2" "PIZZA"
这是更多示例数据。请注意,在“dishes2”中,数据不再排序,并且有一个新项目“DAL”,在“dishes3”中,您还有小写的菜肴。
dishes2 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE",
"HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA",
"PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL")
dishes3 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE",
"HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA",
"PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL", "pizza!!")
这是这些向量的函数:
myfun(dishes2, 4)
# [1] "DAL" "HAMBURGER" "HAMBURGER2" "PIZZA"
myfun(dishes3)
# [1] "DAL" "HAMBURGER" "HAMBURGER2" "PIZZA" "PIZZA!!"
myfun(dishes3, strict = TRUE)
# [1] "DAL" "DAL BHAT" "HAMBURGER" "HAMBURGER2" "PIZZA" "PIZZA!!"