从向量中删除相似但更长的重复项

2024-02-08

对于数据库清理,我有一个向量,例如菜肴,我想删除“基础”菜肴的所有变体,只保留基础菜肴。举例来说,如果我有...

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
            "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
            "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")

...我想删除向量中已经具有较短匹配版本的所有条目。因此,所得向量将仅包括:“DAL BHAT”、“HAMBURGER”、“PIZZA”。

使用嵌套for循环并检查所有其他内容将适用于此示例,但对于手头的大型数据集来说会花费很长时间,而且我会说这是丑陋的编码。

可以假设所有条目都是大写的并且向量已经排序。不能假设下一个基菜的第一项总是比前一个条目短。

关于如何有效解决这个问题有什么建议吗?

额外问题:理想情况下,我只想从初始向量中删除项目,前提是它们比较短的对应项目至少长 3 个字符。在上述情况下,这意味着“HAMBURGER2”也将保留在结果向量中。


这是我对此采取的方法。我将创建一个包含一些我需要考虑的条件的函数,并将其用于输入。我添加了注释来解释函数中发生的情况。

该函数有 4 个参数:

  • invec:输入字符向量。
  • thresh:我们可以用多少个字符来确定“基础”菜。默认 = 5。
  • minlen:你的“奖金”问题。默认 = 3。
  • strict: 符合逻辑。如果有底菜nchar比你的短thresh,您想要降低阈值还是严格限制您对基础的要求?默认 =FALSE。请参阅最后一个示例以了解如何操作strict可能会起作用。

myfun <- function(invec, thresh = 5, minlen = 3, strict = FALSE) {
  # Bookkeeping -- sort, unique, all upper case
  invec <- sort(unique(toupper(invec)))
  # More bookkeeping -- min should not be longer 
  # than min base dish unless strict = TRUE
  thresh <- if (isTRUE(strict)) thresh else min(min(nchar(invec)), thresh)
  # Use `thresh` to get the `stubs``
  stubs <- invec[!duplicated(substr(invec, 1, thresh))]
  # loop through the stubs and do two things:
  #   - Match the dish with the stub
  #   - Return the base dish and any dishes within the minlen
  unlist(
    lapply(stubs, function(x) {
      temp <- grep(x, invec, value = TRUE, fixed = TRUE)
      temp[temp == x | nchar(temp) <= nchar(x) + minlen]
      }), 
    use.names = FALSE)
}

您的样本数据:

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
            "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
            "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")    

结果如下:

myfun(dishes, minlen = 0)
# [1] "DAL BHAT"  "HAMBURGER" "PIZZA" 

myfun(dishes)
# [1] "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA" 

这是更多示例数据。请注意,在“dishes2”中,数据不再排序,并且有一个新项目“DAL”,在“dishes3”中,您还有小写的菜肴。

dishes2 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
             "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
             "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL")

dishes3 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
             "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
             "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL", "pizza!!")

这是这些向量的函数:

myfun(dishes2, 4)
# [1] "DAL"        "HAMBURGER"  "HAMBURGER2" "PIZZA"   

myfun(dishes3)
# [1] "DAL"        "HAMBURGER"  "HAMBURGER2" "PIZZA"      "PIZZA!!"  

myfun(dishes3, strict = TRUE)
# [1] "DAL"        "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA"      "PIZZA!!"  
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

从向量中删除相似但更长的重复项 的相关文章

随机推荐