在分组数据框中选择具有公共 ID 的行

2023-12-09

我正在寻找以下问题的更简单的解决方案。这是我的设置:

test <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_3",     5,   "g",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_3",     8,   "k",
     "groupC",   "id_4",     9,   "l",
     "groupC",   "id_5",     0,   "o",
     "groupC",   "id_6",    12,   "u"
  )

我只想选择其中的那些元素id_name是所有群体共有的group_name- 即删除所有组中不存在的 id 行。我的实际数据很大(200k 行),有 4-20 组(我事先不知道组数,因此解决方案必须适用于任意数量的组)。这id_name每个组中并不唯一。期望的结果是:

test_result <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_4",     9,   "l",
  )

(至少一组中缺少 id 的行将被删除)。理想情况下,我不希望我的输出在末尾加入列。我想“简单地”删除任何一组中缺少的行,但保持数据框的形状。

我知道我可以从每个组中提取所有 id,然后将它们全部相交以获得所有组中存在的唯一 id 列表,然后过滤主数据帧以仅查找这些 id。但这听起来工作量很大;-)

任何提示将非常感激。


在基数 R 中,我们可以split id_name by group_name找到共同点id's进而subset

subset(test, id_name %in% Reduce(intersect, split(id_name, group_name)))

#   group_name id_name  varA varB 
#   <chr>      <chr>   <dbl> <chr>
# 1 groupA     id_1        1 a    
# 2 groupA     id_2        4 f    
# 3 groupA     id_4        6 x    
# 4 groupA     id_4        6 h    
# 5 groupB     id_1        2 s    
# 6 groupB     id_2       13 y    
# 7 groupB     id_4       14 t    
# 8 groupC     id_1        3 d    
# 9 groupC     id_2        7 j    
#10 groupC     id_4        9 l    

使用类似的概念tidyverse, 这将是

library(tidyverse)
test %>%
  filter(id_name %in% (test %>%
                         group_split(group_name)  %>%
                         map(~pull(., id_name)) %>%
                         reduce(intersect)))
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

在分组数据框中选择具有公共 ID 的行 的相关文章

随机推荐