我有一个包含几列的 data.frame,想要根据变量的组合过滤低频数据。这个例子就像性别变量中的男性/女性和胆固醇变量中的高/低。那么我的数据框将是这样的:
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
index Sex Age
1 1 Male High
2 2 Female High
3 3 Male High
4 4 Female High
5 5 Female High
6 6 Male High
7 7 Female High
8 8 Female High
9 9 Female Low
10 10 Male Low
11 11 Female High
12 12 Male High
13 13 Female High
14 14 Female High
15 15 Male Low
16 16 Female Low
17 17 Male High
18 18 Male Low
19 19 Male Low
20 20 Female Low
现在我想过滤频率高于3的性别/年龄组合
table(df[,2:3])
Age
Sex High Low
Female 8 3
Male 5 4
换句话说,我想保留女性高、男性低和男性高的指数。
Notice1)我的数据框有几个变量(不像上面的例子)和2)我有not want使用任何第三个 R 包并且 3) 我希望它速度快。
这是 R 基础上的一个简单方法:
lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
# index Sex Age
#1 1 Male High
#2 2 Female High
#3 3 Male High
#4 4 Female High
#5 5 Female High
#6 6 Male High
#7 7 Female High
#8 8 Female High
#10 10 Male Low
#11 11 Female High
#12 12 Male High
#13 13 Female High
#14 14 Female High
#15 15 Male Low
#17 17 Male High
#18 18 Male Low
#19 19 Male Low
如果变量数量较多,可以将它们存储在向量中:
vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
这是第二个基本 R 方法,使用ave
:
subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)