我有按主题(“id”)分组的时间序列数据,这些数据保留在某个“站点”上,并且在每个“时间”步骤中都有某个“阶段”。
有时受试者从一个站点切换到另一个站点,并且可能会再次返回。如果个人更换站点来回(例如,从站点“a”到站点“b”,然后返回站点“a”)and如果只有中间站点一次注册(在 a-b-a 转换中,站点“b”将被视为“中间站点”)and该个人处于一定阶段(此处,阶段 = 2)在中间站点,那么我希望remove此时的注册步骤。
我的虚拟数据由四个主题组成。其中三人(对象 1-3)从地点 a 移至地点 b,然后又回到地点 b,其中一名从地点移至地点 b。
前两个科目都在中间站点上进行了单一注册。主题 1 处于中间站点的第 1 阶段,我希望保留该注册。另一方面,主题 2 处于中间站点的第 2 阶段,应删除此注册。对象3也在a和b之间来回移动。然而,虽然处于中间站点b的第2阶段,但它已经two那里的注册并且两个注册都被保留。对象 4 已从地点 a 移至地点 b,但没有再回来。因此,虽然b站点处于第2阶段,但b站点的注册不是“中间站点”,应该保留。
数据:
df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4),
time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L),
site = c("a", "b", "a", "a", "b", "a", "a", "b", "b", "a", "a", "b"),
stage = c(1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2)),
.Names = c("id", "time", "site", "stage"),
row.names = c(NA, -12L), class = "data.frame")
df
# id time site stage
# 1 1 1 a 1
# 2 1 2 b 1 <~~ A single middle registration on site 2
# 3 1 3 a 1 However, the individual is in stage 1: -> keep
# 4 2 1 a 1
# 5 2 2 b 2 <~~ A single middle registration on site 2 with stage 2: -> remove
# 6 2 3 a 1
# 7 3 1 a 1
# 8 3 2 b 2 <~~ Two middle registrations with stage 2: -> keep both rows
# 9 3 3 b 2 <~~
# 10 3 4 a 1
# 11 4 1 a 1
# 12 4 2 b 2 <~~ A single registration on site 2 with stage 2,
# but it is not in between two sites: -> keep
因此,在测试数据中,只有 id = 2 在 time = 2 时的注册应该被删除。
之前我用过plyr::ddply
并得出结果rle
解决问题:
对于每个人,计算站点的游程长度(rle(x$site)
)
If:
- 在站点之间来回(例如从 a 到 b,然后返回 a)
(length(r$values) > 2
) &
- 中间站点只有一次注册(r$lengths[2] == 1
) &
- 中间站点的舞台为 2 (x$stage[x$site == r$values[2]][1] == 2
)
Then:删除中间站点的注册x[!(x$site == r$values[2]), ]
)
library(plyr)
ddply(df, .(id), function(x){
r <- rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & x$stage[x$site == r$values[2]][1] == 2){
x[x$site != r$values[2], ]
} else x
})
# id time site stage
# 1 1 1 a 1
# 2 1 2 b 1
# 3 1 3 a 1
# 4 2 1 a 1 <~~ the single middle site with stage = 2 at time 2 is removed
# 5 2 3 a 1 <~~
# 6 3 1 a 1
# 7 3 2 b 2
# 8 3 3 b 2
# 9 3 4 a 1
# 10 4 1 a 1
# 11 4 2 b 2
detach("package:plyr")
现在我在正确处理这个问题时遇到了一些麻烦dplyr
。我在 SO 上找到了一些相关帖子(例如this https://stackoverflow.com/questions/22182442/dplyr-how-to-apply-do-on-result-of-group-by/22182914#22182914 and this https://stackoverflow.com/questions/24376765/dplyrdo-requires-named-function),以及在 github 上(this https://github.com/hadley/dplyr/issues/574 and this https://github.com/hadley/dplyr/issues/418),但我很难使它们适应我的需要。以下是一些绝望的尝试:
library(dplyr)
df %>%
group_by(id) %>%
do((function(x){
r = rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
filter(x, x$site != r$values[2])
} else x
})(.))
# desired row is not removed
df %>%
group_by(id) %>%
do(function(x){
r = rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
x[!(x$site == r$values[2]), ]
} else x
})
# Error: Results are not data frames at positions: 1, 2, 3
此尝试恰好有效(结果与ddply
上面),但远非优雅,我怀疑这是“正确的方式”:
df %>%
group_by(id) %>%
do(r = rle(.$site)) %>%
do(data.frame(id = .$id,
len = length(.$r$values),
site = .$r$values[2],
len2 = .$r$lengths[2])) %>%
filter(len == 3, len2 == 1) %>%
select(-len) %>%
left_join(df, ., by = c("id", "site")) %>%
filter(!(len2 %in% 1 & stage == 2)) %>%
select(-len2)
How to do
这对吗?万维网?