如同这个问题 https://stackoverflow.com/questions/46151452/find-overlapping-dates-for-each-id-and-create-a-new-row-for-the-overlap/46173303?noredirect=1#comment86774699_46173303,我想使用以下方法找到时间戳对之间重叠的持续时间data.table
.
这是我当前的代码:
library(data.table)
DT <- fread(
"stage,ID,date1,date2
1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
sep = ","
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = T] %>%
mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))
返回:
stage date1 date2 V1 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A 900
2 1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C 300
3 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C 2400
4 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B 1200
5 2 2018-04-17 00:50:00 2018-04-17 01:10:00 B 1200
但我只想返回用户二元组之间的重叠(即不超过两个重叠用户)。我可以想到几种巧妙的方法来实现这一目标,例如:
library(dplyr)
library(tidyr)
result %>%
filter(nchar(V1)==3) %>%
tidyr::separate(V1, c("ID1", "ID2"))
返回:
stage date1 date2 ID1 ID2 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B A 900
2 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400
3 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D B 1200
但这似乎不优雅,尤其是在处理较长的数据时ID
字符串和可能数百个ID
每个重叠 s。
理想情况下,我想知道是否有办法修改原始内容data.table
代码直接返回它。