我想获取一堆由数百个分组变量分组的描述性统计数据。我知道从如何按多列对data.table进行分组? https://stackoverflow.com/questions/12478943/how-to-group-data-table-by-multiple-columns如果我想要分组变量组合的统计数据,我可以在分组参数中使用 list( ) 。就我而言,我想要 Y 每个级别的平均值而不是 Z 每个级别的平均值
# example data
set.seed(007)
DF <- data.frame(X=1:50000, Y=sample(c(0,1), 50000, TRUE), Z=sample(0:5, 50000, TRUE))
library(data.table)
DT <- data.table(DF)
# I tried this - but this gives the mean for each combination of Y and Z
DT[, mean(X), by=list(Y, Z)]
# so does this
DT[, mean(X), by=c("Y", "Z")]
# This works but....
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
out <- do.call( rbind, out )
#...but it is really slow.
我有 1 亿条记录和 400 多个分组变量,所以需要一些东西 - 有点高效。 lapply 选项会增加几天的额外处理时间
options( digits=15 )
start.time <- Sys.time()
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
end.time <- Sys.time()
time.taken <- end.time - start.time
start.time <- Sys.time()
DT[, mean(X), by=c("Y")]
DT[, mean(X), by=c("Z")]
end.time <- Sys.time()
time.taken2 <- end.time - start.time
time.taken - time.taken2