我正在尝试定义一个函数,该函数将数据框或表作为具有特定数量的 ID 列(例如 2 或 3 个 ID 列)的输入,其余列是 NAME1、NAME2、...、NAMEK(数字列) )。输出应该是一个数据表,其中包含与之前相同的 ID 列以及一个附加 ID 列,该 ID 列对列名称的每个唯一的成对组合进行分组(NAME1、NAME2、...)。另外,我们必须根据ID列将数字列的实际值收集到两个新列中;具有两个 ID 列和三个数字列的示例:
ID1 <- c("A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3)
NAME1 <- c(10,11,9,22,25,22)
NAME2 <- c(7,9,8,20,22,21)
NAME3 <- c(10,12,11,15,19,30)
DT <- data.table(ID1,ID2,NAME1,NAME2,NAME3)
我希望以 DT 作为输入的函数的输出为
ID.new <- c("NAME1 - NAME2","NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME2",
"NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME3", "NAME1 - NAME3",
"NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3",
"NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3",
"NAME2 - NAME3", "NAME2 - NAME3")
ID1 <- c("A","A","A","B","B","B","A","A","A","B","B","B","A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
value.left <- c(10,11,9,22,25,22,10,11,9,22,25,22,7,9,8,20,22,21)
value.right <- c(7,9,8,20,22,21,10,12,11,15,19,30,10,12,11,15,19,30)
DT.output <- data.table(ID.new,ID1,ID2,value.left,value.right)
我发现 fun() (见下文)可以完成这项工作,但对于我来说太慢了:
fun <- function(data, ID.cols){
data <- data.table(data)
# Which of the columns are ID columns
ids <- which(colnames(data) %in% ID.cols)
# Obtain all pairwise combinations of numeric columns into a list
numeric.combs <- combn(x = data.table(data)[,!ids, with = FALSE], m = 2, simplify = FALSE)
id.cols <- data[,ids, with = FALSE]
# bind the ID columns to each pairwise combination of numeric columns inside the list
bind.columns.each.numeric.comb <- lapply(X = numeric.combs, FUN = function(x) cbind(id.cols,x))
# Create generic names for the numeric columns so that rbindlist() may be applied. In addition we make a new column that groups based on which columns we are considering
generalize <- suppressWarnings(lapply(X = bind.columns.each.numeric.comb, FUN = function(x)
setattr(x = x[,ID.NEW:=paste(colnames(x[,!ids,with=FALSE]),collapse=" - ")], name =
'names', value = c(ID.cols,"value.left","value.right","ID.NEW"))))
return(rbindlist(l=generalize))
}
# Performance
print(microbenchmark(fun(DT,ID.cols=c("ID1","ID2")),times=1000))
有没有更快、更优雅的方法来做到这一点?