我想从数据框中选择列,以便得到结果连续的列序列尽可能长,而带有 NA 的行数尽可能少,因为之后必须删除它们。
(我想这样做的原因是,我想运行TraMineR::seqsubm()
自动获取转移成本矩阵(按转移概率)并稍后运行cluster::agnes()
on it. TraMineR::seqsubm()
不喜欢NA
州和cluster::agnes()
with NA
矩阵中的状态不一定有意义。)
为此,我已经写了一个工作function https://codereview.stackexchange.com/q/208826/185901原则上计算所有可能的功率子集并检查它们NA
s。它与这个玩具数据配合得很好d
它代表一个 10x5 矩阵:
> d
id X1 X2 X3 X4 X5
1 A 1 11 21 31 41
2 B 2 12 22 32 42
3 C 3 13 23 33 NA
4 D 4 14 24 34 NA
5 E 5 15 25 NA NA
6 F 6 16 26 NA NA
7 G 7 17 NA NA NA
8 H 8 18 NA NA NA
9 I 9 NA NA NA NA
10 J 10 NA NA NA NA
11 K NA NA NA NA NA
现在的问题是,我实际上想将该算法应用于代表34235 x 17 矩阵!
我的代码已经在Code Review上审核过了,但仍然无法应用到真实数据上。
我知道采用这种方法会产生巨大的计算量。 (对于非超级计算机来说可能太大了?!)
有谁知道更合适的方法?
我已经给你展示了@minem 的增强功能 https://codereview.stackexchange.com/a/208928/185901来自代码审查:
seqRank2 <- function(d, id = "id") {
require(matrixStats)
# change structure, convert to matrix
ii <- as.character(d[, id])
dm <- d
dm[[id]] <- NULL
dm <- as.matrix(dm)
rownames(dm) <- ii
your.powerset = function(s){
l = vector(mode = "list", length = 2^length(s))
l[[1]] = numeric()
counter = 1L
for (x in 1L:length(s)) {
for (subset in 1L:counter) {
counter = counter + 1L
l[[counter]] = c(l[[subset]], s[x])
}
}
return(l[-1])
}
psr <- your.powerset(ii)
psc <- your.powerset(colnames(dm))
sss <- lapply(psr, function(x) {
i <- ii %in% x
lapply(psc, function(y) dm[i, y, drop = F])
})
cn <- sapply(sss, function(x)
lapply(x, function(y) {
if (ncol(y) == 1) {
if (any(is.na(y))) return(NULL)
return(y)
}
isna2 <- matrixStats::colAnyNAs(y)
if (all(isna2)) return(NULL)
if (sum(isna2) == 0) return(NA)
r <- y[, !isna2, drop = F]
return(r)
}))
scr <- sapply(cn, nrow)
scc <- sapply(cn, ncol)
namesCN <- sapply(cn, function(x) paste0(colnames(x), collapse = ", "))
names(scr) <- namesCN
scr <- unlist(scr)
names(scc) <- namesCN
scc <- unlist(scc)
m <- t(rbind(n.obs = scr, sq.len = scc))
ag <- aggregate(m, by = list(sequence = rownames(m)), max)
ag <- ag[order(-ag$sq.len, -ag$n.obs), ]
rownames(ag) <- NULL
return(ag)
}
产量:
> seqRank2(d)
sequence n.obs sq.len
1 X1, X2, X3, X4 4 4
2 X1, X2, X3 6 3
3 X1, X2, X4 4 3
4 X1, X3, X4 4 3
5 X2, X3, X4 4 3
6 X1, X2 8 2
7 X1, X3 6 2
8 X2, X3 6 2
9 X1, X4 4 2
10 X2, X4 4 2
11 X3, X4 4 2
12 X1 10 1
13 X2 8 1
14 X3 6 1
15 X4 4 1
16 X5 2 1
> system.time(x <- seqRank2(d))
user system elapsed
1.93 0.14 2.93
在这种情况下我会选择X1, X2, X3, X4
, X1, X2, X3
or X2, X3, X4
因为他们是连续的并产生适当数量的观察结果。
预期输出:
所以对于玩具数据d
预期的输出将类似于:
> seqRank2(d)
sequence n.obs sq.len
1 X1, X2, X3, X4 4 4
2 X1, X2, X3 6 3
3 X2, X3, X4 4 3
4 X1, X2 8 2
5 X2, X3 6 2
6 X3, X4 4 2
7 X1 10 1
8 X2 8 1
9 X3 6 1
10 X4 4 1
11 X5 2 1
最后函数应该在巨大的矩阵上正确运行d.huge
这会导致目前的错误:
> seqRank2(d.huge)
Error in vector(mode = "list", length = 2^length(s)) :
vector size cannot be infinite
玩具数据d
:
d <- structure(list(id = structure(1:11, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), X1 = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, NA), X2 = c(11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, NA, NA, NA), X3 = c(21L, 22L, 23L, 24L,
25L, 26L, NA, NA, NA, NA, NA), X4 = c(31L, 32L, 33L, 34L, NA,
NA, NA, NA, NA, NA, NA), X5 = c(41L, 42L, NA, NA, NA, NA, NA,
NA, NA, NA, NA)), row.names = c(NA, -11L), class = "data.frame")
玩具数据d.huge
:
d.huge <- setNames(data.frame(matrix(1:15.3e5, 3e4, 51)),
c("id", paste0("X", 1:50)))
d.huge[, 41:51] <- lapply(d.huge[, 41:51], function(x){
x[which(x %in% sample(x, .05*length(x)))] <- NA
x
})
附录(见评论最新答案):
d.huge <- read.csv("d.huge.csv")
d.huge.1 <- d.huge[sample(nrow(d.huge), 3/4*nrow(d.huge)), ]
d1 <- seqRank3(d.huge.1, 1.27e-1, 1.780e1)
d2 <- d1[complete.cases(d1), ]
dim(d2)
names(d2)