将非平凡函数应用于 data.table 的有序子集

2024-05-09

Problem

我正在尝试使用我新发现的 data.table 功能（永久）来计算一堆数据的频率内容，如下所示：

|  Sample|  Channel|  Trial|     Voltage|Class  |  Subject|
|-------:|--------:|------:|-----------:|:------|--------:|
|       1|        1|      1|  -196.82253|1      |        1|
|       1|        2|      1|   488.15166|1      |        1|
|       1|        3|      1|  -311.92386|1      |        1|
|       1|        4|      1|  -297.06078|1      |        1|
|       1|        5|      1|  -244.95824|1      |        1|
|       1|        6|      1|  -265.96525|1      |        1|
|       1|        7|      1|  -258.93263|1      |        1|
|       1|        8|      1|  -224.07819|1      |        1|
|       1|        9|      1|   -87.06051|1      |        1|
|       1|       10|      1|  -183.72961|1      |        1|

大约有 5700 万行——除了电压之外，每个变量都是整数。 Sample 是从 1:350 开始的索引，Channel 是从 1:118 开始的索引。有 280 次试用。

样本数据

我相信马丁的示例数据是有效的（分类变量的数量对于错误来说不是问题）：

big.table <- data.table(Sample = 1:350, Channel = 1:118, Trial = letters,
             Voltage = rnorm(10e5, -150, 100), Class = LETTERS, Subject = 1:20)

process

我做的第一件事是将键设置为 Sample，因为我希望对单个数据系列所做的任何操作都按合理的顺序发生：

setkey(big.table,Sample)

然后，我对电压信号进行一些过滤以消除高频。（过滤函数返回一个与其第二个参数长度相同的向量）：

require(signal)
high.pass <- cheby1(cheb1ord(Wp = 0.14, Ws = 0.0156, Rp = 0.5, Rs = 10))
big.table[,Voltage:=filtfilt(high.pass,Voltage),by=Subject]

初始误差

我想看看是否正确处理了它（即逐个主题、逐个试验、逐个通道、按样本顺序），因此我添加了一列，其中包含电压列的频谱内容：

get.spectrum <- function(x) {
    spec.obj <- spectrum(x,method="ar",plot=FALSE)
    outlist <- list()
    outlist$spec <- 20*log10(spec.obj$spec)
    outlist$freq <- spec.obj$freq
    return(outlist)
  }
big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),by=Subject]

Error: cannot allocate vector of size 6.1 Gb

我认为问题是get.spectrum()考虑到整个表只有 1.7GB 左右，试图一次吃掉整个列。是这样吗？我有什么选择？

你尝试了什么？

增加分组粒度

如果我打电话给get.spectrum包括我想要分组的所有列，我得到一个更有希望的错误：

big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),
        by=c("Subject","Trial","Channel","Sample")]

Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action,  : 
  'order.max' must be >= 1

这意味着spectrum()我调用的函数正在获取错误形状的数据。

减少要点，尝试不同的“地点”条件

按照罗兰的建议，我将点数减少到 2000 万左右，并尝试了以下操作：

big.table[,"Spectrum":=get.spectrum(Voltage),
        by=c("Subject","Trial","Channel")]

Error in `[.data.table`(big.table, , `:=`("Spectrum", get.spectrum(Voltage)),  :
  All items in j=list(...) should be atomic vectors or lists. If you are trying something like
  j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge 
  afterwards.

我的想法是，我不应该按样本分组，因为我想将此函数应用于上面给出的每组 350 个样本by vector.

通过从 data.table FAQ 第 2.16 节中收集到的一些内容来改进这一点，我添加了相当于ORDER BY。我知道 Sample 列的每个输入都需要从 1:350 开始spectrum()功能：

> big.table[Sample==c(1:350),c("Spectrum","Frequency"):=as.list(get.spectrum(Voltage)),
+             by=c("Subject","Trial","Channel")]
Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action,  : 
  'order.max' must be >= 1

我再次遇到了非唯一输入的麻烦。

也许这可以开始解决问题：

I believe the error data.table gives is because get.spectrum returns a list with:
spec and freq.

Using this example dataset:
big.table <- data.table(Sample = 1:350, Channel = 1:118, Trial = letters,
                 Voltage = rnorm(10e5, -150, 100), Class = LETTERS, Subject = 1:20)

str(big.table)
setkey(big.table,Sample)

get.spectrum <- function(x) {
  spec.obj <- spectrum(x,method="ar",plot=FALSE)
  outlist <- list()
  outlist$spec <- 20*log10(spec.obj$spec)
  outlist$freq <- spec.obj$freq
  return(outlist)
}

VT <- get.spectrum(big.table$Voltage)
str(VT)

# Then you should decide which value you would like to inset in big.table
get.spectrum(big.table$Voltage)$spec
# or
get.spectrum(big.table$Voltage)$freq

这应该有效。您还可以使用set()

big.table[, Spectrum:= get.spectrum(Voltage)$spec, by=Subject]
big.table[, Frequency:= get.spectrum(Voltage)$freq, by=Subject]

EDIT正如评论中提到的，我尝试使用 set() 提供答案，但我不知道如何“分组” 主题：这是我尝试过的，不确定这是否是预期的答案。

cols = c("spec", "freq")
for(inx in cols){
  set(big.table, i=NULL, j=j ,value = get.spectrum(big.table[["Voltage"]])[inx])
}

EDIT2两个函数对应每一列。使用不同的分组变量组合。

spec_fun <- function(x) {
  spec.obj <- spectrum(x,method="ar",plot=FALSE)
  spec <- 20*log10(spec.obj$spec)
  spec
}

freq_fun <- function(x) {
  freq <- spectrum(x,method="ar",plot=FALSE)$freq
  freq
}

big.table[, Spectrum:= spec_fun(Voltage), by=c("Subject","Trial","Channel")]
big.table[, Frequency:= freq_fun(Voltage), by=c("Subject","Trial","Channel")]

# It gives some warnings(), probaby because of the made up data.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

r

datastructures

OutOfMemory

dataTable