我进行数据整理(ETL)以将新数据加载到数据库中,并希望让用户基于我现有的实现自己的数据验证功能data.table
包含准备好的数据的对象。
如何避免用户可以更改(修改)我的数据data.table
在验证函数中(有意或无意)而不进行复制(因为它会减慢验证处理速度,从而减慢整个 ETL 过程)?
仅在验证功能中暂时需要锁定/写保护数据...
简化示例:
library(data.table)
DT <- as.data.table(mtcars)
DT[, row.number := 1:(.N)] # add a row number column to allow identification of invalid rows
DT[c(3,6,7), cyl := 100] # create some data errors (100 cylinders in the car)
# correctly implemented validate function (does not change the data)
validate <- function(data) {
data[cyl > 10, .(row.number = row.number, col.name = "cyl", col.value = cyl,
severity = "ERROR", msg = "More than 10 cylinders")]
}
validate(DT)
输出正常并且data.table
传递给validate()
功能没有改变:
row.number col.name col.value severity msg
1: 3 cyl 100 ERROR More than 10 cylinders
2: 6 cyl 100 ERROR More than 10 cylinders
3: 7 cyl 100 ERROR More than 10 cylinders
我想避免的是用户实现validate()
像这样并修改原始数据:
validate.with.side.effects <- function(data) {
data[, max.cyl := 10] # this adds a new column into the original data.table!
data[cyl > max.cyl, .(row.number = row.number, col.name = "cyl", col.value = cyl, severity = "ERROR", msg = "More than 10 cylinders")]
}
这个实现会修改原始的data.tableDT
!
Update:
有一个开放的功能请求:https://github.com/Rdatatable/data.table/issues/1086 https://github.com/Rdatatable/data.table/issues/1086
该功能请求基于不同上下文中的类似需求(包中的不可变数据):锁定或保护 R 中的 data.table https://stackoverflow.com/q/29085334/4468078
Update 2
还有另一个类似的功能请求:https://github.com/Rdatatable/data.table/issues/778 https://github.com/Rdatatable/data.table/issues/778
还有另一个类似的问题:如何从 R 函数返回“const”data.table? https://stackoverflow.com/q/25426810/4468078
更新3:可以使用不可变的 data.frame 吗?
一般来说是的(制作一个data.frame
只读)但是这个解决方案不满足问题的要求:
library(plyr)
idf <- idata.frame(DT)
idf[, max.cly := 10]
# Error in `:=`(max.cly, 10) :
# Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
validate(idf)
# Error in `[.idf`(data, cyl > 10, .(row.number = row.number, col.name = "cyl", :
# object 'cyl' not found
# it is no longer a data.table...
class(idata.frame(DT))
# [1] "idf" "environment"
# so that the data.table syntax does NOT work anymore...
idf[cyl > 10, .(row.number = row.number, col.name = "cyl", col.value = cyl,
severity = "ERROR", msg = "More than 10 cylinders")]
# Error in `[.idf`(idf, cyl > 10, .(row.number = row.number, col.name = "cyl", :
# object 'cyl' not found