我正在使用 R 编程语言。
我有这个数据集,记录一组学生在不同时间的考试结果(1 = 通过,0 = 失败):
library(data.table)
library(doParallel)
# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
# Create a data frame from the sample data
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
# Convert the data frame to a data.table
my_data = setDT(my_data)
# Create a cluster with 4 workers
cl = makeCluster(4)
我有这个函数,可以跟踪每个学生在上次考试未通过的情况下未通过考试的次数、在学生通过上次考试的情况下通过考试的次数、在学生上次考试失败的情况下通过考试的次数以及未通过考试的次数鉴于该学生通过了之前的考试。这是函数:
my_function <- function(i) {
# Use tryCatch to handle the case where there are no rows in the start_i data frame
tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$i = i
return(frame_i)
}, error = function(err) {
# Return an empty data frame if there are no rows in the start_i data frame
return(data.frame())
})
}
现在,我想尝试在我的数据上并行运行这个函数 - 也就是说,我想将属于不同学生的数据分配到我的计算机中的不同内核,以加快执行此函数所需的时间。这是我的尝试:
# Export the data frames and the my_function to the workers on the cluster
clusterExport(cl, c("my_data", "my_function", "data.table"))
# Assign each worker a different subset of the data to work on
clusterSetRNGStream(cl)
n = nrow(my_data)
chunks = rep(1:4, each = n / 4)
my_data = my_data[chunks == 1,]
# Evaluate the code on the cluster (final_out is the final result)
final_out = parLapply(cl, unique(my_data$id), my_function)
# alternate version
final_out = clusterApply(cl, unique(my_data$id), my_function)
# Stop the cluster when finished
stopCluster(cl)
该代码似乎运行没有错误 - 但我不确定我是否已正确完成所有操作。
有人可以对此发表评论吗?
Thanks!