使用 R,我想根据另一个数据框中包含的条件将数据框中的数值替换为 NA:
# An example data frame with numeric values I want to
# change to `NA` based on values given in another data frame.
df1 <- tibble::tribble(
~col_1, ~col_2, ~col_3, ~group, ~subgroup,
1, 3, 5, 'A', 'p',
6, 8, 5, 'A', 'q',
5, 3, 3, 'B', 'p',
1, 7, 7, 'B', 'q'
)
# A second data frame containing conditions
# to be used for subsetting the first data frame.
df2 <- tibble::tribble(
~group, ~subgroup, ~cols,
'A', 'q', 'col_1',
'A', 'q', 'col_3',
'B', 'p', 'col_2',
'B', 'p', 'col_3'
)
# My problematic approach to subsetting df1 and replacing
# values with `NA` based on the conditions given in df2.
df1[df1$group %in% unique(df2$group) &
df1$subgroup %in% unique(df2$subgroup),
unique(df2$cols)] <- NA
# The incorrect result of my approach.
print(df1)
# A tibble: 4 × 5
col_1 col_2 col_3 group subgroup
<dbl> <dbl> <dbl> <chr> <chr>
1 NA NA NA A p
2 NA NA NA A q
3 NA NA NA B p
4 NA NA NA B q
Created on 2021-09-20 by the reprex package https://reprex.tidyverse.org (v2.0.1)
我的策略是使用 TRUE 索引对 df1 进行子集化,其中 df1 观测值与 df2 匹配并使用]<-NA
用 NA 替换这些观察结果。然而,我的方法选择了所有观察结果,而不是仅替换 df2 中指示的观察结果的期望结果。
如何在不手动替换的情况下以功能/编程方式执行此操作?此示例数据集足够小,可以使用]<-
我想要替换的每个值的方法,但我想在更大、更复杂的数据集上功能性地执行此操作。
解决方案和注意事项:@Ronak 和 @akrun 提供的两种解决方案都适用于本问题中的示例数据集。然而,在我的真实数据集中发现重复的子组和组值的罕见情况后,我发现只有 @akrun 的解决方案有效。下面我添加了另一个示例,该示例重新创建了我在真实数据中观察到的罕见情况,并添加了 @Ronak 对解决方案的修改,使其适用于这些重复。
# Unique numeric observations were added
# in rows 1 and 2 with group and subgroup
# values that are duplicated with existing
# group and subgroup values.
df1 <- tibble::tribble(
~col_1, ~col_2, ~col_3, ~group, ~subgroup,
7, 4, 9, "A", "p",
1, 3, 5, "A", "p",
6, 8, 5, "A", "q",
5, 3, 3, "B", "p",
1, 7, 7, "B", "q"
)
# Conditions were added in rows 1 and 2
# to indicate which values to replace
# in df1 with NA.
df2 <- tibble::tribble(
~group, ~subgroup, ~cols,
"A", "p", "col_1",
"A", "p", "col_2",
"A", "q", "col_1",
"A", "q", "col_3",
"B", "p", "col_2",
"B", "p", "col_3"
)
# Modifications of @Ronak's solution
df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2)
key1 <- lapply(
setNames(names(df1)[grep("col_\\d", x = names(df1))], 1:3),
function(x) {
paste(x, df1$group, df1$subgroup)
}
)
key2 <- with(df2, paste(cols, group, subgroup))
indices <- lapply(
key1,
function(x) {
which(x %in% key2)
}
)
indices <- indices[sapply(indices, function(x) length(x) > 0)]
selection <- lapply(
1:length(indices),
function(x) {
cbind(indices[[x]], as.numeric(names(indices)[x]))
}
)
selection <- do.call(rbind, selection)
df1[selection] <- NA
df1
# col_1 col_2 col_3 group subgroup
# 1 NA NA 9 A p
# 2 NA NA 5 A p
# 3 NA 8 NA A q
# 4 5 NA NA B p
# 5 1 7 7 B q