下面的问题是一个延伸这个问题 https://stackoverflow.com/questions/74135095/adding-a-column-to-the-data-that-looks-for-a-list-of-words-and-adds-them-if-foun.
示例数据
我有示例数据如下:
library(data.table)
example_dat <- fread("var_nam description
some_var this_is_some_var_kg
other_var this_is_meters_for_another_var
extra_var the_price_of_apples
another_var cost_of_goods_sold")
example_dat$description <- gsub("_", " ", example_dat$description)
var_nam description
1: some_var this is some var kg
2: other_var this is meters for another var
3: extra_var the price of apples
4: another_var cost of goods sold
vector_of_units <- c("kg", "meters", "var")
以前的解决方案
我首先询问如何在此数据中创建一个单独的列,该列查找向量中列出的某些单位(vector_of_units
)。一种选择是使用梅丁的这个答案 https://stackoverflow.com/a/74135271/8071608。这会得到所有匹配项。
library(tidyverse)
setDT(example_dat)[, unit := unlist(lapply(example_dat$description,function(x)
paste0(vector_of_units[str_detect(x,vector_of_units)],
collapse = ",")))]
var_nam description unit
1: some_var this is some var Kg kg,var
2: other_var this is meters for another var meters,var
3: extra_var the Price of apples
4: another_var cost of goods sold
我还发现langtang 的回答 https://stackoverflow.com/a/71280592/8071608,它得到了第一场比赛(这实际上在我的情况下更可取):
example_dat[, unit:=stringr::str_extract(description, paste0(vector_of_units,collapse = "|"))]
var_nam description unit
1: some_var this is some var kg var
2: other_var this is meters for another var meters
3: extra_var the price of apples <NA>
4: another_var cost of goods sold <NA>
基于字符串向量从 data.table 列中提取字符串到新列中 https://stackoverflow.com/questions/71280466/based-on-vector-of-strings-extract-string-from-data-table-column-into-new-column
使用 ifelse 语句更灵活
不过,我希望有更多的灵活性。
首先,我想提供一个匹配向量和一个用于单独粘贴的向量,以便我可以将命中更改为其他内容:
vector_of_units_in <- c("kg", "meters", "var")
vector_of_units_out <- c("kilogram", "meters", "variable")
vector_of_units_euro <- c("cost", "price")
vector_of_units_euro_out <- "euro"
其次,我希望能够选择没有命中时发生的情况。例如,当应用 langtang 的解决方案时,我希望它不要用NA
.
我一直在尝试弄乱 langtang 的解决方案:
setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_in)), paste0(vector_of_units_out, collapse = "|"), NA)]
# NA has been replaced by unit, so that it is not overwritten in case of no match
setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_euro)), paste0(vector_of_units_euro_out, collapse = "|"), unit)]
但我以此结束:
var_nam description unit
1: some_var this is some var kg kilogram|meters|variable
2: other_var this is meters for another var kilogram|meters|variable
3: extra_var the price of apples <NA>
4: another_var cost of goods sold <NA>
我应该如何编写这个语法?
所需输出
var_nam description unit
1: some_var this is some var Kg kilogram
2: other_var this is meters for another var meters
3: extra_var the Price of apples euro
4: another_var cost of goods sold euro