将字符串提取函数包装在 ifelse 语句中

2024-03-14

下面的问题是一个延伸这个问题 https://stackoverflow.com/questions/74135095/adding-a-column-to-the-data-that-looks-for-a-list-of-words-and-adds-them-if-foun.

示例数据

我有示例数据如下：

library(data.table)
example_dat <- fread("var_nam description
      some_var this_is_some_var_kg
      other_var this_is_meters_for_another_var
      extra_var the_price_of_apples
      another_var cost_of_goods_sold")
example_dat$description  <- gsub("_", " ", example_dat$description)

       var_nam                    description
1:    some_var            this is some var kg
2:   other_var this is meters for another var
3:   extra_var            the price of apples
4: another_var             cost of goods sold

vector_of_units <- c("kg", "meters", "var")

以前的解决方案

我首先询问如何在此数据中创建一个单独的列，该列查找向量中列出的某些单位（vector_of_units）。一种选择是使用梅丁的这个答案 https://stackoverflow.com/a/74135271/8071608。这会得到所有匹配项。

library(tidyverse)
setDT(example_dat)[, unit :=    unlist(lapply(example_dat$description,function(x) 
                    paste0(vector_of_units[str_detect(x,vector_of_units)],
                    collapse = ",")))]

       var_nam                    description       unit
1:    some_var            this is some var Kg     kg,var
2:   other_var this is meters for another var meters,var
3:   extra_var            the Price of apples           
4: another_var             cost of goods sold

我还发现langtang 的回答 https://stackoverflow.com/a/71280592/8071608，它得到了第一场比赛（这实际上在我的情况下更可取）：

example_dat[, unit:=stringr::str_extract(description, paste0(vector_of_units,collapse = "|"))]

       var_nam                    description   unit
1:    some_var            this is some var kg    var
2:   other_var this is meters for another var meters
3:   extra_var            the price of apples   <NA>
4: another_var             cost of goods sold   <NA>

基于字符串向量从 data.table 列中提取字符串到新列中 https://stackoverflow.com/questions/71280466/based-on-vector-of-strings-extract-string-from-data-table-column-into-new-column

使用 ifelse 语句更灵活

不过，我希望有更多的灵活性。

首先，我想提供一个匹配向量和一个用于单独粘贴的向量，以便我可以将命中更改为其他内容：

vector_of_units_in <- c("kg", "meters", "var")
vector_of_units_out <- c("kilogram", "meters", "variable")

vector_of_units_euro <- c("cost", "price")
vector_of_units_euro_out <- "euro"

其次，我希望能够选择没有命中时发生的情况。例如，当应用 langtang 的解决方案时，我希望它不要用NA.

我一直在尝试弄乱 langtang 的解决方案：

setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_in)), paste0(vector_of_units_out, collapse = "|"), NA)]

# NA has been replaced by unit, so that it is not overwritten in case of no match
setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_euro)), paste0(vector_of_units_euro_out, collapse = "|"), unit)]

但我以此结束：

       var_nam                    description                     unit
1:    some_var            this is some var kg kilogram|meters|variable
2:   other_var this is meters for another var kilogram|meters|variable
3:   extra_var            the price of apples                     <NA>
4: another_var             cost of goods sold                     <NA>

我应该如何编写这个语法？

所需输出

       var_nam                    description       unit
1:    some_var            this is some var Kg     kilogram
2:   other_var this is meters for another var     meters
3:   extra_var            the Price of apples     euro      
4: another_var             cost of goods sold     euro

您可以使用命名单位向量和Vectorize grep for outer。在办案中if没有找到匹配项，我们可以抛出NA.

units <- c(kilogram="kg", meters="meters", euro="cost", euro="price", variable='var')

dat[, unit:=apply(outer(units, description, Vectorize(grepl)), 2, \(x) 
                  if (any(x)) names(which(x)) else NA)]
dat
# var_nam                    description              unit
# 1:    some_var            this is some var kg kilogram,variable
# 2:    some_var                   this is some                NA
# 3:   other_var this is meters for another var   meters,variable
# 4:   extra_var            the price of apples              euro
# 5: another_var             cost of goods sold              euro

Data:

dat <- structure(list(var_nam = c("some_var", "some_var", "other_var", 
"extra_var", "another_var"), description = c("this is some var kg", 
"this is some var", "this is meters for another var", "the price of apples", 
"cost of goods sold")), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x558a7b025230>)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)