假设我有这样的数据:
df <- read.table(text= "title date text
blablabla 22.07.2023 'blablablabla Blue blablabla'
blablabla 23.06.2023 'bala Blue blabla Blue Night Blue'
blablabla 23.08.2023 'bala Mountain blabla House Night Blue'",
header = T, stringsAsFactor = F)
和一个向量words
我考虑的关键词:
words <- c("House", "Mountain", "Blue", "Night")
我想要实现的是计算次数words
发生在df$text
but 计算每种类型的word
分别在自己的专栏中。到目前为止我有这个代码:
llibrary(tidyverse)
df %>%
# extract instances of keywords:
mutate(
keyword = str_extract_all(text,
str_c("(?i)\\b(", str_c(words, collapse = "|"), ")\\b")
)) %>%
# turn into alternation pattern:
mutate(keyword = lapply(keyword, function(x) str_c(x, collapse = "|"))) %>%
# create row ID:
mutate(row = row_number()) %>%
# separate into rows splitting by `|`:
separate_rows(keyword, sep = '\\|') %>%
# cast each keyword in its own row:
pivot_wider(names_from = keyword, values_from = keyword,
values_fn = function(x) 1, values_fill = 0
) %>%
select(-row)
# A tibble: 3 × 7
title date text Blue Night Mountain House
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 blablabla 22.07.2023 blablablabla Blue blablabla 1 0 0 0
2 blablabla 23.06.2023 bala Blue blabla Blue Night Blue 1 1 0 0
3 blablabla 23.08.2023 bala Mountain blabla House Night Blue 1 1 1 1
这不是我想要的,因为function(x) 1
部分并不总结,而只是记录是否word
存在或不存在。必须如何更改才能获得此输出:
# A tibble: 3 × 7
title date text Blue Night Mountain House
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 blablabla 22.07.2023 blablablabla Blue blablabla 1 0 0 0
2 blablabla 23.06.2023 bala Blue blabla Blue Night Blue 3 1 0 0
3 blablabla 23.08.2023 bala Mountain blabla House Night Blue 1 1 1 1