很好的问题。我们没有近似匹配作为“值类型”,但这对于未来的开发来说是一个有趣的想法。同时,我建议使用以下命令生成固定模糊匹配的列表base::agrep()
然后匹配这些。所以这看起来像:
library("quanteda")
## Package version: 1.5.2
dataset <- data.frame(
"patient" = 1:9, "text" = c(
"On his first appointment, the patient was suicidal when he showed up in my office",
"On his first appointment, the patient was suicidaa when he showed up in my office",
"On his first appointment, the patient was suiciaaa when he showed up in my office",
"On his first appointment, the patient was suicaaal when he showed up in my office",
"On his first appointment, the patient was suiaaaal when he showed up in my office",
"On his first appointment, the patient was saacidal when he showed up in my office",
"On his first appointment, the patient was suaaadal when he showed up in my office",
"On his first appointment, the patient was icidal when he showed up in my office",
"On his first appointment, the patient was uicida when he showed up in my office"
),
stringsAsFactors = FALSE
)
corp <- corpus(dataset)
# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
types()
The use agrep()
生成最接近的模糊匹配 - 在这里我运行了几次,增加max.distance
每次都会稍微偏离默认值 0.1。
# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal" "uicida"
然后,使用它作为pattern
论证kwic()
:
# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##
## [text1, 9] the patient was | suicidal | when he showed
## [text2, 9] the patient was | suicidaa | when he showed
## [text3, 9] the patient was | suiciaaa | when he showed
## [text4, 9] the patient was | suicaaal | when he showed
## [text5, 9] the patient was | suiaaaal | when he showed
## [text6, 9] the patient was | saacidal | when he showed
## [text7, 9] the patient was | suaaadal | when he showed
## [text8, 9] the patient was | icidal | when he showed
## [text9, 9] the patient was | uicida | when he showed
基于类似的解决方案还有其他可能性,例如模糊连接 or 字符串分布包,但这是一个简单的解决方案base应该工作得很好的包。