我正在使用 R 进行字符串处理。我有一个带有一列字符串的数据框,例如:
df <- data.frame(textcol=c("In this substring would like to find the position of this substring",
"I would also like to find the position of thes substring",
"No match here","No mention of this substrangy thing"))
matchPattern <- "this substring"
我正在寻找一个函数(取决于某种距离参数,例如 Jarro-Winkler)将采用我的 matchPattern,将其与数据框文本列的每一行进行比较,并返回匹配项中匹配项的确切位置字符串,即第一个元素为 36(除非我数错),第二个元素(可能)为 43,第三个元素为 NA,第四个元素为 14(?)。
你可以使用aregexec
## Get positions (-1 instead of NA)
positions <- aregexec(matchPattern, df$textcol, max.distance = 0.1)
unlist(positions)
# [1] 38 43 -1 15
## Extract matches
regmatches(df$textcol, positions)
# [[1]]
# [1] "this substring"
#
# [[2]]
# [1] "thes substring"
#
# [[3]]
# character(0)
#
# [[4]]
# [1] "this substrang"
Edit
## A possibilty for replacing matches, or maybe `regmatches<-`
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX" # deal with 0 length matches somehow
df$out <- Vectorize(gsub)(unlist(res), "Censored", df$textcol)
df$out
# [1] "I would like to find the position of Censored"
# [2] "I would also like to find the position of Censored"
# [3] "No match here"
# [4] "No mention of Censoredy thing"
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)