我想比较两个文本的相似性,因此我需要一个简单的函数来清楚地按时间顺序列出两个文本中出现的单词和短语。这些单词/句子应突出显示或下划线以便更好地可视化)
基于@joris Meys的想法,我添加了一个数组来将文本分为句子和从属句子。
它是这样的:
textparts <- function (text){
textparts <- c("\\,", "\\.")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text, textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")
commonWords <- intersect(textparts1, textparts2)
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
然而,有时它有效,有时则无效。
我希望得到这样的结果:
> return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence" " whereas this is a dependent clause*" " This thing works*"
[[2]]
[1] "This could be a sentence" " whereas this is a dependent clause*" " Plagiarism is not cool" " This thing works*"
而我没有得到任何结果。