R 中两个文本的简单比较



基于@joris Meys的想法,我添加了一个数组来将文本分为句子和从属句子。


  textparts <- function (text){
  textparts <- c("\\,", "\\.")
  i <- 1
        text <- unlist(strsplit(text, textparts[i]))
        i <- i+1
  return (text)

textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")

  commonWords <- intersect(textparts1, textparts2)
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")

  for(x in commonWords){
    textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
    textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)



>   return(list(textparts1,textparts2))
[1] "This is a complete sentence"         " whereas this is a dependent clause*" " This thing works*"                  

[1] "This could be a sentence"            " whereas this is a dependent clause*" " Plagiarism is not cool"             " This thing works*"           



  • 不考虑大小写差异
  • 插点可能会弄乱结果
  • 如果有多个相似的单词,那么您会因 gsub 调用而收到很多警告。


compareSentences <- function(sentence1, sentence2) {
  # split everything on "not a word" and put all to lowercase
  x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
  x2 <- tolower(unlist(strsplit(sentence2, "\\W")))

  commonWords <- intersect(x1, x2)
  #add word beginning and ending and put words between ()
  # to allow for match referencing in gsub
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")

  for(x in commonWords){ 
    # replace the match by the match with star added
    sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
    sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)


text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "

[1] "This* is* a* test*. Weather* is* fine*"

[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "

