R 中两个文本的简单比较

2024-01-15

我想比较两个文本的相似性，因此我需要一个简单的函数来清楚地按时间顺序列出两个文本中出现的单词和短语。这些单词/句子应突出显示或下划线以便更好地可视化）

基于@joris Meys的想法，我添加了一个数组来将文本分为句子和从属句子。

它是这样的：

  textparts <- function (text){
  textparts <- c("\\,", "\\.")
  i <- 1
  while(i<=length(textparts)){
        text <- unlist(strsplit(text, textparts[i]))
        i <- i+1
  }
  return (text)
}

textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")

  commonWords <- intersect(textparts1, textparts2)
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){
    textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
    textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
  }
  return(list(textparts1,textparts2))

然而，有时它有效，有时则无效。

我希望得到这样的结果：

>   return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence"         " whereas this is a dependent clause*" " This thing works*"                  

[[2]]
[1] "This could be a sentence"            " whereas this is a dependent clause*" " Plagiarism is not cool"             " This thing works*"

而我没有得到任何结果。

@Chase的回答有一些问题：

不考虑大小写差异
插点可能会弄乱结果
如果有多个相似的单词，那么您会因 gsub 调用而收到很多警告。

根据他的想法，有以下解决方案，利用tolower()以及正则表达式的一些不错的功能：

compareSentences <- function(sentence1, sentence2) {
  # split everything on "not a word" and put all to lowercase
  x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
  x2 <- tolower(unlist(strsplit(sentence2, "\\W")))

  commonWords <- intersect(x1, x2)
  #add word beginning and ending and put words between ()
  # to allow for match referencing in gsub
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){ 
    # replace the match by the match with star added
    sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
    sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
  }
  return(list(sentence1,sentence2))      
}

这给出了以下结果：

text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "

compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"

[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

r

Comparison

R 中两个文本的简单比较的相关文章

R rvest 检索空表

我正在尝试两种策略来从网络表中获取数据 library tidyverse library rvest webpage lt read html https markets cboe com us equities market stati
导出“函数”类对象的 S3 方法

函数对象似乎与 S3 方法的调度配合得很好但由于某种原因它们无法导出到 NAMESPACE 文件中下面的代码适用于调度到 function method as abc function x UseMethod as abc as ab
R 中的 sqlSave 创建数据帧并将其保存到 SQL 表

您好我正在使用 R 将数据框保存到 DB2 SQL 表中我似乎能够创建表骨架但无法将数据附加到表中 gt df lt read csv dat csv 其中 dat csv 是没有标题的 csv 只有两列中的原始数据然后我创建表 g
使用 geom_line 绘制多条线（基于分组）

请帮助我关于当我尝试在 ggplot2 中使用 geom line 绘制分组的多条线时遇到的问题当我尝试根据一个变量列即区域对行进行分组时问题就出现了 GDP time series analysis gt group by
R从原始数据生成二维直方图

我有一些 2D x y 的原始数据如下所示我想从数据生成二维直方图通常将 x y 值划分为大小为 0 5 的 bin 并计算每个 bin 中出现的次数同时针对 x 和 y 有什么办法可以做到这一点吗 gt df x y 1 4 2
抑制 R 中的安装输出

这真的开始让我烦恼我尝试了几种方法但似乎都不起作用我正在从一个函数运行安装该函数会生成许多我想抑制的不必要的消息但我尝试执行此操作的所有方法都不起作用我试图抑制的代码是 install github ROAUth duncant
如何使用 ggplot 绘制反向（互补）ecdf？

我目前使用 stat ecdf 来绘制累积频率图这是我使用的代码 cumu plot lt ggplot house total year aes download speed colour ISP stat ecdf size 1 但是
RStudio Shiny renderDataTable 字体大小

我正在尝试减小 renderDataTable 中的字体大小但找不到任何控制字体大小的示例我读到可以通过 jquery 控制它但我找不到任何例子任何指导都会非常有帮助因为我正在使用闪亮的 ioslides 演示文稿并且我的数据表
如何将表达式传递给ggplot中的geom_text标签？（继续）

这是我的后续原问题 https stackoverflow com questions 63813557 how to pass an expression to a geom text label in ggplot了解如何将带下标的表达
R 中的 DataTable，将具有特定值类别的行格式化为百分比

如果我有一个数据表并且我的目标是将包含 MONTH Percent Change 的任何行更改为百分比 MONTH YEAR Client Revenue Metric 1 Metric 2 Metric 3 1 MTD 1 2015 C
如何控制knitr kable科学记数法？

我有一个像这样的数据框 gt summary variable value 1 var1 5 810390e 06 2 var2 5 018182e 06 3 var3 5 414286e 06 4 var4 3 000779e 02 5
%<>%操作的含义

这个操作有什么作用呢 test lt gt select name list 这是来自一个名为magrittr lt gt 意思是取出左边的部分用右边的部分修改它覆盖左边的变量如果你更熟悉dplyr 它相当于 test lt tes
如何使用复选框以交互方式过滤 visNetwork 中的节点/边？（使用R闪亮）

使用 Shiny 和 visNetwork R 包我创建了一个交互式网络可视化我希望用户能够通过使用用户界面中的复选框来删除添加节点和边我设法让它部分工作但不知何故当过滤多个项目时我的解决方案不起作用可以查看我试图实现的行为的
R 中卡方的事后测试

我有一张看起来像这样的桌子 gt dput theft loc structure c 13704L 14059L 14263L 14450L 14057L 15503L 14230L 16758L 15289L 15499L 16066L
将闪亮应用程序部署到 Shinyapps.io 时出错

我有一个闪亮的应用程序它在server R file library shiny Creating the app library ggplot2 library plyr library reshape2 library scales
创建具有多个变量的计数器[重复]

这个问题在这里已经有答案了我的数据如下所示 CustomerID TripDate 1 1 3 2013 1 1 4 2013 1 1 9 2013 2 2 1 2013 2 2 4 2013 3 1 2 2013 我需要创建一个计数器变
使用 ggplot2 表示散点图中每个点的小饼图

我想创建一个散点图其中每个点都是一个小饼图例如考虑以下数据 foo lt data frame X runif 30 Y runif 30 A runif 30 B runif 30 C runif 30 下面的代码将绘制一个散点图代
如何使非常宽的 grid.table 或 tableGrob 适合 pdf 页面？

我有一个相当宽的表格页面宽度的 4 3 我正在尝试使用 grid table 或 grid arrange 通过 tableGrob 将其打印到 pdf 文件中该表超出了页面边界并被剪裁有没有办法强制 grid table grid
ggplot2：使用选定的面和零件数据创建绘图

我想创建一个情节使用部分数据创建基本图facet grid两列使用数据的剩余部分并在现有方面的顶部进行绘图但仅使用单个列示例代码 library ggplot2 library gridExtra df2 lt data frame
“以下对象被‘package:xxx’屏蔽”是什么意思？

当我加载包时我收到一条消息 The following object is masked from package xxx 例如如果我加载testthat http www rdocumentation org packages tes

随机推荐

我的漫游数据无法在设备之间同步

我做了一个 UWP 应用程序我使用漫游数据我通过以下方式保存设置 public static void WriteCode string pwd ApplicationDataContainer RoamingSettings Appl
如何计算 Zend Framework 2 中的行数

我需要计算 MySql 查询的结果行数在这里我扩展了TableGateway类到我的类这是我的代码 public function get num of rows sql SELECT count q no FROM questions
如何在 Pyspark 中读取多行 CSV 文件

我将这个推文数据集与 Pyspark 一起使用以便对其进行处理并根据推文的位置获取一些趋势但是当我尝试创建数据框时遇到问题我在用着spark read options header True csv hashtag donaldtru
socket.io 1.0 错误：ERR_NAME_NOT_RESOLVED

我有一个使用 socket io 的 Node js 服务器这个想法是允许两个不同的 html 文件通过该服务器相互传递信息在 1 0 版发布之前 socket io ver0 9 16 一切都很好现在我在客户端和服务器端都收到了一
带变量的 Vue.js 动态

R 中两个文本的简单比较

R 中两个文本的简单比较 的相关文章

随机推荐

R 中两个文本的简单比较的相关文章