从包含多个文档的语料库中删除行

2024-03-20

我的语料库中有 4000 个文本文档。作为数据清理的一部分,我想从每个文档中删除包含特定单词的行。

例如:

library(tm)
doc.corpus<-  VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en"))

doc.corpus<- tm_map(doc.corpus, PlainTextDocument)

doc.corpus[[1]]

#PlainTextDocument
Metadata:  7
Content:  chars: 16542

    as.character(doc.corpus)[[1]]


$content


"Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities."
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation."

我的问题是从该文档和所有其他文档中删除包含“商标”一词的第二行。目前,我使用 grepl() 函数来识别行,并尝试使用处理数据框时通常使用的方法排除这些行,但该方法不起作用:

corpus.copy<-corpus.doc
corpus.doc[[1]]<-corpus.copy[[1]][!grepl("trademark",as.character(corpus.copy[[1]]),ignore.case = TRUE),]

只要它适用于第一个文档,我就可以轻松地使用“for 循环”在语料库内的所有文档中实现。

任何提示/解决方案表示赞赏。我可以轻松地使用替代路线,将语料库转换为数据帧,以删除不需要的行并再次转换回语料库。谢谢。

System.info:
[1] "x86_64-w64-mingw32"; 
[1] "R version 3.1.0 (2014-04-10)"
[1] tm_0.6-2 

不需要 for 循环 - 尽管它长期以来一直是一个令人沮丧的功能tm一旦文本位于语料库对象中,就很难访问它们。

我已经将“行”的含义解释为文档 - 因此上面的示例是两个“行”。如果情况并非如此,则需要(但很容易)调整此答案。

尝试这个:

txt <- c("Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities.",
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation.")

require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("trademark", textVector, 
                                                  ignore.case = TRUE)]))

newCorp现在排除包含“商标”的文档。请注意,如果您不需要复数形式(例如“商标”)

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

从包含多个文档的语料库中删除行 的相关文章

随机推荐