我的语料库中有 4000 个文本文档。作为数据清理的一部分,我想从每个文档中删除包含特定单词的行。
例如:
library(tm)
doc.corpus<- VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en"))
doc.corpus<- tm_map(doc.corpus, PlainTextDocument)
doc.corpus[[1]]
#PlainTextDocument
Metadata: 7
Content: chars: 16542
as.character(doc.corpus)[[1]]
$content
"Quick to deploy, easy to use, and offering complete investment
protection, our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities."
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation."
我的问题是从该文档和所有其他文档中删除包含“商标”一词的第二行。目前,我使用 grepl() 函数来识别行,并尝试使用处理数据框时通常使用的方法排除这些行,但该方法不起作用:
corpus.copy<-corpus.doc
corpus.doc[[1]]<-corpus.copy[[1]][!grepl("trademark",as.character(corpus.copy[[1]]),ignore.case = TRUE),]
只要它适用于第一个文档,我就可以轻松地使用“for 循环”在语料库内的所有文档中实现。
任何提示/解决方案表示赞赏。我可以轻松地使用替代路线,将语料库转换为数据帧,以删除不需要的行并再次转换回语料库。谢谢。
System.info:
[1] "x86_64-w64-mingw32";
[1] "R version 3.1.0 (2014-04-10)"
[1] tm_0.6-2