

我正在使用 R 中的 wordcloud 包创建一个 wordcloud,并在“的帮助下R 中的词云 http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html".

我可以很容易地做到这一点,但我想从这个词云中删除单词。我的文件中有单词(实际上是一个 Excel 文件,但我可以更改它),并且我想排除所有这些单词,其中有几百个。有什么建议么?

ap.corpus=tm_map(ap.corpus, removePunctuation)
ap.corpus=tm_map(ap.corpus, tolower)
ap.corpus=tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.d=data.frame(word = names(ap.v),freq=ap.v)

@Tyler Rinker 已经给出了答案,只需添加另一行removeWords(),但这里有更多细节。

假设您的 excel 文件名为nuts.xls并且有一列像这样的单词


In R你可以像这样继续

     library(gdata) # package with xls import function
     # now load the excel file with the custom stoplist, note a few of the arguments here 
     # to clean the data by removing spaces that excel seems to insert and prevent it from 
     # importing the characters as factors. You can use any args from read.table(), which is
     # handy
     nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE)

     # now make some words to build a corpus to test for a two-step stopword removal process...
     words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but")
     words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on")
     words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an")

     # now remove the standard list of stopwords, like you've already worked out
     words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english"))
     # now remove the second set of stopwords, this time your custom set from the excel file, 
     # note that it has to be a reference to a character vector containing the custom stopwords
     words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords)

     # have a look to see if it worked
     A corpus with 3 text documents

     The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
          create_date creator 
     Available variables in the data frame are:

        , , , , apple, pear, orange, lime, mandarin, , , 

        , , , , apple, pear, orange, lime, mandarin, , , 

        , , , , apple, pear, orange, lime, mandarin, , , 

成功!标准停用词消失了,Excel 文件中的自定义列表中的单词也消失了。毫无疑问,还有其他方法可以做到这一点。


