我在 R 的 tm 包中遇到问题。我使用的是 0.6.2 版本。以下问题(2个不同的错误)已得到解答here https://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument and here https://stackoverflow.com/questions/24771165/r-project-no-applicable-method-for-meta-applied-to-an-object-of-class-charact但使用发布的解决方案后仍然产生错误。请点击here https://drive.google.com/file/d/0B2YVITpwU9nPTjdDLUV4YXJFSlU/view?usp=sharing下载数据集(仅 93 行)。这是一个可重现的例子。
两个错误如下:
(解决)UseMethod("meta", x) 中的错误:
没有适用于“元”的适用方法应用于“字符”类的对象
错误:继承(doc,“TextDocument”)不是 TRUE
- 在这种情况下,tm_map(ds.corpus, PlainTextDocument) 不会创建纯文本文档。
继承(ds.cleanCorpus, "TextDocument") # 返回 FALSE
请告诉我我的方法有什么问题。
--
# Data import
df.imp<- read.csv("Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)
##### Data Pre-Processing
install.packages("tm")
require(tm)
ds.corpus<- Corpus(VectorSource(df.imp$Content))
ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,removeURL)
stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" , "aren't" , "wasn't" , "weren't" , "hasn't" ,
"haven't" , "hadn't" , "doesn't" , "don't" ,"didn't" ,
"won't" , "wouldn't", "shan't" , "shouldn't", "can't" ,
"cannot" , "couldn't" , "mustn't", "but","no", "nor", "not", "too", "very")
stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )
copy<- ds.corpus ## creating a copy to be used as a dictionary
ds.corpus<- tm_map(ds.corpus, stemDocument)
## error Statement #1
ds.corpus<- stemCompletion(ds.corpus, dictionary = copy)
## Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"
ds.cleanCorpus<- tm_map(ds.corpus, PlainTextDocument) ## creating plain text document
class(ds.cleanCorpus) ## output is VCorpus" "Corpus". what it should be??
## error Statement #2
tdm<- TermDocumentMatrix(ds.corpus) ## creating term document matrix
inherits(ds.cleanCorpus, "TextDocument") ## returns FALSE
Update:找出第一个错误,即stemCompletion方法的x参数应该是字符向量,而字典可以是语料库或字符向量。但是,当我在 ds.corpus 的第一个文档(字符向量)上尝试时,如下所示,词干词未完成,输出只是像以前一样的词干字符向量。
stemCompletion(ds.corpus[[1]]$content, dictionary = copy)
所以现在我的主要问题是“如何从字典(tm包)完成词干语料库?“stemCompletion 方法似乎不起作用(在字符向量上)。其次,如何完成整个语料库的词干提取,是否应该对语料库内容的每个文档使用 for 循环?