如果你做了一个怎么办removeCommonTerms
功能
removeCommonTerms <- function (x, pct)
{
stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")),
is.numeric(pct), pct > 0, pct < 1)
m <- if (inherits(x, "DocumentTermMatrix"))
t(x)
else x
t <- table(m$i) < m$ncol * (pct)
termIndex <- as.numeric(names(t[t]))
if (inherits(x, "DocumentTermMatrix"))
x[, termIndex]
else x[termIndex, ]
}
然后,如果您想删除文档中 >=80% 的术语,您可以这样做
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm
# <<DocumentTermMatrix (documents: 20, terms: 1266)>>
# Non-/sparse entries: 2255/23065
# Sparsity : 91%
# Maximal term length: 17
# Weighting : term frequency (tf)
removeCommonTerms(dtm ,.8)
# <<DocumentTermMatrix (documents: 20, terms: 1259)>>
# Non-/sparse entries: 2129/23051
# Sparsity : 92%
# Maximal term length: 17
# Weighting : term frequency (tf)