我想识别一堆学术论文中的主要 n 元语法,包括带有嵌套停用词的 n 元语法,但不包括带有前导或尾随停用词的 n 元语法。
我有大约 100 个 pdf 文件。我通过 Adobe 批处理命令将它们转换为纯文本文件,并将它们收集在一个目录中。从那里我使用 R。(这是代码的拼凑,因为我刚刚开始进行文本挖掘。)
My code:
library(tm)
# Make path for sub-dir which contains corpus files
path <- file.path(getwd(), "txt")
# Load corpus files
docs <- Corpus(DirSource(path), readerControl=list(reader=readPlain, language="en"))
#Cleaning
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
# Merge corpus (Corpus class to character vector)
txt <- c(docs, recursive=T)
# Find trigrams (but I might look for other ngrams as well)
library(quanteda)
myDfm <- dfm(txt, ngrams = 3)
# Remove sparse features
myDfm <- dfm_trim(myDfm, min_count = 5)
# Display top features
topfeatures(myDfm)
# as_well_as of_the_ecosystem in_order_to a_business_ecosystem the_business_ecosystem strategic_management_journal
#603 543 458 431 431 359
#in_the_ecosystem academy_of_management the_role_of the_number_of
#336 311 289 276
例如,在此处提供的顶部 ngrams 示例中,我想保留“管理学院”,但不保留“以及”,也不保留“the_role_of”。我希望代码适用于任何 n 元语法(最好包括少于 3 元语法,尽管我知道在这种情况下先删除停用词会更简单)。
使用corpusR 包,带有绿野仙踪举个例子(古腾堡项目 ID#55):
library(corpus)
library(Matrix) # needed for sparse matrix operations
# download the corpus
corpus <- gutenberg_corpus(55)
# set the preprocessing options
text_filter(corpus) <- text_filter(drop_punct = TRUE, drop_number = TRUE)
# compute trigram statistics for terms appearing at least 5 times;
# specify `types = TRUE` to report component types as well
stats <- term_stats(corpus, ngrams = 3, min_count = 5, types = TRUE)
# discard trigrams starting or ending with a stopword
stats2 <- subset(stats, !type1 %in% stopwords_en & !type3 %in% stopwords_en)
# print first five results:
print(stats2, 5)
## term type1 type2 type3 count support
## 4 said the scarecrow said the scarecrow 36 1
## 7 back to kansas back to kansas 28 1
## 16 said the lion said the lion 19 1
## 17 said the tin said the tin 19 1
## 48 road of yellow road of yellow 12 1
## ⋮ (35 rows total)
# form a document-by-term count matrix for these terms
x <- term_matrix(corpus, select = stats2$term)
就您而言,您可以从tm
语料库对象
corpus <- as_corpus_frame(docs)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)