我尝试使用 LDA 进行文本聚类,但它没有给我不同的聚类。下面是我的代码
#Import libraries
from gensim import corpora, models
import pandas as pd
from gensim.parsing.preprocessing import STOPWORDS
from itertools import chain
#stop words
stoplist = list(STOPWORDS)
new = ['education','certification','certificate','certified']
stoplist.extend(new)
stoplist.sort()
#read data
dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist()
#remove stop words
texts = [[word for word in document.lower().split() if word not in stoplist] for document in dat]
#dictionary
dictionary = corpora.Dictionary(texts)
#corpus
corpus = [dictionary.doc2bow(text) for text in texts]
#train model
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=25, workers=4,minimum_probability=0)
#print topics
lda.print_topics(num_topics=25, num_words=7)
#get corpus
lda_corpus = lda[corpus]
#calculate cutoff score
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
#threshold
threshold = sum(scores)/len(scores)
threshold
**0.039999999971137644**
#cluster1
cluster1 = [j for i,j in zip(lda_corpus,dat) if i[0][1] > threshold]
#cluster2
cluster2 = [j for i,j in zip(lda_corpus,dat) if i[1][1] > threshold]
问题是 cluster1 中存在重叠元素,这些元素往往出现在 cluster2 等中。
我还尝试手动将阈值增加到 0.5,但是它给了我同样的问题
这只是现实的。
文档或单词通常都不能唯一地分配给单个集群。
如果您手动标记某些数据,您也会很快发现一些无法明确标记为其中之一的文档。所以就是good我希望算法不会假装有一个很好的独特分配。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)