我遇到了同样的问题并通过包含参数解决了它minimum_probability=0
当呼叫get_document_topics
的方法gensim.models.ldamodel.LdaModel
对象。
topic_assignments = lda.get_document_topics(corpus,minimum_probability=0)
默认情况下,gensim 不输出低于 0.01 的概率,因此对于任何特定文档,如果有任何主题分配的概率低于此阈值,则该文档的主题概率之和将不等于 1。
这是一个例子:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=100)
# Try values of minimum_probability argument of None (default) and 0
for minimum_probability in (None, 0):
# Get topic probabilites for each document
topic_assignments = lda.get_document_topics(common_corpus,minimum_probability=minimum_probability)
probabilities = [ [entry[1] for entry in doc] for doc in topic_assignments ]
# Print output
print(f"Calculating topic probabilities with minimum_probability argument = {str(minimum_probability)}")
print(f"Sum of probabilites:")
for i, P in enumerate(probabilities):
sum_P = sum(P)
print(f"\tdoc {i} = {sum_P}")
输出将是:
Calculating topic probabilities with minimum_probability argument = None
Sum of probabilities:
doc 0 = 0.6733324527740479
doc 1 = 0.8585712909698486
doc 2 = 0.7549994885921478
doc 3 = 0.8019999265670776
doc 4 = 0.7524996995925903
doc 5 = 0
doc 6 = 0
doc 7 = 0
doc 8 = 0.5049992203712463
Calculating topic probabilities with minimum_probability argument = 0
Sum of probabilites:
doc 0 = 1.0000000400468707
doc 1 = 1.0000000337604433
doc 2 = 1.0000000079162419
doc 3 = 1.0000000284053385
doc 4 = 0.9999999937135726
doc 5 = 0.9999999776482582
doc 6 = 0.9999999776482582
doc 7 = 0.9999999776482582
doc 8 = 0.9999999930150807
文档中没有非常清楚地说明这种默认行为。默认值为minimum_probability
为了get_document_topics
方法是None
,但这并不会将概率设置为零。相反的价值minimum_probability
设置为值minimum_probability
of the gensim.models.ldamodel.LdaModel
对象,默认值为 0.01,如您在源代码 https://github.com/RaRe-Technologies/gensim/blob/996801bb3fb8c4e10a84eefa70f5e2ac738dd47b/gensim/models/ldamodel.py#L347:
def __init__(self, corpus=None, num_topics=100, id2word=None,
distributed=False, chunksize=2000, passes=1, update_every=1,
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
random_state=None, ns_conf=None, minimum_phi_value=0.01,
per_word_topics=False, callbacks=None, dtype=np.float32):