主题分布：在python中进行LDA后如何查看哪个文档属于哪个主题

2024-01-23

我能够运行 gensim 中的 LDA 代码，并获得前 10 个主题及其各自的关键字。

现在，我想进一步了解 LDA 算法的准确性，方法是查看它们将哪些文档聚类到每个主题中。这在 gensim LDA 中可能吗？

基本上我想做这样的事情，但是在 python 中并使用 gensim。

LDA with topicmodels，如何查看不同文档属于哪些主题？ https://stackoverflow.com/questions/14875493/lda-with-topicmodels-how-can-i-see-which-topics-different-documents-belong-to

使用主题的概率，您可以尝试设置一些阈值并将其用作聚类基线，但我确信有比这种“hacky”方法更好的聚类方法。

from gensim import corpora, models, similarities
from itertools import chain

""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
                               update_every=1, chunksize=10000, passes=1)

# Prints the topics.
for top in lda.print_topics():
  print top
print

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

print cluster1
print cluster2
print cluster3

[out]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user

0.333333333333

['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']

只是为了更清楚地说：

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
    for topic in doc:
        for topic_id, score in topic:
            scores.append(score)
threshold = sum(scores)/len(scores)

上面的代码是所有文档的所有单词和所有主题的分数总和。然后通过分数的数量对总和进行归一化。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

NLTK

LDA

gensim

主题分布：在python中进行LDA后如何查看哪个文档属于哪个主题的相关文章

python：查找围绕某个 GPS 位置的圆的 GPS 坐标的优雅方法

我有一组以十进制表示的 GPS 坐标并且我正在寻找一种方法来查找每个位置周围半径可变的圆中的坐标这是一个例子 http green and energy com downloads test circle html我需要什么这是一个圆
如何手动计算分类交叉熵？

当我手动计算二元交叉熵时我应用 sigmoid 来获取概率然后使用交叉熵公式并平均结果 logits tf constant 1 1 0 1 2 labels tf constant 0 0 1 1 1 probs tf nn sigm
与区域指示符字符类匹配的 python 正则表达式

我在 Mac 上使用 python 2 7 10 表情符号中的标志由一对表示区域指示符号 https en wikipedia org wiki Regional Indicator Symbol 我想编写一个 python 正则表达式来在
Python getstatusoutput 替换不返回完整输出

我发现了这个很棒的替代品getstatusoutput Python 2 中的函数在 Unix 和 Windows 上同样有效不过我觉得这个方法有问题output被构建它只返回输出的最后一行但我不明白为什么任何帮助都是极好的 def
使用 Python 从文本中删除非英语单词

我正在 python 上进行数据清理练习我正在清理的文本包含我想删除的意大利语单词我一直在网上搜索是否可以使用像 nltk 这样的工具包在 Python 上执行此操作例如给出一些文本 Io andiamo to the beach w
您可以格式化 pandas 整数以进行显示，例如浮点数的“pd.options.display.float_format”？

我见过this https stackoverflow com questions 18404946 py pandas formatdataframe and this https stackoverflow com questions
Pandas Merge (pd.merge) 如何设置索引和连接

我有两个 pandas 数据框 dfLeft 和 dfRight 以日期作为索引 dfLeft cusip factorL date 2012 01 03 XXXX 4 5 2012 01 03 YYYY 6 2 2012 01 04 XX
在Python中连接反斜杠

我是 python 新手所以如果这听起来很简单请原谅我我想加入一些变量来生成一条路径像这样 AAAABBBBCCCC 2 2014 04 2014 04 01 csv Id TypeOfMachine year month year
如何在不丢失注释和格式的情况下更新 YAML 文件 / Python 中的 YAML 自动重构

我想在 Python 中更新 YAML 文件值而不丢失 Python 中的格式和注释例如我想改造 YAML 文件 value 456 nice value to value 6 nice value 界面类似于 y yaml load
在Python中检索PostgreSQL数据库的新记录

在数据库表中第二列和第三列有数字将会不断添加新行每次每当数据库表中添加新行时 python 都需要不断检查它们当 sql 表中收到的新行数低于 105 时 python 应打印一条通知消息警告数量已降至 105 以下另一方面
在 Sphinx 文档中*仅*显示文档字符串？

Sphinx有一个功能叫做automethod从方法的文档字符串中提取文档并将其嵌入到文档中但它不仅嵌入了文档字符串还嵌入了方法签名名称参数我如何嵌入only文档字符串不包括方法签名 ref http www sphinx do
仅第一个加载的 Django 站点有效

我最近向 stackoverflow 提交了一个问题标题为使用mod wsgi在apache上多次请求后Django无限加载 https stackoverflow com questions 71705909 django infini
使用特定颜色和抖动在箱形图上绘制数据点

我有一个plotly graph objects Box图我显示了箱形图中的所有点我需要根据数据的属性为标记着色如下所示我还想抖动这些点下面未显示 Using Box我可以绘制点并抖动它们但我不认为我可以给它们着色 fig a
如何断言 Unittest 上的可迭代对象不为空？

向服务提交查询后我会收到一本字典或一个列表我想确保它不为空我使用Python 2 7 我很惊讶没有任何assertEmpty方法为unittest TestCase类实例现有的替代方案看起来并不正确 self assertTrue
如何在 Windows 命令行中使用参数运行 Python 脚本

这是我的蟒蛇hello py script def hello a b print hello and that s your sum sum a b print sum import sys if name main hello sys
在本地网络上运行 Bokeh 服务器

我有一个简单的 Bokeh 应用程序名为app py如下 contents of app py from bokeh client import push session from bokeh embed import server do
使用for循环时如何获取前一个元素？ [复制]

这个问题在这里已经有答案了可能的重复 Python 循环内的上一个和下一个值 https stackoverflow com questions 1011938 python previous and next values inside
如何应用一个函数 n 次？ [关闭]

Closed 这个问题需要细节或清晰度 help closed questions 目前不接受答案假设我有一个函数它接受一个参数并返回相同类型的结果 def increment x return x 1 如何制作高阶函数repeat可以
Kivy - 单击按钮时编辑标签

我希望 Button1 在单击时编辑标签 etykietka 但我不知道如何操作你有什么想法吗 class Zastepstwa App def build self lista WebOps getList layout BoxLayo
使用随机放置的 NaN 创建示例 numpy 数组

出于测试目的我想创建一个M by Nnumpy 数组与c随机放置的 NaN import numpy as np M 10 N 5 c 15 A np random randn M N A mask np nan 我在创建时遇到问题mas

随机推荐

我如何识别通用类？

我怎样才能识别 NET2 一个泛型类 Class A Of T End Class not work If TypeOf myObject Is A Then 如果用 c 的话会是这样的 public class A
Hive Map join：内存不足异常

我正在尝试使用一个大表 10G 和一个小表 230 MB 来执行地图端对于较小的情况在连接关键列后我将使用所有列来生成输出记录我使用了以下设置设置 hive auto convert join true 设置 hive mapjo
java.security.AccessControlException：访问被拒绝（java.io.FilePermission /usr/share/java/jsp-api-2.0.jar 读取）

我正在尝试将应用程序部署到 Debian Lenny 上的 Tomcat 5 5 我收到以下异常 java security AccessControlException access denied java io FilePermissi
在 GitLab CI 上运行 Firebase 模拟器

我正在尝试在我的 GitLab CI 管道上测试 Firestore 的安全规则我需要运行 Firebase 的模拟器来完成此任务然而 Firebase 模拟器基本上开始提供假后端服务那么我如何才能与其他作业并行运行该作业呢例
如何制作一个补丁来显示同一分支中第一次提交之前的分支和最后提交的版本之间的差异？

我有一个名为 Ticket20 的分支我进行了 10 次提交我想制作一个补丁来显示 Ticket20 第一次创建时 0 次提交和现在第 10 次提交之间的差异我知道你可以使用 git diff 打补丁但我不知道如何定位第 0
如何使 OmniAuth Identity 接受 JSON 发布数据？

我使用 OmniAuth Identity 进行登录并使用 Ajax 调用进行登录尝试登录传递 JSON 数据不起作用例如这有效 curl i H Accept application json d email protected c
无法启动 uiautomatorviewer

I have Android SDK 工具修订版 22 Android SDK 平台 API 18 I go to
使用 create-react-app 构建转译第三方模块

我正在开发一个最近弹出的create react app https github com facebookincubator create react app应用程序并需要使用外部module https github com jonbr
为什么我可以在 for 循环中多次重新定义同一变量，但不能在循环之外重新定义？

我有以下程序 package main import fmt func main for i 0 i lt 2 i x 77 fmt Println x 执行时我得到 77 77 正如我们所看到的 x 77已执行2次但是如果我像这样稍微
在登录系统中使用 invalidateOptionsMenu() (Android)

我试图让我的选项菜单重新绘制在同一活动中我称之为登录对话框这是设置从应用程序中的任何活动中用户可以单击溢出选项菜单然后单击登录弹出一个对话框希望他们可以成功登录然后对话框完成如果您单击菜单它仍然显示登录而不是注
跨浏览器背景覆盖的最佳方式

我有一个 1024 768px 的图像我想将其用作网页的背景我还希望这个背景覆盖整个窗口的背景即使它调整了大小而且我不希望图像拉伸得尽可能少我已经尝试过这里的示例除了 jquery 之外因为我更喜欢仅在 css 中完成谢谢
从 cron 运行 casper.js 脚本

我正在尝试通过 cron 运行 casper js 脚本当我手动运行脚本时一切正常但是当我通过 cron 运行它时出现以下错误 Traceback most recent call last File usr local bin cas
SQLiteAssetHelper - 特定手机上的问题，例如一加

我在使用某些设备时遇到崩溃问题SQLite资产助手 https github com jgilfelt android sqlite asset helper在我的应用程序中主要是在 OnePlus 设备上现在我读了here https
jquery datepicker - 计算日期差异

我是这方面的新手我真的需要你的帮助因为我已经为此苦苦挣扎了好几天我想计算使用日期选择器选择的两个日期之间的天数我的做法是正确的还是完全错误的我所知道的是当我单击天数时它没有显示任何内容这是我的代码
如何判断一个类是否有特定的模板化成员函数？

我想知道是否可以扩展 SFINAE 方法来检测类是否具有某个成员函数如此处讨论的 C 中有没有一种技术可以知道一个类是否具有给定签名的成员函数检查类是否具有给定签名的成员函数 https stackoverflow com questi
Solr 在单词拼写不正确时将其识别为拼写正确

我正在跟随 Solr拼写检查组件文档 http wiki apache org solr SpellCheckComponent 但似乎无法让它发挥作用拼写检查组件似乎正在运行但 Solr 会在单词未运行时将其识别为拼写正确我怎样才能
Javascript函数有子函数/变量

这是工作代码 var test function console log test data test data hello test set function data test data data test set Test test
如何在PHP中使用箭头函数？

我开始了解PHP 7 4 中的箭头函数 https github com php php src pull 3941 我尝试像这样使用它们
铸造明确布局的结构

假设我有这个结构 StructLayout LayoutKind Explicit public struct Chapter4Time FieldOffset 0 public UInt16 Unused FieldOffset 2 pu
主题分布：在python中进行LDA后如何查看哪个文档属于哪个主题

我能够运行 gensim 中的 LDA 代码并获得前 10 个主题及其各自的关键字现在我想进一步了解 LDA 算法的准确性方法是查看它们将哪些文档聚类到每个主题中这在 gensim LDA 中可能吗基本上我想做这样的事情但是在

主题分布：在python中进行LDA后如何查看哪个文档属于哪个主题

主题分布：在python中进行LDA后如何查看哪个文档属于哪个主题 的相关文章

随机推荐

热门标签

主题分布：在python中进行LDA后如何查看哪个文档属于哪个主题的相关文章