如何更好地可视化给定文本的单词关联？

2024-01-06

我特别想要的是根据文档中的出现方式将与文档中的名词相关的所有动词和形容词可视化。

我在 Python 中找不到任何函数，所以我编写了自己的基本函数，如下所示。然而，可视化仍有一些不足之处：

import nltk
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

def word_association_graph(text):
    nouns_in_text = []

    for sent in text.split('.')[:-1]:   
        tokenized = nltk.word_tokenize(sent)
        nouns=[word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
        nouns_in_text.append(' '.join([word for word in nouns if not (word=='' or len(word)==1)]))

    nouns_list = []
    is_noun = lambda pos: pos[:2] == 'NN'

    for sent in nouns_in_text:
        temp = sent.split(' ')
        for word in temp:
            if word not in nouns_list:
                nouns_list.append(word)

    df = pd.DataFrame(np.zeros(shape=(len(nouns_list),2)), columns=['Nouns', 'Verbs & Adjectives'])
    df['Nouns'] = nouns_list

    is_adjective_or_verb = lambda pos: pos[:2]=='JJ' or pos[:2]=='VB'
    for sent in text.split('.'):
        for noun in nouns_list:
            if noun in sent:
                tokenized = nltk.word_tokenize(sent)
                adjectives_or_verbs = [word for (word, pos) in nltk.pos_tag(tokenized) if is_adjective_or_verb(pos)]
                ind = df[df['Nouns']==noun].index[0]
                df['Verbs & Adjectives'][ind]=adjectives_or_verbs

    fig = plt.figure(figsize=(30,20))
    G = nx.Graph()

    for i in range(len(df)):
        G.add_node(df['Nouns'][i])
        for word in df['Verbs & Adjectives'][i]:
            G.add_edges_from([(df['Nouns'][i], word)])

    pos = nx.spring_layout(G)
    nx.draw(G, with_labels=True, font_size=20) #font_weight='bold',

因此，如果我们将维基百科的维基百科描述的第一段作为我们想要可视化的示例文本，它会产生以下图：

import re
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018." 
text = re.sub("[\[].*?[\]]", "", text) # Do more processing (like lemmatization, stemming, etc if you want)
word_association_graph(text)

我对这个图的主要问题是我似乎找不到增加图中簇内分离的方法。我尝试了中提到的所有布局文档 https://networkx.github.io/documentation/stable/reference/drawing.html，但都没有解决这个问题。

如果有人知道如何增加单词之间的类内分离，那就太好了。否则，如果有其他好的现有库可以制作更精美的单词关联可视化，那也很棒。

目前，我使用的“修复”是将绘图保存为 SVG 格式，并在浏览器上查看，这样我就可以更仔细地查看集群内部：

fig.savefig('path\wiki_net.svg', format='svg', dpi=1200)

您可以通过使用用于构建它的布局和参数来获得更好的分离。更具体地说，如果您继续使用 spring_layout，请使用 'k' 参数来获得节点之间更好的分离：

...
pos = nx.spring_layout(G, k=0.5)
nx.draw(G, pos, with_labels=True, font_size=20)
plt.show()

k (float (默认=无)) – 节点之间的最佳距离。如果没有距离设置为 1/sqrt(n)，其中 n 是节点数。增加该值可将节点移得更远。

With k=0.5 I got:

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

plot

Graph

datavisualization

networkx

如何更好地可视化给定文本的单词关联？的相关文章

JavaScript 相当于 Python 的参数化 string.format() 函数

这是 Python 示例 gt gt gt Coordinates latitude longitude format latitude 37 24N longitude 115 81W Coordinates 37 24N 115 81W
Kivy - 文本换行工作错误

我正在尝试在 Kivy 1 8 0 应用程序中换行文本当没有太多文字时一切正常但如果文本很长并且窗口不是很大它只是剪切文本这是示例代码 vbox BoxLayout orientation vertical size hint y
Python在postgresql表中查找带有单引号符号的字符串

我需要从 psql 表中查找包含多个单引号的字符串我当前的解决方案是将单引号替换为双单引号如下所示 sql query f SELECT exists SELECT 1 FROM table name WHERE my column m
希伯来语中的稀疏句子标记化错误

尝试对希伯来语使用稀疏句子标记 import spacy nlp spacy load he doc nlp text sents list doc sents I get Warning no model found for he Onl
如何使用显式引用转储 YAML？

递归引用非常适合ruamel yaml or pyyaml ruamel yaml dump ruamel yaml load A A id001 id001 然而它显然不适用于普通引用 ruamel yaml dump ruamel
当我在 Pandas 中使用 df.corr 时，我的一些列丢失了

这是我的代码 import numpy as np import pandas as pd import seaborn as sns import matplotlib pyplot as plt data pd read csv dea
更改 Altair 中的构面标题位置？

如何将方面标题在本例中为年份移动到每个图的上方默认值似乎位于图表的一侧这可以轻易改变吗 import altair as alt from vega datasets import data df data seattle weat
Python Fabric - 未找到主机。请指定用于连接的（单个）主机字符串：

如何获取找不到主机请指定用于连接的单个主机字符串面料如何解决 def bootstrap host ec2 54 xxx xxx xxx compute 1 amazonaws com env hosts host env use
python是带有字符串的运算符行为[重复]

这个问题在这里已经有答案了我无法理解以下行为我正在创建 2 个字符串并使用 is 运算符来比较它对于第一种情况它的工作方式有所不同对于第二种情况它按预期工作当我使用逗号或空格时它显示是什么原因False与比较is当没有使用
python 中的 Johansen 协整检验

我找不到任何有关在处理统计和时间序列分析 pandas 和 statsmodel 的 Python 模块中执行 Johansen 协整检验的功能的参考有谁知道是否有一些代码可以执行时间序列之间的协整测试现在这已在 Python 的 s
乘以行并按单元格值附加到数据框

考虑以下数据框 df pd DataFrame X a b c d Y a b d e Z a b c d 1 2 1 3 df 我想在列中附加数字大于 1 的行并在该行中的数字减 1 df 最好应该然后看起来像这样或者它可能看起来
使用 numpy 在 python 中执行最大方差旋转

我正在研究矩阵的主成分分析我已经找到了如下所示的组件矩阵 A np array 0 73465832 0 24819766 0 32045055 0 3728976 0 58628043 0 63433607 0 72617152 0 5
在 Sphinx 中，有没有办法在声明参数的同时记录参数？

我更喜欢在声明参数的同一行记录每个参数根据需要以便应用D R Y http en wikipedia org wiki Don t repeat yourself 如果我有这样的代码 def foo flab nickers a ser
解析根元素内元素之间的 XML 文本

我正在尝试用 Python 解析 XML 以下是 XML 结构的示例 a aaaa1 b bbbb b aaaa2 a
如何使用 Keras ImageDataGenerator 预测单个图像？

我已经训练 CNN 对图像进行 3 类分类在训练模型时我使用 keras 的 ImageDataGenerator 类对图像应用预处理功能并重新缩放它现在我的网络在测试集上训练得非常准确但我不知道如何在单图像预测上应用预处理功能如
检测 IDLE 的存在/如何判断 __file__ 是否未设置

我有一个脚本需要使用 file 所以我了解到 IDLE 没有设置这个有没有办法从我的脚本中检测到 IDLE 的存在 if file not in globals file is not set 如果你想做一些特别的事情 file 未设置
处理大文件的最快方法？

我有多个 3 GB 制表符分隔文件每个文件中有 2000 万行所有行都必须独立处理任何两行之间没有关系我的问题是什么会更快逐行阅读 with open as infile for line in infile 将文件分块读入内存
如何在单元测试中使用 JSON 发送请求

我的 Flask 应用程序中有在请求中使用 JSON 的代码我可以像这样获取 JSON 对象 Request request get json 这一直工作得很好但是我正在尝试使用 Python 的 unittest 模块创建单元测试但
如何获取所有mysql元组结果并转换为json

我能够从表中获取单个数据但是当我试图获取表上的所有数据时我只得到一行 cnn execute sql rows cnn fetchall column t 0 for t in cnn description for row in ro
长/宽数据到宽/长

我有一个数据框如下所示 import pandas as pd d decil 1 decil 1 decil 2 decil 2 decil 3 decil 3 decil kommune AA BB AA BB AA BB 2010

随机推荐

在 Flexdashboard 中使用 javascript/d3.js 创建图表和表格之间的交互

我创建了下面的 flexdashboard 其中使用了 3 个数据框然后其中两个数据框显示为图表 dcross1 dcross2 和一个 dcross3 如表我想要实现的是将所有这些对象连接在一起当用户单击某个栏时表将相应地进行子集
在 Terraform 0.12 中，如果资源名称已存在，如何跳过资源的创建？

我正在使用 Terraform 版本 0 12 如果同名资源已存在我需要跳过资源创建我为此做了以下操作读取自定义图像列表 data ibm is images custom images 检查图像是否已经存在 locals custo
如何获取html元素的绝对路径

String html Jsoup connect url timeout 1000 1000 get html Document doc Jsoup parse html Elements H2 doc select div h2 for
通过命令提示符运行（可能是路径错误？）

当我编译示例 javac StudentApp java 时从 Notepad 编译代码时遇到问题它无法编译但我收到此错误代码不被识别为内部或外部命令可操作程序或批处理文件我现在使用 Windows 8 以及 8 1 这是我的道路
AWS Lambda、Python、Numpy 等作为层

我已经尝试了一段时间试图将 python numpy 和 pytz 作为层添加到 AWS Lambda 而不是使用我的 py 文件将其压缩并扔到 AWS 我能够遵循多个教程但都失败了如果我要使用 pandas numpy 或 pytz
Flutter 应用程序在发布模式下显示灰屏，但在调试模式下工作正常

我正在尝试 flutter 目前正在学习本教程https www youtube com watch v j6c vHdbUfg https www youtube com watch v j6c vHdbUfg 我注意到该应用程序在调试模
设置视图宽度后，将 LayoutParams 的 ClassCastException 转换为 MarginLayoutParams

我编写了一个小型代理类以便可以使用 ObjectAnimator 为视图的边距设置动画在检查此方法是否有效且所有动画均正常后我想在动画之前调整视图的大小但在设置宽度后我的动画失败并出现 ClassCastException 我不知
socket.io 与私人房间聊天

我开始研究node和socket io 我已经创建了一个简单的聊天应用程序我惊讶于它是如此简单现在我想更进一步提供能够私下聊天的在线用户列表解决这个问题的最佳方法是什么我读过 0 7 的新房间功能这是一条路吗每次 2 个用户
如何在R中使用sparklyr读取S3文件夹/存储桶中的所有文件？

我已经尝试了下面的代码及其组合以便读取 S3 文件夹中给出的所有文件但似乎没有任何效果敏感信息代码已从下面的脚本中删除有 6 个文件每个文件 6 5 GB Spark Connection sc lt spark connect
关闭viewcontroller并执行segue

我需要转到另一个视图控制器performSegueWithIdentifier但我还需要删除我当前所在的视图控制器我该怎么做我已经尝试过但它不起作用 self performSegueWithIdentifier next sende
有没有 len(someObj) 不调用 someObj 的 __len__ 函数的情况？

有没有什么情况len someObj 不调用 someObj len 功能我最近用后者替换了前者以成功努力加快某些代码的速度我想确保某个地方没有一些边缘情况len someObj 不等于someObj len If len 返回长
TCP 连接 - 服务器仅在关闭套接字后发送消息

我试图让客户端向服务器发送请求并接收响应同时保持连接如果我关闭套接字 server side outToClient writeBytes Message to send connectionSocket close client si
Lucene.Net 和 I/O 线程问题

我有一个名为 Execute 的索引函数使用 IndexWriter 来索引我网站的内容如果我只是从网页调用它它会很好用但当我将它作为 System Threading Thread 的委托参数时它会失败但奇怪的是它总是在我的
Python - 不使用 .replace() 替换多个字符

任务是将任意字符串转换为任意字符串无需内置 replace 我失败了因为我忘记了技术上空格也是一个字符串字符首先我将此字符串转换为列表但现在我发现我这样做是不必要的然而它仍然不起作用我可以把猫换成狗我可以将 c 替换
将表从一个数据库移动到另一个数据库 SQL Server

我有一个数据库DB 1其中有一个空表T1有 5 列我想将此表移至另一个数据库DB 2在同一个 SQL Server 上我尝试过使用这个命令 alter table DB 1 T1 rename DB 2 T1 但这显示错误消息 102
在 Carrierwave 中重新处理图像

假设我的模型有一张图像 thumb客户想要 tiny and nano缩略图如何使用 rake 任务重新处理所有现有图像我发现了一个我认为可以完成的 rake 任务https gist github com 777788 https g
std::cout 不喜欢条件 if 中的 std::endl 和字符串

main cpp In function void PrintVector std vector
如何在纯 JavaScript 中模拟鼠标悬停来激活 CSS“:hover”？

我一直在尝试寻找代码来模拟mouseover在 Chrome 中即使 mouseover 监听器被触发 CSS hover 声明也永远不会被设置我也尝试这样做 Called within mouseover listener theEl
如何使用redirect_to向url添加hash参数？

如何使用redirect to向url添加hash参数例如 http localhost products 页 2 http localhost products 7Bpage 2 最好的方法就是这样做 redirect to actio
如何更好地可视化给定文本的单词关联？

我特别想要的是根据文档中的出现方式将与文档中的名词相关的所有动词和形容词可视化我在 Python 中找不到任何函数所以我编写了自己的基本函数如下所示然而可视化仍有一些不足之处 import nltk import pandas a

如何更好地可视化给定文本的单词关联？

如何更好地可视化给定文本的单词关联？ 的相关文章

随机推荐

热门标签

如何更好地可视化给定文本的单词关联？的相关文章