NLTK 中 Text.similar() 和 ContextIndex.similar_words() 生成的单词按频率排序？

2024-03-11

我使用这两个函数来查找相似的单词，它们返回不同的列表。我想知道这些函数是否按最频繁关联到最不频繁关联排序？

ContextIndex.similar_words(word) http://nltk.org/_modules/nltk/text.html#ContextIndex.similar_words将每个单词的相似度得分计算为每个上下文中频率乘积的总和。Text.similar() http://nltk.org/_modules/nltk/text.html#Text.similar只是计算单词共享的独特上下文的数量。

similar_words()NLTK 2.0 中似乎包含一个错误。请参阅中的定义nltk/text.py http://nltk.org/_modules/nltk/text.html#ContextIndex.similar_words:

def similar_words(self, word, n=20):
    scores = defaultdict(int)
    for c in self._word_to_contexts[self._key(word)]:
        for w in self._context_to_words[c]:
            if w != word:
                print w, c, self._context_to_words[c][word], self._context_to_words[c][w]
                scores[w] += self._context_to_words[c][word] * self._context_to_words[c][w]
    return sorted(scores, key=scores.get)[:n]

返回的单词列表应按相似度得分降序排序。将 return 语句替换为：

return sorted(scores, key=scores.get)[::-1][:n]

In similar()，调用similar_words()被注释掉了，可能是由于这个错误。

def similar(self, word, num=20):
    if '_word_context_index' not in self.__dict__:
        print 'Building word-context index...'
        self._word_context_index = ContextIndex(self.tokens,
                                                filter=lambda x:x.isalpha(),
                                                key=lambda s:s.lower())

#   words = self._word_context_index.similar_words(word, num)

    word = word.lower()
    wci = self._word_context_index._word_to_contexts
    if word in wci.conditions():
        contexts = set(wci[word])
        fd = FreqDist(w for w in wci.conditions() for c in wci[w]
                      if c in contexts and not w == word)
        words = fd.keys()[:num]
        print tokenwrap(words)
    else:
        print "No matches"

注意：在一个FreqDist，不同于dict, keys()返回一个排序列表。

Example:

import nltk

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

similar_words = text._word_context_index.similar_words('woman')
print ' '.join(similar_words)

Output:

man day time year car moment world family house boy child country
job state girl place war way case question   # Text.similar()

#man ('a', 'who') 9 39   # output from similar_words(); see following explanation
#girl ('a', 'who') 9 6
#[...]

man number time world fact end year state house way day use part
kind boy matter problem result girl group   # ContextIndex.similar_words()

fd，频率分布similar()，是每个单词的上下文数量的统计：

fd = [('man', 52), ('day', 30), ('time', 30), ('year', 28), ('car', 24), ('moment', 24), ('world', 23) ...]

对于每个上下文中的每个单词，similar_words()计算频率乘积的总和：

man ('a', 'who') 9 39  # 'a man who' occurs 39 times in text;
                       # 'a woman who' occurs 9 times
                       # Similarity score for the context is the product:
                       #     score['man'] = 9 * 39
girl ('a', 'who') 9 6
writer ('a', 'who') 9 4
boy ('a', 'who') 9 3
child ('a', 'who') 9 2
dealer ('a', 'who') 9 2
...
man ('a', 'and') 6 11  # score += 6 * 11
...
man ('a', 'he') 4 6    # score += 4 * 6
...
[49 more occurrences of 'man']

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

NLTK

NLTK 中 Text.similar() 和 ContextIndex.similar_words() 生成的单词按频率排序？的相关文章

python：查找围绕某个 GPS 位置的圆的 GPS 坐标的优雅方法

我有一组以十进制表示的 GPS 坐标并且我正在寻找一种方法来查找每个位置周围半径可变的圆中的坐标这是一个例子 http green and energy com downloads test circle html我需要什么这是一个圆
是否有解决方法可以通过 CoinGecko API 安全检查？

我在工作中运行我的代码一切都很顺利但在不同的网络家庭 WiFi 上我不断收到403访问时出错CoinGecko V3 API https www coingecko com api documentations v3 可以观察到在
与区域指示符字符类匹配的 python 正则表达式

我在 Mac 上使用 python 2 7 10 表情符号中的标志由一对表示区域指示符号 https en wikipedia org wiki Regional Indicator Symbol 我想编写一个 python 正则表达式来在
在 django ORM 中查询时如何将 char 转换为整数？

最近开始使用 Django ORM 我想执行这个查询 select student id from students where student id like 97318 order by CAST student id as UNSIG
如何用python脚本控制TP LINK路由器

我想知道是否有一个工具可以让我连接到路由器并关闭它然后从 python 脚本重新启动它我知道如果我写 import os os system ssh l root 192 168 2 1 我可以通过 python 连接到我的路由器但是
用枢轴点拟合曲线 Python

我有下面的图我想用 2 条线来拟合它使用 python 我设法适应上半部分 def func x a b x np array x return a x b popt pcov curve fit func up x up y 我想用另
使用 kivy textinput 的 'input_type' 属性的问题

您好我在使用 kivy 的文本输入小部件的 input type 属性时遇到问题问题是我制作了两个自定义文本输入其中一个称为 StrText 其中设置了 input type text 然后是第二个文本输入名为 NumText 其
使用字典映射数据帧索引

为什么不df index map dict 工作就像df column name map dict 这是尝试使用index map的一个小例子 import pandas as pd df pd DataFrame one A 10 B 2
Pandas Merge (pd.merge) 如何设置索引和连接

我有两个 pandas 数据框 dfLeft 和 dfRight 以日期作为索引 dfLeft cusip factorL date 2012 01 03 XXXX 4 5 2012 01 03 YYYY 6 2 2012 01 04 XX
在Python中连接反斜杠

我是 python 新手所以如果这听起来很简单请原谅我我想加入一些变量来生成一条路径像这样 AAAABBBBCCCC 2 2014 04 2014 04 01 csv Id TypeOfMachine year month year
如何在 Python 中解析和比较 ISO 8601 持续时间？ [关闭]

Closed 这个问题正在寻求书籍工具软件库等的推荐不满足堆栈溢出指南 help closed questions 目前不接受答案我正在寻找一个 Python v2 库它允许我解析和比较 ISO 8601 持续时间may处于不同单
Python，将函数的输出重定向到文件中

我正在尝试将函数的输出存储到Python中的文件中我想做的是这样的 def test print This is a Test file open Log a file write test file close 但是当我这样做时我收到
在Python中检索PostgreSQL数据库的新记录

在数据库表中第二列和第三列有数字将会不断添加新行每次每当数据库表中添加新行时 python 都需要不断检查它们当 sql 表中收到的新行数低于 105 时 python 应打印一条通知消息警告数量已降至 105 以下另一方面
Python3 在 DirectX 游戏中移动鼠标

我正在尝试构建一个在 DirectX 游戏中执行一些操作的脚本除了移动鼠标之外我一切都正常是否有任何可用的模块可以移动鼠标适用于 Windows python 3 Thanks I used pynput https pypi or
使用特定颜色和抖动在箱形图上绘制数据点

我有一个plotly graph objects Box图我显示了箱形图中的所有点我需要根据数据的属性为标记着色如下所示我还想抖动这些点下面未显示 Using Box我可以绘制点并抖动它们但我不认为我可以给它们着色 fig a
如何断言 Unittest 上的可迭代对象不为空？

向服务提交查询后我会收到一本字典或一个列表我想确保它不为空我使用Python 2 7 我很惊讶没有任何assertEmpty方法为unittest TestCase类实例现有的替代方案看起来并不正确 self assertTrue
python import inside函数隐藏现有变量

我在我正在处理的多子模块项目中遇到了一个奇怪的 UnboundLocalError 分配之前引用的局部变量问题并将其精简为这个片段使用标准库中的日志记录模块 import logging def foo logging info fo
模拟pytest中的异常终止

我的多线程应用程序遇到了一个错误主线程的任何异常终止例如未捕获的异常或某些信号都会导致其他线程之一死锁并阻止进程干净退出我解决了这个问题但我想添加一个测试来防止回归但是我不知道如何在 pytest 中模拟异常终止如果我只
Django-tables2 列总计

我正在尝试使用此总结列中的所有值文档 https github com bradleyayers django tables2 blob master docs pages column headers and footers rst 但页
如何计算Python中字典中最常见的前10个值

我对 python 和一般编程都很陌生所以请友善我正在尝试分析包含音乐信息的 csv 文件并返回最常听的前 n 个乐队从下面的代码中每听一首歌曲都是一个列表中的字典条目格式如下 album Exile on Main Street

随机推荐

gnuplot 调色板，默认和定义

一些相关的上一个问题 https stackoverflow com questions 11011566 gnuplot setting the range of a third colored point 我想采用 gnuplot 的默
如何修复 GMail SMTP 错误：“SMTP 服务器需要安全连接或客户端未经身份验证。”

下面是我正在使用的代码请告知如何纠正此问题 using System using System Collections Generic using System Linq using System Web using System Web
参数类型“Stream”无法分配给参数类型“Stream?”

我想使用 Streams 使用 firebase 身份验证但是我在网上收到上述错误stream FirebaseAuth instance authStateChanges 我努力了onAuthStateChange 那也行不通 clas
Abaqus Surface getSequenceFromMask

我正在 Abaqus 中编写脚本我用切圆像奶酪压碎圆形和正方形我需要在部件之间放置接触所以我需要 Surface 宏管理器生成 s1 a instances kolo 1 edges side1Edges1 s1 getSeque
在 Codeigniter 中加密时避免使用特定字符？

我需要通过 URL 传递一些加密值有什么办法可以避免加密后得到的值中出现一些字符例如斜杠因为在codeigniter中斜杠等字符用于分隔URL中的参数请注意我不希望任何建议不要在 URL 中传递加密字符串使用PHPurlenc
用于索引和搜索的 Lucene 分析器

我有一个正在使用 Lucene 建立索引的字段如下所示 Field name hungerState index Index TOKENIZED store Store YES public HungerState getHungerSt
如何更新角度材质

将项目中的角度材料更新到最新版本的最佳方法是什么 email protected cdn cgi l email protection I tried npm install save angular material angular cd
我应该使用字符串表来提高数据库效率吗？

假设您有一个包含单个表的数据库例如 Name FavoriteFood Alice Pizza Mark Sushi Jack Pizza 如果有一个名为 St
在 git bash 中找不到 sudo/apt-get 命令

我在我的机器上使用 Windows 10 当前安装了 git bash 我想为我的应用程序安装 Node 和 npm 当我尝试时 apt get安装nodejs 找不到 apt get 命令我尝试谷歌并得到须藤安装nodejs 未找到
页面重新加载时 Jquery 显示/隐藏重置

我是 jquery 的新手但我正在尝试使用它来创建多步骤选项卡式表单在其中一页我有单选按钮将根据所选的选择显示多个字段我发现的问题是如果用户在选择单选按钮后刷新页面页面将重新加载并隐藏所有 div 但它会记住所选的单选按钮选项
将项目推送到 STL 容器后出现段错误

typedef struct temp int a b char c temp c char malloc 10 temp free c temp int main temp a list
插入行以用标题分隔数据组

有人可以帮我写这个脚本吗就目前情况而言一旦值文本发生更改并插入新行当前的宏就会分离数据但我只是无法弄清楚如何在插入行后包含标题 Sub Insert Row Dim ws As Worksheet Dim lr As Long D
无法使用 VS 代码从 Node js 中的控制台读取

我正在 VS code 中使用 Node js 的核心模块但无法让它与 readline 模块一起工作我在 js 文件中有以下代码 const readline require readline const rl readline cr
使用布尔索引数组过滤列表

如何在不使用 numpy 的情况下使用布尔索引数组来过滤列表例如 gt gt gt l a b c gt gt gt b True False False gt gt gt l b 结果应该是 a 我知道 numpy 支持它但想知道如何
ruby 打印 2 个字符串之间选定的文本行

我试图在 ruby 中的两个字符串之间获取一组文本但我似乎无法获得正确的方法或使用正确的正则表达式 text h1 all kinds of html h1 p blah blah p p i ve been working on thi
MVC 软件架构中验证逻辑的放置位置

我其实已经开始学习mvc架构了我很困惑是否将用户名注册验证逻辑放在模型中或控制器中我有某种状态消息可以告诉用户要注册的新用户名是否可用我开始感到困惑因为大多数消息来源说它应该在模型中因为它涉及在将用户名数据放入数据库之前进行验证
IPC 的共享内存和线程的共享内存有什么区别？

让我们使用 POSIX 共享内存例如 shmget 协调进程间通信的常见调用调用 shmget 并协调共享内存段上的通信与 Linux 在单个进程中实现共享内存和线程之间的同步有何不同其中之一更轻吗 SHM适用于多进程中的IPC 在现
postgresql 中的 regexp \Q...\E 等价于什么？

我有以下查询 SELECT field FROM myTable WHERE field Qprefix E 它不会找到类似的值prefix foo 我该如何更换 Q E 这种形式的正则表达式带有 Q E仅支持不带引号的子字符串PCRE h
从 shell 脚本中提取 YAML 中的变量

我有一个由以下内容组成的 YAML 文件 acceleration matrix 1ere row x 20 0 0 15 15 2eme row y 0 15 0 0 0 3eme row z 0 0 30 15 15 4eme row
NLTK 中 Text.similar() 和 ContextIndex.similar_words() 生成的单词按频率排序？

我使用这两个函数来查找相似的单词它们返回不同的列表我想知道这些函数是否按最频繁关联到最不频繁关联排序 ContextIndex similar words word http nltk org modules nltk text htm

NLTK 中 Text.similar() 和 ContextIndex.similar_words() 生成的单词按频率排序？

NLTK 中 Text.similar() 和 ContextIndex.similar_words() 生成的单词按频率排序？ 的相关文章

随机推荐

热门标签

NLTK 中 Text.similar() 和 ContextIndex.similar_words() 生成的单词按频率排序？的相关文章