当单词不存在时，将 0 分配给某些单词

2024-01-07

这是我在 stackoverflow 上发表的第一篇文章，我对编码还比较陌生。所以，请耐心听我说。

我正在做一个实验，有两组数据文档。文档1如下：

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464

TOPIC:topic_2 ....
.....
.....

TOPIC:topic_3 1066.0
say 0.062
word 0.182

依此类推，直到100个主题。

在本文档中，有些单词要么出现在所有主题中，要么只出现在少数主题中。因此，我想执行一个过程，如果一个单词不存在于一个主题中，我希望该单词在该主题中的值为 0。也就是说，单词 BBC 存在于主题 2 中，但不存在于主题 2 中。主题 1，所以我希望我的列表为：

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
Mr 0
s 0
president 0
tell 0
BBC 0

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398
site 0
Internet 0
online 0
web 0
say 0
image 0

我必须将这些值与另一个文档中存在的另一组值相乘。为了那个原因，

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
values = map(float, f2.read().split()) 
for line in f:
    if line.strip() and not line.startswith("TOPIC"):
        name, val = line.split()
        d[name].append(float(val))

for k,v in d.items():
     print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

我的 doc2 的格式为：

  0.566667 0.0333333 0.133333 0 0 0  2.43333 0 0.13333......... till 100 values.

上面的代码考虑了单词“say”。它检查该单词是否在 3 个主题中，并将它们的值收集在一个列表中，如 [0.015, 0.45, 0.062]。该列表与 doc2 中的值相乘，其中值 0.015 乘以 doc2 中的第 0 个值、0.45 * doc2 中的第一个值和 0.062* doc2 中的第二个值。但这不是我想要的。我们可以看到topic_2中没有“SAY”这个词。这里的列表必须包含 [0.015, 0.45, 0, 0.062]。因此，当这些值与 doc2 中各自的位置值相乘时，它们将给出

P(SAY) = (0.566667*0.015) + (0.0333333*0.045) + (0.133333 *0) + (0*0.062)

因此，代码非常好，但只需要进行此修改。

问题是您将主题视为一个整体，如果您希望各个部分使用groupby https://stackoverflow.com/a/31506466/2141635原始答案中的代码首先获取一组所有名称，然后将这组名称与 defaultdict 键进行比较，以找出每个部分中的差异：

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    # find every word in every TOPIC
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0) # rset pointer
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            # get difference in all_words vs words in current TOPIC
            # giving 0 as default for missing values
            for word in all_words - d.viewkeys():
                d[word] = 0
            for k,v in d.iteritems():
                print("Prob for {} is {}".format(k,v))
            d = defaultdict(float)

要存储所有输出，您可以将字典添加到列表中：

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = []
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            for word in all_words - d.viewkeys():
                d[word] = 0
            out.append(d)
            d = defaultdict(float)

然后迭代列表：

for top in out:
  for k,v in top.iteritems():
            print("Prob for {} is {}".format(k,v))

或者忘记 defualtdict 并使用 dict.fromkeys：

from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = [line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")]
    f.seek(0)
    out, d = [], dict.fromkeys(all_words ,0.0)
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            out.append(d)
            d = dict.fromkeys(all_words ,0)

如果您总是希望末尾缺少单词，请使用 collections.OrderedDict 并使用第一种方法在字典末尾添加缺少的值：

from collections import OrderedDict

from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = []
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for  (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d.setdefault(name, (float(val) * f))
            for word in all_words.difference(d):
                    d[word] = 0
            out.append(d)
            d = OrderedDict()

for top in out:
    for k,v in top.iteritems():
         print("Prob for {} is {}".format(k,v))

最后按顺序和主题存储：

from collections import OrderedDict

from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = OrderedDict()
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic = next(v).rstrip()
            # create OrderedDict for each topic
            out[topic] = OrderedDict()
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                out[topic].setdefault(name, (float(val) * f))
            # find words missing from TOPIC and  set to 0
            for word in  all_words.difference(out[topic]):
                    out[topic][word] = 0

for k,v in out.items():
    print(k) # each TOPIC
    for k,v in v.iteritems():
        print("Prob for {} is {}".format(k,v)) # the OrderedDict items
   print("\n")

doc1:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398

doc2:

0.345 0.566667

Output:

TOPIC:topic_0 5892.0
Prob for site is 0.0128233197556
Prob for Internet is 0.00901731160895
Prob for online is 0.00790478615073
Prob for web is 0.00755346232181
Prob for say is 0.00550407331974
Prob for image is 0.00521130346231
Prob for BBC is 0
Prob for Mr is 0
Prob for s is 0
Prob for president is 0
Prob for tell is 0


TOPIC:topic_1 12366.0
Prob for Mr is 0.085187930859
Prob for s is 0.0293277438137
Prob for say is 0.0255701266375
Prob for president is 0.00870667394471
Prob for tell is 0.0076985327511
Prob for BBC is 0.0076985327511
Prob for web is 0
Prob for image is 0
Prob for online is 0
Prob for site is 0
Prob for Internet is 0

您可以使用常规 for 循环应用完全相同的逻辑，groupby 只是为您完成所有分组工作。

如果您实际上只想写入文件，那么代码就更简单：

from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2,open("prob.txt","w") as f3:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    for (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic, words  = next(v), []
            flt = next(values)
            f3.write(topic)    
            for s in v:
                name, val = s.split()
                words.append(name)
                f3.write("{} {}\n".format(name, (float(val) * flt)))
            for word in all_words.difference(words):
                  f3.write("{} {}\n".format(word, 0))
            f3.write("\n")

问题.txt：

TOPIC:topic_0 5892.0
site 0.0128233197556
Internet 0.00901731160895
online 0.00790478615073
web 0.00755346232181
say 0.00550407331974
image 0.00521130346231
BBC 0
Mr 0
s 0
president 0
tell 0

TOPIC:topic_1 12366.0
Mr 0.085187930859
s 0.0293277438137
say 0.0255701266375
president 0.00870667394471
tell 0.0076985327511
BBC 0.0076985327511
web 0
image 0
online 0
site 0
Internet 0

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

当单词不存在时，将 0 分配给某些单词的相关文章

scipy.optimize on pandas dataframe

我试图搜索它但结果很差有人可以向我解释一下如何在 Pandas DataFrame 上执行 optimize minimize 以便最小化 DataFrame 中的类别和结果列之间的错误考虑这个例子 import pandas as
如何使用一个模型中间层的输出作为另一个模型的输入？

我训练一个模型A并尝试使用中间层的输出name layer x 作为模型的附加输入B 我尝试像 Keras 文档一样使用中间层的输出https keras io getting started faq how can i obtain th
Flask 中“缺少 CSRF 令牌”，但它在模板中呈现

问题当我尝试登录使用 Flask login 时我得到Bad Request The CSRF session token is missing但令牌正在呈现在模板中 secret key 已设置并且我在本地运行localhost
Keras model.predict 函数给出输入形状错误

我已经在 Tensorflow 中实现了通用句子编码器现在我正在尝试预测句子的类概率我也将字符串转换为数组 Code if model model type universal classifier basic class probs
雅虎财务请求功能出现 404 客户端错误

yahoo Financials的请求功能出现404 Client Error 直接点击以下网址没有问题 https finance yahoo com quote AAPL financials p AAPL https finance
使用 NumPy 编写一个函数来计算具有特定公差的积分

我想编写一个自定义函数来以特定容差对表达式 python 或 lambda 函数进行数字积分我知道与scipy integrate quad人们可以简单地改变epsabs但我想使用 numpy 自己编写该函数 From 这篇博文 htt
如何在Windows中的Python 3.9下pip安装pickle？

我需要pickle https docs python org 3 9 library pickle html module pickle包安装在我的下面Python 3 9在 Windows 10 下我尝试过的当尝试与pip inst
无法在 virtualenv 中安装 libxml2

我有一个问题libxml2蟒蛇模块我正在尝试将其安装在python3 虚拟环境使用以下命令 pip install libxml2 python3 但它显示以下错误 Collecting libxml2 python3 Using cac
使用pathlib获取主目录

翻看新的pathlib在 Python 3 4 中我注意到没有任何简单的方法来获取用户的主目录我能想到的获取用户主目录的唯一方法是使用旧的os path像这样的库 import pathlib from os import path p
数据框中 .map(str) 和 .astype(str) 有什么区别

我有一个数据框其列名为 col1 和 col2 的整数类型条目我想将 col1 和 col2 的条目以及其间的点连接起来我搜索并发现添加两个列条目 df col df col1 map str df col2 map str 并添
为什么将模块级代码放入函数中然后调用该函数在Python中速度更快？

在亚历克斯马尔泰利的回应中使 Python 脚本面向对象 https stackoverflow com questions 1813117 making a python script object oriented 他提到在 Pyth
python Recipe：列出最接近等于值的项[关闭]

Closed 这个问题需要多问focused help closed questions 目前不接受答案考虑像这样的列表 0 3 7 10 12 15 19 21 我想获得最接近任何值的最近的最小数字所以如果我通过4 我会得到3 如果我
Django 2、python 3.4 无法解码 urlsafe_base64_decode(uidb64)

我正在尝试通过电子邮件激活用户电子邮件有效编码有效我使用了 django1 11 中的方法该方法运行成功在 Django 1 11 中以下内容成功解码为 28 其中 uidb64 b Mjg force text urlsafe
如何在matplotlib中调整x轴

I have a graph like this x轴上的数据表示小时所以我希望x轴设置为0 24 48 72 而不是现在的值很难看到 0 100 之间的数据 fig1 plt figure ax fig1 add subplot 11
如何通过 Python Requests 库使用基本 HTTP 身份验证？

我正在尝试在 Python 中使用基本的 HTTP 身份验证我正在使用Requests https docs python requests org 图书馆 auth requests post http hostname auth HT
Django 将 JSON 数据传递给静态 getJSON/Javascript

我正在尝试从 models py 中获取数据并将其序列化为views py 中的 JSON 对象模型 py class Platform models Model platformtype models CharField max len
Python：如何在不先创建整个列表的情况下计算列表的总和？

通常我们必须 1 声明一个列表 2 使用以下方法计算该列表的总和sum 但现在我希望指定一个以 1 开头间隔为 4 100 个元素的列表如下所示 1 5 9 13 17 21 25 29 33 37 我不想涉及数学公式所以 1 如何在
python中有没有一种方法可以将存储在列表中的正则表达式模式列表应用到单个字符串？

我有一个正则表达式模式列表存储在列表类型中我想将其应用于字符串有谁知道一个好方法将列表中的每个正则表达式模式应用于字符串和如果匹配则调用与列表中该模式关联的不同函数如果可能的话我想用 python 来做这件事提前致谢 im
pandas.read_fwf 忽略提供的数据类型

我正在从文本文件导入数据框我想指定列的数据类型但 pandas 似乎忽略了dtype input 一个工作示例 from io import StringIO import pandas as pd string USAF WBAN S
MoviePY 无法在 Windows 上检测 ImageMagick 二进制文件

我刚买了一台新笔记本电脑想要设置MoviePY在那新的Windows 64x Python3 7 0 机器我对所有内容都进行了三次检查但是当涉及到我的代码的文本部分时它向我抛出了这个错误 OSError MoviePy Error

随机推荐

从模块中角度导出的组件在另一个模块中不可用

我正在 AppModule 中导出自定义组件但无法在 AppModule 中导入的另一个模块中使用它我认为导出的组件在全球范围内都是可见的我试图在 TestModule 内的组件中使用 CalendarComponent 和选择器 a
发布代码覆盖率在 Azure DevOps 中找不到覆盖率文件

我正在使用节点14 x和开玩笑26 x 有一个npm testpackage json 文件中的脚本包含以下内容 cross env NODE ENV test jest coverage forceExit 当我在本地运行它时它会生成代
我可以将自定义分区器与 group by 一起使用吗？

假设我知道我的数据集不平衡并且我知道键的分布我想利用它来编写一个自定义分区器以充分利用运算符实例我知道关于数据流 partitionCustom https ci apache org projects flink flink doc
Qt/Qt Creator - 程序意外完成。 <程序路径>崩溃了

我对 C 和 Qt 5 2 1 有点陌生我实际上正在学习如何使用Qt 为了尽可能简单地做到这一点我使用 Qt Creator 3 0 1 我在项目的 main cpp 文件中编写了这一小段代码 include
当命令行给出 -jvm-debug 时，如何在测试中设置 fork？

如果项目在调试模式下运行是否有办法有条件地禁用分叉 sbt jvm debug 9999 然后在我的构建中 fork in Test find a key that lets me know if debugging in set up
使用 bar 函数时如何在 x 轴上显示分类数据？

我正在尝试模拟 MATLAB 官方网站上的代码但无法获得相同的输出这是代码 c categorical apples oranges pears prices 1 23 0 99 2 3 bar c prices 这是 MATLAB 网
如何从Excel列字母中获取列号（或索引）

我搜索过这个网站并用谷歌搜索了一个公式我需要根据字母计算 Excel 列号例如 A 1 B 2 AA 27 AZ 52 AAA 703 在字母表随机循环后代码似乎少了 1 位数字 AZ gt BA 少数字它看起来还会从两个不同的输入
如何检测 JComboBox 是否为空？

如何检测 JComboBox 是否为空是不是类似 combobox isEmpty 出了什么问题JComboBox getItemCount http docs oracle com javase 7 docs api javax swi
隐马尔可夫模型 (HMM) 中的三态电话模型

我想问一下HMM中3态电话模型的含义本案例基于语音识别系统中的HMM理论因此该示例基于 HMM 中语音的声学建模我从期刊论文中得到了这张示例图片 http www intechopen com source html 41188 m
如何在 Github Atom Editor 中同步多台计算机的包和设置

我已经在我的个人电脑和办公室电脑上安装了 Github Atom Editor 我想将设置和软件包同步到我的 Dropbox 帐户这样当我登录办公室电脑时它会自动下载或更新所有软件包和设置到我的家庭电脑您是否尝试过使用原子同步设置 h
CMU Sphinx 是否可以通过 Maven 获得？

我有一个可能需要 CMU Sphinx 的应用程序的想法它可以通过 Maven 获得还是需要手动添加更新 CMUSphinx 将在一周左右的时间内在 sonatype 中提供 Maven 支持已经提交到 sphinx4 trunk 中
go-git：创建本地分支的正确方法，模拟“git分支 ”的行为？

正如标题所示我试图弄清楚如何使用创建本地分支go git与 Git CLI 命令给出相同结果的方式git branch
防止“冒泡”？ [复制]

这个问题在这里已经有答案了我不确定这是否真的在冒泡我会解释一下我有这个 div div text here div div 如何绑定点击事件使其仅影响所包含的 div 如果我这样设置 jQuery div bind click fu
使用elasticsearch实施建议“类别中的xxx”

我想对产品实施类似亚马逊的类别内建议亚马逊建议在特定类别中搜索给定术语而不是全局搜索这允许更具体的搜索和结果有没有办法使用elasticsearch提供的建议功能之一来实现这一点目前我的想法是从elasticsearch获取建
mongoDB vs mySQL——为什么一个在某些方面比另一个更好[关闭]

就目前情况而言这个问题不太适合我们的问答形式我们希望答案得到事实参考资料或专业知识的支持但这个问题可能会引发辩论争论民意调查或扩展讨论如果您觉得这个问题可以改进并可能重新开放访问帮助中心 help reopen questi
评估骰子滚动符号字符串

Rules 编写一个接受字符串作为参数的函数返回表达式的评估值骰子记数法 http en wikipedia org wiki Dice notation 包括加法和乘法为了澄清问题这里是法律表达式的 EBNF 定义 roll po
使用 Python 从文本中提取 IBAN

我想用 Python 从文本中提取 IBAN 号码这里的挑战是 IBAN 本身可以用多种方式编写数字之间有空格我发现很难将其转换为有用的正则表达式模式我写了一个演示版 https regex101 com r PRDDaT 1它尝试
使用 LINQ 进行编码是如何工作的？幕后发生了什么？

例如 m lottTorqueTools From t In m lottTorqueTools Where Not t SlotNumber toolTuple SlotNumber And Not t StationIndex tool
java.lang.RuntimeException: android.database.sqlite.SQLiteException: 没有这样的表: media_store_extension (代码 1): ,

我在 2021 年 10 月之后在 Play 商店上发布我的应用程序时遇到问题错误表明该表media store extension不存在问题是我在项目中没有使用 SQLITE 所以我不知道是什么导致了这个异常目标 sdk 是 30
当单词不存在时，将 0 分配给某些单词

这是我在 stackoverflow 上发表的第一篇文章我对编码还比较陌生所以请耐心听我说我正在做一个实验有两组数据文档文档1如下 TOPIC topic 0 5892 0 site 0 0371690427699 Intern

当单词不存在时，将 0 分配给某些单词

当单词不存在时，将 0 分配给某些单词 的相关文章

随机推荐

热门标签

当单词不存在时，将 0 分配给某些单词的相关文章