在包含一些通配符的大型列表中进行成员资格测试

2024-04-30

当列表包含特殊类别时，如何测试某个短语是否在大型 (650k) 短语列表中？

例如，我想测试这个短语是否["he", "had", "the", "nerve"]在列表中。确实如此，但是在["he", "had", "!DETERMINER", "nerve"] where "!DETERMINER"是包含多个选项的词类的名称(a, an, the)。我有大约 350 个词类，其中一些非常长，因此我认为枚举列表中具有一个（或多个）词类的每一项是不可行的。

我想使用一组这些短语，而不是慢慢地浏览列表，但我不知道如何处理词类的可变性。速度非常重要，因为我每次都需要进行数十万次比较。

与 pjwerneck 的建议类似，您可以使用一棵树（或更具体地说是一棵树）trie https://en.wikipedia.org/wiki/Trie）将列表分部分存储，但将其扩展以特殊对待类别。

# phrase_trie.py

from collections import defaultdict

CATEGORIES = {"!DETERMINER": set(["a","an","the"]),
              "!VERB": set(["walked","talked","had"])}

def get_category(word):
    for name,words in CATEGORIES.items():
        if word in words:
            return name
    return None

class PhraseTrie(object):
    def __init__(self):
        self.children = defaultdict(PhraseTrie)
        self.categories = defaultdict(PhraseTrie)

    def insert(self, phrase):
        if not phrase: # nothing to insert
            return

        this=phrase[0]
        rest=phrase[1:]

        if this in CATEGORIES: # it's a category name
            self.categories[this].insert(rest)
        else:
            self.children[this].insert(rest)

    def contains(self, phrase):
        if not phrase:
            return True # the empty phrase is in everything

        this=phrase[0]
        rest=phrase[1:]

        test = False

        # the `if not test` are because if the phrase satisfies one of the
        # previous tests we don't need to bother searching more

        # allow search for ["!DETERMINER", "cat"]
        if this in self.categories: 
            test = self.categories[this].contains(rest)

        # the word is literally contained
        if not test and this in self.children:
            test = self.children[this].contains(rest)

        if not test:
            # check for the word being in a category class like "a" in
            # "!DETERMINER"
            cat = get_category(this)
            if cat in self.categories:
                test = self.categories[cat].contains(rest)
        return test

    def __str__(self):
        return '(%s,%s)' % (dict(self.children), dict(self.categories))
    def __repr__(self):
        return str(self)

if __name__ == '__main__':
    words = PhraseTrie()
    words.insert(["he", "had", "!DETERMINER", "nerve"])
    words.insert(["he", "had", "the", "evren"])
    words.insert(["she", "!VERB", "the", "nerve"])
    words.insert(["no","categories","here"])

    for phrase in ("he had the nerve",
                   "he had the evren",
                   "she had the nerve",
                   "no categories here",
                   "he didn't have the nerve",
                   "she had the nerve more"):
        print '%25s =>' % phrase, words.contains(phrase.split())

Running python phrase_trie.py:

         he had the nerve => True
         he had the evren => True
        she had the nerve => True
       no categories here => True
 he didn't have the nerve => False
   she had the nerve more => False

关于代码的一些要点：

指某东西的用途defaultdict是为了避免在调用之前检查该子树是否存在insert;它会在需要时自动创建并初始化。
如果有很多电话get_category，为了速度可能值得构建一个反向查找字典。（或者，更好的是，记住对get_category这样常见的单词就可以快速查找，但你不会浪费内存来存储你从不查找的单词。）
该代码假设每个单词仅属于一个类别。（如果没有，唯一的变化是get_category返回列表和相关部分PhraseTrie循环遍历这个列表。）

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

在包含一些通配符的大型列表中进行成员资格测试的相关文章

为什么我的混淆矩阵只返回一个数字？

我正在做二元分类每当我的预测等于事实时我发现sklearn metrics confusion matrix返回单个值难道没有问题吗 from sklearn metrics import confusion matrix print
从字典的元素创建 Pandas 数据框

我正在尝试从字典创建一个 pandas 数据框字典设置为 nvalues y1 1 2 3 4 y2 5 6 7 8 y3 a b c d 我希望数据框仅包含 y1 and y2 到目前为止我可以使用 df pd DataFrame fr
如何为未捕获的异常处理程序编写单元测试

我有一个函数可以捕获uncaught例外情况如下有没有办法编写一个单元测试来执行uncaught exception handler 功能正常但测试正常退出 import logging def config logger logge
为什么第二个 request.session cookies 返回空？

我想使用 requests Session post 登录网站但是当我已经登录主页然后进入帐户页面时看来cookies还没有保存因为cookies是空的而且我无法进入正确的帐户页面 import requests from bs4
如何将 sql 数据输出到 QCalendarWidget

我希望能够在日历小部件上突出显示 SQL 数据库中的一天就像启动程序时突出显示当前日期一样在我的示例中它是红色突出显示我想要发生的是当用户按下突出显示的日期时数据库中日期旁边的文本将显示在日历下方的标签上这是我使用 QT De
如何在 Python 中的函数入口、内部和退出处进行日志记录

我希望能够使用 Python 日志记录工具在我的代码中进行简单且一致的日志记录我能够执行以下操作我希望所有现有未来的模块和函数都有输入和完成日志消息我不想添加相同的代码片段来定义日志记录参数如下所示don t want t
.pdbs 会减慢发布应用程序的速度吗？

如果 dll 中包含 pdb 程序调试文件则行号将出现在引发的任何异常的堆栈跟踪中这会影响应用程序的性能吗这个问题与发布与调试即优化无关这是关于拥有 pdb 文件的性能影响每次抛出异常时都会读取 pdb 文件吗加载程序集时
Python MySQL 操作错误：1045，“用户 root@'localhost' 的访问被拒绝

我试图通过以下方式从我的 python 程序访问数据库 db mysql connect host localhost user Max passwd maxkim db TESTDB cursor db cursor 但是我在第一行代码
如何在 Django Rest 框架中编写“删除”操作的测试

我正在为 Django Rest Framework API 编写测试我一直在测试删除我对创建的测试工作正常这是我的测试代码 import json from django urls import reverse from re
使用标签或 href 传递 Django 数据

我有一个包含链接的表当单击该链接进行更多操作时我想将一些数据传递给我的函数 my html table tbody for query in queries tr td value a href internal my func que
为什么这个 if 语句会导致语法错误

我正在尝试设置一个 elif 语句如果用户按下 Enter 键代码将继续但是我不断遇到语法错误 GTIN 0 while True try GTIN int input input your gtin 8 number if len
如何从列表类别中对 pandas 数据框进行排序？

所以我在下面有这个数据集我想根据我的列表从名称列进行排序以及按 A 升序和按 B 降序排序 import pandas as pd import numpy as np df1 pd DataFrame from items A 1
基于值而不是类型的单次调度

我在 Django 上构建 SPA 并且有一个庞大的功能其中包含许多功能if用于检查我的对象字段的状态名称的语句像这样 if self state new do some logic if self state archive do s
确定分割形状几何体的“左”侧和“右”侧

我的问题是我怎样才能确定哪一个Aside and Bside的侧面已经分割的旋转矩形几何体 http nbviewer jupyter org urls dl dropbox com s ll3mchnx0jwzjnf determine
PyInstaller“ValueError：源代码字符串不能包含空字节”

我得到了一个ValueError source code string cannot contain null bytes执行命令时pyinstaller main py在具有和不具有管理员权限的cmd中 Traceback most re
如何通过函数注释指示函数需要函数作为参数，或返回函数？

您可以使用函数注释 http www python org dev peps pep 3107 在python 3中指示参数和返回值的类型如下所示 def myfunction name str age int gt str return
numpy polyfit 中使用的权重值是多少以及拟合误差是多少

我正在尝试对 numpy 中的某些数据进行线性拟合 Ex 其中 w 是该值的样本数即对于点 x 0 y 0 我只有 1 个测量值该测量值是2 2 但对于这一点 1 1 我有 2 个测量值值为3 5 x np array 0 1 2 3
在 numpy 中连接维度

我有x 1 2 3 4 5 6 7 8 9 10 11 12 shape 2 2 3 I want 1 2 3 4 5 6 7 8 9 10 11 12 shape 2 6 也就是说我想连接中间维度的所有项目在这种特殊情况下我可以得到这
在matlab中，如何读取python pickle文件？

在 python 中我生成了一个 p 数据文件 pickle dump allData open myallData p wb 现在我想在Matlab中读取myallData p 我的Matlab安装在Windows 8下其中没有Pyt
如何通过点击复制 folium 地图上的标记位置？

I am able to print the location of a given marker on the map using folium plugins MousePosition class GeoMap def update

随机推荐

Arduino串口数据解析

我正在编写一个应用程序通过蓝牙用我的 Android 手机控制我的机器人一切都很顺利数据得到回显和验证但我在协议方面遇到了一些问题特别是我希望我的机器人的轮子在我发送时转动一个命令例如s 10 100 or s 30 10 数值
php 中的 PDOException“找不到驱动程序”

我已经在 Linux 系统上安装了 Lampp 并且正在学习 symfony2 同时尝试使用 symfony2 命令创建架构 php app console doctrine schema create 我收到以下错误消息 PDOExcep
为什么 HTML 标记中的

在包含一些通配符的大型列表中进行成员资格测试

python

performance

Search

set

在包含一些通配符的大型列表中进行成员资格测试的相关文章

为什么我的混淆矩阵只返回一个数字？

从字典的元素创建 Pandas 数据框

如何为未捕获的异常处理程序编写单元测试

为什么第二个 request.session cookies 返回空？

如何将 sql 数据输出到 QCalendarWidget

如何在 Python 中的函数入口、内部和退出处进行日志记录

.pdbs 会减慢发布应用程序的速度吗？

Python MySQL 操作错误：1045，“用户 root@'localhost' 的访问被拒绝

如何在 Django Rest 框架中编写“删除”操作的测试

使用标签或 href 传递 Django 数据

为什么这个 if 语句会导致语法错误

如何从列表类别中对 pandas 数据框进行排序？

基于值而不是类型的单次调度

确定分割形状几何体的“左”侧和“右”侧

PyInstaller“ValueError：源代码字符串不能包含空字节”

如何通过函数注释指示函数需要函数作为参数，或返回函数？

numpy polyfit 中使用的权重值是多少以及拟合误差是多少

在 numpy 中连接维度

在matlab中，如何读取python pickle文件？

如何通过点击复制 folium 地图上的标记位置？

随机推荐

Arduino串口数据解析

php 中的 PDOException“找不到驱动程序”

为什么 HTML 标记中的

在包含一些通配符的大型列表中进行成员资格测试

在包含一些通配符的大型列表中进行成员资格测试 的相关文章

随机推荐

在包含一些通配符的大型列表中进行成员资格测试的相关文章