对停用词进行标记，生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中

2024-03-12

我正在使用 Python 制作一个聊天机器人。代码：

import nltk
import numpy as np
import random
import string 
f=open('/home/hostbooks/ML/stewy/speech/chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw=raw.lower()# converts to lowercase

sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

lemmer = nltk.stem.WordNetLemmatizer()    

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey","hii")
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]


def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)    

    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]    

    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")

while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

它运行良好，但每次对话都会出现此错误：

/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. 

Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.

以下是 CMD 的一些对话：

ROBO：聊天机器人是一种通过听觉或文本方法进行对话的软件。

印度是什么

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

ROBO：印度的野生动物传统上在印度文化中被视为宽容，它们在这些森林和其他受保护的栖息地中得到了支持。

什么是聊天机器人

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

ROBO：聊天机器人是一种通过听觉或文本方法进行对话的软件。

原因是你使用了自定义的tokenizer并使用默认值stop_words='english'因此，在提取特征时，会检查特征之间是否存在不一致stop_words and tokenizer

如果你更深入地研究代码sklearn/feature_extraction/text.py您会发现此代码片段执行一致性检查：

def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):
    """Check if stop words are consistent

    Returns
    -------
    is_consistent : True if stop words are consistent with the preprocessor
                    and tokenizer, False if they are not, None if the check
                    was previously performed, "error" if it could not be
                    performed (e.g. because of the use of a custom
                    preprocessor / tokenizer)
    """
    if id(self.stop_words) == getattr(self, '_stop_words_id', None):
        # Stop words are were previously validated
        return None

    # NB: stop_words is validated, unlike self.stop_words
    try:
        inconsistent = set()
        for w in stop_words or ():
            tokens = list(tokenize(preprocess(w)))
            for token in tokens:
                if token not in stop_words:
                    inconsistent.add(token)
        self._stop_words_id = id(self.stop_words)

        if inconsistent:
            warnings.warn('Your stop_words may be inconsistent with '
                          'your preprocessing. Tokenizing the stop '
                          'words generated tokens %r not in '
                          'stop_words.' % sorted(inconsistent))

正如您所看到的，如果发现不一致，它会发出警告。

希望能帮助到你。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

python3x

NLP

NLTK

Chatbot

对停用词进行标记，生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中的相关文章

Python设置1和True的解释

在 IPython 3 交互式 shell 中 In 53 set2 1 2 True hello In 54 len set2 Out 54 3 In 55 set2 Out 55 hello True 2 是因为 1 和 True 得到
如何检索分配给 Django 中的组的所有权限

我正在执行一项任务来检索分配给 Django 中的组的一组权限我可以使用以下代码获取创建的组但无法使用它来获取分配给它们的权限 from django contrib auth models import Group Permissio
sy.sympify(str(表达式)) 不等于表达式

据我了解 str将 SymPy 表达式转换为字符串并sympify将字符串转换为 SymPy 表达式因此我希望以下内容成立对于合理的表达 gt gt gt sy sympify str expr expr True 我尝试过这个确实
Python函数组成

我尝试使用良好的语法来实现函数组合这就是我所得到的 from functools import partial class compfunc partial def lshift self y f lambda args kwargs s
如何将脚本作为 pytest 测试运行

假设我有一个用简单脚本表示的测试assert 陈述请参阅背景了解原因例如 import foo assert foo 3 4 我如何以一种好的方式将该脚本包含在我的 pytest 测试套件中我尝试了两种有效但不太好的方法一种方法是将
当我从本地计算机更改为虚拟主机时，从 python 脚本调用 pdftotext 不起作用

我编写了一个小的 python 脚本来解析提取 PDF 中的信息我在本地机器上测试了它我有 python 2 6 2 和 pdftotext 版本 0 12 4 我正在尝试在我的虚拟主机服务器 dreamhost 上运行它它有 py
如果另一列中的值为空，则删除重复项 - Pandas

我拥有的 df Name Vehicle Dave Car Mark Bike Steve Car Dave Steve 我想从名称列中删除重复项但前提是车辆列中的相应值为空我知道我可以使用 df dropduplicates
Python sys.modules 包含尚未导入的模块

我试图了解加载的模块与导入的模块之间的区别如果有的话我正在使用 Python 2 7 3 并且只是从命令行运行 Python 如果我执行 import sys sys modules 我得到一个列表其中包括os 例如文档说sys m
无法在我的程序中使用 matplotlib 函数

我正在 Windows 10 中运行 Anaconda 安装 conda 版本 4 3 8 这是我尝试在 python 命令行中运行的代码 import matplotlib pyplot as plt x 1 2 3 4 y 5 6 7
打印一份拥有多个家庭的人员名单，每个家庭都有多个电话号码

我有一类 Person 它可以有多个 Home 每个 Home 都有一个或多个电话号码我已经定义了类但现在我正在尝试创建一个视图其中列出每个人的所有家庭以及每个家庭地址的所有电话号码类似于 john smith 123 fake s
Django 1.7 应用程序配置导入错误：没有名为 appname.apps 的模块

我正在尝试按照以下文档为我的一个名为文章的 Django 应用程序设置自定义应用程序配置https docs djangoproject com en dev ref applications https docs djangoproj
Python：“直接”调用方法是否实例化对象？

我是 Python 新手在对我的对象进行单元测试时我注意到一些奇怪的东西 class Ape object def init self print ooook def say self s print s def main Ape
更新 matplotlib 中颜色条的范围

我想更新一个contourf在函数内绘制效果很好然而数据的范围发生了变化因此我还必须更新颜色条这就是我未能做到的地方请参阅以下最小工作示例 import matplotlib pyplot as plt import numpy
如何在 Python 中跟踪日志文件？

我想在 Python 中提供 tail F 或类似内容的输出而无需阻塞或锁定我找到了一些非常旧的代码来做到这一点here http code activestate com recipes 436477 filetailpy 但我认为现
异步异常处理程序：在事件循环线程停止之前不会被调用

我正在我的异步事件循环上设置异常处理程序但是在事件循环线程停止之前它似乎不会被调用例如考虑以下代码 def exception handler loop context print Exception handler called
RuntimeError(f"目录 '{directory}' 不存在") RuntimeError: 目录 'app/static' 不存在

当我运行 server py 文件时出现错误 File C Users nawin AppData Local Programs Python Python38 lib site packages starlette staticfiles
有没有办法拉伸整个显示图像以适应给定的分辨率？

我最近一直在使用pygame制作游戏遇到了一个小问题基本上我希望能够将屏幕上的整个图像我已经传输到它的所有内容拉伸到用户将窗口大小调整到的分辨率我在 pygame 和堆栈溢出的文档中搜索了很多但我似乎找不到答案这可能吗我的
如何使用logging.conf文件使用RotatingFileHandler将所有内容记录到文件中？

我正在尝试使用RotatingHandler用于 Python 中的日志记录目的我将备份文件保留为 500 个这意味着我猜它将创建最多 500 个文件并且我设置的大小是 2000 字节不确定建议的大小限制是多少如果我运行下面的代码
python：函数中的变量，点前面是函数名

我需要理解这个概念其中我们可以在函数定义中的变量名中使用点这里没有类定义也没有模块 Python 不应该接受包含点的变量名 def f x f author sunder f language Python print x f aut
当训练和测试的特征数量不同时，如何处理生产环境中的One-Hot Encoding？

在做某些实验时我们通常在 70 上进行训练在 33 上进行测试但是当您的模型投入生产时会发生什么可能会发生以下情况训练集 Ser Type Of Car 1 Hatchback 2 Sedan 3 Coupe 4 SUV 经过

随机推荐

不相关实体的 HQL 左连接

我有 2 个实体 A and B 它们是相关的但我不想将关系映射添加到 bean 我们如何使用之间的左外连接A and B使用 HQL 或标准有一些解决方法可以解决这个问题按照指示使用本机 SQLhere https stackove
大型 WCF Web 服务请求因 (400) HTTP 错误请求而失败

我遇到过这个明显常见的问题但无法解决如果我使用数组参数中相对较少的项目我已测试最多 50 个来调用 WCF Web 服务则一切都很好但是如果我调用包含 500 个项目的 Web 服务则会收到错误请求错误有趣的是我跑过Wi
从 Bash 函数返回字典

我想在 bash 中有一个函数它创建一个字典作为局部变量用一个元素填充字典然后返回该字典作为输出下面的代码正确吗 function Dictionary Builder local The Dictionary unset The
python 的 random.Random.seed 是如何工作的？

我习惯打字random randrange 我会做一个from random import Random从现在开始发现错误对于涉及程序生成的游戏不不是 Minecraft 克隆 p 我想保留几个不同的伪随机数生成器一个用于世界的生成
使用传单和 R 向 CircleMarkers 添加边框

我想为 CircleMarkers 添加边框我使用了以下代码我找不到任何为笔画添加黑色边框的功能 pal lt colorNumeric palette RdYlBu domain city results ratio m lt lea
合并一个数组中的两个 int 数组，且不重复

我正在尝试将两个数组合并为一个而不重复但我的代码无法正常工作谁能建议如何解决这个问题 void unify int set A int size A int set B int size B int set C int size C i
Flurry Android Analytics 中未获取报告

我已将 Flurry Analytics 集成到我的应用程序中但无法获取 Flurry Dashboard 上的报告但我按照 Flurry ReadMe PDF 中提到的方式集成了它我使用 API 密钥实现了所需的所有代码但不知道为
Rails Devise：登录后如何访问注册页面？

我是 Rails 新手我正在使用 devise gem 进行身份验证首先我通过默认注册页面添加一个新用户例如 users sign up 然后我按照以下说明将 sign up 页面设置为仅对符号用户可用在过滤器之前设计阻止访问
从我的 git 历史记录中删除 4 个提交

我已经完成了一些提交并将它们推送到我的存储库中然后我做了一个拉取请求但我意识到有一些提交我不想出现在拉取请求中它们看起来像这样 My commits look like this Correct HTML ab1c41c HTML e
使用 WiX 创建语言选择对话框

我使用 WiX 创建了一个多语言安装程序我正在使用命令从命令行运行安装程序 msiexec i myinstaller msi TRANSFORMS 1041 并且工作正常现在我已经使用引导程序创建了一个语言选择对话框如何将所选语言传
如何将Win32 HRESULT转换为int返回值？

我正在用 C 编写一个 Windows 控制台应用程序希望在成功时返回零在失败时返回有意义的错误代码即 S OK应该返回 0 并且E OUTOFMEMORY应该返回不同的返回值E FAIL等等以下是一个好的方法吗 int wmain
对于 std::string，复制初始化或直接初始化字符串文字更快吗？

我有以下问题应该遵循哪一项更好为什么 string strMyString SampleString or string strMyString SampleString 我回答了here https stackoverflow com
Chrome 中的 Math.log2 精度已更改

我编写了一个 JavaScript 程序它根据元素的数量计算二叉树的深度我的程序几个月来一直运行良好但最近我发现在 Chrome 和 Firefox 中查看网页时存在差异特别是在 Firefox 上 Math log2 8 3 但现
高效处理超宽但不太高的位图？

有什么方法可以创建更节省空间资源的位图吗目前我尝试渲染一个文件高约 800px 宽约 720000px 它使我的应用程序崩溃大概是因为位图的共享内存大小我是否可以更有效地执行此操作例如直接将其创建为 gif 而不是稍后保存我尝
Elif，如果不工作或者我不理解[重复]

这个问题在这里已经有答案了好吧我的代码可以正常工作但是当我输入否时如果我想重试输入密码它就不起作用了它只是转到输入密码行第 20 行我尝试了多种方法来解决这个问题但我就是做不到 import time import o
未捕获的类型错误：$(…).on 不是函数

我在页面上使用 JQuery UI 对话框并收到此错误
如何更好地拟合seaborn小提琴图

下面的代码给了我一个非常漂亮的小提琴图以及其中的箱线图 import numpy as np import seaborn as sns import matplotlib pyplot as plt foo np random rand
错误：类...没有名为的字段或关联

当我将文件上传到服务器时遇到这个问题 Error Class Prizes PrizesBundle Entity Category has no field or association named order cat 我的类别
C#：如何以编程方式将 SQL 脚本导入数据库？

我是否必须手动解析 SQL 脚本并单独执行每个语句还是有更好的方法我正在寻找一种编程解决方案我知道有些工具已经能够做到这一点如果该解决方案适用于所有数据库系统而不仅仅是 sqlite 那就太好了我不确定这如何适用于 Sqlite
对停用词进行标记，生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中

我正在使用 Python 制作一个聊天机器人代码 import nltk import numpy as np import random import string f open home hostbooks ML stewy spee

对停用词进行标记，生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中

对停用词进行标记，生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中 的相关文章

随机推荐

热门标签

对停用词进行标记，生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中的相关文章