如何在 NLTK 中使用 word_tokenize 忽略单词之间的标点符号？

2024-06-24

我希望使用 NLTK word_tokenize 忽略单词之间的字符。

如果我有一句话：

test = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email [email protected] /cdn-cgi/l/email-protection'

word_tokenize 方法将 S&P 拆分为

'S','&','P','?'

有没有办法让这个库忽略单词或字母之间的标点符号？预期输出：'S&P','?'

让我知道这对你的句子有何作用。
我添加了一个带有一堆标点符号的附加测试。
最后部分的正则表达式是根据 WordPunctTokenizer 正则表达式修改的。

from nltk.tokenize import RegexpTokenizer

punctuation = r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]?'
tokenizer = RegexpTokenizer(r'\w+' + punctuation + r'\w+?|[^\s]+?')

# result: 
In [156]: tokenizer.tokenize(test)
Out[156]: ['Should', 'I', 'trade', 'on', 'the', 'S&P', '?']

# additional test:
In [225]: tokenizer.tokenize('"I am tired," she said.')
Out[225]: ['"', 'I', 'am', 'tired', ',', '"', 'she', 'said', '.']

编辑：要求发生了一些变化，因此我们可以稍微修改波茨 TweetTokenizer https://github.com/adonoho/TweetTokenizers/blob/master/PottsTweetTokenizer.py以此目的。

emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth      
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
    )"""
# Twitter symbols/cashtags:  # Added by awd, 20140410.
# Based upon Twitter's regex described here: <https://blog.twitter.com/2013/symbols-entities-tweets>.
cashtag_string = r"""(?:\$[a-zA-Z]{1,6}([._][a-zA-Z]{1,2})?)"""

# The components of the tokenizer:
regex_strings = (
    # Phone numbers:
    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [\-\s.]*
      )?            
      (?:            # (area code)
        [\(]?
        \d{3}
        [\-\s.\)]*
      )?    
      \d{3}          # exchange
      [\-\s.]*   
      \d{4}          # base
    )"""
    ,
    # Emoticons:
    emoticon_string
    ,
    # HTML tags:
    r"""(?:<[^>]+>)"""
    ,
    # URLs:
    r"""(?:http[s]?://t.co/[a-zA-Z0-9]+)"""
    ,
    # Twitter username:
    r"""(?:@[\w_]+)"""
    ,
    # Twitter hashtags:
    r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
    ,
    # Twitter symbols/cashtags:
    cashtag_string
    ,
    # email addresses
    r"""(?:[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-])""",
    # Remaining word types:
    r"""
    (?:[a-z][^\s]+[a-z])           # Words with punctuation (modification here).
    |
    (?:[+\-]?\d+[,/.:-]\d+[+\-]?)  # Numbers, including fractions, decimals.
    |
    (?:[\w_]+)                     # Words without apostrophes or dashes.
    |
    (?:\.(?:\s*\.){1,})            # Ellipsis dots. 
    |
    (?:\S)                         # Everything else that isn't whitespace.
    """
    )
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
# The emoticon and cashtag strings get their own regex so that we can preserve case for them as needed:
emoticon_re = re.compile(emoticon_string, re.VERBOSE | re.I | re.UNICODE)
cashtag_re = re.compile(cashtag_string, re.VERBOSE | re.I | re.UNICODE)

# These are for regularizing HTML entities to Unicode:
html_entity_digit_re = re.compile(r"&#\d+;")
html_entity_alpha_re = re.compile(r"&\w+;")
amp = "&amp;"

class CustomTweetTokenizer(object):
    def __init__(self, *, preserve_case: bool=False):
        self.preserve_case = preserve_case

    def tokenize(self, tweet: str) -> list:
        """
        Argument: tweet -- any string object.
        Value: a tokenized list of strings; concatenating this list returns the original string if preserve_case=True
        """
        # Fix HTML character entitites:
        tweet = self._html2unicode(tweet)
        # Tokenize:
        matches = word_re.finditer(tweet)
        if self.preserve_case:
            return [match.group() for match in matches]
        return [self._normalize_token(match.group()) for match in matches]

    @staticmethod
    def _normalize_token(token: str) -> str:

        if emoticon_re.search(token):
            # Avoid changing emoticons like :D into :d
            return token
        if token.startswith('$') and cashtag_re.search(token):
            return token.upper()
        return token.lower()

    @staticmethod
    def _html2unicode(tweet: str) -> str:
        """
        Internal method that seeks to replace all the HTML entities in
        tweet with their corresponding unicode characters.
        """
        # First the digits:
        ents = set(html_entity_digit_re.findall(tweet))
        if len(ents) > 0:
            for ent in ents:
                entnum = ent[2:-1]
                try:
                    entnum = int(entnum)
                    tweet = tweet.replace(ent, chr(entnum))
                except:
                    pass
        # Now the alpha versions:
        ents = set(html_entity_alpha_re.findall(tweet))
        ents = filter((lambda x: x != amp), ents)
        for ent in ents:
            entname = ent[1:-1]
            try:
                tweet = tweet.replace(ent, chr(html.entities.name2codepoint[entname]))
            except:
                pass
            tweet = tweet.replace(amp, " and ")
        return tweet

测试一下：

tknzr = CustomTweetTokenizer(preserve_case=True)
tknzr.tokenize(test)

# result:
['Should',
 'I',
 'trade',
 'on',
 'the',
 'S&P',
 '?',
 'This',
 'works',
 'with',
 'a',
 'phone',
 'number',
 '333-445-6635',
 'and',
 'email',
 '[email protected] /cdn-cgi/l/email-protection']

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

NLP

NLTK

tokenize

如何在 NLTK 中使用 word_tokenize 忽略单词之间的标点符号？的相关文章

Pandas groupby 和描述标志 AttributeError

我有一堆数据存储在vals 指数是单调的但不连续我试图对数据的直方图进行一些分析因此我创建了以下结构 hist pd DataFrame vals hist bins pd cut vals 100 这是从实验仪器中获取的数据我知道
如何在不在 iPython 笔记本中使用离线模式下的plotly进行绘图？

我需要使用plotly绘制我的数据但是这段代码没有给我任何结果我显示我的数据但没有任何数字 import plotly graph objs as go from plotly offline import download plot
ResultSet 对象没有属性“find_all”

当我抓取一个网页时我总是遇到一个问题 AttributeError ResultSet 对象没有属性 find 您可能将项目列表视为单个项目当您打算调用 find 时您是否调用了 find all 谁能告诉我如何解决这个问题我的代码
设置面积图 openpyxl 的透明度（alpha）

我想使用 openpyxl 设置面积图背景的透明度我的图表代码是 from openpyxl drawing fill import PatternFillProperties ColorChoice c2 AreaChart c2 gr
将鼠标悬停在 Folium 的弹出窗口中

用这样一个简单的例子 import folium map 1 folium Map location 45 372 121 6972 zoom start 12 tiles Stamen Terrain folium Marker 45 3
SQLAlchemy如何为同一个表定义两个模型

我有一个表其中一列是具有两个值的 varchar groupA groupB 当我创建模型时我想实现两件事 A 组模型包含 X 数量的相关函数 B 组模型包含 Y 数量的相关函数两个模型的功能并不相同尽管它们代表了same ta
在ubuntu中安装pyinterval

我正在尝试安装 python 库 pyinterval 它需要 crlibm C 标头我已安装该标头没有错误但似乎是问题的根源当我跑步时 sudo easy install pyinterval 我得到以下信息 Searching
尝试打开 Excel 时出现“KeyError：“存档中没有名为“xl/sharedStrings.xml”的项目”

我正在尝试使用 Python 脚本将数据导入 PowerBi 以便我可以安排它定期刷新数据我面临着从 Excel 文件获取数据并收到错误的挑战 KeyError 没有名为 xl sharedStrings xml 的项目在档案中导入时
如何在Tensorflow中读取json文件？

我正在尝试编写一个函数用于读取张量流中的 json 文件 json 文件具有以下结构 bounding box y 98 5 x 94 0 height 197 width 188 rotation yaw 27 970195770263
AMLS 实验运行停留在“正在运行”状态

我运行了 Azure 机器学习服务实验并使用 Jupyter Notebook 记录了神经网络损失日志记录工作正常神经网络训练也按预期完成但实验一直停留在运行状态关闭计算资源不会关闭实验运行我无法从实验面板取消它此外运行没有
使用 R 的 qdap 包估计文档极性，无需使用 sendSplit

我想申请qdap s polarity函数对文档向量进行处理每个文档可以包含多个句子并获得每个文档相应的极性例如 library qdap polarity DATA state all polarity Results 1 0 81
如何从数据存储区刷新 NDB 实体？

我希望能够在我的代码调用的测试中断言Model put 对于已修改的实体不幸的是似乎正在进行一些缓存例如以下代码 from google appengine ext import ndb class MyModel ndb Model
如何在Python中按天对时间序列数据求和？ resample.sum() 没有效果

我是Python新手如何根据日期求和数据并绘制结果我有一个 Series 对象其数据如下 2017 11 03 07 30 00 NaN 2017 11 03 09 18 00 NaN 2017 11 03 10 00 00 NaN
Python 中 Javascript 的 reduce()、map() 和 filter() 的等价物是什么？

Python 的等价物是什么 Javascript function wordParts currentPart lastPart return currentPart lastPart word Che mis try console l
使用 Python for PyQt WebEngine 授予对 Cam & Mic 的访问权限

我正在构建一个从 Python 调用的简单 Web 应用程序我正在使用下面的代码加载此页面时以编程方式授予对摄像头和麦克风的访问权限的最简单方法是什么我只在网上找到了 C 示例无法找到在 Python 代码中执行此操作的方法 fr
Numpy 相当于 if/else 不带循环

有没有任何Pythonic方法可以删除下面代码中的for循环和if else 此代码迭代 NumPy 数组并检查条件并根据条件更改值 gt gt gt import numpy as np gt gt gt x np random rand
从线程队列中获取所有项目

我有一个线程将结果写入队列在另一个线程 GUI 中我定期在 IDLE 事件中检查队列中是否有结果如下所示 def queue get all q items while 1 try items append q get nowai
二进制补码扩展 python？

我想知道是否有一种方法可以像在 Python 中的 C C 中一样使用标准库最好在位数组上进行二进制补码符号扩展 C C Example program include
使用 Python 和 lxml 从 HTML 中删除类属性

Question 如何使用 python 和 lxml 从 html 中删除类属性 Example I have p class DumbClass Lorem ipsum dolor sit amet consectetur adipis
命名空间与常规包

命名空间 Python 包之间有什么区别没有 init py 和一个常规的Python包有一个 init py 特别是当 init py普通包裹是空的吗我很好奇因为最近我忘记了 init py在我制作的包中我从未注意到任何问题事

随机推荐

AND OR 导致显示的结果多于应有的结果

我正在尝试显示特定时间范围内匹配的结果效果很好但是我想添加一个子句表示显示的结果必须是 party type1 or 2 所以我这样做了 WHERE start datetime gt DATE START SELECTED AND
iOS 10.核心数据插入新对象 sigABRT

我尝试了 forEntityName Game MyApp Game 在我的 dataManagerFile 中 let appDelegate UIApplication shared delegate as AppDelegate le
如何将微调器添加为导航抽屉中的项目

I want to add spinner as an item in my navigation drawer Where should I put the spinner as an item Where to inflate the
Servlet @WebServlet urlPatterns

这是一个简单的问题但我找不到快速的答案现在我有一个servlet BaseServlet 当用户请求以下任何网址时 host host host BaseServlet 它应该始终引用相同的 servlet 并重定向到主页当我设定 W
将多个侦听器绑定到同一端口

我在用UdpClient上课于 net 3 5我需要将多个应用程序绑定到同一个端口 So if UDP服务器广播任何请求所有侦听该端口的应用程序都可以接收该消息但问题是当我尝试将应用程序绑定到同一端口时只有一个应用程序接收该消息而
实体框架中是否存在 NHibernate.ToFuture() 扩展方法的类似方法？

所以问题就在标题中 NHibernate 用户可以做什么 var q1 Source Companies ToFuture var q2 Source Items ToFuture var q3 Source Users ToFuture
如何判断表格行是否可见？

我想知道如何识别表行是否可见我想使用 jQuery 来解决您可以使用 visible http api jquery com visible selector 伪选择器以及is http api jquery com is 方法返回
Angularjs - $http 成功与当时

我想问一下这个方法有什么区别我关心的是 then 和 success function 和 error 之间的区别谢谢 Simple GET request example http method GET url someUrl the
ggplot geom_text字体大小控制

我尝试将条形图标签的字体更改为 10ggplot2通过这样做 ggplot data file aes x V1 y V3 fill V2 geom bar stat identity position dodge colour white
type: 定义一个只能是某些字符串的类型？

我怎样才能使用typing模块创建一个可以是某些字符串的类型例如假设我需要一个类型CondOperator 可以是以下任何字符串 gt lt gt lt lt gt 我本来希望CondOperator String gt lt gt
发生了类型为“System.AccessViolationException”的未处理异常

我有以下课程 public class RecipeItem public Guid ID get set public string Title get set public string Instructions get set pub
将 SQL 转储导入 PostgreSQL 数据库

我们正在切换主机旧主机提供了我们站点 PostgreSQL 数据库的 SQL 转储现在我尝试在本地 WAMP 服务器上进行设置来测试这一点唯一的问题是我不知道如何在我设置的 PostgreSQL 9 中导入这个数据库我尝试了 pg
Python fuzzywuzzy 错误字符串或缓冲区期望

我正在使用 fuzzywuzzy 在公司名称 csv 中查找近似匹配项我正在将手动匹配的字符串与不匹配的字符串进行比较希望找到一些有用的邻近匹配但是我在 fuzzywuzzy 中遇到了字符串或缓冲区错误我的代码是 from fuz
typescript 派生类不能有相同的变量名？

为什么 TypeScript 派生类不能具有相同的变量名即使这些成员也是私人的有没有替代方案或者我做错了什么 class ClassTS private nom string ClaseTS constructor class Cla
获取 foreach json 架构错误的属性

我正在尝试确定哪个属性导致了错误似乎对于每种类型的错误获取属性的方式都是不同的 from jsonschema import Draft4Validator request json num pages invalid duration
couchdb 查询带有关键参数的视图

没有关键参数视图可以正常工作 curl http 127 0 0 1 5984 music design albums view by release date total rows 311 offset 0 rows id a4327d
跨浏览器AJAX功能动态加载HTML

我正在寻找一个 AJAX 函数来动态请求 HTML 页面我已经找到以下内容 function ajaxinclude url var page request false if window XMLHttpRequest if Mozil
Cordova - 检查 WIFI 连接到互联网

我使用 Cordova 开发智能手机应用程序在此应用程序中我需要在向服务器发送请求之前检查互联网连接为了做到这一点我使用了 Cordova Connection API 但在设备连接到没有互联网连接的 WIFI 网络的情况下该 A
此时无法启动异步操作调用WebService出现异常？

在我的 ASP NET MVC 3 项目中我调用 Web 服务进行登录身份验证但它抛出一个异常异常详细信息此时无法启动异步操作异步操作只能在异步处理程序或模块内启动或者在页面生命周期中的某些事件期间启动如果在执行页面时发生此异
如何在 NLTK 中使用 word_tokenize 忽略单词之间的标点符号？

我希望使用 NLTK word tokenize 忽略单词之间的字符如果我有一句话 test Should I trade on the S P This works with a phone number 333 445 6635 an

如何在 NLTK 中使用 word_tokenize 忽略单词之间的标点符号？

如何在 NLTK 中使用 word_tokenize 忽略单词之间的标点符号？ 的相关文章

随机推荐

热门标签

如何在 NLTK 中使用 word_tokenize 忽略单词之间的标点符号？的相关文章