使用我自己的语料库而不是 movie_reviews 语料库在 NLTK 中进行分类

2023-12-30

我使用以下代码并得到它的形式在 NLTK/Python 中使用电影评论语料库进行分类 https://stackoverflow.com/questions/21107075/classification-using-movie-review-corpus-in-nltk-python

import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

output:

0.655
Most Informative Features
                 bad = True              neg : pos    =      2.0 : 1.0
              script = True              neg : pos    =      1.5 : 1.0
               world = True              pos : neg    =      1.5 : 1.0
             nothing = True              neg : pos    =      1.5 : 1.0
                 bad = False             pos : neg    =      1.5 : 1.0

我想创建自己的文件夹而不是movie_reviews在nltk中，并将我自己的文件放入其中。

如果您的数据与结构完全相同movie_reviewNLTK 中的语料库，有两种方法可以“破解”您的方式：

1. 将你的语料库目录放入你保存的位置nltk.data

首先检查你在哪里nltk.data saved:

>>> import nltk
>>> nltk.data.find('corpora/movie_reviews')
FileSystemPathPointer(u'/home/alvas/nltk_data/corpora/movie_reviews')

然后将您的目录移动到所在位置nltk_data/corpora已保存：

# Let's make a test corpus like `nltk.corpus.movie_reviews`
~$ mkdir my_movie_reviews
~$ mkdir my_movie_reviews/pos
~$ mkdir my_movie_reviews/neg
~$ echo "This is a great restaurant." > my_movie_reviews/pos/1.txt
~$ echo "Had a great time at chez jerome." > my_movie_reviews/pos/2.txt
~$ echo "Food fit for the ****" > my_movie_reviews/neg/1.txt
~$ echo "Slow service." > my_movie_reviews/neg/2.txt
~$ echo "README please" > my_movie_reviews/README
# Move it to `nltk_data/corpora/`
~$ mv my_movie_reviews/ nltk_data/corpora/

在你的Python代码中：

>>> import string
>>> from nltk.corpus import LazyCorpusLoader, CategorizedPlaintextCorpusReader
>>> from nltk.corpus import stopwords
>>> my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
>>> mr = my_movie_reviews
>>>
>>> stop = stopwords.words('english')
>>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
>>> for i in documents:
...     print i
... 
([u'Food', u'fit', u'****'], u'neg')
([u'Slow', u'service'], u'neg')
([u'great', u'restaurant'], u'pos')
([u'great', u'time', u'chez', u'jerome'], u'pos')

（有关更多详细信息，请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/util.py#L21 https://github.com/nltk/nltk/blob/develop/nltk/corpus/util.py#L21 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L144 https://github.com/nltk/nltk/blob/develop/nltk/corpus/__init__.py#L144)

2. 创建你自己的CategorizedPlaintextCorpusReader

如果您无法访问nltk.data目录并且您想使用自己的语料库，请尝试以下操作：

# Let's say that your corpus is saved on `/home/alvas/my_movie_reviews/`

>>> import string; from nltk.corpus import stopwords
>>> from nltk.corpus import CategorizedPlaintextCorpusReader
>>> mr = CategorizedPlaintextCorpusReader('/home/alvas/my_movie_reviews', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
>>> stop = stopwords.words('english')
>>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
>>> 
>>> for doc in documents:
...     print doc
... 
([u'Food', u'fit', u'****'], 'neg')
([u'Slow', u'service'], 'neg')
([u'great', u'restaurant'], 'pos')
([u'great', u'time', u'chez', u'jerome'], 'pos')

类似的问题已被问到使用 NLTK 和 Python 创建自定义分类语料库 https://stackoverflow.com/questions/10463898/creating-a-custom-categorized-corpus-in-nltk-and-python and 在Python NLTK中使用我自己的语料库进行类别分类 https://stackoverflow.com/questions/8818265/using-my-own-corpus-for-category-classification-in-python-nltk

这是可以工作的完整代码：

import string
from itertools import chain

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk

mydir = '/home/alvas/my_movie_reviews'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

使用我自己的语料库而不是 movie_reviews 语料库在 NLTK 中进行分类的相关文章

python-messaging 无法处理 HTTP 请求

我正在使用下面的代码尝试使用 python messaging 发送彩信https github com pmarti python messaging blob master doc tutorial mms rst https gith
如何从 Pandas DataFrame 转换为 Tensorflow BatchDataset 以进行 NLP？

老实说我想弄清楚如何转换数据集格式 pandasDataFrame或 numpy 数组转换为简单文本分类张量流模型可以训练用于情感分析的形式我使用的数据集类似于 IMDB 包含文本和标签正面或负面我看过的每个教程要么以不同的方式
使用 pyttsx 包时没有名为 Win32com.client 的模块错误

今天上网冲浪的时候Quora 我碰到answers https www quora com What amazing things can Python do关于 python 可以做的令人惊奇的事情我尝试使用pyttsx 文本到语音转换
Python 柯里化任意数量的变量

我正在尝试使用柯里化在 Python 中进行简单的函数添加我找到了这个咖喱装饰器here https gist github com JulienPalard 021f1c7332507d6a494b def curry func def
Python 2.7 中四舍五入到小数点后两位？

使用Python 2 7如何将我的数字四舍五入到小数点后两位而不是它给出的10位左右 print financial return of outcome 1 str out1 使用内置函数round https docs python or
如何在欧洲使用 Cloud Dataflow 区域终端节点？

是否可以将 Google Cloud Platform Dataflow 作业的区域更改为欧洲我已将管道区域设置为europe west1 d但我无法更改工作本身的区域我尝试更改管道选项中的区域但这会导致错误并且只有默认区域有效 p
如何对URL进行分类？ URL 的特点是什么？如何从 URL 中选择和提取特征

我刚刚开始研究分类问题这是一个两类问题我的训练模型机器学习必须决定预测是允许 URL 还是阻止它我的问题非常具体如何对 URL 进行分类我应该使用普通的文本分析方法吗 URL 的特点是什么如何从URL中选择和提取特征我假
在 Python 3.x 中，让 print 像在 Python 2 中一样工作（as 语句）

我想知道 print 函数是否可以像 Python 2 及更早版本中那样工作无需更改整个语法所以我有这样的声明 print Hello World 我喜欢在 Python 3 中使用该语法我尝试导入该库six 但这并没有解决问题仍然
如何使用FeatureUnion转换PipeLine中的多个特征？

我有一个 pandas 数据框其中包含有关用户发送的消息的信息对于我的模型我感兴趣的是预测消息的缺失收件人即给定消息的收件人 A B C 我想预测还有谁应该成为收件人的一部分我正在使用 OneVsRestClassifier 和
Scikit-learn：如何获得 True Positive、True Negative、False Positive 和 False Negative

我的问题我有一个数据集它是一个很大的 JSON 文件我读取它并将其存储在trainList多变的接下来我对其进行预处理以便能够使用它完成后我开始分类我用kfold交叉验证方法以获得平均值准确性并训练分类器我做出预测并获
MAMP Python-MySQLdb 问题：调用 Python 文件后 libssl.1.0.0.dylib 的路径发生变化

我正在尝试使用 python MySQLdb 访问 MAMP 服务器上的 MySQL 数据库当我最初尝试使用 python sql 调用 Python 文件来访问 MAMP 上的数据库时我得到了image not found关于错误li
使用 OpenNLP 获取句子的解析树。陷入困境。

OpenNLP 是一个关于自然语言处理的 Apache 项目 NLP 程序的目标之一是解析一个句子并给出其语法结构的树例如天空是蓝色的这句话可能会被解析为 S NP VP The sky is blue where S是句子 NP
类型错误：“生成器”对象没有属性“__getitem__”

我编写了一个应该返回字典的生成函数但是当我尝试打印字段时出现以下错误 print row2 SearchDate TypeError generator object has no attribute getitem 这是我的代码 fro
分类报告 - 精度和 F 分数定义不明确

我从 sklearn metrics 导入了classification report 当我输入我的np arrays作为参数我收到以下错误 usr local lib python3 6 dist packages sklearn met
更新 Sqlalchemy 中的多个列

我有一个在 Flask 上运行的应用程序并使用 sqlalchemy 与数据库交互我想用用户指定的值更新表的列我正在使用的查询是 def update table value1 value2 value3 query update T
如何将 self 传递给装饰器？

我该如何通过self key下面进入装饰器 class CacheMix object def init self args kwargs super CacheMix self init args kwargs key func Cons
如何使用 sys.path.append 在 Python 中导入文件？

我的桌面上有两个目录 DIR1 and DIR2其中包含以下文件 DIR1 file1 py DIR2 file2 py myfile txt 这些文件包含以下内容 file1 py import sys sys path append s
如何将回溯/sys.exc_info() 值保存在变量中？

我想将错误名称和回溯详细信息保存到变量中这是我的尝试 import sys try try print x except Exception ex raise NameError except Exception er print 0 s
使用 Python 2.7 和 MySQLdb 将二进制数据插入 MySQL 中的 blob 列时如何避免编码警告

我在将二进制数据插入到longblob使用 Python 2 7 中的 MySQLdb 在 MySQL 中的列但我收到一个编码警告我不知道如何解决 test py 11 Warning Invalid utf8 character st
如何进行重定向并保留查询字符串？

我想进行重定向并保留查询字符串就像是self redirect加上发送的查询参数那可能吗 newurl my new route urllib urlencode self request params self redirect ne

随机推荐

facebook sdk php 示例不起作用

我正在尝试为网站开发 facebook 登录我尝试了给出的示例php facebook sdk即使在登录 Facebook 后 user variable即使在 Facebook 登录后仍然为 0 它没有显示注销 url 调用 faceb
来自 Symfony Command 的 Swift 邮件

我尝试使用 Symfony 命令从命令行发送 Swift 邮件虽然我得到以下异常 Fatal error Call to undefined method Symfony Bundle TwigBundle Debug TimedTwig
SendMessage - 发送区分大小写的按键

我正在尝试使用 WinAPI 创建一个工作函数以便在其他应用程序例如记事本中逐键写入给定的文本SendMessage功能我有这样的代码 SendMessage handle WM CHAR 0x41 0 SendMessage ha
React Native foreach 循环

我正在 React Native 中开发一个小应用程序我正在寻找类似 foreach 函数的东西我只是找不到 foreach 循环不在 StackOverflow 上甚至不在docs https facebook github io
OAuth2 中 OTP/2FA 支持的推荐设计

我正在尝试将 OTP 2FA 支持添加到 OAuth2 中但是经过大量阅读RFC6749 https www rfc editor org rfc rfc6749 目前还不清楚如何在不违反规范的情况下干净地添加 OTP 2FA 虽然 OT
iPhone 的缓存/离线地图？

我想在我的应用程序中使用地图以便尽可能减少流量完美的解决方案是缓存地图切片我知道谷歌地图许可证是不可能的我查看了 OpenStreetMaps 这似乎是一个很好的解决方案下一个 SDK 我发现的唯一一个来自 CloudMade
使用 Rails 中的模型数据填充选择

我觉得有必要为问这样一个简单的问题而道歉但我对 Rails 指南越来越感到沮丧我确信他们回答了我的问题但他们没有提供足够的背景让我真正理解如何apply他们给我什么谷歌也没有多大帮助尽管我可能只是搜索了错误的术语短语鉴于该免责
可以 Boost Program_options 分隔逗号分隔的参数值

如果我的命令行是 gt prog mylist a b c Boost 的program options 可以设置为查看三个不同的参数值吗mylist争论我已将program options配置为 namespace po boost p
显示来自用户输入的部分数组值匹配

我有一个带有一堆值的 jQuery 数组我希望用户能够在输入中键入内容并与屏幕上显示的数组中的任何内容进行部分匹配到目前为止我已经知道何时有完整的匹配并且我可以将其打印到页面上但我不确定如何进行部分匹配这是我到目前为止所拥有的
无法启动服务并出现 net.tcp 绑定错误 10049

我在使用 net tcp 端点启动 WCF 服务时遇到问题我收到 10049 错误 My app config
从单独的控制器 angularjs 中检索成功后的数据

我正在编写一个简单的服务用于上传文件并将其发布到 Spring 控制器控制器操作这些数据并以 JSON 形式返回多个对象我正在使用以下 Angular 服务 myApp service fileUpload http function
SQL中如何计算运行总计

我的数据集采用给定的格式这是每月级别的数据以及每个月的工资我需要计算每个月末的累计工资我怎样才能做到这一点 Account Month Salary Running Total a 1 586 586 a 2 928 1514 a 3
HTML 时间标签 - 正确的日期格式

我想使用正确的格式和标准将时间标签放入我的 html 文档中这是正确的方法吗
通过 php 函数从 WordPress 短代码中删除空
标签

寻找 php 函数非 jQuery 或 wpaautop 修改方法来删除 p p 从 WordPress 内部我尝试了这个但它不起作用 function cleanup shortcode fix content array arr
在 widget 树中使用 const 会提高性能吗？

创建widget树时会插入const在静态小部件提高性能之前 ie child const Text This is some text vs child Text This is some text 我知道使用 Dart 2 cons
使用java servlet在浏览器中显示Pdf

我的应用程序中有 pdf 文件我需要在浏览器中显示pdf 我正在将文件作为 fileInputStream 读取我需要在我的应用程序中的浏览器中显示 pdf 但我没有 pdf 路径我有文件流请给我一些建议和例子我使用ajax来显示
如何在 Ruby 中解析 url 来获取主域？

我希望能够使用 Ruby 解析任何 URL 以获取域的主要部分而无需www 只是example com 请注意没有算法方法可以找到可以为特定顶级域注册域的最高级别每个注册机构的政策有所不同唯一的方法是创建所有顶级域以及可以注册域的级别
ABP 中的区域设置日期时间

我在弹出模式中有一个 DateTime 字段如下所示它应该只显示时间部分 HTML div class input group date div
是什么导致 EventStore 这么容易抛出 ConcurrencyException？

Using JOliver活动商店 http github com joliver EventStore3 0 并且刚刚开始使用简单的示例我有一个使用 NServiceBus 的简单发布订阅 CQRS 实现客户端在总线上发送命令域服
使用我自己的语料库而不是 movie_reviews 语料库在 NLTK 中进行分类

我使用以下代码并得到它的形式在 NLTK Python 中使用电影评论语料库进行分类 https stackoverflow com questions 21107075 classification using movie review

使用我自己的语料库而不是 movie_reviews 语料库在 NLTK 中进行分类

使用我自己的语料库而不是 movie_reviews 语料库在 NLTK 中进行分类 的相关文章

随机推荐

热门标签

使用我自己的语料库而不是 movie_reviews 语料库在 NLTK 中进行分类的相关文章