NLTK 中的无监督 HMM 训练

2024-05-03

我只是想进行非常简单的无监督 HMM 训练nltk http://www.nltk.org/.

考虑：

import nltk
trainer = nltk.tag.hmm.HiddenMarkovModelTrainer()
from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt')
m = trainer.train_unsupervised(emma)
ValueError: A Uniform probability distribution must have at least one sample.

我能找到一个使用的例子吗nltk.tag.hmm.HiddenMarkovModelTrainer.train_unsupervised http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.hmm-pysrc.html#HiddenMarkovModelTrainer.train_supervised?

显然，nltk要求我们手动指定观察到的符号和状态的集合，并且还要求未标记序列的形式为[ [(symb,tag),(symb,tag),...], [(symb,tag),(symb,tag),...], ...].

所以我们有

s = """"Your humble writer knows a little bit about a lot of things, but despite writing a fair amount about text processing (a book, for example), linguistic processing is a relatively novel area for me. Forgive me if I stumble through my explanations of the quite remarkable Natural Language Toolkit (NLTK), a wonderful tool for teaching, and working in, computational linguistics using Python. Computational linguistics, moreover, is closely related to the fields of artificial intelligence, language/speech recognition, translation, and grammar checking.\nWhat NLTK includes\nIt is natural to think of NLTK as a stacked series of layers that build on each other. Readers familiar with lexing and parsing of artificial languages (like, say, Python) will not have too much of a leap to understand the similar -- but deeper -- layers involved in natural language modeling.\nGlossary of terms\nCorpora: Collections of related texts. For example, the works of Shakespeare might, collectively, by called a corpus; the works of several authors, corpora.\nHistogram: The statistic distribution of the frequency of different words, letters, or other items within a data set.\nSyntagmatic: The study of syntagma; namely, the statistical relations in the contiguous occurrence of letters, words, or phrases in corpora.\nContext-free grammar: Type-2 in Noam Chomsky's hierarchy of the four types of formal grammars. See Resources for a thorough description.\nWhile NLTK comes with a number of corpora that have been pre-processed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first; then words are tagged; then groups of words are parsed into grammatical elements, like noun phrases or sentences (according to one of several techniques, each with advantages and drawbacks); and finally sentences or other grammatical units can be classified. Along the way, NLTK gives you the ability to generate statistics about occurrences of various elements, and draw graphs that represent either the processing itself, or statistical aggregates in results.\nIn this article, you'll see some relatively fleshed-out examples from the lower-level capabilities, but most of the higher-level capabilities will be simply described abstractly. Let's now take the first steps past text processing, narrowly construed. """
sentences = s.split('.')[:-1]
seq = [map(lambda x:(x,''), ss.split(' ')) for ss in sentences]
symbols = list(set([ss[0] for sss in seq for ss in sss]))
states = range(5)
trainer = nltk.tag.hmm.HiddenMarkovModelTrainer(states=states,symbols=symbols)
m = trainer.train_unsupervised(seq)
m.random_sample(random.Random(),10)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

NLP

NLTK

hiddenmarkovmodels

NLTK 中的无监督 HMM 训练的相关文章

如何有效计算文档流中文档之间的相似度

我收集文本文档在 Node js 中其中一个文档i表示为单词列表考虑到新文档以文档流的形式出现计算这些文档之间相似性的有效方法是什么我目前对每个文档中单词的归一化频率使用余弦相似度我不使用 TF IDF 词频逆文档频率因为我
比较文本文档含义的最佳方法？

我正在尝试找到使用人工智能和机器学习方法来比较两个文本文档的最佳方法我使用了 TF IDF Cosine 相似度和其他相似度度量但这会在单词或 n gram 级别上比较文档我正在寻找一种方法来比较meaning的文件最好的方法是什
scikit加权f1分数计算及使用

我有一个关于weightedsklearn metrics f1 score 中的平均值 sklearn metrics f1 score y true y pred labels None pos label 1 average weig
target_vocab_size 在方法 tfds.features.text.SubwordTextEncoder.build_from_corpus 中到底意味着什么？

根据这个链接 https www tensorflow org datasets api docs python tfds features text SubwordTextEncoder build from corpus target
在 Databricks 中的 pyspark 数据帧上下载 punkt 时出现 NLTK 查找错误

我试图通过对 Databricks 中的 pyspark 数据框应用余弦相似度来查找文本列标题标题的相似性我的函数称为 cosine sim udf 为了能够使用它我必须进行第一次 udf 转换将函数应用于 df 后出现查找错误
如何在 nltk 中使用 hunpos 标记文本文件？

有人可以帮我解决在 nltk 中标记语料库的 hunpos 语法吗我要导入什么hunpos HunPosTagger module http nltk googlecode com svn trunk doc api nltk tag h
将 python NLTK 解析树保存到图像文件[重复]

这个问题在这里已经有答案了这可能会复制这个 stackoverflowquestion https stackoverflow com questions 23429117 saving nltk drawn parse tree to
实时跟踪每分钟/小时/天的前 100 个 Twitter 单词

我最近遇到这样一个面试问题 Given a continuous twitter feed design an algorithm to return the 100 most frequent words used at this min
ANEW 字典可以用于 Quanteda 中的情感分析吗？

我正在尝试找到一种方法来实施英语单词情感规范荷兰语以便使用 Quanteda 进行纵向情感分析我最终想要的是每年的平均情绪以显示任何纵向趋势在数据集中所有单词均由 64 名编码员按照 7 分李克特量表在四个类别上进行评分这提
如何下载 NLTK 数据？

更新的答案 NLTK 适用于 2 7 我有3 2 我卸载了3 2并安装了2 7 现在可以了我已经安装了 NLTK 并尝试下载 NLTK 数据我所做的是按照该网站上的说明进行操作 http www nltk org data html h
快速 shell 命令删除文本文件中的停用词

我有一个 2GB 的文本文件我正在尝试从此文件中删除经常出现的英语停用词我有 stopwords txt 包含这样的 a an the for and I 使用 shell 命令例如 tr sed 或 awk 执行此操作的快速方法是什
openNLP 与 Solr 集成时出现异常

我正在尝试将 openNLP 与 Solr 6 1 0 集成我配置了架构和 solrconfig 文件详细信息请参见 wiki 链接 https wiki apache org solr OpenNLP https wiki apach
PyMC：马尔可夫系统中的参数估计

简单的马尔可夫链假设我们想要估计系统的参数以便我们可以在给定时间步 t 的状态的情况下预测系统在时间步 t 1 的状态 PyMC 应该能够轻松处理这个问题让我们的玩具系统由一维世界中的移动物体组成状态是对象的位置我们想要估计潜在变
SpaCy 中的自定义句子边界检测

我正在尝试在 spaCy 中编写一个自定义句子分段器它将整个文档作为单个句子返回我编写了一个自定义管道组件它使用以下代码来执行此操作here https github com explosion spaCy issues 1850 但
缩短文本并仅保留重要句子

德国网站 nandoo net 提供了缩短新闻文章的可能性如果使用滑块更改百分比值文本会发生变化并且某些句子会被遗漏您可以在这里看到它的实际效果 http www nandoo net read article 299925 http
Python模块可以访问英语词典，包括单词的定义[关闭]

Closed 这个问题不符合堆栈溢出指南 help closed questions 目前不接受答案我正在寻找一个 python 模块它可以帮助我从英语词典中获取单词的定义当然有enchant 这可以帮助我检查该单词是否存在于英语中
Lucene 标准分析器与 Snowball

刚刚开始使用 Lucene Net 我使用标准分析器索引了 100 000 行运行了一些测试查询并注意到如果原始术语是单数则复数查询不会返回结果我知道雪球分析器增加了词干支持这听起来不错不过我想知道超过标准的雪球锣是否有任何
如何从 Pandas DataFrame 转换为 Tensorflow BatchDataset 以进行 NLP？

老实说我想弄清楚如何转换数据集格式 pandasDataFrame或 numpy 数组转换为简单文本分类张量流模型可以训练用于情感分析的形式我使用的数据集类似于 IMDB 包含文本和标签正面或负面我看过的每个教程要么以不同的方式
获取 NLTK 索引的所有结果

我正在使用 NLTK 来查找单词的一致性但我不知道如何获取所有结果并将它们放入list or set 例如 text concordance word 仅打印前 25 个结果 TL DR text concordance lines 10
Java文本输出中的UTF-8编码问题

我一直致力于测试高棉语 Unicode Wordbreaker 的各种解决方案高棉语单词之间没有空格这使得拼写检查和语法检查变得困难以及从旧高棉语转换为高棉语 Unicode 我得到了一些源代码现在在线 http www white

随机推荐

将模块定义为 Manipulate 表达式的一部分与在初始化部分中定义有任何性能问题吗？

我想问是否有人知道任何问题性能或其他如果要定义放置 Manipulate 表达式使用的模块就在 Manipulate 表达式本身内部而不是在初始化部分通常是在哪里完成的两种方法都有效但是当涉及到从模块直接访问 Manipul
比较具有不同顶点数的图中的社区

我正在根据通信数据图计算鲁汶社区其中顶点代表大型项目的执行者这些图表代表不同的通信方式例如电子邮件电话我们想尝试从通信数据中识别表演者团队由于表演者对不同的通信方法有不同的偏好因此图的大小不同并且可能有一些独特的顶点而这些
如何在 Mac 上更改 PHP-FPM 端口

我没有使用 php 或 php fpm 但它占用了端口 9000 我需要端口 9000 我想更改端口我把它改成了 etc php fpm conf and etc php fpm d www conf 听 127 0 0 1 9005 仍
轻量级核心数据迁移后，如何为现有实体的新属性设置默认值？

我已经成功完成了核心数据模型的轻量级迁移我的自定义实体 Vehicle 收到了一个新属性 tirePressure 它是 double 类型的可选属性默认值为 0 00 当从商店中获取旧车辆在迁移发生之前创建的车辆时其 tir
不可变子类

我目前正在开发一个多线程框架为了避免副作用我想要求框架操作的所有数据都必须是不可变的那么Java中是否存在一种方法来指定我希望给定类的所有子类或实现给定接口的所有类都是不可变的我建议调查一下变异性检测器 http code goog
使用基类指针创建对象时缺少派生类析构函数

在下面的代码示例中未调用派生类析构函数知道为什么吗我有一个具有虚函数的基类现在我使用基类指针来创建派生类的新对象我的理解是当派生类对象被销毁时首先调用派生类的析构函数然后调用基类但是我只看到基类的析构函数被调用有谁知道我
如何将接口类型传递给过程

如何将接口类型传递给过程参数 type Hello PortType interface ISoapInvokable 243CBD89 8766 F19D 38DF 427D7A02EAEE function GetDeneme s st
如何使用 javascript 从 p:selectOneRadio 获取所选选项

如何获取其中选择的电台p selectOneRadio使用 javascript jquery 自从p selectOneRadio不使用单选标签我不知道如何使用 CSS 选择器获取选中的选项
MySQL：选择 DISTINCT / UNIQUE，但返回所有列？

SELECT DISTINCT field1 field2 field3 FROM table 我正在尝试完成以下 SQL 语句但我希望它返回所有列这可能吗像这样的东西 SELECT DISTINCT field1 FROM tabl
如何使用Python计算多类分割任务的dice系数？

我想知道如何计算多类分割的骰子系数这是计算二元分割任务的骰子系数的脚本如何循环每个类并计算每个类的骰子先感谢您 import numpy def dice coeff im1 im2 empty score 1 0 im1 numpy
Python - ValueError：以 10 为基数的 int() 的文字无效：''

求助当我尝试从字符串中提取整数时我不断收到 ValueError invalidliteral for int with base 10 from string import capwords import sys os import
如何使用 TortoiseHg (Mercurial) 下载代码

我正在尝试下载代码世界上最差的stackoverflow克隆 http code google com p theworldsworststackoverflowclone source checkout 起初我尝试过Tortoise SV
如何在Python可视化代码扩展中预选（设置默认）Python解释器？

The small problem is when I press Ctrl F5 I want the code to be run immediately but I have to Select environment Python
iOS：无法让蓝牙管理器工作

警告我知道私有框架不会在 App Store 中流行我尝试使用 BluetoothManager 框架来让我 1 检查设备上是否启用了蓝牙 2 如果未启用则将其打开我可以按照找到的说明成功加载蓝牙管理器here https stac
将 frontend-maven-plugin 从 Maven 迁移到 gradle

我有一个com github eirslett frontend maven plugin in my maven项目
ng-focus 发射两次而 ng-blur 从不发射

到目前为止我对 Angular 已经有了相当的经验但这似乎是在较低级别上发生的 DOM 事件传播方式的事情由于某种原因在我的申请的一部分中我有ng focus and ng blur一样的input 但是ng focus事件触发两
使用 powershell 从远程服务器获取服务状态

如何获取需要用户名和密码登录的远程计算机的服务状态我正在尝试使用以下代码找到解决方案 serviceStatus get service ComputerName machineName Name service 默认语法为get ser
禁用 jquery-chosen 下拉菜单

我有一个正在使用的选择 div选择jquery插件 http harvesthq github io chosen 设计样式并添加功能最值得注意的是搜索 div 看起来像这样
如何调整jupyterlab笔记本中滚动输出高度的大小？

有没有办法调整 jupyterlab 笔记本中滚动输出高度的大小潜在的解决方案但需要 HTML CSS 知识我从这个问题中找到了这个片段调整 ipython 笔记本输出窗口的大小 https stackoverflow com que
NLTK 中的无监督 HMM 训练

我只是想进行非常简单的无监督 HMM 训练nltk http www nltk org 考虑 import nltk trainer nltk tag hmm HiddenMarkovModelTrainer from nltk corpu

NLTK 中的无监督 HMM 训练

NLTK 中的无监督 HMM 训练 的相关文章

随机推荐

热门标签

NLTK 中的无监督 HMM 训练的相关文章