如何在 python nltk 中获取 n-gram 搭配和关联?

2024-03-10

In 本文档 http://nltk.googlecode.com/svn/trunk/doc/howto/collocations.html,有一个例子使用nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder.

有一个基于 pmi 的二元组和三元组查找 nbest 的示例方法。 例子:

finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)

我知道BigramCollocationFinder and TrigramCollocationFinder继承自AbstractCollocationFinder. While BigramAssocMeasures() and TrigramAssocMeasures()继承自NgramAssocMeasures.

我如何使用这些方法(例如nbest()) in AbstractCollocationFinder and NgramAssocMeasures对于 4-gram、5-gram、6-gram、....、n-gram(例如轻松使用二元语法和三元语法)?

我应该创建继承的类吗AbstractCollocationFinder?

Thanks.


如果您想找到超过 2 或 3 克的克数,您可以使用scikit 包 http://scikit-learn.org/stable/Freqdist 函数用于获取这些克的计数。我尝试使用 nltk.collocations 执行此操作,但我认为我们无法找到超过 3-grams 的分数。所以我决定选择克数。我希望这可以帮助你一点点。谢谢

这是代码

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

query = "This document gives a very short introduction to machine learning problems"
vect = CountVectorizer(ngram_range=(1,4))
analyzer = vect.build_analyzer()
listNgramQuery = analyzer(query)
listNgramQuery.reverse()
print "listNgramQuery=", listNgramQuery
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print "\nNgramQueryWeights=", NgramQueryWeights

这将使输出为

listNgramQuery= [u'to machine learning problems', u'introduction to machine learning', u'short introduction to machine', u'very short introduction to', u'gives very short introduction', u'document gives very short', u'this document gives very', u'machine learning problems', u'to machine learning', u'introduction to machine', u'short introduction to', u'very short introduction', u'gives very short', u'document gives very', u'this document gives', u'learning problems', u'machine learning', u'to machine', u'introduction to', u'short introduction', u'very short', u'gives very', u'document gives', u'this document', u'problems', u'learning', u'machine', u'to', u'introduction', u'short', u'very', u'gives', u'document', u'this']

NgramQueryWeights= <FreqDist: u'document': 1, u'document gives': 1, u'document gives very': 1, u'document gives very short': 1, u'gives': 1, u'gives very': 1, u'gives very short': 1, u'gives very short introduction': 1, u'introduction': 1, u'introduction to': 1, ...>
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何在 python nltk 中获取 n-gram 搭配和关联? 的相关文章

随机推荐