NLTK - 获取并简化标签列表

2023-12-24

我正在使用布朗语料库。我想要某种方法来打印所有可能的标签及其名称(而不仅仅是标签缩写)。标签也不少,有没有办法“简化”标签呢?我所说的简化是指将两个极其相似的标签合并为一个,然后用另一个标签重新标记合并后的单词?


之前以某种方式讨论过:

  • Java Stanley NLP:语音标签的一部分? https://stackoverflow.com/questions/1833252/java-stanford-nlp-part-of-speech-labels

  • 使用 NLTK 简化法语 POS 标签集 https://stackoverflow.com/questions/27513185/simplifying-the-french-pos-tag-set-with-nltk

  • https://linguistics.stackexchange.com/questions/2249/turn-penn-treebank-into-simler-pos-tags https://linguistics.stackexchange.com/questions/2249/turn-penn-treebank-into-simpler-pos-tags

POS 标签输出nltk.pos_tag是 PennTreeBank 标签集,https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html, see NLTK 可能的 pos 标签有哪些? https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk

有几种方法但是最简单的可能是仅使用 POS 的前 2 个字符作为主要的 POS 标签集。这是因为 POS 标签中的前两个字符代表 Penn Tree Bank 标签集中 POS 的广泛类别。

例如NNS表示复数名词,并且NNP表示专有名词,并且NN标签通过表示通用名词来包含所有内容。

这是一个代码示例:

>>> from nltk.corpus import brown
>>> from collections import Counter

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'DTI': [u'any'], u'BEN': [u'been'], u'VBD': [u'said', u'produced', u'took', u'said'], u'NP$': [u"Atlanta's"], u'NN-TL': [u'County', u'Jury', u'City', u'Committee', u'City', u'Court', u'Judge', u'Mayor-nominate'], u'VBN': [u'conducted', u'charged', u'won'], u"''": [u"''", u"''", u"''"], u'WDT': [u'which', u'which', u'which'], u'JJ': [u'recent', u'over-all', u'possible', u'hard-fought'], u'VBZ': [u'deserves'], u'NN': [u'investigation', u'primary', u'election', u'evidence', u'place', u'jury', u'term-end', u'charge', u'election', u'praise', u'manner', u'election', u'term', u'jury', u'primary'], u',': [u',', u','], u'.': [u'.', u'.'], u'TO': [u'to'], u'NP': [u'September-October', u'Durwood', u'Pye', u'Ivan'], u'BEDZ': [u'was', u'was'], u'NR': [u'Friday'], u'NNS': [u'irregularities', u'presentments', u'thanks', u'reports', u'irregularities'], u'``': [u'``', u'``', u'``'], u'CC': [u'and'], u'RBR': [u'further'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'IN': [u'of', u'in', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'CS': [u'that', u'that'], u'NP-TL': [u'Fulton', u'Atlanta', u'Fulton'], u'HVD': [u'had', u'had'], u'IN-TL': [u'of'], u'VB': [u'investigate'], u'JJ-TL': [u'Grand', u'Executive', u'Superior']})
>>> len(x)
29

缩短版本如下所示:

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos[:2]].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'BE': [u'was', u'been', u'was'], u'VB': [u'said', u'produced', u'took', u'said', u'deserves', u'conducted', u'charged', u'investigate', u'won'], u'WD': [u'which', u'which', u'which'], u'RB': [u'further'], u'NN': [u'County', u'Jury', u'investigation', u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'manner', u'election', u'term', u'jury', u'Court', u'Judge', u'reports', u'irregularities', u'primary', u'Mayor-nominate'], u'TO': [u'to'], u'CC': [u'and'], u'HV': [u'had', u'had'], u'``': [u'``', u'``', u'``'], u',': [u',', u','], u'.': [u'.', u'.'], u"''": [u"''", u"''", u"''"], u'CS': [u'that', u'that'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'JJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought'], u'IN': [u'of', u'in', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'NP': [u'Fulton', u"Atlanta's", u'Atlanta', u'September-October', u'Fulton', u'Durwood', u'Pye', u'Ivan'], u'NR': [u'Friday'], u'DT': [u'any']})
>>> len(x)
19

另一种解决方案是使用通用邮资, see http://www.nltk.org/book/ch05.html http://www.nltk.org/book/ch05.html

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words(tagset='universal')[1:100]:
...     x[pos].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'ADV': [u'further'], u'NOUN': [u'Fulton', u'County', u'Jury', u'Friday', u'investigation', u"Atlanta's", u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'Atlanta', u'manner', u'election', u'September-October', u'term', u'jury', u'Fulton', u'Court', u'Judge', u'Durwood', u'Pye', u'reports', u'irregularities', u'primary', u'Mayor-nominate', u'Ivan'], u'ADP': [u'of', u'that', u'in', u'that', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'DET': [u'an', u'no', u'any', u'The', u'the', u'which', u'the', u'the', u'the', u'the', u'which', u'the', u'The', u'the', u'which'], u'.': [u'``', u"''", u'.', u',', u',', u'``', u"''", u'.', u'``', u"''"], u'PRT': [u'to'], u'VERB': [u'said', u'produced', u'took', u'said', u'had', u'deserves', u'was', u'conducted', u'had', u'been', u'charged', u'investigate', u'was', u'won'], u'CONJ': [u'and'], u'ADJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought']})
>>> len(x)
9
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

NLTK - 获取并简化标签列表 的相关文章

随机推荐