如果您想尝试在没有上下文的情况下进行标记,那么您正在寻找某种一元标记器,又名looup tagger
. 一元标记器仅根据给定单词的标签的频率来标记单词。因此它避免了上下文启发法,但是对于任何标记任务,您都必须有数据。对于一元组,您需要带注释的数据来训练它。请参阅lookup tagger
在 nltk 教程中http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html.
下面是训练/测试一元标记器的另一种方法NLTK
>>> from nltk.corpus import brown
>>> from nltk import UnigramTagger as ut
>>> brown_sents = brown.tagged_sents()
# Split the data into train and test sets.
>>> train = int(len(brown_sents)*90/100) # use 90% for training
# Trains the tagger
>>> uni_tag = ut(brown_sents[:train]) # this will take some time, ~1-2 mins
# Tags a random sentence
>>> uni_tag.tag ("this is a foo bar sentence .".split())
[('this', 'DT'), ('is', 'BEZ'), ('a', 'AT'), ('foo', None), ('bar', 'NN'), ('sentence', 'NN'), ('.', '.')]
# Test the taggers accuracy.
>>> uni_tag.evaluate(brown_sents[train+1:]) # evaluate on 10%, will also take ~1-2 mins
0.8851469586629643
我不建议使用 WordNet 进行词性标记,因为太多单词在 wordnet 中仍然没有条目。但是您可以看一下在 wordnet 中使用引理频率,请参阅如何在 NLTK 中获取同义词集的词网语义频率? https://stackoverflow.com/questions/15551195/how-to-get-the-wordnet-sense-frequency-of-a-synset-in-nltk。这些频率基于 SemCor 语料库 (http://www.cse.unt.edu/~rada/downloads.html http://www.cse.unt.edu/~rada/downloads.html)