困惑度是测试集的逆概率,按单词数量标准化。对于一元组:
现在你说你已经构建了一元模型,这意味着对于每个单词你都有相关的概率。那么你只需要应用公式即可。我假设你有一本大字典unigram[word]
这将提供语料库中每个单词的概率。您还需要有一个测试集。如果您的一元模型不是字典的形式,请告诉我您使用的数据结构,以便我可以相应地使其适应我的解决方案。
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
当您要求一个完整的工作示例时,这是一个非常简单的示例。
假设这是我们的语料库:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
下面是我们首先构建一元模型的方法:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
我们这里的模型是平滑的。对于超出其知识范围的单词,它会分配较低的概率0.01
。我已经告诉过你如何计算困惑度:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
现在我们可以在两个不同的测试集上进行测试:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
您将得到以下结果:
>>>
49.09452736318415
99.99999999999997
请注意,在处理困惑时,我们会尽力减少它。对于某个测试集而言,困惑度较小的语言模型比困惑度较大的语言模型更受欢迎。在第一个测试集中,单词Monty
包含在一元模型中,因此相应的困惑度数也较小。