我正在使用 NLTK 来分析一些经典文本,但我在按句子标记文本时遇到了麻烦。例如,这是我从以下内容中得到的片段莫比迪克 http://www.gutenberg.org/cache/epub/2701/pg2701.txt:
import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'
print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''
考虑到 Melville 的语法有点过时,我并不期望这里完美,但 NLTK 应该能够处理终端双引号和像“Mrs.”这样的标题。然而,由于标记器是无监督训练算法的结果,我不知道如何修改它。
有人有更好的句子标记器的建议吗?我更喜欢一个可以破解的简单启发式方法,而不是必须训练我自己的解析器。
您需要向分词器提供缩写列表,如下所示:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)
现在的句子是:
['is THAT what you mean, Mrs. Hussey?']
更新:如果句子的最后一个单词带有撇号或引号(例如Hussey?')。因此,解决此问题的一种快速而肮脏的方法是在句子结束符号 (.!?) 后面的撇号和引号前面放置空格:
text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)