SpaCy版本:2.0.11
Python版本:3.6.5
操作系统:Ubuntu 16.04
我的句子样本:
Marketing-Representative- won't die in car accident.
or
Out-of-box implementation
预期代币:
["Marketing-Representative", "-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out-of-box", "implementation"]
SpaCy 令牌(默认令牌生成器):
["Marketing", "-", "Representative-", "wo", "n't", "die", "in", "car", "accident", "."]
["Out", "-", "of", "-", "box", "implementation"]
我尝试创建自定义标记生成器,但它不会处理 spaCy 使用 tokenizer_exceptions 处理的所有边缘情况(代码如下):
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Marketing-Representative- won't die in car accident.")
for token in doc:
print(token.text)
Output:
Marketing-Representative-
won
'
t
die
in
car
accident
.
我需要有人指导我采取适当的方法来做到这一点。
在上面的正则表达式中进行更改可以做到这一点或任何其他方法,或者我什至尝试了 spaCy 的基于规则的匹配器,但无法创建规则来处理两个以上单词之间的连字符,例如“开箱即用”,以便可以创建与 span.merge() 一起使用的 Matcher。
无论哪种方式,我都需要让包含单词内连字符的单词成为斯坦福 CoreNLP 处理的单个标记。