I want spaCy
使用我提供的句子分割边界而不是它自己的处理。
例如:
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."] # two sents
get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."] # ONE sent
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sents
这就是我到目前为止所拥有的(从文档中借用的东西here https://spacy.io/usage/processing-pipelines#component-example1):
import spacy
nlp = spacy.load('en_core_web_sm')
def mark_sentence_boundaries(doc):
for i, token in enumerate(doc):
if token.text == '@SentBoundary@':
doc[i+1].sent_start = True
return doc
nlp.add_pipe(mark_sentence_boundaries, before='parser')
def get_sentences(text):
doc = nlp(text)
return (list(doc.sents))
但我得到的结果如下:
# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
#=> ["Bob meets Alice.", "@SentBoundary@", "They play together."]
# Ex2
get_sentences("Bob meets Alice. They play together.")
#=> ["Bob meets Alice.", "They play together."]
# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
#=> ["Bob meets Alice, @SentBoundary@", "they play together."]
以下是我面临的主要问题:
- 当发现断句时,如何去掉
@SentBoundary@
token.
- 如何禁止
spaCy
从分裂如果@SentBoundary@
不存在。