实现这一目标的一种方法似乎是使标记器既
- 分解包含没有空格的标签的标记,并且
- “块”标签状序列作为单个标记。
要像示例中那样拆分标记,您可以修改标记生成器中缀(在这里描述的方式 https://github.com/explosion/spaCy/issues/3673):
infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
为了确保标签被视为单个标记,您可以使用“特殊情况”(请参阅分词器概述 https://spacy.io/usage/spacy-101#annotations-token or the 方法文档 https://spacy.io/api/tokenizer#add_special_case)。您可以为打开、关闭和空标签添加特殊情况,例如:
# open and close
for tagName in "html body i br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])
nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])
# empty
for tagName in "br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])
综合起来:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load("en_core_web_trf")
infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
for tagName in "html body i br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])
nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])
for tagName in "br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])
这似乎产生了预期的结果。例如,申请...
text = """<body>documentation<br/>The Observatory <p> Safety </p> System</body>"""
print("Tokenized:")
for t in nlp(text):
print(t)
...将完整且单独地打印标签:
# ... snip
documentation
<br/>
The
# ... snip
I found 分词器的解释方法 https://spacy.io/api/tokenizer#explain在这方面非常有帮助。它为您提供了标记化原因的详细信息。