我有变量trainData
其具有以下简化格式。
[
('Paragraph_A', {"entities": [(15, 26, 'DiseaseClass'), (443, 449, 'DiseaseClass'), (483, 496, 'DiseaseClass')]}),
('Paragraph_B', {"entities": [(969, 975, 'DiseaseClass'), (1257, 1271, 'SpecificDisease')]}),
('Paragraph_C', {"entities": [(0, 27, 'SpecificDisease')]})
]
我正在尝试转换trainData
to .spacy
首先将其转换为doc
然后到DocBin
。整体trainData
文件可通过访问谷歌文档 https://drive.google.com/file/d/1Njb5hoPGU1sqaQzEgvx-Bld4LRUkrChm/view?usp=sharing.
我尝试重现本教程中提到的内容,但对我不起作用。教程是:使用 spaCy 3.0 构建自定义 NER 模型 https://towardsdatascience.com/using-spacy-3-0-to-build-a-custom-ner-model-c9256bea098
我尝试了以下方法。
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in trainData: # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
ents.append(span)
doc.ents = span # label the text with the ents
db.add(doc)
db.to_disk("./train.spacy") # save the docbin object
但我的代码中关于如何转换数据的错误是Spacy v2
to Spacy v3
。
在上面的代码片段中,我得到了一个回溯:TypeError: 'spacy.tokens.token.Token' object is not iterable
.