有两件事需要记住:
First: The train_new_from_iterator
仅适用于快速分词器。
(在这里您可以阅读更多内容 https://github.com/huggingface/transformers/issues/15077)
Second:训练语料库。应该
批量文本的生成器,例如,列表的列表
如果您已经记住了所有内容,请发短信。 (官方文件 https://huggingface.co/docs/transformers/main_classes/tokenizer)
def batch_iterator(batch_size=3, size=8):
df = pd.DataFrame({"note_text": ['fghijk', 'wxyz']})
for x in range(0, size, batch_size):
yield df['note_text'].to_list()
old_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
print(old_tokenizer( ['fghijk', 'wxyz']))
print(new_tokenizer( ['fghijk', 'wxyz']))
output:
{'input_ids': [[0, 506, 4147, 18474, 2], [0, 605, 32027, 329, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
{'input_ids': [[0, 22, 2], [0, 21, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}