FASTAI and Fine-Tuning BERT with FastAI

2023-11-15

这是一篇笔记类型文章，主要是从新学习一下fastai,和实践“pytorch-pretrained-BERT”和“pytorch-transformers”对接fastai 后简洁快速实现bert模型的训练和执行任务。

我还是一个小白，大佬看到我的文章奋起拍砖的时候，还请不要打脸。--------现在还只能靠脸吃饭。感谢！

The Basics of Fast AI

Fastai框架基于Pytorch，底层的数据处理依旧使用datasets and dataloaders，但包装成Databunch，使用时更加简洁（实用性因人而异）

Since fastai is built on top of PyTorch, it uses the same underlying primitives to handle data (datasets and dataloaders). However, unlike many other frameworks, it doesn’t directly expose the datasets and dataloaders and instead wraps them up in a Databunch.

Fastai训练模型大体分为三步：

1、准备数据。

2、确定架构（fastai中叫做Learner），learner 将组合配置数据，定义的模型，损失函数和优化器。

3、确定损失函数和优化器。

这样的框架对于实用bert来做任务时，会比较友好，当然封装得过于厉害，要自己动手做一些改变或者搭一个模型，还是建议用Torch或tf.

Using BERT with fastai

Huggingface 搞了BERT 预训练模型的 PyTorch 版本。将“pytorch-pretrained-BERT”改造了“pytorch-transformers”。

这两个包都可以用来与fastai对接，实现bert模型的训练。

There are three things we need to be careful of when using BERT with fastai.

BERT uses its own wordpiece tokenizer. #BERT有专门的tokenizer方式
BERT needs [CLS] and [SEP] tokens added to each sequence.#pretrain之前需要加token标记
BERT uses its own pre-built vocabulary. #BERT有自己的"pre-built vocabulary"

pytorch_pretrained_bert 是有tokenizer的：

from pytorch_pretrained_bert import BertTokenizer 
bert_tok = BertTokenizer.from_pretrained("bert-base-uncased")

但是fastai 在执行tokenize的时候，会做一下包装，目的是加上 [CLS] and [SEP] tokens：

class FastAiBertTokenizer(BaseTokenizer): 
    """Wrapper around BertTokenizer to be compatible with fast.ai"""
    def __init__(self, tokenizer: BertTokenizer, max_seq_len: int=128, **kwargs): 
         self._pretrained_tokenizer = tokenizer 
         self.max_seq_len = max_seq_len 
    def __call__(self, *args, **kwargs): 
         return self 
    def tokenizer(self, t:str) -> List[str]: 
    """Limits the maximum sequence length""" 
         return ["[CLS]"] + self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2] + ["[SEP]"]

最后包装的格式是这样的（fastai中，tokenizer即fastai_tokenizer）：

fastai_tokenizer = Tokenizer(
                tok_func=FastAiBertTokenizer(bert_tok, max_seq_len=config.max_seq_len),                         
                pre_rules=[], 
                post_rules=[])

当然vocabulary 也需要包装一下：

fastai_bert_vocab = Vocab(list(bert_tok.vocab.keys()))

现在，构建databunch

import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
DATA_ROOT = Path("..") / "input"
train, test = [pd.read_csv(DATA_ROOT / fname) for fname in ["train.csv", "test.csv"]]
train, val = train_test_split(train)
label_cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] 







# read the data into a databunch 
databunch = TextDataBunch.from_df(".", train, val, test,
     tokenizer=fastai_tokenizer, vocab=fastai_bert_vocab,
     include_bos=False,
     include_eos=False, 
     text_cols="comment_text", 
     label_cols=label_cols, 
     bs=config.bs,
     collate_fn=partial(pad_collate, pad_first=False), 
)

#注意设置include_bos=False and include_eos=False ，fastai 会自己根据[CLS] 和[SEP] 默认添加 bos 和 eos 。

Notice we’re passing the include_bos=False and include_eos=False options. This is because fastai adds its own bos and eos tokens by default which interferes with the [CLS] and [SEP] tokens added by BERT. Note that this option is new and might not be available for older versions of fastai.

数据准备完毕，我们处理架构，构建Learner

import torch
import torch.nn as nn
from pytorch_pretrained_bert.modeling import BertConfig, BertForSequenceClassification

bert_model = BertForSequenceClassification.from_pretrained(config.bert_model_name, num_labels=6)

# since this is a multilabel classification problem, we use the BCEWithLogitsLoss
loss_func = nn.BCEWithLogitsLoss()


learner = Learner(
    databunch, bert_model,
    loss_func=loss_func,
)

格式如上，嗯就是很简单。

然后就可以跑起来：

learner.lr_find()
#找一下学习率的区间

learner.recorder.plot()
#打印一下刚刚找到的结果

训练开始：

learner.fit_one_cycle(4, max_lr=3e-5)

嗯是的，就是这么一行代码。

利用“pytorch-transformers” 中的

BertTokenizer, BertForSequenceClassification 是一样的。

但需要注意一点，新的 Transformer 所有的模型运行结果，都是 Tuple 。即原先的模型运行结果，都用括号包裹了起来。括号里，可能包含了新的数据。但是原先的输出，一般作为新版 Tuple 的第一个元素。

所以在指定模型的时候，采取如下方式（代码来自这里）：

#取tuple第一项作为结果
class MyNoTupleModel(BertForSequenceClassification):
  def forward(self, *args, **kwargs):
    return super().forward(*args, **kwargs)[0]

#构建模型
bert_pretrained_model = MyNoTupleModel.from_pretrained(bert_model, num_labels=2)

ok,thanks

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)