如何微调 HuggingFace BERT 模型进行文本分类 [关闭]

2024-01-05

有没有逐步解释关于如何微调 HuggingFace BERT文本分类模型？

微调方法

有多种方法可以针对目标任务微调 BERT。

进一步预训练基础 BERT 模型
可训练的基本 BERT 模型之上的自定义分类层
基础 BERT 模型之上的自定义分类层不可训练（冻结）

请注意，与原始论文一样，BERT 基础模型仅针对两个任务进行了预训练。

BERT：用于语言理解的深度双向变压器的预训练 https://arxiv.org/abs/1810.04805

3.1 预训练 BERT ...我们使用两个无监督任务预训练 BERT

任务#1：蒙面LM

任务#2：下一句预测 (NSP)

因此，基本 BERT 模型就像半生不熟的模型，可以针对目标域进行完全烘焙（第一种方式）。我们可以将其用作我们的自定义模型训练的一部分，其中包含基础可训练（第二个）或不可训练（第三个）。

第一种方法

如何微调 BERT 进行文本分类？ https://arxiv.org/abs/1905.05583演示了进一步预训练的第一种方法，并指出学习率是避免的关键灾难性遗忘在学习新知识的过程中，预先训练的知识会被删除。

We find that a lower learning rate, such as 2e-5, is necessary to make BERT overcome the catastrophic forgetting problem. With an aggressive learn rate of 4e-4, the training set fails to converge.

大概这就是原因BERT论文 https://arxiv.org/abs/1810.04805使用 5e-5、4e-5、3e-5 和 2e-5 进行微调.

我们使用 32 的批量大小，并对所有 GLUE 任务的数据进行 3 轮微调。对于每个任务，我们在开发集上选择最佳的微调学习率（5e-5、4e-5、3e-5 和 2e-5 之间）

请注意，基础模型预训练本身使用了更高的学习率。

bert-base-uncased - 预训练 https://huggingface.co/bert-base-uncased#pretraining

该模型在 Pod 配置中的 4 个云 TPU（总共 16 个 TPU 芯片）上进行了 100 万步训练，批量大小为 256。序列长度限制为 90% 的步骤为 128 个令牌，其余 10% 的步骤为 512 个令牌。使用的优化器是 Adam，学习率为1e-4, β1=0.9 and β2=0.999，重量衰减为0.01，学习率预热 10,000 步，之后学习率线性衰减。

将在下面将第一种方法描述为第三种方法的一部分。

FYI: TFDistilBert模型 https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertmodel是带有名称的裸基模型distilbert.

Model: "tf_distil_bert_model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
=================================================================
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0

第二种方法

Huggingface 采用第二种方法，如下所示使用原生 PyTorch/TensorFlow 进行微调 https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-native-pytorch-tensorflow where TFDistilBertForSequenceClassification添加了自定义分类层classifier在底座顶部distilbert模型可训练。小学习率要求也将适用，以避免灾难性遗忘。

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_59 (Dropout)         multiple                  0         
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010  <--- All parameters are trainable
Non-trainable params: 0

第二种方法的实施

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
    DistilBertTokenizerFast,
    TFDistilBertForSequenceClassification,
)


DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512
LEARNING_RATE = 5e-5
BATCH_SIZE = 16
NUM_EPOCHS = 3


# --------------------------------------------------------------------------------
# Tokenizer
# --------------------------------------------------------------------------------
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args:
        sentences: String or list of string to tokenize
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

# --------------------------------------------------------------------------------
# Load data
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
    raw_train[DATA_COLUMN].tolist(),
    raw_train[LABEL_COLUMN].tolist(),
    test_size=.2,
    shuffle=True
)

# --------------------------------------------------------------------------------
# Prepare TF dataset
# --------------------------------------------------------------------------------
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary
    train_label
)).shuffle(1000).batch(BATCH_SIZE).prefetch(1)
validation_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(validation_data)),
    validation_label
)).batch(BATCH_SIZE).prefetch(1)

# --------------------------------------------------------------------------------
# training
# --------------------------------------------------------------------------------
model = TFDistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=NUM_LABELS
)
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)
model.fit(
    x=train_dataset,
    y=None,
    validation_data=validation_dataset,
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
)

第三种方法

Basics

请注意，图像取自首次使用 BERT 的视觉指南 http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/并修改。

分词器

分词器 https://huggingface.co/transformers/main_classes/tokenizer.html生成 BatchEncoding 的实例，它可以像 Python 字典一样使用，并且可以作为 BERT 模型的输入。

批量编码 https://huggingface.co/transformers/main_classes/tokenizer.html#batchencoding

保存encode_plus()和batch_encode()方法的输出（tokens、attention_masks等）。
这个类派生自Python字典并且可以当字典用。此外，此类公开了从单词/字符空间映射到标记空间的实用方法。

参数

data (dict) – 由encode/batch_encode方法返回的列表/数组/张量的字典（‘input_ids’、‘attention_mask’等）。

The data类的属性是生成的标记，其中具有input_ids and attention_mask元素。

输入ID

输入ID https://huggingface.co/transformers/glossary.html#input-ids

输入 id 通常是作为输入传递给模型的唯一必需参数。他们是代币索引，代币的数字表示构建将用作模型输入的序列。

注意掩码

注意面膜 https://huggingface.co/transformers/glossary.html#attention-mask

该参数向模型指示哪些标记应该被关注，哪些标记不应该被关注。

如果attention_mask是0，令牌 ID 被忽略。例如，如果填充序列以调整序列长度，则应忽略填充的单词，因此它们的 focus_mask 为 0。

特殊代币

BertTokenizer 添加特殊标记，包含一个序列[CLS] and [SEP]. [CLS]代表分类 and [SEP]分隔序列。对于问答或释义任务，[SEP]将两个句子分开进行比较。

伯特分词器 https://huggingface.co/transformers/model_doc/bert.html#berttokenizer

cls_token（str，可选，默认为“[CLS]")
The 进行序列分类时使用的分类器令牌（整个序列的分类而不是每个标记的分类）。当使用特殊标记构建时，它是序列的第一个标记。

sep_token（str，可选，默认为“[SEP]”）
分隔符标记，在从多个序列构建序列时使用，例如用于序列分类或文本的两个序列和用于问答的问题。它还用作用特殊标记构建的序列的最后一个标记。

首次使用 BERT 的视觉指南 http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/显示标记化。

[CLS]

嵌入向量为[CLS]在基本模型的输出中，最后一层表示基本模型已学习的分类。因此输入嵌入向量[CLS]将令牌添加到添加在基本模型之上的分类层。

BERT：用于语言理解的深度双向变压器的预训练 https://arxiv.org/abs/1810.04805

每个序列的第一个标记始终是a special classification token ([CLS])。该token对应的最终隐藏状态是用作分类任务的聚合序列表示。句子对被打包成一个序列。我们以两种方式区分句子。首先，我们用一个特殊的标记（[SEP]）将它们分开。其次，我们向每个标记添加学习嵌入，指示它属于句子 A 还是句子 B。

模型结构如下图所示。

矢量大小

在模型中distilbert-base-uncased，每个标记都嵌入到大小为的向量中768。基本模型的输出形状为(batch_size, max_sequence_length, embedding_vector_size=768)。这与关于 BERT/BASE 模型的 BERT 论文一致（如 distilbert- 中所示）base-无壳）。

BERT：用于语言理解的深度双向变压器的预训练 https://arxiv.org/abs/1810.04805

BERT/BASE（L=12，H=768，A=12，总参数=110M）和BERT/LARGE（L=24，H=1024，A=16，总参数=340M）。

基础模型 - TFDistilBertModel

拥抱人脸变压器：针对二元分类任务微调 DistilBERT https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379

用于实例化基本 DistilBERT 模型的 TFDistilBertModel 类顶部没有任何特定的头（与其他类相反，例如 TFDistilBertForSequenceClassification 确实具有添加的分类头）。

我们不希望附加任何特定于任务的头，因为我们只是希望基本模型的预训练权重能够提供对英语的一般理解，并且在微调期间添加我们自己的分类头将是我们的工作过程以帮助模型区分有毒评论。

TFDistilBertModel生成一个实例TFBaseModelOutput whose last_hidden_state参数是模型最后一层的输出。

TFBaseModelOutput([(
    'last_hidden_state',
    <tf.Tensor: shape=(batch_size, sequence_lendgth, 768), dtype=float32, numpy=array([[[...]]], dtype=float32)>
)])

TFBaseModel输出 https://huggingface.co/transformers/main_classes/output.html#tfbasemodeloutput

参数

last_hidden_state (tf.Tensor of shape (batch_size,equence_length,hidden_size)) – 模型最后一层输出的隐藏状态序列。

执行

Python 模块

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
    DistilBertTokenizerFast,
    TFDistilBertModel,
)

配置

TIMESTAMP = datetime.datetime.now().strftime("%Y%b%d%H%M").upper()

DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'

MAX_SEQUENCE_LENGTH = 512   # Max length allowed for BERT is 512.
NUM_LABELS = len(raw_train[LABEL_COLUMN].unique())

MODEL_NAME = 'distilbert-base-uncased'
NUM_BASE_MODEL_OUTPUT = 768

# Flag to freeze base model
FREEZE_BASE = True

# Flag to add custom classification heads
USE_CUSTOM_HEAD = True
if USE_CUSTOM_HEAD == False:
    # Make the base trainable when no classification head exists.
    FREEZE_BASE = False


BATCH_SIZE = 16
LEARNING_RATE = 1e-2 if FREEZE_BASE else 5e-5
L2 = 0.01

分词器

tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args:
        sentences: String or list of string to tokenize
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

输入层

基本模型期望input_ids and attention_mask其形状是(max_sequence_length,)。为它们生成 Keras 张量Input分别层。

# Inputs for token indices and attention masks
input_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')

基础模型层

从基本模型生成输出。基础模型生成TFBaseModelOutput。馈送嵌入[CLS]到下一层。

base = TFDistilBertModel.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

# Freeze the base model weights.
if FREEZE_BASE:
    for layer in base.layers:
        layer.trainable = False
    base.summary()

# [CLS] embedding is last_hidden_state[:, 0, :]
output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]

分类层

if USE_CUSTOM_HEAD:
    # -------------------------------------------------------------------------------
    # Classifiation leayer 01
    # --------------------------------------------------------------------------------
    output = tf.keras.layers.Dropout(
        rate=0.15,
        name="01_dropout",
    )(output)
    
    output = tf.keras.layers.Dense(
        units=NUM_BASE_MODEL_OUTPUT,
        kernel_initializer='glorot_uniform',
        activation=None,
        name="01_dense_relu_no_regularizer",
    )(output)
    output = tf.keras.layers.BatchNormalization(
        name="01_bn"
    )(output)
    output = tf.keras.layers.Activation(
        "relu",
        name="01_relu"
    )(output)

    # --------------------------------------------------------------------------------
    # Classifiation leayer 02
    # --------------------------------------------------------------------------------
    output = tf.keras.layers.Dense(
        units=NUM_BASE_MODEL_OUTPUT,
        kernel_initializer='glorot_uniform',
        activation=None,
        name="02_dense_relu_no_regularizer",
    )(output)
    output = tf.keras.layers.BatchNormalization(
        name="02_bn"
    )(output)
    output = tf.keras.layers.Activation(
        "relu",
        name="02_relu"
    )(output)

Softmax层

output = tf.keras.layers.Dense(
    units=NUM_LABELS,
    kernel_initializer='glorot_uniform',
    kernel_regularizer=tf.keras.regularizers.l2(l2=L2),
    activation='softmax',
    name="softmax"
)(output)

最终定制模型

name = f"{TIMESTAMP}_{MODEL_NAME.upper()}"
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output, name=name)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
    metrics=['accuracy']
)
model.summary()
---
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_ids (InputLayer)          [(None, 256)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 256)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880    input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None, 768)          0           tf_distil_bert_model[1][0]       
__________________________________________________________________________________________________
01_dropout (Dropout)            (None, 768)          0           tf.__operators__.getitem_1[0][0] 
__________________________________________________________________________________________________
01_dense_relu_no_regularizer (D (None, 768)          590592      01_dropout[0][0]                 
__________________________________________________________________________________________________
01_bn (BatchNormalization)      (None, 768)          3072        01_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
01_relu (Activation)            (None, 768)          0           01_bn[0][0]                      
__________________________________________________________________________________________________
02_dense_relu_no_regularizer (D (None, 768)          590592      01_relu[0][0]                    
__________________________________________________________________________________________________
02_bn (BatchNormalization)      (None, 768)          3072        02_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
02_relu (Activation)            (None, 768)          0           02_bn[0][0]                      
__________________________________________________________________________________________________
softmax (Dense)                 (None, 2)            1538        02_relu[0][0]                    
==================================================================================================
Total params: 67,551,746
Trainable params: 1,185,794
Non-trainable params: 66,365,952   <--- Base BERT model is frozen

数据分配

# --------------------------------------------------------------------------------
# Split data into training and validation
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
    raw_train[DATA_COLUMN].tolist(),
    raw_train[LABEL_COLUMN].tolist(),
    test_size=.2,
    shuffle=True
)

# X = dict(tokenize(train_data))
# Y = tf.convert_to_tensor(train_label)
X = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary
    train_label
)).batch(BATCH_SIZE).prefetch(1)

V = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(validation_data)),  # Convert BatchEncoding instance to dictionary
    validation_label
)).batch(BATCH_SIZE).prefetch(1)

Train

# --------------------------------------------------------------------------------
# Train the model
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
# Input data x can be a dict mapping input names to the corresponding array/tensors, 
# if the model has named inputs. Beware of the "names". y should be consistent with x 
# (you cannot have Numpy inputs and tensor targets, or inversely). 
# --------------------------------------------------------------------------------
history = model.fit(
    x=X,    # dictionary 
    # y=Y,
    y=None,
    epochs=NUM_EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=V,
)

要实现第一种方法，请更改配置如下。

USE_CUSTOM_HEAD = False

Then FREEZE_BASE更改为False and LEARNING_RATE更改为5e-5它将在基本 BERT 模型上运行进一步的预训练。

保存模型

对于第三种方法，保存模型会导致问题。这保存预训练 https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.save_pretrained不能使用 Huggingface 模型的方法，因为该模型不是 Huggingface 的直接子类预训练模型 https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.

Keras save_model https://www.tensorflow.org/api_docs/python/tf/keras/models/save_model导致默认错误save_traces=True，或者导致不同的错误save_traces=True加载模型时Keras负载模型 https://www.tensorflow.org/api_docs/python/tf/keras/models/load_model.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-01d66991d115> in <module>()
----> 1 tf.keras.models.load_model(MODEL_DIRECTORY)
 
11 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in _unable_to_call_layer_due_to_serialization_issue(layer, *unused_args, **unused_kwargs)
    865       'recorded when the object is called, and used when saving. To manually '
    866       'specify the input shape/dtype, decorate the call function with '
--> 867       '`@tf.function(input_signature=...)`.'.format(layer.name, type(layer)))
    868 
    869 
 
ValueError: Cannot call custom layer tf_distil_bert_model of type <class 'tensorflow.python.keras.saving.saved_model.load.TFDistilBertModel'>, because the call function was not serialized to the SavedModel.Please try one of the following methods to fix this issue:
 
(1) Implement `get_config` and `from_config` in the layer/model class, and pass the object to the `custom_objects` argument when loading the model. For more details, see: https://www.tensorflow.org/guide/keras/save_and_serialize
 
(2) Ensure that the subclassed model or layer overwrites `call` and not `__call__`. The input shape and dtype will be automatically recorded when the object is called, and used when saving. To manually specify the input shape/dtype, decorate the call function with `@tf.function(input_signature=...)`.

Only Keras 模型 save_weights https://www.tensorflow.org/api_docs/python/tf/keras/Model#save_weights据我测试，有效。

实验

据我测试有毒评论分类挑战 https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge，第一种方法提供了更好的召回率（识别真正的有毒评论，真正的无毒评论）。可以通过以下方式访问代码。如果有的话，请提供更正/建议。

第一种和第三种方法的代码 https://nbviewer.jupyter.org/github/omontasama/nlp-huggingface/blob/main/fine_tuning/huggingface_fine_tuning.ipynb

BERT 文档分类教程（含代码） https://www.youtube.com/watch?v=_eSGWNqKeeY- 使用 TFDistilBertForSequenceClassification 和 Pytorch 进行微调
拥抱人脸变压器：针对二元分类任务微调 DistilBERT https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379- 使用 TFDistilBertModel 进行微调

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)