基于Tensorflow来重现GPT v1模型


OpenAI推出的ChatGPT模型让我们看到了通用人工智能的发展潜力,我也找了GPT的相关论文来进行研究。OpenAI在2017年的论文Improving Language Understanding by Generative Pre-Training提出了GPT的第一个版本,我也基于这个论文来用Tensorflow进行了复现。



from datasets import load_dataset
dataset = load_dataset("bookcorpusopen", split="train")

这个数据集总共包括了17868本图书,其中每本图书对应title和text两个字段,我们将基于Text来进行训练。按照GPT论文的描述,其采用了BPE来对文本进行tokenizer,在Huggingface里面有一篇文章解释了BPE的原理和训练细节,Byte-Pair Encoding tokenization - Hugging Face NLP Course,这里我直接采用huggingface的tokenizer预训练好的gpt模型。

from transformers import OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

def tokenize_function(examples):
    token_ids = [tokenizer(text) for text in examples["text"]]
    total_length = [len(t["input_ids"]) for t in token_ids]
    total_length = [(l//(block_size+1))*(block_size+1) for l in total_length]
    result = []
    label = []

    for i in range(len(total_length)):
        result.extend([token_ids[i]["input_ids"][j:j+block_size+1] for j in range(0, total_length[i], block_size+1)])
    return {"token_ids": result}

ds_test = ds['train'].select(range(10000))

tokenized_datasets = ds_test.map(
    tokenize_function, batched=True, num_proc=8, remove_columns=["title", "text"], batch_size=100


在以上代码中,我把数据集的每本书的text文本通过tokenizer来转化为token id,然后每513个tokenid保存为一条数据记录,因为在GPT论文中是对512个token进行训练的,因此我们在训练时取这513个token的前512个作为训练,然后对应的第2-513个token作为label,最后把处理后的数据集保存到本地。

因为我将要在tensorflow的模型中进行训练,还要把这个数据集转化为tensorflow dataset的格式。我们可以直接调用tokenized_datasets.to_tf_dataset函数来进行转化,但是我发现这样转换之后,要读取dataset的数据很慢。因此我先把数据集转化为TFRecords的文件格式,这样读取速度就能加快很多,以下的代码把每10万条记录保存为一个tfrecord文件,每个文件大约100M。

import tensorflow as tf
from tqdm import tqdm

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def serialize_example(token_ids):
    feature = {
        'token_ids': _int64_feature(token_ids)

    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

records_num = 100000
count = 0
for record in tqdm(ds):
    if count%records_num == 0:
        writer = tf.io.TFRecordWriter("bookcorpus_"+str(count//records_num)+".tfrecords")
    count += 1
    if count%records_num == 0:
if writer:


feature_description = {
    'token_ids': tf.io.FixedLenFeature([513], tf.int64)

def _parse_function(example_proto):
    # Parse the input `tf.Example` proto using the dictionary above.
    return tf.io.parse_single_example(example_proto, feature_description)

data_dir = "/data/datasets/bookcorpus_tf/"
filenames = os.listdir(data_dir)
filenames = [data_dir+f for f in filenames]
tf_ds = tf.data.TFRecordDataset(filenames)
tf_ds = tf_ds\
    .map(_parse_function, num_parallel_calls=tf.data.AUTOTUNE)\


data = next(iter(tf_ds))


根据论文的描述,GPT只采用了Transformer里面的Decoder,因为Encoder是通过查看整个训练数据的上下文来建立Token之间的联系的,但是对于文本生成来说,只能通过上文来预测之后的token,因此只能采用Decoder。论文的模型架构如下,共采用了12个Decoder组合而成,每个Decoder包含了12个Attention Head:

关于Transformer模型的解读,可以见我以前写的博客基于Tensorflow实现一个Transformer翻译器_tensorflow transformer_gzroy的博客-CSDN博客

首先是定义multi attention head,代码如下:

def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead)
    but it must be broadcastable for addition.
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.
    output, attention_weights
    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
    return output, attention_weights

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self,*, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % self.num_heads == 0
        self.depth = d_model // self.num_heads
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
        self.dense = tf.keras.layers.Dense(d_model)
    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)
        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)
        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
        return output, attention_weights

然后是Feed forward层,代码如下

def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)

定义一个decoder layer,把以上的两个层组合起来:

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self,*, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
    def call(self, x, training, look_ahead_mask):
        attn, attn_weights_block = self.mha(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn = self.dropout1(attn, training=training)
        out1 = self.layernorm1(attn + x)
        ffn_output = self.ffn(out1)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(ffn_output + out1)  # (batch_size, target_seq_len, d_model)
        return out2, attn_weights_bloc

最后就是定义一个GPT模型,模型里面包括了12个Decoder layer。

class Decoder(tf.keras.layers.Layer):
    def __init__(self,*, num_layers, d_model, num_heads, dff, target_vocab_size, rate=0.1):
        super(Decoder, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers
        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = tf.reshape(tf.range(target_vocab_size-block_size, target_vocab_size), shape=[1, -1])
        self.dec_layers = [
            DecoderLayer(d_model=d_model, num_heads=num_heads, dff=dff, rate=rate)
            for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)
    def call(self, x, training, look_ahead_mask):
        #seq_len = tf.shape(x)[1]
        attention_weights = {}
        x = self.embedding(x)  # (batch_size, block_size, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.embedding(self.pos_encoding)
        x = self.dropout(x, training=training)
        for i in range(self.num_layers):
            x, block1 = self.dec_layers[i](x, training, look_ahead_mask)
            attention_weights[f'decoder_layer{i+1}_block1'] = block1
        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

target_vocab_size = vocab_size + block_size

def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)

class Transformer(tf.keras.Model):
    def __init__(self,*, num_layers, d_model, num_heads, dff, target_vocab_size, rate=0.1):
        self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                               num_heads=num_heads, dff=dff,
                               target_vocab_size=target_vocab_size, rate=rate)
        self.final_layer = tf.keras.layers.Dense(target_vocab_size)
    def call(self, inp, training):
        # Keras models prefer if you pass all your inputs in the first argument
        look_ahead_mask = self.create_masks(inp)
        dec_output, attention_weights = self.decoder(inp, training, look_ahead_mask)
        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)
        return final_output, attention_weights
    def create_masks(self, tar):
        # Used in the 1st attention block in the decoder.
        # It is used to pad and mask future tokens in the input received by
        # the decoder.
        look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
        return look_ahead_mask
transformer = Transformer(





loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

train_loss = tf.keras.metrics.Mean(name='train_loss')


def accuracy_function(real, pred):
    accuracies = tf.equal(real, tf.argmax(pred, axis=2))
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    accuracies = tf.math.logical_and(mask, accuracies)
    accuracies = tf.cast(accuracies, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

按照论文的描述,采用了Adam optimizer来优化模型,学习率在最初的2000个Batch的训练中由0增加到0.00025,然后采用余弦衰减,在100个Epoch后降为0。在新版的Tensorflow里面有一个新的CosineDecay可以直接调用

epoch_steps = 1680000//batch_size
epochs = 100
decay_steps = epoch_steps*epochs
initial_learning_rate = 0
warmup_steps = 2000
target_learning_rate = 0.00025
lr_warmup_decayed_fn = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate, decay_steps, warmup_target=target_learning_rate,
optimizer = tf.keras.optimizers.Adam(lr_warmup_decayed_fn, beta_1=0.9, beta_2=0.98, epsilon=1e-9)


checkpoint_path = './checkpoints/train'
#定义两个trackable object需要保存
ckpt = tf.train.Checkpoint(transformer=transformer, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    print('Latest checkpoint restored!!')


train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64)
def train_step(inp, tar):
    with tf.GradientTape() as tape:
        predictions, _ = transformer(inp, training = True)
        loss = loss_function(tar, predictions)
    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
    train_accuracy(accuracy_function(tar, predictions))


for epoch in range(EPOCHS):
    start = time.time()
    # inp -> portuguese, tar -> english
    for (batch, inputs) in enumerate(tf_ds):
            train_step(inputs[...,:-1], inputs[...,1:])
        except ValueError:
        if batch % 10 == 0:
            print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
            if batch == 100:
    if (epoch + 1) % 5 == 0:
        ckpt_save_path = ckpt_manager.save()
        print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')
    print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
    print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')


在本地的2080 TI显卡,11GB内存,设置batch_size为16, 6个Decoder,开启半精度来进行训练,大约1小时训练10000个batch。总共训练了80000个batch,花费了8小时,最后的每个batch的loss为3.5左右,准确度为35%。按照论文的描述,总共训练了100个epoch,batch_size是64,12个Decoder,采用了8块P600显卡训练了30天。我也尝试在AutoDL上面租用了一块80GB的A100来训练了一下,在A100上可以按照论文描述的64 batch,12 Decoder来训练,不过我就没有训练太长时间,只是测试了一下,基本和在本地2080的训练效果差不多。



import tensorflow as tf
from transformers import OpenAIGPTTokenizer
import tensorflow_text as tf_text
from tqdm import trange

model = tf.keras.models.load_model('saved_model/gpt1_model')
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
vocab_size = len(tokenizer.get_vocab())

input_sentence = "it was saturday night, the street"
token_id = tokenizer.encode(input_sentence)
token_len = len(token_id)
gen_seq_len = 50
block_size = 512
for i in trange(token_len, gen_seq_len):
    input_data, mask = tf_text.pad_model_inputs(tf.reshape(tf.constant(token_id, tf.int64), [1,-1]), max_seq_length=block_size)
    prediction, _ = model(input_data, training=False)
    next_token_logit = prediction[0, len(token_id)-1, :vocab_size]
    predict_token = tf.math.argmax(tf.math.softmax(next_token_logit)).numpy()


it was saturday night, the street was packed with people and the girls were in the car. 
 " what's up? " i asked. 
 " i'm not sure. " 
 " what? " 
 " i'm not sure. "



input_sentence = "it was saturday night, the street"
token_id = tokenizer.encode(input_sentence)
token_len = len(token_id)
gen_seq_len = 512
block_size = 512
k = 5
for i in trange(token_len, gen_seq_len):
    input_data, mask = tf_text.pad_model_inputs(tf.reshape(tf.constant(token_id, tf.int64), [1,-1]), max_seq_length=block_size)
    prediction, _ = model(input_data, training=False)
    next_token_logit = prediction[0, len(token_id)-1, :vocab_size]
    next_token_prob = tf.nn.softmax(next_token_logit)
    next_token_topK = tf.math.top_k(next_token_prob, k=5, sorted=True, index_type=tf.dtypes.int32, name=None)
    predict_token = random.choices(next_token_topK.indices.numpy(), next_token_topK.values.numpy()/next_token_topK.values.numpy().sum())[0]


it was saturday night, the street had already been deserted. 
 " you're not going to let him in, " i said, " he's not going to let me in, and i'm leaving. " 
 " i don't care. " he turned and looked at me with a smile. 
 i looked at him, confused. 
 " he said it was you. you don't know how he 'll react, " i said, and he looked down at my hands. 
 " i'm not going to hurt you. i don't want to. " he looked at me with a look of concern. 
 " no. i don't want you to hurt anyone. i want you to hurt someone, and i want him to hurt you and i want you to hurt someone. " 
 " no. " he said, " you're going to be okay. " 
 i looked at him, my eyes burning with tears and i nodded. " i want you to hurt someone, and i want you to feel safe. i want to hurt you. i want you to heal. " 
 " i know. " i turned and looked at my hands. " i want you to heal yourself, so you can rest. " 
 " no, i'm going to. " i looked at him with a smile. " it's not your fault. you're going to be all over me. i can't let you hurt anyone. " 
 " no, i can't do that. you're going to have to do that. i can't let you heal. " 
 he looked back at me, then back at me. " you're going to do this. " 
 " i can't. " 
 " i'm not going to do this. " he said, looking at me. i wanted to cry. he didn't want me to, and he wanted me to. i needed to get rid of him and he was going to do this again. 
 he took my arms and i hugged him back. we kissed and i kissed him back, but i didn't want him to stop. i wanted him to do it again. 
 # chapter 22 
 " hey, i'm sorry about that. " i said. " what are you going to do with him? " 
 " i have to do something. i'm not doing something. " he said. " what's up? " 
 " he's not going to help you, " i said, but he shook his head. " i do


最后,我的代码都放在repo: gpt1_tf2: GPT1 implementation base on Tensorflow 2.13.0


  • 基于Tensorflow来重现GPT v1模型

    OpenAI推出的ChatGPT模型让我们看到了通用人工智能的发展潜力 我也找了GPT的相关论文来进行研究 OpenAI在2017年的论文Improving Language Understanding by Generative Pre