基于Tensorflow来重现GPT v1模型

2023-11-07

OpenAI推出的ChatGPT模型让我们看到了通用人工智能的发展潜力,我也找了GPT的相关论文来进行研究。OpenAI在2017年的论文Improving Language Understanding by Generative Pre-Training提出了GPT的第一个版本,我也基于这个论文来用Tensorflow进行了复现。

数据集的下载

GPT是基于bookcorpus数据集来进行预训练的。在huggingface.co网站里面提供了相关的数据集。以下代码是下载数据集并展示第一条数据

from datasets import load_dataset
dataset = load_dataset("bookcorpusopen", split="train")
dataset["train"][0]

这个数据集总共包括了17868本图书,其中每本图书对应title和text两个字段,我们将基于Text来进行训练。按照GPT论文的描述,其采用了BPE来对文本进行tokenizer,在Huggingface里面有一篇文章解释了BPE的原理和训练细节,Byte-Pair Encoding tokenization - Hugging Face NLP Course,这里我直接采用huggingface的tokenizer预训练好的gpt模型。

from transformers import OpenAIGPTTokenizer
block_size=513

tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

def tokenize_function(examples):
    token_ids = [tokenizer(text) for text in examples["text"]]
    total_length = [len(t["input_ids"]) for t in token_ids]
    total_length = [(l//(block_size+1))*(block_size+1) for l in total_length]
    result = []
    label = []

    for i in range(len(total_length)):
        result.extend([token_ids[i]["input_ids"][j:j+block_size+1] for j in range(0, total_length[i], block_size+1)])
    return {"token_ids": result}

ds_test = ds['train'].select(range(10000))

tokenized_datasets = ds_test.map(
    tokenize_function, batched=True, num_proc=8, remove_columns=["title", "text"], batch_size=100
)

tokenized_datasets.save_to_disk("data/boocorpusopen_10000_512tokens")

在以上代码中,我把数据集的每本书的text文本通过tokenizer来转化为token id,然后每513个tokenid保存为一条数据记录,因为在GPT论文中是对512个token进行训练的,因此我们在训练时取这513个token的前512个作为训练,然后对应的第2-513个token作为label,最后把处理后的数据集保存到本地。

因为我将要在tensorflow的模型中进行训练,还要把这个数据集转化为tensorflow dataset的格式。我们可以直接调用tokenized_datasets.to_tf_dataset函数来进行转化,但是我发现这样转换之后,要读取dataset的数据很慢。因此我先把数据集转化为TFRecords的文件格式,这样读取速度就能加快很多,以下的代码把每10万条记录保存为一个tfrecord文件,每个文件大约100M。

import tensorflow as tf
from tqdm import tqdm

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def serialize_example(token_ids):
    feature = {
        'token_ids': _int64_feature(token_ids)
    }

    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

records_num = 100000
count = 0
for record in tqdm(ds):
    if count%records_num == 0:
        writer = tf.io.TFRecordWriter("bookcorpus_"+str(count//records_num)+".tfrecords")
    writer.write(serialize_example(record['token_ids']))
    count += 1
    if count%records_num == 0:
        writer.close()
if writer:
    writer.close()

之后我们就可以读取数据了

feature_description = {
    'token_ids': tf.io.FixedLenFeature([513], tf.int64)
}

def _parse_function(example_proto):
    # Parse the input `tf.Example` proto using the dictionary above.
    return tf.io.parse_single_example(example_proto, feature_description)

data_dir = "/data/datasets/bookcorpus_tf/"
filenames = os.listdir(data_dir)
filenames = [data_dir+f for f in filenames]
tf_ds = tf.data.TFRecordDataset(filenames)
tf_ds = tf_ds\
    .map(_parse_function, num_parallel_calls=tf.data.AUTOTUNE)\
    .shuffle(buffer_size=batch_size*100)\
    .prefetch(tf.data.experimental.AUTOTUNE)\
    .batch(batch_size)

我们可以检查一下Batch的数据,取出数据并用tokenizer来解码

data = next(iter(tf_ds))
tokenizer.decode(data['token_ids'][0])

结果如下:

"i was sitting to the left of alex, and tinker was to his right, with lilla sitting to the right of her. tinker was leaning over to alex, chatting away, when her hand suddenly gently slid along his thigh, and onto his crotch. oops! now that was a surprise. unexpected, to say the least. right out of the blue. alex looked at me in desperation, and i initially laughed. we hadn't really thought about any sexual side to the situation, we had just considered the girls to be friends. plus, honestly, while they were very nice people, they weren't at all our cup of tea, so to speak. besides, what exactly was the situation down there, in the lady garden? had operations been done? would we end up comparing who had the bigger penis? we didn't know, and we didn't want to find out. it was time to get out of there. we both went into time - to - go mode. \n'you know, we got ta get up early in the morning, so we better hit the road.'i yelled to all and sundry. \n alex was already on his feet, yelling out something similar as well. we started heading to the door. the girls were calling after us, but i couldn't hear what they were saying, over the music, and the blood pumping in my head. i just waved back at them. \n it was with some relief that we found ourselves back out in the industrial wasteland. \n'fuck, what a surprise!'alex said as we ran off in the darkness.'i didn't see that coming. i hope they won't be upset with us. i thought they had sex for money, anyway.'\n'maybe on their night off they like a bit of young cock? fucked if i know. you should have seen the look on your face, man!'\n'shit, i don't want to think about it.'\n'don't worry, i won't be letting you forget this one.'\n we were pretty relieved, and happy to be out of that situation, and pretty much laughed about it all the way home. i would be pulling alex's leg over that one for a long time to come. mind you, probably lilla hadn't been far away from making a move on me, if it had all gone successfully with tinker and alex. sometimes it all comes down to who is the person closest to the door, or, as in that case, who is in the"

现在数据集我们已经准备好了。

建立GPT模型

根据论文的描述,GPT只采用了Transformer里面的Decoder,因为Encoder是通过查看整个训练数据的上下文来建立Token之间的联系的,但是对于文本生成来说,只能通过上文来预测之后的token,因此只能采用Decoder。论文的模型架构如下,共采用了12个Decoder组合而成,每个Decoder包含了12个Attention Head:

关于Transformer模型的解读,可以见我以前写的博客基于Tensorflow实现一个Transformer翻译器_tensorflow transformer_gzroy的博客-CSDN博客

首先是定义multi attention head,代码如下:

def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead)
    but it must be broadcastable for addition.
    Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.
    Returns:
    output, attention_weights
    """
    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
 
    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
 
    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
 
    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
 
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
 
    return output, attention_weights

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self,*, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
 
        assert d_model % self.num_heads == 0
 
        self.depth = d_model // self.num_heads
 
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
 
        self.dense = tf.keras.layers.Dense(d_model)
 
    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
 
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]
 
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)
 
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
 
        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)
 
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)
 
        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)
 
        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
 
        return output, attention_weights

然后是Feed forward层,代码如下

def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

定义一个decoder layer,把以上的两个层组合起来:

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self,*, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)
 
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
 
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
 
    def call(self, x, training, look_ahead_mask):
        attn, attn_weights_block = self.mha(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn = self.dropout1(attn, training=training)
        out1 = self.layernorm1(attn + x)
 
        ffn_output = self.ffn(out1)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(ffn_output + out1)  # (batch_size, target_seq_len, d_model)
 
        return out2, attn_weights_bloc

最后就是定义一个GPT模型,模型里面包括了12个Decoder layer。

class Decoder(tf.keras.layers.Layer):
    def __init__(self,*, num_layers, d_model, num_heads, dff, target_vocab_size, rate=0.1):
        super(Decoder, self).__init__()
 
        self.d_model = d_model
        self.num_layers = num_layers
 
        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = tf.reshape(tf.range(target_vocab_size-block_size, target_vocab_size), shape=[1, -1])
 
        self.dec_layers = [
            DecoderLayer(d_model=d_model, num_heads=num_heads, dff=dff, rate=rate)
            for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)
 
    def call(self, x, training, look_ahead_mask):
        #seq_len = tf.shape(x)[1]
        attention_weights = {}
 
        x = self.embedding(x)  # (batch_size, block_size, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.embedding(self.pos_encoding)
 
        x = self.dropout(x, training=training)
 
        for i in range(self.num_layers):
            x, block1 = self.dec_layers[i](x, training, look_ahead_mask)
 
            attention_weights[f'decoder_layer{i+1}_block1'] = block1
 
        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

target_vocab_size = vocab_size + block_size

def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)

class Transformer(tf.keras.Model):
    def __init__(self,*, num_layers, d_model, num_heads, dff, target_vocab_size, rate=0.1):
        super().__init__()
 
        self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                               num_heads=num_heads, dff=dff,
                               target_vocab_size=target_vocab_size, rate=rate)
 
        self.final_layer = tf.keras.layers.Dense(target_vocab_size)
 
    def call(self, inp, training):
        # Keras models prefer if you pass all your inputs in the first argument
        look_ahead_mask = self.create_masks(inp)
        dec_output, attention_weights = self.decoder(inp, training, look_ahead_mask)
        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)
        return final_output, attention_weights
 
    def create_masks(self, tar):
        # Used in the 1st attention block in the decoder.
        # It is used to pad and mask future tokens in the input received by
        # the decoder.
        look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
 
        return look_ahead_mask
 
transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    target_vocab_size=target_vocab_size,
    rate=dropout_rate)

解释一下上面的代码,输入的序列Token通过嵌入向量的变换,每个Token映射到一个768维的向量。然后这个序列的向量需要添加位置信息,按照论文的解释,这里没有采用正弦余弦的位置信息,而是采用嵌入向量的方式。例如词汇表由40000个词,对应40000个token,我们的输入序列是包括512个token,因此新增40000-40511这512个token对应输入序列的各个位置,然后把这个位置token对应的嵌入向量加到输入序列的嵌入向量中,使得输入包含每个token的位置信息。

训练模型

要训练模型,我们需要定义一个Loss函数,来计算模型的Loss值。

因为我们是根据给定上文的若干个token,模型预测下一个token,因此我们计算这个预测的CategoryCrossEntropy,代码如下:

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
 
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
 
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
 
    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

train_loss = tf.keras.metrics.Mean(name='train_loss')

为了在训练过程中了解模型的预测性能,我们还需要定义一个准确度的指标

def accuracy_function(real, pred):
    accuracies = tf.equal(real, tf.argmax(pred, axis=2))
 
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    accuracies = tf.math.logical_and(mask, accuracies)
 
    accuracies = tf.cast(accuracies, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

按照论文的描述,采用了Adam optimizer来优化模型,学习率在最初的2000个Batch的训练中由0增加到0.00025,然后采用余弦衰减,在100个Epoch后降为0。在新版的Tensorflow里面有一个新的CosineDecay可以直接调用

epoch_steps = 1680000//batch_size
epochs = 100
decay_steps = epoch_steps*epochs
initial_learning_rate = 0
warmup_steps = 2000
target_learning_rate = 0.00025
lr_warmup_decayed_fn = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate, decay_steps, warmup_target=target_learning_rate,
    warmup_steps=warmup_steps
)
 
optimizer = tf.keras.optimizers.Adam(lr_warmup_decayed_fn, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

在训练过程中,我们要保存中间过程的训练结果,为此定义checkpoint

checkpoint_path = './checkpoints/train'
 
#定义两个trackable object需要保存
ckpt = tf.train.Checkpoint(transformer=transformer, optimizer=optimizer)
 
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
 
# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored!!')

定义一个训练函数,计算loss并调用optimizer来进行优化,如以下代码:

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64)
]
@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
    with tf.GradientTape() as tape:
        predictions, _ = transformer(inp, training = True)
        loss = loss_function(tar, predictions)
 
    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
 
    train_loss(loss)
    train_accuracy(accuracy_function(tar, predictions))

最后我们就可以进行模型的训练了,在训练过程中我们将每100个batch打印loss值和预测准确度,然后每个Epoch结束时用checkpoint来保存训练结果。

for epoch in range(EPOCHS):
    start = time.time()
 
    train_loss.reset_states()
    train_accuracy.reset_states()
 
    # inp -> portuguese, tar -> english
    for (batch, inputs) in enumerate(tf_ds):
        try:
            train_step(inputs[...,:-1], inputs[...,1:])
        except ValueError:
            print(inputs)
            break
 
        if batch % 10 == 0:
            print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
            if batch == 100:
                break
    
    if (epoch + 1) % 5 == 0:
        ckpt_save_path = ckpt_manager.save()
        print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')
    
    print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
 
    print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

训练结果

在本地的2080 TI显卡,11GB内存,设置batch_size为16, 6个Decoder,开启半精度来进行训练,大约1小时训练10000个batch。总共训练了80000个batch,花费了8小时,最后的每个batch的loss为3.5左右,准确度为35%。按照论文的描述,总共训练了100个epoch,batch_size是64,12个Decoder,采用了8块P600显卡训练了30天。我也尝试在AutoDL上面租用了一块80GB的A100来训练了一下,在A100上可以按照论文描述的64 batch,12 Decoder来训练,不过我就没有训练太长时间,只是测试了一下,基本和在本地2080的训练效果差不多。

在训练中观察到一些有趣的现象,在训练当中有时会出现Loss突然飙升的情况,然后就一直降不下来,在网上我也查询了一些资料,似乎这个是大模型训练的一个经常出现的现象,在Huggingface和Meta训练大模型的过程中也出现过这个情况,他们的解决方法也很简单,就是在发现Loss飙升之后,重新用之前保存的checkpoint来继续训练。另外还有一些资料提出在开启混合精度训练的时候,原论文提到的Adam优化器用到epsilon=1e-8这个数值不合适,需要进行调整,我也尝试调整为1e-4来进行训练,不过似乎并没有解决问题。这个还要留待以后继续研究。

最后我们可以用如下代码来进行测试,看模型是否能根据我们给出的文字来自动生成文本。

import tensorflow as tf
from transformers import OpenAIGPTTokenizer
import tensorflow_text as tf_text
from tqdm import trange

model = tf.keras.models.load_model('saved_model/gpt1_model')
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
vocab_size = len(tokenizer.get_vocab())

input_sentence = "it was saturday night, the street"
token_id = tokenizer.encode(input_sentence)
token_len = len(token_id)
gen_seq_len = 50
block_size = 512
for i in trange(token_len, gen_seq_len):
    input_data, mask = tf_text.pad_model_inputs(tf.reshape(tf.constant(token_id, tf.int64), [1,-1]), max_seq_length=block_size)
    prediction, _ = model(input_data, training=False)
    next_token_logit = prediction[0, len(token_id)-1, :vocab_size]
    predict_token = tf.math.argmax(tf.math.softmax(next_token_logit)).numpy()
    token_id.append(predict_token)
print(tokenizer.decode(token_id))

结果如下:

it was saturday night, the street was packed with people and the girls were in the car. 
 " what's up? " i asked. 
 " i'm not sure. " 
 " what? " 
 " i'm not sure. "

可以看到生成的文本很好的补充了我们的句子,不过这句话之后生成的文本就没有太大意义了。

我们可以再改进一下,因为模型根据输入的文本给出的预测文本是有一个概率,现在只是取概率最大的那个token,这样每次的输出都是不变的。我们可以修改一下,例如取概率最大的头5个token,然后按照这5个token的概率分布来随机选择一个,然后再按照同样的规则继续预测下一个token。改进后的代码如下:

input_sentence = "it was saturday night, the street"
token_id = tokenizer.encode(input_sentence)
token_len = len(token_id)
gen_seq_len = 512
block_size = 512
k = 5
for i in trange(token_len, gen_seq_len):
    input_data, mask = tf_text.pad_model_inputs(tf.reshape(tf.constant(token_id, tf.int64), [1,-1]), max_seq_length=block_size)
    prediction, _ = model(input_data, training=False)
    next_token_logit = prediction[0, len(token_id)-1, :vocab_size]
    next_token_prob = tf.nn.softmax(next_token_logit)
    next_token_topK = tf.math.top_k(next_token_prob, k=5, sorted=True, index_type=tf.dtypes.int32, name=None)
    predict_token = random.choices(next_token_topK.indices.numpy(), next_token_topK.values.numpy()/next_token_topK.values.numpy().sum())[0]
    token_id.append(predict_token)
print(tokenizer.decode(token_id))

重新生成的文本如下:

it was saturday night, the street had already been deserted. 
 " you're not going to let him in, " i said, " he's not going to let me in, and i'm leaving. " 
 " i don't care. " he turned and looked at me with a smile. 
 i looked at him, confused. 
 " he said it was you. you don't know how he 'll react, " i said, and he looked down at my hands. 
 " i'm not going to hurt you. i don't want to. " he looked at me with a look of concern. 
 " no. i don't want you to hurt anyone. i want you to hurt someone, and i want him to hurt you and i want you to hurt someone. " 
 " no. " he said, " you're going to be okay. " 
 i looked at him, my eyes burning with tears and i nodded. " i want you to hurt someone, and i want you to feel safe. i want to hurt you. i want you to heal. " 
 " i know. " i turned and looked at my hands. " i want you to heal yourself, so you can rest. " 
 " no, i'm going to. " i looked at him with a smile. " it's not your fault. you're going to be all over me. i can't let you hurt anyone. " 
 " no, i can't do that. you're going to have to do that. i can't let you heal. " 
 he looked back at me, then back at me. " you're going to do this. " 
 " i can't. " 
 " i'm not going to do this. " he said, looking at me. i wanted to cry. he didn't want me to, and he wanted me to. i needed to get rid of him and he was going to do this again. 
 he took my arms and i hugged him back. we kissed and i kissed him back, but i didn't want him to stop. i wanted him to do it again. 
 # chapter 22 
 " hey, i'm sorry about that. " i said. " what are you going to do with him? " 
 " i have to do something. i'm not doing something. " he said. " what's up? " 
 " he's not going to help you, " i said, but he shook his head. " i do

这次生成的文本就有意思多了,虽然不太合逻辑,但整体来看还是挺有小说描写的风格的。

最后,我的代码都放在repo: gpt1_tf2: GPT1 implementation base on Tensorflow 2.13.0

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

基于Tensorflow来重现GPT v1模型 的相关文章

随机推荐

  • api的封装

    这是以 cnode中文社区的api 为例 值得一提的是有些请求可能要先进行url的编码 这是简易版的 use strict api 路径 get topics 主题首页 get topic id 主题详情 post accesstoken
  • 基于libevent, libuv和android Looper不断演进socket编程

    最近在做websocket porting的工作中 需要实现最底层socket读和写 基于同步读 libevent libuv和android Looper都写了一套 从中体会不少 1 同步阻塞读写 最开始采用同步阻塞读写 主要是为了快速实
  • 织梦网站搬迁后服务器错误,如何解决DEDECMS 5.7 将data目录迁移后网站地图无法打开和更新的问题...

    如何解决DEDECMS 5 7 将data目录迁移后网站地图无法打开和更新的问题 发布时间 2020 09 15 11 53 29 来源 亿速云 阅读 88 作者 小新 这篇文章主要介绍如何解决DEDECMS 5 7 将data目录迁移后网
  • Firefox about:config设置

    以下内容来自于转载 原文 https www jianshu com p 6e6937a9574c 地址栏输入about config 打开 搜索 书签在新标签页中打开 browser tabs loadBookmarksInTabs 默认
  • 安卓开发日志捕获,错误日志捕获catch,崩溃日志捕获,抓取崩溃日志

    import android content Context import android content SharedPreferences import android content pm PackageInfo import and
  • 从街边小吃到网上爆款,螺蛳粉是如何逆袭走红的呢?

    要说现当下最火的食物是什么 那螺蛳粉肯定占有一席之地 喝奶茶已经不是当下年轻人的续命方式了 现在只有会嗦粉的才能称得上是整条街最靓的崽 在今年五花八门的热搜中 可以说螺蛳粉长在了热搜上 从西瓜微数热搜榜来看 关于螺蛳粉的热搜可是数不胜数 在
  • windows系统升级node

    直接去官网下载对应版本的安装包 覆盖到原来的下载路径就可以了 注意一定要下载稳定版本的下载 Node js nodejs org https nodejs org zh cn download 查看node下载路径where node 查看
  • 线程池基础入门

    文章目录 线程池的状态 ThreadPoolExecutor 构造方法 Executors 固定大小的线程池 Executors 定时线程池 Executors 带缓冲线程池 Executors 单线程线程池 线程池常用方法 线程池的状态
  • 对接阿里云弹性收缩小结

    1 垂直伸缩 执行垂直伸缩任务时 系统自动完成停止目标实例 调整实例规格 启动目标实例一系列操作 这个相对简单 直接增加实例配置 2 弹性伸缩 参考 阿里云弹性伸缩初体验 偶影独行的博客 CSDN博客 Sina Visitor System
  • Android电池信息

    Android中电池信息 Battery information 的取得 这里介绍电池信息的取得 Android content BroadcastReceiver类 Android os BatteryManager类 电池信息的取得 调
  • Jenkins连接k8s的多种姿势

    目录 1 概述 2 同集群 3 跨集群 3 1 端口有什么 3 2 网络策略打通 3 3 证书的生成和配置 3 4 配置连接外部的 k8s 集群 4 测试验证 4 1 配置 pod template 4 2 自由风格构建测试 4 3 流水线
  • Vue计算属性实现及简写

    计算属性 1 定义 要用的属性不存在 要通过已有的属性计算得来 2 原理 底层借助了Object defineproperty方法提供的getter和setter 3 get函数什么时候执行 1 初次读取时会执行一次 2 当依赖的数据发生改
  • 博客网址

    博客不在更新 转到www fulus wang 转载于 https my oschina net fuluS blog 713434
  • pandas整表写入excel指定位置_pandas处理excel的常用方法技巧(上)

    1 导库 import pandas as pd 2 读取excel文件 这里要注意的就是第二个参数header如果不设置 pandas会默认把excel的第一行当作columns header None的时候pandas会为我们新生成从0
  • 使用深度学习模型CNN进行实时情绪检测研究(Matlab代码实现)

    欢迎来到本博客 博主优势 博客内容尽量做到思维缜密 逻辑清晰 为了方便读者 座右铭 行百里者 半于九十 本文目录如下 目录 1 概述 2 运行结果 3 参考文献 4 Matlab代码实现 1 概述 使用深度学习模型CNN进行实时情绪检测是一
  • 字符串、字符数组的截取函数:strncpy、strsub

    字符数组的截取函数 字符串截取函数
  • 【材质和贴图】

    1 贴图坐标的换算公式 a1 a0 Offset 1 Tilling
  • C语言——每日一题

    1 倒置字符串 倒置字符串 要将每一个单词逆序输出 首先可以将整个字符串内容都逆序输出 然后再将字符串中的每一个单词再进行逆序 例如 逆序 i like beijing 先逆序成 gnijieb ekil i 再将每个单词逆序 beijin
  • 用ram实现寄存器堆_51单片机RAM数据存储区学习笔记

    1 RAM keil C语言编程 RAM是程序运行中存放随机变量的数据空间 在keil中编写程序 如果当前模式为small模式 如果总的变量大小未超过128B 则未初始化的变量的初值默认为0 如果所有的变量超过单片机small模式下的128
  • 基于Tensorflow来重现GPT v1模型

    OpenAI推出的ChatGPT模型让我们看到了通用人工智能的发展潜力 我也找了GPT的相关论文来进行研究 OpenAI在2017年的论文Improving Language Understanding by Generative Pre