Transformer详解

2023-11-13

Transformer
- 什么是transformer
- 为什么需要用transformer
  - encoder
    - sub-encoder block
      - multi-head self-attention
      - FFN
    - input
  - decoder
    - input with look-ahead mask
    - sub-decoder block
  - output layer
  - summary
- transformer的缺点
- transformer的应用
- ref

Transformer-XL
- The motivation for Transformer-XL.
- Transformer-XL: the proposed solution: Basic idea.
  - combine hidden states
  - how to compute self-attention
- Absolute Positional Encoding & Memory:
- summary
- 应用和不足
- ref
Self-Attention with Relative Position Representations
- Relation-aware Self-Attention
- Relative Position Representations
- Implement
- ref

Reformer

Transformer

什么是transformer

首先我们先说结论：Attention Is All You Need提出的transformer 其实就是 seq2seq + self attention。代码实现, 非常清晰

seq2seq 任务指的是输入和输出都是序列的任务。例如说法语翻译成英文。

通常来说，Seq2Seq任务最常见的是使用encoder+decoder的模式，先将一个序列编码成一个上下文矩阵，在使用decoder来解码。当然，我们仅仅把context vector作为编码器到解码器的输入。

这样子往往得不到好的效果，因为我们的编码器的很多信息都无法完全编码在这个向量中，并且我们在解码的时候，对于输入的每个单词的权重是不一致的，所以在NMT任务上，还添加了attention的机制。

所以目前来说，我们可以直接先把transformer当成一个黑盒，就是transformer可以当成是一个序列转码的模型，只是它其中用了特殊的self-attention的机制。如下图所示：

为什么需要用transformer

在提到为什么需要用transformer的时候，我们需要了解，在没有transformer的时候，我们都是用什么来完成这系列的任务的呢？

其实在之前我们使用的是RNN（或者是其的单向或者双向变种LSTM/GRU等）来作为编解码器。

RNN模块每次只能够吃进一个输入token和前一次的隐藏状态，然后得到输出。它的时序结构使得这个模型能够得到长距离的依赖关系，但是这也使得它不能够并行计算，模型效率十分低。

当然这边的的RNN可以通过CNN替换，从而达到并行的效果，可以看到下图，总共是两层的卷积层，第一层画出了两个filter，每个1D filter的size是2，到了第二层的卷积层的filter的size是3。

第一层的filter考虑的是两个字之间的关联，但是到了第二层，考虑了三个前一层输出的交互，从而考虑到了较长序列之间的关系。比如说这边序列是 , 第一层只考虑了 , .. 的交互，第二层考虑了，而是前一层两两交互关系的结果，所以第二层考虑了这个序列的结果了。

但是对于CNN每次一般我们的卷积核设的长度为3/5这种较小的值，对于序列长度较长的，比如512，就需要堆叠多层的卷积层，导致模型过于冗杂。

那么，我们有没有办法提出一个新的模型，能够并行，并且能够考虑到输入序列不同token的权重？聪明的科学家们提出了一种新的模型叫做transformer。

其实他就encoder+decoder模式，只是其中的编解码器采用了self-attention的机制。

当然transformer真的就比RNN好吗？有人提出，凡事用RNN做的模型，都可以直接用self-attention替代。这个我们会在transformer的缺点中讨论。# tranformer的内部结构

transformer其实是由encoder以及decoder不是单一模块，而是由小的多个sub-encoder block和sub-decoder block组成。

我们来看看transformer的具体结构图。由下图所示，它主要由左边的encoder+input以及右边的decoder+input+output组成。我们将会一一介绍。

encoder

这边的encoder由input以及多个sub-encoder blocks组成。我们将会先讲sub-encoder，再讲输入，因为输入的设计是为了弥补self-attention的缺陷的。

sub-encoder block

首先每个sub-encoder都由两个主要的部分组成（略过部分细节，之后会写），分别是self-attention layer以及ffn layer。

具体的实现机制就是：我们的输入每个词经过embedding 之后，然后经过self-attention ，根据自己的路径，经过转换得到新的输出vector，最后再经过ffn layer，得到新的输出，作为下一层sub-encoder的输入。

multi-head self-attention

首先我们先了解一下self-attention的作用，其实self attention大家并不陌生，比如我们有一句话，the animal didnot cross the street, because it was too tired. 这里面的it，指代的是the animal。我们在翻译it的时候会将更多的注意力放在the animal身上，self-attention起的作用跟这个类似，就是关注句子中的每个字，和其它字的关联关系。参考实现

我们来看看这些词是怎么经过multi-head attention，得到转换的。

首先我们每个字的输入vector 会经过变换得到三个vector，分别是query ， key 以及value , 这些向量是通过输入分别和query矩阵，key矩阵，value矩阵相乘得来的。query矩阵，key矩阵，value矩阵都是训练时学习而来的。

将 x1 和 WQ weight matrix 做矩阵乘法得到 q1, 即这个字对应的query向量. 类似地，我们最终得到这个字对应query向量，value向量，key向量。- query向量：query顾名思义，是负责寻找这个字的于其他字的相关度（通过其它字的key） - key向量：key向量就是用来于query向量作匹配，得到相关度评分的 - value向量：Value vectors 是实际上的字的表示, 一旦我们得到了字的相关度评分，这些表示是用来加权求和的

得到每个字的之后，我们要得到每个字和句子中其他字的相关关系，我们只需要把这个字的query去和其他字的key作匹配，然后得到分数，最后在通过其它字的value的加权求和（权重就是哪个分数）得到这个字的最终输出。

我们来具体看看这个分数是怎么计算得到的。我们之前看到的都是单个字作self-attention，但是在GPU中，其实整个过程是并行的，一个序列是同时得到每个对应的Q，K，V的，这是通过矩阵乘法。

然后每个字与其他字对应的score的算法采用的是Scaled Dot-product Attention

具体就是以下公式

其中。
其中，scale因子是输入的vector size 开根号。

总结来说：

等等，那么什么是multi-head呢？首先我们先了解一下什么是multi-head，其实很简单，就是我们刚才这个sub-encoder里面，我们的self-attention，只做了一次，如果我们引入多个不同的 , 然后重复刚才的步骤，我们就可以得到multi-head了。

在得到多个向量之后，我们把这些向量concat起来，然后再经过线性变换，得到最终的输出。

那么我们为什么需要multi-head呢？这是因为，他可以提高模型的能力 - 这使得模型能够关注不同的位置，比如句子经济。。。，教育。。。，这使得这座城市发展起来了，句子中的这在不同的head中，可以着重关注不同的地方例如经济，教育。亦或者如下面的栗子。

就像是CNN采用不同的不同的kernel的效果，不同的kernel能过获取的信息不同，类似的，不同的head，能够扩展模型的不同表示空间(different representation subspaces)，因为我们有不同的QKV，这些都是随机初始化，然后通过训练得到最总结果，并且结果往往不同。关于different representation subspaces，举一个不一定妥帖的例子：当你浏览网页的时候，你可能在颜色方面更加关注深色的文字，而在字体方面会去注意大的、粗体的文字。这里的颜色和字体就是两个不同的表示子空间。同时关注颜色和字体，可以有效定位到网页中强调的内容。使用多头注意力，也就是综合利用各方面的信息/特征。
我觉得也可以把多头注意力看作是一种ensemble，模型内部的集成。

FFN

在self-attention层之后模型会经过FFN层。\begin{equation} FFN(x) = max(0, xW_1 + b_1 )W_2 + b_2 \end{equation} 这边的实现就是两层的Dense layer，第一层的激活函数是RELU。

两个sub-layer的连接并不是直接相连，而是先通过ADD&Normalize层，所谓的ADD&Normalize层，由以下两个组成

- ADD：将输入+self-attention的输出 - Normalize：在经过layer-normalization以及dropout操作。

layer normalization：其实很简单就是每一条样本都经过(x-mean) / std, 其mean和std 都是按照单条样本进行计算的。

input

对于encoder的输入，由于self-attention的机制讲没有考虑输入序列的顺序，但是一个句子的输入顺序其实很重要，例如你喜欢苹果不,你不喜欢苹果，两个句子的含义不同，所以我们需要为输入embedding添加position encoding。

这边的position encoding，主要可以分为通过序列的关系可以分为 - 绝对位置：例如每个sequence , 位置都是从0，1..n开始 - 相对位置：位置的表示是由字与字之间的差表示的。相对位置表达Relative Position Representations (RPR)是Shaw et al., 2018，这个论文指出，同一个sequence中使用相对位置更好。

它根据encoding的方式也可以分为， - functional encoding: 这个是指的是通过特定函数的方式，将输入的位置idx变换为embedding。- parametric encoding：指的是通过embedding loopup的方式，让模型自己学习位置的embedding 这两种方式的效果都差不多，但是functional的可以减少模型的参数。

BERT使用的是 parametric absolute positional encoding (PAPE) 而transformer使用的是functional absolute positional encoding (FAPE)。

这边的函数使用的是正弦位置编码：

指的是模型输出的embedding size
pos 代表是字在序列中的位置
代表的是position embedding 之后的第维，即这个公式比较具有迷惑性，特别是论文中的写法，结合例子就比较好理解了，如pos=3,d(model)=128,那么3对应的位置向量如下：

这个编码函数的可视化结果：

decoder

编码器完成之后我们需要解码器进行工作，最后一层的输出会被转化为一组 attention vectors K and V. 作为encoder-decoder attention层的K，V矩阵使用，这些能够帮助decoder关注输入的合适位置。

每一个timestamp的输出都会被喂给decoder，我们将这个输出做embedding 输出在添加position encoding。decoder的解码工作的停止条件就是知道特殊字符\<end of sentence> 得到了。

input with look-ahead mask

decoder的输入和encoder的输入不太一样，引文decoder的self-attention layer只能够关注输出序列当前位置以及之前的字，不能够关注之后的字。所以这边需要将这之后的字都添加上mask，即q*k之后加上负无穷(-inf)，使得其再经过softmax之后的权重变为0。

The look-ahead mask is used to mask the future tokens in a sequence. In other words, the mask indicates which entries should not be used.

look-ahead mask 是用来mask序列的future tokens。具体的做法如下：

def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)

x = tf.random.uniform((1, 3))
temp = create_look_ahead_mask(x.shape[1])
>><tf.Tensor: shape=(3, 3), dtype=float32, numpy=
>>array([[0., 1., 1.],
>>       [0., 0., 1.],
>>       [0., 0., 0.]], dtype=float32)>

刚看到这边的时候，我有个问题，就是decoder的每次timestamp的输入不都是之前的前一次的输出吗，如何并行？这不是跟RNN一样？但是其实在训练的时候，我们是把所有的target 的序列直接作为decoder的输入的！然后通过look-ahead mask来模拟不同timestamp。

sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8,
                         dff=2048, target_vocab_size=8000,
                         maximum_position_encoding=5000)
target_input = tf.random.uniform((64, 26), dtype=tf.int64, minval=0, maxval=200)

output, attn = sample_decoder(target_input,
                              enc_output=sample_encoder_output,
                              training=False,
                              look_ahead_mask=None,
                              padding_mask=None)

在预测的时候，才是真正将decoder的输出作为下一次的输入。但这时候模型已经是一个黑盒了。

def evaluate(inp_sentence):
  start_token = [tokenizer_pt.vocab_size]

  end_token = [tokenizer_pt.vocab_size + 1]

  # inp sentence is portuguese, hence adding the start and end token
  inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token
  encoder_input = tf.expand_dims(inp_sentence, 0)

  # as the target is english, the first word to the transformer should be the
  # english start token.
  decoder_input = [tokenizer_en.vocab_size] # <start_of_sentence>
  output = tf.expand_dims(decoder_input, 0)

  for i in range(MAX_LENGTH):
    print(output)
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
        encoder_input, output)
    predictions, attention_weights = transformer(encoder_input,
                                                 output,
                                                 False,
                                                 enc_padding_mask,
                                                 combined_mask,
                                                 dec_padding_mask)

    # select the last word from the seq_len dimension
    predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)

    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

    # return the result if the predicted_id is equal to the end token
    if predicted_id == tokenizer_en.vocab_size+1: # <end_of_sentence>
      return tf.squeeze(output, axis=0), attention_weights

    # concatentate the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0), attention_weights
translate("este é um problema que temos que resolver.")
print ("Real translation: this is a problem we have to solve .")
>> tf.Tensor([[8087]], shape=(1, 1), dtype=int32)
>> tf.Tensor([[8087   16]], shape=(1, 2), dtype=int32)
>> tf.Tensor([[8087   16   13]], shape=(1, 3), dtype=int32)
>> tf.Tensor([[8087   16   13    7]], shape=(1, 4), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328]], shape=(1, 5), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328   10]], shape=(1, 6), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328   10   14]], shape=(1, 7), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328   10   14   24]], shape=(1, 8), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328   10   14   24    5]], shape=(1, 9), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328   10   14   24    5  966]], shape=(1, 10), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328   10   14   24    5  966   19]], shape=(1, 11), dtype=int32)
>> tf.Tensor([[8087   16   13    7  328   10   14   24    5  966   19    2]], shape=(1, 12), dtype=int32)
Input: este é um problema que temos que resolver.
Predicted translation: this is a problem that we have to solve it .
Real translation: this is a problem we have to solve .

sub-decoder block

sub-decoder block 跟encoder几乎一样，只是它比普通的encoder多了一个Encoder-Decoder Attention，The “Encoder-Decoder Attention” layer和multiheaded self-attention的工作机制一样，除了它使用的是 Keys 和 Values matrix 是encoder的输出, 这就意味着，我们decoder的query考虑到了encoder的所有的字了。

output layer

decoder的output是一个vector，这时候再经过一个dense层得到vocabulary size的logits，再经过softmax在取argmax得到输出的字。

summary

class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                           input_vocab_size, pe_input, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inp, tar, training, enc_padding_mask,
           look_ahead_mask, dec_padding_mask):

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)

    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights

transformer的缺点

tranformer 的空间以及时间复杂度非常大，sequence length , 达到，这是因为每一层的self attention 都要储的score用于之后的更新，所以L的长度不能很大，否则会遇到OOM的问题。在这种情况下，如果一个句子特别长, 那么他就不得不被分为两个sequence作为输入，但是这个时候前后句子之间的关系就没了，但是RNN可以不管多长的输入都能handle。
运行时间太慢，模型太大
position encoding 使用absolute encoding，而Self-Attention with Relative Position Representations指出了相对位置更好

transformer的应用

翻译等， summary

ref

李宏毅transformer

Attention Is All You Need

the-illustrated-transformer

The Evolved Transformer – Enhancing Transformer with Neural Architecture Search

Transformer-XL – Combining Transformers and RNNs Into a State-of-the-art Language Model7

code

Transformer-XL

The motivation for Transformer-XL.

首先，为什么会提出transformerXL呢，它的提出主要是为了解决transformer的问题。我们首先先分析一下RNN以及Transformer的优缺点。

RNN
- 优点：
  - 支持可变长
  - 支持记忆
  - 有序列顺序关系
- 缺点：
  - gradient vanish
  - 耗时，无法并行
Transformer
- 优点：
  - 并行
  - 考虑到sequence的long term dependency信息（相对于RNN）
  - 可解释性
- 缺点：
  - 句子与句子之间的关系
  - batch size也不能很大
  - 空间占用大（因为我每个encoder的score matrix（sequenceLen*sequecenLen是的空间复杂度(BOOOOM!

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

机器学习算法