Transformer论文及源码笔记——Attention Is All You Need
综述
论文题目:《Attention Is All You Need》
时间会议:Advances in Neural Information Processing Systems, 2017 (NIPS, 2017)
论文链接:http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
介绍
transformer结构的优点:
长程依赖性处理能力强:自注意力机制可以实现对整张图片进行全局信息的建模 ;
并行化能力强:可以并行计算输入序列中的所有位置;
网络结构图如下所示:
其中,多头注意力网络结构为:
注意力公式:
A
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
s
o
f
t
m
a
x
(
Q
K
T
d
k
)
V
M
u
l
t
i
H
e
a
d
(
Q
,
K
,
V
)
=
C
o
n
c
a
t
(
h
e
a
d
1
,
…
,
h
e
a
d
h
)
W
O
h
e
a
d
i
=
A
t
t
e
n
t
i
o
n
(
Q
W
i
Q
,
K
W
i
K
,
V
W
i
V
)
Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V\\ MultiHead(Q,K,V)=Concat(head_1,\dots,head_h)W^O\\ head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)
A tt e n t i o n ( Q , K , V ) = so f t ma x ( d k
Q K T ) V M u lt i He a d ( Q , K , V ) = C o n c a t ( h e a d 1 , … , h e a d h ) W O h e a d i = A tt e n t i o n ( Q W i Q , K W i K , V W i V ) 其中:
多头注意力K、Q、V解释:
目前有多组键值匹配对k、v,每个k对应一个v,计算q所对应的值 。思路:计算q与每个k的相似度,得到v的权重,之后对v做加权求和,得到q对应的数值。因此在解码过程中,第二个多头注意力的输入中,k、v传入编码特征(是已知的特征匹配对),q传入解码特征(可迭代传入),求解码对应的特征(根据编码特征之间的相似度求解码的注意力加权特征)。
注:kqv的关系用一句话来说就是根据kv的键值匹配关系,预测q对应的数值,根据kq的相似度对v做加权求和 。
视频参考:https://www.bilibili.com/video/BV1dt4y1J7ov/?spm_id_from=333.788.recommend_more_video.2&vd_source=b1b1710a3f74753e8bfc47c5c2e4d49e
多头注意力计算过程:
先让kqv做线性映射,之后沿特征向量的方向拆分成不同的“头” ,之后利用拆分的向量做运算→q和k做矩阵乘法,得到注意力权重→注意力权重除以缩放因子
d
k
\sqrt{d_k}
d k
,
d
k
d_k
d k 表示每个头的维度,再做Softmax运算→经过一次Dropout运算(可选)→所得的权重与v做矩阵乘法→合并所有“头”,最后经过一次线性映射;
代码实现
编码模块
class TransformerEncoderLayer ( nn. Module) :
def __init__ ( self, d_model, nhead, dim_feedforward= 2048 , dropout= 0.1 ,
activation= "relu" ) :
super ( ) . __init__( )
self. self_attn = nn. MultiheadAttention( d_model, nhead, dropout= dropout)
# Implementation of Feedforward model
self. linear1 = nn. Linear( d_model, dim_feedforward)
self. linear2 = nn. Linear( dim_feedforward, d_model)
self. norm1 = nn. LayerNorm( d_model)
self. norm2 = nn. LayerNorm( d_model)
self. activation = _get_activation_fn( activation)
def with_pos_embed ( self, tensor, pos: Optional[ Tensor] ) :
return tensor if pos is None else tensor + pos
def forward ( self,
src,
pos: Optional[ Tensor] = None ) :
# 特征先与位置编码相加
v = q = k = self. with_pos_embed( src, pos)
src2 = self. self_attn( q, k, v) [ 0 ]
src = src + src2
src = self. norm1( src)
src2 = self. linear2( self. activation( self. linear1( src) ) )
src = src + src2
src = self. norm2( src)
return src
解码模块
class TransformerDecoderLayer ( nn. Module) :
def __init__ ( self, d_model, nhead, dim_feedforward= 2048 , dropout= 0.1 ,
activation= "relu" ) :
super ( ) . __init__( )
self. self_attn = nn. MultiheadAttention( d_model, nhead, dropout= dropout)
self. multihead_attn = nn. MultiheadAttention( d_model, nhead, dropout= dropout)
# Implementation of Feedforward model
self. linear1 = nn. Linear( d_model, dim_feedforward)
self. linear2 = nn. Linear( dim_feedforward, d_model)
self. norm1 = nn. LayerNorm( d_model)
self. norm2 = nn. LayerNorm( d_model)
self. norm3 = nn. LayerNorm( d_model)
self. activation = _get_activation_fn( activation)
def with_pos_embed ( self, tensor, pos: Optional[ Tensor] ) :
return tensor if pos is None else tensor + pos
def forward ( self, encoding, decoding, pos: Optional[ Tensor] = None ) :
q = k = v = self. with_pos_embed( decoding, pos)
# 解码特征先做一次自注意力运算
decoding2 = self. self_attn( q, k, v) [ 0 ]
decoding = decoding + decoding2
decoding = self. norm1( decoding)
decoding2 = self. multihead_attn( query= decoding, key= encoding, value= encoding) [ 0 ]
decoding = decoding + decoding2
decoding = self. norm2( decoding)
decoding2 = self. linear2( self. activation( self. linear1( decoding) ) )
decoding = decoding + decoding2
decoding = self. norm3( decoding)
return decoding
注:以上仅是笔者个人见解,若有问题,欢迎指正。