笔者自己对于这个观点的理解就是:一般之前对于pretrain为何有用的解释都是猜测说,找到了一个很好的初始化点。这里是认为LM在学习的过程中,自然能学到那些有监督任务所需要的信息,即LM本身就是一个无监督的多任务学习者,也就能证明为何pretrain对后面的任务是有用的,即为何能找到一个很好的初始化点。更具体一些,论文中提到有监督的任务其实都只是语言模型序列中的一个子集,这里笔者脑补了一些例子,比如对于“The translation of apple in Chinese is 苹果”这个序列进行LM建模,自然能学到翻译的知识;对于“姚明的身高是2.26米”这个序列进行建模,自然能学到问答相关的知识,诸如此类。。
defblock(x, scope,*, past, hparams):with tf.variable_scope(scope):
nx = x.shape[-1].value
a, present = attn(norm(x,'ln_1'),'attn', nx, past=past, hparams=hparams)
x = x + a
m = mlp(norm(x,'ln_2'),'mlp', nx*4, hparams=hparams)
x = x + m
return x, present
与GPT的主要不同就在于norm的地方不一样,GPT是在residual之后进行norm。
这里的两个细节实现attn和mlp如下:
defattn(x, scope, n_state,*, past, hparams):assert x.shape.ndims ==3# Should be [batch, sequence, features]assert n_state % hparams.n_head ==0if past isnotNone:assert past.shape.ndims ==5# Should be [batch, 2, heads, sequence, features], where 2 is [k, v]defsplit_heads(x):# From [batch, sequence, features] to [batch, heads, sequence, features]return tf.transpose(split_states(x, hparams.n_head),[0,2,1,3])defmerge_heads(x):# Reverse of split_headsreturn merge_states(tf.transpose(x,[0,2,1,3]))defmask_attn_weights(w):# w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.
_, _, nd, ns = shape_list(w)
b = attention_mask(nd, ns, dtype=w.dtype)
b = tf.reshape(b,[1,1, nd, ns])
w = w*b - tf.cast(1e10, w.dtype)*(1-b)return w
defmultihead_attn(q, k, v):# q, k, v have shape [batch, heads, sequence, features]
w = tf.matmul(q, k, transpose_b=True)
w = w * tf.rsqrt(tf.cast(v.shape[-1].value, w.dtype))
w = mask_attn_weights(w)
w = softmax(w)g
a = tf.matmul(w, v)return a
with tf.variable_scope(scope):
c = conv1d(x,'c_attn', n_state*3)
qg, k, v =map(split_heads, tf.split(c,3, axis=2))
present = tf.stack([k, v], axis=1)if past isnotNone:
pk, pv = tf.unstack(past, axis=1)
k = tf.concat([pk, k], axis=-2)
v = tf.concat([pv, v], axis=-2)
a = multihead_attn(q, k, v)
a = merge_heads(a)
a = conv1d(a,'c_proj', n_state)return a, present
defmlp(x, scope, n_state,*, hparams):with tf.variable_scope(scope):
nx = x.shape[-1].value
h = gelu(conv1d(x,'c_fc', n_state))
h2 = conv1d(h,'c_proj', nx)return h2